SENTIMENT ANALYSIS OF PEDULILINDUNGI APPLICATION REVIEWS USING NAIVE BAYES CLASSIFIER AND SUPPORT VECTOR MACHINE

 

 

Irfani Firdausy1, Suhartono2, M. Imamudin3

1-3 UIN Maulana Malik Ibrahim, Malang, Indonesia

*Email for Correspondence: [email protected]

 

Keywords:

analysis sentiment,
pedulilindungi,

naive bayes classifier,

support vector ,machine

 

ABSTRACT

 

The spread of the Covid-19 virus since the end of 2019 in Indonesia has resulted in the Indonesian government taking several actions in various sectors. One of the government's efforts to handle and monitor the condition of the Covid-19 pandemic using information technology is by launching the PeduliLindungi application. With this application the government can monitor public data related to vaccination, tracing, telemedicine, and looking for rooms at the nearest hospital. The launch of the application and the obligation to use the PeduliLindungi application has received a response from the public, this can be monitored/seen from reviews on social media, news and also reviews on the Google Play Store regarding the application. Reviews from users on the Google Play Store can be used as parameters for input or feedback. This data is quite a lot and requires a long time to process it, even though the existing reviews could be useful as input for criticism and suggestions in future application development. From user review data, a classification process can be carried out based on sentiment type. Sentiment analysis is a branch of text classification which aims to classify sentiment (opinion) whether the text contains negative opinions, positive or neutral opinions. The aim of this research is to apply sentiment analysis to user review data of the PeduLindungi application into positive and negative classes using the Naive Bayes Classifier and Support Vector Machine classification algorithms. The dataset used after going through data pre-processing was 10,616 records. The results of testing and model evaluation carried out by randomly dividing training data and testing data obtained accuracy values, for the Naive Bayes Classifier method it was 77% and the Support Vector Machine had higher accuracy, namely 81%.

This is an open access article under the CC BY-SA license.

 

 

INTRODUCTION

The spread of the Covid-19 virus since the end of 2019 in Indonesia has resulted in the Indonesian government taking several actions in various sectors. One of the government's efforts to handle and monitor the condition of the Covid-19 pandemic using information technology is by launching the PeduliLindungi application. With this application the government can monitor public data related to vaccination, tracing, telemedicine, finding rooms at the nearest hospital (Putra, 2020).

The launch of the application and the obligation to use the PeduliLindungi application has received a response from the public, this can be monitored/seen from reviews on social media, news and also reviews on the Google Play Store regarding the application. On the PeduliLindung application page on the Google Play Store, there are many comments and reviews from application users. When this report was written, there were a total of more than 1 million reviews, and the PeduliLindung application had been downloaded more than 50 million times. Reviews from users on the Google Play Store can be used as parameters for input or feedback (Saputra et al., 2022). This data is quite a lot and takes a long time to process it. In fact, the existing reviews could be useful as input for criticism and suggestions in future application development. From user review data, a classification process can be carried out based on sentiment type. Sentiment analysis itself is a branch of text classification which aims to classify the sentiment (opinion) of a text automatically. Does the text contain negative opinion (negative sentiment), positive opinion (positive sentiment) or neutral (neutral sentiment). In the teachings of the Islamic religion, in terms of expressing opinions, it is recommended to express positive opinions, because this includes doing actions that contain goodness (Saputra et al., 2022).

Data mining is a process that uses statistical, mathematical, artificial intelligence, and machine learning techniques to extract and identify useful information and related knowledge from large databases. The term data mining has the essence of being a scientific discipline whose main goal is to discover, explore, or mine knowledge from the data or information that we have (Kawani, 2019). Data mining, often also referred to as Knowledge Discovery in Databases (KDD). KDD is an activity that includes collecting, using historical data to find regularities, patterns or relationships in large data sets (Deolika et al., 2019).

With the increasing number of electronic documents and information in the form of text from various sources, text mining has become a potential field for study. Text mining is the science of studying how to extract information and look for patterns from a document automatically. The working principle of text mining is the same as data mining, namely mining information from large data sets (Pess�a et al., 2021). It's just that the data used in text mining is in the form of text. There are many applications of text mining such as text classification, information retrieval, topic clustering, topic extraction, document summary and sentiment analysis (Widaningsih & Suheri, 2018).

Important concepts in text classification are features and algorithms. Basically, algorithms in sentiment classification can be defined into two methods, namely rule based methods and statistical based methods. The rule-based method means that the word features used for classification are built manually by human experts, while the statistical-based method means that the word features are selected automatically by a classification algorithm through machine learning (Mustopa et al., 2020).

Sentiment analysis, also called opinion mining, is a field of study that analyzes people's opinions, sentiments, judgments, attitudes, and emotions toward entities and their attributes expressed in written text. Entities can be products, services, organizations, individuals, events, issues, or topics (Wilie, 2023). Many related names and slightly different tasks, for example, sentiment analysis, opinion mining, opinion analysis, opinion extraction, sentiment mining, subjectivity analysis, influence analysis, emotion analysis and review mining, now all come under the umbrella of sentiment analysis (Azizah et al., 2022).

The sentiment classification process using statistical-based methods using the Naive Bayes Classifier (NBC) algorithm and Support Vector Machine (SVM) (ANA FATIMAH FITRIANI, 2019)is widely used in sentiment analysis in previous studies. The data pre-processing process and the use of feature extraction, such as the frequency of occurrence of words in a document (Term Frequency) or the frequency of occurrence of words in a document over the entire document (Term Frequency-Inverse Document Frequency / TF-IDF) are expected to increase the accuracy of classification results. Therefore, the author was moved to conduct further research regarding sentiment analysis of user reviews of the PeduliLindungi application (Aripin, 2018).

 

 

METHOD

The data collection method used is the data scraping method using the Python programming language (Sukma, 2021). The library package used is google_play_scraper PeduliLindung application review data which is located at https://play.google.com/store/apps/details?id=com.telkom.tracencare&showAllRe views=true, the review taken is a text review in Indonesian (Devasia, T., Vinushree, T. P., & Hegde, 2016).

There are several processes carried out in this stage, namely Case Folding, Cleaning, Tokenizing, Spelling Normalization, Stemming, Filtering / Stopword (Rozi et al., 2021). There are two approaches that can be used in the feature extraction process, namely the statistical approach and the semantic approach. This process is carried out to form a model that will be used to classify new data. This process is carried out using two methods, namely Naive Bayes Classifier (NBC) and Support Vector Machine (SVM) (Mustopa et al., 2020). Analysis of the classification results was carried out by comparing the results of the Naive Bayes Classifier classification and the results of the Support Vector Machine classification for each combination using a confusion matrix (Accuracy, Recall, Precision, F1-score) (Qadrini et al., 2021).

 

 

RESULTS AND DISCUSSION

Classification results from two models using the Naive Bayes Classifier (NBC) and Support Vector Machine (SVM) methods (Deolika et al., 2019), using test data that was run in the previous chapter. To apply it to the Naive Bayes Classifier and Support Vector Machine methods, the PeduliLindung application user review data obtained must be prepared through a series of pre-processing and transformation steps (feature extraction) with TF-IDF. This step must be taken before using the data in the analysis process by these two methods (FUADIN, 2017).

 

 

Table 1. Comparison of Accuracy, Precision, Recall, F1-score NBC and SVM

Test Data

NBC

SVM

Training

%

Testing%

Accuracy

%

Precision

%

Recall

%

F1-

score%

Accuracy

%

Precision

%

Recall

%

F1-

score%

80

20

77

81

73

74

81

81

80

80

70

30

77

81

73

73

80

80

79

79

60

40

77

81

72

73

81

80

79

80

50

50

77

80

72

73

80

79

78

79

Average

77

81

73

73

81

80

79

80

 

Table 1 shows a comparison of test results, showing that a series of tests were carried out 4 times with varying percentages of training data and testing data. From this observation, it can be seen that the accuracy value is stable, there are no large differences or differences, the accuracy is 77%. In Figure 6.1, we present a graph illustrating the accuracy test of the Naive Bayes Classifier.

 

Figure 1. Naive Bayes Classifier results

 

Figure 1 shows that with increasing training data, there is a significant increase in accuracy values. The highest point in accuracy value appears at a ratio of 80% training data and 20% testing data (trial 1), reaching 77%. There was a steady increase despite the imbalance in the data and the random selection process of the data. To see the accuracy testing of the Support Vector Machine, the details can be seen in Figure 2

Figure 2. Support Vector Machine results

 

Figure 2 shows a similar pattern, with an increase in the proportion of training data, there is an increase in accuracy values. The highest point of accuracy value was recorded at a ratio of 80% training data and 20% testing data, which reached 81%. There was a steady increase despite imbalances in the data and random processes in data selection. To compare the two models, please see Figure 3.


Figure 3. Comparison graph of model results

 

Figure 3 presents the average Accuracy values of two different methods: Naive Bayes Classifier has 77% Accuracy, 81% Precision, 73% Recall and 73% F1-score, while Support Vector Machine shows 81% Accuracy, 81% Precision, 80% Recall and F1-score 79%. Differences in performance can be seen in each method used.

a.     The Naive Bayes Classifier model achieved an average accuracy of 77%. This means that of all the predictions made by the Naive Bayes Classifier model, around 77% of them are correct predictions.

b.    The SVM model has an average accuracy of 81%. This indicates that the SVM model performs better than the NBC model, with a higher level of accuracy.

The results of each method are stored in a model file, namely naive_model.pkl and svm_model.pkl. To test the accuracy value, the researcher tried to create 5 sentences and these would be predicted by the two models, the results can be seen in Figure 4.


Figure 4. Prediction results of sentiment analysis from several sentences using NBC and SVM

 

In Figure 4 of 5 sentences, both methods can predict correctly whether a sentence falls into positive sentiment and negative sentiment.

Researcher's knowledge information tries to display the information and knowledge obtained from this research:

a.    Distribution graph of user review ratings for the PeduliLindung application

After taking data from Google Play, using the Python programming language and the google_play_scrapper library of 20,000 records, the data is saved in a CSV file, and information on the distribution of ratings (scores) from users is obtained as in Figure 5.

 

 

Figure 5. Distribution graph of user review ratings for the PeduliLindungi application

 

In Figure 5, users give a rating (score) from 1 to 5 for the PeduliLindung application, the higher the score means giving a good assessment, conversely if giving a low value means giving a less good assessment, for score 1 data there are 5681 (28.4%) , score 2 was 1218 (6.1%), score 3 was 1156 (5.8%), score 4 was 1470 (7.3%), and score 5 was 10475 (52.4%).

b.    Rating distribution graph (score)

Categories become positive and negative labels. From the rating data (scores) they are then grouped into 2 categories, where if the score is > 3 it is labeled Positive and the score <= 3 is labeled Negative . From the grouping results with the column name rating_category, a graph is created like Figure 6.

Figure 6. Graph of Rating Score Division into Positive and Negative Labels

 

In Figure 6, 11,945 (59.7%) positive category (label) ratings were obtained and 8,055 (40.3%) positive category ratings were obtained from 20,000 records.

c.     Graph of the number of positive and negative labels after data pre-processing

After pre-processing the data, the amount of data was reduced, from 20000 to 10616, the reduction in data was due to the deletion of empty text (no values) and the deletion of duplicate records, because the same or duplicate text data could affect the accuracy results.

 


Figure 7. Graph of the Number of Positive and Negative Labels After Data Pre-processing

 

In Figure 7, after going through data pre-processing, we obtained a positive category (label) rating of 4132 (38.9%) and a negative label of 6485 (61.1%).

d.    Data on positive sentiment and negative sentiment words that often appear as a result of the Naive Bayes Classifier and Support Vector Machine methods

To display positive and negative sentiment words, the author uses a wordcloud, which is a visualization of a collection of words that are often mentioned in a particular media (Mustopa et al., 2020). For example, on social media, wordcloud will collect many trending words. Words that appear frequently will be the largest and most prominent, while other small words will be seen surrounding the largest words, as seen in Figure 8 and Figure 9.

 

Figure 8. Wordcloud of Positive Sentiment Words from the Naive Bayes Classifier Method (left) and


Support Vector Machine (right)

Figure 9. Wordcloud of Negative Sentiment Words from the Naive Bayes Classifier Method

(left) and Support Vector Machine (right)

 

The number of words labeled with positive sentiment that have the highest frequency of appearance is shown in Table 2, while the number of words labeled with negative sentiment is shown in Table 2.

 

Table 2. Comparison of 3 Positive Sentiment Words with the Most Frequency of Appearance

No

Positive Words

Frequency of Occurrence (NBC)

Frequency of Occurrence (SVM)

1

help

479

528

2

Good

441

517

3

application

433

499

 

In Table 2, the word "help" is ranked first with the highest frequency of appearance for positive sentiment, with the Naive Bayes Classifier method it has a frequency of appearance of 479 times, while for Support Vector Machine the frequency of appearance is 528 times.

 

Table 3. Comparison of 3 Negative Sentiment Words with the Most Frequency of Appearance

No

Negative Words

Frequency of Occurrence (NBC)

Frequency of Occurrence (SVM)

1

application

2452

2394

2

updates

2136

2115

3

vaccine

1632

1564

 

In Table 3 the word " application " occupies ranking First with frequency appeared the most For sentiment negative , with method Naive Bayes Classifier has frequency emergence 2452 times, meanwhile for Support Vector Machine frequency its appearance 2394 times.

 

 

CONCLUSION

From the results of the research conducted, there are several conclusions from the comparison of two Data Mining-based classification methods, namely Naive Bayes Classifier (NBC) and Support Vector Machine (SVM), regarding sentiment analysis of PeduliLindungi application reviews as follows: Study This succeed build a classification model For predict analysis sentiment into two classes that is class positive and negative . Data retrieved from user reviews​ from Indonesia and Indonesian on Google Play which was taken on October 20 2022, and evaluated use Naive Bayes Classifier (NBC) and Support Vector Machine (SVM) methods . Comparison and evaluation of two methods show that a Support Vector Machine (SVM) has accuracy more tall of 81%, compared to the Naive Bayes Classfier (NBC) which has 77% accuracy Sentiment said positive with frequency emergence the most is the word � help �, to Naive Bayes Classifier method emerged 479 times, and Support Vector Machine 528 times . For the word sentiment negative with frequency emergence the most is the word � application �, with the Naive Bayes Classfier (NBC) method emerged 2452 times , while Support Vector Machine (SVM) appeared 2394 times.

 

 

REFERENCES

Ana Fatimah Fitriani. (2019). Analisis Kemampuan Technological Pedagogical Content Knowledge (Tpck) Calon Guru Biologi Universitas Islam Negeri Raden Intan Lampung. Jurnal Kajian Pendidikan Ekonomi Dan Ilmu Ekonomi, 1�71.

Aripin, A. A. (2018). Potensi Pemanfaatan Teknologi Blockchain Terhadap Ketepatan Waktu, Efisiensi Dan Keamanan Proses Operasi Pada Subsektor Perbankan. Skripsi Universitas Katolik Parahyangan.

Azizah, L. M., Ajipratama, D. B., Putri, N. A. R., & Damarjati, C. (2022). Analisa Sentimen Masyarakat Terhadap Kebijakan Vaksinasi Covid-19 Di Indonesia Pada Twitter Menggunakan Algoritma Lstm La. Jurnal Iptekkom Jurnal Ilmu Pengetahuan & Teknologi Informasi, 24(2), 161�172. Https://Doi.Org/10.17933/Iptekkom.24.2.2022.161-172

Deolika, A., Kusrini, K., & Luthfi, E. T. (2019). Analisis Pembobotan Kata Pada Klasifikasi Text Mining. Jurnal Teknologi Informasi, 3(2), 179. Https://Doi.Org/10.36294/Jurti.V3i2.1077

Devasia, T., Vinushree, T. P., & Hegde, V. (2016). Prediction Of Students Performance Using Educational Data Mining. International Conference On Data Mining And Advanced Computing (Sapience), 91�95.

Fuadin, D. N. (2017). Deteksi Botnet Menggunakan Na�ve Bayes Classifier Dengan Smote Dan Metode Bfs. Telematika-Cio, Bidang Keahlian Teknik.

Kawani, G. P. (2019). Implementasi Naive Bayes. Journal Of Informatics, Information System, Software Engineering And Applications (Inista), 1(2), 73�81. Https://Doi.Org/10.20895/Inista.V1i2.73

Mustopa, A., Hermanto, Anna, Pratama, E. B., Hendini, A., & Risdiansyah, D. (2020). Analysis Of User Reviews For The Pedulilindungi Application On Google Play Using The Support Vector Machine And Naive Bayes Algorithm Based On Particle Swarm Optimization. 2020 5th International Conference On Informatics And Computing, Icic 2020, 2. Https://Doi.Org/10.1109/Icic50835.2020.9288655

Pess�a, L. C., Deamici, K. M., Pontes, L. A. M., Druzian, J. I., & Assis, D. De J. (2021). Technological Prospection Of Microalgae-Based Biorefinery Approach For Effluent Treatment. Algal Research, 60(May). Https://Doi.Org/10.1016/J.Algal.2021.102504

Putra, S. (2020). Analisis Tows ( Threats , Opportunity , Weakness , Strenghts ) Terhadap Strategi Pemasaran Pada Cv .

Qadrini, L., Sepperwali, A., & Aina, A. (2021). Decision Tree Dan Adaboost Pada Klasifikasi Penerima Program Bantuan Sosial. Jurnal Inovasi Penelitian, 2(7), 1959�1966.

Rozi, F., Sukmana, F., & Adani, M. N. (2021). Pengelompokkan Judul Buku Dengan Menggunakan Algoritma K-Nearest Neighbor (K-Nn) Dan Term Frequency � Inverse Document Frequency (Tf-Idf). Jimp: Jurnal Informatika Merdeka Pasuruan, 6(3), 1�5.

Saputra, I., Djatna, T., Siregar, R. R. A., Kristiyanti, D. A., Yani, H. R., & Riyadi, A. A. (2022). Text Mining Of Pedulilindungi Application Reviews On Google Play Store. Faktor Exacta, 15(2), 101�108. Https://Doi.Org/10.30998/Faktorexacta.V15i2.10629

Sukma, H. (2021). Clustering Data Siswa Smpn-6 Palangka Raya Untuk Menentukan Kelayakan Bantuan. 1�59.

Widaningsih, S., & Suheri, A. (2018). Klasifikasi Jurnal Ilmu Komputer Berdasarkan Pembagian Web Of Science Dengan Menggunakan Text Mining. Seminar Nasional Teknologi Informasi Dan Komunikasi (Sentika), 2018(March), 23�24.

Wilie, D. P. (2023). Analisis Sentimen Opini Publik Terhadap Chatgpt Di Twitter Menggunakan Metode Naive Bayes. Jurnal Nasional Ilmu Komputer, 4(4), 2746�1343.

Ana Fatimah Fitriani. (2019). Analisis Kemampuan Technological Pedagogical Content Knowledge (Tpck) Calon Guru Biologi Universitas Islam Negeri Raden Intan Lampung. Jurnal Kajian Pendidikan Ekonomi Dan Ilmu Ekonomi, 1�71.

Aripin, A. A. (2018). Potensi Pemanfaatan Teknologi Blockchain Terhadap Ketepatan Waktu, Efisiensi Dan Keamanan Proses Operasi Pada Subsektor Perbankan. Skripsi Universitas Katolik Parahyangan.

Azizah, L. M., Ajipratama, D. B., Putri, N. A. R., & Damarjati, C. (2022). Analisa Sentimen Masyarakat Terhadap Kebijakan Vaksinasi Covid-19 Di Indonesia Pada Twitter Menggunakan Algoritma Lstm La. Jurnal Iptekkom Jurnal Ilmu Pengetahuan & Teknologi Informasi, 24(2), 161�172. Https://Doi.Org/10.17933/Iptekkom.24.2.2022.161-172

Deolika, A., Kusrini, K., & Luthfi, E. T. (2019). Analisis Pembobotan Kata Pada Klasifikasi Text Mining. Jurnal Teknologi Informasi, 3(2), 179. Https://Doi.Org/10.36294/Jurti.V3i2.1077

Devasia, T., Vinushree, T. P., & Hegde, V. (2016). Prediction Of Students Performance Using Educational Data Mining. International Conference On Data Mining And Advanced Computing (Sapience), 91�95.

Fuadin, D. N. (2017). Deteksi Botnet Menggunakan Na�ve Bayes Classifier Dengan Smote Dan Metode Bfs. Telematika-Cio, Bidang Keahlian Teknik.

Kawani, G. P. (2019). Implementasi Naive Bayes. Journal Of Informatics, Information System, Software Engineering And Applications (Inista), 1(2), 73�81. Https://Doi.Org/10.20895/Inista.V1i2.73

Mustopa, A., Hermanto, Anna, Pratama, E. B., Hendini, A., & Risdiansyah, D. (2020). Analysis Of User Reviews For The Pedulilindungi Application On Google Play Using The Support Vector Machine And Naive Bayes Algorithm Based On Particle Swarm Optimization. 2020 5th International Conference On Informatics And Computing, Icic 2020, 2. Https://Doi.Org/10.1109/Icic50835.2020.9288655

Pess�a, L. C., Deamici, K. M., Pontes, L. A. M., Druzian, J. I., & Assis, D. De J. (2021). Technological Prospection Of Microalgae-Based Biorefinery Approach For Effluent Treatment. Algal Research, 60(May). Https://Doi.Org/10.1016/J.Algal.2021.102504

Putra, S. (2020). Analisis Tows ( Threats , Opportunity , Weakness , Strenghts ) Terhadap Strategi Pemasaran Pada Cv .

Qadrini, L., Sepperwali, A., & Aina, A. (2021). Decision Tree Dan Adaboost Pada Klasifikasi Penerima Program Bantuan Sosial. Jurnal Inovasi Penelitian, 2(7), 1959�1966.

Rozi, F., Sukmana, F., & Adani, M. N. (2021). Pengelompokkan Judul Buku Dengan Menggunakan Algoritma K-Nearest Neighbor (K-Nn) Dan Term Frequency � Inverse Document Frequency (Tf-Idf). Jimp: Jurnal Informatika Merdeka Pasuruan, 6(3), 1�5.

Saputra, I., Djatna, T., Siregar, R. R. A., Kristiyanti, D. A., Yani, H. R., & Riyadi, A. A. (2022). Text Mining Of Pedulilindungi Application Reviews On Google Play Store. Faktor Exacta, 15(2), 101�108. Https://Doi.Org/10.30998/Faktorexacta.V15i2.10629

Sukma, H. (2021). Clustering Data Siswa Smpn-6 Palangka Raya Untuk Menentukan Kelayakan Bantuan. 1�59.

Widaningsih, S., & Suheri, A. (2018). Klasifikasi Jurnal Ilmu Komputer Berdasarkan Pembagian Web Of Science Dengan Menggunakan Text Mining. Seminar Nasional Teknologi Informasi Dan Komunikasi (Sentika), 2018(March), 23�24.

Wilie, D. P. (2023). Analisis Sentimen Opini Publik Terhadap Chatgpt Di Twitter Menggunakan Metode Naive Bayes. Jurnal Nasional Ilmu Komputer, 4(4), 2746�1343.