Covid-19 Hoax Detection Using KNN in Jaccard Space

model lain pada peneltiian ini. Akurasi pada model KNN pada Jaccard Space dengan dengan stemming Nazief & Adriani dan K=5 sebesar 75,89%, sedangkan untuk Naïve Bayes sebesar 65,18%.


Background
Hoax is information that added from the content of actual news [1]. The element of manipulation or modification in the news is often used to respond so that the news will go viral. Indonesian hoax news phenomenon is seen as causing various problems. This happens because of the rapid spread of hoax news on social media which is the basis for communication between users without checking the news first. [2]. A news is said to be true if the reader has proven the truth of the news content. Jabar Saber Hoaks [3] listed levels from lowest to highest in misinformation and disinformation hoax news found on social media namely: a. Hoax news was found on news portals [4] social media [5] and clickbait news [6]. Machine learning models used by researchers in detecting hoax news, such as Naive Bayes [4], SVM [5], and KNN [6]. KNN uses distance value d(a,b) between the training data and testing data with parameter K (number of neighbor) [6] [7]. Euclidean distance performance in KNN has not shown its best performance. One of the most widely used is Jaccard. The distance value is obtained by compating and calcuating the similarity value of two documents. Jaccard coefficient looks up which words are the same divided by the total documents [8].
Nazief & Adriani algorithm [9] affects the results with TF-IDF (Term Frequency-Inverse Document Frequency) process because the word converted into a root word. This implementation that has been carried out in the classification process using the Naïve Bayes can optimize word extraction in the data obtained and affect the results of N-Gram [10]. This stemming has rule of deleting words affixed into standard words.
Based on the explaination above, this research aims to develop a model that can predict level of hoax news using modified KNN in Jaccard space with text preprocessing Stemming Nazief & Adriani. This modified model was applied to find most similar hoax data, and expected to increase KNN model performance.

Previous Works
Naïve Bayes model with modified TF-IDF have been used to classify indonesian hoax news by Mustofa [11]. The data used is 360 documents. In accordance with the testing with the cross validation (CV) 10 times, it was achieved 85% of accuracy with CV equal 6 and error rate of 15%. Hoax detection from the news applies Naïve Bayes and Cosine similarity [4]. Best performance detection using Naïve Bayes model achieved 91% precision, 100% recall, and 95% f-measures.
Clickbait news classification [6] with three scenarios KNN with euclidean distance and the value of parameter K ranging from 1 to 15 with data of 1000 news published from January 2020.. The scenario used a combination number of training and testing data (80:20, 50:50 and 20:80). Accuracy KNN with K=11 is 71% with number of training and testing data 80:20. Fake news research used data from 2282 Facebook posts has not achieved best performance. [7]. This facebook posts consist of 1669 posts have the label "Mostly True", 264 posts have the label "No Factual", 245 has the label "Mixture of True and False" and 104 posts have the label "Mostly False". Accuarcy of KNN model obtained accuracy of 79%. Hoax news on realated Covid-19 in group chat UNNES [12] with hoax data 500, message 4519 and media contained in the message 1435. Classifier of this scenario is KNN with K=1 obtained average accuracy 54% with minimum accuracy of 14% and a maximum accruacy of 91%.
Research on hoax news detection [13] using the SVM model. Accuracy obtained using TF-IDF is 85%, with 90% for Non-Hoax labels and 80% for hoax labels. The potential for spreading hoaxes on Twitter social media [5] with TF-IDF and SVM models. There are two test scenarios carried out, namely in the first scenario using data sharing variants, namely 90:10, 80:20 and 50:50, while for the second scenario by comparing the existing features to determine the effect of features. Best accuracy obtained by first scenario is 78% accuracy with 90:10 training data and testing data.
Research related to the detection of fake news against the Covid-19 disease with C4.5 Decision Tree Model [14] and combination of N-Gram and TF-IDF. The data used social media twitter hoax against the corona virus, politics and the environment with the hoax label as 49.4% and 50.6% non-hoax. The best accuracy was 72.04% obtained by Unigram with 90:10 training and testing data. Implementation of the Jaccard Index has been conducted to determine the similarity index in the classification of ham and spam e-mails [15]. The method used is the Document Similarity Index (DSI) which is calculated from the Jaccard Index and gets 98% precision results for Ham labels and 98% for Spam labels. Improving truth detection in social media using Scalable and Robust Truth Discovery (SRTD) count with Jaccard Similarity has done by Sangwan [16]. Keys point focus of this resarch are attitude, uncertainty and independent score can be determined by using WORDNET programmed in Java. This scenario used to enhance the exploration of reality through similar terms. Scenario URL, best precision, recall and fmeasure are 91%, 100%,and 95%.

This
research not compare the performance of NB with several scenarios. Naïve Bayes [11] Best accuracy obtained for classifying hoax news using NB is 85% with CV 6.

This
research not compare the performance of NB with several scenarios. SVM [13] Average accuracy of SVM model testing has an accuracy of 85%.
Testing stage is not explained detail calculation in the testing used. KNN [6] Classification of clickbait news with KNN obtained an accuracy of 71% at K = 11 with 80:20 data sharing.
Only focuses on 1 model (KNN) with euclidean distance, so that the accuracy performance is not optimal. Hoax in Social Media KNN [7] Classification fake news results using data from Social Media Facebook with the KNN model obtained an accuracy of 79%.
There is no modification to the KNN model.
SVM [5] In the first scenario, the results of this research obtained great accuracy with an accuracy rate of 78% (90:10 training data and testing data) The limitations of this research is using the unbalanced amount of data on each label, namely 67% for Hoax data labels and 33% for Non-Hoax label data. KNN [12] KNN model with K=1 obtained average accuracy 54% with minimum accuracy of 14% and a maximum accruacy of 91% on data related Covid-19.
There is no modification to the KNN model.
There are no model comparison.
Jaccard Similarity [16] This scenario using WORDNET can improve the score that helps to determine truth better.
This reaseach is not compare the similarity with euclidean, manhataan or others.
Mail Spam Jaccard Similarity [15] The result of this resarch is 98% precision for ham and spam labels.
The limitations of the research were not comparing other similarity methods.
According Table 1, this research conduct using scenarios with and without stemming Nazief & Adriani to classify hoax detection using Modified KNN and Naïve Bayes. The main focus of this research compare the results of KNN classification with modified KNN with Jaccard Space and stemming Nazief & Adriani in the classification of hoax news related to Covid-19.
In summary, contributions of this work processed and classify hoax news related to Covid-19 using modified KNN in Jaccard space with Stemming Nazief & Adriani from Jabar Saber Hoaks and Jala Hoaks. This research organized as follows, Section 1 discusses the background, hoax concepts, and recent research about hoax detection. Section 2 explain

METHODS
This section presents the dataset, data preprocessing, modeling KNN with Jaccard Space, and the scenario of evaluation.

Data
Data collection uses web crawler by extracting and processing in text information on a web page. Thus, information placed in index based on keywords to csv file. Web pages to be executed by crawlers are Jabar Saber Hoaks and Jala Hoaks (sample on Table 2) :

Text Processing and Stemming Nazief & Adriani
Text preprocessing changed unstructured data into structured [17].
Steps of text preprocessing are case folding, tokenization, filtering, and stemming. Stemming applied the Nazief & Adriani algorithm stemming technique [18] apply the following rules: 1. The word check and match in the root word dictionary; if found, the process will stop, but then the following process will continue. 2. Removing inflectional suffix {"-kah," "-lah," "-tah," "-pun"} and suffix {"-ku," "-mu," or root-word. 4. Removing derivational prefix {"be-," "di-," "ke-," "me-," "pe-," "se-," and "te-"}. 5. The process of re-checking the word by removing the prefix by changing it to the rules and re-checking the root word dictionary if it is still not found, then the following process will be carried out. 6. If all steps have been completed and no results are found, the word would be considered the root word, and the initial word value will be returned.

KNN in Jaccard Space
KNN is a model that classifies objects based voting on a given collection [19]. This classifier algorithm also works by initially determining the distance in Equation (1), sorting by nearest K distance and using the majority voting of the K parameter [7].
(1) Distance in Equation (1) defined based on two point data. Jaccard is the most commonly used distance in data to know the similarity between two sets. Let and be two sets. Jaccard index is the sliced population compared to all items in both sets [8].

Evaluation Model
Similar to other researchers about hoax detection, this reaseach focus on evaluation of Accuracy. Accuracy is used to evaluate the number of predictive labels that correspond to the actual label [20]. Accuracy obtained from confusion matrix in Figure 1.

Figure 1 Confusion Matrix
True Positive (TP) is positive data that predicted to be correct, True Negative (TN) is negative data that predicted to be correct, False Positive (FP) or Type I Error is negative data but predicted to be positive data., and False Negative (FN) or Type II Error is positive data but predicted as negative data. So, equation of acuracy can shown in Equation (3).
(3) Accuracy used as a reference for algorithm performance, if dataset has a close number of FN and FP.

Exploration
Preprocessing text eliminate words to reduce noise from dataset. The results of the word weighting TF-IDF can be seen in wordcloud in Figure 2 and scatter word shown in Figure 3. Wordcloud on Figure 2 shows words often used for hoax-related Covid-19 are "Virus", "Covid", "Vaksin", "Fakta", "Video", "Klaim", "Corona", and others. These words have been selected by TF-IDF. Scatter Word in Figure 3 describe the layout of words that are often used in several sentences. As an example of an orange cluster, there are several words, namely "vaksinasi", "suntik" dan "uji". It can be seen that hoaxes in this cluster are more dominated by topics related to vaccines from Covid-19 or effects after vaccine injection.

Classification
The scenario of this classification was done using two approaches, namely without stemming and stemming Nazief & Adriani. This classification label consists of three labels, namely, Class_1, Class_2, and Class_3.

Without Stemming Nazief & Adriani
Classification without stemming Nazief & Adriani performs classification based on TF-IDF and words (Jaccard similarity) without normalizing stemmed from Nazief & Adriani. The performances of classification without stemming Nazief & Adriani are shown in Table 3.  Table 3 shows the best accuracy is KNN with Jaccard Space is 69,64% with precision 73,61%. Best recall and f1-score is KNN with Jaccard Space with 60,28% and 62,80%. Classification hoax related Covid-19 without Stemming Nazief & Adriani get the best performance on KNN model in Jaccard space. The difference of accuracy Naïve Bayes with KNN model with Jaccard Space is not far below 5%.

With Stemming Nazief & Adriani
Model and distance are the same with scenario before. This scenario compare the effect of stemming Nazief & Adraini. Performances classification with stemming Nazief & Adriani are shown in Table 4. Based on Table 4 above, the best performance classification obtained by KNN in Jaccard Distance with K=5 with 75.89% accuracy with 79,55% precision, 67,50% recall and 71,13% f1-score. Improvement Stemming Nazief & Adriani from first scenario (without stemming) to second scenario (with stemming) obtained 6,25% for KNN in Jaccard Space in K=5.

Discussion
The accuracy of KNN with Jaccard Space has been improved on second scenario. This is due to break the words into a combination of characters and good to solve typos in the sentence. Best accuracy for first scenario is 69.64%, while second scenario is 75.89% or difference 6.25% from first scenario of KNN in Jaccard space with K = 5. Confusion matrix on KNN in Jaccard Space with K = 5 for second scenario is shown in Figure 4.  train data for example with sentence "vaksin covid-19 ditanami barcode yang akan masuk dalam tubuh manusia". The results of the testing is shown in Table 5. The hoax categories in Table 5, which is predicted in the KNN in the Jaccard Space model with K = 5, is 2, Naïve Bayes is 1. In this prediction, the KNN in the Jaccard Space model with K = 5 correctly predicts the class category. This shows that the hoax against "vaksin covid-19 ditanami barcode yang akan masuk dalam tubuh manusia" is included in the hoax category of Misleading Content or False Context or Imposter Content.
The improvement of this research compared to previous research is distance and stemming used, while for other parameters it is almost same as the previous research.