Transfer Learning of Pre-trained Transformers for Covid-19 Hoax Detection in Indonesian Language

penelitian ini, sistem pendeteksi hoaks


INTRODUCTION
Nowadays, hoaxes can spread easily through the internet, as one of the dark sides of information technology development. Information technology facilitates people to connect with others and exchange information without constraint. Information that can be consumed is also increasing in terms of quantity and variety. However, the information that is exchanged is often false, inaccurate, and maybe deliberately distributed with a specific purpose, called a hoax.
There are several definitions of a hoax that are not much different from various dictionaries [13] [14] [15]. Principally, a hoax is false information, regardless of the purpose of disseminating the information. In Indonesia, it is indicated that hoax information comes from 800 thousand websites [16].
Hoaxes related to covid-19 have been circulating in the community through social media and various news sites. More than 2,300 reports of hoaxes and conception theories about covid-19 have been recorded. Misinformation about covid-19 has affected the lives of at least 800 people worldwide [17]. Considering the problematic effects of covid-19 hoaxes, we intend to solve hoax detection problems by classifying articles into a hoax and fact.
Several studies have elaborated hoax detection tasks and proposed using classic classification models such as K Nearest Neighbor (KNN) [10], naive Bayes classifier [11] [18], and random forest [1]. These models need manual feature engineering by defining representative features beforehand. However, a deeper analysis is needed to decide on better features. Overcoming the feature engineering problem, a deep learning model was introduced to solve classification tasks without feature engineering. Features were extracted through their word embeddings representing the meaning of each token in the input texts. A recent study has been proved that deep learning models are superior to classic classifiers in a hoax detection system [12].
Due to the complexity of deep learning architecture, deep learning models require a lot of training data. However, constructing a large dataset, especially for supervised tasks, is expensive. Building deep learning models using such dataset also needs more considerable resources, such as high-performance computer architecture. Avoiding the need to train a new model from scratch, public pre-trained language models were proposed.
Recent studies have shown that pre-trained models trained on a large corpus can be successfully solved various downstream natural language processing tasks by transfer learning. One of the pre-trained language representation models outperforming many task-specific architectures is BERT. BERT, Bidirectional Encoder Representation from Transformers, is designed to pre-train deep representation of texts from both left and right directions [2]. BERT can be easily fine-tuned for a classification task by simply adding one additional classification layer, thus avoiding the need to train a new model from scratch. Since our dataset is limited, we aim to take advantage of a larger pre-trained language representation by transfer learning the pre-trained models to our article classification task and develop a accurate hoax detection system.
The promising results of BERT architecture in obtaining deeper context representation of input texts encourage the development of BERT trained on the different corpus. Devlin et al. also offered a multilingual BERT (mBERT) [2], a pre-trained BERT trained on Wikipedia documents for 104 languages, achieving impressive performance for zero-shot cross-lingual transfer [3]. Several studies incorporated mBERT in Indonesian language processing tasks, such as aspect-based sentiment analysis in Indonesian review datasets [7]. Even the original pretrained BERT was elaborated in other Indonesian tasks, such as hate speech classification [8] and summarization [9].
Recently, two different versions of BERT trained on Indonesian corpus have been released in the same year, called IndoBERT. The first version of IndoBERT was proposed by the IndoNLU team, which is trained on a large and clean Indonesian dataset (Indo4B) collected from publicly available sources such as social media texts, blogs, news, and websites [4]. Another version of IndoBERT, proposed by the IndoLEM team, is trained on Indonesian Wikipedia, news articles (Kompas, Tempo, and Liputan6), and Web Corpus [5]. These monolingual BERT trained on the Indonesian language corpus encourage further research exploring transfer learning of pre-trained BERT models to various Indonesian language processing tasks.
In this study, we elaborated the original pre-trained BERT, the multilingual BERT (mBERT) and the monolingual IndoBERT to develop a hoax detection system and presented the experimental results. The performance of each model was compared and analyzed for selecting the hoax detection system with the best accuracies.

METHODS
We proposed transfer learning by finetuning pre-trained transformers models for hoax detection tasks. A flow chart of our proposed system is illustrated in Figure 1.

Dataset
We used a dataset of Covid-19 articles in Indonesian languages collected by [1] consisted of hoax articles from Turnbackhoax.id and fact articles from Detik.com. Keywords that were used for selecting the Covid-19 articles in this study are "covid", "corona", and "pandemik". Page interfaces of hoax and fact article examples are shown in Figure 2 and   Table 1.

2 Pre-processing
We used a pre-trained tokenizer to transform texts into sub-word tokens avoiding outof-vocabulary problems. Following [2], [CLS] token is added at the beginning of articles tokens and [SEP] at the end of tokens. Then, we separated the title and body of articles into two segments by inserting a [SEP] token. Then, all tokens were transformed into token ids. An example of our tokenization process in pre-processing phase is illustrated in Figure 4.

Figure 4 An Example of Tokenization Process
In our study, we fed the original texts as our inputs to the tokenizer. However, when uncased models were used, we did case-folding by reducing all letters to lowercase.

3 Transfer Learning of Pre-trained Models
We fine-tuned a pre-trained transformers BERT to obtain context representation of our input texts. We adapt fine-tuned BERT architecture in [2] to solve the classification task for our hoax detection system. Our proposed fine-tuned architecture is depicted in Figure 5. We assigned token embeddings representing the meaning of each token, segment embeddings to discriminate the title and body of the article, and position embeddings covering the token position in our input sequences. The summation of these embeddings was fed to the Transformer layer of BERT. We used the top context [CLS] token as a representation of sequence tokens. Then, we added a classification layer to detect whether an article is a hoax or not. We used several BERT models trained on different corpus: original BERT, multilingual BERT, and IndoBERT.

3.1 BERT
An original BERT model is trained on the BooksCorpus (800M words) and Wikipedia (2,500M words) [2]. We used BERT-based-cased and BERT-based-uncased in our experiments.

3.2 Multilingual BERT
A multilingual BERT (mBERT) is trained on Wikipedia documents for 104 languages, including Indonesian and has been efficiently fine-tuned for document classification in several languages [2] [3]. mBERT-base-cased and mBERT-base-uncased were used in our study.

3.3 IndoBERT
There are two kinds of IndoBERT trained on a different corpus, proposed by IndoNLU [4] and IndoLEM [5] teams. IndoBERT of the IndoNLU is trained on around four billion words of Indonesian pre-processed text data (≈ 23 GB) from publicly available sources such as social media texts, blogs, news, and websites [4]. IndoBERT of the IndoLEM is trained on Indonesian Wikipedia (74M words), Indonesian news articles (55M words), and an Indonesian Web Corpus (90M words) [5]. Both models provided only the uncased models.

RESULTS AND DISCUSSION
We implemented our models in Pytorch and used a transformer library built by the Huggingface team [6]. For optimization in the fine-tuned phase, we used Adam as the optimizer with a batch size of 32 and a learning rate of 3×10 −6 . As evaluation metrics, we reported accuracy, precision, recall, and F1 scores.
We fine-tuned the models for seven epochs. Based on our experimental results, the models tended to overfit after the seventh epoch. The improvement of accuracy scores in the fine-tuned phase of our proposed models is shown in Figure 6. We validate our models for each epoch in the train and test set. No validation set is available in this dataset.

Figure 6 Accurracies of Our Proposed Models for Each Epoch in Train and Test Sets
As reported in Figure 6, the accuracies of all models were improved after each epoch. It proves that transfer learning from the pre-trained models increases the model performances. We successfully took advantage of pre-trained models trained on a large corpus by fine-tuning the models in our task, even with a limited dataset. The accuracy scores of fine-tuned BERT in the testing sets were sharply boosted in the first three epochs, then continue with a slight increase for the next epoch. Unlike the original fine-tuned BERT models, fine-tuned mBERT and IndoBERT models were slightly increased from the first epoch, but the accuracy score was started from more than 90%.
We compared our proposed models with transfer learning to Random Forest without feature engineering and with feature engineering, the best classic classification models for this task reported in [1]. Our proposed models did not need feature engineering since text representations were automatically obtained through their token embeddings. Our experimental results are shown in Table 2.  As seen in Table 2, our fine-tuned models achieved better accuracy, precision, and F1 scores than reported in the previous works. The original fine-tuned BERT with uncased and cased models gave a similar performance with 96.38 of accuracy and F1 scores. Unlike the original BERT, the fine-tuned mBERT with cased model reported better performance than the uncased one. Since the mBERT model was trained on a larger corpus in various languages, differentiate capital and not capital letters of article texts improved our hoax detection performances. The cased version models were efficiently handling words with capital letters by using different embeddings.
Both fine-tuned IndoBERT models, IndoNLU and IndoLEM, provided only an uncased version. Both models achieved similar performance with 97.67 accuracies and outperformed fine-tuned BERT and mBERT uncased models. IndoBERT as a monolingual pre-trained model gave a better score than mBERT, a multilingual one. However, the cased version of fine-tuned mBERT model outperformed all fine-tuned models since mBERT-cased trained on a larger corpus than others.

CONCLUSIONS
We proposed transfer learning of pre-trained models to solve the COVID-19 hoax detection task. We fine-tuned original pre-trained BERT, multilingual pre-trained mBERT, and monolingual pre-trained IndoBERT to our classification task and reported the results. Our finetuned IndoBERT models trained on monolingual Indonesian corpus outperformed fine-tuned original and multilingual BERT with uncased versions. However, the fine-tuned mBERT cased model trained a larger corpus achieved the best performance.