Hate Speech Detection in Indonesian Twitter using Contextual Embedding Approach

Hate speech develops along with the rapid development of social media. Hate speech is often issued due to a lack of public awareness of the difference between criticism and statements that might contribute to this crime. Therefore, it is essential to do early detection of sentences written before causing a criminal act due to public ignorance. In this paper, we use the advancement of deep neural networks to predict whether a sentence contains a hate speech and an abusive tone. We demonstrate the robustness of different word and contextual embedding to represent the semantic of hate speech words. In addition, we use a document embedding representation via recurrent neural networks with a gated recurrent unit as the main architecture to provide richer representation. Compared to the syntactic representation of the previous approach, the contextual embedding in our model proved to give a significant boost to the performance by a significant margin.


INTRODUCTION
Hate speech is an expression, writing, action, or performance intended to provoke violence or discrimination against someone based on the characteristics of the community they represent, such as race, ethnicity, gender, sexual orientation, religion, and other characteristics [1]. Hate speech develops along with the rapid development of social media. It is a problem that affects the dynamics and interactions of the online social community. In the last two years, a criminal act of hate speech has been committed. Hate speech is often issued due to a lack of public awareness of the difference between criticism and statements that might contribute to this crime. Therefore, it is essential to do early detection of sentences written before causing a criminal act due to public ignorance.
Furthermore, Indonesia governs hate speech in the Electronic Information and Transactions (UU ITE) Law No. 11 of 2008, amended by Law No. 19/2016. The law includes prohibitions and criminal threats for offenders who make hate speech or fake news. Article 28 paragraph (1), under Article 45 of this Law, includes criminal threats against anyone who spreads false and misleading information that causes customer losses in electronic transactions knowingly and without authority [2]. One way to deal with hate speech found on online platforms is by reporting the content to the authorities and removing the content. Other actions in overcoming hate speech are by conducting surveillance, advocacy, and counter-speech [3]. However, these approaches are time-consuming and require human labour.
In addition to the previously mentioned countermeasures, some researchers have attempted to counteract hate speech through machine learning. Machine learning has proved to be a good tool for understanding human language. Machine learning disciplines that specifically deal with human language are called Natural Language Processing (NLP). Most of the current NLP approach uses a supervised learning algorithm. Supervised learning requires human intervention, which acts to label the sentences that are deemed to be hate speech or not. Differences in opinion among humans about whether a piece of writing is hate speech or not are part of the difficulties in determining hate speech. This represents the risk of misclassification in machine learning algorithms which are then trained using human labelling. For example, in a bag-of-words approach, we can have a dictionary of words that are classified as hate speech, such as "black", "gay", and "transgender". Currently, most of the language resources for NLP are developed for English. This poses an additional challenge for detecting hate speech in another language. This research aims to predict hate speech in Bahasa Indonesia, which gives another challenge to the language model. As we mentioned in the previous paragraph, most of the research tackles hate speech detection problems as a supervised classification task by applying a machine learning approach. Spertus [4] utilizes machine learning with a decision tree algorithm to automatically detect messages containing offensive language on social media. Vigna and Warner [1], [5] use the Support Vector Machine (SVM) algorithm. This algorithm has an accuracy rate of 80% in the automatic detection of hate speech.
Several studies used simple linguistic features such as Bag of Words (BoW), n-gram, and Part-of-Speech (PoS) as fundamental features. Waseem and Hovey [6] conducted a study to detect hate speech on the Twitter platform. Researchers classified hate speech into two classes, namely racism and sexism. The author uses several features, such as n-gram characters, along with the user's demographic, linguistic, and geographic features. The results showed that the ngram character with n = 4 gave the best results, and adding the user's gender feature could result in a slight improvement. Word embedding features, such as paragraph2vec [7], [8], are also used to classify user comments. Nobata et al. [8] combined the paragraph2vec feature with several features, such as n-gram features, linguistics and syntax, and semantic distribution. The addition of features shows an increase in the area under the curve (AUC) compared to only Aside from the supervised approach, Watanabe et al. [9] apply unsupervised machine learning with lexical features and word rules to detect sentiment in the text. The algorithm's focus is on word features, emoticons, hashtags, punctuation, and grammatical patterns. In addition, to detect harsh words in the text, they used a dictionary-based approach. Research on automatic hate speech detection using this grammatical feature is often used in English texts because English has a standard grammatical pattern. Meanwhile, the extraction of grammatical features in Indonesian, such as part-of-speech markers and automatic dependency parsers, remains limited.
Along with the development of research in the field of deep neural networks (DNN), some researchers [5], [10], [11] use DNN to solve the problem of automatic hate speech detection. The deep learning method uses word embedding to represent the features of the text. This feature can detect words that contain hate speech more effectively than syntax and linguistic features. The deep learning architectures that are often used are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), especially Long-Short Term Memory (LSTM). CNN's performance in the baseline dataset [6] has an F1 measure score of about 80% [12], [13]. Zhang et al. [14] combined the CNN architecture with Gated Recurrent Units (GRU) to improve CNN's performance. Several studies have shown that LSTM has better performance than CNN [10], [15], with an F1 score of 93%.
Our contribution to this paper is two-fold. Firstly, we used various contextual embedding and stacking between different contextual embedding approaches for hate speech detection in Indonesian. Secondly, we can outperform the performance of the prediction compared to the previous work. We use the advancement of deep neural networks to predict whether a sentence contains a hate speech sentiment. Furthermore, we predict whether a sentence contains abusive language. Finally, to provide richer representation, we use a text embedding representation through a recurrent neural network (RNN) with GRU as the main architecture.
This paper consists of 4 sections. The first section introduces the motivation for our work. We describe our method in detail in Section 2. The result of our experiment is described in Section 3, and finally, we conclude our work in Section 5.

METHODS
The method of detecting hate speech uses a supervised learning approach with contextual embedding and a recurrent neural network. Data that has been labelled hate speech and not hate speech is the primary input in machine learning. This study uses the recurrent neural network (RNN) representation that studies sequential patterns of text. Unlike previous studies [16], [17], we did not pre-process the data because we assumed that each word occurrence would determine a sequential pattern that could lead to hate speech or not. Before entering into the RNN document representation, we tokenize and pass each word token into the token embedding. Then document embedding will combine the token model into one document embedding vector. The document embedding representation becomes an input classifier that will determine whether the sentence is hate speech or not. The architecture of our model can be seen in Figure 1.

Dataset
This study used two datasets on hate speech on social media from Alfina [16] and Ibrohim [17]. Both of these datasets contain Indonesian tweets. In Alfina The architecture of our model. The model starts with the sentence, which then chunked into tokens. Aside from words, the tokens can be chunked into characters. These tokens are then represented in token embeddings. The dashed embeddings indicate that the embedding can be stacked with another embedding to provide a richer representation. The token embeddings were then combined with document recurrent neural network embedding. Finally, the model will predict the class's output, treated as a multilabel classification in this architecture.
In the second dataset from Ibrohim, 5,540 tweets have been annotated with hate speech and 7,593 tweets that are not hate speech. Here are two examples of hate speech data in the second dataset [17]:  Amit amit itu mulut apa congor satwa yang namanya anjing sih (In English: God forbid is the mouth of an animal called a dog)  hari hari makan babi berbentuk wang haram. muka pun mcm babi. perangai lebih babi dari babi. politikus ronggeng babi (In English: everday eating a haram pig. even face like a pig. the temperament of a pig more than a pig. ronggeng pig politician) We got the two datasets from Alfina's Github 1 and Ibrohim's 2 . The two researchers did not provide the training and testing fold in their experiments. Thus, we took the initiative to divide the two datasets ourselves in training and testing to test and compare the model's performance with the two baselines that have been carried out [16], [17].

Token Embedding
The first step in our method is the creation of a token embedding. In this study, we did not pre-process the data so that every word in the input sentence was converted into tokens. We perform tokenization by dividing the sentence into words. In the flair embedding method, the token is character-based. Each token is then converted into a vector. We call this token embedding. In general, there are two types of token embedding that we used. The first type is a classic word embedding type, and the second is a contextual embedding type.
The first type is pre-trained embedding from Indonesian fastText [18], [19], which trained on the Indonesian Wikipedia and Common Crawl [19]. The model is trained with the aim that words that have similar semantics can have similar vectors. For example, Jakarta and Bandung's words will have a similar vector because they represent the same semantic, namely the city. In the fastText model, subwords are also considered to solve out-of-vocabulary problems during the pre-training data formation. This word embedding can be called one word, one embedding. So that word embedding of this type does not pay attention to context.
The second type of token embedding in our experiment is contextual embedding. Contextual embedding is a token embedding method that can encode semantic information relevant to the context of training data. In other words, the representations created by contextual embedding can differ depending on the sentence's context. In this study, we used a representation of Flair Embedding [20]. Flair embedding is contextual embedding which is trained by predicting the next character from a series of characters. This training model is proven to encode linguistic concepts such as words, sentences, and even sentiments in the context used. Flair Embedding is trained without explicit word feature information. It fundamentally models the word as a sequence of characters and is contextualized by the surrounding text, which means that the same word will have different embeddings depending on contextual usage. However, there are drawbacks to character-based approaches such as Flair Embedding. The drawback is that it is difficult to produce meaningful embeddings if there are character sets that are rarely used in a context. To overcome this shortcoming, the same researchers [21] proposed a method in which each unique character set will be dynamically combined. Then a pooling operation is used to filter the global word representation from all contextual instances. This method is called Pooled Flair Embedding [21].

Document Embedding
In contrast to token embedding, Document Embedding creates a single vector embedding for the entire text, while token embedding creates a vector embedding for each word or character. This is necessary to ensure that each sentence with a different number of words is represented identically. This study uses a recurrent neural networks (RNN) technique that trains the sequential token embedding pattern [22]. RNN is a form of neural network architecture in which processing is repeated for sequential data input. Because data is processed across multiple layers, RNN falls into the deep learning category. In long sequence patterns, the RNN has a problem with gradients which tend to have very small values, close to zero. This problem is often called a vanishing gradient. In this study, we used the Gated Recurrent Units (GRU). GRU is a variant of RNN and Long Short Term Memory. GRU can overcome the vanishing gradient by adding a gate mechanism in the RNN architecture [23]. GRU has been proven to overcome long sequence patterns and has a more straightforward gate mechanism than LSTM. The advantage of GRU is that the computation time is better and has competitive accuracy to avoid the problem of disappearing gradients. The two main gates in the GRU is the update gate and reset gate.
The update gate is used to determine the amount of information from the previous units to the next unit. This mechanism can help the model to prevent the vanishing gradient problem. The update gate is computed by equation (1). The equation is almost the same as the linear layer in the vanilla neural network, which multiplies the weight with the network unit in timestep t. However, it is added by the weight information multiply by the network of the previous unit . Finally, the result is passed to the sigmoid activation function. (1) The second gate is the reset gate. The reset gate is used to determine how much information from the past should be discarded. The reset gate is computed by equation (2). The equation is identical to the update gate. However, the usage of each of the outputs will be different in the later step. (2) The final step of the GRU is to calculate the current memory content and the final memory at the current time step. The reset gate will be used in the current memory content, which will calculate how much information to be discarded. The equation to calculate the current memory content can be seen in (3).

(3)
The update gate will be used in the final memory at the current time step to decide what information should be passed to the next unit. The equation to obtain the final memory at the current time step can seen in (4).

Output Layer
The output of the document embedding will be passed to a linear output layer. In this study, we did not add an additional hidden layer after the embedding layer. We assume that a combination of token embedding and document embedding can provide representations that describe semantic patterns and sequential patterns that lead to hate speech. In the output layer, we use cross-entropy loss because our classifier is binary. The formula for cross-entropy loss can be seen in (5). Where is cross-entropy, and shows the class label of hate speech or non-hate speech and is the probability that a sentence is hate speech or not from all sentences in the corpus.

RESULTS AND DISCUSSION
In this section, we present the setup of our experiment performance, followed by our model's performance.

Experimental Setup
Our experiments were conducted on three different sets of data. Unlike our previous work [24], we did not combine the two datasets to provide a benchmark for improving our model's performance. The first set of data is from Alfina [16], which contains two labels only, hate speech (HS) or not hate speech (NHS). The second and third sets of data are from Ibrohim and Budi [17]. For the second dataset, we only took the hate speech label. Thus, we treat the second dataset as a binary classification problem. For the third dataset, we took the hate speech (HS) and abusive language (AB) columns to benchmark the performance with their original paper [17]. As a result, we treat the third dataset as a multi-label classification problem which resulted in four possible labels. The last label we haven't introduced is not abusive language (NAB).
We separated the data into training and test sets for each dataset. We allocate 80% of the data for training and 20% of the data for testing. The training data will be further divided into two parts: training data and validation data, where the validation data plays a role in testing the model's performance during training. Table 1 shows the distribution of training data and testing data for the first and second data sets, while Table 2 shows the third data sets' distribution.  We conducted our model training on an Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz machine with 20 logical central processing unit cores and a GeForce GTX 1080 Ti Graphical Processing Unit. The significant difference between the two datasets makes the training times much longer for the second datasets. We use the Flair framework for the implementation of our model [25].

Experiment Result
The experiment aims to determine which embeddings have the best prediction results for each dataset. Precision, recall, and F1-Measure are the three key measures we use in our experiment. The precision equation can be found on (6), recall on (7), and F1-Measure on (8). We show the performance for every class, Hate Speech (HS) and non-Hate Speech (NHS). We add two more classes for the third experiment: Abusive Language (AB) and non-Abusive Language (NAB).
We use a variety of embeddings in each experiment. We also stack or combine traditional word embedding with contextual embedding. There are seven different types of embeddings in total. The first three embeddings use classical word embeddings trained on subword information from fastText. For the first type, we use the fastText pre-trained model on Wikipedia (FastText Wiki). In the second model, we use the pre-trained model on Common Crawl (FastText crawl). In the third model, we perform stacking between those two embeddings (FastText wiki+crawl).
We use contextual embeddings in the fourth and fifth experiments: Flair embeddings and Pooled Flair Embeddings. For both the flair and pooled flair embeddings, we use both forward and backward models. The embedding was pre-trained on the JW300 corpus, which contained the Bahasa Indonesia language [26]. Lastly, we stack the classical word embedding with contextual embeddings to provide a richer representation. We only use the Wikipedia model for the classical word embedding, which performed better in our experiment [24]. Table 3 shows the performance of our model in the first dataset [16]. We also shows our baseline model's performance from Alfina et al. [16], which trained on the random forest decision trees with the combination of features. These features are word unigram, word bigram, char trigram, char quadragram, and negative sentiment. One of our models achieves better performance compared to the baseline. The best model is using FastText representation trained on Wikipedia. Overall, the first dataset performed better on the classical word embedding rather thank contextual word embedding. The second-best performance is still using the same embeddings, stacked with the FastText crawl. Thus, the FastText common crawl gives no improvement into the basic FastText Wikipedia. Table 3. The performance of the classification on the first dataset [16]. We also show the performance for the baseline model In contrast to the first dataset, the second dataset performs better on contextual word embeddings. The last fourth experiment, which uses contextual word embeddings, outperforms the classical word embeddings. However, stacking the classical word embeddings (FastText Wiki) gave a boost of performance to the Flair Embeddings. The best performance was achieved by Pooled Flair Embeddings, which give 87,3 % F1-Measures on average. We cannot benchmark the second experiment because the previous research [2] does not perform binary classification.  In the third experiment, we conduct multi-label text classification. The classes are either hate speech (HS) or non-hate speech (NHS) and abusive (AB), or non-abusive (NAB). Due to space limitation, we only show the average of HS and NHS and are denoted µHS. The same treatment is for AB and NAB; we only show the average of those two with µAB. Then both result gets aggregated to shows the overall average. Based on the experiment, we can show that our best model (FastText Wiki + Pooled Flair Embedding) significantly outperformed the baseline [17]. Based on the three experiments we have conducted, contextual embedding has proven to be robust in a larger dataset. Flair contextual embedding can capture the linguistic information, including subclauses [20] which can be in the form of Indonesian slang language. In the multi-label classification task, stacking classical word embedding with contextual embedding can give the best result. It consistently followed with the second-best model, which also the stack embedding. Stack Embedding able to provide the information about the global context of a word provided by fastText trained on Wikipedia, and contextual information was given by Flair embedding.

CONCLUSIONS
In this paper, we build the prediction model for hate speech and abusive language prediction focusing on the social media dataset. We demonstrated the usage of the word and contextual embedding approach to provide a semantical representation of the tokens. To prove the robustness of the embeddings, we did not conduct any pre-processing of the data. In other words, the model can still work well without any pre-processing using contextual embedding. Thus, we leave this gap for future work. We are also experimenting with stacking one embedding with another embedding to provide a richer representation of the sentence. Moreover, we use the document recurrent neural network embedding to capture the sequence information from the sentence. Our model proved to improve the dataset provided by Alfina et al. [16] and a significant improvement on the larger dataset by Ibrohim and Budi [17].
In the future, we suggest using stopwords elimination, slang substitution, stemming, and other pre-processing techniques. We believe that several pre-processing methods will boost the model's efficiency due to the high noise in social media data. We also recommend testing out new transformer models, including Bidirectional Encoder Representations (BERT) and Generative Pre-trained Transformers (GPT).