Bidirectional Long Short Term Memory Method and Word2vec Extraction Approach for Hate Speech Detection

Currently, the discussion about hate speech in Indonesia is warm, primarily through social media. Hate speech is communication that disparages a person or group based on characteristics such as (race, ethnicity, gender, citizenship, religion and organization). Twitter is one of the social media that someone uses to express their feelings and opinions through tweets, including tweets that contain expressions of hatred because Twitter has a significant influence on the success or destruction of one's image. This study aims to detect hate speech or not hate Indonesian speech tweets by using the Bidirectional Long Short Term Memory method and the word2vec feature extraction method with Continuous bag-of-word (CBOW) architecture. For testing the BiLSTM purpose with the calculation of the value of accuracy, precision, recall, and F-measure. The use of word2vec and the Bidirectional Long Short Term Memory method with CBOW architecture, with epoch 10, learning rate 0.001 and the number of neurons 200 on the hidden layer, produce an accuracy rate of 94.66%, with each precision value of 99.08%, recall 93, 74% and F-measure 96.29%. In contrast, the Bidirectional Long Short Term Memory with three layers has an accuracy of 96.93%. The addition of one layer to BiLSTM increased by 2.27%.


INTRODUCTION
Social media uses web-based internet technology that converts communication data into conversations between users of social media [1] . Twitter is a social media that allows users to express feelings and opinions through tweets, including tweets that contain expressions of hatred [2]. Twitter has a significant influence on the success or destruction of one's image [3].
Hate speech or hate speech is a communication that belittles a person or group based on characteristics such as (race, ethnicity, gender, nationality, religion and organization) [4]. Expressing hate speech has become a trend, and many people use this as a shortcut to gain instant popularity without putting more effort [5]. Dissemination of information containing hate speech can spread quickly and widely. According to investigators from the Directorate of Criminal Acts Siber Bareskrim, said the majority of cybercrime is dominated by defamation and hate speech on social media with a percentage of 80% [6].
Research on the detection of hate speech for Indonesian has been conducted by [4]. In this study, they are using the Bag of Word model feature extraction method, namely n-gram word and n-gram character. As well as comparing the performance of 4 machine learning algorithms namely, Bayesian, Logistic Regression, Naive Bayes, Support Vector Machine and Random Forest Decision Tree. But Bag of Word models also have problems in extracting the semantic meaning of a sentence. Various feature extraction methods have been proposed, including single words, single-character N-gram, multi-word N-gram, and lexical syntactic. However, semantic features between words are rarely considered in text classifications. Grammatical features can reveal deep and implicit semantic relationships between words which can be more useful in text classification [7]. The LSTM method has been used in research [8] and [9] to analyze sentiments on tweets that have better results compared to conventional methods. This shows that the LSTM method is suiTable for text classification.
Based on the background description that has been explained, this study conducted an in-depth learning approach to analyze hate speech in tweets using the Bidirectional Long Short Term Memory method and the Word Embedding approach namely Word2vec as feature extraction. With the ability of the LSTM is outstanding in doing classification and prediction on the form of time series data with unknown duration of time [10].

General Architecture System
The system architecture that will be built has four supporting parts: data collection, preprocessing, feature extraction and finally classification and evaluation. The architecture to be designed in this study can be seen in Figure 1.

Data Collection
The tweet data used in this study uses a dataset from research [4]. The dataset used is Indonesian-language tweet data that have been labelled as hate speech (HS) and non-hate speech (Non_HS).

3. Preprocessing
Preprocessing is to manage tweet data to get data that is cleaner than noise to facilitate the next process. The preprocessing stage consists of several steps, including 1) Case folding, 2) Filtering, 3) Tekonizing, and 4) Stopword Removal. At this stage a uniform word is made in a tweet into a lower case, filtering is used to delete special characters that are often found in tweets such as hashtag (#), @user, retweet (RT). It also removes punctuation (‗,' ‗.' ‗?' ‗!', Etc.), numeric digits (0 ... 9), and other characters (‗$','%', ‗*', etc.). Tekonizing does the process of separating words in one sentence into tokens, where spaces separate each word in one sentence. Stopword Removal functions to eliminate words that have no influence (which is, and, or, to, from, etc.). Examples of the preprocessing stage can be seen in Table 1:

4. Sentence Conversion
Stages carried out in sentence controversy, namely making word dictionaries, converting sentences into numbers, and padding. The results of this sentence conversion process will be used as input to the BiLSTM method. The first process is to create a word dictionary that is used to provide the word id contained in a sentence in the tweet data that has gone through preprocessing.
The second process is creating a word dictionary. Separate sentences into units of words, delete duplicate words and give values to words contained in the corpus.

5. Word2vec
The word2vec architecture used in this study is the CBOW (Continuous bag-of-words) architecture and skip-gram. With a vector size of 200 dimensions with windows size 5 for CBOW architecture and 5 for skip-gram architecture. Larger windows tend to capture more information about the topic of the sentence, while smaller windows tend to attract the relationship about the word itself as an equal word or word synonym [11].

6. Bidirectional LSTM
Bidirectional Long Short Term Memory is a neural network of Long Short Term Memory (LSTM) which consists of two layers of LSTM neural networks, namely the advanced LSTM layer to model the previous context and the backward LSTM layer to model each subsequent context [12]. Bidirectional LSTM is by connecting two hidden layers from opposite directions to the same output. With this generative form of deep learning, layers of neurons can obtain information from past and future conditions simultaneously. The classification and validation process is shown in Figure 2. The first layer is the Indonesian tweet input layer which has been changed in vector form. In the LSTM layer, several LSTM units will be tested. The LSTM unit is a memory cell that consists of four main components: input gate, self-recurrent connection, forget gate and output gate. In this study, two layers are used, namely forward and backwards, so that the output layer will get the information of the past and the future simultaneously.

6. Embedding
The next layer is the embedding layer (e t ). The purpose of this layer is to study the mapping of each word in the word dictionary into a vector with a lower dimension. This layer will change the positive integer index in the input into a fixed-size vector based on the vector dimensions of the word dictionary based on the word2vec model.

6. 3. LSTM Layer
In LSTM, this layer determines the previous input, whether it can pass in the cell state or not. What determines the data can be continued or not is the sigmoid layer called "forget gate" ( . Output 1 means "let it pass" and 0 means "forget information". The forget gate value can be calculated by Equation (1) [ ] The next step is the sigmoid ( layer called the input gate which determines which part will be updated, the tanh layer which creates a new candidate value vector ( ), which can be added to the cell state ( ). To calculate the input gate values with Equation (2)   The final step we will get what we are looking for is results. First, we run the sigmoid gate, called the output gate, to decide what parts of the context we will produce. The context goes through tanh to make the value between −1 and 1, and we multiply it by the sigmoid gate output. The gate output can be calculated by equations (5) and (6).

RESULTS AND DISCUSSION
This testing method will discuss the testing phase of the system being built. Testing is done by measuring system performance which includes testing the accuracy, precision, recall, and f-measure of the results of the classification of tweets conducted by the system. The tweet data used for this test was taken from previous research data, namely, research from [1] with a total of 713 tweets. The data is divided into two classes, namely, hate speech (HS) or 260 hate tweets (HS) and non-hate speech or non-hate speech (N_HS) of 453 tweets. Testing uses k-fold cross-validation with a k value of 10.
This testing phase is carried out to get the best accuracy results among the parameters tested. Not all combination of settings that can be done due to limited time and resources, the parameters tested and their values can be seen in Table 2.

Word2vec Architecture Testing
The first test is to test the word2vec architecture used, namely Continuous bag-of-word (CBOW) and skip-gram types. Meanwhile, to compare with other methods besides the word2vec process, one hot encoding method will be tested. This test aims to determine the best kind of word2vec and one hot encoding models. Based on the results shown in Table 3, it can be seen that the CBOW type has an accuracy of 93%, better than the Skipgram type with an accuracy difference of 2.99%. While One Hot Encoding only obtained an accuracy of 74, 13%. This is because One Hot Encoding cannot represent the semantic meaning of existing words, One Hot Encoding only counts the number of vectors. When compared between CBOW and One Hot Encoding the difference in accuracy is 18.87%. CBOW Word2vec can produce better word embedding because CBOW can pay attention to the semantic meaning of each word. In this test was given the number of Epoch 10 and L2 value of 0.001. According to table 4 It can be seen that the number of neurons up to 50 accuracy value obtained can not provide significant improvement in accuracy, the number of neurons continued to add up to 300 neurons, when the number of neurons 200 earned accuracy value of 94.66%, when the number of neurons 220 to 300 there is a decrease in the accuracy value of 24.47%, the number of neurons 300 gets an accuracy value of 70.19%. From the results of this experiment obtained the highest accuracy value when the number of neurons 200 is 94.66%. The more the number of neurons, the longer the computing time is needed. On the other hand, a large number of neurons does not guarantee that it can increase the significant accuracy, precision, recall, and Fmeasurement values. During the test, there is no specific formula that can determine the optimal number of neurons for increasing accuracy in the model created. Comparison of the accuracy of the number of neurons to the Bi-LSTM method can be seen in Table 5. From the test results above we can see the accuracy of the LSTM method to the number of neurons, the best results are found in the Bi-LSTM practice with 3 layers with an accuracy of 96.93%. The best amount of neurons can be used for further parameter testing.

Testing the Amount of Epoch Against Bi_LSTM Method
In previous tests, the number of neurons has been determined based on test results that show the best value. In this study, the number of epoch tested was 5,10,20,30,40,50,60,70,80,90, 100 while the number of neurons was 200 because it had the highest accuracy based on previous testing and an L2 value of 0.001. In Table 5 it can be seen that the number of epoch 100 gives the results with the best accuracy.

L2 Regularization Testing
Previous tests have obtained the word2vec architecture, the best number of neurons, and the number of epochs. Finally, determine the optimal L2 regularization value. The L2 values tested were 0.1, 0.01 and 0.001. The test results can be seen in Table 7. The value of L2 is directly proportional to the length of time required. However, the magnitude of the L2 value can increase accuracy but not significantly.

Bi-LSTM Performance Testing
Based on previous tests obtained Word2vec architecture, namely CBOW, the number of neurons, the number of epochs, and the best L2 Regularization value. The best results from previous tests will be used to test the performance of the Bi-LSTM method and will be compared with the LSTM method, the Bi-LSTM 3 layer. The test results are shown in Table 8 of the BiLSTM test results.  From the test results above we can see that the Bi-LSTM 3 layer method gets the highest accuracy value which is 96.93%. This shows that the addition of hidden layer to the LSTM layer can increase the accuracy value in the LSTM method even though the increase in accuracy is not too significant, but adding this layer will take a long time to process the test.

CONCLUSIONS
The use of Word2vec and the Bidirectional method of Long Short Term Memory with CBOW architecture, with epoch 10, learning rate 0.001 and the number of neurons 200 on the hidden layer, generates an accuracy rate of 94,66%, with each precision value of 99,08%, 93,74% recall and F-measure 96,29%. As for the Bidirectional Long Short Term, Memory with three layers has 96,93% accuracy. The addition of one layer to BiLSTM increased by 2,27%.
However, in this study, there are still shortcomings that can be added to the training data aside from the Wikipedia data training in Indonesian to avoid words that cannot be represented.