Aspect-Based Sentiment Analysis on Indonesian Restaurant Review Using a Combination of Convolutional Neural Network and Contextualized Word Embedding

Someone's opinion on a product or service that is poured through a review is something that is quite important for the owner or potential customer. However, the large number of reviews makes it difficult for them to analyze the information contained in the reviews. Aspect-based sentiment analysis is the process of determining the sentiment polarity of a sentence based on predetermined aspects. This study aims to analyze an Indonesian restaurant review using a combination of Convolutional Neural Network and Contextualized Word Embedding models. Then it will be compared with a combination of Convolutional Neural Network and Traditional Word Embedding models. The result of aspect-classification on three models; BERT-CNN, ELMo-CNN, and Word2vec-CNN give the best results on the ELMo-CNN model with micro-average precision of 0.88, micro-average recall of 0.84, and micro-average f1-score of 0.86. Meanwhile, the sentiment-classification gives the best results on the BERT-CNN model with a precision value of 0.89, a recall


INTRODUCTION
In this digital era, it is easier to express our opinion on something in the form of status, tweets, reviews, etc. As a result, there are also platforms that can accommodate these opinions in text form such as Zomato, Tripadvisor, Traveloka, Qraved that contain various places, especially restaurants, in various regions in Indonesia and users can express their opinions about the restaurant in a text called a review. Reviews to a restaurant is one of the important things that is very usefull for potential customers and restaurant owners. The information contained in a review can be used by customers to identify more about the products or services available at the restaurant and for the owner it can be used as an evaluation to meet the needs of customers [14].
Positive reviews from customers can increase the impression of potential customers to a restaurant so they will be interested in visiting the restaurant after seeing positive reviews from other customers. On the other hand, negative reviews can be taken into consideration for potential customers when visiting the restaurant itself. However, with the large number of reviews available, potential customers sometimes only read them briefly which usually appear at the top, making it difficult to make conclusions and make decisions. Therefore, a solution is needed to deal with this, namely by building a model that can classify a review into positive or negative classes. Sentiment analysis is one method that can solve this problem.
Aspect-based sentiment analysis (ABSA) is a solution to the problem in sentiment analysis that is not able to classify a review into positive or negative classes based on aspect categories. ABSA plays a role in classifying a review into positive or negative classes based on the aspect category (Liu, 2015). At this level, a model will be able to classify a text document into the aspect category and the polarity of the sentiment. For example, in the review sentence "The food is delicious but the price is quite expensive", ABSA model is able to classify the sentence "the food is delicious" into food aspects and positive classes, then the sentence "the price is quite expensive" into price aspects and negative classes.
Research on the aspect-based sentiment analysis has been conducted by [7]. In this study, they are using Conditional Random Fields (CRF) and Maximum Entropy (MaxEnt) models and give an average F1-score of 0.642. [1] continued the research of [7] using the deep learning Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (Bi-LSTM) methods resulting in increased performance with an average F1-score of 0.764. The research conducted by [2] using a combination of the CNN-GRU and word2vec models resulting in an F1score of 0.67 for aspect classification and 0.66 for sentiment classification. However, [5] and [13] show a better results using a combination of contextualized word embedding for the model.
Based on the background description that has been explained, this study conducted an aspectbased sentiment analysis on Indonesian restaurant review using a combination of Convolutional Neural Network and Contextualized Word Embedding approach namely BERT and ELMo. Then, they will be compared with the model using Traditional Word Embedding approsch namely Word2vec. This study will also test the effect of using stemming in the preprocessing on the three models of each classification.

Data Collection
Data used in this research are reviews of Indonesian restaurant from [2]. There are four categories of aspects that will be used; food, service, price, and place.

Data Labelling
Data is manually labelled into aspects mentioned in 2.1.

1)
Food: food tastes great, have a variety of menus, delicious dishes, etc.

2)
Service: great service, staff are extremely knowledgeable, staff is really friendly, etc.

3)
Price: price is reasonable, quite expensive for the price, etc.

4)
Place: cozy restaurant, the ambience is welcoming and charming, such a lovely place, etc.

Data Splitting
Data split ratio used in this research is 80% as the train set, 15% of the train set as the validation set, and 20% as the test set. So, 2400 reviews used, 1632 reviews became the train set, 288 reviews became the validation set, and 480 reviews became the test set.

Preprocessing
The preprocessing steps in this research are; 1) case folding, 2) non-alphanumeric removal, 3) normalization, 4) stemming. At this stage, a review changed into a lower case, non-alphanumeric removal used to delete characters that non-alphanumeric, normalization is used to normalize words that has non-standards word, abbreviated word, and typo. Meanwhile, stemming is used to changing a word into its root word.

Feature Extraction
We use the pre-trained BERT embedding in Bahasa Indonesia: IndoBERT [6]. The pretrained embedding is used as is, and not fine-tuned in downstream task, as it was trained in the same corpus domain as our review data. For comparison, we also used the pre-trained ELMo embedding ELMoForManyLangs [18] in Bahasa Indonesia, and Word2vec embedding uses corpus of our data so that the resulting word insertion can cover all the words contained in the data review.

Classification Model
Three models experimented; BERT-CNN, ELMo-CNN, and Word2vec-CNN for each of aspect classification and sentiment classification has four layers; convolutional layer, pooling layer, dropout layer, and fully-connected layer. Hyperparameter tuning using Keras-Tuner Hyperband model [16] is use to get the model with the best combination of parameters. This process is carried out on each model of aspect classification and sentiment classification to provide fair results. There are four parameters tested; convolutional filters, kernel size, pool size, and dense units

Aspect Classification
The model architecture for aspect classification can be seen in Figure 1. The preprocessed data is converted into vector form using contextualized word embedding and used as an input to the CNN model. The CNN model used has several layers: 1) Input layer: input in the form of a word vector resulting from feature extraction from the embedding layer which has a size of x × y × z (x = number of reviews, y = number of hidden layers, z = number of tokens). Padding is done at the end of each review. 2) ConvPool layer: convolution and max-pool processes will be performed on this layer. In the convolution layer, we will try to use several combinations of filters, 128 and 256 filters and use the ReLU activation function. In the pooling layer, we will try to use kernel sizes (3×3) and (5×5) as well as pool sizes (2×2), (3×3), and (4×4). 3) Dropout layer: regularization is performed on the ConvPool layer output to prevent overfitting and with a rate of 0.5. 4) Fully-connected layer: the output from the dropout layer will be flattened and then put into a fully-connected layer consisting of two layers. The first layer uses the ReLU activation function with 128 and 256 units of experimentation. The second layer uses four filters representing each label with a sigmoid activation function so as to get a binary representation.

Sentiment Classification
The model architecture for sentiment classification can be seen in Figure 2. The process is almost similar to the aspect classification, one of the things that distinguishes it is the data used as an input in the form of sentiment data for each aspect that has been separated based on the aspect category. Also, for the second layer of fully-connected uses a sigmoid activation function with one output neuron that produces output in the form of real numbers between 0-1 which can be subject to rules: if the result of the sigmoid function is more than or equal to 0.5 it will output 1, which means the polarity of the sentiment is positive and if the result of the function is sigmoid is less than 0.5 it will output 0, which means the polarity of the sentiment is negative.

Model Evaluation
To evaluate the model we will use a label-based evaluation. Label-based evaluation is based on confusion matrix that can be seen in Figure 3, where TP is true positive which is data labelled True that is predicted True, FP is false positive which is data labelled False that is predicted True, TN is true negative which is data labelled True that is predicted False, and TN is true negative which is data labelled False that is predicted False. 3) Micro-Averaged F-measure

Data Exploration Results
Exploring into the collected and labelled data, we found the label distribution for the aspect data that can be seen in Table 1 and for the sentiment data that can be seen in Table 2.

Aspect Classification Results
The aspect classification results can be seen in Table 3. It can be seen that the model using a combination of contextualized word embedding have better results compared to the model using a combination of traditional word embedding. What's interesting is ELMo-CNN model gave better results of micro-averaged F1-score than the BERT-CNN model.
The ELMo-CNN model which gives the best results in micro-averaged f1-score is using hyperparameters as follows: convolutional filters of 128, kernel size of 4, pool size of 3, dense units of 256. Meanwhile, the aspect classification using stemming data can be seen in Table 4. It can be seen compared to the model that is not using stemming the results are lower. This is because the stemming process can cause the text to lose context due to the deletion and simplification of some words.

Sentiment Classification Results
The sentiment classification results can be seen in Table 5. It can be seen that the model using a combination of contextualized word embedding have better results on each aspect compared to the model using a combination of traditional word embedding.
The BERT-CNN model which gives the best results in micro-averaged f1-score is using hyperparameters for each label as follows: Meanwhile, the aspect classification using stemming data can be seen in Table 6. It can be seen that sentiment classification for models with a combination of contextualized word embedding with data without stemming or with stemming does not have a significant effect, while for the models with a combination of traditional word embedding, the use of data with stemming is having a significant effect, that can be seen from the difference in the average f1-score. This is due to the use of word2vec as a feature extraction between words that have affixes and basic words that have a fairly distant meaning.  The use of a combination of Convolutional Neural Network and Contextualized Word Embedding gave better results compared to a model with a combination of Convolutional Neural Network and Traditional Word Embedding. For aspect classification, ELMo-CNN model gave the best results with a micro-average precision of 0.88, a micro-average recall of 0.84, and a microaverage f1-score of 0.86. For sentiment classification, BERT-CNN model gave the best results with a precision of 0.89, a recall of 0.89, and an f1-score of 0.91.
The use of data without stemming in the BERT-CNN, ELMo-CNN, and Word2vec-CNN models gives almost similar results or even better than the use of data with stemming. However, a significant difference was shown in the Word2vec-CNN model. However, in this study, there are still shortcomings that can be added such as a better data labelling method can be employed. Other than that, using more diverse and balance data given the use of contextualized word embedding is able to work better for large data.