Exploring the Impact of Back-Translation on BERT's Performance in Sentiment Analysis of Code-Mixed Language Data

Nisrina Hanifa Setiono; Yunita Sari

doi:10.22146/ijccs.104757

Exploring the Impact of Back-Translation on BERT's Performance in Sentiment Analysis of Code-Mixed Language Data

https://doi.org/10.22146/ijccs.104757

Nisrina Hanifa Setiono⁽¹⁾, Yunita Sari^(2*)

(1) Department of Computer Science and Electronics, FMIPA UGM, Yogyakarta
(2) Department of Computer Science and Electronics, FMIPA UGM, Yogyakarta
(*) Corresponding Author

Abstract

Social media, particularly Twitter, has become a key platform for communication and opinion-sharing, where code mixing, the blending of multiple languages in a single sentence, is common. In Indonesia, Indonesian-English code mixing is widely used, especially in urban areas. However, sentiment analysis on code-mixed text poses challenges in natural language processing (NLP) due to the informal nature of the data and the limitations of models trained on formal text. This study applies back translation to address these challenges and optimize BERT-based sentiment analysis. The method is tested on the INDONGLISH dataset, consisting of 5,067 labeled tweets. Results show that applying back translation directly to raw tweets yields better performance by preserving original meaning, improving model accuracy. However, when back translation follows monolingual translation, accuracy declines due to semantic distortions. Repeated translation modifies sentence structure and sentiment labels, reducing reliability. These findings indicate that each additional translation step risks decreasing sentiment analysis accuracy, particularly for code-mixed datasets, which are highly sensitive to linguistic shifts. Back translation proves to be an effective approach for formalizing data while maintaining contextual integrity, enhancing sentiment analysis performance on code-mixed text

Keywords

Code-mixing; Sentiment Analysis; Back-Translation; BERT; Informal Text Processing

Full Text:

PDF

References

Patwardhan, V., Takawane, G., Kelkar, N., Gaikwad, O., Saraf, R., & Sonawane, S. (2023). Analysing The Sentiments Of Marathi-English Code-Mixed Social Media Data Using Machine Learning Techniques. 2023 International Conference on Emerging Smart Computing and Informatics, ESCI 2023. https://doi.org/10.1109/ESCI56872.2023.10100304 [2] Widya Astuti, L., & Sari, Y. (2023). Code-Mixed Sentiment Analysis using Transformer for Twitter Social Media Data. In IJACSA) International Journal of Advanced Computer Science and Applications (Vol. 14, Issue 10). www.ijacsa.thesai.org [3] Najiha, H., & Romadhony, A. (2023). Sentiment Analysis on Indonesian-Sundanese Code-Mixed Data. 2023 IEEE 8th International Conference for Convergence in Technology, I2CT 2023. https://doi.org/10.1109/I2CT57861.2023.10126254 [4] Patil, A., Patwardhan, V., Phaltankar, A., Takawane, G., & Joshi, R. (2023). Comparative Study of Pre-Trained BERT Models for Code-Mixed Hindi-English Data. 2023 IEEE 8th International Conference for Convergence in Technology, I2CT 2023. https://doi.org/10.1109/I2CT57861.2023.10126273. [5] Pota, M., Ventura, M., Catelli, R., & Esposito, M. (2021). An effective bert-based pipeline for twitter sentiment analysis: A case study in Italian. Sensors (Switzerland), 21(1), 1–21. https://doi.org/10.3390/s21010133. [6] Shorten, C., Khoshgoftaar, T. M., & Furht, B. (2021). Text Data Augmentation for Deep Learning. Journal of Big Data, 8(1). https://doi.org/10.1186/s40537-021-00492-0. [7] Sari, Y., & Al Faridzi, F. P. (2023). Unsupervised Text Style Transfer for Authorship Obfuscation in Bahasa Indonesia. IJCCS (Indonesian Journal of Computing and Cybernetics Systems), 17(1), 23. https://doi.org/10.22146/ijccs.79623. [8] Diva Wijaya, A., & Bram, B. (2021). A SOCIOLINGUISTIC ANALYSIS OF INDOGLISH PHENOMENON IN SOUTH JAKARTA (Vol. 4, Issue 4). www.news.okezone.com [9] N. A. Salsabila, Y. A. Winatmoko, A. A. Septiandri, and A. Jamal, “Colloquial Indonesian Lexicon,” in 2018 International Conference on Asian Language Processing (IALP), 2018, pp. 236–239, doi: 10.1109/IALP.2018.8629151. [10] Devlin, J., Chang, M.-W., Lee, K., Google, K. T., & Language, A. I. (n.d.). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://github.com/tensorflow/tensor2tensor [11] N. L. Pham and V. V. Nguyen, "Adapting Neural Machine Translation for English-Vietnamese using Google Translate system for Back-translation," 2019 International Conference on Advanced Computing and Applications (ACOMP), 2019, pp. 1-6. [12] Koto, F., Rahimi, A., Lau, J. H., & Baldwin, T. (2020). IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. http://arxiv.org/abs/2011.00677

DOI: https://doi.org/10.22146/ijccs.104757

Article Metrics

Abstract views : 2751 |

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Copyright of :IJCCS (Indonesian Journal of Computing and Cybernetics Systems)ISSN 1978-1520 (print); ISSN 2460-7258 (online)is a scientific journal the results of Computingand Cybernetics Systems
A publication of IndoCEISS.Gedung S1 Ruang 416 FMIPA UGM, Sekip Utara, Yogyakarta 55281Fax: +62274 555133email:ijccs.mipa@ugm.ac.id | http://jurnal.ugm.ac.id/ijccs

View My Stats1View My Stats2

Username
Password
Remember me