Unsupervised Text Style Transfer for Authorship Obfuscation in Bahasa Indonesia

https://doi.org/10.22146/ijccs.79623

Yunita Sari(1*), Fadhlan Pasyah Al Faridzi(2)

(1) Department of Computer Science and Electronics, FMIPA UGM, Yogyakarta
(2) Bachelor Program of Computer Science, FMIPA UGM, Yogyakarta
(*) Corresponding Author

Abstract


Authorship attribution is an NLP task to identify the author of a text based on stylometric analysis. On the other hand, authorship obfuscation aims to protect against authorship attribution by modifying a text’s style. The main challenge in authorship obfuscation is how to keep the content of the text despite the text modification. In this research, we are applying text style transfer methods for modifying the writing style while preserving the content of the input text. We implemented two unsupervised text style transfer: dictionary-based and back translation methods to change the formality level of the text. Experiment results shows that the back-translation method outperformed the dictionary-based method. The authorship attribution performance decreased up to 16.15% and 23.66% on F1-score for 3 and 10 authors respectively using back-translation. While for dictionary-based method the F1-score dropped up to 1.99% and 11.56% for 3 and 10 authors respectively. Evaluation on sensibleness and soundness factors show that the back-translation method can preserve the semantic of the obfuscated texts. Moreover, the modified texts are well-formed and inconspicuous.  

Keywords


authorship obfuscation; style transfer; formality

Full Text:

PDF


References

Swinson, T. and Reyna, C. (2013). Authorship Attribution Using Stopword Graphs. pages 1-9. [2] Shrestha, P., Sierra, S., Gonzalez, F., Montes, M., Rosso, P., and Solorio, T. (2017). Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 669-674, Valencia, Spain. Association for Computational Linguistics. [3] Caliskan-Islam, A., Harang, R., Liu, A., Narayanan, A., Voss, C., Yamaguchi, F., and Greenstadt, R. (2015). De-anonymizing programmers via code stylometry. In Proceedings of the 24th USENIX Conference on Security Symposium, SEC’15, pages 255{270, Berkeley, CA, USA. USENIX Association. [4] Stamatatos, E. (2013). On the Robustness of Authorship Attribution Based on Character n-gram Features. Journal of Law and Policy, 21(2):421-439. [5] Schwartz, R., Tsur, O., Rappoport, A., and Koppel, M. (2013). Authorship Attribution of Micro-Messages. In 2013 Conference on Empirical Methods in Natural Language Processing, number October, pages 1880-1891, Seattle, USA. [6] Burrows, J. (2002). ‘Delta’: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3):267-287. [7] Georgi Karadzhov, Tsvetomila Mihaylova, Yasen Kiprov, Georgi Georgiev, Ivan Koychev, and Preslav Nakov. 2017. The case for being average: A mediocrity approach to style masking and author obfuscation. In International Conference of the CrossLanguage Evaluation Forum for European Languages, pages 173–185. Springer. [8] Mansoorizadeh, M., Rahgooy, T., Aminiyan, M. dan Eskandari, M., 2016, Author Obfuscation Using WordNet and Language Models, CEUR Workshop Proceedings, Évora. [9] Paolo Rosso, Francisco Rangel, Martin Potthast, Efstathios Stamatatos, Michael Tschuggnall, and Benno Stein. Overview of PAN 2016–New Challenges for Authorship Analysis: Cross-genre Profiling, Clustering, Diarization, and Obfuscation. In Norbert Fuhr et al., editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction. 7th International Conference of the CLEF Initiative (CLEF 2016), volume 9822 of Lecture Notes in Computer Science, pages 518-538, September 2016. Springer [10] Di Jin, Zhijing Jin, Zhiting Hu, Olga Vechtomova, and Rada Mihalcea. 2022. Deep Learning for Text Style Transfer: A Survey. Computational Linguistics, 48(1):155–205. [11] Wibowo, Haryo & Prawiro, Tatag & Prasojo, Radityo & Mahendra, Rahmad. (2020). Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation. [12] Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Dominican Republic (virtual).



DOI: https://doi.org/10.22146/ijccs.79623

Article Metrics

Abstract views : 1666 | views : 1216

Refbacks

  • There are currently no refbacks.




Copyright (c) 2023 IJCCS (Indonesian Journal of Computing and Cybernetics Systems)

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.



Copyright of :
IJCCS (Indonesian Journal of Computing and Cybernetics Systems)
ISSN 1978-1520 (print); ISSN 2460-7258 (online)
is a scientific journal the results of Computing
and Cybernetics Systems
A publication of IndoCEISS.
Gedung S1 Ruang 416 FMIPA UGM, Sekip Utara, Yogyakarta 55281
Fax: +62274 555133
email:ijccs.mipa@ugm.ac.id | http://jurnal.ugm.ac.id/ijccs



View My Stats1
View My Stats2