Author Obfuscation on Indonesian News Articles Using Genetic Algorithms

Rayhan Naufal Ramadhan; Yunita Sari; Aina Musdholifah

doi:10.22146/ijccs.64526

Author Obfuscation on Indonesian News Articles Using Genetic Algorithms

https://doi.org/10.22146/ijccs.64526

Rayhan Naufal Ramadhan^(1*), Yunita Sari⁽²⁾, Aina Musdholifah⁽³⁾

(1) Undergraduate Program of Computer Science; FMIPA UGM, Yogyakarta
(2) Department of Computer Science and Electronics, FMIPA UGM, Yogyakarta
(3) Department of Computer Science and Electronics, FMIPA UGM, Yogyakarta
(*) Corresponding Author

Abstract

Authorship attribution is a method for identifying the author of a text from a group of potential authors and can solve the anonymity of unknown authors. Such method threatens anyone’s privacy, especially those who wish to write anonymously. To address this issue, author obfuscation is proposed to modify a text to disguise its author.

In this research, a genetic algorithm-based author obfuscation model was created to modify Indonesian news articles to avoid identification from authorship attribution while keeping its semantics. The model iteratively changed some words in the article using crossover and mutation techniques guided by a fitness function which involve identification probability and similarity to the original article.

The model is evaluated based on safety, soundness, and sensibleness parameter. The model has good safety since it can reduce the given authorship attribution model's accuracy by 0.3018 but drops to 0.1179 when tested on different models. Its soundness is pretty good since the similarity of the modified to the original articles reaches 0.7817. The model obtained a score of 2.571 on a scale of 0 to 4 in terms of sensibleness which indicates that some articles are acceptable in terms of grammar, but not a few are messy.

Keywords

author obfuscation; authorship attribution; genetic algorithm

Full Text:

PDF

References

M. Koppel, S. Argamon, and A. R. Shimoni, “Automatically Categorizing Written Texts by Author Gender,” Lit. Linguist. Comput., vol. 17, no. 4, pp. 401–412, 2002.
[2] M. Sage, P. Cruciata, R. Abdo, J. C. K. Cheung, and Y. F. Zhao, “Investigating the influence of selected linguistic features on authorship attribution using German news articles,” in CEUR Workshop Proceedings, 2020, vol. 2624.
[3] H. Gomez-Adorno, J.-P. Posadas-Duran, G. Rios-Toledo, G. Sidorov, and G. Sierra, “Stylometry-based Approach for Detecting Writing Style Changes in Literary Texts,” Comput. y Sist., vol. 22, no. 1, pp. 47–53, 2018, doi: 10.13053/CyS-22-1-2882.
[4] E. Lundeqvist and M. Svensson, “Author profiling: A machine learning approach towards detecting gender, age, and native language of users in social media,” Uppsala Universitet, 2017.
[5] T. Gröndahl and N. Asokan, “Effective writing style transfer via combinatorial paraphrasing,” in Proceedings on Privacy Enhancing Technologies, 2020, vol. 2020, no. 4, pp. 175–195, doi: 10.2478/popets-2020-0068.
[6] A. Mahmood, F. Ahmad, Z. Shafiq, P. Srinivasan, and F. Zaffar, “A Girl Has No Name: Automated Authorship Obfuscation using Mutant-X,” in Proceedings on Privacy Enhancing Technologies, 2019, vol. 2019, no. 4, pp. 54–71, doi: 10.2478/popets-2019-0058.
[7] M. Potthast, M. Hagen, and B. Stein, “Author Obfuscation : Attacking the State of the Art in Authorship Verification,” 2016, [Online]. Available: https://pan.webis.de/downloads/publications/papers/potthast_2016a.pdf.
[8] Y. Keswani, H. Trivedi, P. Mehta, and P. Majumder, “Author masking through translation,” in CEUR Workshop Proceedings, 2016, vol. 1609, pp. 890–894.
[9] M. Mansoorizadeh, T. Rahgooy, M. Aminiyan, and M. Eskandari, “Author obfuscation using WordNet and language models,” in CEUR Workshop Proceedings, 2016, pp. 1–8.
[10] G. Karadzhov, T. Mihaylova, Y. Kiprov, G. Georgiev, I. Koychev, and P. Nakov, “The Case for Being Average: A Mediocrity Approach to Style Masking and Author Obfuscation,” in Experimental IR Meets Multilinguality, Multimodality, and Interaction, 2017, pp. 173–185.
[11] Y. Yunitasari, A. Musdholifah, and A. K. Sari, “Sarcasm Detection For Sentiment Analysis in Indonesian Tweets,” IJCCS (Indonesian J. Comput. Cybern. Syst., vol. 13, no. 1, p. 53, 2019, doi: 10.22146/ijccs.41136.
[12] C. F. Lima, F. G. Lobo, and M. Pelikan, “From Mating Pool Distributions to Model Overfitting,” in Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, 2008, pp. 431–438, doi: 10.1145/1389095.1389174.
[13] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Trans. Assoc. Comput. Linguist., vol. 5, pp. 135–146, 2017, doi: 10.1162/tacl_a_00051.
[14] A. Dinakaramani, F. Rashel, A. Luthfi, and R. Manurung, “Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus,” in Proceedings of the International Conference on Asian Language Processing 2014, IALP 2014, Oct. 2014, pp. 66–69, doi: 10.1109/IALP.2014.6973519.

DOI: https://doi.org/10.22146/ijccs.64526

Article Metrics

Abstract views : 3050 |

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Copyright of :IJCCS (Indonesian Journal of Computing and Cybernetics Systems)ISSN 1978-1520 (print); ISSN 2460-7258 (online)is a scientific journal the results of Computingand Cybernetics Systems
A publication of IndoCEISS.Gedung S1 Ruang 416 FMIPA UGM, Sekip Utara, Yogyakarta 55281Fax: +62274 555133email:ijccs.mipa@ugm.ac.id | http://jurnal.ugm.ac.id/ijccs

View My Stats1View My Stats2

Username
Password
Remember me