Semantic Similarity Measurement Evaluation of KBBI Synonyms Using a Word Embedding Approac

  • Muhammad Rafli Aditya H. Departemen Teknik Informatika dan Komputer, Fakultas Teknik, Universitas Negeri Makassar, Makassar, Sulawesi Selatan 90224, Indonesia
  • Muhammad Ilham Departemen Teknik Informatika dan Komputer, Fakultas Teknik, Universitas Negeri Makassar, Makassar, Sulawesi Selatan 90224, Indonesia
  • Dewi Fatmarani Surianto Departemen Teknik Informatika dan Komputer, Fakultas Teknik, Universitas Negeri Makassar, Makassar, Sulawesi Selatan 90224, Indonesia
  • Abdul Muis Mappalotteng Departemen Pendidikan Teknik Elektro, Fakultas Teknik, Universitas Negeri Makassar, Makassar, Sulawesi Selatan 90224, Indonesia
Keywords: KBBI, Word2Vec, Cosine Similarity, FastText, GloVe, Sentence-BERT, Semantic Similarity Measurement

Abstract

Kamus Besar Bahasa Indonesia (KBBI) is a primary resource for data in research on determining word-meaning similarity in Indonesian. This study investigates the effectiveness of word embedding methods and the term frequency–inverse document frequency (TF-IDF) weighting technique in assessing the semantic similarity of synonym pairs. The objective is to measure the similarity of synonym word pairs listed in KBBI by applying cosine similarity, leveraging TF-IDF weighting, various word embedding models, and latent semantic analysis (LSA). The methodology involved data collection, followed by a text preprocessing stage consisting of case folding, stopword removal, stemming, and tokenization. The processed data were transformed into vector representations using word embedding models, including Word2Vec, fastText, GloVe, and sentence-bidirectional encoder representations from transformers (S-BERT), and TF-IDF. LSA was employed for dimensionality reduction of the vectors before similarity testing using cosine similarity, with final evaluation of the results. The findings revealed that fastText significantly improved the similarity scores between synonym pairs, achieving an average similarity score of 0.901 for 30 synonym pairs. Evaluation results indicated an accuracy of 0.88, a recall of 1.00, a precision of 0.81, and an F1 score of 0.90. These results suggest that fastText is more effective in enhancing the accuracy of synonym meaning similarity measurements. Future research is encouraged to expand the corpus and further explore the use of word embedding for semantic similarity tasks. This study contributes to the natural language processing advancement and provides a potential foundation for more accurate language-based applications that assess word meaning similarity in KBBI.

References

Y. Caterina, M.A. Yaqin, and S. Zaman, “Pengukuran kemiripan makna kalimat dalam bahasa Indonesia menggunakan metode path,” Fountain Inform. J., vol. 6, no. 2, pp. 45–50, Nov. 2021, doi: 10.21111/fij.v6i2.4844.

N.P. Paino, D.D.S. Hutagaol, and A.U. Sagala, “Analisis penanda hubungan sinonim dan hiponimi pada puisi ‘Membaca Tanda-Tanda’ karya Taufiq Ismail,” Pena Literasi, J. Pendidik. Bhs. Sastra Indones., vol. 4, no. 1, pp. 37–44, Apr. 2021, doi: 10.24853/pl.4.1.37-44.

J. Wang and Y. Dong, “Measurement of text similarity: A survey,” Information, vol. 11, no. 9, pp. 1–17, Sep. 2020, doi: 10.3390/info11090421.

G.U. Abriani and M.A. Yaqin, “Implementasi metode semantic similarity untuk pengukuran kemiripan makna antar kalimat,” ILKOMNIKA, J. Comput. Sci. Appl. Inform., vol. 1, no. 2, pp. 47–57, Dec. 2019, doi: 10.28926/ilkomnika.v1i2.15.

R.M. Arrasyid, D.E. Putera, and A.Y.P. Yusuf, “Analisis sentimen review pembelian produk di marketplace Shopee menggunakan pendekatan natural language processing,” J. Tekno Kompak, vol. 18, no. 2, pp. 319–330, Aug. 2024, doi: 10.33365/jtk.v18i2.3813.

S.A. Zulvian, K. Prihandani, and A.A. Ridha, “Perbandingan metode MSD dan cosine similarity pada sistem rekomendasi dengan pendekatan item-based collaborative filtering,” Intecoms, J. Inf. Technol. Comput. Sci., vol. 4, no. 2, pp. 340–347, Dec. 2021, doi: 10.31539/intecoms.v4i2.2781.

Rismayani et al., “Implementasi algoritma text mining dan cosine similarity untuk desain sistem aspirasi publik berbasis mobile,” Komputika, J. Sist. Komput., vol. 11, no. 2, pp. 169–176, Oct. 2022, doi: 10.34010/komputika.v11i2.6501.

Y.A. Pradana, I. Cholissodin, and D. Kurnianingtyas, “Analisis sentimen pemindahan Ibu Kota Indonesia pada media sosial Twitter menggunakan metode LSTM dan Word2Vec,” JPTIIK (J. Pengemb. Teknol. Inf. Ilmu Komput.), vol. 7, no. 5, pp. 2389–2397, May 2023.

A. Nurdin, B.A.S. Aji, A. Bustamin, and Z. Abidin, “Perbandingan kinerja word embedding Word2Vec, GloVe, dan fastText pada klasifikasi teks,” J. Tekno Kompak, vol. 14, no. 2, pp. 74–79, Aug. 2020, doi: 10.33365/jtk.v14i2.796.

R.P. Nawangsari, R. Kusumaningrum, and A. Wibowo, “Word2Vec for Indonesian sentiment analysis towards hotel reviews: An evaluation study,” Procedia Comput. Sci., vol. 157, pp. 360–366, Sep. 2019, doi: 10.1016/j.procs.2019.08.178.

R.P. Hastuti, V. Riona, and M. Hardiyanti, “Content retrieval dengan fastText word embedding pada learning management system olimpiade,” J. Internet Softw. Eng., vol. 4, no. 1, pp. 18–22, May 2023, doi: 10.22146/jise.v4i1.6766.

B. Juarto and A.S. Girsang, “Neural collaborative with sentence BERT for news recommender system,” JOIV, Int. J. Inform. Vis., vol. 5, no. 4, pp. 448–455, Dec. 2021, doi: 10.30630/joiv.5.4.678.

L. Cagliero, P. Garza, and E. Baralis, “ELSA: A multilingual document summarization algorithm based on frequent itemsets and latent semantic analysis,” ACM Trans. Inf. Syst. (TOIS), vol. 37, no. 2, pp. 1–33, Apr. 2019, doi: 10.1145/3298987.

M. Panji M and A.F. Huda, “Calculating the similarity of Indonesian sentences using latent semantic indexing based on KBBI,” in 2022 Int. Conf. Inform. Multimed. Cyber Inf. Syst. (ICIMCIS), 2022, pp. 148–153, doi: 10.1109/ICIMCIS56303.2022.10017797.

A. Sanjaya and S.D. Sasongko, “Uji kemiripan kalimat menggunakan fungsi terbilang pada pre-processing dan cosine similarity dalam bahasa Indonesia,” NERO (Netw. Eng. Res. Oper.), vol. 7, no. 2, pp. 95–104, Nov. 2022.

A. Sanjaya et al., “Pengukuran kemiripan makna menggunakan cosine similarity dan basis data sinonim kata,” J. Teknol. Inf. Ilmu Komput., vol. 10, no. 4, pp. 747–752, Aug. 2023, doi: 10.25126/jtiik.2023106864.

R.G. Ramli and Y. Sibaroni, “Klasifikasi topik Twitter menggunakan metode random forest dan fitur ekspansi Word2Vec,” e-Proc. Eng., vol. 9, no. 1, pp. 79–92, Feb. 2022.

W. Widayat, “Analisis sentimen movie review menggunakan Word2Vec dan metode LSTM deep learning,” J. Media Inform. Budidarma, vol. 5, no. 3, pp. 1018–1026, Jul. 2021, doi: 10.30865/mib.v5i3.3111.

E. Suryati, Styawati, and A.A. Aldino, “Analisis sentimen transportasi online menggunakan ekstraksi fitur model Word2Vec text embedding dan algoritma support vector machine (SVM),” J. Teknol. Sist. Inf., vol. 4, no. 1, pp. 96–106, Mar. 2023, doi: 10.33365/jtsi.v4i1.2445.

G.W. Aldiansyah, P.P. Adikara, and R.C. Wihandika, “Rekomendasi lagu cross language berdasarkan lirik menggunakan Word2Vec,” JPTIIK (J. Pengemb. Teknol. Inf. Ilmu Komput.), vol. 3, no. 8, pp. 8036–8041, Aug. 2019.

R. Julistiana, “Kosa kata bahasa Indonesia yang unik dan menarik,” Abdima Dejurnal, vol. 1, no. 1, pp. 106–112, Apr. 2024.

X. Rong, “Word2Vec parameter learning explained,” 2014, arXiv: 1411.2738.

H. Arfandy and I.A. Musdar, “Rancang bangun sistem cerdas pemberian nilai otomatis untuk ujian esai menggunakan algoritma cosine similarity,” Inspir., J. Teknol. Inf. Komun., vol. 10, no. 2, pp. 123–136, Dec. 2020.

A.E. Sari, S. Widowati, and K.M. Lhaksmana, “Klasifikasi ulasan pengguna aplikasi mandiri online di Google Play Store dengan menggunakan metode information gain dan naive Bayes classifier,” e-Proc. Eng., vol. 6, no. 2, pp. 9143–9157, Aug. 2019.

R.S. Amardita, Adiwijaya, and M.D. Purbolaksono, “Analisis sentimen terhadap ulasan Paris Van Java Resort Lifestyle Place di Kota Bandung menggunakan algoritma KNN,” JURIKOM (J. Ris. Komput.), vol. 9, no. 1, pp. 62–68, Feb. 2022, doi: 10.30865/jurikom.v9i1.3793.

S. Lumbansiantar, S. Dwiasnati, and N.S. Fatonah, “Penerapan metode cosine similarity dalam mendeteksi plagiarisme pada jurnal,” Format J. Ilm. Tek. Inform., vol. 12, no. 2, pp. 142–150, Jul. 2023, doi: 10.22441/format.2023.v12.i2.007.

Apriani, H. Zakiyudin, and K. Marzuki, “Penerapan algoritma cosine similarity dan pembobotan TF-IDF system penerimaan mahasiswa baru pada kampus swasta,” J. Bumigora Inf. Technol. (BITe), vol. 3, no. 1, pp. 19–27, Jun. 2021, doi: 10.30812/bite.v3i1.1110.

A.B.P. Negara, H. Muhardi, and I.M. Putri, “Analisis sentimen maskapai penerbangan menggunakan metode naive Bayes dan seleksi fitur information gain,” J. Teknol. Inf. Ilmu Komput., vol. 7, no. 3, pp. 599–606, Jun. 2020, doi: 10.25126/jtiik.202071947.

I.K.B.A.W. Kencana and W. Maharani, “Klasifikasi opini pada fitur produk berbasis graph,” e-Proc. Eng., vol. 4, no. 2, pp. 3148–3155, Aug. 2017.

M.D.R. Wahyudi, “Penerapan algoritma cosine similarity pada text mining terjemah Al-Qur’an berdasarkan keterkaitan topik,” Semesta Tek., vol. 22, no. 1, pp. 41–50, May 2019, doi: 10.18196/st.221235.

Published
2025-05-28
How to Cite
Muhammad Rafli Aditya H., Muhammad Ilham, Dewi Fatmarani Surianto, & Abdul Muis Mappalotteng. (2025). Semantic Similarity Measurement Evaluation of KBBI Synonyms Using a Word Embedding Approac. Jurnal Nasional Teknik Elektro Dan Teknologi Informasi, 14(2), 112-120. https://doi.org/10.22146/jnteti.v14i2.17117
Section
Articles