Information Retrieval for Early Detection of Disease Using Semantic Similarity

Aszani Aszani(1), Hayyu Ilham Wicaksono(2*), Uffi Nadzima(3), Lukman Heryawan(4)

(1) Department of Computer Science and Electronics, FMIPA UGM, Yogyakarta
(2) Department of Computer Science and Electronics, FMIPA UGM, Yogyakarta
(3) Department of Computer Science and Electronics, FMIPA UGM, Yogyakarta
(4) Department of Computer Science and Electronics, FMIPA UGM, Yogyakarta
(*) Corresponding Author


 The growth of medical records continues to increase and needs to be used to improve doctors' performance in diagnosing a disease. A retrieval method returns proposed information to provide diagnostic recommendations based on symptoms from medical record datasets by applying the TF-IDF and cosine similarity methods. The challenge in this study was that the symptoms in the medical record dataset were dirty data obtained from patients who were not familiar with biological terms. Therefore, the symptoms were matched in the medical record data with the symptom terms used in the system and from the results, data augmentation was carried out to increase the amount of data up to about 3 times more. In the TF-IDF the highest accuracy with  is only , while after augmentation of the test data, the accuracy becomes . The highest accuracy results with the same  value using the cosine similarity method is  and with the augmented test data accuracy increasing to . From this study it was concluded that a system with sufficient and relevant input of symptoms would provide a more accurate disease prediction. Prediction results using the TF-IDF method with  are more accurate than predictions using the cosine similarity method.


Cosine Similarity; Data Augmentation; Disease Detection; Information Retrieval; TF-IDF

Full Text:



[1] M. Mustakim and R. Wardoyo, “Survey Model-Model Pencarian Informasi Rekam,” JISKA J. Inform. Sunan Kalijaga, vol. 3, no. 3, pp. 132–144, 2019, [Online]. Available:

[2] R. Silalahi and E. J. Sinaga, “Perencanaan Implementasi Rekam Medis Elektronik Dalam Pengelolaan Unit Rekam Medis Klinik Pratama Romana,” J. Manaj. Inf. Kesehat. Indones., vol. 7, no. 1, p. 22, 2019, doi: 10.33560/jmiki.v7i1.219.

[3] C. of Australia, “MBS Telehealth Services from 1 July 2022,” 2022. (accessed Oct. 25, 2022).

[4] V. K and J. Singaraju, “Decision Support System for Congenital Heart Disease Diagnosis based on Signs and Symptoms using Neural Networks,” Int. J. Comput. Appl., vol. 19, no. 6, pp. 6–12, 2011, doi: 10.5120/2368-3115.

[5] A. M. Nuraini Ahmad, Arienda Addis Prasetyo, “Penerapan Information Retrieval Pada Search Engine,” J. Inov. Has. Penelit. dan Pengemb., vol. 1, no. 31, pp. 15–23, 2021, [Online]. Available:

[6] M. Yusuf and A. Cherid, “Implementasi Algoritma Cosine Similarity Dan Metode TF-IDF Berbasis PHP Untuk Menghasilkan Rekomendasi Seminar,” J. Ilm. Fak. Ilmu Komput., vol. 9, no. 1, pp. 8–16, 2020, [Online]. Available:

[7] Rahul Maheshwari, “Disease Detection based on Symptoms with treatment recommendation.” (accessed Oct. 25, 2022).

[8] Christopher D. Manning, Prabhakar Raghavan and H. Schütze, “Introduction to Modern Information Retrieval (2nd edition),” Libr. Rev., vol. 53, no. 9, pp. 462–463, 2004, doi: 10.1108/00242530410565256.

[9] A. R. Lahitani, A. E. Permanasari, and N. A. Setiawan, “Cosine similarity to determine similarity measure: Study case in online essay assessment,” Proc. 2016 4th Int. Conf. Cyber IT Serv. Manag. CITSM 2016, 2016, doi: 10.1109/CITSM.2016.7577578.

[10] K. Park, J. S. Hong, and W. Kim, “A Methodology Combining Cosine Similarity with Classifier for Text Classification,” Appl. Artif. Intell., vol. 34, no. 5, pp. 396–411, 2020, doi: 10.1080/08839514.2020.1723868.

[11] scikit-learn developer, “Metrics and scoring: quantifying the quality of predictions,” 2022. (accessed Nov. 29, 2022).

[12] T. Phreeraphattanakarn and B. Kijsirikul, “Text data-augmentation using Text Similarity with Manhattan Siamese long short-term memory for Thai language,” J. Phys. Conf. Ser., vol. 1780, no. 1, 2021, doi: 10.1088/1742-6596/1780/1/012018.


Article Metrics

Abstract views : 498 | views : 587


  • There are currently no refbacks.

Copyright (c) 2023 IJCCS (Indonesian Journal of Computing and Cybernetics Systems)

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Copyright of :
IJCCS (Indonesian Journal of Computing and Cybernetics Systems)
ISSN 1978-1520 (print); ISSN 2460-7258 (online)
is a scientific journal the results of Computing
and Cybernetics Systems
A publication of IndoCEISS.
Gedung S1 Ruang 416 FMIPA UGM, Sekip Utara, Yogyakarta 55281
Fax: +62274 555133 |

View My Stats1
View My Stats2