A Multilevel and Hierarchical Approach for Multilabel Classification Model in SDGs Research
Abstract
The growing number of research publications complicates the identification of the implementation of research publications, especially related to sustainable development goals (SDGs). The research publication categorization into SDG levels has not been conducted. The Center for Research and Community Service (Pusat Penelitian dan Pengabdian Masyarakat, PPPM) Politeknik Statistika (Polstat) STIS needs this to monitor lecturers in implementing SDGs. This study aimed to implement and evaluate problem transformation methods and machine learning classification algorithms with a multilevel and hierarchical approach to categorize research publications into SDG levels. Problem transformation methods used were binary relevance, label powerset (LP), and classifier chains. Machine learning classification algorithms used were logistic regression (LR) and support vector machine (SVM). The inputs included titles, abstracts, and titles and abstracts. The best filter model that classified data into SDGs-non-SDGs was the model with titles and SVM, with an accuracy of 0.8634. The best level model for classifying data to SDG level was the model using titles, LP, and SVM with multilevel approaches. The level model classified data into four pillars, goals, targets, and indicators of SDGs, with an accuracy of 0.8067, 0.7501, 0.6792, and 0.6194, respectively. In comparison to other inputs with more comprehensive information, the results showed that title inputs yielded the best accuracy due to the simultaneous use of English and Indonesian. Future research can modify the model to utilize a single language input to optimize the term frequency-inverse document frequency (TF-IDF) process, hence, the word meanings from each language are not considered different important words.
References
Bappenas, “Sekilas SDGs.” Access date: 10-Oct-2023. [Online]. Available: https://sdgs.bappenas.go.id/sekilas-sdgs/
Badan Pusat Statistik (BPS), “Persentase penduduk miskin (P0) menurut provinsi dan daerah, 2007-2023.” Access date: 01-Oct-2023. [Online]. Available: https://www.bps.go.id/id/statistics-table/2/MTkyIzI=/persentase-penduduk-miskin--p0--menurut-provinsi-dan-daerah--persen-.html
Badan Pusat Statistik (BPS), “Angka partisipasi kasar (APK) menurut provinsi dan jenjang pendidikan, 2003-2022.” Access date: 01-Oct-2023. [Online]. Available: https://www.bps.go.id/id/statistics-table/2/MzAzIzI=/angka-partisipasi-kasar---a-p-k--.html
Badan Pusat Statistik (BPS), “Angka partisipasi murni (APM) menurut provinsi dan jenjang pendidikan, 2003-2022.” Access date: 01-Oct-2023. [Online]. Available: https://www.bps.go.id/id/statistics-table/2/MzA0IzI=/angka-partisipasi-murni---a-p-m--.html
“Pendidikan Tinggi,” Law of the Republic of Indonesia, No. 12, 2012.
“Jabatan Fungsional Dosen dan Angka Kreditnya,” Regulation of the Minister of Administrative and Bureaucratic Reform, No. 17, 2013.
Ministry of Education and Culture, “WCU Analysis,” 2016. Access date: Oct. 01, 2023. [Online]. Available: https://sinta.kemdikbud.go.id/wcu
Politeknik Statistika STIS, “Call for Paper Seminar Nasional Official Statistics 2023.” Access date: 1-Oct-2023. [Online]. Available: https://semnas.stis.ac.id/call-for-paper
C. Vens et al., “Decision trees for hierarchical multi-label classification,” Mach. Learn., vol. 73, no. 2, pp. 185–214, Nov. 2008, doi: 10.1007/s10994-008-5077-3.
J. Hernández, L.E. Sucar, and E.F. Morales, “Multidimensional hierarchical classification,” Expert Syst. Appl., vol. 41, no. 17, pp. 7671–7677, Dec. 2014, doi: 10.1016/j.eswa.2014.05.054.
H.S. Oh and Y. Jung, “External methods to address limitations of using global information on the narrow-down approach for hierarchical text classification,” J. Inf. Sci., vol. 40, no. 5, pp. 688–708, Oct. 2014, doi: 10.1177/0165551514544626.
“Koordinasi, Perencanaan, Pemantauan, Evaluasi, dan Pelaporan Pelaksanaan Tujuan Pembangunan Berkelanjutan,” Regulation of the Minister of National Development Planning/Head of the National Development Planning Agency of the Republic of Indonesia, No. 7, 2018.
R.C. Morales-Hernández, J.G. Juagüey, and D. Becerra-Alonso, “A comparison of multi-label text classification models in research articles labeled with sustainable development goals,” IEEE Access, vol. 10, pp. 123534–123548, Nov. 2022, doi: 10.1109/ACCESS.2022.3223094.
I.H. Sarker, “Machine learning: Algorithms, real-world applications and research directions,” SN Comput. Sci., vol. 2, no. 3, pp. 1–21, May 2021, doi: 10.1007/s42979-021-00592-x.
J. Alzubi, A. Nayyar, and A. Kumar, “Machine learning from theory to algorithms: An overview,” J. Phys. Conf. Ser., vol. 1142, pp. 1–15, Dec. 2018, doi: 10.1088/1742-6596/1142/1/012012.
P. Chapman et al., CRISP-DM 1.0 Step-by-step data mining guide. SPSS Inc, 2000.
S.L. Octaria, “Analisa Integrasi Data SINTA (Science and Technology Index),” Palembang, Indonesia, Universitas Sriwijaya, 2018. [Online]. Available: http://edocs.ilkom.unsri.ac.id/2906/1/TUGAS%202MTI%20_0903118%201621128_SITI%20LARISTA%20OCTARIA.pdf
J. Hughes, “krippendorffsalpha: An R package for measuring agreement using Krippendorff’s alpha coefficient,” R J., vol. 13, no. 1, pp. 413–425, Jun. 2021, doi: 10.32614/RJ-2021-046.
K. Krippendorff and R. Craggs, “The reliability of multi-valued coding of data,” Commun. Methods Meas., vol. 10, no. 4, pp. 181–198, Oct. 2016, doi: 10.1080/19312458.2016.1228863.
M. Sighn, “Stop the stopwords using different Python libraries.” Toward AI. Access date: 1-Oct-2023. [Online]. Available: https://towardsai.net/p/l/stop-the-stopwords-using-different-python-libraries
F.Z. Tala, “A Study of stemming effects on information retrieval in bahasa Indonesia,” Amsterdam, Netherlands, Universiteti van Amsterdam, 2003. [Online]. Available: https://eprints.illc.uva.nl/id/eprint/740/1/MoL-2003-02.text.pdf
C. Toraman, E.H. Yilmaz, F. Şahi̇nuç, and O. Ozcelik, “Impact of tokenization on language models: An analysis for Turkish,” ACM Trans. Asian Low-Resour. Lang. Inf. Proc., vol. 22, no. 4, pp. 1–21, Apr. 2023, doi: 10.1145/3578707.
K. Kowsari et al., “Text classification algorithms: A survey,” Information, vol. 10, no. 4, pp. 1–68, Apr. 2019, doi: 10.3390/info10040150.
Z. Abdallah, A. El Zaart, and M. Oueidat, “Experimental analysis and comparison of multilabel problem transformation methods for multimedia domain,” in 2015 Int. Conf. Appl. Res. Comput. Sci. Eng. (ICAR), 2015, pp. 1–8. doi: 10.1109/ARCSE.2015.7338147.
O. Ramadhani, “Klasifikasi multi-label dengan problem transformation menggunakan Python,” Undergraduate thesis, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia, 2020.
G. Mustafa et al., “Multi-label classification of research articles using Word2Vec and identification of similarity threshold,” Sci. Rep., vol. 11, Nov. 2021, Art no. 21900, doi: 10.1038/s41598-021-01460-7.
M.D. Turner et al., “Automated annotation of functional imaging experiments via multi-label classification,” Front Neurosci., vol. 7, pp. 1-13, Dec. 2013, doi: 10.3389/fnins.2013.00240.
B.J. Hashimoto, “Is frequency enough?: The frequency model in vocabulary size testing,” Lang. Assess. Quart., vol. 18, no. 2, pp. 171–187, 2021, doi: 10.1080/15434303.2020.1860058.
N.A. Sajid et al., “A novel metadata based multi-label document classification technique,” Comput. Syst. Sci. Eng., vol. 46, no. 2, pp. 2195–2214, Feb. 2023, doi: 10.32604/csse.2023.033844.
C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, Sep. 1995, doi: 10.1007/BF00994018.
M.A. Salam, A.T. Azar, M.S. Elgendy, and K.M. Fouad, “The effect of different dimensionality reduction techniques on machine learning overfitting problem,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 4, pp. 641–655, Apr. 2021, doi: 10.14569/IJACSA.2021.0120480
© Jurnal Nasional Teknik Elektro dan Teknologi Informasi, under the terms of the Creative Commons Attribution-ShareAlike 4.0 International License.