Exploring Pre-Trained Model and Language Model for Translating Image to Bahasa
Ade Nurhopipah(1*), Jali Suhaman(2), Anan Widianto(3)
(1) Department of Informatics, Universitas Amikom Purwokerto
(2) Department of Informatics, Universitas Amikom Purwokerto
(3) Department of Information Technology, Universitas Amikom Purwokerto
(*) Corresponding Author
Abstract
In the last decade, there have been significant developments in Image Caption Generation research to translate images into English descriptions. This task has also been conducted to produce texts in non-English, including Bahasa. However, the references in this study are still limited, so exploration opportunities are open widely. This paper presents comparative research by examining several state-of-the-art Deep Learning algorithms to extract images and generate their descriptions in Bahasa. We extracted images using three pre-trained models, namely InceptionV3, Xception, and EfficientNetV2S. In the language model, we examined four architectures: LSTM, GRU, Bidirectional LSTM, and Bidirectional GRU. The database used was Flickr8k which was translated into Bahasa. Model evaluation was conducted using BLEU and Meteor. The performance results based on the pre-trained model showed that EfficientNetV3S significantly gave the highest score among other models. On the other hand, in the language model, there was only a slight difference in model performance. However, in general, the Bidirectional GRU scored higher. We also found that step size in training affected overfitting. Larger step sizes tended to provide better generalizations. The best model was generated using EfficientNetV3S and Bidirectional GRU with step size=4096, which resulted in an average score of BLEU-1=0,5828 and Meteor=0,4520.
Keywords
Full Text:
PDFReferences
C. Wang, Z. Zhou, and L. Xu, “An Integrative Review of Image Captioning Research,” J. Phys. Conf. Ser., vol. 1748, 2021. Available: https://doi.org/10.1088/1742-6596/1748/4/042060 [2] S. Bai and S. An, “A Survey on Automatic Image Caption Generation,” Neurocomputing, vol. 311, no. October, pp. 291–304, 2018. Available: https://doi.org/10.1016/j.neucom.2018.05.080 [3] Gaurav and P. Mathur, “A Survey on Various Deep Learning Models for Automatic Image Captioning,” J. Phys. Conf. Ser., vol. 1950, 2021. Available: https://doi.org/10.1088/1742-6596/1950/1/012045 [4] H. Wang, Y. Zhang, and X. Yu, “An Overview of Image Caption Generation Methods,” Hindawi Comput. Intell. Neurosci., 2020. Available: https://doi.org/10.1155/2020/3062706 [5] R. Staniute and D. Šešok, “A Systematic Literature Review on Image Captioning,” Appl. Sci., vol. 9, 2019. Available: https://doi.org/10.3390/app9102024 [6] P. Mookdarsanit and L. Mookdarsanit, “Thai-IC : Thai Image Captioning based on CNN-RNN Architecture,” Int. J. Appl. Comput. Technol. Inf. Syst., vol. 10, no. 1, pp. 40–45, 2020. Available: http://203.158.98.12/actisjournal/index.php/IJACTIS/article/view/346/233 [7] S. Antonio, D. Croce, and R. Basili, “Large scale datasets for Image and Video Captioning in Italian,” Ital. J. Comput. Linguist., vol. 5, no. 2, pp. 49–60, 2019. Available: https://doi.org/10.4000/ijcol.478 [8] J. Gu, S. Joty, J. Cai, and G. Wang, “Unpaired Image Captioning by Language Pivoting,” ECCV, vol. 11205, pp. 519–535, 2018. Available: https://doi.org/10.1007/978-3-030-01246-5_31 [9] I. Afyouni, I. Azhar, and A. Elnagar, “AraCap: A hybrid deep learning architecture for Arabic Image Captioning,” Procedia CIRP, vol. 189, pp. 382–389, 2021. Available: https://doi.org/10.1016/j.procs.2021.05.108 [10] R. Dhir, S. K. Mishra, S. Saha, and P. Bhattacharyya, “A Deep Attention based Framework for Image Caption Generation in Hindi Language,” Comput. y Sist., vol. 23, no. 3, pp. 693–701, 2019. Available: https://doi.org/10.13053/cys-23-3-3269 [11] T. Miyazaki and N. Shimizu, “Cross-lingual Image Caption Generation,” in 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers, 2016, pp. 1780–1790. Available: https://doi.org/10.18653/v1/P16-1168 [12] U. A. A. Al-faruq and D. H. Fudholi, “Implementasi Arsitektur Transformer pada Image Captioning dengan Bahasa Indonesia,” Automata, vol. 2, no. 2, 2021. Available : https://journal.uii.ac.id/AUTOMATA/article/view/19525 [13] A. M. Nugroho and A. F. Hidayatullah, “Keterangan Gambar Otomatis Berbahasa Indonesia dengan CNN dan LSTM,” Automata, vol. 2, no. 1, 2021. Available : https://journal.uii.ac.id/AUTOMATA/article/view/17389 [14] E. Mulyanto, E. I. Setiawan, E. M. Yuniarno, and M. H. Purnomo, “Automatic Indonesian Image Caption Generation using CNN-LSTM Model and FEEH-ID Dataset,” in 2019 IEEE International Conference on CIVEMSA 2019, 2019, pp. 1–5. Available: https://doi.org/10.1109/CIVEMSA45640.2019.9071632 [15] A. A. Nugraha, A. Arifianto, and Suyanto, “Generating Image Description on Indonesian Language Using Convolutional Neural Network and Gated Recurrent Unit,” in 2019 7th International Conference on Information and Communication Technology, ICoICT 2019, 2019, pp. 1–6. Available: https://doi.org/10.1109/ICoICT.2019.8835370 [16] D. H. Fudholi et al., “Image Captioning with Attention for Smart Local Tourism using EfficientNet,” in IOP Conference Series: Materials Science and Engineering, 2021, vol. 1077, no. 1. Available: https://doi.org/10.1088/1757-899x/1077/1/012038 [17] D. H. Fudholi, A. Zahra, and R. A. N. Nayoan, “A Study on Visual Understanding Image Captioning using Different Word Embeddings and CNN-Based Feature Extractions,” Kinet. Game Technol. Inf. Syst. Comput. Network, Comput. Electron. Control, vol. 7, no. 1, pp. 91–98, 2022. Available: https://doi.org/10.22219/kinetik.v7i1.1394 [18] D. H. Fudholi, “Image Captioning Approach for Household Environment Visual Understanding,” Int. J. Inf. Syst. Technol., vol. 5, pp. 292–298, 2021. Available: https://ijistech.org/ijistech/index.php/ijistech/article/view/135/pdf [19] A. Nurhopipah, A. W. Murdiyanto, and T. Astuti, “A Pair of Inception Blocks in U-Net Architecture for Lung Segmentation,” in Proceedings - 2021 IEEE 5th ICITISEE 2021, 2021, pp. 40–45. Available: https://doi.org/10.1109/ICITISEE53823.2021.9655804 [20] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015, vol. December, pp. 2818–2826. Available: https://doi.org/10.1109/CVPR.2016.308 [21] F. Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017, pp. 1800–1807. Available: https://doi.org/10.1109/CVPR.2017.195 [22] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning Transferable Architectures for Scalable Image Recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018, pp. 8697–8710. Available: https://doi.org/10.1109/CVPR.2018.00907 [23] M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” 36th Int. Conf. Mach. Learn. ICML 2019, vol. 2019-June, pp. 10691–10700, 2019. Available: https://arxiv.org/abs/1905.11946 [24] M. Tan and Q. V. Le, “EfficientNetV2: Smaller Models and Faster Training,” arXiv:2104.00298, 2021. Available: http://arxiv.org/abs/2104.00298 [25] Aurélien Géron, “Deep Computer Vision Using Convolutional Neural Network,” in Hands-On Machine Learning with Scikit-Learn, Keras & tensorFlow, 2nd ed., Sebastopol: O’Reilly Media, Inc., 2019. [26] L. S. Hadla, Z. Jordan, T. M. Hailat, and M. N. Al-kabi, “Comparative Study Between METEOR and BLEU Methods of MT : Arabic into English Translation as a Case Study,” Int. J. Adv. Comput. Sci. Appl., vol. 6, no. 11, pp. 215–223, 2015. Available: https://doi.org/10.14569/IJACSA.2015.061128
DOI: https://doi.org/10.22146/ijccs.76389
Article Metrics
Abstract views : 1646 | views : 1034Refbacks
- There are currently no refbacks.
Copyright (c) 2023 IJCCS (Indonesian Journal of Computing and Cybernetics Systems)
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
View My Stats1