The Impact of Data Augmentation Techniques on Improving Speech Recognition Performance for English in Indonesian Children Based on Wav2Vec 2.0

Maimunah Maskur(1*), Amalia Zahra(2)
(1) Bina Nusantara University
(2) Bina Nusantara University
(*) Corresponding Author
Abstract
Early childhood education is a crucial phase in shaping children's character and language skills. This study develops an Automatic Speech Recognition (ASR) model to recognize the speech of Indonesian children speaking English. The process begins with collecting and processing a dataset of children's speech recordings, which is then expanded using data augmentation techniques to enhance pronunciation variations. The pre-trained ASR Wav2Vec 2.0 model is fine-tuned with both the original and augmented datasets. Evaluation using Word Error Rate (WER) and Character Error Rate (CER) shows a significant accuracy improvement, with WER decreasing from 53% to 45% and CER from 33% to 27%, reflecting a performance increase of approximately 15%. Further analysis reveals pronunciation errors in phonemes such as /ð/, /θ/, /r/, /v/, /z/, and /ʃ/, which are uncommon in the Indonesian language, manifesting as substitutions, omissions, or additions in words like "three," "that," "rabbit," "very," and "zebra." These findings highlight the need for targeted phoneme training, audio-based approaches with ASR feedback, and the listen-and-repeat technique in English language instruction for children.
Keywords— Early childhood education, Automatic Speech Recognition, Augmentation, Character Error Rate, Word Error Rate
Keywords
Full Text:
PDFReferences
Aggarwal, C. C. (2018). Neural networks and deep learning. Springer International Publishing AG.
Arisaputra, P., & Zahra, A. (2022). Indonesian Automatic Speech Recognition with XLSR-53. Master of Information Technology, Binus University.
Baevski, A., & Auli, M. (2019). Wav2Vec: Unsupervised pre-training for speech recognition. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 1–10.
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33, 12449-12460.
Chen, G., Na, X., Wang, Y., Yan, Z., Zhang, J., Ma, S., & Wang, Y. (2021). Data augmentation for children’s speech recognition: The “Ethiopian” system for the SLT 2021 children speech recognition challenge. arXiv preprint arXiv:2010.00171
Banno, S., & Matassoni, M. (2022). Proficiency assessment of L2 spoken English using Wav2Vec 2.0. Paper presented at the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), Fondazione Bruno Kessler, Trento, Italy & University of Trento, Trento, Italy.
Bhardwaj, V., Bala, S., Kadyan, V., & Kukreja, V. (2020). Development of robust automatic speech recognition system for children using Kaldi Toolkit. Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020).
Bhardwaj, V., Kadyan, V., & Kukreja, V. (2020). Deep neural network trained Punjabi children speech recognition system using Kaldi Toolkit. Chitkara University Institute of Engineering and Technology.
Chen, L., Asgari, M., & Dodge, H. H. (2023). Optimize Wav2Vec2’s architecture for small training sets through analyzing its pre-trained models attention pattern.
Chen, L., Asgari, M., & Dodge, H. H. (2022). Optimize Wav2Vec2’s architecture for small training sets through analyzing its pre-trained models attention pattern. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 11.
Dimzon, F. D., & Pascual, R. M. (2009). An automatic phoneme recognizer for children’s Filipino read speech. Proceedings of the 2009 International Conference on Machine Learning and Computing, 3, 1–5.
Facebook AI Research. (2020). Wav2Vec 2.0: A framework for self-supervised learning of speech representations.
Jain, R., Barcovschi, A., Yiwere, M., Bigioi, D., Corcoran, P., & Cucu, H. (2023). A Wav2Vec2-based experimental study on self-supervised learning methods to improve child speech recognition. IEEE Access, 11, 30129–30141. https://doi.org/10.1109/ACCESS.2023.3056782.
Jain, R., Yiwere, M., Bigioi, D., Corcoran, P., & Cucu, H. (2017). A Wav2Vec2-based experimental study on self-supervised learning methods to improve child speech recognition. Speech and Dialogue Research Laboratory, University Politehnica of Bucharest, Romania, 13.
Kathania, H. K., Kadiri, S. R., Alku, P., & Kurimo, M. (2020). Study of formant modification for children ASR. Department of Signal Processing and Acoustics, Aalto University, Finland.
Moell, B., O'Regan, J., Mehta, S., Kirkland, A., Lameris, H., Gustafsson, J., & Beskow, J. (2022). Speech data augmentation for improving phoneme transcriptions of aphasic speech using Wav2Vec 2.0 for the PSST Challenge. Proceedings of the RaPID-4 @LREC 2022, 62–70.
Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779.
Santos, R. L., Caseiro, D., & Trancoso, I. (2018). Mispronunciation detection in children's reading of sentences. Computer Speech & Language, 50, 1–16. https://doi.org/10.1016/j.csl.2018.02.001.
Syahputra, M. E., & Zahra, A. (2021). Unsupervised pre-training pada speech recognition menggunakan bahasa Indonesia berbasis Wav2Vec. Master of Information Technology, Binus University.
Taniya, V., Bhardwaj, V., & Kadyan, V. (2020). Deep neural network trained Punjabi children speech recognition system using Kaldi Toolkit. Chitkara University Institute of Engineering and Technology.
Wills, S., Bai, Y., Tejedor García, C., Cucchiarini, C., & Strik, H. (2021). Automatic speech recognition of non-native child speech for language learning applications. Proceedings of the 2021 Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 38–49. https://doi.org/10.18653/v1/2021.eacl-1.4.
Yu, D., & Deng, L. (2015). Automatic speech recognition: A deep learning approach. Springer-Verlag London. N. Kawasaki, “Parametric study of thermal and chemical nonequilibrium nozzle flow,” M.S. thesis, Dept. Electron. Eng., Osaka Univ., Osaka, Japan, 1993.

Article Metrics


Refbacks
- There are currently no refbacks.
Copyright (c) 2025 IJCCS (Indonesian Journal of Computing and Cybernetics Systems)

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
View My Stats1