Improving Phoneme to Viseme Mapping for Indonesian Language

Anung Rachman(1*), Risanuri Hidayat(2), Hanung Adi Nugroho(3)

(1) Universitas Gadjah Mada Institut Seni Indonesia (ISI) Surakarta
(2) Universitas Gadjah Mada
(3) Universitas Gadjah Mada
(*) Corresponding Author


The lip synchronization technology of animation can run automatically through the phoneme-to-viseme map. Since the complexity of facial muscles causes the shape of the mouth to vary greatly, phoneme-to-viseme mapping always has challenging problems. One of them is the allophone vowel problem. The resemblance makes many researchers clustering them into one class. This paper discusses the certainty of allophone vowels as a variable of the phoneme-to-viseme map. Vowel allophones pre-processing as a proposed method is carried out through formant frequency feature extraction methods and then compared by t-test to find out the significance of the difference. The results of pre-processing are then used to reference the initial data when building phoneme-to-viseme maps. This research was conducted on maps and allophones of the Indonesian language. Maps that have been built are then compared with other maps using the HMM method in the value of word correctness and accuracy. The results show that viseme mapping preceded by allophonic pre-processing makes map performance more accurate when compared to other maps.


Phoneme-to-Viseme Mapping; Allophones; Vowels; Formant Frequencies; Lip-Reading

Full Text:



S. Taylor, B.J. Theobald, and I. Matthews, “A Mouth Full of Words: Visually Consistent Acoustic Redubbing,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4904–4908.

C.F. Rademan and T. Niesler, “Improved Visual Speech Synthesis Using Dynamic Viseme k -means Clustering and Decision Trees,” Facial Analysis, Animation, and Auditory-Visual Speech Processing (FAAVSP), 2015, pp. 169–174.

Arifin, S. Sumpeno, Muljono, and M. Hariadi, “A Model of Indonesian Dynamic Visemes from Facial Motion Capture Database Using a Clustering-based Approach,” IAENG International Journal of Computer Science, Vol. 44, No. 1, pp. 41–51, Feb. 2017.

S.L. Taylor, “Discovering Dynamic Visemes,” Doctoral dissertation, University of East Anglia, Norwich, UK, 2013.

P. Shih, A. Paul, J. Wang, and Y. Chen, “Speech-Driven Talking Face Using Embedded Confusable System for Real Time Mobile Multimedia,” Multimedia Tools and Applications, Vol. 73, No. 1, pp. 417–437, Nov. 2014.

E. Setyati, S. Sumpeno, M.H. Purnomo, K. Mikami, M. Kakimoto, and K. Kondo, “Phoneme-Viseme Mapping for Indonesian Language Based on Blend Shape Animation,” IAENG International Journal of Computer Science, Vol. 42, No. 3, pp. 233–244, Jul. 2015.

S.-M. Hwang, H.-K. Yun, and B.-H. Song, “Korean Speech Recognition Using Phonemics for Lip-Sync Animation,” Information Science, Electronics and Electrical Engineering (ISEEE), 2014, pp. 1011–1014.

J. Xu, J. Pan, and Y. Yan, “Agglutinative Language Speech Recognition Using Automatic Allophone Deriving,” Chinese Journal of Electronics, Vol. 25, No. 2, pp. 328–333, Mar. 2016.

L. Cappelletta and N. Harte, “Phoneme-to-Viseme Mapping for Visual Speech Recognition,” 1st International Conference on Pattern Recognition Applications and Methods (ICPRAM) Volume 2, 2012, pp. 322–329.

J. Jeffers and M. Barley, Speechreading (lipreading), 1st ed. Springfield, USA: Charles C. Thomas Publisher, 1971.

C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, and J. Zhou, “Audio Visual Speech Recognition,” IDIAP, Workshop Final Report, 2000.

T.J. Hazen, “Visual Model Structures and Synchrony Constraints for Audio-Visual Speech Recognition,” IEEE Transactions on Audio, Speech, and Language Processing, Vol. 14, No. 3, pp. 1082–1089, May 2006.

E. Bozkurt, Ç.E. Erdem, E. Erzin, T. Erdem, and M. Özkan, “Comparison of Phoneme and Viseme Based Acoustic Units for Speech Driven Realistic lip Animation,” 3DTV Conference, 2007, pp. 1–4.

S. Lee and D. Yook, “Audio-to-Visual Conversion Using Hidden Markov Models,” Pacific Rim International Conference on Artificial Intelligence (PRICAI), 2002, pp. 563–570.

A.A. Perwitasari, M. Klamer, J. Witteman, and N.O. Schiller, “Vowel Duration in English as a Second Language Among Javanese Learners,” International Conference on Phonetic Sciences, 2015, pp. 1-4.

L. Burrows, L. Jarmulowicz, and D.K. Oller, “Allophony in English Language Learners: The Case of Tap in English and Spanish,” Language, Speech, and Hearing Services in Schools, Vol. 50, No. 1, pp. 138–149, Jan. 2019.

S.-G. Bae, B.-M. Lim, and M.-J. Bae, “A Comparative Analysis on Allophone of Korean for English Natives,” Information, Vol. 20, No. 5(A), pp. 3291–3298, May 2017.

E. van Zanten, “The Indonesian vowels : Acoustic and Perceptual Explorations,” Doctoral dissertation, Rijksuniversiteit te Leiden, Netherlands, 1989.

N. Adisasmito-Smith, “Phonetic and phonological influences of Javanese on Indonesian,” Doctoral dissertation, Cornell University, New York, USA, 2004.

E.C. Horne, Beginning Javanese, 3rd ed. London, UK: Yale University Press, 1961.

K.M. Dudas, “The Phonology and Morphology of Modern Javanese,” Doctoral dissertation, University of Illinois, Urbana-Champaign, USA, 1976.

K. Hayward, “Lexical Phonology and the Javanese Vowel System,” SOAS Working Papers in Linguistics, Vol. 9, pp. 191–225, 1999.

Wedhawati, W.E.S. Nurlina, E. Setiyanto, R. Sukesi, Marsono, and I.P. Baryadi, Tata Bahasa Jawa Mutakhir, Revisi ed. Yogyakarta, Indonesia: Penerbit Kanisius, 2006.

C.D. Soderberg and K.S. Olson, “Indonesian,” Journal of the International Phonetic Association, Vol. 38, No. 2, pp. 209–213, Aug. 2008.

Arifin, Muljono, S. Sumpeno, and M. Hariadi, “Towards Building Indonesian Viseme: A Clustering-Based Approach,” IEEE International Conference on Computational Intelligence and Cybernetics (CYBERNETICSCOM), 2013, pp. 57–61.

M. Liyanthy, H. Nugroho, and W. Maharani, “Realistic Facial Animation Of Speech Synchronization For Indonesian Language,” 3rd International Conference on Information and Communication Technology (ICoICT ), 2015, pp. 563–567.

I.R. Titze, R.J. Baken, K.W. Bozeman, S. Granqvist, N. Henrich, C.T. Herbst, D.M. Howard, E.J. Hunter, D. Kaelin, R.D. Kent, J. Kreiman, M. Kob, A. Löfqvist, S. McCoy, D.G. Miller, H. Noé, R.C. Scherer, J.R. Smith, B.H. Story, J.G. Švec, S. Ternström, and J. Wolfe, “Toward a Consensus on Symbolic Notation of Harmonics, Resonances, and Formants in Vocalization,” J. Acoust. Soc. Am., Vol. 137, No. 5, pp. 3005–3007, May 2015.

H.L. Bear and R. Harvey, “Phoneme-to-Viseme Mappings: The Good, the Bad, and the Ugly,” Speech Communication, Vol. 95, pp. 40–67, Dec. 2017.

“American National Standard Acoustical Terminology,” ANSI S1.1-1994 (ASA 111-1994), Standards Secretariat, Acoustical Society of America, New York, 1994.

D. O’Shaughnessy, “Linear predictive coding,” IEEE Potentials, Vol. 7, No. 1, pp. 29–32, Feb. 1988.

P. Ladefoged and K. Johnson, A Course in Phonetics, 7th ed. Stamford, USA: Cengage Learning, 2014.

S.J. Cox, R.W. Harvey, Y. Lan, J.L. Newman, and B.-J. Theobald, “The Challenge of Multispeaker Lip-Reading.,” International Conference on Auditory-Visual Speech Processing (AVSP), 2008, pp. 179–184.

Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen, “A Review of Recent Advances in Visual Speech Decoding,” Image and Vision Computing, Vol. 32, No. 9, pp. 590–605, Sep. 2014.

A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic, “Incremental Face Alignment in the Wild,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1859–1866.

C. Goodall, “Procrustes Methods in the Statistical Analysis of Shape,” Journal of the Royal Statistical Society. Series B (Methodological), Vol. 53, No. 2, pp. 285–339, Jan. 1991.

S. Wold, K. Esbensen, and P. Geladi, “Principal Component Analysis,” Chemometrics and Intelligent Laboratory Systems, Vol. 2, No. 1–3, pp. 37–52, Aug. 1987.

K. Sasirekha and P. Baby, “Agglomerative Hierarchical Clustering Algorithm-A Review,” International Journal of Scientific and Research Publications, Vol. 3, No. 3, pp. 1–3, Mar. 2013.

W.H.E. Day and H. Edelsbrunner, “Efficient Algorithms for Agglomerative Hierarchical Clustering Methods,” Journal of Classification, Vol. 1, No. 1, pp. 7–24, Dec. 1984.

S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. (Andrew) Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, “The HTK Book (for HTK version 3.4),” Cambridge University Engineering Department, 2006.

L. Cappelletta and N. Harte, “Viseme definitions comparison for visual-only speech recognition,” 19th European Signal Processing Conference (EUSIPCO), 2011, pp. 2109–2113.


Article Metrics

Abstract views : 337 | views : 354


  • There are currently no refbacks.

Copyright (c) 2020 IJITEE (International Journal of Information Technology and Electrical Engineering)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

ISSN  : 2550-0554 (online)

Contact :

Department of Electrical engineering and Information Technology, Faculty of Engineering
Universitas Gadjah Mada

Jl. Grafika No 2 Kampus UGM Yogyakarta

+62 (274) 552305

Email :