Spectrum Features Combination of Hilbert and Cochleagram for Speech Emotions Identification

Agustinus Bimo Gumelar; Eko Mulyanto Yuniarno; Wiwik Anggraeni; Indar Sugiarto; Andreas Agung Kristanto; Mauridhi Hery Purnomo

doi:10.22146/jnteti.v9i2.166

Agustinus Bimo Gumelar Institut Teknologi Sepuluh Nopember
Eko Mulyanto Yuniarno Institut Teknologi Sepuluh Nopember
Wiwik Anggraeni Institut Teknologi Sepuluh Nopember
Indar Sugiarto Universitas Kristen Petra
Andreas Agung Kristanto Universitas Mulawarman
Mauridhi Hery Purnomo Institut Teknologi Sepuluh Nopember https://orcid.org/0000-0002-6221-7382

DOI: https://doi.org/10.22146/jnteti.v9i2.166

Kata Kunci: Emosi Wicara, Kombinasi Fitur, Convolutional Neural Networks (CNN), Cochleagram, Hilbert Spectrum, Deep Learning

Abstrak

Dalam interaksi perilaku sosial, suara manusia menjadi salah satu saluran utama pembawa atribut ekspresi emosi kondisi mentalnya. Suara manusia merupakan hasil olah vokal yang tersusun dengan disertai urutan kata demi kata, hingga menghasilkan kalimat dalam rupa pola wicara yang memiliki makna ekspresi kondisi psikologisnya. Pola tersebut memberikan karakteristik khusus untuk proses identifikasi biometrik yang menggunakan pola wicara. Teknik visualisasi berupa citra spektrum telah terbukti mampu memberikan representasi hasil olah sinyal wicara. Makalah ini mengidentifikasi jenis emosi pada wicara menggunakan kombinasi fitur multi spektrum Hilbert dan cochleagram. Spektrum Hilbert merepresentasikan hasil transformasi Hilbert-Huang (HHT) untuk memproses sinyal emosi wicara yang nonlinear dan nonstasioner secara instan dengan fungsi mode intrinsik. Dengan meniru cara kerja komponen telinga luar dan tengah, sinyal emosi wicara dipecah menjadi frekuensi yang berbeda secara alami dengan hasil representasinya berupa cochleagram. Kedua masukan berupa spektrum wicara diproses menggunakan metode Convolutional Neural Networks (CNN) yang dikenal terbaik dalam mengenali data citra karena merepresentasikan mekanisme kerja retina manusia, serta metode Long Short-Term Memory (LSTM). Berdasarkan hasil uji coba dengan tiga himpunan data (dataset) publik emosi wicara yang terbagi ke dalam delapan kelas emosi, diperoleh akurasi sebesar 90,97% dengan CNN dan 80,62% dengan LSTM.

Referensi

C. Saint-Georges, M. Chetouani, R. Cassel, F. Apicella, A. Mahdhaoui, F. Muratori, M.-C. Laznik, dan D. Cohen, “Motherese in Interaction: At the Cross-Road of Emotion and Cognition? (A Systematic Review),” PLoS One, Vol. 8, No. 10, hal. 1–17, 2013.

R.J. Sternberg dan J.S. Mio, Cognitive Psychology, 4th ed. Belmont, USA: Wadsworth, 2005.

J.E. Shackman dan S.D. Pollak, “Experiential Influences on Multimodal Perception of Emotion,” Child Dev., Vol. 76, No. 5, hal. 1116–1126, Sep. 2005.

R. Plutchik, “The Nature of Emotions: Human Emotions Have Deep Evolutionary Roots, a Fact that may Explain Their Complexity and Provide Tools for Clinical Practice,” Am. Sci., Vol. 89, No. 4, hal. 344–350, 2001.

M. Gendron, D. Roberson, J.M. van der Vyver, dan L.F. Barrett, “Perceptions of Emotion from Facial Expressions are not Culturally Universal: Evidence from a Remote Culture,” Emotion, Vol. 14, No. 2, hal. 251–262, 2014.

C.-H. Wu, J.-F. Yeh, dan Z.-J. Chuang, “Emotion Perception and Recognition from Speech,” dalam Affective Information Processing, J. Tao dan T. Tan, Eds., London, UK: Springer London, 2009, hal. 93–110.

V. Arora, A. Lahiri, dan H. Reetz, “Phonological Feature-based Speech Recognition System for Pronunciation Training in Non-native Language Learning,” J. Acoust. Soc. Am., Vol. 143, No. 1, hal. 98–108, 2018.

B.J. Mohan dan N.R. Babu, “Speech Recognition Using MFCC and DTW,” 2014 Int. Conf. Adv. Electr. Eng. (ICAEE 2014), 2014, hal. 1-4.

R. Sharma, R.K. Bhukya, dan S.R.M. Prasanna, “Analysis of the Hilbert Spectrum for Text-Dependent Speaker Verification,” Speech Commun., Vol. 96, hal. 207–224, 2018.

A.R. Avila, S.R. Kshirsagar, A. Tiwari, D. Lafond, D. O’Shaughnessy, dan T.H. Falk, “Speech-based Stress Classification Based on Modulation Spectral Features and Convolutional Neural Networks,” Eur. Signal Process. Conf., 2019, hal. 1–5.

D.P. Adi, A.B. Gumelar, dan R.P. Arta Meisa, “Interlanguage of Automatic Speech Recognition,” 2019 International Seminar on Application for Technology of Information and Communication (iSemantic), 2019, hal. 88–93.

H.L. Hawkins, T.A. McMullen, A.N. Popper, dan R.R. Fay, Auditory Computation, New York, USA: Springer New York, 1996.

K. Venkataramanan dan H.R. Rajamohan, “Emotion Recognition from Speech,” arXiv Prepr. arXiv1912.10458, hal. 1-14, Des. 2019.

C. Etienne, G. Fidanza, A. Petrovskii, L. Devillers, dan B. Schmauch, “CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation,” arXiv Prepr. arXiv1802.05630, hal. 1-5, Feb. 2018.

J. Zhao, X. Mao, dan L. Chen, “Speech Emotion Recognition using Deep 1D & 2D CNN LSTM networks,” Biomed. Signal Process. Control, Vol. 47, hal. 312–323, Jan. 2019.

S. Basu, J. Chakraborty, dan M. Aftabuddin, “Emotion Recognition from Speech Using Convolutional Neural Network with Recurrent Neural Network Architecture,” 2017 2nd International Conference on Communication and Electronics Systems (ICCES), 2017, hal. 333–336.

M. Slaney dan R.F. Lyon, “On the Importance of Time - A Temporal Representation of Sound,” dalam Visual Representations of Speech Signals, M. Cooke, S. Beet, dan M. Crawford, Eds., Hoboken, USA: John Wiley & Sons Ltd, 1993, hal. 95–116.

V.K. Rai dan A.R. Mohanty, “Bearing Fault Diagnosis using FFT of Intrinsic Mode Functions in Hilbert–Huang Transform,” Mech. Syst. Signal Process., Vol. 21, No. 6, hal. 2607–2615, Agt. 2007.

A.B. Gumelar, A. Kurniawan, A.G. Sooai, M.H. Purnomo, E.M. Yuniarno, I. Sugiarto, A. Widodo, A.A. Kristanto, dan T.M. Fahrudin, “Human Voice Emotion Identification Using Prosodic and Spectral Feature Extraction Based on Deep Neural Networks,” 2019 IEEE 7th International Conference on Serious Games and Applications for Health (SeGAH), 2019, hal. 1–8.

Y.K. Muthusamy, R.A. Cole, dan M. Slaney, “Speaker-independent Vowel Recognition: Spectrograms Versus Cochleagrams,” International Conference on Acoustics, Speech, and Signal Processing, 1990, hal. 533–536.

S. Sandoval, P.L. de Leon, dan J.M. Liss, “Hilbert Spectral Analysis of Vowels using Intrinsic Mode Functions,” 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, hal. 569–575.

S. Camacho, D. Renza, dan D.M. Ballesteros L., “A Semi-supervised Speaker Identification Method for Audio Forensics Using Cochleagrams,” dalam Applied Computer Sciences in Engineering: 4th Workshop on Engineering Applications (WEA 2017), J.C. Figueroa-García, E.R. López-Santana, J.L. Villa-Ramírez, dan R. Ferro-Escobar, Eds., Cham, Switzerland: Springer, 2017, hal. 55–64.

B. Gao, W.L. Woo, dan L.C. Khor, “Cochleagram-based Audio Pattern Separation using Two-Dimensional Non-negative Matrix Factorization with Automatic Sparsity Adaptation,” J. Acoust. Soc. Am., Vol. 135, No. 3, hal. 1171–1185, Mar. 2014.

X.L. Zhang dan D.L. Wang, “Boosted Deep Neural Networks and Multi-Resolution Cochleagram Features for Voice Activity Detection,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 2014, hal. 1534–1538.

R.V. Sharan dan T.J. Moir, “Cochleagram Image Feature for Improved Robustness in Sound Recognition,” Int. Conf. Digit. Signal Process. DSP, 2015, hal. 441–444.

C. Darwin, “Computational Auditory Scene Analysis: Principles, Algorithms and Applications,” J. Acoust. Soc. Am., Vol. 124, No. 1, hal. 13–13, Jul. 2008.

M.K.I. Molla dan K. Hirose, “Single-Mixture Audio Source Separation by Subspace Decomposition of Hilbert Spectrum,” IEEE Trans. Audio, Speech Lang. Process., Vol. 15, No. 3, hal. 893–900, Mar. 2007.

R.D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, dan M. Allerhand, “Complex Sounds and Auditory Images,” Proceedings of the 9th International Symposium on Hearing, 1991, hal. 429–446.

H. Yin, V. Hohmann, dan C. Nadeu, “Acoustic Features for Speech Recognition Based on Gammatone Filterbank and Instantaneous Frequency,” Speech Commun., Vol. 53, No. 5, hal. 707–715, Mei 2011.

M. Russo, M. Stella, M. Sikora, dan V. Pekić, “Robust Cochlear-Model-Based Speech Recognition,” Computers, Vol. 8, No. 1, hal. 1-15, Jan. 2019.

M. Slaney, “An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank,” Apple Compupter, Inc., Cupertino, USA, Technical Report, 1993.

N.E. Huang, Z. Shen, S.R. Long, M.C. Wu, H.H. Shih, Q. Zheng, N.-C. Yen, C.C. Tung, dan H.H. Liu, “The Empirical Mode Decomposition and the Hubert Spectrum for Non-linear and Non-stationary Time Series Analysis,” Proc. R. Soc. A Math. Phys. Eng. Sci., Vol. 454, No. 1971, hal. 903–995, 1998.

H. Huang dan X.-X. Chen, “Speech Formant Frequency Estimation based on Hilbert-Huang Transform,” Zhejiang Daxue Xuebao (Gongxue Ban)/Journal Zhejiang Univ. (Engineering Sci., Vol. 40, hal. 1926–1930, 2006.

H. Huang dan J. Pan, “Speech Pitch Determination based on Hilbert-Huang Transform,” Signal Processing, Vol. 86, No. 4, hal. 792–803, Apr. 2006.

X. Li dan X. Li, “Speech Emotion Recognition Using Novel HHT-TEO Based Features,” J. Comput., Vol. 6, No. 5, hal. 989-998, Mei 2011.

A.B. Gumelar, M.H. Purnomo, E.M. Yuniarno, dan I. Sugiarto, “Spectral Analysis of Familiar Human Voice Based On Hilbert-Huang Transform,” 2018 International Conference on Computer Engineering, Network and Intelligent Multimedia (CENIM), 2018, hal. 311–316.

N.E. Huang dan S.S.P. Shen, Hilbert-Huang Transform and Its Applications, Singapore, Singapore: World Scientific, 2005.

S.K. Phan dan C. Chen, “Big Data and Monitoring the Grid,” dalam The Power Grid, B.W. D’Andrade, Eds., Cambridge, USA: Academic Press, 2017, hal. 253–285.

S.R. Livingstone dan F.A. Russo, “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English,” PLoS One, Vol. 13, No. 5, hal. 1-35, 2018.

K. Dupuis dan M.K. Pichora-Fuller, “Toronto Emotional Speech Set (TESS),” Scholars Portal Dataverse, 2010.

X.-L. Zhang dan Ji Wu, “Deep Belief Networks Based Voice Activity Detection,” IEEE Trans. Audio. Speech. Lang. Processing, Vol. 21, No. 4, hal. 697–710, Apr. 2013.

L. Mateju, P. Cerva, dan J. Zdansky, “Study on the Use of Deep Neural Networks for Speech Activity Detection in Broadcast Recordings,” Proceedings of the 13th International Joint Conference on e-Business and Telecommunications, 2016, hal. 45–51.

S. Thomas, S. Ganapathy, G. Saon, dan H. Soltau, “Analyzing Convolutional Neural Networks for Speech Activity Detection in Mismatched Acoustic Conditions,” 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, hal. 2519–2523.

L. Deng dan J. Platt, “Ensemble Deep Learning for Speech Recognition,” Proc. Interspeech, 2014, hal. 1-5.

T. Xu, H. Li, H. Zhang, dan X. Zhang, “Improve Data Utilization with Two-stage Learning in CNN-LSTM-based Voice Activity Detection,” 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019, hal. 1185–1189.

T.N. Sainath, O. Vinyals, A. Senior, dan H. Sak, “Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, hal. 4580–4584.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, dan D. Cournapeau, “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res., Vol. 12, hal. 2825–2830, 2011.

S. Hochreiter dan J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., Vol. 9, No. 8, hal. 1735–1780, Nov. 1997.

F.A. Gers, “Learning to Forget: Continual Prediction with LSTM,” 9th International Conference on Artificial Neural Networks (ICANN ’99), 1999, hal. 850–855.

J. LeDoux dan A.R. Damasio, “Emotions and Feelings,” dalam Principles of Neural Science, 5th ed., E.R. Kandel, J.H. Schwartz, T.M. Jessell, S.A. Siegelbaum, dan A.J. Hudspeth, Eds. New York, USA: McGraw-Hill, 2013, hal. 1079–1094.

J.E. LeDoux dan R. Brown, “A Higher-order Theory of Emotional Consciousness,” Proc. Natl. Acad. Sci. of USA, Vol. 114, No. 10, hal. E2016–E2025, Mar. 2017.

J.E. LeDoux, “Semantics, Surplus Meaning, and the Science of Fear,” Trends Cogn. Sci., Vol. 21, No. 5, hal. 303–306, Mei 2017.

K.R. Scherer, E. Clark-Polner, dan M. Mortillaro, “In the Eye of the Beholder? Universality and Cultural Specificity in the Expression and Perception of Emotion,” Int. J. Psychol., Vol. 46, No. 6, hal. 401–435, Des. 2011.

Nama Pengguna
Kata Sandi
Remember me
Daftar

Metrik Jurnal (Januari 2025)
Tingkat penerimaan	26%
Penyerahan naskah sampai keputusan pertama	± 36 days
Penerimaan sampai penerbitan	± 30 days
Akreditasi	Sinta 2
h-index	29
Sitasi 5 tahun terakhir	3549

Kombinasi Fitur Multispektrum Hilbert dan Cochleagram untuk Identifikasi Emosi Wicara

Abstrak

Referensi