Spectrum Features Combination of Hilbert and Cochleagram for Speech Emotions Identification
Abstract
In social behavior of human interaction, human voice becomes one of the means of channeling mental states' emotional expression. Human voice is a vocal-processesed speech, arranged with word sequences, producing the speech pattern which able to channel the speakers' psychological condition. This pattern provides special characteristics that can be developed along with biometric identification process. Spectrum image visualization techniques are employed to sufficiently represent speech signal. This study aims to identify the emotion types in the human voice using a feature combination multi-spectrum Hilbert and cochleagram. The Hilbert spectrum represents the Hilbert-Huang Transformation(HHT)results for processing a non-linear, non-stationary instantaneous speech emotional signals with intrinsic mode functions. Through imitating the functions of the outer and middle ear elements, emotional speech impulses are broken down into frequencies that typically vary from the effects of their expression in the form of the cochlea continuum. The two inputs in the form of speech spectrum are processed using Convolutional Neural Networks(CNN) which best known for recognizing image data because it represents the mechanism of human retina and also Long Short-Term Memory(LSTM)method. Based on the results of this experiments using three public datasets of speech emotions, which each of them has similar eight emotional classes, this experiment obtained an accuracy of 90.97% with CNN and 80.62% with LSTM.
References
C. Saint-Georges, M. Chetouani, R. Cassel, F. Apicella, A. Mahdhaoui, F. Muratori, M.-C. Laznik, dan D. Cohen, “Motherese in Interaction: At the Cross-Road of Emotion and Cognition? (A Systematic Review),” PLoS One, Vol. 8, No. 10, hal. 1–17, 2013.
R.J. Sternberg dan J.S. Mio, Cognitive Psychology, 4th ed. Belmont, USA: Wadsworth, 2005.
J.E. Shackman dan S.D. Pollak, “Experiential Influences on Multimodal Perception of Emotion,” Child Dev., Vol. 76, No. 5, hal. 1116–1126, Sep. 2005.
R. Plutchik, “The Nature of Emotions: Human Emotions Have Deep Evolutionary Roots, a Fact that may Explain Their Complexity and Provide Tools for Clinical Practice,” Am. Sci., Vol. 89, No. 4, hal. 344–350, 2001.
M. Gendron, D. Roberson, J.M. van der Vyver, dan L.F. Barrett, “Perceptions of Emotion from Facial Expressions are not Culturally Universal: Evidence from a Remote Culture,” Emotion, Vol. 14, No. 2, hal. 251–262, 2014.
C.-H. Wu, J.-F. Yeh, dan Z.-J. Chuang, “Emotion Perception and Recognition from Speech,” dalam Affective Information Processing, J. Tao dan T. Tan, Eds., London, UK: Springer London, 2009, hal. 93–110.
V. Arora, A. Lahiri, dan H. Reetz, “Phonological Feature-based Speech Recognition System for Pronunciation Training in Non-native Language Learning,” J. Acoust. Soc. Am., Vol. 143, No. 1, hal. 98–108, 2018.
B.J. Mohan dan N.R. Babu, “Speech Recognition Using MFCC and DTW,” 2014 Int. Conf. Adv. Electr. Eng. (ICAEE 2014), 2014, hal. 1-4.
R. Sharma, R.K. Bhukya, dan S.R.M. Prasanna, “Analysis of the Hilbert Spectrum for Text-Dependent Speaker Verification,” Speech Commun., Vol. 96, hal. 207–224, 2018.
A.R. Avila, S.R. Kshirsagar, A. Tiwari, D. Lafond, D. O’Shaughnessy, dan T.H. Falk, “Speech-based Stress Classification Based on Modulation Spectral Features and Convolutional Neural Networks,” Eur. Signal Process. Conf., 2019, hal. 1–5.
D.P. Adi, A.B. Gumelar, dan R.P. Arta Meisa, “Interlanguage of Automatic Speech Recognition,” 2019 International Seminar on Application for Technology of Information and Communication (iSemantic), 2019, hal. 88–93.
H.L. Hawkins, T.A. McMullen, A.N. Popper, dan R.R. Fay, Auditory Computation, New York, USA: Springer New York, 1996.
K. Venkataramanan dan H.R. Rajamohan, “Emotion Recognition from Speech,” arXiv Prepr. arXiv1912.10458, hal. 1-14, Des. 2019.
C. Etienne, G. Fidanza, A. Petrovskii, L. Devillers, dan B. Schmauch, “CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation,” arXiv Prepr. arXiv1802.05630, hal. 1-5, Feb. 2018.
J. Zhao, X. Mao, dan L. Chen, “Speech Emotion Recognition using Deep 1D & 2D CNN LSTM networks,” Biomed. Signal Process. Control, Vol. 47, hal. 312–323, Jan. 2019.
S. Basu, J. Chakraborty, dan M. Aftabuddin, “Emotion Recognition from Speech Using Convolutional Neural Network with Recurrent Neural Network Architecture,” 2017 2nd International Conference on Communication and Electronics Systems (ICCES), 2017, hal. 333–336.
M. Slaney dan R.F. Lyon, “On the Importance of Time - A Temporal Representation of Sound,” dalam Visual Representations of Speech Signals, M. Cooke, S. Beet, dan M. Crawford, Eds., Hoboken, USA: John Wiley & Sons Ltd, 1993, hal. 95–116.
V.K. Rai dan A.R. Mohanty, “Bearing Fault Diagnosis using FFT of Intrinsic Mode Functions in Hilbert–Huang Transform,” Mech. Syst. Signal Process., Vol. 21, No. 6, hal. 2607–2615, Agt. 2007.
A.B. Gumelar, A. Kurniawan, A.G. Sooai, M.H. Purnomo, E.M. Yuniarno, I. Sugiarto, A. Widodo, A.A. Kristanto, dan T.M. Fahrudin, “Human Voice Emotion Identification Using Prosodic and Spectral Feature Extraction Based on Deep Neural Networks,” 2019 IEEE 7th International Conference on Serious Games and Applications for Health (SeGAH), 2019, hal. 1–8.
Y.K. Muthusamy, R.A. Cole, dan M. Slaney, “Speaker-independent Vowel Recognition: Spectrograms Versus Cochleagrams,” International Conference on Acoustics, Speech, and Signal Processing, 1990, hal. 533–536.
S. Sandoval, P.L. de Leon, dan J.M. Liss, “Hilbert Spectral Analysis of Vowels using Intrinsic Mode Functions,” 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, hal. 569–575.
S. Camacho, D. Renza, dan D.M. Ballesteros L., “A Semi-supervised Speaker Identification Method for Audio Forensics Using Cochleagrams,” dalam Applied Computer Sciences in Engineering: 4th Workshop on Engineering Applications (WEA 2017), J.C. Figueroa-García, E.R. López-Santana, J.L. Villa-Ramírez, dan R. Ferro-Escobar, Eds., Cham, Switzerland: Springer, 2017, hal. 55–64.
B. Gao, W.L. Woo, dan L.C. Khor, “Cochleagram-based Audio Pattern Separation using Two-Dimensional Non-negative Matrix Factorization with Automatic Sparsity Adaptation,” J. Acoust. Soc. Am., Vol. 135, No. 3, hal. 1171–1185, Mar. 2014.
X.L. Zhang dan D.L. Wang, “Boosted Deep Neural Networks and Multi-Resolution Cochleagram Features for Voice Activity Detection,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 2014, hal. 1534–1538.
R.V. Sharan dan T.J. Moir, “Cochleagram Image Feature for Improved Robustness in Sound Recognition,” Int. Conf. Digit. Signal Process. DSP, 2015, hal. 441–444.
C. Darwin, “Computational Auditory Scene Analysis: Principles, Algorithms and Applications,” J. Acoust. Soc. Am., Vol. 124, No. 1, hal. 13–13, Jul. 2008.
M.K.I. Molla dan K. Hirose, “Single-Mixture Audio Source Separation by Subspace Decomposition of Hilbert Spectrum,” IEEE Trans. Audio, Speech Lang. Process., Vol. 15, No. 3, hal. 893–900, Mar. 2007.
R.D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, dan M. Allerhand, “Complex Sounds and Auditory Images,” Proceedings of the 9th International Symposium on Hearing, 1991, hal. 429–446.
H. Yin, V. Hohmann, dan C. Nadeu, “Acoustic Features for Speech Recognition Based on Gammatone Filterbank and Instantaneous Frequency,” Speech Commun., Vol. 53, No. 5, hal. 707–715, Mei 2011.
M. Russo, M. Stella, M. Sikora, dan V. Pekić, “Robust Cochlear-Model-Based Speech Recognition,” Computers, Vol. 8, No. 1, hal. 1-15, Jan. 2019.
M. Slaney, “An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank,” Apple Compupter, Inc., Cupertino, USA, Technical Report, 1993.
N.E. Huang, Z. Shen, S.R. Long, M.C. Wu, H.H. Shih, Q. Zheng, N.-C. Yen, C.C. Tung, dan H.H. Liu, “The Empirical Mode Decomposition and the Hubert Spectrum for Non-linear and Non-stationary Time Series Analysis,” Proc. R. Soc. A Math. Phys. Eng. Sci., Vol. 454, No. 1971, hal. 903–995, 1998.
H. Huang dan X.-X. Chen, “Speech Formant Frequency Estimation based on Hilbert-Huang Transform,” Zhejiang Daxue Xuebao (Gongxue Ban)/Journal Zhejiang Univ. (Engineering Sci., Vol. 40, hal. 1926–1930, 2006.
H. Huang dan J. Pan, “Speech Pitch Determination based on Hilbert-Huang Transform,” Signal Processing, Vol. 86, No. 4, hal. 792–803, Apr. 2006.
X. Li dan X. Li, “Speech Emotion Recognition Using Novel HHT-TEO Based Features,” J. Comput., Vol. 6, No. 5, hal. 989-998, Mei 2011.
A.B. Gumelar, M.H. Purnomo, E.M. Yuniarno, dan I. Sugiarto, “Spectral Analysis of Familiar Human Voice Based On Hilbert-Huang Transform,” 2018 International Conference on Computer Engineering, Network and Intelligent Multimedia (CENIM), 2018, hal. 311–316.
N.E. Huang dan S.S.P. Shen, Hilbert-Huang Transform and Its Applications, Singapore, Singapore: World Scientific, 2005.
S.K. Phan dan C. Chen, “Big Data and Monitoring the Grid,” dalam The Power Grid, B.W. D’Andrade, Eds., Cambridge, USA: Academic Press, 2017, hal. 253–285.
S.R. Livingstone dan F.A. Russo, “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English,” PLoS One, Vol. 13, No. 5, hal. 1-35, 2018.
K. Dupuis dan M.K. Pichora-Fuller, “Toronto Emotional Speech Set (TESS),” Scholars Portal Dataverse, 2010.
X.-L. Zhang dan Ji Wu, “Deep Belief Networks Based Voice Activity Detection,” IEEE Trans. Audio. Speech. Lang. Processing, Vol. 21, No. 4, hal. 697–710, Apr. 2013.
L. Mateju, P. Cerva, dan J. Zdansky, “Study on the Use of Deep Neural Networks for Speech Activity Detection in Broadcast Recordings,” Proceedings of the 13th International Joint Conference on e-Business and Telecommunications, 2016, hal. 45–51.
S. Thomas, S. Ganapathy, G. Saon, dan H. Soltau, “Analyzing Convolutional Neural Networks for Speech Activity Detection in Mismatched Acoustic Conditions,” 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, hal. 2519–2523.
L. Deng dan J. Platt, “Ensemble Deep Learning for Speech Recognition,” Proc. Interspeech, 2014, hal. 1-5.
T. Xu, H. Li, H. Zhang, dan X. Zhang, “Improve Data Utilization with Two-stage Learning in CNN-LSTM-based Voice Activity Detection,” 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019, hal. 1185–1189.
T.N. Sainath, O. Vinyals, A. Senior, dan H. Sak, “Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, hal. 4580–4584.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, dan D. Cournapeau, “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res., Vol. 12, hal. 2825–2830, 2011.
S. Hochreiter dan J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., Vol. 9, No. 8, hal. 1735–1780, Nov. 1997.
F.A. Gers, “Learning to Forget: Continual Prediction with LSTM,” 9th International Conference on Artificial Neural Networks (ICANN ’99), 1999, hal. 850–855.
J. LeDoux dan A.R. Damasio, “Emotions and Feelings,” dalam Principles of Neural Science, 5th ed., E.R. Kandel, J.H. Schwartz, T.M. Jessell, S.A. Siegelbaum, dan A.J. Hudspeth, Eds. New York, USA: McGraw-Hill, 2013, hal. 1079–1094.
J.E. LeDoux dan R. Brown, “A Higher-order Theory of Emotional Consciousness,” Proc. Natl. Acad. Sci. of USA, Vol. 114, No. 10, hal. E2016–E2025, Mar. 2017.
J.E. LeDoux, “Semantics, Surplus Meaning, and the Science of Fear,” Trends Cogn. Sci., Vol. 21, No. 5, hal. 303–306, Mei 2017.
K.R. Scherer, E. Clark-Polner, dan M. Mortillaro, “In the Eye of the Beholder? Universality and Cultural Specificity in the Expression and Perception of Emotion,” Int. J. Psychol., Vol. 46, No. 6, hal. 401–435, Des. 2011.
© Jurnal Nasional Teknik Elektro dan Teknologi Informasi, under the terms of the Creative Commons Attribution-ShareAlike 4.0 International License.