Deep Learning-Based Speech Emotion Recognition
DOI:
https://doi.org/10.15680/IJCTECE.2020.0305001Keywords:
Speech Emotion Recognition, Deep Learning, Convolutional Neural Networks, Recurrent Neural Networks, Long Short-Term Memory, Emotion Classification, Audio Signal Processing, Feature Extraction, Machine LearningAbstract
Speech Emotion Recognition (SER) is an essential component in human-computer interaction, enabling systems to understand and respond to human emotions. Traditional emotion recognition methods often rely on handcrafted features, which can be limited in capturing the full complexity of emotional cues. In contrast, deep learning approaches, particularly convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory (LSTM) networks, offer more robust solutions by automatically learning hierarchical features from raw audio data. This paper reviews recent advancements in deep learning-based speech emotion recognition, discusses the various architectures used, and evaluates the challenges in real-world applications. We focus on the application of deep learning models to enhance the accuracy and robustness of SER, particularly in noisy environments. The study also discusses future directions for research, including multimodal emotion recognition and transfer learning to address challenges such as small datasets and cross-domain applications
References
1. El Ayadi, M., Kamel, M. S., & Karray, F. "Speech emotion recognition using classifiers." International Journal of Speech Technology, 14(2), 99-111.
2. Nogueira, M., et al. "Deep Learning for Speech Emotion Recognition: A Review." Proceedings of the 6th International Conference on Machine Learning and Applications.
3. Satt, A., et al. "Speech Emotion Recognition Using Convolutional Neural Networks." Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
4. Hershey, S., et al. "Speech Emotion Recognition using LSTM Networks." IEEE Transactions on Audio, Speech, and Language Processing, 25(8), 1823-1831.
5. Zhao, Z., et al"Hybrid CNN-LSTM Model for Speech Emotion Recognition." IEEE Access, 8, 49789-49798.