Multimodal recognition with deep learning: audio, image, and text

Ravi Gummula

Vinothkumar Arumugam

Abilasha Aranganathan

International Journal of Reconfigurable and Embedded Systems

Multimodal recognition with deep learning: audio, image, and text

Abstract

Emotion detection is essential in many domains including affective computing, psychological assessment, and human computer interaction (HCI). It contrasts the study of emotion detection across text, image, and speech modalities to evaluate state-of-the-art approaches in each area and identify their benefits and shortcomings. We looked at present methods, datasets, and evaluation criteria by conducting a comprehensive literature review. In order to conduct our study, we collect data, clean it up, identify its characteristics and then use deep learning (DL) models. In our experiments we performed text-based emotion identification using long short-term memory (LSTM), term frequency-inverse document frequency (TF-IDF) vectorizer, and image-based emotion recognition using a convolutional neural network (CNN) algorithm. Contributing to the body of knowledge in emotion recognition, our study's results provide light on the inner workings of different modalities. Experimental findings validate the efficacy of the proposed method while also highlighting areas for improvement.

Cite

Full View

DOI

10.11591/ijres.v14.i1.pp254-264

ISSN Information

2089-4864

Pages

254-264

More Information

Volume 14

Issue 1

Publish at 2025-03-01

Discover Our Library

Embark on a journey through our expansive collection of articles and let curiosity lead your path to innovation.

Explore Now