IAES Inter national J our nal of Articial Intelligence (IJ-AI) V ol. 14, No. 4, August 2025, pp. 3421 3434 ISSN: 2252-8938, DOI: 10.11591/ijai.v14.i4.pp3421-3434 3421 Exploring bibliometric tr ends in speech emotion r ecognition (2020-2024) Y esy Diah Rosita 1,2 , Muhammad Raa’u Firmansyah 2 , Annisaa Utami 2 1 Center of Excellence for Human Centric Engineering, Institute of Sustainable Society , T elk om Uni v ersity , Main Campus, Bandung City , Indonesia 2 Informatics Engineering Study Program, T elk om Uni v ersity , Purw ok erto Campus, Ban yumas City , Indonesia Article Inf o Article history: Recei v ed Apr 21, 2024 Re vised Jun 12, 2025 Accepted Jul 10, 2025 K eyw ords: Audio features Classication model Emotions Preprocessing Speech emotion recognition ABSTRA CT Speech emotion recognition (SER) is crucial in v arious real-w orld applications, including healthcare, human-computer interaction, and af fecti v e computing. By enabling systems to detect and respond to human emotions through v ocal cues, SER enhances user e xperience, supports mental health monitoring, and impro v es adapti v e technologies. This research pres ents a bibliom etric analysis of SER based on 68 articles from 2020 to early 2024. The ndings sho w a signicant increase in publications each year , reecting the gro wing interest in SER research. The analysis highlights v arious approaches in preprocessing, data sources, feature e xtraction, and emotion cl assication. India and China emer ged as the most acti v e contrib utors, with e xternal funding, particularly from the Na- tional Natural Science F oundation of China (NSFC), playing a si gnicant role in the adv ancement of SER research. Support v ector ma chine (SVM) remains the most widely used classication model, follo wed by K-ne arest neighbors (KNN) and con v olutional neural netw orks (CNN). Ho we v er , se v eral critical challenges persist, including inconsistent data quality , cross-linguistic v ariability , limited emotional di v ersity in datasets, and the comple xity of real-time implementation. These limitations hinder the generalizabili ty and scalability of SER systems in practical en vironments. Addressing these g aps is essential to enhance SER per - formance, especially for multimodal and multilingual applications. This study pro vides a detailed understa nding of SER research trends, of fering v aluable insights for future adv ances in speech-based emotion recognition. This is an open access article under the CC BY -SA license . Corresponding A uthor: Y esy Diah Rosita Informatics Engineering Study Program, T elk om Uni v ersity , Purw ok erto Campus Jln. D.I. P anjaitan No. 128, Purw ok erto, Ban yumas City , 53147, Indonesia Email: yesydr@telk omuni v ersity .ac.id 1. INTR ODUCTION Hate speech is often dri v en by strong ne g ati v e emotions, such as hatred or anger , which can t rigger social conicts and escal ate tensions between indi viduals or groups [ 1], [2]. In this conte xt, speech emotion recognition (SER) emer ges as a tec hn ol ogy capable of detecting and interpreting em otions from hu m an speech. By identifying ne g ati v e emotions s uch as anger in speech, SER can be utilized to support online content mod- eration, enhance hate speech detection systems, and analyze social interactions to pre v ent conict escalation. SER w as rst introduced by Rosalind W . Picard and her team at the MIT Media Laboratory in the early 2000s [3]. Since then, this eld has e xperienced rapid gro wth, with adv ancements in feature e xtraction, classication models, and multimodal approaches. Ov er the past decade, the rise of deep learning and the J ournal homepage: http://ijai.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
3422 ISSN: 2252-8938 a v ailability of lar ger speech datasets ha v e signicantly impro v ed the accurac y of SER systems. T oday , this technology is widely applied in v arious domains, including healthcare, human-computer interaction, and digital security . This study aims to pro vide a bibliometric analysis of 68 articles on SER published between 2020 and early 2024. This timeframe w as sel ected due to the increasing number of publications in recent years, reecting the gro wing inter est in SER research, particularly follo wing the CO VID-19 pandemic, whic h accelerated the adoption of v oice-based technologies in communication and emotion analysis. The analysis is conducted by collecting data from the Scopus database, co v ering k e y trends in SER, met hodological de v elopments, and research collaborations among scholars from v arious countries. Bibliometrics is a statistical analysis technique used to understand the historical de v elopment of a scientic eld [4]. This method helps unco v er collaboration patterns in multidisciplinary research [5], identify trends in scientic publications, and analyze inter -article relationships [6]. Additionally , bibliometric analysis enables the e v aluation of research impact and the mapping of scientic structures using v arious statistical indicators [7]. Research collaboration tends to enhance the inuence of a study c o m pared to indi vidual research ef forts [8], particularly when it in v olv es multiple rele v ant disciplines. SER has become a rapidly e v olving research eld, emplo ying v arious approaches to recognize emotions in human speech [9]. In se v eral studies, SER has been applied in sentiment analysis [8] and human- computer interaction [10], helping to ident ify dominant topics in scientic publications. Moreo v er , its applica- tion in speech and video data analysis demonstrates signicant potential for understanding emotional dynamics across dif ferent conte xts. Although SER research has seen substantial gro wth in the past v e years, se v eral k e y challenges remain. These include dif culties in collecting and analyzing accurate speech data, the comple xity of understanding and interpreting human emotions through speech, and limitations in handling linguistic and cultural v ariations. This study aims to identify SER’ s contrib utions and impacts on other elds while highlighting areas that require further e xploration. One of the primary challenges in this study is ensuring the accurac y and consistenc y of the collected data from v arious sources. A rigorous data-cleaning process and manual re vie w are necessary to ensure that the analyzed articles are rele v ant and of high quality . Additionally , the comple xity of interpreting bibliometric analysis results presents another challenge. While SER has been widely e xplored, bibliometric studies specically mapping the distrib ution of classi cation models, the inter - disciplinary collaborations, and cross-cultural g aps in emotion recognition remain limited. This study seeks to ll these g aps by pro viding a comprehensi v e o v ervie w of trends, collaborations, and undere xplored areas in SER literature between 2020 and 2024. This research adv ances e xisting methodologies by inte grating SER models with a comprehensi v e re vie w of rele v ant literature. Data collection is based on titles, abstracts, and the full content of selected articles, follo wed by a manual re vie w to ensure rele v ance to the r esearch topic. The main objecti v es of this study are: i) to pro vide an o v ervie w of SER research trends using t h e 5W+1H approach (what, who, when, wh y , where, and ho w); and ii) to identify potential research subtopics that w arrant further e xploration. The structure of this article is or g anized as follo ws. Section 2 e xplains the data collection methodol- ogy . Section 3 presents the research ndings related to SER. Lastly , section 4 pro vides the study’ s conclusions. 2. METHOD Scientic articles are obtained from the Scopus database, where some articles can be accessed directly , while others are closed access. The a v ailability of access to scientic articles can inuence the ease of obtaining rele v ant information and literature. Addit ionally , access-restricted articles may require additional ef forts to g ain full access, such as through an institutional library or database subscription service. In the conte xt of academic research, it is important to consider the a v ailable information sources and the access methods that can be used to optimize the use of e xisting information resources. Data collection w as a crucial step in this research, ensuring that the study could be replicated. T o ensure that this research can be replicated, a well-dened query strate gy w as implemented, which had been tested for ef fecti v eness in retrie ving rele v ant articles from the Scopus database. This structured approach helped obtain consistent and high-quality data for bibliometric analysis. Moreo v er , the user -friendly interf ace of the Scopus database f acilitated the data retrie v al process, enabling researchers to focus more on data interpretation and analysis. The article selection and data analysis were primarily carried out using spreadsheet softw are, which allo wed for ef cient or g anization, ltering, and summarization of the dataset. This approach w as chosen to maintai n e xibility in the re vie w process and Int J Artif Intell, V ol. 14, No. 4, August 2025: 3421–3434 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 3423 to adapt to the e v olving nature of the research. The article search w as conducted using an adv anced search query applied to titles, abstracts, and k e yw ords, with additional lters based on publication year , article source, publication stage, and document type. The last data collection w as performed on February 2, 2024, resulting in an initial set of 80 articles. Since these art icles originat ed from di v erse sources, a rigorous data-cleaning process w as undertak en to remo v e duplicates and inconsistencies. The names of authors, publishers, journals, and research funders were cross-check ed for duplication. The w orko w for data collection and analysis is illustrated in Figure 1 (data collection o wchart) and Figure 2 (query instruction). The data collection steps were carried out as follo ws: Search by query: articles were retrie v ed using an adv anced search query applied to title, abstract, and k e yw ords, with lters based on publication year , article source, publication stage, and document type. Data cleaning: the names of authors, publishers, journals, and funding institutions were check ed to eliminate duplication. Article numbering: each article w as assigned a unique identication number to f acilitate tracking during the re vie w and analysis process. Manual re vie w: a detailed manual re vie w of each article w as conducted by a team of three independent re vie wers to ensure rele v ance to the research topic, v erify the adequac y of the information presented, and assess the quality of the journal. Discrepancies in the selection of articles were resolv ed through discussion. Re vie wed summary: compile a summary of each article that has been re vie wed to pro vide reference material in writing a literature re vie w . Nomenclature: compile a nomenclature or list of terms used in the articles to f acilitate readers’ under - standing. Dataset source: compile the dataset source or list of data sources in the articles. List document: compile a list of articles that will be re vie wed, based on certain criteria to be used as a basis for compiling a literature re vie w . Figure 1. The o wchart for collecting data Figure 2. The query for collecting data From the 80 initially retrie v ed articles, 68 (85%) were deemed rele v ant and included in the nal dataset. The selection process w as carried out through a structured screening procedure to ensure that only the Exploring bibliometric tr ends in speec h emotion r eco gnition (2020-2024) (Y esy Diah Rosita) Evaluation Warning : The document was created with Spire.PDF for Python.
3424 ISSN: 2252-8938 most pertinent and impact ful studies were retained. This process in v olv ed a thorough re vie w of each article’ s abstract, k e yw ords, and full te xt when necessary to determine its suitability for inclusion. The selection process w as based on the follo wing criteria: Rele v ance to SER: articles that e xplicitly discuss SER methodologies, datasets, or applications were prioritized. Citation impact: articles wi th signicant citations (when a v ailable) were gi v en preference to ensure academic inuence. Publication year: articles published between 2020 and early 2024 were included to reect recent de v elopments in SER research. The bibliometric analysis re v ealed a signicant increase in SER-related research o v er the past v e years, indicating gro wing interest and impact in this domain. Ho we v er , the number of publications alone does not necessarily correlate with high citation counts. Some journals with a high number of publications had relati v ely lo w citation a v erages, while others with fe wer articles had a substantial citation impact. T able 1 sho ws the summary per pro v enance as a result of that query . Based on it , there are v ariations in the number of articles published and the a v erage citations per year for each journal. In general, journals that publish more articles tend to ha v e a higher a v erage of citations per year . Ho we v er , the correlation between the number of articles and the a v erage citations per year is not al w ays the case. F or instance, the Multimedia T ools and Applications journal, despite ha ving a high number of articles (13.24%), has a relati v ely lo w a v erage of citations per year (7 citations per year); the International Journal of Adv anced Computer Science and Applications, with only 2.94% of 68 articles, has a v ery high a v erage of citations per year (59 citations per year). This suggests that other f actors, such as article quality , research no v elty , and journal inde xing, also contrib ute to citation impact. T able 1. The summary per pro v enance Journal Number of art icles Percentages % A v erage number of citations/years Multimedia T ools and Applications [3]–[7], [9]–[12] 9 13.24 8 (8 citations) IEEE Access [13]–[19] 7 10.29 20 (20 citations) Applied Acoustics [20]–[24] 5 7.35 39.8 (40 citations) International Journal of Speech T echnology [25]–[27] 3 4.41 7.4 (7 citations) Journal of Supercomputing [28]–[30] 3 4.41 1.4 (1 citation) Signal, Image and V ideo Processing [31], [32] 2 2.94 0.8 (1 citation) Electronics (Switzerland) [33], [34] 2 2.94 2.8 (3 citations) Sensors (Switzerland) [35], [36] 2 2.94 1.4 (1 citation) IEEE/A CM T ransactions on Audio Speech and Lan- guage Processing [37], [38] 2 2.94 2.2 (2 citations) Journal of Ambient Intelligenc e and Humanized Computing [39], [40] 2 2.94 4.4 (4 citations) International Journal of Adv anced Computer Science and Applications [41], [42] 2 2.94 59.2 (59 citations) The journals that ha v e only 1 article [8], [43]–[70] 29 42.65 2.83 (3 citations) T otal 68 100 - The re vie w technique that will be carried out by applying the 5W+1H concept (what, who, where, when, wh y , and ho w) sho ws the follo wing: What: data sources used, features, types of emotions. Information on data sources used in research can be obtained from the data/material section, while the method for e xtracting v oice characteristics/features and determining emotions as output is obtained from the method section. Where: country of origin of the main researcher and correspondent. Information on the author’ s country of origin can be obtained on the rst page, commonly belo w the author’ s name. When: year of publication. Informat ion re g arding the year of publication of the article can be obtained from the rst page. It is generally put before the abstract; it states the time of submission, re vision time, time of acceptance, and time of publication of the article in the journal. Who: research funding agent. Information about the institution that funded the research w as obtained from the ackno wledgment section as a form of the researchers’ thanks. Some articles did not mention the institution that pro vided the research funding, which could mean that the research w as funded inde- pendently . Int J Artif Intell, V ol. 14, No. 4, August 2025: 3421–3434 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 3425 Wh y: the root of the problem. Re vie wing the root of the problem is carried out by observing the back- ground, including the problem formulation dened by the researcher . Ho w: classier model used. An o v ervie w of the classier models used by researchers can be found in the method section. Se v eral researchers compared v arious classier methods; others applied a single method b ut rened the model’ s architectural conguration. 3. RESUL TS AND DISCUSSION After conducting a thorough re vie w of the lit erature, the researchers decided to include only 68 articl es in the nal analysis, re v ealing the results of the processed data. This selection pro vides a comprehensi v e o v ervie w of v arious aspects of recognizing emotions in speech and of fers v aluable insights into the current state of research in this eld. 3.1. What In the re vie w of the articles, four k e y aspects were identied as c rucial components within the “What cate gory of SER. These aspects include the preprocessing stages used in the analysis, the data sources emplo yed for training the models, the types of features e xtracted from the speech signals, and the emotions that serv e as the tar get or class for emotion detection in speech. These four f actors play an essential role in shaping the methodologies used to recognize and classify emotions in speech and ha v e a signicant impact on the accurac y and applicability of the models. Figure 3 presents a detailed map of the data, illustrating the distrib ution of articles that discuss each of thes e four critical aspects. The map cate gorizes the number of articles into se v en distinct ranges based on the frequenc y with which each aspect is co v ered. These ranges are as follo ws: 6 > articles, 6-10 articles, 11-20 articles, 21-30 articles, 31-40 articles, 41-50 articles, 51-60 articles, and > 60 articles. This classication allo ws for a better understanding of which aspects of emotion recognition in speech are most frequently addressed in the literature, highlighting the areas of the eld that are recei ving the most attention and those that may require further e xploration. Figure 3. Distrib ution of re vie w data based on the concepts of ‘What’ 3.1.1. Pr epr ocessing Preprocessing, or the preprocessing stage, is a critical step in processing speech signals for the recog- nition of emotions in speech. This stage aims to impro v e the quality of the sound signal before further analysis is carried out. In this research, preprocessing includes three main stages: silence remo v al, noise remo v al, and unspecied. The results of the analysis sho w that researches in v olving silence and noise remo v al processes are only 7 articles [9], [23], [25], [36], [39], [57], [61] and studies e xamining preprocessing of s ilence remo v al only are 2 articles [21], [65] and others focusing noise remo v al only are 23 articles [3], [6], [7], [11]-[13], [15], Exploring bibliometric tr ends in speec h emotion r eco gnition (2020-2024) (Y esy Diah Rosita) Evaluation Warning : The document was created with Spire.PDF for Python.
3426 ISSN: 2252-8938 [16], [20], [30], [32], [40], [42], [46], [48]-[55], [66], [69]. Ho we v er , most of the articles (32 articles) did not specically m ention the preprocessing steps the y used. A summary of the types of preprocessing. There is still v ariat ion in the preprocessing approaches used in SER research. Most researchers did not pro vide specic details about the preprocessing step the y undertook. The main challenge in this stage is to ensure that the re- sulting sound signal is free from interference and ready for further analysis. Therefore, further researches need to e xplore v arious preprocessing methods that can impro v e the quality of speech signals and the accurac y of emotion recognition in speech. 3.1.2. Data sour ces Data sources are an important component in research into emotion recognition in speech, as the qu a lity and representati v eness of the data can ha v e a major impact on the results of the analysis. In this research, there are v ariations in the data sources used by researchers. Berlin database of emotional speech (EMO-DB) is the most commonly used data source, with 31-40 articles using data from it [4]-[7], [9]-[14], [18], [20]-[23], [26], [28], [31], [32], [34], [35], [37], [38], [42], [45], [49], [50], [52], [56]-[60], [63]-[65]. There are also other popular data sources such as the interacti v e emotional dyadic motion capture (IEMOCAP) tak en by 21-30 articles [8], [11], [14], [15], [17]-[22], [28], [29], [33], [35]-[38], [43], [49], [52], [54]-[56], [58], [59], [62], [65], [68], ryerson audio-visual database of emotional speech and song (RA VDESS) [5], [6], [8], [10], [11], [23], [27], [30], [31], [35]-[37], [39], [40], [42], [43], [45], [52], [53], [58], [59], [64], [65], [69], [70] and surre y audio-visual e xpressed emotion (SA VEE), f airly com mon data source, is used in 11-20 articles [3], [10], [13], [18], [21], [23], [31], [34], [35], [39], [42], [45], [50]-[53], [58], [60], [63], [69]. Meanwhile, toronto emotional speech set (TESS) only becomes sources in less than 11 articles [3], [6], [25], [37], [38], [40], [53], [69]. In addition to this main data source, there are also other data sources used by fe wer than 6 articles, which are included in the “Others” cate gory . The v ariations in data sources indicate that researchers ha v e di v erse choices in selecting data for their research. This also sho ws the importance of ha ving good access to a v ariety of rele v ant data sources to ensure the representati v eness of research results. In the conte xt of SER research, it is important to select data sources that are appropriate to the research objecti v es and capable of representing a v ariety of dif ferent emotional states. P arameters that can inuence data quality include the distance from the recorder to the transmitter of respondents, the specications of the equipment used, the recording duration of the recording, and the signicance of the emotions gi v en by the respondents. Despite the frequent use of well-kno wn datasets such as EMO-DB and IEMOCAP , this analysi s re- v eals a lack of di v ersity in the selection of data sources, particularly those that capture spontaneous emotional e xpressions or represent non-W estern cultural conte xts. This suggests a research g ap in cross-cultural emotional representation and real-w orld data v ariability , which may limit the generalizability of current SER models. By identifying this g ap through bibliometric mapping, this study encourages future research to e xplore and de v elop more inclusi v e, di v erse, and naturalistic datasets to enhance the rob ustness of SER systems. 3.1.3. F eatur es The features used in speech analysis play an important role in the recognition of emotions in speech. In this research, the mel-frequenc y cepstral coef cients (MFCC) feature is the most commonly used feature, with more than 41 articles using it [3], [4], [6], [7], [9], [10], [12], [13], [15], [16], [19], [21]-[23], [25]- [30], [34], [36], [38]-[40], [42], [43], [45], [46], [48], [49], [51]-[53], [57], [60], [61], [63], [67], [68], [70]. Besides, pitch is also a popular feature, found in 12 articles [6], [7], [9], [14], [21], [25], [27], [29], [34], [46], [68], [70]. In addition, there are se v eral other features used by 6-10 articles, including mel-spectrogram [3], [5], [10], [46], [48], [51], [54], [58], linear predicti v e coding (LPC) [6], [9], [13], [26], [29], [40], [61], formant [9], [14], [27], [46], [57], [59], ener gy [6], [9], [29], [46], [51], and chroma [25], [28], [46], [48], [51], [61]. These features reect v ariations in speech analysis approaches used to identify emotional patterns in speech. Apart from these main features, there are als o other features used in fe wer than six articles, which f all into the “Others” cate gory . The v ariation sho ws that researchers ha v e applied v aried approaches in analyzing sound signals for emotion recognition, with each feature ha ving its adv antages and disadv antages. Therefore, selecting appropriate features is a critical step in the de v elopment of an ef fecti v e emotion recognition system. As in pre vious research, the use of the dominant we ight normalization feature selection algorithm also has an inuence on the le v el of accurac y , which sho ws suf cient transmission with a relati v ely small amount of data. This research sho ws that with 300 data points, it is able to sho w an accurac y rate of 86%, so that this algorithm can be used as a consideration for use in de v eloping SER research [71]. Int J Artif Intell, V ol. 14, No. 4, August 2025: 3421–3434 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 3427 3.1.4. Emotions Analysis of emotions in speech in v olv es identifying the dif ferent types of emotions that can be e xpressed through sound. In this study , the emotion “Happ y” w as found out to be the most commonly in- v estig ated emotion, sometimes referred to as “Jo y/Jo yful”, with more than 62 articles (91.18%). In addition, the emotions Angry” 60 articles (88.24%), “Sad” 60 articles (88.23%), and “Neutral” 59 articles (86.76%) are also equally chosen. The emotions “Fear” 47 articles (69.12%) and “Disgust” 46 articles (67.65%) are also quite commonly used. Apart from that, there were also “Surprise” emotions in 36 articles (52.94%), and “Boredom” emotions in 24 articles (35.29%). In addition to them, there are also others used by fe wer than 11 articles in the cate gory ‘Others’. The v ariation in the types of emotions sho ws that researchers ha v e v aried interests in understanding and identifying dif ferent types of emotions in speech. Research classied these emotions into 3 types, namely positi v e, ne g ati v e, and neutral [35]. It also reects the comple xity of human emotional e xpression and the challenges in de v eloping systems capable of recognizing emotions with high accurac y in a v ariety of conte xts and situations. 3.2. Wher e The discussion re g arding “Where” stems from an in-depth analysis of the countries of origin and the institutions af liated with the rst author and the corresponding author , as illustrated in Figure 4. This aspect of the research is crucial for understanding the geographical and institutional spread of the studies on SER. By e xamining whether the rst and correspondi ng authors come from the same or dif ferent countries, we can g ain insights into the international collaboration patterns within the eld of emotion detection in speech. A signicant number of articles authored by researchers from multiple countries suggests a broad, global netw ork of researchers eng aged i n v oice emotion identication. Con v ersely , studies with authors from a single country or institution may indicate more localized research ef f orts. Figure 4 sho ws the number of authors from countries, either as the rst author or corresponding author , who ha v e published on SER with more than three contrib uting authors. Others come from T aiw an, T urk e y , P akistan, Australia, Indonesia, Japan, Egypt, France, Iraq, Italy , Kazakhstan, London, Portug al, Saudi Arabia, V ietnam, Bhutan, and Malaysia. Figure 4. T op 4 countries by number of authors Moreo v er , this geographical analysis indicates the le v el of global interest and in v olv ement in SER research, reecting ho w research in this domain is distrib uted across dif ferent re gions. It also allo ws for the identication of leading countries or institutions that are dri ving inno v ation and contrib uting to adv ancements in this eld. Understanding the “Where” thus highlights not only the scope of international collaboration b ut also the potential for future netw orking opportunities and the sharing of kno wledge across borders. 3.2.1. The rst author The rst author of a study often reects the institution or country where the research w as conduct ed. In this study , the rst authors came from countries around the w orld. The countries contrib uting the most rst authors are India, with 27 articles [3], [5]-[8], [10], [11], [13], [20], [23], [25], [27], [30], [31], [39], [40], [43], [46], [48], [50], [52], [53], [56], [60], [65], [66], [70] and China, with 17 articles [15], [16], [19], [32], [33], [34], [38], [44], [45], [49], [54], [55], [59], [61], [62], [67], [68]. Apart from that, there are se v eral other contrib uting countries with much fe wer articles such as Iran (4 articles) [9], [21], [28], [57], South K orea Exploring bibliometric tr ends in speec h emotion r eco gnition (2020-2024) (Y esy Diah Rosita) Evaluation Warning : The document was created with Spire.PDF for Python.
3428 ISSN: 2252-8938 (2 articles) [29], [36], P akistan (2 articles) [33], [39], T aiw an (2 articles) [14], [58], T urk e y (2 articles) [22], [24] and se v eral other countries with an article including Egypt, France, Indonesia, Iraq, Italy , Japan, Kazakhstan, London, Portug al, Saudi Arabia, and V ietnam. Among the authors, B anusree Y alamanchili from India w as the most acti v e by publishing 3 articles o v er the last 5 years as rst author . 3.2.2. The corr esponding author Corresponding authors often ha v e an important role in research, especially in terms of com munication with journal editors and other researchers. In this research, the y come from v arious countries of origin. A similar gure is seen lik e the rst author trend. Ag ain India is the country with the most contrib utions for its corresponding author , with 25 articles [3], [5]-[8], [10], [11], [20], [23], [25], [ 2 7] , [31], [39], [40], [43], [46], [48], [50], [52], [53], [56], [60], [65], [66], [70], follo wed by China with 16 articles [15], [16], [19], [30], [32]- [34], [38], [44], [45], [49], [54], [59], [61], [67], [68]. Apart from that , se v eral other countries also contrib ute, such as Iran (4 articles) [9], [21], [28], [57], South K orea (3 articles) [29], [35], [36], Australia (2 articles) [12], [62], Indonesia (2 articles) [26], [42], Japan (2 articles) [17], [55], T aiw an (2 articles) [14], [58], T urk e y (2 articles) [22], [24], and se v eral other countries ha v e only an article such as Bhutan, Egypt, France, Iraq, Italy , Kazakhstan, London, Malaysia, P akistan, and Portug al. Among the corresponding authors, 43 authors are also the rst authors. This sho ws that the y ha v e a signicant role in the research carried out, both as the main initiator a n d as the person responsible for communication and coordination with other parties, such as journal editors and other researchers. This also sho ws the high le v el of in v olv ement and contrib ution of these researchers in the de v elopment and dissemination of kno wledge in the eld of SER. 3.3. When The distrib ution of articles about SER by year sho ws an interesting trend in the last v e years as sho wn in Figure 5. In 2020, 14 articles [12]-[14], [20], [21], [26], [29], [35], [36], [41], [54], [57], [61], [62] were published, indicating a moderate le v el of research acti vity in this area. The follo wing year , in 2021, the number of articles increased slightly to 11 articles [7], [15], [22], [23], [30] [33], [34], [42], [58], [59], [64] sho wing a temporary inc rease in rese arch output. A higher increase is also seen in 2022 with 17 articles [10], [11], [17], [24], [27], [32], [40], [43], [46], [51], [53], [55], [56], [62], [63], [65], [67], sho wing rene wed interest in SER research. This trend continues in 2023, with the highest number of articles reaching 22 articles [3]-[5], [8], [9], [16], [18], [19], [25], [28], [37]-[39], [44], [45], [48], [50], [52], [60], [66], [68], [70] indicating continued gro wth in research acti vity and perhaps also maturity of the eld. Figure 5. Number of articles published by year As of February 2024, there are four articles [6], [47], [50], [69] that pro v e research in this eld remains sustainable, despite a dramatic do wnturn. During this period, some articles be gin discussi ng calm SER, and it is possible that by the end of 2024, there will be a signicant rise compared to 2023. A clearer trend of the last v e years in the number of publications re g arding SER articles. It sho ws that e motion recognition in s peech remains a rele v ant and interesting topic, and it can be e xpected that further research will continue Int J Artif Intell, V ol. 14, No. 4, August 2025: 3421–3434 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 3429 to be conducted to e xpand understanding of the technologie s that can be used to detect and interpret human emotions. 3.4. Who An analysis of funding sources for research into emotion recognition in speech sho ws the di v erse origins of funds used to support this research. This di v ersity is reected in se v eral patterns identied in the dataset. From the data, these situations are identied: An institution funds research, totaling thirteen studies [12], [13], [17], [18], [20], [29], [32], [37], [45], [62]-[64], [68]; A funding institution pro vides the funds for man y studies, such as the National Natural Science F ounda- tion of China (NSFC), supporting se v en projects [32], [38], [45], [54], [55], [59], [61]; A research is funded by man y instituti o ns ; one research w as funded by v e institutions [15], [69], four institutions [61], three institutions [55], and tw o institutions [35], [58], [68]; Others are self-funded research. Funding plays a crucial role in research, with the NSFC reecting China’ s commitment. Ho we v er , man y studies do not list funding sources, suggesting a combination of e xternal, internal, or independent funding. Interest- ingly , the most cited studies are independent, particularly in the journal Multimedia T ools and Applications. 3.5. Wh y The analysis of research methods behind emotion recognition in speech is visualized in Figure 6. Most research in this area is dri v en by three main reasons: classication model selection, feature selection, and implementation in other cases. There are studies (7 articles) discussing not only classication models b ut also the selection of v oice features [3], [4], [17], [32], [44], [52], [70]. Figure 6. Distrib ution of research reasons in emotion recognition Classication model: a total of 26 articles use the selection of a classication model which is only used as the sole reason behind the methods of pre vious studies [3], [4], [10], [11], [14], [15], [17], [25]-[27], [32]-[34], [41], [43]-[45], [47]-[49], [52], [54], [60], [66], [68], [70]. This reects the importance of selecting an appropriate and ef fecti v e classication model for b uilding an e motion recognition system in speech. Classication models ha v e become t h e main choice for researchers to classify emotions in speech with high accurac y; Feature selection: a total of 39 articles use feature selection as the sole main reason behind their research methods [3]-[6], [8], [9], [12], [13], [17], [18], [20]-[24], [29]-[32], [35], [37], [39], [40], [44], [46], [50], [51]-[53], [56]-[59], [61], [62], [64], [65], [67], [70]. Appropriate and representati v e features are essential in b uilding a reliable emotion recognition system. Features such as MFCC, Pitch, Mel- spectrogram, and others ha v e become the focus of research to e xtract important information from sound signals that can be used to identify emotions; Exploring bibliometric tr ends in speec h emotion r eco gnition (2020-2024) (Y esy Diah Rosita) Evaluation Warning : The document was created with Spire.PDF for Python.
3430 ISSN: 2252-8938 Implementation: a total of 7 articles used im p l ementation in other cases as the rationale behind their research methods [16], [28], [38], [42], [55], [63], [69]. This suggests that some researchers ha v e applied their approach in a broader application conte xt, be yond emotion recognition in speech. This approach may in v olv e the use of emotion recognition technology for s uch purposes as sentiment analysis, human- computer interaction, or psychological research. Although most of the research w as dri v en by the se reasons, some research has used other reasons be yond the cate gories [59]. This sho ws that there is sti ll v ariation in research moti v ations and approaches in emotion recognition in speech, and there is potential for further e xploration in de v eloping more inno v ati v e methods and techniques. 3.6. Ho w This research notes v ari ations in the use of classier models to identify emotions in human speech. More than 20 articles use support v ector machine (SVM) as the main model, sho wing its popularity and ef fec- ti v eness in emotion classication. Meanwhile, around 14 articles emplo y K-nearest neighbors (KNN) and 1D con v olutional neural netw orks (CNN) 12 articles, while about 5-10 articles apply approaches such as decision tree (DT), deep neural netw ork (DNN), long short-term memory ( LSTM), multi layer perceptron (MLP), and random forest (RF) in more detail i s sho wn in T able 2. The use of v arious classier models sho ws an ef fort to e xplore v arious approaches in f acing the challenge of emotion classication in human speech. In addition, there are also se v eral other approaches used in smaller numbers, demonstrating the di v ersity in strate gies and techniques used in SER researches. T able 2. The summary of model classiers Number of articles Model classiers > 20 SVM 10–15 KNN, 1D CNN 5–10 DT , DNN, LSTM, MLP , RF A clear trend sho ws that deep learning models, particularly CNN and LSTM, are g aining popularity due to their ability to capture the comple xity of speech signals and outperform traditional models lik e SVM, especially with lar ge datasets. These models automatically learn features from ra w data, of fering better gen- eralization in noisy en vironments. In contrast, while traditional models lik e SVM w ork well with smaller , structured datasets, the y struggle with ra w audio data, where deep learning e xcels . Therefore, deep learning models are becoming more pre v alent in SER research due to their higher accurac y and adaptability . 4. CONCLUSION This research presents a bibliometric analysis of 68 articles on SER published between 2020 and early 2024. There ha v e been signicant de v elopments in SER research in the last v e years, and India being the top contrib utor . The e xploration of research topics pro vides a comprehensi v e o v ervie w of de v elopments and trends in this eld. The use of preprocessing techniques, such as silence remo v al and noise remo v al, is the main focus. The most commonly used data sources are EmoDB, IEM OCAP , and RA VDESS, while features such as MFCC and pitch are the most frequently used in the analysis. More di v erse data sources, including real-w orld noisy data, can signicantly impro v e SER models. By inte grating datasets that reect real-w orld conditions, including a broader range of emotional v ariations and loud en vironments, SER models can be trained to be more resilient to the challenges f aced in e v eryday situations. This will help address current limitations, such as inconsistent data quality and a lack of emotional di v ersity in datasets, thereby enhancing the accurac y and generalizability of models in practical applications. Based on these ndings, further research is suggested to de v elop multi modal approaches that inte grate acoustic features with non-auditory data, such as f acial e xpressions, body mo v ements, or ph ysiological signals. The combination of multimodal features can capture a more holistic representation of emotions, o v ercoming the l imitations of single-v oice-based systems susceptible to en vironmental noise or ambiguous conte xts. The most frequently analyzed emotions are happ y , angry , sad, neutral, fear , disgust, and surprise. In terms of classication modeling, SVM is the most widely used model, follo wed by KNN, 1D CNN, and se v eral other approaches. Ov erall, this study pro vides an in- depth understanding of SER research trends and the techniques most commonly used in this analysis. It is recommended to de v elop more sophisticated pre-processing techniques and classication models that are more Int J Artif Intell, V ol. 14, No. 4, August 2025: 3421–3434 Evaluation Warning : The document was created with Spire.PDF for Python.