IAES Inter national J our nal of Articial Intelligence (IJ-AI) V ol. 15, No. 2, April 2026, pp. 1891 1908 ISSN: 2252-8938, DOI: 10.11591/ijai.v15.i2.pp1891-1908 1891 T unDC: a public benchmark dataset f or sentiment analysis and language modeling in the T unisian dialect Ahmed Khalil Boulahia 1 , Mourad Mars 2 1 T unis Dauphine Uni v ersity , T unis, T unisia 2 Department of Computer Science and Articial Intelligence, Colle ge of Computing, Umm Al-Qura Uni v ersity , Mecca, Saudi Arabia Article Inf o Article history: Recei v ed May 8, 2025 Re vised Jan 29, 2026 Accepted Feb 6, 2026 K eyw ords: Arabic dataset Articial intelligence Fine-tuning Lar ge language model Lo w-resource language Sentiment analysis T unisian dialect ABSTRA CT The de v elopment of natural language processing (NLP) applications has increasingly focused on dialectal v ariations of languages. The T unisian dialect (TD), a widely spok en v ariant of Arabic, poses unique linguistic challenges due to its lack of s tandardized writing con v entions and inuences from multiple languages, including French, Italian, T urkish, and Berber . In this w ork, we introduce T unDC, a dataset of 20,044 labeled comments designed to adv ance NLP research on the TD. The dataset co v ers di v erse linguistic forms (Arabic, Latin, and mix ed scripts), and each comment w as manually annotated for positi v e or ne g ati v e sentiment by nati v e speak ers, achie ving high inter -annotator agreement. T o e v aluate its ef fecti v eness, we ne-tuned v arious models on T unDC. The bert-base-arabic-T unDC-mix ed model achie v ed an accurac y of 0.84 and a macro-a v eraged F1-score of 0.83, demonstrating strong generalization across sentiment cate gories and writing systems. A stratied data-splitting strate gy considering both sentime nt and script type further impro v ed accurac y by approximately 8% compared to standard splits. As a publicly a v aila ble resource, T unDC contrib utes to the computational linguistics community , fostering adv ancements in language modeling and applications tailored to the TD. This is an open access article under the CC BY -SA license . Corresponding A uthor: Mourad Mars Department of Computer Science and Articial Intelligence, Colle ge of Comput ing, Umm Al-Qura Uni v ersity Makkah 24382, Saudi Arabia Email: msmars@uqu.edu.sa 1. INTR ODUCTION Recent years ha v e witnessed remarkable adv ancements in lar ge language models (LLMs) and generati v e AI, leading to signicant breakthroughs in v arious natural language processing (NLP) tasks. From machine translation and te xt summarization to writi ng dif ferent kinds of creati v e content, these models ha v e demonstrated e xceptional capabilities in understanding and generating human language. Ho we v er , a major obstacle to broader adoption and linguistic inclusi vity is the persistent scarcity of annotated data, especially for lo w-resource languages and dialects. This limitation often leads to signicant performance g aps, undermining the linguistic richness of these communities and restricting their access to NLP technologies. One such case is T unisian Arabic (T A), an Arabic dialect spok en by o v er 12 million indi viduals in T unisia, which of fers a capti v ating and undere xplored aspect. Unlik e standard Arabic, it possesses unique linguistic characteristics shaped by histor ical and cultural inuences. Its v ocab ulary dra ws hea vily from Arabic b ut also incorporates a wi de range of loanw ords from Berber , T urkish, French, English, and Italian, enriching J ournal homepage: http://ijai.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
1892 ISSN: 2252-8938 its e xpressi v e po wer and reecting its vibrant sociolinguistic landscape (T able 1). Y et, the lack of lar ge labeled datasets tailored for sentiment analysis in the T unisian dialect (TD) poses a barrier to deeper e xploration and hinders the de v elopment of rob ust NLP applications. When it comes to writing in the TD, people use dif ferent writing systems: Arabic script (abjad), Latin script, and e v en numbers to represent some characters. Most of the time, the TD doesn’ t ha v e an y well-dened structure and does not conform to an y con v entions or orthographic rules. These characteristics of the TD present an additional challenge in ef fecti v ely processing it using NLP techniques. T able 2 present some of most common e xpressions that represent translation of English sentence "when are you going to the doctor?". T able 1. Examples of loanw ords in the TD T unisian dialect Arabic Origin English translation AkžA –n Italian “banca” Bank  AmVr TqJ French “appartement” Apartment xA¯ dy Berber “labes“ Fi ne  wfyl A¡ English “talifoun“ T elephone T able 2. Dif ferent TD e xpressions that represent a translation of "when are you going to the doctor?" Reference sentence When are you going to the doctor? Sentence 1 ?ybWl˜ ¨Km MAt’¤ Sentence 2 W a9tash timshi litbib? Sentence 3 W aktech temchi lel doctour? Sentence 4 Ana w a9t machi 3and tbib? The processing of the TD represents a challenging task, mainly due to its ambiguous and comple x structure, not only for machines b ut sometimes for T unisians themselv es. People can write in both directions (right to left and left to right) using the Arabic alphabet (abjad) and the Latin alphabet. On man y occasions, the y use both simultaneously . It also v aries according to a person’ s age, origins, and culture. The majority of Arabic users on social media platforms use dialects to e xpress themselv es; most of these dialects can be described as unstructured, non-grammatical slang Arabic. This non-uniformity in dialects mak es it more dif cult for machine learning (ML) algorithms and LLMs to be able to perform tasks such as sentiment analysis. Hence, there is an increasing need for lar ger datasets to impro v e the performance of these models. This paper aims to tackle this challenge by introducing T unDC, a ne w benchmark corpus s pecically designed for sentiment analysis in T A. W e le v erage the po wer of social media, collecting and annotating more than 20K comments with sentiment labels (positi v e or ne g ati v e) pro vided by nati v e speak ers. T unDC is intended not only as a training resource b ut also as a standardized benchmark for e v aluating and comparing NLP systems on TD sentiment analysis, with potential use in future shared tasks and competitions. Moreo v er , the dataset’ s scale and script di v ersity mak e it suitable for pre-training dialect-specic language models and for enabling ef fecti v e transfer learning through ne-tuning of multilingual or Arabic-centric transformers such as Arabic bidirectional encoder representations from transformers (AraBER T), multidialectal Arabic BER T (MARBER T), or cr o s s-lingual language model-rob ustly optimized BER T pre -training approach (XLM-R). By addressing the data scarcity iss u e , T unDC aims to empo wer future research and de v elopment of sentiment analysis solutions tailored to the unique linguistic characteristics of T A [1], [2]. This paper mak es three k e y contrib utions. First, it pro vides a surv e y of a v ailable resources for the TD. Second, it introduces T unDC, a no v el, publicly a v ailable benchmark dataset for sentiment analysis in T A, designed to support both model training and standardized e v aluation. Third, it presents the training, e v aluation, and public release on Huggingf ace of multiple pre-trained LLMs ne-tuned via transfer learning, including Camembert-BER T (CamemBER T) (AhmedBou/camembert-T unDC), bert-base-uncased (AhmedBou/camembert-T unDC), Modernized BER T (ModernBER T) (AhmedBou/ModernBER T -T unDC), and bert-base-arabic (AhmedBou/bert-base-arabic-T unDC-mix ed), on the T unDC dataset. The rest of the paper is or g anized as follo ws: section 2 pro vides an o v ervie w of related w ork in sentiment analysis and datasets for Arabic dialects. Section 3 details the dataset creation process, co v ering data collection, preprocessing, annotation, statistical analysis, and e v aluation. Section 4 describes the e xperimental setup, presents the results, and discusses the ndings. Section 5 discusses ethical considerations, including bias and f airness in T unisian dialect NLP . Finally , section 6 concludes the paper and outlines potential directions for future research. Int J Artif Intell, V ol. 15, No. 2, April 2026: 1891–1908 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 1893 2. RELA TED W ORKS Sentiment analysis in the TD presents a unique challenge due to the dialect’ s distinct v ocab ulary , morphology , and syntax compared to modern standard Arabic (MSA) [3], [4]. Se v eral research ef forts ha v e focused on addressing this challenge, emplo ying di v erse approaches and datasets [5]–[11]. This section pro vides a comprehensi v e o v ervie w of e xisting w ork, with a particular focus on recent adv ances in multilingual models, dialect adaptation, and the specic challenges and opportunities within T A NLP . 2.1. Adv ances in multilingual transf ormers and dialect adaptation The adv ent of the T ransformer architecture has re v olutionized NLP , with multilingual lar ge language models (MLLMs) lik e XLM-R and the more recent generati v e models demonstrating impressi v e cross-lingual capabilities. Recent w ork has focused on adapting these po werful models to lo w-resource dialects. A k e y challenge is the dialect adaptation of these MLLMs. While models pre-trained on massi v e amounts of data, including some Arabic content, sho w a baseline performance, the y often struggle with the nuances of specic dialects lik e TD. Studies such as those presented at Arabic natural language processing 2025 (ArabicNLP 2025) and empirical methods in natural language processing 2025 (EMNLP 2025) ha v e sho wn that MLLMs e xhibit a signicant "Arabic g ap" in handling dialectal v ariations, code-mixing, and script-switching [12]. Model mer ging and continual pre-training ha v e emer ged as ef fecti v e strate gies for dialect adapt ation, mo ving be yond con v entional ne-tuning. F or instance, research in 2025 e xplored model mer ging to adapt multilingual models for code-mix ed tasks, a highly rele v ant challenge for TD which frequently mix es Arabic script, Latin script (Arabizi), and French loanw ords [13]. Furthermore, the e v aluation of generati v e LLMs in zero-shot and fe w-shot settings for Arabic tasks , including sense disambiguation and translation, highlights the gro wing trend of le v eraging the inherent kno wledge within these models to o v ercome data scarcity in dialectal NLP [14], [15]. 2.2. Zer o- and few-shot lear ning f or lo w-r esour ce dialects The scarcity of lar ge, high-quality annotated datasets for dialects necessitates the e xploration of zero-shot (ZSL) and fe w-shot learning (FSL) techniques. These methods are crucial for adv ancing NLP in lo w-resource settings, including T A. Recent surv e ys (2024-2025) on Arabic dialect processing emphasize the shift to w ards ZSL and FSL, often utilizing the po wer of LLMs. In the conte xt of sentiment analysis, ZSL and FSL allo w models to generalize from instructions or a handful of e xamples, bypassing the need for e xtensi v e, costly manual annotation [16]. F or TD specically , researchers are benchmarking LLMs’ performance, nding that while the y struggle initially , prompting techniques and FSL can signicantly impro v e their ability to recognize and adhere to the dialect’ s unique linguistic structure [17]. This approach is particularly promising for tasks lik e sentiment analysis where the underlying sentiment concept is uni v ersal, b ut the dialectal e xpression is unique. 2.3. T echnical gaps and linguistic challenges in T unisian dialect natural language pr ocessing The TD presents a unique set of linguistic and technical challenges that impede the de v elopment of rob ust NLP systems. The most signicant linguistic hurdle is the script-s witching phenomenon, where users uidly switch between Arabic and Latin characters, often within the same sentence, to represent TD phonemes. This necessitates resources lik e T unisian Arabish corpus (T ArC) [18] and the ne wly introduced LinT O datasets [19], which focus on transliteration and linguistic annotation to bridge the g ap between the written forms. The technical g ap is primarily the lack of a lar ge-scale, multi-domain, and multi-script benchmark dataset, which T unDC aims to address. As sho wn in T able 3, e xtreme code-switching and script-mixing complicate model design, while the lack of orthographic standardizati on increases le xical v ariability . On the technical side, data scarcity and domain adaptation challenges further limit the performance of NLP systems, moti v ating the de v elopment of the T unDC dataset. 2.4. Arabic r egional dialects Unlik e English, Arabic language presents additional challenges due to its multiple dialects, the lim ited a v ailability of lar ge corpora, and the absence of v ocalization. Therefore, creating high-quality datasets and de v eloping NLP tools capable of accurately processing dialectal Arabic is crucial. Signicant ef forts ha v e been made to construct datasets for specic dialects, with the Egyptian (EGY) [20]–[22] and Le v antine (LEV) dialects being the most e xtensi v ely studied [23], [24]. More recently , research has e xpanded to include the P alestinian (P AL) [25], Khaliji [26], [27], Syro-P alestinian [22], Gulf (GLF) [28], Mesopotamian (Iraqi) [22], T unDC: a public benc hmark dataset for sentiment analysis and langua g e ... (Ahmed Khalil Boulahia) Evaluation Warning : The document was created with Spire.PDF for Python.
1894 ISSN: 2252-8938 and Maghrebi (MGR) [29] dia lects [30], [31]. Ho we v er , the TD remains undere xplored, with limited linguistic resources and NLP tools a v ailable. T able 3. T echnical g aps and linguistic challenges in T A NLP Challenge type Specic challenge in TD Impact on NLP de v elopment Linguistic Extreme code-switching/mixing: frequent, unstandardized mixing of Arabic script, Latin script (Arabizi), and French/Italian/T urkish loanw ords. Requires models to handle multiple orthographies and languages simultaneously , increasing model comple xity and data requirements. Linguistic Lack of orthographic standardizat ion: no x ed rules for writing T A, leading to high le xical v ariability (e.g., multiple w ays to write the same w ord). Hinders the ef fecti v eness of traditional tok enization, stemming, and le xicon-based methods. T echnical Data scarcity and fragmentation: limited a v ailability of lar ge, publicly accessible, and high-quality annotated corpora. Existing datasets are often small and task-specic. Pre v ents the ef fecti v e pre-training of dedicated, high-performing TD language models. T echnical Domain adaptation: models trained on one domain (e.g., political tweets) perform poorly on others (e.g., e-commerce comments). Requires continuous adaptation strate gies and di v erse datasets lik e T unDC to ensure generalizability . 2.5. T unisian dialect datasets The rst interest in TD sentiment analysis w as in 2016, Sayadi et al. [32] presented in their paper , a sentiment analysis study on the rst labeled and publicly a v ailable dataset called T unisian election corpus (TEC). This dataset is composed of 5,514 tweets collected during the T unisian elections period of 2014. 3,760 of them are in MSA, and 1,754 are in the TD. Man y ML approaches were presented, and a comparati v e study w as conducted. The presented results sho wed that support v ector machines (SVM) achie v ed a higher accurac y of 71.09% than the other methods used. The T unisian Arabic corpus (T A C) [33] consists of 800 tweets co v ering v arious topics, including media, telecommunications, and politics. This dataset w as g athered by Karmani [33] and labeled with sentiment cate gories: pos iti v e, ne g ati v e, and neutral. In 2017, a dataset called the T unisian sentiment analysis corpus (TSA C) w as presented and made a v ailable publicly for the NLP T unisian community [34]. The dataset w as obta ined from F acebook comments about popular TV sho ws, and it is written only with Arabic letters. The authors reported the rst application of deep lear ning in sentiment analysis on the TD, where the y used multi-layer perceptron (MLP), which produced a lo wer error rate than SVM and naïv e-Bayes and reached 78% accurac y . T unisian Arabizi (TUNIZI) [35] contains 9,210 comments g athered from the Y ouT ube platform and labeled positi v e and ne g ati v e. Man y topics were co v ered in this dataset, such as sports, politics, comedy , and TV sho ws. Both classes are similarly represented in this dataset, with 47% positi v e comments and 53% ne g ati v e comments. Masmoudi et al. [36] introduced a manually annotated dataset for sentiment analysis of the TD, composed of comments collected from of cial F acebook pages of T un i sian supermark ets. The dataset w as labeled based on v e sentiment cate gories (v ery positi v e, positi v e, neutral, ne g ati v e, and v ery ne g ati v e) and twenty aspect-based cate gories. T o analyze sentiment, the authors e xperimented with three deep learning models: con v olutional neural netw orks ( CNN), long short-term memory (LSTM), and bidirectional long short-term memory (Bi-LSTM). Their e v aluation sho wed that CNN and Bi-LSTM achie v ed the best classication performance, demonstrating the ef fecti v eness of deep learning in processing TD te xt. Gugliotta and Dinarelli [18] introduced T ArC, a publicly a v ailable dataset designed for processing T A written in Arabizi. The corpus w as de v eloped alongside an NLP tool that pro vides v arious le v els of linguist ic annotation, including w ord classication, transliteration, tok enization, part-of-speech tagging (POS-tagging), and lemmatization. The authors outlined their computational and linguistic methodologies, discussing strate gies to enhance annotation accurac y . Their e xperiments demonstrated the ef fecti v eness of these resources for both computational applications and linguistic research. Mulki et al. [37] in v estig ated sentiment analysis of the TD using both supervised and le xicon-based models. The y e v aluated preprocessing techniques such as stemming, emoji recognition, and ne g ation detection on three datasets of v arying sizes. Their results sho wed that these preprocessing steps signicantly impro v ed sentiment classication performance, with named entity tagging further enhancing le xicon-based models and beneting supervised models on smaller datasets. Int J Artif Intell, V ol. 15, No. 2, April 2026: 1891–1908 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 1895 T o summarize, T able 4 pro vides an o v ervie w of the datasets collected for the TD sentiment analysi s task. The e xisting datas ets are either small, limited in script co v erage, or focused on specic tasks, such as transliteration. T unDC distinguishes itself by of fering a lar ge, publicly a v ailable, and script-di v erse corpus (Arabic, Latin, and mix ed scripts) with high-quality , manually v eried sentiment annotations, positioning it as a rob ust benchmark for multi-script and multi-domain sentiment analysis. T able 4. Summary of a v ailable datasets for TD sentiment analysis Study Dataset Size Source Labels Sayadi et al. [32] TEC [32] 5,514 tweets T witter Positi v e, Ne g ati v e Karmani [33] T A C [33] 800 tweets F acebook Positi v e, Ne g ati v e, Neutral Medhaf f ar et al. [34] TSA C [34] 17k tweets F acebook Positi v e, Ne g ati v e F ourati et al. [35] TUNIZI [35] 9,210 Y ouT ube comments Y ouT ube Positi v e, Ne g ati v e Masmoudi et al. [36] Comments about 17k Arabic script posts F acebook V ery Positi v e Positi v e, Neutral T unisian supermark ets [36] 27k Arabizi script posts Ne g ati v e, V ery Ne g ati v e Gugliotta and Dinarelli [18] T ArC [18] 11,291 comments F acebook Positi v e, Ne g ati v e, Neutr al 3. TUNISIAN DIALECT CORPUS T unDC w as de v eloped through a three-step process consisting of data g athering, content lte ring and preprocessing, and one-stage annotation. This dataset is designed to be di v erse and well-structured. Pro viding a v aluable resource for training and e v aluating deep learning models for sentiment analysis in the TD. 3.1. Data gathering, pr e-pr ocessing, and labelling A further inspection of the TD typesetting, commonly used on social media, sho ws that there are man y f actors af fecting its structure, such as the user’ s age, se x, re gion, and interests. The goal is to b uild a lar ge dataset that contains the majority of this used v ocab ulary . W e ha v e g athered approximately 24K comments through data scraping from o v er 300 social media posts and videos across v arious T unisian F acebook pages and Y ouT ube channels, with the most recent scraping conducted in January 2025. These comments co v er a broad array of topics, including music, politics, sports, ne ws, and TV sho ws. T o ensure v ocab ulary di v ersity , we limited the maximum number of comments scraped per post to 200. Additionally , we intentionally included longer comments in the dataset to better process and understand e xtended conte xtual entries. The longest comment in the T unDC dataset spans 873 tok ens, while the a v erage comment length is 42 tok ens. W e ltered out the non-TD comments and comments fully written in MSA using f aste xt scoring [38], [39]. Then, other lters were performed on the collected dataset to e xclude inappropriate comments (harmful and of fensi v e) and to in v olv e only the comments from T unisian people that were written in TD. Additionally , comments consisting of only one w ord were e xcluded from the dataset to maintain the quality and rele v ance of the content. This step ensures that the remaining comments contain suf cient conte xt and information, allo wing for more accurate sentim ent analysis. The last step in ltering w as to perform deduplication [40] by remo ving near -duplicate e xamples and long repetiti v e sub-strings to impro v e the quality of the dataset allo wing for more accurate e v aluation and better model performance [41]–[43]. The complete step-by-step ltering, script classication, and sentiment labeling process applied to the T unDC dataset is illustrated in Figure 1. After the g athering and ltering phases, the data preprocessing starts by remo ving all kinds of links and special characters. Recognizing that emojis carry signicant semantic meaning and are pre v alent in real-w orld te xt, we intentionally preserv ed them during preprocessing, rather than replacing them with decoded formats (as might be done with packages lik e the emoji package [44]). This approach aims to impro v e the model’ s understanding of nuanced e xpressions and enhance its generalization capabilities on authentic data. T o ensure the accurac y and reliability of the labels, a careful labeling process w as implem ented. Initially , a representati v e sample of approximately 20K comments w as carefully selected, ensuring the inclusion of di v erse linguistic styles, topics, and sentiment e xpressions. T o establish a consistent labeling frame w ork, a comprehensi v e description of labels, detailed guidelines, and a list of k e y instructions were pro vided. In addition, annotated comment e xamples were presented to the annotators to foster a clear understanding of the labeling criteria. The primary labeling task in v olv ed determining whether each comment con v e yed a positi v e or ne g ati v e sentiment. T unDC: a public benc hmark dataset for sentiment analysis and langua g e ... (Ahmed Khalil Boulahia) Evaluation Warning : The document was created with Spire.PDF for Python.
1896 ISSN: 2252-8938 Each comment underwent a three-stage annotation process t hat in v olv ed three nati v e Arabic and T unisian speak ers v olunteering as annotators. In the rst stage, eac h annotator independently assigned a sentiment label, either positi v e or ne g ati v e, to the comment. In the second stage, the labels assigned by the three annotators were compared. If three or tw o annotator s agreed on the sentiment label, that label w as considered the nal annotation for the comment. This majorit y v ote approach ensured that the nal sentiment labels were consistent and reected the consensus of the annotators. Ho we v er , if one of the three annotators w as uncertain about the labeling of the comment, the comment w as e xcluded from the dataset. This e xclusion step ensured that only comments with clear and consistent sentiment labels were retained, further enhancing the quality of the dataset. The agreement between our t hree annotators, measured using the Fleiss Kappa measurement, w as almost perfect (kappa =0.97) [45], [46]. The ambiguity in labeling often arose from comments containing sarcasm, rhetorical questions, quotes from others, or mix ed sentiments within a single statement. Such nuances made it challenging for annotators to assign a cle ar positi v e or ne g ati v e label consistently . T ranslating these comments into English is especially challenging, as much of their meaning, tone, and cultura l conte xt w ould be lost. T able 5 pro vides se v eral e xamples illustrating these cases. W e constructed T unDC, a sentiment analysis corpus for T A, comprising 9,088 positi v e and 10,956 ne g ati v e e xamples, ensuring a good distrib ution for rob ust ML and language modeling applications. Figure 1. Step-by-step ltering, script classication, and sentiment labeling process applied to T unDC dataset T able 5. Examples of ambiguous and borderline sentiment cases and their assigned labels Ambiguous e xamples Chal lenge Assigned label xAb˜ AžC Hžw ¨ AJr wn› w  A• ¢§d˜¤ œr§ (May God bless his parents if T unisia had more people lik e him, things w ould be much better for us.) Mix ed sentiment, colloquial nuances Positi v e T˜¤d˜ Ty˜¤¥s› ¨h Hžw AbJ ¤@qž . AhAbJ ™bqts› Ÿ›  ®b˜ ™bqts› (The future of the country depends on the future of its youth. Sa v e T unisia’ s youth, it is the responsibility of the state.) General s tatement with implied sentiment Positi v e Cest bien bec h hata rbo3 r ajil maytjar a wymed yedo 3la marto ydhrbha (It’ s great, so that e v en a quarter of a man w on’ t dare and raise his hand to beat his wife.) Sarcasm, indirect e xpression Ne g ati v e ™bqtsm˜ ¤ rRA˜ As Ylˆ Mw› A› £wr`ž ¨• ¨¡A §CAt˜ (History is important to kno w , b ut not at the e xpense of the present and future.) Statement with contrasting ideas Positi v e Br abbi ya mos aique iktbou 7aja s7i7a. Martou hia awal 3amla l Amazon wa wa9t el c harika 3la sa9iha m3ah. Y a3ni c hrika 50/50 kan mouc h akthar (Please, Mosaique, write something accurate. His wife w as the rst emplo yee at Amazon and helped b uild the compan y with him. It’ s a 50/50 compan y , if not more.) Long, detailed comment with mix ed f acts and opinions Ne g ati v e Int J Artif Intell, V ol. 15, No. 2, April 2026: 1891–1908 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 1897 3.2. T unDC dataset description T unDC’ s primary purpose is to f acilitate sentiment analysis tasks. T o achi e v e this goal, we crafted a near -balanced dataset in terms of the proportion of writing systems (Arabic and Latin) within each sentiment label class. This repres entation ensures that the dataset accurately reects the distrib ution of writing systems in real-w orld TD usage, enabling sentiment analysis models to ef fecti v ely capture sentiment patterns across both writing con v entions. The labels are represented as follo ws: 45.3% are positi v e labels and 54.6% are ne g ati v e labels. The comments written in Arabic alphabet represents around 68.6% of the tot al data, whereas the comments written in Latin alphabet only or mix ed represent the remaining 31.4%. T o the best of our kno wledge, this is the lar gest public TD dataset with o v er 20K manually labeled comments. T able 6 presents the e xact distrib ution of comments in the dataset, cate gorizing them by sentiment (positi v e and ne g ati v e) and writing system (Arabic, Latin, and mix ed). T able 6. Distrib ution of comments by sentiment and writing system in T unDC T unDC # Comments # Arabic comments # Latin comments # Mix ed comments Positi v e comments 9,088 5,735 3,061 292 Ne g ati v e comments 10,956 8,023 2,651 282 Figure 2 illustrates the distrib ution of each writing system (Arabic and Latin) per label (positi v e or ne g ati v e). It displays the distrib ution of comments written in TD across positi v e and ne g ati v e sentiment labels. The x-axis represents the sentiment label (positi v e and ne g ati v e), while the y-axis denotes the number of comments. The results indicate that approximately 55% of the Arabic comments e xpress a ne g ati v e sentiment, whereas 45% con v e y a positi v e sentiment. Figure 2 also presents a donut chart of the distrib ution of comments written in Latin scri pt across positi v e and ne g ati v e sentiment labels. The data re v eal that around 60% of the Latin comments are positi v e, while 40% are ne g ati v e. This pattern aligns with the sentiment distrib ution observ ed in Arabic-script comments, indicating a similar sentiment trend across both writing systems. These gures clearly illustrate that the distrib ution of writing systems is almost balanced across both sentiment labels. This balanced representation is crucial for ensuring that the sentiment analysis models trained on T unDC can accurately capture sentiment patterns across both writing con v entions and sentiment classes. The follo wing analysis e xamines the w ord and tok en distrib ution within the dataset, of fering insights into its composition and suitability for dif ferent NLP tasks. Figure 2. Distrib ution of positi v e and ne g ati v e comments across writing systems (Arabic, Latin, and Mix ed) in the T unDC dataset, along with the o v erall proportion of each script type The w ord count distrib ution (Figure 3) highlights a dominant range between 2 and 10 w ords, sho wing that short entries mak e up the majority of the dataset. The alignment between tok en and w ord counts follo ws typical patterns, with slight v ariations due to tok enizati on comple xities, such as special characters or encoding rules. This distrib ution implies a dataset that is well-suited for classication or lightweight NLP tasks rather than generation-hea vy applications. Expanding the dataset to include longer te xts m ay help impro v e v ersatility if required for more comprehensi v e NLP solutions. T unDC: a public benc hmark dataset for sentiment analysis and langua g e ... (Ahmed Khalil Boulahia) Evaluation Warning : The document was created with Spire.PDF for Python.
1898 ISSN: 2252-8938 Similarly , the tok en distrib ution plot (Figure 4) re v eals a sk e wed pattern, wit h the major ity of entr ies containing fe wer than 50 tok ens. This indicates that most te xts in the dataset are concise, with a sharp decline in frequenc y as tok en counts increase. The cl100k_base tok enizer appears to ef ciently compress te xt, as e xpected, with relati v ely short tok en sequences dominating the data. The steep drop-of f after 50 tok ens suggests that long-form content is rare, possibly reecting a dataset b uilt around brief comments or te xt snippets. Figure 3. The w ord count distrib ution in T unDC Figure 4. The tok en distrib ution in T unDC In addition to the o v erall w ord and tok en distrib utions, we also analyzed the le xical di v ersity of the T unDC dataset. Specically , T unDC contains 71,372 unique w ords, highlighting the broad range of e xpressions used in TD content. Among these, 47,889 unique w ords are in Arabic script, while 21,306 unique w ords are writt en in Latin script. This considerable presence of Latin-scripted terms reect s the mul tilingual and code-switching nature of T A, which frequently blends elements from French, English, and other languages into e v eryday digital communication. Such di v ersity underscores the linguistic richness of the dataset and reinforces the importance of tailoring NLP models to ef fecti v ely handle both scripts in lo w-resource dialects. Int J Artif Intell, V ol. 15, No. 2, April 2026: 1891–1908 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 1899 T able 7 presents a selection of randomly chosen comments from the T unDC dataset. These comme nts represent a di v erse range of sentiment labels and writing systems (Arabic, Latin, and mix ed), pro viding insight into the linguistic and conte xtual v ariations within the dataset. By sho wcasing real e xamples, this table highlights the comple xity of sentiment analysis in the TD and the challenges associated with processing dialectal Arabic te xt. T able 8 pro vides a comparison of T unDC’ s k e y features and characteristics with other publicly a v ailable TD datasets. T able 7. Sample of randomly selected comments from T unDC dataset Script type Comment T ranslation Label Arabic Tn› ©ÐA¡ ...®} TA ™m` šz›  wyl› T§rhJ Mw› rq A million barely gets you an ything... This is a po v erty handout, not a proper salary . Ne g ati v e Arabic w§CAnys... Ÿ§EAtm› Ÿylmm..Tˆ¤C ™sls› Šd AJr ¤¤¤¤¤wr šAb¡ What a f antastic series! Excellent actors... The script is absolutely crazy good. Bra v o! So much creati vity! Positi v e Latin 9adeh 5alsouk ya ta7foun bec h tahki Hal k elmtin li kif wejhek Ho w much did the y pay you, cutie, just to say those tw o w orthless w ords? Ne g ati v e Latin wallahi nhebek bar c ha wo n9adr ek ena I swear to God, I lo v e you v ery much and I respect/appreciate you. Positi v e Mix ed F i ls à maman ....œ¡r• A›¤ Mama’ s bo y , and there are so man y of them... Ne g ati v e Mix ed B ra v o ¢l› ¨ Cr’ That’ s a sound decision. Bra v o Positi v e T able 8. Comparing T unDC to other publicly a v ailable sentiment datasets Feature Dataset size Label classes Data source #Pos #Ne g T unDC 20,044 2 F acebook and Y outube 9,088 10,956 TEC [32] 5,514 2 X (T witter) - - TSA C [34] 17K 2 F acebook 8,845 8,215 TUNIZI [35] 9,210 2 Y outube 4,372 4,838 4. RESUL TS AND DISCUSSION This section presents the e xperimental setup and results to e v aluate the quality and utility of the T unDC dataset for sentiment classication in TD NLP . W e assess the performance of a transformer -based model, ne-tuned on T unDC, to determine its ef fecti v eness in capturing dialectal nuances. K e y e v aluation metrics, including accurac y , precision, recall, and F1-score, are us ed to analyze model performance. The follo wing subsections outline the dataset preprocessing, model setup, e v aluation criteria , and training-testing procedures. 4.1. Experimental setup 4.1.1. Dataset pr eparation The T unDC dataset w as di vided into an 85% traini ng set and a 15% v ali dation set. W e emplo yed a stratied splitting strate gy based not only on sentiment classes (positi v e and ne g ati v e) b ut also e xplicitly on writing systems (Arabic, Latin, and mix ed). This multi-criteria stratication approach pro v ed ef fecti v e, contrib uting to an approximate 8% increase in accurac y compared to simpler splits. Subsequent preprocessing focused on data cleaning and normalization. Specic cleaning steps included: ltering out comments consisting of only one w ord to ensure meaningful content; remo ving all URLs and user mentions for anon ymization and noise reduction; and retaining emojis and hashtags if the y appeared alongside te xt, ackno wledging their signicant semantic and emotional v alue. These preprocessing steps were essential for impro ving the model’ s ability to generalize across di v erse linguistic v ariations in the TD. 4.1.2. Baseline models Classical approaches: to conte xtualize the performance of our transformer -based solution, we established a set of classical ML baselines spanning distinct architectural paradigms: a SVM, an ensemble gradient-boosted tree model (e xtreme gradient boosting (XGBoost)), and a lightweight recurrent neural netw ork (a single Bi-LSTM layer with 16 units follo wed by a dense layer of 8 units). All models were trained using def ault h yperparameters and e v aluated on the same stratied 15% test split of the T unDC dataset described abo v e, ensuring a f air comparison. No architecture-specic tuning w as performed, allo wing us to assess out-of-the-box performance under identical data conditions. T able 9 summarizes their results on the test set in terms of accurac y , F1-score, and area under the curv e (A UC). T unDC: a public benc hmark dataset for sentiment analysis and langua g e ... (Ahmed Khalil Boulahia) Evaluation Warning : The document was created with Spire.PDF for Python.
1900 ISSN: 2252-8938 T able 9. Performance of classical baseline models on the T unDC test set Model Accurac y F1-score A UC SVM 0.77 0.77 0.85 XGBoost 0.76 0.75 0.84 LSTM (Bi-LSTM-16 + Dense-8) 0.78 0.76 0.85 T ransformer -based approaches: we further e v aluated four pre-trained ArabBER T -style T ransformer models, each trained on dif ferent combinations of MSA and dialectal content, to assess their out-of-the-box suitability for TD sentiment analysis. The selected models include: bert-base-arabic-camelbert-da (dialect-focused), the second model is the UBC-NLP/MARBER Tv2 (trained on dialect-hea vy social media te xt), CAMeL-Lab/bert-base-arabic-camelbert-mix (mix ed MSA and dialect), and the last model is the asaf aya/bert-base-arabic (primarily MSA with some dialectal co v erage). All models were ne-tuned identically using the same training protocol (learning rate 3e 5 , batch size 8, 4 epochs) and e v aluated on the same stratied test split of T unDC. As sho wn in T able 10, all four models signicantly outperformed classical baselines, with test F1-scor es ranging from 0.826 to 0.837. Notably , both camelbert-mix and asaf aya/bert-base-arabic achie v ed the highest e v aluation accurac y and F1-score (0.837). Gi v en its strong performance, broader linguistic co v erage, and established use as a foundation for Arabic NLP tasks, we selected asaf aya/bert-base-arabic as the base architecture for our subsequent h yperparameter tuning, error analysis, and nal model de v elopment—leading to the T unDC-mix ed model described in the ne xt section. T able 10. Performance of transformer -based baseline models on the T unDC test set Model T est accurac y T est F1-score camelbert-da 0.826 0.826 MARBER Tv2 0.836 0.835 camelbert-mix 0.837 0.837 asaf aya/bert-base-arabic 0.837 0.837 4.1.3. Ev aluated model F or this study , we e v aluated T unDC-mix ed [47], our ne-tuned transformer -based model b uilt upon asaf aya/bert-base-arabic [48], [49]. T o optimize perform ance on the T unDC dataset, we conducted systematic h yperparameter tuning using Ray T une from t he Ray library . Specically , we emplo yed random search o v er a predened h yperparameter space, e xploring the follo wing dimensions: learning_rate: log-uniform sampling between 1e 5 and 5e 5 . per_de vice_train_batch_size: cate gorical selection from {4, 8, 16}. weight_decay: uniform sampling in [0 . 0 , 0 . 1] . num_train_epochs: inte ger v alues from 3 to 6. lr_scheduler_type: cate gorical choice between linear and cosine. The search yielded the follo wing optimal conguration: learning rate 3e 5 , batch size 8, weight decay 0.017, 4 training epochs, and a linear learning rate scheduler . Using these h yperparameters, the nal model—comprising 111 million parameters—w as trained with the AdamW optimizer . Its strong performance underscores its capacity to capture the linguistic nuances and conte xtual v ariability inherent in lo w-resource T A dialects, making it well-suited for sentiment classication in this domain. 4.1.4. Ev aluation metrics The model’ s performance w as assessed using standard classication metrics, including accurac y , precision, recall, and F1-score. Accurac y measured the o v erall correctness of predictions, while precision e v aluated the proportion of correctly predicted positi v e (or ne g ati v e) labels. Recall measured the model’ s ability to capture all rele v ant instances of a sentiment class. The F1-score, as the harmonic mean of precision and recall, ensured a balanced e v aluation, particularly useful when handling v ariations in sentiment representation. Additionally , macro-a v eraged F1-score w as reported to treat each sentiment class equally , re g ar dless of class size. The weighted F1-score w as computed to account for class dis trib ution, pro viding a more representati v e measure of o v erall model performance. These e v aluation metrics of fered a rob ust assessment of bert-base-arabic-T unDC-mix ed in processing TD te xt. Int J Artif Intell, V ol. 15, No. 2, April 2026: 1891–1908 Evaluation Warning : The document was created with Spire.PDF for Python.