IAES
Inter
national
J
our
nal
of
Articial
Intelligence
(IJ-AI)
V
ol.
15,
No.
2,
April
2026,
pp.
1891
∼
1908
ISSN:
2252-8938,
DOI:
10.11591/ijai.v15.i2.pp1891-1908
❒
1891
T
unDC:
a
public
benchmark
dataset
f
or
sentiment
analysis
and
language
modeling
in
the
T
unisian
dialect
Ahmed
Khalil
Boulahia
1
,
Mourad
Mars
2
1
T
unis
Dauphine
Uni
v
ersity
,
T
unis,
T
unisia
2
Department
of
Computer
Science
and
Articial
Intelligence,
Colle
ge
of
Computing,
Umm
Al-Qura
Uni
v
ersity
,
Mecca,
Saudi
Arabia
Article
Inf
o
Article
history:
Recei
v
ed
May
8,
2025
Re
vised
Jan
29,
2026
Accepted
Feb
6,
2026
K
eyw
ords:
Arabic
dataset
Articial
intelligence
Fine-tuning
Lar
ge
language
model
Lo
w-resource
language
Sentiment
analysis
T
unisian
dialect
ABSTRA
CT
The
de
v
elopment
of
natural
language
processing
(NLP)
applications
has
increasingly
focused
on
dialectal
v
ariations
of
languages.
The
T
unisian
dialect
(TD),
a
widely
spok
en
v
ariant
of
Arabic,
poses
unique
linguistic
challenges
due
to
its
lack
of
s
tandardized
writing
con
v
entions
and
inuences
from
multiple
languages,
including
French,
Italian,
T
urkish,
and
Berber
.
In
this
w
ork,
we
introduce
T
unDC,
a
dataset
of
20,044
labeled
comments
designed
to
adv
ance
NLP
research
on
the
TD.
The
dataset
co
v
ers
di
v
erse
linguistic
forms
(Arabic,
Latin,
and
mix
ed
scripts),
and
each
comment
w
as
manually
annotated
for
positi
v
e
or
ne
g
ati
v
e
sentiment
by
nati
v
e
speak
ers,
achie
ving
high
inter
-annotator
agreement.
T
o
e
v
aluate
its
ef
fecti
v
eness,
we
ne-tuned
v
arious
models
on
T
unDC.
The
bert-base-arabic-T
unDC-mix
ed
model
achie
v
ed
an
accurac
y
of
0.84
and
a
macro-a
v
eraged
F1-score
of
0.83,
demonstrating
strong
generalization
across
sentiment
cate
gories
and
writing
systems.
A
stratied
data-splitting
strate
gy
considering
both
sentime
nt
and
script
type
further
impro
v
ed
accurac
y
by
approximately
8%
compared
to
standard
splits.
As
a
publicly
a
v
aila
ble
resource,
T
unDC
contrib
utes
to
the
computational
linguistics
community
,
fostering
adv
ancements
in
language
modeling
and
applications
tailored
to
the
TD.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Mourad
Mars
Department
of
Computer
Science
and
Articial
Intelligence,
Colle
ge
of
Comput
ing,
Umm
Al-Qura
Uni
v
ersity
Makkah
24382,
Saudi
Arabia
Email:
msmars@uqu.edu.sa
1.
INTR
ODUCTION
Recent
years
ha
v
e
witnessed
remarkable
adv
ancements
in
lar
ge
language
models
(LLMs)
and
generati
v
e
AI,
leading
to
signicant
breakthroughs
in
v
arious
natural
language
processing
(NLP)
tasks.
From
machine
translation
and
te
xt
summarization
to
writi
ng
dif
ferent
kinds
of
creati
v
e
content,
these
models
ha
v
e
demonstrated
e
xceptional
capabilities
in
understanding
and
generating
human
language.
Ho
we
v
er
,
a
major
obstacle
to
broader
adoption
and
linguistic
inclusi
vity
is
the
persistent
scarcity
of
annotated
data,
especially
for
lo
w-resource
languages
and
dialects.
This
limitation
often
leads
to
signicant
performance
g
aps,
undermining
the
linguistic
richness
of
these
communities
and
restricting
their
access
to
NLP
technologies.
One
such
case
is
T
unisian
Arabic
(T
A),
an
Arabic
dialect
spok
en
by
o
v
er
12
million
indi
viduals
in
T
unisia,
which
of
fers
a
capti
v
ating
and
undere
xplored
aspect.
Unlik
e
standard
Arabic,
it
possesses
unique
linguistic
characteristics
shaped
by
histor
ical
and
cultural
inuences.
Its
v
ocab
ulary
dra
ws
hea
vily
from
Arabic
b
ut
also
incorporates
a
wi
de
range
of
loanw
ords
from
Berber
,
T
urkish,
French,
English,
and
Italian,
enriching
J
ournal
homepage:
http://ijai.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
1892
❒
ISSN:
2252-8938
its
e
xpressi
v
e
po
wer
and
reecting
its
vibrant
sociolinguistic
landscape
(T
able
1).
Y
et,
the
lack
of
lar
ge
labeled
datasets
tailored
for
sentiment
analysis
in
the
T
unisian
dialect
(TD)
poses
a
barrier
to
deeper
e
xploration
and
hinders
the
de
v
elopment
of
rob
ust
NLP
applications.
When
it
comes
to
writing
in
the
TD,
people
use
dif
ferent
writing
systems:
Arabic
script
(abjad),
Latin
script,
and
e
v
en
numbers
to
represent
some
characters.
Most
of
the
time,
the
TD
doesn’
t
ha
v
e
an
y
well-dened
structure
and
does
not
conform
to
an
y
con
v
entions
or
orthographic
rules.
These
characteristics
of
the
TD
present
an
additional
challenge
in
ef
fecti
v
ely
processing
it
using
NLP
techniques.
T
able
2
present
some
of
most
common
e
xpressions
that
represent
translation
of
English
sentence
"when
are
you
going
to
the
doctor?".
T
able
1.
Examples
of
loanw
ords
in
the
TD
T
unisian
dialect
Arabic
Origin
English
translation
AkA
n
Italian
“banca”
Bank
AmVr
TqJ
French
“appartement”
Apartment
xA¯
dy
Berber
“labes“
Fi
ne
wfyl
A¡
English
“talifoun“
T
elephone
T
able
2.
Dif
ferent
TD
e
xpressions
that
represent
a
translation
of
"when
are
you
going
to
the
doctor?"
Reference
sentence
When
are
you
going
to
the
doctor?
Sentence
1
?ybWl
¨Km
MAt¤
Sentence
2
W
a9tash
timshi
litbib?
Sentence
3
W
aktech
temchi
lel
doctour?
Sentence
4
Ana
w
a9t
machi
3and
tbib?
The
processing
of
the
TD
represents
a
challenging
task,
mainly
due
to
its
ambiguous
and
comple
x
structure,
not
only
for
machines
b
ut
sometimes
for
T
unisians
themselv
es.
People
can
write
in
both
directions
(right
to
left
and
left
to
right)
using
the
Arabic
alphabet
(abjad)
and
the
Latin
alphabet.
On
man
y
occasions,
the
y
use
both
simultaneously
.
It
also
v
aries
according
to
a
person’
s
age,
origins,
and
culture.
The
majority
of
Arabic
users
on
social
media
platforms
use
dialects
to
e
xpress
themselv
es;
most
of
these
dialects
can
be
described
as
unstructured,
non-grammatical
slang
Arabic.
This
non-uniformity
in
dialects
mak
es
it
more
dif
cult
for
machine
learning
(ML)
algorithms
and
LLMs
to
be
able
to
perform
tasks
such
as
sentiment
analysis.
Hence,
there
is
an
increasing
need
for
lar
ger
datasets
to
impro
v
e
the
performance
of
these
models.
This
paper
aims
to
tackle
this
challenge
by
introducing
T
unDC,
a
ne
w
benchmark
corpus
s
pecically
designed
for
sentiment
analysis
in
T
A.
W
e
le
v
erage
the
po
wer
of
social
media,
collecting
and
annotating
more
than
20K
comments
with
sentiment
labels
(positi
v
e
or
ne
g
ati
v
e)
pro
vided
by
nati
v
e
speak
ers.
T
unDC
is
intended
not
only
as
a
training
resource
b
ut
also
as
a
standardized
benchmark
for
e
v
aluating
and
comparing
NLP
systems
on
TD
sentiment
analysis,
with
potential
use
in
future
shared
tasks
and
competitions.
Moreo
v
er
,
the
dataset’
s
scale
and
script
di
v
ersity
mak
e
it
suitable
for
pre-training
dialect-specic
language
models
and
for
enabling
ef
fecti
v
e
transfer
learning
through
ne-tuning
of
multilingual
or
Arabic-centric
transformers
such
as
Arabic
bidirectional
encoder
representations
from
transformers
(AraBER
T),
multidialectal
Arabic
BER
T
(MARBER
T),
or
cr
o
s
s-lingual
language
model-rob
ustly
optimized
BER
T
pre
-training
approach
(XLM-R).
By
addressing
the
data
scarcity
iss
u
e
,
T
unDC
aims
to
empo
wer
future
research
and
de
v
elopment
of
sentiment
analysis
solutions
tailored
to
the
unique
linguistic
characteristics
of
T
A
[1],
[2].
This
paper
mak
es
three
k
e
y
contrib
utions.
First,
it
pro
vides
a
surv
e
y
of
a
v
ailable
resources
for
the
TD.
Second,
it
introduces
T
unDC,
a
no
v
el,
publicly
a
v
ailable
benchmark
dataset
for
sentiment
analysis
in
T
A,
designed
to
support
both
model
training
and
standardized
e
v
aluation.
Third,
it
presents
the
training,
e
v
aluation,
and
public
release
on
Huggingf
ace
of
multiple
pre-trained
LLMs
ne-tuned
via
transfer
learning,
including
Camembert-BER
T
(CamemBER
T)
(AhmedBou/camembert-T
unDC),
bert-base-uncased
(AhmedBou/camembert-T
unDC),
Modernized
BER
T
(ModernBER
T)
(AhmedBou/ModernBER
T
-T
unDC),
and
bert-base-arabic
(AhmedBou/bert-base-arabic-T
unDC-mix
ed),
on
the
T
unDC
dataset.
The
rest
of
the
paper
is
or
g
anized
as
follo
ws:
section
2
pro
vides
an
o
v
ervie
w
of
related
w
ork
in
sentiment
analysis
and
datasets
for
Arabic
dialects.
Section
3
details
the
dataset
creation
process,
co
v
ering
data
collection,
preprocessing,
annotation,
statistical
analysis,
and
e
v
aluation.
Section
4
describes
the
e
xperimental
setup,
presents
the
results,
and
discusses
the
ndings.
Section
5
discusses
ethical
considerations,
including
bias
and
f
airness
in
T
unisian
dialect
NLP
.
Finally
,
section
6
concludes
the
paper
and
outlines
potential
directions
for
future
research.
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1891–1908
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
1893
2.
RELA
TED
W
ORKS
Sentiment
analysis
in
the
TD
presents
a
unique
challenge
due
to
the
dialect’
s
distinct
v
ocab
ulary
,
morphology
,
and
syntax
compared
to
modern
standard
Arabic
(MSA)
[3],
[4].
Se
v
eral
research
ef
forts
ha
v
e
focused
on
addressing
this
challenge,
emplo
ying
di
v
erse
approaches
and
datasets
[5]–[11].
This
section
pro
vides
a
comprehensi
v
e
o
v
ervie
w
of
e
xisting
w
ork,
with
a
particular
focus
on
recent
adv
ances
in
multilingual
models,
dialect
adaptation,
and
the
specic
challenges
and
opportunities
within
T
A
NLP
.
2.1.
Adv
ances
in
multilingual
transf
ormers
and
dialect
adaptation
The
adv
ent
of
the
T
ransformer
architecture
has
re
v
olutionized
NLP
,
with
multilingual
lar
ge
language
models
(MLLMs)
lik
e
XLM-R
and
the
more
recent
generati
v
e
models
demonstrating
impressi
v
e
cross-lingual
capabilities.
Recent
w
ork
has
focused
on
adapting
these
po
werful
models
to
lo
w-resource
dialects.
A
k
e
y
challenge
is
the
dialect
adaptation
of
these
MLLMs.
While
models
pre-trained
on
massi
v
e
amounts
of
data,
including
some
Arabic
content,
sho
w
a
baseline
performance,
the
y
often
struggle
with
the
nuances
of
specic
dialects
lik
e
TD.
Studies
such
as
those
presented
at
Arabic
natural
language
processing
2025
(ArabicNLP
2025)
and
empirical
methods
in
natural
language
processing
2025
(EMNLP
2025)
ha
v
e
sho
wn
that
MLLMs
e
xhibit
a
signicant
"Arabic
g
ap"
in
handling
dialectal
v
ariations,
code-mixing,
and
script-switching
[12].
Model
mer
ging
and
continual
pre-training
ha
v
e
emer
ged
as
ef
fecti
v
e
strate
gies
for
dialect
adapt
ation,
mo
ving
be
yond
con
v
entional
ne-tuning.
F
or
instance,
research
in
2025
e
xplored
model
mer
ging
to
adapt
multilingual
models
for
code-mix
ed
tasks,
a
highly
rele
v
ant
challenge
for
TD
which
frequently
mix
es
Arabic
script,
Latin
script
(Arabizi),
and
French
loanw
ords
[13].
Furthermore,
the
e
v
aluation
of
generati
v
e
LLMs
in
zero-shot
and
fe
w-shot
settings
for
Arabic
tasks
,
including
sense
disambiguation
and
translation,
highlights
the
gro
wing
trend
of
le
v
eraging
the
inherent
kno
wledge
within
these
models
to
o
v
ercome
data
scarcity
in
dialectal
NLP
[14],
[15].
2.2.
Zer
o-
and
few-shot
lear
ning
f
or
lo
w-r
esour
ce
dialects
The
scarcity
of
lar
ge,
high-quality
annotated
datasets
for
dialects
necessitates
the
e
xploration
of
zero-shot
(ZSL)
and
fe
w-shot
learning
(FSL)
techniques.
These
methods
are
crucial
for
adv
ancing
NLP
in
lo
w-resource
settings,
including
T
A.
Recent
surv
e
ys
(2024-2025)
on
Arabic
dialect
processing
emphasize
the
shift
to
w
ards
ZSL
and
FSL,
often
utilizing
the
po
wer
of
LLMs.
In
the
conte
xt
of
sentiment
analysis,
ZSL
and
FSL
allo
w
models
to
generalize
from
instructions
or
a
handful
of
e
xamples,
bypassing
the
need
for
e
xtensi
v
e,
costly
manual
annotation
[16].
F
or
TD
specically
,
researchers
are
benchmarking
LLMs’
performance,
nding
that
while
the
y
struggle
initially
,
prompting
techniques
and
FSL
can
signicantly
impro
v
e
their
ability
to
recognize
and
adhere
to
the
dialect’
s
unique
linguistic
structure
[17].
This
approach
is
particularly
promising
for
tasks
lik
e
sentiment
analysis
where
the
underlying
sentiment
concept
is
uni
v
ersal,
b
ut
the
dialectal
e
xpression
is
unique.
2.3.
T
echnical
gaps
and
linguistic
challenges
in
T
unisian
dialect
natural
language
pr
ocessing
The
TD
presents
a
unique
set
of
linguistic
and
technical
challenges
that
impede
the
de
v
elopment
of
rob
ust
NLP
systems.
The
most
signicant
linguistic
hurdle
is
the
script-s
witching
phenomenon,
where
users
uidly
switch
between
Arabic
and
Latin
characters,
often
within
the
same
sentence,
to
represent
TD
phonemes.
This
necessitates
resources
lik
e
T
unisian
Arabish
corpus
(T
ArC)
[18]
and
the
ne
wly
introduced
LinT
O
datasets
[19],
which
focus
on
transliteration
and
linguistic
annotation
to
bridge
the
g
ap
between
the
written
forms.
The
technical
g
ap
is
primarily
the
lack
of
a
lar
ge-scale,
multi-domain,
and
multi-script
benchmark
dataset,
which
T
unDC
aims
to
address.
As
sho
wn
in
T
able
3,
e
xtreme
code-switching
and
script-mixing
complicate
model
design,
while
the
lack
of
orthographic
standardizati
on
increases
le
xical
v
ariability
.
On
the
technical
side,
data
scarcity
and
domain
adaptation
challenges
further
limit
the
performance
of
NLP
systems,
moti
v
ating
the
de
v
elopment
of
the
T
unDC
dataset.
2.4.
Arabic
r
egional
dialects
Unlik
e
English,
Arabic
language
presents
additional
challenges
due
to
its
multiple
dialects,
the
lim
ited
a
v
ailability
of
lar
ge
corpora,
and
the
absence
of
v
ocalization.
Therefore,
creating
high-quality
datasets
and
de
v
eloping
NLP
tools
capable
of
accurately
processing
dialectal
Arabic
is
crucial.
Signicant
ef
forts
ha
v
e
been
made
to
construct
datasets
for
specic
dialects,
with
the
Egyptian
(EGY)
[20]–[22]
and
Le
v
antine
(LEV)
dialects
being
the
most
e
xtensi
v
ely
studied
[23],
[24].
More
recently
,
research
has
e
xpanded
to
include
the
P
alestinian
(P
AL)
[25],
Khaliji
[26],
[27],
Syro-P
alestinian
[22],
Gulf
(GLF)
[28],
Mesopotamian
(Iraqi)
[22],
T
unDC:
a
public
benc
hmark
dataset
for
sentiment
analysis
and
langua
g
e
...
(Ahmed
Khalil
Boulahia)
Evaluation Warning : The document was created with Spire.PDF for Python.
1894
❒
ISSN:
2252-8938
and
Maghrebi
(MGR)
[29]
dia
lects
[30],
[31].
Ho
we
v
er
,
the
TD
remains
undere
xplored,
with
limited
linguistic
resources
and
NLP
tools
a
v
ailable.
T
able
3.
T
echnical
g
aps
and
linguistic
challenges
in
T
A
NLP
Challenge
type
Specic
challenge
in
TD
Impact
on
NLP
de
v
elopment
Linguistic
Extreme
code-switching/mixing:
frequent,
unstandardized
mixing
of
Arabic
script,
Latin
script
(Arabizi),
and
French/Italian/T
urkish
loanw
ords.
Requires
models
to
handle
multiple
orthographies
and
languages
simultaneously
,
increasing
model
comple
xity
and
data
requirements.
Linguistic
Lack
of
orthographic
standardizat
ion:
no
x
ed
rules
for
writing
T
A,
leading
to
high
le
xical
v
ariability
(e.g.,
multiple
w
ays
to
write
the
same
w
ord).
Hinders
the
ef
fecti
v
eness
of
traditional
tok
enization,
stemming,
and
le
xicon-based
methods.
T
echnical
Data
scarcity
and
fragmentation:
limited
a
v
ailability
of
lar
ge,
publicly
accessible,
and
high-quality
annotated
corpora.
Existing
datasets
are
often
small
and
task-specic.
Pre
v
ents
the
ef
fecti
v
e
pre-training
of
dedicated,
high-performing
TD
language
models.
T
echnical
Domain
adaptation:
models
trained
on
one
domain
(e.g.,
political
tweets)
perform
poorly
on
others
(e.g.,
e-commerce
comments).
Requires
continuous
adaptation
strate
gies
and
di
v
erse
datasets
lik
e
T
unDC
to
ensure
generalizability
.
2.5.
T
unisian
dialect
datasets
The
rst
interest
in
TD
sentiment
analysis
w
as
in
2016,
Sayadi
et
al.
[32]
presented
in
their
paper
,
a
sentiment
analysis
study
on
the
rst
labeled
and
publicly
a
v
ailable
dataset
called
T
unisian
election
corpus
(TEC).
This
dataset
is
composed
of
5,514
tweets
collected
during
the
T
unisian
elections
period
of
2014.
3,760
of
them
are
in
MSA,
and
1,754
are
in
the
TD.
Man
y
ML
approaches
were
presented,
and
a
comparati
v
e
study
w
as
conducted.
The
presented
results
sho
wed
that
support
v
ector
machines
(SVM)
achie
v
ed
a
higher
accurac
y
of
71.09%
than
the
other
methods
used.
The
T
unisian
Arabic
corpus
(T
A
C)
[33]
consists
of
800
tweets
co
v
ering
v
arious
topics,
including
media,
telecommunications,
and
politics.
This
dataset
w
as
g
athered
by
Karmani
[33]
and
labeled
with
sentiment
cate
gories:
pos
iti
v
e,
ne
g
ati
v
e,
and
neutral.
In
2017,
a
dataset
called
the
T
unisian
sentiment
analysis
corpus
(TSA
C)
w
as
presented
and
made
a
v
ailable
publicly
for
the
NLP
T
unisian
community
[34].
The
dataset
w
as
obta
ined
from
F
acebook
comments
about
popular
TV
sho
ws,
and
it
is
written
only
with
Arabic
letters.
The
authors
reported
the
rst
application
of
deep
lear
ning
in
sentiment
analysis
on
the
TD,
where
the
y
used
multi-layer
perceptron
(MLP),
which
produced
a
lo
wer
error
rate
than
SVM
and
naïv
e-Bayes
and
reached
78%
accurac
y
.
T
unisian
Arabizi
(TUNIZI)
[35]
contains
9,210
comments
g
athered
from
the
Y
ouT
ube
platform
and
labeled
positi
v
e
and
ne
g
ati
v
e.
Man
y
topics
were
co
v
ered
in
this
dataset,
such
as
sports,
politics,
comedy
,
and
TV
sho
ws.
Both
classes
are
similarly
represented
in
this
dataset,
with
47%
positi
v
e
comments
and
53%
ne
g
ati
v
e
comments.
Masmoudi
et
al.
[36]
introduced
a
manually
annotated
dataset
for
sentiment
analysis
of
the
TD,
composed
of
comments
collected
from
of
cial
F
acebook
pages
of
T
un
i
sian
supermark
ets.
The
dataset
w
as
labeled
based
on
v
e
sentiment
cate
gories
(v
ery
positi
v
e,
positi
v
e,
neutral,
ne
g
ati
v
e,
and
v
ery
ne
g
ati
v
e)
and
twenty
aspect-based
cate
gories.
T
o
analyze
sentiment,
the
authors
e
xperimented
with
three
deep
learning
models:
con
v
olutional
neural
netw
orks
(
CNN),
long
short-term
memory
(LSTM),
and
bidirectional
long
short-term
memory
(Bi-LSTM).
Their
e
v
aluation
sho
wed
that
CNN
and
Bi-LSTM
achie
v
ed
the
best
classication
performance,
demonstrating
the
ef
fecti
v
eness
of
deep
learning
in
processing
TD
te
xt.
Gugliotta
and
Dinarelli
[18]
introduced
T
ArC,
a
publicly
a
v
ailable
dataset
designed
for
processing
T
A
written
in
Arabizi.
The
corpus
w
as
de
v
eloped
alongside
an
NLP
tool
that
pro
vides
v
arious
le
v
els
of
linguist
ic
annotation,
including
w
ord
classication,
transliteration,
tok
enization,
part-of-speech
tagging
(POS-tagging),
and
lemmatization.
The
authors
outlined
their
computational
and
linguistic
methodologies,
discussing
strate
gies
to
enhance
annotation
accurac
y
.
Their
e
xperiments
demonstrated
the
ef
fecti
v
eness
of
these
resources
for
both
computational
applications
and
linguistic
research.
Mulki
et
al.
[37]
in
v
estig
ated
sentiment
analysis
of
the
TD
using
both
supervised
and
le
xicon-based
models.
The
y
e
v
aluated
preprocessing
techniques
such
as
stemming,
emoji
recognition,
and
ne
g
ation
detection
on
three
datasets
of
v
arying
sizes.
Their
results
sho
wed
that
these
preprocessing
steps
signicantly
impro
v
ed
sentiment
classication
performance,
with
named
entity
tagging
further
enhancing
le
xicon-based
models
and
beneting
supervised
models
on
smaller
datasets.
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1891–1908
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
1895
T
o
summarize,
T
able
4
pro
vides
an
o
v
ervie
w
of
the
datasets
collected
for
the
TD
sentiment
analysi
s
task.
The
e
xisting
datas
ets
are
either
small,
limited
in
script
co
v
erage,
or
focused
on
specic
tasks,
such
as
transliteration.
T
unDC
distinguishes
itself
by
of
fering
a
lar
ge,
publicly
a
v
ailable,
and
script-di
v
erse
corpus
(Arabic,
Latin,
and
mix
ed
scripts)
with
high-quality
,
manually
v
eried
sentiment
annotations,
positioning
it
as
a
rob
ust
benchmark
for
multi-script
and
multi-domain
sentiment
analysis.
T
able
4.
Summary
of
a
v
ailable
datasets
for
TD
sentiment
analysis
Study
Dataset
Size
Source
Labels
Sayadi
et
al.
[32]
TEC
[32]
5,514
tweets
T
witter
Positi
v
e,
Ne
g
ati
v
e
Karmani
[33]
T
A
C
[33]
800
tweets
F
acebook
Positi
v
e,
Ne
g
ati
v
e,
Neutral
Medhaf
f
ar
et
al.
[34]
TSA
C
[34]
17k
tweets
F
acebook
Positi
v
e,
Ne
g
ati
v
e
F
ourati
et
al.
[35]
TUNIZI
[35]
9,210
Y
ouT
ube
comments
Y
ouT
ube
Positi
v
e,
Ne
g
ati
v
e
Masmoudi
et
al.
[36]
Comments
about
17k
Arabic
script
posts
F
acebook
V
ery
Positi
v
e
Positi
v
e,
Neutral
T
unisian
supermark
ets
[36]
27k
Arabizi
script
posts
Ne
g
ati
v
e,
V
ery
Ne
g
ati
v
e
Gugliotta
and
Dinarelli
[18]
T
ArC
[18]
11,291
comments
F
acebook
Positi
v
e,
Ne
g
ati
v
e,
Neutr
al
3.
TUNISIAN
DIALECT
CORPUS
T
unDC
w
as
de
v
eloped
through
a
three-step
process
consisting
of
data
g
athering,
content
lte
ring
and
preprocessing,
and
one-stage
annotation.
This
dataset
is
designed
to
be
di
v
erse
and
well-structured.
Pro
viding
a
v
aluable
resource
for
training
and
e
v
aluating
deep
learning
models
for
sentiment
analysis
in
the
TD.
3.1.
Data
gathering,
pr
e-pr
ocessing,
and
labelling
A
further
inspection
of
the
TD
typesetting,
commonly
used
on
social
media,
sho
ws
that
there
are
man
y
f
actors
af
fecting
its
structure,
such
as
the
user’
s
age,
se
x,
re
gion,
and
interests.
The
goal
is
to
b
uild
a
lar
ge
dataset
that
contains
the
majority
of
this
used
v
ocab
ulary
.
W
e
ha
v
e
g
athered
approximately
24K
comments
through
data
scraping
from
o
v
er
300
social
media
posts
and
videos
across
v
arious
T
unisian
F
acebook
pages
and
Y
ouT
ube
channels,
with
the
most
recent
scraping
conducted
in
January
2025.
These
comments
co
v
er
a
broad
array
of
topics,
including
music,
politics,
sports,
ne
ws,
and
TV
sho
ws.
T
o
ensure
v
ocab
ulary
di
v
ersity
,
we
limited
the
maximum
number
of
comments
scraped
per
post
to
200.
Additionally
,
we
intentionally
included
longer
comments
in
the
dataset
to
better
process
and
understand
e
xtended
conte
xtual
entries.
The
longest
comment
in
the
T
unDC
dataset
spans
873
tok
ens,
while
the
a
v
erage
comment
length
is
42
tok
ens.
W
e
ltered
out
the
non-TD
comments
and
comments
fully
written
in
MSA
using
f
aste
xt
scoring
[38],
[39].
Then,
other
lters
were
performed
on
the
collected
dataset
to
e
xclude
inappropriate
comments
(harmful
and
of
fensi
v
e)
and
to
in
v
olv
e
only
the
comments
from
T
unisian
people
that
were
written
in
TD.
Additionally
,
comments
consisting
of
only
one
w
ord
were
e
xcluded
from
the
dataset
to
maintain
the
quality
and
rele
v
ance
of
the
content.
This
step
ensures
that
the
remaining
comments
contain
suf
cient
conte
xt
and
information,
allo
wing
for
more
accurate
sentim
ent
analysis.
The
last
step
in
ltering
w
as
to
perform
deduplication
[40]
by
remo
ving
near
-duplicate
e
xamples
and
long
repetiti
v
e
sub-strings
to
impro
v
e
the
quality
of
the
dataset
allo
wing
for
more
accurate
e
v
aluation
and
better
model
performance
[41]–[43].
The
complete
step-by-step
ltering,
script
classication,
and
sentiment
labeling
process
applied
to
the
T
unDC
dataset
is
illustrated
in
Figure
1.
After
the
g
athering
and
ltering
phases,
the
data
preprocessing
starts
by
remo
ving
all
kinds
of
links
and
special
characters.
Recognizing
that
emojis
carry
signicant
semantic
meaning
and
are
pre
v
alent
in
real-w
orld
te
xt,
we
intentionally
preserv
ed
them
during
preprocessing,
rather
than
replacing
them
with
decoded
formats
(as
might
be
done
with
packages
lik
e
the
emoji
package
[44]).
This
approach
aims
to
impro
v
e
the
model’
s
understanding
of
nuanced
e
xpressions
and
enhance
its
generalization
capabilities
on
authentic
data.
T
o
ensure
the
accurac
y
and
reliability
of
the
labels,
a
careful
labeling
process
w
as
implem
ented.
Initially
,
a
representati
v
e
sample
of
approximately
20K
comments
w
as
carefully
selected,
ensuring
the
inclusion
of
di
v
erse
linguistic
styles,
topics,
and
sentiment
e
xpressions.
T
o
establish
a
consistent
labeling
frame
w
ork,
a
comprehensi
v
e
description
of
labels,
detailed
guidelines,
and
a
list
of
k
e
y
instructions
were
pro
vided.
In
addition,
annotated
comment
e
xamples
were
presented
to
the
annotators
to
foster
a
clear
understanding
of
the
labeling
criteria.
The
primary
labeling
task
in
v
olv
ed
determining
whether
each
comment
con
v
e
yed
a
positi
v
e
or
ne
g
ati
v
e
sentiment.
T
unDC:
a
public
benc
hmark
dataset
for
sentiment
analysis
and
langua
g
e
...
(Ahmed
Khalil
Boulahia)
Evaluation Warning : The document was created with Spire.PDF for Python.
1896
❒
ISSN:
2252-8938
Each
comment
underwent
a
three-stage
annotation
process
t
hat
in
v
olv
ed
three
nati
v
e
Arabic
and
T
unisian
speak
ers
v
olunteering
as
annotators.
In
the
rst
stage,
eac
h
annotator
independently
assigned
a
sentiment
label,
either
positi
v
e
or
ne
g
ati
v
e,
to
the
comment.
In
the
second
stage,
the
labels
assigned
by
the
three
annotators
were
compared.
If
three
or
tw
o
annotator
s
agreed
on
the
sentiment
label,
that
label
w
as
considered
the
nal
annotation
for
the
comment.
This
majorit
y
v
ote
approach
ensured
that
the
nal
sentiment
labels
were
consistent
and
reected
the
consensus
of
the
annotators.
Ho
we
v
er
,
if
one
of
the
three
annotators
w
as
uncertain
about
the
labeling
of
the
comment,
the
comment
w
as
e
xcluded
from
the
dataset.
This
e
xclusion
step
ensured
that
only
comments
with
clear
and
consistent
sentiment
labels
were
retained,
further
enhancing
the
quality
of
the
dataset.
The
agreement
between
our
t
hree
annotators,
measured
using
the
Fleiss
Kappa
measurement,
w
as
almost
perfect
(kappa
=0.97)
[45],
[46].
The
ambiguity
in
labeling
often
arose
from
comments
containing
sarcasm,
rhetorical
questions,
quotes
from
others,
or
mix
ed
sentiments
within
a
single
statement.
Such
nuances
made
it
challenging
for
annotators
to
assign
a
cle
ar
positi
v
e
or
ne
g
ati
v
e
label
consistently
.
T
ranslating
these
comments
into
English
is
especially
challenging,
as
much
of
their
meaning,
tone,
and
cultura
l
conte
xt
w
ould
be
lost.
T
able
5
pro
vides
se
v
eral
e
xamples
illustrating
these
cases.
W
e
constructed
T
unDC,
a
sentiment
analysis
corpus
for
T
A,
comprising
9,088
positi
v
e
and
10,956
ne
g
ati
v
e
e
xamples,
ensuring
a
good
distrib
ution
for
rob
ust
ML
and
language
modeling
applications.
Figure
1.
Step-by-step
ltering,
script
classication,
and
sentiment
labeling
process
applied
to
T
unDC
dataset
T
able
5.
Examples
of
ambiguous
and
borderline
sentiment
cases
and
their
assigned
labels
Ambiguous
e
xamples
Chal
lenge
Assigned
label
xAb
AC
Hw
¨
AJr
wn
w
A
¢§d¤
r§
(May
God
bless
his
parents
if
T
unisia
had
more
people
lik
e
him,
things
w
ould
be
much
better
for
us.)
Mix
ed
sentiment,
colloquial
nuances
Positi
v
e
T¤d
Ty¤¥s
¨h
Hw
AbJ
¤@q
.
AhAbJ
bqts
®b
bqts
(The
future
of
the
country
depends
on
the
future
of
its
youth.
Sa
v
e
T
unisia’
s
youth,
it
is
the
responsibility
of
the
state.)
General
s
tatement
with
implied
sentiment
Positi
v
e
Cest
bien
bec
h
hata
rbo3
r
ajil
maytjar
a
wymed
yedo
3la
marto
ydhrbha
(It’
s
great,
so
that
e
v
en
a
quarter
of
a
man
w
on’
t
dare
and
raise
his
hand
to
beat
his
wife.)
Sarcasm,
indirect
e
xpression
Ne
g
ati
v
e
bqtsm
¤
rRA
As
Yl
Mw
A
£wr`
¨
¨¡A
§CAt
(History
is
important
to
kno
w
,
b
ut
not
at
the
e
xpense
of
the
present
and
future.)
Statement
with
contrasting
ideas
Positi
v
e
Br
abbi
ya
mos
aique
iktbou
7aja
s7i7a.
Martou
hia
awal
3amla
l
Amazon
wa
wa9t
el
c
harika
3la
sa9iha
m3ah.
Y
a3ni
c
hrika
50/50
kan
mouc
h
akthar
(Please,
Mosaique,
write
something
accurate.
His
wife
w
as
the
rst
emplo
yee
at
Amazon
and
helped
b
uild
the
compan
y
with
him.
It’
s
a
50/50
compan
y
,
if
not
more.)
Long,
detailed
comment
with
mix
ed
f
acts
and
opinions
Ne
g
ati
v
e
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1891–1908
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
1897
3.2.
T
unDC
dataset
description
T
unDC’
s
primary
purpose
is
to
f
acilitate
sentiment
analysis
tasks.
T
o
achi
e
v
e
this
goal,
we
crafted
a
near
-balanced
dataset
in
terms
of
the
proportion
of
writing
systems
(Arabic
and
Latin)
within
each
sentiment
label
class.
This
repres
entation
ensures
that
the
dataset
accurately
reects
the
distrib
ution
of
writing
systems
in
real-w
orld
TD
usage,
enabling
sentiment
analysis
models
to
ef
fecti
v
ely
capture
sentiment
patterns
across
both
writing
con
v
entions.
The
labels
are
represented
as
follo
ws:
45.3%
are
positi
v
e
labels
and
54.6%
are
ne
g
ati
v
e
labels.
The
comments
written
in
Arabic
alphabet
represents
around
68.6%
of
the
tot
al
data,
whereas
the
comments
written
in
Latin
alphabet
only
or
mix
ed
represent
the
remaining
31.4%.
T
o
the
best
of
our
kno
wledge,
this
is
the
lar
gest
public
TD
dataset
with
o
v
er
20K
manually
labeled
comments.
T
able
6
presents
the
e
xact
distrib
ution
of
comments
in
the
dataset,
cate
gorizing
them
by
sentiment
(positi
v
e
and
ne
g
ati
v
e)
and
writing
system
(Arabic,
Latin,
and
mix
ed).
T
able
6.
Distrib
ution
of
comments
by
sentiment
and
writing
system
in
T
unDC
T
unDC
#
Comments
#
Arabic
comments
#
Latin
comments
#
Mix
ed
comments
Positi
v
e
comments
9,088
5,735
3,061
292
Ne
g
ati
v
e
comments
10,956
8,023
2,651
282
Figure
2
illustrates
the
distrib
ution
of
each
writing
system
(Arabic
and
Latin)
per
label
(positi
v
e
or
ne
g
ati
v
e).
It
displays
the
distrib
ution
of
comments
written
in
TD
across
positi
v
e
and
ne
g
ati
v
e
sentiment
labels.
The
x-axis
represents
the
sentiment
label
(positi
v
e
and
ne
g
ati
v
e),
while
the
y-axis
denotes
the
number
of
comments.
The
results
indicate
that
approximately
55%
of
the
Arabic
comments
e
xpress
a
ne
g
ati
v
e
sentiment,
whereas
45%
con
v
e
y
a
positi
v
e
sentiment.
Figure
2
also
presents
a
donut
chart
of
the
distrib
ution
of
comments
written
in
Latin
scri
pt
across
positi
v
e
and
ne
g
ati
v
e
sentiment
labels.
The
data
re
v
eal
that
around
60%
of
the
Latin
comments
are
positi
v
e,
while
40%
are
ne
g
ati
v
e.
This
pattern
aligns
with
the
sentiment
distrib
ution
observ
ed
in
Arabic-script
comments,
indicating
a
similar
sentiment
trend
across
both
writing
systems.
These
gures
clearly
illustrate
that
the
distrib
ution
of
writing
systems
is
almost
balanced
across
both
sentiment
labels.
This
balanced
representation
is
crucial
for
ensuring
that
the
sentiment
analysis
models
trained
on
T
unDC
can
accurately
capture
sentiment
patterns
across
both
writing
con
v
entions
and
sentiment
classes.
The
follo
wing
analysis
e
xamines
the
w
ord
and
tok
en
distrib
ution
within
the
dataset,
of
fering
insights
into
its
composition
and
suitability
for
dif
ferent
NLP
tasks.
Figure
2.
Distrib
ution
of
positi
v
e
and
ne
g
ati
v
e
comments
across
writing
systems
(Arabic,
Latin,
and
Mix
ed)
in
the
T
unDC
dataset,
along
with
the
o
v
erall
proportion
of
each
script
type
The
w
ord
count
distrib
ution
(Figure
3)
highlights
a
dominant
range
between
2
and
10
w
ords,
sho
wing
that
short
entries
mak
e
up
the
majority
of
the
dataset.
The
alignment
between
tok
en
and
w
ord
counts
follo
ws
typical
patterns,
with
slight
v
ariations
due
to
tok
enizati
on
comple
xities,
such
as
special
characters
or
encoding
rules.
This
distrib
ution
implies
a
dataset
that
is
well-suited
for
classication
or
lightweight
NLP
tasks
rather
than
generation-hea
vy
applications.
Expanding
the
dataset
to
include
longer
te
xts
m
ay
help
impro
v
e
v
ersatility
if
required
for
more
comprehensi
v
e
NLP
solutions.
T
unDC:
a
public
benc
hmark
dataset
for
sentiment
analysis
and
langua
g
e
...
(Ahmed
Khalil
Boulahia)
Evaluation Warning : The document was created with Spire.PDF for Python.
1898
❒
ISSN:
2252-8938
Similarly
,
the
tok
en
distrib
ution
plot
(Figure
4)
re
v
eals
a
sk
e
wed
pattern,
wit
h
the
major
ity
of
entr
ies
containing
fe
wer
than
50
tok
ens.
This
indicates
that
most
te
xts
in
the
dataset
are
concise,
with
a
sharp
decline
in
frequenc
y
as
tok
en
counts
increase.
The
cl100k_base
tok
enizer
appears
to
ef
ciently
compress
te
xt,
as
e
xpected,
with
relati
v
ely
short
tok
en
sequences
dominating
the
data.
The
steep
drop-of
f
after
50
tok
ens
suggests
that
long-form
content
is
rare,
possibly
reecting
a
dataset
b
uilt
around
brief
comments
or
te
xt
snippets.
Figure
3.
The
w
ord
count
distrib
ution
in
T
unDC
Figure
4.
The
tok
en
distrib
ution
in
T
unDC
In
addition
to
the
o
v
erall
w
ord
and
tok
en
distrib
utions,
we
also
analyzed
the
le
xical
di
v
ersity
of
the
T
unDC
dataset.
Specically
,
T
unDC
contains
71,372
unique
w
ords,
highlighting
the
broad
range
of
e
xpressions
used
in
TD
content.
Among
these,
47,889
unique
w
ords
are
in
Arabic
script,
while
21,306
unique
w
ords
are
writt
en
in
Latin
script.
This
considerable
presence
of
Latin-scripted
terms
reect
s
the
mul
tilingual
and
code-switching
nature
of
T
A,
which
frequently
blends
elements
from
French,
English,
and
other
languages
into
e
v
eryday
digital
communication.
Such
di
v
ersity
underscores
the
linguistic
richness
of
the
dataset
and
reinforces
the
importance
of
tailoring
NLP
models
to
ef
fecti
v
ely
handle
both
scripts
in
lo
w-resource
dialects.
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1891–1908
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
1899
T
able
7
presents
a
selection
of
randomly
chosen
comments
from
the
T
unDC
dataset.
These
comme
nts
represent
a
di
v
erse
range
of
sentiment
labels
and
writing
systems
(Arabic,
Latin,
and
mix
ed),
pro
viding
insight
into
the
linguistic
and
conte
xtual
v
ariations
within
the
dataset.
By
sho
wcasing
real
e
xamples,
this
table
highlights
the
comple
xity
of
sentiment
analysis
in
the
TD
and
the
challenges
associated
with
processing
dialectal
Arabic
te
xt.
T
able
8
pro
vides
a
comparison
of
T
unDC’
s
k
e
y
features
and
characteristics
with
other
publicly
a
v
ailable
TD
datasets.
T
able
7.
Sample
of
randomly
selected
comments
from
T
unDC
dataset
Script
type
Comment
T
ranslation
Label
Arabic
Tn
©ÐA¡
...®}
TA
m`
z
wyl
T§rhJ
Mw
rq
A
million
barely
gets
you
an
ything...
This
is
a
po
v
erty
handout,
not
a
proper
salary
.
Ne
g
ati
v
e
Arabic
w§CAnys...
§EAtm
ylmm..T¤C
sls
d
AJr
¤¤¤¤¤wr
Ab¡
What
a
f
antastic
series!
Excellent
actors...
The
script
is
absolutely
crazy
good.
Bra
v
o!
So
much
creati
vity!
Positi
v
e
Latin
9adeh
5alsouk
ya
ta7foun
bec
h
tahki
Hal
k
elmtin
li
kif
wejhek
Ho
w
much
did
the
y
pay
you,
cutie,
just
to
say
those
tw
o
w
orthless
w
ords?
Ne
g
ati
v
e
Latin
wallahi
nhebek
bar
c
ha
wo
n9adr
ek
ena
I
swear
to
God,
I
lo
v
e
you
v
ery
much
and
I
respect/appreciate
you.
Positi
v
e
Mix
ed
F
i
ls
à
maman
....¡r
A¤
Mama’
s
bo
y
,
and
there
are
so
man
y
of
them...
Ne
g
ati
v
e
Mix
ed
B
ra
v
o
¢l
¨
Cr
That’
s
a
sound
decision.
Bra
v
o
Positi
v
e
T
able
8.
Comparing
T
unDC
to
other
publicly
a
v
ailable
sentiment
datasets
Feature
Dataset
size
Label
classes
Data
source
#Pos
#Ne
g
T
unDC
20,044
2
F
acebook
and
Y
outube
9,088
10,956
TEC
[32]
5,514
2
X
(T
witter)
-
-
TSA
C
[34]
17K
2
F
acebook
8,845
8,215
TUNIZI
[35]
9,210
2
Y
outube
4,372
4,838
4.
RESUL
TS
AND
DISCUSSION
This
section
presents
the
e
xperimental
setup
and
results
to
e
v
aluate
the
quality
and
utility
of
the
T
unDC
dataset
for
sentiment
classication
in
TD
NLP
.
W
e
assess
the
performance
of
a
transformer
-based
model,
ne-tuned
on
T
unDC,
to
determine
its
ef
fecti
v
eness
in
capturing
dialectal
nuances.
K
e
y
e
v
aluation
metrics,
including
accurac
y
,
precision,
recall,
and
F1-score,
are
us
ed
to
analyze
model
performance.
The
follo
wing
subsections
outline
the
dataset
preprocessing,
model
setup,
e
v
aluation
criteria
,
and
training-testing
procedures.
4.1.
Experimental
setup
4.1.1.
Dataset
pr
eparation
The
T
unDC
dataset
w
as
di
vided
into
an
85%
traini
ng
set
and
a
15%
v
ali
dation
set.
W
e
emplo
yed
a
stratied
splitting
strate
gy
based
not
only
on
sentiment
classes
(positi
v
e
and
ne
g
ati
v
e)
b
ut
also
e
xplicitly
on
writing
systems
(Arabic,
Latin,
and
mix
ed).
This
multi-criteria
stratication
approach
pro
v
ed
ef
fecti
v
e,
contrib
uting
to
an
approximate
8%
increase
in
accurac
y
compared
to
simpler
splits.
Subsequent
preprocessing
focused
on
data
cleaning
and
normalization.
Specic
cleaning
steps
included:
ltering
out
comments
consisting
of
only
one
w
ord
to
ensure
meaningful
content;
remo
ving
all
URLs
and
user
mentions
for
anon
ymization
and
noise
reduction;
and
retaining
emojis
and
hashtags
if
the
y
appeared
alongside
te
xt,
ackno
wledging
their
signicant
semantic
and
emotional
v
alue.
These
preprocessing
steps
were
essential
for
impro
ving
the
model’
s
ability
to
generalize
across
di
v
erse
linguistic
v
ariations
in
the
TD.
4.1.2.
Baseline
models
Classical
approaches:
to
conte
xtualize
the
performance
of
our
transformer
-based
solution,
we
established
a
set
of
classical
ML
baselines
spanning
distinct
architectural
paradigms:
a
SVM,
an
ensemble
gradient-boosted
tree
model
(e
xtreme
gradient
boosting
(XGBoost)),
and
a
lightweight
recurrent
neural
netw
ork
(a
single
Bi-LSTM
layer
with
16
units
follo
wed
by
a
dense
layer
of
8
units).
All
models
were
trained
using
def
ault
h
yperparameters
and
e
v
aluated
on
the
same
stratied
15%
test
split
of
the
T
unDC
dataset
described
abo
v
e,
ensuring
a
f
air
comparison.
No
architecture-specic
tuning
w
as
performed,
allo
wing
us
to
assess
out-of-the-box
performance
under
identical
data
conditions.
T
able
9
summarizes
their
results
on
the
test
set
in
terms
of
accurac
y
,
F1-score,
and
area
under
the
curv
e
(A
UC).
T
unDC:
a
public
benc
hmark
dataset
for
sentiment
analysis
and
langua
g
e
...
(Ahmed
Khalil
Boulahia)
Evaluation Warning : The document was created with Spire.PDF for Python.
1900
❒
ISSN:
2252-8938
T
able
9.
Performance
of
classical
baseline
models
on
the
T
unDC
test
set
Model
Accurac
y
F1-score
A
UC
SVM
0.77
0.77
0.85
XGBoost
0.76
0.75
0.84
LSTM
(Bi-LSTM-16
+
Dense-8)
0.78
0.76
0.85
T
ransformer
-based
approaches:
we
further
e
v
aluated
four
pre-trained
ArabBER
T
-style
T
ransformer
models,
each
trained
on
dif
ferent
combinations
of
MSA
and
dialectal
content,
to
assess
their
out-of-the-box
suitability
for
TD
sentiment
analysis.
The
selected
models
include:
bert-base-arabic-camelbert-da
(dialect-focused),
the
second
model
is
the
UBC-NLP/MARBER
Tv2
(trained
on
dialect-hea
vy
social
media
te
xt),
CAMeL-Lab/bert-base-arabic-camelbert-mix
(mix
ed
MSA
and
dialect),
and
the
last
model
is
the
asaf
aya/bert-base-arabic
(primarily
MSA
with
some
dialectal
co
v
erage).
All
models
were
ne-tuned
identically
using
the
same
training
protocol
(learning
rate
3e
−
5
,
batch
size
8,
4
epochs)
and
e
v
aluated
on
the
same
stratied
test
split
of
T
unDC.
As
sho
wn
in
T
able
10,
all
four
models
signicantly
outperformed
classical
baselines,
with
test
F1-scor
es
ranging
from
0.826
to
0.837.
Notably
,
both
camelbert-mix
and
asaf
aya/bert-base-arabic
achie
v
ed
the
highest
e
v
aluation
accurac
y
and
F1-score
(0.837).
Gi
v
en
its
strong
performance,
broader
linguistic
co
v
erage,
and
established
use
as
a
foundation
for
Arabic
NLP
tasks,
we
selected
asaf
aya/bert-base-arabic
as
the
base
architecture
for
our
subsequent
h
yperparameter
tuning,
error
analysis,
and
nal
model
de
v
elopment—leading
to
the
T
unDC-mix
ed
model
described
in
the
ne
xt
section.
T
able
10.
Performance
of
transformer
-based
baseline
models
on
the
T
unDC
test
set
Model
T
est
accurac
y
T
est
F1-score
camelbert-da
0.826
0.826
MARBER
Tv2
0.836
0.835
camelbert-mix
0.837
0.837
asaf
aya/bert-base-arabic
0.837
0.837
4.1.3.
Ev
aluated
model
F
or
this
study
,
we
e
v
aluated
T
unDC-mix
ed
[47],
our
ne-tuned
transformer
-based
model
b
uilt
upon
asaf
aya/bert-base-arabic
[48],
[49].
T
o
optimize
perform
ance
on
the
T
unDC
dataset,
we
conducted
systematic
h
yperparameter
tuning
using
Ray
T
une
from
t
he
Ray
library
.
Specically
,
we
emplo
yed
random
search
o
v
er
a
predened
h
yperparameter
space,
e
xploring
the
follo
wing
dimensions:
−
learning_rate:
log-uniform
sampling
between
1e
−
5
and
5e
−
5
.
−
per_de
vice_train_batch_size:
cate
gorical
selection
from
{4,
8,
16}.
−
weight_decay:
uniform
sampling
in
[0
.
0
,
0
.
1]
.
−
num_train_epochs:
inte
ger
v
alues
from
3
to
6.
−
lr_scheduler_type:
cate
gorical
choice
between
linear
and
cosine.
The
search
yielded
the
follo
wing
optimal
conguration:
learning
rate
3e
−
5
,
batch
size
8,
weight
decay
0.017,
4
training
epochs,
and
a
linear
learning
rate
scheduler
.
Using
these
h
yperparameters,
the
nal
model—comprising
111
million
parameters—w
as
trained
with
the
AdamW
optimizer
.
Its
strong
performance
underscores
its
capacity
to
capture
the
linguistic
nuances
and
conte
xtual
v
ariability
inherent
in
lo
w-resource
T
A
dialects,
making
it
well-suited
for
sentiment
classication
in
this
domain.
4.1.4.
Ev
aluation
metrics
The
model’
s
performance
w
as
assessed
using
standard
classication
metrics,
including
accurac
y
,
precision,
recall,
and
F1-score.
Accurac
y
measured
the
o
v
erall
correctness
of
predictions,
while
precision
e
v
aluated
the
proportion
of
correctly
predicted
positi
v
e
(or
ne
g
ati
v
e)
labels.
Recall
measured
the
model’
s
ability
to
capture
all
rele
v
ant
instances
of
a
sentiment
class.
The
F1-score,
as
the
harmonic
mean
of
precision
and
recall,
ensured
a
balanced
e
v
aluation,
particularly
useful
when
handling
v
ariations
in
sentiment
representation.
Additionally
,
macro-a
v
eraged
F1-score
w
as
reported
to
treat
each
sentiment
class
equally
,
re
g
ar
dless
of
class
size.
The
weighted
F1-score
w
as
computed
to
account
for
class
dis
trib
ution,
pro
viding
a
more
representati
v
e
measure
of
o
v
erall
model
performance.
These
e
v
aluation
metrics
of
fered
a
rob
ust
assessment
of
bert-base-arabic-T
unDC-mix
ed
in
processing
TD
te
xt.
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1891–1908
Evaluation Warning : The document was created with Spire.PDF for Python.