Inter
national
J
our
nal
of
Electrical
and
Computer
Engineering
(IJECE)
V
ol.
11,
No.
3,
June
2021,
pp.
2327
2334
ISSN:
2088-8708,
DOI:
10.11591/ijece.v11i3.pp2327-2334
r
2327
A
T
AR:
Attention-based
LSTM
f
or
Arabizi
transliteration
Bashar
T
alafha
1
,
Analle
Ab
uammar
2
,
Mahmoud
Al-A
yy
oub
3
1,3
Jordan
Uni
v
ersity
of
Science
and
T
echnology
,
Jordan
2
Uni
v
ersity
of
Southampton,
UK
Article
Inf
o
Article
history:
Recei
v
ed
Apr
4,
2020
Re
vised
Sep
30,
2020
Accepted
Oct
9,
2020
K
eyw
ords:
Arabizi
transliteration
Attention
Benchmark
dataset
LSTM
Seq2seq
ABSTRA
CT
A
non-standard
romanization
of
Arabic
script,
kno
wn
as
Arbizi,
is
widely
used
in
Arabic
online
and
SMS/chat
communities.
Ho
we
v
er
,
since
state-of-the-art
tools
and
applications
for
Arabic
NLP
e
xpects
Arabic
to
be
written
in
Arabic
script,
handling
contents
written
in
Arabizi
requires
a
special
attention
either
by
b
uilding
customized
tools
or
by
transliterating
them
into
Arabic
script.
The
la
tter
approach
is
the
more
common
one
and
this
w
ork
presents
tw
o
significant
contrib
utions
in
this
direction.
The
first
one
is
to
collect
and
publicly
release
the
first
lar
ge-scale
“
Arabizi
to
Arabic
script”
parallel
corpus
focusing
on
the
Jordanian
dialect
and
consisting
of
more
than
25
k
pairs
carefully
created
and
inspected
by
nati
v
e
speak
ers
to
ensure
highest
quality
.
Second,
we
present
A
T
AR,
an
A
Ttention-
based
LSTM
model
for
ARabizi
translitera-
tion.
T
raining
and
testing
this
model
on
our
dataset
yields
impressi
v
e
accurac
y
(79%)
and
BLEU
score
(88.49).
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Mahmoud
Al-A
yyoub
Department
of
Computer
Science
Jordan
Uni
v
ersity
of
Science
and
T
echnology
Irbid,
Jordan
Email:
maalshbool@just.edu.jo
1.
INTR
ODUCTION
As
stated
by
man
y
researchers
[1–3],
social
media
users
e
xpress
themselv
es
in
w
ays
dif
ferent
from
standard
format.
Social
media
content
e
xhibit
frequent
use
of
informal
v
ocab
ulary
,
non-standard
abbre
viation,
typos,
and
man
y
idiosyncrasies
such
as
repeating
le
tters
for
emphasis
and
writing
out
non-linguistic
content
lik
e
emojis
and
sound
reactions
[4–6].
F
or
se
v
eral
reasons,
these
issues
are
more
complicated
for
Arabic
content.
Examples
of
these
reasons
include
the
pre
v
alent
use
of
dialectal
Arabic
(D
A)
and
its
gra
v
e
de
viations
from
modern
standard
Arabic
(MSA)
[7].
Another
reason
is
the
common
use
of
a
non-standard
romanized
w
ay
of
writing
Arabic
w
ords
kno
wn
as
Ar
abizi
.
There
are
man
y
reasons
for
the
widespread
of
Arabizi
such
as
the
lack
of
support
for
Arabic
script
on
some
de
vices/platforms,
the
e
xistence
of
some
dif
ficulties
in
usi
ng
Arabic
script,
the
relati
v
e
ease
of
code-switching
between
Arabizi
and
English
or
French
compared
with
Arabic
script.
Ev
en
though
Arabizi
is
not
kno
wn
to
all
social
media
users,
it
is
common
enough
to
w
arrant
studies
focusing
solely
on
it
[8–17].
F
or
most
state-of-the-art
tools
and
applications
for
natural
language
processing
(NLP)
and
inf
o
r
mation
retrie
v
al
(IR)
of
Arabic
te
xt,
the
e
xpected
input
is
Arabic
w
ords
written
in
Arabic
script
[18–20].
Therefore,
there
is
an
ob
vious
need
for
a
system
to
automatically
transliterate
content
writ
ten
in
Arabizi
into
Arabic
orthograph
y
[2].
Pre
vious
studies
[2,
9,
21–30]
presented
tools
and
resources
for
this
problem.
Ho
we
v
er
,
to
the
best
of
our
kno
wledge,
v
ery
fe
w
of
them
[27–30]
follo
wed
deep
learning
approaches
such
as
Recurrent
neural
J
ournal
homepage:
http://ijece
.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
2328
r
ISSN:
2088-8708
netw
orks
(RNN)
and
its
e
xtens
ions
such
as
long
short-term
memory
(LSTM)
[31].
The
others
mostly
follo
w
character
-le
v
el
rule-based
approaches.
In
this
w
ork,
we
are
addressing
the
problem
of
Arabizi
transliteration
by
presenting
A
T
AR,
an
A
Ttention-
based
encoder
-decoder
model
for
ARabizi
transliteration.
This
no
v
el
neural
netw
ork-based
approach
follo
ws
the
celebrated
attention-based
encoder
-decoder
model
of
[32].
T
o
e
v
aluate
A
T
AR,
we
present
a
“first
of
its
kind”
dataset
consisting
of
21.5
K
w
ords
from
the
Jordanian
dialect.
The
rest
of
this
paper
is
or
g
anized
as:
The
follo
wing
section
gi
v
es
a
high-le
v
el
vie
w
of
the
related
w
ork
while
section
3
presents
our
A
T
AR
model
and
discusses
its
details.
Section
4
discusses
the
dataset
we
create
and
section
5
presents
our
e
v
aluation
of
the
proposed
model
on
the
collected
dataset.
Finally
,
the
paper
is
concluded
in
section
6.
2.
RELA
TED
W
ORK
Due
to
the
importance
of
Arabizi-Arabic
script
transliteration
problem,
se
v
eral
companies,
such
as
Google
and
Microsoft,
ha
v
e
in
v
ested
mone
y
and
ef
fort
into
de
v
eloping
tools
for
this
problem.
Examples
of
such
tools
include:
Google
T
a3reeb,
in
http://www
.google.com/ta3reeb;
Microsoft
Maren,
in
https://www
.microsoft.
com/en-us/do
wnload/details.aspx?id=20530;
F
acebook’
s
automatic
translation
services,
in
https://engineering.
fb
.com/ml-applications/e
xpanding-automatic-machine-translation-to-more-languages/;
Rosette
Chat
T
ransla-
tor
,
in
https://www
.basistech.com/te
xt-analytics/rosette/chat-translator/;
Y
amli,
in
https://www
.yamli.com/;.
Ho
w-
e
v
er
,
these
tools
are
mos
tly
closed-source,
and
v
ery
little
is
kno
wn
about
the
approaches
the
y
follo
w
or
the
resources
the
y
emplo
y
.
On
the
other
hand,
the
ef
fort
within
the
Arabic
NLP
research
community
to
address
the
Arabizi-Arabic
script
transliteration
problem
has
been
rather
sh
y
.
The
e
xisting
resources
are
limited
and
are
not
publicly
a
v
ailable
and
the
proposed
approaches
do
not
follo
w
the
ne
w
and
e
xciting
approaches
in
the
field
of
sequence
learning
[33].
Existing
w
ork
on
Arabizi
transliteration,
such
as
[2,
22–24,
34,
35],
follo
wed
basic
approaches
that
used
character
-to-character
mappings
in
order
to
generate
lattices
of
multiple
alternati
v
e
w
ords.
The
approach
proposed
by
[36]
combines
a
rule-based
model
and
a
discriminati
v
e
model
based
on
conditional
random
fields
(CRF)
for
transliterating
T
unisian
dialect
Arabizi
te
xts
to
standard
Arabic.
A
further
selection
from
these
w
ords
is
done
using
language
models.
As
for
the
dataset
the
y
used,
only
that
of
[23]
is
reported
to
be
publicly
a
v
ailable
[2],
ho
we
v
er
,
it
is
v
ery
small
with
only
2.2
K
w
ord
pairs.
It
w
as
used
in
the
de
v
elopment
of
[24]’
s
system
in
addition
to
6,300
Arabic-English
proper
name
pairs
from
[37].
The
reported
accurac
y
of
[24]’
s
system
is
69.4%
and
it
w
as
later
used
by
[2].
Another
interesting
ef
fort
in
creating
useful
resources
for
the
Arabizi
transliteration
problem
is
the
w
ork
of
Bies
et
al.
[2].
The
authors
discussed
ho
w
the
linguistic
data
consortium
(LDC)
coll
ected
and
anno-
tated
a
huge
parallel
corpus
of
Arabizi
content
and
its
Arabic
script
counterpart
as
part
of
the
D
ARP
A
broad
operational
language
translation
(BOL
T)
program
(Phase
2).
The
corpus
consisted
of
more
than
408
K
w
ords
and
it
mainly
focused
on
the
Egyptian
dialect.
Fe
w
papers
[27–30]
discussed
the
use
of
deep
learning
for
the
problem
of
Arabic
trans
literation.
In
[27,
28],
the
authors
claimed
to
use
a
standard
RNN
encoder
-decoder
model
for
transliterating
sentences
written
in
Algerian
dialect,
b
ut
the
y
did
not
pro
vide
an
y
details
of
the
model.
Moreo
v
er
,
the
dataset
the
y
considered
is
rather
small
(1.3
k
sentences).
In
a
mode
detailed
w
ork,
Y
ounes
et
al.
[29]
used
a
standard
RNN
encoder
-decoder
model
for
transliterating
w
ords
in
T
unisian
dialect.
Their
dataset
w
as
relati
v
ely
big
with
45.6
k
w
ord
pairs.
In
a
follo
w-up
w
ork
[30],
the
e
xpanded
their
w
ork
and
discussed
ho
w
to
adapt
three
well-kno
wn
models
in
machine
translation
for
the
problem
of
transliterating
T
unisian
dialect.
The
first
one
w
as
a
CRF
,
while
the
second
one
w
as
a
Bidirectional
RNN
with
Long
short-term
memory
cells
(BLSTM).
As
for
the
third
one,
it
w
as
a
BLSTM
with
CRF
decoder
.
The
results
sho
w
the
superiority
of
the
latter
approach
o
v
er
the
former
tw
o
approaches.
T
ransliteration
systems
ha
v
e
been
proposed
for
man
y
languages
other
than
Arabic.
Ho
we
v
er
,
such
systems
are
usually
designed
to
transliterate
between
tw
o
closely
related
languages.
Examples
include
the
w
ork
of
Musleh
et
al.
[38]
on
translit
erating
Urdu
to
Hindi,
the
w
ork
of
Nak
o
v
et
al.
[39]
on
translite
rating
Portuguese
and
Italian
to
look
lik
e
Spanish
and
the
w
ork
of
Nak
o
v
et
al.
[40]
on
transliterating
Macedonian
to
Bulg
arian.
Int
J
Elec
&
Comp
Eng,
V
ol.
11,
No.
3,
June
2021
:
2327
–
2334
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
r
2329
3.
A
T
AR:
A
TTENTION-B
ASED
LSTM
FOR
ARABIZI
TRANSLITERA
TION
Ov
er
the
past
decade,
deep
learning
approaches
ha
v
e
made
a
ground-breaksaing
impact
on
man
y
fields
such
as
NLP
,
image
processing,
computer
vision
[41–44].
A
particularly
interesting
and
challenging
set
of
problems,
kno
wn
as
sequence
learning
problems,
has
been
hea
vily
studied
by
deep
learning
researchers.
A
special
kind
of
neural
netw
orks,
kno
wn
as
recurrent
neural
netw
orks
(RNN),
has
been
sho
wn
to
perform
v
ery
well
for
man
y
sequence
learning
problems
in
natural
language
understanding
(NLU)
and
natural
language
generation
(NLG).
Ho
we
v
er
,
RNN
suf
fers
from
some
issues
lik
e
the
v
anishing
gradient
problem.
T
o
address
this
problem,
Hochreiter
and
Schmidhuber
[31]
proposed
to
equip
RNN
with
memory
cells
creating
what
the
y
called
LSTM
netw
orks.
F
or
sequence-to-sequence
problems
(lik
e
the
one
we
ha
v
e
at
our
hands),
a
general
approach
kno
wn
as
the
encoder
-decoder
approach
w
as
found
t
o
be
v
ery
successful.
The
approach
is
based
on
the
idea
of
learning
ef
ficient
representations
of
the
input
using
an
RNN
(or
LSTM)
as
an
“encoder
netw
ork”
and
using
another
RNN
(or
LSTM)
as
a
“decoder
net
w
ork”
to
tak
e
this
feature
representation
as
input,
process
it
to
mak
e
its
decision,
and
produce
an
output
in
https://www
.quora.com/What-is-an-Encoder
-Decoder
-in-Deep-Learning.
In
the
rest
of
this
section,
we
present
the
details
of
our
attention-based
LSTM
model
for
Arabizi
translit
eration,
which
we
call
A
T
AR.
3.1.
Model
ar
chitectur
e
our
transliteration
model
is
inspired
by
the
attentional
sequence-to-sequence
(seq2seq)
model
pro-
posed
by
[32],
which
is
based
on
the
encoder
-decoder
architecture
as
sho
wn
in
Figure
1.
Figure
1.
Illustration
of
the
sequence-to-sequence
architecture
based
on
LSTM
with
the
attention
mechanism
The
seq2seq
architecture
consists
of
an
RNN
encoder
to
learn
representations
of
input
sequence
X
=
f
x
1
;
x
2
;
:
:
:
;
x
n
g
of
v
arying
length
and
an
RNN
decoder
,
which
reads
the
hidden
representation
produced
by
the
encoder
and
generates
output
sequence
Y
=
f
y
1
;
y
2
;
:
:
:
;
y
m
g
of
v
arying
length.
The
model
tak
es
input
from
the
embedding
layer
that
maps
a
one-hot
encoding
v
ector
of
v
ocab
size,
which
in
our
case
is
the
number
of
letters
consisting
of
dif
ferent
47
letters
in
Ar
abizi
and
36
in
Arabic,
as
an
input
and
generates
a
fix
ed
size
dense
v
ector
that
represents
the
semantic
features
of
the
input
letter
.
It
is
w
orth
mentioning
that
there
is
no
<
UNK
>
tok
en
in
our
case
because
each
w
ord
(i.e.,
sequence)
is
a
combination
of
limited
predefined
letters.
In
our
architecture,
each
unit
in
the
encoder
and
decoder
is
an
LSTM
cell
which
solv
es
the
problem
of
v
anishing
gradients
with
its
memory
cells
[31].
Instead
of
relying
on
one
thought
v
ector
from
the
encoder
,
man
y
researchers
[32,
45]
proposed
the
encoder
-decoder
architecture
with
attention.
The
idea
behind
the
attention
mechanism
is
to
link
each
time
step
A
T
AR:
Attention-based
LSTM
for
Ar
abizi
tr
ansliter
ation
(Bashar
T
alafha)
Evaluation Warning : The document was created with Spire.PDF for Python.
2330
r
ISSN:
2088-8708
of
the
decoder
with
the
most
“con
v
enient”
time
step(s)
of
the
encoder
input
s
equence.
This
is
done
by
utilizing
the
idea
of
global
attentional
model
which
tak
es
all
the
hidden
states
of
the
encoder
h
s
and
the
current
tar
get
state
h
t
into
consideration
to
calculate
the
attention
score.
In
this
paper
,
the
dot
product
function
is
used
in
order
to
perform
the
attention
score
calculation.
scor
e
(
h
T
t
;
h
s
)
F
ollo
wing
the
pre
vious
step,
the
alignment
v
ector
a
ts
is
computed
for
each
state
by
applying
a
softmax
function
to
normalize
all
scores;
therefore,
a
probability
distrib
ution
based
on
the
tar
get
state
will
be
produced.
a
ts
=
exp(
scor
e
h
T
t
;
h
s
)
P
s
exp(
scor
e
h
T
t
;
h
s
)
The
decoder
then
computes
a
global
conte
xt
v
ector
c
t
as
a
weighted
a
v
erage,
based
on
the
alignment
v
ector
a
t
o
v
er
all
the
source
states.
c
t
=
X
s
a
ts
h
s
Therefore,
the
decoder
will
tak
e
the
conte
xt
v
ector
as
an
additional
input
v
ector
at
the
ne
xt
time
step
s
t
.
4.
D
A
T
ASET
W
e
use
Arabizi-Arabic
script
parallel
w
ords
in
order
to
perform
our
Arabizi
transliteration
e
xper
-
iments.
Due
to
the
lack
of
such
a
v
ailable
parallel
data,
we
ha
v
e
cra
wled
only
Arabizi
data
written
in
the
Jordanian
dialect
from
dif
ferent
resources,
such
as
T
witter
,
F
acebook
and
ASK.
These
cra
wled
w
ords
are
re
gu-
larly
used
on
daily
life
basis.
W
e
were
able
to
collect
21.5
K
unique
Arabizi
w
ords,
which
were
then
translated
to
the
Jordanian
dialect
using
only
Arabic
letters.
A
group
of
nati
v
e
speak
ers
v
alidated
the
parallel
data
by
correcting
an
y
spelling
mistak
es,
remo
ving
redundant
letters
and
omitting
an
y
unneeded
special
characters.
One
of
the
contrib
utions
of
this
w
ork
is
to
mak
e
this
“first
of
its
kind”
dataset
publicly
a
v
ailable.
In
https://github
.com/bashartalafha/Arabizi-T
ransliter
ation
T
able
1
sho
ws
samples
of
our
parallel
data.
The
a
v
erage
length
of
the
collected
w
ords
is
about
5
letters
per
w
ord,
maximum
w
ord
length
of
12
letters
and
the
minimum
is
2
letters.
T
able
1.
Examples
of
our
parallel
corpus
It
is
w
orth
mentioning
that
the
same
w
ord
in
Arabizi
could
ha
v
e
dif
ferent
representations
in
the
Jorda-
nian
dialect
since
not
all
people
w
ould
write
it
in
the
same
w
ay
b
ut
s
till
the
y
are
all
correct.
T
able
2
sho
ws
fe
w
such
e
xample
s.
This
issue
w
as
f
aced
by
earlier
w
ork
on
Arabizi
transliteration
such
as
[2,
9]
and
it
is
discussed
in
details
therein.
As
stated
by
these
researchers,
such
things
could
penal
ize
the
model
and
gi
v
e
it
lo
wer
score
considering
some
transliterations
are
right
b
ut
the
reference
is
dif
ferent.
T
able
2.
Examples
with
dif
ferent
representations
Int
J
Elec
&
Comp
Eng,
V
ol.
11,
No.
3,
June
2021
:
2327
–
2334
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
r
2331
5.
EXPERIMENTS
AND
EV
ALU
A
TION
T
o
e
v
aluate
the
performance
of
our
proposed
model,
we
implement
it
using
T
ensorFlo
w
(W
e
sel
ect
T
ensorFlo
w
for
its
ef
ficienc
y
and
ease
of
use.
F
or
a
comparison
of
dif
ferent
deep
learning
frame
w
orks,
the
interested
readers
are
directed
to
[46]),
and
perform
se
v
eral
e
xperiments
using
our
dataset.
After
shuf
fling
and
lo
wercasing
the
data,
we
use
the
first
80%
of
the
dataset
as
the
training
set,
the
ne
xt
10%
as
the
v
alida-
tion
set
and
the
remaining
10%
as
the
testing.
As
for
the
e
v
aluation
metric,
we
use
the
tw
o
most
common
measures
for
the
Arabizi
trans
literation
task:
accurac
y
and
bilingual
e
v
aluation
understudy
(BLEU)
[47].
Fi-
nally
,
to
aid
the
reproducibility
of
our
results,
both
the
dataset
and
the
model
are
made
publicly
a
v
ailable,
in
https://github
.com/bashartalafha/Arabizi-T
ransliteration.
Using
an
attentional
encoder
-decoder
sequence-to-sequence
translation
model,
we
ha
v
e
to
w
orry
about
the
man
y
h
yperparameters
that
can
af
fect
its
performance.
This
issue
is
so
important
that
complete
studies
ha
v
e
been
dedicat
ed
for
it
such
as
[48],
which
reported
the
use
of
more
than
250
K
of
GPU
hours
for
e
xperimentation.
F
or
our
w
ork,
we
use
the
w
ork
of
Britz
et
al.
[48]
as
well
as
Ruder’
s
blog
in
http://ruder
.io/deep-learning-nlp-
best-practices/
and
Bro
wnlee’
s
blog
in
https://machinelearningmastery
.com/configure-encoder
-decoder
-model-
neural-machine-translation/
to
guide
us
in
our
e
xperiments
to
search
for
the
best
v
alues
for
the
h
yperparameters.
The
ones
that
gi
v
e
the
best
performance
are
listed
in
T
able
3.
F
or
this
configuration,
the
accurac
y
is
79%
and
the
BLEU
score
is
88.49.
T
able
3.
The
v
alues
of
our
model’
s
h
yperparameters
that
gi
v
es
the
best
performance
Hyperparameter
V
alue
LEARNING
RA
TE
0.001
B
A
TCH
SIZE
64
HIDDEN
NODES
256
NUMBER
OF
LA
YERS
1
EMB
SIZE
50
EPOCHS
30
LOSS
“cate
gorical
crossentrop
y”
OPT
“adam”
BI
“No”
DR
OPOUT
0.2
A
T
AR
does
achie
v
e
good
results.
Ho
we
v
er
,
it
does
ha
v
e
its
limi
tation
such
as
the
lack
of
support
for
the
v
arious
Arabic
dialects.
T
o
address
this,
one
might
benefit
from
e
xisting
multi-dialect
parallel
datasets
[49–54]
or
b
uild
ne
w
ones
(perhaps,
by
benefiting
from
unsupervised
approaches
for
dialect
translation
[7]).
Another
is-
sue
that
can
be
addressed
before
adopting
A
T
AR
in
real-life
scenarios
is
trying
to
increase
the
model’
s
accurac
y
.
This
can
be
done
by
either
considering
other
sequence-to-sequence
models,
such
as
F
acebook’
s
con
v
olutional
sequence-to-sequence
model
[55]
and
Google’
s
attention-only
T
ransformer
model
[56]
or
by
combining
it
with
a
neural
diacrization
model
[57,
58].
6.
CONCLUSION
In
this
paper
,
we
addressed
the
Arabizi
transliteration
problem.
This
w
ork
has
tw
o
significant
cont
ri-
b
utions
to
this
problem.
The
first
one
is
to
collect
and
publicly
distrib
ute
the
first
lar
ge-scale
Arabizi-Arabic
script
parallel
corpus
focusing
on
the
Jordanian
dialect
and
consisting
of
more
than
25
k
pairs
carefully
cre-
ated
and
inspected
by
nati
v
e
speak
ers
to
ensure
the
highest
quality
.
In
the
s
econd
contrib
ution,
we
presented
one
of
the
first
detailed
and
reproducible
ef
forts
to
emplo
y
the
celebrat
ed
attention-based
seq2seq
model
for
Arabizi
transliteration.
The
presented
model,
which
we
called
A
T
AR,
performed
v
ery
well
in
the
e
xperiments
we
conducted.
It
reached
an
impressi
v
e
le
v
el
with
an
accurac
y
of
79%
and
a
BLEU
score
of
88.49.
Future
directions
include
e
xperimenting
with
other
sequence-to-sequence
models,
such
as
F
acebook’
s
con
v
olutional
sequence-to-sequence
model
and
Google’
s
attention-only
T
ransformer
model.
W
e
are
also
thinking
of
w
ays
to
e
xpand
our
w
ork
to
other
Arabic
dialects.
Finally
,
we
will
e
xplore
the
generation
of
more
accurate
MSA
te
xt
from
the
transliteration
by
looking
into
combining
our
model
with
a
neural
diacrization
model.
A
T
AR:
Attention-based
LSTM
for
Ar
abizi
tr
ansliter
ation
(Bashar
T
alafha)
Evaluation Warning : The document was created with Spire.PDF for Python.
2332
r
ISSN:
2088-8708
A
CKNO
WLEDGEMENT
The
authors
w
ould
lik
e
to
thank
the
Deanship
of
Research
at
the
Jordan
Uni
v
ersity
of
Science
and
T
echnology
for
supporting
this
w
ork
(through
Grant
#20190180).
The
authors
w
ould
also
lik
e
to
thank
Nesreen
Al-Qasem
and
Areen
Ban
y
Salim
for
their
ef
forts
in
creating
the
dataset.
REFERENCES
[1]
N.
Y
.
Habash,
“Introduction
to
arabic
natural
language
processing,
”
Synthesis
Lect
ures
on
Human
Lan-
guage
T
echnologies,
v
ol.
3,
no.
1,
pp.
1–187,
2010.
[2]
A.
Bies,
Z.
Song,
M.
Maamouri,
S.
Grimes,
H.
Lee,
J.
Wr
ight,
S.
Strassel,
N.
Habash,
R.
Eskander
,
and
O.
Rambo
w
,
“T
ransliteration
of
arabizi
into
arabic
orthograph
y:
De
v
el
op
i
ng
a
parallel
annotated
arabizi-
arabic
script
sms/chat
corpus,
”
Proceedings
of
the
EMNLP
2014
W
orkshop
on
Arabic
Natural
Language
Processing
(ANLP),
2014,
pp.
93–103.
[3]
W
.
A.
Hussien,
Y
.
M.
T
ashtoush,
M.
Al-A
yyoub,
and
M.
N.
Al-Kabi,
“
Are
emoticons
good
enough
to
train
emotion
classifiers
of
arabic
tweets?”
7th
International
C
on
f
erence
on
Computer
Science
and
Information
T
echnology
(CSIT),
2016,
pp.
1–6.
[4]
A.
I.
Alharbi
and
M.
Lee,
“Combining
character
and
w
ord
embeddings
for
af
fect
in
arabic
informal
social
media
microblogs,
”
International
Conference
on
Applications
of
Natural
Language
to
Information
Systems.
Springer
,
2020,
pp.
213–224.
[5]
W
.
Hussien,
M.
Al-A
yyoub,
Y
.
T
ashtoush,
and
M.
Al-Kabi,
“On
the
use
of
emojis
to
train
emotion
classi-
fiers,
”
arXi
v
preprint
arXi
v:1902.08906,
2019.
[6]
K.
A.
Kw
aik,
S.
Chatzik
yriakidis,
S.
Dobnik,
M.
Saad,
and
R.
Johansson,
“
An
arabic
tweets
sentiment
analysis
dataset
(atsad)
using
distant
supervision
and
self
training,
”
Proceedings
of
the
4th
W
orkshop
on
Open-Source
Arabic
Corpora
and
Processing
T
ools,
with
a
Shared
T
ask
on
Of
fensi
v
e
Language
Detection,
2020,
pp.
1–8.
[7]
W
.
F
arhan,
B.
T
alafha,
A.
Ab
uammar
,
R.
Jaikat,
M.
Al-A
yyoub,
A.
B.
T
arakji,
and
A.
T
oma,
“Unsu-
pervised
dialectal
neural
machine
translation,
”
Information
Processing
and
Management,
v
ol.
57,
no.
3,
2020.
[8]
J.
May
,
Y
.
Benjira,
and
A.
Echihabi,
“
An
arabizi-english
social
medi
a
statistical
machine
translation
sys-
tem,
”
Proceedings
of
the
11th
Conference
of
the
Association
for
Machine
T
ranslation
in
the
Americas,
2014,
pp.
329–341.
[9]
M.
v
an
der
W
ees,
A.
Bisazza,
and
C.
Monz,
“
A
simple
b
ut
ef
fecti
v
e
approach
to
impro
v
e
arabizi-to-
english
statistical
machine
translation,
”
Proceedings
of
the
2nd
W
orkshop
on
Noisy
User
-generated
T
e
xt
(WNUT),
2016,
pp.
43–50.
[10]
R.
M.
Duw
airi,
M.
Al
f
aqeh,
M.
W
ardat,
and
A.
Alrabadi,
“Sentiment
analysis
for
arabizi
te
xt,
”
2016
7th
International
Conference
on
Information
and
Communication
Systems
(ICICS),
2016,
pp.
127–132.
[11]
A.
M.
Abd
Al-Aziz,
M.
Gheith,
and
A.
S.
E.
Ahmed,
“T
o
w
ard
b
uilding
arabizi
sentiment
le
xicon
based
on
orthographic
v
ariants
identification,
”
The
2nd
International
Conference
on
Arabic
Computational
Lin-
guistics
(A
CLing),
2016.
[12]
I.
Guellil,
A.
Adeel,
F
.
Azouaou,
F
.
Benali,
A.-e.
Hachani,
and
A.
Hussain,
“
Arabizi
sentiment
analysis
based
on
transliteration
and
automatic
corpus
annotation,
”
Proceedings
of
the
9th
w
orkshop
on
computa-
tional
approaches
to
subjecti
vity
,
sentiment
and
social
media
Analysis,
2018,
pp.
335–341.
[13]
I.
Guellil,
F
.
Azouaou,
F
.
Benali,
A.
E.
Hachani,
and
M.
Mendoza,
“The
role
of
transliteration
in
the
pro-
cess
of
arabizi
translation/sentiment
analysis,
”
Recent
Adv
ances
in
NLP:
The
Case
of
Arabic
Language,
Springer
,
2020,
pp.
101–128.
[14]
T
.
T
obaili,
M.
Fernandez,
H.
Alani,
S.
Sharafeddine,
H.
Hajj,
and
G.
Gla
v
as,
“Senzi:
A
sentiment
anal-
ysis
le
xicon
for
the
latinised
arabic
(arabizi),
”
Proceedings
of
the
International
Conference
on
Recent
Adv
ances
in
Natural
Language
Processing
(RANLP
2019),
2019,
pp.
1203–1211.
[15]
F
.
Aqlan,
X.
F
an,
A.
Alqwbani,
and
A.
Al-Mansoub,
“
Arabic–chinese
neural
machine
translation:
Ro-
manized
arabic
as
subw
ord
unit
for
arabic-sourced
translation,
”
IEEE
Acce
ss,
v
ol.
7,
pp.
133
122–133
135,
2019.
[16]
E.
Gugliotta
and
M.
Dinarelli,
“T
arc:
Incrementally
and
semi-automatically
collecting
a
tunisian
arabish
corpus,
”
arXi
v
preprint
arXi
v:2003.09520,
2020.
[17]
M.
Alkhatib
and
K.
Shaalan,
“Boosting
arabic
named
entity
recognition
transliteration
with
deep
learn-
Int
J
Elec
&
Comp
Eng,
V
ol.
11,
No.
3,
June
2021
:
2327
–
2334
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
r
2333
ing,
”
The
Thirty-Third
International
Flairs
Conference,
2020.
[18]
I.
El
Bazi
and
N.
Laachfoubi,
“
Arabic
named
entity
recognition
using
deep
learning
approach,
”
Interna-
tional
Journal
of
Electrical
and
Computer
Engineering,
v
ol.
9,
no.
3,
pp.
2025-2032,
2019.
[19]
H.
G.
Hassan,
H.
M.
A.
Bakr
,
and
B.
E.
Ziedan,
“
A
frame
w
ork
for
arabic
concept-le
v
el
sentiment
analysis
using
senticnet,
”
International
Journal
of
Electrical
and
Computer
Engineering,
v
ol.
8,
no.
5,
pp.
4015-
4022,
2018.
[20]
M.
A.
Ahmed,
R.
A.
Hasan,
A.
H.
Ali,
and
M.
A.
Mohammed,
“The
classification
of
the
modern
ara-
bic
poetry
using
m
achine
learning,
”
TELK
OMNIKA
T
elecommunication,
Computing,
Electronics
and
Control,
v
ol.
17,
no.
5,
pp.
2667–2674,
2019.
[21]
K.
Shaalan,
H.
Bakr
,
and
I.
Ziedan,
“T
ransferring
e
gypt
ian
colloquial
dialect
into
modern
standard
arabic,
”
International
Conference
on
Recent
Adv
ances
in
Natural
Language
Processing
(RANLP–2007),
Boro
v
ets,
Bulg
aria,
2007,
pp.
525–529.
[22]
A.
Chalabi
and
H.
Ger
ges,
“Romanized
arabic
transliteration,
”
Proceedings
of
the
Second
W
orkshop
on
Adv
ances
in
T
e
xt
Input
Methods,
2012,
pp.
89–96.
[23]
K.
Darwish,
“
Arabizi
detection
and
con
v
ersion
to
arabic,
”
arXi
v
preprint
arXi
v:1306.6755,
2013.
[24]
M.
Al-Badrashin
y
,
R.
Eskander
,
N.
Habash,
and
O.
Rambo
w
,
“
Automatic
transliteration
of
romanized
di-
alectal
arabic,
”
Proceedings
of
the
Eighteenth
Conference
on
Computational
Natural
Language
Learning,
2014,
pp.
30–38.
[25]
R.
Eskander
,
M.
Al-Badrashi
n
y
,
N.
Habash,
and
O.
Rambo
w
,
“F
oreign
w
ords
and
the
automatic
process-
ing
of
arabic
social
media
t
e
xt
written
in
roman
script,
”
Proceedings
of
The
First
W
orkshop
on
Computa-
tional
Approaches
to
Code
Switching,
2014,
pp.
1–12.
[26]
N.
Altrabsheh,
M.
El-Masri,
and
H.
Mansour
,
“Proposed
no
v
el
algorithm
for
transliterating
arabic
terms
into
arabizi,
”
Research
in
Computer
Science,
2017.
[27]
I.
Guellil,
F
.
Azouaou,
M.
Abbas,
and
S.
F
atiha,
“
Arabizi
transliteration
of
algerian
arabic
dialect
into
modern
standard
arabic,
”
Social
MT
2017/First
w
orkshop
on
social
media
and
user
generated
content
machine
translation,
2017.
[28]
I.
Guellil,
F
.
Azouaou,
and
M.
Abbas,
“Neural
vs
statistical
t
ranslation
of
algerian
arabic
dialect
writ-
ten
with
arabizi
and
arabic
letter
,
”
The
31st
P
acific
Asia
Conference
on
Language,
Information
and
Compu-
tation
P
A
CLIC,
v
ol.
31,
2017,
p.
2017.
[29]
J.
Y
ounes,
E.
Souissi,
H.
Achour
,
and
A.
Fer
chichi,
“
A
sequence-to-sequence
based
approach
for
the
double
transliteration
of
tunisian
dialect,
”
Procedia
computer
science,
v
ol.
142,
pp.
238–245,
2018.
[30]
J.
Y
ounes,
H.
Achour
,
E.
Souissi,
and
A.
Ferchichi,
“Romanized
tunisian
dialect
transli
teration
using
sequence
labelling
techniques,
”
Journal
of
King
Saud
Uni
v
ersity-Computer
and
Information
Sciences,
2020.
[31]
S.
Hochreiter
and
J.
Schmidhuber
,
“Long
short-term
memory
,
”
Neural
computation,
v
ol.
9,
no.
8,
pp.
1735–1780,
1997.
[32]
M.-T
.
Luong,
H.
Pham,
and
C.
D.
Manning,
“Ef
fecti
v
e
approaches
to
attention-based
neural
machine
translation,
”
arXi
v
preprint
arXi
v:1508.04025,
2015.
[33]
M.
Al-A
yyoub,
A.
Nuseir
,
K.
Alsmearat,
Y
.
Jararweh,
and
B.
Gupta,
“Deep
learning
for
arabic
nlp:
A
surv
e
y
,
”
Journal
of
computational
science,
v
ol.
26,
pp.
522–531,
2018.
[34]
G.
Lancioni,
E.
Gugliotta,
and
V
.
Pettinari
,
“Lahajat:
A
rule-based
con
v
erter
of
standard
arabic
le
xical
databases
into
spok
en
arabic
forms,
”
2016
4th
IEEE
International
Colloquium
on
Information
Science
and
T
echnology
(CiSt),
2016,
pp.
395–399.
[35]
I.
Guellil,
F
.
Azouaou,
F
.
Benali,
A.-E.
Hachani,
and
H.
Saadane,
“Hybrid
approach
for
transliteration
of
algerian
arabizi:
a
primary
study
,
”
arXi
v
preprint
arXi
v:1808.03437,
2018.
[36]
A.
Masmoudi,
M.
E.
Khmekhem,
M.
Khrouf,
and
L.
H.
Belguith,
“T
ransliteration
of
arabizi
into
arabic
script
for
tunisian
dialect,
”
A
CM
T
rans.
Asian
Lo
w-Resour
.
Lang.
Inf.
Process.,
v
ol.
19,
no.
2,
No
v
.
2019,
doi:
10.1145/3364319.
[37]
T
.
Buckw
alter
,
“Buckw
alter
arabic
morphological
analyzer
v
ersion
2.0,
”
W
eb
Do
wnload,
2004.
[38]
A.
Musl
eh,
N.
Durrani,
I.
T
emnik
o
v
a,
P
.
Nak
o
v
,
S.
V
ogel,
and
O.
Alsaad,
“Enabling
medical
translation
for
lo
w-resource
languages,
”
International
Conference
on
Intelligent
T
e
xt
Processing
and
Computational
Linguistics.
Springer
,
2016,
pp.
3–16.
[39]
P
.
Nak
o
v
and
H.
T
.
Ng,
“Impro
ving
statistical
machine
translation
for
a
resource-poor
language
using
related
resource-rich
languages,
”
Journal
of
Artificial
Intelligence
Research,
v
ol.
44,
pp.
179–222,
2012.
A
T
AR:
Attention-based
LSTM
for
Ar
abizi
tr
ansliter
ation
(Bashar
T
alafha)
Evaluation Warning : The document was created with Spire.PDF for Python.
2334
r
ISSN:
2088-8708
[40]
P
.
Nak
o
v
and
J.
T
iedemann,
“Combining
w
ord-le
v
el
and
character
-le
v
el
models
for
machine
translation
between
closely-related
languages,
”
Proceedings
of
the
50th
Annual
Meeting
of
the
Association
for
Com-
putational
Linguistics:
Short
P
apers-V
olume
2.
Association
for
Computational
Linguistics,
2012,
pp.
301–305.
[41]
Y
.
LeCun,
Y
.
Bengio,
and
G.
Hinton,
“Deep
learning,
”
nature,
v
ol.
521,
no.
7553,
pp.
436–444,
2015.
[42]
I.
Goodfello
w
,
Y
.
Bengio,
and
A.
Courville,
”Deep
learning,
”
MIT
press,
2016.
[43]
J.
P
atterson
and
A.
Gibson,
”Deep
learning:
A
practitioner’
s
approach,
”
”
O’Reilly
Media,
Inc.
”,
2017.
[44]
G.
Al-Bdour
,
R.
Al-Qurran,
M.
Al-A
yyoub,
and
A.
Shatna
wi,
“
A
detailed
comparati
v
e
study
of
open
source
deep
learning
frame
w
orks,
”
arXi
v
preprint
arXi
v:1903.00102,
2019.
[45]
D.
Bahdanau,
K.
Cho,
and
Y
.
Bengio,
“Neural
machine
translation
by
jointly
learning
to
align
and
trans-
late,
”
arXi
v
preprint
arXi
v:1409.0473,
2014.
[46]
G.
Al
-Bdour
,
R.
Al-Qurran,
M.
Al-A
yyoub,
and
A.
Shatna
wi,
“Benchmarking
open
source
deep
learning
frame
w
orks,
”
Submitted
to
the
International
Journal
of
Electrical
and
Computer
Engineering
(IJECE),
2020.
[47]
K.
P
apineni,
S.
Rouk
os,
T
.
W
ard,
and
W
.-J.
Zhu,
“Bleu:
a
method
for
automatic
e
v
aluation
of
machine
translation,
”
Proceedings
of
the
40th
annual
meeting
on
association
for
computational
linguistics.
Asso-
ciation
for
Computational
Linguistics,
2002,
pp.
311–318.
[48]
D.
Britz,
A.
Goldie,
M.-T
.
Luong,
and
Q.
Le,
“Massi
v
e
e
xploration
of
neural
machine
translation
archi-
tectures,
”
Proceedings
of
the
2017
Conference
on
Empirical
Methods
in
Natural
Language
Processing,
2017,
pp.
1442–1451.
[49]
H.
Bouamor
,
S.
Hassan,
and
N.
Habash,
“The
madar
shared
task
on
arabic
fine-grained
dialect
identifica-
tion,
”
Proceedings
of
the
F
ourth
Arabic
Natural
Language
Processing
W
orkshop,
2019,
pp.
199–207.
[50]
B.
T
alafha,
A.
F
adel,
M.
Al-A
yyoub,
Y
.
Jararweh,
A.-S.
Mohammad,
and
P
.
Juola,
“T
eam
just
at
the
madar
shared
task
on
arabic
fine-grained
dialect
identification,
”
Proceedings
of
the
F
ourth
Arabic
Natural
Language
Processing
W
orkshop,
2019,
pp.
285–289.
[51]
B.
T
alafha,
W
.
F
arhan,
A.
Altakrouri,
and
H.
Al-Natsheh,
“Ma
wdoo3
ai
at
madar
shared
task:
Arabic
tweet
dialect
identification,
”
Proceedings
of
the
F
ourth
Arabic
Natural
Language
Processing
W
orkshop,
2019,
pp.
239–243.
[52]
A.
Rag
ab,
H.
Seela
wi,
M.
Samir
,
A.
Mattar
,
H.
Al-Bataineh,
M.
Zaghloul,
A.
Mustaf
a,
B.
T
alafha,
A.
A.
Freihat,
and
H.
Al-Natsheh,
“Ma
wdoo3
ai
at
madar
shared
task:
Arabic
fine-grained
dialect
identification
with
ensemble
learning,
”
Proceedings
of
the
F
ourth
Arabic
Natural
Language
Processing
W
orkshop,
2019,
pp.
244–248.
[53]
C.
Zhang,
H.
Bouamor
,
M.
Abdul-Mageed,
and
N.
Habash,
“The
shared
task
on
nuanced
rabic
dialect
identification
(nadi),
”
Proceedings
of
the
Fifth
Arabic
Natural
Language
Processing
W
orkshop,
2020.
[54]
B.
T
alafha,
M.
Ali,
M.
E.
Za’
ter
,
H.
Seela
wi,
I.
T
uf
f
aha,
M.
Samir
,
W
.
F
arhan,
and
H.
T
.
Al-Natsheh,
“Multi-dialect
arabic
bert
for
country-le
v
el
dialect
identification,
”
arXi
v
preprint
arXi
v:2007.05612,
2020.
[55]
J.
Gehring,
M.
Auli,
D.
Grangier
,
D.
Y
arats,
and
Y
.
N.
Dauphin,
“Con
v
olutional
sequence
to
sequence
learning,
”
Proceedings
of
the
34th
International
Conference
on
Machine
Learning-V
olume
70.
JMLR.
or
g,
2017,
pp.
1243–1252.
[56]
A.
V
asw
ani,
N.
Shazeer
,
N.
P
armar
,
J.
Uszk
oreit,
L.
Jones,
A.
N.
Gomez,
Ł.
Kaiser
,
and
I.
Polosukhin,
“
Attention
is
all
you
need,
”
Adv
ances
in
Neural
Information
Processing
Systems,
2017,
pp.
5998–6008.
[57]
A.
F
adel,
I.
T
uf
f
aha,
M.
Al-A
yyoub
et
al.,
“
Arabic
te
xt
diacritization
using
deep
neural
netw
orks,
”
2019
2nd
International
Conference
on
Computer
Applications
and
Information
Security
(ICCAIS),
2019,
pp.
1–7.
[58]
A.
F
adel,
I.
T
uf
f
aha,
B.
Al-Ja
w
arneh,
and
M.
Al-A
yyoub,
“Neural
arabic
te
xt
diacritizat
ion:
State
of
the
art
results
and
a
no
v
el
approach
for
machine
translation,
”
arXi
v
preprint
arXi
v:1911.03531,
2019.
Int
J
Elec
&
Comp
Eng,
V
ol.
11,
No.
3,
June
2021
:
2327
–
2334
Evaluation Warning : The document was created with Spire.PDF for Python.