Int
ern
at
i
onal
Journ
al of Ele
ctrical
an
d
C
om
put
er
En
gin
eeri
ng
(IJ
E
C
E)
Vo
l.
8
,
No.
6
,
D
ece
m
ber
201
8
, pp.
4352
~
43
55
IS
S
N:
20
88
-
8708
,
DOI: 10
.11
591/
ijece
.
v
8
i
6
.
pp
4352
-
43
55
4352
Journ
al h
om
e
page
:
http:
//
ia
es
core
.c
om/
journa
ls
/i
ndex.
ph
p/IJECE
A
Su
rvey
of Ara
bic Text
Classi
fication
M
odels
Ah
ed
M. F
.
Al
Sb
ou
Depa
rt
m
ent
o
f
C
om
pute
r
Scie
n
ce,
Facu
lty
of
Info
rm
at
ion
T
ec
hno
l
og
y
,
Al_Hus
sein
Bin Ta
l
al Unver
sit
y
,
J
orda
n
Art
ic
le
In
f
o
ABSTR
A
CT
Art
ic
le
history:
Re
cei
ved
N
ov
25
, 201
7
Re
vised
Feb 1
7
, 2
01
8
Accepte
d
Ma
r
2
, 2
01
8
The
re
is
a
huge
cont
en
t
of
Arabi
c
te
x
t
avail
abl
e
over
onli
n
e
that
req
uire
s
a
n
orga
nizati
on
o
f
t
hese
te
x
ts.
As
result
,
h
ere
are
m
an
y
app
li
c
at
io
ns
of
nat
ur
a
l
la
nguag
es
proc
e
ss
ing
(NLP)
tha
t
conc
ern
s
with
te
xt
orga
ni
za
t
ion.
One
of
the
is
te
xt
c
la
ss
ifi
c
a
ti
on
(
TC).
TC
h
el
ps
to
m
ake
d
e
al
ing
with
unorg
ani
z
ed
t
ext.
How
eve
r,
it
is
e
asie
r
to
class
if
y
the
m
int
o
suita
bl
e
cl
ass
or
la
bel
s.
Thi
s
pape
r
is
a
surve
y
o
f
Arabi
c
te
xt
cl
assif
ic
a
ti
on.
Also,
it
pre
sents
compar
ison
among
diffe
ren
t
m
et
hod
s
in
the
class
ifi
c
at
ion
of
A
rab
ic
te
xts,
wher
e
Ara
bic
t
ext
is
rep
rese
nt
ed
a
co
m
ple
x
te
xt
du
e
t
o
it
s
voca
bu
la
r
ies
.
Arabi
c
la
ngua
ge
is
one
o
f
the
ri
che
st
la
ng
uage
s
in
the
world,
wher
e
i
t
h
as
m
an
y
li
nguist
ic
base
s.
Th
e
rese
arc
h
e
in
Ar
a
bic
l
angua
g
e
pro
ce
ss
ing
is
ver
y
f
ew
compare
d
to
Engl
ish.
As
a
result
,
th
ese
proble
m
s
rep
rese
nt
challe
ng
es
in
the
class
ifi
c
at
ion
,
and
orga
nizati
on
of
spec
ific
Arabi
c
t
ext
.
Te
x
t
c
la
ss
ifica
t
ion
(TC)
h
el
p
s
to
acce
ss
the
m
ost
docum
ent
s,
or
informa
ti
on
th
at
has
al
r
ea
d
y
cl
assifi
ed
i
nto
spec
ifi
c
cl
asses,
or
c
ate
gor
ie
s
to
one
or
m
ore
c
la
ss
es
or
cate
gor
ie
s
.
I
n
addition,
cl
assifi
ca
t
ion
of
documents
fac
il
i
ta
t
e
sea
r
ch
eng
in
e
to
de
crease
th
e
amount
of
document
to
,
an
d
the
n
to
bec
om
e
e
asie
r
to
sea
rc
h
and
m
a
tc
hing
with
quer
ie
s.
Ke
yw
or
d:
Ar
a
bic lan
guag
e proces
sin
g
Ar
a
bic text cat
egorizat
ion
Ar
a
bic text m
i
ning
Cl
assifi
cat
ion
al
gorithm
s
Cl
us
te
rin
g
al
go
rithm
s
Natu
ral la
ngua
ges processi
ng
Text cla
ssific
at
ion
Copyright
©
201
8
Instit
ut
e
o
f Ad
vanc
ed
Engi
n
ee
r
ing
and
S
cienc
e
.
Al
l
rights re
serv
ed
.
Corres
pond
in
g
Aut
h
or
:
Ah
e
d
M.
F.
Al
Sbou,
Dep
a
rt
m
ent o
f C
om
pu
te
r
Scie
nce, Facult
y
of Inf
or
m
at
ion
T
echnolo
gy,
Al
H
us
sei
n
Bi
n
Tal
al
Unvers
it
y,
Ra
wd
at
Al
-
Ami
r
Ra
sh
i
d,
Ma'
an,
J
orda
n.
Em
a
il
: ahed
_al
sb
ou
@ahu.e
du
.jo
1.
INTROD
U
CTION
The
Ar
a
bic
la
ngua
ge
is
one
of
the
m
os
t
co
m
m
on
la
ngua
ge
s
with
m
or
e
than
42
0
m
illi
on
s
pea
ker
s
ov
e
r
the
w
orl
d.
Un
li
ke
E
ngli
sh
,
A
ra
bic
do
e
s
n’
t
ha
ve
up
per
cases.
It
al
so
di
ff
ers
from
oth
er
natu
ral
la
nguag
e
s
du
e
to
the
pr
es
ence
of
diacrit
ic
s
wh
ic
h
repre
sent
a
s
m
al
l
vo
wel
le
tt
ers
suc
h
as
“fatha,
ka
sra,
dam
m
a,
s
ukun,
sh
a
dd
a
,
an
d
ta
nw
ee
n”.
The
Ar
a
bic
la
ngua
ge'
s
or
th
ogra
phic
syst
em
is
base
d
on
diacr
it
ic
s
eff
ect
,
w
he
re
each
sp
eci
fic ty
pe
of d
ia
crit
ic
s produ
ce
s d
if
fer
e
nt
w
ord
s w
it
h dif
fer
e
nt m
eaning
s.
This lan
guag
e h
as
sp
eci
fic le
tt
ers
known
as
A
ra
bic
vowels
(waw,
ya
a
,
al
f
)
t
hat
re
quire
a
s
pecial
syst
em
of
m
or
phol
og
y
an
d
gram
m
ars.
Wh
at
al
so
disti
nguis
hes Ara
bic is t
he huge am
ount
o
f
voca
bu
la
ri
es an
d
c
oncept
s [1].
Althou
gh
t
he
Ar
a
bic
te
xts
are
viewe
d
as
the
m
os
t
diff
ic
ul
t
on
es
,
ther
e
are
few
st
udie
s
on
th
e
processi
ng
of
Ar
a
bic
te
xts
for
reas
ons
relat
ed
to
the
li
ng
ui
sti
c
char
act
eri
sti
cs
of
t
he
Arabic
la
ngua
ge
Du
e
to
the
stric
t
li
ng
ui
sti
c
char
act
eris
ti
cs
of
Ar
a
bic
te
xts
an
d
the
li
m
it
a
ti
on
s
of
st
ud
ie
s o
n
proce
ssing
it
[2
]
,
t
hi
s
stud
y
will
deal
with
a
var
ie
ty
of
NLP
a
ppli
cat
i
on
s
that
ha
ve
recently
em
erg
ed
to
m
anipu
l
at
e
la
nguag
es
su
c
h
as
Ar
a
bic,
En
glis
h,
an
d
U
rdu.
O
ne
of
the
se
ap
plica
ti
on
s
is
te
xt
cl
assifi
cat
ion
(TC)
,
w
hich
aim
s
to
m
ake
a
set
of
do
c
um
ents
fr
om
un
str
uctur
e
d
do
c
u
m
ents.
This
struct
ur
e
d
set
of
te
xts
inc
lud
es
a
descr
i
pt
ion
of
the
c
onte
nt
of
do
c
um
ents.
T
C
is
a
process
of
cl
assify
in
g
t
he
te
xtu
al
do
c
um
ent
into
gro
up
s
based
on
s
ubj
ect
’s
sim
i
lar
it
y
or
oth
e
r
feat
ur
es
[3].
This
pa
pe
r
is
orga
nized
as
f
ollows.
T
he
pr
esent
sec
ti
on
offe
rs
a
bri
ef
in
tro
du
ct
io
n
to
t
he
to
p
ic
an
d
the
desig
n
of
the
pa
per
.
T
he
seco
nd
sect
io
n
bri
efly
de
scri
bes
t
he
relat
ed
w
orks
in
t
he
area
of
A
ra
bic
te
xts
cl
assifi
cat
ion
.
The
thir
d
sect
ion
dis
play
s
t
he
Ar
a
bic
te
xt
chall
eng
es.
T
he
fou
rth
sect
ion
prov
i
des
a
br
ie
f
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
& C
om
p
Eng
IS
S
N:
20
88
-
8708
A Survey
of Ar
ab
ic
Text
Cl
assi
fi
cation
M
od
el
s
(
Ahed M.
F
. Al_S
bou
)
4353
exp
la
natio
n
ab
ou
t
the
com
m
on
te
chn
i
qu
e
s
and
al
go
rithm
s
us
e
d
in
Ar
a
bic
te
xts
cl
assifi
c
at
ion
.
T
he
la
st
sect
ion
pro
vid
es a
s
umm
ary of
t
he pa
per an
d
s
ugge
s
ts fu
t
ur
e
work.
2.
RELATE
D
W
ORK
Ther
e
a
re
m
a
ny
cl
assifi
cat
ion
Algorit
hms
that
ha
ve
be
en
ap
plied
t
o
A
rab
ic
Te
xt
s.
In
t
heir
app
li
cat
io
n
of
Naive
Ba
ye
s
(
NB)
al
gorithm
to
cl
assify
15
00
A
rab
ic
te
xt
do
c
um
ents,
El
-
Kourdi
et
al
fi
nd
five
m
ajo
r
cat
e
gori
es
w
ho
se
res
ul
ts
ind
ic
at
ed
t
ha
t
the
accu
rac
y
was
ar
ound
68.78%
[
4].
S
awaf
et
al
c
ondu
ct
e
d
ano
t
her
stu
dy b
ase
d
on co
ll
ect
ing
d
at
a from
the A
ra
bic N
E
WSWIRE c
orp
us
b
y usi
ng
sta
ti
sti
cal
m
e
tho
ds. The
resu
lt
s
wer
e
62
.7
%
[5]
.
El
-
Halees
et
al
al
so
cl
assifi
ed
300
A
rab
ic
d
oc
um
ents
by
app
ly
ing
d
iffe
rent
al
go
rithm
s
su
ch
as
vect
or
sp
ace
m
od
el
(V
SM)
,
K
-
Nea
r
est
Neighb
or
a
lgorit
hm
s
(K
N
N)
,
a
nd
Naïve
Ba
ye
s
(N
B).
The
accu
racy
of
the
cl
assifi
cat
ion
was
74.
41%
[
6].
The
sam
e
accuracy
wa
s
ob
ta
ine
d
in
Al
-
Z
oghby'
s
stud
y
w
hich
i
nc
lud
es
CHARM al
gor
it
h
m
to
cl
assif
y Ar
a
bic text
docum
ents from
5524 rec
ords [
7].
Othe
r
stu
dies
ap
plied
by
Me
sle
h
to
cl
assify
A
rab
ic
do
c
um
ent
thr
ough
us
i
ng
S
uppo
rt
Vect
or
Ma
chine
(SV
Ms)
with
Chi
Square
feat
ure.
He
c
onduct
ed
a
n
ex
pe
rim
ental
stud
y
on
1445
on
li
ne
Ar
a
bic
corp
us
th
at
in
vo
l
ves
Al
-
Nahar,
Al
-
hayat
,
Al
-
Jazee
ra,
Al
-
A
hr
am
,
an
d
Al
-
D
os
t
or
to
be
cl
assifi
e
d
i
nto
9
cat
egories.
T
he
F
-
m
easur
e
re
su
lt
was
88.
11%
[8]
.
Ha
rr
a
g
et
al
dev
el
ope
d
A
ra
bic
TCs
t
hro
ugh
us
in
g
Hybr
i
d
appr
oach
with
tree
al
gorith
m
factor
to
se
le
ct
the
feat
ures.
T
he
data
was
c
ollec
te
d
from
sever
al
Ar
a
bia
n
sci
entifi
c
encycl
op
e
dia
in
m
any
fiel
ds
.
T
he
accu
racy
wa
s
91%
an
d
93
%
for
li
te
rar
y
and
sci
e
ntific
corp
us
,
resp
ect
ively
[9]
.
3.
ARABI
C
TE
X
T
CHALL
E
NGES
Natu
ral
la
ngua
ges
s
uc
h
as
A
rab
ic
la
ngua
ge
ha
ve
been
pr
ocesse
d
th
r
ough
diff
e
re
nt
m
e
thods.
This
la
nguag
e
wh
ic
h
has
se
ver
al
te
xtu
al
featu
res
requires
a
sp
e
ci
fic
cat
ego
ric
al
env
ir
on
m
ent
of
it
s
m
or
phol
og
y
,
con
ce
pts,
an
d
on
t
ology.
Ar
a
bic
la
ngua
ge
i
s
one
of
t
he
m
os
t
co
m
plex
natu
ral
la
ng
ua
ges.
It
c
om
pr
ise
s
28
char
act
e
rs
[
1].
The
c
har
act
er
s
in
this
la
ng
ua
ge
are
w
ritt
en
in
dif
fer
e
nt
f
orm
s
based
on
their
po
sit
io
n
s
i
n
the
word.
The
c
harac
te
rs
m
ay
co
m
e
in
the
fr
on
t
,
m
idd
le
,
or
la
s
t
par
t
of
the
w
ord
[
10
]
.
TC
seeks
to
colle
ct
si
m
il
ar
do
c
um
ents
into
sp
eci
fic
cat
egories
that
ass
ign
the
cat
eg
ori
es
of
Ar
a
bic
te
xts,
and
m
anip
ulate
the
re
la
ti
v
e
cat
egories that
hav
e
b
ee
n p
ro
duced
fro
m
o
the
r
te
xt classi
fica
ti
on
s
[11].
The
syst
em
of
retrievin
g
in
for
m
at
ion
fr
om
the
la
rg
e
am
ou
nt
of
A
rab
ic
te
xt
s
acce
ssible
on
the
web
is
ver
y
chall
en
gi
ng.
The
retriev
al
ta
sk
of
que
r
y
to
al
l
relevan
t
do
c
um
ents
i
s
ver
y
im
po
rtant
to
the
us
e
rs
,
too
.
Th
ere
f
or
e,
the
TC
to
acce
ss
the
diff
e
re
nt
ca
te
gories
m
akes
the
processes
of
query
ea
sie
r
an
d
the
n
ca
n
at
ta
in
the
inf
orm
ation
ne
ede
d
f
r
om
them
[1
2].
F
urt
her,
A
ra
bic
te
xts
inclu
de
s
om
e
pr
oble
m
a
ti
c
issues
due
to
the
natu
re
of
la
nguag
e
.
To
th
e
be
st
of
m
y
kn
owle
d
ge,
the
st
udie
s
on
A
rab
ic
la
nguag
e
a
re
ver
y
lim
it
ed,
i
n
w
hic
h
there
is
a
la
c
k
of
A
rab
ic
c
or
pu
s
,
la
ng
uag
e
too
ls,
a
nd
c
ompre
he
ns
ive
st
udie
s
on
pr
e
pro
cessi
ng
A
rab
ic
te
xts.
All
these
prob
l
e
m
s
ref
er
t
o
div
erse
areas
of
chall
enges
to
c
at
egorize
the
s
pecific
A
ra
b
ic
te
xtu
al
data
i
nto
a
cl
os
ed
cate
gor
y.
4.
TE
X
T
CL
ASSIFIC
ATIO
N
TC
inclu
des
di
ff
ere
nt
ph
a
ses
.
T
he
first
phase
sta
rts
fro
m
pr
eprocessi
ng
the
te
xt
to
rem
ov
e
th
e
punctuati
ons
,
stop
w
ords
,
a
nd
norm
al
iz
ation
.
The
sec
ond
an
d
t
hir
d
ph
a
ses
incl
ud
e
T
C
an
d
e
valuati
ng
the
cl
assifi
ed
te
xt
[7
]
,
[
11
]
,
[
13]
.
TC
is
the
best
m
echan
ism
to
m
anag
e
an
d
or
gan
iz
e
the
data
.
It
helps
m
achine
to
acce
ss
the
data
cat
ego
ries
a
nd
te
xt
la
bels
us
ing
pr
e
def
i
ned
process
[
14]
,
[
15
]
.
T
his
m
ech
anism
can
be
us
ed
to
cl
assify
a
group
of
do
c
um
ents
i
nto
kinds
of
doc
um
ents
usi
ng
seve
ral
fea
tures
su
c
h
as
con
te
nts,
auth
or
s,
or
publishe
r
[
16
]
.
The
cor
e
go
al
of
TC
is
to
c
onve
rt
un
st
ru
ct
ur
e
d
te
xt
into
orga
nized
or
struct
ur
e
d
that
can
be
us
e
d
in
d
if
fer
e
nt NLP a
ppli
cat
ion
s s
uc
h
as
s
umm
arizat
ion
or r
et
rie
val [
10]
,
[17].
Ther
e
are
tw
o
m
et
ho
ds
util
iz
ed
in
TC:
m
ac
hin
e
le
ar
ning
i
n
w
hich
the
te
xt
can
be
cl
ass
ifie
d
by
us
in
g
a
set
of
trai
ning
docum
ents,
a
nd
r
ule
-
base
d
TC
w
hich
al
lo
ws
t
he
us
a
ge
of
e
xp
e
rts,
or
e
nginee
r'
s
knowl
edg
e
to
cl
assify
the
te
xt
[
18]
.
F
ur
t
he
rm
or
e,
the
TC
can
be
us
e
d
in
seve
ral
a
pp
li
c
at
ion
s
of
com
pute
r
sci
e
nce
s
uc
h
as
sp
am
o
r
e
-
m
ail
f
il
te
ring,
or as
an
acce
ssi
ble t
oo
l
f
or
i
nterest
ing
i
nfor
m
at
ion
in
p
a
rtic
ular
do
c
um
ents [
4]
,
[
9].
4.1
.
C
omm
on
Model
s
, an
d
Algori
th
m
s
of A
r
ab
ic
Te
xt
Clas
sific
at
i
on
Diff
e
re
nt
al
gor
it
h
m
s
are
us
e
d
to
cl
assify
the
Ar
a
bic
doc
ume
nts.
I
n
this
se
ct
ion
,
we
will
fo
c
us
on
t
he
fo
ll
owin
g
m
odel
s:
Naïve
Ba
ye
sia
n
al
gorith
m
(N
B),
K
-
Ne
arest
Neig
hbor
al
gorithm
(K
NN),
S
uppo
rt
Vecto
r
Mod
el
(SVM),
A
rtific
ia
l N
eu
ral Net
work (
A
NN).
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
20
88
-
8708
In
t J
Elec
&
C
om
p
En
g,
V
ol.
8
, N
o.
6
,
Dece
m
ber
201
8
:
4352
-
4355
4354
4.1.1
.
N
aïv
e B
ay
e
sian
Al
go
ri
th
m
NB
is
a
m
achine
le
ar
ning
te
chn
i
qu
e
us
e
d
t
o
cl
assify
te
xt
into
pr
e
de
fine
d
cat
eg
ori
es
ba
sed
on
the
si
m
il
ar
featur
es.
NB
ha
s
bee
n
ap
plied
to
i
m
pr
ov
e
the
processin
g
a
nd
m
anipu
la
ti
ng
of
te
xts
or
i
nfor
m
at
io
n
from
diff
ere
nt
sources
.
T
hi
s
al
gorithm
rep
resen
ts
a
pro
ba
bili
sti
c
m
et
h
od.
I
n
oth
e
r
words,
NB
cl
assifi
e
r
assum
es
that
the
a
bs
e
nce
of
cl
ass
featu
re
is
unrelat
ed
to
t
he
abse
nce
of
ot
her
featu
res.
N
B
is
com
m
on
ly
us
e
d
to
cl
assify
do
c
um
ents
du
e
to
that
is
giv
en
a
good
pe
rfor
m
ance
in
cl
ass
ific
at
ion
,
NB
com
pu
te
s
the
pro
ba
bili
t
y
of
docum
ents
that
relat
ed
t
o
cl
assify
them
i
nto
dif
fer
e
nt
cl
asses,
a
nd
the
n
assig
ns
them
to
the
s
pecific
cl
ass
with the
h
i
ghes
t pro
bab
il
it
y [19].
Like
m
any
oth
er
m
od
el
s,
NB
has
num
ero
us
ad
va
ntage
s.
It
is
gen
e
rall
y
con
sidere
d
the
m
os
t
powerful
m
od
el
us
ed
in
this
fiel
d.
NB
is
under
sta
ndab
le
and
ve
ry
sim
ple
in
i
m
ple
m
entat
ion
.
As
for
the
disa
dv
a
ntages
,
NB
su
f
fer
s
s
om
e
l
i
m
it
ation
s
su
ch
as
it
need
s
occ
urren
ce
of
cl
ass,
beca
use
dep
e
nds
on
pr
oba
bili
ty
,
wh
ereas
the pr
ob
a
bili
ty
in usuall
y de
pe
nd
s
on fr
eq
ue
nc
y.
4.1.2
.
K
-
Ne
ar
est
Neighbor
Algori
th
m
s
K
-
Near
e
st neig
hbor (KN
N
)
is
ano
t
her
ty
pe
of
m
achine lear
ni
ng
al
gorithm
of
TC
. I
t
repre
sents a
non
-
par
am
et
ric
te
c
hn
i
qu
e
t
o
cl
assify
do
c
um
ents
or
o
bject
s
de
pe
nd
o
n
cl
os
ed
cl
ass
or
trai
ni
ng
featur
e
.
It
incl
ud
e
s k
value
t
hat
is
a
lway
s
a
po
sit
i
ve
value;
K
N
N
the
obj
ect
has
bee
n
cl
as
s
ifie
d
to
cl
os
e
neig
hbor
cl
ass
.
K
N
N
at
tem
pts
to
cl
assify
the
obj
ec
t
that
is
m
os
t
vo
te
of
it
s
nei
ghbor
[
16
]
.
K
NN
is
possible
wh
e
n
t
rainin
g
data
is
la
rg
e,
a
nd
ve
r
y
l
arg
e.
Yet,
t
her
e
a
re
so
m
e
disad
van
ta
ges
of
KNN
incl
ud
i
ng
t
he
nec
essit
y
of
def
i
ni
ng
k
-
par
am
et
er
valu
e
w
her
e
k
re
presents
a
nea
res
t
neig
hbor’s
num
ber
.
Also
,
usi
ng
this
m
od
e
l
is
so
e
xpe
ns
ive
i
n
com
par
ison wi
th o
t
her al
gorit
hm
s [
20
]
.
4.1.3
.
Sup
po
r
t
Vec
to
r
Mod
el
(SVM
)
VS
M
is
one
of
the
s
up
e
r
vised
le
ar
ning
m
od
el
s
that
ha
ve
bee
n
a
ppli
ed
for
TC
.
It
c
la
ssifie
s
the
diff
e
re
nt
obj
ec
ts
and
do
c
um
e
nts
into
a
fi
nite
dim
ension
al
sp
ace.
VS
M
is
al
so
us
e
d
to
a
na
ly
ze
data,
te
xts,
an
d
do
c
um
ents
in
order
t
o
com
pu
te
the
sim
il
arit
y
a
m
on
g
the
m
[2
1].
V
SM
sh
ows
differe
nt
helpful
aspec
ts
as
an
i
m
po
rtant
m
odel
us
e
d
in
com
pu
te
r
sci
enc
e.
First,
this
m
eth
od
de
fends
on
a
li
nea
r
al
ge
br
a
,
wh
e
re
it
does
n'
t
con
ta
in
an
y
co
m
plex
al
geb
ra
equ
at
io
n [
8].
The
ot
her
a
dvantage
is
the
eff
ic
ie
ncy
of
weig
hts
ascrib
ed
to
co
nce
pts
or
te
rm
s.
This
m
od
el
al
so
sh
ows a s
pecial
sen
se o
f
ease
in co
m
par
iso
n
with o
the
r
m
eth
ods. It
m
akes th
e
m
achine com
pu
te
the si
m
i
la
rity
a
m
on
g
docum
ents
[22].
H
oweve
r,
VS
M
c
on
ta
in
s
so
m
e
l
i
m
i
ta
ti
on
s
that
pr
e
ven
t
s
om
e
researc
her
s
to
us
e
it
.
The
diff
ic
ulty
of
us
in
g
sy
nonym
s
in
Ar
abic
represe
nts
a
m
assive
chall
eng
i
ng
a
rea,
w
he
re
A
rab
ic
la
ngua
ge
has
m
any
syn
on
ym
s
fo
r
each
w
ord,
or
con
ce
pt.
Ot
her
lim
it
a
ti
on
s
th
at
it
’s
assu
m
e
that
the
te
r
m
s
are
sta
ti
sti
cally
ind
epende
nt. Whil
e m
os
t of
Ar
a
bi
c
te
rm
s h
ave a
stron
g rela
ti
onsh
ip
w
it
h ot
her te
rm
s.
4.1.4
.
Ar
tifici
al
N
eur
al N
e
tw
ork
A
N
N
is
on
e
of
m
achine
le
arni
ng
of
inf
or
m
ation
proce
ssin
g
li
kes
hu
m
an
brai
n.
It
has
bee
n
app
li
ed
in
diff
e
re
nt
com
pu
te
r
are
as
suc
h
as
cl
assifi
ca
ti
on
an
d
patte
r
n
rec
ogniti
on.
It
co
ns
ist
s
of
a
set
of
i
nput
s
and
adap
ti
ve
w
ei
ght,
non
-
li
near
functi
on,
a
nd
ou
t
pu
ts
[
23
]
.
Diff
e
re
nt
ad
va
ntages
an
d
dis
adv
a
ntage
s
a
re
w
or
t
h
m
entioning
for
us
i
ng
arti
fici
al
ne
ur
al
net
wor
ks
.
It
represe
nt
s
one
of
the
ea
sy
m
od
el
s
to
use
.
Also
,
it
is
usual
ly
appr
opriat
e
for
com
plex
pr
ob
le
m
s
or
la
r
ge
te
xts.
The
disad
va
ntage
s
of
t
his
m
od
el
inclu
de
t
he
le
ss
reco
m
m
end
at
ion
of
us
i
ng
with
sim
pler
so
luti
ons
or
s
m
al
l
te
xts.
This
m
et
ho
d
al
so
need
s
l
oa
di
ng
the
trai
ning
data.
5.
CONCL
US
I
O
N AND F
UT
U
RE W
ORK
Thi
s
pa
pe
r
is
a
su
r
vey
of
the
i
m
po
rtance
of
TC,
as
well
as
the
cu
rr
e
nt
m
e
t
hods
us
e
d
in
N
LP
fiel
d.
I
n
this
researc
h,
we
ha
ve
discu
ssed
the
tradit
ion
al
TC
m
od
el
s
that
are
us
ed
to
cl
assify
the
Ar
abic
te
xts,
corp
us
,
and
do
c
um
ents
into
dif
fer
e
nt
cat
ego
ries.
T
he
f
u
ture
wor
k
nee
ds
m
or
e
effor
ts
to
buil
d
an
d
de
velo
p
a
new
sta
nd
a
rd
m
od
e
l
of
A
rab
ic
TC
.
T
his
m
od
el
m
us
t
be
m
or
e
eff
ic
ie
nt
t
han
the
c
urren
t
trad
it
ion
al
m
et
ho
ds.
T
he
oth
e
r
i
m
po
rtan
t
ta
sk
that
need
to
i
m
pr
ove
m
ent
in
this
m
od
el
is
la
ng
uag
e
diale
ct
s.
In
ot
her
w
ord
s
,
du
e
to
diff
e
re
nt
A
ra
bic
diale
ct
s
this
m
od
el
m
us
t
be
com
patible
wi
th
these
A
ra
bic
la
ngua
ge
dia
le
ct
s.
Also
,
it
can
be
app
li
ed
in a
ny
Ar
a
bic texts.
ACKN
OWLE
DGE
MENT
We
w
ou
l
d
li
ke
to
than
k
Al_Hus
sei
n
bi
n
T
al
al
Un
iver
sit
y
(AHU)
f
or
prov
i
ding
us
a
good
sci
entifi
c
env
i
ronm
ent to prod
uce th
is
sim
ple w
ork
.
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
& C
om
p
Eng
IS
S
N:
20
88
-
8708
A Survey
of Ar
ab
ic
Text
Cl
assi
fi
cation
M
od
el
s
(
Ahed M.
F
. Al_S
bou
)
4355
REFERE
NCE
S
[1]
Duw
ai
ri,
R
.
M.
(
2007).
Arab
ic
text
c
at
egor
izati
on
.
Int
.
Arab
J
.
In
f. Tec
hnol
.
,
4(2),
125
-
132.
[2]
Fauzi
,
M.A
.
,
Ari
fin,
A.Z.,
&
Yuniar
ti,
A.
(201
7)
.
Arabi
c
Book
R
et
ri
eva
l
using
C
l
ass
and
Book
Inde
x
Based
T
erm
W
ei
ghti
ng.
Int
ernati
onal Journal
of El
e
ct
ri
cal
an
d
Computer
Eng
ine
ering
(
IJE
CE
)
,
7(6),
3705
-
37
11.
[3]
El
-
Hal
ee
s,
A.
(2
006).
Mini
ng
Arabic
ass
oci
ati
on
rules
for
te
xt
clas
sifi
cat
ion
.
Paper
pre
sente
d
a
t
t
he
Proce
edi
ngs
of
the
f
irst
in
te
rn
at
i
onal
conf
er
enc
e
on
Mathe
m
atic
al Sci
en
ce
s.
Al
-
Az
har
Univer
si
t
y
o
f
Gaz
a, Pal
est
ine.
[4]
Caba
llero,
Y.
,
B
el
lo
,
R
.
,
Alvar
e
z
,
D.
,
&
Gar
ci
a
,
M.M.
(2006).
Tw
o
new
fe
ature
sele
c
ti
on
a
lgorit
hms
wit
h
Roug
h
Set
s Theory.
Pap
er
pre
s
ent
ed
a
t
t
he
IFIP
Internat
i
onal
Conf
ere
n
ce on
Artif
icial Int
el
li
g
ence in
Th
e
or
y
and
Pr
ac
t
ic
e
.
[5]
El
Kourdi,
M.
,
Bensai
d,
A.
,
&
Ra
chi
di
,
T.
-
e.
(200
4).
Aut
omati
c
Ar
abic
docume
nt c
ate
gorization
ba
sed
on
the
Naïv
e
Bay
es
algori
thm.
Paper
pr
ese
nt
e
d
at
th
e
Proce
e
di
ngs
of
the
W
orkshop
on
Com
puta
ti
onal
Approac
hes
to
Ara
bic
Script
-
base
d
L
an
guage
s.
[6]
Al
-
Zoghb
y
,
A.,
El
din
,
A.
S.,
Ism
ai
l
,
N.
A.,
&
Ha
m
za
,
T.
(2007).
Mini
ng
Arabic
t
ex
t
using
soft
-
matchi
ng
ass
oci
at
ion
rules.
Paper
pre
s
ent
ed
at t
h
e
Com
pute
r Engineerin
g
&
S
y
stems
,
20
07.
ICCES
'
07.
I
nte
rna
ti
ona
l
Con
fer
ence
on
.
[7]
Al
-
Harbi
,
S.
,
Al
m
uhar
eb,
A.,
A
l
-
Thubait
y
,
A
.
,
Khors
hee
d,
M.,
&
Al
-
Rajeh,
A
.
(2008).
Auto
m
at
ic
Arabi
c
t
e
xt
cl
assifi
ca
t
ion.
[8]
Mesleh
,
A
.
(200
8).
Support
v
ect
or
m
ac
hine
s
b
as
ed
Arabi
c
l
angu
age
te
xt
c
la
ss
ifica
t
ion
s
y
s
te
m
:
f
ea
tur
e
sel
ection
compara
ti
v
e
stu
d
y
Adv
an
ce
s
in Com
pute
r and
I
nformation
Sc
ience
s and Engin
e
ering
(pp. 11
-
16
):
Springer
.
[9]
Dharm
adhi
kar
i
,
S.
C.
,
Ing
le
,
M.,
&
Kulk
arn
i,
P.
(
2011)
.
E
m
piri
ca
l
stud
ie
s
on
m
ac
hine
l
ea
rning
b
ase
d
t
ext
cl
assifi
ca
t
ion al
g
orit
hm
s.
Adv
an
c
ed
Computing
,
2
(6),
161
.
[10]
Khan,
A.
,
B
aharudin,
B
.
,
L
ee,
L.
H.
,
&
Khan,
K.
(2010).
A
re
vie
w
of
m
a
chi
n
e
l
ea
rning
a
lgor
it
hm
s
for
te
x
t
-
documents
class
ifi
cation.
Journal
of
ad
vances i
n
i
nformation
t
ec
h
nology
,
1(1), 4
-
2
0.
[11]
Ababne
h,
J.
,
Al
m
om
ani
,
O.,
Ha
di,
W
.
,
El
-
Om
ari,
N.K.T.,
&
Al
-
I
bra
him,
A.
(201
4).
Vec
tor
sp
ac
e
m
odel
s
to
cl
assif
y
Arabi
c te
xt
.
Inter
nati
onal Journal
of
Computer
Tr
ends
and
Te
chno
logy
(
IJCTT)
,
7(4),
219
-
223
.
[12]
M
esle
h,
A.
(200
7).
Chi
squar
e
fe
at
ure
ext
r
ac
t
ion
base
d
svm
s
ara
bic
l
angua
g
e
t
ext
ca
t
egor
izati
on
sy
stem.
Journal
o
f
Computer
Scien
ce
,
3(6), 430
-
43
5.
[13]
Khors
hee
d,
M.S
.
,
&
Al
-
Thubait
y
,
A.O.
(2013)
.
Com
par
at
ive
ev
al
ua
ti
on
of
t
ext
cl
assifi
ca
t
ion
technique
s
using
a
la
rge
divers
e
Ar
abi
c
dataset
.
Lan
guage
resour
ce
s
and
e
val
ua
ti
on,
47(2),
513
-
538
.
[14]
Sebasti
ani,
F.
(2
002).
Mac
hin
e
l
ea
rning
in
aut
o
m
at
ed
te
x
t
cate
g
oriz
a
ti
on.
ACM
computi
ng
sur
veys
(
CSUR
)
,
34(1),
1
-
47.
[15]
Khrei
sat
,
L
.
(20
09).
A
m
ac
hin
e
le
arn
ing
appr
o
ach
for
Arabi
c
t
ext
cl
assifi
cation
us
ing
N
-
gra
m
fre
q
uency
st
at
isti
cs.
Journal
of
Infor
metric
s,
3(1
), 72
-
77.
[16]
Alaa
,
E
.
(2008)
.
A c
om
par
at
iv
e
s
tud
y
on
ara
b
ic t
e
xt
c
la
ss
ifi
c
at
ion
.
Egy
pt
.
Comput
.
Sci
.
J
,
2
.
[17]
Mesleh,
A.
(200
8).
Support
Vec
tor
Mac
hin
e
T
e
xt
Cla
ss
ifier
for
Arabi
c
Arti
cles:
Ant
Colon
y
Optimiza
t
ion
-
Base
d
Feat
ure
Subs
et
S
el
e
ct
ion
.
The
Ar
ab
Ac
ad
emy
for
Bank
ing
and
Fi
n
anci
al
Scienc
es
.
[18]
Sebasti
ani,
F.
(2
005).
T
ext
cate
g
oriz
a
ti
on
Encyc
l
opedi
a
of
Datab
ase
Technol
og
ies
a
nd
Appl
i
catio
ns
(pp.
683
-
687)
:
IGI Globa
l
.
[19]
Abu
-
Err
ub,
A.
(2014).
Arabi
c
Te
xt
Cla
ss
ifica
t
ion
Algorit
hm
using
TFIDF
and
Chi
Square
Mea
surem
ent
s.
Inte
rnational
Jo
urnal
of
Comput
er
Applications,
93(6).
[20]
Al
-
Shala
bi
,
R.
,
Kana
an,
G
.
,
&
Ghara
ibe
h
,
M.
(
2006).
Arabi
c
t
ex
t
cat
egorizat
i
on
using
KNN
algorit
hm.
Pape
r
pre
sente
d
a
t
th
e t
he
Proc.
of
Int.
m
ult
i
conf
.
on
c
om
pute
r
scie
n
ce
and
informat
ion te
chno
log
y
CS
I
T06.
[21]
Jac
kson,
P.
,
&
Moulini
er
,
I
.
(20
07).
Natural
lan
guage
proce
ss
in
g
for
onl
ine
app
l
ic
ati
ons:
Text
re
trie
val,
extract
io
n
and
categorizati
on
(Vol.
5):
Joh
n
Benjam
ins Publi
shing.
[22]
Sawaf,
H.,
Z
aplo,
J.,
&
Ne
y
,
H.
(2001).
Stat
i
stic
al
cl
assifi
ca
t
ion
m
et
hods
fo
r
Arabi
c
news
art
i
cl
es.
Natura
l
Language
Proces
sing i
n
ACL200
1,
Toulous
e, F
ra
nce
.
[23]
Harra
g,
F.,
El
-
Q
awa
sm
eh,
E
.
,
&
Picha
pp
an,
P
.
(
2009).
Impr
ovi
n
g
Arabic
text
ca
te
gorization
usi
ng
decision
tre
e
s.
Paper
pre
s
ent
ed
at
the Net
worke
d
Digital Te
chno
logi
es,
2009.
ND
T'
09
.
First
Internat
ion
al
Conf
er
enc
e
on.
BIOGR
AP
H
Y
O
F
AU
TH
OR
Ahed
Al
-
Sbou
is
a
le
c
ture
r
in
t
he
Inform
at
ion
Te
chno
log
y
Sch
ool
of
Com
pute
r
Scie
nc
e
a
t
the
Univer
sit
y
of
Al
-
Hus
sein
Bin
Talal
wher
e
he
h
as
bee
n
a
fac
u
lty
m
ember
since
2
014.
He
holds
the
m
aste
r
degr
ee
in
computer
scie
nce.
Ahed
complet
ed
his
Master
degr
ee
from
Al
-
Bal
qa
Applie
d
Unive
r
sit
y
,
Sal
t,
Jord
a
n
in
2012
and
his
B.
S.
degr
ee
in
computer
sc
ie
nc
e
from
Al
-
Hus
sein
Bin
Ta
la
l
Univer
si
t
y
,
Ma'
an
,
Jordan
i
n
2006.
His
res
ea
rch
in
te
r
ests
li
e
in
computer
scie
nc
e
are
in
the
are
a
of
progr
amm
ing
l
angua
ges,
r
anging
from
the
ory
to
design
to
implementa
t
ion,
Data
b
ase
,
D
at
a
Mining,
Na
tura
l
la
nguag
es
Proc
essing
(
NLP),
a
nd
informati
on
s
y
stems
.
Ahed
h
as
worked
as
a
computer
l
ab
su
per
visor
(2006
-
2
014)
at
Al
-
Hus
sein
B
in
T
alal
Univer
sit
y
.
.
Evaluation Warning : The document was created with Spire.PDF for Python.