Internati
o
nal
Journal of Ele
c
trical
and Computer
Engineering
(IJE
CE)
Vol.
5, No. 6, Decem
ber
2015, pp. 1458~
1
467
I
S
SN
: 208
8-8
7
0
8
1
458
Jo
urn
a
l
h
o
me
pa
ge
: h
ttp
://iaesjo
u
r
na
l.com/
o
n
lin
e/ind
e
x.ph
p
/
IJECE
Ontol
ogi
es and B
i
gram-b
as
ed Ap
proach f
o
r Is
olat
ed Non-
word
Errors Correction
in OCR S
y
stem
Aicha Eutame
ne*, Mohame
d Khireddi
ne
Kholladi**, Hacene
Belhade
f
*
* IFA Department, NTIC
Facu
lty
,
University
of
Constantine 2
–
ABEDLHAMID
MEHRI,
Algeria
** Computer Science Depar
t
ment,
El Oued Univer
sity
,
Alge
ria
Article Info
A
B
STRAC
T
Article histo
r
y:
Received
J
u
l 15, 2015
Rev
i
sed
Sep
13
, 20
15
Accepted
Sep 30, 2015
In the presen
t
paper, we d
e
scr
i
be a n
e
w and
original
approach for post-
processing step in optical ch
arac
ter recognition s
y
stems (OCR). This
approach is based on a new m
e
thod of sp
elling
correctio
n to correct
automatically
misspelled words
resulti
ng from character r
ecognition step o
f
scanned documents b
y
combinin
g both ont
ologies and bigram co
de in order
to create
a robust sy
stem able to
solve auto
matically
th
e anomalies of
classical appro
a
ches. Th
e propo
sed a
pproach
is based on a h
ybrid method
which is spread
over two stages, first
on
e is
character r
ecognitio
n b
y
using
the ontolog
ical
model and the
second
one is
word recognitio
n
based on
spelling
correction approach bas
e
d on bi
gr
am co
dification
for detection
and
correction of
err
o
rs. The spelling
error
is broadly
classified in
two
categories
namely
non-wor
d
error and
real-
w
ord error.
In
th
is paper
,
we in
terested on
ly
on detect
ion and
correct
ion of non-word errors
because th
is is the onl
y
t
y
p
e
of errors
treate
d
b
y
an OCR. In addition, th
e
us
e of an onli
n
e extern
al
resource such as
WordNet proves
ne
cessar
y
to
im
prove its perfor
m
ances.
Keyword:
Big
r
am
OCR
Ont
o
lo
gy
Sp
ellin
g Correctio
n
Word
Recogn
itio
n
Wo
r
d
Net
Copyright ©
201
5 Institut
e
o
f
Ad
vanced
Engin
eer
ing and S
c
i
e
nce.
All rights re
se
rve
d
.
Co
rresp
ond
i
ng
Autho
r
:
Hacene
Belha
d
ef,
IFA
De
pa
rtm
e
nt,
NTIC
Fac
u
lty
Uni
v
ersi
t
y
o
f
C
onst
a
nt
i
n
e 2 – AB
ED
LH
AM
I
D
M
E
HR
I,
C
onst
a
nt
i
n
e 25
00
0,
Al
geri
a
Em
a
il: hacene.belha
d
ef@u
ni
v-constantine
2
.dz
1.
INTRODUCTION
W
i
t
h
t
h
e a
d
ve
nt
o
f
di
gi
t
a
l
opt
i
cal
scanne
r
s
, a l
o
t
o
f
pa
per
-
ba
sed
b
o
o
k
s, t
e
xt
b
o
o
k
s,
m
a
gazi
nes
,
art
i
c
l
e
s, and d
o
cum
e
nt
s are bei
n
g t
r
ans
f
or
m
e
d i
n
t
o
an
electronic ve
rsi
on that
can
be
m
a
nipulated by a
co
m
p
u
t
er.
For th
is
p
u
rpo
s
e,
OCR, sh
ort for Op
tical
Ch
aracter Recogn
itio
n
was d
e
v
e
l
o
p
e
d
fo
r
h
a
n
d
writing
recogn
itio
n
(HWR). It con
s
ists to
tran
slate scan
n
e
d
gr
aph
i
cal tex
t
in
to ed
itab
l
e co
m
p
u
t
er tex
t
[1
],
[2
].
OCR (Op
tical Ch
aracter R
eco
gn
itio
n) mean
s th
at a
co
m
p
u
t
er an
alyses ch
aracter im
ag
es
au
to
m
a
tical
ly
to
ach
iev
e
the tex
t
in
formatio
n
,
a
nd
O
C
R
engi
ne i
s
an IC
R
(I
n
d
epe
n
dent
C
h
aract
er
Reco
gn
itio
n) en
g
i
n
e
u
s
u
a
lly. It reco
gn
izes ch
aracters in every po
sitio
n of an
im
ag
e and
g
i
v
e
s all th
e
po
ssib
l
e
can
d
i
d
a
te resu
l
t
s. In
ICR recog
n
ition
p
r
o
cess, o
n
l
y i
m
ag
e i
n
fo
rm
atio
n
is
u
s
ed
, and
th
e resu
lts co
n
t
ain
man
y
incorrect
words. So the s
p
elling c
o
r
r
ect
i
o
n app
r
oaches
, w
h
i
c
h pl
ay
t
h
e m
o
st
im
port
a
nt
rol
e
i
n
OC
R
po
st
-
pr
ocessi
ng
, are
re
qui
re
d
[
3
]
.
Th
e task
of word
recogn
itio
n
is to
find
all th
e v
a
lid
en
tries o
f
th
e d
i
ctio
n
a
ry that
m
a
tch
th
e
reco
g
n
i
zed
wo
rd
. To
d
o
t
h
at
,
one
or s
e
ve
ral
candi
dat
e
s co
ul
d
be ge
ne
rat
e
d f
o
r eac
h
wo
rd
. Each ca
n
d
i
d
at
e i
s
al
so p
r
ovi
ded
wi
t
h
a c
o
nfi
d
e
n
ce sc
ore
.
T
h
e
p
o
st
-
p
r
o
cessi
n
g
task is to
p
i
ck
th
e b
e
st cand
id
ate
for each wo
rd
by
expl
oi
t
i
ng
di
ffe
re
nt
t
y
pes of i
n
f
o
rm
at
i
on. The
post
-
pro
cessing
step
h
e
lp
s to
sign
ifican
tly red
u
c
e errors.
Ho
we
ver
,
i
t
i
s
not
a c
o
m
p
l
e
tel
y
separat
e
o
f
prece
di
n
g
st
e
p
s.
In
ge
neral
,
t
h
i
s
st
ep sh
o
w
s t
h
e
un
rec
o
gni
ze
d
characte
r
s a
n
d
words that
ha
ve not
been
fo
un
d in
t
h
e
d
i
ction
a
ry
o
r
in th
e l
e
x
i
con
.
Th
is step
p
r
ov
id
es the u
s
er
t
o
m
a
ke l
e
xi
cal
and
sy
nt
act
i
c
co
rrect
i
o
ns
us
i
ng
di
ct
i
o
nari
e
s
an
d
gram
m
a
r r
u
l
e
s t
o
i
m
prove
t
h
e
rec
o
g
n
i
t
i
o
n
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Ont
o
l
o
gi
es
an
d
Bi
gr
am
-
base
d
Ap
pr
oac
h f
o
r
Isol
at
ed N
o
n-w
o
rd
Err
o
rs
C
o
r
r
ect
i
on…
(
A
i
c
h
a
E
u
t
a
me
ne)
1
459
rat
e
[4]
,
seve
r
a
l
app
r
oac
h
es
have
bee
n
re
so
rt
ed t
o
pl
ay
in
g a con
s
o
lid
ation
ro
le in
th
is
step
. In
o
u
r case, we
foc
u
s
on c
o
r
r
ect
i
on o
f
i
s
ol
a
t
ed w
o
r
d
s.
Th
e use o
f
e
x
ternal res
o
urces
suc
h
as lexica
l ontol
ogies
proves
n
ecessary to
i
m
p
r
o
v
e
t
h
e
qu
ality o
f
t
h
e
resu
lts. Th
e on
to
log
i
es
p
l
ay an
im
p
o
r
tan
t
ro
le in
t
h
e
field
of
kn
o
w
l
e
d
g
e re
p
r
esent
a
t
i
o
n an
d sha
r
i
n
g of i
n
fo
rm
ati
on, t
h
e
y
can be
m
odi
fi
ed, a
d
apt
e
d and re
use
d
i
n
di
ffe
ren
t
appl
i
cat
i
o
ns a
n
d
d
o
m
a
i
n
s. I
n
rece
nt
y
ears
,
o
n
t
o
l
o
gi
es
ha
ve
been
sh
o
w
n t
o
be a
n
i
m
po
rt
ant
i
n
st
r
u
m
e
nt
and
useful for the
represe
n
tation, shar
ing and reuse of knowl
e
dge
, also tha
nks to
ont
ology'
s
langua
ges
whic
h
ex
press a
rich
sem
a
n
tic an
d
prov
id
e b
e
st
reaso
n
i
ng
cap
a
b
ilities.
Th
is
is
why
we op
ted
for
th
e u
s
e of o
n
t
o
l
og
y
as a tool a
n
d a
way to
desi
gn
a ne
w
rec
o
g
n
i
t
i
on a
p
pr
oac
h
o
f
c
h
aract
ers
a
n
d
wo
rd
s.
Indee
d
, we
found t
h
at whate
v
er of t
h
e wealt
h
of do
cu
m
e
n
t
s co
n
t
en
t, th
is
wealth
rem
a
in
s in
su
fficien
t
to
h
e
l
p
th
e
pro
cess i
n
ord
e
r to
acco
m
p
lis
h
its task
. Th
e app
r
o
aches that ex
ist currently d
o
no
t take in
to
account t
h
e se
mantic aspect
whic
h is
n'
t
explicitly visible for the
m
achine, thi
n
g that
ori
e
nted us
to the
use
of
o
n
t
o
l
og
y to
com
p
le
te th
is g
a
p (called
sem
a
n
tic g
a
p).
Th
e on
to
log
y
h
a
s b
e
en
app
lied
in
th
e ch
aracter reco
g
n
ition
step
wh
ich
we will d
e
scrib
e
later. On
the
ot
he
r ha
n
d
i
n
t
h
e p
o
st
-
p
roc
e
ssi
ng
st
ep,
w
e
ad
opt
e
d
a s
p
el
l
i
ng c
h
ec
ki
ng
ap
pr
oac
h
by
usi
ng t
h
e
bi
g
r
am
codi
fi
cat
i
on
f
o
r
det
ect
i
on a
n
d
i
d
ent
i
f
i
cat
i
o
n
of
er
ro
rs i
n
m
i
sspel
l
e
d
w
o
r
d
s.
Th
e task
of spell-ch
eck
i
n
g
is sp
lit in
to
two
p
a
rts,
error-detectio
n
,
and
erro
r-correction
.
Th
ere are
err
o
r
det
ect
i
o
n
com
p
l
e
x sy
stem
s whi
c
h m
a
y
be use
d
t
o
detect words
that are co
rrectly sp
elled
,
b
u
t
are
u
n
s
u
itab
l
e i
n
t
h
e sy
n
t
actic o
r
sem
a
n
tic co
n
t
ex
t; th
is is
re
fe
rr
ed t
o
as
real
-
w
or
d e
r
r
o
r
det
e
c
t
i
on i
n
co
nt
e
x
t
[5]
.
Th
e
wo
rd
er
ro
r is a m
a
j
o
r
h
i
nd
r
a
n
ce to
th
e real w
o
r
l
d app
licatio
n
s
. In
tex
t
u
a
l do
cu
m
e
n
t
s,
wo
rd
-
e
rror
can
be
of
t
w
o
t
y
pes.
O
n
e i
s
n
o
n
-
wo
rd
er
r
o
r
w
h
i
c
h
has
no
m
eani
ng a
n
d
ot
her
i
s
rea
l
wo
r
d
er
r
o
r
whi
c
h i
s
mean
in
gfu
l
b
u
t n
o
t
th
e in
tend
ed
wo
rd
in
t
h
e con
t
ex
t
o
f
t
h
e sente
n
ce.
Of these,
non-word ha
s bee
n
widely
studie
d
and al
gorithm
s
to detect and s
u
ggest correc
tio
n w
o
r
d
fo
r th
e err
o
r
ha
v
e
b
een
pr
opo
sed.
Th
ese
algorithm
s
are gene
rally termed as
spell-c
h
ecke
r
, which are
in
tegr
ated
in
v
a
r
i
ou
s wo
rd
pr
o
c
essing
so
f
t
w
a
r
e
lik
e Micro
s
o
f
t
Word
, Li
b
r
e Office
Writer ,Isp
ell, Asp
e
lletc
[6
].
The resea
r
ch i
n
to algorithm
i
c tec
hni
q
u
es f
o
r det
ect
i
ng an
d
correct
i
n
g spe
l
l
i
ng err
o
rs i
n
t
h
e t
e
xt
has a
lo
ng
h
i
story in
co
m
p
u
t
er scien
ce. As an
a
m
alg
a
m
a
ti
o
n
o
f
th
e trad
ition
a
l field
s
of artificial in
telli
g
e
n
c
e,
p
a
ttern
reco
gnitio
n
,
stri
n
g
m
a
tch
i
n
g
, co
m
p
u
t
atio
n
a
l ling
u
istics, an
d
o
t
hers, t
h
is fu
nd
amen
tal p
r
ob
lem
in
i
n
f
o
rm
at
i
on
sci
e
nce has bee
n
st
udi
e
d
f
r
o
m
th
e early 196
0’s
to
the present
[7].
In
th
is p
a
p
e
r,
we are in
terested
in
d
e
tecting
an
d
id
en
tifi
catio
n
of erro
rs in
th
e
m
i
ssp
elled
word
i
nde
pen
d
e
n
t
l
y
of i
t
s
c
o
nt
ext
,
so-cal
l
e
d
i
s
ol
at
ed
no
n-
real
wo
rd
. T
o
do
t
h
at
,
we
pr
o
p
o
s
ed
a ne
w m
e
t
hod
base
d
o
n
a
g
e
n
e
ral lo
oku
p
d
i
ctio
n
a
ry an
d
th
e sp
ecific lex
i
co
n
of a d
o
m
ain
.
If
th
e word
do
es n
o
t
ex
ist, it will b
e
co
d
i
fied
in
b
i
gram co
d
e
in
ord
e
r to
facilitate th
e d
e
t
ectio
n an
d
id
en
tificatio
n
th
e erro
r i
n
th
e wo
rd
. A
set o
f
can
d
i
d
a
tes will b
e
g
e
n
e
rated
by ap
p
l
yin
g
a set o
f
ru
les th
at match
th
e typ
e
o
f
erro
r
d
e
tected
. Fo
r m
o
re d
e
tails,
the rea
d
er
ca
n
refe
r to
the se
c
tion that
desc
ri
bes
o
u
r m
e
tho
d
.
In
section
2
of th
is p
a
p
e
r,
we p
r
esen
t sh
ort
d
e
fi
n
itio
n
o
f
th
e sp
elling
correction
an
d
it
s typ
e
s. In
Sectio
n
3, we will rev
i
ew so
m
e
related
w
o
rk
s.
We p
r
esen
t o
u
r m
e
th
od
in
sectio
n
4
,
an
d
we fin
i
sh b
y
a
gene
ral
c
oncl
u
si
on
.
2.
SPELLING CORRECTION
Sp
ell ch
eck
i
ng is th
e process o
f
find
ing
m
i
ssp
elle
d
wo
rd
s in
a written
tex
t
and
p
o
s
sib
l
y to
correct
them
. This problem
has bee
n
widely studied,
and s
p
ell-c
h
ec
kers
are
proba
bly am
ong t
h
e
first s
u
ccess
f
ul
NL
P
appl
i
cat
i
o
ns
wi
del
y
use
d
by
t
h
e ge
neral
p
ubl
i
c
[
8
]
.
Sp
ellin
g
co
rrectio
n
is th
e task
o
f
correcting
words in
tex
t
s. Mo
st o
f
th
e av
ailab
l
e sp
elling
co
rrection
to
o
l
s
on
ly wo
rk
o
n
iso
l
ated
wo
rd
s and
co
m
p
u
t
e a list
o
f
sp
ellin
g
su
gg
estion
s
rank
ed
b
y
ed
it-d
i
stan
ce, letter-n-
g
r
am
si
milarit
y
o
r
co
m
p
arab
le m
easu
r
es. Alth
oug
h
t
h
e p
r
o
b
ab
ility o
f
th
e
b
e
st-rank
ed
sug
g
estion
b
e
i
n
g
correct in the
current c
onte
x
t is high, use
r
i
n
tervention is
usually n
ecess
a
ry to choose the m
o
st appropriate
su
gg
estion
.
Acco
rd
ing
to
Kuk
i
ch
[6
], th
e prob
lem
o
f
sp
ell checking can be classifi
ed in three cate
g
ories of
i
n
creasi
n
g
di
f
f
i
cul
t
y
:
no
n-
wo
rd
er
ro
r
det
ect
i
o
n
,
i
s
ol
at
ed-
w
or
d e
r
r
o
r
c
o
r
r
e
c
t
i
on,
an
d c
o
nt
ext
-
depe
n
d
ent
wo
r
d
correction.
We ca
n cl
assi
f
y
spel
l
i
ng e
r
r
o
rs i
n
t
w
o
m
a
i
n
gr
o
ups:
n
o
n
-
w
o
r
d
e
r
r
o
rs
an
d
real
-w
o
r
d e
r
r
o
r
s
. T
h
e n
o
n
-
wo
rd
er
r
o
rs
m
i
ght
be c
o
r
r
ect
ed
wi
t
h
o
u
t
c
o
nsi
d
e
r
i
n
g t
h
e c
ont
e
x
t
i
n
w
h
i
c
h t
h
e
er
r
o
r
oc
curs
,
but
a
rea
l
-w
or
d
error can be
corrected only by
ta
ki
n
g
c
ont
e
x
t
i
n
t
o
acc
o
unt
[
9
]
.
1)
No
n-w
o
rd
erro
r
o
ccurs
wh
en
a wo
rd
in
t
h
e OCR tex
t
is in
terpreted
as a strin
g
th
at
d
o
e
s no
t
cor
r
es
po
n
d
t
o
any
val
i
d
w
o
r
d
i
n
a gi
ve
n wo
rd l
i
s
t
(l
exi
c
on
) o
r
di
ct
i
o
nary
. F
o
r e
x
a
m
pl
e, The
bok
is o
n
th
e
tab
l
e.
The word
bok
doe
s not exist in English, a
n
d it pr
oba
bly de
rives from
a typo
of the
noun
bo
ok
. The
correct phrase
is: The
bo
ok
is
o
n
th
e tab
l
e.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJECE
Vol. 5, No. 6, D
ecem
ber
2015 :
1458 –
1467
1
460
2)
R
e
al
-w
or
d
error
occurs
whe
n
a source-
text wo
rd
is in
terp
reted
as a strin
g
th
at actu
a
l
l
y d
o
e
s
o
ccur i
n
th
e d
i
ctio
n
a
ry, bu
t is d
i
fferen
t fro
m
the s
o
urce-te
xt word.
For e
x
a
m
ple, I saw
tree
trees in th
e park.
The n
o
un
tree
exi
s
t
s
i
n
En
gl
i
s
h, b
u
t
i
n
t
h
i
s
cont
e
x
t
i
t
i
s
most
l
i
k
el
y
a typo
of t
h
e
num
eral
th
ree
: I
saw
th
r
e
e
trees in
th
e p
a
rk
.
We
recall th
at in
th
is stud
y presen
ted in
t
h
is pap
e
r,
we are only in
terested
i
n
th
e
first typ
e
of erro
r.
3.
RELATED WORK
Th
e task
o
f
wo
rd
error d
e
tectio
n
in
OCR
pro
cessing
is o
f
ten
con
s
id
ered triv
ial o
r
so
l
v
ed
in
m
a
n
y
research
p
a
p
e
rs
d
ealing
with
sp
ellin
g
correctio
n
:
Xi
an
g T
o
n
g
a
nd
Da
vi
d
A.
Eva
n
s,
descri
b
e
an a
u
t
o
m
a
t
i
c
, co
nt
ext
-
s
e
nsi
t
i
v
e, w
o
r
d
-e
rr
or c
o
r
r
ect
i
o
n
system
based
on statistical
langua
ge m
odelling (S
LM) as
a
pplied to
optical character re
c
o
gnition (OCR
) post-
processi
ng. Gi
ven a se
ntence
to be co
rrecte
d
, the syste
m
decom
poses each
string in the s
e
ntence int
o
letter n-
gram
s and
ret
r
i
e
ves w
o
r
d
ca
ndi
dat
e
s f
r
om
t
h
e l
e
xi
co
n
by
com
p
ari
ng
st
r
i
ng
n-
gram
s w
i
t
h
l
e
xi
co
n-e
n
t
r
y
n-
g
r
am
s. Th
e ret
r
iev
e
d
can
d
i
d
a
tes are
rank
ed
b
y
th
e co
n
d
iti
o
n
a
l
prob
ab
ilit
y o
f
m
a
tch
e
s with
th
e string, g
i
v
e
n
ch
aracter con
f
u
s
ion
p
r
o
b
a
b
ilities. Th
e
word-b
i
g
ram
m
o
d
e
l an
d Viterb
i alg
o
rith
m
are u
s
ed
to
d
e
term
in
e th
e
b
e
st scoring
word sequ
en
ce for th
e
sen
t
en
ce.
In
ad
d
itio
n, th
e system can
learn
t
h
e ch
aracter co
nfu
s
ion
p
r
ob
ab
ilities fo
r a sp
ecific OCR en
v
i
ron
m
e
n
t and
u
s
e t
h
em
in
self-calibratio
n to
ach
i
ev
e
b
e
tter
p
e
rform
a
n
ce
[1
0]
.
Mo
h
a
mm
ad
Ali El
mi an
d
Marth
a
Ev
en
s, d
e
scrib
e
a sp
elling
correction
sy
ste
m
th
at fu
n
c
t
i
o
n
s
as
p
a
rt
o
f
an
in
tellig
en
t tu
to
r th
at carries on
a n
a
tu
ral lan
g
u
a
g
e
d
i
alo
g
u
e
with its u
s
ers. Th
e b
a
sic id
ea o
f
th
eir
ap
pro
ach
is the in
teractio
n
between
the pa
rser and the spelling corrector
. Alternative correction targets are
fed
bac
k
t
o
t
h
e
par
s
er,
w
h
i
c
h
doe
s a se
ri
es o
f
sy
nt
act
i
c
an
d
sem
a
nt
i
c
checks,
base
d
on
t
h
e di
al
o
gue
co
n
t
ext
,
the se
ntence c
o
ntext, a
n
d t
h
e
phrase c
o
ntext
[11].
Davi
deF
o
ssat
i
and B
a
r
b
a
r
a D
i
Euge
ni
o, ad
d
r
ess t
h
e p
r
o
b
l
e
m
of real
-w
or
d spel
l
checki
n
g
,
i
.
e., t
h
e
d
e
tectio
n and correction
o
f
typ
o
s
th
at
resu
lt in
real
wo
rd
s
of th
e targ
et lan
g
u
a
g
e
. Au
tho
r
s pro
p
o
s
e
a
m
e
t
hod
ol
o
g
y
base
d o
n
a m
i
xed t
r
i
g
ram
s
lang
ua
ge m
ode
l
.
Thei
r e
x
peri
m
e
nt
s sho
w
pr
om
i
s
i
ng resul
t
s wi
t
h
resp
ect t
o
the
hit rates of
b
o
t
h
d
e
tectio
n and
co
rrectio
n,
ev
en th
oug
h th
e
fal
s
e po
sitiv
e
rate is still h
i
g
h
[8
].
Sebast
i
a
n
Deo
r
o
w
o
cz, M
A
R
C
IN G
.
C
I
UR
A, Acc
o
unt
f
o
r a new t
e
c
hni
que
of c
o
r
r
ect
i
ng i
s
ol
at
e
d
words in
typ
e
d tex
t
s. A lan
guag
e
-d
ep
end
e
n
t
set o
f
string
sub
s
titu
tio
n
s
refl
ects th
e su
rface fo
rm
o
f
erro
rs th
at
resul
t
f
r
o
m
vo
cabul
a
r
y
i
n
c
o
m
p
et
ence,
m
i
sspel
l
i
ngs
,
or
m
i
st
y
p
i
ngs. C
a
ndi
dat
e
co
rre
ct
i
ons a
r
e f
o
r
m
ed by
ap
p
l
ying
th
e su
b
s
titu
tion
s
to
tex
t
word
s ab
sen
t
fro
m
th
e c
o
m
p
u
t
er lex
i
con
.
A m
i
n
i
m
a
l
acyclic d
e
ter
m
in
istic
fin
ite au
t
o
m
a
t
o
n storing
th
e
lex
i
co
n allo
ws qu
ick
rej
ection of nonsense
correctio
ns,
while costs a
ssociated
with
th
e su
b
s
titu
tio
ns serv
e to
rank
t
h
e rem
a
in
ing
o
n
e
s [12
]
.
M
a
rt
i
n
Schi
erl
e
, Sascha Sc
h
u
l
z
and M
a
r
k
us Ac
kerm
ann
,
have
devel
o
ped a
n
effi
ci
e
n
t
cont
e
x
t
sen
s
itiv
e sp
ellin
g
co
rrectio
n
syste
m
ca
lled
d
c
Clean
b
y
com
b
in
in
g
two
ap
pro
ach
es: th
e ed
it d
i
stan
ce b
a
sed
ran
k
i
n
g
of
an
ope
n s
o
urce
sp
el
l
i
ng co
rrect
o
r
an
d
nei
ghbour co-occurre
nc
e statistics
com
puted from
a dom
a
in
specific c
o
rpus. In c
o
m
b
ination
with
dom
ain speci
fic re
place
m
e
nt and
abbre
v
iation li
sts, they are a
b
le to
sig
n
i
fican
tly im
p
r
o
v
e
t
h
e correction
p
r
ecisi
o
n
co
m
p
ared
t
o
ed
it d
i
stan
ce o
r
con
t
ex
t b
a
sed
sp
ellin
g correcto
r
s
appl
i
e
d
o
n
t
h
ei
r
ow
n
[1
3]
.
Yo
usse
f B
a
ssi
l
and M
oham
m
a
d Al
wani
pr
o
pos
e a p
o
st
-
p
r
o
cessi
n
g
c
ont
e
x
t
-
based e
r
r
o
r
cor
r
ect
i
o
n
al
go
ri
t
h
m
for d
e
t
ect
i
ng and c
o
rrect
i
n
g OC
R
no
n
-
w
o
r
d
an
d
real
-w
o
r
d er
r
o
r
s
. The p
r
o
p
o
se
d al
go
ri
t
h
m
i
s
base
d
o
n
Goog
le’s on
lin
e sp
elling
su
gg
estion
wh
i
c
h
h
a
rn
esses an
in
tern
al d
a
tab
a
se con
t
ain
i
ng
a h
u
g
e
co
llectio
n
of
term
s
and word seque
n
ces gathere
d
from
a
ll over the
we
b, convenie
nt to suggest pos
s
i
ble replacem
e
n
ts for
wo
rd
s t
h
at
ha
v
e
been m
i
sspel
l
e
d d
u
ri
ng t
h
e OC
R
pr
ocess
.
According to the aut
h
ors, t
h
e
expe
rim
e
nts c
a
rrie
d
out
re
veal
ed
a
si
gni
fi
ca
nt
i
m
pro
v
em
ent
i
n
O
C
R
err
o
r
c
o
r
r
e
c
t
i
on
rat
e
a
n
d
t
h
ey
ca
n i
m
pro
v
e
u
p
o
n
t
h
e
pr
op
ose
d
alg
o
rith
m
so
mu
ch so
th
at it can
b
e
p
a
rallelized
an
d ex
ecu
t
ed
o
v
e
r m
u
ltip
ro
cessing
p
l
atform
s [1
].
Bak
k
a
li
Ham
z
a et al.,
h
a
v
e
p
r
op
o
s
ed
a n
e
w ap
pro
a
ch
for sp
ell-ch
eck
i
ng
erro
rs co
mmitted
in
th
e
Ara
b
i
c
l
a
ng
ua
ge. T
h
e aut
h
o
r
s i
n
t
r
od
uce
d
t
h
e co
ncept
of
m
o
rp
hol
ogi
ca
l
anal
y
s
i
s
i
n
the p
r
oce
ss o
f
spel
l
-
ch
eck
i
n
g. Th
ei
r syste
m
u
s
es a ste
m
s d
i
ctio
n
a
ry of red
u
ced
size rath
er than
exp
l
o
iting
a larg
e d
i
ction
a
ry n
o
t
co
v
e
ring
th
e
all Arab
ic
word
s. Accord
i
n
g
to
t
h
e au
tho
r
s, th
e
ob
tain
ed
resu
lts are h
i
gh
ly po
sitiv
e and
sat
i
s
fact
ory
[1
4]
,
[
1
5]
.
Gol
d
i
n
g an
d S
c
habe
s [
16]
ha
ve i
n
t
r
od
uce
d
a hy
bri
d
ap
p
r
o
ach cal
l
e
d "Tri
bay
e
s" com
b
i
n
i
ng T
r
i
g
r
a
m
an
d Bayes'
m
e
th
od
.
Tr
i
g
r
a
m
meth
o
d
u
s
es
par
t
-
o
f
-
sp
eech
t
r
ig
ram
s
to
en
co
d
e
th
e con
t
ext wh
ereas Bayes is a
feat
ure
-
based
m
e
t
hod.
T
h
ey
use t
w
o
t
y
pes
of
feat
ures:
c
o
nt
ext
wo
r
d
a
n
d c
o
l
l
o
cat
i
o
ns.
Lat
e
r
G
o
l
d
i
n
g
wi
t
h
R
o
t
h
[
16]
ha
ve
pr
op
ose
d
a
Wi
nn
ow
-
b
ased
m
e
t
hod f
o
r
rea
l
wor
d
det
ect
i
o
n an
d co
rrect
i
o
n. T
h
ey
m
odi
fi
ed t
h
e
pre
v
ious m
e
thod
by ap
plying a winnow m
u
ltiplicative algorithm
co
m
b
ining varia
n
ts
of winnow
a
nd weighted
maj
o
rity vo
ting
an
d ach
iev
e
d b
e
tter accuracy.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Ont
o
l
o
gi
es
an
d
Bi
gr
am
-
base
d
Ap
pr
oac
h f
o
r
Isol
at
ed N
o
n-w
o
rd
Err
o
rs
C
o
r
r
ect
i
on…
(
A
i
c
h
a
E
u
t
a
me
ne)
1
461
Hi
rst
a
n
d B
u
d
a
ni
t
s
ky
[1
8]
m
a
de a
st
u
d
y
of
t
h
e
p
r
o
b
l
e
m
of
sp
el
l
i
ng c
o
r
r
ect
i
o
n
o
n
sam
e
co
rp
us
o
f
Wall Street Jou
r
n
a
l. Th
eir meth
od
id
en
tifies to
k
e
n
s
th
at
are sem
a
n
ticall
y
u
n
related
t
o
th
eir co
n
t
ex
t an
d was
not re
stricted
to chec
king words
f
r
om
pre
d
efi
n
ed c
o
n
f
us
i
on set
.
T
h
ey
achi
e
ve
d R
eca
l
l
of 2
3
%-
5
0
%
and
Pr
ecision
o
f
18%-
2
5
%
.
Em
illia and al. [19] proposed a
hybri
d
m
e
thod
for isol
ate
d
word
rec
o
gnition
base
d on the Ergodic
Hi
d
d
en
M
a
r
k
o
v
M
odel
an
d
t
h
ey
use
d
t
h
e
genet
i
c
al
go
ri
t
h
m
t
o
o
p
t
i
m
i
ze t
h
e B
a
um
-wel
ch m
e
t
hod i
n
t
h
e
train
i
ng
p
r
o
cess in
o
r
d
e
r to im
p
r
o
v
e
t
h
e accu
racy
o
f
th
e reco
gn
itio
n resu
lt wh
ich is pro
d
u
c
ed
b
y
th
e
HMM
param
e
ters that ge
nerate t
h
e low acc
uracy
when the
HMM a
r
e tested.
In
an
ot
he
r W
o
r
d
N
e
t
-
ba
sed
app
r
oach
,
Pe
d
d
l
e
r [
20]
s
howed that t
h
e s
e
m
a
ntic associ
ation ca
n
be
usef
ul
i
n
d
e
t
ect
i
ng a real
-
w
or
d er
ro
r u
s
i
n
g som
e
conf
u
s
i
on set
s
es
pe
ci
al
l
y
i
n
case of
Dy
sl
exi
c
t
e
xt
. S
h
e
achieve
d recall
and precision of
correction 40
% a
n
d
81%
,
respectively for Dyslexic te
xt.
4.
OU
R CO
NTR
I
BUTIO
N
Fi
gu
re
1 s
h
o
w
s t
h
e
di
ffe
rent
com
pone
nt
o
f
ou
r sy
st
em
wh
i
c
h co
nt
ai
ns t
w
o m
a
i
n
m
odul
e
s
:
C
h
aract
er
Reco
gn
itio
n
(fi
rst
pro
cessing
u
n
it)
and
Wo
rd
Recogn
itio
n
(second
po
st-pro
cessing
un
it). In
th
e fi
rst unit, th
e
syste
m
receives a seque
n
ce
of gra
p
he
m
e
s resulting
from
a pre-processi
ng
st
ep which consists in segm
entation
of te
xt doc
u
m
e
nt's i
m
age in orde
r to ge
ne
rate a sequ
e
n
tial series of
gra
phe
m
e
s
according thei
r
position in the
doc
um
ent
.
In
th
is un
it called
OMCR (On
t
o
l
og
ical Mo
d
e
l fo
r Ch
aracter Reco
gn
itio
n), th
e system reco
gn
izes
doc
um
ent
'
s charact
ers
by
bas
i
ng
o
n
a
n
o
n
t
o
l
ogi
cal
m
odel
[2
1]
-[
2
3
]
.
T
h
e
conce
p
t
s
o
f
on
t
o
l
ogy
re
prese
n
t
t
h
e
gra
p
hem
e
s, by cont
rast
t
h
e r
e
l
a
t
i
onshi
ps re
prese
n
t
t
h
e spa
t
i
a
l
i
t
y
bet
w
een t
h
e grap
hem
e
s by
respect
i
n
g t
h
ei
r
order of a
p
pearance
(Figure
2
descri
bes all these steps
)
.
The t
w
o
uni
t
s
of o
u
r
sy
st
em
have a com
p
l
e
m
e
nt
ar
y role. The first one generates the characters one
by
one
by
fo
r
m
i
ng t
h
e l
e
xi
cal
t
okens (
s
et
of w
o
rds se
par
a
t
e
d by
bl
an
k)
.
These t
oke
ns
are co
nsi
d
e
r
ed
as t
h
e
input of the se
cond un
it whic
h consists in the check
ing and correction in
order to
gene
ra
te an accepted word
by
t
h
e
use
r
.
Fi
gu
re
1
sh
ow
s t
h
e
use
o
f
t
w
o
t
y
pes
o
f
o
n
t
o
l
o
gi
es,
t
h
e
dom
ai
n o
n
t
o
l
o
gy
t
h
at
we
ha
ve c
r
eat
ed
t
o
m
odel
t
h
e t
e
xt
ual
st
r
u
ct
u
r
e o
f
a
doc
um
ent
and
w
h
i
c
h c
o
ns
ists to
sup
e
rv
ise th
e ch
ar
acter recognition
process
.
The seco
n
d
i
s
a l
e
xi
cal
ont
ol
ogy
cal
l
e
d
Wo
rd
Net
whi
c
h i
s
used as di
ct
i
ona
ry
t
o
veri
fy
t
h
e exi
s
t
e
nce of t
o
ke
n
i
n
t
h
e l
a
ng
ua
ge
v
o
cab
ul
ary
.
Fi
gu
re
1.
Sy
st
em
Overvi
e
w
4.
1. Proces
si
n
g
S
t
ep o
f
Ch
a
r
acter
Rec
o
gn
i
t
i
o
n
The st
ep
o
f
ch
aract
er rec
o
gni
t
i
on i
s
base
d
o
n
creat
i
o
n
of
ont
ol
o
g
y
.
It
de
scri
bes t
h
e v
o
cabul
a
r
y
of
p
r
o
cessed
do
cu
m
e
n
t
s an
d the relatio
nsh
i
p
s
th
at ex
ist b
e
t
w
een the
elements
of t
h
is
vocabula
r
y. T
h
is
step is
wel
l
desc
ri
be
d
i
n
Fi
g
u
r
e
2.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJECE
Vol. 5, No. 6, D
ecem
ber
2015 :
1458 –
1467
1
462
Fi
gu
re
2.
P
r
oc
essi
ng
u
n
i
t
o
v
e
r
vi
e
w
It
i
s
com
posed
of
fo
ur st
a
g
es,
i
n
t
h
e be
gi
n
n
i
ng
we m
u
st
st
art
by
t
h
e ext
r
act
i
on
of
gra
phem
e
s an
d
spat
i
a
l
rel
a
t
i
onshi
ps
bet
w
e
e
n t
h
em
. Se
e Fi
g
u
re
3
t
o
ha
ve a
n
i
d
ea
o
n
di
ffe
re
nt
t
y
pe
s o
f
rel
a
t
i
ons
hi
pi
m
p
l
e
m
e
nt
ed i
n
"
P
rot
é
gé" e
n
vi
r
onm
ent
f
o
r
o
n
t
o
l
o
gi
es
devel
o
pm
ent
.
Fi
gu
re
3.
Tax
o
nom
y
of
gra
p
h
e
m
e
s and
rel
a
t
i
ons
hi
ps
In t
h
e case
of a m
a
nuscript
script, a
step of no
rm
alization
proves
ne
cessary for clustering a
nd
id
en
tificatio
n
o
f
g
r
aph
e
m
e
's
class. On
ce we id
en
tify th
ese ele
m
en
ts, we p
r
o
ceed
to
th
e in
stan
tiatio
n
of the
dom
ai
n ont
ol
o
g
y
i
n
o
r
der t
o
i
d
ent
i
f
y
t
h
e
gr
aphem
e
s whi
c
h are c
o
nnect
e
d
by
s
p
at
i
a
l
rel
a
t
i
ons
hi
ps.
Th
ese l
a
st
will d
e
v
e
l
o
p
a
ch
aracter th
at can
b
e
recogn
ised
b
y
app
l
yin
g
a ru
le
o
f
SW
R
L
typ
e
(see
ru
l
e
b
e
low).
Ex
am
p
l
e o
f
SWRL
ru
le t
o
iden
tify th
e letter ‘
L
’
Fig
u
re
4
.
Recog
n
ition
o
f
th
e l
e
tter
L
by
o
u
r
ont
ol
o
g
y
Tro
n
c
(
?
X
1)
an
d B
a
r
(
?
X
2
)
C
h
arL
(
?
X
)
an
d
pos
Let
t
e
r(i
)
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Ont
o
l
o
gi
es
an
d
Bi
gr
am
-
base
d
Ap
pr
oac
h f
o
r
Isol
at
ed N
o
n-w
o
rd
Err
o
rs
C
o
r
r
ect
i
on…
(
A
i
c
h
a
E
u
t
a
me
ne)
1
463
Fi
gu
re
4 s
h
o
w
s t
h
e i
n
st
a
n
t
i
a
t
i
o
n
o
f
dom
ai
n ont
ol
o
g
y
by
di
ffe
rent
gra
p
he
m
e
s fo
un
d i
n
t
h
e
doc
um
ent
;
it sh
ows also
t
h
e two
graph
e
mes an
d th
e
rel
a
tio
n
s
h
i
p
b
e
tween
th
em
wh
ich
form
s th
e lett
er
L
.
4.
2. Pos
t
-proc
e
ssing Step
: Word's
Rec
ognition
and Re
cons
truc
tion
Post-proces
sing is t
h
e last sta
g
e
of an OCR
syste
m
whose
purpose
is to detect and to c
o
rrect s
p
elling
erro
rs resu
lting fro
m
th
e in
ab
i
lity o
f
th
e syste
m
to
prop
erly reco
gn
ize t
h
e
ch
aracters.
Th
i
s
stag
e p
l
ays a ro
le
o
f
recogn
itio
n an
d
reco
n
s
t
r
uctio
n
of
words; it h
e
lp
s to
red
u
ce
sign
ifican
tly th
e erro
rs. Ho
wev
e
r, th
i
s
isn
'
t
co
m
p
letely sep
a
rate stag
e of t
h
e
p
r
eced
i
ng on
e,
bu
t it
plays
a c
o
m
p
le
m
e
ntary role.
It c
o
m
p
rises severa
l steps
whi
c
h ca
n
be
p
e
rf
orm
e
d o
n
t
h
e rec
o
g
n
i
zer
o
u
t
p
ut
,
very
o
f
t
e
n
di
ct
i
ona
ri
es
o
r
e
v
en
wo
rd'
s
l
e
xi
co
ns
are
use
d
t
o
i
m
p
r
ov
e th
e reco
gn
itio
n resu
l
t
b
y
u
s
ing
approp
riate sim
ila
rity
m
easu
r
es. In
t
h
is
p
a
p
e
r, we pro
p
o
s
e a n
e
w
app
r
oach f
o
r det
ect
i
ng an
d
cor
r
ect
i
on
of
m
i
sspel
l
e
d wo
rds
.
The algorith
m
that re
p
r
esen
ts th
is app
r
o
a
ch
com
p
rises seve
ral steps to be execute
d in order to co
rrect misspelled word. The fl
owch
art fo
r th
e algorith
m is
sho
w
n i
n
Fi
g
u
r
e
5 t
h
at
sum
m
a
ri
zes t
h
e di
f
f
e
r
ent
st
eps
of t
h
e pr
op
ose
d
al
g
o
ri
t
h
m
.
Each of t
h
ese c
o
m
pone
nt
s
will b
e
d
i
scu
ssed
in th
e
fo
llowing
section
s
.
Fig
u
r
e
5
.
Flowch
ar
t o
f
th
e post-
pr
o
cessing
pr
o
cess
4.
2.
1.
Checki
n
g
an
d
L
o
oki
n
g U
p
The i
n
put
o
f
t
h
i
s
st
ep i
s
a Toke
n, t
h
i
s
l
a
st
i
s
a st
ri
ng
previ
o
usl
y
generat
e
d by
t
h
e cha
r
act
e
r
reco
g
n
i
t
i
on pr
ocess. Fi
rst
l
y
,
we
m
u
st
check the existence
of this T
o
ken i
n
a standard
dictionary, i
n
our case
we prefe
rre
d t
o
use t
h
e online lexical ontol
ogy (WordN
et) as external re
source
because
it is open acce
ss and
reg
u
l
a
rl
y
up
dat
e
d.
Fi
g
u
re
6
s
h
ows
a
ja
va
pr
o
g
ram
t
o chec
k t
h
e e
x
i
s
t
e
nce
o
f
t
h
e
w
o
r
d
"
wri
t
e
" in
th
e
Word
Net
di
ct
i
ona
ry
.
Fi
gu
re
6.
Exa
m
pl
e of Java
p
r
o
g
ram
fo
r acc
ess an
d l
o
o
k
u
p
i
n
Wo
r
d
Net
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJECE
Vol. 5, No. 6, D
ecem
ber
2015 :
1458 –
1467
1
464
If t
h
e t
o
ken e
x
i
s
t
s
i
n
t
h
e di
ct
i
ona
ry
, s
o
i
t
be
l
o
n
g
s t
o
t
h
e v
o
cab
ul
ary
of t
h
e w
r
i
t
i
ng l
a
n
gua
ge
of t
h
e
d
o
c
u
m
en
t an
d
it will
b
e
co
n
s
i
d
ered
as a real word
well
recog
n
i
zed
b
y
ou
r
syste
m
an
d
an
in
sert op
eration
will
b
e
tri
g
g
e
red in o
r
d
e
r to in
sert it in
th
e
o
u
t
p
u
t
file.
Ot
h
e
rwise, if th
e t
o
k
e
n do
es
no
t
ex
ist in
t
h
e st
an
d
a
rd
di
ct
i
ona
ry
, we
pass t
o
a
n
ot
he
r
checki
ng
st
ep
pri
o
r t
o
i
t
s
c
o
r
r
ect
i
o
n
.
Thi
s
se
con
d
l
e
vel
of c
h
ecki
ng i
s
o
p
t
i
ona
l
an
d
it co
nsists to
v
e
rify th
e
ex
isten
ce
o
f
t
h
is To
ken
i
n
to
a sp
ecific lex
i
co
n
(o
r sp
ecifi
c d
o
m
ain
on
tolo
g
y
)
related to s
u
bject dealt in t
h
e docum
e
nt (e
g. m
e
dical,
che
m
ical, astronom
ica
l
, hist
orical, etc.).If the
toke
n
d
o
e
s no
t ex
ist
yet, so
it is co
n
s
id
ered
as a
n
on-real word
an
d
we
p
a
ss t
o
correct it in
o
r
d
e
r t
o
g
e
n
e
rate it
s
corres
ponding real-word.
Figure
7. Graphical interf
ace
for lookup i
n
WordNet
4.
2.
2.
Candid
a
tes
G
e
neration
In t
h
i
s
st
ep
, we
gene
rat
e
a set
of ca
ndi
dat
e
s f
o
r t
h
e m
i
sspel
l
e
d w
o
r
d
w
h
i
c
h
has bee
n
det
e
ct
ed d
u
ri
n
g
th
e ch
eck
i
ng
st
ep
.
In th
e
ou
tpu
t
,
we will have a set of
words g
e
n
e
rated
b
y
ap
p
l
ying
so
m
e
tran
sform
a
t
i
o
n
ru
les
(see table
3).
In
ord
e
r to faci
litate th
e d
e
tectio
n
an
d id
en
tificati
o
n
o
f
errors,
we
p
r
op
ose a n
e
w cod
i
ficatio
n
o
f
th
e
m
i
sspel
l
e
d wo
r
d
ba
sed
o
n
bi
g
r
am
code t
h
at
c
onsi
s
t
s
t
o
t
r
ansform
a wo
rd
to a p
a
ir
o
f
con
s
ecu
tiv
e letters with
a
space cha
r
acte
r
in the fi
rst and t
h
e last position. Ta
ble 1 shows a
n
exa
m
ple of a
m
i
s
s
pelled word a
nd its
bi
g
r
am
code, a
n
d
t
a
bl
e2
s
h
o
w
s t
h
e st
ruct
ure
of
suc
h
a
co
di
f
i
cat
i
on.
Tabl
e
1. B
i
gra
m
code
Cor
r
ect wor
d
M
i
sspelled wor
d
Bigr
am
code
learn
learii
(-l,
le,
ea,
ar,
ri,
ii,
i-)
Table 1 s
h
ows
that the word "learn" ha
sn'
t
been rec
o
gn
ized co
rrectly. In
t
h
is case th
e erro
r lies in
th
e
fal
s
e rec
o
g
n
i
t
i
on
of c
h
a
r
act
er
"n";
i
t
has bee
n
rec
o
gni
ze
d b
y
bot
h l
e
t
t
e
rs "
i
i
"
. Tabl
e 3 s
h
ows
som
e
cl
asses o
f
errors
that we have
treated.
Table 2. Data
s
t
ructure of
Bigram
’s
codi
fi
cat
i
on
o
f
m
i
sspel
led
wo
r
d
‘
lea
r
ii
’
Bigr
am
’
s
nu
m
b
er
Bigr
am
code
Absolute value of
differ
e
nce
between one bigr
am
Rank of the second letter
in the one bigr
am
1 -
l
15
12
2 le
7
5
3 ea
4
1
4 ar
16
18
5 r
i
9
8
6 ii
0
8
7 i-
18
27
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Ont
o
l
o
gi
es
an
d
Bi
gr
am
-
base
d
Ap
pr
oac
h f
o
r
Isol
at
ed N
o
n-w
o
rd
Err
o
rs
C
o
r
r
ect
i
on…
(
A
i
c
h
a
E
u
t
a
me
ne)
1
465
Tabl
e
3. M
a
ppi
ng
r
u
l
e
s
fo
r e
r
r
o
rs
co
rrect
i
o
n
4.
2.
2.
1. E
r
r
o
rs
De
tecti
o
n
Th
e cod
i
ficatio
n shown in
Tab
l
e 2 allo
ws
d
e
tectin
g
errors in
th
e m
i
ssp
elled
wo
rd
.
For ex
am
p
l
e, if
our system
detect a zero in t
h
e thir
d col
u
m
n
o
f
bi
gram
codi
fi
cat
i
on (
s
e
e
Table 2), it autom
a
tically r
ealizes
that this is a
n
e
r
r
o
r
of
one
of t
h
e types: aa
,
bb,
...
ii ... zz.
We can als
o
dete
ct som
e
other types
of e
r
rors.
4.
2.
2.
2. E
r
r
o
r
Identi
fi
c
a
ti
on
On
ly,
b
y
con
s
ultin
g
th
e co
rresp
ond
ing
v
a
l
u
e
in
th
e fo
urt
h
c
o
lum
n
that we
can
detect the
exact error.
Th
e ex
am
p
l
e o
f
Tab
l
e
2
sh
ows an
error of
th
is typ
e
. Here, is abo
u
t
th
e l
e
tter ‘i’ wh
ich
h
a
s th
e
rank
8
in
th
e
al
pha
bet
(se
e
T
a
bl
e 4
)
.
Tabl
e
4. R
a
n
k
of
l
e
t
t
e
rs i
n
t
h
e
al
pha
bet
1 2
3 …
…
8 …
…
26
27
a
B
c
….
….
i
…..
….
z
-(
Space
lette
r)
In
t
h
i
s
case
,
ou
r sy
st
em
gener
a
t
e
s a set
of ca
ndi
dat
e
o
f
t
h
e
m
i
sspel
l
e
d w
o
r
d
by
re
pl
aci
n
g
t
h
e er
ro
r
b
y
i
t
s
cor
r
esp
o
ndi
ng
r
u
l
e
(s
).
For t
h
e m
i
sspel
l
e
d wo
rd
‘l
ear
i
i
’
sho
w
n i
n
T
a
bl
e 2,
we ca
n gene
rat
e
f
o
u
r
c
a
ndi
dat
e
s by
a
ppl
y
i
n
g
t
h
e fi
r
s
t
fo
ur
r
u
les: (lea
rn
, l
earu, lea
r
o, learv).
4.
2.
3. C
a
ndi
d
a
tes
Sel
ecti
o
n
Aft
e
r
ge
nerat
i
ng t
h
e can
di
d
a
t
e
s, t
h
e sy
st
em
m
u
st
sel
e
ct onl
y
o
n
e o
f
t
h
em
,t
he one t
h
at
has
be
e
n
gene
rat
e
d
by
appl
y
i
n
g
t
h
e r
u
l
e
of hi
g
h
pri
o
ri
t
y
. For ex
am
ple, i
n
Tabl
e 3
,
t
h
e r
u
l
e
(i
i
n) has a hi
g
h
pri
o
rity com
p
ared to t
h
e
othe
r ru
les of t
h
e sa
m
e
type of error.
5.
CO
NCL
USI
O
N
AN
D F
U
T
U
RE W
O
R
K
In
t
h
i
s
pape
r, we prese
n
t
e
d
a
ne
w
a
p
p
r
oac
h
t
h
at
c
o
nsol
i
d
at
es
t
h
e w
o
r
k
of
a
n
OC
R
sy
st
em
,
t
h
i
s
ap
pro
ach is
based
on
two
main
step
s; the first
on
e con
s
ists to ch
aracter recog
n
ition
b
y
u
s
ing
a
d
o
m
ain
ont
ology which m
odels the docum
e
nt'
s
language
. The re
c
o
gnition is done
by inst
antiation and infere
nc
e of
th
e on
to
log
y
by u
s
in
g
typ
e
SWRL's ru
les.
Th
e seco
nd
step
co
m
p
letes th
e first on
e b
y
co
rrectin
g
t
h
e po
ssib
l
e
errors
obtaine
d
from
the previ
ous
step.
It takes as i
nput the
recognized c
h
a
r
acters
by
f
o
r
m
i
ng t
h
em
i
n
gr
o
ups
(tok
en or lex
i
cal ite
m
)
. It d
e
t
ects th
e ex
isten
ce
o
f
th
e
po
ssib
le errors
o
n
t
h
e
word
b
y
check
ing
its ex
isten
ce i
n
th
e d
i
ctio
n
a
ry o
r
in
th
e
do
m
a
in
lex
i
con
.
If
th
is is
n
o
t
th
e case, a co
rrect
io
n
op
eratio
n
will b
e
lau
n
c
hed
to
correct these
errors
by using a
bigram
code t
h
at
s
u
pports t
h
e detection a
n
d
correction of words.
In t
h
e f
u
t
u
re
wo
rk
, we e
n
vi
sage t
o
ge
nera
l
i
ze our
a
p
proach s
o
that it can
deal the
real-word by
tak
i
n
g
in
to acco
u
n
t
th
e sem
a
n
tic an
d
con
t
ex
t
u
al inform
atio
n
d
e
fi
n
e
d in
t
h
e on
to
log
y
.
ACKNOWLE
DGE
M
ENTS
Th
e au
tho
r
s wo
u
l
d
lik
e t
o
than
k th
e
gu
est
ed
ito
rs
an
da
no
ny
m
ous re
vi
ewers
f
o
r
t
h
ei
r
val
u
a
b
l
e
a
n
d
co
nstru
c
tiv
e commen
t
s.
Erro
r
Co
rrectio
n
ii n
ii u
ii o
ii v
rn
m
vv
w
cl d
iii m
b h
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJECE
Vol. 5, No. 6, D
ecem
ber
2015 :
1458 –
1467
1
466
REFERE
NC
ES
[1]
Bassil Youssef
and Alwani Mohamma
d, "OCR
post-processing error co
rrectio
n algorithm using google onlin
e
spelling
suggesti
on", 2012.
[2]
Olakanmi Olufemi Oladay
o, "O
ptical Char
acter
Recogni
tion of
Off-Line
Ty
ped and Handwritten Eng
lish Tex
t
Using Morphological and
Templa
te Match
i
ng Techniqu
es",
IA
E
S
Internationa
l
Journal of
Artif
icial
Inte
llig
enc
e
,
Vol. 3
,
No. 3, 20
14.
[3]
Zhuang Li Bao
,
Ta
Zhu Xiao
y
a
n,
et
al.
,
"A Chi
n
es
e OCR s
p
elli
ng check
appro
ach bas
e
d on s
t
atis
ti
ca
l langu
ag
e
models",
In: S
y
stems, Man
and
Cybernetics, 200
4 IE
EE In
ternational C
onference on. IEEE,
pp. 4
727-4732, 2004
.
[4]
MaedaJunji, Iizawa Taku
y
a
, IshizakaTohru,
et a
l
.,
"Segm
e
ntation
of natural
im
ages
using anisotr
opic diffusion and
linking o
f
bound
ar
y
edg
e
s",
Patt
e
r
n Recogn
ition
,
Vol. 31
, No. 12,
pp. 1993-1999
,
1998.
[5]
Pirinen Tommi
A., and Krister
Lindé
n
.
"State-o
f-the-ar
t
in wei
ghted finit
e
-sta
t
e
spell-ch
ecking
"
,
Computationa
l
Linguistics and
I
n
telligent Text
Pro
cessing. Sprin
ger Berlin Heidelberg,
pp
. 519-5
32, 2014
.
[6]
Karen Kukich
, "
T
echn
i
ques for
Automatic
al
ly
Corre
c
t
i
ng Words i
n
T
e
xt
",
ACM
Computing Sur
veys
,
V
o
l. 24
, N
o
.
4, pp
. 377-439
,
1992.
[7]
P.
Sa
ma
nta
,
et
al.,
"A s
i
m
p
le r
eal-word
error d
e
te
ction
and
cor
r
ection using lo
cal word b
i
gram and trigram",
In
Pr
oceed
ings
of
ROCL
ING,
2013.
[8]
Fossa
ti D.
,
et
al
.,
"A mix
e
d tr
ig
rams approach
f
o
r
context sensitive
spell checking",
In Computa
tional
Lingu
istics
and Intellig
ent
Text
Processing"
, Springer
Ber
lin
Heidelb
e
rg,
pp. 623-633,
2007
.
[9]
Agarwal,
et al.,
"Uti
li
zing
Big Data i
n
Identif
ica
tio
n and Correc
tion of OCR Errors",
UNLV
Theses/Dissertations/Pro
fessiona
l Pap
e
rs/Capstones,
pp
. 1914
, 2
013.
[10]
Tong X.,
et al.
, "A statistica
l
ap
proach to autom
a
ti
c OCR error correct
ion in cont
ext".
The fourth workshop on very
large corpora,
p
p
. 88-100
, 1996
.
[11]
Elm
i
M. A.
,
et
al
.,
"Spel
ling
corre
ctio
n
using contex
t",
T
h
e 36th
Annual Meeting of th
e Association for
Computational L
i
nguistics and
1
7
th International
Conferen
ce on
Computational L
i
nguistics-Vo
lume 1,
pp. 360-364,
1998.
[12]
Deorowicz Seba
stian, and Mar
c
i
n
G. Ciura, "Correct
ing spelling
errors b
y
m
odelling the
i
r caus
e
s".
Internationa
l
Journal of Applied Math
ematics
and Computer S
c
ien
c
e
, Vol. 15
,
No. 2, pp. 275-2
85, 2005
.
[13]
Sc
hie
r
le
,
et al.
,
"From spelling correction to tex
t
cleaning–using
context information",
In
Data A
nalysis, Machin
e
Learning and Ap
plications
, Sprin
g
er Ber
lin
Heidelberg, pp. 397-4
04, 2008
.
[14]
Bakkali Hamza,
et al.
, "For an
Independen
t
Spell-Ch
ecking S
y
st
em from the Arabic Language Vocabular
y
.
”
International Jo
urnal of
Advan
c
ed Com
puter S
c
ience and
Applications (
I
JACSA)
,
Vol. 5, No. 1
,
2
014.
[15]
Hicham Gueddah, "Introdu
ction
of the weigh
t
ed
ition
errors in
the Lev
e
nshtein
dis
t
ance", 2012
[16]
A.
R. Golding,
et al.
, "Com
bi
ning Tr
igram
-
based and
Fea
t
u
r
e-based M
e
tho
d
s for Contex
t
sensitive
Spell
i
ng
Correct
ion",
The 34th
Annual Meeting
of
the Associ
ation
for Com
putational Lingu
istics,
pp. 71-78, 1996.
[17]
Golding Andre
w
R.,
and Ro
th
Dan,
"A winno
w-based appro
a
ch to
cont
ext-se
nsitive sp
ell
i
ng
correc
tion".
Machi
n
e
learning
, Vol. 3
4
, No 1-3
,
pp
. 10
7-130, 1999
.
[18]
Hirst Graeme, and Budanitsk
y
Alex
ander
,
"Correcti
ng r
eal-word spelling errors b
y
restoring lexical
cohesion",
Natur
a
l Language
En
gineering
, Vol.
11, No 01
, pp
. 8
7
-111, 2005
.
[19]
Rizkha Emillia
N
y
oman, Su
y
a
n
t
o N.
F. N,
Maharani Warih,
"
I
solated word recognition using ergodic hidden
markov models
and gen
e
tic
algo
rithm",
Telkomnika,
Vol. 10
, No
.
1, pp
. 129-136
,
2012.
[20]
Pedler Jenn
ifer
,
"Using semantic associations fo
r
the detection of
real
-word spelling errors",
In
:
P
r
oceedings
fr
om
the Cor
pus
L
i
ng
uis
tics
Conf
er
en
ce S
e
r
i
es
, 2005.
[21]
Eutamene Aich
a, Kholladi Mo
hame
d Khireddine,
and B
e
lhad
ef Hacene
, "O
ntological Model For Ch
aracter
Recognition B
a
sed On Spatial R
e
lations",
S
i
gnal
&
Image Processing: An
International Journal (
S
IPIJ)
, V
o
l. 04
,
No. 03, pp. 113-
124, 2013
.
[22]
Belhad
ef Hacen
e, Kholl
a
di M
oham
e
d Khireddi
ne, and
Eu
tam
e
ne Aicha
,
"Ontolog
y
of Graph
e
m
e
s
for Lati
n
Charac
ter
Recog
n
ition",
Procedia Engin
eering
,
Vol. 24
, pp
. 579
-584, 2011
.
[23]
Eutamene Aicha, Belh
adef
Hacene,
and Khollad
i
Moha
med Khireddine, "New
process ontolog
y
-
bas
e
d ch
aracter
recognition",
In:
Metada
ta and
S
e
mantic Res
e
arch
. Springer
Ber
l
in Heidelberg
, pp
. 137-144
, 2011
.
BIOGRAP
HI
ES
OF AUTH
ORS
Eutamene Aicha was born in Co
nsta
ntine, Alger
i
a. In 2009, she
received th
e Master degree in
computer engin
e
ering from Mentouri university
o
f
C
onstantine
.
Actua
l
l
y
she is Ph.D. student in
NTIC facult
y in
the Constantine
2 Universit
y
in
Algeria. Her re
s
earch int
e
res
t
s
includ
e P
a
ttern
recognition, dig
i
tal image proces
sing and Onto
lo
gies.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Ont
o
l
o
gi
es
an
d
Bi
gr
am
-
base
d
Ap
pr
oac
h f
o
r
Isol
at
ed N
o
n-w
o
rd
Err
o
rs
C
o
r
r
ect
i
on…
(
A
i
c
h
a
E
u
t
a
me
ne)
1
467
Mohamed Khireddine Kholla
di
was born in Con
s
tantine, Alg
e
ri
a. In 1991
, he received
the Ph.D
degree
in computer eng
i
neer
ing f
r
om Claud Bern
ar
d University
o
f
Ly
on
-Fra
nce. Actually
,
he is
a P
r
ofes
s
o
r at the com
puter s
c
i
e
nce dep
a
rtm
e
n
t
at the Univ
er
s
i
t
y
E
l
Ouad in
Algeria
.
His
res
earch
in
teres
t
s
includ
e Geogr
a
phica
l Inform
a
tion S
y
stem, Complex s
y
st
em, image processing.
Belahd
ef Hacen
e
was born in
Constantine, Alge
ria. In 2010
he receiv
e
d th
e Ph.D degree in
computer scien
c
e from the Men
t
ouri University
of Constantin
e
and in 2013 h
e
received HDR
degree (
L
e
c
turer
)
from
the NTIC facul
t
y
in th
e Univ
ersit
y
of Con
s
tantine 2
.
Actu
all
y
,
he is a
n
as
s
o
ciat
e prof
es
s
o
r at
th
e s
a
m
e
u
n
ivers
i
t
y
.
His
r
e
s
earch
int
e
res
t
s
i
n
clude
Ontologi
es
fie
l
d,
Im
age
Processing, Arti
f
i
ci
al
inte
llig
enc
e
and D
a
ta
Minin
g
.
Evaluation Warning : The document was created with Spire.PDF for Python.