TELKOM
NIKA Indonesia
n
Journal of
Electrical En
gineering
Vol.12, No.6, Jun
e
201
4, pp. 4579 ~ 4
5
8
8
DOI: 10.115
9
1
/telkomni
ka.
v
12i6.544
0
4579
Re
cei
v
ed
De
cem
ber 2
8
, 2013; Re
vi
sed
F
ebruary 27,
2014; Accept
ed March 1
2
, 2014
A Preprocessing and Analyzing Method of Images in
PDF Documents for Mathematical Exp
r
ession Retrieval
Xuedong Tia
n
*, Bota
o Yu, Jing Sun
Coll
eg
e of Mathematics a
nd
Comp
uter, He
b
e
i Univ
ersit
y
, Baod
ing, H
ebe
i, Chin
a
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: txdi
nfo@
ya
h
oo.com
A
b
st
r
a
ct
PDF
docu
m
ent
s are the i
m
po
rtant infor
m
ati
o
n reso
urces fo
r a mathe
m
atic
al ex
pressi
on r
e
trieva
l
system
. As
a
m
a
jor c
o
m
p
onent of P
D
F documents, the
im
age objects
m
u
st
be conv
erted to c
o
ded f
o
r
m
w
i
th the help
of character recogn
ition a
nd doc
u
m
ent
ana
lysis techn
o
lo
gy firstly for content ba
s
e
d
search
ing. T
h
e
r
efore, the
q
ual
ity of these i
m
a
ges b
e
co
mes t
he key
factor w
h
ich
deci
des th
e correctn
e
ss
i
n
this co
nversi
on
proc
ess. C
ons
ideri
n
g
the
ch
a
r
acterist
ics of PDF
i
m
a
ges a
nd mathe
m
atic
al express
i
o
n
s,
a
prepr
ocessi
ng
and a
n
a
l
y
z
i
n
g meth
od w
a
s pr
opos
ed w
h
ich i
n
clu
des the
modu
les of PDF
imag
e extracti
on
,
grayi
ng, bin
a
ri
z
a
ti
on, de
no
isi
ng, skew
correction
a
nd la
yout para
m
ete
r
detec
tion. T
he features o
f
math
e
m
atic
al
express
i
ons w
e
re a
deq
uate
l
y
consi
dere
d
to
avoi
d the i
n
fo
rmati
on l
o
ss i
n
imag
e conv
erting
process a
nd th
e adv
erse int
e
rference
both to
the ana
l
ysis and correction process resulted from
for
m
ulas
.
T
he exp
e
ri
me
ntal res
u
lts sh
ow
that
the
method
is effecti
v
e in
i
m
prov
in
g the acc
u
racy
and
efficie
n
cy
of
docu
m
ent i
m
a
ge reco
gniti
on,
analys
is an
d retrieva
l.
Ke
y
w
ords
: PD
F
imag
e pre
p
ro
cessi
ng analysis resolution
Copy
right
©
2014 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
No
wad
a
ys, more an
d more
PDF docum
ents [1] are
widely u
s
ed i
n
variou
s fields, su
ch
as i
n
form
atio
n sto
r
a
ge, transmi
ssion,
exch
a
nge,
a
nd
so
on.
How to
u
s
e
P
D
F
do
cume
n
t
s
efficiently an
d conveni
entl
y
become
s
a
hot poi
nt in
the field
of
do
cume
nt a
nalysis an
d p
r
o
c
e
ss.
Although a va
riety of techni
que
s abo
ut PDF do
cu
ment
analyzin
g, proce
s
sing a
n
d
applying, such
as content e
d
iting and inf
o
rmatio
n extracti
on, have
been d
e
velo
ped an
d app
lied, there ex
ist
many un
solv
ed p
r
obl
ems in rel
a
ted t
opics b
e
cau
s
e of the
co
mplexity and
diversity of PDF
st
ru
ct
ur
e.
In a PDF
do
cument, inform
ation exist
s
in
many forms i
n
clu
d
ing
co
d
e
s a
nd fo
nt b
o
xes of
cha
r
a
c
ters, i
m
age
s an
d graph
s etc, whi
c
h indi
cate
s t
he clu
e
s
of e
x
tracting info
rmation from
PDF
document
s.
Totally, the method
s of
extracting
inf
o
rmatio
n fro
m
PDF d
o
cuments coul
d be
divided into t
h
ree
cate
gori
e
s [2-5]. The
first ki
n
d
s
of method
s are
only based
on the conte
n
ts
inclu
d
ing the
code
s and
font boxes o
f
PDF f
iles.
Thro
ugh extracting the chara
c
te
r cod
e
s,
cha
r
a
c
ter fo
nts an
d their
bo
xes, the cont
ent of
PDF d
o
cum
ents
co
uld be
obtain
ed. The limita
t
ion
of this st
rateg
y
lies in that t
he characte
rs’
geomet
ric i
n
formation
co
n
t
ained in P
D
F
document
s is
gro
s
s rath
er
than exa
c
t coordi
nate
s
of
each
cha
r
a
c
ter’
s bo
undi
ng box, which re
sults in t
he
difficulties of
the informati
on extra
c
tion
of
PDF do
cuments
ba
se
d on the g
e
o
m
etrical featu
r
es.
The
se
con
d
strategy i
s
to
obtain
information
from
the corre
s
po
nding
imag
es co
nverted
from
PDF do
cu
me
nts. The
effe
cts
of these
method
s d
e
p
end o
n
the
correctn
ess of
layout an
alysis
and cha
r
a
c
te
r
recognition
with whi
c
h the
cont
ent
s and th
eir
re
lationship
s
could b
e
kno
w
n.
Und
oubte
d
ly, the generall
y
acce
pted st
rategie
s
of
PDF inform
atio
n extraction
are em
ployin
g a
hybrid
metho
d
combini
ng
both the
cha
r
acte
r
co
nten
t feature
s
an
d the
co
rrespondi
ng im
a
ge
feature
s
to e
n
su
re the
a
c
cura
cy and
efficien
cy
of obt
ained i
n
form
a
t
ion. Bake
r et
al. [2] prop
osed
a method of
extracting m
a
thematic
al
expre
ssi
on
s fro
m
PDF do
cu
m
ents. Th
e chara
c
te
risti
c
of
their method
is to integra
t
e the conten
t featur
e of chara
c
te
rs an
d the geome
t
rical featu
r
e of
corre
s
p
ondin
g
imag
es
of
the layouts i
n
the p
r
o
c
e
s
s of lo
catin
g
the mathem
a
t
ical sym
bol
s in
document
s. The content
of PDF do
cume
nts
in
cl
uding th
e font boun
ding
boxes
and
the
geomet
ric
po
sition
s of cha
r
acte
rs is
extracted f
r
om
th
e PDF
sou
r
ce firstly. Then
the ch
aracte
rs’
actual
bou
ndi
ng boxe
s
a
r
e
obtaine
d ba
sed on
the
im
age
s of the
correspon
ding
font boxe
s
. The
method coul
d obtain bot
h the conte
n
t
informat
ion
and the ge
ometri
cal inf
o
rmatio
n of PDF
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4579 – 4
588
4580
document
s. In literature [3], Baker et al. further di
scu
s
sed the topi
c of extr
actin
g
and an
alysi
s
of
mathemati
c
al
formul
as in
PDF d
o
cume
nts
with th
e
h
e
lp of
syntax
feature
s
. Fi
rst
l
y, the bou
ndi
ng
boxes
(PDFB
B
s) info
rmati
on of PDF do
cume
nt
s a
r
e
obtaine
d thro
ugh de
co
mpressing the P
D
F
document
s with
the
op
en sou
r
ce Java softwa
r
e
call
ed multivalen
t and the t
r
ue
glyph b
oundi
ng
boxes (GBBs) are lo
cated
with the help of corr
espon
d
i
ng image
s. Then, Anderso
n’s algo
rithm
is
employed
a
nd imp
r
ove
d
to turn
th
e two
dime
nsio
nal m
a
th
ematical
formula into
li
near
rep
r
e
s
entatio
n. Finally, th
e linear exp
r
ession i
s
an
alyzed an
d the re
sult is repre
s
e
n
ted a
s
a
formula
synta
x
tree with wh
ich the form
ul
a that is
con
s
i
s
tent with the
original exp
r
ession in Lat
e
x
format
coul
d
be e
a
sily
gen
erated.
Lin
et
al. [4] p
r
o
p
o
s
ed
a
metho
d
of i
dentifyin
g
math
ematical
expre
ssi
on
s in PDF do
cu
ments. The f
eature
s
u
s
ed
in formula e
x
traction a
r
e
divided into three
layers
calle
d
geomet
ric lay
out, cha
r
a
c
te
r and
context
content. In the first ste
p
, whi
c
h is
calle
d
prep
ro
ce
ssin
g, mathemati
c
al ex
pressio
n
element
s a
r
e obtain
ed fr
om the origin
al PDF symb
ols
throug
h a
se
ries of p
r
o
c
e
s
sing
ba
sed
o
n
the
kno
w
le
dge of P
D
F
symbols
and t
he text lines
are
detecte
d. Th
en, the i
s
ol
a
t
ed form
ula
s
are extr
a
c
t
ed with
the
help of
a hybrid
metho
d
of
combi
n
ing
rul
e
meth
od
an
d SVM
cla
ssif
i
er
ba
sed
on
geomet
ric, ch
ara
c
ter
an
d context
features.
Finally, the embedd
ed formulas a
r
e lo
cated with a ru
le-ba
s
e
d
algo
rithm and
ch
ara
c
ter featu
r
es.
In literature [5], the extract
i
on of emb
e
d
ded formul
a
s
from PDF
do
cume
nts i
s
fu
rther
discu
ssed
.
The metho
d
segm
ents tex
t
lines into word
s. Di
ffere
nt from the literature [4], these wo
rd
s
are
then divide
d i
n
to two
cla
s
ses
calle
d formulas or
text
s with
an SV
M cla
s
sifier in
stead
of previous
rule
-ba
s
e
d
method. The
words i
dentified as
th
e compon
ents
o
f
formulas
are merg
ed int
o
c
o
mplete formulas
.
In a mathem
atical retrieva
l system, the
info
rmatio
n of
PDF do
cu
m
ents, whethe
r existing
in code
d stat
e or in imag
e form, shou
ld be tr
eated
as conte
n
t flows. So, the image obje
c
ts
contai
ned in
PDF do
cum
e
nts (simply called PDF im
age
s) mu
st b
e
re
cog
n
ized
and an
alyze
d
with O
CR
(O
ptical Cha
r
a
c
ter Re
co
gniti
on) te
chn
o
lo
gy firstly, whi
c
h lays
a fo
undatio
n of th
e
followin
g
inde
xing and retri
e
val pro
c
e
ss.
Neverthel
ess, the pro
c
e
s
sing obj
ect
s
of ordina
ry OCR
system
s are gene
rally the document
im
age
s with hig
h
quality wh
ich are gen
erated by sca
nne
rs
with a
defaul
t resolution
such
as 30
0d
pi in
ch
a
r
a
c
ter
recognitio
n
or 600
dpi i
n
mathem
atica
l
formula reco
gnition. Whe
n
the OCR systems
a
r
e use
d
to reco
gni
ze PDF image
s, the
recognitio
n
rate wo
uld d
e
c
re
ase be
ca
use th
e qu
al
i
t
y, for example, the reso
lutions
of the
s
e
image
s, is varied
an
d u
n
ce
rtain. E
s
p
e
cially
,
the correctn
ess of
a
form
ula reco
gnition a
nd
analysi
s
syst
em
will b
e
inf
l
uen
ced
by t
he va
riation
i
n
ima
ge
re
so
lution mo
re
seriou
sly. Thi
s
is
becau
se formulas
may
contai
n som
e
sp
eci
a
l sy
mbols
whi
c
h
con
s
i
s
t of very sm
all or little
stro
ke
s, an
d
formula
s
freq
uently
express cal
c
ulatin
g meanin
g
wi
th
the
spatial
a
rra
ngem
ents
of
sy
mbol
s impl
icit
ly
.
The ch
ara
c
t
e
ri
st
ic
s
make
th
e pe
rforma
nce of
a formula reco
gnition a
nd
analysi
s
syst
em to b
e
sen
s
itive to the
q
uality of
do
cu
ment imag
es.
The
r
efore, it
is ne
ce
ssa
r
y
to
pro
c
e
s
s the i
m
age
s to
a
higher qu
ality a
nd a
nalyze th
em to
obtain
some
pa
ram
e
ters for
guidi
n
g
the followin
g
module
s
befo
r
e re
co
gnition
.
Although all
of the metho
d
s an
d alg
o
ri
thms
of do
cu
ment image
pro
c
e
ssi
ng, a
nalysi
s
and
re
co
gnition
coul
d b
e
employed
for treatin
g the
imag
es con
v
erted from
PDF d
o
cume
nt
s
theoreti
c
ally, many un
solv
ed p
r
oble
m
s
exist in
eve
r
y step of
a PDF ima
ge p
r
oce
s
sing
syst
em
becau
se of the cha
r
a
c
teri
st
ics of PDF
im
age
s and mat
hematical expre
ssi
on
s.
In the aspe
ct
of image extraction from
PDF docum
ents, Ch
en e
t
al. [6] propose
d
a
method to e
x
tract the image
s in PDF
docum
ents.
On the basi
s
of the introdu
ction of PDF
format, a sch
e
me of obtai
n
i
ng the ima
g
e
data and
de
codi
ng them i
n
to normal d
a
ta is de
sig
n
ed.
The obtain
e
d
images a
r
e
saved a
s
jpg
files.
Wang [
7
] desig
ned
an extractin
g
algorithm of the
image
s in
PDF d
o
cument
s. Thi
s
m
e
th
od
spe
c
ially
con
s
id
ere
d
th
e re
quireme
n
t
s on
the im
a
g
e
format of O
C
R sy
stem
s. L
i
and Li
u [8] put forw
ard a
PDF reade
r
with a
strateg
y
of positioni
ng
key inform
ation and ig
no
ring se
co
nda
ry message.
Experimental
result sho
w
s the propo
sed
method could
extract and d
i
splay the info
rmation in PDF files accu
ra
tely.
In the field of
document an
alysis an
d re
c
ognition, man
y
resea
r
ch works have be
en done
in document i
m
age p
r
ep
ro
ce
ssi
ng an
d analyzi
ng [9-20].
The meth
od
s me
ntioned
above lay
the found
atio
n of ou
r wo
rk. In thi
s
p
aper,
a
prep
ro
ce
ssin
g an
d a
nalyzi
ng m
e
thod
of
PDF i
m
ag
es is
de
sign
ed t
o
imp
r
ove th
e qu
ality of P
D
F
image
s an
d the pe
rform
a
n
c
e of the foll
owin
g re
co
gn
ition and retri
e
val pro
c
e
s
s. It include
s the
module
s
of P
D
F im
age
ext
r
actio
n
, g
r
ayi
ng, bin
a
ri
zati
on, de
noi
sing
, skew
corre
c
tion an
d l
a
yout
para
m
eter d
e
tection
that
co
nsi
d
e
r
th
e cha
r
a
c
te
ri
stics
of PDF
image
s. In t
he p
r
o
c
e
s
s,
the
feature
s
of m
a
thematical e
x
pres
sio
n
s are ade
quately
con
s
id
ere
d
to
avoid the i
n
formatio
n lo
ss in
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
A Prepro
c
e
ssing and Anal
yzing Meth
od
of Im
ages in PDF Do
cum
e
nts for… (X
u
edon
g Tian
)
4581
image
co
nve
r
ting
pro
c
e
ss and
the
ad
verse
inte
rference b
o
th to
the a
nalysi
s
and
corre
c
tion
pro
c
e
ss
re
sul
t
ed from formulas. Th
e e
x
perime
n
tal result
s sh
ow t
hat the meth
od is effe
ctive in
improvin
g the
accura
cy an
d efficien
cy of doc
um
ent im
age re
co
gniti
on, analysi
s
a
nd retri
e
val.
The o
r
ga
niza
tion of the p
aper is
as fol
l
ows.
In the
se
con
d
pa
rt, the procedu
re of the
PDF imag
e p
r
ep
ro
ce
ssi
ng
and a
nalyzin
g method i
s
g
i
ven. The research a
nd d
e
s
ign
of modul
es
of the
system
is
discu
s
sed
in the thi
r
d
p
a
rt. Th
e
la
st part i
s
the
re
sults of expe
riments
and
the
con
c
lu
sio
n
s o
f
the entire pa
per.
2. Descrip
tio
n
of the Prep
roces
sing a
nd Analy
z
ing Method o
f
PDF Images
The image
s
are extra
c
ted
from PDF docum
ents through a PDF
docum
ent analysi
s
module fi
rstly
.
The quality
of them is i
r
regul
a
r
ity. Many degrade
d
phen
omen
on
s such a
s
noi
se,
skew
and l
o
w resolution
exist in extra
c
ted im
age
s
becau
se they
com
e
from
d
i
fferent colle
cting
ways. T
herefore, the
s
e im
age
s ne
ed to
be p
r
ep
ro
ce
ssed a
nd a
n
a
lyzed
before
re
cognitio
n
and
retrieval. The
stru
cture diag
ram of
the proce
s
s is sho
w
n in Figu
re
1.
Figure 1. Structure
Diag
ra
m of PDF Image Pre
p
ro
ce
ssi
ng
The imag
es i
n
PDF do
cu
ments mig
h
t exist in vario
u
s mod
e
s
su
ch a
s
col
our
image
s,
gray ima
g
e
s
and bi
nary im
age
s. Co
nsi
d
ering
the e
fficiency
of re
co
gnition,
an
alysis and retri
e
val,
all extracted
PDF image
s
are tra
n
sfo
r
m
ed into bina
ry images.
PDF ima
g
e
s
might come
from
different
so
urce
s. Ma
ny of them
contain
a lot
of noi
se
pixels. If the
s
e
noi
se
pixels
are
wro
n
g
ly co
nsi
dered a
s
norm
a
l imag
e pix
e
ls, the
spat
ial
relation
shi
p
s of ch
ara
c
te
rs
wo
uld b
e
disrupte
d
a
nd e
rro
r
re
sults will
be
prod
uced. T
hese
situation
s
o
c
cur m
o
re fre
quently in sy
mbol re
co
gni
tion and stru
ctural a
nalysi
s
modul
es of
a
mathemati
c
al
formula re
cognition sy
stem. Becau
s
e formula symbols com
e
from seve
ral
cha
r
a
c
ter sets, the
r
e
exist
more
simila
r symbol
s
whi
c
h
are
ha
rdly
distin
ct from
ea
ch
other than
norm
a
l text. Whe
n
noi
se
pixels lie in
a
key area,
the
error
re
sults
are in
evitable
.
Therefo
r
e, it is
necessa
ry to desi
gn an
al
gorithm to
de
lete t
he noi
se pixels from
PDF imag
es and avoid t
he
norm
a
l pixels being
wro
n
g
l
y erased wit
h
the help of
the relative
spatial, synta
x
and sem
a
n
t
ic
kno
w
le
dge.
When a document image mainly consi
s
ts of
text contents, the skew of it will be ver
y
harmful b
e
ca
use the ab
erran
c
e of
the distrib
u
tion fe
ature
s
of cha
r
acte
rs on th
e layout might
occur, whi
c
h
results
i
n
the
error re
sult of
the
lo
gi
cal st
ructu
r
e
s
. T
h
e
r
efore, with
th
e hel
p of layo
ut
kno
w
le
dge, we coul
d
det
ect
the skew
angle of
a
document im
age a
nd rota
te the image
a
corre
s
p
ondin
g
angle to obt
ain a co
rrecte
d image.
Different f
r
om
the imag
es i
n
O
CR
syste
m
, t
he qu
ality of the ima
g
e
s
in P
D
F
do
cuments
varies in a la
rge extent. Espe
cially, the resol
u
tion
s of them are not unified to a default value
becau
se of t
he differe
nce
of scanni
ng
operation
s
,
which
will influ
ence the
correctn
ess of t
he
whol
e syste
m
.
Although
variou
s kin
d
s
of
inte
rpol
ation al
go
rithm
s
in
di
gital i
m
age
processing
coul
d imp
r
ov
e the visu
al
effect of the
image
s, the
recognitio
n
rate could
n
o
t be in
cre
a
s
e
d
essentially. T
herefo
r
e, it is ne
ce
ssary
to m
easure t
he re
sol
u
tion
value of image
s ro
ughly
to
dire
ct the followin
g
step
s
of doc
um
ent
recognitio
n
a
nd analy
s
is t
o
adju
s
t the related pa
ram
e
ters
to fit the variations. Thi
s
coul
d be co
n
s
ide
r
ed
a
s
a
speci
a
l se
g
m
ent of mathematical formula
recognitio
n
u
s
ed in math
e
m
atical exp
r
e
ssi
on ret
r
ieva
l.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4579 – 4
588
4582
3. Implement of the PDF I
m
age Prepro
cessing a
nd Analy
z
ing Method
Aiming at the need of mathemat
ical expre
ssi
on re
co
gnition and retrieval, a sol
u
tion of
PDF ima
ge
prep
ro
ce
ssin
g and
an
alyzing
is
de
si
g
ned con
s
id
ering
the ch
aracteri
stics of
the
image
s in
PDF d
o
cume
n
t
s an
d the
feature
s
of fo
rmula
s
. It ne
ed p
a
y atten
t
ion to that t
he
prep
ro
ce
ssin
g algorith
m
should ta
ke ta
rgeted m
e
a
s
ure
s
to pro
c
e
ss P
D
F imag
es a
c
cording
to
the actu
al sta
t
es of them f
o
r me
eting th
e need
s
of re
cog
n
ition an
d
retrieval
rath
er than
apply
all
module
s
to the image
s aim
l
essly.
3.1. Extrac
tion of Image
s
from PDF
Docume
nts
The im
age
s
are
save
d a
s
obje
c
ts in P
D
F
do
cu
me
nts. Th
ey shou
ld be
extra
c
ted firstly
according to the PDF do
cu
ment spe
c
ification [1], [6-8]:
Step 1. Obtai
n
the off
s
et o
f
the cro
s
s-re
feren
c
e ta
ble
and th
e o
b
je
ct num
be
r of
Catalo
g
at the trailer o
f
PDF file.
Step 2. Get the co
ntent of cro
s
s-refe
ren
c
e table.
Step 3. G
e
t the
conte
n
t of
Catal
og
obje
c
t an
d all
Pa
ge o
b
je
cts th
roug
h
sea
r
ch
ing the
Page tree. Sa
ve all the Page obje
c
ts in a
stack.
Step 4. If the
sta
c
k is
emp
t
y, end; otherwise
, obtain
the info
rmatio
n of XObj
ect
throug
h
readi
ng the P
age o
b
ject in
the top of sta
ck. If XO
bje
c
t
obje
c
t exists, go to step
5
;
otherwi
se, g
o
to step 4.
Step 5. If
the subtype of th
e XObject i
s
Image, go to
Step 6; else, go to Step 4.
Step 6. O
b
ta
in the
relat
e
d info
rmation
thro
ugh
re
a
d
ing th
e
con
t
ent of XObj
ect a
nd
extract imag
e
.
Deco
de the
pixel data in the image to g
enerate a no
rmal image.
Step 7. Conv
ert the i
m
age
in different f
o
rm
s
into
BM
P format
with
the
same
bit
depth.
Go to Step 4.
3.2. Binariza
tion and Den
o
ising of PDF Images
Whe
n
a PDF
image is
a colourful
one, i
t
should
be transfo
rme
d
into a gray ima
ge. The
graying
algori
thm of colo
ur images is ful
l
y discussed i
n
relate
d literatures, so thi
s
paper
will not
dis
c
u
ss it
.
Binari
z
ation [
9
, 10] of gray images i
s
to transmit an image in which a pixel is saved with
multi binary b
i
ts into anot
h
e
r on
e who
s
e
pixel has
onl
y two values
calle
d bla
ck
a
nd white sav
ed
with only on
e
binary bit to
expre
ss it
s gray level. Assume that
f
(
i
,
j
) is the
pixel value of imag
e in
point (
i
,
j
), then:
0
(
,
)
(,
)
1
(
,
)
fi
j
bi
j
fi
j
Whe
r
e
b
(
i
,
j
) i
s
the pixel value of point (
i
,
j
) in the binarize
d
image a
nd
θ
is the bi
narying
threshold val
ue varying f
r
om the mini
mum value
t
o
the maxim
u
m value of
gray value
s
of the
image.
The bina
ri
zat
i
on method o
f
gray image
s co
uld be di
vided into two categ
o
rie
s
:
global
threshold
met
hod
s a
nd l
o
cal thresh
old
s
method
s. Th
e form
er defi
ne a
uni
que
thre
shol
d valu
e
θ
for the binari
z
ation of wh
ole image. T
h
is
algo
rithm
has the adv
antage of bei
ng impleme
n
t
ed
simply an
d runnin
g
fast. Ho
wever, it coul
d not proce
s
s the i
m
age
s with
inhomo
gen
eo
us
distrib
u
tion of
gray valu
e. The r
obu
stne
ss
of the late
r is b
e
tter be
cause it co
uld
employ different
threshold val
ue
θ
in binari
z
ation p
r
o
c
e
s
s acco
rdin
g to the si
tuatio
n of the curre
nt pixel. However,
its cal
c
ulatin
g
complexity is higher tha
n
the forme
r
.
Con
s
id
erin
g the diversity of PDF image
s, we em
ploy
the local thresh
old value
method
as the bin
a
ryi
ng strat
egy as
discu
s
sed i
n
literature [11].
The de
noi
sin
g
of layout image
s which cont
ai
n
m
a
thematical expre
ssi
on
s is
more
compl
e
x than
those
of no
rmal text. Some spe
c
ial
symbols m
u
st
be con
s
ide
r
e
d
to avoid
wrong
deleting o
perations of com
pone
nts of mathematical symbols.
Step 1. Obtain all co
nn
ected
com
p
o
nents
o
n
the
layout imag
e usin
g a conne
cted
comp
one
nts sea
r
ching
al
g
o
rithm.
Step 2. Ident
ify the candi
date noi
se
compon
ent
s
a
c
cordi
ng to t
he ge
ometri
c feature
s
obtaine
d from
statistic hi
sto
g
ram of comp
onent
s si
ze.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
A Prepro
c
e
ssing and Anal
yzing Meth
od
of Im
ages in PDF Do
cum
e
nts for… (X
u
edon
g Tian
)
4583
Step 3
.
Rem
o
ve the see
m
ing formul
a
symbols’
co
mpone
nts su
ch a
s
“.” of symbol “
i
”
and “
j
” from the candi
dat
e noise
com
pone
nts with
the help of the comp
ositi
on kno
w
le
dg
e of
formula
symb
ols in ge
omet
ric level [12].
Step 4
.
Delet
e
the can
d
ida
t
e noise
com
pone
nts.
3.3. Ske
w
De
tec
t
ion and
Corre
ction o
f
Doc
u
ment I
m
ages in PDF Files
The ke
rnel te
chn
o
logy of ske
w
co
rre
ctio
n of
docume
n
t images is t
he skew d
e
te
ction of
layout image
s.
The
strate
gie
s
of
skew det
ection
co
uld
also
be
divid
ed into t
w
o
cl
asse
s
called
top-d
o
wn
method an
d b
o
ttom-up met
hod [13-15].
The typical
method of t
op-d
o
wn met
hod is
proj
e
c
tion alg
o
rith
m. Throu
gh
multiple
proje
c
tion
op
eration
to an
image
with
different
a
ngl
es, the
state
of pixel di
stribution
coul
d
be
obtaine
d and
the angle i
n
whi
c
h the m
a
ximum space
of white pi
xels o
c
curs o
n
the proje
c
tion
histog
ram i
s
identified a
s
the ske
w
a
n
g
le of
the
la
yout image.
This
metho
d
is
simpl
e
to
be
impleme
n
ted.
But it is only
suitabl
e for the
skew d
e
tect
ion of simple l
a
yout image
s.
The b
o
ttom-up meth
od
searche
s
th
e
layout
comp
onent
s, such
as conn
ecte
d area
s,
firstly. Then,
it detects the
sk
ew an
gle
of the layout
in the
combi
n
ing p
r
o
c
e
s
s of com
pon
e
n
ts.
For
example
,
in a
Ho
u
gh tra
n
sfo
r
m
ba
sed
ske
w
d
e
tectio
n
method
[1
3-15], the
l
a
yout
comp
one
nts are used
a
s
t
he
sampli
ng points of
Hou
gh tra
n
sfo
r
m
to obtain th
e
skew
angl
e o
f
the layout. This metho
d
could process any la
yout image
s as lo
ng as
the la
youts co
ntain
a
certai
n n
u
mb
er of
charact
e
rs u
s
ed
for
Hou
gh tran
sform. T
he
disadvantag
e of
this meth
od li
es i
n
that it might be distu
r
b
ed
easily
when
a layout ima
ge contain
s
some line
a
r el
ements
su
ch
as
mathemati
c
al
symbol
s wh
ich
woul
d be
wro
ngl
y extracted
by Ho
ugh tra
n
sfo
r
m as the
skew
detectio
n
obj
ects.
In this paper,
the Hough transfo
rm ba
sed skew
d
e
te
ction metho
d
[13-15] is e
m
ployed
and im
prove
d
to d
e
tect
the ske
w
an
gle of
do
cu
ment ima
g
e
s
in PDF file
s, in
whi
c
h
th
e
mathemati
c
al
symbol
s
are e
s
pe
cially
analy
z
ed
t
o
avoid
the
occu
rren
ce
s of wron
g
ske
w
detectio
n
.
Step 1. Obtain all co
nn
ected
com
p
o
nents
o
n
the
layout imag
e usin
g a conne
cted
comp
one
nts sea
r
ching
al
g
o
rithm.
Step 2. The
conne
cted
co
mpone
nts a
r
e
combi
ned i
n
to ch
ara
c
te
r b
o
xes a
c
cordi
ng to the
distan
ce thre
shol
d value
s
betwe
en character
co
m
pon
ents, ch
ara
c
t
e
rs, text lines and parag
ra
phs
acq
u
ire
d
with
the statistic h
i
stogram of the
distan
ce of
con
n
e
c
ted co
mpone
nts on
the layout.
Step 3. Select effective sample poi
nts of
Hough transfo
rm with
the help of symbol
feature
s
in m
a
thematical formul
as a
nd
cha
r
a
c
ter feat
ure
s
in no
rma
l
text.
Step 4. Calcu
l
ate the samp
le
points of Hough tra
n
sfo
r
m in
every symbol box accordin
g
to the sample
rules. Fulfil Hough
tra
n
sfo
r
m to get the ske
w angl
e.
Step 5. Rotate the image a
c
cordi
ng to the got angle.
3.4. Dete
ctio
n and Estimation of the
La
y
out
Parameters o
f
Do
cument
Images in PDF Files
Let
P
(
L
,
S
,
N
) be a three
-
tuple of the layout par
amet
ers of do
cum
ent image
s in PDF
files, in
whi
c
h
L
i
s
th
e la
ng
uage
ki
nd
of l
a
youts
(limite
d
to
Chin
ese
or En
glish),
S
is t
he
statisti
c
size of
cha
r
a
c
ters in l
a
you
t
which i
s
correlated
to the
re
solutio
n
of
layout imag
e
s
, and
N
is
the
numbe
r u
s
ed
for obtaini
n
g
the pa
ram
e
ter
S
.
P
(
L
,
S
,
N
) will
be transferre
d to the sy
mbol
recognitio
n
a
nd stru
ctu
r
e a
nalysi
s
modul
es
a
s
the refe
ren
c
e of pa
ra
meter sele
ction.
Paramete
r
L
is the
precondition of t
he re
sol
u
tion
analysi
s
of
PDF imag
es beca
u
se
different lan
g
uage
kind of
layouts ha
s different
feature
s
in ch
ara
c
ter
size. In this pap
er,
cha
r
a
c
ter stroke
s’ runni
ng
-num
ber
feat
ure ba
sed
m
e
thod i
s
expl
ored
to ide
n
tify the langua
ge
kind of layout
s [16-1
7
].
The ch
ara
c
te
r size
S
is re
lated to the langu
age ki
nd
of layouts. Table 1 and T
able 2
sho
w
the
av
erag
e valu
e
of actu
al si
ze
of pri
n
ted
Chine
s
e a
nd E
nglish cha
r
a
c
ters i
n
differe
n
t
type size
s m
easure
d
by a conne
cted
compon
ent
se
arching al
gori
t
hm, from wh
ich we
can
see
the differen
c
e
in size b
e
twe
en two ki
nd
s of layouts.
Therefore,
S
sho
u
ld be i
d
e
n
tified by means of
stat
isti
c metho
d
and
con
s
ide
r
not
only the
langu
age
kin
d
of layout
s
but also the i
n
fluen
ce
com
i
ng from
non
-text symbols.
In our meth
od,
the hi
stogram
method
[18
-
20] is
empl
oyed to
obtain
the e
s
timated
value of
ch
aracter
size.
From
Figure 2 and
Figure 3 we can see th
at the param
eter
S
could
be detected
by the proper
cal
c
ulatio
n of the data that come
s fro
m
the co
rrespon
ding hi
stogra
m
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4579 – 4
588
4584
Table 1. The
Average
Heig
ht and Width
of Print
ed Chi
nese Ch
ara
c
t
e
rs in Va
riou
s Font Sizes
No.
T
y
pe size
Character
numb
e
r
Actual hei
ght ( pixels)
Actual w
i
dth ( pi
xels)
1 0
638
127
119
2 2
1837
70
63
3 4
4591
46
43
4 6
4354
26
24
5 8
3864
18
18
Table 2. The
Average
Heig
ht and Width
of Printed English
Cha
r
a
c
ters in Va
riou
s Font Sizes
No.
T
y
pe size
Character
numb
e
r
Actual hei
ght ( pixels)
Actual w
i
dth ( pi
xels)
1 42
750
88
71
2 22
2497
48
39
3 14
6196
31
23
4 7.5
5905
18
13
5 5
3875
13
14
0
5
10
15
20
25
30
35
0
10
0
20
0
30
0
40
0
50
0
60
0
70
0
80
0
90
0
10
00
11
00
12
00
Charac
t
e
r quant
it
y
C
h
ar
ac
t
e
r
h
e
i
g
ht
Figure 2. Hist
ogra
m
of a Page of Printe
d Chin
ese Ch
ara
c
ters (
Song
, 6
Hao
)
0
5
10
15
2
0
2
5
30
35
40
45
50
55
60
0
50
100
150
200
250
300
350
400
450
500
550
600
Cha
r
at
er quan
t
i
t
y
C
har
at
er
hei
g
h
t
Figure 3. Hist
ogra
m
of a Page of Printe
d Eng
lish
Ch
ara
c
ters (Ti
m
es New
Rom
an, 14 Pound
)
Step 1. Obtain all co
nn
ected
com
p
o
nents
o
n
the
layout imag
e usin
g a conne
cted
comp
one
nts sea
r
ching
al
g
o
rithm.
Step 2. The
conne
cted
co
mpone
nts a
r
e
combi
ned i
n
to ch
ara
c
te
r b
o
xes a
c
cordi
ng to the
distan
ce thre
shol
d value
s
betwe
en character
co
m
pon
ents, ch
ara
c
t
e
rs, text lines and parag
ra
phs
acq
u
ire
d
with
the statistic h
i
stogram of the
distan
ce of
con
n
e
c
ted co
mpone
nts on
the layout.
Step 3. Esta
blish th
e stat
istic hi
stog
ra
m of
size of
con
n
e
c
ted
compon
ents i
n
layout
image.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
A Prepro
c
e
ssing and Anal
yzing Meth
od
of Im
ages in PDF Do
cum
e
nts for… (X
u
edon
g Tian
)
4585
Step 4. Analyz
e the s
t
atistic
his
t
ogram of
c
onnec
t
ed c
o
mponents
in layout image and
obtain the pa
rameter
S
.
For Chi
n
e
s
e
cha
r
a
c
ters, assume that
P
is the corre
s
po
ndin
g
pixel value of chara
c
te
r
height to th
e
max value in
the stati
s
tic hi
stogram. Th
e
signifi
cant
ch
ara
c
ter
heig
h
ts
H
in
Chinese
layout is den
oted as:
12
1
1
(
,
,
...,
)
(
,
0
.
5
,
)
ni
i
n
Hh
h
h
h
h
h
P
h
P
(1)
Whe
r
e
θ
a
nd
∆
are pa
ram
e
ters de
cid
ed
by experime
n
t
.
English
characters co
uld be classified
into
two
types
acco
rdin
g
to ch
aracte
r height.
Type 1 is the letters with the small
e
r he
ight su
ch
a, c and e. Type 2 is the taller letters of b,
A,
and f. Let
P
1
be the corresp
ondi
ng pi
xel value of
cha
r
a
c
ter h
e
i
ght to the max value in the
statistic hi
sto
g
ram fo
r Type 1 and
P
2
for Type 2. Fro
m
Figure 4,
we can
kno
w
P
1
=28 an
d
P
2
=42.
27
28
29
30
31
32
33
34
3
5
36
37
38
39
40
41
4
2
43
4
4
45
0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
C
haracter qu
antit
y
S
i
gn
ific
a
n
t
h
e
i
g
h
t
o
f
E
n
g
lis
h c
h
a
r
a
c
t
e
r
Figure 4. Hist
ogra
m
of Cha
r
acte
r Height
s of an English Layout
The sig
n
ifica
n
t characte
r h
e
ights
H
in English layouts is denote
d
a
s
:
12
1
1
1
1
2
2
(
,
,
...,
)
(
,
,
)
ni
i
n
Hh
h
h
h
h
h
P
h
P
(2)
Whe
r
e
∆
1
an
d
∆
2
are pa
ra
meters de
cid
ed by experi
m
ent.
Other
pa
ram
e
ter definitio
ns
fo
r Chi
n
e
s
e and
Engl
ish ch
ara
c
te
rs
a
r
e
the same as
follows
.
The ch
aracte
r numb
e
rs in
H
are d
e
fined
as
C
h
in Equation (3).
12
(
,
,
.
.
.
,
)
hn
Cc
c
c
(3)
The pixel value of cha
r
a
c
t
e
r width
s
corresp
ondi
ng to cha
r
a
c
ters in
H
is defin
ed
as
W
in
Equation (4).
12
(
,
,
.
..,
)
n
Ww
w
w
(4)
The ch
aracte
r quantity co
rresp
ondi
ng to cha
r
a
c
ters in
W
is defin
ed
as
C
w
in Equation
(5).
12
(
,
,
..
.,
)
wn
Cc
c
c
(5)
We h
a
ve characte
r ave
r
ag
e h
e
i
ght
H
av
g
and
avera
ge
wid
t
h
W
av
g
as E
quation
(6
)
a
nd
Equation (7).
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4579 – 4
588
4586
1
nn
avg
h
i
i
i
h
ii
i
HH
C
C
h
c
c
(
6
)
1
nn
av
g
w
i
i
i
w
ii
i
WW
C
C
w
c
c
(7)
Her
e
we h
a
v
e
S
as Equati
on (8
).
(,
)
av
g
a
v
g
SH
W
(8)
4. Experimental Re
sults
and An
aly
s
is
We
develo
p
e
d
a
PDF
image
pre
p
ro
ce
ssing an
d ana
lyzing syste
m
with the propo
se
d
method in Vi
sual C++ deve
l
oping
wo
rk
b
ench. T
he sa
mple
PDF do
cume
nts com
e
from
net
wo
rk
in Chi
n
e
s
e a
nd Engli
s
h l
a
ngua
ge. Th
e
image
s ex
tra
c
ted from PDF do
cume
nts in Chine
s
e
a
nd
English a
r
e
shown in Figu
re 5 and Figu
re 6.
Figure 5. Extracted Imag
e of Chine
s
e L
a
yout
Figure 6. Extracted Imag
e of English La
yout
The hi
stogra
m
s of sig
n
ificant cha
r
a
c
ter hei
ghts of th
e image
s in
Figure 5 and
Figure
6
are sho
w
n in
Figure 7 and
Figure 8 re
sp
ectively.
3
2
34
36
3
8
40
42
0
50
10
0
15
0
20
0
25
0
30
0
35
0
40
0
Char
at
e
r
qu
ant
it
y
C
har
ac
t
e
r
hei
ght
Figure 7. Hist
ogra
m
of Image in Figu
re 5
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
A Prepro
c
e
ssing and Anal
yzing Meth
od
of Im
ages in PDF Do
cum
e
nts for… (X
u
edon
g Tian
)
4587
18
20
22
24
26
2
8
30
32
0
20
0
40
0
60
0
80
0
100
0
120
0
140
0
C
h
a
r
a
t
er qua
ntity
C
h
ara
c
t
e
r heigh
t
Figure 8. Hist
ogra
m
of Image in Figu
re 6
The estim
a
tio
n
result of
S
is sh
own in Table 3.
Table 3. Esti
mation Re
sult
s of
S
in Figure 5 and Fi
gu
re 6
La
y
out t
y
pe
Character
numb
e
r
Character height (pixels)
Character w
i
d
t
h
(
p
ixels)
Chinese 1211
38
38
English
3745
23
17
5. Conclusio
n
In this
pap
er,
a
scheme
of
analy
z
ing
an
d prep
ro
ce
ssi
ng PDF ima
g
e
s fo
r m
a
the
m
atical
expre
ssi
on
re
trieval is put
forwa
r
d. It in
clud
es
t
he m
odule
s
of P
D
F imag
e extraction,
grayin
g,
binari
z
atio
n, denoi
sing,
ske
w
corre
c
tion an
d la
yo
ut paramete
r
detectio
n
. Espe
cially, some
spe
c
ial
strate
gies are d
e
si
gned
for fittin
g
the
ch
ara
c
t
e
risti
c
s of m
a
thematical
co
mpone
nts. T
h
e
experim
ental
result sho
w
s the effe
ctiven
ess of the pro
posed metho
d
.
Although the
propo
se
d m
e
thod could
pro
c
e
ss the
norm
a
lly prin
ted image
s in PDF
document
s, it has many
sh
ortage
s be
ca
use of the variation in layou
t
styles and i
m
age colle
cting
mode
s. The
furthe
r work i
s
to improve t
he robu
st
n
e
ss
o
f
th
is
s
c
h
e
me
to
pr
oc
ess
mor
e
k
i
nd
s
o
f
layout image
s.
Ackn
o
w
l
e
dg
ements
This wo
rk
is sup
porte
d
by the
Natio
nal Na
tural Sci
e
n
c
e F
oun
datio
n of China
(G
rant
No.
6137
5075
) a
nd the Natu
ra
l Science Fou
ndation of He
bei Provin
ce (Grant
No. F2
0122
0102
0).
Referen
ces
[1]
Adob
e S
y
st
ems Incorporate
d
.
PDF
Referenc
e Versio
n 1.7. 200
6.
[2]
Baker J, Sex
t
on AP, Sorge V.
Extracting Precise Data
on the Math
ematics Co
nte
n
t of PDF
Docu
ments
. DML 200
8: T
o
w
a
rds Di
gital Ma
thematics Li
bra
r
y
.
Birmin
g
h
a
m
.
2008: 75-
79.
[3]
Baker J, S
e
x
t
on AP, S
o
rge
V.
A Li
ne
ar G
r
ammar A
ppro
a
ch to
Math
e
m
atic
al
F
o
rmul
a
R
e
cog
n
iti
o
n
from
PDF
. ICM 2009: Intel
lig
e
n
t Computer M
a
thematics. Be
rlin. 20
09: 20
1-
216.
[4]
Lin XY,
Gao C,
T
ang
Z
.
Mathe
m
atic
al F
o
rmu
l
a Ide
n
tific
a
tion i
n
PDF
Docu
ments
. Procee
din
g
of
Internatio
na
l C
onfere
n
ce o
n
Docum
ent Ana
l
y
s
is an
d Reco
gniti
on. Beij
in
g. 2011: 1
419-
14
23.
[5]
Lin
XY, Gao L
C
,
T
ang Z
.
Identificatio
n of Emb
e
d
ded M
a
th
ematica
l
F
o
rmulas i
n
PDF
Do
cuments Usi
n
g
SVM
. Docume
nt Recog
n
iti
on
and R
e
trieva
l.
San F
r
ancisc
o
. 2012; 8
297 0
D
1-8.
[6]
Chen Y, Liu LZ, Ye H. Au
tomatically
Ex
tracting
Images of
JPEG Format from PDF Docum
ents.
Journ
a
l of Infor
m
ati
on En
gi
ne
erin
g Univ
ersit
y
. 2007; 8(2): 2
13-2
16.
[7]
W
ang JT
, Kang
XD, Li M, et al. Extraction of Rec
o
g
n
i
zabl
e Images
from PDF
F
i
l
e
.
Comput
e
r
Engi
neer
in
g an
d Desi
gn.
20
06
; 27(9): 153
9-1
541.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4579 – 4
588
4588
[8]
Li Q, Li
u SJ.
Desig
n
and
Implem
entatio
n o
f
PDF
Rea
der.
Co
mp
uter
En
gin
eeri
ng an
d Desig
n
. 2
010
;
31(7): 16
35-
16
38.
[9]
Z
hang
XZ
. Chi
nese C
haract
e
r Recog
n
itio
n T
e
chn
o
lo
g
y
. Beij
ing: T
s
inghu
a Univers
i
t
y
Pr
es
s. 1992.
[10]
Hu JZ
. Comput
er Char
acter R
e
cog
n
itio
n T
e
chno
log
y
. Be
iji
n
g
: Chin
a Meteo
r
olo
g
ica
l
Press
.
1994.
[11]
T
i
an DZ
. Recogn
ition
Prepr
ocessi
ng of V
i
s
ual
Doc
u
me
nt Image. Ph
D T
hesis. Baodi
ng: He
b
e
i
Univers
i
t
y
; 2
0
0
7
.
[12]
Liu S L. PDF Images Pre-pr
ocessing for Formula
Recognition. Master Dis
ertation. Baoding: Hebei
Univers
i
t
y
; 2
0
1
2
.
[13]
Hinds S
C
, F
i
sher JL, D
’
A
m
ato
DP. A Document Skew
Detection Metho
d
Using R
un-
len
g
th Encod
i
n
g
and th
e Ho
u
gh T
r
ansfor
m
. Proceed
ings
of the
10th
Internatio
nal
Confere
n
ce
on Patte
r
n
Reco
gniti
on(IC
PR). Atlantic Ci
ty. 1990: 464~
468.
[14]
Le
DX, T
h
o
m
a
GR, W
e
chsl
er
H. Auto
mated
Pag
e
Orie
ntat
ion
an
d Sk
ew
Angl
e D
e
tectio
n for B
i
nar
y
Docu
ment Ima
ges. Pattern R
e
cog
n
itio
n. 199
4; 27(10): 1
325
-134
4.
[15]
T
i
an XD, Guo BL.
T
he Method for Ch
ine
s
e Do
cum
ent La
yo
ut Anal
ysi
s
Based on C
o
mpre
hens
i
v
e
F
eatures.
Jour
nal of Ch
in
ese
Information Pr
ocessi
ng
. 19
99
; 13(4): 22-28.
[16]
Lu
XC, Yi BZ
, Ping
XJ, et al. Engl
ish a
nd C
h
in
ese Scripts
Identificati
on of
Noise
d
Doc
u
ment Image.
Co
mp
uter Engi
neer
ing a
nd d
e
s
ign.
20
07; 28(
21): 515
0-5
152
.
[17]
Lia
ng
X. An Extractio
n
Met
hod for Math
em
atica
l
Expr
essio
n
s in En
glish
and C
h
i
nese Pri
n
ted
Docum
ents. Master Disertati
o
n
. Baodi
ng: He
bei U
n
ivers
i
t
y
;
201
0.
[18]
Guo L, P
i
n
g
X J,
Z
hou
L.
Script Id
entifi
c
ation
of D
o
c
u
ment
Imag
e
Ba
se
d on
Stro
ke
D
i
r
e
c
tio
n
Histogr
am.
Jou
r
nal of Infor
m
at
ion En
gin
eer
in
g Univ
ersity
. 2011; 12(
2): 231
-237.
[19]
Li C, Ding
XQ, W
u
YS. An Algorithm for T
e
xt
Locati
on i
n
Images Bas
ed
on Histogr
am F
eatures an
d
Ada Boost.
Jou
r
nal of Image
a
nd Graph
ics
. 2006; 11(
3): 325
-331.
[20]
Lia
ng
HW
. Dir
ect Determ
inat
ion
of T
h
resh
old
from Bim
o
dal
Histo
gram.
Pattern
Rec
o
gniti
on
an
d
Artificial Intelligence
. 20
02; 15
(2): 253-2
56.
Evaluation Warning : The document was created with Spire.PDF for Python.