TELKOM
NIKA
, Vol. 13, No. 4, Dece
mb
er 201
5, pp. 1414
~1
421
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v13i4.2389
1414
Re
cei
v
ed Au
gust 25, 20
15
; Revi
sed O
c
t
ober 9, 20
15;
Accept
ed O
c
tober 24, 20
1
5
Automatically Generation and Evaluation of Stop
Words List for Chinese Paten
t
s
Deng Na
1
, Chen Xu*
2
1
School of Co
mputer, Hub
e
i
Univers
i
t
y
of
T
e
chn
o
lo
g
y
, W
uhan 4
3
0
068, P.
R. Chin
a
2
School of Infor
m
ation a
nd Saf
e
t
y
En
gin
eer
in
g,
Z
hongn
an U
n
iversit
y
of Eco
nomics a
nd L
a
w
,
W
uha
n
430
07
3, P.R.
Chin
a
*Corres
p
o
ndi
n
g
author, em
ail
:
iamden
gn
a@
163.com,
xuch
en2
01
4@
ye
ah.
net
A
b
st
r
a
ct
As an
i
m
porta
nt pre
p
roc
e
ssi
ng ste
p
of i
n
formati
o
n
retri
e
val
and
i
n
for
m
ati
o
n
proc
es
sing, th
e
accuracy
of stop w
o
rds
’
eli
m
i
nati
on d
i
rect
ly infl
uenc
es
t
he u
l
ti
mate r
e
sult of retriev
a
l a
nd
mi
ni
ng.
In
infor
m
ati
on retr
ieval, sto
p
w
o
rds
’
e
l
i
m
i
natio
n
can co
mp
ress
the storag
e sp
ace of i
n
d
e
x, a
nd i
n
text min
i
ng,
it can
red
u
ce
the
di
me
nsio
n
of ve
ctor s
pac
e e
nor
mo
usly,
save
the st
or
age
spac
e of vector
sp
ace an
d
spee
d up t
he
calcul
atio
n. Ho
w
e
ver,
Chin
es
e pate
n
ts are
a kin
d
of
le
gal
docu
m
ents co
ntain
i
ng tec
h
n
i
cal
infor
m
ati
on
an
d the
g
e
n
e
ral
Chin
ese
stop
w
o
rds list
is
n
o
t ap
plic
ab
le f
o
r the
m
. T
h
is
pap
er a
d
va
nce
s
tw
o
meth
od
olo
g
i
e
s
for Chin
ese p
a
tents. One is base
d
on
w
o
rd
freque
ncy an
d
the other o
n
statistics. T
h
roug
h
exper
iments
o
n
rea
l
p
a
tents
data,
th
ese t
w
o meth
od
olo
g
ies
’
acc
u
racy
are c
o
mpar
e
d
un
der s
e
ver
a
l
corpus
es w
i
th different sca
le,
and
also c
o
mpare
d
w
i
th
ge
n
e
ral sto
p
list.
T
he exp
e
ri
me
n
t
result in
dicat
e
s
that both of th
ese tw
o meth
odo
log
i
es ca
n
extract
the stop w
o
rds suita
b
l
e for Ch
ines
e
patents a
nd t
h
e
accuracy of Me
thodo
lo
gy base
d
on statistics i
s
a li
ttle hig
her
than the o
ne b
a
sed o
n
w
o
rd frequ
ency.
Ke
y
w
ords
: stop w
o
rd; patent; statistics; info
rmati
on retri
e
va
l; w
o
rd frequen
cy
Copy
right
©
2015 Un
ive
r
sita
s Ah
mad
Dah
l
an
. All rig
h
t
s r
ese
rved
.
1. Introduc
tion
Along
with th
e develo
p
me
nt of Intern
et and Info
rmat
ion Te
ch
nolo
g
y, massive
amount
s
of data
are a
c
cumulate
d i
n
every
dom
a
i
n. Re
porte
d
by Survey
re
port
on th
e q
uantity of Chi
nese
Internet Information
Re
so
urces,
up to t
he en
d
of 2
0
05, the n
u
mb
er of
Chin
ese
web
pag
es
has
rea
c
he
d 2.4
billion, and th
is num
ber i
s
i
n
crea
si
ng
co
ntinually and
explosively. Not only the
web
page
s,
some
other
cha
r
a
c
ter carriers,
su
ch
a
s
a
c
ademi
c
p
ape
rs,
patent
s a
nd
so
on, th
eir
amount
s a
r
e
going
up
at a
n
ala
r
ming
ra
te too. Ho
w t
o
find o
u
t the
need
ed
data
from the
s
e
b
i
g
data qui
ckly and p
r
e
c
isely and h
o
w to
mine o
u
t useful inform
ation from
the
s
e data a
r
e t
h
e
probl
em
s d
e
m
andin
g
p
r
o
m
pt solution.
In inform
atio
n
retrieval, th
ere
are th
ree
ba
sic
step
s,
that
is, word
seg
m
entation,
stop
words’
eli
m
ination
and
indexing. In t
e
xt mining, b
e
fore
seve
ral
kind
s
of
mining, word segmentation,
stop words’
e
limi
nat
ion and
key
words’
extraction are
still the
essential
wo
rk [1]. Therefore, a
s
an i
m
porta
nt
pre
p
ro
ce
ssi
ng
step of informa
t
ion retri
e
val
and
informatio
n p
r
ocessin
g
, th
e a
c
cura
cy o
f
stop
wo
rd
s’
elimination directly in
flue
nc
es
th
e u
l
tima
te
result of retrie
val and minin
g
.
Stop word
s
are tho
s
e word
s eme
r
gin
g
freque
ntly in corpu
s
bu
t with no important
informatio
n [2-3]. Zipf'
s
L
a
w [4]
sho
w
s that, in Engl
ish la
ngu
age,
only a fe
w
words are u
s
ed
regul
arly, mo
st wo
rd
s a
r
e
rarely used.
The lan
gu
a
g
e
s of the
oth
e
r count
rie
s
, inclu
d
ing
Chin
ese,
have the sa
me ch
ara
c
te
r. Stop word
s
are
su
ch
words u
s
ed f
r
eq
uently but ca
n not differen
t
iate
document
s. F
o
r exam
ple, i
n
Engli
s
h, the
most fr
equ
e
n
t wo
rd
s a
r
e t
he, of, and, t
o
, a, in, that
and
is [5], all of th
em are in
clu
d
ed in English stop word
s list.
The elimi
nati
on of
stop
wo
rds can b
r
ing
in
advanta
g
e
s
from
two
aspect
s
. In information
retrieval, si
nce stop words have no act
ual meani
ng
s, there i
s
n
o
need to in
dex them. Stop
words’ elimi
n
ation can
co
mpre
ss the storage
spa
c
e
of index. In
text mining, it can red
u
ce the
dimen
s
ion of vector spa
c
e
enormou
s
ly, save t
he sto
r
age sp
ace of vector sp
ace
and spe
ed u
p
the cal
c
ulatio
n.
At present, there
are ma
ny stop
word
s list
s
for En
glish l
angu
ag
e, su
ch a
s
[6
] and [7],
and the
r
e ha
ve been
som
e
re
sea
r
che
s
about Engli
s
h stop
wo
rd
s. [8] [9] evaluate the influe
nce
on retri
e
val p
e
rform
a
n
c
e b
r
oug
ht by sto
p
wo
rd
s lis
ts
with differe
nt length an
d co
ntent. In [9], its
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 13, No
. 4, Decem
b
e
r
2015 : 141
4 – 1421
1415
own
stop words list’
s gen
eration method
is given.
Thi
s
method is b
a
s
ed o
n
two e
x
isting lists a
nd
the most si
m
p
le and g
e
n
e
ral
strategy,
that is,
com
puting the freque
ncy of e
a
ch
wo
rd in
the
corpu
s
, ran
k
i
ng them in d
e
scen
ding o
r
der, ch
oo
sin
g
those
word
s with high freque
nci
e
s, a
n
d
filtering the
m
manually. If the
words left
are
not in
the
two
existing
l
i
sts, a
dd th
e
m
into the
uni
on
of the two lists. Finally, add som
e
sp
e
c
ial wo
rd
s in,
like times, file name
s
, ro
man nume
r
al
s,
prefixes, a
d
je
ctives, adve
r
bs, date
s
, foreign word
s,
scale unite
s
and
so on. [
10] verifies t
hat
after u
s
ing
st
op words li
st, meta-sea
rch
engin
e
can
obtain b
e
tter
result. [11] a
dopts
stop
word
s
list to detect pape
r plagi
ari
s
m.
Except for English, ab
o
u
t other no
n-Ch
ine
s
e l
angu
age, [12] [13] studies the
c
o
ns
truc
tion
of s
t
op
words
lis
t for A
r
abic
. [14]
ma
kes u
s
e
of uni
on entr
opy to study the
stop
words fo
r Mo
ngolia
n. [15] use
s
the met
hodol
ogy of [2] to build sto
p
words li
st for Thai.
Curre
n
tly, there
are fe
w schol
ars
re
sea
r
ching
on
stop
word
s for
Chin
ese
.
[2] [16]
cal
c
ulate the
ran
k
of word
s from statisti
cs a
nd informatics the
s
e
two viewpoi
nts, and co
nsi
d
er
both of them
to get the fina
l list. In statistics
m
odel, th
e avera
ge p
r
obability and
varian
ce of e
a
ch
word is
com
p
uted. Tho
s
e
words
with hi
gh average p
r
oba
bility and
low varia
n
ce
are de
eme
d
as
can
d
idate
sto
p
words. In informatio
n model, t
he entropy of each
word is calcu
l
ated and the
y
extract the
word
s
with lo
w entro
py out
as
can
d
idate
s
. Finally, Bo
rda
ran
k
in
g
method i
s
a
d
opted
to determi
ne
the final list. [
17] gives t
h
e
definition
of
stop
wo
rd
s from the a
ngle
of statisti
cs.
It
thinks
that a stop wo
rd sh
ould sati
sfy
two co
ndi
tion
s. One i
s
high
do
cume
nt freque
ncy
and
the
other i
s
that it has little rel
a
tionship wit
h
cla
ssi
fi
catio
n
categ
o
ri
es.
On the ba
si
s of co
ntinge
ncy
table,
calcula
t
e
wo
rd
s’ wei
ghted Chi, an
d
tho
s
e wi
th t
he lo
we
st val
ues are
ran
k
ed o
n
the
top
of
list. [18] calcu
l
ate the pro
b
abilities of wo
rd in
senten
ce and
senten
ce in
corpu
s
resp
ectively, and
extract
stop
words list
accordin
g to th
ei
r uni
on
ent
ro
py. [19] divides
stop
wo
rd
s into t
w
o
gro
ups,
that are, abs
o
lute words
and relative words
.
It ma
k
e
s
us
e of left/right entr
opy and Ngram to filter
s
t
op words
for us
ers
’
reques
t in Information Retrieval.
Our
wo
rk in
this
pap
er
co
n
t
ains: 1
)
give
two meth
odol
ogie
s
to
gene
rate
stop
wo
rds li
sts
for Chine
s
e
p
a
tents; 2
)
through
expe
rim
ents o
n
t
he
real pate
n
t dat
a, com
pare t
he a
c
curaci
e
s
of
these t
w
o m
e
thodolo
g
ies u
nder corp
use
s
with
differe
nt scale
s
, an
d co
mpa
r
e th
em with th
e li
st
for gene
ral
Chine
s
e texts.
The main in
n
o
vation of our paper i
s
that:
1)
Cu
rre
ntly, there
have
no
relative
re
se
arch o
n
stop
words list fo
r
Chin
ese pate
n
ts. We
fill in the blanks.
We
analyze the
cont
ents i
n
the
stop words list for general
Chi
n
ese tex
t
s,
cla
ssify them
into some
cat
egori
e
s, an
d clarify
why thi
s
list is not su
itable for Chi
nese patent
s.
2) Th
ro
ugh th
e experi
m
ent
s on
re
al pat
ent data, we
find that the
algorith
m
in [
2
] [15] is
not appli
c
abl
e for Chin
ese
patents
too, and on the b
a
si
s on which, we give so
me modification
and adj
ustme
n
t.
3) Compa
r
e
and evalu
a
te
the accuraci
es of the
s
e two meth
odol
ogie
s
und
er
corpu
s
e
s
with different
scale
s
, and compa
r
e them
wi
th the list for gen
eral Ch
inese texts.
2. The Stop
Word Lis
t
fo
r General
Ch
inese Tex
t
s
In Intern
et, there
a
r
e
so
me po
pula
r
stop
wo
rd
s
l
i
st
s,
su
ch a
s
Har
b
in
I
n
s
t
it
ut
e
of
Tech
nolo
g
y’s and Baidu Corpo
r
atio
n’s.
These
li
sts contain some
words
freq
ue
ntly
occu
rring
in gen
eral te
xts. We
cla
s
sify them
into eight cat
egori
e
s.
1)
Modal pa
rticl
e
. Eg: ”
啊
(ah),
阿
(ah),
哎
(h
ey
),
哎呀
(oh,my!),
哎
哟
(ouch),
唉
(g
osh)” and so o
n
.
2)
Onomato
poei
a. Eg: ”
嗡嗡
(buzz),
吧
哒
(ba
-
da),
叮
咚
(ding
-
don
g),
沙沙
(rustle),
瑟瑟
(se-se)” an
d
so on.
3)
Local
Diale
c
t And F
o
lk Ad
age. Eg:
“
敞开
儿
(unres
tric
tedly),
打开天窗
亮
说话
(Spe
ak fran
kly an
d
sin
c
erely),
赶早不赶晚
(The
earlie
r the bet
ter),
那末
(then
)” an
d so o
n
.
4)
Conj
un
ction. Eg: “
不管
(reg
a
rdle
s
s
of),
并非
(really not),
句
换话
说
(in other words)” a
nd so o
n
.
5)
Adverb. Eg:
“
上
马
(i
mme
diately),
略微
(a little),
默默
地
(sil
ently),
必定
(c
er
tainly)
,
果然
(a
s
expecte
d)” an
d so on.
6)
Prono
un. Eg: “
你
(you),
他
们
(they)” and
so
on.
7) Prepo
sition.
Eg:
“
在
(at),
从
(from),
当
(wh
en)” and
so o
n
.
8)
Emotional wo
rds. The
s
e word
s
u
s
u
a
lly
c
ontai
n com
m
endato
r
y o
r
derogato
r
y sence in th
em.
Eg: “
不
择
手段
(
p
la
y
har
d
)
,
不亦
乐
乎
(ple
asure
ably),
老
老
实实
(
c
on
sc
ientio
u
s
ly)
,
故意
(intention
a
lly),
成心
(on purp
o
se
)” a
nd so on.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Autom
a
tically Generation a
nd Evalu
a
tion
of St
op Word
s List for
Chin
ese Patent
s (Che
n Xu)
1416
Ho
wever, the
s
e list
s
are n
o
t suitable for Chine
s
e Pat
ents. It can b
e
explained from two
asp
e
ct
s.
1) Many
word
s in the list fo
r gen
eral
Chi
nese
texts wil
l
never ap
pea
r in Chi
n
e
s
e
Patents.
Patens are a
kind of
legal
document
s contai
ning
te
chn
o
logy info
rmation, a
nd
their
wording
is
usu
a
lly rigorous a
nd se
ri
ous. Thu
s
, patents
w
ill not contain t
hose wo
rd
s refereed ab
o
v
e
because they are not
serious enough. Therefor
e, since these
words
will never appear in
patents, the li
st for gen
eral
texts
is
not fit
for patent
s
.
2)
Chin
ese p
a
tents
are
u
s
ually written
accord
ing
to
spe
c
ific mod
e
and
se
nten
ce pattern,
and incl
ude
many conve
n
t
ional
wo
rd
s. For exampl
e, “
明
发
(invention),
用新型
实
(utility model),
技
术
(tech
nolo
g
y),
系
统
(s
ys
tem),
装置
(equipm
ent),
采用
(ad
opt)” an
d so
on. T
hese word
s have actual
meanin
g
s i
n
gene
ral texts, but in pate
n
ts, they are
template wo
rds
wo
uld b
e
use
d
by mo
st
patents, an
d coul
d not differentiate p
a
te
nts.
3. Word Seg
m
enta
tion of Chinese
Before
elimin
ating
stop
word
s, word segmentatio
n i
s
requi
red. A
t
pre
s
ent, the
r
e h
a
ve
been m
any p
opula
r
tools,
su
ch a
s
IKAnalyzer [2
0]
an
d JE-a
nalysi
s
[21]. JE-anal
ysis is
not op
en
sou
r
c
e
sof
t
w
are wit
h
it
s
o
w
n st
op wo
r
d
s
li
st
.
B
e
f
o
r
e
it
s seg
m
en
t
a
t
i
on
re
sult
come
s out
,
s
t
op
words have
b
een
delete
d
.
By comp
ari
s
o
n
, IKAnalyz
e
r
is ope
n sou
r
ce,
a
nd users
ca
n cu
stomi
z
e
their o
w
n
sto
p
word
s li
st.
To gu
arantee
ke
ep
all
the
words,
we
ch
oose IK Anal
yzer
as the
word
segm
entation
tool in this paper.
4. T
w
o
M
e
th
odologies of Genera
ting
Stop Wor
d
s
Lists fo
r Chi
n
ese Patents
[22] mention
s
that, the
m
o
st
simple
an
d ge
neral strategy
is co
m
puting the
fre
quen
cy of
each word i
n
the co
rpu
s
,
ran
k
ing th
em
in de
sce
ndi
ng orde
r, ch
oosi
ng tho
s
e
words
with
high
freque
nci
e
s,
and filte
r
ing
them m
anu
all
y
. [2] [15]
use a
stati
s
tics metho
d
to
calcul
ate ave
r
age
prob
ability, varian
ce
and
SAT of wo
rd
and
ran
k
the
m
. Ho
wever,
our
experi
m
e
n
t sho
w
s that
this
method i
s
n
o
t suitable f
o
r Chine
s
e
patents. We
give it some
modificatio
n
and adju
s
t
m
ent.
Inspired by t
he re
se
arch
above, in thi
s
pa
per
,
we
give two met
hodol
ogie
s
t
o
gen
erate
stop
words list. O
ne is ba
sed
on the m
o
st
simple
and
g
eneral
strate
gy and th
e o
t
her i
s
ba
se
d
on
modified SAT
. The details
are a
s
follows.
4.1. Methodo
log
y
One: Based on the
Most Simple and Gener
a
l Strate
g
y
The procedu
re of generatin
g stop word
s
lis
t usin
g Methodol
ogy On
e is as follo
ws:
1.
segm
ent wo
rd for patent
s
2.
get the frequ
ency of ea
ch
word in the corpu
s
3.
ran
k
the wo
rd
s acco
rdin
g to their freq
ue
ncie
s in de
scendin
g
ord
e
r
4.
extract the word
s with hig
h
freque
ncy a
s
ca
ndid
a
tes,
and filter them manually
Here, the fre
quen
cy in ste
p
2 refe
rs to the o
c
curring
cou
n
t of wo
rd
in the wh
ole
corpu
s
,
not document
frequen
cy.
4.2. Methodo
log
y
T
w
o
:
Based on Statistics
Suppo
se
the corpu
s
D={d
i
},
1=<
i
<=N. N refers
to
the c
o
unt
of patents
.
The s
e
t of
words
in corpu
s
is d
enoted a
s
W={
w
j
}.
Defini
tion 1:
av
erage probabilit
y
MP
The average
prob
ability of word w
j
in
D is:
ij
1i
N
j
p
MP
w
N
p
ij
is the freq
u
ency proba
bil
i
ty of w
j
in d
i
.
In other words, p
ij
equals to
w
j
‘s freq
uen
cy in d
i
divided
by the
numbe
r of word
s in d
i
. If a word h
a
s a h
i
gh MP value, it implies that this wo
rd
occurs fre
que
ntly in the whole co
rpu
s
.
Defini
tion 2:
v
a
r
i
ance VP
The varia
n
ce of w
j
in D is:
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 13, No
. 4, Decem
b
e
r
2015 : 141
4 – 1421
1417
2
ij
j
1i
N
j
(p
-
w
(w
)
=
VP
N
MP
(
)
)
If a word ha
s
a low VP value, it implies that this wo
rd o
c
curs unifo
rm
ly in the whol
e
cor
p
u
s
.
Defini
tion 3:
SAT
The SAT of w
j
in D is:
j
j
j
(w
)
(w
)
=
(w
)
MP
SA
T
VP
If a word ha
s
a high SAT value, it implies that this wo
rd occu
rs freq
uently and un
iformly
in the whol
e corpu
s
. The
word like this is very likely to be a stop
word.
Modifica
tion
and adjustment of SAT
On the
ba
sis of [2] [15], we
give som
e
modifi
catio
n
an
d adj
ust
m
ent on
SAT. Ou
r
experim
ent shows that the old def
inition of SAT is not suitable
for Chine
s
e
patents. In the
experim
ent, if we
use the o
l
d definition,
after ra
nki
ng
the wo
rd
s a
c
cording
to SAT in de
scen
di
ng
orde
r, those words o
n
the
top are not
corre
c
t
stop
words. Ma
ny words
with lo
w MP and lo
wer
VP are ra
nke
d
on the top i
m
pro
perly. T
he re
ason is
that MP is in magnitud
e
of
freque
ncy, b
u
t
VP is in mag
n
itude of the
squ
a
re
of fre
quen
cy.
Therefore, we mo
dify the definition of SAT,
and
adju
s
t the
sq
uare
root of
VP as SAT’s den
ominato
r
. In this way, VP an
d MP
are
in th
e
sa
me
magnitud
e
.
The procedu
re of generatin
g stop word
s
lis
t usin
g Methodol
ogy Two is as follo
ws:
1. Segment word for p
a
tent
s
2. Calculate e
a
ch
word’s S
A
T value in the corpu
s
3. Ran
k
the word
s acco
rdi
n
g to SAT in descen
d
ing o
r
der
4. Extract the word
s with hi
gh SAT as ca
ndidate
s
, and
filter them manually
5. Analy
s
is a
nd Ev
aluation of Experim
e
nt
5.1. Data
se
t
The data so
urce in ou
r e
x
perime
n
t is more tha
n
4
0000
Chin
ese patents a
p
p
lied by
Chin
ese univ
e
rsitie
s a
nd
scientific
re
sea
r
ch i
n
st
itution
from 19
85-9-10 to 20
10-1
0
-6. Th
ese d
a
ta
inclu
de a
ppli
c
ation nu
mbe
r
, application
date, IPC
(Int
ernatio
nal Pa
tent Cla
ssifi
cation), a
ppli
c
ant,
patentee, title
,
abstract, d
e
puty and
so
o
n
. In this p
a
p
e
r, we a
r
e
co
nce
r
ne
d o
n
ly with ap
plication
numbe
r an
d
abstract. Th
e
application n
u
mbe
r
s a
r
e u
s
ed a
s
keys,
and the a
b
st
ract
s comp
ose
the corpu
s
.
The
experi
m
e
n
t is co
ndu
ct
ed u
nde
r
9
co
rpu
s
e
s
with di
fferent scale
s
.
They rep
r
e
s
ent
the
numbe
r of p
a
tents a
r
e 5
00, 1000, 2
000,
300
0, 5000, 10
000
, 20000, 30
000 an
d 40
000
r
e
spec
tively.
5.2. Using M
e
thod
olog
y
One to
Gene
rate Stop Words List u
n
der Cor
pus
e
s
w
i
th Di
ffe
r
e
nt
Scales
Different
scal
e of corp
use
s
ge
ne
rate
s
9 lists in all.
Figure 1
sho
w
s th
e p
r
op
o
r
tion of
comm
on p
a
rt
s of the
s
e 9 l
i
sts. W
e
can
see th
at
alon
g with the ev
er-i
ncrea
s
in
g
of the co
rpu
s
’s
scale, thou
gh
there
are a
few
wave h
o
llows, t
he
p
r
opo
rtion
of
comm
on p
a
rt
s i
s
on
the
rise
totally. In oth
e
r wo
rd
s, alo
ng with the e
v
er-in
c
rea
s
in
g of the corp
us’
s
scale, th
e words
on th
e top
of the li
sts
b
e
com
e
stable
gra
dually. In
additio
n
, Fig
u
re
1 in
dicates th
at when
the
comp
ared
scale
s
a
r
e 3
0
000 a
nd
400
00 pate
n
ts, th
e propo
rti
on
of com
m
on
p
a
rts
of top 1
5
0
wo
rd
s
rea
c
hes
0.987. Th
ose
wave
hollo
ws may
be
cau
s
ed
by the d
a
t
a’s un
bala
n
ced di
strib
u
tio
n
in the
co
rp
us.
The phe
nom
enon of broke
n
lines’ glidi
n
g disa
ppe
ars going
with the scale’
s increase.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Autom
a
tically Generation a
nd Evalu
a
tion
of Stop Word
s List for
Chin
ese Patent
s (Che
n Xu)
1418
Figure 1. The
propo
rtion of
commo
n part
s
of these
9 li
sts un
der
co
rpuses
with different scale
s
5.3. The Ac
curacie
s
of
Stop Words
List und
er
Corpu
ses
w
i
th Differ
ent Scales
Usi
ng
Metho
dolog
y
One
Figure 2 sho
w
s th
e accu
racie
s
of To
p
150, 250, 3
50, 450
word
s in sto
p
words list
s
unde
r
co
rpu
s
es
with
different scal
es. It
is
cle
a
r th
at all of th
e ni
ne b
r
o
k
en
lin
es i
n
the
fig
u
re
reveal a d
o
wntrend. Thi
s
indicates that
the more th
e Top word
s’ numbe
r, the lowe
r is th
e
accuracy of the stop word
s list. In other words,
in e
a
ch list, the accuracy of Top 150 words is
highe
st. In a
ddition, Fig
u
re 2
also
sho
w
s that it i
s
not to
say th
e big
ger the
corpu
s
, the
more
accurate is the list.
Figure 2. the accuraci
es of
Top 150, 25
0, 350, 450 word
s in stop
words li
sts
unde
r co
rp
uses with different scale
s
T
h
e
pr
o
por
t
i
on
o
f
c
o
m
m
on
p
a
r
t
s
u
n
der
c
o
r
p
u
s
es
w
i
t
h
def
f
e
r
e
n
t
s
c
al
es
0.
7
5
0.
8
0.
8
5
0.
9
0.
9
5
1
d
e
ffe
r
e
n
t
s
c
a
l
e
s
pr
op
or
t
i
on
To
p
150
To
p
250
To
p
350
To
p
450
T
o
p
150
0.
8
5
3
0
.
9
0
.
927
0.
9
2
7
0
.
8
8
7
0.
92
0.
9
6
0.
9
8
7
T
o
p
250
0
.
88
4
0
.
9
04
0
.
9
4
0
.
91
2
0
.
9
08
0
.
93
2
0
.
9
44
0
.
96
8
T
o
p
350
0.
8
9
7
0
.
8
9
7
0
.
923
0.
9
2
6
0
.
8
8
6
0.
92
0
.
95
7
0
.
9
57
T
o
p
450
0
.
86
7
0
.
9
13
0.
9
2
2
0
.
91
8
0
.
9
0
.
93
8
0
.
9
47
0
.
96
7
5
00-
100
0
1
000
-
2
000
20
00-
30
00
300
0-
5
000
5
000
-
100
00
100
00-
2
000
0
20
00
0-
30
000
300
00-
400
00
T
h
e
ac
c
u
r
a
c
i
es
of
s
t
op w
o
r
d
lis
t
s
under
c
o
r
pus
e
s
w
i
t
h
d
i
f
f
e
r
e
n
t
s
c
a
les
0
.
6000
0
.
6500
0
.
7000
0
.
7500
0
.
8000
0
.
8500
0
.
9000
1
5
0
250
35
0
4
5
0
T
he am
oun
t
of
T
op w
o
r
d
s
ac
c
u
rac
y
500
100
0
2
000
3000
50
00
10000
2
0000
3000
0
40000
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 13, No
. 4, Decem
b
e
r
2015 : 141
4 – 1421
1419
The co
ncrete
accuraci
es d
a
ta in Figure 2 is sh
own as follows.
Table 1. The
accuraci
es d
a
ta of Top 15
0, 250, 350, 4
50 wo
rd
s
in stop wo
rd
s lists u
nde
r corpu
s
e
s
with
different scal
es
150 250 350
450
500
0.8067
0.7440
0.7260
0.6820
1000
0.8200
0.7920
0.7540
0.7270
2000
0.8200
0.7760
0.7510
0.7400
3000
0.8467
0.8000
0.7600
0.7510
5000
0.8267
0.7960
0.7400
0.7240
10000
0.8330
0.8040
0.7690
0.7240
20000
0.8530
0.8160
0.7770
0.7490
30000
0.8330
0.8160
0.7910
0.7530
40000
0.8267
0.8200
0.8090
0.7620
5.4. Methodo
log
y
One’s List Compa
r
e
d
w
i
th the Li
st for G
e
ner
a
l Texts
It has be
en referred a
bov
e that, some
words
have a
c
tual me
anin
g
s in
gen
eral
texts, but
are
commo
n t
e
mplate
s
and
se
nten
ce
pat
tern i
n
p
a
tent
s
with n
o
actu
al me
aning
s.
We
co
mpa
r
e
a
subli
s
t from
Methodol
ogy
One
(i.e. To
p 150
wo
rd
s whe
n
the
scale is
200
00) with the li
st
of
Harbin In
stitu
t
e of Te
chn
o
l
ogy (7
67
wo
rds). In
ou
r
su
blist, there a
r
e 110
ne
w
word
s. 30
words
are cho
s
en, shown in Tabl
e.2.
Table 2. Som
e
new
words
not in the sto
p
word
s li
st of Harbi
n
Institute of Tech
no
logy
本明
发
(the invent
ion)
一种
(a kind of)
方法
(method
)
中
(i
n)
装置
(device)
制
备
(equipment)
具有
(have)
行
进
(conducted)
上
(on)
包括
(containing)
涉及
(involved)
系
统
(s
y
s
tem)
在于
(
lie in)
采用
(adopt)
属于
(belong to)
材料
(material)
后
(after)
用新型
实
(
u
tility
model)
实现
(
a
chieve)
用于
(used for
)
所述
(ac
c
o
rdi
ng to
)
成
组
(constitute)
理
处
(handle)
提供
(prov
i
de)
技域
术领
(technolo
g
y
domain
)
使
(make)
特征
(c
harac
teri
s
t
i
c
)
形成
(form)
利用
(
u
tilize)
得到
(get)
5.5. The Ac
curacie
s
of Stop Word
s
List
under
Corpus
es w
i
th Differe
n
t Scales u
s
ing
Metho
dolog
y
T
w
o
Figure 3. Methodol
ogy On
e and Two’s
accuraci
es of
Top 150 words
unde
r co
rp
uses with different scale
s
T
he c
ompa
ris
ion
of t
wo m
eth
odol
ogie
s' a
ccu
raci
es
0.
7
0.7
5
0.
8
0.8
5
0.
9
0.9
5
50
0
1000
2000
30
00
500
0
1
0
000
20
000
3000
0
c
orpu
s s
cale
accuracy
Me
thod
olog
y O
ne
Me
thod
olog
y T
wo
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Autom
a
tically Generation a
nd Evalu
a
tion
of Stop Word
s List for
Chin
ese Patent
s (Che
n Xu)
1420
From Fi
gure
3, we
can
se
e that, unde
r
different scal
es of
corpu
s
e
s
, Method
olo
g
y Two’
s
accuraci
es a
r
e highe
r than
Methodolo
g
y
One’s.
Moreover, wh
en
the scale lay
s
between 5
00
and 1
000
0, the a
c
curaci
e
s
of T
op
150
wo
rd
s
of m
e
thodolo
g
y ha
ve little differences,
and
when
the scale pro
m
otes to 200
00 and 3
000
0
,
the accu
ra
ci
es amplifie
s o
b
viously.
6. Conclusio
n
Aiming at the pro
b
lem th
at the gene
ral stop
wo
rd
s list is
not
suitabl
e for
Chin
ese
patents, thi
s
pape
r p
r
op
oses two meth
odolo
g
ies to
automatically gene
rate
st
op word
s li
st. It
cla
ssifie
s
the
words of ge
n
e
ral li
st into
several
cate
go
ries an
d cl
arif
ies
why the
g
eneral list i
s
not
appli
c
able fo
r Chine
s
e
pat
ents. Th
rou
g
h
the expe
ri
ment on real
patents
data,
we comp
are
our
two method
o
l
ogie
s
’ accu
racie
s
und
er
corpu
s
e
s
wit
h
different scale
s
, and compa
r
e with
the
gene
ral
list to
o. The
expe
ri
ment result in
dicate
s
th
at b
o
th of
our two meth
odol
og
ies
are
suitab
le
for Chine
s
e
patents, and
the
a
c
curacy
of
the
me
tho
dology
ba
sed
on
statisti
cs
is a
little hig
h
e
r
than the one
base on word
frequen
cy.
Ackn
o
w
l
e
dg
ment
This work was su
ppo
rted
by Resea
r
ch Found
ation
for Advanced Talent
s of Hubei
University of Tech
nolo
g
y (No. BSQD1
2
131),
the F
u
ndame
n
tal Rese
arch Fu
n
d
s for the Yo
ung
Teache
rs' In
n
o
vation proje
c
t of Zhong
n
an Unive
r
sity
of Economi
cs and La
w (No
.
201414
7), the
Nation
al Natu
ral Scie
nce Found
ation of Chin
a
(No. 61201
250
), the Natural Sci
ence Foun
da
tion
of Anhui Province
(No. 1308
085Q
F
103), the
G
uangxi Natural Scien
c
e
Found
ation (No.
2012
GXNSF
BA05317
4), a
nd the Guan
g
x
i Universi
ty Key Lab of Cloud Co
mputi
ng and Comp
lex
System Foun
d. The autho
rs
sin
c
erely than
k t
he an
onymou
s
rev
i
ewe
r
s fo
r their con
s
tru
c
tive
comm
ents a
n
d
helpful sug
gestio
n
s.
Referen
ces
[1]
Erlin E, R
ahmi
a
ti R, Ri
o U. T
w
o
T
e
xt Cl
a
ssi
fiers in On
li
ne
Discussi
on: Su
pport Vect
or M
a
chi
ne v
s
Back-Prop
agat
ion Neur
al
N
e
tw
o
r
k.
T
E
LKOMNIKA (T
elec
ommunic
a
tio
n
Co
mp
uting E
l
e
c
tronics a
n
d
Contro
l)
. 2014;
12(1): 189-
20
0.
[2]
Z
ou F
,
W
ang
F
L
, De
ng
X,
et.a
l.
Auto
matic
co
nstruction
of C
h
in
ese sto
p
w
o
rd list
. Proc
ee
d
i
ngs
of th
e
5th WSEAS
international
conference on A
p
plied c
o
mputer
sci
ence. Hangzh
ou, China.
2006: 1010-
101
5.
[3]
Yuang CT
, Ba
nchs RE, Siong CE.
An e
m
pir
i
cal ev
alu
a
tio
n
of stop w
o
rd remov
a
l i
n
statistical
mac
h
in
e
translati
on
. Pr
ocee
din
g
s of
the Joi
n
t W
o
rkshop
on E
x
ploiti
ng S
y
n
e
r
g
ies
bet
w
e
en
Informati
o
n
Retriev
a
l a
n
d
Machi
ne T
r
anslati
on (ESIR
M
T
)
and H
y
b
r
id Ap
proac
he
s to Machi
n
e
T
r
anslatio
n
(H
y
T
ra). Associatio
n for Comp
ut
ation
a
l L
i
ng
ui
stics, 2012: 30-
37.
[4]
K Z
i
pf. Selective Studies
and the Pri
n
c
i
ple
of Rel
a
tiv
e
F
r
eque
nc
y in La
ng
uag
e. MIT
Press.
193
2.
[5]
T
i
mothy
C B
e
ll,
John G Clear
y and Ian H W
i
tten. T
e
xt Comp
ressio
n
. Prenti
c
e Hall. 1
9
9
0
.
[6]
DT
IC-DROLS
English Stop Wo
rd
List , http:/
/d
vl.dtic.mil/stop_list.html
[7]
Engl
ish Stop
W
o
rd List in W
o
rd Net, http
://www
.
d
.um
n
.ed
u
/~
tpederse/Gr
oup
01/W
o
rdN
e
t/
w
o
rds.t
x
t
[8]
Dolam
i
c L,
Savoy
J. When s
t
op
w
o
r
d
l
i
sts
make th
e d
i
fference.
J
ourn
a
l
of the
A
m
eric
an S
o
ciety fo
r
Information Sci
ence a
nd T
e
ch
nol
ogy
. 20
10; 61(1): 20
0-2
0
3
.
[9]
Z
a
man A
N
K,
Matsakis P, B
r
o
w
n
C.
Eva
l
u
a
tion
of sto
p
w
o
rd lists i
n
t
e
xt retriev
a
l
u
s
ing
Late
n
t
Semantic Ind
e
x
ing
. Proce
edi
ngs of 201
1 the Si
xth Intern
at
ion
a
l Co
nfer
ence o
n
Dig
ita
l
Information
Mana
geme
n
t (ICDIM). Melbor
une,
Austral
i
a.
201
1: 133-
136.
[10]
Patel
B, Shah D.
Sig
n
ifica
n
ce
of stop
w
o
rd
e
l
i
m
in
atio
n i
n
meta se
arch
en
g
i
ne
. Pr
oce
edi
n
g
s of
2
0
1
3
Internatio
na
l C
onfere
n
ce o
n
Intelli
ge
nt S
y
ste
m
s and Sig
nal
Processi
ng (ISSP). India. 201
3: 52-55.
[11]
Stamatatos E.
Plag
iarism
de
tection us
in
g s
t
op
w
o
r
d
n-
gra
m
s.
Journa
l of
the A
m
eric
an
Society for
Information Sci
ence a
nd T
e
ch
nol
ogy
. 20
11; 62(1
2
): 251
2-2
527.
[12]
El-Khair IA. Ef
fects of st
op
w
o
rds el
imin
atio
n for Arabic i
n
formation retri
e
val: a com
p
a
r
ative stud
y.
Internatio
na
l Journ
a
l of Co
mputin
g & Information Sci
enc
es
. 2006; 4(3): 1
1
9
-13
3
.
[13]
Medh
at W
,
Yousef AH, Kora
sh
y
H. Corp
or
a
Prep
aratio
n and
Sto
p
w
o
r
d
Li
st Gener
atio
n for Arab
ic
data in Soc
i
a
l
Net
w
ork.
arXiv prepri
n
t
arXiv
:1
410.1
1
3
5
, 201
4.
[14]
Gong Z
h
en
g,
Guan Ga
o
w
a
a
.
Comp
arative
Stud
y o
n
B
e
t
w
een
Mo
ngo
lia
n
Stop
W
o
rds
a
nd E
n
g
lish
Stop W
o
rds.
Journ
a
l of chi
n
e
s
e infor
m
ati
on
process
i
ng
. 2
0
11; 25(4): 3
5
-3
8. (in Chi
nese)
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 13, No
. 4, Decem
b
e
r
2015 : 141
4 – 1421
1421
[1
5
]
D
a
o
w
adu
ng
P, C
h
en
YH
.
Sto
p
Word i
n
R
e
a
dab
ility Assess
me
nt of Tha
i
Text
. Procee
din
g
s of 2
0
1
2
12th Inter
natio
nal
Confer
enc
e on A
d
va
nced
Lear
nin
g
T
e
chnol
ogi
es (ICAL
T
). Rome,
Ital
y
. 2012: 4
9
7
-
499.
[16]
Z
ou F
,
W
ang F
L
, Deng
X, e
t.al. Stop
w
o
rd
lis
t constructi
on an
d a
ppl
ic
ation i
n
Ch
in
e
s
e lan
g
u
age
process
i
ng.
WSEAS Transac
tions on Infor
m
ation Sci
enc
e and Ap
plic
atio
ns
. 2006; 3(
6): 103
6-10
44.
[17]
Hao
L,
Hao L.
Automatic identifica
tio
n
of stop w
o
rds in
C
h
inese text classification
. Pro
c
eed
ings
of
200
8 Internati
ona
l Conf
eren
ce on Com
put
er Sci
enc
e an
d Soft
w
a
re En
gin
eeri
ng. W
uhan, Ch
in
a
.
200
8; 1: 718-7
22.
[18]
Gu Yiju
n, F
an
Xi
aoz
hon
g,
W
ang J
i
an
hu
a, W
ang T
ao, Hu
ang W
e
i
j
i
n
. Au
tomatic Sel
e
cti
on of C
h
in
ese
Stoplist.
T
r
ans
actions of Bei
j
i
ng
Institute of T
e
chno
logy
. 2
005; 25(
4): 337
-340. (in C
h
in
e
s
e)
[19]
Xi
on
g W
e
n
x
i
n
,
Song r
ou. R
e
mova
l of Sto
p
W
o
rd i
n
Us
ers’ R
equ
est for Informatio
n
Retriev
a
l.
Co
mp
uter Engi
neer
ing
. 2
007;
33(6): 19
5-1
9
7
.
[20]
ik-analy
zer. https://c
ode.google.com/p/ik-analy
zer/
[21]
je-analy
s
is.https://code.goo
gle.com/p/jinhe-tss/do
w
n
loads/de
tail?nam
e=je-
analy
sis1.5.
1
.jar
[22]
Christo
pher D
Mannin
g
, Prabh
akar Rag
h
a
van, Hinr
i
ch
Schütze. An Introducti
on to Information
Retriev
a
l. Ca
mbridg
e Univ
e
r
sit
y
Press, 20
08.
Evaluation Warning : The document was created with Spire.PDF for Python.