Internati
o
nal
Journal of Ele
c
trical
and Computer
Engineering
(IJE
CE)
V
o
l. 6,
N
o
.
3
,
Ju
n
e
201
6,
p
p
.
9
2
5
~
935
I
S
SN
: 208
8-8
7
0
8
,
D
O
I
:
10.115
91
/ij
ece.v6
i
3.9
663
9
25
Jo
urn
a
l
h
o
me
pa
ge
: h
ttp
://iaesjo
u
r
na
l.com/
o
n
lin
e/ind
e
x.ph
p
/
IJECE
Automatic Extraction of Ma
lay Compound Nouns Using A
Hybrid of Statistical and
Machine Learning Methods
Muneer
A. S.
Haz
a
a
1
, Na
zlia
Oma
r
2
, Fa
dl M
u
ta
her Ba
-A
lwi
3
, Mohammed
Albare
d
3
1
F
acul
t
y
of
Co
m
puter S
c
ien
c
e
and Inform
ati
on
Techno
log
y
,
Th
am
ar Univers
i
t
y
,
Yem
e
n
2
University
K
e
b
a
ngsaan Malay
s
ia, Faculty
of
Inf
o
rmation Science and
Technolog
y
3
Faculty
of
Co
mputer and
Infor
m
ati
on Technolo
g
y
, Sana'a Univ
ersity
, Yemen
Article Info
A
B
STRAC
T
Article histo
r
y:
Received Oct 12, 2015
R
e
vi
sed Dec 7,
2
0
1
5
Accepted Dec 21, 2015
Identif
y
i
ng of
compound nouns is impor
tant for a wide s
p
ectrum of
applications in the field of natura
l langu
age p
r
oces
s
i
ng s
u
ch as
m
achin
e
translation and
information retr
ieva
l. Extr
action
of compound nouns requires
deep or s
h
al
low
s
y
nta
c
t
i
c pr
epr
o
ces
s
i
ng
tools and large
corpora. This pap
e
r
investigates several methods for
extracting Noun compounds
from Malay
text corpor
a. F
i
rs
t, we pres
ent
the em
pirical
res
u
lts
of s
i
xtee
n s
t
atis
tic
al
association measures of Malay
<N
+N> compound nouns extr
action. Second
,
we introduce th
e possibilit
y
of
integr
ating m
u
ltipl
e
associ
atio
n m
easures.
Third, th
is work also provides a st
andard dataset inte
nded to
provide
a
common platfor
m
for evaluatin
g
res
ear
ch on
the id
entif
ication
compound
Nouns in Malay langu
age. Th
e s
t
andard
data set
contains 7,235
u
n
ique N-N
candid
a
tes, 2
,
97
0 of them are N-N compoun
d nouns collocations. Th
e
extra
c
tion a
l
gor
ithm
s
are evalu
a
ted ag
ains
t thi
s
reference d
a
t
a
s
e
t. The
experimental res
u
lts demonstrate that
a group o
f
association measures (T-
test , Piatersk
y
-
S
h
apiro (PS) , C_valu
e, FGM and
rank combination method)
are th
e bes
t
as
s
o
ciation
m
eas
ure and ou
tperfo
rm
s
the other
as
s
o
ciat
ion
m
eas
ures
for <N+N> collo
cat
ion
s
in the M
a
la
y
corpus
. F
i
nall
y,
we des
c
ribe
several classificati
on methods f
o
r combining as
sociation
measu
r
es scores of
the bas
i
c m
eas
u
r
es
, followed b
y
the
i
r evalu
a
t
i
on. Evalu
a
tion r
e
s
u
lts
s
h
ow
that
clas
s
i
fi
cat
io
n algorithm
s
s
i
gnific
a
nt
ly
ou
tper
form individual
association
m
eas
ures
. Exper
i
m
e
ntal r
e
s
u
lts
o
b
tain
ed ar
e quit
e
s
a
tis
fac
t
or
y in
t
e
rm
s
of the
Precision, Recall and F-scor
e.
Keyword:
Ass
o
ciation M
easure
s
Classificatio
n
Algo
rith
m
s
C
o
m
pou
nd
N
o
uns
M
a
l
a
y
Lang
ua
ge
Copyright ©
201
6 Institut
e
o
f
Ad
vanced
Engin
eer
ing and S
c
i
e
nce.
All rights re
se
rve
d
.
Co
rresp
ond
i
ng
Autho
r
:
Munee
r
A.S. Hazaa,
Facul
t
y
o
f
C
o
m
put
er an
d I
n
f
o
rm
at
i
on Tec
h
nol
ogy
,
Dh
am
ar Un
iv
ersity, Yem
e
n
.
Em
a
il:
m
uneer_hazaa@ya
hoo.c
om
1.
INTRODUCTION
C
o
m
pou
nd
n
o
uns
are a c
o
m
m
onl
y
occur
r
i
ng c
o
nst
r
uct
i
o
n i
n
nat
u
ral
l
a
ng
ua
ges. C
o
m
p
o
u
nd
n
o
u
n
s
are m
a
de up
of
t
w
o
or m
o
re
n
o
u
n
s
w
h
i
c
h t
o
get
h
e
r
f
u
nct
i
o
n
sy
nt
act
i
cal
l
y
as si
ngl
e
n
o
u
n
s
u
ch
as ‘
g
ol
f cl
ub
‘
o
r
‘com
put
er sci
e
nce’
. Th
e co
m
poun
d n
o
u
n
sy
nt
ax a
nd
sem
a
nt
i
c
s are di
scus
sed i
n
det
a
i
l
s
i
n
L
e
vi
[
1
]
.
C
o
m
pou
nd
n
o
uns
w
h
i
c
h c
o
n
s
i
s
t
of t
w
o
wo
r
d
s are a
n
al
y
z
e
d
sy
nt
act
i
cal
l
y
by
m
eans of t
h
e r
u
l
e
N
→
N N
o
r
th
e ru
le
N
→
N
N
̅
a
p
pl
i
e
d
recu
rsi
v
el
y
.
C
o
m
pou
n
d
s
of
m
o
re t
h
a
n
t
w
o
n
o
u
n
s
are
am
bi
gu
ous
i
n
syntactic struc
t
ure.
N
o
u
n
-
N
ou
n
c
o
m
pou
nd
s, as
a s
ubset
o
f
com
pou
nd
n
o
uns
, c
h
ara
c
t
e
r
i
st
i
cal
ly
occu
r
wi
t
h
hi
gh
fre
que
ncy a
n
d high le
xical and sem
a
ntic varia
b
ility
[2]. Noun c
o
m
p
ounds
(o
r NCs)
ha
ve received a
significa
nt
dea
l
of attention i
n
recent yea
r
s
in com
put
ational linguistic lit
erature
.
Ide
n
tification of
c
o
mpound
n
oun
Mu
ltiwo
r
d
Ex
pressi
o
n
(M
W
E
) and
u
n
d
e
rstan
d
i
n
g
their syn
t
ax and
sem
a
n
tics is d
i
fficu
lt
b
u
t
im
p
o
r
tan
t
fo
r m
a
ny
Nat
u
ral
Lan
g
u
age
Pr
ocessi
ng
(
N
LP
) a
p
plica
t
i
ons
,
part
i
c
ul
a
r
l
y
pa
rsi
n
g,
a
n
d
di
ct
i
onary
-
b
ase
d
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 3,
J
u
ne 2
0
1
6
:
92
5 – 9
3
5
92
6
ap
p
lication
s
lik
e m
ach
in
e tran
slati
o
n
[3
] an
d qu
estion
an
sw
er
ing
[
4
],[5
]. Ex
tracting
Malay co
mp
oun
d
n
oun
s
is challen
g
i
ng
task in
term
s o
f
ob
tain
ing
accu
rate
resu
lts. Hen
ce,
th
is
stu
d
y
atte
m
p
ts to
i
m
p
r
ov
e t
h
e ef
f
ectiv
en
ess
of
Malay
no
un
co
m
p
o
und
ex
tr
action
b
y
pr
opo
sing
a
h
ybr
id
of
statistical
and
m
achi
n
e l
earni
ng
m
e
t
hods.
The
m
a
i
n
act
i
v
i
t
y
i
n
o
u
r
r
e
search
w
o
rk
i
s
t
o
o
b
ser
v
e a
n
d
fi
nd
an
acc
ept
a
bl
e
t
echni
q
u
e t
o
e
x
t
r
act
a pai
r
of c
o
m
pou
nd
n
o
u
n
s
i
n
M
a
l
a
y
.
Co
m
p
o
u
n
d
s
h
a
v
e
th
u
s
b
een
a recu
rren
t fo
cus o
f
a
tten
tio
n
with
in
th
eo
reti
cal, co
gn
itiv
e, an
d
in
th
e
last decade also
with
in com
putational linguistics. Consi
d
erable
research
has been proposed on
automatic
identification of
m
u
ltiword units
a
nd noun-noun
c
o
m
poun
ds a
n
d on
to classify sem
a
ntic relationshi
ps
bet
w
ee
n C
o
m
p
o
u
nds
com
p
o
n
ent
s
. M
o
st
of
t
h
ese st
u
d
i
e
s
on
n
o
u
n
-
n
o
un
com
pou
nd
s o
n
l
y
deal
wi
t
h
E
ngl
i
s
h
and s
o
m
e
ot
her l
a
ng
ua
ges b
u
t
not
m
u
ch research
have
bee
n
car
ri
ed at
t
h
i
s
l
e
vel
for M
a
l
a
y
.
Vari
o
u
s Le
xi
cal
asso
ciatio
n
m
easu
r
es h
a
v
e
b
een
sug
g
e
sted
in
literatu
re fo
r i
d
en
tificatio
n
of M
W
Es. Th
ese associatio
n
measures are
mathe
m
atica
l
f
o
rm
ulas that c
o
m
pute an
associ
at
i
on sc
ore
bet
w
ee
n t
w
o
o
r
m
o
re w
o
r
d
s base
d o
n
their occ
u
rrenc
e
s and c
o
-occ
urre
nces in a text corpus.
T
h
e
scores indicat
e the pote
n
tial for a candidate
to be
a collocation.
They can be used for ra
nki
n
g
(can
di
dat
e
s
wi
t
h
hi
g
h
sco
r
es at
t
h
e t
op), or f
o
r cl
assi
fi
c
a
t
i
on (by
set
t
i
ng a t
h
res
hol
d an
d di
sca
r
di
ng al
l
bi
g
r
a
m
s bel
o
w t
h
i
s
t
h
res
hol
d)
. A
n
ove
rvi
e
w o
f
t
h
e m
o
st
wi
del
y
used
t
echni
q
u
es
i
s
g
i
ven i
n
[6]
-
[
1
0
]
.
C
o
m
pou
nd
n
o
uns
i
n
M
a
l
a
y
have
bee
n
cl
a
ssi
fi
ed i
n
t
o
t
h
r
ee
m
a
jor t
y
pe
s base
d o
n
t
h
ei
r sy
nt
act
i
c
st
ruct
u
r
e, as di
scusse
d i
n
[
11]
.The sy
nt
act
i
c
st
ruct
u
r
e o
f
t
h
e fi
rst
and t
h
e
seco
nd cat
eg
or
i
e
s i
s
nou
n fol
l
owe
d
no
u
n
, f
o
r exa
m
pl
e “gunu
n
g
-
g
ana
n
g ” (m
ou
nt
ai
n) a
nd “
k
a
p
al
l
a
y
a
r” (sai
l
i
ng s
h
i
p
). F
o
r t
h
e t
h
i
r
d cat
eg
o
r
y
,
t
h
e
sy
nt
act
i
c
st
ruct
ure i
s
a
no
u
n
f
o
l
l
o
we
d
by
a
n
o
u
n
w
o
r
d
.
The
PO
S o
f
t
h
e
no
n-
n
o
u
n
wo
r
d
c
a
n
be a
det
e
rm
i
n
er,
ver
b
, a
d
ject
i
v
e
,
ad
ver
b
,
p
r
ep
osi
t
i
on
p
h
rase
or
or
di
nal
.
I
n
t
h
i
s
st
u
d
y
,
o
u
r
wo
r
k
i
s
f
o
c
u
s
e
d
on t
h
e a
u
t
o
m
a
t
i
c
ex
traction
o
f
the N-N Malay co
m
p
o
u
n
d
nou
ns m
u
ltiwo
r
d
exp
r
essi
on
.
I
n
th
is p
a
p
e
r,
fir
s
t,
sev
e
r
a
l statistical
asso
ciatio
n
m
easu
r
es [
7
],
[8
],
[1
2
]-[15
]
h
a
v
e
b
e
en
in
v
e
stig
ated
for th
e id
en
tifi
catio
n
no
un–
no
un
co
m
p
ou
nds in
Malay co
rp
u
s
. After th
at, we p
r
esen
t an
au
to
m
a
tic
n
o
u
n
–no
un
com
pou
nd
s ext
r
act
i
on
base
d
o
n
wei
ght
e
d
c
o
m
b
i
n
at
i
on of
m
u
lt
i
p
l
e
l
e
xi
cal
associ
at
i
on m
easure
s
l
i
s
t
s
. F
i
nal
l
y
,
we de
scri
be
s
e
veral
cl
assi
fi
cat
i
on m
e
t
hod
s whi
c
h
uses
association m
easure
s
scores
as their feat
ure sets.
Ex
peri
m
e
nt
s p
r
esent
e
d i
n
t
h
i
s
pa
per
we
re
p
e
rf
orm
e
d o
n
M
a
l
a
y
dat
a
an
d
ou
r at
t
e
nt
i
o
n
was
rest
ri
ct
ed
t
o
t
h
e
fi
rst
a
n
d
seco
n
d
cat
eg
o
r
i
e
s o
f
M
a
l
a
y
no
un
co
m
poun
ds.
Thi
s
pape
r
i
s
o
r
ga
ni
zed
as fol
l
ows:
In
Sect
i
o
n 2, we gi
ve
a sum
m
ary
of
re
l
a
t
e
d.
Sect
i
o
n 3 descri
be
s
ou
r M
a
l
a
y
N
o
u
n
c
o
m
pou
nds
e
x
t
r
act
i
o
n m
e
t
hods
. Sect
i
o
n
4
prese
n
t
s
t
h
e
ev
al
uat
i
on m
e
t
h
o
d
s, t
h
e e
xpe
ri
m
e
nt
al
resul
t
s
a
n
d
di
scussi
o
n
o
n
t
h
e
resul
t
s
.
Fi
nal
l
y
, Sect
i
o
n
5 c
o
n
c
l
udes
t
h
e st
ud
y
and
gi
ves s
o
m
e
fut
u
re
wo
r
k
.
2.
RELATED WORK
Seve
ral
app
r
oa
ches ha
ve bee
n
pr
op
ose
d
ha
ve
been car
ri
ed
o
u
t
rega
r
d
i
n
g M
W
E i
n
va
ri
o
u
s
l
a
ng
uage
s
l
i
k
e En
gl
i
s
h,
Germ
an an
d s
o
m
e
ot
her l
a
n
gua
ges
Ge
nera
l
l
y
speaki
n
g, t
h
ese a
p
p
r
oa
ch
es can
be di
vi
ded i
n
t
o
fo
ur m
a
i
n
st
ream
m
e
t
hod
ol
o
g
i
e
s:
st
at
i
s
ti
cal
app
r
oaches
[9]
,
[
16]
,
[
1
7
]
,
l
i
n
g
u
i
s
t
i
c
m
e
t
hod
s [
18]
,
[
1
9
]
an
d H
y
bri
d
M
e
t
h
o
d
s
[
16]
,
[
2
0
]
,
[
2
1]
, an
d
m
achi
n
e l
eani
n
g m
e
t
hods
[
7
]
,
[
8
]
.
In
st
at
i
s
t
i
cal
m
e
t
hods
f
o
r
M
W
E e
x
t
r
act
i
o
n
,
C
h
u
r
ch
a
n
d
Ha
n
k
s
[
2
2
]
prese
n
t
e
d
t
h
e co
nce
p
t
o
f
association m
easure
s
firstl
y, and t
h
en
proposed M
u
tual Inform
ation (MI) as a
n
objective m
eas
ure
for
estim
a
ting word association.
Pecina
(2005)
prese
n
t em
pirical evaluation
of
a co
m
p
reh
e
n
s
iv
e list o
f
auto
m
a
tic
col
l
o
cat
i
o
n ext
r
act
i
on m
e
t
h
o
d
s (
8
4 ki
nd
s
of as
soci
at
i
on
m
easures for bigram
co
llo
catio
n
ex
tractio
n) an
d
concl
ude
d that
in Czech data, MI has the
best perf
ormance. Yos
h
ida
et al. [
23] propose a ne
w
method
(Enh
an
ced
M
u
tu
al Inform
atio
n
and
C
o
llo
catio
n
Op
tim
izat
i
o
n) to
ex
tract
M
W
E fro
m
te
x
t
. Th
e resu
lts show
t
h
at
t
h
e new
m
e
t
hod si
g
n
i
fi
cant
l
y
im
prove
s t
h
e pe
rf
orm
a
nce of
m
u
lt
i
w
or
d ex
pressi
o
n
ext
r
a
c
t
i
on i
n
com
p
ari
s
on
wi
t
h
a cl
assi
c MI ext
r
act
i
o
n m
e
t
h
o
d
. C
h
a
k
ra
bo
rt
y
[2
4]
and Da
nda
pat
,
M
i
t
r
a et
al
. [2
5]
have
use
d
statistical
m
easurem
ents
to extract Noun-No
un (
N
-N
) and
No
u
n
-
V
erb (
N
-
V
) c
o
l
l
o
cat
i
o
n
s
as M
W
E i
n
Ben
g
a
li Corp
us r
e
sp
ectiv
ely.
K
u
n
c
h
uku
ttan
an
d D
a
m
a
n
i
[26
]
d
e
v
e
lop
e
d
a syste
m
f
o
r
H
i
n
d
i
co
m
p
o
und
n
oun
M
W
E ext
r
act
i
on f
r
o
m
a Hi
ndi
co
rp
us
. Thei
r ext
r
act
i
o
n
m
e
t
hods are
base
d on st
at
i
s
t
i
cal
co-occu
r
r
ence
measures.
The l
i
n
g
u
i
s
t
i
c
m
e
t
hods f
o
r M
W
E ext
r
act
i
o
n
i
s
based o
n
w
o
r
d
s’ P
O
S t
a
gs
t
h
at
form
t
h
e gram
m
a
ti
cal
and sy
ntactical requirem
ent for a
word
sequence to
be a M
W
E. B
o
uriga
u
l
t
[2
7]
pr
o
pose
gram
m
a
ti
cal
anal
y
s
i
s
m
e
t
hod f
o
r t
h
e
ext
r
act
i
o
n o
f
t
e
rm
i
nol
ogi
cal
no
u
n
p
h
ra
ses.
Ar
gam
on, Da
g
a
n et
al
. [2
8]
p
r
o
p
o
sed a m
e
m
o
ry
-
base
d ap
pr
oac
h
t
o
l
ear
n l
a
n
g
u
age
pat
t
e
r
n
s f
r
om
corp
o
r
a.
Thei
r m
e
t
hod
r
e
l
i
e
s on l
o
cal
POS i
n
f
o
rm
at
ion
of a
word se
quenc
e
instead of full pars
i
n
g a sent
ence
. The
hy
bri
d
a
p
p
r
o
ach com
b
ines
both statistical and
l
i
ngui
st
i
c
i
n
fo
r
m
at
i
on o
f
wo
r
d
se
qu
ences
.
D
i
as [2
0]
p
r
op
os
ed a
hy
b
r
i
d
sy
st
em
whi
c
h
use
s
m
u
t
u
al
expec
t
at
i
o
n
to score both t
h
e association
of
wo
rd
s an
d t
h
e asso
ci
at
i
on
of P
O
S
pat
t
e
rn
s in
th
e tag
g
e
d co
rp
ora. Su,
Wu et
al
. [
29]
desi
g
n
e
d a
n
a
u
t
o
m
a
ti
c com
pou
n
d
r
e
t
r
i
e
val
t
o
ext
r
act com
p
ounds withi
n
a te
xt
. T
h
ey use
n-gram
m
u
t
u
al
i
n
fo
r
m
at
i
on,
rel
a
t
i
v
e
fr
eq
ue
ncy
cou
n
t
an
d
P
O
S
as
t
h
e
fe
at
ures
f
o
r
c
o
m
poun
d
ext
r
a
c
t
i
on.
I
n
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Au
toma
tic Extra
c
tio
n o
f
Ma
l
a
y Com
pou
nd
N
o
un
s
U
s
i
n
g A
H
y
b
r
i
d
o
f
S
t
a
tistica
l
an
d .... (
M
un
eer A.S.
Ha
zaa
)
9
27
machine leaning m
e
thods, Pec
i
na [6],
[7] use
d
m
achine learning approac
h
for M
W
E ext
r
action. T
h
eir method
u
s
es 55
k
i
n
d
s
o
f
asso
ciatio
n
measu
r
es, su
ch
as j
o
i
n
t p
r
obab
ility, MI an
d
t-score, to
sco
r
e each
co
mp
oun
d
no
u
n
ca
ndi
dat
e
. A
f
t
e
r t
h
at
, a
m
achi
n
e l
ear
ni
ng
m
e
t
hod
(l
i
n
ear l
o
gi
st
i
c
re
g
r
essi
o
n
,
l
i
n
ear
di
scri
m
i
nant
anal
y
s
i
s
and
ne
ural
net
)
i
s
use
d
t
o
cl
as
si
fy
new
com
i
ng
col
l
o
cat
i
o
n
candi
dat
e
s u
s
i
ng t
h
e ass
o
ci
at
i
on m
easures
’
scor
e
s
as feat
u
r
es an
d t
o
det
e
rm
i
n
e whet
her
or
n
o
t
t
h
ey
are M
W
Es.
The m
achi
n
e l
ear
ni
n
g
m
e
t
hods si
gni
fi
cant
l
y
im
pro
v
ed r
a
n
k
i
ng o
f
col
l
o
cat
i
on can
di
dat
e
s on al
l
of t
h
ei
r
data sets than the best ass
o
ciation m
easure. Dua
n
,
Lu
et al. [3
0
]
dev
e
lop
e
d a
b
i
o-in
sp
ired
ap
proach
for m
u
lti-word ex
pression
Ex
traction
.
3.
R
E
SEARC
H M
ETHOD
We
ha
ve
dev
e
l
ope
d
a sy
st
em
t
h
at
ext
r
act
s
bi
gram
com
pou
n
d
n
o
u
n
s M
W
E
s
f
r
om
a t
e
xt
c
o
r
p
us.
Th
e
co
m
p
o
u
n
d
n
oun
s ex
tr
act
o
r
creates a r
a
n
k
e
d
list o
f
Malay
c
o
m
p
o
u
n
d
n
ouns. Sev
e
r
a
l ap
pro
ach
es w
h
ich
main
ly
rely
m
a
in
ly o
n
th
e statistica
l
co
-o
ccu
r
ren
ce
in
fo
rm
atio
n
o
f
th
e co
m
p
o
und
n
oun
s and
POS p
a
tterns h
a
ve b
een
i
m
p
l
e
m
en
ted
.
Basic syste
m
arch
itecture is
sh
own
in
Figure 1. Th
e fo
llowing
su
b
s
ecti
o
n
s
will d
i
scu
s
sed
in
det
a
i
l
t
h
e e
x
t
r
a
c
t
i
on m
e
t
hods
use
d
.
3.
1.
Co
rpus Acquisitio
n
C
o
r
p
o
r
a
have
been e
x
t
e
nsi
v
el
y
em
pl
oy
ed i
n
se
v
e
ral
NLP task
s as th
e b
a
sis
for au
to
m
a
tical
ly
l
earni
n
g
m
odel
s
fo
r l
a
ng
ua
ge anal
y
s
i
s
and
g
e
nerat
i
o
n.
In t
h
i
s
st
ep, we c
r
awl
and c
o
l
l
ect
M
a
l
a
y news
art
i
c
l
e
s
wh
ich
are
writ
ten
in
Malay lan
g
u
a
g
e
fro
m
Malays
ian National Ne
ws
A
g
ency
(BERN
A
M
A
) ne
ws s
o
u
r
ce
[h
ttp
://ww
w
.b
ern
a
m
a
.co
m
/b
ern
a
m
a
/v
6
/in
d
e
x.ph
p
]
. Th
e size o
f
th
e corpu
s
is 4
9661
n
e
w
s
article an
d
1
3
,346
,3
81
t
o
ken
.
Fig
u
re
1
.
Ex
t
r
actio
n
an
d Filtratio
n
o
f
Co
m
p
ou
nd
No
un
s M
u
ltiwo
r
d
Un
its
3.
2.
Preproces
sing
In t
h
i
s
phase
, al
l
crawl
e
d we
b
pages a
r
e pre
p
roces
sed
by
re
m
ovi
ng al
l
HTM
L
t
a
gs, i
d
ent
i
fy
i
ng m
a
i
n
cont
e
n
t
,
aut
o
m
a
t
i
c
noi
se rem
oval
an
d b
r
ea
k
i
ng t
h
e co
nt
en
t
dow
n t
o
a se
que
nce o
f
i
n
di
vi
d
u
al
t
oke
ns.
Aft
e
r
that, all-uppercase, ca
pita
liz
ed a
n
d m
i
xed
case words
we
re lowerca
s
ed. Punctuations
, special sym
bols a
n
d
num
bers a
r
e
re
m
oved. Ta
bl
e
1 s
h
o
w
s
t
h
e
n
-
gram
st
at
i
s
t
i
c
of
ou
r c
o
r
p
us.
Tab
l
e
1
.
Statistics o
f
th
e Malay co
rpu
s
Nu
m
b
er
of ty
pes
5474
2
Nu
m
b
er
of tokens
13,
346,
3
8
1
Nu
m
b
er
of unique bi-
g
r
a
m
s
705,
68
0
Nu
m
b
er
of bi-
g
r
a
m
s
13,
296,
7
2
4
Nu
m
b
er
of unique tr
i-
gr
a
m
s
1,
730,
91
6
Nu
m
b
er
of tr
i-
gr
am
s
13,
247,
0
6
7
Candid
a
te Compou
nd Nou
n
s
Final Co
m
p
ound
Nouns lists
Auto
m
a
tic C
N
s E
x
tra
c
tio
n
M
e
tho
d
s (ra
n
k
i
ng
and classification
)
Preprocessing
Candidate Gener
a
tion
Corpus
Corpus Acquisi
ti
on
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 3,
J
u
ne 2
0
1
6
:
92
5 – 9
3
5
92
8
3.
3.
Candid
a
te Ge
neration
In t
h
i
s
p
h
ase,
we ha
ve t
a
gge
d al
l
n
o
u
n
s i
n
t
h
e t
e
xt
co
r
pus
gi
ve
n a l
i
s
t
of
M
a
l
a
y
nou
n l
i
st
obt
ai
n
e
d
fr
om
a
m
a
nual
l
y
annot
at
ed s
m
al
l
t
a
gged c
o
rp
us an
d M
a
l
a
y
l
e
xi
con w
h
i
c
h co
nt
ai
n M
a
l
a
y
wor
d
s
wi
t
h
t
h
ei
r
pos
si
bl
e P
O
S t
a
gs.
Thi
s
p
h
as
e gi
ves
al
l
po
s
s
i
b
l
e
N
-
N c
o
l
l
o
cat
i
o
n
s
t
h
at
o
ccur i
n
a c
o
r
p
us.
Fr
om
t
h
e t
a
gge
d
co
rpu
s
, if two con
s
ecu
ti
v
e
words ta
gg
ed
as No
un
and
N
oun
r
e
sp
ecti
v
el
y is extract
ed as a
candidate N-N
co
llo
cation
.
Th
ese co
m
p
o
u
n
d
no
un
s can
d
id
ates ar
e th
en p
a
ssed
to
th
e
n
e
x
t
p
h
a
se fo
r
au
to
m
a
tic co
mp
ound
no
u
n
s ext
r
act
i
on m
e
t
hod
. C
o
m
poun
d n
o
u
n
s
candi
dat
e
s w
h
i
c
h occ
u
r wi
t
h
very
l
o
w
fre
q
u
ency
are
di
sc
arde
d.
Onl
y
can
di
dat
e
com
poun
d n
o
u
n
s col
l
ocat
i
ons
wh
ose f
r
e
que
ncy
i
n
t
h
e cor
p
us are gr
eat
er t
h
an o
r
equal
t
o
three a
r
e c
o
nsidere
d
.
3.
4.
Aut
o
m
a
tic
Ex
trac
tion
O
n
ce w
e
h
a
ve ex
tr
acted
the cand
i
d
a
te
N
-N
co
m
p
ou
nd
s i
n
th
e co
m
p
o
u
n
d
noun
s can
d
i
d
a
te
gene
rat
i
o
n
pha
se,
we
have
r
a
nke
d
o
r
cl
ass
i
fi
ed eac
h c
o
m
poun
d
n
o
u
n
M
W
E
can
di
d
a
t
e
ext
r
act
e
d
f
r
om
a
corpus. In our task, several statistical co-occurre
n
ce m
e
asure
s
an
d seque
nce t
y
pe conce
r
ned m
o
d
e
l
are
calculated on each of
the
extracted
ca
ndidates, and the c
a
ndi
date c
o
llocati
ons
a
r
e ra
nked or classified by
these m
easures
. They ca
n
be
use
d
for ra
nking (ca
ndi
date
s with
h
i
gh
scores at th
e top
)
, o
r
fo
r classifi
catio
n
(b
y setting
a t
h
resho
l
d and
d
i
scard
i
n
g
a
ll b
i
gra
m
s b
e
low th
is th
resho
l
d).
3.
4.
1.
Statistic
a
l co-occurrence
as
sociation m
o
del
The m
a
jor statistical
m
easures used and eva
l
uate
d in N -N com
pounds re
cognition in
our study are
prese
n
t
e
d
.
Pointw
ise M
u
tu
al Inf
o
rm
ati
o
n (P
MI
),
Z
-
socre, T-test:
T
h
ese methods try
to com
p
are the
obs
er
ved
fre
qu
enci
es of c
o
l
l
o
cat
i
on can
di
da
t
e
s wi
t
h
t
h
e expect
e
d
fre
q
u
e
n
ci
es base
d o
n
t
h
e assum
p
t
i
on of
i
nde
pen
d
e
n
ce i
n
t
h
e t
a
rget
pa
i
r
s (w
1,
w2
).
K
r
en
n [
31]
di
d a t
hor
o
u
g
h
eva
l
uat
i
on am
ong
t
-
sco
r
e, z-s
o
cr
e an
d
MI
m
easures and s
h
owe
d
that t-score over perfor
m
e
d t
h
e ot
he
r associ
at
i
on m
easures f
o
r <PP+
Ver
b
>
collocations in a Germ
an corpus
. Howe
ve
r, the statistical
measures t-sc
ore, z-sc
or
e
,
and MI are form
ulated
bel
o
w:
;
;
wh
ere
N:
o
f
t
h
e to
tal i
n
stan
ces of
NNCs ;
O:
o
f
th
e to
tal in
stan
ces of
p
a
ir
(w
1
;w
2
).
: o
f
th
e t
o
tal in
stan
ces
o
f
w
1
;
f
w2
: of t
h
e to
t
a
l in
stan
ces
of
w
2.
Chi-s
q
uare te
st (
-test ) :
Pea
r
son’s
t
e
st
of inde
pe
nde
nce c
a
n be use
d
t
o
t
e
st
i
f
t
h
e wor
d
s i
n
t
h
e
collocation are
inde
pende
n
t of each ot
her.
Th
e
χ
2
-t
est
i
s
a cl
assi
cal
m
e
t
hod t
h
at
i
s
wi
del
y
used f
o
r t
h
i
s
t
y
pe
of analysis. T
h
e
χ
2
-test
is form
ulated below:
wh
ere
N:
o
f
the to
tal in
stan
ces of
NNCs ;
O:
of th
e to
tal in
stan
ces
o
f
p
a
ir (w
1
;w
2
)
: o
f
th
e t
o
tal in
stan
ces
of
w
1
;
f
w2
: of th
e to
tal in
stan
ces
of
w
2
:
of
pai
r
s
d
o
not
c
o
nt
ai
n
w
1
an
d w
2
sim
u
ltaneously
: of
pair
s c
ontain w
2
but
not
w
1
;
:
o
f
pai
r
s c
o
n
t
ai
n w
1
bu
t
no
t w
2
Phi coe
fficient:
In statistics, th
e Ph
i coefficien
t
Ф
is a
measure
of a
s
sociation for t
w
o bi
nary
v
a
riab
les. Th
e
Ph
i co
efficien
t is ado
p
t
ed
in
sev
e
ral
wo
rk
s fo
r c
o
m
pou
n
d
s e
x
tractio
n [
8
]
,
[
24]
,
[
3
2
]
Th
e Ph
i
coefficient is
form
ulated bel
o
w:
Lo
g
Likelihoo
d Ra
ti
o
(LLR)
:
Th
e lik
el
ih
oo
d-ratio test is a m
o
re
gen
e
ral test
of
sig
n
i
fican
ce
com
p
ared to the
χ
^
2
t
e
st
an
d
m
a
kes n
o
ass
u
m
p
ti
ons
of a
p
p
r
o
x
i
m
ati
on t
o
t
h
e n
o
r
m
a
l
di
st
ri
but
i
o
n. T
h
e L
L
R
has
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Au
toma
tic Extra
c
tio
n o
f
Ma
l
a
y Com
pou
nd
N
o
un
s
U
s
i
n
g A
H
y
b
r
i
d
o
f
S
t
a
tistica
l
an
d .... (
M
un
eer A.S.
Ha
zaa
)
9
29
p
r
ov
ed to
g
i
v
e
b
e
tter
resu
lts [3
3
]
. Th
e log
-
li
k
e
lih
ood
is calcu
l
ated
with
a
form
u
l
a ad
ju
st
ed
for co
-o
ccurren
c
e
cont
i
n
ge
ncy
t
a
bl
e as
fol
l
o
ws:
For
a gi
ve
n pa
i
r
o
f
w
o
r
d
s
an
d
, let
be t
h
e
n
u
m
b
er of
wi
nd
o
w
s i
n
w
h
i
c
h
and
co-
o
ccur, let
b
e
t
h
e
nu
m
b
er
of
w
i
nd
ow
s in
wh
ich on
ly
occ
u
rs
,
l
e
t
c be t
h
e num
ber of w
i
nd
ows i
n
whi
c
h
onl
y
occurs, and let
b
e
t
h
e
num
b
e
r
o
f
w
i
ndow
s in wh
ich
n
o
n
e
o
f
th
em
o
ccu
r
s
, th
en
Other
method
s
:
in addition t
o
the
m
e
thods
descri
bed a
b
ove
, ot
her statistical association m
easures
suc
h
as dice coefficient, odds ratio
a
n
d J
accard (J
),
Norm
alized Expe
ctation
(NE
)
, Mutual Depe
ndency
(M
D)
, an
d M
u
t
u
al
Ex
pect
at
i
on (M
E
)
are
al
so use
d
. T
h
ese
m
e
t
hods a
r
e wi
del
y
use
d
i
n
t
h
e col
l
o
cat
i
o
n
extraction [6]-[9],[17]
,[24],[25],[32],[34]. T
h
ese m
e
thods
a
r
e form
ulated below:
;
;
3.
4.
2.
The
statis
tics of
com
p
ound nouns an
d the
i
r compone
n
ts
concer
ned
m
e
thods
The C-value
Appr
oach:
The C-v
a
lu
e m
e
t
h
od
is an
effici
en
t do
m
a
in
-in
d
e
p
e
nd
en
t m
u
lti-word
term
recogn
itio
n
m
e
th
od
[35
]
,
wh
ich
co
m
b
in
es lin
gu
istic and
statistical in
fo
rm
atio
n
[1
3
]
,[14
],[36
]
. C-valu
e is
sen
s
itiv
e to th
e n
e
sted co
m
p
ou
nd
ing
b
y
its en
h
a
n
c
ed
st
atistical
m
easu
r
e of frequ
e
n
c
y of
o
ccurren
ce. C-v
a
lue
i
s
defi
ned
as:
whe
r
e CN is a
candi
date compound
noun,
is th
e nu
m
b
er of sim
p
le n
o
u
n
s
th
at co
n
s
ist
o
f
CN,
is its
fre
que
ncy
of occurrence
in t
h
e corpus,
is th
e set of ex
tracted
can
d
i
d
a
te term
s th
at co
n
t
ain
CN,
is
the num
b
er
of t
h
ese ca
ndidat
e
t
e
rm
s. c(C
N
) i
s
t
h
e
num
ber
of
t
hose
t
e
rm
candi
dat
e
s.
Com
b
ining fr
equenc
y
and
geometric me
an
of n
o
uns
(FGM) :
the m
a
in adva
ntage
of t
h
is m
e
thod
is that it
manages to ta
ke into account
bot
h statistic
s of
com
pound
noun spac
e and ac
tual use in a c
o
rpus
with
in
on
e scoring
fun
c
tio
n [2
0
]
,[37
],[38
]
.
whe
r
e
and
whe
r
e
f(C
N) is
the
num
b
er of inde
pe
ndent occurrence
s of
n
oun
C
N
,
# LN
(N
) and # R
N
(N
) ar
e t
h
e
n
u
m
b
e
r
of distinct simple words whi
c
h di
rectly pre
cede or s
u
ccee
d N and LN(N
) and RN(N) a
r
e the fre
que
nc
ies of
nouns t
h
at dire
ctly precede
or succee
d
N.
3.
4.
3.
Rank c
o
mbin
ati
o
n
Each o
f
t
h
e a
b
o
v
e ass
o
ci
at
i
on m
easures
m
e
t
hods
gi
ves
a ran
k
ed l
i
s
t
.
W
e
t
r
i
e
d t
h
e fol
l
o
wi
n
g
ap
pro
ach to
com
b
in
e th
ese ran
k
e
d
lists:
Ra
nk Ag
gre
g
ati
o
n (R
A)
:
Th
e aim
is to
co
m
b
in
e rank
ed lists
produced by se
veral association
m
easures
usi
n
g i
n
fo
rm
at
i
on of t
h
e
or
di
nal
ran
k
s
o
f
t
h
e
ele
m
en
ts in
each
list.
T
h
e weighted
com
b
ination
m
e
thod has prove
d
to gi
ve
better
re
sults their indi
vidual
s
[24]-[26]. Gi
ven m
u
ltiple ordere
d lists
L
1
, L
2
...L
k
o
f
C
N
s, th
e
ran
k
agg
r
eg
ation
prob
lem
is t
o
co
m
b
in
e these lists in
to
a sin
g
l
e ranked
list.
W
e
u
s
e th
e
fo
llowing
rank agg
r
eg
ation
heu
r
istic wh
ich
is called
Bord
a’s
p
o
s
ition
a
l ran
k
i
n
g
:
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 3,
J
u
ne 2
0
1
6
:
92
5 – 9
3
5
93
0
Giv
e
n
lists
L
1
, L
2
...L
m
, where m
≤
k f
o
r e
ach ca
ndi
dat
e
c
NN
Cs and
list
L
i
, the score
is th
e
num
ber o
f
can
di
dat
e
s ra
n
k
ed
bel
o
w c i
n
L
i
. Th
e to
tal Bo
rda sco
r
e is
. The
candi
dates are
t
h
en so
rt
ed by
desce
ndi
ng
B
o
rda
sco
r
es.
3.
4.
4.
Sta
t
istic
a
l Cl
a
ssificati
on
Th
e m
a
in
id
ea is to
feed
statistical an
d
lingu
istic in
fo
rm
at
io
n
abo
u
t
two
ad
j
acen
t
Malay n
oun
s to
a
machine learni
ng classification fram
ew
o
r
k
.
As sh
own
in th
e p
r
ev
i
o
u
s
sectio
n
s
, th
e statistica
l
asso
ciatio
n
measures a
r
e
only m
easure
th
e a
s
so
ci
a
tion
streng
th
of p
a
irs o
f
wo
rd
s.
After
th
at, t
h
eir scores are
usually
rank
ed
.
Th
en
, th
resho
l
d
s
or ev
al
u
a
tio
n po
in
ts are
set
by
u
s
ers
t
o
e
v
al
uat
e
t
h
em
gi
ve
n a
st
an
d
a
rd
t
e
st
.
Howev
e
r, th
ei
r sco
r
es ev
en after
rank
ing cann
o
t
i
n
d
i
cate ex
p
licitly wh
et
h
e
r
p
a
irs o
f
wo
rd
s
scored
are
com
pou
nd
no
u
n
s or
not
.
F
o
r
exam
pl
e,
“ka
d
kre
d
i
t
”
“cre
d
i
t
card”
word pai
r
is sc
ored “
61.65 “
by t
_
test and
rank
ed
n
i
n
t
h
i
n
t_
test list, b
u
t
all
th
ese in
form
at
io
n
cann
o
t
tell
clearly w
eath
e
r th
e “k
ad
k
r
ed
it” is a
Malay
CN
or
no
t.
Ho
we
ver
,
com
p
o
u
nd
n
o
u
n
s e
x
t
r
act
i
o
n p
r
o
b
l
e
m
can be fo
r
m
ul
at
ed as a bi
nary
cl
assi
fi
cat
i
on
pr
obl
em
[7] i
n
whic
h
each ca
ndidat
e is assi
gne
d
one
class:
. Each
co
m
p
o
und nou
ns
candi
dat
e
x i
s
descri
bed
by
t
h
e feat
u
r
e or a
t
t
r
i
but
e vect
o
r
,
i
s
t
h
e st
at
i
s
tical
score gi
ve
n by
one
of the
above a
ssociation m
easures.
We
have
several association
score
s
gi
ve
n by several as
sociation
m
easures m
e
t
hods
fo
r eac
h candi
dat
e
an
d
want
t
o
c
o
m
b
i
n
e t
h
em
t
oget
h
er t
o
achi
e
ve b
e
t
t
e
r perf
o
r
m
a
nce. I
n
othe
r words
,
t
h
e classification algori
t
h
m
s
integrate all the association
m
easures
desc
ri
be
d ab
o
v
e, a
n
d
use
their scores as attributes or fe
atures
to classify N-N candidates.
W
e
eval
uated
sev
e
ral classificatio
n
meth
od
s
fo
r c
o
m
pou
nd
no
u
n
s e
x
t
r
act
i
o
n.
L
i
near L
ogi
s
t
i
c
Regressi
o
n
: Lo
g
i
stic regressio
n
p
r
ed
icts th
e p
r
ob
ab
ility o
f
an
o
u
t
co
me th
at can
onl
y
have
bi
n
a
ry
res
p
o
n
se
Lo
gi
st
i
c
reg
r
essi
o
n
ca
n ha
n
d
l
e
s
e
veral
pre
d
i
c
t
o
rs (
n
um
eri
cal
and
cat
eg
ori
cal
)
.
Th
e
m
u
l
tip
le lo
g
i
stic reg
r
essi
o
n
mo
d
e
l
h
a
s th
e
form
:
Th
e
m
o
d
e
l d
e
fi
n
e
s
t
h
e p
r
ed
icted
p
r
ob
ab
ility
as:
whe
r
e the c
o
efficients
cont
rols the effect of the of t
h
e predictor . The
fart
her a
f
a
lls f
r
o
m 0
,
th
e str
onger
the effect
of t
h
e predictor
.
L
i
near Di
scri
mi
nan
t
An
al
ysi
s
:
Linear
Discrim
i
nant Analysis
(L
D
A
) is a p
o
p
u
l
ar tool f
o
r
m
u
l
ticlass d
i
sc
rimin
a
tiv
e d
i
men
s
ion
a
lity redu
ctio
n. Th
e
basic id
ea o
f
LDA is to
fi
n
d
a o
n
e
-d
im
en
sio
n
a
l
pr
o
j
ect
i
on defi
ned by
a
vect
or
t
h
at
m
a
xim
i
zes cl
ass separat
i
o
n
.
Thi
s
m
e
t
hod m
a
xi
m
i
zes t
h
e rat
i
o o
f
betwee
n-class varia
n
ce
to
th
e with
in
-class
v
a
rian
ce
in
an
y p
a
rticu
l
ar data set th
ereb
y g
u
a
ran
t
eei
ng
max
i
m
a
l sep
a
rab
ility.
v
S
v
v
S
v
W
t
B
t
v
max
,
Supp
ort Vec
t
or Mac
h
ines:
SVM
pr
op
os
ed t
o
s
o
l
v
e
t
w
o-cl
ass
pr
o
b
l
e
m
s
by
fi
ndi
n
g
t
h
e o
p
t
i
m
al
separat
i
n
g hy
p
e
r
-
p
l
a
ne b
e
t
w
e
e
n t
w
o cl
asses
of dat
a
. S
u
pp
ose t
h
at
X i
s
set
of l
a
bel
e
d t
r
ai
ni
ng
poi
nt
s (
f
eat
ur
e
vector)
(x
1
, y
1
),..
., ( x
n
, y
n
)
,
whe
r
e eac
h t
r
a
i
ni
ng
p
o
i
n
t
x
i
∈
R
N
i
s
gi
ven
a l
a
bel
y
i
∈
{
−
1, +
1
},where i
= 1,.
.
.,n. Th
e go
al in
SVM is to
esti
m
a
te
a fu
n
c
tio
n
and to find a classifier
wh
ich
can
b
e
so
lv
ed
thro
ugh
t
h
e
fo
llowing
co
nv
ex
o
p
tim
iz
atio
n
:
with
λ
as a
regu
larizatio
n p
a
ra
m
e
ter.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Au
toma
tic Extra
c
tio
n o
f
Ma
l
a
y Com
pou
nd
N
o
un
s
U
s
i
n
g A
H
y
b
r
i
d
o
f
S
t
a
tistica
l
an
d .... (
M
un
eer A.S.
Ha
zaa
)
9
31
4.
EX
PER
I
M
E
NTS AN
D DISC
USSION
4.
1.
Data set and e
x
periment se
tup
To c
r
eat
e an
e
v
al
uat
i
o
n
gol
d
st
and
a
r
d
, m
a
nual
i
d
e
n
t
i
f
i
cat
i
o
n
o
f
c
o
m
pou
nd
n
o
uns
M
W
Es was
d
o
n
e
o
n
a Malay corpu
s
. All
N-N
co
m
p
o
u
n
d
n
oun
s co
llo
cation
s
are m
a
n
u
a
lly
an
no
tated b
y
a n
a
tiv
e speak
er. Th
e
en
tir
e r
e
f
e
r
e
n
c
e d
a
ta set con
t
ain
i
n
g
165
35
N
-
N
can
d
i
d
a
tes (
723
5 un
iqu
e
N
-
N
can
d
i
d
a
tes)
,
2
970
o
f
t
h
e 7
235
are N-N com
pound nouns c
o
llocations.
We
evaluate the extraction algorith
m
s
against the refe
re
nce s
e
t of
co
m
p
o
u
n
d
noun
s co
llo
cation
s
m
a
n
u
a
lly ex
tr
acted
fr
o
m
th
e
8
200
f
iles.
As desc
ri
be
d abo
v
e, t
h
e c
o
l
l
o
cat
i
on st
at
i
s
t
i
c
s were col
l
ect
ed fr
om
a l
a
rger cor
p
us of
49
66
1 M
a
l
a
y
news
d
o
c
u
m
e
nt
s (
1
3,
34
6,
3
8
1
wo
rd
s)
fr
om
M
a
l
a
y
s
i
a
n N
a
t
i
onal
Ne
ws
Age
n
cy
(B
ER
NAM
A)
ne
ws
so
urc
e
[h
ttp
://www.b
ern
a
m
a
.co
m
/b
ern
a
m
a
/v
6
/in
d
e
x.ph
p
]
. Using
a larg
er corp
u
s
p
r
o
v
i
d
e
d
m
o
re ev
id
en
ce for th
e
statistical
meas
ures
we
us
ed.
Since
we m
a
nually annotate
d
the e
n
tire
reference
data
set, we
have
used standard m
e
trics Precisi
on
an
d Recall for ev
alu
a
ting
auto
m
a
tic co
m
p
o
und
n
oun
s
e
x
traction m
e
thod. T
h
ese m
e
trics are
com
puted a
t
di
ffe
re
nt
ra
nk
s,
cal
l
e
d E
v
al
uat
i
on
Poi
n
t
s
(
E
P)
i
n
t
h
e
f
o
l
l
o
wi
n
g
way
[
6
]
,
[
7
]
,
[
24]
-
[
26]
:
Preci
si
o
n
at
e
v
al
uat
i
on
p
o
i
n
t
k i
s
defi
ned
as:
R
ecal
l
at
eval
u
a
t
i
on
poi
nt
k i
s
de
fi
ne
d as:
F-1
sc
ore
at
ev
al
uat
i
on
p
o
i
n
t
k i
s
defi
ned
as:
4.
2.
Experimental results and
analysis
In
our experim
e
nt, we inc
r
ementally
exa
m
ined the n-hi
ghe
s
t ranked
ca
ndi
date lists returned
by each
m
e
t
hod.
The
p
r
eci
si
on
val
u
es
are cal
cul
a
t
e
d
fo
r t
h
e
fi
rst
1
0
0
,
20
0,
5
0
0
,
10
00 a
n
d
20
0
0
t
o
p ra
n
k
ed ca
n
d
i
d
at
es
.
The
preci
si
o
n
m
e
t
r
i
c
s for
di
f
f
ere
n
t
m
e
t
hods
are s
h
o
w
n i
n
Fi
gu
re
2. T
h
e x-a
x
i
s
re
p
r
e
s
ent
s
t
h
e E
v
al
uat
i
o
n
Poi
n
t
s
, whi
l
e
t
h
e y
-
axi
s
rep
r
esent
s
t
h
e p
r
eci
si
on val
u
es
(t
he perce
n
t
a
ge of t
r
ue N
-
N C
o
m
pou
nd
no
u
n
s)
ach
iev
e
d
at th
ese Ev
alu
a
tion
Po
in
ts.
The perform
a
nce
m
e
tr
ics
(Precision,
Recall an
d
F-sco
r
e) fo
r all meth
ods
are also s
h
own in Ta
ble
2.
A fi
rst
a
n
al
y
s
i
s
of t
h
e
preci
si
o
n
cu
rve
s
an
d o
t
her m
e
t
r
i
c
s i
n
Tabl
e 2 r
e
veal
s di
st
i
n
ct
i
on i
n
t
w
o c
u
r
v
e
cl
asses. S
o
m
e
of
t
h
e m
e
t
hod
s st
art
wi
t
h
ve
ry
hi
gh
p
r
ecision and t
h
en decreases
quite substantially. On the
cont
rary,
othe
r
m
e
thods start
with
low Pre
c
ision a
nd the
n
slightly in
crease. The
prec
ision curve of each
m
easure i
s
i
m
po
rt
ant
i
n
t
h
i
s
pu
r
pose
beca
u
s
e t
h
e m
onot
o
n
o
u
sl
y
dec
r
eas
i
ng
gra
p
h i
n
di
cat
es t
h
e m
o
re num
ber
of
N
-
N
c
o
m
pou
n
d
n
o
u
n
s c
o
l
l
o
cat
i
o
ns
i
n
up
pe
r ran
k
s
r
a
t
h
er
t
h
an
i
n
l
o
we
r ran
k
s.
Al
t
h
o
u
g
h
al
l
m
e
t
hods
ap
pro
x
i
m
a
tel
y
h
a
v
e
th
e sam
e
p
r
ecision
at 3
0
0
0
top
r
a
n
k
e
d
list, f
i
n
d
i
ng
a b
i
g
g
e
r
pr
opor
tio
n
o
f
th
e true N
-
N
co
m
p
o
u
n
d
noun
s at an
ear
l
y stag
e is sim
p
ly m
o
r
e
eco
no
m
i
cal.
It
i
s
qui
t
e
p
r
o
m
i
n
ent
fr
om
the res
u
l
t
s
o
f
T
a
bl
e 2 a
nd Fi
g
u
re
2 t
h
at
T-t
e
st
, PS, C
_
val
u
e an
d F
G
M
pr
o
v
e t
o
be
go
od m
easures
f
o
r a
u
t
o
m
a
t
i
c
ext
r
act
i
o
n
of
M
a
lay N
-
N
co
m
p
oun
d
no
un
s co
llo
cation
as
M
W
Es,
since thei
r Pre
c
ision sc
ores
are
higher at alm
o
st all
eval
uat
i
on
poi
nt
s,
w
h
i
l
e
t
h
e w
o
r
s
t
m
easure
ap
pea
r
s
t
o
be
C
S
m
e
t
hod
. A
s
exam
pl
e, 9
9
,
99
,
98 a
n
d
9
8
o
f
t
h
e
t
o
p 1
00
ra
nke
d
N-
N by
T
-
t
e
st
, P
S
, C
_
val
u
e
an
d F
G
M
,
respect
i
v
el
y
,
a
r
e N-
N com
pou
n
d
n
o
u
n
s c
o
l
l
o
cat
i
o
n. Th
e t
op fi
ve can
di
dat
e
s f
o
r ea
ch m
e
t
hod a
n
d t
h
ei
r
corres
ponding tags are shown in Table
4. In fact, these methods show an
in
teresting
beh
a
v
i
o
r
co
m
p
ared
to
t
h
ei
r be
havi
or
i
n
ot
he
r l
a
ng
u
a
ges. T
h
e res
u
l
t
s
obt
ai
ned
us
i
ng t
h
ese al
g
o
r
i
t
h
m
s
on M
a
lay
corp
us are
bet
t
e
r
t
h
an
t
h
ei
r res
u
l
t
s
rep
o
r
t
e
d by
ot
he
r
e
v
al
uat
i
o
n
st
udi
es f
o
r ot
her
l
a
n
gua
ges [
6
]
,
[
7
]
,
[
9
]
,
[
24]
-
[
2
6
]
.
It
i
s
im
port
a
nt
t
o
not
e f
r
o
m
Tabl
e 2 a
nd T
a
bl
e 3 t
h
at
so
m
e
m
e
t
hods
w
h
i
c
h are
n
o
t
m
a
t
h
em
ati
cal
ly
equi
val
e
nt
(i
.e
., assi
gni
n
g
i
d
ent
i
cal
sco
r
es
t
o
i
n
p
u
t
can
di
dat
e
s)
suc
h
as
T-t
e
st
a
n
d P
S
achi
e
ve t
h
e
sam
e
av
erag
e
precisio
n
and
p
r
od
uce th
e sam
e
lists of rank
ed
can
d
i
d
a
tes.
The ab
ility to
iden
tify su
ch gro
u
p
s
of
asso
ciatio
n m
e
asu
r
es m
a
y h
e
lp
in sim
p
lifyin
g
th
ei
r form
u
l
a
s
[3
9
]
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 3,
J
u
ne 2
0
1
6
:
92
5 – 9
3
5
93
2
Fi
gu
re
2.
O
v
er
al
l
Preci
si
o
n
of
di
f
f
ere
n
t m
easures
at di
ffe
rent evaluation
points
For
t
h
e
ran
k
c
o
m
b
i
n
at
i
on ex
p
e
ri
m
e
nt
s, we c
o
m
b
i
n
ed t
h
e
b
e
st
fo
ur
m
e
t
hods (
T
-t
est
,
PS,
C
_val
u
e an
d
FGM
)
. Ta
bl
e 4
sh
ows B
o
r
d
a
’
s p
o
si
t
i
onal
ra
nki
ng m
e
t
h
o
d
’
s
per
f
o
rm
ance (Preci
si
on
, R
e
cal
l
and F
-
sc
or
e) an
d
t
h
e t
op fi
ve ca
ndi
dat
e
s. B
o
rd
a’s p
o
si
t
i
onal
ran
k
i
n
g t
h
at
d
o
es an ap
p
r
o
x
i
m
at
e aggregat
i
on
of t
h
e ra
n
k
e
d l
i
s
t
has
bee
n
u
s
ed
as st
an
dar
d
ran
k
i
n
g f
u
nct
i
o
n i
n
pre
v
i
o
us st
u
d
i
e
s [
2
6]
,[
40]
.
Ho
we
ver
,
i
n
o
u
r ca
se, t
h
e B
o
r
d
a’
s
p
o
s
ition
a
l rankin
g
b
e
h
a
v
e
s in
th
e sam
e
way as its in
d
i
v
i
du
al
s.
Tabl
e
2. T
h
e
p
e
rf
orm
a
nce m
e
t
r
i
c
s (P
r
ecision, Recall and F-score
)
for all
m
e
t
h
o
d
s at
di
f
f
er
ent
eval
uat
i
o
n
poi
nt
Evaluat
i
on
Po
int
M
I
C
H
I
T
-
t
e
s
t
P
H
I
P
R
F
P R F
P R F
P R
F
100
0.
62
0.
02
0.
04
0.
77
0.
03
0.
05
0.
99
0.
03
0.
06
0.
55
0.
02
0.
04
200
0.
66
0.
04
0.
08
0.
76
0.
05
0.
09
0.
91
0.
06
0.
11
0.
61
0.
04
0.
08
500
0.
64
0.
11
0.
18
0.
72
0.
12
0.
21
0.
81
0.
14
0.
23
0.
59
0.
1
0.
17
1000
0.
63
0.
21
0.
32
0.
68
0.
23
0.
34
0.
73
0.
24
0.
36
0.
55
0.
19
0.
28
1500
0.
62
0.
31
0.
41
0.
63
0.
32
0.
42
0.
66
0.
33
0.
44
0.
53
0.
27
0.
36
2000
0.
6
0.
4 0.
48
0.
6
0.
41
0.
48
0.
62
0.
41
0.
5 0.
58
0.
39
0.
47
Evaluat
i
on
Po
int
LLR
MD NE ME
P
R
F
P R F
P R F
P R
F
100
0.
54
0.
02
0.
04
0.
77
0.
03
0.
05
0.
54
0.
02
0.
04
0.
77
0.
03
0.
05
200
0.
55
0.
04
0.
07
0.
76
0.
05
0.
09
0.
59
0.
04
0.
07
0.
76
0.
05
0.
09
500
0.
58
0.
1
0.
17
0.
72
0.
12
0.
21
0.
6
0.
1
0.
17
0.
72
0.
12
0.
21
1000
0.
56
0.
19
0.
28
0.
68
0.
23
0.
34
0.
61
0.
21
0.
31
0.
68
0.
23
0.
34
1500
0.
55
0.
28
0.
37
0.
63
0.
32
0.
42
0.
6
0.
3
0.
4
0.
63
0.
32
0.
42
2000
0.
56
0.
38
0.
45
0.
6
0.
41
0.
49
0.
59
0.
39
0.
47
0.
6
0.
41
0.
49
Evaluat
i
on
Po
int
DICE KAP
P
A
CV
FGM
P
R
F
P R F
P R F
P R
F
100
0.
74
0.
02
0.
05
0.
79
0.
03
0.
05
0.
96
0.
03
0.
06
0.
97
0.
03
0.
06
200
0.
78
0.
05
0.
1
0.
8
0.
05
0.
1
0.
9
0.
06
0.
11
0.
9
0.
06
0.
11
500
0.
74
0.
12
0.
21
0.
71
0.
12
0.
21
0.
8
0.
13
0.
23
0.
76
0.
13
0.
22
1000
0.
68
0.
23
0.
34
0.
67
0.
23
0.
34
0.
71
0.
24
0.
35
0.
69
0.
23
0.
35
1500
0.
64
0.
32
0.
43
0.
62
0.
32
0.
42
0.
65
0.
33
0.
44
0.
65
0.
33
0.
43
2000
0.
61
0.
41
0.
49
0.
59
0.
4
0.
48
0.
61
0.
41
0.
49
0.
61
0.
41
0.
49
Evaluat
i
on
Po
int
CS PS
Odd
Jacc.
P
R
F
P R F
P R F
P R
F
100
0.
42
0.
01
0.
03
0.
99
0.
03
0.
06
0.
47
0.
02
0.
03
0.
79
0.
03
0.
05
200
0.
4
0.
03
0.
05
0.
91
0.
06
0.
12
0.
52
0.
03
0.
07
0.
8
0.
05
0.
1
500
0.
41
0.
07
0.
12
0.
81
0.
14
0.
23
0.
54
0.
09
0.
16
0.
71
0.
12
0.
21
1000
0.
43
0.
15
0.
22
0.
72
0.
25
0.
37
0.
52
0.
18
0.
26
0.
67
0.
23
0.
34
1500
0.
44
0.
22
0.
3
0.
66
0.
33
0.
44
0.
51
0.
26
0.
34
0.
63
0.
32
0.
42
2000
0.
44
0.
3
0.
35
0.
61
0.
42
0.
49
0.
5
0.
34
0.
41
0.
59
0.
4
0.
48
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Au
toma
tic Extra
c
tio
n o
f
Ma
l
a
y Com
pou
nd
N
o
un
s
U
s
i
n
g A
H
y
b
r
i
d
o
f
S
t
a
tistica
l
an
d .... (
M
un
eer A.S.
Ha
zaa
)
9
33
To
av
o
i
d
in
commen
s
u
r
ab
ility o
f
asso
ciation
m
easu
r
es in
o
u
r ex
p
e
rim
e
n
t
s, we
u
s
ed
a co
mm
o
n
p
r
e-
pr
ocessi
ng t
e
c
hni
que
fo
r sco
r
es st
anda
rdi
zat
i
on:
al
l
a
ssociation m
easure values are cente
red towards ze
ro a
nd
scaled them
to
unit
varia
n
ce.
To e
v
aluate
machine learning
m
e
t
hod
s Preci
si
o
n
, rec
a
l
l
and F1
-measure
of all
classification
m
e
t
hods we
re
obt
ai
ne
d by
ve
rt
i
cal
averagi
n
g i
n
t
e
n-
fol
d
cr
oss val
i
d
at
i
o
n
on t
h
e sam
e
refere
nce dat
a
as
i
n
t
h
e
earlier e
xpe
riments.
In each cross--valid
ation step, ni
ne folds we
re
use
d
for
training a
n
d
one
fol
d
for testing.
All classification m
e
thods
pe
rform
e
d very well. Detailed results
(P
recision, recall and
F1-m
easure
)
of al
l
cl
assi
fi
ca
t
i
on m
e
t
hods a
r
e gi
ven
i
n
Ta
b
l
e 5. T
h
e
best
r
e
sul
t
was ac
hi
e
v
ed
by
a s
u
pp
o
r
t
vect
o
r
m
achi
n
es
.
SVM ac
hieves
precision, recal
l an
d F1-m
easure
of
75.44%
, 87.78%
a
n
d 81.14 %
res
p
ectively.
Ex
peri
m
e
nt
s sho
w
t
h
at
cl
assi
fi
cat
i
on al
g
o
r
i
t
h
m
s
whi
c
h com
b
i
n
e associ
at
i
on sc
or
es gi
ve
n b
y
several
ass
o
ci
a
t
i
on m
easures m
e
t
hods l
e
a
d
t
o
a si
g
n
i
f
i
cant
per
f
o
r
m
a
nce im
pro
v
em
ent
i
n
com
p
ari
s
o
n
wi
t
h
i
ndi
vi
dual
ba
si
c
m
e
t
hods. I
n
f
act
, Expe
ri
m
e
n
t
al
resul
t
s
obt
ai
ned are q
u
i
t
e
sat
i
s
fact
ory
,
es
peci
al
l
y
when
bei
n
g
com
p
ared t
o
re
sul
t
s
o
b
t
a
i
n
ed
i
n
ot
he
r
wo
rk
s
[6]
,
[7]
.
In
[6]
,
[7]
a hy
bri
d
m
e
t
hod
of l
i
n
g
u
i
s
t
i
c
and
st
at
i
s
t
i
c
al
ap
pro
ach
es h
a
s b
e
en
pro
p
o
s
ed
in
ter
m
s o
f
id
en
tif
yin
g
co
m
p
o
u
n
d
no
un
s.
I
t
s clear
th
at th
e h
ypr
id
m
e
thod
whi
c
h com
b
i
n
e bot
h st
at
i
s
t
i
cal
and m
achine l
earni
ng i
s
out
pe
rf
o
r
m
e
d t
h
e hy
bri
d
m
e
t
hod o
f
l
i
n
gui
st
i
c
app
r
oach
an
d s
t
at
i
s
t
i
c
al
m
e
t
hods.
Tabl
e
3. T
o
p
1
0
M
a
l
a
y
N
-
N
c
a
ndi
dat
e
s e
x
t
r
a
c
t
e
d by
di
f
f
ere
n
t
m
e
t
hods
M
I
C
H
I
T
-
t
e
s
t
P
H
I
pangg
ung way
a
ng
CN
sahabat ha
ndai
CN
kenaikan har
g
a
CN
pengar
a
h sy
ar
ikat
CN
sahabat handai
CN
lubuk y
u
CN
ehwal
pengguna
CN
m
a
kanan
ter
n
akan
CN
kar
e
nah bir
okr
asi
CN
jem
m
a
du
CN
har
g
a m
i
nyak
CN
pakej
u
m
r
a
h
CN
m
a
khluk per
o
sak
CN
pendingi
n ha
wa CN
kem
e
nter
ian
per
d
aga
ngan
CN
per
lindunga
n
har
t
a
NCN
kanun keseksaan
CN
lam
a
n web
CN
bahan api
CN
pr
oduk b
u
atan
CN
jejar
i
kentang
CN
har
t
a intelek
CN
ke
r
a
jaan neger
i
CN
pegawai jabatan
NCN
adat r
e
sa
m
CN
kanun keseksaan
CN
ketua pegawai
CN
bot pukat
CN
akar
u
m
bi
CN
hukum
sy
ar
ak
CN
har
g
a bar
a
ng
CN
bulan apr
il
CN
wakaf m
e
m
p
elam
CN
pangg
ung
way
a
ng
CN
kad kr
edit
CN
perm
ohonan lesen
CN
kar
bon diok
sida
NCN
khabar
angin
CN
m
u
si
m
per
a
yaan
CN
m
uka sur
a
t
CN
LLR MD
NE
ME
sahabat handai
CN
jem
m
a
du
CN lubuk
y
u
CN
jem
m
a
du
CN
pangg
ung way
a
ng
CN
sahabat
handai CN
sahabat
handai
CN sahabat
handai
CN
kar
e
nah bir
okr
asi
CN
lubuk y
u
CN jem
m
a
du
CN
lubuk
y
u
CN
bar
a
h otak
CN
pendingi
n hawa
CN
pangg
ung way
a
ng
CN
pe
ndingi
n hawa
CN
pay
a
pahlawan
NCN
l
am
an web
CN
nir
a
nipah
NCN
lam
a
n web
CN
hukum
sy
ar
ak
CN
har
t
a intelek
CN
kar
e
nah bir
okr
asi
CN
har
t
a intelek
CN
m
a
khluk per
o
sak
CN
kanun keseksaan
CN
bar
a
h otak
CN
kanun keseksaan
CN
ais kr
im
NCN
hukum
sy
ar
ak
CN
er
a globalisasi
CN
hukum
sy
ar
ak
CN
m
i
lo ais
CN
pangg
ung way
a
ng
CN
ais kr
im
NCN
pangg
ung way
a
ng
CN
tu
m
buhan ubatan
NCN
khabar
angin
CN
kond
o
m
iniu
m
pan
g
sapur
i
NCN
khabar
angin
CN
DICE KAP
P
A
CV
FGM
jem
m
a
du
CN
jaksa
pendam
a
i
NCN
k
enai
kan har
g
a
CN
kenaikan har
g
a
CN
sahabat handai
CN
lam
a
n web
CN
ehwal pengguna
CN
har
g
a
m
i
nyak
CN
lubuk y
u
CN
kanun keseksaan
CN
har
g
a
m
i
nyak CN
kem
e
nter
ian
per
d
agangan
CN
pendingi
n hawa
CN
pendingi
n ha
wa CN
kem
e
nter
ian
per
d
ag
angan
CN
ehwal
pengguna
CN
la
m
a
n w
e
b
CN
harta intelek
CN
keraj
aan neger
i
CN
ker
a
jaan neger
i
CN
harta intelek
CN
m
u
si
m
pe
rayaan
CN
bahan api
CN
har
g
a bar
a
ng
CN
kanun keseksaan
CN
penghilan
g
dahaga
CN
ketua pegawai
CN
bahan api
CN
hukum
sy
ar
ak
CN
tali pinggang
CN
ha
r
g
a bar
a
ng
CN
ketua pegawai
CN
pangg
ung way
a
ng
CN
topi keledar
CN
kad kr
edit
CN
har
g
a bahan
CN
khabar
angin
CN
akar
u
m
bi
CN
m
u
si
m
per
a
yaan
CN
stesen
m
i
ny
ak
CN
CS
PS
Odd
Jacc.
sanak sudar
a
CN
kenaikan har
g
a
CN
roti canai
CN
jaksa penda
m
a
i
NCN
pustaka sufi
NCN
e
hwal pengguna
CN
m
a
hkam
a
h ses
y
en
CN
lam
a
n web
CN
angin sakal
NCN
h
ar
ga
m
i
nyak
CN
ka
nun keseksaan
CN
kanun keseksaan
CN
tuanku m
a
harajalela
NCN
k
em
en
ter
i
an per
d
agangan
C
N
sungai
ny
iur
NCN
pendingi
n
hawa
CN
poko
k m
e
m
p
isang
NCN
b
ahan
api
CN
peny
am
an udar
a
CN
har
t
a intelek
CN
online kegilaan
NCN
k
er
ajaan neger
i
CN
m
u
si
m
tengkujuh
CN
m
u
si
m
per
a
yaan
CN
em
as ker
a
jang
NCN
k
etua pegawai
CN
k
acang buncis
CN
penghilan
g
dahaga
CN
penub
uhan platun
CN
har
g
a bar
a
ng
CN
m
uka sauk
CN
tali pinggang
CN
m
e
syua
rat inform
a
l
NCN
k
ad kredit
CN
setebal m
uka
NCN
topi
keledar
CN
bukit tekoh
CN
m
u
si
m
per
a
yaan
CN
pangg
ung way
a
ng
CN
akar
u
m
bi
CN
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 3,
J
u
ne 2
0
1
6
:
92
5 – 9
3
5
93
4
Tab
l
e
4
.
Resu
lts fo
r
r
a
n
k
co
m
b
in
ation
Method
Evaluat
i
on
Po
int
Precision
Recall
F-score
Top 5 rank
e
d
can
d
idates
100
0.
98
0.
03
0.
06
kenaikan
har
g
a
CN
200
0.
92
0.
06
0.
12
ehwal
pengguna
CN
500
0.
79
0.
13
0.
23
har
g
a
m
i
nyak
CN
1000
0.
72
0.
24
0.
36
kem
e
nter
ian
per
d
agangan
CN
1500
0.
66
0.
33
0.
44
bahan
api
CN
Ta
ble 5
.
Per
f
o
r
m
a
nce of C
l
as
si
fi
cat
i
on M
e
t
h
ods
Co
m
b
in
ing All Asso
ciation
Measu
r
es
Method Precision
Recall
F1
SVM
75.
44
87.
78
81.
14
LDA
72.
21
81.
09
76.
39
GL
M
69.
94
83.
48
76.
11
5.
CO
NCL
USI
O
NS
In t
h
e
prese
n
t
wo
rk
, we ha
v
e
devel
o
pe
d a com
poun
d n
o
un M
W
E ext
r
act
i
on sy
st
em
whi
c
h ra
n
k
s
col
l
o
cat
i
o
ns u
s
i
ng st
at
i
s
t
i
cal
m
e
t
hods
.
W
e
d
e
vel
o
ped
an
d
m
a
nual
l
y
ann
o
t
at
ed a refe
re
n
ce dat
a
set
co
n
t
ai
ni
ng
5,
61
0 M
a
l
a
y
N
-
N
bi
gram
s, 1,
85
4
o
f
t
h
em
w
e
re a
g
ree
d
t
o
b
e
a
N-
N c
o
m
pou
n
d
no
u
n
.
We i
m
pl
em
ent
e
d seve
ral
l
e
xi
cal
associ
at
i
on m
easures
,
em
pl
oy
ed t
h
e
m
for N
-
N
co
m
poun
d
no
u
n
ext
r
act
i
o
n a
n
d
eval
uat
e
d t
h
em
agai
nst
th
e
referen
ce d
a
ta
set.
Th
e resu
lts ob
tain
ed
u
s
ing
t
h
ese algorithm
s
on Malay
corpus
are better
tha
n
their
resul
t
s
re
p
o
rt
e
d
by
ot
he
r ev
al
uat
i
on st
udi
e
s
fo
r
ot
her l
a
ng
ua
ges. T
h
e
resul
t
s
al
so
sh
ow t
h
at
T-t
e
st
, SP
,
C
_val
u
e, FLR
and R
C
are
go
od m
easur
es fo
r aut
o
m
a
t
i
c
ext
r
act
i
on
of M
a
l
a
y
N-
N com
pou
n
d
no
u
n
s
col
l
o
cat
i
o
n. Fi
nal
l
y
, we em
pl
oy
t
h
ree cl
assi
fi
cat
i
on m
ode
ls (lin
ear lo
g
i
stic reg
r
ession
, lin
ear d
i
scrimin
a
nt
analysis and s
u
pport vect
or
machines
) to c
o
m
b
ine association scores of
t
h
e i
ndi
vi
d
u
al
m
easures. Ev
al
uat
i
on
resul
t
s
s
h
ow t
h
at
t
h
ese m
o
d
e
l
s
si
gni
fi
ca
nt
l
y
out
per
f
o
r
m
i
ndi
vi
dual
a
s
s
o
ci
at
i
on m
easures
. S
V
M
ac
hi
eve
s
precision, recal
l and F1-m
easur
e
of
75.44%
, 87.78% a
n
d 81
.14 %, res
p
ec
tively.
In
t
h
e fu
ture,
we will i
m
p
l
emen
t, an
d
ev
alu
a
te o
t
h
e
r availab
l
e
m
e
th
o
d
s su
itab
l
e fo
r
th
is task
.
In
ad
d
ition
,
we
will fo
cu
s especially o
n
au
t
o
m
a
t
i
cally
in
terpretin
g
co
m
p
o
und
nou
n
s
relatio
n
s
and
imp
r
ov
ing
q
u
a
lity o
f
th
e train
i
ng
and
testin
g
d
a
ta. Fin
a
l
l
y, we will a
tte
m
p
t
to
d
e
m
o
n
s
trate co
n
t
ribu
tio
n
o
f
co
llo
cati
o
n
s
i
n
selected application a
r
eas,
suc
h
as
m
ach
in
e tran
slatio
n or i
n
fo
rm
atio
n
retriev
a
l.
REFERE
NC
ES
[1]
A.
Rahman,
et al.
, “Constructio
n of compound nouns (CNs) for
noun phrase in
Malay
sen
t
ence,”
Pr
es
ented
a
t
Information Retr
ieva
l
&
Knowledge Managem
e
nt
(
C
AMP)
, 2012 International Con
f
erence on
.
[2]
Ahn, K.,
et al.
, “Question Answering with
QED at TR
EC-2005,”
Pr
es
ented
at
Pr
o
ceed
ings
of
T
R
E
C
,
2005.
[3]
A
lias
,
N
.
A
.
R.
,
et a
l
.
, “Application of semantic
t
echnolog
y
in
dig
ital librar
y
.
”
[4]
Argamon, S.,
et al.
, “A memor
y
-bas
ed appro
a
ch to learni
ng
shallow natural language p
a
tterns,”
Pr
es
en
ted a
t
Proceed
ings of
t
h
e 17th
int
e
rnati
onal con
f
er
en
ce
on Computation
a
l lingu
isti
cs,
vo
lume 1
.
[5]
Baldwin, T. an
d Tanak
a
, T.,
“Translation b
y
m
achine of co
mplex nominals: getting it righ
t,”
Pr
es
ented a
t
Proceed
ings of
t
h
e Workshop on
Multiwor
d
Expr
essions: Integrat
ing Proc
essing
.
[6]
Bourigault, D.,
“An endogeneous
corpus-based method for
struct
ural noun phrase disambiguation,”
Presented at
Proc.
[7]
Church, K. W. Hanks, P., “Word association
nor
ms, mutual
information, an
d lexicograph
y
,
”
Computationa
l
linguistics
, vo
l/issue: 16(1), pp.
22-29, 1990
.
[8]
Ckakrabort
y
, T
.
, “Identification
of Noun-Noun (NN) Collo
cat
ion
s
as Multi-Word Expressions in Bengali Corpus
,”
Presented
at Stu
d
ent S
e
ssion,
In
ternational Conference o
f
Na
tural Language Processing (
I
CON
)
.
[9]
Dandapat, S.,
et al.
, “Stat
i
stical
investigation of
Bengali nounv
er
b
(NV) collocat
i
ons as m
u
lti-wordexpressions,”
Proceed
ings of
Modeling
and S
hallow Parsing
of Indian
Languages (
M
SPIL)
, 2006, pp
. 230-23
3.
[10]
Dias, G.
, “
M
ulti
word unit h
y
br
i
d
extr
act
ion,
”
P
r
esented
a
t
Proceedings of the A
C
L 2003 wo
rksh
op on Multiword
expressions: ana
lysis, a
c
quisition
and trea
tment,
volume 18,
2003.
[11]
Duan,
J.
,
et al.
,
“
A
bio-inspired
approach
for m
u
lti-word expr
ession extr
ac
tion,
”
Pr
es
ented a
t
Pr
oceed
ings
of th
e
COLING/ACL on Main
conferen
ce poster session
s.
[12]
Fra
n
tz
i,
K.,
et al.
,
“
A
utom
atic
recogn
ition
of
m
u
lti-word term
s:. th
e C-va
lue
/
NC-value m
e
th
od,”
In
ternation
a
l
Journal on Dig
i
tal Libraries
, vo
l/issue: 3(2), pp. 1
15-130, 2000
.
[13]
Gurrutxaga, A.
and Alegria
,
I.,
“
M
eas
uring the com
pos
itionality
of NV expressions
in Basq
ue b
y
means of
distribution
a
l similari
ty
techniq
u
es,”
2012
.
[14]
Hoang,
H.
H.,
et al.
, “A re-ex
a
mination of
lexi
ca
l as
s
o
cia
tio
n m
eas
ures
,”
P
r
es
ented at
Pr
oceed
ings
of th
e
Workshop on M
u
ltiword E
x
pressions: Ident
ifi
cati
on,
Int
e
rpretatio
n, Disambiguati
on and App
lica
t
i
ons.
[15]
Kit, C.
and Liu
,
X., “Measuring mono-word te
rmhood by
r
a
nk
differen
ce via corpus comparison,”
Terminology
,
vol/issue: 14(2), pp.
204-229
,
20
08.
Evaluation Warning : The document was created with Spire.PDF for Python.