Internati
o
nal
Journal of Ele
c
trical
and Computer
Engineering
(IJE
CE)
Vol
.
5
,
No
. 3,
J
une
2
0
1
5
,
pp
. 48
3~
49
0
I
S
SN
: 208
8-8
7
0
8
4
83
Jo
urn
a
l
h
o
me
pa
ge
: h
ttp
://iaesjo
u
r
na
l.com/
o
n
lin
e/ind
e
x.ph
p
/
IJECE
Partial Context Similarity of
Gene/Proteins in Leukemia Using
Context Rank Based Hierarchi
cal Clustering Algorithm
Shahana Ban
o
1
,
K.
Ra
ja
seka
ra
Rao
2
1
Departem
ent
of
Com
puter S
c
i
e
n
ce
a
nd Engineering,
K L
Univ
ersity
2
Departement of
Computer Scien
ce
and
Engine
ering, Sri Pr
akash
College of
Engineering
Article Info
A
B
STRAC
T
Article histo
r
y:
Received Dec 31, 2014
Rev
i
sed
Feb
24
, 20
15
Accepted
Mar 16, 2015
In this pap
e
r w
e
proposed a method which
av
oids the
choice of natur
a
l
languag
e
processing tools suc
h
as
pos taggers and parsers reduce th
e
processing over
h
ead. Moreover
,
we suggest
a
str
u
cture
to immediately
create
a large-scale co
rpus annotated
along
with disease names, which can be
applied to tr
ain
our probabilistic mode
l. In th
is proposed work contex
t ran
k
based hier
arch
ical clustering
method is
applied
on differ
e
nt datasets nam
e
ly
colon, Leukemia, MLL medical diseases
. Optim
al rule fi
lter
i
ng
algorithm
is
applied on these datasets to
remove
unwa
n
ted s
p
eci
al c
h
arac
ters
for
gene/pro
tein
identification. Fin
a
lly
,
e
xperimental results show th
at proposed
method outperfo
rmed existing
methods in
terms of time
a
nd clusters
space.
Keyword:
Biom
edical
Clu
s
ter
i
ng
Gen
e
/
p
ro
tein
Machine lea
r
ni
ng
Med
lin
e
Pubm
ed
Copyright ©
201
5 Institut
e
o
f
Ad
vanced
Engin
eer
ing and S
c
i
e
nce.
All rights re
se
rve
d
.
Co
rresp
ond
i
ng
Autho
r
:
S
h
ah
a
n
a Ba
no
,
Depa
rt
em
ent
of C
o
m
put
er
Sci
e
nce a
n
d
E
ngi
neeri
n
g
,
K L Un
iv
ersity,
Em
a
il: Sh
ah
anab
ano_
cse@k
l
u
n
i
v
e
rsity.in
1.
INTRODUCTION
Li
fe sci
e
nce
st
udi
es
are
cha
r
act
eri
zed
by
t
h
e c
o
nst
r
uct
i
o
n
of
l
a
r
g
e a
n
d
het
e
r
o
gene
o
u
s
pat
t
e
r
n
s
o
f
bi
ol
o
g
i
cal
st
ud
y
,
i
n
cl
udi
ng
pr
ot
ei
n o
r
ge
ne s
e
ri
es. The
r
ef
o
r
e, a num
ber o
f
m
e
t
hods ba
se
d u
p
o
n
t
e
xt
-m
ini
n
g
have
bee
n
use
d
t
o
i
m
pro
v
e t
h
e i
d
e
n
t
i
f
y
p
r
ot
ei
n a
n
d
ge
ne
s nam
e
s i
n
m
e
di
cal
t
e
xt
s. T
e
xt
m
i
ni
ng ha
s bee
n
defi
ned as t
h
e
di
sco
v
ery
by
c
o
m
put
er of
rec
e
nt
, p
r
evi
o
u
s
ly
u
n
k
nown, d
a
t
a
b
y
au
to
m
a
tic
ally ex
tractin
g
d
a
ta
fro
m
d
i
fferen
t
written
resou
r
ces. Mach
in
e learn
i
n
g
m
ean
s th
e d
e
v
e
lop
m
e
n
t an
d
st
u
d
y
of syste
m
s th
at
co
u
l
d
learn from
data. This is actually a technique of teach
i
ng c
o
m
puters in
order to
m
a
ke and enha
nce be
haviors
base
d on
som
e
dat
a
.
M
a
c
h
i
n
e
l
earni
ng
i
s
a hu
ge fi
el
d wi
t
h
hu
n
d
re
ds o
f
al
go
ri
t
h
m
s
for add
r
essi
ng
di
f
f
e
ren
t
i
ssues. M
achi
n
e l
earni
ng
pr
o
v
i
d
es chal
l
e
ngi
n
g
p
r
obl
em
s i
n
t
e
r
m
s of al
g
o
ri
t
h
m
i
c approac
h
, dat
a
represen
tatio
n
,
co
m
p
u
t
atio
n
a
l
effectiv
en
ess,
an
d qu
ality o
f
th
e resu
lting
p
r
o
g
ram
.
Bio
m
e
d
ical d
a
ta along
wit
h
i
t
s
updat
e
s a
r
e
saved i
n
nat
u
ral
l
a
ng
uag
e
s
t
y
l
e. Due t
o
t
h
e en
ha
nced a
m
ount
of
bi
o
m
edi
cal
sourc
e
s, i
t
i
s
becom
i
ng m
o
r
e
an
d m
o
re c
h
al
l
e
ngi
n
g
t
o
fi
nd
u
s
ef
ul
an
d
rel
e
va
nt
i
n
fo
r
m
at
i
on re
gar
d
i
n
g
a s
p
eci
fi
c t
opi
c.
Al
l
researc
h
i
n
ven
t
i
ons com
e
and ent
e
r t
h
e re
posi
t
ory
at
hi
gh
-rat
e
, m
a
ki
ng t
h
e st
rat
e
gy
of fi
n
d
i
n
g o
u
t
an
d
dissem
i
nating
quality inform
ation a
ve
ry
difficult tas
k
. Manual
assessm
ent
of suc
h
la
rge am
ount
of da
ta will
pr
o
b
abl
y
be
ve
ry
di
f
f
i
c
ul
t
an
d t
i
m
e
-con
sum
i
ng.
The i
s
s
u
e
i
s
fu
rt
her
m
a
gni
fi
ed
by
t
h
e c
ons
um
pt
i
on
of
l
a
rge
ev
alu
a
tion
m
easu
r
es, and
d
a
tasets th
at con
t
ai
n
essen
tia
lly d
i
fferen
t
ann
o
t
atio
n fo
rm
ats an
d task
d
e
fi
n
itio
ns.
M
e
di
cal
t
e
xt
d
o
cum
e
nt
s co
nt
i
n
u
o
u
s
l
y
hi
de
val
u
a
b
l
e
st
r
u
ct
ure
d
dat
a
.
Fo
r
exam
pl
e, a c
o
l
l
ect
i
on
of
n
e
wsp
a
p
e
r co
nten
t will con
t
ain
d
e
tails on
t
h
e lo
cation
o
f
th
e
h
e
ad
-q
u
a
rters
o
f
v
a
riou
s
en
tities. If we
need
t
o
find
th
e
p
o
s
itio
n of t
h
e
h
e
ad-qu
a
rters
o
f
, say Micro
s
o
f
t
we cou
l
d
t
r
y an
d u
tilize co
nven
tio
n
a
l
d
a
ta retriev
a
l
t
echni
q
u
es fo
r di
sco
v
eri
ng d
o
c
um
ent
s
t
h
at
cont
ai
n
t
h
e
ans
w
er
on t
h
e pre
s
ent query.
An
ap
p
lication
of syste
m
s
bi
ol
o
g
y
i
s
t
o
u
n
co
ve
r t
h
e
bi
o
-
p
r
oce
sses
un
d
e
rl
y
i
ng t
h
e
pat
t
erns
of
a cel
l
.
R
e
l
a
t
i
onshi
ps
wi
t
h
i
n
ge
nes e
n
co
d
e
m
o
st
of t
h
i
s
d
a
t
a
an
d a
r
e
oc
casi
onal
l
y
di
sc
ove
re
d a
n
d
sy
m
bol
i
zed as
k
e
y
pro
duct
s
.
Un
de
rst
a
n
d
i
n
g
t
h
ese
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
5, No
. 3,
J
u
ne 2
0
1
5
:
48
3 – 4
9
0
48
4
relations
hips
is an ext
r
em
ely challenging
issue as
ev
en
t
h
e si
m
p
l
e
st
orga
ni
sm
s cont
ai
n
vari
et
y
ge
nes t
h
at
in
teract in
com
p
lex
co
m
b
in
atio
n
s
to
d
eal with
eco
l
ogi
ca
l
ci
rcum
st
ances. A
not
he
r c
o
m
p
licating element is
cur
r
ent
hi
gh
t
h
ro
u
g
h
p
u
t
t
ech
n
i
que
desi
gne
d t
o
det
e
rm
i
n
e t
h
e activ
ity lev
e
l of
g
e
n
e
s is ex
t
r
em
el
y n
o
i
sy [8
].
As
th
ere ex
ists
v
e
ry few
well un
d
e
rstood
g
e
netic activ
ities,
u
n
s
up
erv
i
sed
clu
s
tering
is a
co
mm
o
n
first
step
to
un
de
rst
a
n
d
t
h
e
s
e dat
a
.
The clusteri
ng
proce
d
ure is a basic
tool to organize a collec
tion of ob
jects
within a m
e
tric
space into
a set o
f
sm
al
ler p
a
rtitio
ns called
clu
s
ters. By
u
s
ing
cl
u
s
ters, th
e rep
r
esen
tatio
n
o
f
t
h
e obj
ect p
o
o
l
can b
e
mad
e
easi
e
r and t
h
e
com
put
at
i
on e
xpe
nse
of dat
a
m
a
nagem
e
nt
can be re
d
u
ced
.
The creat
ed cl
ust
e
rs ca
n be u
s
ed t
o
i
n
t
r
o
d
u
ce r
u
l
e
s
of t
op l
e
vel
s
descri
bi
n
g
t
h
e
com
m
on chara
c
t
e
ri
st
i
c
s of
da
t
a
ob
ject
s.
In t
h
e case
of
gra
m
m
a
r
in
du
ctio
n stru
ctu
r
es, th
e ru
les of
g
r
amm
a
r are stated on
word classi
fica
tio
n
s
as th
e
wo
rd
s
with
in th
e
same
category are transform
e
d sim
i
l
a
rl
y
.
If wo
rd
cat
ego
r
i
e
s
are kn
o
w
n
,
gram
m
a
r
p
r
i
n
ci
pl
es
m
i
ght
be
e
xpl
ore
d
i
n
a
b
e
tter way.
Nearest
neighbor is a m
achine learni
ng
m
e
thod int
r
oduced i
n
the literature t
h
at often learns
by
com
p
aring eac
h indi
vidual new case
to
prior exam
ples.
Machine lea
r
ning is
d
e
fin
itely an
area
o
f
artificial
in
tellig
en
ce focu
sing
on
th
e d
e
v
e
lop
m
en
t
o
f
appro
a
ch
es wh
ich
p
e
rm
it
co
m
p
u
t
ers to
learn
.
Mo
re
clearly,
m
achi
n
e l
earni
ng i
s
a
m
e
t
hod
fo
r ge
nerat
i
n
g
com
put
er pr
o
g
r
am
s for t
h
e ev
al
uat
i
on o
f
dat
a
set
s
. Inst
a
n
ce base
d
lear
n
i
ng
,
o
f
wh
ich
near
est neig
hb
or
is
a subset, is a bra
n
ch of m
achine
l
earni
n
g
t
ech
n
i
ques;
ot
her
br
anche
s
in
clu
d
e
:
ru
le based
g
e
n
e
tic alg
o
r
ith
m
s
, A
N
N
an
d suppo
r
t
-v
ector
-
m
ach
ines.
In t
h
e
wh
ol
e nearest
n
e
i
g
hb
or al
g
o
ri
t
h
m
,
al
l
t
upl
es are gene
ral
l
y
save
d i
n
m
e
m
o
ry
du
ri
n
g
dat
a
training.
Whe
n
a new que
ry instance is accepted the m
e
m
o
ry is searched
to find th
e inst
ance that suits the
q
u
e
ry in
stan
ce
m
o
st clo
s
ely.
Nearest n
e
i
g
h
b
o
r
will th
en
in
fer th
at th
e con
cep
t lab
e
l of
th
e q
u
e
ry in
stan
ce is
si
m
ilar as th
e no
tio
n lab
e
l
o
f
th
e m
o
st si
m
ila
r in
stan
ce st
o
r
ed
in m
e
m
o
ry.
Noise
prese
n
t
in data is a signi
ficant chall
e
nge
a
v
oiding machine
learners
away from being m
o
re
q
u
a
lity, o
r
app
licab
le to
th
e larg
e selectio
n
o
f
d
o
m
ain
s
. No
ise is an
in
correct attribu
t
e o
r
m
o
d
e
l v
a
lu
e
i
n
f
o
rm
at
i
on w
h
i
c
h ca
n
be a
e
ffect
of e
r
r
o
rs
i
n
m
a
nual
dat
a
ent
r
y
,
c
o
m
p
i
l
a
t
i
on, m
easure
m
ent
or
co
rr
u
p
t
i
on
of
d
a
ta. If t
h
e po
ten
tial fo
r
n
o
i
se is certain
ly n
o
t
reco
gn
i
zed, th
is can
lead
to
mach
in
e learn
i
n
g
algorith
m
s
fittin
g
th
e no
ise. Fittin
g
t
h
e no
ise hap
p
e
n
s
wh
en
t
h
e m
ach
in
e l
earn
e
r learns the n
o
i
sy d
a
ta as if were
n
o
t
n
o
i
sy
in
fo
rm
atio
n
.
No
ise
will o
f
ten
mak
e
in
stan
ces in
m
e
m
o
ry o
p
p
o
s
e
o
n
e
ano
t
her.
2.
R
E
SEARC
H M
ETHOD
Fo
llowing
are t
h
e limitatio
n
s
o
f
th
e
rela
ted
work d
i
scu
ssed in
th
is section
.
Elim
inate the Non-Functiona
l Characters
App
l
y Heuristic Po
licies to R
e
m
o
v
e
Non-Fun
c
tio
n
a
l
Sym
b
o
l
s
Re
m
ove a
n
d re
place the
following sym
bols with gaps:
#“
?
$&*ó
@
|~!\
Re
m
ove the
subse
que
nt c
h
ara
c
ters if
t
h
ey are followe
d
by
a space:
;: .,
El
im
i
n
at
e t
h
e
fol
l
o
wi
n
g
pai
r
s of
brac
ket
s
i
f
t
h
e o
p
e
n
b
r
a
c
ket
i
s
prece
d
e
d by
a spa
c
e
and t
h
e cl
ose
d
b
r
ack
e
t is
fo
llowed b
y
a sp
ace: [] ()
El
im
i
n
at
e t
h
e si
ngl
e
qu
ot
at
i
o
n
sy
m
bol
if it is associated
with by a
space
or i
f
it is prece
ded by a
space
.
Re
m
o
v
e
s and
t if th
ey are
ass
o
ciated with by
a
space
Elim
inate slash / if it is ass
o
ci
ated with by
a
space.
Our
p
r
op
o
s
ed wo
rk
o
v
e
rcomes all th
ese li
mitatio
n
s
.
We take three
biom
edical disease datasets
o
f
flin
e t
o
ex
tract h
i
dd
en
p
a
ttern
s usin
g
feature e
x
traction a
n
d hie
r
arc
h
ical clusteri
ng approac
h
es. Eac
h
dataset
i
s
pre
p
r
o
cesse
d t
o
r
e
m
ove no
n
-
f
unct
i
o
nal
charact
er
s t
o
i
d
ent
i
f
y
di
se
ase nam
e
s by
usi
n
g g
e
ne/
p
rot
e
i
n
dat
a
base
. Hi
e
r
archi
cal
m
e
t
hods f
o
r s
upe
rvi
s
ed a
nd
u
n
su
pe
r
v
i
s
ed
dat
a
m
i
ning
gi
ve m
u
l
t
i
l
e
vel
i
n
dexi
ng
o
f
dat
a
.
It
can
be
rel
e
vant
f
o
r
sev
e
ral
ap
pl
i
cat
i
ons ass
o
ci
at
ed
t
o
dat
a
ext
r
act
i
on,
pat
t
e
r
n
s ret
r
i
e
val
an
d
dat
a
or
ga
ni
zat
i
on.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Pa
rtia
l C
o
n
t
ext S
i
mila
rity
o
f
Gen
e
/Pro
tein
s
in Le
ukemi
a
Using Context R
ank
B
a
sed …
(S
hah
ana
Ban
o
)
48
5
Fi
gu
re
1.
Pr
o
p
o
se
d m
e
t
hod
f
o
r
el
im
i
n
at
i
ng
t
h
e N
o
n-
Fu
nct
i
onal
C
h
aract
e
r
s.
Fi
gu
re
2.
Pr
o
p
o
se
d m
e
t
hod
fl
ow
cha
r
t
f
o
r el
i
m
i
n
at
i
ng t
h
e
N
o
n
-
F
u
n
c
t
i
onal
C
h
aract
ers
Hierarchic
al Clustering Al
gori
t
hm
:
Inpu
t :
Nam
e
en
tity Gen
e
/Pro
tein tag
s
Tgp
u
s
ing
NER
ap
pro
ach,Gen
e/Pro
t
ein DB
, Prob
ab
ility P, Classes
Po
sitiv
e
po
s,
Neg
a
tiv
e
n
e
g
,
To
k
e
n
s
et Tk
, Sen
t
en
ceset
Sen .
R
ead
k, T
h
resh
ol
d,
E
n
t
r
o
p
y
w
e
i
ght
;
Outp
ut:
Quali
t
y k- a
b
stracts.
Tgp
=
Get(Name _
E
n
tity_
G
ene/ Pro
t
ein
_
Tags)
fo
r eac
h tg
in
Tg
p
For eac
h in Tk
Calcu
l
ate tag
prob
ab
ility
List.add(tg)
Offline Data
Online Data
Gene/Protein
Keyword
Medline
A
b
st
ra
ct
s
Prepa
re
Data
set
Biomedi
cal
Datas
e
ts
Pre-
pro
c
e
ssi
ng
Feature Extraction
Hierarchi
c
al
C
l
us
te
r
i
ng
Top K -Ranki
ng
Algorithm
Context
Similarity
Fetch To
p K Re
sults F
r
om
Biomedi
cal Dataset
s
Gene /
Protein
Synonym
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
5, No
. 3,
J
u
ne 2
0
1
5
:
48
3 – 4
9
0
48
6
List.add()
co
un
t=coun
t+1
end
end.
Fo
r each
tok
e
n t in
Tk
For eac
h se
n in Sente
n
ceset
If
((t
Se
n)
&&
(t T
g
p
)&&
(
>getP
r
o
b
(t
))
List Data
Sen
t
en
ce_
i
d,token
,
Pm
id
, En
tro
p
y
_
w
ei
g
h
t
,Syn
on
ym
s,Data,Title,Po
s
itiv
eClass
Else
List Data
Sen
t
en
ce_
i
d,token
,
Pm
id
, En
tro
p
y
_
w
ei
g
h
t
,Syn
on
ym
s,Data,Title,Neg
ativ
eClass
End
End
For
eac
h pai
r
o
f
ob
ject
s
i
n
Dat
a
Calculate distance
betwee
n t
w
o objects as
6.
a. St
a
r
t
wi
t
h
t
h
e di
s
j
oi
nt
cl
ust
e
ri
n
g
t
h
at
ha
ve
l
e
vel
as
0 a
n
d
seq
u
ence
_
num
ber
m
= 0.
b.
R
a
n
k
t
h
e
pa
i
r
s f
r
om
sm
al
l
e
st
di
st
ance
(si
m
ilarit
i
es in
commo
n
)
to th
e
max
i
m
a
l d
i
stan
ce.
c. Calcu
l
ate an
d coun
t p
a
i
r
s, say n p
a
irs.
If n
>=
0
do
,
c.1
Explore t
h
e
m
e
dian as
r
oot
hi
era
r
chi
cal
n
ode
.
c.2
Sp
lit th
e p
a
irs as left and
rig
h
t
si
d
e
b
r
an
ch
es
b
a
sed
o
n
the m
e
d
i
an
.
c.3
Exp
l
ore the sm
a
llest u
n
lik
e
p
a
ir
o
f
cl
u
s
ters in
t
h
e le
ftsid
e
and
righ
tsid
e cu
rren
t cl
u
s
terin
g
, say
p
a
ir rs, ls
according t
o
d[(rs),(ls)] =
m
i
n
r[(i),(j
)] in
which t
h
e m
i
nim
u
m
value is take
n ove
r
all pairs of cl
usters
in the
current cl
usteri
ng.
c.4 I
f
l
e
ft
si
de and ri
g
h
t
s
i
d
e h
a
ve at
l
east
one
sim
i
l
a
r
object. In this case
merge it co
llectively in one cluste
r,
an
d loo
k
up
smallest v
a
lu
e
o
v
er all p
a
i
r
s
o
f
clu
s
ters i
n
th
e cu
rren
tclustering
.
Else
c.5 Fi
nd t
h
e m
a
xim
a
l
di
ssim
i
l
a
r pai
r
of cl
ust
e
rs i
n
t
h
e l
e
ft
si
de an
d ri
ght
si
de cu
rre
nt
cl
ust
e
ri
n
g
, s
a
y
p
a
i
r
rs, l
s
according to
d[(rs),(ls)] = m
a
x
r[(i),(j)] in
which the m
value is take
n ov
e
r
all pairs
of cl
usters i
n
the c
u
rrent
clustering.
d.
Increm
ent the se
quen
ce nu
m
b
er
: m
=
m +1
. (I
n bo
th l
e
f
t
and
r
i
gh
t si
des) Merge cl
usters (r) and
(s
) int
o
a
single-cl
uster to
form
the subsequent cl
uster m
.
Pl
ace the level of t
h
is clust
e
r to L(m
)
= r[(r),(s
)
]
e. R
e
vi
se t
h
e t
r
ee, T, by
el
i
m
i
n
at
i
ng t
h
e n
ode
s cor
r
es
po
ndi
ng t
o
cl
ust
e
rs (
p
) an
d (
q
) an
d ad
di
n
g
a no
d
e
co
rr
esp
ond
ing
to
th
e n
e
w
l
y
co
m
p
o
s
ed
clu
s
t
e
r
.
Th
e n
e
ighbo
rho
o
d
b
e
tw
een
t
h
e
n
e
w
clu
s
t
e
r
,
d
e
no
ted
(p
,q
)
and
o
l
d
cl
u
s
ter
(m
) is stated
in
t
h
is
way:
d[
(m
), (p
,q
)]
=
m
i
n r[
(m
),(p)]
,
d
[(m
),(q
)]
.
If d
<
0
The
n
2
01
1
(1
,
2
)
(
1
r
)
*
0
.
5
(
(
c1
1
)
(
c
1
1
)
)
/
(
c1
1
)
(
c
1
1
)
ij
i
j
dd
d
ij
i
j
i
j
i
j
ij
i
j
i
j
ii
i
Dc
c
rc
c
c
c
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Pa
rtia
l C
o
n
t
ext S
i
mila
rity
o
f
Gen
e
/Pro
tein
s
in Le
ukemi
a
Using Context R
ank
B
a
sed …
(S
hah
ana
Ban
o
)
48
7
f.
If all objects
are in one cl
us
t
e
r, st
op. Else,
go to
step
b.
Alg
o
rithm
2
:
Input : Hier
ar
chical cluste
rs
fr
om
top to
bottom
Outp
ut
: T
o
p
K Dise
ase
Res
u
lts.
6.
1 F
o
r
eac
h cl
ust
e
r i
n
C
l
ust
e
r
-
set
6.1.1 t1
=g
en
e/p
r
o
t
ein
sear
ch k
e
ywor
d.
6.
1.
2 F
o
r e
ach sy
no
ny
m
in t
h
e
cl
ust
e
r
t
2
=sy
n
ony
m
.
Fi
nd
co
nt
ext
si
m
i
l
a
ri
ty
bet
w
e
e
n t
1
an
d t
2
.
Context Sim
i
larity Score:
E
n
d f
o
r
6.2 Sort
<t1,t
2
> accordi
n
g to
conte
x
t sim
i
larity score.
6.3
Get abstrac
t
s from
bi
om
edical databases
according t
o
ta
g
pair sc
ore.
Tabl
e
1. T
h
e
P
e
rf
orm
a
nce o
f
.
.
.
Variab
le
Sp
eed
(rp
m
)
Po
wer (k
W)
x 10
8.
6
y 15
12.
4
z 20
15.
3
3.
R
E
SU
LTS AN
D ANA
LY
SIS
Fi
gu
re
3.
Loa
d
i
ng l
e
ukem
i
a d
i
sease dat
a
Pa
rti
a
l Co
ntex
t Simia
r
ity
of Gene/Proteins in leukemia
:
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.2
526
455
02
645
502
6
<=== U19107_rna
1_at ===>
synonym
s
ar
e ZNF127 (ZNF127) gene
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.3
436
507
93
650
793
6
<=== U19142_at ===> sy
nonym
s
are G
AGE1 G
a
n
tige
n
1
(GAGE-1)
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.4
829
059
82
905
982
9
<=== U19180_at ===> sy
nonym
s
are
BAGE B m
e
lanom
a
antige
n
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.4
363
929
14
653
784
2
<=== U19261_at ===> sy
nonym
s
are Epste
i
n-
Barr virus-i
n
duce
d protein
m
R
NA
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.2
578
347
57
834
757
8
<=== U19345_at ===> sy
nonym
s
are AR1
protein
(AR
)
mRNA
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.4
391
534
39
153
439
13
<=== U19487_at ===> sy
nonym
s
are Pros
taglandin
E2 re
ceptor m
R
NA
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.2
629
629
62
962
962
95
<=== U19517_at ===> sy
nonym
s
are (apoa
r
gC
) long m
R
NA
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.3
879
142
30
019
493
16
<=== U19523_at ===> sy
nonym
s
are GCH1
GTP cycl
ohydrolase 1 (dopa-re
sponsive
dystonia)
{alte
rnative
pr
o
duct
s
}
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.4
162
962
96
296
296
33
<=== U19718_at ===> sy
nonym
s
are MFAP2 Mi
crofibrillar-associated protein
2
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.3
785
004
51
671
183
4
<=== U19796_at ===> sy
nonym
s
are Melanom
a antige
n
p15 m
R
NA
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.4
340
740
74
074
074
06
1
1
2
(1
,
2
)
/
(
)
m
i
tc
l
u
s
t
e
r
i
tk
e
y
w
o
r
d
Cos
t
t
s
ize
o
f
c
luste
r
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
5, No
. 3,
J
u
ne 2
0
1
5
:
48
3 – 4
9
0
48
8
<=== U19878_at ===> sy
nonym
s
are Transm
e
m
brane protein m
R
NA
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.4
330
484
33
048
433
05
<=== U19906_at ===> sy
nonym
s
are VA
SOPRESSIN
V1A RECEPTOR
Context Sim
i
a
r
ity %5.3f===
=>0.0
<=== U19948_at ===> sy
nonym
s
are Protein
disulfide
isomerase (PDIp)
mRNA
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.2
578
347
57
834
757
8
<=== U19977_at ===> sy
nonym
s
are Pre
p
rocar
boxy
pepti
d
ase
A2 (proC
P
A2) m
R
NA
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.4
240
740
74
074
074
05
<=== U20158_at ===> sy
nonym
s
are 76 kDa ty
rosine
phosphoprotein SLP-76 m
R
NA
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.4
232
804
23
280
423
26
<=== U20230_at ===> sy
nonym
s
are "GB
DEF =
Gu
a
n
yl cyclase C ge
ne, pa
rtial cds"
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.3
777
777
77
777
777
77
<=== U20240_at ===> sy
nonym
s
are "CEBPG CCAAT/
e
nha
nce
r
binding protein (C/
E
BP), gamm
a"
Co
n
t
ex
t Sim
i
ar
ity %5
.3f====
>0
.4
199
860
237
596
087
<=== U20285_at ===> sy
nonym
s
ar
e Gps1 (GPS1) m
R
N
A
Context Sim
i
a
r
ity %5.3f===
=>0.0
<=== U20325_at ===> synonym
s
are Cocaine and amphe
tam
i
ne regulated tra
n
scri
pt CART
(hC
A
RT)
m
R
NA
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.4
181
600
95
579
450
44
<=== U20350_at ===> sy
nonym
s
are CMKRL1
Chem
okine rece
ptor-like 1
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.3
807
870
37
037
037
03
<=== U20362_at ===> sy
nonym
s
are Tg737 m
R
NA
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.4
037
037
03
703
703
7
<=== U20391_rna
6_at ===>
synonym
s
are
Fo
late rece
ptor (FOLR1)
ge
ne
Con
t
ex
t Sim
i
a
r
ity %5
.3
f
====>0
.3
293
650
79
365
079
36
<=== U20428_at ===> sy
nonym
s
ar
e SNC
1
9 m
R
NA sequence
Context Sim
i
a
r
ity %5.3f===
=>0.0
<=== U20530_at ===> sy
nonym
s
are GB
DEF =
Bone
p
hosphoprotein spp-24
prec
ursor m
R
NA
Context
Sim
i
arity %5.3f
====>0.37703
703703703706
Correl
ati
o
n Distance
Me
tric
:
Co
rr
elatio
n
D
i
stan
ces:0
.52
466
623
049
099
25
Co
rr
elatio
n
D
i
stan
ces:0
.56
193
624
229
997
64
Co
rr
elatio
n
D
i
stan
ces:0
.65
139
477
122
244
07
Co
rr
elatio
n
D
i
stan
ces:0
.48
759
512
587
181
975
Co
rr
elatio
n
D
i
stan
ces:0
.53
190
491
592
377
61
Co
rr
elatio
n
D
i
stan
ces:0
.52
466
623
049
099
25
Co
rr
elatio
n
D
i
stan
ces:0
.56
193
624
229
997
64
Co
rr
elatio
n
D
i
stan
ces:0
.65
139
477
122
244
07
Co
rr
elatio
n
D
i
stan
ces:0
.53
190
491
592
377
61
Co
rr
elatio
n
D
i
stan
ces:0
.52
466
623
049
099
25
Co
rr
elatio
n
D
i
stan
ces:0
.56
193
624
229
997
64
Co
rr
elatio
n
D
i
stan
ces:0
.65
139
477
122
244
07
Co
rr
elatio
n
D
i
stan
ces:0
.53
190
491
592
377
61
Co
rr
elatio
n
D
i
stan
ces:0
.56
193
624
229
997
64
Co
rr
elatio
n
D
i
stan
ces:0
.62
218
648
498
795
17
Co
rr
elatio
n
D
i
stan
ces:0
.60
582
347
758
373
36
=== Clustering stats for t
r
aini
ng data ===
Clustered Insta
n
ces
0
11
(
92
%)
1
1
(
8%
)
=== ACCUR
A
CY DETAILS===
TOT
A
L
GE
N
E
DE
TECTI
O
N
ACCURAC
Y
12
100
%
ERROR RAT
E
OF
PR
OP
O
S
ED
AL
G
O
RI
THM
0
0
%
Cor
r
elation
Ef
f
i
ciency
1
Total N
u
m
b
er
of
I
n
stances
12
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Pa
rtia
l C
o
n
t
ext S
i
mila
rity
o
f
Gen
e
/Pro
tein
s
in Le
ukemi
a
Using Context R
ank
B
a
sed …
(S
hah
ana
Ban
o
)
48
9
Figure 4.
Comparision betwe
e
n datasi
ze and accuracy in
di
ffe
rent
datasets
Fig
u
re
5
.
Co
mp
ariso
n
b
e
t
w
een
p
r
op
o
s
ed
and trad
ition
a
l algo
rith
m
s
for leuk
em
ia d
a
taset
4.
CO
NCL
USI
O
N
In t
h
i
s
pr
o
pos
ed w
o
r
k
c
ont
e
x
t
ran
k
based
hi
erarc
h
i
cal
cl
ust
e
ri
n
g
m
e
t
hod i
s
a
ppl
i
e
d
on
di
f
f
ere
n
t
d
a
tasets n
a
m
e
l
y
co
lo
n, Leuk
emia, MLL
m
e
d
i
cal d
i
seases.
Op
tim
al ru
le filterin
g
algo
rit
h
m is ap
p
lied
on
th
ese
datasets to rem
ove
unwa
nte
d
special cha
r
acters for
ge
ne
/pro
tein
id
en
tificatio
n
.
Th
is
work
ov
erco
m
e
s so
m
e
o
f
th
e li
m
itat
i
o
n
s
in
th
e literatu
re
su
ch
as :
n
o
i
se elim
in
ati
o
n in
m
e
d
i
cal d
a
tasets,
robu
stn
e
ss,
h
i
gh
d
i
sease
pre
d
iction rate
, high quality cluster result
with
less search space and high true positive rate. Finally,
expe
ri
m
e
nt
al
resul
t
s
sh
ow t
h
a
t
pro
p
o
se
d m
e
tho
d
out
per
f
o
r
m
e
d wel
l
i
n
t
e
r
m
s of t
i
m
e
and
cl
ust
e
rs sear
ch
space
are conce
r
ne
d. In future this
work
can be
e
x
tended to implem
ent sim
ila
r
disease
clust
e
rs on online medical
doc
um
ents like m
e
dline, pubmed etc.
REFERE
NC
ES
[1]
B.
F.
Momin,
S.
Mitra
,
a
nd R.D. Gupta, “Reduce Generation
and Cla
ssification of Gene E
xpression Data”, in
Proceedings of
t
h
e 2006
In
ternational Conference on
H
y
brid Info
rmation Technology
, pp
. 699
-708
, 2006
.
[2]
Jung-HsienChiang, Senior M
e
mber, I
EEE,
and
Shing-HuaHo,
”A Combination
of Rough-Based
Featur
e Selection
and RBF Neural Network for Classifi
cation Using Gene
Expression Data”,
IEEE Transactions
On
Nanobioscien
ce
,
VOL.7, NO.1
,
March 2008
.
[3]
Ma
sse
r,
M.
B.,
White
,
M. Ka
therine
,
H
y
de
a
nd
K.
Me
lissa
et
al
., “Predicting blo
od donation
intentions and b
e
havior
among Australian blood donors:
Testing
an
exten
d
ed th
eor
y
of p
l
anned B
e
havior
model”,
Transfu
s
ion, 49: 320-32
9
DOI: 10.1111/j.1537- 2995.200
8.01981.x
,
2009
.
[4]
S. Gopal, A. Haake, R
.
P. Jones
et al.
,
Bioinform
a
tics: a computing perspective
,
Int.Ed
.
ed
.:
Mc
Graw-Hill
High
er
Education, 2009.
[5]
Anil Rajput, Ramesh Prasad Aharwal, Nidh
i
Chandel, Deven
r
a Singh Solan
k
i and Ritu So
ni, “Approaches
of
Classific
a
tions t
o
Polic
y of
An
aly
s
is of M
e
dical Data”,
I
J
CSNS
Internationa
l Journal of Computer Science an
d
Networ
k S
ecur
i
t
y
, VOL. 9
No. 1
1
, November 20
09, pp
. 01-09
.
1400
1450
1500
1550
1600
1650
1700
1750
Leukemia
MLL
C
olon
Accu
ra
cy
Da
ta
si
ze
80
82
84
86
88
90
92
94
96
98
NNGE
I
BL
RBHCA
Accuracy
Al
gor
i
t
hms
Accuracy
Accu
ra
cy
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
5, No
. 3,
J
u
ne 2
0
1
5
:
48
3 – 4
9
0
49
0
[6]
T. Santh
a
nam an
d Sh
y
a
m Sundaram,
“Applicatio
n of CART Algorithm
in Blood Donors Classification
”
,
Journal o
f
Computer Scien
ce
6
(5): 548-55
2, 2010
ISSN 1549-3636 © 2010
Scien
c
e Publications.
[7]
Rossen Dimov et al., Weka: Practical machin
e Learning
Tools
and
Techniques
-April 30
, 2010
.
[8]
ZhiwenYu, H
a
u
-
SanWongb, JaneYou,
QinminYang,
and Hong
y
i
ng Liao,
”Knowledge
Based Clu
s
ter Ensemble for
Cancer Discovery from Biomolecular Da
ta”, IEEE Transactions
on Nanobioscien
ce
, Vol. 10, No.
2, June 2011.
[9]
Devchand J Chaudhari, Mamta
Ramteke and M
a
noj G Lad
e
. Ar
ticle: Data
Mining in Blood Platelets Transfusio
n
using Classification Rule.
IJCA
Proceed
ings on
Emerging Trends in Co
mputer S
c
ien
ce and Information Technolo
g
y
(
E
TCSIT2012
)
etcsit1001
ETCSI
T (2): 14-17, Ap
ril 2012
.
[10]
Shahana B
a
no
and Dr. K. Rajas
e
khara Rao
“Key
Word B
a
sed
Word Sense Extraction in
A Ind
e
x For TextFiles:
Design Approach”,
CIIT
International
Journal Of Data
Min
i
ng
An
d Knowledg
e En
gineering
JAN '12.
[11]
Shahana Bano and Dr. K. Rajasekhara
Ra
o “Key
Word Ba
se
d Word Se
nse
E
x
traction in Text: Design Approach”,
International Jo
urnal of Com
puter Science and
C
o
mmunication
March
'
12.
[12]
Shahana Bano
and Dr. K. Rajas
e
khara
Rao “
P
at
tern Based
Gene
/Protein S
y
non
ym
s Identifica
tio
n from
Biologic
a
l
Da
ta
ba
se
s”,
In
te
rnational
Journal of
Appl
ied
Engi
neering
Researc
h
(
I
JAER)
, Volu
me 9, Number 1
2
(2014).
BIOGRAP
HI
ES OF
AUTH
ORS
S
h
ahana Bano rece
ived her M
S
(IS
) degree in Com
puter Science from
Monte
ssor
i
Ma
hila
Ka
la
sala
Vija
yawad
a
. M
.
Tech d
e
gre
e
in
Com
puter Scien
ce from K.L. Co
lleg
e
of Eng
i
neering Vaddeswar
a
m
and pursuing her Ph.D from KL
University
. Curr
ently
,
she
is working a
s
a
Assista
n
t Profe
ssor in t
h
e
Department of Computer
Science & Engineering
in K.L Universi
ty
. She has got 7
y
ears of teachin
g
experience. She
has published
Eleven
r
e
sear
ch pa
pers in var
i
ous n
a
tion
a
l and
int
e
r
n
ation
a
l Journa
l
s
.
She is member o
f
professional so
cieties CSI.
D
r
.K
. Rajas
e
k
h
a
ra Rao r
ece
ive
d
his
P
h
.D
from
A
c
har
y
a Nag
a
rjuna Univ
ersity
. Curr
ently
, he is
working as a Professor in the
Department of
Computer Scien
ce &
Engineering in Sri Prakash
College Of Engi
neering
.
He has got 28
y
e
a
r
s of
teach
ing experi
e
n
ce. He has published 30 resear
ch
papers in variou
s national and in
te
rnational Journals. He is memb
er
of professional societies CSI. He
was awarded
th
e "Best Dean" on
30th De
cember 2012,
organ
i
zed b
y
ASDF'
s
.
Evaluation Warning : The document was created with Spire.PDF for Python.