TELKOM
NIKA Indonesia
n
Journal of
Electrical En
gineering
Vol.12, No.6, Jun
e
201
4, pp. 4322 ~ 4
3
2
9
DOI: 10.115
9
1
/telkomni
ka.
v
12i6.468
9
4322
Re
cei
v
ed O
c
t
ober 1
0
, 201
3; Revi
se
d Decem
b
e
r
25, 2013; Accept
ed Ja
nua
ry 2
0
, 2014
A Java Program of Feature Extraction Algorithms for
Protein Sequences
Shanping Qi
ao*
1,2
, Baoqiang Yan
3
1
School of Man
agem
ent Scie
n
c
e and En
gi
ne
erin
g, Shan
don
g Normal U
n
iv
ersit
y
2
Shand
on
g Pro
v
incia
l
Ke
y La
b
o
rator
y
of Net
w
ork Based Inte
l
lige
n
t Comp
uti
ng,
Schoo
l of Information Sci
enc
e and En
gi
neer
ing, Un
iversit
y
of Jinan,
No. 336, W
e
st Roa
d
of Nan
Xi
nzhu
an
g,
Jinan
2500
22, Ch
ina
,
Ph.:
+
86-531
897
36
503
3
School of Mat
hematic
al Scie
nce, Shan
do
ng
Normal Un
iver
sit
y
No. 88, Cultur
a
l
East Road, Ji
nan
2
5
0
014, C
h
in
a, Ph.: +
86-531
86
182
50
1
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: qspzl@
hotm
a
il.com
1,2
, y
a
nba
o
q
i
a
ng
66
6
@
gma
i
l
.
co
m
3
A
b
st
r
a
ct
Predicti
on of pr
otein su
bcel
lu
l
a
r local
i
z
a
ti
ons
a
ttracted the eyes of ma
ny re
search
ers and
henc
e a
serial
of co
mp
utation
a
l a
ppr
o
a
ches
w
h
ic
h ai
me
d at d
e
sig
n
i
ng a
n
effective
learn
i
n
g
mach
ine to
dea
l w
i
t
h
the new
ly-fou
n
d
prote
i
n se
qu
ences
on the
base
of the fe
atur
e vector were develo
pe
d i
n
th
e
la
st two
deca
des. T
h
e
feature
extra
c
tion a
l
g
o
rith
m for
protei
n
sequ
enc
es p
l
aye
d
a v
i
tal
role
actual
ly.
T
h
e
infor
m
ati
on
in t
he fe
ature v
e
ctor infl
ue
nced
the
perfor
m
a
n
c
e
of th
e l
earn
i
n
g
al
gor
ith
m
si
g
n
ifica
n
tly. In or
de
r
to facilit
ate us
ers to b
u
i
l
d
p
r
edicti
ng syste
m
, thre
e
fe
ature extraction
alg
o
rith
ms a
bout a
m
in
o
a
c
i
d
compos
ition w
e
re intro
duc
ed,
impr
ove
d
an
d imple
m
ente
d
in
a Java pro
g
ra
m. By compar
i
ng the res
u
lts w
i
th
those fro
m
so
me w
eb s
e
rver
s, it
w
a
s proved that this pr
o
g
ra
m ran
nor
mally a
nd
had
g
ood
perfor
m
a
n
ce
both
in ti
me co
sting
and
us
er
interface. M
o
re
over, the
resu
lts coul
d
be
eas
i
l
y save
d to
the
specifi
e
d
file
fo
r
later use. It w
a
s anticip
ated th
at this progra
m
w
ould give so
me h
e
l
p
to research
ers.
Ke
y
w
ords
: pr
otein
subc
ell
u
l
a
r loc
a
tio
n
pr
e
d
ictio
n
, a
m
i
no
acid c
o
mp
ositi
on, feat
ur
e ext
r
action
al
gorith
m
,
Java
Copy
right
©
2014 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
Predi
ction of
protein su
bcell
u
lar lo
caliz
atio
ns is an importa
nt and hot
topic in
bioinformatics. Knowle
dg
e of protein
sub
c
ell
u
lar
lo
cation
s often
offers impo
rtant clue
s toward
determi
ning t
he fun
c
tion o
f
an un
cha
r
a
c
teri
zed
prote
i
n. Therefore, a key
step
o
n
this
way i
s
to
determi
ne th
e su
bcellula
r
locali
zation
s
of each prot
e
i
n. The tra
d
itional ap
proa
ch to this p
r
ob
lem
is doin
g
the
physi
co
chemi
c
al exp
e
rime
nt
s,
su
ch as cell
f
r
a
c
t
i
on
ation, ele
c
tron
microsco
py a
n
d
fluore
s
cen
c
e
microsco
py. Ho
wever, it is time-c
on
sum
i
ng and
co
stl
y
to settle this proble
m
ba
sed
on exp
e
rim
e
nts p
u
rely. M
o
reove
r
, the
numbe
r
of
an
notated prote
i
ns ha
s
in
cre
a
se
d
expl
osi
v
ely
in the po
st-g
e
nomic
age, a
s
illu
strated i
n
Table
1. Th
erefo
r
e, expe
rimental
anno
tation of prot
ein
sub
c
ell
u
lar l
o
cali
zation
s
ca
n not
kee
p
u
p
with th
e hu
ge nu
mbe
r
of
seq
uen
ce
s t
hat co
ntinue
to
emerge fro
m
the geno
me
seq
uen
cing
proje
c
ts. T
o
br
idg
e
this
g
ap, it is high
ly desirable t
o
develop
the computation
a
l method
s
fo
r predi
cting
p
r
otein subc
ellular localiz
ations
. In fac
t, many
efforts have b
een mad
e
in the last two d
e
ca
de
s [1-5].
Table 1. Nu
m
ber of Protein
Sequen
ce
s in the UniProtKB/Swiss-Pro
t
Release Date
Database Versio
n
Total
Experimental
An
notations
Non-e
x
pe
rimenta
l
Annotations
2003-12
-15
1
135,938
38,903
45,391
2004-07
-05
2
148,277
41,031
50,806
2005-05
-10
5
178,998
45,606
65,084
2006-10
-31
9
239,174
53,510
94,897
2007-07
-24
12
274,311
57,490
113,135
2008-07
-22
14
390,787
64,733
167,972
2009-09
-01
15.7
495,368
68,029
220,091
2010-07
-13
2010_08
516,934
70,180
232,546
2011-07
-27
2011_08
531,326
70,552
241,226
2012-05
-16
2012_05
536,029
70,868
245,342
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
A Java P
r
og
ram
of Feature Extra
c
tion
Algor
ithm
s for Protein Seq
uen
ce
s (Sha
nping Qi
ao)
4323
These comp
utational met
hod
s all aim
at desi
gning
an effective
learni
ng ma
chin
e to
predi
ct the
subcellula
r lo
cations of the
ne
wly-
foun
d
protei
n
seq
uen
ce
s on
the ba
se
of t
he
feature ve
cto
r
. Fo
r the
fe
ature ve
cto
r
, too ma
ny f
eature
s
may
lead
to a
h
i
gh-di
men
s
io
nal
disa
ster.
On
the contrary, too fe
w fe
ature
s
wo
uld l
o
se
some
ne
ce
ssary i
n
formation. So
the
informatio
n i
n
the featu
r
e vecto
r
infl
uen
ce
s the
perfo
rman
ce
of the le
arning al
go
rithm
s
i
gnific
antly.
As
a res
u
lt, how to extrac
t
the info
rmative featu
r
e
s
fro
m
the
protei
n
se
que
nce i
s
a
key pro
b
lem
which n
eed
s to study deeply. M
any feature extraction alg
o
rit
h
ms have b
e
e
n
prop
osed by
now. In this p
aper, three wi
dely use
d
on
es related to
amino a
c
id
compo
s
ition were
introdu
ce
d, improve
d
and
impleme
n
ted
in a Java pro
g
ram.
The pap
er is
orga
nized a
s
follows. In the next
section,
we introdu
ce
the three alg
o
rithm
s
formally. In
Section
3, th
e impl
ement
ation m
e
thod
ba
sed
o
n
Java is d
e
scri
bed. Se
ction
4
pre
s
ent
s the
result got fro
m
this pro
g
ra
m and gi
ves
discu
ssi
on
s. Finally, we concl
ude o
u
r
work
in se
ction 5.
2. The Propo
sed Thre
e
Al
gorithms
There are 20
native amin
o aci
d
s i
n
th
e natur
e. Th
e alph
abeti
c
a
l
orde
r of the
i
r sin
g
le
-
l
e
t
t
e
r
c
o
d
e
s
a
r
e
A
,
C
,
D
,
E
,
F
,
G
,
H
,
I
,
K
,
L
,
M
,
N
,
P
,
Q
,
R
,
S
,
T
,
V
,
W
,
a
n
d
Y
.
A
m
i
n
o
a
c
i
d
i
s
n
o
t
only a basi
c
u
n
it compo
s
in
g protein, but
is also
an imp
o
rtant physi
ol
ogical active
sub
s
tan
c
e. T
h
e
feature
s
of a
m
ino a
c
id d
e
termin
e the at
tribute of
a
protein to a g
r
e
a
t extent. So, the amin
o a
c
id
comp
ositio
n i
s
u
s
u
a
lly co
nsid
ere
d
into
the featu
r
e
extraction
alg
o
rithm
s
. A p
r
otein
seq
uen
ce
whi
c
h contain
s
N amin
o aci
d
s can be formulated by:
N
R
R
R
...
2
1
(1)
Whe
r
e
1
R
repre
s
ent
s the first
amino a
c
id,
2
R
the se
con
d
o
ne, and so fo
rth. On the b
a
se of the
protein
sequ
e
n
ce
and
som
e
physi
oche
mical
attri
but
es
of amino
acid
s,
three algorith
m
s rel
a
ted
to amino aci
d
compo
s
ition
are introdu
ce
d bello
w.
2.1. Amino Acid Composi
t
ion
The Amino A
c
id Comp
ositi
on (AAC) wa
s first p
r
op
osed by Na
ka
shima [6] in st
udying
the protei
n folding type p
r
oble
m
in 19
86. Later,
it wa
s
u
s
ed
in predi
cting protein
subcell
u
lar
locali
zation
s
and
som
e
ot
her bran
che
s
, su
ch
as p
r
e
d
icting
protei
n st
ru
ctur
al
cl
asse
s, p
r
edi
cting
protein q
uate
r
na
ry stru
cture and so o
n
. The form
of AAC is a 20-dim
e
n
s
iona
l digital vector
defined a
s
fol
l
ows:
T
f
f
f
X
]
[
20
2
1
(2)
Whe
r
e
X
den
otes the fea
t
ure vecto
r
of a protein
,
1
f
,
2
f
, …,
20
f
a
r
e th
e
c
o
mp
os
itio
n
comp
one
nts
(i.e. freq
uen
cie
s
)
of the
20 a
m
ino
acid
s. Th
e
amino
aci
d
frequ
en
cie
s
were
cal
c
ulate
d
as
follows. The percenta
ge of
the amino acid
i
in a protei
n is define
d
b
y
:
)
20
,...,
2
,
1
(
100
i
N
n
f
i
i
(3)
Whe
r
e
i
n
is th
e
frequ
en
cy of
amino
aci
d
i
, and
N
is th
e n
u
mbe
r
of a
m
ino a
c
id
re
sid
ues in the
protein
se
qu
ence. AAC is simple a
nd
use
d
br
oadly
in the early
days. Ho
wev
e
r, the se
que
nce
orde
r i
n
form
a
t
ion is lo
st comple
tely
in AAC.
Thi
s
would result
in a low
accuracy
in predi
cting
protein attri
b
u
t
es only usi
n
g
AAC.
2.2. Pseudo Amino Acid
Composi
t
ion
The P
s
eu
do
Amino A
c
id
Comp
ositio
n
(PseAA
C) al
gorithm
was
prop
osed
in t
he yea
r
2001
by
Ch
ou [7]. It wa
s d
e
si
gne
d t
o
imp
r
ove th
e p
r
edi
ction
quality of p
r
otein attrib
utes,
inclu
d
ing th
e
sub
c
ell
u
la
r l
o
cali
zatio
n
a
nd the
mem
b
ran
e
p
r
otei
n types.
Co
mpared
with
the
conve
n
tional
AAC, PseAA
C
n
o
t only
converts the
p
r
otein
se
que
nce
s
with va
riou
s le
ngth
s
to
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4322 – 4
329
4324
fixed-length
digital vecto
r
s, but al
so
kee
p
s
co
nsi
dera
b
le
seq
u
ence orde
r i
n
formatio
n. The
formulatio
n of PseAAC is d
e
fined by:
T
x
x
x
x
X
]
[
20
1
20
20
1
(4)
Whe
r
e,
)
20
1
20
(
,
)
20
1
(
,
20
11
20
20
11
u
w
f
w
u
w
f
f
x
ik
k
i
u
ik
k
i
u
u
(5)
W
h
er
e
)
20
,...,
2
,
1
(
i
f
i
are
the no
rmali
z
ed o
c
cu
rre
nce fre
que
ncy
of the
20
a
m
ino
aci
d
s i
n
protein
X
,
w
is t
he
weig
ht factor for the
se
quen
ce
orde
r effect, an
d
k
is the
k-tier seque
nce
correl
ation fa
ctor
comp
ute
d
according t
o
Equation (6
) - (9
) for this
protein.
)
(
,
1
1
,
N
k
J
k
L
k
L
i
k
i
i
k
(6)
Whe
r
e
1
is ca
lled the fi
rst
-
tier
co
rrel
a
tio
n
facto
r
that
reflect
s
the
seque
nce o
r
d
e
r
co
rrel
a
tion
betwe
en all t
he mo
st co
ntiguou
s re
si
du
es al
ong a
protein chain,
2
the se
co
nd-ti
er correlatio
n
factor, an
d so
forth.
}
)]
(
)
(
[
)]
(
)
(
[
)]
(
)
(
{[
3
1
2
2
2
2
2
1
1
,
i
k
i
i
k
i
i
k
i
k
i
i
R
M
R
M
R
H
R
H
R
H
R
H
J
(7)
Whe
r
e
k
i
R
H
1
,
k
i
R
H
2
and
k
i
R
M
are, re
sp
ectively, the h
y
droph
obi
city value, hydrophili
city
value, and
si
de-chai
n ma
ss of ami
no a
c
id
k
i
R
; and
i
R
H
1
,
i
R
H
2
and
i
R
M
the co
rre
spo
ndin
g
values for a
m
ino aci
d
i
R
. These value
s
were gotten by:
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
0
0
0
0
2
0
2
0
2
2
0
1
0
1
0
1
1
M
SD
M
R
M
R
M
H
SD
H
R
H
R
H
H
SD
H
R
H
R
H
i
i
i
i
i
i
(8)
Whe
r
e
i
R
H
0
1
is th
e origin
al hydrop
hobi
city value of the ith amino aci
d
,
i
R
H
0
2
the original
hydrophilicity value,
and
i
R
M
0
the ma
ss of t
he ith a
m
ino
acid
sid
e
cha
i
n. The
s
e th
ree value
s
are
shown i
n
Table
2. In E
quation
(8
),
denote
s
th
e
mean
of the
correspon
ding
value of
all th
e
20 amino a
c
i
d
s, and
SD
means the varia
n
c
e. The
s
e two values a
r
e
defined by Eq
uation (9
).
20
20
1
20
1
2
20
1
i
i
i
i
SD
(9)
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
A Java P
r
og
ram
of Feature Extra
c
tion
Algor
ithm
s for Protein Seq
uen
ce
s (Sha
nping Qi
ao)
4325
There a
r
e o
n
l
y
three
physi
co
chemi
c
al
p
r
ope
rt
ie
s (i.e.
,
hydrop
hobi
city, hydrophili
city and
mass) u
s
e
d
in the stand
ard PseAAC.
In order
to
add more prop
ertie
s
int
o
PseAAC,
we
enha
nced it here. Let n be
the numbe
r o
f
prope
rties,
the ne
w formu
l
ations a
r
e gi
ven as follo
ws:
}
)]
(
)
(
[
)]
(
)
(
[
)]
(
)
(
{[
1
2
2
2
2
2
1
1
,
i
n
k
i
n
i
k
i
i
k
i
k
i
i
R
V
R
V
R
V
R
V
R
V
R
V
n
J
(10)
Whe
r
e
k
i
R
V
1
,
k
i
R
V
2
, ..., and
k
i
n
R
V
are, res
p
ec
tively, the jth (j=
1
, 2, ...,
n) propert
y value
o
f
amino a
c
id
k
i
R
; and
i
R
V
1
,
i
R
V
2
, ..., and
i
n
R
V
the corre
s
pondi
ng valu
es for amin
o
acid
i
R
. The
vlaues of the
s
e prope
rtie
s are gai
ned by
:
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
0
0
0
0
2
0
2
0
2
2
0
1
0
1
0
1
1
n
n
i
n
i
n
i
i
i
i
V
SD
V
R
V
R
V
V
SD
V
R
V
R
V
V
SD
V
R
V
R
V
(11)
The
sam
e
m
e
thod
as Eq
uation
(9
)
was
used
to
calcul
ate e
a
ch
value
in th
e
above
equatio
n. After the refo
rm
ation, any valuable p
r
op
erty can be ad
ded into Equ
a
tion (10
)
. T
h
is
improve
d
the informativity and flex
ibility of PseAAC si
gnifica
ntly.
Table 2. The
Origin
al Hyd
r
opho
bicity, Hydroph
ili
city and Mass Valu
es of 20 Amin
o Acids
Number of
amino
acid
Letter of ami
no a
c
id
H
y
drophobicit
y
H
y
drophilicity
Mass
1 A
0.62
-0.5
15.0
2 C
0.29
-1.0
47.0
3 D
-0.90
3.0
59.0
4 E
-0.74
3.0
73.0
5 F
1.19
-2.5
91.0
6 G
0.48
0.0
1.0
7 H
-0.40
-0.5
82.0
8 I
1.38
-1.8
57.0
9 K
-1.50
3.0
73.0
10 L
1.06
-1.8
57.0
11 M
0.64
-1.3
75.0
12 N
-0.78
0.2
58.0
13 P
0.12
0.0
42.0
14 Q
-0.85
0.2
72.0
15 R
-2.53
3.0
101.0
16 S
-0.18
0.3
31.0
17 T
-0.05
-0.4
45.0
18 V
1.08
-1.5
43.0
19 W
0.81
-3.4
130.0
20 Y
0.26
-2.3
107.0
2.3. Amphiphilic Pseudo Amino Acid Composi
t
ion
In 2005, Cho
u
[8] propo
sed a novel repre
s
e
n
tation
of protein feature ve
ctor named
“Amphiphilic
Pseudo Amino Acid
Composition” (Am
P
seAAC) in
predi
cting enzyme subfam
il
y
c
l
as
ses
.
AmPs
eAAC contains
2
20
discrete numbe
rs: the first
20 numb
e
rs are the
comp
one
nts
of the conven
tional AAC; the next
2
num
bers are a se
t of corre
lation fac
t
ors
that
reflect differe
nt hydroph
obi
city and hydrophili
city
distribution patterns alo
ng a protein ch
ain. The
formulatio
n of AmPseAAC i
s
define
d
by:
T
x
x
x
x
x
x
X
]
[
2
20
1
20
20
1
20
20
1
(12)
Whe
r
e,
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4322 – 4
329
4326
)
2
20
1
20
(
,
)
20
1
(
,
20
1
2
1
20
1
2
1
u
w
f
w
u
w
f
f
x
ik
k
i
u
ik
k
i
u
u
(13)
Whe
r
e
)
20
,...,
2
,
1
(
i
f
i
are
the no
rmali
z
e
d
occu
rren
ce
frequ
en
cy of the 20
amin
o aci
d
s i
n
th
e
protein
X
,
k
is t
he sequ
en
ce
co
rrel
a
tion f
a
ctor comput
ed a
c
cording
to Equation
(14
)
for thi
s
protein, an
d
w
is the wei
ght factor.
N
H
N
H
N
H
N
H
N
N
k
k
k
N
k
k
k
N
k
k
k
N
k
k
k
,
1
1
1
1
1
1
1
2
,
2
1
1
,
1
2
1
1
2
1
,
2
1
1
1
1
,
1
(14)
Whe
r
e
1
and
2
are call
ed the first-tie
r
correlation
factors that refl
ect the seq
u
ence-o
r
d
e
r
correl
ation b
e
twee
n all the most contiguo
us re
sidu
es al
on
g a protei
n
chain th
ro
ugh
hydrop
hobi
cit
y
and hydrop
hilicity, resp
e
c
tively,
3
and
4
the seco
nd
-tier correlati
on facto
r
s,
and so forth. In Equation (14),
1
,
j
k
H
and
2
,
j
k
H
are hydrop
hobi
city and hydrophili
city correlation
function
s give
n by:
j
k
j
k
j
k
j
k
R
h
R
h
H
R
h
R
h
H
2
2
2
,
1
1
1
,
(15)
W
h
er
e
i
R
h
1
and
i
R
h
2
are hyd
r
op
h
obicity a
nd
h
y
drophili
city
values
for the ith (i=
1
,
2,
…,
N)
amino a
c
id in
this protein resp
ectively. Note that
i
R
h
1
and
i
R
h
2
are
cal
c
ula
t
ed usin
g the
sam
e
way as d
e
scri
bed in sectio
n 2.2.
Similarly, there are only two physi
co
chem
ical pro
p
e
rties
(i.e.,
hydrop
hobi
cit
y
and
hydrophilicity) which are consid
ered in the standard AmPseAAC.
Based on the same mode as
PseAAC,
we
enh
an
ced A
m
PseAAC h
e
re
also. Let
n be
the
nu
mber of p
r
op
erties, th
e n
e
w
formulatio
ns
of Equation (12) -
(15
)
ar
e
given as the f
o
llowin
g
four
equatio
ns:
T
n
x
x
x
x
x
x
X
]
[
20
1
20
20
1
20
20
1
(16)
)
20
1
20
(
,
)
20
1
(
,
20
11
20
11
n
u
w
f
w
u
w
f
f
x
i
n
k
k
i
u
i
n
k
k
i
u
u
(17)
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
A Java P
r
og
ram
of Feature Extra
c
tion
Algor
ithm
s for Protein Seq
uen
ce
s (Sha
nping Qi
ao)
4327
N
V
N
V
N
V
N
V
N
V
N
V
N
N
k
n
k
k
n
N
k
k
k
n
n
N
k
k
k
n
n
N
k
n
k
k
n
N
k
k
k
N
k
k
k
,
1
1
1
1
1
1
1
1
1
1
,
1
2
,
)
2
(
1
1
,
)
1
(
1
1
1
,
1
1
2
1
,
2
1
1
1
1
,
1
(18)
j
n
k
n
n
j
k
j
k
j
k
j
k
j
k
R
v
R
v
V
R
v
R
v
V
R
v
R
v
V
,
2
2
2
,
1
1
1
,
(19)
3. Rese
arch
Metho
d
In orde
r to re
alize the fu
nctions of the
a
bove thre
e im
proved
algo
rithms, a
cla
ss
name
d
Rep
r
e
s
entati
on wa
s defin
ed in Java. At the same
time, a flexible mode wa
s a
dopted to utilize
more
attribut
es in th
e al
gorithm
s. Th
at is
the n
u
m
ber
of attri
butes
used i
n
PseAAC a
n
d
AmPseAAC i
s
no lo
nge
r li
mited to thre
e and two.
Usere
s
can
use
any attri
bute to cal
c
u
l
ate
PseAAC an
d AmPSeAAC. All the methods in p
s
eu
do-cod
e
we
re de
scribe
d bello
w in detail.
3.1. AAC Fea
t
ure Vec
t
or
Input the prot
ein se
que
nce
P;
Cal
c
ulate the
seq
uen
ce le
n
g
th N;
Initialize a ve
ctor na
med v
s
with the type of double[] and the len
g
th of 20;
For ea
ch ami
no aci
d
AAi in P
Cou
n
t the nu
mber of AAi and save it into vs;
Endfor
Normali
z
e vs
into [0, 100) interval acco
rding to eq. (3
);
Return vs
;
3.2. PseAAC Featur
e Vec
t
or
Input the prot
ein se
que
nce
P,
w
,
and attri
bute value
s
;
Cal
c
ulate the
seq
uen
ce le
n
g
th N;
Call the AAC
algorith
m
to get the vector
of AAC name
d
vs1;
Initialize a ve
ctor na
med v
s
2 with the type of doubl
e[] and the lengt
h of
;
Conve
r
t each
attribute value to its stand
ard form acco
rding to eq. (9) and
(11
)
;
For i = 1 to
For j = 1 to N
– i
For ea
ch attri
bute value
Get the two continuo
us am
ino aci
d
s: j an
d j + i;
Cal
c
ulate the
attribute valu
e according t
o
eq. (10
)
;
Endfor
Endfor
Cal
c
ulate the
i-tier correlati
on facto
r
acco
rdin
g to eq. (6) a
nd save it into vs2;
Endfor
Con
n
e
c
t vs1 and vs2 tog
e
ther into a ne
w vecto
r
nam
ed vs with the
length of
20
;
Normali
z
e vs
into [0, 100) interval;
Return vs
;
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4322 – 4
329
4328
3.3. AmPseAAC Fe
ature
Vector
Input the prot
ein se
que
nce
P,
w
,
and attri
bute value
s
;
Cal
c
ulate the
seq
uen
ce le
n
g
th N;
Call the AAC
algorith
m
to get the vector
of AAC name
d
vs1;
Cal
c
ulate the
numbe
r of attribute
s
n;
Initialize a ve
ctor na
med v
s
2 with the type of doubl
e[] and the lengt
h of
n
;
Conve
r
t each
attribute value to its stand
ard form acco
rding to eq. (9) and
(11
)
;
For i = 1 to
n
For ea
ch attri
bute value
For j = 1 to N
– ((i -1
) / n + 1)
Get the two continuo
us am
ino aci
d
s: j an
d j + ((i -1) / n + 1);
Cal
c
ulate the
attribute valu
e according t
o
eq. (19
)
;
Endfor
Endfor
Cal
c
ulate the
i-tier correlati
on facto
r
acco
rdin
g to eq. (18
)
and
save
it into vs2;
Endfor
Con
n
e
c
t vs1 and vs2 tog
e
ther into a ne
w vecto
r
nam
ed vs with the
length of
n
20
;
Normali
z
e vs
into [0, 100) interval;
Return vs
;
4. Results a
nd Discu
ssi
on
In orde
r to u
s
e these algo
ri
thms ea
sily, a fr
iendly g
r
a
phical user in
terface, a
s
sh
own in
Figure 1, wa
s provided.
Figure 1. The
Graphi
cal
User Interfa
c
e
T
h
r
o
u
g
h
th
is in
te
r
f
a
c
e
,
us
ers
c
a
n in
pu
t or l
oad
sequ
en
ce
s, choo
se
or/and
inp
u
t
attribute
values, sele
ct algorithm
s a
nd input the p
a
ram
e
ters
. Allow for the ba
tch cal
c
ul
atin
g in some tim
e
,
use
r
s can lo
ad a n
u
mb
er of protei
n seque
nces
f
r
o
m
a file with
the FASTA format. In th
e
interface, the
five attributes whi
c
h a
r
e
used fre
quently are provid
ed dire
ctly.
For more attribut
es,
use
r
s can in
put them in
the text area
acco
rd
in
g to the
spe
c
ifi
ed form
at. For PseAAC
and
AmPseAAC, the
paramete
r
w
and
are n
eede
d whil
e AAC is not. T
o
make it ea
sier
for u
s
e
r
s
in ope
rating,
the pro
g
ra
m will ena
ble o
r
disabl
e the
s
e two compo
nents a
u
toma
tically acco
rdi
ng
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
A Java P
r
og
ram
of Feature Extra
c
tion
Algor
ithm
s for Protein Seq
uen
ce
s (Sha
nping Qi
ao)
4329
to the option
of users. After t
he necessory data are
all given, “C
alculate” function will
cal
c
ul
ate
and
sho
w
th
e re
sult
s in
a JT
able
field dynami
c
ly
. Each
protei
n contain
s
it
s a
c
ce
ss
nu
mber,
sequence and the values
of the
selected algorithm
(
s). To press
the “Save...” button
will
save t
h
e
results to a specifie
d file with the styl
e of one line per
protein in text format.
We chose three seq
uen
ce
s to test this prog
ra
m. The
result
s prove
d
that it run normally.
To verify the result
s are corre
c
t or no
t, we compa
r
ed the
s
e da
ta with results gotten fro
m
http://www.c
s
b
io.sjtu.ed
u.c
n
/bioinf/PseA
A
C [9]. The compa
r
ing r
e
s
u
lt sho
w
ed th
at our pr
ogr
a
m
gave the
correct a
n
swe
r
s.
Comp
ari
ng to
PseAAC
,
wit
h
the fe
ature
s
in
crea
sed,
the con
s
ume
d
time wa
s m
u
ch
sho
r
ter in
our
program
than t
hat in
PseAAC. Mo
reove
r
, PseA
AC cann
ot a
d
d
other featu
r
e
s
exce
pt for the pre
define
d
ones in it.
5. Conclusio
n
The fe
ature
extraction
is
a key pa
rt in
de
signi
ng th
e lea
r
nin
g
al
gorithm
to p
r
edict th
e
sub
c
ell
u
lar lo
cali
zation
s of protein
s
. The thr
ee wi
del
y used feature extraction
algorith
m
s were
improve
d
in
this work a
n
d
their
co
rre
s
po
ndi
n
g
p
s
eudo
-code
s
were d
epicte
d
in detail.
The
friendly inte
rf
ace
would
p
r
ovide
conve
n
ien
c
e to
u
s
ers.
Each algorithm
can cre
a
te a
di
gi
tal
feature ve
cto
r
fast.
On th
e ba
se
of th
ese
ve
cto
r
s,
a ne
w ve
ctor ca
n b
e
cre
a
t
ed thro
ugh
the
weig
hting o
r
fusion
strate
gy. This
ne
w vector
woul
d give
a go
o
d
pe
rform
a
n
c
e to p
r
edi
cti
n
g
algorith
m
s. A
n
intelligent
computing
alg
o
rithm,
su
ch
as PSO [10]
and GA
[11],
for creating a
new optimized vector
will be developed i
n
the future.
Ackn
o
w
l
e
dg
ements
The wo
rk
wa
s su
ppo
rted by National Natural
Scie
nce Found
ation
of China und
er Grant
No. 61302128 and Doctoral Foundation of Un
iversity
of Jinan under Grant No. XBS1318.
Referen
ces
[1]
Kuoch
en
C, H
ong
bin
S. Rec
ent Pro
g
ress
i
n
Protei
n Su
bc
ellu
lar
Loc
atio
n Pred
ictio
n
.
Anal B
i
oc
he
m
.
200
7; 370(
1): 1-16.
[2]
Imai K, Nak
a
i
K. Predicti
on
of Subc
ell
u
l
a
r
Locati
ons
of P
r
oteins: W
h
ere
to Proc
eed
?.
Proteo
mics
.
201
0; 10(2
2
): 3970-
398
3.
[3]
Kuoch
e
n
C. S
o
me
Remarks
on Pr
otein
Attribute
Pred
ictio
n
a
nd
Pseu
do
Amino
Acid
C
o
mpos
ition.
J
T
heor Bio
l
. 201
1; 273(1): 2
36-
247.
[4]
Pufeng D,
Ch
ao X.
Predicti
ng
Multis
ite P
r
otein S
ubce
l
l
u
lar L
o
cati
ons:
Progress a
n
d
Chal
le
nges
.
Expert Rev Pro
t
eomics
. 201
3; 10(3): 22
7-2
3
7
.
[5
]
Ku
o
c
he
n C
.
Some
Re
ma
rks
on
Pred
i
c
ti
ng
Mu
lti-L
abe
l Attrib
utes i
n
M
o
lec
u
l
a
r Bi
os
ystems.
Mol Biosyst
.
201
3; 9(6): 109
2-11
00.
[6]
Nakas
h
ima H,
Nishik
a
w
a
K,
Ooi
T
.
T
he Foldi
ng
T
y
pe
of a Protein Is Relev
ant to th
e Amino Aci
d
Comp
ositio
n.
J Bioche
m
. 19
8
6
; 99(1): 15
3-1
62.
[7]
Kuoch
en C.
Pr
edicti
on of
Prot
ein Ce
llu
lar
At
t
r
ibutes
Usi
ng
Pseud
o-Amin
o
Acid
Comp
osit
ion.
Pr
oteins:
Struct, F
unct,
Genet
. 200
1; 43(3): 246-
25
5.
[8]
Kuoch
en C.
Using Am
phi
p
h
ilic Ps
eu
do
Amino Ac
i
d
C
o
mpos
ition to
Predict Enz
y
me Subfami
l
y
Classes.
Bi
oinf
ormatics
. 200
5
;
21(1): 10-19.
[9]
Hon
gbi
n S, Ku
oche
n C. Pse
AAC: A F
l
e
x
ib
l
e
W
eb S
e
rver
for Gener
atin
g Vari
ous K
i
n
d
s of Prote
i
n
Pseud
o Amin
o Acid Com
positi
on.
Anal Bi
och
e
m
. 2
008; 3
73(
2): 386-3
88.
[10]
Yu M, Liche
n
G. F
u
zzy
Immune PID C
o
ntrol
of H
y
dr
a
u
lic S
y
stem B
a
sed o
n
PSO Algorithm.
T
E
LKOMNIKA Indon
esi
an Jou
r
nal of Electric
al Eng
i
ne
eri
n
g
.
2013; 1
1
(2): 8
90-8
95.
[11]
Xu
eso
ng Y,
Qingh
ua W
,
Can Z
,
etc.
An Im
prove
d
Genetic A
l
g
o
rithm a
nd It
s Appl
icatio
n.
T
E
LKOMNIKA Indon
esi
an Jou
r
nal of Electric
al Eng
i
ne
eri
n
g
.
2012; 1
0
(5): 1
081-
108
6.
Evaluation Warning : The document was created with Spire.PDF for Python.