Indonesian
J
our
nal
of
Electrical
Engineering
and
Computer
Science
V
ol.
16,
No.
3,
December
2019,
pp.
1286
1296
ISSN:
2502-4752,
DOI:
10.11591/ijeecs.v16i3.pp1286-1296
r
1286
A
str
eaming
multi-class
support
v
ector
machine
classification
ar
chitectur
e
f
or
embedded
systems
J
ee
v
an
Sirkunan,
J
.W
.
T
ang,
N.
Shaikh-Husin,
M.
N.
Marsono
School
of
Electrical
Engineering,
F
aculty
of
Engineering,
Uni
v
ersiti
T
eknologi
Malaysia,
Johor
,
Malaysia
Article
Inf
o
Article
history:
Recei
v
ed
Dec
28,
2018
Re
vised
Jun
1,
2019
Accepted
Jun
28,
2019
K
eyw
ords:
Multi-class
SVM
FPGA
embedded
system
ABSTRA
CT
Pedestrian
det
ection,
f
ace
detection,
speech
recognition
and
object
detection
are
some
of
the
appl
ications
that
ha
v
e
benefited
from
hardw
are-accelerated
Support
V
ector
Machine
(SVM).
Computational
comple
xity
of
SVM
classification
mak
es
it
chal-
lenging
for
designing
hardw
are
architec
ture
with
real-time
performance
and
lo
w
po
wer
consumption.
On
an
embedded
streaming
architecture,
testing
data
are
mostly
stored
on
e
xternal
memory
.
Data
are
transferred
in
streams
with
the
maximum
bandwidth
limited
to
the
b
us
bandwidth.
The
hardw
are
implementation
for
SVM
classification
needs
to
be
suf
ficiently
f
ast
to
k
eep
up
with
the
data
transfer
speed.
Prior
implementation
throttles
data
input
to
a
v
oid
o
v
erwhelming
the
computational
unit.
This
results
in
a
bottleneck
in
the
st
reaming
architecture.
In
this
w
ork,
we
propose
a
streaming-architecture
multi
-class
SVM
classification
for
an
embedded
system
that
is
fully
pipelined
and
able
to
proce
ss
data
continuously
without
an
y
need
to
throttle
data
stream
input.
The
proposed
design
is
tar
geted
for
em
bedded
platform
where
testing
data
is
transferred
in
streams
from
e
xternal
memories.
The
architecture
is
modele
d
using
V
erilog
and
the
e
v
aluation
is
tar
geted
for
Altera
Cyclone
IV
field
programmable
g
ate
array
platform.
Performance
profiling
on
the
proposed
architec-
ture
is
done
with
re
g
ard
to
the
number
of
features
and
support
v
ectors.
F
or
v
alidation,
the
proposed
architecture
is
simulated
using
ModelSim
and
t
he
results
are
compared
with
LibSVM.
Based
on
the
simulation
result,
the
proposed
architecture
is
able
to
produce
a
throughput
of
1
=
N
f
classification
per
clock
c
ycle,
where
N
f
is
the
number
of
features.
Copyright
c
2019
Insitute
of
Advanced
Engineeering
and
Science
.
All
rights
r
eserved.
Corresponding
A
uthor:
Jee
v
an
Sirkunan,
School
of
Electrical
Engineering,
F
aculty
of
Engineering,
Uni
v
ersiti
T
eknologi
Malaysia,
Johor
,
Malaysia.
Email:
jee
v
an2@li
v
e.utm.my
1.
INTR
ODUCTION
Perv
asi
v
e
computing
leads
to
emer
gence
of
applications
that
mandated
computational
inte
lligence
and
analytics.
Enter
tainment,
sensor
netw
orks,
health
care
and
en
vironmental
monitoring
are
some
of
the
e
xample
applications
that
ha
v
e
embedded
perv
asi
v
e
computing
into
their
system
[1].
These
applications
need
to
process
lar
ge
amounts
of
data,
and
this
requires
massi
v
e
data
parallelism
that
needs
high
data
bandwidth
transfer
between
the
processors
and
of
f-chip
memory
.
Such
data
access
pattern
renders
on-chi
p
caches
mostly
inef
fecti
v
e
[2].
At
the
same
time,
some
of
these
applications
require
a
lo
w
size,
lo
w
po
wer
processor
with
real-
time
operati
ng
performance
especially
for
an
embedded
system
[1].
There
are
se
v
eral
w
ays
in
dealing
with
this
demand.
One
method
is
to
use
a
reconfigurable
hardw
are
such
as
field
programmable
g
ate
array
(FPGA)
that
pro
vides
high
performance
computation
at
minimal
cost
o
v
erhead
and
po
wer
consumption
[3].
J
ournal
homepage:
http://ijeecs.iaescor
e
.com/inde
x.php/IJEECS
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1287
Support
V
ector
Machine
(SVM)
is
an
accurate
binary
classi
fier
that
is
based
on
a
solid
theoreti
cal
background
[4-6].
Some
applications
that
ha
v
e
benefited
from
SVM
hardw
are
acceleration
are
pedestrian
detection
[7],
f
ace
detection
[8],
speech
recognition
[9]
and
object
detection
[10,
11].
Due
to
SVM
computa-
tional
comple
xity
,
designing
a
hardw
are
architecture
with
lo
w
po
wer
consumption
and
real-time
classification
performance
still
remains
a
challenge
[12].
Man
y
e
xisting
custom
SVM
hardw
are
implementations
focused
on
accelerating
the
decision
function
that
is
application
specific
[8,
13].
These
designs
cannot
be
easily
reused
for
other
applications
as
the
y
are
optimized
for
specific
applications
[8].
W
orks
that
tar
geted
on
acceleration
of
SVM
with
a
co-processor
unit
[2,
14,
15,
16]
tend
to
focus
on
the
k
ernel
due
to
its
compute-int
ensi
v
e
task
and
also
its
innate
nature
to
be
parallelized.
The
performance
of
a
single
accelerator
unit
or
a
co-processor
unit
is
highly
dependent
on
the
rate
of
data
transfer
.
W
ith
limited
on-chip
memory
in
an
embedded
system,
dat
a
are
stored
in
of
f-chip
memory
.
Data
are
transferred
in
streams
and
the
maximum
bandwidth
depends
on
the
b
us
width
of
a
parti
cular
platform.
Kane
et
al.
[12]
proposed
a
fully-pipelined
SVM
architecture
that
supports
multi-class
classification
which
w
as
tested
with
a
wide
range
of
data
sets.
Ho
we
v
er
,
in
their
implementation
input
throttling
w
as
required
to
a
v
oid
data
input
from
o
v
erwhelming
the
processing
unit.
The
throughput
of
their
proposed
architecture
is
limited
under
high
load.
In
this
w
ork,
we
propose
a
multi-class
fully
pipelined
SVM
classification
streaming
architecture
for
embedded
system.
The
architecture
is
able
to
produce
output
at
the
same
rate
as
the
data
input
without
the
need
for
data
input
throttling.
The
proposed
hardw
are
architecture
is
based
on
LibSVM
model
data
structure.
Performance
e
v
aluation
of
the
proposed
hardw
are
implementation
is
done
with
re
g
ard
to
the
number
of
support
v
ectors
and
features.
R
esults
sho
w
that
the
proposed
design
has
an
initial
latenc
y
approximately
equi
v
alent
to
the
number
of
support
v
ectors
and
able
to
process
data
continuously
without
an
y
need
to
throttle
the
data
transfer
.
This
feature
is
crucial
as
training
data
are
transferred
in
streams
from
e
xternal
memory
.
2.
RESEARCH
METHOD
T
raditionally
,
there
ha
v
e
been
tw
o
fundamentally
dif
ferent
types
of
tasks
in
machine
learning:
unsu-
pervised
and
supervised.
The
goal
of
unsupervised
learning
is
to
find
patterns
and
structure
in
the
data,
while
supervised
learning
is
to
learn
a
mapping
from
a
labeled
data
set
[18].
SVM
f
alls
under
the
cate
gory
of
the
latter
.
In
general,
SVM
tasks
can
be
di
vided
into
tw
o:
training
and
classification.
Based
on
a
labeled
data
set,
SVM
is
trained
to
obtain
its
deci
sion
function.
W
ith
the
decision
function,
unkno
wn
data
can
then
be
classified.
2.1.
T
raining
SVM
is
inherently
a
binary
classifier;
in
the
training
phase,
it
determines
the
decision
boundary
that
maximizes
the
space
between
tw
o
classes
[19].
In
order
to
handle
data
which
cannot
be
separated
linearly
in
lo
w-dimensional
feature
space,
k
ernel
functions
are
used
to
project
learning
data
into
high-dimensional
space
so
that
it
can
be
linearly
separated
[20].
Equations
(1)
and
(2)
sho
w
the
dual
Lagrange
problem
to
obtain
the
SVM
decision
function.
K
(
x
i
;
x
j
)
is
the
k
ernel
function,
b
is
the
bias
parameter
and
C
is
the
trade-of
f
parameter
between
maximizing
the
mar
gin
and
minimizing
the
error
.
Maximize
L
=
N
X
i
=1
i
1
2
N
X
i
=1
N
X
j
=1
i
j
y
i
y
j
K
(
x
i
;
x
j
)
Subject
to
(
0
C
P
N
i
=1
i
y
i
=
0
(1)
b
=
y
i
N
X
i
=1
i
x
i
y
i
(2)
where
:
Lagrangian
multiplier
N
T
otal
amount
of
training
data
set
x
i
T
raining
data
(feature
set)
y
i
T
raining
data
(label)
A
str
eaming
multi-class
support
vector
mac
hine
...
(J
ee
van
Sirkunan)
Evaluation Warning : The document was created with Spire.PDF for Python.
1288
r
ISSN:
2502-4752
The
main
aim
of
the
training
functions
(equations
(1)
and
(2))
are
to
obtain
i
and
b
v
alues.
If
i
is
0,
the
corresponding
training
feature
v
ector
x
i
is
not
a
support
v
ector
.
If
i
is
v
ery
high,
then
the
corresponding
training
feature
x
i
has
high
influence
o
v
er
the
decision
surf
ace
of
the
h
yperplane
(
x
i
becomes
a
support
v
ector).
2.2.
Classification
Equation
(3)
sho
ws
the
SVM
decision
function
d
(
u
)
for
classification.
The
computation
only
e
xtends
to
the
number
of
support
v
ectors
N
sv
,
where
the
support
v
ectors
are
a
subset
of
the
training
data.
The
class
of
unkno
wn
input
u
is
determined
by
the
sign
of
the
function
in
equation
(3).
d
(
u
)
=
sign
N
sv
X
i
=1
i
K
(
x
i
;
u
)
+
b
(3)
2.3.
K
er
nel
Function
K
ernel
functions
ef
ficiently
map
non-linear
datasets
to
a
high
dimensional
linear
feature
space.
K
ernel
function
allo
ws
SVM
to
handle
non-linear
data.
The
type
of
k
ernel
to
be
emplo
yed
is
entirely
dependent
on
the
characteristics
of
the
appli
cation
data
sets,
which
is
be
yond
the
scope
of
this
paper
.
Some
of
the
main
k
ernel
functions
that
are
used
in
literature
are
linear
(equation
(4)),
polynomial
(equation
(5)),
Gaussian
radial
function
(RBF)
(equation
(6))
and
sigmoid
(equation
(7)).
K
(
x
i
;
u
)
=
x
i
u
(4)
K
(
x
i
;
u
)
=
(
(
x
i
u
)
+
r
)
d
(5)
K
(
x
i
;
u
)
=
e
(
)
x
i
u
2
;
where
>
0
(6)
K
(
x
i
;
u
)
=
tanh
(
(
x
i
u
)
+
r
)
(7)
3.
RELA
TED
W
ORKS
Extant
w
orks
on
accelerating
SVM
on
FPGA
can
be
di
vided
into
tw
o
main
groups:
training
and
classification.
Accelerating
training
phase
focuses
on
certain
applications
where
online
learning
is
required.
Applications
such
as
adapti
v
e
channel
equalization
[21]
and
sk
etch
recognizer
[22]
demand
short
training
time
for
f
ast
adaptation
since
the
characteristics
of
data
change
o
v
er
time.
On
the
other
hand,
accelerat-
ing
classification
phase
is
important
when
a
task
requires
a
fix
ed
classification
module
with
lar
ge
data
to
be
classified
[8,
13].
Sequential
Minimal
Optimization
(SMO)
[23]
is
considered
as
one
of
the
commonly
used
algorithms
for
optimizing
SVM
training
problem.
Ho
we
v
er
,
the
SMO
algorithm
itself
w
as
designed
in
such
a
w
ay
that
it
caters
for
single-threaded
computer
[16].
Besides
SMO,
there
are
also
Gilbert’
s
algorithm
[24]
and
Least
Squared
Support
V
ector
Machine
(LS-SVM)
[25]
for
training.
W
orks
that
are
tar
geted
for
training
either
implement
the
whole
system-on-chip
or
of
floaded
to
a
co-processor
[2,
14,
15,
16]
to
accel
erate
the
training
process.
The
task
that
is
t
ar
geted
for
hardw
are
implementation
is
the
k
ernel
as
it
is
a
compute-intensi
v
e
task
that
benefits
from
parallelization.
SVM
training
phase
produces
a
model,
which
i
s
used
in
the
classification
phase
later
.
This
model
then
dict
ates
the
architecture,
logic
resource
and
memory
bits
utilization
of
the
hardw
are
implementation
for
SVM
classification.
F
or
the
classification
part,
man
y
e
xisting
custom
hardw
are
implementations
focused
on
accelerating
the
decision
function
for
a
specific
task
[8,
9,
11,
13,
26].
These
designs
cannot
eas
ily
be
reused
for
other
purposes
as
these
were
tar
geted
for
a
fix
ed
number
of
support
v
ectors
and
classes.
Another
approach
to
w
ards
SVM
classification
on
FPGA
system
is
cascaded
SVM
[27].
The
cascaded
SVM
architecture
by
[27]
contains
a
combination
of
a
high
and
a
lo
w
precision
module
that
are
implemented
depending
on
the
rate
of
the
incoming
testing
data.
The
lo
w
precision
module
processes
incom
ing
data
since
it
has
higher
throughput
potential.
Data
which
are
not
able
to
be
classified
wit
h
certainty
is
transferred
to
the
high
precision
module
which
has
lo
wer
throughput
potential
b
ut
is
able
to
classify
the
data
more
accurately
.
Ho
we
v
er
,
[27]
focused
w
as
solely
on
the
binary
classification
problem.
The
implementation
of
SVM
classification
in
hardw
are
requires
man
y
multipliers
especially
the
k
ernel
modules.
This
has
led
to
the
use
of
the
CORDIC
algorithm
in
implementing
SVM
classification
accelerator
module.
Ruiz-Llata
et
al.
[1]
proposed
an
SVM
classification
with
multi-class
support
using
the
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
16,
No.
3,
December
2019
:
1286
–
1296
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1289
CORDIC
algorithm
to
replace
multipliers
[1].
Ho
we
v
er
,
timing
performance
w
as
not
reported
and
the
w
ork
w
as
not
benchmark
ed
with
an
y
CPU
or
GPU
implementation.
As
the
CORDIC
method
is
iterati
v
e,
this
w
ould
result
in
a
reduction
in
performance
for
more
comple
x
data
sets
with
a
lar
ger
amount
of
support
v
ectors
and
classes.
COR
DIC
unrolled
loop
can
be
implemented
with
fix
ed
pipelined
stages.
Ho
we
v
er
,
this
w
ould
consume
additional
hardw
are
and
w
ould
af
fect
accurac
y
for
certain
input
conditions.
Multiplications
in
FPGA
were
a
v
oided
in
earlier
implementations
since
it
w
as
resource-hungry
.
Ho
we
v
er
,
ne
wer
FPGA
f
amilies
ha
v
e
dedicated
hardw
are
multipliers
and
MA
C
units
[17].
Kane
et
al.
[12]
proposed
a
fully
pipelined
SVM
decision
architect
ure
for
multi-class
with
a
generi
c
design,
which
w
as
t
h
e
n
tested
with
a
wide
v
ariety
of
datasets.
This
w
ork
performed
better
than
GPU
and
CPU
implementations
in
terms
of
throughput
with
practically
a
ne
gli
g
i
ble
trade-of
f
in
accurac
y
.
Ho
we
v
er
,
in
their
proposed
w
ork,
a
stalling
mechanism
w
as
used
to
stop
input
data
from
o
v
erflo
wing
the
processing
unit.
The
rate
of
input
data
needs
to
be
throttled
to
match
with
the
maximum
rate
of
the
data
output
of
the
proposed
hardw
are.
In
this
w
ork,
we
propose
a
streaming
architecture
for
embedded
systems
that
is
fully
pipelined
and
is
able
to
operate
without
the
need
to
throttle
the
rate
of
input
data.
This
allo
ws
the
proposed
architecture
to
achie
v
e
maximum
allo
w
able
throughput
depending
on
the
rate
of
data
input.
The
proposed
architecture
is
tar
geted
for
embedded
system
platform
where
training
data
are
stored
in
e
xternal
memory
source
and
continu-
ously
streamed-in
through
a
fix
ed
memory
bandwidth.
T
raining
data
are
computed
as
the
y
arri
v
e,
which
result
in
the
throughput
of
the
data
output
to
be
equi
v
alent
to
the
data
input.
The
architecture
has
an
initial
latenc
y
approximately
equi
v
alent
to
the
number
of
support
v
ectors.
4.
LIBSVM
MODEL
The
proposed
architecture
is
based
on
the
model
generated
by
LibSVM
[28].
During
the
training
phase,
based
on
a
set
of
training
data,
LibSVM
creates
an
SVM
model
file.
This
file
contains
important
parameters
for
the
decision
function.
Besides
that,
the
classifier
model
also
contains
information
on
types
of
classification,
k
ernel
type,
k
ernel
parameters,
number
of
support
v
ectors
from
each
class
and
a
bias
v
alue.
F
or
each
binary
training
problem,
LibSVM
generates
a
set
of
that
is
used
in
the
decision
function.
There
are
man
y
multi-class
approaches
in
SVM,
such
as
pairwise,
D
A
G,
and
one-v
ersus-all.
LibSVM
uses
the
pairwise
method.
Therefore
for
each
multi-class
SVM
model,
there
will
be
n
!
2(
n
2)!
sets
of
coef
ficient
for
computation,
where
n
is
the
number
of
classes.
Figure
1
sho
ws
the
format
for
the
SVM
classifier
model
with
four
classes
as
generated
by
LibSVM.
N
sv
represents
the
number
of
support
v
ectors,
N
f
is
the
number
of
fea
tures,
and
N
c
is
the
number
of
classes.
During
the
training
phase,
each
support
v
ector
is
compared
wit
h
other
support
v
ectors
to
ensure
that
no
redundant
SVs
are
stored
in
the
model.
In
the
coef
ficient
column
for
SV
belonging
to
another
binary
classier
,
the
coef
ficient
is
set
to
zero.
Based
on
this
structure,
each
data
input
can
be
multi
plied
with
all
SVs
and
coef
ficients
without
requiring
a
multiple
x
er
for
selection
in
the
SVM
hardw
are
module.
Figure
1.
SVM
Model
format
A
str
eaming
multi-class
support
vector
mac
hine
...
(J
ee
van
Sirkunan)
Evaluation Warning : The document was created with Spire.PDF for Python.
1290
r
ISSN:
2502-4752
5.
PR
OPOSED
ARCHITECTURE
This
sect
ion
presents
the
proposed
architecture.
Figure
2.
sho
ws
the
o
v
ervie
w
of
the
ar
chitecture.
The
architecture
is
di
vided
into
four
sections:
k
ernel
computation,
coef
ficient
weighting,
v
oting
summation,
and
v
ote
sorting.
Each
of
this
module
is
structured
in
such
a
w
ay
to
produce
output
at
e
v
ery
clock
c
ycle.
Our
implementation
design
is
parameterizable
in
terms
of
N
sv
and
N
f
in
order
to
handle
v
arious
types
of
application.
The
input
of
the
proposed
architecture
is
an
e
xternal
memory
connect
ed
to
the
FPGA.
Unlik
e
on-chip
memory
where
data
can
be
accessed
simultaneously
,
for
e
xternal
memory
,
the
bandwidth
is
limited.
F
or
FPGA
de
vices
that
ha
v
e
higher
memory
transfer
bandwidth
with
e
xternal
memory
,
the
proposed
design
can
be
e
xpanded
with
the
same
structure
to
support
the
high
bandwidth
data
input.
F
or
our
implemen-
tation,
we
fix
ed
the
input
to
be
one
feature
per
clock
c
ycle.
Figure
2.
Proposed
architecture
o
v
ervie
w
5.1.
K
er
nel
Computation
This
module
performs
k
ernel
computation
as
the
input
data
is
streamed-in
continuously
e
v
ery
clock
c
ycle,
resulting
in
the
throughput
of
the
system
to
be
1
=
N
f
.
In
each
clock
c
ycle,
a
single
feature
is
transferred
to
the
module.
Figure
3
sho
ws
the
o
v
ervie
w
of
the
linear
k
ernel
architecture.
The
number
of
multipliers
needed
is
based
on
the
number
of
SVs
that
are
a
v
ailable
in
the
model.
The
linear
k
ernel
implementation
is
based
on
LibSVM.
This
structure
is
maintained
so
that
it
can
easily
be
substituted
with
a
dif
ferent
type
of
k
ernel
for
future
implementation.
From
the
SVM
model
in
Figure
1,
parallelization
of
the
k
ernel
computation
can
be
done
based
on
either
the
N
f
or
the
N
sv
.
P
arallelizing
based
on
N
f
will
result
in
a
match
with
the
input
data
rate.
Therefore
for
proposed
design
parallelization
based
on
the
number
of
SVs
is
chosen.
Figure
3.
Linear
K
ernel
architecture
In
our
implementation,
each
input
data
is
transferred
through
a
shift
re
gister
in
comparison
to
the
w
ork
proposed
by
[26],
where
input
data
is
broadcasted
to
each
SV
at
the
same
time.
This
is
done
in
order
to
reduce
the
f
an-out
from
input
data,
which
spans
across
the
support
v
ector
memory
output,
which
w
ould
be
lar
ge
for
application
with
man
y
SVs.
Since
data
tra
v
erse
through
the
SV
memory
with
a
clock
c
ycle
delay
,
each
SV
is
also
accessed
with
one
clock
c
ycle
delay
.
Therefore,
in
order
to
a
v
oid
multiple
indi
vidual
memory
blocks,
a
counter
and
a
complicated
control
unit,
we
concatenate
all
the
support
v
ectors
into
a
single
memory
block
based
on
the
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
16,
No.
3,
December
2019
:
1286
–
1296
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1291
format
in
Figure
4b.
W
ith
this
scheme,
only
a
single
memory
block
and
counter
are
required.
Figure
4a
sho
ws
the
con
v
entional
method
[28]
for
storing
support
v
ectors
in
memory
.
The
proposed
method
(Figure
4b)
is
based
on
the
con
v
entional
method
(Figure
4a)
where
each
SV
column
is
shifted
one
ro
w
from
left
to
right.
(a)
Con
v
entional
storage
format
of
support
v
ectors
memory
[28]
(b)
Proposed
storage
format
of
support
v
ectors
memory
Figure
4.
Storage
format
of
support
v
ectors
memory
5.2.
Coefficient
weighting
Based
on
SVM
classification
equation
(3),
resultant
k
ernel
computation
for
each
SV
will
be
multipli
ed
with
its
corresponding
coef
ficient.
The
resultant
is
then
summed
together
with
its
bias.
The
final
result
sign
is
then
compared
to
see
if
its
greater
or
less
than
zero.
Based
on
this
the
class
is
determined.
Piece
wise
multi-class
approach
from
[28]
compares
one
class
with
other
classes.
During
the
tr
aining
phase,
one
class
is
ass
igned
to
be
in
the
positi
v
e
class
while
the
other
is
in
the
class
ne
g
ati
v
e.
Generated
coef
ficients
will
be
based
on
this
notation.
The
sign
on
SVM
model
coef
ficients
for
multi-class
with
four
classes
is
sho
wn
in
Figure
5.
If
the
coef
ficients
from
1vs2
are
positi
v
e,
the
corresponding
2vs1
w
ould
be
ne
g
ati
v
e.
Based
on
this,
equation
(3)
is
e
xpanded
to
equation
(8).
The
bias
parameter
b
can
be
placed
on
either
side
of
equation
(8)
depending
on
its
sign.
Based
on
equation
(8),
the
hardw
are
module
does
not
require
for
signed
inte
ger
computation.
Figure
5.
coef
ficient
sign
notation
for
SVM
model
with
4
classes
N
sv
+
X
i
=1
i
K
(
x
i
;
x
)
b
>
N
sv
X
i
=1
i
K
(
x
i
;
x
)
b
(8)
Figure
6
sho
ws
the
structure
of
the
coef
ficient
weighti
ng
module.
The
v
alues
are
loaded
into
the
FPGA
as
constant
v
alues.
The
resultant
v
alue
from
each
SV
will
be
summed
and
stored
in
each
indi
vidual
A
str
eaming
multi-class
support
vector
mac
hine
...
(J
ee
van
Sirkunan)
Evaluation Warning : The document was created with Spire.PDF for Python.
1292
r
ISSN:
2502-4752
FIFO.
The
FIFOs
are
needed
for
this
architecture
to
store
the
result
from
the
coef
ficient
weighting
module
before
it
is
computed
in
the
ne
xt
module.
T
est
ing
data
are
computed
from
quadrant
1
until
4
(Figure
5).
Once
the
data
reaches
quadrant
2,
comparison
for
class
1
and
2
can
be
performed.
In
quadrant
3,
comparison
for
classes
1
and
2,
with
3
can
be
performed,
and
finally
at
quadrant
4,
comparison
for
classes
1,2
and
3
with
4
can
be
performed.
The
FIFOs
are
loaded
in
depending
on
the
SV
that
has
been
completed.
Figure
6.
Alpha
coef
ficient
weighting
architecture
5.3.
V
oting
summation
The
outputs
of
coef
ficient
weighting
module
are
connected
to
the
comparator
to
determine
the
winner
for
each
class
comparison
(Figure
6).
Based
on
Figure
1,
training
data
inputs
are
computed
with
SV
from
class
1
follo
wed
by
class
2,
class
3
and
finally
class
4.
The
results
of
each
comparator
are
stored
in
another
set
of
FIFOs.
Figure
7
sho
ws
an
o
v
ervie
w
of
the
v
oting
summation
module.
The
FIFOs
are
shifted
out
depending
on
the
SVs
that
ha
v
e
been
completed.
The
in
v
erter
produces
an
in
v
erted
result
for
the
opposing
class.
Once
SVs
from
all
classes
are
completed,
the
results
are
summed
up
and
stored
in
a
re
gister
.
Each
re
gister
corresponds
to
a
particular
class.
Figure
7.
V
oting
Summation
architecture
5.4.
V
oting
sort
Finally
,
the
results
of
all
summations
are
sorted
to
determine
the
class.
Figure
8
sho
ws
the
a
rchitecture
of
the
sorting
mechanism.
Each
result
re
gister
from
v
oting
summation
is
concatenated
with
the
class
inde
x.
Once
sorted,
the
inde
x
will
be
sho
wn
as
the
labeled
class.
The
architecture
is
based
on
bitonic
sort
[29],
b
ut
in
our
proposed
architecture
it
only
determines
the
maxim
um
v
alue
within
a
list.
Similar
to
LibSVM,
if
tw
o
classes
sho
w
the
identical
summation
result,
the
class
with
the
smaller
inde
x
number
will
be
the
winner
.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
16,
No.
3,
December
2019
:
1286
–
1296
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1293
Figure
8.
V
oting
sort
architecture
6.
RESUL
T
AND
DISCUSSION
Depending
on
the
tar
geted
application,
the
number
of
SVs,
features
and
class
v
ary
.
Based
on
the
implementation
of
[12]
with
linear
k
ernel
for
DN
A
dataset
[30],
the
SVM
model
has
three
classes,
180
features
and
402
SVs
while
for
V
o
wel
[31]
dataset
the
SVM
model
has
11
classes,
10
features,
and
268
SVs.
SVM
configurations
for
certain
applications
dif
fer
based
on
these
param
eters.
Our
analysis
has
been
carried
in
tw
o
scenarios.
F
or
Experiment
1,
the
number
of
support
v
ectors
is
fix
ed
to
20
and
the
size
of
input
data
is
16-bit
inte
ger
.
The
proposed
architecture
has
been
e
v
aluated
wi
th
a
dif
ferent
number
of
features
and
SVs.
The
number
of
features
is
adjusted
from
10
onw
ards
with
increments
of
10
until
100
features.
F
or
Experiment
2,
the
number
of
features
is
fix
ed
to
32
and
the
size
of
input
data
is
also
16-bit
inte
ger
.
The
numbers
of
support
v
ectors
are
adjusted
from
10
onw
ards
with
increments
of
10
until
100
support
v
ectors.
The
number
of
classes
is
k
ept
constant
to
four
classes.
F
or
both
e
xperiments,
the
logic
resource
utilization,
internal
memory
bits
usage,
maximum
operating
frequenc
y
(in
MHz)
and
the
number
of
embedded
multipliers
are
analysed.
Performance
profiling
e
xperiment
on
the
proposed
architecture
w
as
done
using
Quartus
II
v13.0
softw
are
and
the
tar
get
FPGA
de
vice
is
Cyclone
IV
EP4CE115F29C7.
F
or
v
alidation,
the
proposed
architecture
w
as
simulated
using
ModelSim
v10
to
obtain
c
ycle
accurate
results
and
the
results
are
compared
with
LibSVM.
T
ables
1
and
2
sho
w
the
results
for
Experiments
1
and
2,
respecti
v
ely
.
The
number
of
SVs
has
a
greater
impact
on
logic
util
ization.
As
the
number
of
SVs
increases,
the
number
of
multipli
ers
increases,
whereas
if
the
number
features
increases,
only
the
number
of
adders
will
increases.
The
number
of
internal
memory
bits
utilization
also
increases
with
the
number
of
SVs.
This
is
mainly
due
to
SV
memory
and
the
FIFOs
that
are
needed
to
store
the
temporary
resul
ts.
The
number
of
SVs
does
not
ha
v
e
a
significant
impact
on
the
maximum
operating
frequenc
y
(T
able
2).
The
minor
dif
ferences
are
mai
nly
due
to
placement
and
routing.
On
the
other
hand,
as
the
number
of
features
increases,
the
maximum
operating
frequenc
y
decreases
is
due
to
the
increase
in
the
size
of
adders
and
multipliers
for
the
architecture.
T
able
1.
Performance
e
v
aluation
of
the
proposed
architecture
to
w
ards
dif
ferent
number
of
features.
No.
of
Logic
Memory
Multipliers
Max
featur
es
Elements
Bits
Fr
equency
(MHz)
10
6408
27776
320
90.46
20
7706
36736
376
85
30
9056
40576
432
79.13
40
10322
54656
488
76.34
50
12429
58496
532
71.78
60
14027
62336
532
68.48
70
16688
86656
532
66.54
80
19223
90496
532
63.4
90
21308
94336
532
60.51
100
23777
98176
532
58.2
A
str
eaming
multi-class
support
vector
mac
hine
...
(J
ee
van
Sirkunan)
Evaluation Warning : The document was created with Spire.PDF for Python.
1294
r
ISSN:
2502-4752
T
able
2.
Performance
e
v
aluation
of
the
proposed
architecture
to
w
ards
dif
ferent
number
of
support
v
ectors.
No.
of
Logic
Memory
Multipliers
Max
SVs
Elements
Bits
Fr
equency
(MHz)
10
4810
20672
244
79.94
20
9300
41344
488
81.31
30
15836
46464
532
80.75
40
24276
82688
532
76.38
50
33080
87808
532
77.22
60
42538
155136
532
74.77
70
52110
160256
532
76.16
80
59679
165376
532
75.41
90
69094
170496
532
71.12
100
80540
175616
532
75.48
The
performance
e
v
aluations
are
tar
geted
for
a
generic
SVM
classification
hardw
are
impleme
n
t
ation.
A
16-bit
data
input
w
as
assumed
to
be
a
16-bit
inte
ger
,
therefore
the
adders,
multipliers,
and
re
gisters
that
follo
w
after
the
linear
k
ernel
module
(Figure
3)
increase
in
width
in
order
to
maintain
a
lossless
computation.
F
or
real
application
implementation
depending
on
the
tolerance
of
the
application
to
w
ards
a
lossy
result,
the
proposed
archi
tecture
can
be
optimized
to
ha
v
e
smaller
numerical
adders
and
multipliers,
which
will
l
o
wer
the
logic
utilization
and
increase
the
maximum
operating
frequenc
y
.
F
or
this
implementation,
the
focus
w
as
only
on
the
linear
k
ernel.
The
proposed
design
can
be
e
xpanded
to
dif
ferent
types
of
the
k
ernel
with
some
minor
changes
in
the
k
ernel
computation
module.
F
or
RBF
k
ernel
implementation
(equation
(6)),
e
xponential
computation
is
required.
F
or
a
pipelined
architecture,
a
lookup
table
is
more
suitable
in
comparison
to
CORDIC
since
it
can
produce
results
with
a
predictable
time
frame.
Based
on
our
architecture,
since
the
input
data
tra
v
ersed
through
the
shift
re
gister
,
a
shared
lookup
table
for
RBF
k
ernel
implementation
can
be
used
without
compromising
the
throughput
of
the
system.
F
or
application
where
the
number
of
SVs
is
less
than
the
number
of
features,
the
proposed
architecture
w
ould
be
able
to
share
a
single
e
xponential
lookup
table.
Ho
we
v
er
,
if
the
number
of
SVs
is
lar
ger
than
the
number
of
features,
the
number
of
required
e
xponential
lookup
tables
will
be
equi
v
alent
to
N
sv
=
N
f
.
Since
N
sv
determines
the
number
of
clock
c
ycles
needed
for
all
support
v
ectors
to
share
a
single
lookup
table,
whereas
N
f
represents
the
number
of
clock
c
ycles
needed
for
each
SV
to
be
computed.
In
our
performance
analysis,
the
focus
w
as
only
on
the
linear
k
ernel.
The
proposed
design
can
be
e
xpanded
to
dif
ferent
t
ypes
of
k
ernels
with
some
minor
changes
in
the
k
ernel
computation
module.
F
or
RBF
k
ernel
implementation
(Eq.
)
e
xponential
computation
is
required.
F
or
a
pipelined
architecture
lookup
table
is
more
suitable
in
comparison
to
CORDIC
since
its
able
to
produce
results
with
a
predictable
time
frame.
Based
on
our
architecture,
since
the
input
data
tra
v
ersed
through
the
shift
re
gister
,
a
shared
lookup
table
for
RBF
k
ernel
implementation
can
be
used
without
compromising
the
throughput
of
the
system.
F
or
application
where
the
number
of
SVs
is
less
then
the
number
of
features,
the
proposed
architecture
w
ould
be
able
to
share
a
single
e
xponential
lookup
table.
Ho
we
v
er
,
if
the
number
of
SVs
is
lar
ger
then
the
number
of
features
the
number
of
e
xponential
lookup
table
needed
will
be
equi
v
alent
to
N
sv
=
N
f
.
N
sv
determines
the
number
of
clock
c
ycle
needed
for
all
SVs
to
share
a
single
lookup
table
whereas
N
f
represents
the
the
number
of
clock
c
ycle
needed
for
each
SV
to
be
computed
before
the
ne
xt
input
data
starts
to
be
computed.
7.
CONCLUSION
The
proposed
streaming
architecture
for
SVM
classification
is
able
to
produce
output
at
the
same
rate
as
data
input.
The
throughput
of
the
implementation
is
equi
v
alent
to
1
=
N
f
.
Based
on
the
performance
profiling
e
xperiments,
the
proposed
architecture
resource
utilization
is
dependent
on
the
number
of
SVs
,
while
the
e
resource
utilization
is
dependent
on
the
number
number
of
features
determines
the
maximum
operating
frequenc
y
.
F
or
future
w
ork,
the
proposed
architecture
needs
to
be
tested
with
real-w
orld
data
sets.
From
this,
the
trade-of
f
of
accurac
y
betwee
n
the
softw
are
and
hardw
are
implementations
can
be
observ
ed.
Besides
that,
e
xplorations
on
dif
ferent
streaming
hardw
are
architecture
for
dif
ferent
multi-class
SVM
method
need
to
be
done.
Other
multi-class
approaches
can
be
applied
to
the
same
problem,
which
can
result
in
dif
ferent
hardw
are
utilization
and
accurac
y
.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
16,
No.
3,
December
2019
:
1286
–
1296
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1295
A
CKNO
WLEDGEMENT
This
w
ork
is
supported
in
part
by
the
Collaborati
v
e
Research
in
Engineeri
ng,
Science
&
T
echnology
(CREST)
grant
P17C114
(v
ote
no.
4B176)
and
Uni
v
ersiti
T
eknologi
Malaysia
Matching
grant
(v
ote
no.
00M75).
REFERENCES
[1]
Ruiz-Llata
M,
Guarnizo
G,
Y
´
ebenes-Calvino
M.
FPGA
implementation
of
a
support
v
ector
machine
for
classification
and
re
gression.
In
Proceedings
of
the
2010
IEEE
International
Joint
Conference
on
Neural
Netw
orks
(IJCNN),
2010
(pp.
1-5).
[2]
Cadambi
S,
Durdano
vi
c
I,
Jakkula
V
,
Sankaradass
M,
Cosatto
E,
Chakradhar
S,
Graf
HP
.
A
massi
v
ely
parallel
FPGA-based
coprocessor
for
support
v
ector
machines.
In
Pr
oceedings
of
the
2009
17th
IEEE
Symposium
on
Field
Programmable
Custom
Computing
Machines,
2009
(pp.
115-122).
[3]
V
´
estias
MP
.
High-performance
reconfigurable
computing
granularity
.
Enc
yclopedia
of
Information
Sci-
ence
and
T
echnology
,
Third
Edition.
IGI
Global.
2015:3558-3567.
[4]
K
otsiantis
SB,
Zaharakis
I,
Pintelas
P
.
Supervised
machine
learning:
A
re
vie
w
of
cl
assification
techniques.
Emer
ging
Artificial
Intelligence
Applications
in
Computer
Engineering.
2007:160:3-24.
[5]
V
apnik
V
.
The
nature
of
statistical
learning
theory
.
Springer
science
&
b
usiness
media,
2013.
[6]
Mahmoodi
D,
Soleimani
A,
Khosra
vi
H,
T
aghizadeh
M.
FPGA
simulation
of
linear
and
nonlinear
support
v
ector
machine.
Journal
of
Softw
are
Engineering
and
Applications.
2011,
4(05):320.
[7]
Hahnle
M,
Sax
en
F
,
Hisung
M,
Brunsmann
U,
Doll
K.
FPGA-based
real-time
pedestrian
detection
on
high-resolution
images.
In
Proceedings
of
the
IEEE
Conference
on
Computer
V
ision
and
P
attern
Recog-
nition
W
orkshops,
2013
(pp.
629-635).
[8]
K
yrk
ou
C,
Boug
anis
CS,
Theocharides
T
,
Polycarpou
MM.
Embedded
hardw
are-ef
ficient
real-time
clas-
sification
with
cascade
support
v
ector
machines.
IEEE
T
ransactions
on
Neural
Netw
orks
and
Learning
Systems.
2015,
27(1):99-112.
[9]
Cutajar
M,
Gatt
E,
Grech
I,
Casha
O,
Micallef
J.
Hardw
are-based
support
v
ector
machine
for
phoneme
classification.
In
Proceedings
of
the
2013
IEEE
International
Conference
on
Smart
T
echnologies
(EUR
O-
CON),
2013
(pp.
1701-1708).
[10]
Mizuno
K,
T
erachi
Y
,
T
akagi
K,
Izumi
S,
Ka
w
aguchi
H,
Y
oshimoto
M.
An
FPGA
implementation
of
a
HOG-based
object
detection
processor
.
IPSJ
T
ransactions
on
System
LSI
Design
Methodology
.
2013,
6:42-51.
[11]
Qasaimeh
M,
Sag
ah
yroon
A,
Shanableh
T
.
FPGA-based
parallel
hardw
are
architecture
for
real-time
image
classification.
IEEE
T
ransactions
on
Computational
Imaging.
2015,
1(1):56-70.
[12]
Kane
J,
Hernandez
R,
Y
ang
Q.
A
reconfigurable
multiclass
support
v
ector
machine
architecture
for
real-
time
embedded
systems
classification.
In
Proceedings
of
the
2015
IEEE
23rd
Annual
International
Sym-
posium
on
Field-Programmable
Custom
Computing
Machines
(FCCM),
2015
(pp.
244-251).
[13]
Kryjak
T
,
K
omorkie
wicz
M,
Gor
gon
M.
FPGA
implementation
of
real-time
head-shoulder
detection
using
local
binary
patterns,
SVM
and
fore
ground
object
detection.
In
Proceedings
of
the
2012
IEEE
Conference
on
Design
and
Architectures
for
Signal
and
Image
Processing,
2012
(pp.
1-8).
[14]
Bustio-Mart
´
ınez
L,
Cumplido
R,
Hern
´
andez-P
alancar
J,
Fere
grino-Uribe
C.
On
the
Design
of
a
Hardw
are-
Softw
are
Architecture
for
Acceleration
of
SVM’
s
T
rai
ning
Phase.
In
Proceedings
of
the
Me
xican
Confer
-
ence
on
P
attern
Recognition,
2010
(pp.
281-290).
[15]
W
ang
JF
,
Peng
JS,
W
ang
JC,
Lin
PC,
K
uan
TW
.
Hardw
are
/softw
are
co-design
for
f
ast-trainable
speak
er
identification
system
based
on
SMO.
In
Proceedings
of
the
2011
IEEE
International
Conference
on
Sys-
tems,
Man,
and
Cybernetics,
2011
(pp.
1621-1625).
[16]
V
enkateshan
S,
P
atel
A,
V
ar
ghese
K.
Hybrid
w
orking
set
algorithm
for
SVM
learning
with
a
k
er
-
nel
coprocessor
on
FPGA.
IEEE
T
ransactions
on
V
ery
Lar
ge
Scale
Inte
gration
(VLSI)
Systems.
2015,
23(10):2221-32.
[17]
ˇ
Sk
oda
P
,
Rogina
BM,
Sruk
V
.
FPGA
implementations
of
data
mining
algorithms.
In
Proceedings
of
the
35th
International
Con
v
ention
of
Information
Communication
T
echnology
(MIPR
O),
2012
(pp.
362-367).
[18]
Chapelle
O,
Scholk
opf
B,
Zien
A.
Semi-supervised
learning.
IEEE
T
ransactions
on
Neural
Netw
orks.
2009,
20(3):542.
[19]
V
apnik
V
.
Statistical
Learning
Theory
.
John
W
ile
y&Sons.
Inc.
1998:156-60.
A
str
eaming
multi-class
support
vector
mac
hine
...
(J
ee
van
Sirkunan)
Evaluation Warning : The document was created with Spire.PDF for Python.