Indonesi
an
Journa
l
of El
ect
ri
cal Engineer
ing
a
nd
Comp
ut
er
Scie
nce
Vo
l.
13
,
No.
1
,
Jan
uar
y
201
9
,
pp.
102
~
108
IS
S
N: 25
02
-
4752, DO
I: 10
.11
591/ijeecs
.v1
3
.i
1
.pp
102
-
108
102
Journ
al h
om
e
page
:
http:
//
ia
es
core.c
om/j
ourn
als/i
ndex.
ph
p/ij
eecs
Integr
ation o
f syn
thetic mi
nority
overs
ampli
ng tech
niq
ue for
imb
alance
d
class
Noviyan
ti S
anto
s
o, W
ahyu Wi
bo
w
o, Hil
d
a
Him
awa
ti
Depa
rtment
o
f
B
usiness Sta
ti
sti
cs,
Facu
lty
of
Voc
at
ion
al
,
Instit
ut Te
kno
lo
gi
Sepuluh
Nope
m
ber
,
Kam
pus IT
S Su
koli
lo
-
Sur
abay
a
,
Indon
esia
Art
ic
le
In
f
o
ABSTR
A
CT
Art
ic
le
history:
Re
cei
ved
A
ug
1,
2018
Re
vised
N
ov 1
, 2018
Accepte
d Nov
19
, 201
8
In
the
data
m
ining,
a
class
imbala
nc
e
is
a
probl
emati
c
issue
to
l
ook
for
the
soluti
ons.
It
p
r
obably
be
ca
use
m
ac
hine
le
a
rni
ng
is
construc
t
ed
b
y
using
al
gorit
hm
s
with
assum
ing
the
n
um
ber
of
insta
n
ce
s
in
ea
ch
b
ala
nce
d
c
la
ss
,
so
when
using
a
cl
ass
imbala
nc
e
,
it
is
poss
ibl
e
th
at
the
pre
d
ic
t
ion
result
s
are
not
appr
opri
ate.
They
ar
e
solut
io
ns
offe
red
to
sol
ve
class
imbalance
issues,
inc
ludi
ng
ove
rsam
pli
ng,
un
der
sam
pli
ng,
and
s
y
n
the
t
ic
m
inorit
y
over
sam
pli
ng
technique
(SM
OT
E).
Both
ov
ersa
m
pli
ng
and
und
ersa
m
pli
ng
ha
ve
it
s
disadv
ant
ag
es,
so
S
MO
TE
is
an
al
t
ern
ative
to
over
come
it
.
B
y
integra
t
ing
SM
OTE
in
the
dat
a
m
ini
ng
cla
ss
ifi
ca
ti
on
m
et
h
od
such
as
Naive
B
a
y
es,
S
upport
Vec
tor
Mac
hine
(SV
M)
,
and
R
andom
Forest
(RF)
is
expe
c
te
d
to
improve
the
p
erf
or
m
anc
e
of
a
cc
ur
acy
.
In
thi
s
rese
arc
h,
it
was
found
tha
t
the
d
at
a
of
SM
OTE
gave
b
et
t
er
a
cc
u
racy
th
an
th
e
or
igi
nal
data.
In
addi
ti
on
to
t
he
thre
e
c
la
ss
ifica
t
ion
m
et
hods
used,
RF
give
s
the
highe
st
ave
rag
e
AU
C,
F
-
m
ea
sure,
and
G
-
m
ea
ns score
.
Ke
yw
or
ds:
Accuracy
Data m
ining
Im
balanced
cl
a
ss
SMOTE
Copyright
©
201
9
Instit
ut
e
o
f Ad
vanc
ed
Engi
n
ee
r
ing
and
S
cienc
e
.
Al
l
rights re
serv
ed.
Corres
pond
in
g
Aut
h
or
:
Noviya
nti Sa
nto
so
,
Dep
a
rtm
ent o
f B
us
iness
Stat
ist
ic
s,
Faculty
of
Vocat
ion
al
,
In
sti
tut Te
knol
og
i
Sepulu
h N
op
em
ber
,
Kam
pu
s ITS
S
ukolil
o
-
Suraba
ya
, 60111, I
ndon
e
sia
.
Em
a
il
:
no
viya
nt
i_s@stat
ist
ika.it
s.ac.id
1.
INTROD
U
CTION
A
cl
ass
on
a
da
ta
set
with
unbalance
d
cl
ass
distrib
utio
n
m
akes
cl
assifi
cat
ion
res
ults
m
or
e
li
kely
to
belo
ng
to
m
ajo
rity
cl
ass
than
the
m
ino
rity
cl
ass.
Cl
ass
i
m
balance
in
t
he
dataset
is
a
prob
le
m
in
machin
e
le
arn
in
g,
wh
e
r
e
the
m
ajo
rity
(n
e
gative)
cl
as
s
is
hig
he
r
tha
n
the
m
ino
rity
(positi
ve)
cl
as
s.
The
iss
ue
of
cl
ass
i
m
balance
is
a
com
m
on
pr
oble
m
fo
und
in
the
dataset
in
var
io
us
fiel
ds,
includi
ng
ba
nkr
up
tc
y
pr
e
di
ct
ion
,
cred
it
ca
rd
f
ra
ud
detect
io
n
[
1]
,
an
d
disease
diag
nosis
[
2].
Cl
ass
i
m
balance
is
ve
ry
diss
erv
i
ng
f
or
rese
arch
e
rs
that
are
en
gaged
in
data
m
ini
ng.
The
reas
on
is
in
the
data
m
ining
ge
ner
al
ly
has
diff
ic
ulti
es
in
cl
assify
i
ng
t
he
m
ino
rity
cl
ass
corre
ct
ly
.
That
al
gorithm
assume
s
that
the
te
ste
d
cl
ass
distri
bu
ti
on h
as
al
re
ady
balan
ced
s
o
that
there
is
an
er
ror
in
cl
assify
in
g
the v
al
ue
of
ea
ch
cl
ass.
Mo
re
ov
e
r,
m
achine learn
in
g
al
gori
thm
s
are
desig
ned
to
gen
e
rali
ze
the
te
ste
d
data
as
equ
al
an
d
m
ake
the
si
m
plest
hypothesis.
T
he
pr
inci
ple
is
e
m
bed
ded
in
va
rio
us
al
gorithm
s
su
ch
as
decisi
on
tree,
near
est
neighb
or
,
a
nd
suppo
rt
vec
tor
m
achine.
Ther
e
f
or
e,
when
this
al
gorithm
te
sts
the
un
balance
d
dataset
,
it
wi
ll
te
nd
to
f
ocus
on
a
m
ajo
rit
y
and
i
gnore
t
he
m
ino
rity
cl
ass
an
d
causin
g
e
rrors
in m
ino
rity
class cl
assifi
cat
io
n.
Mi
nority
class i
s consi
der
e
d
as
noise
on
ly
.
The
pro
blem
o
f
cl
assifi
cat
ion
te
sti
ng
m
et
ho
d
in
i
m
balance
d
dataset
s
usua
ll
y
hav
e
the
ch
aracte
risti
cs
as
cl
assifi
ed
in
sta
nce
val
ues
(
m
isc
la
ssific
at
i
on
c
o
st)
i
n
the
m
ino
rity
cl
ass
higher
t
han
t
he
m
isc
la
ssific
at
i
on
i
n
m
ajo
rity
cl
ass.
Ma
ny
of
rese
arch
[3
-
6]
ha
ve
pro
vid
e
d
the
releva
nce
of
this
m
a
tt
er
in
cl
assifica
ti
on
c
ase.
In
rece
nt
ye
ar
s,
cl
assifica
ti
on
pro
blem
fo
r
i
m
balanced
dataset
s
beca
m
e
the
chall
e
ng
i
ng
r
esearc
h
top
ic
.
Ther
e
f
or
e,
the
chall
enge
in
overc
om
ing
this
is
how
to
cl
as
sify
m
ino
rity
cl
ass
m
or
e
accu
ratel
y.
Accor
din
g
t
o
Evaluation Warning : The document was created with Spire.PDF for Python.
Ind
on
esi
a
n
J
E
le
c
En
g
&
Co
m
p
Sci
IS
S
N:
25
02
-
4752
In
te
gr
atio
n of
Syn
t
hetic
Min
ori
ty
O
vers
ampl
ing
Tech
nique
for Im
ba
l
an
ce
d C
lass
(
Noviy
anti
San
t
oso
)
103
researc
h
[7
]
,
t
he
way
to
overc
om
e
the
cl
ass
im
balance
is
to
resam
pling
the
or
i
gin
al
datas
et
,
ei
ther
i
n
m
i
nority
cl
ass (ov
e
rsam
pling),
or m
ajorit
y cl
ass (
un
de
rsam
pling
).
Ov
e
rsam
pling
is
a
m
echan
is
m
fo
r
balanci
ng
cl
ass
distri
bu
ti
on
by
ra
ndom
ly
rep
li
cating
m
ino
rity
cl
ass
instance.
Howe
ver
,
t
he
la
ck
of
over
sam
pling
is
the
increase
d
possibil
it
y
ov
e
r
fitt
ing
,
becau
s
e
this
proce
dure
m
akes
the
d
up
li
cat
ion
of
in
sta
nc
es
preci
sel
y.
Unde
rsam
pling
is
a
pr
ocedu
re
f
or
balanci
ng
cl
ass
distrib
ution
by
ra
ndom
ly
su
btracti
ng
the
m
aj
or
it
y
cl
ass
in
sta
nce.
The
la
ck
of
un
der
sam
pling
is
the
loss
of
the
essenti
al
data
for
the
c
onti
nuit
y
of
the
de
ci
sion
m
aking
the
proce
ss
by
m
achine
le
arn
i
ng
[8
]
.
The
n
[
9]
pro
po
se
d
a
so
l
ution
cal
le
d
S
yntheti
c
Mi
nor
it
y
Ov
ersam
pling
Tec
hn
i
qu
e
(S
MOT
E).
S
MOTE
ca
n
ge
ner
at
e
syntheti
c
m
inorit
y
sa
m
ple
cl
a
ss
util
iz
ing
int
erpolat
ion
proc
esses
bet
ween
m
ino
rity
cl
ass
instance
that
l
oc
at
ed
adj
ace
nt.
SMO
TE u
ti
li
zes the
near
est
neig
hb
or
s
f
act
or
a
nd t
he desire
d ov
e
r
sam
pling
level.
Ther
e
a
re
sev
eral
work
s
tha
t
integrated
S
MOTE
an
d
da
ta
m
ining
te
ch
nique.
Acc
ord
ing
to
[
10]
com
bin
at
ion
a
m
on
g
SM
OT
E
an
d
Tom
ek
li
nk
s
as
resa
m
pl
ing
a
ppr
oa
ch
s
how
n
bet
te
r
pe
rfo
rm
ance
in
i
m
balanced
cl
ass
dataset
.
F
urt
her
m
or
e,
c
on
cl
us
io
n
of
[
11
]
is
AU
C
sc
ore
of
im
balance
d
dataset
wh
ic
h
ha
d
resam
pling
us
ing
SM
OTE
i
nc
reasin
g
as
we
ll
as
pe
rfor
m
ance
of
accu
rac
y
for
al
l
data
m
ining
m
et
hods
t
hat
integrate
d
in
.
I
n
the
m
edical
dataset
,
[
1
2]
a
pp
li
ed
SM
OT
E
ensem
bled
m
achine
le
ar
nin
g
ap
proac
h
t
o
predict
diabetes
m
el
lit
us
,
t
he
r
esults
is
Ra
ndom
Fo
rest
(RF)
an
d
Naïve
Ba
ye
s
show
n
t
he
gr
eat
er
sc
or
e
for
al
l
evaluati
on
m
easur
em
ents.
W
hile
[
13
]
a
nd
[14]
co
nclu
de
that
SV
M
an
d
C.
45
is
outs
ta
nd
in
g
m
et
hods
t
o
pr
e
dict
kind
of
fish
based
on
DNA
ba
rc
od
e
.
The
re
is
no
ex
act
ly
a
ppr
oach
that
consi
ste
ncy
prov
i
de
s
appr
opriat
e p
e
r
form
ance,
beca
us
e it
dep
e
nds
on quali
ty
and
char
act
e
risti
c of it
s d
at
aset
.
Ba
sed
on
the
descr
ipti
on
above,
t
his
re
search
will
i
nteg
rat
e
the
SMOTE
a
nd
data
m
ining
cl
assifi
cat
ion
m
et
ho
ds
of
N
ai
ve
Ba
ye
s,
S
VM,
an
d
RF
to
evaluate
their
pe
rfor
m
ance
to
ov
e
rc
om
ing
unbalance
d
cl
a
ss
on
ba
nk
i
ng
case.
T
he
resu
l
ts
of
this
st
ud
y
are
e
xpect
ed
t
o
be
a
n
al
te
r
na
ti
ve
in
set
tl
em
ent
of
cl
assifi
cat
ion
cases
w
it
h
unba
la
nced
cl
asses
in
var
i
ou
s
fiel
ds
.
S
o
it
can
be
an
early
warnin
g
m
od
el
to
pr
e
dict
the ev
e
nts t
hat
will
co
m
e w
it
h
a h
i
gh d
e
gree
of accu
racy.
2.
RESEA
R
CH MET
HO
D
2.1
.
D
atase
t
This
resea
rc
h
is
us
in
g
Ba
nk
Mark
et
in
g
dataset
s
from
UCI
Ma
chine
Lea
rni
ng
.
From
45
210
i
ns
ta
nc
e
s
,
as
m
uch
as
10
%
sam
ple
is
r
andom
ly
ta
ken
so
that
t
he
nu
m
ber
of
insta
nc
es
use
d
is
4521.
T
he
total
of
52
1
instances
(
13
%)
belo
ng
to
m
ino
rity
(p
osi
ti
ve)
cl
ass,
and
4000
instances
(
87
%
)
is
includi
ng
m
ajorit
y
(n
e
gative)
clas
s.
It i
nd
ic
at
es
t
hat the Ba
nk
Ma
rk
et
in
g data
set
h
as a
n u
nb
a
la
nced
class
ca
te
gory.
To
e
valuate
t
he
m
od
el
,
we
sp
li
t
dataset
into
trai
ning
s
et
and
te
sti
ng
set
in
f
our
c
om
bin
at
ion
s,
i.e.
,
90:1
0,
80:
20,
70:3
0,
an
d
50
:5
0.
Pe
rform
the
validat
i
on
us
in
g
5
-
f
old
cross
validat
ion
th
e
n
cal
culat
e
the
accuracy
of cla
ssific
at
ion
us
in
g
th
ree e
valuat
ion
m
easur
es
, i
.e.,
A
UC, G
-
m
eans,
and
F
-
m
e
asur
e
.
2.2
.
Metho
ds
2.2.1 S
ynthe
tic Min
orit
y
O
ve
rsamplin
g
Te
chnique
The
SMOT
E
m
et
ho
d
pr
opose
d
by
[9
]
as
one
of
the
s
olu
t
ion
s
in
deali
ng
with
unbalanc
ed
data
wit
h
the
di
ff
e
ren
t
pri
nciple
from
the
pre
viously
pro
posed
ove
rsam
pling
m
eth
od.
Wh
e
n
t
he
oversam
plin
g
has
var
i
ou
s
pri
nci
ples
ra
ndom
ly
,
SMOTE
m
eth
od
a
dds
the
nu
m
ber
of
m
i
nor
cl
ass
t
o
e
qu
al
t
o
m
ajo
r
cl
ass
by
gen
e
rati
ng
arti
fici
al
data.
The
arti
fici
al
or
synthesis
data
is
m
ade
base
d
on
a
k
-
nest
nei
ghbor.
Determ
ini
ng
t
he
nu
m
ber
of
k
-
ne
st
neighb
or
s
by
con
side
rin
g
t
he
ease
of
the
app
li
cat
io
n.
Ge
ner
at
in
g
arti
fic
ia
l
nu
m
erical
da
ta
is
diff
e
re
nt
from
cat
ego
rical
da
ta
.
Me
asur
in
g
the
distan
c
e
of
num
erical
data
us
in
g
Eucli
dean
di
sta
nce,
wh
e
re
cat
e
gorical
data
is
sim
pler
th
an
num
e
rical
data,
it
m
easur
e
d
by
the
m
od
e
value
.
Gen
e
rat
in
g
ne
w
data,
in g
e
ne
ral,
us
in
g
(1).
(1)
2.2.
2 Na
ï
ve
B
ayes
Naive
Ba
ye
s
is
a
sim
ple
pr
oba
bili
sti
c
classifier
t
hat
c
al
culat
es
pro
ba
bili
ti
es
by
s
umm
ing
the
fr
e
qu
e
ncies
an
d
com
bin
at
ions
of
dataset
s
giv
e
n.
T
he
al
gorithm
us
es
Ba
ye
s
theor
em
and
as
su
m
es
al
l
the
ind
e
pende
nt
or
no
nm
utu
al
at
tribu
te
s
giv
e
n
by
va
lues
on
t
he
cl
ass
var
ia
bles
[
15
]
.
Naive
Ba
ye
s
is
a
cl
assifi
cat
ion
te
chn
i
qu
e
with
pro
bab
il
it
y
and
sta
ti
sti
ca
l
m
eth
od
br
ought
by
a
Brit
ish
sci
e
ntist
Tho
m
as
Ba
ye
s,
pr
e
dicti
ng
t
he
fu
t
ur
e
opport
un
it
ie
s
bas
ed
on
t
he
past
exp
e
riences
an
d
it
is
known
as
Ba
ye
s
Th
eo
rem
.
Com
bin
ing
t
he
theo
rem
with
Naive
w
hich
t
he
c
onditi
on
be
tween
at
tri
bu
t
es
is
ass
um
ing
ind
e
pende
nt.
T
he
NB
cl
assifi
cat
ion
i
s
assum
ed
that
whet
her
there
is
the
prese
nc
e
of
a
certai
n
featur
e
or
no
t,
it
has
nothi
ng
to
do
with
the
cha
ra
ct
erist
ic
s
of
the
oth
e
r
cl
asses
.
The
cal
culat
ion
of
NB
is
th
e
Xi
occurre
nc
e
pr
oba
bili
ty
i
n
the
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2502
-
4752
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci,
Vo
l.
13
, N
o.
1
,
Ja
nu
a
ry 20
19
:
102
–
108
104
cl
ass
cat
eg
or
y
of
C
P
(C|Xi)
m
ul
ti
plied
by
cl
ass
cat
eg
or
y
of
C
P
(C)
pro
ba
bili
ty
.
Then
t
he
resu
lt
is
m
ulti
plied
by the
occ
urre
nce
of X
i
var
ia
ble pr
ob
a
bili
ty
P(
Xi
)
. Mat
hem
at
ic
al
ly
, it i
s w
ritt
en
in t
he fol
lowing e
quat
io
n:
(
|
)
(
)
(
|
)
()
i
i
i
P
C
X
P
C
P
C
X
P
PX
(2)
The
ne
xt
pr
oc
ess
is
op
ti
m
al
cl
ass
sel
ect
ion
by
c
hoos
i
ng
the
la
r
gest
pro
ba
bili
ty
value
of
eac
h
cl
ass
pro
bab
il
it
y. He
re is the
for
m
ula to c
hoos
e
the
larg
est
valu
e s
how
n
by (3)
(3)
Functi
on
3
is
t
he
Nai
ve
Ba
ye
s
m
od
el
wh
ic
h
the
ne
xt
will
be
us
e
d
f
or
cl
as
sific
at
ion
if
Xi
is
a
rand
om
var
ia
ble
with
c
at
egorical
data.
If
Xi
is
a
co
nt
inu
ous
data,
it
is
assum
ed
as
data
that
f
ollo
w
Ga
us
distrib
utio
n
wit
h densi
ty
f
unct
ion
in
(
4
).
(4)
W
he
re µ
is m
e
an,
a
nd σ
is the
stan
dard
dev
ia
ti
on
.
2.2.3 S
upp
ort
Vector
M
achi
ne
Suppor
t
Vecto
r
Ma
chi
ne
(
SVM
)
is
a
le
anin
g
that
us
es
a
n
op
e
n
s
pace
i
n
a
high
dim
ensi
on
al
featu
r
e
sp
ace.
Tr
ai
ning
the
al
gorithm
based
on
op
ti
m
iz
at
ion
theor
y
by
im
ple
m
enting
le
arn
in
g
bias
[1
6]
.
SV
M
becam
e
fam
ou
s
be
caus
e
of
it
s
s
ucces
s
in
rec
ognizin
g
hand
wr
it
in
g
dig
it
s
with
1%
of
e
rror
s
.
T
he
basic
con
ce
pt
of
SVM
is
to
find
an
optim
al
fu
nction
th
at
can
sepa
rate
two
dataset
s
f
or
two
dif
fer
e
nt
cl
asses.
This
te
ch
ni
qu
e
h
as
a c
onvin
ci
ng p
e
rfo
rm
ance in pr
e
dicti
ng
a n
e
w data
clas
s.
SV
M
is
in
the
sam
e
cl
ass
wit
h
the
A
rtific
ia
l
Neural
Netw
ork
,
w
hich
is
in
cl
ud
in
g
in
the
su
pe
r
vised
le
arn
in
g,
but
in
it
s
i
m
ple
m
entat
ion
,
S
WM
giv
es
bette
r
res
ults
than
ANN,
es
peci
al
ly
in
a
chiev
ing
th
e
so
luti
ons.
S
V
M
has
a
good
perform
ance
f
or
so
l
ving
m
any
pro
blem
s
of
i
den
ti
fica
ti
on
[17].
M
oreo
ver,
SV
M
can
fi
nd
the
op
ti
m
u
m
s
olu
ti
on
in
each
runn
i
ng
[
18]
.
Accor
ding
to
[
19
]
,
the
S
VM
m
et
ho
d
is
eff
ic
ie
nt
to
so
lve
classi
fica
ti
on
for bina
ry
cl
ass.
The
m
axi
m
u
m
m
arg
in
hype
r
plane
giv
es
t
he
m
axi
m
u
m
se
par
at
io
n
betwe
en
the
decisi
on
cl
asses
as
sh
ow
n
in
Fig
ure
1.
I
f
the
tra
ining
dataset
i
s
an
im
balance,
the
n
the
c
hoic
e
of
the
opt
i
m
al
hyper
pla
ne
was
aff
ect
ed
dom
in
antly
by
sa
m
ples
vecto
rs
of
m
ajo
rity
cl
ass,
a
cl
ass
wh
ic
h
has
m
uch
m
or
e
sa
m
ples
data
[13].
The
se
pa
rator f
un
ct
io
n
t
o dete
rm
ine the d
at
a
cl
ass for x is
as
(5)
:
(5)
Figure
1. The
m
axi
m
u
m
m
ar
gin
hype
r
plane
of SVM
M
a
x
i
m
u
m
m
ar
g
i
n
X
1
X
2
Evaluation Warning : The document was created with Spire.PDF for Python.
Ind
on
esi
a
n
J
E
le
c
En
g
&
Co
m
p
Sci
IS
S
N:
25
02
-
4752
In
te
gr
atio
n of
Syn
t
hetic
Min
ori
ty
O
vers
ampl
ing
Tech
nique
for Im
ba
l
an
ce
d C
lass
(
Noviy
anti
San
t
oso
)
105
W
he
re
an
d
ar
e
coe
ff
ic
ie
nts
t
hat
est
im
a
te
d
by
m
ini
m
iz
ing
the
re
gula
rized
risk
functi
on.
Kernel
m
et
ho
d
is
the
so
luti
on
th
at
us
es
to
dupl
ic
at
e
SV
M
wh
en
the
data
is
hard
or
m
ayb
e
i
m
po
ssible
to
be
cl
assifi
ed
with
lim
it
ed
li
near
fiel
ds
.
The
us
e
of
Ke
r
nel
m
et
ho
d
ca
us
ed
a
da
ta
x
in
inp
ut
spa
ce
being
m
app
ed
in
the
F
fe
at
ur
e
sp
ace
with
hi
gh
e
r
dim
ension
al
by
φ
m
ap
as
φ
:
x
→
φ(x
)
.
T
his
m
a
pp
i
ng
is
doin
g
t
o
keep
i
ng
th
e
dat
a
char
act
e
risti
cs
or
data
t
opol
ogy.
So
m
e
of
the
Ke
rn
el
ge
ner
a
l
form
s
that
use
d
for
t
he
S
V
M
m
e
tho
d
are
li
near,
po
ly
nom
ia
l, rad
ia
l basis
f
un
ct
ion
,
and
sigm
oid
.
2.2.4
Rand
om
Fo
res
t
Ra
ndom
fo
rest
is
one
of
t
he
e
ns
em
ble
m
et
ho
ds
to
im
pr
ov
e
t
he
acc
ur
acy
of
a
data
cl
assifi
c
at
ion
of
a
n
un
sta
ble
sing
le
div
ide
r
thr
ough
m
ult
iple
com
bin
at
ion
s
of
m
et
ho
ds
sim
i
l
ar
to
the
vo
ti
ng
process
to
ge
t
final
cl
assifi
cat
ion
pr
e
dicti
on.
Th
e
te
r
m
RF
was
propose
d
by
[20]
from
Boo
tst
rap
Agg
regat
ing
proce
ss
or
m
ore
popul
arly
kn
own
as
Ba
gg
i
ng.
I
n
t
he
baggin
g
process
, b
oot
strap
resam
pling
is used
to generate
a
cl
assifi
cat
ion
tree.
The
cl
assifi
cat
ion
tree
is
a
gen
e
ral
te
chni
qu
e
with
m
ultip
le
ver
si
ons
w
hich
the
n
it
com
bin
es
to
ob
ta
in
the
final
pre
dicti
on
.
Where
t
he
RF
m
e
tho
d,
ra
ndom
iz
at
ion
proces
s
is
not
only
done
on
th
e
sam
ple
data
bu
t
al
s
o
on
t
he
in
depen
den
t
var
ia
bles
colle
ct
ion
,
s
o
t
hat
the
cl
assifi
cat
ion
tree
raised,
will
hav
e
t
he
di
ff
e
ren
t
siz
es
an
d
sh
a
pes.
RF
is
a
de
velo
pm
ent
of
a
decisi
on
tree
(
DT).
I
n
t
he
DT,
t
he
cl
a
ss
ific
at
ion
tree
is
m
ade
in
on
ly
on
e,
wh
il
e
in
RF
i
s
m
ade
m
or
e
than
one
an
d
it
ov
erc
om
es
no
ise
a
nd
m
is
sing
value
.
T
he
al
gorithm
of
RF
is
sh
ow
n by:
S
t
e
p
1
:
T
o
g
e
t
t
r
a
i
n
i
n
g
d
a
t
a
,
g
e
n
e
r
a
t
e
n
e
w
r
a
n
d
o
m
s
a
m
p
l
e
w
i
t
h
b
o
o
t
s
t
r
a
p
r
e
s
a
m
p
l
i
n
g
m
e
t
h
o
d
N
t
i
m
e
s
.
S
t
e
p
2
:
M
a
k
e
t
h
e
d
e
c
i
s
i
o
n
t
r
e
e
o
r
r
e
g
r
e
s
s
i
o
n
t
r
e
e
b
a
s
e
d
o
n
d
a
t
a
b
y
S
t
e
p
1
S
t
e
p
3
:
R
e
p
e
a
t
S
t
e
p
1
a
n
d
S
t
e
p
2
,
s
o
i
t
w
i
l
l
o
b
t
a
i
n
s
e
v
e
r
a
l
t
r
e
e
s
a
n
d
b
e
c
o
m
e
a
f
o
r
e
s
t
S
t
e
p
4
:
L
e
t
e
a
c
h
o
f
t
h
e
t
r
e
e
s
c
h
o
o
s
e
t
h
e
X
S
t
e
p
5
:
C
o
u
n
t
t
h
e
n
u
m
b
e
r
o
f
t
h
e
c
h
o
s
e
n
X
i
i
n
e
a
c
h
c
l
a
s
s
.
T
h
e
c
l
a
s
s
w
i
t
h
t
h
e
m
o
s
t
n
u
m
b
e
r
i
s
t
h
e
d
e
t
e
r
m
i
n
a
n
t
o
f
a
c
l
a
s
s
i
f
i
c
a
t
i
o
n
l
a
b
e
l
f
o
r
X
i
.
S
t
e
p
6
:
T
h
e
i
m
p
r
o
p
e
r
p
e
r
c
e
n
t
a
g
e
c
l
a
s
s
i
f
i
c
a
t
i
o
n
i
s
t
h
e
c
l
a
s
s
e
r
r
o
r
r
a
t
i
o
i
n
t
h
e
r
a
n
d
o
m
f
o
r
e
s
t
.
Accor
ding
to
[
21
]
,
i
n
the
im
balanced
pr
e
dict
ion
us
in
g
rand
om
fo
rests,
t
he
re
are
t
wo
a
ppr
oach
e
s:
one
is
cost
sensiti
ve
le
arn
in
g
wh
i
ch
inc
orp
or
at
e
s
cl
ass
weig
hts
into
the
rand
om
fo
rests
cl
ass
ifie
r,
an
d
t
he
ot
her
is
by
us
in
g
over
-
sam
pling
m
et
h
od
s
with
the
m
ino
rity
cl
ass
and
or
un
der
-
sam
pling
with
the
m
ajo
rity
on
e
to
balance t
he ori
gin
al
data.
2.2.5
Accur
ac
y Mea
sureme
nt
Cl
assifi
cat
ion
accuracy
is
use
d
to
asse
ss
t
he
go
odness
of
a
m
od
el
in
r
epr
ese
ntin
g
or
cl
assify
ing
act
ual
eve
nts.
The
m
easur
e
of
cl
assifi
cat
io
n
accu
racy
us
e
d
for
unbala
nce
d
data
is
A
rea
U
nd
e
r
R
OC
Curve
(AUC).
A
UC
i
s
com
plete
accuracy
i
n
the
c
on
te
xt
of
im
balance
acc
ur
acy
.
I
n
perf
or
m
ing
AU
C
cal
c
ula
ti
on
s,
it
n
eeds to
calc
ulate
sen
sit
ivit
y and
s
pecifici
ty
f
irst. For easi
er calculat
ions,
it
u
su
al
ly
u
ses
a co
nf
us
io
n
m
at
rix.
The fo
rm
ula to calculat
e the s
ensiti
vity
, s
pec
ific
it
y and
AUC
score is s
ho
wn b
y
(6), (
7),
and (
8).
(6)
(7)
(8)
Ther
e
are
ot
he
r
cl
assifi
cat
io
n
evaluati
on
m
easur
e
s;
there
are
G
eom
et
ric
Me
ans
(G
-
m
e
ans)
w
hich
was
int
rod
uced
by
[
22]
.
T
he
ba
sic
idea
w
as
t
o
m
axi
m
iz
e
th
e
accu
racy
of
e
ach
cl
ass
by
ke
epin
g
the
bala
nc
e
of
the both
.
(9)
Stud
y
by
[
6]
wer
e
usi
ng
F
-
m
easur
e
to
ev
al
uate
the
cl
assifi
cat
ion
accu
racy
on
the
im
balance
cl
ass
dataset
.
F
-
m
easur
e
is
a
c
ombinati
on
of
se
ns
it
ivit
y
and
s
pecifici
ty
wh
i
ch
it
is
us
ed
t
o
determ
ine
th
e
best
pr
e
dicti
on r
es
ul
t.
(10)
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2502
-
4752
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci,
Vo
l.
13
, N
o.
1
,
Ja
nu
a
ry 20
19
:
102
–
108
106
3.
RESU
LT
S
AND A
N
ALYSIS
3.1
.
Data
Bal
an
ci
n
g
Be
fore
doin
g
a
cl
assifi
cat
ion
analy
sis
usi
ng
NB,
S
VM
,
an
d
RF,
it
is
essenti
al
to
know
the
descr
i
ption
of
the
data
us
e
d.
In
t
he
researc
h,
the
m
et
ho
dol
og
y
has
bee
n
e
xp
la
ine
d
t
hat
in
the
pre
-
proc
essin
g
sta
ge,
the
data
will
be
div
ide
d
into
trai
ning
a
nd
te
sti
ng
data
.
In
this
stu
dy,
the
propo
rtion
of
sam
ple
train
in
g
and
te
sti
ng
dat
a
is
div
id
ed
i
nto
f
our,
i.e.,
90:10
w
hich
m
eans
90
%
of
trai
ning
da
ta
an
d
10%
of
te
sti
ng
data,
80
:
20,
70
:
30
a
nd
50:5
0.
It
is
done
t
o
get
th
e
best
pro
porti
on
in
form
at
ion
by
a
ppl
yi
ng
t
his
propo
rtion
to
the
or
i
gin
al
data
a
nd
the
data
a
fter
SM
OTE.
Ta
ble
1
prese
nts
a
su
m
m
ary
of
t
he
per
ce
ntage
of
data
te
sti
ng
for
the
neg
at
ive
a
nd posit
ive cla
sses
in each
co
m
bina
ti
on
.
Af
te
r
dete
rm
in
ing
t
he
sam
pling
pr
opor
ti
on,
perform
ing
cl
assifi
cat
ion
on
t
he
data
with
th
ree
m
et
ho
ds
of
Nai
ve
Ba
ye
s,
S
VM,
an
d
R
andom
Fo
re
st
to
e
valuate
the
cl
assifi
cat
ion
m
et
ho
d.
Me
as
ur
i
ng
the
go
odness
of
the
m
e
tho
ds
is
us
in
g
ev
al
uation
cl
assi
ficat
ion
,
i.e.
,
accu
rac
y,
AU
C,
F
-
m
easur
e
,
an
d
G
-
m
eans.
Table
2
sho
w
s
the
accuracy
of
each
sam
ple
pro
portions
of
the
or
igi
nal
da
ta
and
after
S
MOTE
data
th
e
accuracy
of
eac
h
m
et
ho
ds are
pr
esented
in
t
he Ta
ble
3.
Table
1
.
Cl
ass
D
ist
rib
ution
b
a
sed o
n
Sam
pling
P
r
oport
io
n
Data
9
0
:1
0
8
0
:2
0
7
0
:3
0
5
0
:5
0
N
eg
ativ
e
P
o
sitiv
e
N
eg
ativ
e
P
o
sitiv
e
N
eg
ativ
e
P
o
sitiv
e
N
eg
ativ
e
P
o
sitiv
e
Origin
al
8
8
.05
%
1
1
.95
%
8
9
.05
%
1
0
.95
%
8
8
.35
%
1
1
.65
%
8
8
.36
%
1
1
.64
%
Af
ter
SM
OT
E
7
8
.97
%
2
1
.03
%
7
9
.17
%
2
0
.83
%
7
8
.86
%
2
0
.62
%
7
9
.45
%
2
0
.55
%
Table
2.
T
he
A
ccur
acy
of
Eac
h
Cl
assifi
er
us
i
ng the
O
rigin
al
an
d
Af
te
r
SM
OTE
Data
Data
co
m
b
in
atio
n
Origin
al
Af
ter
SM
OT
E
NB
SVM
RF
NB
SVM
RF
9
0
:1
0
8
5
.6%
8
9
.2%
8
9
.4%
8
3
.5%
8
9
.35
9
1
.1%
8
0
:2
0
8
7
.5%
8
9
.7%
8
9
.9%
8
3
.7%
8
8
.2%
8
9
.2%
7
0
:3
0
8
7
.8%
8
9
.4%
8
9
.7%
84%
8
8
.4%
8
9
.3%
5
0
:5
0
8
7
.9%
8
9
.8%
8
9
.5%
8
3
.5%
88%
8
9
.4%
Table
3.
T
he
AUC
Sc
or
e
of
E
ach Cla
ssifie
r
us
in
g
t
he
Or
igi
nal
an
d
A
fter
S
MOTE
Data
Data
co
m
b
in
atio
n
Origin
al
Af
ter
SM
OT
E
NB
SVM
RF
NB
SVM
RF
9
0
:1
0
67%
6
1
.8%
6
1
.2%
7
4
.4%
79%
8
2
.2%
8
0
:2
0
7
1
.3%
6
1
.9%
6
2
.5%
7
5
.7%
7
7
.8%
7
8
.5%
7
0
:3
0
6
8
.4%
6
0
.8%
6
0
.9%
7
6
.3%
7
8
.6%
80%
5
0
:5
0
70%
6
1
.5%
5
9
.4%
7
8
.4%
7
8
.6%
7
9
.7%
3.2
.
C
ompari
so
n
of cla
ssifie
r
Com
par
ison
of
cl
assifi
cat
ion
accuracy
for
each
cl
assifi
er
wer
e
evaluate
d
by
so
m
e
m
e
asur
em
ent.
Table
2
s
hows
that
the
highe
st
accuracy
val
ue
on
the
NB
m
et
ho
d
is
to
use
50:5
0
sam
pling
pro
portio
n
in
the
or
i
gin
al
data
w
it
h
accuracy
87.
9%.
Li
kew
is
e
on
t
he
SV
M
m
et
ho
d
with
a
ccur
acy
e
qu
al
t
o
89.
8%.
For
the
RF
m
et
ho
d,
t
he
hi
gh
est
acc
ur
ac
y
is
ob
ta
ine
d
thr
ough
the
90:1
0
sam
pling
propo
rtion
s
i
n
the
SM
OTE
data
.
Howe
ver,
acc
ur
acy
is
c
on
s
idere
d
ina
ppr
opriat
e
to
be
us
e
d
as
an
e
valuati
on
of
t
he
go
odness
of
t
he
cl
assifi
cat
ion
m
od
el
on
the
da
ta
set
with
an
unbalance
d
cl
ass.
It
is
becau
s
e
of
the
accu
ra
cy
fo
rm
ulati
on
based
on accu
rate
observ
at
io
n
i
n
t
he
n
e
gative a
nd
po
sit
ive
class.
Table
3
s
hows
that
the
A
UC
ob
ta
ine
d
by
N
B
m
et
ho
d
with
50:5
0
sam
pling
propo
rtion
s
of
sam
pling
data
is
the
la
r
ge
st
a
m
on
g
a
no
ther
sam
pling
pro
portions.
T
he
hi
gh
e
st
A
U
C
scor
e
with
t
he
S
VM
m
et
h
od
was
79%
w
hich
is
ob
ta
ine
d
by
90
:10
sam
pling
pro
portio
ns
with
data
after
S
MOTE,
a
s
wel
l
as
RF
m
et
ho
d
wit
h
AU
C
value
e
qual
to
82
,2
%
.
I
n
Ta
ble
3
it
ca
n
al
so
be
see
n
that
the
A
UC
in
the
SM
OTE
data
te
nds
to
ha
ve
a
m
or
e
sign
ific
ant
value
tha
n
the
ori
gin
al
dat
a.
Ba
sed
on
th
e
m
et
ho
d,
the
m
os
t
con
sidera
ble
value
is
ob
ta
ined
by RF
with a
90:1
0
sam
pling
pro
po
sal
of
the
d
at
a a
fter S
M
OTE.
Ba
sed
on
Ta
bl
e
4
,
it
can
be
known
t
hat
the
highest
F
-
m
easur
e
am
on
g
thr
ee
m
et
ho
ds
is
in
th
e
SMOTE
da
ta
with
90:1
0
sa
m
pl
ing
pro
port
ion
s
.
F
-
m
easure
is
on
e
of
the
evaluati
on
m
ea
su
res
t
hat
appr
opriat
e
for
data
with
i
m
balance
cl
ass,
the
hi
gh
e
r
t
he
F
-
m
easur
e,
the
bette
r
t
he
cl
assifi
cat
ion
m
et
ho
d,
because
F
-
m
easur
e is
obta
ined by cl
assi
ficat
ion
obser
va
ti
on
acc
ur
acy
in the p
os
it
ive
cl
ass only
.
The
la
st
e
valu
at
ion
m
easur
e
m
ent
is
G
-
m
ea
ns
,
the
a
naly
si
s
res
ult
is
pr
es
ented
in
Table
5
.
Ta
ble
5
sh
ow
s
that
the
la
rg
est
G
-
m
eans
is
ob
ta
ine
d
by
the
NB
m
eth
od
an
d
it
is
e
qu
al
to
88,
2%
by
80
:
20
pr
opor
ti
ons
sam
pling
in
t
he
ori
gin
al
data.
For
SV
M
m
eth
od
is
,
th
e
G
-
m
eans
sc
or
e
is
eq
ual
to
76,9%
with
50:
50
s
a
m
pling
Evaluation Warning : The document was created with Spire.PDF for Python.
Ind
on
esi
a
n
J
E
le
c
En
g
&
Co
m
p
Sci
IS
S
N:
25
02
-
4752
In
te
gr
atio
n of
Syn
t
hetic
Min
ori
ty
O
vers
ampl
ing
Tech
nique
for Im
ba
l
an
ce
d C
lass
(
Noviy
anti
San
t
oso
)
107
pro
portions
in
the
SMOTE
da
ta
and
the
hi
ghest
G
-
m
eans
scor
e
by
RF
m
eth
od
is
obta
ine
d
by
90:1
0
sa
m
pl
ing
pro
portions i
n
the SMO
TE
da
ta
.
Ba
sed
on
t
he
pro
portion
sa
m
pl
ing
s
of
tra
ining
an
d
te
st
ing
data,
m
os
t
cl
assifi
cat
ion
evaluati
on
m
easur
em
ent
in
th
ree
m
et
hods
obta
in
t
he
hi
gh
e
st
val
ue
at
90
:
10
pr
oport
ion
s
.
It
m
eans
that
the
la
r
ger
s
a
m
ple
us
e
d
to
gen
e
rat
e
the
cl
assifi
ca
ti
on
m
od
el
,
it
will
descr
i
be
the
un
balance
d
data
co
ndit
ion
s
.
So
t
hat
w
he
n
us
in
g
te
sti
ng
d
at
a
f
or
validat
io
n
it
obtai
ns
h
ig
h
A
U
C,
F
-
m
easur
e
, an
d
G
-
m
eans
s
cor
e
.
Be
side
s,
i
f
the
ori
gi
nal dat
a
is
com
par
ed
wit
h
the
data
after
SMOTE,
t
he
analy
sis
resu
lt
s
denote
that
SMOTE
data
perform
ance
is
bette
r
than
the
ori
gi
na
l
data.
It
m
atch
es
with
the
t
heory
an
d
pr
e
vious
resea
rch
wh
ic
h
on
ce
st
at
es
that
the
SMOTE
sam
pling
is u
s
ed
to
s
olv
e t
he c
la
ss i
m
balanc
e so
that t
he
classi
ficat
ion
e
va
luati
on
obta
in
ed
is a
ppr
opria
te
.
Table
4
. T
he
F
-
m
easur
e
Sc
ore
of
Eac
h
Cl
as
sifie
r
us
in
g
the
Or
i
gin
al
a
nd
Af
te
r
SMO
TE
Data
Data
co
m
b
in
atio
n
Origin
al
Af
ter
SM
OT
E
NB
SVM
RF
NB
SVM
RF
9
0
:1
0
4
0
.4%
6
0
.8%
6
5
.0%
6
1
.4%
8
3
.3%
8
7
.7%
8
0
:2
0
4
3
.9%
5
6
.5%
5
8
.7%
6
0
.8%
7
3
.0%
8
2
.5%
7
0
:3
0
4
7
.6%
6
1
.7%
6
6
.1%
6
0
.8%
7
7
.2%
8
0
.0%
5
0
:5
0
4
7
.9%
6
6
.3%
6
5
.4%
5
8
.1%
7
5
.0%
8
1
.1%
Table
5
. T
he
G
-
m
eans
Score
of
Each
Cl
assif
ie
r
us
i
ng the
O
rigin
al
a
nd
Af
t
er
SM
OTE
Dat
a
Data
co
m
b
in
atio
n
Origin
al
SMOT
E
NB
SVM
RF
NB
SVM
RF
9
0
:1
0
6
2
.4%
5
0
.3%
4
8
.6%
7
2
.6%
7
7
.0%
8
0
.8%
8
0
:2
0
8
8
.2%
5
0
.6%
5
1
.6%
7
4
.4%
7
5
.7%
7
6
.4%
7
0
:3
0
6
3
.5%
4
7
.9%
4
8
.0%
7
5
.1%
7
6
.8%
7
8
.4%
5
0
:5
0
6
6
.1%
4
9
.3%
4
4
.6%
7
8
.0%
7
6
.9%
7
7
.9%
The
best
m
et
ho
d
is
determ
ined
by
cal
culat
ing
the
a
ver
a
ge
of
al
l
e
valuati
on
m
easur
em
ent
s
of
the d
at
a
after
SM
OTE
s
a
m
pling
p
r
oport
ion
s on
eac
h
evaluati
on
m
ea
su
re
.
T
he
m
os
t
extensi
ve
eval
uation
m
easur
e
m
ent
will
be
sel
ect
e
d
as
the
best
m
et
ho
d.
Fi
gur
e
2
sh
ows
that
the
m
et
ho
d
w
it
h
the
hig
hest
aver
age
of
A
UC,
F
-
m
easur
e,
a
nd
G
-
m
eans
val
ue
am
on
g
a
no
t
he
r
m
et
ho
d
is
R
F.
T
her
e
fore,
the
best
m
et
ho
d
f
or
this
st
udy
is
RF
with the
d
at
a
a
fter S
M
OTE
.
Figure
2. Com
par
is
on p
e
rfo
r
m
ance of eac
h cl
assifi
er
4.
CONCL
US
I
O
N
Ba
sed
on
the
analy
sis,
we
co
nclu
de
t
ha
t
data
a
fter
r
esam
pling
by
SM
OT
E
ob
t
ai
ned
a
bette
r
perform
ance
than
or
i
gin
al
da
ta
.
This
rese
arch
has
acc
om
pl
ished
the
obj
ect
ives
w
he
re
three
cl
as
sifie
rs
(N
B,
S
VM,
a
nd
RF
)
wer
e
pe
rfor
m
ed
f
or
a
n
im
balanced
cl
ass
dataset
.
T
he
pri
m
ary
ob
j
e
ct
ive
of
t
his
st
ud
y
is
to
ide
ntify
the b
est
te
ch
n
iq
ue
for
im
balanced
cl
ass
predict
io
n
befor
e
an
d
a
f
te
r
resam
pling
b
y
SMO
TE. Hence
,
after
ap
plyi
ng
the
three
m
et
hods
,
a
c
omparati
ve
a
naly
sis
has
bee
n
perform
ed
to
determ
ine
the
m
os
t
appr
opriat
e
on
e.
The
e
xperim
ental
res
ults
showe
d
that
RF
perform
s
well
becau
se
of
it
s
abili
ti
es
to
pr
e
dict
the
higher
por
ti
on
of d
at
a
with
hi
gh
e
r AUC
, F
-
m
easur
e, a
nd
G
-
m
eans s
c
or
e
.
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2502
-
4752
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci,
Vo
l.
13
, N
o.
1
,
Ja
nu
a
ry 20
19
:
102
–
108
108
Fo
r
f
uture
wor
k,
t
he
f
ollo
wing
s
uggestio
ns
c
an
be
c
on
si
dered;
Com
bin
in
g
oth
e
r
resam
pli
ng
m
et
ho
ds
su
c
h
as
Tom
e
k
li
nk
s
a
nd
R
andom
Un
de
r
-
sam
pling
;
Use
m
or
e
sam
ple
i
m
balanced
da
ta
set
with
a
diff
e
rent
distrib
ution o
f c
la
ss wou
l
d be
a v
al
ua
ble i
dea
.
ACKN
OWLE
DGE
MENT
The
a
utho
rs
grat
efu
ll
y
ack
no
wled
ge
t
he
fin
ancial
sup
port
from
a
Head
of
I
ns
ti
tute
for
Re
search
an
d
Com
m
un
it
y Se
rv
ic
e
ITS t
hro
ugh R
esearc
h G
ran
t
for
Be
g
in
ne
r
Re
searc
he
r
s
chem
e in 2
01
8
.
REFERE
NCE
S
[1]
Phoungphol P.
A Cl
assificat
ion
Fram
ework
for
I
m
bal
anc
ed
Dat
a. Ge
orgia Sta
te Unive
rsit
y
.
2013
.
[2]
Rohini
RR
,
Kr
ishnamoorthi
M
.
L
ea
rn
ing
fro
m
a
Cla
ss
Im
bal
an
ce
d
Public
Hea
l
th
Da
ta
se
t:
A
Cost
-
b
ase
d
Com
par
ison
of
Cla
ss
ifi
er
Perfor
m
anc
e.
Int
ernational
Journal
of
El
ectric
al
and
Computer
Engi
n
ee
ring
(
IJE
CE)
.
2017
;7(4): 2215
-
2222.
[3]
Qiang
W
.
A
H
ybrid
Sam
pli
ng
S
VM
Approac
h
t
o
Im
bal
ance
d
D
at
a
Cla
ss
ifica
t
io
n.
Abs
tra
c
t
and
Applie
d
Ana
l
y
s
i
s.
2014;
1:
1
-
7
.
[4]
Sukarda
B,
Muham
m
ad
MI,
Xi
u
Y,
and
Kaz
u
yuki
M.
MW
M
O
TE
–
Majori
t
y
W
ei
ghte
d
Minor
ity
Ov
ersa
m
pli
n
g
Te
chn
ique
for
I
m
bal
anc
ed
Da
tas
et
Learni
ng.
I
EE
E
T
ran
sac
ti
o
ns
on
Know
le
dge
and
Dat
a
E
ngine
er
i
ng.
201
4;
26(2):
405
–
425
.
[5]
Giovanna
M,
N
ic
ol
a
T.
Trainin
g
and
As
sess
in
g
Cla
ss
ifica
ti
on
Rule
s
with
Im
bal
an
ce
d
Dat
a.
Data
Mining
an
d
Know
le
dge
Disc
over
y
.
2014
;
28(
1):
92
–
122
.
[6]
Gala
r
M,
Ferna
ndez
A,
Barr
eneche
a
E
and
Her
rre
ra
F.
EUSBoos
t
:
Enha
nci
ng
ense
m
ble
s
f
or
highly
imbala
nc
e
dat
a
-
se
ts b
y
evolutiona
r
y
under
sa
m
pli
ng.
Pa
tt
ern
Re
cogn
it
ion
.
20
13:
3460
-
3471.
[7]
Choi
MJ
.
A
Sele
ctive
Sam
pli
ng
Method
for
Imba
la
n
ce
d
Data
Le
arn
ing
on
Su
pport
Vec
tor
Mac
hin
es.
Gradua
te
The
ses.
US
:
Iowa
Sta
te Unive
rsit
y
;
2010.
[8]
Yap
BW
,
Rani
KA
,
Ary
an
i
H,
Rahman
A,
Fong
S,
Khair
udin
Z
and
Abdulla
h
NN
.
A
n
Appl
ic
at
ion
of
Ove
rs
ampling,
Under
-
sam
pli
ng,
Baggi
ng
and
B
oosting
in
Handl
ing
Imbalanc
ed
Datasets
.
Proce
e
dings
of
the
First
Inte
rna
ti
ona
l
Confer
ence
on
Advanc
ed
Data
and
Inform
at
ion
Engi
nee
rin
g
(DaE
ng
-
2013).
Stanford.
2015;
285:
13
–
23.
[9]
Chawla
NV
,
Bow
y
er
KW
,
Hall
LO,
and
Kege
l
m
e
y
er
W
P.
SMOTE
:
S
y
nth
et
i
c
Minority
Oversa
m
pli
ng
Te
chni
q
ue
.
Journal
of
Artifi
ci
al
Intelli
g
ence and R
ese
arch
.
2
002;
16:
321
-
35
7.
[10]
Sain
H
and
Purnam
i
SW
.
Combine
sam
pli
n
g
sup
port
ve
ctor
machine
for
imbalanc
ed
d
ata
cl
ass
if
i
catio
n
.
Proce
edi
ng
of T
he
Th
ird
Inform
at
ion
S
y
st
ems
Inte
rna
ti
ona
l
Conf
ere
nc
e. Sura
ba
ya.
2015
;
72
:
59
-
66.
[11]
Maira
A
and
Mohs
in
A.
Inve
sti
gat
ing
th
e
Perfo
rm
anc
e
of
Sm
ote
for
Cla
ss
Im
bal
anced
Learni
ng:
A
Case
Stud
y
o
f
Credi
t
Scoring D
at
ase
ts.
Europ
e
an
Sci
ent
if
ic Journal
.
2017;
13(3
3):
340
-
353
.
[12]
Mana
l
A,
Mouaz
A,
Steve
n
K,
Cli
nton
B,
Jonat
han
E,
and
Sheri
f
S.
Predic
ti
ng
dia
be
te
s
m
el
li
tus
using
SMOTE
and
ense
m
ble
m
ac
hi
ne
le
a
rning
ap
proa
ch:
Th
e
Henr
y
Ford
Ex
er
cIse
Te
st
ing
(F
IT)
project
.
PL
oS
ON
E.
2017;
12(7):
e0179805
.
[13]
Kus
um
a
W
A,
N
ovia
na
N,
Hasib
uan
LS,
Nurilma
la
M.
Im
proving
DN
A
Barc
ode
-
base
d
Fis
h
Ide
nt
ifi
c
at
ion
S
y
s
te
m
on
Im
bal
anc
ed
Data
using
SM
OTE.
TEL
KOMNIKA
(
Tele
communic
ati
on
Co
mputing
El
e
ct
ro
nic
s
and
Control)
.
2017;
15
(3):
123
0
-
1238.
[14]
Loke
sh
SK
and
John
SU
.
Co
m
par
at
ive
Stud
y
of
Re
comm
enda
ti
on
Algorithm
s
and
S
y
ste
m
s
using
W
EK
A.
Inte
rna
ti
ona
l
Jou
rna
l
of
Com
puter Appli
c
at
ions
.
2
015;
110(3)
.
[15]
Pati
l
TR
and
Shere
kar
SS
.
Per
form
anc
e
Ana
l
ysis
of
Naive
Ba
y
es
and
J48
Cl
ass
ifi
cation
Alg
orit
hm
for
Dat
a
Cla
ss
ifi
c
at
ion
.
I
nte
rnational
Jou
rnal
of
Comput
e
r Sc
ie
n
ce and Ap
pli
cations
.
2013
;
6(2):
256
-
2
61.
[16]
Vapnik
VN
.
Sup
port
-
vector
n
et
w
orks.
Mac
hi
ne
L
earning
.
1995
;
2
0:
273
-
297.
[17]
Bat
uwita
R
and
Pala
de
V.
E
ff
i
c
ie
nt
resam
pli
ng
methods
for
traini
ng
suppor
t
ve
c
tor
machine
s
wit
h
imbalan
ced
dataset
s
.
Proc
eeding
of
In
te
rn
at
i
onal
Join
t
Confe
ren
ce on
Neur
al
Networks.
Bar
celona
,
Span
y
ol
.
2
010:
1
-
8
[18]
Seiffe
rt
C,
Khos
hgofta
ar
TM,
Hulse
JV
and
Napoli
t
ano
A.
RUS
Boost:
A
hy
brid
appr
o
ac
h
to
a
ll
ev
ia
t
ing
cl
ass
imbala
nc
e.
IEEE
Tr
ans.
Syst. Ma
n
Cybe
rne
t
.
201
0;
40:
185
-
197.
[19]
Miner
G,
Nisbe
t
R,
E
lde
r
J
,
Del
en
D
and
Fast
A.
Prac
t
ic
a
l
T
ex
t
Mining
and
St
at
isti
cal
Anal
y
si
s
for
Uns
truc
tur
ed
Te
xt
Data Applications.
First
Edition. US
A:
Aca
d
emic
Press
.
2012
:
1000.
[20]
Brei
m
an
L
.
R
an
dom
fore
sts.
Ma
chi
ne
Learning
.
2001;
45(1):
5
-
3
2.
[21]
Zhou
L
,
W
ang
H.
Loa
n
Defa
u
l
t
Prediction
on
La
rge
Im
bal
an
c
ed
Dat
a
Us
ing
Random
Forests.
TEL
KOMNIKA
(
Tele
communic
ati
on
Computing
El
e
ct
ronics
and
Control)
.
2012; 10(6):
1519
-
152
5.
[22]
Kubat
M
and
Ma
twin
S.
Addres
sing
the
Curse
o
f
Im
bal
anced
Tr
ai
ning
Se
t:
One
Sided
Selecti
on
.
Proce
ed
ing
of
th
e
14th
Int
ernati
on
al
Conf
ere
nce o
n
Mac
hin
e
Lear
ning
.
Nashvi
ll
e
,
US
A.
1997:
179
-
186.
Evaluation Warning : The document was created with Spire.PDF for Python.