Int
ern
at
i
onal
Journ
al of Ele
ctrical
an
d
Co
mput
er
En
gin
eeri
ng
(IJ
E
C
E)
Vo
l.
15
,
No.
1
,
Febr
uary
20
25
, pp.
741
~
754
IS
S
N:
20
88
-
8708
, DO
I: 10
.11
591/ij
ece.v
15
i
1
.
pp
741
-
754
741
Journ
al
h
om
e
page
:
http:
//
ij
ece.i
aesc
or
e.c
om
Handlin
g class i
mbala
nce in edu
cation
using dat
a
-
l
evel a
nd
deep l
earn
i
ng m
ethods
Rith
es
h K
ann
an
1
, Hu
N
g
1
, T
im
ot
hy Tz
en
V
u
n
Yap
2
, L
ai
Kuan W
ong
1
, F
ang F
ang
C
hua
1
,
Vik Tor
G
oh
3
,
Y
ee Li
en
Lee
3
, Hwee Li
ng
Wong
3
1
Facu
lty
of
Co
m
p
u
tin
g
and
I
n
form
ati
cs, M
u
lti
m
ed
ia Un
iv
ersity
,
Cy
b
erjaya,
Mal
ay
sia
2
Sch
o
o
l of M
ath
em
atical
an
d
Co
m
p
u
ter
Scien
ces,
He
ri
o
t
-
W
att Univ
ersity
Malays
ia
,
Putraja
y
a,
Malays
ia
3
Facu
lty
of E
n
g
in
eering
,
Multi
m
ed
ia
Un
iv
ersity
,
Cy
b
erjaya,
Mal
ay
sia
Art
ic
le
In
f
o
ABSTR
A
CT
Art
ic
le
history:
Re
cei
ved
M
a
r 28,
2024
Re
vised
A
ug 13, 2
024
Accepte
d
Oct
1,
2024
In
the
cur
ren
t
f
i
el
d
of
educat
ion
,
univ
ersit
i
es
mu
st
be
high
ly
co
mpe
titive
to
thri
ve
and
grow
.
Edu
ca
t
ion
data
mi
ning
has
he
lp
ed
un
ive
rsit
ie
s
i
n
bring
ing
in
new
student
s
and
retain
ing
ol
d
ones.
How
ever,
the
r
e
is
a
maj
or
issue
in
thi
s
ta
sk,
whi
ch
is
the
c
la
s
s
im
b
al
an
ce
b
et
wee
n
the
succ
essful
student
s
an
d
at
-
risk
student
s
t
hat
c
ause
s
ina
c
c
ura
te
pr
edi
c
ti
ons
.
To
addr
ess
th
is
issue,
12
me
thods
from
dat
a
-
le
ve
l
samp
li
ng
techniqu
es
and
2
me
thod
s
from
de
e
p
le
arn
ing
synthes
iz
ers
wer
e
com
p
are
d
aga
inst
e
ach
othe
r
and
an
i
dea
l
class
bal
an
ci
ng
me
tho
d
for
th
e
da
ta
se
t
was
id
ent
if
ie
d.
The
evalua
t
ion
was
don
e
using
the
li
ght
g
rad
ie
n
t
boost
ing
machin
e
ense
m
ble
model,
and
t
he
m
et
r
ic
s
inc
lud
ed
re
ceiv
er
oper
ating
c
har
acte
rist
ic
cu
rve
,
p
recision,
recal
l
and
F
1
score
.
Th
e
t
wo
best
me
tho
ds
were
Tomek
li
nks
and
ne
ig
hbourhood
cl
e
ani
ng
ru
le
fro
m
under
sa
mpl
in
g
technique wit
h
a
F1 sco
re
of
0.
72
and
0
.
7
1
respe
ctivel
y
.
Th
e
resul
ts
of
th
is
pape
r
ide
n
ti
fi
e
d
the
b
est
cl
ass
balanc
ing
me
thod
bet
we
en
the
two
appr
oa
che
s
and
i
d
ent
if
ie
d
the
li
m
it
a
ti
o
ns
of
the
dee
p
learni
ng
ap
proa
ch.
Ke
yw
or
d
s
:
Acad
e
mic at
-
risk
Cl
ass b
al
anci
ng
Ed
ucati
on
al
da
ta
minin
g
M
ulti
-
cl
assifi
c
at
ion
Re
samplin
g
te
chn
i
qu
e
s
Syntheti
c da
ta
set
s
This
is an
open
acc
ess arti
cl
e
un
der
the
CC
BY
-
SA
l
ic
ense
.
Corres
pond
in
g
Aut
h
or
:
Hu Ng
Faculty
of Co
mputi
ng and
I
nfo
rmati
cs,
Mult
imedia
Unive
r
sit
y
Persiara
n M
ulti
media,
6310
0
Cyb
e
rjaya,
Sel
ango
r,
M
al
ay
sia
Emai
l:
nghu@
mmu.ed
u.my
1.
INTROD
U
CTION
In
pr
e
sent
ti
m
es,
ed
ucati
on
insti
tutes
opera
te
in
a
highly
com
plex
e
nviro
nm
e
nt
an
d
a
re
fiercel
y
com
petit
ive
w
it
h
each
oth
e
r.
In
orde
r
to
sta
nd
out,
the
insti
tute
must
be
a
ble
to
c
onsist
ently
at
trac
t
new
stud
e
nts
an
d
r
et
ai
n
old
on
e
s.
With
the
ra
pi
d
rise
in
te
c
hnology
de
velo
pme
nt
al
lowi
ng
equ
i
pm
e
nt
to
become
cheap
e
r,
m
or
e
data
on
stu
de
nt
s
can
be
colle
ct
ed
a
nd
sto
re
d.
By
ef
fecti
ve
ly
a
nalyzin
g
a
nd
util
iz
ing
thi
s
data,
an
e
du
cat
i
on
al
insti
tute
can
l
ever
a
ge
it
s
a
dv
antages
an
d
ou
tperfo
rm
it
s
c
ompeti
tors.
E
du
cat
ion
al
data
minin
g
(EDM
)
is
the
emer
ging
fiel
d
that
is
con
cer
ned
with
aut
oma
ti
ng
the
pro
cess
of
a
nalyz
ing
stu
de
nt
da
ta
an
d
pro
vid
in
g
act
io
nab
le
i
ns
ig
hts
f
or mem
bers
of
the staf
f
to
u
ti
li
ze.
Pr
e
dicti
ng
stu
den
t
aca
demic
performa
nce
is
a
commo
n
go
al
in
t
he
f
ie
ld
of
E
D
M
.
This
is
an
importa
nt
ta
s
k
as
a
maj
or
c
on
t
rib
ution
to
a
stu
den
t’
s
s
uccess
is
t
heir
academic
performa
nce
in
hi
gh
e
r
edu
cat
io
n
insti
tutes.
Acad
e
mi
c
pe
rformance
ma
y
c
onsist
of
exa
minati
on
resu
lt
s,
c
our
sewor
k,
c
o
-
cu
r
ricular
achieve
ment,
wh
et
her
the
st
ud
e
nt
has
gr
a
duat
ed
on
ti
me,
and
s
o
on.
[1]
us
e
d
data
mini
ng
to
predict
st
ud
e
nts’
gr
a
des
usi
ng
t
heir
at
te
nd
a
nc
e,
cl
ass
te
st,
a
ssignment
an
d
midte
rm
sc
ores
a
nd
ac
hieve
d
a
high
a
ccu
r
acy
of
88.6%
usi
ng
deep
le
ar
ning
al
gorithms.
St
ud
e
nts
with
good
aca
demic
res
ults
are
li
kely
t
o
face
a
gr
eat
e
r
amo
un
t
of
ca
r
eer
ch
oices
a
nd
bette
r
j
ob
se
cur
it
y,
w
he
n
c
ompare
d
to
stud
e
nts
with
poor
aca
demic
resu
lt
s.
Howe
ver,
it
mu
st
be
no
te
d
that
each
stu
de
nt
is
un
i
qu
e
,
a
s
they
r
ecei
ve
and
process
i
nfo
rmati
on
di
fferently
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
omp E
ng,
V
ol.
15
, No
.
1
,
Febr
uary
20
25
:
741
-
754
742
from
one
an
ot
her.
It
is
possi
ble
for
st
ud
e
nt
s
with
poo
r
ac
ademic
res
ults
to
im
pro
ve
with
s
om
e
inte
rv
e
ntions.
Th
us
,
it
is
a
crit
ic
al
ta
sk
t
o
know
the
a
t
-
risk
stu
de
nts
as
ea
rly
as
po
s
sible
i
n
t
he
stu
dy
year
so
t
hat
interve
ntio
ns
c
an be a
pp
li
e
d q
uickly.
On
e
majo
r
c
ha
ll
eng
e
faced
in
this
ta
sk
is
th
e
cl
ass
im
balan
ce
betwee
n
t
he
s
uccess
fu
l
s
tud
e
nts
a
nd
the
at
-
ris
k
st
udents
.
Ge
ne
rall
y,
the
re
a
re
a
lot
m
or
e
s
ucc
essfu
l
stu
den
ts
com
pared
t
o
at
-
risk
stu
den
t
s
in
a
pro
gr
am
.
Buil
ding
a
pr
e
dicti
ve
m
odel
with
ou
t
a
ddressi
ng
this
issue
ca
n
le
ad
to
mea
ni
ng
le
ss
res
ults,
as
the
mo
del
ma
y
show
that
it
has
a
hi
gh
e
r
acc
uracy
by
sim
ply
predict
in
g
al
l
stu
den
ts
pa
ss
ed
a
nd
a
re
s
uc
cessf
ul
[2],
[3]
.
Othe
r
researc
hers
li
ke
Cha
wla
et
al.
[
4]
ha
ve
c
ommo
nly
us
e
d
tradit
ion
al
ov
e
rsam
pling
t
ec
hniq
ues
li
ke
syntheti
c
minorit
y
overs
amplin
g
te
ch
ni
qu
e
(SM
OTE
)
or
unde
rsam
pling
te
ch
niqu
es
li
ke
T
om
e
k
li
nk
s
(TL)
[5]
.
How
ever,
these
te
c
hn
i
qu
e
s
are
no
t
al
way
s
a
pp
li
c
able
an
d
ha
ve
disad
va
ntages,
su
c
h
as
overs
a
mp
li
ng
introd
ucin
g
ad
diti
on
al
no
ise
and
unde
rsam
pling
rem
ovin
g
la
r
ge
sam
ple
s
of
data
to
m
ake
the
majori
ty
cl
ass
samples
eq
ual
to
the
mi
nority
cl
ass.
The
cl
a
ss
im
balance
issue
is
al
s
o
not
uniq
ue
t
o
t
he
ed
ucati
on
do
main,
it
can
occur i
n
m
any d
i
ff
e
ren
t
domain
s incl
ud
i
ng cybe
rs
ecu
rity
[6]
, ai
r qu
al
it
y
[7]
, a
nd
oth
e
rs.
In
rece
nt
years
,
re
searc
her
s
ha
ve
e
xplore
d
ge
ner
at
in
g
synt
hetic
dataset
s
to
i
ncr
ease
the
am
ount
of
inf
or
mati
on
a
nd
hel
p
imp
rove
the
perf
or
m
an
ce
of
the
predi
ct
ive
models
[
8]
.
More
over
,
thr
ough
va
rio
us
deep
le
arn
in
g
synth
esi
zers,
it
has
bee
n
sho
wn
t
o
be
possi
ble
to
gen
e
rate
s
yntheti
c
dataset
s
to
so
l
ve
t
he
cl
ass
imbala
nce
issu
e
[
9]
.
T
he
s
yn
t
hetic
dataset
s
are
m
odel
le
d
on
a
simi
la
r
data
distrib
utio
n
t
o
the
real
data
set
by
th
e
synthe
siz
er
s
but
can
m
od
i
fy
t
he
ta
r
get
cl
asses
by
maki
ng
the
cl
asses
ba
la
nced
a
ppr
op
riat
el
y.
Howe
ve
r,
a
s
this is st
il
l a rel
at
ively new
ap
proac
h,
t
her
e
ha
s not
been m
uc
h fo
c
us
on thi
s approac
h.
In
this
pa
per,
mu
lt
iple
m
et
hods
from
the
two
a
ppro
a
ches
to
c
la
ss
balan
ci
ng
,
wh
ic
h
a
r
e
data
-
le
ve
l
samplin
g
a
nd
deep
le
a
rn
i
ng
sy
nt
hesizers
ar
e
com
par
e
d
a
nd
eval
uated
on
a
stud
e
nt
ed
uc
at
ion
dataset
.
Th
e
mu
lt
ic
la
ss
eval
uation
is
pe
rfo
rme
d
th
rou
gh
the
li
ght
gradie
nt
bo
os
ti
ng
ma
chine
(Lig
htGB
M
)
cl
assi
fier
us
in
g
machine
l
ea
rn
i
ng
metri
cs
suc
h
as
recei
ver
operati
ng
c
har
a
ct
erist
ic
(
ROC
)
,
preci
sio
n,
re
cal
l
and
F
1
sc
or
e
.
I
n
add
it
io
n,
t
he d
at
aset
s g
ene
rat
ed
a
re c
ompare
d visuall
y an
d
t
hro
ugh data
qual
it
y
scor
e
s.
Data
-
le
vel
te
c
hn
i
qu
e
s
are
a
gro
up
of
te
c
hniqu
es
th
at
invo
lve
modif
yi
ng
the
act
ual
data
set
it
sel
f
to
make
t
he
cl
ass
distrib
utio
n
ba
la
nced
[10
],
[
11]
.
T
he
re
are
three
ty
pes
unde
r
this
a
ppr
oach,
ove
rsam
pling
,
unde
rsam
pling
an
d
hy
br
i
d
sa
mp
li
ng.
O
ver
s
amplin
g
te
ch
ni
qu
e
s
are
ge
neral
ly
con
ce
r
ned
with
increasi
ng
the
minorit
y
cl
ass
samples
to
ma
ke
the
cl
ass
distribu
ti
on
li
ke
the
majo
rity
cl
ass.
T
his
is
done
by
sa
mp
li
ng
the
majority
c
l
a
s
s
and
re
plica
ti
ng
it
or
s
yn
t
hesi
zi
ng
new
sam
ples
base
d
on
it
fo
r
t
he
min
ori
ty
cl
ass.
T
he
main
disad
va
ntage
of
this
met
hod
is
that
it
reli
es
heav
il
y
on
th
e
or
i
gin
al
data
qu
al
it
y,
an
d
ge
ner
at
in
g
to
o
man
y
samples
ma
y
le
ad
to
over
fitt
ing
.
T
he
f
our
meth
ods
of
t
his
natu
re
c
on
sidere
d
in
this
pa
per
a
re
random
ov
e
rsam
plin
g
(ROS)
[
12]
,
s
yn
t
hetic
mi
nor
it
y
ov
e
rsam
pling
te
c
hn
i
qu
e
(
SMOTE
)
[
4]
,
b
or
der
li
ne
s
yntheti
c
minorit
y o
ver
s
amplin
g
te
c
hn
i
qu
e
(BS
MOTE
)
[
13]
an
d
ada
pt
ive
syntheti
c
te
chn
i
qu
e
(
ADAS
YN)
[14]
.
Unde
rsam
plin
g
te
ch
niques
on
t
he
ot
her
hand,
a
re
c
oncern
e
d
with
r
edu
ci
ng
t
he
majority
cl
ass
samples
to
ma
ke
t
he
c
l
a
s
s
dis
tribu
ti
on
eq
ual
to
the
min
or
it
y
cl
ass
.
T
his
is
do
ne
by
rand
om
ly
or
strat
e
gical
ly
rem
ov
i
ng
sam
ples
f
rom
the
majority
cl
ass
.
The
main
dis
adv
a
ntage
of
t
his
meth
od
is
t
he
l
os
s
of
data
wh
ic
h
cou
l
d
ha
ve
bee
n
us
e
d
to
t
rain
the
model.
T
he
four
meth
ods
of
t
his
natu
re
c
on
si
der
e
d
in
t
hi
s
pap
e
r
are
ra
ndom
unde
rsam
pling
(RUS),
TL
[5]
,
e
dited
near
e
st
-
neig
hbor
(E
NN)
[
15]
a
nd
neig
hbour
hood
cl
eanin
g
r
ule
(
NCR)
[16]
.
Fi
nally,
hybri
d
sa
mp
li
ng
te
ch
ni
qu
es
c
ombine
bot
h
oversa
mp
li
ng
a
nd
un
der
sa
mp
l
ing
meth
ods
to
gethe
r
to b
al
ance t
he
minorit
y
a
nd
majority
c
l
a
s
s
distrib
ution. Ge
ner
al
ly
, th
e
y
perform
bette
r t
han
bo
t
h
over
samplin
g
and
undersam
pl
ing
as
the
y
co
mb
ine
the
a
dvantages
an
d
li
mit
the
disad
va
ntages
of
bot
h
te
c
hn
i
ques
.
The
f
our
methods
of
thi
s
natu
re
c
on
si
de
red
i
n
this
pa
per
a
re
S
M
O
T
E
-
TL
[
17]
,
SMOTE
-
EN
N
[17
]
,
S
MOTE
-
RU
S
an
d
SMOTE
-
NCR
[18]
.
Pr
at
ama
et
al.
[19]
c
onduct
e
d
seve
ral
ex
pe
riments
to
s
ho
w
the
ef
fect
of
data
-
le
vel
te
chn
i
qu
e
s
on
imbala
nce
d
cl
assifi
cat
ion
a
nd
c
omp
are
d
s
ever
al
resam
pl
ing
meth
od
s
inclu
ding
S
MOTE,
BS
M
O
TE
an
d
SMOTE
-
TL
to
fin
d
t
he
best
method.
T
hey
al
so
util
iz
ed
s
ever
al
mac
hine
le
arn
i
ng
cl
as
sifie
rs,
li
ke
lo
gisti
c
regressio
n
(L
R),
k
-
nea
rest
neig
hbors
(K
-
NN),
cl
assifi
c
at
ion
a
nd
regr
ession
trees
(
CART),
ra
ndom
f
orest
(RF)
,
s
uppo
rt
vecto
r
machi
ne
(
SVM
),
an
d
the
sta
ckin
g
en
semble
meth
od
.
T
he
res
ults
s
howe
d
t
hat
t
he
hy
br
i
d
samplin
g
met
hod,
S
M
O
TE
-
T
L
wor
ked
t
he
best
with
th
e
RF
model,
ac
hi
eving
85.
8%
accurac
y
on
a
10
-
f
old
cro
ss
-
validat
io
n
an
d
a
sc
or
e
of
0.8
9
Ge
o
metr
ic
mean,
wh
ic
h
was
the
best
s
cor
e
a
mon
g
th
e
models.
T
he
study
cou
l
d
be
imp
r
ov
e
d
by
ad
ding
m
or
e
un
der
s
amplin
g
a
nd
hybri
d
-
sam
plin
g
meth
od
s
or
f
ocusi
ng
dee
pe
r
o
n
ensem
ble
le
arni
ng
cl
assifi
er
s.
M
ore
ty
pes
of
data
co
uld
ha
ve
al
so
been
a
dd
e
d
that
hav
e
been
a
dju
ste
d
with
featur
e
selec
ti
on met
hods.
On
t
he
ot
her
ha
nd,
Bu
raim
oh
et
al.
[
20]
f
oc
us
e
d
on
the
im
portance
of
di
mensi
on
al
it
y
r
edu
ct
io
n
a
nd
data
-
sam
plin
g
methods
to
im
pro
ve
the
perf
ormance
of
i
m
balance
d
mode
ls
predict
in
g
st
ud
e
nt
s
uccess
.
The
y
util
iz
ed
pri
nci
pal
c
ompone
nt
anal
ys
is
(P
C
A)
f
or
thei
r
dimens
i
on
al
it
y
r
edu
ct
io
n
al
ong
with
c
ompari
ng
RO
S
,
RUS,
a
nd
S
MOTE
meth
ods
with
si
x
cl
as
sifie
rs.
The
y
wer
e
SVM
,
K
-
N
N,
C
ART,
gr
a
dient
boos
t
ed
tree
(G
BT
),
mu
lt
il
ayer
pe
rcep
t
ron
(
M
L
P),
an
d
li
near
discri
minant
a
naly
sis
(L
DA).
The
res
ults
show
ed
tha
t
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
&
C
omp E
ng
IS
S
N:
20
88
-
8708
Han
dling cl
as
s
imbala
nce i
n
e
du
c
atio
n usin
g
data
-
le
vel
and dee
p
le
arni
ng
meth
ods
(
Rit
he
sh
K
annan
)
743
SMOTE
with
PCA
was
the
best
prep
r
oces
sing
meth
od
al
ong
with
the
S
VM
cl
assifi
er
as
they
ac
hie
ve
d
the
highest ac
c
ur
a
cy of
0.93 a
nd
a K
a
pp
a
accu
r
acy
of 0.8
9.
The
oth
e
r
a
ppro
ac
h
t
o
cl
ass
balanci
ng
is
thr
ough
usi
ng
dee
p
le
ar
ning
synthesiz
e
rs
t
o
gen
e
rate
sy
nt
hetic
data
th
at
balance
t
he
ta
r
get
cl
asses
[21
]
.
T
he
two
s
ynthesiz
e
rs
us
e
d
i
n
this
pap
e
r
are
Ga
us
sia
n
copula
an
d
ge
ner
at
ive
a
dver
sarial
netw
ork
(GAN).
Ga
us
si
an
co
pula
s
are
mathemat
ic
al
models
us
e
d
t
o
ma
p
the
ma
rg
i
nal
di
stribu
ti
on
of
e
ach
var
ia
ble
in
the
giv
e
n
dat
aset
to
sta
nd
a
r
d
normal
distri
bu
ti
on.
Esse
ntial
ly,
they
create
a
joint
pro
ba
bili
t
y
f
or
tw
o
or
more
va
riables
w
hile
sti
ll
pr
e
serv
i
ng
th
ei
r
mar
gin
al
distri
bu
ti
ons
.
The
disad
va
nt
age
of
Ga
us
sia
n
c
opula
is
t
ha
t
it
cannot
ca
pt
ur
e
ta
il
de
pe
ndence
withi
n
t
he
data
dist
rib
ution.
Th
us
,
it
can
not
rep
li
cat
e
the
r
eal
data
distrib
ution
c
omplet
e
ly.
G
A
Ns
are
a
cl
ass
of
dee
p
l
earn
i
ng
model
s
that
work
by
trai
nin
g
tw
o
ne
ural
netw
orks,
cal
le
d
t
he
ge
ner
at
or
an
d
the
disc
ri
minato
r.
T
he
ge
ner
at
or
net
w
ork
go
al
is
to
tr
y
a
nd
pro
du
ce
ne
w
sy
nt
hetic
data
com
pa
rab
le
t
o
the
real
dat
a
an
d
the
disc
riminato
r’
s
go
al
is
to
evaluate
w
heth
er
the
in
pu
t
data
is
real
or
s
yntheti
c.
GANs
a
re
typ
ic
al
ly
use
d
in
c
ompu
te
r
vision
t
o
ar
bitrari
l
y
gen
e
rate
im
ag
es
to
do
data
augmentat
io
n
and
i
ncr
ease
t
he
siz
e
of
the
ir
dataset
s
for
a
bette
r
pr
e
di
ct
ion
.
Howe
ver,
re
s
e
arch
e
rs
ha
ve
a
lso
a
dap
te
d
th
em
into
le
ar
nin
g
f
rom
real
ta
bu
la
r
data
an
d
gen
e
rati
ng
s
yn
t
hetic
data with
h
i
gh
fideli
ty
[
22]
, [2
3]
.
In
te
rms
of
de
ep
le
ar
ni
ng
ap
proac
hes,
m
os
t
us
e
m
odel
s
to
ge
ner
at
e
synt
hetic
data
that
can
balanc
e
the
cl
asses.
H
oweve
r,
the
mos
t
popula
r
us
e
c
ase
f
or
these
m
et
hods
a
re
ima
ge
data
a
nd
vi
de
o
data.
T
he
re
wer
e
no
t
ma
ny
well
performi
ng
m
odel
s
for
te
xt
da
ta
un
ti
l
a
resea
rch
e
r
i
ntrod
uce
d
one
a
nd
im
prov
e
d
up
on
pre
vious
methods
by
c
r
eat
ing
their
m
od
el
cal
le
d
co
nd
it
io
nal
ta
bu
l
ar
ge
ner
at
ive
a
dv
e
rsa
rial
net
work
(CTG
A
N)
[
22]
.
The
y
furthe
r
e
ns
ure
d
th
at
the
model
c
ould
gen
e
rate
sa
mp
l
es
in
a
novel
r
andom
ma
nner
to
a
ddress
the
cl
ass
imbala
nce
issu
e.
T
he
C
TG
A
N
model
is
bei
ng
c
on
ti
nu
ously
im
pro
ve
d
upon
by
oth
er
res
earche
rs,
howe
ver
the
base
m
odel
is
sti
ll
relat
ively
su
it
able
f
or
m
os
t
ta
sk
s
.
Re
s
earche
rs
hav
e
al
so
ex
pe
rime
nted
with
im
pro
ving
CTGA
N
m
ode
ls
to
create
a
ge
ner
at
ive
a
dve
rsar
ia
l
net
work
modell
ing
in
s
pire
d
f
rom
nai
ve
Ba
yes
a
nd
log
ist
ic
regr
essio
n
’s rel
at
ion
s
hip
(GA
NBLR)
[
24]
.
Be
sides
G
A
N,
an
oth
er
dee
p
le
arn
in
g
(
DL
)
base
d
po
pu
la
r
model
is
va
riat
ion
al
a
uto
e
ncoder
(
VA
E
),
wh
ic
h
[
25]
propose
d.
It
wa
s
an
inte
gr
at
e
d
f
rame
wor
k
that
co
ns
ist
ed
of
la
te
nt
VAE
with
a
deep
neura
l
netw
ork
(DN
N)
t
o
a
ddress
and
al
le
viate
the
cl
ass
im
bal
ance
iss
ue
a
nd
pro
vid
e
early
warnin
g
f
or
at
-
risk
stud
e
nts.
V
AE
has
certai
n
a
dvanta
ges
a
s
it
has
a
sim
pler
l
os
s
f
un
ct
io
n
c
ompare
d
t
o
G
AN
f
or
e
xam
pl
e.
The
researc
hers
util
iz
ed
this t
o
trai
n
m
ulti
ple m
od
el
s quickly
and
got a
high
res
ult o
f 80
%
F
1 scor
e
.
2.
METHO
D
In
this
sect
io
n,
the
overall
m
et
hodo
l
ogy
of
this
pa
per
as
s
how
n
in
Fig
ur
e
1,
is
e
xp
la
in
ed.
The
re
a
re
two
se
pa
rate
re
search
flo
ws
i
n
the
ove
rall
methodol
ogy,
wit
h
the main d
if
f
eren
ce bein
g
t
he
sta
ge
at
w
hic
h
the
cl
ass
bala
ncin
g
occ
ur
s
.
For
da
ta
-
le
vel
resa
m
plin
g,
t
he
cl
ass
bala
ncin
g
occ
ur
s
imme
diate
ly
a
fter
data
s
pl
it
ti
ng
on
t
he
trai
n
set
wh
e
reas
f
or
t
he
deep
le
ar
ning
meth
od,
the
cl
ass
balancin
g
occurs
after
fe
at
ur
e
sel
ect
ion
wh
e
n
the s
yn
t
hetic
dat
aset
s ar
e
generate
d.
2.1.
D
atase
t
The
stu
de
nt
ac
ademic
dataset
is
colle
ct
ed
f
r
om
gr
a
duat
ed
stud
e
nts
in
t
he
year
s
of
20
20
to
2021,
acro
s
s
di
ff
e
rent
pro
gr
am
s
f
r
om
a
pr
i
vate
unive
rsity
in
Ma
la
ys
ia
.
T
her
e
are
a
total
of
5
,
488
stu
de
nts
an
d
158
featu
res,
i
nclu
ding
dem
ogra
phic
an
d
ac
ademic
perfor
mance
featu
re
s.
A
mon
g
the
15
8
featu
res,
23
a
re
cat
egorical
,
13
2
are
nume
rical
and
3
are
dat
et
ime
featur
es
.
Each
se
meste
r
has
thei
r
ow
n
gr
a
de
po
i
nt
av
erag
e
(
GPA
)
a
nd
cu
mu
la
ti
ve gra
de
point a
ver
a
ge
(
CPG
A
)
an
d ar
e gro
uped
to
ge
ther
i
n
Ta
ble
1.
2.2.
D
ata pre
-
processin
g
The
main
ste
p
in
pr
e
processi
ng
t
he
data
is
ha
nd
li
ng
missi
ng
data
a
nd
re
m
ov
i
ng
a
ny
irrel
evan
t
data.
In
this
paper,
featur
e
s
with
missi
ng
data
a
bove
80
%
,
are
droppe
d.
S
ome
of
the
c
olum
ns
ha
ve
ov
e
rla
pp
i
ng
inf
or
mati
on
s
uc
h
as
_
,
_
,
and
_
_
.
The
feat
ur
e
th
at
pr
ovide
s
the
mo
st
in
for
mati
on
is
ke
pt
an
d
the
rema
ining
c
olumns
are
dro
pp
e
d.
All
re
mainin
g
missi
ng
valu
es
a
re
rep
la
ce
d wit
h median
for q
ua
ntit
at
ive f
eat
ur
es an
d mo
de fo
r qu
al
it
at
ive f
e
at
ur
es.
2.3.
D
ata
spli
t
ting
The
dataset
is
then
di
vid
e
d
in
to
a
t
rain
a
nd
t
est
set
with
70%
-
30%
rati
o
r
especti
vely
.
T
he
data
-
le
ve
l
cl
ass
balanci
ng
met
hods
are
app
li
ed
only
to
the
trai
n
set
a
s
the
te
st
set
mu
st
remain
uncha
nged
to
obta
in
a
n
unbiase
d
est
im
at
e
of
the
pe
rformanc
e
.
S
ub
s
equ
e
nt
ste
ps
be
sides
the
data
-
le
vel
meth
ods
are
ap
plied
to
bo
t
h
set
s of
data.
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
omp E
ng,
V
ol.
15
, No
.
1
,
Febr
uary
20
25
:
741
-
754
744
(a)
(b)
Figure
1. Re
se
arch flo
ws
for
(a)
data
-
le
vel re
samplin
g
te
c
hniq
ues
a
nd (b
) deep
lear
ning s
yn
t
hetic
m
od
el
s
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
&
C
omp E
ng
IS
S
N:
20
88
-
8708
Han
dling cl
as
s
imbala
nce i
n
e
du
c
atio
n usin
g
data
-
le
vel
and dee
p
le
arni
ng
meth
ods
(
Rit
he
sh
K
annan
)
745
Table
1.
Feat
ur
es
desc
riptio
ns i
n
the
aca
demi
c d
at
aset
Nam
e
Descripti
o
n
2
Gradu
ated
on
tim
e
or no
t
Un
iq
u
e
i
d
en
tifier
(
ID
)
for stu
d
en
ts
_
Typ
e of deg
ree
th
e
stu
d
en
t is takin
g
_
End
of pro
g
ram statu
s o
f
stu
d
en
t
_
End
of pro
g
ram ac
tio
n
f
o
r
stu
d
en
t
_
The ter
m
wh
en
the stu
d
en
t is admitte
d
_
Date when
stu
d
en
t is admitted
_
Date when
stu
d
en
t is grad
u
ated
_
Date when
stu
d
en
t statu
s is
ch
an
g
ed
_
_
Exp
ected ter
m
the
stu
d
en
t grad
u
ates
Campu
s th
e stu
d
en
t belo
n
g
s to
_
_
Mod
e of stu
d
y
the stu
d
en
t is in
_
Co
d
e f
o
r
th
e sp
ecif
ic stu
d
y
pro
g
ram
t
h
e stu
d
en
t is taking
_
_
Sh
o
rt
form
of the s
tu
d
y
pro
g
ram
_
Lon
g
,
d
escript
iv
e f
o
rm
of
th
e stu
d
y
pro
g
ram
_
Facu
lty
the stu
d
en
t belo
n
g
s to
Typ
e of dis
ab
ility
the stu
d
en
t has
Natio
n
ality
of the
stu
d
en
t
Race of the stu
d
en
t
Sex
of the stu
d
en
t
Malays
ian
Univ
ers
ity
E
n
g
lish
test
(
M
UET
)
sco
re
f
o
r
th
e
stu
d
en
t
Internatio
n
al
Eng
lish
lan
g
u
ag
e testin
g
sy
stem
(
IE
LT
S
)
s
co
re
for
the stu
d
en
t
Loan
belo
n
g
in
g
to
th
e stu
d
en
t
Sp
o
n
so
rsh
i
p
belo
n
g
in
g
to th
e stu
d
en
t
Sch
o
larsh
ip
belo
n
g
in
g
to th
e stu
d
en
t
_
Total cu
m
u
lativ
e
c
redits
_
_
_
Descripti
o
n
of f
in
a
l pro
g
ram
resu
lts
_
_
Ho
n
o
rs belo
n
g
i
n
g
to
the stu
d
en
t
Total cre
d
its
requ
ir
ed
to g
radu
ate
1
Sijil Pelaja
ran
Mal
ay
sia (SP
M)
and
S
ijil T
in
g
g
i Per
sek
o
lah
an
M
alay
sia
(S
TPM
)
g
rades o
f
stu
d
en
t
_
ℎ
(
)
Nu
m
b
er
o
f
ter
m
s s
tu
d
en
t has
do
n
e
1
17
:
_
Cu
rr
en
t GPA
f
o
r
t
h
e ter
m
1
17
:
_
Cu
m
u
lativ
e
GPA
(
CPGA)
for the t
er
m
2.4.
D
ata scal
ing
Fo
r
al
l
qu
a
ntit
at
ive
featu
res,
the
data
is
sca
le
d
to
e
ns
ure
no
featu
re
dominate
s
the
ca
lc
ulati
on
in
a
pr
e
dicti
ve
al
go
rithm.
F
or
this
pap
e
r,
a
r
obus
t
scal
er
is
app
l
ie
d
as
it
scal
es
the
data
to
th
e
interq
uar
ti
le
range
wh
ic
h
al
lo
ws
t
he
data
to
be
r
obus
t
to
outl
ie
rs.
It
is
al
so
sui
ta
ble
to
ha
ndle
sk
e
wed
distri
bu
ti
ons
a
s
it
is
base
d
on p
e
rce
ntil
es w
hic
h
a
re less
aff
ect
ed
by e
xt
reme
values
.
2.
5
.
C
ateg
ory
encodi
ng
Fo
r
this
pa
per,
a
la
bel
e
ncoder
has
be
e
n
us
e
d
to
co
nvert
t
he
cat
egorical
ta
r
get
feat
ur
e
into
nume
rical
.
Ordinal
e
nc
od
er
has
been
use
d
to
co
nvert
t
ho
s
e
ordi
nal
cat
egorical
var
i
ables
an
d
th
ose
var
ia
bles
with
only
two
uniq
ue
val
ues.
F
or
the
re
minin
g
nomin
al
var
ia
bles,
it
is
decide
d
t
o
us
e
M
-
est
im
at
e
e
nc
oder
,
as
s
een
i
n
[26]
.
The
M
-
es
ti
mate
or
M
-
pr
ob
a
bili
ty
est
im
at
e
enc
oder
is
gen
e
rall
y
use
d
w
he
n
the
car
di
nalit
y
of
fe
at
ures
is
high.
It
us
e
s
th
e
ta
rg
et
var
ia
bl
e
to
enc
od
e
th
e
nominal
featur
es
an
d
us
e
s
a
regulariz
at
io
n
va
riable
t
o
c
on
t
ro
l
the
ta
r
get
le
ak
age
a
nd
r
ed
uc
e
the
over
fitt
i
ng.
On
e
H
otEnc
od
e
r
ca
nnot
be
us
e
d
as
it
causes
t
he
feat
ur
es
to
gr
eat
ly
inc
reas
e
in
num
ber
.
E
xam
ples
f
or
th
e
map
pi
ng
ha
ve
bee
n
pro
vid
e
d
in
Ta
ble
2.
As
t
her
e
is
a
c
ertai
n
amo
un
t
of
ra
ndomne
ss
prese
nt
in
t
he
M
-
es
ti
mate
enc
od
e
r
,
the
map
ping
f
or
t
he
e
nc
oder
ca
nnot
be
c
le
arly
seen,
which
is
a d
isa
dv
a
ntage
of this
meth
od.
Table
2.
Be
for
e an
d
a
fter cate
gor
y
e
ncodin
g
Nam
e
Befo
re
en
co
d
in
g
After en
co
d
in
g
Label
e
n
co
d
in
g
{NO GP
A,
P
ASS,
PROB
ATI
ON,
T
E
RMINA
TE
D,
TE
R
MI
N
ATE
D
–
RE
I
NSTA
TE
D}
{0
,
1
,
2
,
3
,
4
}
Ordin
al
e
n
co
d
in
g
{N. Y}
{0
,
1
}
2.
6
.
Fe
at
ure
s
el
ection
In
this
pa
per
,
r
ecur
si
ve
feat
ure
el
imi
nation
(
RFE),
util
iz
ing
the
Ligh
t
GB
M
m
od
el
is
use
d
to
sel
ect
the
m
os
t
im
por
ta
nt
featu
res
f
r
om
t
he
dataset
.
RFE
is
wi
dely
us
e
d
a
s
on
e
of
the
best
featu
r
e
sel
ect
ion
met
hods
and
w
orks
by
s
earchi
ng
f
or
t
he
best
s
ubset
of
feat
ur
es
by
st
arti
ng
with
al
l
featur
e
s
in
t
he
trai
ning
da
ta
set
and
su
ccess
f
ully
re
movin
g
the
fea
tures
un
ti
l
the
desire
d
num
be
r
is
reac
hed.
T
he
reas
on
w
hy
RFE
was
c
hosen
is
because
it
sta
rts
with
al
l
featu
res
a
nd
rem
ov
es
the
le
ss
rele
van
t
ones
,
wh
i
ch
help
s
c
onse
rv
e
the
m
os
t
a
moun
t
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
omp E
ng,
V
ol.
15
, No
.
1
,
Febr
uary
20
25
:
741
-
754
746
of
data
.
T
he
Ligh
t
GBM
m
ode
l
was
sel
ect
ed
because
i
t
pro
vid
es
ge
ne
rall
y
go
od
perform
ance
a
nd
is
me
mory
eff
ic
ie
nt
[
27]
,
w
hich
helps
wh
e
n
m
ulti
ple
m
od
el
s
a
re
be
ing
r
un
i
n
R
FE.
The
y
ide
n
ti
fied
43
featu
res
a
s
pro
vid
in
g
the
best
perf
or
m
an
ce
in
te
rm
s
of
F1
sc
or
e.
Be
yond
43
featu
res,
the
pe
r
forma
nc
e
does
no
t
inc
rease
su
bst
antia
ll
y,
a
s can
b
e
seen
in
Fi
gure
2.
Figure
2. Com
par
i
ng mea
n
te
st acc
ur
ac
y o
f mo
dels a
gainst
num
ber o
f
sel
ect
ed
fea
t
ur
es
in
the
m
od
el
s
2.
7
.
Cl
as
s
ba
l
an
ci
n
g
W
h
e
n
t
h
e
c
h
o
s
e
n
t
a
r
g
e
t
f
e
a
t
u
r
e
,
“
_
_
_
”
o
t
h
e
r
w
i
s
e
k
n
o
w
n
a
s
t
h
e
f
i
n
a
l
s
t
a
t
u
s
o
f
t
he
p
r
o
g
r
a
m
i
s
e
x
a
m
i
n
e
d
,
a
c
l
e
a
r
c
l
a
s
s
i
m
b
a
l
a
n
c
e
c
a
n
b
e
s
e
e
n
.
T
h
e
r
e
i
s
a
s
i
gn
i
f
i
c
a
nt
a
m
o
u
n
t
o
f
c
l
a
s
s
i
m
b
a
l
a
n
c
e
p
r
e
s
e
n
t
i
n
t
hi
s
d
a
t
a
s
e
t
a
s
c
a
n
b
e
s
e
e
n
i
n
T
a
bl
e
3
a
n
d
F
i
g
u
r
e
3
.
T
h
e
m
a
j
o
r
i
t
y
c
l
a
s
s
,
“
”
o
c
c
u
p
i
e
s
a
r
o
u
n
d
6
6
%
o
f
a
l
l
v
a
l
u
e
s
w
h
e
r
e
a
s
t
h
e
m
i
n
o
r
i
t
y
c
l
a
s
s
,
“
−
”
d
o
e
s
n
o
t
o
c
c
u
p
y
e
v
e
n
1
%
.
Table
3.
Distri
bu
ti
on
of cla
ss
es in tar
get
featur
e
Origin
al variable
Enco
d
ed
variabl
e
Co
u
n
t
Percentag
e
(%)
NO GP
A
0
929
1
7
.33
PASS
1
3576
6
6
.72
PROB
ATI
ON
2
551
1
0
.28
TE
RM
INA
TE
D
3
273
5
.09
TE
RM
INA
TE
D
–
REINS
TAT
ED
4
31
0
.58
Figure
3.
Vis
ua
li
zat
ion
of cla
ss imbala
nce i
n
ta
r
get f
e
at
ur
e
T
he
a
m
o
un
t
of
c
l
a
s
s
i
m
ba
l
a
nc
e
c
a
n
be
c
a
l
c
ul
a
t
e
d
us
i
ng
t
h
e
i
m
ba
l
a
nc
e
r
a
t
i
o
(
I
R
)
f
or
m
ul
a
w
hi
c
h
i
s
gi
ve
n
i
n
(
1
)
.
F
or
m
ul
t
i
c
l
a
s
s
cl
a
s
s
i
f
i
ca
t
i
on
,
t
he
I
R
a
r
e
s
a
m
pl
e
s
f
r
om
t
he
g
r
e
a
t
e
s
t
m
a
j
or
i
t
y
c
l
a
s
s
ov
e
r
t
he
l
ow
e
s
t
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
&
C
omp E
ng
IS
S
N:
20
88
-
8708
Han
dling cl
as
s
imbala
nce i
n
e
du
c
atio
n usin
g
data
-
le
vel
and dee
p
le
arni
ng
meth
ods
(
Rit
he
sh
K
annan
)
747
m
i
no
r
i
t
y
c
l
a
s
s
,
w
hi
c
h
i
n
t
hi
s
c
a
s
e
i
s
35
76
/
31
=
11
5.
3
5
.
W
h
e
n
t
he
da
t
a
s
e
t
i
s
s
pl
it
i
nt
o
t
r
ai
n
a
nd
t
e
s
t
da
ta
s
e
t
,
s
t
r
a
t
i
f
ie
d
s
pl
i
t
ti
ng
i
s
us
e
d
t
o
e
ns
ur
e
t
he
s
a
m
e
c
l
a
s
s
di
s
t
r
i
bu
t
io
n
i
s
pr
e
s
e
nt
i
n
bo
t
h d
a
t
a
s
e
t
s
.
I
n
(
1
)
,
r
e
f
e
r
s
t
o
t
he
nu
m
be
r
of
s
a
m
pl
e
s
i
n
t
he
m
a
j
or
i
t
y
c
l
a
s
s
a
nd
re
f
e
r
s
t
o
t
he
n
um
be
r
of
s
a
m
pl
e
s
i
n
t
he
m
i
no
r
i
t
y
c
l
a
s
s
.
=
(1)
2.
8
.
Perf
orm
ance
metrics
To
e
valuate
th
e
models
a
nd
c
hoos
e
t
he
op
ti
mal
one,
ce
rtai
n
metri
ces
nee
d
to
be
c
hosen
to
co
mp
a
r
e
the
res
ults.
In
this
pa
per,
c
ommo
n
metri
cs
from
the
mac
hin
e
le
a
rn
i
ng
f
ie
ld
that
are
s
uited
for
im
ba
la
nced
mu
lt
ic
la
ss
cl
assifi
cat
ion
pr
oble
ms,
ha
ve
be
en
ch
os
e
n.
T
he
se
inclu
de
th
reshold
metri
c
s
su
c
h
as
prec
isi
on
,
recall
an
d
F
1
scor
e
.
Pr
eci
sio
n
sco
re
is
a
metri
c
that
cal
culat
es
the
rati
o
of
tr
ue
posi
ti
ve
pre
dicte
d
to
the
amo
un
t
o
f
total
posit
ive
incl
uded
i
n
t
he
mod
el
.
Re
cal
l
sc
ore
al
so kno
wn
a
s
tr
ue
posit
ive
r
at
e r
efe
rs
to
th
e
rati
o
of
tr
ue
po
sit
iv
e
pr
e
dicte
d
to
t
he
num
be
r
of
act
ual
posit
ive
include
d
in
t
he
model.
F
1
sc
or
e
ca
n
s
how
t
he
tr
ue
model
pe
rfo
rm
ance
as
it
is
ca
lc
ulate
d
usi
ng
harmo
nic
mea
ns
of
both
prec
isi
on
a
nd
recal
l
for
eac
h
cl
ass
.
This
scor
e
s
hows
poor
r
es
ults
if
the
model
is
simply
pr
e
dicti
ng
the
m
aj
or
i
ty
cl
ass.
As
t
his
is
a
mu
lt
ic
la
ss
cl
assifi
cat
ion
pro
blem,
t
he
av
erag
e
for
eac
h
metri
c
is
cal
culat
ed
usi
ng
m
acro
-
ave
ra
ge
a
s
it
treat
s
each
cl
ass
equ
al
ly
a
nd
if
one
cl
ass
pe
r
forms
po
or
l
y
t
he
overall
re
s
ult
b
ec
om
es
poor.
This
is
use
fu
l
f
or
imba
la
nced
dataset
s
as
it
ensures
t
hat
each
cl
ass
eq
ually
con
t
rib
utes
to
the
res
ult.
Be
sides
these
,
the
ROC
area
unde
r
th
e
curve
(AUC)
s
cor
e
is
al
so
inc
lud
e
d
as
it
is
not
biase
d
t
ow
a
rd
s
the
majo
rity
or
min
or
it
y
c
la
sses,
wh
ic
h
makes
it
us
ef
ul
in
imbala
nc
ed
cl
a
ssific
at
ion
.
It
cal
culat
es
the
trade
-
off
betw
een
the
t
ru
e
posit
ive
rate
a
nd
false
po
sit
ive
rates
f
or
a
model.
A
gain,
as
this
is
a
mu
lt
ic
la
ss
c
la
ssific
at
ion
,
the
R
OC
A
UC
scor
e
us
es
the
o
ne
vs
r
est
sc
heme w
hich
c
ompa
res e
ac
h
cl
ass a
gai
ns
t al
l ot
her
s
to
gethe
r
a
nd av
e
rag
es
the
res
ults.
2.
9
.
A
ppli
ca
tion
of clas
s b
al
an
ci
n
g
me
tho
ds
In
t
his
S
ubsect
ion
,
each
of
th
e
cl
ass
bala
ncing
meth
ods
ar
e
br
ie
fly
e
xpla
ined
,
a
nd
t
heir
eff
ect
of
t
he
method
on
t
he
dataset
is
sho
wn
i
n
Ta
ble
s
4
t
o
Ta
ble
7.
The
diff
e
ren
t
way
s
e
ach
method
a
ppr
oach
e
s
cl
ass
balancin
g
a
nd
to
w
hat
extent
they
c
onsider
cl
asses
as
bal
anced
ca
n
be
seen
as
well
.
Each
cl
ass
balancin
g
method
has
the
n been
ev
al
uat
ed usin
g
a
Lig
htGB
M
classi
fi
er in
s
ect
ion
3.
2.
9
.
1.
Ov
er
sam
pli
ng
techni
ques
Within
t
he
ove
rsam
pling
te
ch
niques,
t
he
sim
plest
is
ROS,
wh
e
re
ra
nd
om
samples
a
mon
g
al
l
cl
asses
besides
the
ma
jority
cl
ass
are
duplica
te
d
unt
il
the
final
am
ount
from
each
cl
ass
is
e
qu
al
to
the
majo
rity
cl
ass.
In
t
his
pap
e
r,
a
mong
th
e
5
ta
r
get
cl
asses
in
the
re
al
dataset
,
‘0’
has
60,
‘1’
has
2
,
503,
‘
2’
has
386,
‘
3’
ha
s
19
1
and
‘
4’
has
22,
ROS
ca
us
es
al
l
cl
asses
to
ha
ve
2
,
50
3
sam
ples,
as
ca
n
be
seen
in
Table
4.
S
MOTE
is
mor
e
com
plex
tha
n
ROS
as
it
ge
ner
at
es
s
yn
t
he
ti
c
samples
of
the
min
ori
ty
cl
ass
by
c
onside
rin
g
t
he
li
near
com
bin
at
io
ns
of
e
xisti
ng
minorit
y
cl
ass
ne
ighbors
.
B
y
de
fau
lt
,
it
ge
ner
a
te
s
samples
eq
ual
in
num
be
r
to
t
he
majority
class.
Th
us
, i
n
this
paper
, s
am
ples
f
rom all
o
t
her cl
asses inc
rease t
o 2
,
503.
BSM
OTE
im
pro
ves
upon
th
e
SMOTE
al
gorith
m
by
sel
e
ct
ing
the
min
ori
ty
cl
ass
sam
ples
on
the
bor
der
of
the
li
ne
of
best
fit.
These
ar
e
then
us
ed
t
o
synthe
siz
e
new
sa
mpl
es
wh
ic
h
help
s
impro
ve
the
sampl
e
cat
egory
distri
bu
ti
on.
For
thi
s
pa
per
,
t
he
nu
mb
e
r
of
sam
pl
es
from
eac
h
c
la
ss
increa
ses
t
o
2
,
50
3.
T
he
s
ame
as
the
majo
rity
cl
ass
sam
ples.
A
DASY
N
is
al
s
o
an
overs
am
pl
ing
met
hod
th
at
gen
e
rates
syntheti
c
sam
ples,
but
the
dif
fer
e
nce
betwee
n
it
an
d
SMOTE
is
t
ha
t
ADAS
YN
f
ocuses
on
min
or
it
y
sam
ples
that
are
diff
ic
ult
to
cl
assify
c
orrec
tl
y,
rat
he
r
t
ha
n
ov
e
rsam
plin
g
al
l
min
ori
ty
sam
ples
unif
ormly.
It
assi
gn
s
wei
ght
to
each
minorit
y
in
sta
nce
ba
sed
on
it
s
diff
ic
ulty
.
Fo
r
this
pap
e
r
,
‘
0’
i
ncr
ease
d
f
rom
650
t
o
2
,
50
0
sam
pl
es,
‘
1’
remaine
d
the
s
ame
at
2
,
50
3
samples,
‘2’
increase
d
f
r
om
38
6
to
2
,
48
8
sample
s,
‘3’
increase
d
f
r
om
19
1
t
o
2
,
511
s
am
ples
and ‘
4’ inc
reas
ed fr
om
22 to
2
,
507
s
am
ples,
as see
n
in
Ta
bl
e 4
.
Table
4.
Distri
bu
ti
on
of cla
ss
after
ov
e
rsam
pl
ing
Enco
d
ed
variabl
e
Im
b
alan
ced
ROS
SMOT
E
BSMOT
E
ADASY
N
0
650
2503
2503
2503
2500
1
2503
2503
2503
2503
2503
2
386
2503
2503
2503
2488
3
191
2503
2503
2503
2511
4
22
2503
2503
2503
2507
2.
9
.
2.
Under
sam
pli
ng
techni
ques
Fo
r
undersam
pling
te
ch
niqu
es,
li
ke
R
OS,
RUS
is
t
he
si
mp
le
st
te
ch
nique
that
ra
ndoml
y
rem
ov
e
s
samples
from
al
l
oth
er
cl
asse
s
besi
des
the
minorit
y
cl
ass
un
ti
l
the
num
be
r
of
sa
mp
le
s
i
s
eq
ual
to
t
he
minorit
y
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
omp E
ng,
V
ol.
15
, No
.
1
,
Febr
uary
20
25
:
741
-
754
748
cl
ass
samples.
Fo
r
t
his
pa
per,
as
the
min
or
it
y
cl
ass
is
‘
4’
with
22
sa
mp
l
es,
sam
ples
fro
m
al
l
oth
er
cl
a
sses
are
rem
ov
e
d
unti
l
there
is
22
f
or
each
cl
ass.
TL
is
an
unde
rsam
pling
te
ch
niqu
e
focuse
d
on
r
emo
ving
th
os
e
nois
y
instances
t
hat
are
on
the
bor
de
r
of
t
he
cl
assi
ficat
ion
li
ne
.
T
his
is
done
by
fin
ding
pairs
of
very
cl
ose
in
s
ta
nces
that
bel
ong
to
dif
fe
ren
t
cl
asses.
T
hese
pairs
a
re
know
n
as
T
om
e
k
li
nk
s
an
d
re
movin
g
t
he
majo
rity
cl
as
s
instances
of
ea
ch
pai
r
inc
reas
es
the
s
pace
be
tween
t
he
cl
asses
w
hich
im
pro
ves
the
cl
as
sific
at
ion
proc
ess
a
s
the
nois
y
a
nd
bor
der
li
ne
in
sta
nces
a
r
e
re
moved
.
In
thi
s
pa
per,
the
‘
0’
cl
as
s
de
cre
ased
from
65
0
to
605
samples,
‘1’
de
creased
from
2
,
503
to
2
,
44
9,
‘
2’
decre
ased
f
rom
386
to
355,
‘3’
dec
reas
ed
from
191
to
180
and ‘
4’ r
e
main
ed
the
same
at
22 sam
ples, a
s
seen i
n
Ta
ble
5.
E
N
N
i
s
a
no
t
he
r
m
e
t
ho
d
f
o
r
e
l
i
m
i
na
t
i
ng
no
i
s
y
s
a
m
pl
e
s
f
r
o
m
c
l
a
s
s
e
s
ot
he
r
t
ha
n
t
he
m
i
n
or
i
t
y
c
l
a
s
s
.
F
or
e
ve
r
y
s
a
m
pl
e
,
E
N
N
c
a
l
c
ul
a
te
s
i
t
s
ne
a
r
e
s
t
ne
i
gh
b
ou
r
s
‘
b
y
de
f
a
ul
t
,
t
hr
e
e
’
,
i
f
t
he
s
a
m
pl
e
w
a
s
no
t
f
r
om
t
he
m
i
no
r
i
t
y
c
l
a
s
s
a
nd
i
s
m
i
s
c
l
a
s
s
i
f
i
e
d
by
i
t
s
ne
i
gh
bo
ur
s
,
i
t
i
s
r
e
m
ov
e
d
.
I
f
t
he
s
a
m
pl
e
w
a
s
f
r
om
a
m
i
n
or
i
t
y
c
l
a
s
s
ho
w
e
ve
r
,
a
n
d
i
s
m
i
s
c
l
a
s
s
i
f
i
e
d
b
y
i
t
s
ne
i
g
hb
o
ur
s
,
t
he
n
t
he
n
e
i
gh
bo
ur
i
ns
t
a
nc
e
s
n
ot
f
r
om
t
he
m
i
no
r
i
t
y
c
l
a
s
s
a
r
e
r
e
m
ov
e
d.
F
o
r
t
hi
s
pa
pe
r
,
‘
0’
de
c
r
e
a
s
e
d
f
r
o
m
65
0
t
o
39
6
s
a
m
pl
e
s
,
‘
1’
de
c
r
e
a
s
e
d
f
r
o
m
2
,
5
03
t
o
2
,
10
5,
‘
2
’
de
c
r
e
a
s
e
d
f
r
o
m
38
6
t
o
1
73
,
‘
3’
de
c
r
e
a
s
e
d
f
r
om
1
91
t
o
9
4
a
nd
‘
4
’
r
e
m
a
i
ne
d
t
he
s
a
m
e
a
t
22
s
a
m
pl
e
s
.
Me
a
nw
hi
l
e
,
N
C
R
i
s
a
n i
m
p
r
o
ve
m
e
nt
t
o
E
N
N
a
s
i
t
fo
c
us
e
s
l
e
s
s
on
i
m
pr
ov
i
ng
t
he
c
l
a
s
s
di
s
t
r
i
bu
t
i
on
a
n
d
m
or
e
on
i
nc
r
e
a
s
i
ng
t
he
un
a
m
bi
gu
i
t
y
o
f
t
he
s
a
m
pl
e
s
t
ha
t
a
r
e
r
e
t
a
i
ne
d
i
n
a
l
l
t
he
c
l
as
s
e
s
e
xp
e
c
t
t
he
m
i
no
r
i
t
y
c
l
a
s
s
.
N
C
R
w
or
ks
by
f
i
r
s
t
s
e
l
e
c
ti
ng
a
n
d
r
e
m
ov
i
n
g
a
l
l
n
oi
s
y
s
a
m
pl
e
s
i
n
a
s
i
m
i
l
a
r
m
a
nn
e
r
t
o
E
N
N
.
T
he
n
t
he
nu
m
be
r
o
f
s
a
m
pl
e
s
f
o
r
a
l
l
c
l
a
s
s
e
s
e
xp
e
c
t
t
he
m
i
n
or
i
t
y
c
l
a
s
s
t
ha
t
a
r
e
m
i
s
c
l
a
s
s
i
f
ie
d
a
r
e
r
e
m
ov
e
d,
b
ut
o
nl
y
i
f
t
he
nu
m
be
r
o
f
s
a
m
pl
e
s
i
n
t
ho
s
e
c
l
a
s
s
e
s
i
s
l
ar
ge
r
t
ha
n
ha
l
f
t
he
s
a
m
pl
e
s
i
n
t
he
m
i
no
r
i
t
y
c
la
s
s
.
T
he
N
C
R
m
e
t
ho
d
c
ha
ng
e
d
t
he
c
l
a
s
s
di
s
t
ri
bu
t
io
n i
n t
he
fo
l
l
o
w
i
ng
m
a
n
ne
r
:
‘
0’
de
c
r
e
a
s
e
d f
r
om
65
0 t
o 4
93
s
a
m
pl
e
s
,
‘1
’
de
c
r
e
a
s
e
d f
r
om
2
,
50
3 t
o
2
,
3
57
,
‘
2
’
de
c
r
e
a
s
e
d
f
r
o
m
3
86
t
o
24
7,
‘
3’
de
c
r
e
a
s
e
d
f
r
om
19
1
t
o
1
42
a
nd
‘
4
’
r
e
m
a
i
ne
d
t
he
s
a
m
e
a
t
22
s
a
m
pl
e
s
.
Table
5.
Distri
bu
ti
on
of cla
ss
after
unde
rsam
pling
Enco
d
ed
variabl
e
Im
b
alan
ced
RUS
TL
ENN
NCR
0
650
22
605
396
493
1
2503
22
2449
2105
2357
2
386
22
355
173
247
3
191
22
180
94
142
4
22
22
22
22
22
2.
9
.
3.
Hybri
d
sa
mpli
n
g tech
niques
Hybr
i
d
sam
pli
ng
most
ly
co
nsi
sts
of
c
ombi
ning
both
over
samplin
g
a
nd
unde
rsam
pling
te
chn
i
qu
es
tog
et
he
r.
Si
nc
e
S
MOTE
is
so
po
pu
la
r
[
28]
,
it
is
widel
y
us
e
d
in
c
ombinati
on
with
ot
her
undersampli
ng
methods
.
S
MOTE
-
RUS
is
on
e
s
uc
h
met
hod,
w
her
e
S
M
O
TE
an
d
R
US
are
c
ombi
ned
by
first
ge
ner
at
in
g
sy
nt
hetic
samp
le
s
fo
r
t
he
min
or
it
y
cl
ass
a
nd
then
el
imi
nating
sa
mp
le
s
fro
m
cl
asses
oth
e
r
than
t
he
min
or
i
t
y
cl
ass.
Acc
ordi
n
g
t
o
[
4]
,
w
ho
first
impleme
nt
ed
SMOT
E,
t
he
be
st
perf
orming
meth
od
was
w
he
n
S
MOTE
was
com
bin
e
d
with
RUS
a
nd
not
just
w
he
n
us
in
g
S
MOTE
al
one.
For
t
his
pa
per,
sam
ples
from
al
l
cl
asses
wer
e
equ
al
t
o
the
m
aj
or
it
y
cl
ass
a
f
te
r
SMOT
E
-
R
US
was
a
pp
li
e
d.
Li
ke
S
MOT
E
-
RUS
,
S
MO
TE
-
TL
first
pe
rforms
ov
e
rsam
plin
g
on
the
min
or
it
y
cl
ass
us
in
g
SMOTE
a
nd
t
hen
unde
rsam
pling
us
in
g
T
ome
k
Lin
ks.
H
ow
e
ve
r,
there
are
some
sli
gh
t
dif
fer
e
nc
es
in
the
cl
as
s
distrib
utio
n
after
the
m
et
hod
was
a
pp
li
e
d
w
he
n
co
mp
a
red
t
o
SMOTE
-
RUS
,
du
e
to
RU
S
bein
g
a
m
or
e
simple
an
d
na
ïve
meth
od
th
an
TL.
The
cl
ass
distrib
utio
n
wa
s
change
d
as
f
ol
lows
:
‘0’
i
ncr
e
ased
from
65
0
to
2
,
501,
‘
1’
decr
ease
d
fro
m
2
,
50
3
to
2
,
501,
‘2’
inc
reas
ed
from
386
t
o 2
,
503, ‘
3’ increa
sed
fr
om
191 t
o 2
,
50
3
a
nd ‘4’ i
nc
re
ased
from
22 t
o 2
,
503
s
am
ples.
A
simi
la
r
patte
rn
to
S
MOT
E
-
TL
occ
urs
wh
e
n
S
M
O
TE
-
EN
N
is
a
pp
li
ed
t
o
t
he
datas
et
,
al
beit
with
more
dif
fer
e
nc
es
in
cl
ass
dist
rib
ution
wh
e
n
com
par
e
d
to
S
M
O
TE
-
R
US
.
This
is
obser
ve
d
i
n
Table
6
as
cl
ass
‘
0’
sa
mp
le
s
in
creased
f
r
om
650
to
2
,
40
5,
‘
1’
decr
ease
d
f
rom
2
,
50
3
t
o
1
,
990,
‘2’
i
ncr
eas
ed
from
38
6
to
2
,
476,
‘3’
i
ncr
ea
sed
f
rom
191
to
2
,
500
an
d
‘
4’
i
nc
reased
from
22
t
o
2
,
50
3
s
a
mp
le
s.
Wh
e
rea
s
afte
r
S
M
OT
E
-
NCR
was
a
pp
li
ed
‘
0’
incre
ased
fro
m
650
to
2
,
503
sam
ples,
‘1’
decr
ease
d
fro
m
2
,
503
to
2
,
158,
‘
2’
increa
s
ed
f
rom
386
to
2
,
495,
‘
3’
inc
rease
d
f
r
om
19
1
to
2
,
501
a
nd
‘4’
inc
r
eased
f
rom
22
to
2
,
503
sam
pl
es.
SMOTE
-
N
CT
is
al
so
us
ed
com
monly i
n netw
ork
i
ntrusio
n d
et
ect
ion
[
18]
.
Table
6.
Distri
bu
ti
on
of cla
ss
after
hybri
d
sa
mp
li
ng
Enco
d
ed
v
ariable
Im
b
alan
ced
SMOT
E
-
RU
S
SMOT
E
-
TL
SMOT
E
-
ENN
SMOT
E
-
NCR
0
650
2503
2501
2405
2503
1
2503
2503
2501
1990
2158
2
386
2503
2503
2476
2495
3
191
2503
2503
2500
2501
4
22
22
2503
2503
2503
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
&
C
omp E
ng
IS
S
N:
20
88
-
8708
Han
dling cl
as
s
imbala
nce i
n
e
du
c
atio
n usin
g
data
-
le
vel
and dee
p
le
arni
ng
meth
ods
(
Rit
he
sh
K
annan
)
749
2.
9
.
4.
Deep
lea
rning s
yn
th
esi
zers
The
cl
ass
distr
ibu
ti
on
for
the
Gau
s
sia
n
c
opul
a
gen
e
rated
da
ta
set
is
no
t
simi
la
r
to
t
he
re
al
dataset
,
as
seen
i
n
Table
7.
The
mi
nority
cl
ass
w
as
c
ha
ng
e
d
f
rom
‘4’
to
‘
3’,
bu
t
the
res
ult
was
a
ba
la
nced
distri
buti
on.
On
t
he
ot
her
ha
nd,
f
or
t
he
CTGA
N
dataset
,
t
he
syntheti
c
data
distrib
utio
n
f
ollows
the
r
eal
dataset
more
tha
n
Gau
s
sia
n
c
op
ul
a.
T
her
e
wa
s
no
c
ha
ng
e
in
the
min
ori
ty
cl
ass
but
t
he
overall
cl
ass
distrib
ution
f
or
t
he
s
yntheti
c
data is m
or
e
ba
la
nced
t
ha
n
the
r
eal
dataset
.
Table
7.
Distri
bu
ti
on
of cla
ss
after
deep le
ar
ning s
ynthesiz
ers
Enco
d
ed
variabl
e
Im
b
alan
ced
Gau
ss
ian
c
o
p
u
la
CTGAN
0
650
662
527
1
2503
2758
2081
2
386
176
678
3
191
60
406
4
22
96
60
Figure
4
s
how
s
the
diff
e
re
nc
es
in
data
distr
ibu
ti
on
betwee
n
Gau
s
sia
n
c
opula
a
nd
CT
G
AN
f
or
one
column
w
hich
is
the
s
emest
er
9
CG
PA
(T
9:
_
).
T
his
col
umn
was
c
hosen
a
s
it
is
the
final
semest
er
gr
a
des
f
or
ce
rt
ai
n
pro
gr
a
ms,
and
a
ny
us
e
fu
l
in
formati
o
n
gl
eaned
ca
n
be
use
d
in
f
uture
works.
T
he
Ga
us
sia
n
c
opula
c
ha
rt
s
hows
t
hat
the
s
yntheti
c
data
dis
tribu
ti
on
does
no
t
f
ollo
w
t
he
peaks
a
nd
t
hro
ughs
of
the
rea
l
data
distrib
ution
w
her
eas
the
one
ma
de
from
CTGA
N
does.
This
i
nforms
t
hat
f
or
this
c
ol
um
n,
t
he
Ga
ussi
an
c
opula
dataset
do
e
s
no
t
fo
ll
ow the
r
eal
data
distrib
ution as
accuratel
y as t
he
CT
GAN dat
aset
.
(a)
(b)
Figure
4. Com
par
i
ng d
ist
ri
bu
t
ion
of seme
ste
r
9
C
GPA
betw
een
real datase
t and (a
) Gau
ss
ia
n
c
op
ula a
nd
(b)
CT
GAN
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
omp E
ng,
V
ol.
15
, No
.
1
,
Febr
uary
20
25
:
741
-
754
750
3.
RESU
LT
S
AND DI
SCUS
S
ION
This
sect
io
n
a
nalyses
the
res
ults
of
the
co
mp
a
rison
betw
een
the
di
ff
e
re
nt
ty
pes
of
cl
ass
bala
ncin
g
methods
.
T
he
evaluati
on
is
done
in
tw
o
ste
ps
.
T
he
first
i
s
by
e
val
uatin
g
t
he
dataset
a
fter
cl
ass
bala
ncin
g
methods
hav
e
been
a
ppli
ed
usi
ng
the
popul
ar
cl
assifi
er
Li
gh
t
GBM
[27
]
.
Ligh
tGB
M
w
as
chosen
du
e
to
it
bein
g
a
hist
ogr
am
-
based
grad
ie
nt
boos
ti
ng
a
lgorit
hm
wh
ic
h
le
ads
t
o
lo
w
er
mem
ory,
fa
ste
r
trai
nin
g
ti
me
an
d
accurac
y.
It
is
gen
e
rall
y
m
uc
h
faste
r
tha
n
ot
her
machi
ne
le
arn
i
ng
al
gorith
ms.
T
he
res
ults
from
e
valuati
ng
t
he
dataset
a
fter
di
ff
ere
nt
cl
ass
balancin
g
met
hods
wer
e
ap
pl
ie
d
are
record
ed
i
n
Table
8.
T
he
se
co
nd
s
te
p
i
s
evaluati
ng
onl
y
t
he
syntheti
c
dataset
s
ge
ne
r
at
ed
by
the
de
ep
le
a
rn
i
ng
s
ynthesiz
er
s
us
in
g
ta
bu
la
r
data
scor
e
s
su
c
h
as
data
qual
it
y,
d
at
a c
ove
rag
e
, c
olu
m
n
s
hap
e
s sc
or
e
, a
nd c
olu
m
n pair
trends s
co
re.
3.1. Ev
alu
at
io
n of
cl
as
s b
al
anci
ng
meth
ods
Wh
et
her
re
gardin
g
the
F
1
sc
or
e
or
t
he
RO
C
A
U
C
sc
or
e,
al
l
metri
cs
for
both
Ga
us
sia
n
c
opula
an
d
CTGA
N
are
lo
w.
T
his
c
ou
l
d
be
due
to
the
low
am
ount
of
stud
e
nt
data
i
n
the
or
igi
nal
da
ta
set
,
as
it
co
ntains
on
l
y
a
ppr
ox
im
at
el
y
5
,
000
r
ows,
w
hich
ma
y
not
be
en
ough
to
gen
e
rate
a
good
qu
al
it
y
s
yn
t
hetic
datase
t
us
in
g
deep le
ar
ning
models.
When
dataset
s b
ec
ome
v
er
y
la
rg
e
, the
d
ee
p
le
ar
ning
models
has be
en
s
how
n
to
ac
hieve
com
petit
ive
pe
rformance
f
or
oth
e
r
resea
rc
he
rs
[
9],
[
29]
,
[
30]
.
The
F1
sc
or
e
was
a
r
ound
0.2
5
for
Ga
us
sia
n
c
opula a
nd 0.29
for
CT
GAN.
Ov
e
rsam
plin
g
and
hy
br
id
samplin
g
generall
y
pe
rfo
r
m
bette
r
t
han
unde
rsam
plin
g
[
31]
–
[
33]
.
Howe
ver,
ther
e
are
some
cas
es
wh
e
re
unde
rsam
pling
pe
rforms
bette
r
t
ha
n
ot
her
te
c
hn
i
qu
e
s
[
7]
,
[
34]
.
These
cases
de
pe
nd
on
a
va
riet
y
of
facto
rs,
s
uc
h
as
the
siz
e
of
t
he
dataset
,
the
typ
e
of
pr
e
dicti
ve
model
us
e
d,
t
he
domain,
the
pe
rformance
m
et
rics,
a
nd
s
o
on.
I
n
t
his
pa
per
as
well
,
TL
a
nd
NCR
f
rom
unde
rsa
mp
li
ng
te
chn
iq
ues
out
performe
d
t
he
oth
e
rs.
Th
e
rea
so
n
the
y
ou
t
pe
rformed
co
uld be w
he
n
t
ho
s
e
methods
are u
s
ed,
al
l
the
no
is
y
or
a
mb
ig
uous
sa
m
p
le
s
a
re
rem
ov
ed,
w
hich
al
lo
ws
t
he
cl
assi
fiers
t
o
bette
r
se
par
at
e
the
in
div
id
ual
cl
asses. T
his is
esp
eci
al
ly im
porta
nt for
m
ulti
cl
ass cla
ssific
at
ion
pro
blems
.
That
sai
d,
no
t
al
l
under
sa
mp
l
ing
te
c
hn
i
qu
e
s
perf
or
me
d
a
s
well
as
T
om
e
k
li
nk
a
nd
NCR
.
Excl
ud
i
ng
the
dee
p
le
arn
i
ng
s
ynthesiz
er
resu
lt
s,
the
ne
xt
wo
r
st
resu
lt
was
f
rom
RUS
,
with
a
n
F1
sc
or
e
of
50%.
T
hi
s
can
be
ex
plaine
d
a
s
RUS
r
an
dom
ly
el
imi
nates
la
rg
e
a
mou
nts
of
sa
mp
le
s
s
uc
h
that
the
num
ber
of
sam
ples
in
al
l
cl
asses
bec
ome
s
eq
ual
to
th
e
min
or
it
y
cl
a
ss
sa
m
ples
w
hi
ch
is
22.
Re
duci
ng
the
num
ber
of
sam
ples
from
3
,
652
sam
ples
to
just
22
s
a
mp
le
s
al
so
le
ad
t
o
t
he
l
os
s
of
la
r
ge
a
mou
nts
of
in
forma
ti
on
f
or
the
cl
assifi
er,
wh
ic
h
ma
y be t
he reaso
n
it
performe
d
s
o ba
dly
.
Table
8.
C
omp
ariso
n of cl
ass
balancin
g met
hod
s
usi
ng Li
ghtGBM classi
fier
Clas
s b
alan
cin
g
app
roach
es
Clas
s b
alan
cin
g
meth
o
d
s
Precisio
n
Recall
F1
ROC
Im
b
alan
ced
Bas
elin
e
0
.72
0
.68
0
.70
0
.97
Ov
ersam
p
lin
g
ROS
0
.66
0
.72
0
.69
0
.97
SMOT
E
0
.67
0
.72
0
.69
0
.97
BSMOT
E
0
.67
0
.72
0
.69
0
.95
ADASY
N
0
.66
0
.72
0
.68
0
.95
Un
d
ersam
p
lin
g
RUS
0
.51
0
.63
0
.50
0
.85
TL
0
.81
0
.69
0
.72
0
.97
ENN
0
.76
0
.65
0
.68
0
.96
NCR
0
.77
0
.67
0
.71
0
.95
Hy
b
rid samplin
g
SMOT
E
-
RU
S
0
.67
0
.72
0
.69
0
.97
SMOT
E
-
TL
0
.67
0
.71
0
.69
0
.96
SMOT
E
-
ENN
0
.63
0
.7
0
.66
0
.95
SMOT
E
-
NCR
0
.65
0
.71
0
.67
0
.96
Sy
n
th
esizers
Gau
ss
ian
c
o
p
u
la
0
.35
0
.29
0
.25
0
.84
CTGAN
0
.31
0
.31
0
.29
0
.75
3.2. Ev
alu
at
io
n of
syn
th
e
tic
da
t
as
e
ts
The
s
yn
t
hesize
rs
an
d
the
e
valuati
on
metri
cs
us
e
d
in
this
pa
per
wer
e
from
a
pack
a
ge
cal
le
d
syntheti
c
data
va
ult
(S
D
V)
[35]
.
T
yp
ic
al
ly
evaluati
ng
sy
nt
hetic
data
set
s
can
be
do
ne
by
in
putt
ing
them
into
a
cl
assifi
er
an
d
co
mp
a
rin
g
the
res
ults.
H
ow
e
ve
r,
the
y
c
an
al
so
be
e
va
luate
d
based
on
their
dataset
struct
ur
e
i
nclud
i
ng
their
data qual
it
y,
c
olu
m
n sha
pes, col
umn
pa
ir trends
and
da
ta
co
ve
ra
ge.
Data
qual
it
y
re
fer
s
to
t
he
ove
rall
structu
re
of
syntheti
c
dat
a.
It
meas
ur
es
how
cl
os
el
y
th
e
sy
nt
hetic
data
matc
hes
with
t
he
real
da
ta
.
Data
co
ve
rag
e
re
fers
t
o
how
m
uch
do
e
s
the
s
yn
t
hetic
data
f
ollo
ws
t
he
data
distrib
utions
of
the
real
data.
I
t
chec
ks
w
heth
er
the
s
yn
t
hetic
data
c
over
s
t
he
real
data’s
value
ra
nge.
C
olu
m
n
sh
a
pes
sco
re
r
efers
t
o
how
cl
os
el
y
eac
h
c
olu
m
n
of
the
s
yntheti
c
data
f
ol
lows
the
real
da
ta
and
descr
i
be
s
the
ov
e
rall
col
umn
distrib
utio
n
c
hange.
The
en
d
sc
or
e
was
ca
lc
ulate
d
usi
ng
the
overall
a
ve
rag
e
f
or
eac
h
c
olu
m
n
sh
a
pe
sc
ore.
S
imi
la
rly,
t
he
c
olu
m
n
pai
r
t
re
nd
s
sc
ore
c
al
c
ulate
s
how
col
umn
pairs
vary
i
n
relat
ion
t
o
eac
h
Evaluation Warning : The document was created with Spire.PDF for Python.