IAES
Inter
national
J
our
nal
of
Articial
Intelligence
(IJ-AI)
V
ol.
14,
No.
5,
October
2025,
pp.
4162
∼
4170
ISSN:
2252-8938,
DOI:
10.11591/ijai.v14.i5.pp4162-4170
❒
4162
Ensemble
r
e
v
erse
kno
wledge
distillation:
training
r
ob
ust
model
using
weak
models
Christopher
Ga
vra
Reswara,
Tjeng
W
awan
Cenggor
o
School
of
Computer
Science,
Bina
Nusantara
Uni
v
ersity
,
W
est
Jakarta,
Indonesia
Article
Inf
o
Article
history:
Recei
v
ed
Sep
19,
2024
Re
vised
Jun
28,
2025
Accepted
Jul
13,
2025
K
eyw
ords:
Ef
cientNet
Ensemble
learning
Kno
wledge
distillation
T
ransfer
learning
W
eak-to-strong
ABSTRA
CT
T
o
ensure
that
articial
intelligence
(AI)
can
be
aligned
with
humans,
AI
models
need
to
be
de
v
eloped
and
supervised
by
humans.
Unfortunately
,
it
is
possible
for
an
AI
to
e
xceed
human
capabilities,
which
is
commonly
referred
to
as
su-
peralignment
models.
Thus,
it
raised
the
question
of
whether
humans
can
still
supervise
a
superalignment
model,
which
is
encapsulated
in
a
concept
called
weak-to-strong
generalization.
T
o
address
this
issue,
we
introduce
ensemble
re
v
erse
kno
wledge
distillation
(ERKD),
which
le
v
erages
tw
o
weak
er
models
to
supervise
a
more
rob
ust
model.
This
tec
hnique
is
a
potential
solution
for
humans
to
manage
a
super
-alignment
of
models.
ERKD
enables
a
more
rob
ust
model
to
achie
v
e
optimal
performance
with
the
assistance
of
tw
o
weak
er
models.
W
e
tried
to
train
a
more
rob
ust
Ef
cientNet
m
odel
with
weak
er
con
v
olutional
neural
net-
w
ork
(CNN)
models
in
a
supervised
f
ashion.
W
ith
this
method,
the
Ef
cientNet
model
performed
better
than
the
model
trai
ned
with
the
standard
transfer
learn-
ing
(STL)
method.
It
also
performed
better
than
a
model
that
w
as
supervised
by
a
single
weak
er
model.
Finally
,
ERKD-trained
Ef
cientNet
models
can
perform
better
than
Ef
cientNet
models
that
are
one
or
e
v
en
tw
o
le
v
els
stronger
.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Christopher
Ga
vra
Resw
ara
School
of
Computer
Science,
Bina
Nusantara
Uni
v
ersity
K
ebon
Jeruk
Raya
No.
27,
W
est
Jakarta,
Indonesia
Email:
christopher
.resw
ara@binus.ac.id
1.
INTR
ODUCTION
The
de
v
elopment
of
articial
intelligence
(AI)
model
must
be
inte
grated
with
human
supervision
to
obtain
a
useful
model
for
humans.
F
or
e
xample,
in
the
eld
of
image
classication,
con
v
olutional
neural
netw
orks
(CNN)
models,
such
as
ResNet
[1],
DenseNet
[2],
Ef
cientNet
[3],
Inception
V3
[4],
and
MobileNet
V3
[5]
models,
were
ask
ed
to
learn
a
collection
of
images
labeled
by
e
xperts,
such
as
ImageNet
[6],
CIF
AR-10
[7],
F
ood-101
[8],
Oxford
102
Flo
wers
[9],
Birdsnap
[10],
and
other
datasets.
Lar
ge
language
models
(LLMs)
such
as
GPT
-4
[11],
Gemini
1.5
[12],
and
Llama-3
[13]
were
also
b
uilt
to
learn
human-generated
te
xt
datasets
to
perform
natural
language
processing
(NLP)
tasks.
T
o
add
an
additional
guarantee
of
its
alignment
with
humans,
LLMs
were
also
trained
with
an
additional
s
tep
called
reinforcement
learning
from
human
feedback
(RLHF),
which
re
w
ards
or
punishes
during
learning
based
on
human
judgment
[14]–[16].
Until
no
w
,
all
forms
of
AI
ha
v
e
al
w
ays
been
intentionally
directed
to
align
with
human
kno
wledge,
e
xperience,
e
v
aluation,
and
feedback
to
assist
in
completing
human
tasks.
Ho
we
v
er
,
the
emer
gence
of
AI
models
that
ha
v
e
better
capabilitie
s
than
humans,
commonly
referred
to
as
superalignment
models,
is
una
v
oidable.
This
is
lar
gely
due
to
the
f
act
that
AI
supervision
w
as
not
usually
done
by
a
lar
ge
cro
wd
of
humans.
Most
of
the
datasets
that
were
used
to
tra
in
AI
models
no
w
adays
were
curated
J
ournal
homepage:
http://ijai.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
4163
via
cro
wd-sourcing.
This
theoretically
can
crystallize
the
wisdom
of
the
cro
wd
within
AI
models,
which
can
lead
the
models
to
be
more
intelligent
than
a
si
n
gl
e
human.
The
emer
gence
of
superalignment
models
can
also
come
from
the
practice
of
applying
reinforcement
learning
without
human
supervision,
which
has
been
demonstrated
multiple
times
in
video
g
ames
[17],
board
g
ames
[18],
[19],
and
recently
LLM
[20].
The
emer
gence
of
supera
lignment
models
raised
the
question:
Ho
w
can
we
as
humans
supervise
t
hese
models
to
better
align
with
us
if
the
y
are
better
than
us?
As
superalignment
models
can
eme
r
ge
from
the
wisdom
of
the
cro
wd,
perhaps
we
can
also
supervise
these
models
via
another
wisdom
of
the
cro
wd.
This
study
aims
to
simulate
this
idea
by
ha
ving
an
ensemble
of
weak
er
models
to
supervise
a
stronger
model.
In
the
machine
learning
community
,
it
is
kno
wn
that
an
ensemble
of
weak
er
models
can
form
a
strong
model.
This
concept
is
named
ensemble
learning
and
has
been
used
to
form
a
strong
machine
learning
model
such
as
random
forest
[21]
and
XGBoost
[22].
T
o
achie
v
e
our
aim,
we
designed
a
schema
of
more
than
one
weak
er
teacher
models
to
supervise
one
stronger
model
in
the
kno
wledge
distillation
(KD)
frame
w
ork
[23].
W
e
named
this
schema
ensemble
re
v
erse
kno
wledge
distillation
(ERKD).
Figure
1
illustrates
the
ERKD
schema
with
tw
o
weak
teacher
models.
T
o
simulate
the
idea
of
supervising
a
model
that
is
already
intelligent,
we
use
transfer
learning
as
the
main
task.
In
particular
,
we
use
transfer
learning
for
image
classication
as
t
he
task.
T
o
m
easure
the
success
of
this
study
,
we
compare
ERKD
with
a
standard
transfer
learning
(STL)
procedure.
Figure
1.
The
ERKD
schema
with
tw
o
weak
teacher
models
2.
METHOD
2.1.
Dataset
This
study
uses
tw
o
image
classication
datasets,
namely
CIF
AR-10
and
CIF
AR-100
[7].
Both
datasets
consist
of
50,000
images
for
training
and
10,000
images
for
testing.
Both
also
ha
v
e
32
×
32
pix
els
resolution
images.
The
dif
ference
between
the
tw
o
datasets
is
that
CIF
AR-10
only
has
ten
classes,
so
each
class
consists
of
6,000
images,
while
CIF
AR-100
has
100
classes,
so
each
class
consists
of
600
images.
These
tw
o
datasets
are
used
in
this
study
because
the
y
are
commonly
used
in
AI
studies.
2.2.
T
rain,
v
alidation,
and
test
split
data
The
CIF
AR-10
and
CIF
AR-100
datasets
ha
v
e
been
di
vided
into
50,000
images
for
training
and
10,000
for
testing.
All
images
in
the
training
section
ha
v
e
been
randomized.
Then,
we
split
the
training
part
into
tw
o
parts,
namely
,
40,000
images
used
for
training
and
10,000
images
used
for
v
alidation.
The
40,000
images
used
as
the
training
model
will
be
subjected
to
data
augmentation.
Meanwhile,
the
10,000
v
alidation
images
will
calculate
the
error
rate
and
v
alidation
when
the
model
learns.
Finally
,
10,000
test
images
will
be
used
to
measure
the
performance
of
the
model.
Ensemble
r
e
ver
se
knowledg
e
distillation:
tr
aining
r
ob
ust
model
using
...
(Christopher
Gavr
a
Reswar
a)
Evaluation Warning : The document was created with Spire.PDF for Python.
4164
❒
ISSN:
2252-8938
2.3.
Data
pr
epr
ocessing
W
e
preprocessed
the
dataset
with
z-score
standardization
on
scale
0
to
1.
Firstly
,
we
normalize
the
pix
el
v
alues
from
scale
0-to-255
to
scale
0-to-1.
Afterw
ards,
we
standardize
the
pix
el
v
alues
with
z-score
standardization,
with
the
mean
and
standard
de
viation
v
al
ues
deri
v
ed
from
the
dataset.
F
or
the
CIF
AR-10
dataset,
the
mean
v
alues
were
0.4914,
0.4822,
and
0.4465
for
the
red,
green,
and
blue
channels,
respecti
v
ely
.
The
standard
de
viation
v
alues
were
0.247,
0.243,
and
0.261
for
the
red,
green,
and
blue
channels,
respecti
v
ely
.
F
or
the
CIF
AR-100
dataset,
the
mean
v
alues
were
0.5071,
0.4865,
and
0.4409
for
the
red,
green,
and
blue
channels,
respecti
v
ely
.
The
standard
de
viation
v
alues
were
0.267,
0.256,
and
0.276
for
the
red,
green,
and
blue
channels,
respecti
v
ely
.
2.4.
Data
augmentation
T
o
a
v
oid
o
v
ertting,
we
applied
data
augmentation
with
a
random
crop
to
28
×
28
pix
els
and
a
random
horizontal
ip.
This
data
augmentation
procedure
is
applied
only
to
the
training
dataset
during
model
training.
The
data
augmentation
process
w
as
performed
online
for
each
epoch.
2.5.
Models
F
or
the
transfer
learning
process
in
this
study
,
we
used
Ef
cientNet
and
Ef
cientNet
V2
[24]
models,
which
were
pre-trained
on
the
ImageNet
dataset
[25],
[26].
Ef
cientNet
models
ha
v
e
a
hierarch
y
of
weak
models
to
strong
models
due
to
the
use
of
systematic
model
scaling,
i.e.
from
the
weak
est
B0
to
s
trongest
B7
in
Ef
cientNet
and
from
the
weak
est
V2S
to
the
stronger
V2M
to
the
strongest
V2L
in
Ef
cientNet
v2.
W
ith
this
characteristic,
Ef
cientNet
models
are
perfect
for
the
setup
in
this
study
.
2.6.
T
raining
pr
ocess
The
training
process
in
all
e
xperiments
in
this
study
used
Adam
optimiza
tion
[27]
with
a
learning
rate
of
10-3
and
a
ridge
re
gularization
of
10-5.
In
addition,
training
w
as
conducted
with
100
epochs,
a
batch
size
of
32,
and
t
he
random
seed
used
w
as
42.
Furthermore,
the
temperature
used
in
the
KD
process
w
as
2.0.
The
checkpoint
model
technique
is
used
during
training
based
on
the
best
v
alidation
a
ccurac
y
.
The
image
resolution
scale
in
the
Ef
cientNet
study
is
also
adjusted
for
each
model
in
this
study
.
Ef
cientNet
models
B0
to
B7
use
image
sizes
32,
34,
38,
44,
54,
66,
76,
and
86,
respecti
v
ely
.
Meanwhile,
the
Ef
cientNet
V2
models,
V2S,
V2M
and
V2L,
use
image
sizes
of
32,
40,
and
48,
respecti
v
ely
.
2.7.
Experiment
setup
In
ERKD,
we
used
tw
o
weak
er
models
to
supervise
a
stronger
model.
F
or
e
xample,
a
stronger
model
Ef
cientNet
B2
w
as
supervised
by
using
Ef
cientNet
B1
and
B0.
The
weak
er
models
were
rst
trained
with
STL
on
the
CIF
AR-10
and
CIF
AR-100
datasets.
Afterw
ards,
these
tw
o
models
were
used
as
teachers
by
producing
soft
labels
to
train
a
stronger
student
model
in
a
response-based
KD
frame
w
ork.
The
stronger
student
model
w
as
optimized
to
match
the
distrib
ution
of
the
soft
labels
using
the
K
ullback-Leibler
di
v
er
gence
(KL
di
v
er
gence)
loss
function.
3.
RESUL
TS
AND
DISCUSSION
In
T
able
1,
we
compare
the
accurac
y
of
STL
and
three
dif
ferent
v
ariations
of
ERKD
with
a
dif
ferent
proportion
gi
v
en
to
the
loss
functions:
i)
equal
proportion,
ii)
10%
for
cross
entrop
y
loss
and
45%
for
KL
di
v
er
gence,
and
iii)
30%
for
cross
entrop
y
loss
and
35%
for
KL
di
v
er
gence.
The
icons
in
the
table
indicate
that
ERKD
outperforms
the
STL.
The
square
indicates
the
best
accurac
y
,
the
circle
indicates
the
second-best
accurac
y
,
and
t
he
triangle
indicates
the
third-best
accurac
y
.
As
seen
in
the
table,
all
v
ariations
of
ERKD
outperform
STL.
This
pro
v
es
that
tw
o
weak
er
models
can
still
supervise
the
stronger
model,
e.g.
Ef
cientNet
B0
and
Ef
cientNet
B1
can
still
supervise
Ef
cientNet
B2.
In
addition,
we
also
e
xperimented
using
only
one
weak
er
model
as
a
teacher
of
the
stronger
model.
F
or
e
xample,
Ef
cientNet
model
B2
is
taught
only
by
B0
or
B1.
The
proportion
of
between
the
cross
entrop
y
loss
and
the
KL
di
v
er
gence
loss
are
both
50%.
The
results
can
be
seen
in
T
able
2.
The
icons
in
the
table
indicates
that
ERKD
outperforms
the
STL
and
a
single-teacher
method.
The
square
indicates
the
best
accurac
y
,
the
circle
indicates
the
second-best
accurac
y
,
and
the
triangle
indicates
the
third-best
accurac
y
.
W
e
found
that
at
least
one
v
ariation
of
ERKD
can
outperform
using
only
one
weak
er
model.
This
pro
v
ed
that
the
ensemble
learning
concept
in
ERKD
is
also
ef
fecti
v
e
in
impro
ving
model
performance.
F
or
e
xample,
Ef
cientNet
B2
is
more
optimal
when
supervised
by
B0
and
B1
than
B0
or
B1
alone.
Int
J
Artif
Intell,
V
ol.
14,
No.
5,
October
2025:
4162–4170
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
4165
T
able
1.
Comparison
of
the
student
model’
s
accurac
y
between
ERKD
and
STL
T
eacher
1
T
eacher
2
Student
Dataset
Student
accurac
y
Model
Image
size
Model
Image
size
Model
Image
size
STL
T
eacher
1
and
2
A
v
erage
(%)
10
45
45
(%)
30
35
35
(%)
B1
34
B0
32
B2
38
CIF
AR-
10
88.84
89.79
89.40
89.49
B2
38
B1
34
B3
44
90.63
91.34
91.21
91.18
B3
44
B2
38
B4
54
91.69
92.87
92.56
92.68
B4
54
B3
44
B5
66
92.63
92.94
93.23
93.43
B5
66
B4
54
B6
76
93.02
93.23
93.61
93.51
B6
76
B5
66
B7
86
93.18
93.78
93.75
93.94
V2M
40
V2S
32
V2L
48
92.09
92.47
92.63
92.65
B1
34
B0
32
B2
38
CIF
AR-
100
64.93
68.77
68.39
68.68
B2
38
B1
34
B3
44
68.39
70.09
70.42
70.36
B3
44
B2
38
B4
54
71.27
72.72
74.01
73.84
B4
54
B3
44
B5
66
71.35
73.87
74.27
73.83
B5
66
B4
54
B6
76
72.63
75.61
75.12
75.80
B6
76
B5
66
B7
86
73.11
74.75
75.89
75.92
V2M
40
V2S
32
V2L
48
69.82
70.98
70.99
72.13
T
able
2.
Comparison
of
the
student
model’
s
accurac
y
between
ERKD
using
teachers
1
and
2,
and
STL
T
eacher
1
T
eacher
2
Student
Dataset
Student
accurac
y
Model
Image
size
Model
Image
size
Model
Image
size
STL
(%)
T
eacher
1
T
eacher
2
T
eacher
1
and
2
Only
(%)
Only
(%)
A
v
erage
(%)
10
45
45
(%)
30
35
35
(%)
B1
34
B0
32
B2
38
CIF
AR-
10
88.84
89.54
89.24
89.79
89.40
89.49
B2
38
B1
34
B3
44
90.63
90.87
91.24
91.34
91.21
91.18
B3
44
B2
38
B4
54
91.69
92.19
92.33
92.87
92.56
92.68
B4
54
B3
44
B5
66
92.63
93.02
93.11
92.94
93.23
93.43
B5
66
B4
54
B6
76
93.02
93.17
93.33
93.23
93.61
93.51
B6
76
B5
66
B7
86
93.18
93.89
93.86
93.78
93.75
93.94
V2M
40
V2S
32
V2L
48
92.09
92.52
91.98
92.47
92.63
92.65
B1
34
B0
32
B2
38
CIF
AR-
100
64.93
67.41
67.28
68.77
68.39
68.68
B2
38
B1
34
B3
44
68.39
69.29
69.34
70.09
70.42
70.36
B3
44
B2
38
B4
54
71.27
73.05
73.60
72.72
74.01
73.84
B4
54
B3
44
B5
66
71.35
74.12
72.86
73.87
74.27
73.83
B5
66
B4
54
B6
76
72.63
74.26
75.06
75.61
75.12
75.80
B6
76
B5
66
B7
86
73.11
74.51
74.70
74.75
75.89
75.92
V2M
40
V2S
32
V2L
48
69.82
70.08
68.73
70.98
70.99
72.13
T
o
check
whether
architectural
similarity
can
inuence
the
performance
of
ERKD,
we
pick
ed
other
CNN
models
to
repl
ace
Ef
cientNet
models
as
teachers.
The
other
CNN
models
were
pick
ed
and
mapped
to
replace
Ef
cientNet
models
on
the
basis
of
similar
accurac
y
on
ImageNet
dataset.
Other
CNN
model
architec-
tures
we
nally
pick
ed
were
ResNet,
Re
gNet
[28],
Con
vNe
xt
[29],
and
ResNeXt.
T
able
3
pro
vides
the
mapping
of
the
other
CNN
model
to
their
Ef
cientNet
equi
v
alent.
W
ith
the
addition
of
other
CNN
models,
we
no
w
ha
v
e
four
candidates
to
be
used
as
teachers:
tw
o
weak
er
Ef
cientNet
models
and
tw
o
other
CNN
models
equi
v
alent
to
the
Ef
cientNet
models.
F
or
the
sak
e
of
simplicity
,
we
named
the
rst
tw
o
Ef
cientNet
models
as
teacher
1
and
teacher
2,
while
the
other
tw
o
CNN
models
as
teacher
3
and
teacher
4.
F
or
e
xample,
to
supervise
Ef
cientNet
B2,
teacher
1
and
teacher
2
are
respecti
v
ely
B1
and
B0,
meanwhile
teacher
3
and
teacher
4
are
respecti
v
ely
ResNet-101
and
ResNet-152.
In
T
ables
4
and
5,
we
sho
w
the
result
of
e
xperiments
on
substituting
only
one
Ef
cientNet
teacher
with
ot
h
e
r
CNN
models.
The
result
with
the
icon
in
T
able
4
indicates
that
ERKD
outperforms
the
STL
a
n
d
a
single
teacher
method.
The
square
indicates
the
best
accurac
y
,
the
circle
indicates
the
second-best
accurac
y
,
and
the
triangle
indicates
the
third-best
accurac
y
.
While,
the
result
with
the
icon
in
T
ables
5
and
6
indicates
that
ERKD
outperforms
the
STL
and
a
single-teacher
method.
The
square
indicates
the
best
accurac
y
,
the
Ensemble
r
e
ver
se
knowledg
e
distillation:
tr
aining
r
ob
ust
model
using
...
(Christopher
Gavr
a
Reswar
a)
Evaluation Warning : The document was created with Spire.PDF for Python.
4166
❒
ISSN:
2252-8938
circle
indicates
the
second-best
accurac
y
,
the
equilateral
triangle
indicates
the
third-best
accurac
y
,
and
the
right
triangle
indicates
the
fourth-best
accurac
y
.
T
able
3.
The
mapping
of
other
CNN
models
to
the
Ef
cientNet
models
based
on
similar
accurac
y
on
ImageNet
dataset
Ef
cientNet
model
Others
CNN
model
Ef
cientNet
accurac
y
(%)
Others
CNN
model
(%)
B0
ResNet-101
77.692
77.374
B1
ResNet-152
77.692
77.374
B2
Re
gNet
Y
16GF
77.692
77.374
B3
Con
vNeXt
T
in
y
77.692
77.374
B4
ResNeXt101
64X4D
77.692
77.374
B5
ResNeXt101
64X4D
77.692
77.374
B6
Con
vNeXt
Small
77.692
77.374
V2S
Con
vNeXt
Base
77.692
77.374
V2M
Con
vNeXt
Lar
ge
77.692
77.374
T
able
4.
Comparison
of
the
student
model’
s
accurac
y
between
ERKD
using
teachers
1
and
4,
and
STL
T
eacher
1
T
eacher
4
Student
Dataset
Student
accurac
y
Model
Model
Model
STL
(%)
T
eacher
1
T
eacher
4
T
eacher
1
and
4
Only
(%)
Only
(%)
A
v
erage
(%)
10
45
45
(%)
30
35
35(%)
B1
B0
B2
CIF
AR-
10
88.84
89.54
89.46
89.54
89.57
89.67
B2
B1
B3
90.63
90.87
90.56
91.07
90.94
91.34
B3
B2
B4
91.69
92.19
92.20
92.54
92.83
92.62
B4
B3
B5
92.63
93.02
92.81
92.98
93.03
93.34
B5
B4
B6
93.02
93.17
93.62
94.05
93.42
94.03
B6
B5
B7
93.18
93.89
93.89
93.85
93.74
94.28
V2M
V2S
V2L
92.09
92.52
92.30
92.63
92.35
92.33
B1
B0
B2
CIF
AR-
100
64.93
67.41
65.95
67.59
67.22
66.99
B2
B1
B3
68.39
69.29
68.57
69.53
69.17
69.60
B3
B2
B4
71.27
73.05
73.22
74.49
73.10
73.51
B4
B3
B5
71.35
74.12
72.76
74.32
73.90
74.50
B5
B4
B6
72.63
74.26
74.21
74.93
74.88
75.14
B6
B5
B7
73.11
74.51
74.66
75.62
75.23
75.26
V2M
V2S
V2L
69.82
70.08
69.95
72.30
71.16
71.32
T
able
5.
Comparison
of
the
student
model’
s
accurac
y
between
ERKD
using
teachers
2
and
3,
and
STL
method
T
eacher
2
T
eacher
3
Student
Dataset
Student
accurac
y
Model
Model
Model
STL
(%)
T
eacher
2
T
eacher
3
T
eacher
2
and
3
Only
(%)
Only
(%)
A
v
erage
(%)
10
45
45
(%)
20
40
40
(%)
30
35
35
(%)
B1
B0
B2
CIF
AR-
10
88.84
89.24
89.58
89.79
89.33
89.75
89.64
B2
B1
B3
90.63
91.24
91.14
91.56
91.39
91.45
91.40
B3
B2
B4
91.69
92.33
92.46
92.10
92.42
92.63
92.91
B4
B3
B5
92.63
93.11
93.26
93.46
93.20
93.30
93.20
B5
B4
B6
93.02
93.33
93.95
93.41
93.79
94.15
93.73
B6
B5
B7
93.18
93.86
94.24
93.66
93.71
94.04
93.83
V2M
V2S
V2L
92.09
91.98
92.47
92.31
92.63
92.29
92.26
B1
B0
B2
CIF
AR-
100
64.93
67.28
66.39
67.81
67.03
67.65
67.49
B2
B1
B3
68.39
69.34
69.30
70.19
70.06
69.99
69.85
B3
B2
B4
71.27
73.60
72.96
73.26
73.93
73.24
73.41
B4
B3
B5
71.35
72.86
72.99
73.47
73.81
73.66
73.98
B5
B4
B6
72.63
75.06
74.67
74.93
75.61
76.19
75.65
B6
B5
B7
73.11
74.70
74.55
75.60
74.94
74.86
75.04
V2M
V2S
V2L
69.82
68.73
70.80
70.92
70.91
71.18
71.39
Int
J
Artif
Intell,
V
ol.
14,
No.
5,
October
2025:
4162–4170
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
4167
In
T
able
4,
only
teacher
1
and
teacher
4
are
used.
Meanwhile,
only
teacher
2
and
teacher
3
are
used
in
T
able
5.
W
e
add
a
ne
w
proportion
of
20%
for
cross
entrop
y
loss
and
40%
for
KL
di
v
er
gence
loss
in
T
able
5.
W
e
also
sho
w
the
result
of
substituting
all
the
Ef
cientNet
teachers
with
other
CNN
models
in
T
able
6.
From
these
results,
we
found
ERKD
can
generally
s
till
impro
v
e
the
accurac
y
com
pared
to
STL
and
usi
ng
one
teacher
only
.
This
f
act
is
especially
ob
vious
when
we
see
the
accurac
y
of
using
dif
ferent
teachers
combination
side
by
side
in
T
able
7,
where
there
is
no
combination
that
dominantly
outperforms
other
combination.
The
squares
in
T
able
7
indicate
superior
performance.
Thus,
ERKD
still
w
orks
re
g
ardless
of
the
architectural
similarity
.
T
able
6.
Comparison
of
the
student
model’
s
accurac
y
between
ERKD
using
teachers
3
and
4,
and
STL
T
eacher
3
T
eacher
4
Student
Dataset
Student
accurac
y
Model
Model
Model
STL
(%
)
T
eacher
3
T
eacher
4
T
eacher
3
and
4
Only
(%)
Only
(%)
A
v
erage
(%)
10
45
45
(%)
30
35
35
(%)
50
25
25
(%)
B1
B0
B2
CIF
AR-
10
88.84
89.58
89.46
89.50
89.23
89.52
89.85
B2
B1
B3
90.63
91.14
90.56
91.03
91.35
91.04
91.39
B3
B2
B4
91.69
92.46
92.20
92.22
92.54
92.43
92.63
B4
B3
B5
92.63
93.26
92.81
93.46
92.80
93.35
92.95
B5
B4
B6
93.02
93.95
93.62
94.11
93.92
93.56
94.24
B6
B5
B7
93.18
94.24
93.89
94.00
93.99
93.91
93.92
V2M
V2S
V2L
92.09
92.47
92.30
92.66
92.54
92.26
92.23
B1
B0
B2
CIF
AR-
100
64.93
66.39
65.95
66.33
66.64
66.73
66.80
B2
B1
B3
68.39
69.30
68.57
69.61
69.80
69.70
68.53
B3
B2
B4
71.27
72.96
73.22
72.82
73.31
72.65
72.33
B4
B3
B5
71.35
72.99
72.76
73.76
73.39
73.56
73.25
B5
B4
B6
72.63
74.67
74.21
74.85
75.03
74.70
75.20
B6
B5
B7
73.11
74.55
74.66
75.07
74.82
74.96
74.10
V2M
V2S
V2L
69.82
70.80
69.95
71.58
71.03
71.69
71.10
T
able
7.
The
accurac
y
of
comparison
of
ERKD
with
v
arious
teachers
Student
model
Dataset
T
eacher
1
and
2
(%)
T
eacher
1
and
4
(%)
T
eacher
2
and
3
(%)
T
eacher
3
and
4
(%)
B2
CIF
AR-10
89.79
89.67
89.79
89.85
B3
91.34
91.34
91.56
91.39
B4
92.87
92.83
92.91
92.63
B5
93.43
93.34
93.46
93.46
B6
93.61
94.05
94.15
94.24
B7
93.94
94.28
94.04
94.00
V2L
92.65
92.63
92.63
92.66
B2
CIF
AR-100
68.77
67.59
67.81
66.80
B3
70.42
69.60
70.19
69.80
B4
74.01
74.49
73.93
73.31
B5
74.27
74.50
73.98
73.76
B6
75.80
75.14
76.19
75.20
B7
75.92
75.62
75.60
75.07
V2L
72.13
72.30
71.39
71.69
When
we
tried
to
compare
the
accurac
y
of
models
trained
with
ERKD
with
a
stronger
model
t
rained
with
STL,
we
found
a
surprising
result
that
sometimes
a
weak
er
model
with
ERKD
can
be
stronger
than
a
stronger
model
with
STL.
F
or
e
xample,
we
compared
the
performance
of
Ef
cientNet
B2
model
using
ERKD
with
the
performance
of
the
Ef
cientNet
model
using
the
STL
method.
The
results
can
be
seen
in
T
able
8,
which
sho
ws
that
some
models
with
certain
datasets
can
beat
stronger
models.
The
result
with
the
icon
in
T
able
8
indicates
that
ERKD
outperforms
the
STL
of
a
one-le
v
el
higher
rob
ust
model.
Similarly
,
the
result
with
the
icon
in
T
able
9
sho
ws
that
ERKD
outperforms
the
STL
of
a
tw
o-le
v
el
higher
rob
ust
model.
The
square
indicates
the
best
accurac
y
,
the
circle
indicates
the
second-best
accurac
y
,
the
equilateral
triangle
indicates
the
third-best
accurac
y
,
and
the
right
triangle
indicates
the
fourth-best
accurac
y
.
F
or
CIF
AR-10
dataset;
Ef
cientNet
models
B4,
B5,
and
B6
with
ERKD
can
outperform
Ef
cientNet
models
B5,
B6,
and
B7.
Meanwhile
for
CIF
AR-100
dataset;
Ef
cientNet
models
B2,
B4,
B5,
and
B6
can
outperform
Ef
cientNet
models
B3,
B5,
B6,
and
B7.
Ensemble
r
e
ver
se
knowledg
e
distillation:
tr
aining
r
ob
ust
model
using
...
(Christopher
Gavr
a
Reswar
a)
Evaluation Warning : The document was created with Spire.PDF for Python.
4168
❒
ISSN:
2252-8938
T
able
8.
The
accurac
y
of
comparison
between
ERKD
and
STL
model
with
one
le
v
el
higher
rob
ust
model
Student
model
Dataset
STL
1
le
v
el
higher
(%)
T
eacher
1
and
2
(%)
T
eacher
1
and
4
(%)
T
eacher
2
and
3
(%)
T
eacher
3
and
4
(%)
B2
CIF
AR-10
90.63
89.79
89.67
89.79
89.85
B3
91.69
91.34
91.34
91.56
91.39
B4
92.63
92.87
92.83
92.91
92.63
B5
93.02
93.43
93.34
93.46
93.46
B6
93.18
93.61
94.05
94.15
94.24
B2
CIF
AR-100
68.39
68.77
67.59
67.81
66.80
B3
71.27
70.42
69.60
70.19
69.80
B4
71.35
74.01
74.49
73.93
73.31
B5
72.63
74.27
74.50
73.98
73.76
B6
73.11
75.80
75.14
76.19
75.20
T
able
9.
The
accurac
y
of
comparison
between
ERKD
and
STL
model
with
tw
o
le
v
els
higher
rob
ust
model
Student
model
Dataset
STL
2
le
v
el
higher
(%)
T
eacher
1
and
2
(%)
T
eacher
1
and
4
(%)
T
eacher
2
and
3
(%)
T
eacher
3
and
4
(%)
B2
CIF
AR-10
91.69
89.79
89.67
89.79
89.85
B3
92.63
91.34
91.34
91.56
91.39
B4
93.02
92.87
92.83
92.91
92.63
B5
93.18
93.43
93.34
93.46
93.46
B2
CIF
AR-100
71.27
68.77
67.59
67.81
66.80
B3
71.35
70.42
69.60
70.19
69.80
B4
72.63
74.01
74.49
73.93
73.31
B5
73.11
74.27
74.50
73.98
73.76
W
e
also
tried
to
compare
ERKD
with
the
tw
o-le
v
el
stronger
models
with
STL.
F
or
e
xample,
the
per
-
formance
of
the
Ef
cie
n
t
Net
B2
m
odel
using
ER
KD
is
compared
with
the
performance
of
the
Ef
cientNet
B4
model
using
the
STL
method.
The
results
can
be
seen
in
T
able
9.
Surprisingly
,
we
still
found
that
some
weak
er
models
can
be
stronger
with
ERKD
than
the
tw
o-le
v
el
stronger
models
with
STL.
Using
ERKD.
Ef
cientNet
model
B5
with
the
CIF
AR-10
dataset
performs
better
than
Ef
cientNet
model
B7
with
the
CIF
AR-10
dataset.
In
addition,
Ef
cientNet
Models
B4
and
B5
with
CIF
AR-100
dataset
using
ERKD
also
perform
better
than
Ef
cientNet
models
B6
and
B7.
These
tw
o
surprising
results
pro
v
e
that
ERKD
ef
fecti
v
ely
impro
v
es
model
performance.
4.
CONCLUSION
All
e
xperiments
pro
v
ed
that
ERKD
can
impro
v
e
the
model’
s
performance.
The
model’
s
performance
with
the
ERKD
method
can
be
better
than
the
STL
and
single-teacher
methods.
It
can
also
be
better
than
the
STL
method’
s
one
or
tw
o-le
v
el,
stronger
model.
Thus,
the
ERKD
method
is
suitable
for
supervising
stronger
models
us
ing
weak
er
models.
This
study
also
pro
v
ed
that
the
ERKD
method
can
impro
v
e
the
model’
s
performance
e
v
en
though
the
weak
and
strong
models’
architectures
are
dif
ferent.
The
Ef
cientNet
models
can
still
outperform
e
v
en
when
assisted
by
other
CNN
models.
Despite
using
weak
er
AI
instead
of
human,
the
result
of
this
study
sho
ws
a
glimmer
of
hope
that
an
AI
with
stronger
intelligence
than
human
can
still
be
supervised
by
humans.
The
trick
is
to
ha
v
e
se
v
eral
humans
to
collaborate
in
managing
a
super
-alignment
model.
Future
studies
could
in
v
estig
ate
a
similar
study
b
ut
without
using
the
trained
model.
The
y
could
also
in
v
estig
ate
ERKD
methods
in
other
computer
vision
tasks,
such
as
image
detection
or
image
se
gmentation.
In
addition,
the
y
can
also
e
xperimented
on
using
more
than
tw
o
weak
er
models
to
supervise
a
stronger
model
to
get
the
optimal
number
of
weak
er
models.
FUNDING
INFORMA
TION
Authors
state
there
is
no
funding
in
v
olv
ed.
A
UTHOR
CONTRIB
UTIONS
ST
A
TEMENT
This
journal
uses
the
C
on
t
rib
utor
Roles
T
axonomy
(CRediT)
to
recognize
indi
vidual
author
contrib
u-
tions,
reduce
authorship
disputes,
and
f
acilitate
collaboration.
Int
J
Artif
Intell,
V
ol.
14,
No.
5,
October
2025:
4162–4170
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
4169
Name
of
A
uthor
C
M
So
V
a
F
o
I
R
D
O
E
V
i
Su
P
Fu
Christopher
Ga
vra
Resw
ara
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
Tjeng
W
a
w
an
Cenggoro
✓
✓
✓
C
:
C
onceptualization
I
:
I
n
v
estig
ation
V
i
:
V
i
sualization
M
:
M
ethodology
R
:
R
esources
Su
:
Su
pervision
So
:
So
ftw
are
D
:
D
ata
Curation
P
:
P
roject
Administration
V
a
:
V
a
lidation
O
:
Writing
-
O
riginal
Draft
Fu
:
Fu
nding
Acquisition
F
o
:
F
o
rmal
Analysis
E
:
Writing
-
Re
vie
w
&
E
diting
CONFLICT
OF
INTEREST
ST
A
TEMENT
Authors
state
there
is
no
conict
of
interest.
D
A
T
A
A
V
AILABILITY
No
ne
w
data
were
generated
or
analyzed
during
this
study
.
REFERENCES
[1]
K.
He,
X.
Zhang,
S.
Ren,
and
J.
Sun,
“Deep
residual
learning
for
image
recognition,
”
in
2016
IEEE
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
(CVPR)
,
2016,
pp.
770–778,
doi:
10.1109/CVPR.2016.90.
[2]
G.
Huang,
Z.
Liu,
L.
V
an
Der
Maaten,
and
K.
Q.
W
einber
ger
,
“Densely
connected
con
v
olutional
netw
orks,
”
in
2017
IEEE
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
(CVPR)
,
2017,
pp.
2261–2269,
doi:
10.1109/CVPR.2017.243.
[3]
M.
T
an
and
Q.
Le,
“Ef
cientNet:
Rethinking
model
scaling
for
con
v
olutional
neural
netw
orks,
”
in
Pr
oceedings
of
the
36th
International
Confer
ence
on
Mac
hine
Learning
(ICML)
,
pp.
6105–6114,
2019.
[4]
C.
Sze
gedy
,
V
.
V
anhouck
e,
S.
Iof
fe,
J.
Shlens,
and
Z.
W
ojna,
“Rethinking
the
inception
architecture
for
computer
vision,
”
in
2016
IEEE
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
(CVPR)
,
2016,
pp.
2818–2826,
doi:
10.1109/CVPR.2016.308.
[5]
A.
Ho
w
ard
et
al
.,
“Searching
for
MobileNetV3,
”
in
2019
IEEE/CVF
International
Confer
ence
on
Computer
V
ision
(ICCV)
,
2019,
pp.
1314–1324,
doi:
10.1109/ICCV
.2019.00140.
[6]
J.
Deng,
W
.
Dong,
R.
Socher
,
L.-J.
Li,
K.
Li,
and
L.
Fei-Fei,
“ImageNet:
A
lar
ge-scale
hierarchical
image
database,
”
in
2009
IEEE
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
,
2009,
pp.
248–255,
doi:
10.1109/CVPR.2009.5206848.
[7]
A.
Krizhe
vsk
y
,
“Learning
multiple
layers
of
features
from
tin
y
images,
”
M.Sc.
Thesis
,
De
partment
of
Computer
Science,
Uni
v
ersity
of
T
oronto,
T
oronto,
Canada,
2009.
[8]
L.
Bossard,
M.
Guillaumin,
and
L.
V
.
Gool,
“F
ood-101–mining
discriminati
v
e
components
with
random
forests,
”
in
Computer
V
ision
-
Eur
opean
Confer
ence
on
Computer
V
ision(ECCV)
,
pp.
446-461,
2014,
doi:
10.1007/978-3-319-10599-4_29.
[9]
M.-E.
Nils
back
and
A.
Zisserman,
“
Automated
o
wer
classication
o
v
er
a
lar
ge
number
of
classes,
”
in
2008
Sixth
Indian
Confer
ence
on
Computer
V
ision,
Gr
aphics
&
Ima
g
e
Pr
ocessing
,
2008,
pp.
722–729,
doi:
10.1109/ICV
GIP
.2008.47.
[10]
T
.
Ber
g,
J.
Liu,
S.
W
.
Lee,
M.
L.
Ale
xander
,
D.
W
.
Jacobs,
and
P
.
N.
Belhumeur
,
“Birdsnap:
Lar
ge-scale
ne-grained
visual
cate
gorization
of
birds,
”
in
2014
IEEE
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
,
2014,
pp.
2019–2026,
doi:
10.1109/CVPR.2014.259.
[11]
J
.
Achiam
et
al
.,
“GPT
-4
technical
report,
”
arXiv-Computer
Science
,
Mar
.
2023.
[12]
P
.
Geor
gie
v
et
al
.,
“Gemini
1.5:
Unlocking
multimodal
understanding
across
millions
of
tok
ens
of
conte
xt,
”
arXiv-Computer
Science
,
Dec.
2024.
[13]
A.
Gr
attaori
et
al
.,
“The
Llama
3
herd
of
models,
”
arXiv-Computer
Science
,
No
v
.
2024.
[14]
A.
Gl
aese
et
al
.,
“Impro
ving
alignment
of
dialogue
agents
via
tar
geted
human
judgements,
”
arXiv-Computer
Science
,
Sep.
2022.
[15]
Y
.
Bai
et
al
.,
“T
raining
a
helpful
and
harmless
assistant
with
reinforcement
learning
from
human
feedback,
”
arXiv-Computer
Science
,
Apr
.
2022.
[16]
L.
Ouyang
et
al
.,
“T
raining
language
models
t
o
follo
w
instructions
with
human
feedback,
”
Advances
in
neur
al
information
pr
ocess-
ing
systems
,
v
ol.
35,
pp.
27730–27744,
2022.
[17]
V
.
Mnih
et
al
.,
“Human-le
v
el
control
through
deep
reinforcement
learning,
”
natur
e
,
v
ol.
518,
pp.
529–533,
Feb
.
2015,
doi:
10.1038/nature14236.
[18]
D.
Silv
er
et
al
.,
“Mastering
the
g
ame
of
go
without
human
kno
wledge,
”
natur
e
,
v
ol.
550,
pp.
354–359,
Oct.
2017,
doi:
10.1038/na-
ture24270.
[19]
D.
Silv
er
et
al
.,
“
A
general
reinforcement
learning
algorithm
that
masters
chess,
shogi,
and
Go
through
self-play
,
”
Science
,
v
ol.
362,
no.
6419,
pp.
1140–1144,
2018,
doi:
10.1126/science.aar6404.
[20]
D.
Guo
et
al
.,
“Deepseek-R1:
Incenti
vizing
reasoning
capability
in
llms
via
reinforcem
ent
learning,
”
arXiv-Computer
Science
,
pp.
1-22,
Jan.
2025.
[21]
L.
Breiman,
“Random
forests,
”
Mac
hine
Learning
,
v
ol.
45,
pp.
5–32,
Oct.
2001,
doi:
10.1023/A:1010933404324.
[22]
T
.
Chen
and
C.
Guestrin,
“XGBoost:
A
scalable
tree
boosting
system,
”
in
Pr
oceedings
of
the
22nd
A
CM
SIGKDD
International
Confer
ence
on
Knowledg
e
Disco
very
and
Data
Mining
,
pp.
785–794,
2016,
doi:
10.1145/2939672.2939785.
[23]
G.
Hinton,
O.
V
in
yals,
and
J.
Dean,
“Distilling
the
kno
wledge
in
a
neural
netw
ork,
”
arXiv-Statistics
,
pp.
1-9,
Mar
.
2015.
[24]
M.
T
an
and
Q.
Le,
“Ef
cientnetv2:
Smaller
models
and
f
aster
training,
”
in
Pr
oceedings
of
the
38th
International
Confer
ence
on
Mac
hine
Learning
(ICML)
,
Apr
.
2021,
pp.
10096–10106.
Ensemble
r
e
ver
se
knowledg
e
distillation:
tr
aining
r
ob
ust
model
using
...
(Christopher
Gavr
a
Reswar
a)
Evaluation Warning : The document was created with Spire.PDF for Python.
4170
❒
ISSN:
2252-8938
[25]
L.
F
.
-Fei,
J.
Deng,
and
K.
Li,
“ImageNet:
Constructing
a
lar
ge-scale
image
database,
”
J
ournal
of
V
ision
,
v
ol.
9,
no.
8,
pp.
1037–1037,
2009,
doi:
10.1167/9.8.1037.
[26]
O.
Russak
o
vsk
y
et
al
.,
“Imagenet
lar
ge
scale
visual
recognition
challenge,
”
International
J
ournal
of
Computer
V
is
ion
,
v
ol.
115,
pp.
211–252,
2015,
doi:
10.1007/s11263-015-0816-y
.
[27]
D.
P
.
Kingma
and
J.
Ba,
“
Adam:
A
method
for
stochastic
optimization,
”
arXiv-Computer
Science
,
pp.
1-15,
Jan.
2017.
[28]
I
.
Radosa
v
o
vic,
R.
P
.
K
osaraju,
R.
Girshick,
K.
He,
and
P
.
Dollár
,
“Designing
netw
ork
design
spaces,
”
in
2020
IEEE/CVF
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
(CVPR)
,
2020,
pp.
10425–10433,
doi:
10.1109/CVPR42600.2020.01044.
[29]
Z.
Liu,
H.
Mao,
C.-Y
.
W
u,
C.
Feichtenhofer
,
T
.
Darrell,
and
S.
Xie,
“
A
Con
vNet
for
the
2020s,
”
in
2022
IEEE/CVF
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
(CVPR)
,
2022,
pp.
11966–11976,
doi:
10.1109/CVPR52688.2022.01167.
BIOGRAPHIES
OF
A
UTHORS
Christopher
Ga
vra
Reswara
recei
v
ed
his
bachelor’
s
de
gree
in
Computer
Science
from
Bina
Nusantara
Uni
v
ersity
,
where
he
is
pursuing
a
master’
s
de
gree
in
the
same
eld.
He
also
w
orks
as
a
programmer
at
the
B
ina
Nusantara
IT
Di
vision.
His
research
focuses
on
articial
intelli-
gence,
recommendation
systems,
and
computer
vision,
and
he
has
authored
tw
o
conference
papers
on
recommendation
systems.
He
can
be
contacted
at
email:
christopher
.resw
ara@binus.ac.id.
Tjeng
W
awan
Cenggor
o
recei
v
ed
a
bachel
or’
s
de
gree
in
Information
T
echnology
from
STMIK
W
idya
Cipta
Dharma
and
a
master’
s
de
gree
in
Information
T
echnology
from
Bina
Nusantara
Uni
v
ersity
.
He
is
currently
an
AI
researcher
focusing
on
de
v
eloping
deep
learning
algo-
rithms
for
applications
in
computer
vision,
natural
language
processing,
and
bioinformatics.
He
is
also
an
NVIDIA
Deep
L
earning
Institute
certied
instructor
.
Throughout
his
9+
year
career
,
he
has
led
numerous
research
projects
related
to
AI
and
data
science,
with
applications
in
man
y
domains
such
as
e-commerce,
agriculture,
and
health.
He
has
published
o
v
er
80
peer
-re
vie
wed
publications
and
re
vie
wed
for
prestigious
journals,
such
as
Scientic
Reports,
IEEE
Access,
a
nd
PLOS
ONE.
In
addition
to
this,
he
also
holds
4
cop
yrights
for
AI-based
video/image
analytics
softw
are.
He
can
be
contacted
at
email:
tjeng.cenggoro@binus.ac.id.
Int
J
Artif
Intell,
V
ol.
14,
No.
5,
October
2025:
4162–4170
Evaluation Warning : The document was created with Spire.PDF for Python.