Inter
national
J
our
nal
of
Inf
ormatics
and
Communication
T
echnology
(IJ-ICT)
V
ol.
15,
No.
1,
March
2026,
pp.
198
∼
206
ISSN:
2252-8776,
DOI:
10.11591/ijict.v15i1.pp198-206
❒
198
Le
v
eraging
distillation
tok
en
and
weak
er
teacher
model
to
impr
o
v
e
DeiT
transfer
lear
ning
capability
Christopher
Ga
vra
Reswara,
Gede
Putra
K
usuma
Department
of
Computer
Science,
BINUS
Graduate
Program
-
Master
of
Computer
Science,
Bina
Nusantara
Uni
v
ersity
,
Jakarta,
Indonesia
Article
Inf
o
Article
history:
Recei
v
ed
Mar
6,
2025
Re
vised
Oct
28,
2025
Accepted
No
v
5,
2025
K
eyw
ords:
DeiT
model
Distillation
tok
en
Kno
wledge
distillation
T
ransfer
learning
T
ransformers
architecture
W
eak-to-strong
generalization
ABSTRA
CT
Recently
,
distilling
kno
wledge
from
con
v
olutional
neural
netw
orks
(CNN)
has
positi
v
ely
impacted
the
data-ef
cient
image
transformer
(DeiT)
model.
Due
to
the
distillation
tok
en,
this
method
is
capable
of
boosting
DeiT
performance
and
helping
DeiT
t
o
learn
f
aster
.
Unfortunately
,
a
distillation
procedure
with
that
tok
en
has
not
yet
been
implemented
in
the
DeiT
for
transfer
learning
to
the
do
wnstream
dataset.
This
study
propos
es
implementing
a
distillation
procedure
based
on
a
distillation
tok
en
for
transfer
learning.
It
boosts
DeiT
performance
on
do
wnstream
datasets.
F
or
e
xample,
our
proposed
method
impro
v
es
the
DeiT
B
16
model
performance
by
1.75%
on
the
OxfordIIIT
-Pets
dataset.
Furthermore,
we
present
using
a
weak
er
model
as
a
teacher
of
the
DeiT
.
It
coul
d
reduce
the
transfer
learning
process
of
the
teacher
model
without
reducing
the
DeiT
per
-
formance
too
much.
F
or
e
xample,
DeiT
B
16
model
performance
decreased
by
only
0.42%
on
Oxford
102
Flo
wers
with
Ef
cientNet
V2S
compared
to
Re
gNet
Y
16GF
.
In
contr
ast,
in
se
v
eral
cases,
the
DeiT
B
16
model
performance
could
impro
v
e
with
a
weak
er
teacher
model.
F
or
e
xample,
DeiT
B
16
mode
l
perfor
-
mance
impro
v
ed
by
1.06%
on
the
OxfordIIIT
-Pets
dataset
with
Ef
cientNet
V2S
compared
to
Re
gNet
Y
16GF
as
a
teacher
model.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Christopher
Ga
vra
Resw
ara
Department
of
Computer
Science,
BINUS
Graduate
Program
-
Master
of
Computer
Science
Bina
Nusantara
Uni
v
ersity
Jakarta,
Indonesia
Email:
christopher
.resw
ara@binus.ac.id
1.
INTR
ODUCTION
Recently
,
transformer
[1]
architectures
ha
v
e
become
the
model
of
choice
in
natural
language
process-
ing
(NLP)
and
computer
vision.
Due
to
the
self-attention
module,
the
po
werful
netw
ork
capacity
with
that
architecture
could
perform
v
arious
tasks.
In
NLP
,
transformer
models
achie
v
e
competiti
v
e
results
in
de
v
elop-
ing
the
lar
ge
language
model
(LLM),
such
as
GPT
-4o
[2],
Llama
3.2
[3],
and
Gemini
1.5
[4].
These
LLMs
help
complete
human
tasks
lik
e
te
xt
summarization
[5],
sentiment
analysis
[6],
question
answering
[7],
and
others.
In
addition,
the
transformer
model
achie
v
es
e
xcellent
performance
in
computer
vision,
including
image
classication
[8],
object
detection
[9],
image
matching
[10],
and
other
tasks.
The
rst
transformer
-based
model
in
computer
vision
is
t
he
vision
transformer
(V
iT)
[11].
That
model
le
v
erages
ra
w
image
patches
as
input
and
classication
tok
ens
as
output.
Subsequently
,
transformer
-based
models
in
computer
vision
de
v
eloped
into
v
arious
models,
such
as
DeiT
[12],
Swin
[8],
Swin
V2
[13],
and
others.
F
or
e
xample,
DeiT
introduces
a
ne
w
procedure
for
kno
wledge
distillation
(KD)
[14]
i
n
a
transformer
-
J
ournal
homepage:
http://ijict.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Inf
&
Commun
T
echnol
ISSN:
2252-8776
❒
199
based
model.
This
model
has
a
ne
w
tok
en
compared
to
the
V
iT
,
a
distillation
tok
en.
The
tok
en
is
used
to
calculate
the
loss
between
teacher
and
student
output.
T
o
impro
v
e
DeiT
performance,
the
Re
gNet
Y
16GF
[15]
model
w
as
used
as
its
teacher
.
Re
gNet
Y
16GF
has
83.6M
parameters,
similar
to
DeiT
B
16,
which
has
86.6M.
While
training
t
he
student
model,
the
output
of
the
teacher
model
becomes
a
supervisor
.
Furthermore,
the
classication
tok
en
of
DeiT
is
computed
loss
with
cross-entrop
y
loss.
On
the
other
hand,
the
distillation
tok
en
of
DeiT
and
teacher
output
are
computed
loss
with
K
ullback-Leibler
di
v
er
gence
(KL
Di
v
er
gence)
[16].
Both
result
losses
will
be
a
v
erage
and
become
student
losses
used
for
backw
ard
propag
ation
in
the
student
model.
Unfortunately
,
this
technique
is
not
yet
used
for
transfer
learning
to
do
wnstream
datasets.
Therefore,
this
study
in
v
estig
ated
the
ef
fects
of
util
izing
a
distillation
tok
en
for
transfer
learning
to
a
do
wnstream
dataset.
While
the
DeiT
paper
has
e
xplored
the
impact
of
the
distillation
tok
en
for
training
a
model
from
scratch,
i
t
has
not
e
xplicitly
addressed
its
inuence
on
utilizing
it
for
transfer
learning
to
a
do
wnstream
dataset.
It
is
aiming
to
enhance
the
transfer
learning
capability
of
the
transformer
-based
model.
T
o
pro
v
e
it,
we
design
a
simple
setup.
In
Figure
1,
we
utilize
Re
gNet
Y
16GF
to
become
a
teacher
model.
(a)
W
e
transfer
learning
pre-
trained
Re
gNet
Y
16GF
on
ImageNet-1k
[17]
to
the
do
wnstream
dataset,
and
the
result
we
called
the
trained
teacher
model
at
a
do
wnstream
dataset
(as
sho
wn
in
Figure
1
(a)).
After
that,
(b)
we
le
v
erage
the
trained
teacher
model
at
a
do
wnstream
dataset
to
supervise
pre-trained
DeiT
on
ImageNet-1k
as
a
student
model
to
transfer
learning
to
the
same
do
wnstream
dataset
(as
sho
wn
in
Figure
1
(b)).
In
this
w
ay
,
we
pro
v
e
that
this
technique
could
impro
v
e
DeiT’
s
transfer
learning
capability
.
The
e
xperimental
results
of
this
study
use
CIF
AR-10
[18],
CIF
AR-100
[18],
Oxford
102
Flo
wers
[19],
and
Oxford-IIIT
Pets
[20]
as
do
wnstream
datasets.
In
addition,
we
adopt
the
weak-to-strong
generalization
[21]
concept
to
simplify
(a)
transfer
learning
pre-trained
teacher
model
on
ImageNet-1k
to
the
do
wnstream
dat
aset.
This
concept
w
as
based
on
articial
intelligence’
s
rapid
and
rob
ust
de
v
elopment,
especially
the
transformers-based
model.
The
superalignment
model,
which
is
more
intelligent
than
humans,
is
possible.
In
contrast,
humans
need
help
understanding
to
en-
sure
the
superalignment
model
is
still
correct
and
safe.
Therefore,
this
concept
pro
v
es
that
a
weak
er
supervisor
model
still
supervises
a
rob
ust
model.
(a)
(b)
Figure
1.
Illustration
of
our
proposed
method,
(a)
the
pre-trained
teacher
model
is
used
on
the
ImageNet-1k
dataset
for
transfer
learning
to
the
do
wnstream
dataset
and
(b)
the
pre-trained
student
model
is
used
on
the
ImageNet-1k
dataset
and
the
trained
teacher
model
on
the
do
wnstream
dataset
Le
ver
a
ging
distillation
tok
en
and
weak
er
teac
her
model
to
impr
o
ve
DeiT
...
(Christopher
Gavr
a
Reswar
a)
Evaluation Warning : The document was created with Spire.PDF for Python.
200
❒
ISSN:
2252-8776
The
teacher
model
used
in
the
DeiT
paper
is
Re
gNet
Y
16GF
.
That
model
has
83.6M
parameters,
similar
to
DeiT
B
16
(student
model),
which
has
86.6M.
T
o
implement
the
weak-to-strong
generalization
concept,
we
propose
to
use
a
weak
er
teacher
model.
W
e
utilize
Ef
cientNet
B4
(19.3M)
[22]
and
Ef
cientNet
V2S
(21.5M)
[23].
W
e
use
the
teacher
model,
which
is
approximately
75%
weak
compared
to
the
student
model.
Our
contrib
utions
are
listed
as
follo
ws:
(a)
W
e
propose
using
a
distillation
procedure
based
on
a
distillat
ion
tok
en
for
transfer
learning
to
the
do
wn-
stream
dataset.
W
e
nd
this
technique
capable
of
impro
ving
DeiT
model
performance
on
the
do
wnstream
dataset.
(b)
W
e
introduce
using
a
weak
er
model
as
a
teacher
of
the
DeiT
model.
Its
method
could
reduce
(a)
the
transfer
learning
process
on
the
teacher
model
because
it
uses
approximately
75%
of
the
weak
without
reducing
DeiT
(student)
model
performance.
(c)
W
e
nd
that
the
CNN
model
is
the
best
teacher
for
the
transformers
model
in
the
transfer
learning
process
to
do
wnstream
datasets.
In
addition,
we
nd
that
using
soft
distillation
outperforms
hard
distillation.
2.
METHOD
2.1.
Dataset
In
this
study
,
the
ImageNet-1k
dataset
is
used
as
a
lar
ge
dataset
to
train
a
model
from
scratch.
That
dataset
has
1000
classes
and
consists
of
1,281,167
training
images,
50,000
v
alidation
images
and
100,000
test
images.
Furthermore,
a
trained
model
in
the
ImageNet-1k
dataset
will
be
used
for
transfer
learning
to
the
do
wnstream
datasets,
i.e.
CIF
AR-10,
CIF
AR-100,
Oxford
102
Flo
wers
and
Oxford-IIIT
Pets.
T
able
1
presents
detailed
information
on
do
wnstream
datasets.
2.2.
T
rain,
v
alidation,
and
test
split
data
The
dataset
is
split
into
a
train,
v
alidation,
and
test
set
for
conducted
model
training.
A
model
uti
lizes
a
training
set
for
training.
That
set
will
be
augmented
so
that
a
model
could
be
trained
well.
In
contrast,
v
alidation
and
test
sets
are
not
augmentation.
A
v
alidation
set
is
used
to
calculate
the
error
rates
of
a
model
during
training
and
its
impact
on
the
backw
ard
propag
ation
process.
Meanwhile,
a
test
set
is
used
to
e
v
aluate
the
performance
model
after
the
training.
In
this
study
,
we
tak
e
a
10%
train
image
of
CIF
AR-10,
CIF
AR-100,
and
OxfordIIT
-Pet
for
v
alidation
test.
Ho
we
v
er
,
we
use
the
def
ault
train
and
v
alidation
split
of
Oxford
102
Flo
wers,
each
50%
for
the
train
and
v
alidation
set.
T
able
2
sho
ws
a
detailed
split
of
the
dataset
in
this
study
.
T
able
1.
Do
wnstream
datasets
information
Dataset
Size
(T
rain/T
est)
Classes
CIF
AR-10
50,000/10,000
10
CIF
AR-100
50,000/10,000
100
Oxford
102
Flo
wers
2,040/6,149
102
OxfordIIIT
-Pets
3,680/3,669
37
T
able
2.
Split
dataset
to
train,
v
alidation,
and
test
set
Dataset
Size
(T
rain/V
al/T
est)
CIF
AR-10
45,000/5,000/10,000
CIF
AR-100
45,000/5,000/10,000
Oxford
102
Flo
wers
1,020/1,020/6,149
OxfordIIIT
-Pets
3,312/368/3,669
2.3.
Data
pr
epr
ocessing
The
data
preprocessing
technique
is
used
for
all
do
wnstream
datasets
and
all
parts
of
that
dataset,
as
well
as
train,
v
alidation,
and
test
sets.
T
echnique
data
preprocessing
used
in
this
study
are
resized
images,
standardization,
and
normalization.
First,
all
images
will
be
resized
to
224
x
224
pix
els.
Do
wnstream
datasets
will
be
standardized
to
con
v
ert
the
image
v
alue
from
0.0
to
255.0
into
0.0
to
1.0.
Then,
con
v
ert
the
image
array
structure
from
Height,
W
idth,
and
Channel
to
Channel,
Height,
and
W
idth.
Aft
er
standardization,
the
do
wnstream
dataset
will
be
normalized.
Normalization
w
as
conducted
based
on
each
do
wnstream
dataset
a
v
erage
and
standard
de
viation
of
each
channel.
Therefore,
the
normalization
v
alue
for
CIF
AR-10
is
dif
ferent
from
other
do
wnstream
datasets,
such
as
CIF
AR-100.
F
or
CIF
AR-10,
the
a
v
erages
of
each
channel
(Red,
Green,
Blue)
are
0.4914,
0.4822,
0.4465,
and
the
standard
de
viations
of
each
channel
(Red,
Green,
Blue)
are
0.247,
0.243,
0.261.
Meanwhile,
in
CIF
AR-100,
the
a
v
erages
are
0.5071,
0.4865,
0.4409,
and
the
standard
de
viations
are
0.267,
0.256,
0.276.
Oxford
102
Flo
wers
a
v
erages
0.4330,
0.3819,
0.2964,
and
the
standard
de
viations
of
each
channel
are
0.273,
0.224,
0.253.
Finally
,
OxfordIIIT
-Pets
a
v
erages
are
0.4782,
0.4458,
0.3956,
and
the
standard
de
viations
are
0.247,
0.241,
0.249.
Int
J
Inf
&
Commun
T
echnol,
V
ol.
15,
No.
1,
March
2026:
198–206
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Inf
&
Commun
T
echnol
ISSN:
2252-8776
❒
201
2.4.
Data
augmentation
The
data
augmentation
technique
is
only
implemented
in
the
train
set
of
all
do
wnstream
datase
ts.
The
techni
ques
used
are
random
crop
and
random
horizontal
ip
images.
F
or
CIF
AR-10
and
CIF
AR-100,
the
original
images
are
32
x
32
pix
els.
Therefore,
both
datasets
will
randomly
crop
to
28
x
28
pix
els.
Meanwhile,
Oxford
102
Flo
wers
and
OxfordIII-Pets
ha
v
e
a
v
ariety
of
image
sizes.
Both
datasets
will
randomly
crop
to
196
x
196
pix
els.
After
that,
i
mages
wi
ll
be
randomly
ipped
horizontally
.
After
data
augmentation,
the
train
set
of
all
do
wnstream
datasets
will
be
data
preprocessed.
2.5.
Distillation
loss
Distillation
loss
is
a
numerical
metric
that
measures
the
dif
ference
between
the
student
and
teacher
models’
predicted
output.
T
w
o
techniques
were
used
in
this
study
to
compute
distillation
loss,
i.e.,
hard
dis-
tillation
and
soft
distillation.
Hard
distillat
ion
uses
cross-entrop
y
loss
to
compute
distillation
loss,
while
soft
distillation
uses
the
KL
di
v
er
gence
function.
Hard
distillation.
Let
Z
s
be
the
logits
of
the
DeiT
(student)
model
and
Z
t
be
the
logits
of
the
teacher
model.
Then,
we
denote
L
C
E
by
the
cross-entrop
y
loss
and
ψ
by
the
softmax
function.
Especially
for
hard
distillation,
the
teacher
models’
predicted
output
must
be
computed
with
the
ar
gmax
function,
which
becomes
the
hard
decision
of
the
teacher
model
and
we
donated
it
by
y
t
=
ar
g
max
c
Z
t
(
c
)
.
Finally
,
the
function
to
compute
hard
distillation
loss
could
be
dened
as
follo
ws:
L
har
d
distil
l
=
L
C
E
(
ψ
(
Z
s
)
,
y
t
)
Soft
distillation.
W
e
denoted
KL
be
the
KL
Di
v
er
gence
function
and
τ
as
the
temperature
of
the
soft
distillation.
The
temperature
in
the
soft
distillation
is
used
to
smooth
the
probability
distrib
ution.
In
this
study
,
we
used
τ
=
2
for
all
soft
distillation
e
xperiments.
The
function
to
compute
soft
distillation
loss
could
be
dened
as
follo
ws:
L
sof
t
distil
l
=
τ
2
K
L
(
ψ
(
Z
s
τ
)
,
ψ
(
Z
t
τ
))
After
computing
distillation
loss,
we
can
compute
global
loss
for
the
backw
ard
propag
ation
proce
ss.
Global
loss
is
the
a
v
erage
between
student
model
loss
and
disti
llation
loss.
Let
L
g
l
obal
by
global
loss
and
y
by
true
class.
Therefore,
the
function
to
compute
global
loss
could
be
dened
as
follo
ws:
L
g
l
obal
=
1
2
L
C
E
(
ψ
(
Z
s
)
,
y
)
+
1
2
L
distil
l
2.6.
T
raining
pr
ocess
The
training
process
in
this
study
uses
AdamW
[24]
optimizer
and
CosineLRScheduler
[25].
More-
o
v
er
,
the
training
process
w
as
conducted
in
10
epochs,
with
a
batch
size
of
32
and
a
random
seed
of
42.
The
checkpoint
model
technique
w
as
also
used
during
training
based
on
the
best
v
alidation
accurac
y
.
Finally
,
the
training
process
for
transfer
learning
of
the
DeiT
model
to
the
do
wnstream
dataset
only
uses
the
attention
layer
.
Other
layers
were
frozen.
2.7.
Experiment
setup
Based
on
our
proposed
method,
as
sho
wn
in
Figure
1,
we
present
tw
o
steps
to
impro
v
e
DeiT
t
ransfer
learning
capability
.
First,
we
use
a
pre-trained
model
on
the
ImageNet-1k
dataset
as
a
teacher
model.
That
teacher
model
needs
to
transfer
learning
to
do
wnstream
datasets
(Figure
1(a)).
The
results
of
the
rst
step
are
the
T
rained
T
eacher
model
and
logits
of
the
teacher
model,
which
we
denote
by
Z
t
.
In
the
ne
xt
step,
we
will
use
a
pre-trained
DeiT
model
on
the
ImageNet-1k
dataset.
Then,
in
the
transfer
learning
student
model
to
do
wnstream
datasets
process,
we
use
the
T
rained
T
eacher
model
as
a
helper
through
Z
t
.
Z
t
will
be
compared
with
a
distillation
tok
en
using
the
distillation
method
and
produce
distillation
loss.
That
loss
will
be
computed
with
student
loss
to
become
a
global
loss.
It
is
a
loss
used
by
the
student
model
in
updating
the
weight
of
the
student
model.
In
that
w
ay
,
the
teacher
model
could
help
the
s
tudent
model
and
impro
v
e
student
model
performance.
Le
ver
a
ging
distillation
tok
en
and
weak
er
teac
her
model
to
impr
o
ve
DeiT
...
(Christopher
Gavr
a
Reswar
a)
Evaluation Warning : The document was created with Spire.PDF for Python.
202
❒
ISSN:
2252-8776
3.
RESUL
TS
AND
DISCUSSION
3.1.
CNN
vs
T
ransf
ormer
teacher
First,
we
proposed
using
a
disti
llation
tok
en
to
transfer
learning
to
the
do
wnstream
dataset.
Therefore,
we
observ
ed
the
best
teacher
architecture
model
for
transfer
learning.
Figure
2
compares
the
performance
of
the
DeiT
B
model
while
transferring
learning
to
CIF
AR-10,
CIF
AR-100,
Oxford
102
Flo
wers,
and
OxfordIIIT
-Pets
dataset
between
Re
gNet
Y
16GF
and
Deit
B
16
as
a
teacher
model.
W
e
found
that
using
the
CNN
architecture
(Re
gNet
Y
16GF)
as
a
teacher
model
outperforms
the
transformer
architecture
(DeiT
B
16).
Inducti
v
e
bias
from
CNN
adapted
to
T
ransformers
through
distillation
mak
es
CNN
a
better
teacher
,
as
e
xplained
by
Abnar
[26].
CNN
has
a
local
inducti
v
e
bias
that
could
help
DeiT
learn
f
aster
,
and
complemen-
tary
transformers
architecture
designed
global
inducti
v
e
bias.
Hence,
Re
gNet
Y
16GF
could
outperform
the
transfer
learning
process
only
in
10
epochs.
This
study’
s
follo
wing
e
xperiment
uses
a
CNN
architecture
model,
specically
Re
gNet
Y
16GF
,
with
83.6M
parameters
as
a
teacher
model.
3.2.
Hard
vs
Soft
distillation
In
addition,
we
com
pare
tw
o
techniques
to
compute
distillation
loss,
as
sho
wn
in
Figure
3.
Soft
distillation
outperforms
in
all
do
wnstream
datasets.
F
or
e
xample,
transfer
learning
DeiT
B
16
model
to
Oxford
102
Flo
wers
with
soft
distillation
accurac
y
is
95.83%
compared
to
hard
distillation,
only
92.51%.
Lik
e
wise,
the
performance
of
soft
distillation
is
1.34%
higher
than
that
of
hard
distillation
in
the
CIF
AR-100
dataset.
Soft
distillation
gi
v
es
information
on
the
predicted
probability
class
of
data
from
the
teacher
model
to
the
distillation
tok
en
through
KL
Di
v
er
gence.
Its
tok
en
distillation
could
adjust
performance
better
because
the
actual
class
is
not
al
w
ays
in
the
rst
teacher’
s
prediction.
Possible
actual
class
data
in
the
second
or
third
of
the
teacher’
s
predicted.
Hence,
the
follo
wing
e
xperiment
in
this
study
uses
the
soft
distillation
technique
with
τ
=
2
.
Figure
2.
Comparison
performance
DeiT
B
16
model
with
dif
ferent
teacher
architecture.
CNN
architecture
(Re
gNet
Y
16GF)
outperforms
transformer
architecture
(DeiT
B
16)
Figure
3.
Comparison
performance
DeiT
B
16
model
with
dif
ferent
distillation
loss
techniques.
Soft
distillation
outperforms
hard
distillation
3.3.
T
ransfer
lear
ning
to
do
wnstr
eam
datasets
Finally
,
we
ha
v
e
con
g
ur
ed
Re
gNet
Y
16GF
,
a
teacher
model
and
soft
distillation,
which
mak
es
the
DeiT
B
16
model
better
performance.
Furthermore,
we
pro
v
ed
that
using
distillation
tok
ens
and
teacher
-
predicted
output
to
compute
distillation
loss
(our
proposed
method)
is
better
than
just
using
a
v
erage
distillation
tok
ens
and
classication
tok
ens
(without
teacher
or
standard
transfer
learning)
for
transfer
learning
to
do
wn-
stream
datasets.
Figure
4
sho
ws
t
hat
our
proposed
method
signi
cantly
impro
v
es
t
he
DeiT
B
16
model.
F
or
e
xample,
the
performance
of
the
DeiT
B
16
model
increased
by
1.75%
on
the
OxfordIIIT
-Pets
dataset.
Simi-
larly
,
on
the
Oxford
102
Flo
wers,
DeiT
B
16
performance
with
our
proposed
method
is
95.83%
compared
to
without
a
teacher
,
only
94.99%.
Int
J
Inf
&
Commun
T
echnol,
V
ol.
15,
No.
1,
March
2026:
198–206
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Inf
&
Commun
T
echnol
ISSN:
2252-8776
❒
203
This
happened
because
the
student
model
could
le
v
erage
the
teacher’
s
kno
wledge
well.
Our
proposed
method
could
impro
v
e
the
student
model
because
the
distillation
tok
en
gets
information
from
the
teacher
model.
Thats
information
mak
es
student
models
learn
more
straightforw
ard
and
f
aster
.
The
proof
is
that
the
standard
transfer
learning
process
needs
300
epochs
in
the
DeiT
paper
,
while
in
this
study
,
only
10
epochs.
3.4.
Using
weak
er
teacher
model
Unfortunately
,
our
proposed
method
creates
an
additional
process:
standard
transfer
learning
teac
h
e
r
model
to
a
do
wnstream
dataset.
It
mak
es
the
transfer
learning
process
of
DeiT
to
the
do
wnstream
dataset
longer
.
Ho
we
v
er
,
we
ha
v
e
a
fortune
because
of
using
CNN
architec
ture
as
a
teacher
model.
The
transfer
learning
CNN
teacher
model
is
f
aster
than
the
transfe
r
learning
T
ransformers
teacher
model.
Ev
en
though
the
transfer
learning
CNN
teacher
model
is
f
aster
,
we
still
tried
to
reduce
the
time
standard
transfer
learning
teacher
model.
Furthermore,
we
proposed
using
a
weak
er
CNN
model
as
a
teacher
.
The
model
is
ar
guably
weak
er
based
on
the
size
of
the
model
parameters.
The
pre
vious
e
xperiment
used
Re
gNet
Y
16GF
as
a
teacher
model
with
83.6M
parameters.
Its
model’
s
parameters
are
similar
to
the
student
model,
in
which
DeiT
B
16
has
86.6M
parameters.
Then,
we
present
tw
o
weak
er
models
as
teachers,
i.e.,
Ef
cientNet
B4
and
Ef
cientNet
V2,
each
with
19.3M
and
21.5M
parameters,
respecti
v
ely
.
Therefore,
our
e
xperiment
uses
a
teacher
model
that
is
approximately
75%
weak
er
t
han
a
student
model,
DeiT
B
16.
T
able
3
sho
ws
a
detailed
description
of
the
size
of
the
model
parameters.
The
result
is
that
Ef
cientNet
V2,
whose
model
size
is
only
24.82%
compared
to
the
DeiT
B
16
model,
can
outperform
in
CIF
AR-10,
CIF
AR-100,
and
OxfordIIIT
-Pets
datasets,
as
sho
wn
in
Fi
gure
5.
In
addition,
the
performance
of
the
student
model
with
the
weak
er
model
is
similar
to
Re
gNet
Y
16GF
.
F
or
e
xample,
in
the
OxfordIIIT
-Pets
dataset,
the
performance
of
the
student
model
with
Ef
cientNet
B4
as
a
t
eacher
model,
whose
model
size
is
only
22.82%
compared
to
the
student
model,
decreased
by
only
0.57%
(90.16%
vs
90.73%)
compared
to
Re
gNet
Y
16G.
Our
study
sho
ws
that
a
weak
er
teacher
model
could
simplify
the
training
model
for
the
teacher
.
Additionally
,
a
weak
er
model
may
yield
better
student
model
performance
in
some
e
xperiments.
Figure
4.
Comparison
performance
DeiT
B
16
model
in
transfer
learning
to
do
wnstream
datasets
between
our
proposed
method
and
the
student
model
without
a
teacher
model
(standard
transfer
learning)
Figure
5.
Comparison
performance
DeiT
B
16
model
in
transfer
learning
to
do
wnstream
datasets
between
Re
gNet
Y
16GF
,
Ef
cientNet
B4,
and
Ef
cientNet
V2S
3.5.
Using
another
student
model
Finally
,
we
pro
v
ed
that
using
a
distillation
tok
en
for
transfer
learning
to
do
wnstream
datasets
could
impro
v
e
the
DeiT
B
16
model
performance.
Moreo
v
er
,
we
also
pro
v
ed
that
using
a
weak
er
model
as
a
teacher
model
could
reduce
the
comple
xity
of
the
training
teacher
model
and
impro
v
e
student
model
performance
in
s
e
v
er
al
do
wnstream
datasets.
Therefore,
we
try
the
same
concept
in
DeiT
S
16
as
a
student
model
and
Ef
cientNet
B0
as
a
weak
er
teacher
model
to
pro
v
e
that
our
proposed
method
applies
to
the
v
ariety
model.
A
teacher
model
is
ar
guably
weak
er
by
the
size
of
the
model
parameters
between
the
teacher
and
student
model.
W
e
denoted
DeiT
S
16
as
the
baseline
of
the
model’
s
size
(100%).
Then,
we
determine
Ef
cientNet
B0
as
a
teacher
model
with
5.3
parameters
or
24.09%
compared
to
the
student
model
and
Re
gNet
Y
16G
as
a
teacher
model
with
83.6M
parameters
or
380%
compared
to
the
student
model.
T
able
4
sho
ws
detailed
information
on
the
model
size
in
this
e
xperiment.
Le
ver
a
ging
distillation
tok
en
and
weak
er
teac
her
model
to
impr
o
ve
DeiT
...
(Christopher
Gavr
a
Reswar
a)
Evaluation Warning : The document was created with Spire.PDF for Python.
204
❒
ISSN:
2252-8776
Lik
e
the
Deit
B
16
model,
DeiT
S
16
with
Ef
cientNet
B0
as
a
teacher
model
outperforms
compared
to
Re
gNet
Y
16GF
as
a
teacher
model
in
CIF
AR-100,
Oxford
102
Flo
wers,
and
OxfordIIIT
-Pets.
Moreo
v
er
,
the
Ef
cientNet
B0
model
could
increase
0.99%
the
DeiT
S
16
performance
in
the
CIF
AR-100
dataset,
as
sho
wn
in
Figure
6.
Con
v
ersely
,
the
dif
ference
in
performance
in
the
CIF
AR-10
dataset
is
only
0.07%
(96.89%
vs
96.82%)
between
Re
gNet
Y
16GF
and
Ef
cientNet
B0
as
a
teacher
.
Thus,
this
e
xperiment
pro
v
es
that
our
proposed
method
could
impro
v
e
performance
on
v
arious
models.
T
able
3.
Model
size
in
DeiT
B
16
e
xperiment
Model
P
arams
Model
size
DeiT
B
16
86.6M
100%
Re
gNet
Y
16GF
83.6M
96.53%
Ef
cientNet
B4
19.3M
22.28%
Ef
cientNet
V2S
21.5M
24.82%
T
able
4.
Model
size
in
DeiT
S
16
e
xperiment
Model
P
arams
Model
size
Re
gNet
Y
16GF
83.6M
380%
DeiT
S
16
22M
100%
Ef
cientNet
B0
5.3M
24.09%
Figure
6.
Comparison
performance
DeiT
S
16
model
in
transfer
learning
to
do
wnstream
datasets
between
Re
gNet
Y
16GF
and
Ef
cientNet
B0
4.
CONCLUSION
Recent
observ
ations
suggest
that
a
ne
w
procedure
of
KD
in
the
V
iT
model
with
distillation
tok
ens
can
impro
v
e
performance
while
training
it
from
scratch.
Our
ndings
pro
vide
conclusi
v
e
e
vidence
that
this
ne
w
KD
procedure
can
also
enhance
model
performance
when
applied
to
a
do
wnstream
dataset
through
transfer
learning.
Utilizing
a
distillation
tok
en
to
calculat
e
distillation
loss
between
student
output
and
teacher
output
remains
a
helpful
technique
for
the
DeiT
model
in
the
transfer
learning
process.
The
DeiT
model
(student
model)
can
ef
fecti
v
ely
learn
from
teacher
kno
wledge.
This
could
happen
by
supporting
CNN
architecture
as
a
teacher
model
and
computing
the
distillation
loss
using
soft
distillation.
In
addition,
we
proposed
using
a
weak
er
teacher
m
odel.
W
e
present
that
se
v
eral
do
wnstream
dat
asets
could
impro
v
e
the
performance
of
the
DeiT
model
.
Otherwise,
t
he
performance
of
the
DeiT
model
with
a
weak
er
teacher
model
is
similar
to
Re
gNet
Y
16GF
as
a
teacher
model.
Ho
we
v
er
,
the
comple
xity
of
the
training
teacher
model
could
be
decreased
by
approximately
75%.
Therefore,
our
proposed
method
of
using
a
weak
er
teacher
model
could
impro
v
e
the
ef
cienc
y
of
the
training
process.
Our
study
demonstrates
that
utilizing
a
distillation
tok
en
and
a
weak
er
t
eacher
model
can
enhance
the
transfer
learning
capability
of
the
DeiT
model
.
Thus,
future
studies
may
e
xplore
the
implementation
of
quantization
and
pruning
methods,
allo
wing
the
size
of
the
DeiT
model
parameters
to
be
similar
to
that
of
the
weak
er
teacher
model.
Additionally
,
it
could
also
be
e
xplored
to
incorporate
a
distillation
tok
en
technique
into
other
transformer
models,
such
as
Swin
and
PVT
.
FUNDING
INFORMA
TION
Authors
state
there
is
no
funding
in
v
olv
ed.
Int
J
Inf
&
Commun
T
echnol,
V
ol.
15,
No.
1,
March
2026:
198–206
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Inf
&
Commun
T
echnol
ISSN:
2252-8776
❒
205
A
UTHOR
CONTRIB
UTIONS
ST
A
TEMENT
This
journal
uses
the
Contrib
utor
Roles
T
axonomy
(CRediT)
to
recognize
indi
vidual
author
contrib
u-
tions,
reduce
authorship
disputes,
and
f
acilitate
collaboration.
Name
of
A
uthor
C
M
So
V
a
F
o
I
R
D
O
E
V
i
Su
P
Fu
Christopher
Ga
vra
Resw
ara
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
Gede
Putra
K
usuma
✓
✓
✓
✓
✓
C
:
C
onceptualization
I
:
I
n
v
estig
ation
V
i
:
V
i
sualization
M
:
M
ethodology
R
:
R
esources
Su
:
Su
pervision
So
:
So
ftw
are
D
:
D
ata
Curation
P
:
P
roject
Administration
V
a
:
V
a
lidation
O
:
Writing
-
O
riginal
Draft
Fu
:
Fu
nding
Acquisition
F
o
:
F
o
rmal
Analysis
E
:
Writing
-
Re
vie
w
&
E
diting
CONFLICT
OF
INTEREST
ST
A
TEMENT
Authors
state
there
is
no
conict
of
interest.
D
A
T
A
A
V
AILABILITY
−
The
datasets
analyzed
during
the
current
study
are
publicly
a
v
ailable.
−
The
CIF
AR-10
and
CIF
AR-100
datasets
are
a
v
ailable
at
https://www
.cs.toronto.edu/
kriz/cif
ar
.html.
−
The
Oxford
102
Flo
wers
dataset
can
be
found
at
https://doi.or
g/10.1109/ICV
GIP
.2008.47.
−
The
Oxford-IIIT
Pet
dataset
is
a
v
ailable
at
https://doi.or
g/10.1109/CVPR.2012.6248092.
REFERENCES
[1]
A.
V
asw
ani,
N.
Shazeer
,
N.
P
armar
,
J.
Uszk
oreit,
L.
Jones,
A.
N.
Gomez,
Ł.
Kaiser
,
and
I.
Polosukhin,
“
Attention
is
all
you
need,
”
Advances
in
neur
al
information
pr
ocessing
systems
,
v
ol.
30,
2017,
doi:
10.
48550/arXi
v
.1706.03762.
[2]
Y
.
W
u,
X.
Hu,
Z.
Fu,
S.
Zhou,
and
J.
Li,
“GPT
-4o:
visual
perception
performance
of
multimodal
lar
ge
language
models
in
piglet
acti
vity
understanding,
”
Arxive
,
2024,
[Online].
A
v
ailable:
http://arxi
v
.or
g/abs/2406.09781.
[3]
Meta,
“
The
Llama
3
Herd
of
Models,
”
arXiv
,
2024.
[4]
Gemi
ni,
“Gemini
1.5:
unlocking
multimodal
understanding
across
millions
of
tok
ens
of
conte
xt,
”
2024.
[5]
H.
Shakil,
Z.
Ortiz,
G.
C.
F
orbes,
and
J
.
Kalita,
“Utilizing
GPT
to
enhance
t
e
xt
summarization:
a
strate
gy
to
minimize
hallucina-
tions,
”
Pr
ocedia
Computer
Science
,
v
ol.
244,
pp.
238–247,
2024.
[6]
J.
ˇ
Sm
´
ıd,
P
.
Priban,
and
P
.
Kral,
“LLaMA-based
models
for
as
pect-based
sentiment
analysis,
”
in
Pr
oceedings
of
the
14th
W
orkshop
on
Computational
Appr
oac
hes
to
Subjectivity
,
Sentiment,
&
Social
Media
Analysis
,
Aug.
2024,
pp.
63–70,
doi:
10.18653/v1/2024.w
assa-1.6.
[7]
J.
Ding,
H.
Nguyen,
and
H.
Chen,
“Ev
aluation
of
quest
ion-answering
based
te
xt
summarization
using
LLM
in
vited
paper
,
”
in
Pr
oceedings
-
6th
IEEE
International
Confer
ence
on
Arti
cial
Intellig
ence
T
esting
,
AIT
est
2024
,
2024,
pp.
142–149,
doi:
10.1109/AIT
est62860.2024.00025.
[8]
Z.
Liu
et
al.
,
“Swin
transformer:
hierarchical
vision
transformer
using
shifted
W
indo
ws,
”
in
Pr
oceedings
of
the
IEEE
International
Confer
ence
on
Computer
V
ision
,
Oct.
2021,
pp.
9992–10002,
doi:
10.1109/ICCV48922.2021.00986.
[9]
Y
.
Li,
H.
Mao,
R.
Girshick,
and
K.
He,
“Exploring
plain
vision
transformer
backbones
for
object
detection,
”
Lectur
e
Notes
in
Computer
Science
(including
subseries
Lectur
e
Notes
in
Articial
Intellig
ence
and
Lectur
e
Notes
in
Bioinformatics)
,
v
ol.
13669
LNCS.
pp.
280–296,
2022,
doi:
10.1007/978-3-031-20077-9
17.
[10]
C.
Cao
and
Y
.
Fu,
“Impro
ving
transformer
-based
image
matching
by
cascaded
capturing
spatially
informati
v
e
k
e
y-
points,
”
in
Pr
oceedings
of
the
IEEE
International
Confer
ence
on
Com
puter
V
ision
,
Oct.
2023,
pp.
12095–12105,
doi:
10.1109/ICCV51070.2023.01114.
[11]
A.
Doso
vitskiy
et
al.
,
“
An
image
is
w
orth
16X16
w
ords:
transformers
for
image
recognition
at
scale,
”
ICLR
2021
-
9th
International
Confer
ence
on
Learning
Repr
esentations
,
2021.
[12]
H.
T
ouvron,
M.
Cord,
M.
Douze,
F
.
Massa,
A.
Sablayrolles,
and
H.
J
´
egou,
“T
raining
data-ef
cient
image
transformers
&
distillation
through
attention,
”
in
Pr
oceedings
of
Mac
hine
Learning
Resear
c
h
,
2021,
v
ol.
139,
pp.
10347–10357.
[13]
Z.
Liu
et
al.
,
“Swin
transformer
V2:
scaling
up
capacity
and
resolution,
”
in
Pr
oceedings
of
the
IEEE
Computer
Society
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
,
2022,
v
ol.
2022-June,
pp.
11999–12009,
doi:
10.1109/CVPR52688.2022.01170.
[14]
G.
Hinton,
O.
V
in
yals,
and
J.
Dean,
“Distilling
the
kno
wledge
in
a
neural
netw
ork.
”
2015,
[Online].
A
v
ailable:
http://arxi
v
.or
g/abs/1503.02531.
[15]
I.
Radosa
v
o
vic,
R.
P
.
K
osaraju,
R.
Girshick,
K.
He,
and
P
.
Doll
´
ar
,
“Designing
netw
ork
design
spaces,
”
in
Pr
oceedings
of
the
IEEE
Computer
Society
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
,
2020,
pp.
10425–10433,
doi:
10.1109/CVPR42600.2020.01044.
Le
ver
a
ging
distillation
tok
en
and
weak
er
teac
her
model
to
impr
o
ve
DeiT
...
(Christopher
Gavr
a
Reswar
a)
Evaluation Warning : The document was created with Spire.PDF for Python.
206
❒
ISSN:
2252-8776
[16]
S.
K
ullback
and
R.
A.
Leibler
,
“On
Information
and
Suf
cienc
y
,
”
The
Annals
of
Mathematical
Statistics
,
v
ol.
22,
no.
1,
pp.
79–86,
1951,
doi:
10.1214/aoms/1177729694.
[17]
J
.
Deng,
W
.
Dong,
R.
Socher
,
L.
J.
Li,
K.
Li,
and
L.
Fei-Fei,
“ImageNet:
a
lar
ge-scale
hierarchical
image
database,
”
in
2009
IEEE
Confer
ence
on
Computer
V
ision
and
P
attern
Reco
gnition,
CVPR
2009
,
2009,
pp.
248–255,
doi:
10.1109/CVPR.2009.5206848.
[18]
A.
K
rizhe
vsk
y
,
“Learning
multiple
layers
of
features
from
tin
y
images.
”
pp.
32–33,
2009.
[19]
M
.
E.
Nilsback
and
A.
Zisserman,
“
Automated
o
wer
classication
o
v
er
a
lar
ge
number
of
classes,
”
in
Pr
oceedings
-
6th
Indian
Confer
ence
on
Computer
V
ision,
Gr
aphics
and
Ima
g
e
Pr
ocessing
,
ICVGIP
2008
,
2008,
pp.
722–729,
doi:
10.1109/IC
V
GI
P
.2008.47.
[20]
O
.
M.
P
arkhi,
A.
V
edaldi,
A.
Ziss
erman,
and
C.
V
.
Ja
w
ahar
,
“Cats
and
dogs,
”
in
Pr
oceedings
of
the
IEEE
Computer
Society
Confer
-
ence
on
Computer
V
ision
and
P
attern
Reco
gnition
,
2012,
pp.
3498–3505,
doi:
10.1109/CVPR.2012.6248092.
[21]
C.
Burns
et
al.
,
“W
eak-to-strong
generalization:
eliciting
strong
capabilit
ies
with
weak
supervision,
”
Pr
oceedings
of
Mac
hine
Learn-
ing
Resear
c
h
,
v
ol.
235.
pp.
4971–5012,
2024.
[22]
M.
T
an
and
Q.
V
.
Le,
“Ef
cientNet:
Rethinking
model
scaling
for
con
v
olutional
neural
netw
orks,
”
36th
International
Confer
ence
on
Mac
hine
Learning
,
ICML
2019
,
v
ol.
2019-June,
pp.
10691–10700,
2019.
[23]
M.
T
an
and
Q.
V
.
Le,
“Ef
cientNetV2:
smaller
models
and
f
aster
training,
”
Pr
oceedings
of
Mac
hine
Learning
Resear
c
h
,
v
ol.
139,
pp.
10096–10106,
2021.
[24]
I.
Loshchilo
v
and
F
.
Hutter
,
“Decoupled
weight
decay
re
gularization,
”
arXiv
pr
eprint
arXiv:1711.05101
,
2019.
[25]
I.
Loshchilo
v
and
F
.
Hutter
,
“SGDR:
stochastic
gradient
descent
with
w
arm
restarts,
”
arXiv
pr
eprint
arXiv:1608.03983
,
2017.
[26]
S.
Abnar
,
M.
Dehghani,
and
W
.
Zuidema,
“T
ransferring
inducti
v
e
biases
through
kno
wledge
distillation,
”
Arxiv
,
2020,
[Online].
A
v
ailable:
http://arxi
v
.or
g/abs/2006.00555.
BIOGRAPHIES
OF
A
UTHORS
Christopher
Ga
vra
Reswara
recei
v
ed
his
bachelor’
s
de
gree
in
computer
science
from
Bina
Nusantara
Uni
v
ersity
,
where
he
is
pursuing
a
master’
s
de
gree
in
the
same
eld.
He
also
w
orks
as
a
Progr
ammer
at
the
Bina
Nusa
ntara
IT
Di
vision.
His
research
focuses
on
AI,
recommendation
systems,
and
computer
vision,
and
he
has
authored
tw
o
conference
papers
on
recommendation
sys-
tems.
He
can
be
contacted
at:
christopher
.resw
ara@binus.ac.id.
Gede
Putra
K
usuma
recei
v
ed
Ph.D.
de
gree
in
Electrical
and
Electronic
Engineering
from
Nan
yang
T
echnological
Uni
v
ersity
(NTU),
Sing
apore
,
in
2013.
He
is
currently
w
orking
as
a
Lecturer
and
Head
of
Department
of
Master
of
Computer
Science,
Bina
Nusantara
Uni
v
ersity
,
Indonesia.
Before
joining
Bina
Nusantara
Uni
v
ersity
,
he
w
as
w
orking
as
a
Research
Scientist
in
I2R
–
A*ST
AR,
Sing
apore.
His
research
interests
include
computer
vision,
deep
learning,
f
ace
recognition,
appearance-based
object
recognition,
g
amication
of
learning,
and
indoor
positioning
system.
He
can
be
contacted
at:
ine
g
ara@binus.edu.
Int
J
Inf
&
Commun
T
echnol,
V
ol.
15,
No.
1,
March
2026:
198–206
Evaluation Warning : The document was created with Spire.PDF for Python.