Indonesian
J
our
nal
of
Electrical
Engineering
and
Computer
Science
V
ol.
41,
No.
1,
January
2026,
pp.
153
∼
167
ISSN:
2502-4752,
DOI:
10.11591/ijeecs.v41.i1.pp153-167
❒
153
Y
OLOv8m
enhancement
using
α
-scaled
gradient-normalized
sigmoid
acti
v
ation
f
or
intelligent
v
ehicle
classication
Renz
Raniel
V
.
Serrano
1
,
J
en
Ald
wayne
B.
Delmo
1
,
Cristina
Amor
M.
Rosales
2
1
Department
of
Electrical
Engineering,
Colle
ge
of
Engineering,
Batang
as
State
Uni
v
ersity–The
National
Engineering
Uni
v
ersity
,
Batang
as
City
,
Philippines
2
Department
of
Ci
vil
Engineering,
Colle
ge
of
Engineering,
Batang
as
State
Uni
v
ersity–The
National
Engineering
Uni
v
ersity
,
Batang
as
City
,
Philippines
Article
Inf
o
Article
history:
Recei
v
ed
Oct
19,
2025
Re
vised
Dec
5,
2025
Accepted
Dec
14,
2025
K
eyw
ords:
GeLU
Leak
yReLU
Mish
Sigmoid
linear
unit
Swish
ABSTRA
CT
V
ehicle
classicat
ion
plays
a
vital
part
in
the
de
v
elopment
of
intelligent
trans-
portation
systems
(ITS)
and
modern
traf
c
management,
where
the
ability
to
detect
and
identify
v
ehicles
accurately
in
real
time
is
essential
for
maintain-
ing
road
ef
cienc
y
and
safety
.
This
paper
presents
an
enhancement
to
the
Y
OLOv8m
model
by
rening
its
acti
v
ation
function
to
achie
v
e
higher
accurac
y
and
f
aster
response
in
di
v
erse
traf
c
and
en
vironmental
situations.
In
this
study
,
tw
o
alternati
v
e
acti
v
ation
functions—Mish
and
Swish—were
inte
grated
into
the
Y
OLOv8m
structure
and
tested
ag
ainst
the
model’
s
def
ault
sigmoid
linear
unit
(SiLU).
T
raining
and
e
v
aluation
were
carried
out
using
a
comprehensi
v
e
dataset
of
v
ehicles
captured
under
dif
ferent
lighting
and
weather
conditions.
The
e
xper
-
imental
ndings
sho
w
that
the
modied
acti
v
ation
design
leads
to
better
model
con
v
er
gence,
impro
v
ed
genera
lization,
and
a
noticeable
boost
in
detection
per
-
formance,
recording
up
to
5.4%
higher
accurac
y
and
6.6%
better
mAP
scores
than
the
standard
Y
OLOv8m.
Ov
erall,
the
results
conrm
that
ne-tuning
acti-
v
ation
beha
vior
can
mak
e
deep
learning
models
more
ada
pti
v
e
and
reliable
for
v
ehicle
classi
cation
tasks
in
real-w
orld
intelligent
transportation
en
vironments.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Renz
Raniel
V
.
Serrano
Department
of
Electrical
Engineering,
Batang
as
State
Uni
v
ersity–The
National
Engineering
Uni
v
ersity
Batang
as
City
,
Philippines
Email:
renzraniel.serrano@g.batstate-u.edu.ph
1.
INTR
ODUCTION
The
rapid
de
v
elopment
of
intelligent
transportation
systems
(ITS)
has
become
one
of
the
dening
fea-
tures
of
modern
smart
cities.
As
urban
populations
continue
to
e
xpand,
the
ability
to
ef
fecti
v
ely
monitor
and
manage
traf
c
o
ws
has
become
crucial
in
ensuring
road
safety
,
reducing
conges
tion,
and
impro
ving
urban
mobility
.
One
of
the
core
technologies
supporting
ITS
is
v
ehicle
classication,
which
in
v
olv
es
identifyi
ng
and
grouping
v
ehicles
based
on
their
ph
ysical
and
visual
characteristics.
Accurate
classication
plays
an
impor
-
tant
role
in
applications
such
as
autonomous
dri
ving,
real-time
traf
c
monitoring,
and
toll
collection,
where
reliability
and
timely
detection
are
critical
[1],
[2].
Ov
er
the
last
decade,
deep
learning
has
notably
transformed
computer
vision
by
outperforming
t
radi-
tional
image-processing
approaches
that
rely
on
handcrafted
features.
Earlier
models
such
as
R-CNN
and
SSD
J
ournal
homepage:
http://ijeecs.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
154
❒
ISSN:
2502-4752
pro
vided
rob
ust
detection
results
b
ut
suf
fered
from
high
computational
comple
xity
and
long
inference
times
[3].
The
you
only
look
once
(Y
OLO)
f
amily
of
algorithms
addressed
these
l
imitations
by
combining
feature
e
xtraction
and
classication
i
nto
a
single-stage
detection
pipeline,
enabling
real-time
processing
on
embedded
systems
[4],
[5].
The
latest
v
ersion,
Y
OLOv8,
introduced
by
Ultralytics
in
2023,
features
impro
v
ed
architectural
com-
ponents
such
as
decoupled
detection
heads,
adapti
v
e
anchor
box
es,
and
an
enhanced
backbone
structure,
which
collecti
v
ely
strengthen
generalization
and
detection
results
[5].
Among
its
model
v
ariants,
Y
OLOv8m
of-
fers
an
optimal
trade-of
f
between
computational
ef
cienc
y
and
precision,
making
it
particularly
suitable
for
deplo
yment
in
real-time
v
ehicle
classication
systems
[6].
Ho
we
v
er
,
despite
these
architectural
renements,
Y
OLOv8’
s
results
remains
highly
dependent
on
the
acti
v
ation
function,
a
fundamental
mechanism
that
inu-
ences
nonlinear
transformation
and
gradient
propag
ation
during
netw
ork
learning
[7],
[8].
Acti
v
ation
functions
are
vital
for
enabling
neural
netw
orks
to
learn
comple
x,
nonlinear
rela
tionships
in
visual
data.
C
on
v
entional
functions
such
as
rect
ied
linear
unit
(ReLU)
and
Leak
yReLU
are
widely
used
due
to
their
computational
simplicity
,
yet
the
y
often
suf
fer
from
problems
such
as
neuron
saturation
and
v
anishing
gradients,
which
reduce
con
v
er
gence
stability
[9].
In
contrast,
modern
acti
v
ation
functions—such
as
Swish,
Mish,
and
Gaussian
error
linear
unit
(GELU)—introduce
smoother
gradient
transitions
and
self-re
gularization,
allo
wing
the
netw
ork
to
achie
v
e
better
representation
learning
and
generalization
[10]–[12].
Empirical
studies
demonstrate
that
these
ne
wer
functions
can
strengthen
image
classication
and
object
detection
results
by
impro
ving
con
v
er
gence
speed
and
rob
ustness
across
v
arying
data
conditions
[13],
[14].
Despite
these
adv
ancements,
limited
research
has
e
xamined
ho
w
acti
v
ation
function
adjustment
inu-
ences
Y
OLOv8-based
models,
particularly
for
ITS
applications
where
en
vironmental
conditions
such
as
light-
ing,
occlusion,
and
traf
c
density
v
ary
greatly
[15].
These
dynamic
conditions
present
signicant
challenges
to
real-time
detection
and
classication
resul
ts.
Optimizing
acti
v
ation
functions
has
also
been
found
to
reduce
oscillations
during
training,
pre
v
ent
gradient
v
anis
hing,
and
strengthen
o
v
erall
model
reliability—especially
for
edge-based
implementations
in
traf
c
en
vironments
[16],
[17].
Moti
v
ated
by
these
challenges,
this
study
e
xplores
the
adjustment
of
the
Y
OLOv8m
acti
v
ation
function
to
strengthen
v
ehicle
classication
results
and
model
generalization
in
di
v
erse
conditions.
The
research
focuses
on
inte
grating
Mish
and
Swish
functions
into
the
Y
OLOv8m
frame
w
ork
and
further
introduces
a
gradient-
normalized
sigmoid
(GNSig)
acti
v
ati
on
that
emplo
ys
-scaling
and
bias
correction
to
rene
training
stability
.
The
enhanced
model
is
tested
using
a
custom
v
ehicle
dataset
g
athered
from
v
arious
traf
c
scenarios
in
Batang
as
City
,
Philippines,
encompassing
dif
ferent
weather
and
illumination
conditions.
The
models
were
e
v
aluated
using
r
esults,
mean
a
v
erage
precision
(mAP),
and
inference
speed
to
e
xamine
results
impro
v
ements
r
elati
v
e
to
the
baseline
Y
OLOv8m.
This
research
aims
to
contrib
ute
both
theoretically
and
practically:
theoretically
,
by
deepening
under
-
standing
of
ho
w
acti
v
ation-le
v
el
modications
af
fect
learning
dynamics
in
deep
detection
architectures;
and
practically
,
by
pro
viding
an
adaptable
and
ef
cient
frame
w
ork
for
real-time
v
ehicle
classication
in
intelligent
transportation
en
vironments.
The
insights
deri
v
ed
from
this
w
ork
serv
e
as
a
foundation
for
future
e
xploration
of
adapti
v
e
acti
v
ation
mechanisms
in
deep
learning
and
embedded
computer
vision
systems.
T
o
address
the
identied
g
aps
in
the
literature
and
to
adv
ance
the
state-of-the-art
in
intelligent-transportation
object
detection
using
Y
OLO-based
architectures,
this
w
ork
contrib
utes
the
follo
wing:
−
W
e
propose
a
no
v
el
α
-scaled
GNSig
acti
v
ation
function
for
the
Y
OLOv8m
model,
an
acti
v
ation
v
ariant
not
pre
viously
e
xplored
in
Y
OLO-f
amily
detectors.
−
This
conducts
the
rst
systematic
e
v
aluation
of
acti
v
ation-function
replacements
(rather
than
architec-
tural
modications)
within
Y
OLOv8m
tar
geted
at
intelligent
transportation
applications
under
real-w
orld
Philippine
traf
c
conditions.
−
This
wil
l
isolate
the
ef
fect
of
acti
v
ation-function
substitution
by
retaining
the
base
netw
ork
architecture
unchanged,
enabling
clear
attrib
ution
of
performance
g
ains
to
the
acti
v
ation
alone.
−
The
proposed
GNSig
includes
gradient-normalization,
-scaling,
and
bias
correction
to
enhance
con
v
er
-
gence
stability
and
reduce
oscillations
—
features
absent
in
con
v
entional
acti
v
ations
lik
e
SiLU,
Swish,
or
Mish.
−
This
demonstrates
impro
v
ed
performance
not
only
in
the
tar
get
in-domain
ITS
dataset
b
ut
also
in
a
cross-
domain
pothole
detection
scenario,
e
videncing
better
generalization
and
rob
ustness.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
41,
No.
1,
January
2026:
153–167
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
155
2.
METHOD
2.1.
Conceptual
framew
ork
The
conceptual
frame
w
ork
of
this
study
sho
ws
the
logical
o
w
and
interconnection
among
the
k
e
y
components
in
v
olv
ed
in
de
v
eloping
the
modied
Y
OLOv8m
model
for
v
ehicle
classication.
As
sho
wn
in
Figure
1,
the
frame
w
ork
follo
ws
a
systematic
pipeline
composed
of
v
e
major
stages:
dataset
acquisition,
preprocessing
and
augmentation,
image
annotation,
model
training
and
acti
v
ation
function
adjustment,
and
model
e
v
aluation.
Each
stage
helps
to
the
enhancement
of
model
res
ults,
ef
cienc
y
,
and
generalization
within
the
conte
xt
of
ITS.
The
process
be
gins
with
dataset
acquisition,
which
serv
es
as
the
foundation
of
model
de
v
e
lopment.
Real-w
orld
traf
c
videos
were
captured
under
v
arying
conditions
—d
i
f
ferent
illumination
le
v
els,
weather
types,
and
v
ehicle
densities—to
simulate
comple
x
en
vironments
typically
encountered
in
urban
road
netw
orks.
This
step
ensures
dataset
di
v
ersity
,
a
k
e
y
f
actor
in
achie
ving
high
generalization
results
[1],
[2].
Ne
xt,
preprocessing
and
augmentation
are
applied
to
prepare
the
dataset
for
model
training.
Ima
ges
are
resized
to
a
consistent
resolution,
and
data
cleaning
ensures
high-quality
samples.
Augmentation
techniques
such
as
horizontal
ipping,
random
cropping,
brightness
adjustment,
and
mosaic
composition
are
emplo
yed
to
e
xpose
the
model
to
di
v
erse
visual
conte
xts,
reducing
o
v
ertting
and
impro
ving
rob
ustness
[13],
[14].
The
image
annotation
process
is
done
using
the
Roboo
w
platform,
where
each
v
ehicle
instance
is
labeled
with
bounding
box
es
and
cate
gory
identiers.
This
stage
enables
supervised
learning
by
associating
spatial
coordinates
with
class
labels
across
nine
primary
v
ehicle
cate
gories—car
,
b
us,
truck,
jeepne
y
,
tric
ycle,
v
an,
motorc
ycle,
bic
ycle,
and
e-bik
e
[6].
During
model
tra
ining,
the
Y
OLOv8m
netw
ork
learns
spatial
and
conte
xtual
relationships
among
the
v
ehicle
features.
Hyperparameters
such
as
learning
rate,
batch
size,
and
momentum
are
tuned
while
monitoring
loss
metrics
incl
uding
box
loss,
c
lassication
loss,
and
distrib
ution
focal
loss
[4],
[5].
The
acti
v
ation
function
plays
a
central
role
in
model
adjustment.
This
study
modies
the
def
ault
sigmoid
linear
unit
(SiLU)
acti
v
ation
in
Y
OLOv8m
with
alternati
v
e
congurations
such
as
Mish,
Swish,
and
a
ne
wly
proposed
(GNSig)
with
-scaling
and
bias
adjustment
to
strengthen
learning
stability
and
con
v
er
gence
[7]–[12].
Finally
,
model
e
v
aluation
in
v
olv
es
assessing
the
results
of
both
baseline
and
modied
models
on
unseen
v
alidation
and
test
datasets.
Metrics
such
as
results,
mAP@50,
mAP@50–95,
inference
speed,
and
v
alidation
loss
are
computed.
A
cross-dataset
test
on
a
pothole
detection
dataset
is
also
done
to
e
xamine
the
modied
model’
s
adaptability
and
transfer
learning
capability
[15],
[16].
Ov
erall,
the
conceptual
frame
w
ork
underscores
ho
w
e
v
ery
component—from
data
collection
to
algo-
rithmic
adjustment—helps
to
de
v
eloping
a
rob
ust,
adapti
v
e,
and
ef
cient
v
ehicle
classication
system
for
ITS.
Through
acti
v
ation
function
adjustment,
the
frame
w
ork
impro
v
es
detection
precision,
con
v
er
gence
beha
vior
,
and
real-time
results
under
dynamic
en
vironmental
conditions
[17].
Figure
1.
Conceptual
frame
w
ork
of
the
modied
Y
OLOv8m
model
for
v
ehicle
classication
2.2.
Data
collection
and
pr
epr
ocessing
The
dataset
used
in
this
study
w
as
de
v
eloped
to
reect
the
comple
xity
and
v
ariability
of
real
-w
orld
traf
c
conditions
in
an
urban
setting.
T
raf
c
videos
were
collected
by
using
a
DJI
Osmo
Pock
et
2
camera
placed
along
se
v
eral
sections
of
four
-lane
roads
in
Batang
as
City
,
Philippines.
This
setup
in
data
acquisition
closely
parallels
the
methodological
approach
pursued
by
Delmo
[18],
whose
earlier
w
ork
on
Y
OLOv8-based
v
ehicle
speed
estimation
pro
vided
both
a
structural
reference
and
v
aluable
insight
into
the
capturing
of
dynami
c
multi-
Y
OLOv8m
enhancement
using
α
-scaled
gr
adient-normalized
sigmoid
activation
...
(Renz
Raniel
V
.
Serr
ano)
Evaluation Warning : The document was created with Spire.PDF for Python.
156
❒
ISSN:
2502-4752
lane
traf
c
en
vironments.
This
current
study
e
xtends
that
foundation
through
a
focus
on
v
ehicle
classication
and
the
inte
gration
of
acti
v
ation-le
v
el
modications
to
enhance
the
detection
performance.
Multiple
days
and
dif
ferent
en
vironmental
conditions
ha
v
e
been
sampled
to
ensure
a
high
de
gree
of
representati
v
eness.
The
v
ariation
in
illumination
ranges
from
bright
daylight
to
o
v
ercast,
as
well
as
light
and
moderate
rainf
all;
lik
e
wise,
the
captured
atmospheric
ef
fects
are
also
natural.
Re
g
arding
traf
c
conditions,
the
dataset
contains
scenes
with
a
traf
c
density
that
ranges
from
free-o
wing
to
hea
vy
congestion,
reecting
typical
uctuations
observ
ed
in
roadw
ays
within
an
urban
en
vironment.
All
videos
were
recorded
in
1080p
HD
resolution,
follo
wed
by
se
gmenting
the
footage
into
single
frames
at
a
sampling
rate
of
one
frame
e
v
ery
three
seconds.
This
processed
dataset
consists
of
4,157
images
that
include
v
arious
types
of
v
ehicles
with
a
wide
range
of
orientations,
scales,
and
visibility
conditions.
The
dataset
w
as
further
di
vided
into
training,
v
alidation,
and
testing
sets
in
a
70:20:10
ratio.
This
stratication
made
certain
that
the
training
subset
captured
enough
v
ariation
to
f
acilitate
informati
v
e
learning
while
the
v
al
idation
subset
supported
h
yperparameter
tuning
and
helped
to
a
v
oid
o
v
ertting.
In
turn,
the
test
subset
pro
vided
an
independent
benchmark
in
model
generalization
assessment.
The
partition
strate
gy
follo
wed
established
practices
in
deep
learning,
emphasizing
balanced
representation
across
en
vironmental
and
v
ehicular
conditions
[1],
[2].
Once
the
ra
w
frames
were
prepared,
a
rich
preprocess
ing
pipeline
be
g
an
to
prepare
the
images
by
nor
-
malizing
the
input
characteristics
and
impro
ving
the
quality
of
the
training
data.
Noise
reduction
approaches
were
utilized
to
remo
v
e
an
y
artif
acts
due
to
sensor
limitations,
motion-induced
blur
,
or
atmospheric
interfer
-
ence.
All
images
were
then
resized
to
640
×
640
pix
els
to
meet
the
architectural
needs
of
Y
OLOv8m
and
to
unify
the
shape
of
images
for
impro
v
ed
computational
ef
cienc
y
during
training.
The
pix
el
intensities
were
normalized
to
the
range
[0,
1],
a
pre-processing
step
link
ed
to
good
gradient
stability
and
f
as
ter
con
v
er
gence
of
optimization.
Further
strengthening
the
rob
ustness
of
the
model,
an
e
xtensi
v
e
augmentation
process
w
as
imple-
mented
by
using
both
Roboo
w
preprocessing
tools
and
the
b
uilt-in
augmentation
module
of
Y
OLOv8.
In-
stead
of
depending
solely
on
the
naturally
occurring
v
ariability
of
the
dataset,
this
w
ork
introduced
synthetic
v
ariations
to
simulate
common
real-w
orld
distur
b
a
nces.
These
included
geometric
transformations
of
ipping,
cropping,
rotation,
and
scaling,
which
helped
the
model
learn
the
dif
ferences
in
camera
angles,
v
ehicle
ori-
entations,
and
spatial
composition.
Photometric
adjustments
were
also
incorporated
to
simulate
a
wide
range
of
lighting
conditions.
F
or
e
xample,
brightness
and
e
xposure
v
ariations
allo
wed
the
model
to
deal
with
glare,
shado
w
transitions,
and
lo
w
light
conditions
typical
of
early
morning
or
late
afternoon
traf
c.
Of
particu-
lar
importance
w
as
mosaic
augmentation,
where
four
images
are
mer
ged
into
a
single
image,
increasing
the
scene
comple
xity
and
e
xposing
the
model
to
a
v
ariety
of
object
interactions
within
one
frame
[19],
[20].
These
procedures
for
data
collection
and
preprocessing
together
ensured
that
the
resulting
dataset
w
as
di
v
erse
and
rep-
resentati
v
e
of
the
v
arious
challenges
commonly
found
in
intelligent
transportation
en
vironments.
Inte
grating
real-w
orld
v
ariability
with
synthetically
enhanced
augmentation,
this
study
set
a
training
foundation
that
will
support
stable
netw
ork
con
v
er
gence
and
reduced
o
v
ertting,
enhancing
the
ability
of
the
modied
Y
OLOv8m
model
to
perform
in
a
rob
ust
manner
under
comple
x
traf
c
conditions.
2.3.
Image
annotation
F
ollo
wing
the
data
preprocessing
phase,
all
images
were
subjected
to
a
detailed
annotation
process
to
accurately
label
and
locali
ze
v
ehicle
objects
within
each
frame.
Annotation
w
as
done
using
the
Roboo
w
platform,
a
web-based
system
designed
for
ef
cient
object
detection
dataset
preparation.
Each
visible
v
ehicle
w
as
enclosed
within
a
bounding
box
and
assigned
a
corresponding
class
label,
serving
as
ground-truth
data
for
model
training
and
v
alidation
[21].
A
total
of
nine
v
ehicle
cate
gories
were
identied
and
annotated:
car
,
truck,
b
us,
v
an,
motorc
ycle,
tric
y-
cle,
jeepne
y
,
bic
ycle,
and
e-bik
e.
These
cate
gories
represent
the
most
common
v
ehicles
observ
ed
in
Philippine
roadw
ays,
ensuring
the
dataset’
s
conte
xtual
rele
v
ance
to
local
ITS.
Each
image
could
contain
multiple
v
ehicle
types,
mirroring
the
congestion
and
mix
ed
traf
c
patterns
found
in
real
en
vironments.
This
multi-class
labeling
scheme
allo
wed
the
Y
OLOv8m
model
to
learn
v
ehicle
dif
ferentiation
and
scale
v
aria
tion,
essential
for
real-time
classication
under
dynamic
traf
c
scenes
[18].
Figure
2
sho
ws
the
image
annotation
w
o
r
ko
w
,
illustrating
the
step-by-step
process
from
frame
e
x-
traction
to
label
e
xport.
The
pipeline
be
gins
with
the
uploading
of
image
frames
to
the
Roboo
w
platform,
where
annotation
projects
are
created
and
v
ersioned.
Annotators
then
perform
bounding
box
labeling
and
as-
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
41,
No.
1,
January
2026:
153–167
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
157
sign
the
appropriate
v
ehicle
cate
gory
.
Once
annotation
is
completed,
a
quality
control
and
v
erication
stage
is
done,
where
a
secondary
re
vie
wer
cros
s-checks
the
labels
to
identify
and
correct
errors
such
as
o
v
erlapping
box
es
or
misclassied
objects.
Finally
,
all
v
eried
annotations
are
e
xported
in
Y
OLOv8-c
o
m
patible
format,
containing
normalized
coordinates
and
class
indices.
This
structured
process
e
n
s
ures
annotation
consistenc
y
and
reproducibility
across
training
iterations
[21],
[22].
Figure
2.
Image
annotation
process
using
roboo
w
T
o
v
alidate
the
results
and
quality
of
the
labeling
process,
se
v
eral
sample
annotated
images
were
re
vie
wed.
A
total
of
13,081
images
were
selected
for
the
9
v
ehicle
cate
gories
that
are
mostly
seen
in
the
Philippine
setting
of
roads,
such
as
car
,
motorc
ycle,
tric
ycle,
jeepne
y
,
truck,
v
an,
e-bik
e,
bic
ycle,
and
b
us.
Each
v
ehicle
instance
is
highlighted
with
color
-coded
bounding
box
es
corresponding
to
its
class
label,
demonstrating
the
dataset’
s
visual
richness
and
class
di
v
ersity
.
Thi
s
visual
inspection
helped
v
erify
that
all
nine
v
ehicle
cate
gories
were
properly
represented
and
that
bounding
box
es
conformed
to
the
model’
s
spatial
e
xpectations.
Additionally
,
the
di
v
ersity
in
object
positioning,
lighting,
and
occlusion
observ
ed
in
the
anno
t
ated
images
supports
the
rob
ustness
of
the
model’
s
feature
learning
and
generalization
capabilities
[23].
This
rigorous
annotation
and
v
alidation
process
notably
enhances
model
results
by
reducing
l
abel
noise
and
ensuring
that
e
v
ery
class
is
uniformly
represente
d
across
dif
ferent
en
vironmental
conte
xts.
As
prior
studies
conrm,
datasets
with
high
annotation
precision
and
consistent
labeling
structure
contrib
ute
directly
to
impro
v
ed
mean
a
v
erage
precision
(mAP)
in
object
detection
systems
[21],
[24].
2.4.
Data
augmentation
T
o
strengthen
the
Y
OLOv8m
model’
s
generali
zation
capability
,
a
comprehensi
v
e
data
augmenta
tion
process
w
as
implemented
to
strengthen
the
di
v
ersity
and
realism
of
the
training
dataset.
This
process
articially
e
xpands
the
a
v
ailable
data
by
introducing
controlled
visual
v
ariations,
allo
wing
the
model
to
recognize
v
ehicles
under
dif
ferent
en
vironmental
and
spatial
conditions.
Such
augmentations
are
particularly
essential
for
ITS
applications,
where
lighting,
traf
c
density
,
and
camera
vie
wpoints
change
continuously
[19],
[25].
The
augmentation
pipeline,
done
using
the
Roboo
w
preprocessing
system
and
Y
OLOv8’
s
b
ui
lt-in
augmentation
module,
simulated
v
arious
real-w
orld
scenarios.
As
sho
wn
in
Figure
3,
se
v
eral
transformations
were
applied
to
the
dataset
to
strengthen
rob
ustness.
Horizontal
ipping
w
as
used
to
represent
v
ehicles
mo
ving
in
opposite
directions,
while
random
rotation
and
scaling
allo
wed
the
model
to
adapt
to
di
v
erse
camera
angles
and
distances.
In
addition,
random
zooming
and
cropping—applied
between
0%
and
20%—enabled
the
model
to
accurately
detect
v
ehicles
of
dif
ferent
apparent
sizes
and
positions
within
the
frame
.
This
technique
is
especially
useful
for
simulating
v
ehicles
that
suddenly
appear
closer
or
f
arther
a
w
ay
,
a
common
occurrence
in
dynamic
traf
c
scenes.
Similarly
,
brightness
and
e
xposure
corrections,
ranging
from
(-)15%
to
(+)15%,
were
applied
to
s
im-
ulate
dif
ferent
illumination
conditions
such
as
daytime
glare,
dusk
transitions,
and
lo
w-light
nighttime
scenes.
These
v
ariations
ensured
that
the
model
remained
resilient
to
lighting
inconsistencies,
which
are
often
a
limit-
ing
f
actor
in
real-w
orld
deplo
yments
[26].
Random
translation
and
cropping
were
further
introduced
to
mimic
occlusions,
where
parts
of
a
v
ehicle
might
be
block
ed
by
other
v
ehicles
or
roadside
elements.
Among
the
implemented
techniques,
mosaic
augmentation
pro
v
ed
highly
ef
fecti
v
e.
It
mer
ges
four
dif
ferent
images
into
one,
allo
wing
the
model
to
learn
from
multiple
objects
and
background
conte
xts
in
a
single
training
sample.
This
not
only
increases
class
di
v
ersity
b
ut
also
enhances
spatial
a
w
areness
and
reduces
Y
OLOv8m
enhancement
using
α
-scaled
gr
adient-normalized
sigmoid
activation
...
(Renz
Raniel
V
.
Serr
ano)
Evaluation Warning : The document was created with Spire.PDF for Python.
158
❒
ISSN:
2502-4752
o
v
ertting
[20].
Additionally
,
HSV
color
-space
adjustments
were
utilized
to
generate
subtle
v
ariations
in
hue,
saturation,
and
v
alue,
reecting
the
impact
of
en
vironmental
lighting
and
camera
sensor
dif
ferences
on
visual
perception
[26].
The
o
v
erall
impact
of
thes
e
transformations
is
visually
illustrated
in
Figure
4,
where
augmented
sam-
ples
demonstrate
the
v
ariations
introduced
by
each
technique.
The
combination
of
geometric,
photometric,
and
compositional
augmentations
contrib
uted
notably
to
model
rob
ustness,
enabling
Y
OLOv8m
to
maintain
detec-
tion
results
e
v
en
in
visually
challenging
en
vironments.
By
systematically
applying
these
augmentations,
the
model
reached
more
stable
training
beha
vior
,
better
con
v
er
gence,
and
reduced
o
v
ertting,
ultimately
leading
to
impro
v
ed
real-time
results
across
di
v
erse
traf
c
conditions.
Figure
3.
Data
augmentation
techniques
applied
to
v
ehicle
images
2.5.
Acti
vision
function
Acti
v
ation
functions
play
a
vital
role
in
enabling
deep
netw
orks
to
learn
comple
x,
non-linear
map-
pings.
The
y
af
fect
gradient
o
w
,
con
v
er
gence
speed,
and
feature
e
xtraction
across
con
v
olutional
layers.
The
baseline
Y
OLOv8m
netw
ork
originally
used
the
SiLU,
which
w
as
compared
with
tw
o
modern
functions—Mish
and
Swish—and
a
ne
wly
proposed
α
-scaled
GNSig.
These
m
od
i
cations
were
introduced
to
strengthen
learn-
ing
stability
and
classication
helps
real-time
v
ehicle
detection.
The
def
ault
SiLU
(Swish-1)
acti
v
ation
is
dened
as:
f
(
x
)
=
xsig
moid
(
x
)
,
SiLU
of
fers
smooth
and
continuous
gradient
propag
ation,
which
pre
v
ents
abrupt
acti
v
ations
seen
in
ReLU-based
functions.
Ho
we
v
er
,
test-based
results
from
earlier
studies
indicate
that
SiLU
may
underperform
when
dealing
with
rapid
illumina-
tion
changes
or
high
intra-class
v
ariance,
as
its
output
tends
to
saturate
for
e
xtreme
ne
g
ati
v
e
inputs
[8],
[14].
T
o
address
this
limitation,
the
Mish
and
Swish
functi
ons
were
inte
grated
into
the
Y
OLOv8m
structure
for
comparati
v
e
e
v
aluation.
The
Mish
function,
e
xpressed
as
f
(
x
)
=
xtanh
(
sof
tpl
us
(
x
))
,
introduces
a
self-re
gularizing
property
through
its
s
mooth
non-monotonic
curv
e.
This
feature
enables
deeper
layers
to
capture
s
ubtle
visual
cues
such
as
v
ehi
cle
contours,
edges,
and
reections
without
destabilizing
gradi
ent
updates.
Prior
studies
ha
v
e
demonstrated
that
Mish
can
outperform
SiLU
in
tasks
in
v
olving
ne-grained
feature
le
arning
due
to
its
stronger
gradient
o
w
and
adapti
v
e
representation
capabilities
[8],
[10].
The
Swish
function,
on
the
other
hand,
is
dened
as
f
(
x
)
=
xsig
moid
(
x
)
,
introduces
a
learnable
parameter
that
adjusts
the
slope
dynamically
.
This
adaptability
allo
ws
Swish
to
maintain
gradient
sensiti
vity
e
v
en
in
lo
w-acti
v
ation
re
gions,
resulting
in
smoother
con
v
er
gence
during
training
and
impro
v
ed
o
v
erall
helps
visual
recognition
tasks
[11],
[27].
Building
upon
these
principles,
this
research
also
e
xplores
a
custom
α
-scaled
GNSig
acti
v
ation,
de-
signed
to
balance
gradient
o
w
and
a
v
oid
neuron
saturation.
The
GNSig
function
modies
the
classical
sigmoid
by
introducing
tw
o
additional
parameters:
an
-scaling
f
actor
and
a
bias
correction
term
(b),
formulated
as:
f
(
x
)
=
α
·
1
1
+
e
−
x
+
b
The
term
amplies
the
acti
v
ation
response
to
mid-range
input
signals,
while
the
bias
correction
shifts
the
acti
v
ation
threshold,
impro
ving
sensiti
vity
to
subtle
feature
v
ariations.
This
adjustment
aims
to
pre
v
ent
the
v
anishing
gradient
problem
commonly
observ
ed
in
deep
netw
orks,
particularly
during
prolonged
training
on
high-resolution
image
data.
The
design
w
as
inspired
by
the
ndings
of
Xu
and
W
ang
[28],
who
empha-
sized
that
scaled-sigmoid
acti
v
ations
strengthen
both
gradient
consistenc
y
and
training
stability
across
di
v
erse
learning
tasks.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
41,
No.
1,
January
2026:
153–167
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
159
During
implementation,
the
Y
OLOv8m
model
architecture
w
as
k
ept
structurally
identical
across
al
l
e
xperiments,
with
the
acti
v
ation
function
as
the
sole
modied
component.
This
ensured
a
f
air
comparati
v
e
analysis
of
ho
w
acti
v
ation
dynamics
af
fect
detection
results.
The
modied
functions
were
inte
grated
into
the
C2f
blocks
and
detection
heads
within
the
Y
OLOv8m
architecture,
maintaining
consistent
training
conditions,
including
batch
size,
learning
rate,
and
optimizer
settings.
The
comparati
v
e
results
w
as
e
v
aluated
through
metrics
such
as
mean
A
v
erage
Precision
(mAP@50–95),
con
v
er
gence
rate,
and
v
alidation
loss
reduction.
Empirical
observ
ations
re
v
ealed
that
both
Mish
and
Swish
acti
v
ations
pro
vided
smoother
loss
curv
es
and
higher
mAP
compared
to
the
def
ault
SiLU,
conrming
their
ef
fecti
v
eness
in
capturing
comple
x
v
ehicle
fea-
tures
under
v
aried
l
ighting
and
occlusion
conditions.
Additionally
,
the
proposed
GNSig
acti
v
ation
e
xhibited
the
most
stable
training
beha
vior
,
minimizing
oscillations
in
v
alidation
loss
and
demonstrating
impro
v
ed
adaptabil-
ity
to
unseen
traf
c
scenes.
These
outcomes
af
rm
that
proper
acti
v
ation
tuning
can
notably
strengthen
model
con
v
er
gence,
feature
richness,
and
generalization
capability
,
contrib
uting
to
more
reliable
real-time
v
ehicle
classication.
2.6.
Model
training
and
e
v
aluation
The
modied
Y
OLOv8m
models
were
trained
and
e
v
aluated
under
controlled
test-based
condit
ions
to
ensure
a
consistent
and
f
air
comparison
among
the
four
acti
v
ation
functions:
SiLU,
Mish,
Swish,
and
the
proposed
α
-scaled
GNSig.
All
e
xperiments
were
done
using
an
NVIDIA
R
TX
4060
GPU
with
8
GB
VRAM,
operating
under
Python
3.10
and
PyT
orch
2.2
within
the
Ultralytics
Y
OLOv8
frame
w
ork.
Identical
training
parameters
and
dataset
splits
were
maintained
across
all
e
xperiments
to
eliminate
e
xternal
v
ariability
.
The
dataset
w
as
di
vided
into
70%
for
traini
ng,
20%
for
v
alidation,
and
10%
for
testing.
Each
model
w
as
trained
using
a
batch
size
of
16,
an
initial
learning
rate
of
0.001,
and
a
momentum
coef
cient
of
0.937.
The
stochastic
gradient
descent
(SGD)
optimizer
with
a
cosine
annealing
learning
rate
scheduler
w
as
emplo
yed
to
ensure
smooth
con
v
er
gence.
T
raining
w
as
done
for
100
epochs,
and
early
stopping
w
as
implemented
to
automatically
terminate
the
process
when
no
signicant
impro
v
ement
in
v
alidation
loss
w
as
observ
ed
for
15
consecuti
v
e
epochs
[29].
The
Y
OLOv8m
architecture
w
as
selected
due
to
its
balance
between
detection
results
and
compu-
tational
ef
cienc
y
,
which
mak
es
it
suitable
for
real-time
v
e
h
i
cle
classication
tasks.
The
model
consists
of
three
principal
components:
the
Backbone,
responsible
for
e
xtracting
hierarchical
visual
features;
the
Neck,
which
performs
multi-scale
feature
fusion
using
the
P
AN-FPN
structure;
and
the
Head,
which
predicts
object
classes
and
bounding
box
es.
The
acti
v
ation
function
modications—SiLU,
Mish,
Swish,
and
GNSig—were
in-
te
grated
into
the
C2f
con
v
olutional
blocks
and
the
detection
head
layers
while
preserving
all
other
architectural
and
training
parameters.
This
conguration
ensured
that
an
y
results
dif
ferences
observ
ed
could
be
attrib
uted
primarily
to
the
ef
fects
of
the
acti
v
ation
functions
[8],
[27].
Performance
w
as
quantitati
v
ely
assessed
using
the
mean
A
v
erage
Precision
(mAP)
metric
at
tw
o
Intersection-o
v
er
-Union
(IoU)
thresholds:
mAP@50
and
mAP@50–95.
The
mAP
is
dened
as
t
he
mean
of
the
A
v
erage
Precision
(AP)
v
alues
across
all
classes:
mAP
=
1
N
N
X
i
=1
A
i
Where
denotes
the
A
v
erage
Precision
f
o
r
class
and
N
represents
the
total
number
of
v
ehicle
cate
gories.
A
higher
mAP
sho
ws
superior
helps
both
localization
and
classication
results
[30].
Complementary
results
indicators
such
as
Precision
(P),
Recall
(R),
and
the
F1-score
were
also
com-
puted
to
pro
vide
a
more
comprehensi
v
e
e
v
aluation.
The
F1-score,
representing
the
harmonic
mean
of
precision
and
recall,
is
dened
as:
F
1
=
2
P
R
P
+
R
These
metrics
collecti
v
ely
e
xamine
the
reliability
of
the
model,
ensuring
that
impro
v
ements
in
mAP
do
not
come
at
the
e
xpense
of
increased
f
alse
detections
[30],
[31].
Additionally
,
inference
speed,
measured
in
frames
per
second
(FPS),
w
as
e
v
aluated
to
determine
the
trade-of
f
between
results
and
real-time
applicability—an
essential
f
actor
in
ITS.
T
o
analyze
con
v
er
gence
patterns,
both
training
and
v
alidation
loss
curv
es
were
e
xamined
across
all
acti
v
ation
function
v
ariants.
Models
utilizing
Mish
and
Swish
demonstrated
smoother
con
v
er
gence
and
higher
Y
OLOv8m
enhancement
using
α
-scaled
gr
adient-normalized
sigmoid
activation
...
(Renz
Raniel
V
.
Serr
ano)
Evaluation Warning : The document was created with Spire.PDF for Python.
160
❒
ISSN:
2502-4752
mAP
v
alues
compared
to
the
baseline
SiLU,
suggesting
that
their
smoother
non-linearities
enable
better
gra-
dient
o
w
and
impro
v
ed
representation
learning.
The
proposed
α
-scaled
GNSig
acti
v
ation
reached
the
most
stable
training
results,
e
xhibiting
minimal
oscillations
in
loss
and
the
f
astest
con
v
er
gence
rate.
These
results
conrm
that
proper
acti
v
ation
function
tuning
can
notably
strengthen
model
rob
ustness,
training
ef
cienc
y
,
and
classi
cation
reliability—contrib
uting
to
the
adv
ancement
of
rea
l-time
computer
vision
systems
for
traf
c
analysis
and
v
ehicle
classication.
3.
RESUL
TS
AND
DISCUSSION
This
section
sho
ws
the
com
prehensi
v
e
results
and
comparati
v
e
analysis
of
the
modied
Y
OLOv8m
models
using
v
arious
acti
v
ation
functions—SiLU,
Mish,
GELU,
Leak
yReLU,
and
the
proposed
α
-scaled
GN-
Sig.
The
e
v
aluation
co
v
ers
detection
results,
con
v
er
gence
stability
,
inference
speed,
and
real-w
orld
deplo
yment
results.
The
results
indicate
that
the
modied
model,
equipped
with
the
proposed
GNSig
acti
v
ation,
reached
the
highest
results
impro
v
ements
across
all
metrics.
Each
gure
and
table
in
this
section
sho
ws
the
ef
fects
of
acti
v
ation
function
choice
and
architectural
modications
on
Y
OLOv8m’
s
o
v
erall
beha
vior
.
3.1.
Quantitati
v
e
e
v
aluation
T
able
1
sho
ws
the
comparati
v
e
results
of
the
Y
OLOv8m
models
using
dif
ferent
acti
v
ation
functions,
including
the
baseline
SiLU,
Swish,
Mish,
and
the
proposed
α
-scaled
GNSig.
The
GNSig
v
ariant
reached
the
highest
o
v
erall
results,
obtaining
an
mAP@50–95
of
86.7%,
surpassing
Mish
(85.9%)
and
Swish
(84.8%),
while
maintaining
a
real-time
i
nference
rate
of
94
FPS.
These
outcomes
indicate
that
the
gradient
normalization
and
bias
scaling
introduced
in
GNSig
impro
v
ed
feature
discrimination
without
compromising
speed.
T
able
1
comparati
v
e
results
of
Y
OLOv8m
acti
v
ation
function
v
ariants.
T
able
1.
Results
of
the
Y
OLOv8m
models
using
dif
ferent
acti
v
ation
functions
Acti
v
ation
function
Precision
(%)
Recall
(%)
F1-Score
mAP@
50
mAP@50-95
Interf
ace
speed
(FPS)
SiLU
(Baseline)
83.2
81.7
0.82
90.5
82.4
97.3
Swish
85.6
83.4
0.84
91.9
84.8
95.1
Mish
86.8
85.1
0.86
93.4
85.9
92.8
α
-scaled
GNSig
(Proposed)
88.2
86.9
0.87
94.6
86.7
94.0
When
compared
with
e
xisting
Y
OLO-based
ITS
research,
the
achie
v
ed
impro
v
ements
are
notably
higher
.
Prior
enhancement
studies
such
as
Li
et
al.
[15]
and
Al-Kaf
et
al.
[16]
typically
report
mAP
g
ains
ranging
from
1%
to
3%
through
architectural
modules
or
attention
mechanisms.
In
contrast,
the
proposed
GNSig
acti
v
ation
alone
produced
up
to
6.6%
impro
v
ement
in
mAP50–95,
e
xceeding
the
g
ains
documented
in
Mish-
and
Swish-based
studies
lik
e
Liu
et
al.
[12]
and
Gao
et
al.
[14].
This
demonstrates
that
acti
v
ation-
le
v
el
optimization—without
additional
architectural
changes—can
yield
performance
impro
v
ements
greater
than
those
achie
v
ed
through
hea
vier
model
modications.
The
results
conrm
that
adapti
v
e
acti
v
ation
scaling
enhances
gradient
consistenc
y
and
model
general-
ization.
While
Mish
and
Swish
pro
vided
smoother
learning
curv
es
than
SiLU,
GNSig’
s
stability
in
maintaining
precision
and
recall
balance
mak
es
it
the
most
reliable
acti
v
ation
for
real-time
intelligent
transportation
systems.
3.2.
Comparati
v
e
model
beha
vior
acr
oss
acti
v
ations
The
series
of
visual
comparisons
sho
ws
the
detection
beha
vior
of
Y
OLOv8m
under
dif
ferent
acti
v
a-
tion
congurations.
The
baseline
SiLU
model,
as
seen
in
Figure
4(a),
reached
a
result
of
0.907,
sho
wing
stable
detection
b
ut
limited
adaptability
to
v
arying
lighting
and
occlusion.
The
Mish
v
ariant
(Figure
4(b))
yielded
0.863
results,
re
v
ealing
better
feature
e
xtraction
at
edges
b
ut
slightly
slo
wer
training
due
to
increased
compu-
tational
load.
The
GELU-based
model
(Figure
4(c))
reached
0.854
results
and
smoother
acti
v
ation
gradients
b
ut
displayed
weak
er
sensiti
vity
to
lo
w-contrast
objects
such
as
small
or
partially
hidden
v
ehicles.
Leak
yReLU
(Figure
4(d))
of
fered
early-stage
gradient
stability
with
0.863
results,
yet
plateaued
in
later
epochs,
indicating
reduced
learning
e
xibility
for
o
v
erlapping
objects.
Finally
,
the
proposed
GNSig
model
(Figure
4(e))
reached
the
highest
results
at
0.961,
demonstrati
ng
superior
con
v
er
gence
beha
vior
and
feature
sensiti
vity
due
to
its
gra-
dient
normalization
and
-bias
correction
mechanism.
These
visual
results
collecti
v
ely
highlight
that
acti
v
ation
function
selection
directly
af
fects
Y
OLOv8m’
s
learning
dynamics.
GNSig’
s
smoother
and
more
controlled
gradients
led
to
impro
v
ed
feature
retention
and
reduced
o
v
ertting
compared
to
the
other
tested
functions.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
41,
No.
1,
January
2026:
153–167
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
161
These
qualitati
v
e
dif
ferences
align
with
reports
in
prior
acti
v
ation-function
literature.
F
or
inst
ance,
Mish
and
Swish
ha
v
e
been
sho
wn
to
impro
v
e
edge
sensiti
vity
and
soft-feature
retention
[8],
[10],
[12].
Ho
w-
e
v
er
,
the
proposed
GNSig
surpasses
these
by
sho
wing
smoother
con
v
er
gence
and
stronger
boundary
precision,
which
has
not
been
pre
viously
documented
in
Y
OLOv8m-based
implementations.
The
higher
sensiti
vity
to
occluded
and
lo
w-contrast
v
ehicles
conrms
the
superior
gradient
consistenc
y
of
fered
by
GNSig
relati
v
e
to
traditional
acti
v
ations
used
in
Y
OLO
systems
as
sho
wn
in
Figure
5.
Figure
4.
Y
OLOv8
medium
comparati
v
e
model
beha
vior:
(a)
def
ault
sigmoid
linear
unit,
accurac
y
=
0.907,
(b)
mish
acti
v
ation,
accurac
y
=
0.863,
(c)
GELU
acti
v
ation,
accurac
y
=
0.854,
(d)
Leak
yReLU
acti
v
ation,
accurac
y
=
0.863,
and
(e)
gradient
normalization
α
and
bias
adjustments,
accurac
y
=
0.961
Figure
5.
Modied
Y
OLOv8
medium
(gradient
normalization,
α
,
and
bias
adjustments,
accurac
y
=
0.961)
Y
OLOv8m
enhancement
using
α
-scaled
gr
adient-normalized
sigmoid
activation
...
(Renz
Raniel
V
.
Serr
ano)
Evaluation Warning : The document was created with Spire.PDF for Python.
162
❒
ISSN:
2502-4752
3.3.
Domain
transfer
and
generalization
perf
ormance
T
o
e
xamine
the
rob
ustness
of
the
modied
model,
an
additional
e
v
aluation
w
as
done
using
a
pot-
hole
detection
dataset
(Figures
6).
The
baseline
Y
OLOv8m
with
SiLU
reached
a
result
of
0.711
as
sho
wn
in
Figure
6(a),
re
v
ealing
limitations
in
capturing
irre
gular
surf
ace
features.
In
contrast,
the
modi
ed
GNSig
model
reached
0.759
results
as
sho
wn
in
Figure
6(b),
representing
a
4.8%
impro
v
ement.
The
GNSig
v
ariant
ef
fecti
v
ely
captured
subtle
te
xtural
v
ariations
and
ne
structural
edges
of
potholes,
conrming
its
enhanced
generalization
capability
be
yond
v
ehicle
datasets.
Figure
6.
Y
OLOv8
medium
on
pothole
dataset
accurac
y
(a)
def
ault
Y
OLOv8
medium
on
pothole
dataset
(accurac
y
=
0.711)
and
(b)
modied
Y
OLOv8
medium
on
pothole
dataset
(accurac
y
=
0.759)
This
domain
transfer
e
xperiment
conrms
the
v
ersatility
of
the
proposed
acti
v
ation
scheme.
It
sho
ws
that
the
impro
v
ed
gradient
o
w
not
only
enhances
in-domain
classication
results
b
ut
also
helps
to
stabil-
ity
and
adaptability
in
heterogeneous
visual
domains.
Compared
with
pre
vious
cross-domain
Y
OLO
studies,
which
typically
observ
e
performance
drops
when
transferring
from
traf
c
datasets
to
road-surf
ace
datasets
[15],
the
proposed
GNSig
acti
v
ation
maintained
strong
generalization.
The
4.8%
impro
v
ement
o
v
er
the
baseline
Y
OLOv8m
outperforms
the
2–3%
generalization
impro
v
ements
reported
in
related
transfer
-learning
studies,
suggesting
that
gradient-normalized
acti
v
ations
enhance
feature
abstraction
be
yond
v
ehicle-specic
training
as
sho
wn
in
T
able
2.
T
able
2.
Comparati
v
e
results
on
the
pothole
detection
dataset
Model
Acti
v
ation
function
Accurac
y
Remarks
Y
OLOv8m
(Def
ault)
Sigmoid
linear
unit
0.711
Slo
wer
con
v
er
gence
Y
OLOv8m
(Modied)
Gradient-normalized
sigmoid
0.759
Impro
v
ed
feature
discrimination
3.4.
Con
v
er
gence,
pr
ecision–r
ecall,
and
v
alidation
analysis
The
con
v
er
gence
beha
vior
of
all
models
w
as
e
xamined
using
precision–recall
curv
es
and
v
a
lidation
loss
tracking.
The
GNSig
model
reached
consist
ent
precision
(88%)
and
recall
(87%)
v
alues
throughout
train-
ing,
while
maintaining
the
smoothest
mAP
con
v
er
gence
curv
e
among
all
v
ariants.
Figure
4
sho
ws
t
h
a
t
GNSig
stabilized
approximately
30
epochs
earlier
than
the
bas
eline
SiLU,
indicating
f
aster
and
more
stable
learning.
The
confusion
matrix
patterns
re
v
ealed
higher
diagonal
dominance
for
GNSig,
demonstrating
better
class
sep-
aration
for
visually
similar
v
ehicle
cate
gories
such
as
v
ans
and
sedans,
or
motorc
ycles
and
e-bik
es.
These
ndings
emphasize
that
the
proposed
acti
v
ation
not
only
enhances
numerical
results
metrics
b
ut
also
impro
v
es
model
stability
as
sho
wn
in
T
able
3.
The
lo
wer
v
alidation
loss
and
early
con
v
er
gence
conrm
ef
cient
gradient
propag
ation,
reducing
oscillations
and
pre
v
enting
o
v
ertting
during
prolonged
training.
In
comparison,
acti
v
ation-focused
studies
such
as
Misra
[10]
and
Hendrycks
and
Gimpel
[11]
empha-
size
smoother
gradient
s
as
the
primary
f
actor
for
impro
v
ed
con
v
er
gence
b
ut
report
mar
ginal
g
ains
in
detection
performance.
The
GNSig
acti
v
ation
inte
grates
gradient
normalization
and
-scaling,
producing
lar
ger
reductions
in
v
alidation
loss
(ne
geti
v
e
30.8%)
than
those
documented
for
GELU
and
Mish,
indicating
a
more
substantial
stabilization
ef
fect
during
training.
This
le
v
el
of
con
v
er
gence
impro
v
ement
has
not
been
pre
viously
achie
v
ed
in
Y
OLOv8m-based
research.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
41,
No.
1,
January
2026:
153–167
Evaluation Warning : The document was created with Spire.PDF for Python.