TELK
OMNIKA
T
elecommunication,
Computing,
Electr
onics
and
Contr
ol
V
ol.
23,
No.
6,
December
2025,
pp.
1611
∼
1625
ISSN:
1693-6930,
DOI:
10.12928/TELK
OMNIKA.v23i6.26829
❒
1611
Mixed
attention
mechanism
on
ResNet-DeepLabV3+
f
or
paddy
eld
segmentation
Alya
Khairunnisa
Rizkita
1
,
Masagus
Muhammad
Luth
Ramadhan
1
,
Y
ohanes
Fridolin
Hestrio
1,2
,
Muhammad
Hannan
Hunafa
1
,
Danang
Surya
Candra
2
,
W
isnu
J
atmik
o
1
1
Department
of
Computer
Science,
F
aculty
of
Computer
Science,
Uni
v
ersity
of
Indonesia,
Depok,
Indonesia
2
Research
Center
for
Geoinformatics,
Electronics
and
Informatics
Research
Or
g
anization,
National
Research
and
Inno
v
ation
Agenc
y
,
Bogor
,
Indonesia
Article
Inf
o
Article
history:
Recei
v
ed
Dec
7,
2024
Re
vised
Aug
27,
2025
Accepted
Sep
10,
2025
K
eyw
ords:
Attention
mechanism
DeepLabV3+
Remote
sensing
Residual
netw
ork
Semantic
se
gmentation
ABSTRA
CT
Rice
culti
v
ation
monitoring
is
crucial
for
Indonesia,
where
paddy
eld
areas
de-
clined
by
2.45%
according
to
the
Central
Bureau
of
Statistics
due
to
land
func-
tion
changes
and
shifting
crop
preferences.
Re
gular
monitoring
of
paddy
eld
distrib
ution
is
essential
for
understanding
agricultural
land
utilization
by
f
armers
and
lando
wners.
Satellite
imagery
has
become
increasingly
common
for
agricul-
tural
land
observ
ation,
b
ut
traditional
neural
netw
orks
alone
pro
vide
insuf
cient
se
gmentation
accurac
y
.
This
study
proposes
an
enhanced
deep
learning
architec-
ture
combining
residual
netw
ork
(ResNet)-DeepLabV3+
with
coordinate
atten-
tion
(CA)
and
spatial
group-wise
enhancement
(SGE)
modules.
The
attention
mechanisms
establish
direct
connections
between
conte
xt
v
ectors
and
inputs,
enabling
the
model
to
prioritize
rele
v
ant
spatial
and
spectral
features
for
precise
paddy
eld
identication.
The
CA
module
enhances
spectral
feature
discrim-
ination,
whereas
the
SGE
impro
v
es
spatial
charact
eristic
representation.
The
e
xperimental
results
demonstrate
superior
performance
o
v
er
the
baseline
meth-
ods,
achie
ving
intersection
o
v
er
union
(IoU)
of
0.85,
dice
coef
cient
of
0.89,
and
accurac
y
of
0.95.
The
proposed
mix
ed
attention
mechanism
signicantly
impro
v
es
the
accurac
y
and
ef
cienc
y
of
automatic
crop
area
identication
from
satellite
imagery
.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Alya
Khairunnisa
Rizkita
Department
of
Computer
Science,
F
aculty
of
Computer
Science,
Uni
v
ersity
of
Indonesia
Depok,
16424,
Indonesia
Email:
alya.khairunnisa21@ui.ac.id
1.
INTR
ODUCTION
Rice
is
a
prominent
agricultural
product
globally
,
and
it
thri
v
es
in
e
xpansi
v
e,
unobstructed
re
gions.
In
Indonesia,
rice
is
the
predominant
commodity
.
Based
on
statistics
from
the
Central
Bureau
of
Statistics
in
Indonesia,
there
w
as
a
2.45%
decline
in
rice
culti
v
ation
from
2022
to
2023
[1],
[2].
This
reduction
in
land
area
is
due
to
changes
in
land
functions
and
preferences
for
planted
crop
commodities.
The
decreased
price
of
rice
relati
v
e
to
other
crops
and
the
less-than-ideal
paddy
eld
conditions
for
rice
gro
wth
are
the
main
causes
of
the
change
in
commodities
and
land
usage
[3],
[4].
Therefore,
it
is
important
for
f
armers
and
o
wners
of
paddy
elds
to
kno
w
the
usefulness
of
the
distrib
ution
of
e
xisting
paddy
elds
by
conducting
re
gular
monitoring.
Remote
sensing
imagery
pro
vides
a
precise
and
geographically
accurate
depiction
of
the
land
that
the
de
vice
has
captured
[5].
Remote
sensing
imagery
typically
requires
geographic
information
about
agricultural
J
ournal
homepage:
https://telk
omnika.uad.ac.id/inde
x.php/TELK
OMNIKA
Evaluation Warning : The document was created with Spire.PDF for Python.
1612
❒
ISSN:
1693-6930
land
to
accurately
monitor
crop
conditions
at
specic
locations
and
utilize
time
ef
ciently
.
Additionally
,
this
information
aids
in
monitoring
the
use
of
agricultural
land,
which
is
rapidly
changing
its
function.
Only
lar
ge
and
open
land
crops,
such
as
rice,
corn,
onions,
and
oil
palm,
are
suitable
for
remote
sensing
monitoring
on
agricultural
land.
Consequently
,
it
is
possible
to
map
paddy
elds
using
remote
sensing
technology
to
monitor
crop
gro
wth,
manage
paddy
eld
use,
and
estimate
crop
yields
[6].
Extracting
information
about
agricultural
land
can
emplo
y
methods
such
as
se
gmentation
or
object
detection
to
further
process
remote
sensing
images
[7].
These
approaches
ha
v
e
been
the
focus
of
recent
research.
In
line
with
the
de
v
elopment
of
remote
sensing
image
processing
techniques,
image
se
gmentation
methods
ha
v
e
become
one
of
the
techniques
relied
upon
to
produce
more
detailed
information
from
an
image.
Image
se
gmenta
tion
is
a
technique
that
classies
each
pix
el
in
an
image
into
semanti
c
classes
[8].
Ho
we
v
er
,
automatically
identifying
rice
e
lds
from
satellite
images
is
still
v
ery
dif
cult.
Current
deep
learning
methods
ha
v
e
major
problems,
such
as
dif
culty
in
accurately
tracing
comple
x
and
irre
gular
eld
boundaries,
dif
culty
in
distinguishing
rice
elds
from
other
similar
-looking
features,
such
as
w
ater
bodies,
wetlands,
or
other
crops,
satellites
do
not
perform
well
when
el
ds
ha
v
e
dif
ferent
sizes,
shapes,
and
orientations
in
the
same
image,
and
satellites
do
not
w
ork
reliably
under
dif
ferent
weather
conditions,
seasons,
or
rice
gro
wth
stages.
These
problems
lead
to
poor
accurac
y
in
identifying
rice
elds,
resulting
in
unreliable
f
arm
monitoring
and
incorrect
crop
yield
predictions.
In
recent
years,
the
de
v
elopment
of
se
gmentati
on
methods
has
continued
to
gro
w
.
Agriculture
widely
applies
neural
netw
ork
approaches,
such
as
crop
type
se
gme
n
t
ation
[9]-[12],
crop
disease
detection
[13],
[14],
and
f
armland
mapping
[15].
Shelhamer
et
al.
[16]
rst
proposed
the
se
gmentation
neural
netw
ork
architecture,
fully
con
v
olutional
netw
orks
(FCN).
U-shaped
con
v
olutional
neural
netw
ork
(U-Net)
[17]
is
a
highly
impro
v
ed
FCN
v
ersion
that
inte
grates
up-sampling,
do
wn-sampling,
skip
connections,
and
fully
con
v
olutional
applica-
tions.
Ho
we
v
er
,
U-Net
is
more
symmetric
than
FC
N.
An
y
do
wn-sampling
and
up-sampling
stage
of
U-Net
is
endo
wed
with
a
skip
connection
at
e
v
ery
le
v
el.
The
U-Net
architecture
is
highly
applied
in
applications,
and
se
gmentation
problems
are
being
upgraded
as
studies
emer
ge.
Research
conducted
on
agricultural
land
mapping
with
se
gmentation
methods
w
as
conducted
by
Zhang
et
al.
[18]
used
rened
p
yramid
scene
parsing
netw
ork
(PSPNet)
on
polarimetric
synthetic
aperture
radar
(PolS
AR)
data.
This
study
proposes
a
m
odied
PSPNet
with
the
addi
tion
of
the
polarimetric
CA
module.
This
module
captures
polarimetric
features
from
PolSAR
data,
allo
wing
PSPNet
to
distinguish
e
xisting
plant
species.
The
results
of
this
study
pro
v
e
that
rened
PSPNet
pro
vides
e
xcellent
se
gmentation
results
on
PolSAR
data,
especially
in
terms
of
accurac
y
in
mapping
sharp
contours
and
separating
agricultural
areas.
Li
et
al.
[19]
de
v
eloped
U-Net
with
the
addition
of
a
linear
attention
mechanism
(LAM)
application
to
se
gment
remote
sensing
images.
The
U-Net
structure
places
LAM
on
each
skip
connection.
This
research
led
to
the
de
v
elopment
of
M
AResU-Net,
which
generally
outperforms
U-Net
in
general.
Mahmud
et
al.
[20]
de
v
eloped
DeepLabV3+
by
adding
a
spatial
attention
(SA)
mechanism
for
road
se
gmentation.
The
ndings
of
this
study
demonstrate
that
adding
a
SA
mechanism
increases
se
gmentation
accurac
y
by
1.56%
compared
to
the
usual
model.
Pre
vious
research
has
demonstrated
that
the
inte
gration
of
attention
modules
can
enhance
se
gmentation
accurac
y
.
Despite
these
adv
ances,
e
xisting
approaches
e
xhibit
se
v
eral
critical
limitations:
traditional
U-Net
ar
-
chitectures
struggle
with
multi-scale
feature
e
xtraction
required
for
v
arying
paddy
eld
sizes,
PSPNet-based
methods
suf
fer
from
high
computational
comple
xity
limiting
their
practical
deplo
yment,
FCN-based
approaches
lack
suf
cient
conte
xtual
information
for
comple
x
agricultural
landscapes,
and
current
attention
mechanisms
often
f
ail
to
simultaneously
capture
both
spatial
relationships
and
channel
dependencies
crucial
for
distinguish-
ing
paddy
elds
from
spectrally
similar
features.
Most
e
xisting
methods
are
e
v
aluated
on
general
se
gmentation
datasets
rather
than
domain-specic
agricultural
scenarios,
limit
ing
their
applicability
to
real-w
orld
paddy
eld
monitoring
tasks.
Pre
vious
research
has
demonstrated
that
the
inte
gration
of
attention
modules
can
enhance
se
gmenta-
tion
ac
curac
y
.
Attention
modules
can
disti
ngu
i
sh
between
important
and
unimportant
image
elements.
The
object
boundaries
are
more
precise,
and
internal
pix
els
are
more
uniform
with
the
inte
gration
of
attention
mod-
ules
[18]-[20].
The
con
v
olutional
block
attention
module
(CB
AM)
is
one
of
the
attention
modules
designed
to
enhance
the
repres
entational
po
wer
of
CNN
feature
maps.
CB
AM
applies
channel
attent
ion
(CA)
and
SA.
CA
focuses
on
rening
the
importance
of
feature
channels,
allo
wing
the
models
to
kno
w
what
the
y
must
see
in
the
feature.
Meanwhile,
the
SA
module
focuses
on
highlighting
the
important
re
gions
within
the
feature
maps.
The
spatial
group-wise
enhance
(SGE)
attention
module
is
another
attention
module
that
impro
v
es
CNN
feature
maps.
Unlik
e
CB
AM,
which
has
tw
o
modules,
SGE
focuses
on
spatial
information
using
a
group-wise
TELK
OMNIKA
T
elecommun
Comput
El
Control,
V
ol.
23,
No.
6,
December
2025:
1611–1625
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
T
elecommun
Comput
El
Control
❒
1613
strate
gy
[21],
[22].
Pre
vious
research
by
Li
et
al.
[23]
proposes
an
impro
v
ed
U-Net
model
combining
a
mul
ti-axis
vision
transformer
(MaxV
iT)
encoder
with
a
CB
AM
decoder
for
automated
crop-weed
se
gmentation
in
precision
agri-
culture.
This
study
addresses
the
limitations
of
uniform
pesticide
spraying
and
e
xisting
se
gmentation
models
that
struggle
with
agricultural
images’
comple
x
backgrounds
a
n
d
multi-scale
tar
gets.
Using
MaxV
iT’
s
dual
attention
mechanisms
for
local
and
global
feature
capture
alongside
CB
AM’
s
channel
and
SA
for
boundary
en-
hancement,
the
model
achie
v
ed
84.28%
mean
intersection
o
v
er
union
(mIoU)
and
88.59%
mean
pix
el
accurac
y
(mP
A)
on
sug
ar
beet
datasets,
impro
ving
3.08%
and
3.15%
o
v
er
baseline
U-Net
whil
e
reducing
parameters
from
43.93
M
to
22.08
M
and
maintaining
0.0559
s
inference
time.
De
v
eloping
ef
fecti
v
e
rice
eld
mapping
methods
f
aces
se
v
eral
major
challenges,
including
the
need
for
computationally
ef
cient
solutions
that
w
ork
in
real-time
for
f
arm
monitoring,
high
data
v
ariability
from
seasonal
changes
a
nd
dif
ferent
gro
wth
stages
across
di
v
erse
geographical
locations,
limited
a
v
ailabilit
y
of
high-
quality
annotated
datasets,
requirements
for
methods
that
w
ork
across
dif
ferent
sensor
types
and
imaging
con-
ditions,
and
the
need
for
practical
deplo
yment
in
e
xisting
agricultural
systems
with
minimal
infrastructure.
T
o
address
these
challenges,
this
research
aims
to
de
v
elop
an
enhanced
deep
learning
architecture
that
combines
a
residual
netw
ork
(ResNet)
encoder
and
DeepLabV3+
decoder
to
le
v
erage
both
detailed
features
and
multi-
scale
conte
xtual
information,
in
v
es
tig
ate
h
ybrid
attention
mechanisms
(CB
AM
and
S
GE)
for
capturing
spatial
and
channel
dependencies
that
are
crucial
for
accurate
boundary
detection,
and
systematically
e
v
aluate
per
-
formance
ag
ainst
e
xisting
state-of-the-art
methods
using
comprehensi
v
e
metrics,
such
as
IoU,
dice
coef
cient,
and
accurac
y
.
Finally
,
this
study
pro
vides
a
practical,
computationally
ef
cient
solution
for
automated
rice
eld
monitoring
from
remote
sensing
imagery
that
can
be
readily
deplo
yed
in
operational
agricultural
systems.
Therefore,
this
research
focuses
on
mapping
agricultural
land
use,
especially
paddy
elds,
using
the
ResNet-DeepLabV3+
neural
netw
ork
se
gmenta
tion
method.
The
se
gmentation
method
will
also
incorporate
CB
AM
and
SGE
attention
modules
to
enhance
ResNet-DeepLabV3+’
s
abilit
y
to
se
gment
paddy
elds.
In
summary
,
this
article
of
fers
the
follo
wing
contrib
utions:
-
W
e
proposed
an
enhanced
deep
learning
model,
ResNet
DeepLabV3+
with
attention,
for
se
gmenting
paddy
elds
from
remote
sensing
images.
The
proposed
model
comprises
ResNet-34
as
the
encoder
,
augmented
with
an
attention
module
to
impro
v
e
the
model’
s
se
gmentation
capabilities,
and
DeepLabV3+
as
the
decoder
.
-
By
incorporating
the
SGE
technique,
we
enhance
the
CB
AM
by
replacing
its
SA
mechanism.
This
modi-
cation
guarantees
a
more
accurate
depiction
of
the
spatial
characteristics
throughout
the
image.
-
W
e
are
comparing
the
performance
of
the
proposed
method
is
compared
with
other
se
gmentation
methods.
This
paper
is
or
g
anized
into
se
v
eral
sections,
the
rst
of
which
co
v
ers
the
introduction
to
the
res
earch.
The
second
section
discusses
related
w
orks.
The
third
section
describes
the
e
xperimental
setup.
The
fourth
section
dis
cusses
the
proposed
method.
The
results
are
discussed
in
the
fth
section,
and
the
last
section
discusses
the
research
conclusions.
2.
RELA
TED
W
ORK
2.1.
Residual
netw
ork
ResNet
w
as
rst
introduced
in
2016
by
He
et
al.
[24]
as
an
inno
v
ation
that
introduced
the
concept
of
residual
learning
on
deep
neural
netw
orks
that
are
more
than
100
layers
deep.
He
et
al.
[24]
proposed
t
his
concept
to
address
optimization
issues
in
deep
neural
netw
orks,
which
frequently
f
ace
performance
de
gradation
dif
culties.
The
basic
idea
of
residual
learning
is
that
the
netw
ork
learns
the
residuals
of
an
e
xisting
function.
If
the
function
to
be
learned
is
denoted
as
H(
x)
and
the
learning
netw
ork
is
F
(
x
)
=
H
(
x
)
−
x
,
then
the
formula
becomes
F
(
x
)
=
H
(
x
)
+
x
.
ResNet
uses
a
residual
block
with
man
y
layers,
each
of
which
learns
from
the
residuals
of
a
pre
vious
one.
A
skip
connection
is
used
in
the
netw
ork
to
connect
the
input
and
output.
There
are
also
skip
connections
that
perform
identity
mapping,
where
the
outputs
of
layers
get
summed
at
a
later
point
to
reduce
netw
ork
com-
ple
xity
and
help
in
optimizing
it.
ResNet
de
v
elopment
has
sho
wn
that
residual
learning
can
help
to
successfully
optimize
and
impro
v
e
the
accurac
y
of
deep
neural
netw
orks
[24].
2.2.
DeepLabV3
Chen
et
al.
[25]
re-implemented
their
prior
model,
DeepLab,
to
the
ne
wer
v
ersion:
DeepLabV3.
The
atrous
con
v
olution,
used
in
the
DeepLabV3
model,
enables
a
netw
ork
to
globally
upscale
its
eld
of
Mixed
attention
mec
hanism
on
ResNet-deepLabV3+
for
paddy
eld
se
gmentation
(Alya
Khairunnisa
Rizkita)
Evaluation Warning : The document was created with Spire.PDF for Python.
1614
❒
ISSN:
1693-6930
vie
w
without
adding
ne
w
learned
parameters.
Atrous
con
v
olution
to
highlight
spat
ial
information
at
a
lar
ge
recepti
v
e
eld
and
maintain
crisp
resolution
an
atrous
con
v
olution
at
multiple
parallel
le
v
els
forms
an
atrous
spatial
p
yramid
pooling
(ASPP).
ASPP
means
that
the
method
can
ef
fecti
v
ely
e
xtract
information
from
small
parameter
spaces
and
has
multiple
scales
using
only
images.
DeepLabV3
combines
man
y
ne
w
ideas,
such
as
the
use
of
ASPP
instead
to
perform
multiscale
feature
learning
global
ly
that
will
help
the
deep
netw
ork
be
a
w
are
of
more
conte
xt
and
mak
e
it
capture
richer
features
from
multiple
scales
of
detail;
add
global
image-
le
v
el
features
directly
ag
ainst
the
last
con
v
olution
o
v
er
grouping
(IC
OG)
con
v
olutional
layer
for
se
gmenting
boundaries;
replace
SoftMax
with
sigmoid
binary
cross-entrop
y
loss
while
emplo
ying
the
dic
e
coef
cient
as
e
v
aluation
metrics;
and,
importantly
,
emplo
y
batch
norm
in
the
ASP
module,
which
has
pro
v
en
to
enable
a
greatly
high
stability
training
ef
fecti
v
eness
[25].
2.3.
Attention
mechanism
The
attention
mechanism
creates
a
direct
link
between
the
conte
xt
v
ector
and
the
input,
allo
wing
the
model
to
focus
on
the
most
important
parts
by
assigning
scores
to
each
input
element
based
on
its
rele
v
ance.
This
is
especially
helpful
when
w
orking
with
lar
ge
datasets
where
not
all
information
is
equally
useful,
such
as
in
language
translation
or
image
recognition
where
only
specic
parts
matter
for
accurate
results.
By
using
attention,
models
can
learn
more
ef
ciently
by
concentrating
on
k
e
y
features
and
ignoring
less
important
details,
which
impro
v
es
accurac
y
while
sa
ving
computational
resources
[21],
[22].
2.3.1.
Con
v
olutional
block
attention
module
CB
AM
is
an
attention
module
that
combines
CA
and
SA
to
pay
more
profound
attention
to
rele
v
ant
features
at
the
channel
and
spatial
le
v
els.
CB
AM
starts
by
applying
CA
rst
to
adjust
the
weight
v
alue
of
each
channel,
follo
wed
by
SA,
which
adjusts
the
weight
v
alue
spatially
[21].
The
combination
of
CA
and
SA
is
sho
wn
in
(1)
and
(2).
In
(1),
the
initial
feature
map
(
F
)
is
performed
element-wise
multiplication
⊗
with
the
one
dimension
of
the
feature
map
generated
from
CA
(
M
c
(
F
))
.
The
resulting
feature
map
(
F
′
)
is
also
subjected
to
element-wise
multiplication
with
the
tw
o-dimensional
feature
map
generated
by
SA
(
M
s
(
F
))
.
F
′
=
M
c
(
F
)
⊗
F
(1)
F
′′
=
M
s
(
F
)
⊗
F
′
(2)
CA
is
an
attention
mechanism
t
hat
assigns
v
arying
weights
to
each
channel
according
to
their
re
lati
v
e
importance.
The
goal
of
assigning
this
weight
is
t
o
determine
the
most
pertinent
channel
[21].
The
formulation
of
CA
using
(3)
and
(4).
Feature
maps
are
assigned
a
v
erage
pooling
(
Av
g
P
)
and
maximum
pooling
(
M
axP
)
at
each
le
v
el
of
the
multilayer
perceptron
(
M
LP
)
.
The
featured
result
Av
g
P
is
symbolized
as
F
c
av
g
and
M
axP
is
symbolized
as
F
c
max
.
Both
feature
results
are
added
and
gi
v
en
a
sigmoid
function
σ
.
The
sigmoid
function
determines
the
most
rele
v
ant
part
of
the
feature
based
on
the
Av
g
P
and
M
axP
.
There
are
2
M
LP
layers
used,
so
the
weight
of
each
layer
is
denoted
as
W
1
and
W
0
.
M
c
(
F
)
=
σ
(
M
LP
(
Av
g
P
(
F
))
+
M
LP
(
M
axP
(
F
)))
(3)
M
c
(
F
)
=
σ
(
W
1
(
W
0
(
F
c
av
g
))
+
W
1
(
W
0
(
F
c
max
)))
(4)
SA,
in
contrast
to
CA,
looks
at
each
channel’
s
rele
v
ance.
SA
is
an
attention
mechanism
that
focuses
on
gi
ving
weight
to
the
spatial
location
of
features
based
on
ho
w
important
the
location
is.
This
weighting
is
intended
to
identify
the
feature’
s
most
rele
v
ant
location
[21].
The
formulation
of
SA
using
(5)
and
(6)
Feature
maps
(
F
)
are
assigned
a
v
erage
pooling
(
Av
g
P
)
and
maximum
pooling
(
M
axP
)
and
the
resulting
tw
o-dimensional
feature
maps
are
F
s
av
g
and
F
s
max
.
Both
results
are
concatenated
and
con
v
oluted
with
the
con
v
olution
operation
(
f
)
with
a
l
ter
size
of
7
×
7
.
T
o
pro
vide
a
feature
map
that
represents
the
most
rele
v
ant
spatial
v
alues,
the
result
of
the
con
v
olution
is
gi
v
en
a
sigmoid
function.
M
s
(
F
)
=
σ
f
7
×
7
Av
g
P
(
F
)
M
axP
(
F
)
(5)
M
s
(
F
)
=
σ
f
7
×
7
F
s
av
g
F
s
max
(6)
TELK
OMNIKA
T
elecommun
Comput
El
Control,
V
ol.
23,
No.
6,
December
2025:
1611–1625
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
T
elecommun
Comput
El
Control
❒
1615
2.3.2.
Spatial
gr
oup-wise
enhanced
attention
SGE
is
an
attention
module
that
performs
feature
grouping
at
the
spatial
and
channel
le
v
els
and
applies
an
attention
mechanism
to
determ
ine
the
rele
v
ance
le
v
el
of
sub-features
based
on
their
spatial
location
in
each
formed
group.
SGE
generates
a
coef
cient
v
alue
(
c
i
)
for
each
feature
based
on
the
similarity
of
global
(
g
)
and
local
features
(
x
i
)
,
allo
wing
it
to
focus
on
the
right
semantic
area.
The
di
vision
of
features
into
se
v
eral
groups
allo
ws
the
processing
of
features
on
a
smaller
scale
to
obtain
the
desired
details
[22].
W
e
compute
the
formulation
of
SGE
using
(7).
c
i
=
g
·
x
i
(7)
3.
METHODOLOGY
3.1.
Data
collection
This
research
utilized
remote
sensing
image
data
collected
to
support
the
eld
surv
e
y
implement
ation
of
the
National
Research
and
Inno
v
ation
Agenc
y
(BRIN)
in
de
v
eloping
a
method
to
harmonize
high-resolution
satellite
images
using
artici
al
intelligence
on
monitoring
and
paddy
eld
mapping.
Remote
mapping
acti
vities
occurred
from
21
to
25
August
2023
in
Y
ogyakarta
Pro
vince.
Thirteen
image
capture
points
coincided
wi
th
rice
elds
in
Sleman
Re
genc
y
,
Bantul
Re
genc
y
,
and
K
ulon
Progo
Re
genc
y
.
The
remote
sensing
tool
used
to
tak
e
images
of
rice
elds
is
the
Da-Jiang
Inno
v
ations
(DJI)
Phantom
4
multis
pectral
camera,
specically
designed
for
agriculture
and
en
vironmental
monitoring.
The
camera
model
used
in
the
drone
is
FC6360
with
v
e
color
ranges,
including
blue,
green,
red,
red
edge,
and
near
-infrared
(NIR).
The
a
v
erage
height
of
the
drone
in
taking
images
at
all
capture
points
is
100
m,
resulting
in
images
wit
h
a
high
resolution
of
1600
pix
els,
a
width
of
1300
pix
els,
and
a
ground
resolution
of
5
cm/pix
el.
3.2.
Data
cr
eation
Before
data
from
BRIN
eld
surv
e
y
acti
vities
can
serv
e
as
a
foundation
for
a
machine
learning
model,
it
must
under
go
se
v
eral
stages
of
data
processing.
The
stages
carried
out
to
prepare
the
image
data
of
paddy
elds
include
digitization
or
annotation
of
paddy
eld
im
ages
and
the
formation
of
paddy
eld
image
datasets.
Land
designation
resulted
in
v
e
types
of
paddy
eld
groupings,
namely:
background,
rst-phase
paddy
plants
(v
e
getati
v
e),
second-phase
paddy
plants
(reproducti
v
e),
third-phase
paddy
plants
(ready
to
harv
est),
and
non-
paddy
elds.
Direct
identication
of
the
dif
ferences
in
each
paddy
eld’
s
land
use
is
possible.
Ho
we
v
er
,
some
plants
ha
v
e
characteristics
similar
to
those
of
paddy
plants,
and
the
paddy
phase
also
sometimes
has
the
same
characteristics.
The
output
of
this
digitization
process
includes
an
image
le
and
a
shape
le,
which
serv
e
as
the
initial
mask
form
for
se
gmentation.
All
e
xisting
drone
images
are
digitized
to
f
acilitate
easier
im
age
selection.
An
e
xample
of
digitized
data
is
sho
wn
in
Figure
1.
After
digitization,
each
digitized
shape
le
is
di
vided
into
parts
based
on
the
color
range
from
0
to
255,
forming
a
mask
le.
Se
gmentation
acti
vities
use
the
mask
le
as
the
ground
truth
of
an
image
to
map
the
class
into
a
pix
el-
b
y-
p
i
x
el
image.
Once
the
entire
image,
the
ne
xt
step
w
as
processing
the
image
and
shape
le
were
processed
to
crea
te
a
label
mask
and
uniformize
it
to
a
size
of
512
×
512
pix
els.
In
summary
,
the
dataset
used
in
this
study
consisted
of
paddy
eld
images
with
a
mask
label
of
512
×
512
pix
els.
This
study
used
a
total
of
546
paddy
eld
images,
grouped
into
5
classes.
Figure
1.
Example
of
the
data
Mixed
attention
mec
hanism
on
ResNet-deepLabV3+
for
paddy
eld
se
gmentation
(Alya
Khairunnisa
Rizkita)
Evaluation Warning : The document was created with Spire.PDF for Python.
1616
❒
ISSN:
1693-6930
3.3.
Data
pr
eparation
The
research
e
xperiment
used
a
dataset
of
546
images
and
their
labels.
The
size
of
each
image
and
label
is
512
×
512
pix
els.
The
dataset
labels
contain
v
e
classes:
background,
rst-phase
rice
plant,
second-
phase
rice
plant,
third-phase
rice
plant,
and
nonrice
paddy
eld.
W
e
split
the
dataset
into
3
subsets:
70%
for
training
data,
20%
for
v
alidation
data,
and
10%
for
test
data.
T
raining
data
are
used
to
model
to
learn
the
data,
v
alidation
data
are
used
to
aid
in
b
uilding
the
model
f
ast
by
recognizing
at
each
training
where
your
v
alidation
v
alue
increases
impro
v
ed
or
not,
and
test
data
are
used
to
e
v
aluate
ho
w
our
model
performs
se
gmentation.
3.4.
Pr
oposed
ar
chitectur
e
In
this
study
,
we
proposed
a
deep
learning
de
v
elopment
model,
ResNet-DeepLabV3+
with
attention,
for
se
gmenting
paddy
elds
from
remote
sensing
images.
The
basic
idea
behind
this
model
de
v
elopment
is
to
create
a
se
gmentation
model
by
combining
ResNet-34
as
an
encoder
and
DeepLabV3+
as
a
decoder
and
placing
an
attention
module
on
the
encoder
to
impro
v
e
the
encoder’
s
ability
to
e
xtract
features.
Figure
2
sho
ws
a
perspecti
v
e
of
the
proposed
method.
Figure
2.
Proposed
method
3.4.1.
Encoder:
ResNet-34
with
attention
The
encoder
of
ResNet-34
be
gins
with
a
7
×
7
k
ernel
and
con
v
erts
an
incoming
image
input
(3
chan-
nels)
to
be
processed
in
the
con
v-layer
,
mapping
it
into
a
space
with
64
channels.
The
rst
layer
also
performs
batch
normalization,
rectied
linear
unit
(ReLU),
and
max
pooling
to
reduce
the
spatial
dimension.
Each
ResNet
layer
comprises
tw
o
congurations
of
con
v
olutional
layers
with
3
×
3
k
ernels,
follo
wed
by
batch
nor
-
malization
and
attention
addition.
The
rst
ResNet
layer
has
an
output
of
64
channels;
the
second
layer
reduces
the
spatial
dimension
by
increasing
the
channel
size
to
128,
and
the
third
layer
reduces
the
dimension
to
256
channels.
The
last
layer
pro
vides
an
output
of
512
channels.
Each
ResNet
layer
recei
v
es
attention
from
the
CA
and
SGE
after
the
con
v
olution
layer
has
nished.
Figure
2
illustrates
the
combination
of
CA
and
SGE.
Initially
,
we
combined
the
application
of
CA
with
SA
to
form
CB
AM,
which
increased
attention
at
the
channel
and
spatial
le
v
els.
W
e
also
inte
grated
CA
with
SGE.
In
addition,
we
added
a
MLP
to
the
depth
no
v
el
contrib
ution.
CA-SGE
attent
ion
inte
gration:
our
k
e
y
inno
v
ation
lies
in
strate
gically
combining
CA
[21]
with
SGE
[22]
mechanisms.
Unlik
e
the
standard
CB
AM
[21]
which
uses
CA
and
SA
sequential
ly
,
we
replace
the
SA
component
with
SGE
to
create
a
more
ef
fecti
v
e
CA-SGE
h
ybrid.
CA-SGE
attention
inte
gration
illustrated
in
Figure
3.
CA
enhancement:
the
standard
CA
mechanism
w
as
modied
by
incorporating
a
4-layer
MLP
(compared
to
the
def
ault
2-layer)
to
enable
more
sophisticated
channel
relati
onship
modeling.
The
CA
process
TELK
OMNIKA
T
elecommun
Comput
El
Control,
V
ol.
23,
No.
6,
December
2025:
1611–1625
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
T
elecommun
Comput
El
Control
❒
1617
uses
both
adapti
v
e
a
v
erage
pooling
(AAP)
and
adapti
v
e
max
pooling
(AMP)
follo
wed
by
the
enhanced
MLP
and
sigmoid
acti
v
ation,
as
illustrated
in
Figure
4.
SGE
inte
gration:
the
SGE
mechanism
[22]
recei
v
es
the
CA-
enhanced
features
and
applies
group-wise
spatial
enhancement
through:
1)
feature
grouping,
2)
global
a
v
erage
pooling
(GAP),
3)
normalization,
4)
position-wise
dot
product
(PWP),
and
5)
sigmoid
acti
v
ation,
as
sho
wn
in
Figure
5.
W
e
anticipat
ed
the
addition
of
MLP
to
enable
the
encoder
to
e
xtract
features
from
comple
x
data
with
greater
accurac
y
in
CA;
we
f
o
c
used
on
retrie
ving
the
v
alues
from
each
AAP
and
AMP
to
reduce
the
channel
size
to
a
si
ngle
v
alue
based
on
the
a
v
erage
and
maximum
channel
v
alues.
Ne
xt,
we
used
sigmoid
acti
v
ation
to
group
the
feature
v
alues
into
binary
form
(0,
1).
Figure
4
illustrates
the
CA
process.
F
or
the
SGE
e
xperiment,
we
multiplied
the
CA
result
by
the
initial
feature
before
entering
the
SGE
to
produce
a
more
focused
feature
v
alue
based
on
the
most
rele
v
ant
channel.
SGE
recei
v
es
feature
input
from
CA
and
splits
it
into
se
v
eral
groups,
with
the
same
processing
for
each
group.
F
ollo
wing
a
series
of
processing
steps
on
each
group,
all
groups
were
mer
ged
back
into
the
resulting
feature,
which
pro
vides
a
more
accurate
representation
of
the
feature
v
alue.
The
processing
in
SGE
includes
GAP
,
normalization,
PWP
,
and
sigmoid
acti
v
ation.
The
GAP
function
tak
es
all
feature
v
alues
and
con
v
erts
them
into
a
single
v
alue.
The
normalization
function
mak
es
the
input
range
independent.
The
PWP
function
nds
the
dot
product
between
the
e
xisting
feature
v
ectors.
The
sigmoid
acti
v
ation
function
con
v
erts
feature
v
alues
into
binary
form.
Figure
5
illustrates
the
SGE
process.
Our
k
e
y
inno
v
ation
lies
in
strate
gically
combining
CA
[21]
with
SGE
[22]
mechanisms.
Unlik
e
the
standard
CB
AM
[21]
which
uses
CA
and
SA
sequentially
,
we
replace
the
SA
component
with
SGE
to
create
a
more
ef
fecti
v
e
CA-SGE
h
ybrid.
W
e
modied
the
standard
CA
mechanism
by
incorporating
a
4-layer
MLP
(compared
with
the
def
ault
2-layer)
to
enable
more
sophisticated
channel
relationship
modeling.
The
CA
process
uses
both
AAP
and
AMP
fol
lo
wed
by
the
enhanced
MLP
and
sigmoid
acti
v
ation,
as
illustrated
in
Figure
4.
The
SGE
mechanism
[22]
recei
v
es
the
CA-enhanced
features
and
applies
group-wise
spat
ial
enhancement
through:
1)
feature
grouping,
2)
GAP
,
3)
normalization,
4)
PWP
,
and
5)
sigmoid
acti
v
ati
on,
as
sho
wn
in
Figure
5.
Figure
3.
Channel-SGE
attention
Figure
4.
CA
Figure
5.
SGE
attention
3.4.2.
Decoder:
DeepLabV3+
In
this
study
,
the
decoder
reconstructs
the
feature
representation
e
xtracted
by
the
encoder
.
The
rst
pa
rt
of
DeepLabV3+
is
the
ASPP
module,
which
has
dif
ferent
dilation
le
v
els
in
parallel.
This
study
uses
the
ASPP
module,
which
consists
of
v
e
parallel
layers:
a
con
v
olution
layer
without
dilation,
a
con
v
olution
layer
with
12,
24,
and
36
dilations,
and
an
AAP
layer
.
In
addition,
it
is
good
for
pooling
information
from
multiscale
images.
Then,
the
output
from
5
layers
will
be
concatenated,
follo
wed
by
their
parallel
projection,
such
that
a
tone
feature
representation
of
256
channels.
Then,
the
projection
output
will
go
through
the
feature
concatenation
in
a
layer
block
before
the
se
gmentation
head
process.
The
pre
vious
dense
prediction
layer
consisted
of
a
CNN
and
SoftMax
to
pro
vide
the
nal
result
for
se
gmentation
image
forecast.
3.5.
T
raining
conguration
Data
augmentation:
W
e
applied
se
v
eral
augmentation
techniques
to
the
training
dat
aset
to
e
nh
a
nce
training
data
di
v
ersity
and
impro
v
e
model
rob
ustness:
1)
random
horizontal
and
v
ertical
ipping
with
probabil-
Mixed
attention
mec
hanism
on
ResNet-deepLabV3+
for
paddy
eld
se
gmentation
(Alya
Khairunnisa
Rizkita)
Evaluation Warning : The document was created with Spire.PDF for Python.
1618
❒
ISSN:
1693-6930
ity
p=0.5,
2)
brightness
adjustment
with
linear
scaling
f
actor
ranging
from
0%
to
10%
of
the
original
intensity
,
3)
contrast
enhancement
using
nonlinear
adjustment
within
0%
to
10%
range,
and
4)
g
amma
correction
with
random
g
amma
v
alues
between
0.8
and
1.2
(80%
to
120%
of
original
image
g
amma
v
alue).
Hardw
are
and
softw
are
s
etup:
The
proposed
model
w
as
implemented
using
Python
3.8
with
the
Py-
T
orch
2.0.1
frame
w
ork
and
CUD
A
11.8
for
GPU
acceleration.
All
e
xperiments
were
conducted
on
the
NVIDIA
DGX-1
V100
GPU
system
at
the
T
ok
opedia-UI
AI
Center
of
Excellence,
which
pro
vided
suf
cient
computa-
tional
resources
for
training
the
deep
learning
model
training.
T
raining
parameters:
The
netw
ork
w
as
optimized
using
Adam
optimizer
[26]
with
learning
rate
of
0.00008,
which
w
as
chosen
based
on
preliminary
e
xperi
ments
for
stable
con
v
er
gence.
T
raining
w
as
conducted
for
100
epochs
with
a
batch
size
of
4
for
training
data,
limited
by
GPU
memory
constraints.
F
or
the
v
alidation
and
testing
phases,
the
batch
size
w
as
set
to
1
to
ensure
consistent
e
v
aluation
across
all
samples.
Re
gulariza-
tion
strate
gy:
to
mitig
ate
o
v
ertting
and
enhance
model
generalization,
L2
re
gularization
with
weight
decay
coef
cient
of
0.0001
w
as
applied
to
all
trainable
parameters
during
the
optimization
process.
3.6.
Ev
aluation
W
e
used
IoU,
the
dice
coef
cient,
accurac
y
,
and
the
F1
score
as
metrics
to
e
v
aluate
our
se
gmentation
model.
IoU
is
a
popular
e
v
aluation
metric
for
semantic
se
gmentation
is
IoU.
IoU
calculates
the
o
v
erlapping
area
between
predicted
se
gmentation
(A)
and
the
ground
truth
(B).
The
formula
for
calculating
IoU
is
sho
wn
in
(8).
I
oU
=
A
∩
B
A
∪
B
(8)
The
accurac
y
(A
CC)
and
F1
score
are
deri
v
ed
from
the
confusion
matrix.
A
confusion
matrix
is
a
table
to
assess
a
classication
algorithm’
s
performance
[27].
The
confusion
matrix
pro
vides
an
in-depth
analysis
of
the
actual
vs.
predicted
classication.
Se
v
eral
terms
are
in
the
confusion
matrix.
T
rue
positi
v
e
(TP)
is
when
the
model
successfully
predicts
that
the
actual
positi
v
e
class
becomes
a
positi
v
e
class.
T
rue
ne
g
ati
v
e
(TN)
is
when
the
model
predicts
the
actual
ne
g
ati
v
e
class
as
a
ne
g
ati
v
e
class.
If
the
class
is
ne
g
ati
v
e
and
predicted
as
positi
v
e,
it
is
called
a
f
alse
positi
v
e
(FP).
Otherwise,
if
the
class
is
positi
v
e
and
predicted
as
ne
g
ati
v
e,
it
is
called
a
f
alse
ne
g
ati
v
e
(FN).
Accurac
y
is
the
number
of
correct
predictions
di
vided
by
the
total
data.
The
formula
for
calculating
accurac
y
can
be
seen
in
(9).
The
F1
score
is
the
harmonic
mean
of
precision
and
recall.
In
the
semantic
se
gmentation
task,
the
F1
score
is
also
called
the
dice
coef
cient
score
(DSC).
Precision
is
the
proportion
of
positi
v
e
predict
ions,
and
recall
is
the
proportion
of
actual
positi
v
es
that
are
correctly
classied.
F1
score
is
a
better
metric
than
accurac
y
when
the
testing
data
are
imbalanced.
F1
score,
dice
coef
cient,
and
loss
formula
are
sho
wn
in
(10),
(11),
and
(12).
Accur
acy
=
T
P
+
T
N
F
P
+
F
N
+
T
P
+
T
N
(9)
F
1
S
cor
e
=
T
P
T
P
+
(
F
P
+
F
N
)
2
(10)
D
iceC
oef
f
icient
=
2
|
A
∩
B
|
|
A
|
+
|
B
|
(11)
D
iceLoss
=
1
−
D
iceC
oef
f
icient
(12)
4.
RESUL
T
AND
DISCUSSION
4.1.
Ov
er
view
of
the
k
ey
ndings
The
proposed
ResNet-DeepLabV3+
with
the
CA-SGE
attention
mechanism
achie
v
ed
superior
per
-
formance
for
paddy
eld
se
gmentation,
demonstrating
signicant
impro
v
ements
across
all
e
v
aluation
metrics.
The
k
e
y
ndings
re
v
eal
that:
1)
the
CA-SGE
combination
pro
vides
syner
gistic
benets
with
IoU
of
0.85,
DSC
of
0.91,
and
accurac
y
of
0.96,
representing
impro
v
ements
of
3.7%,
4.6%,
and
2.1%
respecti
v
ely
o
v
er
the
base-
line,
2)
CA
mechanisms
are
more
ef
fecti
v
e
than
SA
for
paddy
eld
se
gmentation
tasks,
and
3)
the
proposed
method
e
xhibits
bet
ter
training
stability
with
reduced
o
v
ertting
compared
to
the
baseline
model.
From
the
training
process
of
both
models,
it
can
be
concluded
that
the
proposed
ResNet-DeepLabV3+
+
CA-SGE
is
better
because
it
sho
ws
more
stable
performance
and
lo
wer
potential
for
o
v
ertting.
TELK
OMNIKA
T
elecommun
Comput
El
Control,
V
ol.
23,
No.
6,
December
2025:
1611–1625
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
T
elecommun
Comput
El
Control
❒
1619
4.2.
Analysis
of
training
perf
ormance
The
baseline
ResNet-DeepLabV3+
model
e
xhibi
ted
rapid
performance
impro
v
ement,
with
tr
aining
and
v
al
idation
IoU
scores
reaching
approximately
0.9
within
the
initial
40
epochs.
Ho
we
v
er
,
subsequent
train-
ing
phases
sho
wed
gradual
impro
v
ement
accompanied
by
increasing
di
v
er
gence
between
training
and
v
alida-
tion
curv
es,
suggesting
that
o
v
ertting
tendencies
were
emer
ging
despite
the
g
ap
remaining
within
acceptable
limits.
The
proposed
ResNet-DeepLabV3+
with
CA-SGE
attention
demonstrated
distinctly
dif
ferent
train-
ing
characteristics.
The
model
initially
e
xhibited
undertting
beha
vior
,
where
the
v
alidation
IoU
e
xceeded
the
training
IoU
during
the
earl
y
epochs.
This
pattern
w
as
re
v
ersed
around
epoch
30,
after
which
the
train-
ing
IoU
be
g
an
to
surpass
the
v
ali
d
a
tion
IoU
while
gradually
impro
ving.
Notably
,
the
training-v
alidation
g
ap
remained
consistently
smaller
than
that
of
the
baseline
model
throughout
the
training
process.
The
reduced
training-v
alidation
di
v
er
gence
observ
ed
in
our
CA-S
GE
model
indicates
superior
generalization
capability
and
enhanced
training
stability
.
This
beha
vior
suggests
that
the
inte
grated
attention
mechanisms
function
as
ef-
fecti
v
e
re
gularizers,
mitig
ating
o
v
ertting
risks
while
preserving
st
rong
model
performance.
The
comparati
v
e
analysis
demonstrates
that
the
proposed
architecture
achie
v
es
more
rob
ust
training
dynamics
with
impro
v
ed
con
v
er
gence
stability
compared
to
the
baseline
approach.
The
e
xperimental
result
s
demonstrate
that
the
base-
line
method
e
xhibi
ts
a
lar
ge
training-v
alidation
g
ap
after
epoch
40,
indicating
o
v
ertting
beha
vior
.
In
contrast,
the
proposed
CA-SGE
method
sho
ws
a
smaller
training-v
alidation
g
ap
with
better
stability
,
ef
fecti
v
ely
reducing
o
v
ertting
issues.
A
summary
of
the
training
is
sho
wn
in
Figures
6
and
7.
Figure
6.
ResNet-DeepLabV3+
Figure
7.
ResNet-DeepLabV3+
CA-SGE
4FC
4.3.
Quantitati
v
e
e
v
aluation
of
perf
ormance
4.3.1.
Baseline
method
comparison
W
e
conducted
a
comprehensi
v
e
comparati
v
e
e
v
aluation
ag
ainst
established
se
gmentation
architecture
s
to
establish
the
ef
fecti
v
eness
of
our
proposed
approach.
The
baseline
methods
include
ResNet-UNet,
ResNet-
FPN,
and
ResNet-DeepLabV3+,
all
implemented
without
attention
mechanisms
to
ensure
f
air
comparison
with
our
attention-enhanced
v
ariant.
T
able
1
presents
the
quantitati
v
e
performance
comparison
across
three
standard
e
v
aluation
metrics:
IoU,
DSC,
and
pix
el-wise
accurac
y
.
The
proposed
ResNet-DeepLabV3+
with
CA-SGE
attention
consistently
outperformed
all
baseline
methods
across
e
v
ery
e
v
aluation
metric.
The
results
re
v
eal
substantial
and
consistent
impro
v
ements
in
performance
across
all
e
v
aluation
dimensions.
Specical
ly
,
the
in-
te
gration
of
CA
and
SGE
attention
mechanism
s
enhanced
the
baseline
ResNet-DeepLabV3+
performance
from
IoU
0.82
to
0.85,
DSC
from
0.87
to
0.91,
and
accurac
y
from
0.94
to
0.96.
These
impro
v
ements
demonstrate
the
ef
fecti
v
eness
of
our
h
ybri
d
attention
approach
in
capturing
both
channel-wise
feature
importance
and
spatial
relationships,
which
are
crucial
for
accurate
paddy
el
d
se
gmentation.
The
superior
IoU
and
DSC
scores
indi-
cate
enhanced
boundary
delineation
accurac
y
and
impro
v
ed
re
gion
o
v
erlap
precision,
which
ar
e
critical
f
actors
for
practical
agricultural
monitori
ng
applications.
Consistent
performance
g
ains
across
dif
ferent
architectural
baselines
v
alidate
the
proposed
attention
mechanism’
s
rob
ustness
and
generalizability
.
T
able
1.
Comparison
of
testing
se
gmentation
methods
Method
IoU
DSC
Acc
ResNet-UNet
0.79
0.86
0.94
ResNet-FPN
0.80
0.87
0.93
ResNet-DeepLabV3+
0.82
0.87
0.94
ResNet-DeepLabV3+CA
SGE
0.85
0.91
0.96
Mixed
attention
mec
hanism
on
ResNet-deepLabV3+
for
paddy
eld
se
gmentation
(Alya
Khairunnisa
Rizkita)
Evaluation Warning : The document was created with Spire.PDF for Python.
1620
❒
ISSN:
1693-6930
4.4.
Ablation
study
4.4.1.
Indi
vidual
attention
mechanism
perf
ormance
An
ablation
study
w
as
conducted
in
the
attention
mechanism
part.
W
e
compared
the
performance
of
CA,
SA,
CB
AM,
and
SGE
on
the
ResNet-DeepLabV3+
model.
W
e
also
compared
the
depth
of
the
MLP
layers
in
the
CA
attention
module.
The
addition
of
CA
on
ResNet-DeepLabV3+
slightly
increases
the
performance.
As
sho
wn
in
T
able
2,
the
addition
of
CA
increased
all
metrics
by
0.01.
Adding
more
MLP
layers
to
4
in
CA
did
not
ha
v
e
a
signicant
impact,
only
increasing
the
DSC
by
0.001,
whereas
the
IoU
and
accurac
y
did
not
increase.
The
ablation
results
re
v
eal
se
v
eral
critical
insights
into
the
ef
fecti
v
eness
of
the
attention
mechanism
in
paddy
eld
se
gmentation.
CA
demonstrated
consistent
performance
impro
v
ements
across
all
e
v
aluation
metrics,
with
IoU
increasing
from
0.82
t
o
0.83
when
indi
vidually
inte
grated.
This
impro
v
ement
v
alidates
the
ef
fecti
v
eness
of
channel-wise
feature
recalibration
for
agricultural
se
gmentation
tasks.
Con
v
ersely
,
SA
sho
wed
counterproducti
v
e
ef
fects,
decreasing
IoU
performance
from
0.82
to
0.81
while
maintaining
similar
DSC
scores
and
reducing
accurac
y
.
This
de
gradation
suggests
that
SA
mechanisms
may
introduce
unw
anted
noise
rather
than
benecial
spatial
focusing
for
paddy
eld
boundary
detection.
The
CB
AM
e
v
aluation,
which
combines
both
CA
and
SA
components,
yielded
performance
identical
to
CA
alone
across
all
metrics.
This
nding
reinforces
our
observ
ation
that
SA
pro
vides
no
additional
benet
and
potentially
interferes
with
CA
ef
fecti
v
eness,
conrming
the
limited
utility
of
SA
in
this
specic
application
domain.
SGE
demonstrated
a
unique
performance
characteristic,
achie
ving
e
xceptionally
high
pix
el-wise
accu-
rac
y
(0.98)
while
simultaneously
sho
wing
decreased
performance
in
IoU
(0.81)
and
DSC
(0.86)
metrics.
This
pattern
indicates
that
SGE
e
xcels
at
o
v
erall
pix
el
classication
accurac
y
b
ut
struggles
with
precise
boundary
de-
lineation
and
re
gion
o
v
erlap
precision,
which
are
critical
for
accurate
se
gmentation.
The
op
t
imal
performance
w
as
achie
v
ed
through
our
proposed
CA-SGE
combination
with
4-layer
MLP
depth,
demonstrating
superior
bal-
ance
across
all
e
v
aluation
metrics.
This
conguration
achie
v
ed
the
highest
IoU
(0.85)
and
DSC
(0.91)
scores
while
maintaining
competiti
v
e
accurac
y
(0.96),
representing
impro
v
ements
compared
to
the
baseline.
T
able
2.
Comparison
of
attention
modules
Module
MLP
depth
IoU
DSC
Acc
-
-
0.82
0.87
0.94
SA
-
0.81
0.87
0.94
CA
2
0.83
0.88
0.95
CB
AM
2
0.83
0.88
0.95
CA+SGE
2
0.83
0.89
0.95
SGE
-
0.81
0.86
0.98
CA
4
0.83
0.89
0.95
CB
AM
4
0.82
0.88
0.95
CA+SGE
4
0.85
0.91
0.96
4.4.2.
Multilay
er
per
ceptr
on
depth
analysis
In
v
estig
ation
of
MLP
depth
congurations
within
the
CA
module
re
v
ealed
that
the
4-layer
architecture
pro
vides
optimal
performance
without
introducing
signicant
computational
o
v
erhead.
Comparison
between
2-layer
and
4-layer
MLP
congurations
sho
wed
that
deeper
netw
orks
yielded
mar
ginal
impro
v
ements
(+0.01
DSC)
while
maintaining
computational
ef
cienc
y
.
Further
e
xploration
of
deeper
MLP
architectures
(
>
4
layers)
demonstrated
diminishing
returns
with
ne
gligible
performance
g
ains,
conrming
the
appropriateness
of
our
4-
layer
architectural
choice
for
practical
deplo
yment
scenarios.
4.4.3.
Data
augmentation
impact
T
able
3
displays
the
results
of
testing
each
data
augmentation
on
the
neural
netw
ork
se
gmentation
method.
The
results
demonstrate
subst
antial
and
consistent
performance
impro
v
ements
across
all
se
gmentation
architectures
when
data
augmentati
on
is
applied.
ResNet-FPN
sho
wed
the
most
dramatic
impro
v
ement,
with
IoU
increasing
from
0.55
to
0.80,
indicating
that
this
architecture
particularly
benets
from
enhanced
data
di
v
ersity
.
ResNet-UNet
also
e
xhibited
s
ignicant
impro
v
ement,
with
IoU
rising
from
0.64
to
0.79.
The
ResNet-
DeepLabV3+
baseline
sho
wed
more
modest
b
ut
meaningful
impro
v
ements,
increasing
from
0.76
to
0.82.
This
smaller
relati
v
e
impro
v
ement
suggests
that
the
more
sophist
icated
DeepLabV3+
archi
tecture
with
i
ts
atrous
con
v
olutions
and
multi-scale
processing
already
possesses
some
i
nherent
rob
ustness
to
v
ariations
in
input
data.
TELK
OMNIKA
T
elecommun
Comput
El
Control,
V
ol.
23,
No.
6,
December
2025:
1611–1625
Evaluation Warning : The document was created with Spire.PDF for Python.