TELK OMNIKA T elecommunication, Computing, Electr onics and Contr ol V ol. 23, No. 6, December 2025, pp. 1611 1625 ISSN: 1693-6930, DOI: 10.12928/TELK OMNIKA.v23i6.26829 1611 Mixed attention mechanism on ResNet-DeepLabV3+ f or paddy eld segmentation Alya Khairunnisa Rizkita 1 , Masagus Muhammad Luth Ramadhan 1 , Y ohanes Fridolin Hestrio 1,2 , Muhammad Hannan Hunafa 1 , Danang Surya Candra 2 , W isnu J atmik o 1 1 Department of Computer Science, F aculty of Computer Science, Uni v ersity of Indonesia, Depok, Indonesia 2 Research Center for Geoinformatics, Electronics and Informatics Research Or g anization, National Research and Inno v ation Agenc y , Bogor , Indonesia Article Inf o Article history: Recei v ed Dec 7, 2024 Re vised Aug 27, 2025 Accepted Sep 10, 2025 K eyw ords: Attention mechanism DeepLabV3+ Remote sensing Residual netw ork Semantic se gmentation ABSTRA CT Rice culti v ation monitoring is crucial for Indonesia, where paddy eld areas de- clined by 2.45% according to the Central Bureau of Statistics due to land func- tion changes and shifting crop preferences. Re gular monitoring of paddy eld distrib ution is essential for understanding agricultural land utilization by f armers and lando wners. Satellite imagery has become increasingly common for agricul- tural land observ ation, b ut traditional neural netw orks alone pro vide insuf cient se gmentation accurac y . This study proposes an enhanced deep learning architec- ture combining residual netw ork (ResNet)-DeepLabV3+ with coordinate atten- tion (CA) and spatial group-wise enhancement (SGE) modules. The attention mechanisms establish direct connections between conte xt v ectors and inputs, enabling the model to prioritize rele v ant spatial and spectral features for precise paddy eld identication. The CA module enhances spectral feature discrim- ination, whereas the SGE impro v es spatial charact eristic representation. The e xperimental results demonstrate superior performance o v er the baseline meth- ods, achie ving intersection o v er union (IoU) of 0.85, dice coef cient of 0.89, and accurac y of 0.95. The proposed mix ed attention mechanism signicantly impro v es the accurac y and ef cienc y of automatic crop area identication from satellite imagery . This is an open access article under the CC BY -SA license . Corresponding A uthor: Alya Khairunnisa Rizkita Department of Computer Science, F aculty of Computer Science, Uni v ersity of Indonesia Depok, 16424, Indonesia Email: alya.khairunnisa21@ui.ac.id 1. INTR ODUCTION Rice is a prominent agricultural product globally , and it thri v es in e xpansi v e, unobstructed re gions. In Indonesia, rice is the predominant commodity . Based on statistics from the Central Bureau of Statistics in Indonesia, there w as a 2.45% decline in rice culti v ation from 2022 to 2023 [1], [2]. This reduction in land area is due to changes in land functions and preferences for planted crop commodities. The decreased price of rice relati v e to other crops and the less-than-ideal paddy eld conditions for rice gro wth are the main causes of the change in commodities and land usage [3], [4]. Therefore, it is important for f armers and o wners of paddy elds to kno w the usefulness of the distrib ution of e xisting paddy elds by conducting re gular monitoring. Remote sensing imagery pro vides a precise and geographically accurate depiction of the land that the de vice has captured [5]. Remote sensing imagery typically requires geographic information about agricultural J ournal homepage: https://telk omnika.uad.ac.id/inde x.php/TELK OMNIKA Evaluation Warning : The document was created with Spire.PDF for Python.
1612 ISSN: 1693-6930 land to accurately monitor crop conditions at specic locations and utilize time ef ciently . Additionally , this information aids in monitoring the use of agricultural land, which is rapidly changing its function. Only lar ge and open land crops, such as rice, corn, onions, and oil palm, are suitable for remote sensing monitoring on agricultural land. Consequently , it is possible to map paddy elds using remote sensing technology to monitor crop gro wth, manage paddy eld use, and estimate crop yields [6]. Extracting information about agricultural land can emplo y methods such as se gmentation or object detection to further process remote sensing images [7]. These approaches ha v e been the focus of recent research. In line with the de v elopment of remote sensing image processing techniques, image se gmentation methods ha v e become one of the techniques relied upon to produce more detailed information from an image. Image se gmenta tion is a technique that classies each pix el in an image into semanti c classes [8]. Ho we v er , automatically identifying rice e lds from satellite images is still v ery dif cult. Current deep learning methods ha v e major problems, such as dif culty in accurately tracing comple x and irre gular eld boundaries, dif culty in distinguishing rice elds from other similar -looking features, such as w ater bodies, wetlands, or other crops, satellites do not perform well when el ds ha v e dif ferent sizes, shapes, and orientations in the same image, and satellites do not w ork reliably under dif ferent weather conditions, seasons, or rice gro wth stages. These problems lead to poor accurac y in identifying rice elds, resulting in unreliable f arm monitoring and incorrect crop yield predictions. In recent years, the de v elopment of se gmentati on methods has continued to gro w . Agriculture widely applies neural netw ork approaches, such as crop type se gme n t ation [9]-[12], crop disease detection [13], [14], and f armland mapping [15]. Shelhamer et al. [16] rst proposed the se gmentation neural netw ork architecture, fully con v olutional netw orks (FCN). U-shaped con v olutional neural netw ork (U-Net) [17] is a highly impro v ed FCN v ersion that inte grates up-sampling, do wn-sampling, skip connections, and fully con v olutional applica- tions. Ho we v er , U-Net is more symmetric than FC N. An y do wn-sampling and up-sampling stage of U-Net is endo wed with a skip connection at e v ery le v el. The U-Net architecture is highly applied in applications, and se gmentation problems are being upgraded as studies emer ge. Research conducted on agricultural land mapping with se gmentation methods w as conducted by Zhang et al. [18] used rened p yramid scene parsing netw ork (PSPNet) on polarimetric synthetic aperture radar (PolS AR) data. This study proposes a m odied PSPNet with the addi tion of the polarimetric CA module. This module captures polarimetric features from PolSAR data, allo wing PSPNet to distinguish e xisting plant species. The results of this study pro v e that rened PSPNet pro vides e xcellent se gmentation results on PolSAR data, especially in terms of accurac y in mapping sharp contours and separating agricultural areas. Li et al. [19] de v eloped U-Net with the addition of a linear attention mechanism (LAM) application to se gment remote sensing images. The U-Net structure places LAM on each skip connection. This research led to the de v elopment of M AResU-Net, which generally outperforms U-Net in general. Mahmud et al. [20] de v eloped DeepLabV3+ by adding a spatial attention (SA) mechanism for road se gmentation. The ndings of this study demonstrate that adding a SA mechanism increases se gmentation accurac y by 1.56% compared to the usual model. Pre vious research has demonstrated that the inte gration of attention modules can enhance se gmentation accurac y . Despite these adv ances, e xisting approaches e xhibit se v eral critical limitations: traditional U-Net ar - chitectures struggle with multi-scale feature e xtraction required for v arying paddy eld sizes, PSPNet-based methods suf fer from high computational comple xity limiting their practical deplo yment, FCN-based approaches lack suf cient conte xtual information for comple x agricultural landscapes, and current attention mechanisms often f ail to simultaneously capture both spatial relationships and channel dependencies crucial for distinguish- ing paddy elds from spectrally similar features. Most e xisting methods are e v aluated on general se gmentation datasets rather than domain-specic agricultural scenarios, limit ing their applicability to real-w orld paddy eld monitoring tasks. Pre vious research has demonstrated that the inte gration of attention modules can enhance se gmenta- tion ac curac y . Attention modules can disti ngu i sh between important and unimportant image elements. The object boundaries are more precise, and internal pix els are more uniform with the inte gration of attention mod- ules [18]-[20]. The con v olutional block attention module (CB AM) is one of the attention modules designed to enhance the repres entational po wer of CNN feature maps. CB AM applies channel attent ion (CA) and SA. CA focuses on rening the importance of feature channels, allo wing the models to kno w what the y must see in the feature. Meanwhile, the SA module focuses on highlighting the important re gions within the feature maps. The spatial group-wise enhance (SGE) attention module is another attention module that impro v es CNN feature maps. Unlik e CB AM, which has tw o modules, SGE focuses on spatial information using a group-wise TELK OMNIKA T elecommun Comput El Control, V ol. 23, No. 6, December 2025: 1611–1625 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA T elecommun Comput El Control 1613 strate gy [21], [22]. Pre vious research by Li et al. [23] proposes an impro v ed U-Net model combining a mul ti-axis vision transformer (MaxV iT) encoder with a CB AM decoder for automated crop-weed se gmentation in precision agri- culture. This study addresses the limitations of uniform pesticide spraying and e xisting se gmentation models that struggle with agricultural images’ comple x backgrounds a n d multi-scale tar gets. Using MaxV iT’ s dual attention mechanisms for local and global feature capture alongside CB AM’ s channel and SA for boundary en- hancement, the model achie v ed 84.28% mean intersection o v er union (mIoU) and 88.59% mean pix el accurac y (mP A) on sug ar beet datasets, impro ving 3.08% and 3.15% o v er baseline U-Net whil e reducing parameters from 43.93 M to 22.08 M and maintaining 0.0559 s inference time. De v eloping ef fecti v e rice eld mapping methods f aces se v eral major challenges, including the need for computationally ef cient solutions that w ork in real-time for f arm monitoring, high data v ariability from seasonal changes a nd dif ferent gro wth stages across di v erse geographical locations, limited a v ailabilit y of high- quality annotated datasets, requirements for methods that w ork across dif ferent sensor types and imaging con- ditions, and the need for practical deplo yment in e xisting agricultural systems with minimal infrastructure. T o address these challenges, this research aims to de v elop an enhanced deep learning architecture that combines a residual netw ork (ResNet) encoder and DeepLabV3+ decoder to le v erage both detailed features and multi- scale conte xtual information, in v es tig ate h ybrid attention mechanisms (CB AM and S GE) for capturing spatial and channel dependencies that are crucial for accurate boundary detection, and systematically e v aluate per - formance ag ainst e xisting state-of-the-art methods using comprehensi v e metrics, such as IoU, dice coef cient, and accurac y . Finally , this study pro vides a practical, computationally ef cient solution for automated rice eld monitoring from remote sensing imagery that can be readily deplo yed in operational agricultural systems. Therefore, this research focuses on mapping agricultural land use, especially paddy elds, using the ResNet-DeepLabV3+ neural netw ork se gmenta tion method. The se gmentation method will also incorporate CB AM and SGE attention modules to enhance ResNet-DeepLabV3+’ s abilit y to se gment paddy elds. In summary , this article of fers the follo wing contrib utions: - W e proposed an enhanced deep learning model, ResNet DeepLabV3+ with attention, for se gmenting paddy elds from remote sensing images. The proposed model comprises ResNet-34 as the encoder , augmented with an attention module to impro v e the model’ s se gmentation capabilities, and DeepLabV3+ as the decoder . - By incorporating the SGE technique, we enhance the CB AM by replacing its SA mechanism. This modi- cation guarantees a more accurate depiction of the spatial characteristics throughout the image. - W e are comparing the performance of the proposed method is compared with other se gmentation methods. This paper is or g anized into se v eral sections, the rst of which co v ers the introduction to the res earch. The second section discusses related w orks. The third section describes the e xperimental setup. The fourth section dis cusses the proposed method. The results are discussed in the fth section, and the last section discusses the research conclusions. 2. RELA TED W ORK 2.1. Residual netw ork ResNet w as rst introduced in 2016 by He et al. [24] as an inno v ation that introduced the concept of residual learning on deep neural netw orks that are more than 100 layers deep. He et al. [24] proposed t his concept to address optimization issues in deep neural netw orks, which frequently f ace performance de gradation dif culties. The basic idea of residual learning is that the netw ork learns the residuals of an e xisting function. If the function to be learned is denoted as H( x) and the learning netw ork is F ( x ) = H ( x ) x , then the formula becomes F ( x ) = H ( x ) + x . ResNet uses a residual block with man y layers, each of which learns from the residuals of a pre vious one. A skip connection is used in the netw ork to connect the input and output. There are also skip connections that perform identity mapping, where the outputs of layers get summed at a later point to reduce netw ork com- ple xity and help in optimizing it. ResNet de v elopment has sho wn that residual learning can help to successfully optimize and impro v e the accurac y of deep neural netw orks [24]. 2.2. DeepLabV3 Chen et al. [25] re-implemented their prior model, DeepLab, to the ne wer v ersion: DeepLabV3. The atrous con v olution, used in the DeepLabV3 model, enables a netw ork to globally upscale its eld of Mixed attention mec hanism on ResNet-deepLabV3+ for paddy eld se gmentation (Alya Khairunnisa Rizkita) Evaluation Warning : The document was created with Spire.PDF for Python.
1614 ISSN: 1693-6930 vie w without adding ne w learned parameters. Atrous con v olution to highlight spat ial information at a lar ge recepti v e eld and maintain crisp resolution an atrous con v olution at multiple parallel le v els forms an atrous spatial p yramid pooling (ASPP). ASPP means that the method can ef fecti v ely e xtract information from small parameter spaces and has multiple scales using only images. DeepLabV3 combines man y ne w ideas, such as the use of ASPP instead to perform multiscale feature learning global ly that will help the deep netw ork be a w are of more conte xt and mak e it capture richer features from multiple scales of detail; add global image- le v el features directly ag ainst the last con v olution o v er grouping (IC OG) con v olutional layer for se gmenting boundaries; replace SoftMax with sigmoid binary cross-entrop y loss while emplo ying the dic e coef cient as e v aluation metrics; and, importantly , emplo y batch norm in the ASP module, which has pro v en to enable a greatly high stability training ef fecti v eness [25]. 2.3. Attention mechanism The attention mechanism creates a direct link between the conte xt v ector and the input, allo wing the model to focus on the most important parts by assigning scores to each input element based on its rele v ance. This is especially helpful when w orking with lar ge datasets where not all information is equally useful, such as in language translation or image recognition where only specic parts matter for accurate results. By using attention, models can learn more ef ciently by concentrating on k e y features and ignoring less important details, which impro v es accurac y while sa ving computational resources [21], [22]. 2.3.1. Con v olutional block attention module CB AM is an attention module that combines CA and SA to pay more profound attention to rele v ant features at the channel and spatial le v els. CB AM starts by applying CA rst to adjust the weight v alue of each channel, follo wed by SA, which adjusts the weight v alue spatially [21]. The combination of CA and SA is sho wn in (1) and (2). In (1), the initial feature map ( F ) is performed element-wise multiplication with the one dimension of the feature map generated from CA ( M c ( F )) . The resulting feature map ( F ) is also subjected to element-wise multiplication with the tw o-dimensional feature map generated by SA ( M s ( F )) . F = M c ( F ) F (1) F ′′ = M s ( F ) F (2) CA is an attention mechanism t hat assigns v arying weights to each channel according to their re lati v e importance. The goal of assigning this weight is t o determine the most pertinent channel [21]. The formulation of CA using (3) and (4). Feature maps are assigned a v erage pooling ( Av g P ) and maximum pooling ( M axP ) at each le v el of the multilayer perceptron ( M LP ) . The featured result Av g P is symbolized as F c av g and M axP is symbolized as F c max . Both feature results are added and gi v en a sigmoid function σ . The sigmoid function determines the most rele v ant part of the feature based on the Av g P and M axP . There are 2 M LP layers used, so the weight of each layer is denoted as W 1 and W 0 . M c ( F ) = σ ( M LP ( Av g P ( F )) + M LP ( M axP ( F ))) (3) M c ( F ) = σ ( W 1 ( W 0 ( F c av g )) + W 1 ( W 0 ( F c max ))) (4) SA, in contrast to CA, looks at each channel’ s rele v ance. SA is an attention mechanism that focuses on gi ving weight to the spatial location of features based on ho w important the location is. This weighting is intended to identify the feature’ s most rele v ant location [21]. The formulation of SA using (5) and (6) Feature maps ( F ) are assigned a v erage pooling ( Av g P ) and maximum pooling ( M axP ) and the resulting tw o-dimensional feature maps are F s av g and F s max . Both results are concatenated and con v oluted with the con v olution operation ( f ) with a l ter size of 7 × 7 . T o pro vide a feature map that represents the most rele v ant spatial v alues, the result of the con v olution is gi v en a sigmoid function. M s ( F ) = σ f 7 × 7 Av g P ( F ) M axP ( F ) (5) M s ( F ) = σ f 7 × 7 F s av g F s max (6) TELK OMNIKA T elecommun Comput El Control, V ol. 23, No. 6, December 2025: 1611–1625 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA T elecommun Comput El Control 1615 2.3.2. Spatial gr oup-wise enhanced attention SGE is an attention module that performs feature grouping at the spatial and channel le v els and applies an attention mechanism to determ ine the rele v ance le v el of sub-features based on their spatial location in each formed group. SGE generates a coef cient v alue ( c i ) for each feature based on the similarity of global ( g ) and local features ( x i ) , allo wing it to focus on the right semantic area. The di vision of features into se v eral groups allo ws the processing of features on a smaller scale to obtain the desired details [22]. W e compute the formulation of SGE using (7). c i = g · x i (7) 3. METHODOLOGY 3.1. Data collection This research utilized remote sensing image data collected to support the eld surv e y implement ation of the National Research and Inno v ation Agenc y (BRIN) in de v eloping a method to harmonize high-resolution satellite images using artici al intelligence on monitoring and paddy eld mapping. Remote mapping acti vities occurred from 21 to 25 August 2023 in Y ogyakarta Pro vince. Thirteen image capture points coincided wi th rice elds in Sleman Re genc y , Bantul Re genc y , and K ulon Progo Re genc y . The remote sensing tool used to tak e images of rice elds is the Da-Jiang Inno v ations (DJI) Phantom 4 multis pectral camera, specically designed for agriculture and en vironmental monitoring. The camera model used in the drone is FC6360 with v e color ranges, including blue, green, red, red edge, and near -infrared (NIR). The a v erage height of the drone in taking images at all capture points is 100 m, resulting in images wit h a high resolution of 1600 pix els, a width of 1300 pix els, and a ground resolution of 5 cm/pix el. 3.2. Data cr eation Before data from BRIN eld surv e y acti vities can serv e as a foundation for a machine learning model, it must under go se v eral stages of data processing. The stages carried out to prepare the image data of paddy elds include digitization or annotation of paddy eld im ages and the formation of paddy eld image datasets. Land designation resulted in v e types of paddy eld groupings, namely: background, rst-phase paddy plants (v e getati v e), second-phase paddy plants (reproducti v e), third-phase paddy plants (ready to harv est), and non- paddy elds. Direct identication of the dif ferences in each paddy eld’ s land use is possible. Ho we v er , some plants ha v e characteristics similar to those of paddy plants, and the paddy phase also sometimes has the same characteristics. The output of this digitization process includes an image le and a shape le, which serv e as the initial mask form for se gmentation. All e xisting drone images are digitized to f acilitate easier im age selection. An e xample of digitized data is sho wn in Figure 1. After digitization, each digitized shape le is di vided into parts based on the color range from 0 to 255, forming a mask le. Se gmentation acti vities use the mask le as the ground truth of an image to map the class into a pix el- b y- p i x el image. Once the entire image, the ne xt step w as processing the image and shape le were processed to crea te a label mask and uniformize it to a size of 512 × 512 pix els. In summary , the dataset used in this study consisted of paddy eld images with a mask label of 512 × 512 pix els. This study used a total of 546 paddy eld images, grouped into 5 classes. Figure 1. Example of the data Mixed attention mec hanism on ResNet-deepLabV3+ for paddy eld se gmentation (Alya Khairunnisa Rizkita) Evaluation Warning : The document was created with Spire.PDF for Python.
1616 ISSN: 1693-6930 3.3. Data pr eparation The research e xperiment used a dataset of 546 images and their labels. The size of each image and label is 512 × 512 pix els. The dataset labels contain v e classes: background, rst-phase rice plant, second- phase rice plant, third-phase rice plant, and nonrice paddy eld. W e split the dataset into 3 subsets: 70% for training data, 20% for v alidation data, and 10% for test data. T raining data are used to model to learn the data, v alidation data are used to aid in b uilding the model f ast by recognizing at each training where your v alidation v alue increases impro v ed or not, and test data are used to e v aluate ho w our model performs se gmentation. 3.4. Pr oposed ar chitectur e In this study , we proposed a deep learning de v elopment model, ResNet-DeepLabV3+ with attention, for se gmenting paddy elds from remote sensing images. The basic idea behind this model de v elopment is to create a se gmentation model by combining ResNet-34 as an encoder and DeepLabV3+ as a decoder and placing an attention module on the encoder to impro v e the encoder’ s ability to e xtract features. Figure 2 sho ws a perspecti v e of the proposed method. Figure 2. Proposed method 3.4.1. Encoder: ResNet-34 with attention The encoder of ResNet-34 be gins with a 7 × 7 k ernel and con v erts an incoming image input (3 chan- nels) to be processed in the con v-layer , mapping it into a space with 64 channels. The rst layer also performs batch normalization, rectied linear unit (ReLU), and max pooling to reduce the spatial dimension. Each ResNet layer comprises tw o congurations of con v olutional layers with 3 × 3 k ernels, follo wed by batch nor - malization and attention addition. The rst ResNet layer has an output of 64 channels; the second layer reduces the spatial dimension by increasing the channel size to 128, and the third layer reduces the dimension to 256 channels. The last layer pro vides an output of 512 channels. Each ResNet layer recei v es attention from the CA and SGE after the con v olution layer has nished. Figure 2 illustrates the combination of CA and SGE. Initially , we combined the application of CA with SA to form CB AM, which increased attention at the channel and spatial le v els. W e also inte grated CA with SGE. In addition, we added a MLP to the depth no v el contrib ution. CA-SGE attent ion inte gration: our k e y inno v ation lies in strate gically combining CA [21] with SGE [22] mechanisms. Unlik e the standard CB AM [21] which uses CA and SA sequential ly , we replace the SA component with SGE to create a more ef fecti v e CA-SGE h ybrid. CA-SGE attention inte gration illustrated in Figure 3. CA enhancement: the standard CA mechanism w as modied by incorporating a 4-layer MLP (compared to the def ault 2-layer) to enable more sophisticated channel relati onship modeling. The CA process TELK OMNIKA T elecommun Comput El Control, V ol. 23, No. 6, December 2025: 1611–1625 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA T elecommun Comput El Control 1617 uses both adapti v e a v erage pooling (AAP) and adapti v e max pooling (AMP) follo wed by the enhanced MLP and sigmoid acti v ation, as illustrated in Figure 4. SGE inte gration: the SGE mechanism [22] recei v es the CA- enhanced features and applies group-wise spatial enhancement through: 1) feature grouping, 2) global a v erage pooling (GAP), 3) normalization, 4) position-wise dot product (PWP), and 5) sigmoid acti v ation, as sho wn in Figure 5. W e anticipat ed the addition of MLP to enable the encoder to e xtract features from comple x data with greater accurac y in CA; we f o c used on retrie ving the v alues from each AAP and AMP to reduce the channel size to a si ngle v alue based on the a v erage and maximum channel v alues. Ne xt, we used sigmoid acti v ation to group the feature v alues into binary form (0, 1). Figure 4 illustrates the CA process. F or the SGE e xperiment, we multiplied the CA result by the initial feature before entering the SGE to produce a more focused feature v alue based on the most rele v ant channel. SGE recei v es feature input from CA and splits it into se v eral groups, with the same processing for each group. F ollo wing a series of processing steps on each group, all groups were mer ged back into the resulting feature, which pro vides a more accurate representation of the feature v alue. The processing in SGE includes GAP , normalization, PWP , and sigmoid acti v ation. The GAP function tak es all feature v alues and con v erts them into a single v alue. The normalization function mak es the input range independent. The PWP function nds the dot product between the e xisting feature v ectors. The sigmoid acti v ation function con v erts feature v alues into binary form. Figure 5 illustrates the SGE process. Our k e y inno v ation lies in strate gically combining CA [21] with SGE [22] mechanisms. Unlik e the standard CB AM [21] which uses CA and SA sequentially , we replace the SA component with SGE to create a more ef fecti v e CA-SGE h ybrid. W e modied the standard CA mechanism by incorporating a 4-layer MLP (compared with the def ault 2-layer) to enable more sophisticated channel relationship modeling. The CA process uses both AAP and AMP fol lo wed by the enhanced MLP and sigmoid acti v ation, as illustrated in Figure 4. The SGE mechanism [22] recei v es the CA-enhanced features and applies group-wise spat ial enhancement through: 1) feature grouping, 2) GAP , 3) normalization, 4) PWP , and 5) sigmoid acti v ati on, as sho wn in Figure 5. Figure 3. Channel-SGE attention Figure 4. CA Figure 5. SGE attention 3.4.2. Decoder: DeepLabV3+ In this study , the decoder reconstructs the feature representation e xtracted by the encoder . The rst pa rt of DeepLabV3+ is the ASPP module, which has dif ferent dilation le v els in parallel. This study uses the ASPP module, which consists of v e parallel layers: a con v olution layer without dilation, a con v olution layer with 12, 24, and 36 dilations, and an AAP layer . In addition, it is good for pooling information from multiscale images. Then, the output from 5 layers will be concatenated, follo wed by their parallel projection, such that a tone feature representation of 256 channels. Then, the projection output will go through the feature concatenation in a layer block before the se gmentation head process. The pre vious dense prediction layer consisted of a CNN and SoftMax to pro vide the nal result for se gmentation image forecast. 3.5. T raining conguration Data augmentation: W e applied se v eral augmentation techniques to the training dat aset to e nh a nce training data di v ersity and impro v e model rob ustness: 1) random horizontal and v ertical ipping with probabil- Mixed attention mec hanism on ResNet-deepLabV3+ for paddy eld se gmentation (Alya Khairunnisa Rizkita) Evaluation Warning : The document was created with Spire.PDF for Python.
1618 ISSN: 1693-6930 ity p=0.5, 2) brightness adjustment with linear scaling f actor ranging from 0% to 10% of the original intensity , 3) contrast enhancement using nonlinear adjustment within 0% to 10% range, and 4) g amma correction with random g amma v alues between 0.8 and 1.2 (80% to 120% of original image g amma v alue). Hardw are and softw are s etup: The proposed model w as implemented using Python 3.8 with the Py- T orch 2.0.1 frame w ork and CUD A 11.8 for GPU acceleration. All e xperiments were conducted on the NVIDIA DGX-1 V100 GPU system at the T ok opedia-UI AI Center of Excellence, which pro vided suf cient computa- tional resources for training the deep learning model training. T raining parameters: The netw ork w as optimized using Adam optimizer [26] with learning rate of 0.00008, which w as chosen based on preliminary e xperi ments for stable con v er gence. T raining w as conducted for 100 epochs with a batch size of 4 for training data, limited by GPU memory constraints. F or the v alidation and testing phases, the batch size w as set to 1 to ensure consistent e v aluation across all samples. Re gulariza- tion strate gy: to mitig ate o v ertting and enhance model generalization, L2 re gularization with weight decay coef cient of 0.0001 w as applied to all trainable parameters during the optimization process. 3.6. Ev aluation W e used IoU, the dice coef cient, accurac y , and the F1 score as metrics to e v aluate our se gmentation model. IoU is a popular e v aluation metric for semantic se gmentation is IoU. IoU calculates the o v erlapping area between predicted se gmentation (A) and the ground truth (B). The formula for calculating IoU is sho wn in (8). I oU = A B A B (8) The accurac y (A CC) and F1 score are deri v ed from the confusion matrix. A confusion matrix is a table to assess a classication algorithm’ s performance [27]. The confusion matrix pro vides an in-depth analysis of the actual vs. predicted classication. Se v eral terms are in the confusion matrix. T rue positi v e (TP) is when the model successfully predicts that the actual positi v e class becomes a positi v e class. T rue ne g ati v e (TN) is when the model predicts the actual ne g ati v e class as a ne g ati v e class. If the class is ne g ati v e and predicted as positi v e, it is called a f alse positi v e (FP). Otherwise, if the class is positi v e and predicted as ne g ati v e, it is called a f alse ne g ati v e (FN). Accurac y is the number of correct predictions di vided by the total data. The formula for calculating accurac y can be seen in (9). The F1 score is the harmonic mean of precision and recall. In the semantic se gmentation task, the F1 score is also called the dice coef cient score (DSC). Precision is the proportion of positi v e predict ions, and recall is the proportion of actual positi v es that are correctly classied. F1 score is a better metric than accurac y when the testing data are imbalanced. F1 score, dice coef cient, and loss formula are sho wn in (10), (11), and (12). Accur acy = T P + T N F P + F N + T P + T N (9) F 1 S cor e = T P T P + ( F P + F N ) 2 (10) D iceC oef f icient = 2 | A B | | A | + | B | (11) D iceLoss = 1 D iceC oef f icient (12) 4. RESUL T AND DISCUSSION 4.1. Ov er view of the k ey ndings The proposed ResNet-DeepLabV3+ with the CA-SGE attention mechanism achie v ed superior per - formance for paddy eld se gmentation, demonstrating signicant impro v ements across all e v aluation metrics. The k e y ndings re v eal that: 1) the CA-SGE combination pro vides syner gistic benets with IoU of 0.85, DSC of 0.91, and accurac y of 0.96, representing impro v ements of 3.7%, 4.6%, and 2.1% respecti v ely o v er the base- line, 2) CA mechanisms are more ef fecti v e than SA for paddy eld se gmentation tasks, and 3) the proposed method e xhibits bet ter training stability with reduced o v ertting compared to the baseline model. From the training process of both models, it can be concluded that the proposed ResNet-DeepLabV3+ + CA-SGE is better because it sho ws more stable performance and lo wer potential for o v ertting. TELK OMNIKA T elecommun Comput El Control, V ol. 23, No. 6, December 2025: 1611–1625 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA T elecommun Comput El Control 1619 4.2. Analysis of training perf ormance The baseline ResNet-DeepLabV3+ model e xhibi ted rapid performance impro v ement, with tr aining and v al idation IoU scores reaching approximately 0.9 within the initial 40 epochs. Ho we v er , subsequent train- ing phases sho wed gradual impro v ement accompanied by increasing di v er gence between training and v alida- tion curv es, suggesting that o v ertting tendencies were emer ging despite the g ap remaining within acceptable limits. The proposed ResNet-DeepLabV3+ with CA-SGE attention demonstrated distinctly dif ferent train- ing characteristics. The model initially e xhibited undertting beha vior , where the v alidation IoU e xceeded the training IoU during the earl y epochs. This pattern w as re v ersed around epoch 30, after which the train- ing IoU be g an to surpass the v ali d a tion IoU while gradually impro ving. Notably , the training-v alidation g ap remained consistently smaller than that of the baseline model throughout the training process. The reduced training-v alidation di v er gence observ ed in our CA-S GE model indicates superior generalization capability and enhanced training stability . This beha vior suggests that the inte grated attention mechanisms function as ef- fecti v e re gularizers, mitig ating o v ertting risks while preserving st rong model performance. The comparati v e analysis demonstrates that the proposed architecture achie v es more rob ust training dynamics with impro v ed con v er gence stability compared to the baseline approach. The e xperimental result s demonstrate that the base- line method e xhibi ts a lar ge training-v alidation g ap after epoch 40, indicating o v ertting beha vior . In contrast, the proposed CA-SGE method sho ws a smaller training-v alidation g ap with better stability , ef fecti v ely reducing o v ertting issues. A summary of the training is sho wn in Figures 6 and 7. Figure 6. ResNet-DeepLabV3+ Figure 7. ResNet-DeepLabV3+ CA-SGE 4FC 4.3. Quantitati v e e v aluation of perf ormance 4.3.1. Baseline method comparison W e conducted a comprehensi v e comparati v e e v aluation ag ainst established se gmentation architecture s to establish the ef fecti v eness of our proposed approach. The baseline methods include ResNet-UNet, ResNet- FPN, and ResNet-DeepLabV3+, all implemented without attention mechanisms to ensure f air comparison with our attention-enhanced v ariant. T able 1 presents the quantitati v e performance comparison across three standard e v aluation metrics: IoU, DSC, and pix el-wise accurac y . The proposed ResNet-DeepLabV3+ with CA-SGE attention consistently outperformed all baseline methods across e v ery e v aluation metric. The results re v eal substantial and consistent impro v ements in performance across all e v aluation dimensions. Specical ly , the in- te gration of CA and SGE attention mechanism s enhanced the baseline ResNet-DeepLabV3+ performance from IoU 0.82 to 0.85, DSC from 0.87 to 0.91, and accurac y from 0.94 to 0.96. These impro v ements demonstrate the ef fecti v eness of our h ybri d attention approach in capturing both channel-wise feature importance and spatial relationships, which are crucial for accurate paddy el d se gmentation. The superior IoU and DSC scores indi- cate enhanced boundary delineation accurac y and impro v ed re gion o v erlap precision, which ar e critical f actors for practical agricultural monitori ng applications. Consistent performance g ains across dif ferent architectural baselines v alidate the proposed attention mechanism’ s rob ustness and generalizability . T able 1. Comparison of testing se gmentation methods Method IoU DSC Acc ResNet-UNet 0.79 0.86 0.94 ResNet-FPN 0.80 0.87 0.93 ResNet-DeepLabV3+ 0.82 0.87 0.94 ResNet-DeepLabV3+CA SGE 0.85 0.91 0.96 Mixed attention mec hanism on ResNet-deepLabV3+ for paddy eld se gmentation (Alya Khairunnisa Rizkita) Evaluation Warning : The document was created with Spire.PDF for Python.
1620 ISSN: 1693-6930 4.4. Ablation study 4.4.1. Indi vidual attention mechanism perf ormance An ablation study w as conducted in the attention mechanism part. W e compared the performance of CA, SA, CB AM, and SGE on the ResNet-DeepLabV3+ model. W e also compared the depth of the MLP layers in the CA attention module. The addition of CA on ResNet-DeepLabV3+ slightly increases the performance. As sho wn in T able 2, the addition of CA increased all metrics by 0.01. Adding more MLP layers to 4 in CA did not ha v e a signicant impact, only increasing the DSC by 0.001, whereas the IoU and accurac y did not increase. The ablation results re v eal se v eral critical insights into the ef fecti v eness of the attention mechanism in paddy eld se gmentation. CA demonstrated consistent performance impro v ements across all e v aluation metrics, with IoU increasing from 0.82 t o 0.83 when indi vidually inte grated. This impro v ement v alidates the ef fecti v eness of channel-wise feature recalibration for agricultural se gmentation tasks. Con v ersely , SA sho wed counterproducti v e ef fects, decreasing IoU performance from 0.82 to 0.81 while maintaining similar DSC scores and reducing accurac y . This de gradation suggests that SA mechanisms may introduce unw anted noise rather than benecial spatial focusing for paddy eld boundary detection. The CB AM e v aluation, which combines both CA and SA components, yielded performance identical to CA alone across all metrics. This nding reinforces our observ ation that SA pro vides no additional benet and potentially interferes with CA ef fecti v eness, conrming the limited utility of SA in this specic application domain. SGE demonstrated a unique performance characteristic, achie ving e xceptionally high pix el-wise accu- rac y (0.98) while simultaneously sho wing decreased performance in IoU (0.81) and DSC (0.86) metrics. This pattern indicates that SGE e xcels at o v erall pix el classication accurac y b ut struggles with precise boundary de- lineation and re gion o v erlap precision, which are critical for accurate se gmentation. The op t imal performance w as achie v ed through our proposed CA-SGE combination with 4-layer MLP depth, demonstrating superior bal- ance across all e v aluation metrics. This conguration achie v ed the highest IoU (0.85) and DSC (0.91) scores while maintaining competiti v e accurac y (0.96), representing impro v ements compared to the baseline. T able 2. Comparison of attention modules Module MLP depth IoU DSC Acc - - 0.82 0.87 0.94 SA - 0.81 0.87 0.94 CA 2 0.83 0.88 0.95 CB AM 2 0.83 0.88 0.95 CA+SGE 2 0.83 0.89 0.95 SGE - 0.81 0.86 0.98 CA 4 0.83 0.89 0.95 CB AM 4 0.82 0.88 0.95 CA+SGE 4 0.85 0.91 0.96 4.4.2. Multilay er per ceptr on depth analysis In v estig ation of MLP depth congurations within the CA module re v ealed that the 4-layer architecture pro vides optimal performance without introducing signicant computational o v erhead. Comparison between 2-layer and 4-layer MLP congurations sho wed that deeper netw orks yielded mar ginal impro v ements (+0.01 DSC) while maintaining computational ef cienc y . Further e xploration of deeper MLP architectures ( > 4 layers) demonstrated diminishing returns with ne gligible performance g ains, conrming the appropriateness of our 4- layer architectural choice for practical deplo yment scenarios. 4.4.3. Data augmentation impact T able 3 displays the results of testing each data augmentation on the neural netw ork se gmentation method. The results demonstrate subst antial and consistent performance impro v ements across all se gmentation architectures when data augmentati on is applied. ResNet-FPN sho wed the most dramatic impro v ement, with IoU increasing from 0.55 to 0.80, indicating that this architecture particularly benets from enhanced data di v ersity . ResNet-UNet also e xhibited s ignicant impro v ement, with IoU rising from 0.64 to 0.79. The ResNet- DeepLabV3+ baseline sho wed more modest b ut meaningful impro v ements, increasing from 0.76 to 0.82. This smaller relati v e impro v ement suggests that the more sophist icated DeepLabV3+ archi tecture with i ts atrous con v olutions and multi-scale processing already possesses some i nherent rob ustness to v ariations in input data. TELK OMNIKA T elecommun Comput El Control, V ol. 23, No. 6, December 2025: 1611–1625 Evaluation Warning : The document was created with Spire.PDF for Python.