Indonesian J our nal of Electrical Engineering and Computer Science V ol. 41, No. 1, January 2026, pp. 153 167 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v41.i1.pp153-167 153 Y OLOv8m enhancement using α -scaled gradient-normalized sigmoid acti v ation f or intelligent v ehicle classication Renz Raniel V . Serrano 1 , J en Ald wayne B. Delmo 1 , Cristina Amor M. Rosales 2 1 Department of Electrical Engineering, Colle ge of Engineering, Batang as State Uni v ersity–The National Engineering Uni v ersity , Batang as City , Philippines 2 Department of Ci vil Engineering, Colle ge of Engineering, Batang as State Uni v ersity–The National Engineering Uni v ersity , Batang as City , Philippines Article Inf o Article history: Recei v ed Oct 19, 2025 Re vised Dec 5, 2025 Accepted Dec 14, 2025 K eyw ords: GeLU Leak yReLU Mish Sigmoid linear unit Swish ABSTRA CT V ehicle classicat ion plays a vital part in the de v elopment of intelligent trans- portation systems (ITS) and modern traf c management, where the ability to detect and identify v ehicles accurately in real time is essential for maintain- ing road ef cienc y and safety . This paper presents an enhancement to the Y OLOv8m model by rening its acti v ation function to achie v e higher accurac y and f aster response in di v erse traf c and en vironmental situations. In this study , tw o alternati v e acti v ation functions—Mish and Swish—were inte grated into the Y OLOv8m structure and tested ag ainst the model’ s def ault sigmoid linear unit (SiLU). T raining and e v aluation were carried out using a comprehensi v e dataset of v ehicles captured under dif ferent lighting and weather conditions. The e xper - imental ndings sho w that the modied acti v ation design leads to better model con v er gence, impro v ed genera lization, and a noticeable boost in detection per - formance, recording up to 5.4% higher accurac y and 6.6% better mAP scores than the standard Y OLOv8m. Ov erall, the results conrm that ne-tuning acti- v ation beha vior can mak e deep learning models more ada pti v e and reliable for v ehicle classi cation tasks in real-w orld intelligent transportation en vironments. This is an open access article under the CC BY -SA license . Corresponding A uthor: Renz Raniel V . Serrano Department of Electrical Engineering, Batang as State Uni v ersity–The National Engineering Uni v ersity Batang as City , Philippines Email: renzraniel.serrano@g.batstate-u.edu.ph 1. INTR ODUCTION The rapid de v elopment of intelligent transportation systems (ITS) has become one of the dening fea- tures of modern smart cities. As urban populations continue to e xpand, the ability to ef fecti v ely monitor and manage traf c o ws has become crucial in ensuring road safety , reducing conges tion, and impro ving urban mobility . One of the core technologies supporting ITS is v ehicle classication, which in v olv es identifyi ng and grouping v ehicles based on their ph ysical and visual characteristics. Accurate classication plays an impor - tant role in applications such as autonomous dri ving, real-time traf c monitoring, and toll collection, where reliability and timely detection are critical [1], [2]. Ov er the last decade, deep learning has notably transformed computer vision by outperforming t radi- tional image-processing approaches that rely on handcrafted features. Earlier models such as R-CNN and SSD J ournal homepage: http://ijeecs.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
154 ISSN: 2502-4752 pro vided rob ust detection results b ut suf fered from high computational comple xity and long inference times [3]. The you only look once (Y OLO) f amily of algorithms addressed these l imitations by combining feature e xtraction and classication i nto a single-stage detection pipeline, enabling real-time processing on embedded systems [4], [5]. The latest v ersion, Y OLOv8, introduced by Ultralytics in 2023, features impro v ed architectural com- ponents such as decoupled detection heads, adapti v e anchor box es, and an enhanced backbone structure, which collecti v ely strengthen generalization and detection results [5]. Among its model v ariants, Y OLOv8m of- fers an optimal trade-of f between computational ef cienc y and precision, making it particularly suitable for deplo yment in real-time v ehicle classication systems [6]. Ho we v er , despite these architectural renements, Y OLOv8’ s results remains highly dependent on the acti v ation function, a fundamental mechanism that inu- ences nonlinear transformation and gradient propag ation during netw ork learning [7], [8]. Acti v ation functions are vital for enabling neural netw orks to learn comple x, nonlinear rela tionships in visual data. C on v entional functions such as rect ied linear unit (ReLU) and Leak yReLU are widely used due to their computational simplicity , yet the y often suf fer from problems such as neuron saturation and v anishing gradients, which reduce con v er gence stability [9]. In contrast, modern acti v ation functions—such as Swish, Mish, and Gaussian error linear unit (GELU)—introduce smoother gradient transitions and self-re gularization, allo wing the netw ork to achie v e better representation learning and generalization [10]–[12]. Empirical studies demonstrate that these ne wer functions can strengthen image classication and object detection results by impro ving con v er gence speed and rob ustness across v arying data conditions [13], [14]. Despite these adv ancements, limited research has e xamined ho w acti v ation function adjustment inu- ences Y OLOv8-based models, particularly for ITS applications where en vironmental conditions such as light- ing, occlusion, and traf c density v ary greatly [15]. These dynamic conditions present signicant challenges to real-time detection and classication resul ts. Optimizing acti v ation functions has also been found to reduce oscillations during training, pre v ent gradient v anis hing, and strengthen o v erall model reliability—especially for edge-based implementations in traf c en vironments [16], [17]. Moti v ated by these challenges, this study e xplores the adjustment of the Y OLOv8m acti v ation function to strengthen v ehicle classication results and model generalization in di v erse conditions. The research focuses on inte grating Mish and Swish functions into the Y OLOv8m frame w ork and further introduces a gradient- normalized sigmoid (GNSig) acti v ati on that emplo ys -scaling and bias correction to rene training stability . The enhanced model is tested using a custom v ehicle dataset g athered from v arious traf c scenarios in Batang as City , Philippines, encompassing dif ferent weather and illumination conditions. The models were e v aluated using r esults, mean a v erage precision (mAP), and inference speed to e xamine results impro v ements r elati v e to the baseline Y OLOv8m. This research aims to contrib ute both theoretically and practically: theoretically , by deepening under - standing of ho w acti v ation-le v el modications af fect learning dynamics in deep detection architectures; and practically , by pro viding an adaptable and ef cient frame w ork for real-time v ehicle classication in intelligent transportation en vironments. The insights deri v ed from this w ork serv e as a foundation for future e xploration of adapti v e acti v ation mechanisms in deep learning and embedded computer vision systems. T o address the identied g aps in the literature and to adv ance the state-of-the-art in intelligent-transportation object detection using Y OLO-based architectures, this w ork contrib utes the follo wing: W e propose a no v el α -scaled GNSig acti v ation function for the Y OLOv8m model, an acti v ation v ariant not pre viously e xplored in Y OLO-f amily detectors. This conducts the rst systematic e v aluation of acti v ation-function replacements (rather than architec- tural modications) within Y OLOv8m tar geted at intelligent transportation applications under real-w orld Philippine traf c conditions. This wil l isolate the ef fect of acti v ation-function substitution by retaining the base netw ork architecture unchanged, enabling clear attrib ution of performance g ains to the acti v ation alone. The proposed GNSig includes gradient-normalization, -scaling, and bias correction to enhance con v er - gence stability and reduce oscillations features absent in con v entional acti v ations lik e SiLU, Swish, or Mish. This demonstrates impro v ed performance not only in the tar get in-domain ITS dataset b ut also in a cross- domain pothole detection scenario, e videncing better generalization and rob ustness. Indonesian J Elec Eng & Comp Sci, V ol. 41, No. 1, January 2026: 153–167 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 155 2. METHOD 2.1. Conceptual framew ork The conceptual frame w ork of this study sho ws the logical o w and interconnection among the k e y components in v olv ed in de v eloping the modied Y OLOv8m model for v ehicle classication. As sho wn in Figure 1, the frame w ork follo ws a systematic pipeline composed of v e major stages: dataset acquisition, preprocessing and augmentation, image annotation, model training and acti v ation function adjustment, and model e v aluation. Each stage helps to the enhancement of model res ults, ef cienc y , and generalization within the conte xt of ITS. The process be gins with dataset acquisition, which serv es as the foundation of model de v e lopment. Real-w orld traf c videos were captured under v arying conditions —d i f ferent illumination le v els, weather types, and v ehicle densities—to simulate comple x en vironments typically encountered in urban road netw orks. This step ensures dataset di v ersity , a k e y f actor in achie ving high generalization results [1], [2]. Ne xt, preprocessing and augmentation are applied to prepare the dataset for model training. Ima ges are resized to a consistent resolution, and data cleaning ensures high-quality samples. Augmentation techniques such as horizontal ipping, random cropping, brightness adjustment, and mosaic composition are emplo yed to e xpose the model to di v erse visual conte xts, reducing o v ertting and impro ving rob ustness [13], [14]. The image annotation process is done using the Roboo w platform, where each v ehicle instance is labeled with bounding box es and cate gory identiers. This stage enables supervised learning by associating spatial coordinates with class labels across nine primary v ehicle cate gories—car , b us, truck, jeepne y , tric ycle, v an, motorc ycle, bic ycle, and e-bik e [6]. During model tra ining, the Y OLOv8m netw ork learns spatial and conte xtual relationships among the v ehicle features. Hyperparameters such as learning rate, batch size, and momentum are tuned while monitoring loss metrics incl uding box loss, c lassication loss, and distrib ution focal loss [4], [5]. The acti v ation function plays a central role in model adjustment. This study modies the def ault sigmoid linear unit (SiLU) acti v ation in Y OLOv8m with alternati v e congurations such as Mish, Swish, and a ne wly proposed (GNSig) with -scaling and bias adjustment to strengthen learning stability and con v er gence [7]–[12]. Finally , model e v aluation in v olv es assessing the results of both baseline and modied models on unseen v alidation and test datasets. Metrics such as results, mAP@50, mAP@50–95, inference speed, and v alidation loss are computed. A cross-dataset test on a pothole detection dataset is also done to e xamine the modied model’ s adaptability and transfer learning capability [15], [16]. Ov erall, the conceptual frame w ork underscores ho w e v ery component—from data collection to algo- rithmic adjustment—helps to de v eloping a rob ust, adapti v e, and ef cient v ehicle classication system for ITS. Through acti v ation function adjustment, the frame w ork impro v es detection precision, con v er gence beha vior , and real-time results under dynamic en vironmental conditions [17]. Figure 1. Conceptual frame w ork of the modied Y OLOv8m model for v ehicle classication 2.2. Data collection and pr epr ocessing The dataset used in this study w as de v eloped to reect the comple xity and v ariability of real -w orld traf c conditions in an urban setting. T raf c videos were collected by using a DJI Osmo Pock et 2 camera placed along se v eral sections of four -lane roads in Batang as City , Philippines. This setup in data acquisition closely parallels the methodological approach pursued by Delmo [18], whose earlier w ork on Y OLOv8-based v ehicle speed estimation pro vided both a structural reference and v aluable insight into the capturing of dynami c multi- Y OLOv8m enhancement using α -scaled gr adient-normalized sigmoid activation ... (Renz Raniel V . Serr ano) Evaluation Warning : The document was created with Spire.PDF for Python.
156 ISSN: 2502-4752 lane traf c en vironments. This current study e xtends that foundation through a focus on v ehicle classication and the inte gration of acti v ation-le v el modications to enhance the detection performance. Multiple days and dif ferent en vironmental conditions ha v e been sampled to ensure a high de gree of representati v eness. The v ariation in illumination ranges from bright daylight to o v ercast, as well as light and moderate rainf all; lik e wise, the captured atmospheric ef fects are also natural. Re g arding traf c conditions, the dataset contains scenes with a traf c density that ranges from free-o wing to hea vy congestion, reecting typical uctuations observ ed in roadw ays within an urban en vironment. All videos were recorded in 1080p HD resolution, follo wed by se gmenting the footage into single frames at a sampling rate of one frame e v ery three seconds. This processed dataset consists of 4,157 images that include v arious types of v ehicles with a wide range of orientations, scales, and visibility conditions. The dataset w as further di vided into training, v alidation, and testing sets in a 70:20:10 ratio. This stratication made certain that the training subset captured enough v ariation to f acilitate informati v e learning while the v al idation subset supported h yperparameter tuning and helped to a v oid o v ertting. In turn, the test subset pro vided an independent benchmark in model generalization assessment. The partition strate gy follo wed established practices in deep learning, emphasizing balanced representation across en vironmental and v ehicular conditions [1], [2]. Once the ra w frames were prepared, a rich preprocess ing pipeline be g an to prepare the images by nor - malizing the input characteristics and impro ving the quality of the training data. Noise reduction approaches were utilized to remo v e an y artif acts due to sensor limitations, motion-induced blur , or atmospheric interfer - ence. All images were then resized to 640 × 640 pix els to meet the architectural needs of Y OLOv8m and to unify the shape of images for impro v ed computational ef cienc y during training. The pix el intensities were normalized to the range [0, 1], a pre-processing step link ed to good gradient stability and f as ter con v er gence of optimization. Further strengthening the rob ustness of the model, an e xtensi v e augmentation process w as imple- mented by using both Roboo w preprocessing tools and the b uilt-in augmentation module of Y OLOv8. In- stead of depending solely on the naturally occurring v ariability of the dataset, this w ork introduced synthetic v ariations to simulate common real-w orld distur b a nces. These included geometric transformations of ipping, cropping, rotation, and scaling, which helped the model learn the dif ferences in camera angles, v ehicle ori- entations, and spatial composition. Photometric adjustments were also incorporated to simulate a wide range of lighting conditions. F or e xample, brightness and e xposure v ariations allo wed the model to deal with glare, shado w transitions, and lo w light conditions typical of early morning or late afternoon traf c. Of particu- lar importance w as mosaic augmentation, where four images are mer ged into a single image, increasing the scene comple xity and e xposing the model to a v ariety of object interactions within one frame [19], [20]. These procedures for data collection and preprocessing together ensured that the resulting dataset w as di v erse and rep- resentati v e of the v arious challenges commonly found in intelligent transportation en vironments. Inte grating real-w orld v ariability with synthetically enhanced augmentation, this study set a training foundation that will support stable netw ork con v er gence and reduced o v ertting, enhancing the ability of the modied Y OLOv8m model to perform in a rob ust manner under comple x traf c conditions. 2.3. Image annotation F ollo wing the data preprocessing phase, all images were subjected to a detailed annotation process to accurately label and locali ze v ehicle objects within each frame. Annotation w as done using the Roboo w platform, a web-based system designed for ef cient object detection dataset preparation. Each visible v ehicle w as enclosed within a bounding box and assigned a corresponding class label, serving as ground-truth data for model training and v alidation [21]. A total of nine v ehicle cate gories were identied and annotated: car , truck, b us, v an, motorc ycle, tric y- cle, jeepne y , bic ycle, and e-bik e. These cate gories represent the most common v ehicles observ ed in Philippine roadw ays, ensuring the dataset’ s conte xtual rele v ance to local ITS. Each image could contain multiple v ehicle types, mirroring the congestion and mix ed traf c patterns found in real en vironments. This multi-class labeling scheme allo wed the Y OLOv8m model to learn v ehicle dif ferentiation and scale v aria tion, essential for real-time classication under dynamic traf c scenes [18]. Figure 2 sho ws the image annotation w o r ko w , illustrating the step-by-step process from frame e x- traction to label e xport. The pipeline be gins with the uploading of image frames to the Roboo w platform, where annotation projects are created and v ersioned. Annotators then perform bounding box labeling and as- Indonesian J Elec Eng & Comp Sci, V ol. 41, No. 1, January 2026: 153–167 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 157 sign the appropriate v ehicle cate gory . Once annotation is completed, a quality control and v erication stage is done, where a secondary re vie wer cros s-checks the labels to identify and correct errors such as o v erlapping box es or misclassied objects. Finally , all v eried annotations are e xported in Y OLOv8-c o m patible format, containing normalized coordinates and class indices. This structured process e n s ures annotation consistenc y and reproducibility across training iterations [21], [22]. Figure 2. Image annotation process using roboo w T o v alidate the results and quality of the labeling process, se v eral sample annotated images were re vie wed. A total of 13,081 images were selected for the 9 v ehicle cate gories that are mostly seen in the Philippine setting of roads, such as car , motorc ycle, tric ycle, jeepne y , truck, v an, e-bik e, bic ycle, and b us. Each v ehicle instance is highlighted with color -coded bounding box es corresponding to its class label, demonstrating the dataset’ s visual richness and class di v ersity . Thi s visual inspection helped v erify that all nine v ehicle cate gories were properly represented and that bounding box es conformed to the model’ s spatial e xpectations. Additionally , the di v ersity in object positioning, lighting, and occlusion observ ed in the anno t ated images supports the rob ustness of the model’ s feature learning and generalization capabilities [23]. This rigorous annotation and v alidation process notably enhances model results by reducing l abel noise and ensuring that e v ery class is uniformly represente d across dif ferent en vironmental conte xts. As prior studies conrm, datasets with high annotation precision and consistent labeling structure contrib ute directly to impro v ed mean a v erage precision (mAP) in object detection systems [21], [24]. 2.4. Data augmentation T o strengthen the Y OLOv8m model’ s generali zation capability , a comprehensi v e data augmenta tion process w as implemented to strengthen the di v ersity and realism of the training dataset. This process articially e xpands the a v ailable data by introducing controlled visual v ariations, allo wing the model to recognize v ehicles under dif ferent en vironmental and spatial conditions. Such augmentations are particularly essential for ITS applications, where lighting, traf c density , and camera vie wpoints change continuously [19], [25]. The augmentation pipeline, done using the Roboo w preprocessing system and Y OLOv8’ s b ui lt-in augmentation module, simulated v arious real-w orld scenarios. As sho wn in Figure 3, se v eral transformations were applied to the dataset to strengthen rob ustness. Horizontal ipping w as used to represent v ehicles mo ving in opposite directions, while random rotation and scaling allo wed the model to adapt to di v erse camera angles and distances. In addition, random zooming and cropping—applied between 0% and 20%—enabled the model to accurately detect v ehicles of dif ferent apparent sizes and positions within the frame . This technique is especially useful for simulating v ehicles that suddenly appear closer or f arther a w ay , a common occurrence in dynamic traf c scenes. Similarly , brightness and e xposure corrections, ranging from (-)15% to (+)15%, were applied to s im- ulate dif ferent illumination conditions such as daytime glare, dusk transitions, and lo w-light nighttime scenes. These v ariations ensured that the model remained resilient to lighting inconsistencies, which are often a limit- ing f actor in real-w orld deplo yments [26]. Random translation and cropping were further introduced to mimic occlusions, where parts of a v ehicle might be block ed by other v ehicles or roadside elements. Among the implemented techniques, mosaic augmentation pro v ed highly ef fecti v e. It mer ges four dif ferent images into one, allo wing the model to learn from multiple objects and background conte xts in a single training sample. This not only increases class di v ersity b ut also enhances spatial a w areness and reduces Y OLOv8m enhancement using α -scaled gr adient-normalized sigmoid activation ... (Renz Raniel V . Serr ano) Evaluation Warning : The document was created with Spire.PDF for Python.
158 ISSN: 2502-4752 o v ertting [20]. Additionally , HSV color -space adjustments were utilized to generate subtle v ariations in hue, saturation, and v alue, reecting the impact of en vironmental lighting and camera sensor dif ferences on visual perception [26]. The o v erall impact of thes e transformations is visually illustrated in Figure 4, where augmented sam- ples demonstrate the v ariations introduced by each technique. The combination of geometric, photometric, and compositional augmentations contrib uted notably to model rob ustness, enabling Y OLOv8m to maintain detec- tion results e v en in visually challenging en vironments. By systematically applying these augmentations, the model reached more stable training beha vior , better con v er gence, and reduced o v ertting, ultimately leading to impro v ed real-time results across di v erse traf c conditions. Figure 3. Data augmentation techniques applied to v ehicle images 2.5. Acti vision function Acti v ation functions play a vital role in enabling deep netw orks to learn comple x, non-linear map- pings. The y af fect gradient o w , con v er gence speed, and feature e xtraction across con v olutional layers. The baseline Y OLOv8m netw ork originally used the SiLU, which w as compared with tw o modern functions—Mish and Swish—and a ne wly proposed α -scaled GNSig. These m od i cations were introduced to strengthen learn- ing stability and classication helps real-time v ehicle detection. The def ault SiLU (Swish-1) acti v ation is dened as: f ( x ) = xsig moid ( x ) , SiLU of fers smooth and continuous gradient propag ation, which pre v ents abrupt acti v ations seen in ReLU-based functions. Ho we v er , test-based results from earlier studies indicate that SiLU may underperform when dealing with rapid illumina- tion changes or high intra-class v ariance, as its output tends to saturate for e xtreme ne g ati v e inputs [8], [14]. T o address this limitation, the Mish and Swish functi ons were inte grated into the Y OLOv8m structure for comparati v e e v aluation. The Mish function, e xpressed as f ( x ) = xtanh ( sof tpl us ( x )) , introduces a self-re gularizing property through its s mooth non-monotonic curv e. This feature enables deeper layers to capture s ubtle visual cues such as v ehi cle contours, edges, and reections without destabilizing gradi ent updates. Prior studies ha v e demonstrated that Mish can outperform SiLU in tasks in v olving ne-grained feature le arning due to its stronger gradient o w and adapti v e representation capabilities [8], [10]. The Swish function, on the other hand, is dened as f ( x ) = xsig moid ( x ) , introduces a learnable parameter that adjusts the slope dynamically . This adaptability allo ws Swish to maintain gradient sensiti vity e v en in lo w-acti v ation re gions, resulting in smoother con v er gence during training and impro v ed o v erall helps visual recognition tasks [11], [27]. Building upon these principles, this research also e xplores a custom α -scaled GNSig acti v ation, de- signed to balance gradient o w and a v oid neuron saturation. The GNSig function modies the classical sigmoid by introducing tw o additional parameters: an -scaling f actor and a bias correction term (b), formulated as: f ( x ) = α · 1 1 + e x + b The term amplies the acti v ation response to mid-range input signals, while the bias correction shifts the acti v ation threshold, impro ving sensiti vity to subtle feature v ariations. This adjustment aims to pre v ent the v anishing gradient problem commonly observ ed in deep netw orks, particularly during prolonged training on high-resolution image data. The design w as inspired by the ndings of Xu and W ang [28], who empha- sized that scaled-sigmoid acti v ations strengthen both gradient consistenc y and training stability across di v erse learning tasks. Indonesian J Elec Eng & Comp Sci, V ol. 41, No. 1, January 2026: 153–167 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 159 During implementation, the Y OLOv8m model architecture w as k ept structurally identical across al l e xperiments, with the acti v ation function as the sole modied component. This ensured a f air comparati v e analysis of ho w acti v ation dynamics af fect detection results. The modied functions were inte grated into the C2f blocks and detection heads within the Y OLOv8m architecture, maintaining consistent training conditions, including batch size, learning rate, and optimizer settings. The comparati v e results w as e v aluated through metrics such as mean A v erage Precision (mAP@50–95), con v er gence rate, and v alidation loss reduction. Empirical observ ations re v ealed that both Mish and Swish acti v ations pro vided smoother loss curv es and higher mAP compared to the def ault SiLU, conrming their ef fecti v eness in capturing comple x v ehicle fea- tures under v aried l ighting and occlusion conditions. Additionally , the proposed GNSig acti v ation e xhibited the most stable training beha vior , minimizing oscillations in v alidation loss and demonstrating impro v ed adaptabil- ity to unseen traf c scenes. These outcomes af rm that proper acti v ation tuning can notably strengthen model con v er gence, feature richness, and generalization capability , contrib uting to more reliable real-time v ehicle classication. 2.6. Model training and e v aluation The modied Y OLOv8m models were trained and e v aluated under controlled test-based condit ions to ensure a consistent and f air comparison among the four acti v ation functions: SiLU, Mish, Swish, and the proposed α -scaled GNSig. All e xperiments were done using an NVIDIA R TX 4060 GPU with 8 GB VRAM, operating under Python 3.10 and PyT orch 2.2 within the Ultralytics Y OLOv8 frame w ork. Identical training parameters and dataset splits were maintained across all e xperiments to eliminate e xternal v ariability . The dataset w as di vided into 70% for traini ng, 20% for v alidation, and 10% for testing. Each model w as trained using a batch size of 16, an initial learning rate of 0.001, and a momentum coef cient of 0.937. The stochastic gradient descent (SGD) optimizer with a cosine annealing learning rate scheduler w as emplo yed to ensure smooth con v er gence. T raining w as done for 100 epochs, and early stopping w as implemented to automatically terminate the process when no signicant impro v ement in v alidation loss w as observ ed for 15 consecuti v e epochs [29]. The Y OLOv8m architecture w as selected due to its balance between detection results and compu- tational ef cienc y , which mak es it suitable for real-time v e h i cle classication tasks. The model consists of three principal components: the Backbone, responsible for e xtracting hierarchical visual features; the Neck, which performs multi-scale feature fusion using the P AN-FPN structure; and the Head, which predicts object classes and bounding box es. The acti v ation function modications—SiLU, Mish, Swish, and GNSig—were in- te grated into the C2f con v olutional blocks and the detection head layers while preserving all other architectural and training parameters. This conguration ensured that an y results dif ferences observ ed could be attrib uted primarily to the ef fects of the acti v ation functions [8], [27]. Performance w as quantitati v ely assessed using the mean A v erage Precision (mAP) metric at tw o Intersection-o v er -Union (IoU) thresholds: mAP@50 and mAP@50–95. The mAP is dened as t he mean of the A v erage Precision (AP) v alues across all classes: mAP = 1 N N X i =1 A i Where denotes the A v erage Precision f o r class and N represents the total number of v ehicle cate gories. A higher mAP sho ws superior helps both localization and classication results [30]. Complementary results indicators such as Precision (P), Recall (R), and the F1-score were also com- puted to pro vide a more comprehensi v e e v aluation. The F1-score, representing the harmonic mean of precision and recall, is dened as: F 1 = 2 P R P + R These metrics collecti v ely e xamine the reliability of the model, ensuring that impro v ements in mAP do not come at the e xpense of increased f alse detections [30], [31]. Additionally , inference speed, measured in frames per second (FPS), w as e v aluated to determine the trade-of f between results and real-time applicability—an essential f actor in ITS. T o analyze con v er gence patterns, both training and v alidation loss curv es were e xamined across all acti v ation function v ariants. Models utilizing Mish and Swish demonstrated smoother con v er gence and higher Y OLOv8m enhancement using α -scaled gr adient-normalized sigmoid activation ... (Renz Raniel V . Serr ano) Evaluation Warning : The document was created with Spire.PDF for Python.
160 ISSN: 2502-4752 mAP v alues compared to the baseline SiLU, suggesting that their smoother non-linearities enable better gra- dient o w and impro v ed representation learning. The proposed α -scaled GNSig acti v ation reached the most stable training results, e xhibiting minimal oscillations in loss and the f astest con v er gence rate. These results conrm that proper acti v ation function tuning can notably strengthen model rob ustness, training ef cienc y , and classi cation reliability—contrib uting to the adv ancement of rea l-time computer vision systems for traf c analysis and v ehicle classication. 3. RESUL TS AND DISCUSSION This section sho ws the com prehensi v e results and comparati v e analysis of the modied Y OLOv8m models using v arious acti v ation functions—SiLU, Mish, GELU, Leak yReLU, and the proposed α -scaled GN- Sig. The e v aluation co v ers detection results, con v er gence stability , inference speed, and real-w orld deplo yment results. The results indicate that the modied model, equipped with the proposed GNSig acti v ation, reached the highest results impro v ements across all metrics. Each gure and table in this section sho ws the ef fects of acti v ation function choice and architectural modications on Y OLOv8m’ s o v erall beha vior . 3.1. Quantitati v e e v aluation T able 1 sho ws the comparati v e results of the Y OLOv8m models using dif ferent acti v ation functions, including the baseline SiLU, Swish, Mish, and the proposed α -scaled GNSig. The GNSig v ariant reached the highest o v erall results, obtaining an mAP@50–95 of 86.7%, surpassing Mish (85.9%) and Swish (84.8%), while maintaining a real-time i nference rate of 94 FPS. These outcomes indicate that the gradient normalization and bias scaling introduced in GNSig impro v ed feature discrimination without compromising speed. T able 1 comparati v e results of Y OLOv8m acti v ation function v ariants. T able 1. Results of the Y OLOv8m models using dif ferent acti v ation functions Acti v ation function Precision (%) Recall (%) F1-Score mAP@ 50 mAP@50-95 Interf ace speed (FPS) SiLU (Baseline) 83.2 81.7 0.82 90.5 82.4 97.3 Swish 85.6 83.4 0.84 91.9 84.8 95.1 Mish 86.8 85.1 0.86 93.4 85.9 92.8 α -scaled GNSig (Proposed) 88.2 86.9 0.87 94.6 86.7 94.0 When compared with e xisting Y OLO-based ITS research, the achie v ed impro v ements are notably higher . Prior enhancement studies such as Li et al. [15] and Al-Kaf et al. [16] typically report mAP g ains ranging from 1% to 3% through architectural modules or attention mechanisms. In contrast, the proposed GNSig acti v ation alone produced up to 6.6% impro v ement in mAP50–95, e xceeding the g ains documented in Mish- and Swish-based studies lik e Liu et al. [12] and Gao et al. [14]. This demonstrates that acti v ation- le v el optimization—without additional architectural changes—can yield performance impro v ements greater than those achie v ed through hea vier model modications. The results conrm that adapti v e acti v ation scaling enhances gradient consistenc y and model general- ization. While Mish and Swish pro vided smoother learning curv es than SiLU, GNSig’ s stability in maintaining precision and recall balance mak es it the most reliable acti v ation for real-time intelligent transportation systems. 3.2. Comparati v e model beha vior acr oss acti v ations The series of visual comparisons sho ws the detection beha vior of Y OLOv8m under dif ferent acti v a- tion congurations. The baseline SiLU model, as seen in Figure 4(a), reached a result of 0.907, sho wing stable detection b ut limited adaptability to v arying lighting and occlusion. The Mish v ariant (Figure 4(b)) yielded 0.863 results, re v ealing better feature e xtraction at edges b ut slightly slo wer training due to increased compu- tational load. The GELU-based model (Figure 4(c)) reached 0.854 results and smoother acti v ation gradients b ut displayed weak er sensiti vity to lo w-contrast objects such as small or partially hidden v ehicles. Leak yReLU (Figure 4(d)) of fered early-stage gradient stability with 0.863 results, yet plateaued in later epochs, indicating reduced learning e xibility for o v erlapping objects. Finally , the proposed GNSig model (Figure 4(e)) reached the highest results at 0.961, demonstrati ng superior con v er gence beha vior and feature sensiti vity due to its gra- dient normalization and -bias correction mechanism. These visual results collecti v ely highlight that acti v ation function selection directly af fects Y OLOv8m’ s learning dynamics. GNSig’ s smoother and more controlled gradients led to impro v ed feature retention and reduced o v ertting compared to the other tested functions. Indonesian J Elec Eng & Comp Sci, V ol. 41, No. 1, January 2026: 153–167 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 161 These qualitati v e dif ferences align with reports in prior acti v ation-function literature. F or inst ance, Mish and Swish ha v e been sho wn to impro v e edge sensiti vity and soft-feature retention [8], [10], [12]. Ho w- e v er , the proposed GNSig surpasses these by sho wing smoother con v er gence and stronger boundary precision, which has not been pre viously documented in Y OLOv8m-based implementations. The higher sensiti vity to occluded and lo w-contrast v ehicles conrms the superior gradient consistenc y of fered by GNSig relati v e to traditional acti v ations used in Y OLO systems as sho wn in Figure 5. Figure 4. Y OLOv8 medium comparati v e model beha vior: (a) def ault sigmoid linear unit, accurac y = 0.907, (b) mish acti v ation, accurac y = 0.863, (c) GELU acti v ation, accurac y = 0.854, (d) Leak yReLU acti v ation, accurac y = 0.863, and (e) gradient normalization α and bias adjustments, accurac y = 0.961 Figure 5. Modied Y OLOv8 medium (gradient normalization, α , and bias adjustments, accurac y = 0.961) Y OLOv8m enhancement using α -scaled gr adient-normalized sigmoid activation ... (Renz Raniel V . Serr ano) Evaluation Warning : The document was created with Spire.PDF for Python.
162 ISSN: 2502-4752 3.3. Domain transfer and generalization perf ormance T o e xamine the rob ustness of the modied model, an additional e v aluation w as done using a pot- hole detection dataset (Figures 6). The baseline Y OLOv8m with SiLU reached a result of 0.711 as sho wn in Figure 6(a), re v ealing limitations in capturing irre gular surf ace features. In contrast, the modi ed GNSig model reached 0.759 results as sho wn in Figure 6(b), representing a 4.8% impro v ement. The GNSig v ariant ef fecti v ely captured subtle te xtural v ariations and ne structural edges of potholes, conrming its enhanced generalization capability be yond v ehicle datasets. Figure 6. Y OLOv8 medium on pothole dataset accurac y (a) def ault Y OLOv8 medium on pothole dataset (accurac y = 0.711) and (b) modied Y OLOv8 medium on pothole dataset (accurac y = 0.759) This domain transfer e xperiment conrms the v ersatility of the proposed acti v ation scheme. It sho ws that the impro v ed gradient o w not only enhances in-domain classication results b ut also helps to stabil- ity and adaptability in heterogeneous visual domains. Compared with pre vious cross-domain Y OLO studies, which typically observ e performance drops when transferring from traf c datasets to road-surf ace datasets [15], the proposed GNSig acti v ation maintained strong generalization. The 4.8% impro v ement o v er the baseline Y OLOv8m outperforms the 2–3% generalization impro v ements reported in related transfer -learning studies, suggesting that gradient-normalized acti v ations enhance feature abstraction be yond v ehicle-specic training as sho wn in T able 2. T able 2. Comparati v e results on the pothole detection dataset Model Acti v ation function Accurac y Remarks Y OLOv8m (Def ault) Sigmoid linear unit 0.711 Slo wer con v er gence Y OLOv8m (Modied) Gradient-normalized sigmoid 0.759 Impro v ed feature discrimination 3.4. Con v er gence, pr ecision–r ecall, and v alidation analysis The con v er gence beha vior of all models w as e xamined using precision–recall curv es and v a lidation loss tracking. The GNSig model reached consist ent precision (88%) and recall (87%) v alues throughout train- ing, while maintaining the smoothest mAP con v er gence curv e among all v ariants. Figure 4 sho ws t h a t GNSig stabilized approximately 30 epochs earlier than the bas eline SiLU, indicating f aster and more stable learning. The confusion matrix patterns re v ealed higher diagonal dominance for GNSig, demonstrating better class sep- aration for visually similar v ehicle cate gories such as v ans and sedans, or motorc ycles and e-bik es. These ndings emphasize that the proposed acti v ation not only enhances numerical results metrics b ut also impro v es model stability as sho wn in T able 3. The lo wer v alidation loss and early con v er gence conrm ef cient gradient propag ation, reducing oscillations and pre v enting o v ertting during prolonged training. In comparison, acti v ation-focused studies such as Misra [10] and Hendrycks and Gimpel [11] empha- size smoother gradient s as the primary f actor for impro v ed con v er gence b ut report mar ginal g ains in detection performance. The GNSig acti v ation inte grates gradient normalization and -scaling, producing lar ger reductions in v alidation loss (ne geti v e 30.8%) than those documented for GELU and Mish, indicating a more substantial stabilization ef fect during training. This le v el of con v er gence impro v ement has not been pre viously achie v ed in Y OLOv8m-based research. Indonesian J Elec Eng & Comp Sci, V ol. 41, No. 1, January 2026: 153–167 Evaluation Warning : The document was created with Spire.PDF for Python.