Inter national J our nal of Electrical and Computer Engineering (IJECE) V ol. 16, No. 3, June 2026, pp. 1286 1297 ISSN: 2088-8708, DOI: 10.11591/ijece.v16i3.pp1286-1297 1286 Sepsis detection using biomark ers and machine lear ning T uan Anh V u 1 , Dang Hoai Bac 2 , Minh T uan Nguy en 3 1 Center for De v elopment of Information T echnology and Communications, Posts and T elecommunications Institute of T echnology , Hanoi, V ietnam 2 Posts and T elecommunications Institute of T echnology , Hanoi, V ietnam 3 F aculty of T elecommunications 1, Posts and T elecommunications Institute of T echnology , Hanoi, V ietnam Article Inf o Article history: Recei v ed Jul 19, 2025 Re vised Jan 23, 2026 Accepted Mar 16, 2026 K eyw ords: Biomark er Deep learning Immune-related genes Machine learning Sepsis detection ABSTRA CT Life-threatening dysfunction of or g ans, kno wn as sepsi s, is caused by an im- balanced response of host to infection. In this w ork, an ef cient algorithm is proposed to address vital biomark ers for identication of sepsis using immune- related dif ferential e xpression genes. A total of 16 gene datasets are processed for the e xtraction of a gene intersection between dif ferent gene datasets and the immune-related gene group, which impro v e the generalization of the nal detec- tion algorithm due to di v ersity of the input data. A no v el gene selection method using sequential forw ard gene selection, machine learning, and rank ed genes based on their importance calculated by a random forest model. A subset of 36 potential immune-related genes, which are identied as the biomark ers from 560 input genes, sho w an ef cienc y of the proposed gene selection algorithm. The biomark ers are v alidated the performance using v arious machine learning and deep learning related to sepsis diagnosis. The highest statistical performance is sho wn for the random forest model using the biomark ers as the input with an accurac y of 96.83%, sensiti vity of 98.86%, specicity of 86.70%, and A UC of 98.67%. The proposed detection algorithm includes a random forest model and 36 biomark ers, which is simple, ef fecti v e, and reliable for the applications in clinic en vironments. This is an open access article under the CC BY -SA license . Corresponding A uthor: Minh T uan Nguyen Posts and T elecommunications Institute of T echnology No. 122, Hoang Quoc V iet, Hanoi 10000, V ietnam Email: nmtuan@ptit.edu.vn 1. INTR ODUCTION Sepsis disease is caused by an imbalanced response of host to infection, which also kno wn as life- threatening or g an dysfunction [1]. F or those who are in sepsis, the h yperacti v e inammatory response in the early stages results in se v ere injuries, or g an f ail ures and e v en septic shock for the bodies [2]. Despite adv ances in the treatment of sepsis, the mortality proportion due to septic shock maintains a signicant number , which is from 25% to 30% and e v en higher [3]. Furthermore, sepsis survi v ors quit e frequently suf fer long-term ph ysical, psychological, and cogniti v e impairment [4] without ef fecti v e treatments or appro v ed drugs. Hence, intensi v e care, antibiotics medicine, and hemodynamic stabilization are the main medical treatment methods, empha- sizing the ur gent need to address biomark ers for the prompt and accurate sepsis identication, which leads to signicantly impro v e the clinical decision-making of technicians and e xperts in practical en vironment [5]. Immune-related genes (IRGs) play an important role in response to infection, inammation, and other immune-related processes of the immune system. In other w ords, IRGs are considered as biomark ers, which J ournal homepage: http://ijece .iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 1287 are the diagnosis and prognostic signatures of v arious human diseases, such as cancer , e xhibiting reliable sen- siti vity and specicity . Indeed, dif ferential e xpression analys is of IRGs plays an essential role in biomark er identication related to rapid sepsis detection [6]–[8]. Although IRG e xpression datasets of fer v aluable insight into di v erse biological processes, ident ica- tion of essential biomark ers among high-dimensional databases is challenging due to redundant and irrele v ant genes. V arious dif ferential e xpression analysis techniques ha v e been de v eloped to o v ercome such obstacles to impro v e the accurac y and ef cienc y for the e xisting dif ferential e xpression analysis methods with respect to the selection of informati v e dif ferential e xpression genes (DEGs) [9], which are primarily responsible for dif ferences between biological states. Recently , machine learning (ML) and deep learning (DL) approaches has been widely used to ad- dress vital biomark ers in terms of sepsis diagnosis [10]. Indeed, ensemble or multi-algorithm pipelines are proposed in dif ferent publications such as [11], which in v estig ates the parameters of 18 ML models to select the optimal models based on the area under the curv e (A UC)-recei v er operating characteristic (R OC) curv e v alues produced by 10-fold cross v ali dation (CV). Here, R OC is a curv e which sho ws the trade-of f between sensiti vity and specicity of a classier for all possi ble classication thresholds, while A UC measures an area under this curv e. The optimal models are then v alidated with 72 additional samples, resulting in a highest A UC of 85% of the e xtreme gradient boosting model. A lar ge number of ML models are considered in [12] to address a signicant gene subset among imm une-related DEGs in which a weighted gene co-e xpression netw ork analysis (WGCN A) is performed on the original data to identify sepsis-rel ated genes. A number of 108 DEGs, also kno wn as the o v erlapping gene subset between the immune-related DEGs and sepsis-related genes, are put into dif ferent ML models, which leads to a selection of 11 biomark ers. Among the ML models used for the e stimation of optimal biomark ers, the penalized discriminant analysis model releases the highest A UC of 90.1% in the v alidation dataset. In [13], dif ferentially e xpressed mRN As are addressed by packages of ”Limma” and ”metaMA”, which are then rank ed by mean decrease accurac y v alues. Here, a forw ard-wrapper approach combined with dif ferent ML models is emplo yed to identify a subset of 15 biomark ers in terms of sep- sis detection. The lar gest v alidation performance is generated by the RF model with an A UC of 87.3% on the v alidation data. Lin et al. [14] propose the 5-meth ylc ytosine (m5C)-related genes in terms of sepsis, focusing on dif ferent immune re gulatory mechanisms. As a result, 3 biomark ers are identied from the abo v e 3 subsets of rank ed genes corresponding to the indi vidual ML models, which generate A UC v alues o v er 70% on testing and v alidat ion data. A similar method de v elopment is proposed in [15], which addresses 44 DEGs by the use of ”limma’ packa g e and then 4 biomar k ers using dif ferent ML models. Performance analysis sho ws an A UC of 92% on the testing data e v aluated by the CIBERSOR T algorithm. In [16], GEO database including GSE65682 and GSE95233 is emplo yed to in v estig ate the role of P ANoptosis-related genes (PRGs) and their association with characteristics of immune system related to sepsis. Here, the ConsensusClusterPlus algorithm classies sepsis samples into molecular subtypes to address the DEGs using the package of ”limma” with thresholds of | l og F C | > 1 and p v al u e < 0 . 05 . Furthermore, WGCN A in combination with the cluster analysis considers only sepsis samples of GSE65682 to select the red module genes, which results an intersection between the abo v e genes and the P ANoptosis-related DEGs, also kno wn as a biomark er subset of 5 genes. In [17], a total of 308 potential genes are identied as an intersection between DEGs and MEturquoise module genes, which are subsequently subjected to 113 combinations using 12 ML algorithms for performance e v aluation. The results indicate 22 biomark ers identied by the RF and Elastic Net models, which sho w the highest A UC of 88.1% among other model combinations. Although man y studies apply ML techniques to sepsis recognition, most of them rely on small gene- e xpression datasets, which limits the rob ustness and generalization of the resulting models [11]–[17]. More- o v er , the identication process of DEGs i s often insuf ciently addressed in e xisting approaches using a x ed DEG-selection procedures. Therefore, it is potential to miss the informati v e genes, which are essential for accurate diagnosis. T o address these limitations, we use 16 public datasets including v arious cell types , plat- forms, and age groups to ensure high generalization of the proposed prediction model. W e further propose a no v el algorithm t o classify sepsis patients from normal people kno wn as controls, which contains an ef fecti v e ML model and a subset of biomark ers. Here, the sequential forw ard gene sel ection algorithm using a 5-fold cross-v alidation (CV) is emplo yed to identify dif ferent potential genes as immune-related DEGs (IRDEGs) using gene importance computed by a ML model. The immune-related DEGs are then v alidated for their performance in a separated dataset by v arious ML and DL models to select the nal biomark er subset of genes. Sepsis detection using biomark er s and mac hine learning (T uan Anh V u) Evaluation Warning : The document was created with Spire.PDF for Python.
1288 ISSN: 2088-8708 The most signicant contrib utions of this w ork are as follo ws: a. In v estig ation of dif ferent IRG frame w orks for the e xtraction of a potential IRG subset which contrib utes signicantly to the diagnosis of sepsis. b . The utility of a no v el gene selection algorithm using an intelligent method for the identication of the IRDEGs, which denitely maintain the rele v ant number of remarkable genes in terms of the distinction between sepsis and control people. c. Proposal of an ef fecti v e sepsis recognition algorithm based on ML techniques and immune-related biomark- ers, which is po werful to deplo y in medical f acilities. 2. D A T A T able 1 sho ws 16 gene e xpression datasets, which are do wnloaded from t he GEO and BioStudies databases including eight platforms namely Af fymetrix Human Gene 2.0 ST Array , Custom Af fymetrix Human T ranscriptome Array , Af fymetrix Human Gene 2.1 ST Array , Af fymetrix Human T ranscriptome Array 2.0, Agilent-026652 Whole Human Genome Microarray 4x44K v2, Af fymetrix Human Genome U133 Plus 2.0, Af fymetrix Human Genome U219 Array , and Agilent Humman Gene Expression 4x44K v2 Micorarry of Biostudies database. There are 2151 participants, which include 468 normal people kno wn as controls and 1683 sepsis patients in the entire database. The total gene databases are randomly di vided by datasets, which result in v alidation set of GSE26378, GSE26440, GSE57065, GSE95233, and GSE119217, while the remaining datasets are allocated to the training set. T able 1. Data description Order Dataset No. Genes Control Sepsis Cell type Age 1 GSE119217 28376 12 122 Peripheral blood Children 2 GSE69686 20299 85 64 Peripheral blood Post-natal age 3 GSE69063 25512 33 57 Peripheral blood Adult 4 GSE134347 30905 83 215 Whole blood Adult 5 GSE131761 21754 15 81 Peripheral blood Adult 6 GSE57065 23520 25 82 Whole blood Adult 7 GSE95233 23520 22 102 Whole blood Adult 8 GSE28750 23520 20 10 Whole blood Adult 9 GSE26378 23520 21 82 Whole blood Children 10 GSE8121 23520 15 60 Whole blood Children 11 GSE13904 23520 18 52 Whole blood Children 12 GSE26440 23520 32 98 Whole blood Children 13 GSE9692 23520 15 30 Whole blood Children 14 GSE4067 23520 15 69 Whole blood Children 15 GSE65682 19040 42 479 Whole blood Adult 16 E-MT AB-1548 17028 15 80 Peripheral blood Adult 3. METHOD Figure 1 sho ws the proposed method including three steps, namely gene processing, gene selection, and gene estimation. In the rst st ep, v arious gene databases are compared with dif ferent gene platforms such as the Af fymetrix Human Genome U133 Plus 2.0, Af fymetric Human Genome U129 Array , Agilent Human Gene Expression 4x4 4K v2 Microarray , Af fymetrix Human Gene 2.0 ST Array , Cus tom Af fymetrix Human T ranscriptome Array , Af fymetrix Human Gene 2.1 ST Array , Af fymetrix Human T ranscriptome Array 2.0, and Agilent-026652 Whole Human Genome Microarray 4x44K v2 for the identication of IRGs, which are then preprocessed by dif ferent techniques to impro v e data quality for further analysis. In the second step, the RF model and a 5-fold CV procedure are applied to calculate gene importance v alues for which the IRGs are rank ed. A gene ranking based gene selection algorithm kno wn as sequential forw ard gene selection (SFGS) is implemented with 3 ML models and 5-fold CV method to select 3 IRG combinations dened as 3 IRDEGs. Finally , dif ferent ML and DL models of RF , K-nearest neighbors (KNN), logistic re gression (LR), and long short-term memory (LSTM) are used to v alidate the performance of the selected IRDEGs using 5-fold CV procedure to address the most informati v e biomark ers. These models are representati v e of widely used ML and DL techniques. Furthermore, LR, RF , and KNN models handle linear relationships, nonlinear interactions, and local similarity patterns, res p e cti v ely , while LSTM is able to capture comple x non-linear relationships in gene Int J Elec & Comp Eng, V ol. 16, No. 3, June 2026: 1286-1297 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 1289 e xpression data. The procedure of 5-fold CV includes the input dataset, which is di vided into 5 folds. One of these fold is utilized for testing, and the others are applied for model training. The CV procedure is completed with 5 repetitions to ensure that all indi vidual folds are used as the testing data.                               F ig u re   1 .   M e th o d   d iag ra m     IR DE G s   Op tima mo d els   Gene  Selectio n     Selectio n   SF GS   R F,  L R KNN   I m p o r ta n ce     by   RF ,   C V   R a nk ing   Gr id   s ea r ch ,   C V   Gene   E s tim a tio n   RF   LR   CV   KNN   L STM   V a lid a tio n   s et   Tr a in in g   s et   Pre p r o ce s s in g   I R Gs   Gene   d a ta b a s es   Gene  P r o ce s s i ng   E x tr ac tio n   o f   im m u n e - r elate d   g en es   Figure 1. Method diagram 3.1. Gene pr ocessing The gene processing w orko w in thi s study consists of tw o stages: preprocessing of ra w gene e x- pression data and e xtraction of IRGs. A total of 16 publicly a v ailable gene e xpression datasets from v arious microarray platforms are aggre g ated to ensure broad co v erage of heterogeneous patient cohorts and measure- ment conditions. Detailed descriptions of the preprocessing procedures and IRG e xtraction steps are pro vided in the follo wing subsections. 3.1.1. Pr epr ocessing W e consider 16 ra w gene e xpression datasets, which are then preprocessed and normalized by the rob ust multi-array a v erage (RMA) algorithm. Here, gene annotation is performed by mapping probe identiers to gene symbols, based on the most recent SOFT les or chip description les (CDFs) which are a v ailable from the GEO database. SOFT les are used to process 14 gene datasets to set the gene e xpression le v el as the mean of the probes for common genes, while custom CDFs are adopted for GSE119217 and GSE69063 to ensure accurate gene mapping. Finally , gene data are preprocessed by Mi n - Max normalization with a scaling technique in the range of [0-1]. It is note w orth y that no method related to batch ef fects is considered to ensure the model generalization across independent datasets from v arious platforms. 3.1.2. Immune-r elated gene extraction A total of 8 platforms of gene data are used to address the IRGs from 16 publicly a v ailable se psis databases. Each database i s map wit h IRGs reference set to identify the subset of IRGs related to sepsis. There are 770 IRGs collected from [6] using the publicly accessible NanoString database ( www .nanostring .com ), which are compared with 16 gene databases used in this w ork to identify potential IRGs related to sepsis. After ltering the IRGs of the plat forms, o v erlapping genes across the platforms are identied by an intersection- based approach. Thereafter , these intersected genes are utilized as input for the gene selection step to identify IRDEGs. 3.2. Gene selection The gene selection stage aims to identify IRDEGs for sepsis detection t hrough a combination of gene ranking and SFGS. W e emplo y gene ranking-based gene selection namely SFGS, which contains gene importance computed by the RF model in combination with dif ferent ML models as the tness functions and 5-fold CV procedure. The gene selection frame w ork is presented in detail in the follo wing subsections. 3.2.1. Gene ranking The preprocessed IRGs are put into the RF model to calculate their importance v alues, which repre- sent the signicance of the indi vidual IRGs in terms of the nal detection performance for sepsis detection. Specically , the importance v alues are dened as scores for all input IRGs computed by a gi v en ML model. Here, the total of IRGs are rank ed by the abo v e scores from highest to lo west v alues in which the higher score sho ws a greater impact of a specic IRG related to a ML model used to recognize sepsis disease. Sepsis detection using biomark er s and mac hine learning (T uan Anh V u) Evaluation Warning : The document was created with Spire.PDF for Python.
1290 ISSN: 2088-8708 3.2.2. Sequential f orward gene selection A gene selection namely SFGS in combination with 3 ML models such as KNN, LR, and RF and 5-fold CV procedure are deplo yed to select 3 optimal gene subsets, also kno wn as 3 subsets of IRDEGs. The preprocessed IR Gs are rank ed according to their scores as presented in the pre vious step. The SFGS selects the rst gene with the highest score as the input of 3 ML models to calculate classication performance related to sepsis detection. Then, tw o genes with the highest importance v alues are selected to put into the ML models to estimate their performance. The procedure is repeated until the entire preprocessed IRGs are considered for the performance calculation of the ML models. Algorithm 3.1 sho ws the SFGS combined with dif ferent ML models and 5-fold CV procedure. Algorithm 1. Sequential forw ard gene selection with ML models 1) Sorting IRGs based on the importance values G : input set of IRGs; G (1) : an IRG with highest importance v alue; G ( N ) : an IRG with lo west importance v alue; N : number of IRG; IRGs of G set are sorted descendingly by the importance v alues from the highest to lo west. 2) Calculating accur acy of ML models using dif fer ent g ene subsets T raining data: P ( i ) = { T (: , G ( i )) , y } ; Where i=1 ÷ N; y = L × 1 : label matrix; L : number of samples; T = i × N : sample matrix with i genes. a) Starting with entire data and i genes: P ( i ) = { T (: , G ( i )) , y } ; i = 1; b) Repeat Separation of P ( i ) into 5 folds by databases P ( i, k ); for k=1 to 5 Model training with V ( i, t ) , t ̸ = k ; Accurac y calculation on S ( i, k ) ; end Calculation of the mean accurac y of CV ; Addition of a gene with highest score; i = i + 1; c) Until i=N 3) Immune-r elated dif fer ential g ene e xpr ession selection The subsets of IRGs namely IRDEGs are selected with the highest accuracies of the corresponding models. =0 In addition, a gird search-based optimization method is used for identication of the optimal learning and structure parameters of the models to address the o v ertting problem. Indeed, the most important learning parameters are in v estig ated for the RF model related to tree number of [25, 55, 75, 95], leaf number of [15, 25, 35, 55], while K of [5, 8, 11, 14, 17, 20, 23, 26, 29] is considered for the KNN model. The learning parameters of the LSTM model are the optimizer of [adam, SGD, RMSprop], batch sizes of [16, 32, 64], learning rate of [0.005, 0.01, 0.02], L2 re gularization of [0.8, 0.9, 0.95], epochs of [40, 60, 80]. Here, 5 structures of the LSTM model are emplo yed in which the rst structure includes a LSTM and a Batch normalization layer . The second is a combination of 2 rst structures, while the third contains the rst and the second structure, etc. As a result, there are 1, 16, 9, and 1215 structures of LR, RF , KNN, and LSTM models, which are implemented to identify the optimal models corresponding to the indi vidual subsets of IRGs. 3.3. Gene estimation W e use 3 ML and a DL models, namely RF [18], LR [19], KNN [20] and LSTM [21] to v alidate the entire input IRGs (AIRG) and 3 subsets of IRDEGs selected by the SFGS algorithm on the v alidation set using 5-fold CV procedure. Here, 5 folds are generated for the v alidation set in which each fold corresponds to a completed dataset. Then, 4 folds are for model training and one fold is for testing. The CV procedure is repeated 5 times to ensure that all indi vidual datasets are used as the test ing gene data. The mean v alidation performance of the models and their standard de viation are calculated for further analysis and comparison with pre vious studies. 4. SIMULA TION RESUL TS 4.1. P erf ormance measur ement W e use accurac y ( Ac ), sensiti vity ( Se ), specicity ( Sp ), Mathe ws correlation coef cient (MCC), and area under the curv e (A UC) to estimate the performance of dif ferent ML and DL models in this study . Ac sho ws Int J Elec & Comp Eng, V ol. 16, No. 3, June 2026: 1286-1297 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 1291 the rate of participants who are correctly predicted. Se and Sp present the number of correctly detect ed sepsis patients and control people, respecti v ely . The discrepanc y between patients and controls is measured by the MCC parameter . Furthermore, the A UC e v aluates the ability of the ML and DL models to distinguish sepsis patients and control people. Ac = T P + T N T P + F P + T N + F N (1) S p = T N T N + F P (2) S e = T P T P + F N (3) MCC = T P × T N F P × F N p ( T P + F P )( T P + F N )( T N + F P )( T N + F N ) (4) where T N , T P , F N , and F P are true ne g ati v e, true positi v e, f alse ne g ati v e, and f alse positi v e v alues. 4.2. Gene pr ocessing 4.2.1. Pr epr ocessing The preprocessing stage be gins by applying the RMA method to all 16 gene e xpression datasets to perform background correction, normalization, and probe-le v el summarization. F ollo wing RMA, gene annota- tion is carried out using the corresponding SOFT and CDFs to accurately map probe identiers to standardized gene symbols across dif ferent platforms. As a resul t of this procedure, the processed datasets contain between 17028 and 30905 genes, as detailed in T a b l e 1, which are normalized by the Min-Max normalization-based scaling technique in the range of [0-1]. 4.2.2. Immune-r elated gene extraction The 16 gene e xpression datasets are ltered for IR Gs based on a set of 770 IRGs. As a result, there are 760, 696, 742, 755, 751, 740, 737 e xtracted IRGs from GSE119217, GSE69686, GSE69063, GSE134347, GSE131761, GSE65682, E-MT AB-1548, respecti v ely . Furthermore, the remaining gene datase ts namely GSE57065, GSE95233, GSE28750, GSE26378, GSE1821, GSE13904, GSE26440, GSE9692, GSE4067 pro- duce a similar number of 737 IRGs. W e consider a subset of 560 IRGs, which is an intersection between 16 datasets for further analysis to ensure the inclusion of the most common characteristics of all input gene databases related to sepsis in the proposed algorithm. T able 2. Gene rank ed by the important v alues Ord Gene Imp Ord Gene Imp Ord Gene Imp Ord Gene Imp 1 IL1R2 2.99 13 GA T A3 0.44 25 ITGA4 0.34 37 CEA CAM8 0.28 2 S100A12 1.80 14 MA GEB2 0.42 26 IFIT1 0.33 38 KLRD1 0.27 3 CCR7 1.58 15 CD3E 0.42 27 CD274 0.32 39 AMMECR1L 0.26 4 IL6ST 0.74 16 ARG1 0.42 28 GZMA 0.32 40 PYCARD 0.26 5 ABCB1 0.66 17 CCR9 0.39 29 CR1 0.32 41 CD80 0.25 6 FCER1A 0.64 18 LRRN3 0.39 30 B A TF 0.30 42 ST6GAL1 0.25 7 FCER1G 0.62 19 GNL Y 0.38 31 L TB 0.30 43 TXK 0.25 8 C1QA 0.62 20 COLEC12 0.37 32 CR2 0.30 44 CD63 0.25 9 C3AR1 0.61 21 CD3D 0.37 33 HLA DQA1 0.29 45 C5 0.25 10 CCL28 0.60 22 BCL2 0.36 34 KLRG1 0.29 46 SSX1 0.24 11 BST2 0.55 23 KLRF1 0.35 35 DUSP6 0.28 Others < 0.24 12 CFD 0.46 24 ST A T3 0.35 36 IL18R1 0.28 Imp: Importance v alue, Ord: Order 4.3. Gene selection 4.3.1. Gene ranking A total of 560 IRGs are e v aluated and rank ed according to their import ance v alues, which are com- puted using the RF model as sho wn in T able 2. These importance scores represent the contrib ution of each IRG to the o v erall classication performance, allo wing us to identify genes that are most inuential in distinguishing sepsis samples from non-sepsis samples. W e only represent the rst 46 IRGs with the highest important v alues due to lar ge number of IRGs in v estig ated in this w ork. Sepsis detection using biomark er s and mac hine learning (T uan Anh V u) Evaluation Warning : The document was created with Spire.PDF for Python.
1292 ISSN: 2088-8708 4.3.2. Sequential f orward gene selection W e emplo y 3 ML models such as LR, KNN, RF as the tness function of the SFGS algorithm in combination with 5-fold CV procedure to identify optimal IRG subsets. During the selection process, SFGS iterati v ely adds genes from the rank ed list and e v aluates each candidate subset using a 5-fold CV procedure to measure its classication performance. The optimal subsets, term ed IRDEG1, IRDEG2, and IRDEG3, contain the rst 31, 36, and 46 genes, respecti v ely , corresponding to the highest a v erage accurac y achie v ed by each ML model. These selected gene sets are summarized in T able 2, and their performance are illustrated in Figure 2. Figure 2. A v erage accurac y of 5-fold CV for the indi vidual immune gene subsets T able 3. The lar gest v alidation performance of v arious models using 3 IRDEGs and AIRG on the v alidation set Model DEG Ac (%) Se (%) Sp (%) MCC (%) A UC (%) RF IRDEG1 94.71 ± 5.68 95.71 ± 6.08 90.65 ± 10.57 83.08 ± 20.29 97.60 ± 4.41 IRDEG2 96.83 ± 1.39 98.86 ± 2.03 86.70 ± 11.85 84.97 ± 13.56 98.67 ± 2.54 IRDEG3 95.60 ± 3.73 97.46 ± 3.18 85.56 ± 15.33 83.02 ± 19.28 97.48 ± 5.16 AIRG 96.03 ± 5.03 98.07 ± 3.11 81.32 ± 27.81 80.69 ± 31.36 97.17 ± 6.02 KNN IRDEG1 95.94 ± 3.49 96.09 ± 2.77 93.42 ± 10.50 85.02 ± 17.48 97.68 ± 4.31 IRDEG2 94.89 ± 3.87 94.95 ± 3.21 91.75 ± 14.17 81.82 ± 19.89 96.70 ± 6.60 IRDEG3 94.67 ± 7.56 94.52 ± 7.23 94.37 ± 10.91 83.48 ± 25.33 97.17 ± 6.07 AIRG 94.59 ± 5.38 95.16 ± 3.45 85.71 ± 29.35 77.42 ± 31.53 93.27 ± 13.28 LR IRDEG1 75.25 ± 28.44 73.61 ± 40.22 81.54 ± 24.71 55.37 ± 31.52 80.26 ± 16.41 IRDEG2 71.68 ± 27.75 68.96 ± 39.40 82.58 ± 28.76 50.22 ± 30.67 79.41 ± 16.28 IRDEG3 65.77 ± 32.95 65.01 ± 46.52 72.09 ± 39.39 49.26 ± 22.63 72.54 ± 19.61 AIRG 49.54 ± 26.84 40.34 ± 37.52 72.13 ± 40.50 47.81 ± 24.89 55.61 ± 9.40 LSTM IRDEG1 94.52 ± 4.70 95.71 ± 4.51 87.88 ± 10.43 80.69 ± 19.56 97.39 ± 4.30 IRDEG2 89.33 ± 7.76 92.28 ± 8.56 76.67 ± 38.16 65.37 ± 29.17 93.44 ± 8.60 IRDEG3 88.97 ± 13.43 90.79 ± 16.21 85.37 ± 18.06 74.02 ± 24.09 94.37 ± 7.48 AIRG 91.67 ± 5.57 92.86 ± 7.74 84.29 ± 17.06 74.66 ± 19.82 97.86 ± 3.96 4.4. Gene estimation The optimal parameters of the RF models include 55 trees and 25 lea v es, whil e that of KNN is K=17. Moreo v er , the structure of optimal LSTM model consists of 3 sequential layers in which LSTM layer is follo wed by a batch normalization and dropout layer for training stabilization and o v ertting reduction, re- specti v ely . The e xtracted temporal representations are then passed through dual layers, namely fully connected and softmax output layers for binary classication. The optimal LSTM model uses the Adam optimizer with a batch size of 32, a learning rate of 0.01, and a L2 re gularization coef cient of 0.9. These optimal models are then used for the performance v alidation of 3 IRDEG subsets on the v alidation set us ing 5-fold CV procedure. The mean performance of dif ferent ML and DL models such as RF , LR, KNN, and LSTM is gi v en in T able 3. Int J Elec & Comp Eng, V ol. 16, No. 3, June 2026: 1286-1297 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 1293 The RF model produces the highest a v erage Ac of 96.83%, Se of 98.86%, Sp of 86.70%, MCC of 84.97% and A UC of 98.67%, which is selected as the proposed algorithm to classify sepsis disease. 5. DISCUSSION Sepsis is a dangerous disease for human health, which has recei v ed intense attention from medic al e xperts, technicians, and researchers. Existing studies certainly consider dif ferent gene databases to de v elop an ef fecti v e method for the sepsis detection. Ho we v er , the number of gene databases is frequently small resulting in unreliability , lo w performance of the proposed method, sho wing dif culties for practical application in the clinic en vironments [14]–[17]. A potential solution to enhance detection performance of the proposed algo- rithm in pre vious w orks is the use of IRG databases, which are in v olv ed in immune re gulation, response, and proper functioning of the immune system to protect the human body from harmful substances, germs, and cell changes. Hence, we in v estig ate a lar ge number of 16 gene databases to produce better detection performance of the nal algorithm in this w ork. Ob viously , the utility of massi v e gene databases certainly results in a v oidance of o v ertting problems, impro v ement of the nal classication performance, and increase in reliability of the proposed m ethod. Additionally , the utility of common IRGs from 16 gene databases certainly signicantly impro v es the generalization of the proposed method in terms of sepsis recognition. Another signicant characteristic is the gene selection. Most of e xisting studies adopt con v ent ional methods such as log(F old-change) and P-v alue to address the DEGs [16], [17]. Indeed, log(F old-change) pa- rameter represents the de gree of gene e xpression in which the up- and do wn-re gulation of genes are based on higher and lo wer v alues of log(F old-change) than zero, respecti v ely . Moreo v er , statistical method indicates a threshold of 0.05 for which p-v alue parameter being sma ller than such threshold certainly represents biological e xpression changes. The combination of the abo v e parameters results in an ef fecti v e method for DEG identi- cation. Ho we v er , a lar ge number of DEGs as the outcome of con v entional method denitely poses an obstacle for the further step of biomark er identication among selected DEGs such as 6361, 1230, 405 [14], [16], [17]. Therefore, a gene ranking-based gene selection method namely SFGS is applied in this w ork to select the po- tential IRDEGs from the input gene set of 560 IRGs. Here, the gene important v alues, which are computed by a RF model, are used to rank 560 IRGs. A total of 560 IRG combinations with number of IRG ranging from 1 to 560 are e v aluated by 3 ML models such as RF , LR, and KNN. Consequently , there a re 3 subset including 31, 36, 46 IRGs selected by the SFGS in combination with 3 ML models and 5-fold CV procedure on the training set. It is clear that t he gene number of the abo v e subsets is smaller than those of [14], [16], [17], which mak es it easily to identify a subset of biomark ers. W e impl ement ML and DL models for comparison with e xisting publications based on dif ferent per - formance metrics and the proposal of an ef fecti v e sepsis diagnosis algorithm. Indeed, A UC and MCC perfor - mance parameters are widely used for estimation of the proposed methods in pre vious w orks [13], [15], [17]. Ob viously , A UC metric emphasizes the ability of the models to distinguish between sepsis and control groups, while the o v erall prediction is measured by accurac y parameter . It is clear that the high classication performance of the na l algorithm for sepsis is one of the most important elements for those who de v elop no v el methods related to sepsis recognition. Therefore, the use of numerous metrics for the per - formance estimation of the sepsis detection algorithm plays an essential role. In this w ork, 5 parameters are emplo yed for the performance e v aluation of v arious models in terms of sepsis classi cation, which certainly pro vide reliable estimation of the proposed algorithm’ s ability with respect to sepsis diagnosis. Moreo v er , the grid search combined with 5-fold CV procedure is deplo yed for identication of the optimal learning and structure parameters, which leads to obtain the best model with relati v e high sepsis detection performance while a v oiding fundamental problems such as o v ertting. The a v erage performance of ML and DL models on the v alidation set is gi v en in T able 3. The RF and KNN models generate high performance for the sepsis diagnosis with mean Ac and A UC o v er 94% and 93%, respecti v ely , while LR model sho ws lo west perform ance with Ac and A UC less than 75% and 80%. The highest performance with mean Ac of 96.83%, Se of 98.86%, Sp of 86.70%, MCC of 84.97%, and A UC of 98.67% is released by the RF model selected as the nal algorithm for the sepsis detection among the others. Here, high sensiti vity of the proposed model implies an accurate diagnosis of sepsis cases, which are then denitely check ed by clinical e xperts to mak e nal decision of deli v ering ef fecti v e treatment. In the clinical cont e xt , emphasizing sensiti vity is essential for early detection, as timely interv ention can signicantly reduce the risk of se v ere complications and mortality in patients with sepsis. It is note w orth y that e xamination Sepsis detection using biomark er s and mac hine learning (T uan Anh V u) Evaluation Warning : The document was created with Spire.PDF for Python.
1294 ISSN: 2088-8708 of e xperts is applied for people being incorrectly identied by the proposed model, who are then gi v en no medical treatment. A comparison of the proposed algorithm with e xisting publications is presented in T able 4 which sho ws outperformed performance of the proposed algorithm compared with e xisting studies. Hence, the proposed algorithm is ef fecti v e for sepsis detection applications in practical f acilities and hospitals. T able 4. Comparisons of the proposed algorithm with pre vious studies Ref. Method Data Ac (%) A UC (%) MCC (%) Pros Cons [17] 2025 Gene selection and classication using RF and Elastic Net - 4 datasets (359 samples) - Separated training and testing N A 88.1 N A - 113 combinations models, enabling rob ust, thorough performance e v aluation - Small dataset - Non-optimized model - Only using ML - Only A UC [13] 2022 - Gene selection using RF ranking and forw ard-wrapper - Classication using RF - 5 datasets (958 samples) - 4 datasets for training and testing - A dataset for v alidation N A 87.3 71.3 - Rob ust wrapper -based gene selection - Independent dataset v alidation - Small dataset - Non-optimized model - Only using ML [15] 2023 - Gene selection by intersecting LASSO, SVM-RF , and RF - Classication using CIBERSOFT - 3 datasets (253 samples) - 2 datasets for training - A dataset for v alidation N A 92 N A Inte gration of multiple gene selection methods - Small dataset - Non-optimized model - Only using ML - Model e v aluation using only A UC Our - Gene importance-based gene ranking - Gene selection using SFGS and ML models - Classication using ML, DL - 16 datasets (2151 samples) - 11 datasets for training - 5 datasets for v alidation - 5-fold CV 96.83 98.67 84.97 - Multiple datasets to impro v e generalizability - Grid search, CV to optimize model - SFGS and ML models for IRDEG selection - High classication performance - High number of selected biomark ers - Limited e xploration of DL models Existing clinical tools for asses sing sepsis, such as SOF A, qSOF A, and procalcitonin, are kno wn as important diagnostic guidance, which still ha v e se v eral limitations. Indeed, e v aluat ion of or g an dysfunction across six ph ysiological systems for the diagnosis of sepsis is considered as SOF A score, which is more reac- ti v e than predicti v e [22]. Consequently , sepsis disease is often identied only after the e xistence of signicant or g an damages [23], [24]. Similarly , qSOF A w as de v eloped and v alidated in populations with suspected sepsis, making it less suitable as an early screening tool [25]. Small sensiti vity and specicity to dif ferentiate sepsis disease based on other causes of systemic inammatory res po ns es is sho wn for Procalcitonin, which empha- sizes the need for more reliable molecular mark ers [26]. In contrast, our proposed method utilizes a subset of 36 IRDEGs to detect sepsis promptly and with high performance such as sensiti vity of 98.86% and A UC of 98.67%. These ndings suggest that our model pro vides an alternati v e diagnostic approach with impro v ed accurac y and timeliness in comparison with the e xisting clinical tools. The rst limitation of this w ork is the lar ge number of 36 biomark ers, which denitely increases the time, comple xity , and cost of gene e xp r ession measurement in real-w orld applications. Secondly , the datasets used in this study e xhibit class imbalance, which may introduce bias to model learning and inate sensiti vity while reducing specicity , potentially af fecting the generalization of the predicti v e model. Omission of e x- ternal v alidation and analysis limited by the utility of 3 ML and a DL models are certainly other limitations. Indeed, the e xploration of dif ferent DL models denitely generat es a better chance to nd a producti v e model with better sepsis recognition performance, which is absolutely considered in future research. 6. CONCLUSION Sepsis is a main cause of serious medical conditions, which represent the body’ s uncontrolled response to infection, leading to or g an f ailure and high mortality . Millions of cases are reported each year , which pose important b urden on healthcare systems w orldwide. Prompt and accurate recognition is essential to impro v e patient outcomes. In this w ork, we propose an algorithm for the sepsis prediction with high generalization Int J Elec & Comp Eng, V ol. 16, No. 3, June 2026: 1286-1297 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 1295 based on the utility of multiple cases, di v erse age groups of input gene datasets collected from dif ferent medical platforms. The proposed algorithm is de v eloped with the RF model and a subset of 36 biomark ers, which are choosen from the input IRGs. A gene ranking-based gene selection kno wn as the SFGS algorithm utilizing dif ferent ML models as the tness function and 5-fold CV procedure is deplo yed to select the optimal subset of IRGs, kno wn as IRDEGs, which are then v alidated their sepsis detection performance by the ML and DL models. The relati v ely high performance conrms the ef fecti v eness of the SFGS using ML techniques in comparison with the con v entional method using log(F old-Change) and p-v alue for the identication of DEGs. The RF model releases the highest a v erage sepsis recognition performance with Ac of 96.83%, Se of 98.86%, Sp of 86.70%, MCC of 84.97% and A UC of 98.67% among the other ML and DL models, which sho ws successful utility of ML model and biomark ers for the sepsis diagnosis. Indeed, we propose an simple b ut ef cient method to archi v e better massi v e gene data processing and high le v el of gene data separation for the sepsis detection. As a result, we suppose that the proposed algorithm is deplo yed as the application in clinic en vironments and hospitals. Ho we v er , the number of biomark ers includes 36 genes, which may increase the practical comple xity for clinical implementation, is the rst limit of this w ork. Moreo v er , the imbalanced datasets, no e xternal v alidation, and the use of small number of models such as 4 ML and DL models for method de v elopment are the additional limitations, which are certainly addressed in future researches. FUNDING INFORMA TION Authors state no funding in v olv ed. A UTHOR CONTRIB UTIONS ST A TEMENT This journal uses the Contrib utor Roles T axonomy (CRediT) to recognize indi vidual author contrib u- tions, reduce authorship disputes, and f acilitate collaboration. Name of author C M So V a F o I R D O E V i Su P Fu T uan Anh V u Dang Hoai Bac Minh T uan Nguyen C : C onceptualization I : I n v estig ation V i : V i sualization M : M ethodology R : R esources Su : Su pervision So : So ftw are D : D ata Curation P : P roject Administration V a : V a lidation O : Writing - O riginal Draft Fu : Fu nding Acquisition F o : F o rmal Analysis E : Writing - Re vie w & E diting CONFLICT OF INTEREST ST A TEMENT Authors state no conict of interest. D A T A A V AILABILITY The supporting data of this study are openly a v ailable at https://www .ncbi.nlm.nih.go v/geo/ and https://www .ebi.ac.uk/biostudies/arraye xpress. REFERENCES [1] S. Lin et al. , “Multiple datasets to e xplore the molecular mechanism of sepsis, BMC Genomic Data , v ol. 23, pp. 1–13, 2022, doi: 10.1186/s12863-022-01078-2. [2] L.-W . Duan et al. , “Ef fects of viral infection and microbial di v ersity on patients with sepsis: A retrospecti v e study based on metage- nomic ne xt-generation sequencing, W orld Journal of Emer genc y Medicine , v ol. 12, pp. 29–35, 2021, doi: 10.5847/wjem.j.1920- 8642.2021.01.005. [3] L. La V ia et al. , “The global b urden of sepsis and septic shock, Epidemiologia , v ol. 5, pp. 456–478, 2024, doi: 10.3390/epidemi- ologia5030032. Sepsis detection using biomark er s and mac hine learning (T uan Anh V u) Evaluation Warning : The document was created with Spire.PDF for Python.