Inter
national
J
our
nal
of
Electrical
and
Computer
Engineering
(IJECE)
V
ol.
16,
No.
3,
June
2026,
pp.
1286
∼
1297
ISSN:
2088-8708,
DOI:
10.11591/ijece.v16i3.pp1286-1297
❒
1286
Sepsis
detection
using
biomark
ers
and
machine
lear
ning
T
uan
Anh
V
u
1
,
Dang
Hoai
Bac
2
,
Minh
T
uan
Nguy
en
3
1
Center
for
De
v
elopment
of
Information
T
echnology
and
Communications,
Posts
and
T
elecommunications
Institute
of
T
echnology
,
Hanoi,
V
ietnam
2
Posts
and
T
elecommunications
Institute
of
T
echnology
,
Hanoi,
V
ietnam
3
F
aculty
of
T
elecommunications
1,
Posts
and
T
elecommunications
Institute
of
T
echnology
,
Hanoi,
V
ietnam
Article
Inf
o
Article
history:
Recei
v
ed
Jul
19,
2025
Re
vised
Jan
23,
2026
Accepted
Mar
16,
2026
K
eyw
ords:
Biomark
er
Deep
learning
Immune-related
genes
Machine
learning
Sepsis
detection
ABSTRA
CT
Life-threatening
dysfunction
of
or
g
ans,
kno
wn
as
sepsi
s,
is
caused
by
an
im-
balanced
response
of
host
to
infection.
In
this
w
ork,
an
ef
cient
algorithm
is
proposed
to
address
vital
biomark
ers
for
identication
of
sepsis
using
immune-
related
dif
ferential
e
xpression
genes.
A
total
of
16
gene
datasets
are
processed
for
the
e
xtraction
of
a
gene
intersection
between
dif
ferent
gene
datasets
and
the
immune-related
gene
group,
which
impro
v
e
the
generalization
of
the
nal
detec-
tion
algorithm
due
to
di
v
ersity
of
the
input
data.
A
no
v
el
gene
selection
method
using
sequential
forw
ard
gene
selection,
machine
learning,
and
rank
ed
genes
based
on
their
importance
calculated
by
a
random
forest
model.
A
subset
of
36
potential
immune-related
genes,
which
are
identied
as
the
biomark
ers
from
560
input
genes,
sho
w
an
ef
cienc
y
of
the
proposed
gene
selection
algorithm.
The
biomark
ers
are
v
alidated
the
performance
using
v
arious
machine
learning
and
deep
learning
related
to
sepsis
diagnosis.
The
highest
statistical
performance
is
sho
wn
for
the
random
forest
model
using
the
biomark
ers
as
the
input
with
an
accurac
y
of
96.83%,
sensiti
vity
of
98.86%,
specicity
of
86.70%,
and
A
UC
of
98.67%.
The
proposed
detection
algorithm
includes
a
random
forest
model
and
36
biomark
ers,
which
is
simple,
ef
fecti
v
e,
and
reliable
for
the
applications
in
clinic
en
vironments.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Minh
T
uan
Nguyen
Posts
and
T
elecommunications
Institute
of
T
echnology
No.
122,
Hoang
Quoc
V
iet,
Hanoi
10000,
V
ietnam
Email:
nmtuan@ptit.edu.vn
1.
INTR
ODUCTION
Sepsis
disease
is
caused
by
an
imbalanced
response
of
host
to
infection,
which
also
kno
wn
as
life-
threatening
or
g
an
dysfunction
[1].
F
or
those
who
are
in
sepsis,
the
h
yperacti
v
e
inammatory
response
in
the
early
stages
results
in
se
v
ere
injuries,
or
g
an
f
ail
ures
and
e
v
en
septic
shock
for
the
bodies
[2].
Despite
adv
ances
in
the
treatment
of
sepsis,
the
mortality
proportion
due
to
septic
shock
maintains
a
signicant
number
,
which
is
from
25%
to
30%
and
e
v
en
higher
[3].
Furthermore,
sepsis
survi
v
ors
quit
e
frequently
suf
fer
long-term
ph
ysical,
psychological,
and
cogniti
v
e
impairment
[4]
without
ef
fecti
v
e
treatments
or
appro
v
ed
drugs.
Hence,
intensi
v
e
care,
antibiotics
medicine,
and
hemodynamic
stabilization
are
the
main
medical
treatment
methods,
empha-
sizing
the
ur
gent
need
to
address
biomark
ers
for
the
prompt
and
accurate
sepsis
identication,
which
leads
to
signicantly
impro
v
e
the
clinical
decision-making
of
technicians
and
e
xperts
in
practical
en
vironment
[5].
Immune-related
genes
(IRGs)
play
an
important
role
in
response
to
infection,
inammation,
and
other
immune-related
processes
of
the
immune
system.
In
other
w
ords,
IRGs
are
considered
as
biomark
ers,
which
J
ournal
homepage:
http://ijece
.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
❒
1287
are
the
diagnosis
and
prognostic
signatures
of
v
arious
human
diseases,
such
as
cancer
,
e
xhibiting
reliable
sen-
siti
vity
and
specicity
.
Indeed,
dif
ferential
e
xpression
analys
is
of
IRGs
plays
an
essential
role
in
biomark
er
identication
related
to
rapid
sepsis
detection
[6]–[8].
Although
IRG
e
xpression
datasets
of
fer
v
aluable
insight
into
di
v
erse
biological
processes,
ident
ica-
tion
of
essential
biomark
ers
among
high-dimensional
databases
is
challenging
due
to
redundant
and
irrele
v
ant
genes.
V
arious
dif
ferential
e
xpression
analysis
techniques
ha
v
e
been
de
v
eloped
to
o
v
ercome
such
obstacles
to
impro
v
e
the
accurac
y
and
ef
cienc
y
for
the
e
xisting
dif
ferential
e
xpression
analysis
methods
with
respect
to
the
selection
of
informati
v
e
dif
ferential
e
xpression
genes
(DEGs)
[9],
which
are
primarily
responsible
for
dif
ferences
between
biological
states.
Recently
,
machine
learning
(ML)
and
deep
learning
(DL)
approaches
has
been
widely
used
to
ad-
dress
vital
biomark
ers
in
terms
of
sepsis
diagnosis
[10].
Indeed,
ensemble
or
multi-algorithm
pipelines
are
proposed
in
dif
ferent
publications
such
as
[11],
which
in
v
estig
ates
the
parameters
of
18
ML
models
to
select
the
optimal
models
based
on
the
area
under
the
curv
e
(A
UC)-recei
v
er
operating
characteristic
(R
OC)
curv
e
v
alues
produced
by
10-fold
cross
v
ali
dation
(CV).
Here,
R
OC
is
a
curv
e
which
sho
ws
the
trade-of
f
between
sensiti
vity
and
specicity
of
a
classier
for
all
possi
ble
classication
thresholds,
while
A
UC
measures
an
area
under
this
curv
e.
The
optimal
models
are
then
v
alidated
with
72
additional
samples,
resulting
in
a
highest
A
UC
of
85%
of
the
e
xtreme
gradient
boosting
model.
A
lar
ge
number
of
ML
models
are
considered
in
[12]
to
address
a
signicant
gene
subset
among
imm
une-related
DEGs
in
which
a
weighted
gene
co-e
xpression
netw
ork
analysis
(WGCN
A)
is
performed
on
the
original
data
to
identify
sepsis-rel
ated
genes.
A
number
of
108
DEGs,
also
kno
wn
as
the
o
v
erlapping
gene
subset
between
the
immune-related
DEGs
and
sepsis-related
genes,
are
put
into
dif
ferent
ML
models,
which
leads
to
a
selection
of
11
biomark
ers.
Among
the
ML
models
used
for
the
e
stimation
of
optimal
biomark
ers,
the
penalized
discriminant
analysis
model
releases
the
highest
A
UC
of
90.1%
in
the
v
alidation
dataset.
In
[13],
dif
ferentially
e
xpressed
mRN
As
are
addressed
by
packages
of
”Limma”
and
”metaMA”,
which
are
then
rank
ed
by
mean
decrease
accurac
y
v
alues.
Here,
a
forw
ard-wrapper
approach
combined
with
dif
ferent
ML
models
is
emplo
yed
to
identify
a
subset
of
15
biomark
ers
in
terms
of
sep-
sis
detection.
The
lar
gest
v
alidation
performance
is
generated
by
the
RF
model
with
an
A
UC
of
87.3%
on
the
v
alidation
data.
Lin
et
al.
[14]
propose
the
5-meth
ylc
ytosine
(m5C)-related
genes
in
terms
of
sepsis,
focusing
on
dif
ferent
immune
re
gulatory
mechanisms.
As
a
result,
3
biomark
ers
are
identied
from
the
abo
v
e
3
subsets
of
rank
ed
genes
corresponding
to
the
indi
vidual
ML
models,
which
generate
A
UC
v
alues
o
v
er
70%
on
testing
and
v
alidat
ion
data.
A
similar
method
de
v
elopment
is
proposed
in
[15],
which
addresses
44
DEGs
by
the
use
of
”limma’
packa
g
e
and
then
4
biomar
k
ers
using
dif
ferent
ML
models.
Performance
analysis
sho
ws
an
A
UC
of
92%
on
the
testing
data
e
v
aluated
by
the
CIBERSOR
T
algorithm.
In
[16],
GEO
database
including
GSE65682
and
GSE95233
is
emplo
yed
to
in
v
estig
ate
the
role
of
P
ANoptosis-related
genes
(PRGs)
and
their
association
with
characteristics
of
immune
system
related
to
sepsis.
Here,
the
ConsensusClusterPlus
algorithm
classies
sepsis
samples
into
molecular
subtypes
to
address
the
DEGs
using
the
package
of
”limma”
with
thresholds
of
|
l
og
F
C
|
>
1
and
p
−
v
al
u
e
<
0
.
05
.
Furthermore,
WGCN
A
in
combination
with
the
cluster
analysis
considers
only
sepsis
samples
of
GSE65682
to
select
the
red
module
genes,
which
results
an
intersection
between
the
abo
v
e
genes
and
the
P
ANoptosis-related
DEGs,
also
kno
wn
as
a
biomark
er
subset
of
5
genes.
In
[17],
a
total
of
308
potential
genes
are
identied
as
an
intersection
between
DEGs
and
MEturquoise
module
genes,
which
are
subsequently
subjected
to
113
combinations
using
12
ML
algorithms
for
performance
e
v
aluation.
The
results
indicate
22
biomark
ers
identied
by
the
RF
and
Elastic
Net
models,
which
sho
w
the
highest
A
UC
of
88.1%
among
other
model
combinations.
Although
man
y
studies
apply
ML
techniques
to
sepsis
recognition,
most
of
them
rely
on
small
gene-
e
xpression
datasets,
which
limits
the
rob
ustness
and
generalization
of
the
resulting
models
[11]–[17].
More-
o
v
er
,
the
identication
process
of
DEGs
i
s
often
insuf
ciently
addressed
in
e
xisting
approaches
using
a
x
ed
DEG-selection
procedures.
Therefore,
it
is
potential
to
miss
the
informati
v
e
genes,
which
are
essential
for
accurate
diagnosis.
T
o
address
these
limitations,
we
use
16
public
datasets
including
v
arious
cell
types
,
plat-
forms,
and
age
groups
to
ensure
high
generalization
of
the
proposed
prediction
model.
W
e
further
propose
a
no
v
el
algorithm
t
o
classify
sepsis
patients
from
normal
people
kno
wn
as
controls,
which
contains
an
ef
fecti
v
e
ML
model
and
a
subset
of
biomark
ers.
Here,
the
sequential
forw
ard
gene
sel
ection
algorithm
using
a
5-fold
cross-v
alidation
(CV)
is
emplo
yed
to
identify
dif
ferent
potential
genes
as
immune-related
DEGs
(IRDEGs)
using
gene
importance
computed
by
a
ML
model.
The
immune-related
DEGs
are
then
v
alidated
for
their
performance
in
a
separated
dataset
by
v
arious
ML
and
DL
models
to
select
the
nal
biomark
er
subset
of
genes.
Sepsis
detection
using
biomark
er
s
and
mac
hine
learning
(T
uan
Anh
V
u)
Evaluation Warning : The document was created with Spire.PDF for Python.
1288
❒
ISSN:
2088-8708
The
most
signicant
contrib
utions
of
this
w
ork
are
as
follo
ws:
a.
In
v
estig
ation
of
dif
ferent
IRG
frame
w
orks
for
the
e
xtraction
of
a
potential
IRG
subset
which
contrib
utes
signicantly
to
the
diagnosis
of
sepsis.
b
.
The
utility
of
a
no
v
el
gene
selection
algorithm
using
an
intelligent
method
for
the
identication
of
the
IRDEGs,
which
denitely
maintain
the
rele
v
ant
number
of
remarkable
genes
in
terms
of
the
distinction
between
sepsis
and
control
people.
c.
Proposal
of
an
ef
fecti
v
e
sepsis
recognition
algorithm
based
on
ML
techniques
and
immune-related
biomark-
ers,
which
is
po
werful
to
deplo
y
in
medical
f
acilities.
2.
D
A
T
A
T
able
1
sho
ws
16
gene
e
xpression
datasets,
which
are
do
wnloaded
from
t
he
GEO
and
BioStudies
databases
including
eight
platforms
namely
Af
fymetrix
Human
Gene
2.0
ST
Array
,
Custom
Af
fymetrix
Human
T
ranscriptome
Array
,
Af
fymetrix
Human
Gene
2.1
ST
Array
,
Af
fymetrix
Human
T
ranscriptome
Array
2.0,
Agilent-026652
Whole
Human
Genome
Microarray
4x44K
v2,
Af
fymetrix
Human
Genome
U133
Plus
2.0,
Af
fymetrix
Human
Genome
U219
Array
,
and
Agilent
Humman
Gene
Expression
4x44K
v2
Micorarry
of
Biostudies
database.
There
are
2151
participants,
which
include
468
normal
people
kno
wn
as
controls
and
1683
sepsis
patients
in
the
entire
database.
The
total
gene
databases
are
randomly
di
vided
by
datasets,
which
result
in
v
alidation
set
of
GSE26378,
GSE26440,
GSE57065,
GSE95233,
and
GSE119217,
while
the
remaining
datasets
are
allocated
to
the
training
set.
T
able
1.
Data
description
Order
Dataset
No.
Genes
Control
Sepsis
Cell
type
Age
1
GSE119217
28376
12
122
Peripheral
blood
Children
2
GSE69686
20299
85
64
Peripheral
blood
Post-natal
age
3
GSE69063
25512
33
57
Peripheral
blood
Adult
4
GSE134347
30905
83
215
Whole
blood
Adult
5
GSE131761
21754
15
81
Peripheral
blood
Adult
6
GSE57065
23520
25
82
Whole
blood
Adult
7
GSE95233
23520
22
102
Whole
blood
Adult
8
GSE28750
23520
20
10
Whole
blood
Adult
9
GSE26378
23520
21
82
Whole
blood
Children
10
GSE8121
23520
15
60
Whole
blood
Children
11
GSE13904
23520
18
52
Whole
blood
Children
12
GSE26440
23520
32
98
Whole
blood
Children
13
GSE9692
23520
15
30
Whole
blood
Children
14
GSE4067
23520
15
69
Whole
blood
Children
15
GSE65682
19040
42
479
Whole
blood
Adult
16
E-MT
AB-1548
17028
15
80
Peripheral
blood
Adult
3.
METHOD
Figure
1
sho
ws
the
proposed
method
including
three
steps,
namely
gene
processing,
gene
selection,
and
gene
estimation.
In
the
rst
st
ep,
v
arious
gene
databases
are
compared
with
dif
ferent
gene
platforms
such
as
the
Af
fymetrix
Human
Genome
U133
Plus
2.0,
Af
fymetric
Human
Genome
U129
Array
,
Agilent
Human
Gene
Expression
4x4
4K
v2
Microarray
,
Af
fymetrix
Human
Gene
2.0
ST
Array
,
Cus
tom
Af
fymetrix
Human
T
ranscriptome
Array
,
Af
fymetrix
Human
Gene
2.1
ST
Array
,
Af
fymetrix
Human
T
ranscriptome
Array
2.0,
and
Agilent-026652
Whole
Human
Genome
Microarray
4x44K
v2
for
the
identication
of
IRGs,
which
are
then
preprocessed
by
dif
ferent
techniques
to
impro
v
e
data
quality
for
further
analysis.
In
the
second
step,
the
RF
model
and
a
5-fold
CV
procedure
are
applied
to
calculate
gene
importance
v
alues
for
which
the
IRGs
are
rank
ed.
A
gene
ranking
based
gene
selection
algorithm
kno
wn
as
sequential
forw
ard
gene
selection
(SFGS)
is
implemented
with
3
ML
models
and
5-fold
CV
method
to
select
3
IRG
combinations
dened
as
3
IRDEGs.
Finally
,
dif
ferent
ML
and
DL
models
of
RF
,
K-nearest
neighbors
(KNN),
logistic
re
gression
(LR),
and
long
short-term
memory
(LSTM)
are
used
to
v
alidate
the
performance
of
the
selected
IRDEGs
using
5-fold
CV
procedure
to
address
the
most
informati
v
e
biomark
ers.
These
models
are
representati
v
e
of
widely
used
ML
and
DL
techniques.
Furthermore,
LR,
RF
,
and
KNN
models
handle
linear
relationships,
nonlinear
interactions,
and
local
similarity
patterns,
res
p
e
cti
v
ely
,
while
LSTM
is
able
to
capture
comple
x
non-linear
relationships
in
gene
Int
J
Elec
&
Comp
Eng,
V
ol.
16,
No.
3,
June
2026:
1286-1297
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
❒
1289
e
xpression
data.
The
procedure
of
5-fold
CV
includes
the
input
dataset,
which
is
di
vided
into
5
folds.
One
of
these
fold
is
utilized
for
testing,
and
the
others
are
applied
for
model
training.
The
CV
procedure
is
completed
with
5
repetitions
to
ensure
that
all
indi
vidual
folds
are
used
as
the
testing
data.
F
ig
u
re
1
.
M
e
th
o
d
d
iag
ra
m
IR
DE
G
s
Op
tima
l
mo
d
els
Gene
Selectio
n
Selectio
n
SF
GS
R
F,
L
R
,
KNN
I
m
p
o
r
ta
n
ce
by
RF
,
C
V
R
a
nk
ing
Gr
id
s
ea
r
ch
,
C
V
Gene
E
s
tim
a
tio
n
RF
LR
CV
KNN
L
STM
V
a
lid
a
tio
n
s
et
Tr
a
in
in
g
s
et
Pre
p
r
o
ce
s
s
in
g
I
R
Gs
Gene
d
a
ta
b
a
s
es
Gene
P
r
o
ce
s
s
i
ng
E
x
tr
ac
tio
n
o
f
im
m
u
n
e
-
r
elate
d
g
en
es
Figure
1.
Method
diagram
3.1.
Gene
pr
ocessing
The
gene
processing
w
orko
w
in
thi
s
study
consists
of
tw
o
stages:
preprocessing
of
ra
w
gene
e
x-
pression
data
and
e
xtraction
of
IRGs.
A
total
of
16
publicly
a
v
ailable
gene
e
xpression
datasets
from
v
arious
microarray
platforms
are
aggre
g
ated
to
ensure
broad
co
v
erage
of
heterogeneous
patient
cohorts
and
measure-
ment
conditions.
Detailed
descriptions
of
the
preprocessing
procedures
and
IRG
e
xtraction
steps
are
pro
vided
in
the
follo
wing
subsections.
3.1.1.
Pr
epr
ocessing
W
e
consider
16
ra
w
gene
e
xpression
datasets,
which
are
then
preprocessed
and
normalized
by
the
rob
ust
multi-array
a
v
erage
(RMA)
algorithm.
Here,
gene
annotation
is
performed
by
mapping
probe
identiers
to
gene
symbols,
based
on
the
most
recent
SOFT
les
or
chip
description
les
(CDFs)
which
are
a
v
ailable
from
the
GEO
database.
SOFT
les
are
used
to
process
14
gene
datasets
to
set
the
gene
e
xpression
le
v
el
as
the
mean
of
the
probes
for
common
genes,
while
custom
CDFs
are
adopted
for
GSE119217
and
GSE69063
to
ensure
accurate
gene
mapping.
Finally
,
gene
data
are
preprocessed
by
Mi
n
-
Max
normalization
with
a
scaling
technique
in
the
range
of
[0-1].
It
is
note
w
orth
y
that
no
method
related
to
batch
ef
fects
is
considered
to
ensure
the
model
generalization
across
independent
datasets
from
v
arious
platforms.
3.1.2.
Immune-r
elated
gene
extraction
A
total
of
8
platforms
of
gene
data
are
used
to
address
the
IRGs
from
16
publicly
a
v
ailable
se
psis
databases.
Each
database
i
s
map
wit
h
IRGs
reference
set
to
identify
the
subset
of
IRGs
related
to
sepsis.
There
are
770
IRGs
collected
from
[6]
using
the
publicly
accessible
NanoString
database
(
www
.nanostring
.com
),
which
are
compared
with
16
gene
databases
used
in
this
w
ork
to
identify
potential
IRGs
related
to
sepsis.
After
ltering
the
IRGs
of
the
plat
forms,
o
v
erlapping
genes
across
the
platforms
are
identied
by
an
intersection-
based
approach.
Thereafter
,
these
intersected
genes
are
utilized
as
input
for
the
gene
selection
step
to
identify
IRDEGs.
3.2.
Gene
selection
The
gene
selection
stage
aims
to
identify
IRDEGs
for
sepsis
detection
t
hrough
a
combination
of
gene
ranking
and
SFGS.
W
e
emplo
y
gene
ranking-based
gene
selection
namely
SFGS,
which
contains
gene
importance
computed
by
the
RF
model
in
combination
with
dif
ferent
ML
models
as
the
tness
functions
and
5-fold
CV
procedure.
The
gene
selection
frame
w
ork
is
presented
in
detail
in
the
follo
wing
subsections.
3.2.1.
Gene
ranking
The
preprocessed
IRGs
are
put
into
the
RF
model
to
calculate
their
importance
v
alues,
which
repre-
sent
the
signicance
of
the
indi
vidual
IRGs
in
terms
of
the
nal
detection
performance
for
sepsis
detection.
Specically
,
the
importance
v
alues
are
dened
as
scores
for
all
input
IRGs
computed
by
a
gi
v
en
ML
model.
Here,
the
total
of
IRGs
are
rank
ed
by
the
abo
v
e
scores
from
highest
to
lo
west
v
alues
in
which
the
higher
score
sho
ws
a
greater
impact
of
a
specic
IRG
related
to
a
ML
model
used
to
recognize
sepsis
disease.
Sepsis
detection
using
biomark
er
s
and
mac
hine
learning
(T
uan
Anh
V
u)
Evaluation Warning : The document was created with Spire.PDF for Python.
1290
❒
ISSN:
2088-8708
3.2.2.
Sequential
f
orward
gene
selection
A
gene
selection
namely
SFGS
in
combination
with
3
ML
models
such
as
KNN,
LR,
and
RF
and
5-fold
CV
procedure
are
deplo
yed
to
select
3
optimal
gene
subsets,
also
kno
wn
as
3
subsets
of
IRDEGs.
The
preprocessed
IR
Gs
are
rank
ed
according
to
their
scores
as
presented
in
the
pre
vious
step.
The
SFGS
selects
the
rst
gene
with
the
highest
score
as
the
input
of
3
ML
models
to
calculate
classication
performance
related
to
sepsis
detection.
Then,
tw
o
genes
with
the
highest
importance
v
alues
are
selected
to
put
into
the
ML
models
to
estimate
their
performance.
The
procedure
is
repeated
until
the
entire
preprocessed
IRGs
are
considered
for
the
performance
calculation
of
the
ML
models.
Algorithm
3.1
sho
ws
the
SFGS
combined
with
dif
ferent
ML
models
and
5-fold
CV
procedure.
Algorithm
1.
Sequential
forw
ard
gene
selection
with
ML
models
1)
Sorting
IRGs
based
on
the
importance
values
G
:
input
set
of
IRGs;
G
(1)
:
an
IRG
with
highest
importance
v
alue;
G
(
N
)
:
an
IRG
with
lo
west
importance
v
alue;
N
:
number
of
IRG;
IRGs
of
G
set
are
sorted
descendingly
by
the
importance
v
alues
from
the
highest
to
lo
west.
2)
Calculating
accur
acy
of
ML
models
using
dif
fer
ent
g
ene
subsets
T
raining
data:
P
(
i
)
=
{
T
(:
,
G
(
i
))
,
y
}
;
Where
i=1
÷
N;
y
=
L
×
1
:
label
matrix;
L
:
number
of
samples;
T
=
i
×
N
:
sample
matrix
with
i
genes.
a)
Starting
with
entire
data
and
i
genes:
P
(
i
)
=
{
T
(:
,
G
(
i
))
,
y
}
;
i
=
1;
b)
Repeat
Separation
of
P
(
i
)
into
5
folds
by
databases
P
(
i,
k
);
for
k=1
to
5
•
Model
training
with
V
(
i,
t
)
,
t
̸
=
k
;
•
Accurac
y
calculation
on
S
(
i,
k
)
;
end
Calculation
of
the
mean
accurac
y
of
CV
;
Addition
of
a
gene
with
highest
score;
i
=
i
+
1;
c)
Until
i=N
3)
Immune-r
elated
dif
fer
ential
g
ene
e
xpr
ession
selection
The
subsets
of
IRGs
namely
IRDEGs
are
selected
with
the
highest
accuracies
of
the
corresponding
models.
=0
In
addition,
a
gird
search-based
optimization
method
is
used
for
identication
of
the
optimal
learning
and
structure
parameters
of
the
models
to
address
the
o
v
ertting
problem.
Indeed,
the
most
important
learning
parameters
are
in
v
estig
ated
for
the
RF
model
related
to
tree
number
of
[25,
55,
75,
95],
leaf
number
of
[15,
25,
35,
55],
while
K
of
[5,
8,
11,
14,
17,
20,
23,
26,
29]
is
considered
for
the
KNN
model.
The
learning
parameters
of
the
LSTM
model
are
the
optimizer
of
[adam,
SGD,
RMSprop],
batch
sizes
of
[16,
32,
64],
learning
rate
of
[0.005,
0.01,
0.02],
L2
re
gularization
of
[0.8,
0.9,
0.95],
epochs
of
[40,
60,
80].
Here,
5
structures
of
the
LSTM
model
are
emplo
yed
in
which
the
rst
structure
includes
a
LSTM
and
a
Batch
normalization
layer
.
The
second
is
a
combination
of
2
rst
structures,
while
the
third
contains
the
rst
and
the
second
structure,
etc.
As
a
result,
there
are
1,
16,
9,
and
1215
structures
of
LR,
RF
,
KNN,
and
LSTM
models,
which
are
implemented
to
identify
the
optimal
models
corresponding
to
the
indi
vidual
subsets
of
IRGs.
3.3.
Gene
estimation
W
e
use
3
ML
and
a
DL
models,
namely
RF
[18],
LR
[19],
KNN
[20]
and
LSTM
[21]
to
v
alidate
the
entire
input
IRGs
(AIRG)
and
3
subsets
of
IRDEGs
selected
by
the
SFGS
algorithm
on
the
v
alidation
set
using
5-fold
CV
procedure.
Here,
5
folds
are
generated
for
the
v
alidation
set
in
which
each
fold
corresponds
to
a
completed
dataset.
Then,
4
folds
are
for
model
training
and
one
fold
is
for
testing.
The
CV
procedure
is
repeated
5
times
to
ensure
that
all
indi
vidual
datasets
are
used
as
the
test
ing
gene
data.
The
mean
v
alidation
performance
of
the
models
and
their
standard
de
viation
are
calculated
for
further
analysis
and
comparison
with
pre
vious
studies.
4.
SIMULA
TION
RESUL
TS
4.1.
P
erf
ormance
measur
ement
W
e
use
accurac
y
(
Ac
),
sensiti
vity
(
Se
),
specicity
(
Sp
),
Mathe
ws
correlation
coef
cient
(MCC),
and
area
under
the
curv
e
(A
UC)
to
estimate
the
performance
of
dif
ferent
ML
and
DL
models
in
this
study
.
Ac
sho
ws
Int
J
Elec
&
Comp
Eng,
V
ol.
16,
No.
3,
June
2026:
1286-1297
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
❒
1291
the
rate
of
participants
who
are
correctly
predicted.
Se
and
Sp
present
the
number
of
correctly
detect
ed
sepsis
patients
and
control
people,
respecti
v
ely
.
The
discrepanc
y
between
patients
and
controls
is
measured
by
the
MCC
parameter
.
Furthermore,
the
A
UC
e
v
aluates
the
ability
of
the
ML
and
DL
models
to
distinguish
sepsis
patients
and
control
people.
Ac
=
T
P
+
T
N
T
P
+
F
P
+
T
N
+
F
N
(1)
S
p
=
T
N
T
N
+
F
P
(2)
S
e
=
T
P
T
P
+
F
N
(3)
MCC
=
T
P
×
T
N
−
F
P
×
F
N
p
(
T
P
+
F
P
)(
T
P
+
F
N
)(
T
N
+
F
P
)(
T
N
+
F
N
)
(4)
where
T
N
,
T
P
,
F
N
,
and
F
P
are
true
ne
g
ati
v
e,
true
positi
v
e,
f
alse
ne
g
ati
v
e,
and
f
alse
positi
v
e
v
alues.
4.2.
Gene
pr
ocessing
4.2.1.
Pr
epr
ocessing
The
preprocessing
stage
be
gins
by
applying
the
RMA
method
to
all
16
gene
e
xpression
datasets
to
perform
background
correction,
normalization,
and
probe-le
v
el
summarization.
F
ollo
wing
RMA,
gene
annota-
tion
is
carried
out
using
the
corresponding
SOFT
and
CDFs
to
accurately
map
probe
identiers
to
standardized
gene
symbols
across
dif
ferent
platforms.
As
a
resul
t
of
this
procedure,
the
processed
datasets
contain
between
17028
and
30905
genes,
as
detailed
in
T
a
b
l
e
1,
which
are
normalized
by
the
Min-Max
normalization-based
scaling
technique
in
the
range
of
[0-1].
4.2.2.
Immune-r
elated
gene
extraction
The
16
gene
e
xpression
datasets
are
ltered
for
IR
Gs
based
on
a
set
of
770
IRGs.
As
a
result,
there
are
760,
696,
742,
755,
751,
740,
737
e
xtracted
IRGs
from
GSE119217,
GSE69686,
GSE69063,
GSE134347,
GSE131761,
GSE65682,
E-MT
AB-1548,
respecti
v
ely
.
Furthermore,
the
remaining
gene
datase
ts
namely
GSE57065,
GSE95233,
GSE28750,
GSE26378,
GSE1821,
GSE13904,
GSE26440,
GSE9692,
GSE4067
pro-
duce
a
similar
number
of
737
IRGs.
W
e
consider
a
subset
of
560
IRGs,
which
is
an
intersection
between
16
datasets
for
further
analysis
to
ensure
the
inclusion
of
the
most
common
characteristics
of
all
input
gene
databases
related
to
sepsis
in
the
proposed
algorithm.
T
able
2.
Gene
rank
ed
by
the
important
v
alues
Ord
Gene
Imp
Ord
Gene
Imp
Ord
Gene
Imp
Ord
Gene
Imp
1
IL1R2
2.99
13
GA
T
A3
0.44
25
ITGA4
0.34
37
CEA
CAM8
0.28
2
S100A12
1.80
14
MA
GEB2
0.42
26
IFIT1
0.33
38
KLRD1
0.27
3
CCR7
1.58
15
CD3E
0.42
27
CD274
0.32
39
AMMECR1L
0.26
4
IL6ST
0.74
16
ARG1
0.42
28
GZMA
0.32
40
PYCARD
0.26
5
ABCB1
0.66
17
CCR9
0.39
29
CR1
0.32
41
CD80
0.25
6
FCER1A
0.64
18
LRRN3
0.39
30
B
A
TF
0.30
42
ST6GAL1
0.25
7
FCER1G
0.62
19
GNL
Y
0.38
31
L
TB
0.30
43
TXK
0.25
8
C1QA
0.62
20
COLEC12
0.37
32
CR2
0.30
44
CD63
0.25
9
C3AR1
0.61
21
CD3D
0.37
33
HLA
DQA1
0.29
45
C5
0.25
10
CCL28
0.60
22
BCL2
0.36
34
KLRG1
0.29
46
SSX1
0.24
11
BST2
0.55
23
KLRF1
0.35
35
DUSP6
0.28
Others
<
0.24
12
CFD
0.46
24
ST
A
T3
0.35
36
IL18R1
0.28
Imp:
Importance
v
alue,
Ord:
Order
4.3.
Gene
selection
4.3.1.
Gene
ranking
A
total
of
560
IRGs
are
e
v
aluated
and
rank
ed
according
to
their
import
ance
v
alues,
which
are
com-
puted
using
the
RF
model
as
sho
wn
in
T
able
2.
These
importance
scores
represent
the
contrib
ution
of
each
IRG
to
the
o
v
erall
classication
performance,
allo
wing
us
to
identify
genes
that
are
most
inuential
in
distinguishing
sepsis
samples
from
non-sepsis
samples.
W
e
only
represent
the
rst
46
IRGs
with
the
highest
important
v
alues
due
to
lar
ge
number
of
IRGs
in
v
estig
ated
in
this
w
ork.
Sepsis
detection
using
biomark
er
s
and
mac
hine
learning
(T
uan
Anh
V
u)
Evaluation Warning : The document was created with Spire.PDF for Python.
1292
❒
ISSN:
2088-8708
4.3.2.
Sequential
f
orward
gene
selection
W
e
emplo
y
3
ML
models
such
as
LR,
KNN,
RF
as
the
tness
function
of
the
SFGS
algorithm
in
combination
with
5-fold
CV
procedure
to
identify
optimal
IRG
subsets.
During
the
selection
process,
SFGS
iterati
v
ely
adds
genes
from
the
rank
ed
list
and
e
v
aluates
each
candidate
subset
using
a
5-fold
CV
procedure
to
measure
its
classication
performance.
The
optimal
subsets,
term
ed
IRDEG1,
IRDEG2,
and
IRDEG3,
contain
the
rst
31,
36,
and
46
genes,
respecti
v
ely
,
corresponding
to
the
highest
a
v
erage
accurac
y
achie
v
ed
by
each
ML
model.
These
selected
gene
sets
are
summarized
in
T
able
2,
and
their
performance
are
illustrated
in
Figure
2.
Figure
2.
A
v
erage
accurac
y
of
5-fold
CV
for
the
indi
vidual
immune
gene
subsets
T
able
3.
The
lar
gest
v
alidation
performance
of
v
arious
models
using
3
IRDEGs
and
AIRG
on
the
v
alidation
set
Model
DEG
Ac
(%)
Se
(%)
Sp
(%)
MCC
(%)
A
UC
(%)
RF
IRDEG1
94.71
±
5.68
95.71
±
6.08
90.65
±
10.57
83.08
±
20.29
97.60
±
4.41
IRDEG2
96.83
±
1.39
98.86
±
2.03
86.70
±
11.85
84.97
±
13.56
98.67
±
2.54
IRDEG3
95.60
±
3.73
97.46
±
3.18
85.56
±
15.33
83.02
±
19.28
97.48
±
5.16
AIRG
96.03
±
5.03
98.07
±
3.11
81.32
±
27.81
80.69
±
31.36
97.17
±
6.02
KNN
IRDEG1
95.94
±
3.49
96.09
±
2.77
93.42
±
10.50
85.02
±
17.48
97.68
±
4.31
IRDEG2
94.89
±
3.87
94.95
±
3.21
91.75
±
14.17
81.82
±
19.89
96.70
±
6.60
IRDEG3
94.67
±
7.56
94.52
±
7.23
94.37
±
10.91
83.48
±
25.33
97.17
±
6.07
AIRG
94.59
±
5.38
95.16
±
3.45
85.71
±
29.35
77.42
±
31.53
93.27
±
13.28
LR
IRDEG1
75.25
±
28.44
73.61
±
40.22
81.54
±
24.71
55.37
±
31.52
80.26
±
16.41
IRDEG2
71.68
±
27.75
68.96
±
39.40
82.58
±
28.76
50.22
±
30.67
79.41
±
16.28
IRDEG3
65.77
±
32.95
65.01
±
46.52
72.09
±
39.39
49.26
±
22.63
72.54
±
19.61
AIRG
49.54
±
26.84
40.34
±
37.52
72.13
±
40.50
47.81
±
24.89
55.61
±
9.40
LSTM
IRDEG1
94.52
±
4.70
95.71
±
4.51
87.88
±
10.43
80.69
±
19.56
97.39
±
4.30
IRDEG2
89.33
±
7.76
92.28
±
8.56
76.67
±
38.16
65.37
±
29.17
93.44
±
8.60
IRDEG3
88.97
±
13.43
90.79
±
16.21
85.37
±
18.06
74.02
±
24.09
94.37
±
7.48
AIRG
91.67
±
5.57
92.86
±
7.74
84.29
±
17.06
74.66
±
19.82
97.86
±
3.96
4.4.
Gene
estimation
The
optimal
parameters
of
the
RF
models
include
55
trees
and
25
lea
v
es,
whil
e
that
of
KNN
is
K=17.
Moreo
v
er
,
the
structure
of
optimal
LSTM
model
consists
of
3
sequential
layers
in
which
LSTM
layer
is
follo
wed
by
a
batch
normalization
and
dropout
layer
for
training
stabilization
and
o
v
ertting
reduction,
re-
specti
v
ely
.
The
e
xtracted
temporal
representations
are
then
passed
through
dual
layers,
namely
fully
connected
and
softmax
output
layers
for
binary
classication.
The
optimal
LSTM
model
uses
the
Adam
optimizer
with
a
batch
size
of
32,
a
learning
rate
of
0.01,
and
a
L2
re
gularization
coef
cient
of
0.9.
These
optimal
models
are
then
used
for
the
performance
v
alidation
of
3
IRDEG
subsets
on
the
v
alidation
set
us
ing
5-fold
CV
procedure.
The
mean
performance
of
dif
ferent
ML
and
DL
models
such
as
RF
,
LR,
KNN,
and
LSTM
is
gi
v
en
in
T
able
3.
Int
J
Elec
&
Comp
Eng,
V
ol.
16,
No.
3,
June
2026:
1286-1297
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
❒
1293
The
RF
model
produces
the
highest
a
v
erage
Ac
of
96.83%,
Se
of
98.86%,
Sp
of
86.70%,
MCC
of
84.97%
and
A
UC
of
98.67%,
which
is
selected
as
the
proposed
algorithm
to
classify
sepsis
disease.
5.
DISCUSSION
Sepsis
is
a
dangerous
disease
for
human
health,
which
has
recei
v
ed
intense
attention
from
medic
al
e
xperts,
technicians,
and
researchers.
Existing
studies
certainly
consider
dif
ferent
gene
databases
to
de
v
elop
an
ef
fecti
v
e
method
for
the
sepsis
detection.
Ho
we
v
er
,
the
number
of
gene
databases
is
frequently
small
resulting
in
unreliability
,
lo
w
performance
of
the
proposed
method,
sho
wing
dif
culties
for
practical
application
in
the
clinic
en
vironments
[14]–[17].
A
potential
solution
to
enhance
detection
performance
of
the
proposed
algo-
rithm
in
pre
vious
w
orks
is
the
use
of
IRG
databases,
which
are
in
v
olv
ed
in
immune
re
gulation,
response,
and
proper
functioning
of
the
immune
system
to
protect
the
human
body
from
harmful
substances,
germs,
and
cell
changes.
Hence,
we
in
v
estig
ate
a
lar
ge
number
of
16
gene
databases
to
produce
better
detection
performance
of
the
nal
algorithm
in
this
w
ork.
Ob
viously
,
the
utility
of
massi
v
e
gene
databases
certainly
results
in
a
v
oidance
of
o
v
ertting
problems,
impro
v
ement
of
the
nal
classication
performance,
and
increase
in
reliability
of
the
proposed
m
ethod.
Additionally
,
the
utility
of
common
IRGs
from
16
gene
databases
certainly
signicantly
impro
v
es
the
generalization
of
the
proposed
method
in
terms
of
sepsis
recognition.
Another
signicant
characteristic
is
the
gene
selection.
Most
of
e
xisting
studies
adopt
con
v
ent
ional
methods
such
as
log(F
old-change)
and
P-v
alue
to
address
the
DEGs
[16],
[17].
Indeed,
log(F
old-change)
pa-
rameter
represents
the
de
gree
of
gene
e
xpression
in
which
the
up-
and
do
wn-re
gulation
of
genes
are
based
on
higher
and
lo
wer
v
alues
of
log(F
old-change)
than
zero,
respecti
v
ely
.
Moreo
v
er
,
statistical
method
indicates
a
threshold
of
0.05
for
which
p-v
alue
parameter
being
sma
ller
than
such
threshold
certainly
represents
biological
e
xpression
changes.
The
combination
of
the
abo
v
e
parameters
results
in
an
ef
fecti
v
e
method
for
DEG
identi-
cation.
Ho
we
v
er
,
a
lar
ge
number
of
DEGs
as
the
outcome
of
con
v
entional
method
denitely
poses
an
obstacle
for
the
further
step
of
biomark
er
identication
among
selected
DEGs
such
as
6361,
1230,
405
[14],
[16],
[17].
Therefore,
a
gene
ranking-based
gene
selection
method
namely
SFGS
is
applied
in
this
w
ork
to
select
the
po-
tential
IRDEGs
from
the
input
gene
set
of
560
IRGs.
Here,
the
gene
important
v
alues,
which
are
computed
by
a
RF
model,
are
used
to
rank
560
IRGs.
A
total
of
560
IRG
combinations
with
number
of
IRG
ranging
from
1
to
560
are
e
v
aluated
by
3
ML
models
such
as
RF
,
LR,
and
KNN.
Consequently
,
there
a
re
3
subset
including
31,
36,
46
IRGs
selected
by
the
SFGS
in
combination
with
3
ML
models
and
5-fold
CV
procedure
on
the
training
set.
It
is
clear
that
t
he
gene
number
of
the
abo
v
e
subsets
is
smaller
than
those
of
[14],
[16],
[17],
which
mak
es
it
easily
to
identify
a
subset
of
biomark
ers.
W
e
impl
ement
ML
and
DL
models
for
comparison
with
e
xisting
publications
based
on
dif
ferent
per
-
formance
metrics
and
the
proposal
of
an
ef
fecti
v
e
sepsis
diagnosis
algorithm.
Indeed,
A
UC
and
MCC
perfor
-
mance
parameters
are
widely
used
for
estimation
of
the
proposed
methods
in
pre
vious
w
orks
[13],
[15],
[17].
Ob
viously
,
A
UC
metric
emphasizes
the
ability
of
the
models
to
distinguish
between
sepsis
and
control
groups,
while
the
o
v
erall
prediction
is
measured
by
accurac
y
parameter
.
It
is
clear
that
the
high
classication
performance
of
the
na
l
algorithm
for
sepsis
is
one
of
the
most
important
elements
for
those
who
de
v
elop
no
v
el
methods
related
to
sepsis
recognition.
Therefore,
the
use
of
numerous
metrics
for
the
per
-
formance
estimation
of
the
sepsis
detection
algorithm
plays
an
essential
role.
In
this
w
ork,
5
parameters
are
emplo
yed
for
the
performance
e
v
aluation
of
v
arious
models
in
terms
of
sepsis
classi
cation,
which
certainly
pro
vide
reliable
estimation
of
the
proposed
algorithm’
s
ability
with
respect
to
sepsis
diagnosis.
Moreo
v
er
,
the
grid
search
combined
with
5-fold
CV
procedure
is
deplo
yed
for
identication
of
the
optimal
learning
and
structure
parameters,
which
leads
to
obtain
the
best
model
with
relati
v
e
high
sepsis
detection
performance
while
a
v
oiding
fundamental
problems
such
as
o
v
ertting.
The
a
v
erage
performance
of
ML
and
DL
models
on
the
v
alidation
set
is
gi
v
en
in
T
able
3.
The
RF
and
KNN
models
generate
high
performance
for
the
sepsis
diagnosis
with
mean
Ac
and
A
UC
o
v
er
94%
and
93%,
respecti
v
ely
,
while
LR
model
sho
ws
lo
west
perform
ance
with
Ac
and
A
UC
less
than
75%
and
80%.
The
highest
performance
with
mean
Ac
of
96.83%,
Se
of
98.86%,
Sp
of
86.70%,
MCC
of
84.97%,
and
A
UC
of
98.67%
is
released
by
the
RF
model
selected
as
the
nal
algorithm
for
the
sepsis
detection
among
the
others.
Here,
high
sensiti
vity
of
the
proposed
model
implies
an
accurate
diagnosis
of
sepsis
cases,
which
are
then
denitely
check
ed
by
clinical
e
xperts
to
mak
e
nal
decision
of
deli
v
ering
ef
fecti
v
e
treatment.
In
the
clinical
cont
e
xt
,
emphasizing
sensiti
vity
is
essential
for
early
detection,
as
timely
interv
ention
can
signicantly
reduce
the
risk
of
se
v
ere
complications
and
mortality
in
patients
with
sepsis.
It
is
note
w
orth
y
that
e
xamination
Sepsis
detection
using
biomark
er
s
and
mac
hine
learning
(T
uan
Anh
V
u)
Evaluation Warning : The document was created with Spire.PDF for Python.
1294
❒
ISSN:
2088-8708
of
e
xperts
is
applied
for
people
being
incorrectly
identied
by
the
proposed
model,
who
are
then
gi
v
en
no
medical
treatment.
A
comparison
of
the
proposed
algorithm
with
e
xisting
publications
is
presented
in
T
able
4
which
sho
ws
outperformed
performance
of
the
proposed
algorithm
compared
with
e
xisting
studies.
Hence,
the
proposed
algorithm
is
ef
fecti
v
e
for
sepsis
detection
applications
in
practical
f
acilities
and
hospitals.
T
able
4.
Comparisons
of
the
proposed
algorithm
with
pre
vious
studies
Ref.
Method
Data
Ac
(%)
A
UC
(%)
MCC
(%)
Pros
Cons
[17]
2025
Gene
selection
and
classication
using
RF
and
Elastic
Net
-
4
datasets
(359
samples)
-
Separated
training
and
testing
N
A
88.1
N
A
-
113
combinations
models,
enabling
rob
ust,
thorough
performance
e
v
aluation
-
Small
dataset
-
Non-optimized
model
-
Only
using
ML
-
Only
A
UC
[13]
2022
-
Gene
selection
using
RF
ranking
and
forw
ard-wrapper
-
Classication
using
RF
-
5
datasets
(958
samples)
-
4
datasets
for
training
and
testing
-
A
dataset
for
v
alidation
N
A
87.3
71.3
-
Rob
ust
wrapper
-based
gene
selection
-
Independent
dataset
v
alidation
-
Small
dataset
-
Non-optimized
model
-
Only
using
ML
[15]
2023
-
Gene
selection
by
intersecting
LASSO,
SVM-RF
,
and
RF
-
Classication
using
CIBERSOFT
-
3
datasets
(253
samples)
-
2
datasets
for
training
-
A
dataset
for
v
alidation
N
A
92
N
A
Inte
gration
of
multiple
gene
selection
methods
-
Small
dataset
-
Non-optimized
model
-
Only
using
ML
-
Model
e
v
aluation
using
only
A
UC
Our
-
Gene
importance-based
gene
ranking
-
Gene
selection
using
SFGS
and
ML
models
-
Classication
using
ML,
DL
-
16
datasets
(2151
samples)
-
11
datasets
for
training
-
5
datasets
for
v
alidation
-
5-fold
CV
96.83
98.67
84.97
-
Multiple
datasets
to
impro
v
e
generalizability
-
Grid
search,
CV
to
optimize
model
-
SFGS
and
ML
models
for
IRDEG
selection
-
High
classication
performance
-
High
number
of
selected
biomark
ers
-
Limited
e
xploration
of
DL
models
Existing
clinical
tools
for
asses
sing
sepsis,
such
as
SOF
A,
qSOF
A,
and
procalcitonin,
are
kno
wn
as
important
diagnostic
guidance,
which
still
ha
v
e
se
v
eral
limitations.
Indeed,
e
v
aluat
ion
of
or
g
an
dysfunction
across
six
ph
ysiological
systems
for
the
diagnosis
of
sepsis
is
considered
as
SOF
A
score,
which
is
more
reac-
ti
v
e
than
predicti
v
e
[22].
Consequently
,
sepsis
disease
is
often
identied
only
after
the
e
xistence
of
signicant
or
g
an
damages
[23],
[24].
Similarly
,
qSOF
A
w
as
de
v
eloped
and
v
alidated
in
populations
with
suspected
sepsis,
making
it
less
suitable
as
an
early
screening
tool
[25].
Small
sensiti
vity
and
specicity
to
dif
ferentiate
sepsis
disease
based
on
other
causes
of
systemic
inammatory
res
po
ns
es
is
sho
wn
for
Procalcitonin,
which
empha-
sizes
the
need
for
more
reliable
molecular
mark
ers
[26].
In
contrast,
our
proposed
method
utilizes
a
subset
of
36
IRDEGs
to
detect
sepsis
promptly
and
with
high
performance
such
as
sensiti
vity
of
98.86%
and
A
UC
of
98.67%.
These
ndings
suggest
that
our
model
pro
vides
an
alternati
v
e
diagnostic
approach
with
impro
v
ed
accurac
y
and
timeliness
in
comparison
with
the
e
xisting
clinical
tools.
The
rst
limitation
of
this
w
ork
is
the
lar
ge
number
of
36
biomark
ers,
which
denitely
increases
the
time,
comple
xity
,
and
cost
of
gene
e
xp
r
ession
measurement
in
real-w
orld
applications.
Secondly
,
the
datasets
used
in
this
study
e
xhibit
class
imbalance,
which
may
introduce
bias
to
model
learning
and
inate
sensiti
vity
while
reducing
specicity
,
potentially
af
fecting
the
generalization
of
the
predicti
v
e
model.
Omission
of
e
x-
ternal
v
alidation
and
analysis
limited
by
the
utility
of
3
ML
and
a
DL
models
are
certainly
other
limitations.
Indeed,
the
e
xploration
of
dif
ferent
DL
models
denitely
generat
es
a
better
chance
to
nd
a
producti
v
e
model
with
better
sepsis
recognition
performance,
which
is
absolutely
considered
in
future
research.
6.
CONCLUSION
Sepsis
is
a
main
cause
of
serious
medical
conditions,
which
represent
the
body’
s
uncontrolled
response
to
infection,
leading
to
or
g
an
f
ailure
and
high
mortality
.
Millions
of
cases
are
reported
each
year
,
which
pose
important
b
urden
on
healthcare
systems
w
orldwide.
Prompt
and
accurate
recognition
is
essential
to
impro
v
e
patient
outcomes.
In
this
w
ork,
we
propose
an
algorithm
for
the
sepsis
prediction
with
high
generalization
Int
J
Elec
&
Comp
Eng,
V
ol.
16,
No.
3,
June
2026:
1286-1297
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
❒
1295
based
on
the
utility
of
multiple
cases,
di
v
erse
age
groups
of
input
gene
datasets
collected
from
dif
ferent
medical
platforms.
The
proposed
algorithm
is
de
v
eloped
with
the
RF
model
and
a
subset
of
36
biomark
ers,
which
are
choosen
from
the
input
IRGs.
A
gene
ranking-based
gene
selection
kno
wn
as
the
SFGS
algorithm
utilizing
dif
ferent
ML
models
as
the
tness
function
and
5-fold
CV
procedure
is
deplo
yed
to
select
the
optimal
subset
of
IRGs,
kno
wn
as
IRDEGs,
which
are
then
v
alidated
their
sepsis
detection
performance
by
the
ML
and
DL
models.
The
relati
v
ely
high
performance
conrms
the
ef
fecti
v
eness
of
the
SFGS
using
ML
techniques
in
comparison
with
the
con
v
entional
method
using
log(F
old-Change)
and
p-v
alue
for
the
identication
of
DEGs.
The
RF
model
releases
the
highest
a
v
erage
sepsis
recognition
performance
with
Ac
of
96.83%,
Se
of
98.86%,
Sp
of
86.70%,
MCC
of
84.97%
and
A
UC
of
98.67%
among
the
other
ML
and
DL
models,
which
sho
ws
successful
utility
of
ML
model
and
biomark
ers
for
the
sepsis
diagnosis.
Indeed,
we
propose
an
simple
b
ut
ef
cient
method
to
archi
v
e
better
massi
v
e
gene
data
processing
and
high
le
v
el
of
gene
data
separation
for
the
sepsis
detection.
As
a
result,
we
suppose
that
the
proposed
algorithm
is
deplo
yed
as
the
application
in
clinic
en
vironments
and
hospitals.
Ho
we
v
er
,
the
number
of
biomark
ers
includes
36
genes,
which
may
increase
the
practical
comple
xity
for
clinical
implementation,
is
the
rst
limit
of
this
w
ork.
Moreo
v
er
,
the
imbalanced
datasets,
no
e
xternal
v
alidation,
and
the
use
of
small
number
of
models
such
as
4
ML
and
DL
models
for
method
de
v
elopment
are
the
additional
limitations,
which
are
certainly
addressed
in
future
researches.
FUNDING
INFORMA
TION
Authors
state
no
funding
in
v
olv
ed.
A
UTHOR
CONTRIB
UTIONS
ST
A
TEMENT
This
journal
uses
the
Contrib
utor
Roles
T
axonomy
(CRediT)
to
recognize
indi
vidual
author
contrib
u-
tions,
reduce
authorship
disputes,
and
f
acilitate
collaboration.
Name
of
author
C
M
So
V
a
F
o
I
R
D
O
E
V
i
Su
P
Fu
T
uan
Anh
V
u
✓
✓
✓
✓
✓
✓
✓
✓
Dang
Hoai
Bac
✓
✓
✓
✓
Minh
T
uan
Nguyen
✓
✓
✓
✓
✓
✓
✓
✓
✓
C
:
C
onceptualization
I
:
I
n
v
estig
ation
V
i
:
V
i
sualization
M
:
M
ethodology
R
:
R
esources
Su
:
Su
pervision
So
:
So
ftw
are
D
:
D
ata
Curation
P
:
P
roject
Administration
V
a
:
V
a
lidation
O
:
Writing
-
O
riginal
Draft
Fu
:
Fu
nding
Acquisition
F
o
:
F
o
rmal
Analysis
E
:
Writing
-
Re
vie
w
&
E
diting
CONFLICT
OF
INTEREST
ST
A
TEMENT
Authors
state
no
conict
of
interest.
D
A
T
A
A
V
AILABILITY
The
supporting
data
of
this
study
are
openly
a
v
ailable
at
https://www
.ncbi.nlm.nih.go
v/geo/
and
https://www
.ebi.ac.uk/biostudies/arraye
xpress.
REFERENCES
[1]
S.
Lin
et
al.
,
“Multiple
datasets
to
e
xplore
the
molecular
mechanism
of
sepsis,
”
BMC
Genomic
Data
,
v
ol.
23,
pp.
1–13,
2022,
doi:
10.1186/s12863-022-01078-2.
[2]
L.-W
.
Duan
et
al.
,
“Ef
fects
of
viral
infection
and
microbial
di
v
ersity
on
patients
with
sepsis:
A
retrospecti
v
e
study
based
on
metage-
nomic
ne
xt-generation
sequencing,
”
W
orld
Journal
of
Emer
genc
y
Medicine
,
v
ol.
12,
pp.
29–35,
2021,
doi:
10.5847/wjem.j.1920-
8642.2021.01.005.
[3]
L.
La
V
ia
et
al.
,
“The
global
b
urden
of
sepsis
and
septic
shock,
”
Epidemiologia
,
v
ol.
5,
pp.
456–478,
2024,
doi:
10.3390/epidemi-
ologia5030032.
Sepsis
detection
using
biomark
er
s
and
mac
hine
learning
(T
uan
Anh
V
u)
Evaluation Warning : The document was created with Spire.PDF for Python.