Indonesian
J
our
nal
of
Electrical
Engineering
and
Computer
Science
V
ol.
39,
No.
3,
September
2025,
pp.
1571
∼
1586
ISSN:
2502-4752,
DOI:
10.11591/ijeecs.v39.i3.pp1571-1586
❒
1571
Combination
of
MLF-V
O-F
and
loss
functions
f
or
V
OE
fr
om
RGB
image
sequence
using
deep
lear
ning
V
an-Hung
Le
1
,
Huu-Son
Do
1
,
Thi-Ha-Phuong
Nguy
en
1
,
V
an-Thuan
Nguy
en
2
,
T
at-Hung
Do
2
1
Department
of
Information
T
echnology
,
T
an
T
rao
Uni
v
ersity
,
T
uyen
Quang,
V
ietnam
2
F
aculty
of
Engineering
T
echnology
,
Hung
V
uong
Uni
v
ersity
,
V
iet
T
ri
City
,
V
ietnam
Article
Inf
o
Article
history:
Recei
v
ed
Feb
21,
2025
Re
vised
Apr
8,
2025
Accepted
Jul
2,
2025
K
eyw
ords:
Comparati
v
e
study
Deep
learning
Loss
functions
MLF-V
O-F
RGB
image
sequence
V
isual
odometry
ABSTRA
CT
V
isual
odometry
estimation
(V
OE)
is
important
in
b
uilding
na
vig
ation
and
path-
nding
systems.
It
helps
entities
nd
their
w
ay
and
estimate
paths
in
the
en-
vironment.
M
ost
of
the
computer
vision
(CV)-based
V
OE
models
are
usually
e
v
aluated
and
compared
on
the
KITTI
dataset.
Multi-layer
fusion
frame
w
ork
(MLF-V
O-F)
has
had
good
V
OE
results
from
red,
green,
and
blue
(RGB)
im-
age
sequence
in
Jiang
et
al.
study
,
using
the
DeepNet
to
e
xtract
the
lo
w-le
v
el
te
xtures,
edges,
and
deeper
high-le
v
el
semantic
features
for
estimating
motion
between
consecuti
v
e
frames.
This
paper
proposed
a
combined
model
of
MLF-
V
O-F
as
a
backbone
and
loss
functions
(LFs)
(
L
M
S
E
,
L
M
S
E
−
L
2
,
L
C
E
,
a
nd
L
combi
)
to
optimize
and
supervise
the
training
process
of
the
V
OE
model.
W
e
e
v
aluated
and
compared
the
ef
fecti
v
eness
of
LFs
for
V
OE
based
on
the
KITTI
and
TQ
U-SLAM
datasets
wi
th
the
original
MLF-V
O-F
.
From
there,
choose
the
appropriate
LF
combined
with
the
backbone
for
V
OE.
The
e
v
aluation
results
on
the
KITTI
dataset
sho
w
that
L
C
E
(
R
T
E
is
0
.
075
m
,
0
.
06
m
on
the
Seq.
#9,
Seq.
#10,
respecti
v
ely),
and
L
combi
(
t
r
el
is
2
.
21%
,
2
.
67%
,
3
.
59%
,
1
.
01%
,
and
4
.
62%
on
the
Seq.
#4,
Seq.
#5,
Seq.
#6,
Seq.
#7,
Seq.
#10,
r
especti
v
ely)
ha
v
e
the
lo
west
errors
and
L
M
S
E
has
the
hi
ghest
errors
(
AT
E
is
133
.
36
m
on
the
Seq.
#9).
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
T
at-Hung
Do
F
aculty
of
Engineering
T
echnology
,
Hung
V
uong
Uni
v
ersity
Nong
T
rang,
V
iet
T
ri
City
,
Phu
Tho,
V
ietnam
Email:
dotathung@hvu.edu.vn
1.
INTR
ODUCTION
V
isual
odometry
estimation
(V
OE)
is
one
of
the
tw
o
important
problems
of
V
isual
SLAM
and
is
an
important
problem
of
computer
vision
(CV)
and
robotics
technology
that
has
been
studied
for
a
long
time.
V
OE
focuses
mainly
on
local
consistenc
y
and
aims
to
incrementally
esti
mate
the
camera
pose
pat
h
after
each
pose
and
can
perform
local
optimization.
V
isual
SLAM
estimates
the
entire
scene/map
and
the
camera
trajec-
tory/V
OE.
This
means
that
visual
SLAM
includes
the
V
OE
problem
for
robots,
which
helps
robots
or
softw
are
that
supports
visually
impaired
people
to
estimate
the
direction
and
path
of
mo
v
ement
in
the
en
vironment.
Especially
in
ne
w
en
vironments.
The
data
used
to
b
uild
the
V
OE
can
be
collected
from
IMU
[1],
[2],
LiDar
[3]–[5],
or
image
sensors.
The
data
obtained
from
the
image
sensor
(RGB,
depth,
and
stereo)
can
be
used
to
b
uild
V
OE
at
a
reasonable
cost.
Pre
viously
,
with
the
traditional
method,
V
OE
[6]
could
be
implemented
based
on
a
geomet
ry-based
method.
These
methods
use
a
k
e
ypoint
detector
to
identify
the
salient
points
(k
e
ypoints)
in
the
image,
and
J
ournal
homepage:
http://ijeecs.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
1572
❒
ISSN:
2502-4752
feature
v
ectors
or
descriptors
are
computed
by
considering
the
local
re
gion
around
each
k
e
ypoint.
T
racking
of
k
e
ypoints
to
establis
h
correspondence
bet
ween
dif
ferent
vie
ws
(or
image
frames)
is
done
through
descriptor
matching.
As
PT
AM
[7]
used
a
F
AST
corner
detector
to
detect
k
e
ypoints
in
the
im
age.
ORB-SLAM
[8]
used
oriented
F
AST
and
rotated
BRIEF
descriptors
to
perform
the
V
OE
model’
s
tracking,
mapping,
and
loop
closure
steps.
The
V
OE
model
of
the
geomet
ry-based
method
usually
includes
modules
such
as
feature
e
xtraction,
feature
matching,
pose
estimation,
and
local
optimization
[9].
While
deep
learning
(DL)
[6]
uses
recurrent
con
v
olutional
neural
netw
orks
(CNN)
to
e
xtract
motion
features,
representati
v
e
points,
or
the
relati
v
e
pose
between
consecuti
v
e
frames.
In
this
model,
deep
neural
netw
orks
[9]
can
perform
instead
of
modules
feature
e
xtraction,
feature
matching,
and
pose
estimation
instead
of
the
traditional
approach.
W
ith
the
DL-based
approach,
V
OE
can
be
implemented
based
on
the
follo
w-
ing
netw
ork
architectures
sho
w
in
Figure
1.
CNN-based
frame
w
ork
sho
w
in
Figure
1(a),
CNN-base
d
frame-
w
ork
with
tw
o
fully
connected
net
w
orks
in
Figure
1(b),
recurrent
neural
netw
orks
(RNN)-based
frame
w
ork
in
Figure
1(c),
stereo-based
frame
w
ork
in
Figure
1(d),
and
generati
v
e
adv
ersarial
netw
orks
(GAN)-based
frame-
w
ork
in
Figure
1(e).
Figure
1.
Illustration
of
DL
architectures
for
V
OE
from
consecuti
v
e
frames:
(a)
CNN-based
frame
w
ork,
(b)
CNN-based
frame
w
ork
with
tw
o
fully
connected
netw
orks,
(c)
RNN-based
frame
w
ork,
(d)
stereo-based
frame
w
ork,
and
(e)
GAN-based
frame
w
ork
In
the
surv
e
y
study
,
Chen
et
al.
[10]
presented
the
adv
antages
and
disadv
antages
of
DL
for
V
OE
as
follo
ws.
A
CNN-based
frame
w
ork
has
the
adv
antage
of
being
able
to
learn
features
such
as
edges,
corners,
and
te
xtures
well
to
estimate
representati
v
e
points
between
consecuti
v
e
frames
and
can
eliminate
irrele
v
ant
features,
especially
with
end-to-end
DL
for
V
OE.
Ho
we
v
er
,
the
CNN-ba
sed
frame
w
ork
often
processes
independent
frames
without
taking
adv
antage
of
temporal
features
on
consecuti
v
e
frames.
RNN-based
frame
w
ork
with
the
prominent
long
short-term
memory
(LSTM)
model
has
the
adv
antage
of
e
xploiting
temporal
features
on
the
frame
sequence
obtained
from
the
en
vironment,
so
predicting
the
V
OE
of
the
current
state
is
as
good
as
considering
pre
vious
states.
Ho
we
v
er
,
this
model
also
has
the
disadv
antage
of
requiring
a
v
ery
lar
ge
amount
of
memory
to
store
the
states
of
the
frame
sequence.
Stereo-based
frame
w
ork
is
often
used
to
estimate
depth
from
RGB
images,
with
good
performance
in
lo
w-comple
xity
en
vironments.
Ho
we
v
er
,
this
approach
has
a
lar
ge
dependence
on
stereo
data
collected
from
stereo
cameras.
GAN-based
frame
w
ork
is
often
applied
to
b
uild
a
real-w
orld
conte
xt
dataset
when
labeled
data
of
the
en
vironment
is
limited,
so
this
approach
can
learn
a
self-supervised
V
OE
model,
which
can
ne-tune
the
predicted
depth/optical
o
w
results.
Ho
we
v
er
,
this
approach
requires
a
lar
ge
memory
cost
and
is
dif
cult
to
train.
V
OE
is
important
in
b
uilding
na
vig
ation
and
path-nding
systems
for
robots,
autonomous
v
ehicles,
and
blind
people
in
the
en
vironment
[11],
[12],
no
w
adays
with
the
v
ery
con
vincing
results
of
DL
i
n
solving
CV
problems
and
modules
or
end-to-end
DL
for
V
OE
systems
[12]–[15].
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
39,
No.
3,
September
2025:
1571–1586
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
1573
W
ith
the
DL
approach
for
V
OE,
the
features
e
xtracted
for
the
V
OE
process
can
be
e
xtracted
through
DL
netw
orks
[14],
[16]
or
traditional
features
such
as
AKAZE,
ORB,
SIFT
,
and
SURF
[17],
then
apply
a
DL
netw
ork
for
V
OE.
In
addition,
the
V
OE
process
can
be
performed
based
on
the
transformer
[18],
and
reinforcement
learning
[19]
methods.
Ho
we
v
er
,
to
optimize
DL
netw
orks
for
V
OE,
LFs
[20]
are
often
used
for
supervised,
semi-supervised,
and
self-supervised
training
of
features
for
V
OE.
Jiang
et
al.
[21]
proposed
the
multi-layer
fusion
frame
w
ork
(MLF-V
O-F)
for
V
OE
to
ne-tune
the
V
OE
model
with
the
RGB
image
as
the
input.
MLF-V
O-F
used
DepthNet
to
estimate
the
depth
image
and
e
xploited
some
LFs
such
as
geometry
consistenc
y
loss
(
L
g
c
),
smoothness
loss
(
L
smoo
),
and
photometric
LF
(
L
pm
)
t
o
supervise
the
training
process
and
impro
v
e
the
depth
image
estimation
result
corresponding
to
the
input
RGB
image.
And
use
re
gularization
loss
(
L
r
eg
u
)
to
synthesize
LFs
to
control
the
scaling
f
actors
process
for
channel
e
xchange
between
the
RGB
image
and
estimat
ed
depth
image
when
combining
the
features
of
these
tw
o
types
of
data
for
V
OE.
Recently
,
there
ha
v
e
also
been
studies
by
[22],
[23]
that
used
the
mean
squared
error
function
(
L
M
S
E
)
to
optimize
the
training
process
of
V
OE
models.
In
the
study
of
Hw
ang
et
al.
[24],
the
aggre
g
ate
LF
(
L
F
2
F
)
w
as
proposed
to
be
synthesized
from
the
forw
ard
loss
(
L
f
l
)
function
and
bi-directional
LF
(
L
bd
),
correction
LF
(
L
co
).
MLF-V
O-F
[21]
is
currently
e
v
aluated
only
on
the
Seq.
#9
and
Seq.
#10
frame
sequences
of
the
KITTI
dataset.
Recent
impro
v
ements
to
V
OE
models
ha
v
e
also
focused
on
a
fe
w
frame
sequences
of
the
KITTI
dataset,
such
as
frame-to-frame
(F2F)
[24],
which
also
e
v
aluates
on
the
Seq.
#8,
Seq.
#9,
and
Seq.
#10.
Therefore,
e
v
aluating
these
models
on
other
frame
sequences
of
the
KITTI
dataset
and
other
datasets
is
necessary
to
conrm
the
rob
ustness
of
the
V
OE
model.
At
the
same
time,
choose
a
suitable
LF
to
supervise
and
optimize
the
training
process
of
the
V
OE
model.
In
this
paper
,
we
e
xploit
the
adv
antages
of
M
LF-V
O-F
as
a
backbone
for
V
OE
and
combine
it
with
LFs:
L
M
S
E
,
L
M
S
E
+
L
2
,
cross
entrop
y
loss
(
L
C
E
=
L
v
is
+
L
dy
n
),
and
L
combi
based
on
the
component
LFs
the
forw
ard
loss
(
L
f
l
)
function
and
bi-directional
LF
(
L
bd
),
correction
LF
(
L
co
),
and
aggre
g
ate
LF
(
L
F
2
F
).
The
combined
model
is
trained
and
e
v
aluated
on
the
KITTI
and
TQ
U-SLAM
[25]
datasets.
From
there
,
we
select
the
best
LF
for
optimizing
the
training
process
of
the
V
OE
system
construction
model.
Our
paper
includes
the
follo
wing
main
contrib
utions:
(i)
proposing
and
testing
the
combination
of
LFs
(
L
M
S
E
,
L
M
S
E
+
L
2
,
L
C
E
,
and
L
combi
)
with
the
MLF-V
O-F
as
a
backbone
for
V
OE.
(ii)
e
v
aluating
and
comparing
the
combination
of
LFs
with
MLF-V
O-F
as
a
backbone
and
the
original
MLF-V
O-F
for
V
OE
on
KITTI
(Seq.
#4,
Seq.
#5,
Seq.
#6,
Seq.
#7,
Seq.
#9,
and
Seq.
#10)
and
TQ
U-SLAM
datasets.
The
structure
of
the
paper
is
or
g
anized
as
follo
ws.
Section
1
introduces
the
V
OE
issue
and
related
issues.
The
combined
model
of
MLF-V
O-F
and
LFs
is
presented
in
sect
ion
2.
The
dataset
and
e
xperimental
results,
discussion,
and
challenges
will
be
presented
in
section
3.
W
e
nally
conclude
and
gi
v
e
some
i
deas
for
future
w
ork
presented
in
section
4.
2.
METHOD
Based
on
the
adv
antages
and
results
of
MLE-V
O-F
for
V
OE
[21],
in
this
paper
,
we
propose
the
combination
of
MLF-V
O-F
as
a
backbone
with
LFs
to
ne-tune
the
V
OE
model.
The
details
of
the
LFs
background
and
MLF-V
O-F
are
presented
in
detail
belo
w
.
2.1.
Loss
functions
V
OE
from
image
data
is
a
re
gression
problem
in
the
CV
that
outputs
the
future
position
of
the
camera
in
the
en
vironment
based
on
the
positions
learned
by
the
model
trained
in
pre
vious
frames.
DL
netw
orks
use
LFs
to
supervise
the
learning
process
to
calculate
the
prediction
error
and
the
ground
truth
(GT).
The
LF
is
a
function
that
allo
ws
determining
the
dif
ference
between
the
predicted
results
and
the
GT
data.
It
is
a
method
of
measuring
the
quality
of
the
predi
ction
model
on
the
observ
ed
dataset.
If
the
model
predicts
man
y
mistak
es,
the
v
alue
of
the
LF
is
lar
ge,
and
vice
v
ersa,
if
it
predicts
almost
correctly
,
the
v
alue
of
the
LF
will
be
lo
wer
.
LFs
can
be
used
unsupervised,
supervised,
semi-supervised,
or
self-supervised
to
optimize
the
V
OE
model
during
training.
The
mean
squared
error
loss
(
L
M
S
E
)
function
[22]
is
a
common
function
for
calculating
the
square
of
the
error
as
the
formula
(1).
L
M
S
E
measures
the
a
v
erage
magnitude
of
the
squared
error
between
the
GT
of
camera
motion
P
i
and
predicted
camera
motion
ˆ
P
i
.
This
means
that
it
will
pay
attention
to
lar
ger
errors
since
the
squared
error
will
add
a
lar
ge
error
v
alue
to
the
total
v
alue
of
L
M
S
E
.
L
M
S
E
=
||
P
i
−
ˆ
P
i
||
2
(1)
Combination
of
MLF-V
O-F
and
loss
functions
for
V
OE
fr
om
RGB
ima
g
e
sequence
using
...
(V
an-Hung
Le)
Evaluation Warning : The document was created with Spire.PDF for Python.
1574
❒
ISSN:
2502-4752
Additionally
,
Liu
et
al.
[26]
used
L1
loss
to
calculate
the
error
between
the
w
arped
stereo
image
and
reference
image
for
self-supervised
stereo
matching
loss
on
features
on
stereo
data,
and
the
error
between
the
w
arped
temporal
image
and
reference
image
according
to
the
temporal
model
of
stereo
data.
The
Huber
LF
[27]
describes
the
penalty
imposed
by
an
estimate
f
by
the
formula
(2).
L
δ
(
a
)
=
(
1
2
a
2
for
|
a
|
≤
δ
,
δ
·
|
a
|
−
1
2
δ
,
otherwise.
(2)
Where
a
is
the
dif
ference
between
the
ground
truth
data
y
and
the
predicted
v
alue
f
(
x
)
,
meaning
a
=
y
−
f
(
x
)
.
L
δ
(
a
)
is
quadratic
for
small
v
alues
of
a
and
linear
for
lar
ge
v
alues,
with
equal
v
alues
and
slopes
of
the
dif
ferent
parts
at
tw
o
points
where
|
a
|
=
δ
|
a
|
=
δ
.
The
smooth-L1
LF
(
L
sm
−
L
1
)
[28]
is
also
used
to
calculate
the
error
between
the
ground
truth
data
x
and
the
prediction
y
as
in
the
formula
(3).
L
sm
−
L
1
=
(
0
.
5
(
x
n
−
y
n
)
2
/beta,
if
|
x
n
−
y
n
|
<
beta
|
x
n
−
y
n
|
−
0
.
5
∗
beta,
otherwise
(3)
The
L
sm
−
L
1
can
be
vie
wed
as
e
xactly
L1
Loss,
b
ut
with
the
part
|
x
−
y
|
<
beta
replaced
by
a
quadratic
function
such
that
its
slope
is
1
at
|
x
−
y
|
=
beta
.
The
quadratic
part
smooths
the
L1
loss
near
|
x
−
y
|
=
0
.
If
beta
approaches
0,
then
smooth
L1
loss
con
v
er
ges
to
the
form
of
L1
Loss,
while
L
δ
(
a
)
con
v
er
ges
to
0.
When
beta
is
0,
smooth
L1
loss
is
equi
v
alent
to
L1
loss.
If
beta
approaches
innity
,
then
L
sm
−
L
1
con
v
er
ges
to
0,
while
L
δ
(
a
)
con
v
er
ges
to
L
M
S
E
.
When
L
sm
−
L
1
has
beta
changing,
t
he
L1
se
gment
of
the
loss
has
slope
1,
then
the
L
δ
(
a
)
has
a
slope
of
L1
se
gment
is
beta
.
In
research
by
Francani
and
Maximo
[23],
calculate
the
mean
squared
error
LF
of
L
2
(
LL∈
M
S
E
)
to
optimize
the
V
OE
model
training
process.
It
is
the
mean
squared
error
between
all
predicted
motions
and
their
GT
motions,
as
formula
(4).
L
M
S
E
−
L
2
=
1
N
f
−
1
N
f
−
1
X
w
=1
y
k
w
−
ˆ
y
k
w
2
2
(4)
Where
||
.
||
2
2
is
the
squared
L2
norm.
y
k
w
is
the
attened
6-DoF
(six
de
grees
of
freedom)
of
t
h
e
relati
v
e
pose
in
space,
ˆ
y
k
w
is
its
estimate
predicted
by
the
netw
ork.
Chen
et
al.
[15]
proposed
the
LEAP-V
O
and
cross
entrop
y
LF.
Cross
entrop
y
(
L
C
E
=
L
v
is
+
L
dy
n
):
L
v
is
is
used
to
supervise
the
visibility
label,
is
calculated
as
formula
(5),
where
V
is
the
estimated
visibility
and
V
∗
is
the
GT
visibility
.
L
dy
n
is
used
to
supervise
the
dynamic
track
label,
is
calculated
as
formula
(6),
where
m
d
is
the
estimated
dynamic
track
label,
m
∗
d
is
the
GT
of
dynamic
track
label.
L
vis
=
(1
−
V
∗
)
log
(1
−
V
)
+
V
∗
log
V
(5)
L
dyn
=
(1
−
m
∗
d
)
log
(1
−
m
d
)
+
m
∗
d
log
m
d
(6)
Hw
ang
et
al.
[24]
proposed
a
F2F
method
to
reduce
noise
when
estimating
camera
pose
on
the
KITIT
dataset,
as
sho
wn
in
Figure
2.
F2F
consists
of
tw
o
stages:
the
initial
estimation
based
on
the
combination
of
se
v
eral
encoder
netw
orks,
visual
geometry
group
(V
GG),
ResNet,
and
DenseNet,
and
the
forw
ard
loss
(
L
f
l
)
function
and
error
relaxation
netw
ork.
In
this
rst
stage,
geometric
features
are
used
to
approximate
camera
pose
prediction
and
are
ne-tuned.
The
second
stage
is
the
errors
of
rotation
and
translation
are
reduced
by
using
rotation
and
translation
netw
orks
during
the
training
of
geometric
features
by
using
the
skip
method
in
the
frame
sequence.
In
the
rst
stage,
F2F
used
the
errors
of
three
Euler
angles
θ
and
translation
v
ectors
P
to
calculate
the
LF
for
ne-tuning
the
model
as
a
formula
(7).
L
f
l
=
λ
θ
X
||
θ
−
ˆ
θ
||
2
+
X
||
P
−
ˆ
P
||
2
(7)
Where
θ
,
ˆ
θ
are
the
Euler
angles
in
the
3D
space
of
label
and
estimated
label,
respecti
v
ely
.
P
,
ˆ
P
are
the
transla-
tion
v
ector
in
the
3D
space
of
between
tw
o
spaces
and
λ
is
the
balance
scale
between
tw
o
spaces.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
39,
No.
3,
September
2025:
1571–1586
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
1575
When
training
on
the
KITIT
database,
only
training
in
the
positi
v
e
direction
is
performed,
so
the
re
v
erse
direction
has
a
lar
ge
error
.
Therefore,
F2F
proposed
a
bi-directional
LF
(
L
bd
)
in
the
second
stage
according
to
the
formula
(8).
L
bd
=
X
||
G
−
ˆ
G
i,i
+1
ˆ
G
i
+1
,i
||
2
(8)
where
G
is
the
identity
matrix,
ˆ
G
i,i
+1
is
the
result
when
using
F2F
with
input
image
G
i
,
G
i
+1
and
ˆ
G
i
+1
,i
is
the
result
when
using
F2F
with
input
image
G
i
+1
,
G
i
.
Figure
2.
Illustration
of
the
architecture
of
the
tw
o
independent
CNN
models
underlying
e
go-motion
estimation
[24]
In
addition,
F2F
also
proposed
a
method
to
reduce
noise
when
estimating
camera
pose,
the
neighboring
pix
els
of
the
current
prediction
need
to
be
used
for
calcula
tion.
F2F
proposed
a
correcti
v
e
LF,
assuming
G
i,i
+1
has
an
error
ϕ
e
as
in
Figure
3,
then
the
camera
pose
estimation
at
the
neighboring
position
can
be
used
to
reduce
the
error
as
in
Gi
−
1
,
i
and
G
i
+1
,i
+2
,
the
correction
LF
is
calculated
as
formula
(9).
L
co
=
X
||
G
i
−
1
,i
+1
−
ˆ
G
i
−
1
,i
ˆ
G
i,i
+1
||
2
(9)
Thus,
the
aggre
g
ate
LF
in
F2F
is
calculated
as
a
formula
(10).
L
F
2
F
=
L
bd
+
L
co
(10)
Figure
3.
Illustration
of
the
calculation
of
the
error
between
a
pair
of
frames
G
i,i
+1
[24]
Jiang
et
al.
,
[21]
proposed
the
MLF-V
O-F
for
V
OE.
T
o
optimize
the
training
process
of
DepthNet
depth
estimation.
Bian
et
al.
,
[29]
used
the
smoothness
LF
(
L
smoo
)
on
the
RGB
image
to
increase
the
dif
ference
between
color
pix
els
and
increase
the
scene
heterogeneity
,
L
smoo
is
calculated
according
to
the
follo
wing
formula
(11).
L
smoo
=
X
p
(
e
−▽
I
a
(
p
)
∗
▽
D
a
(
p
))
2
(11)
Combination
of
MLF-V
O-F
and
loss
functions
for
V
OE
fr
om
RGB
ima
g
e
sequence
using
...
(V
an-Hung
Le)
Evaluation Warning : The document was created with Spire.PDF for Python.
1576
❒
ISSN:
2502-4752
Where
▽
is
the
rst
deri
v
ati
v
e
concerning
the
image’
s
spatial
directions,
and
the
image’
s
edge
guides
the
smoothness.
T
o
reduce
the
w
arping
of
frames
during
depth
estimat
ion
of
a
frame
sequence,
specically
the
w
arp-
ing
of
consecuti
v
e
color
image
frames
in
a
frame
sequence.
The
photometric
LF
(
L
pm
)
is
computed
during
unsupervised
learning
of
the
netw
ork.
L
pm
is
computed
using
the
follo
wing
formula
(12).
L
pm
=
1
|
V
|
X
p
∈
V
(
λ
i
||
I
a
(
p
)
−
I
′
a
(
p
)
||
1
+
λ
s
1
−
S
S
I
M
aa
′
(
p
)
2
)
(12)
Where
the
SSIM
function
is
used
to
calculate
the
element-by-element
compatibility
between
I
a
and
I
′
a
,
λ
i
,
λ
s
are
set
to
x
ed
v
alues
[30].
MLF-V
O-F
uses
a
smoothness
loss
(
L
smoo
)
to
ensure
the
y
do
not
change
abruptly
.
The
output
is
the
loss
computed
between
adjacent
color
pix
els
at
each
scale
(4
scales).
Calculating
the
re
gularization
loss
(
L
r
eg
u
))
channel
e
xchange
according
to
the
formula
(13)
is
presented.
L
r
eg
u
=
X
m
∈
sel
f
.sl
im.par
ams
(
||
m
||
1
−
0
.
01
||
m
−
m
||
1
)
(13)
Where
||
m
||
1
is
the
L
1
re
gularization
for
parameter
m
,
i.e.
the
sum
of
the
absolute
v
alues
of
the
elements
in
m
.
m
is
the
a
v
erage
v
alue
of
parameter
m
.
m
is
the
re
gularization
polorize,
that
is,
the
sum
o
f
the
absolute
v
alues
of
the
dif
ferences
between
the
elements
in
m
and
the
mean
v
alue
m
.
The
f
actor
0.01
adjusts
the
correlation
of
the
polorize
re
gularization
with
the
L
1
re
gularization.
During
training,
optimize
the
LF
(
L
total
)
as
in
formula
(14).
L
total
=
L
pm
+
e
−
2
L
g
c
+
e
−
3
L
smoo
+
e
−
5
L
r
eg
u
(14)
In
this
paper
,
we
propose
a
combination
LF
(
L
combi
)
to
optimize
the
self-supervi
sed
training
model
based
on
the
MLF-V
O-F
.
L
combi
is
calculated
as
the
formula
(15).
L
combi
=
L
total
+
e
−
6
L
F
2
F
(15)
2.2.
MLF-V
O-F
backbone
f
or
V
OE
Man
y
visual
SLAM
and
V
OE
construction
models
ha
v
e
recently
been
based
on
the
DL
method.
This
paper
e
xploits
an
MLF-V
O-F
[21]
as
a
backbone
and
com
bines
with
LFs
to
ne-tune
the
V
OE
model
on
the
KITTI,
TQ
U-SLAM
datasets.
MLF-V
O-F
w
as
proposed
by
Jiang
et
al.
[21]
with
a
combination
of
dif
ferent
fusion
strate
gies
to
estimate
e
go-motion
from
RGB
images
and
depth
images
obtained
from
depth
estimation.
MLF-V
O-F
uses
DepthNet
to
estimate
the
depth
image
corresponding
to
each
color
image/frame
as
sho
wn
on
the
left
side
of
Figure
4.
Gi
v
en
the
input
of
consecuti
v
e
frames
of
video
I
t
,
I
t
+1
,
the
netw
ork
rst
estimates
the
depth
images
corresponding
to
each
input
frame:
D
t
=
θ
depth
(
I
t
)
,
D
t
+1
=
θ
depth
(
I
t
+1
)
.
DepthNet
is
b
uilt
on
the
structure
of
U-Net.
Figure
4.
Illustration
of
the
architecture
of
the
tw
o
independent
CNN
models
underlying
e
go-motion
estimation
[21]
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
39,
No.
3,
September
2025:
1571–1586
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
1577
T
o
smooth
out
the
color
pix
els
between
consecuti
v
e
frames
in
the
input
frame
sequence,
MLF-V
O-F
uses
a
smoothness
loss
(
L
smoo
)
to
ensure
the
y
do
not
change
abruptly
.
Gi
v
en
a
pair
of
consecuti
v
e
RGB
images
and
a
disparity
map
as
input
[31]–[33].
The
output
is
the
loss
computed
between
adjacent
color
pix
els
at
each
scale
(4
scales).
T
o
ensure
consistenc
y
between
frames,
which
helps
transfer
consistenc
y
to
the
entire
frame
sequence.
This
creates
scale-consistenc
y
for
the
entire
frame
sequence
[31],
[32].
T
o
do
this,
the
geometry
consistenc
y
loss
(
L
g
c
)
is
used
to
calculate
the
loss
between
the
depth
frame
and
the
ne
xt
depth
frame.
The
input
is
the
pix
els
at
the
current
depth
and
the
pix
els
at
the
ne
xt
depth
image.
The
output
is
the
loss
calculated
at
each
di
f
ferent
scale
(4
scales).
T
o
reduce
the
impact
of
outliers
,
the
photometric
LF
(
L
pm
)
is
calculated
based
on
L
1
.
The
L
1
loss
calculates
the
total
absolute
dif
ference
between
the
predicted
results
and
original
data,
making
it
less
sensiti
v
e
to
outliers
than
the
L
2
loss
[31]–[33].
This
function
is
used
to
calculate
the
loss
between
the
current
RGB
frame
and
the
ne
xt
RGB
frame.
The
input
is
the
pix
els
in
the
current
RGB
image
and
the
pix
els
in
t
he
ne
xt
RGB
image.
The
output
is
the
loss
calculated
at
each
scale
(4
scales).
T
o
reduce
and
control
the
number
of
parameters
m
of
the
model
training
process,
with
the
input
being
the
weight
parameters
initialized
before
the
training
process.
T
o
smooth
out
the
color
pix
els
between
consecuti
v
e
frames
in
the
input
frame
sequence,
MLF-V
O-F
uses
a
smoothness
loss
(
L
smoo
)
to
ensure
the
y
do
not
change
abruptly
.
Gi
v
en
a
pair
of
consecuti
v
e
RGB
images
and
a
disparity
map
as
input
[31]–[33].
The
output
is
the
loss
computed
between
adjacent
color
pix
els
at
each
scale
(4
scales).
Calculating
the
re
gularization
loss
(
L
r
eg
u
)
channel
e
xchange
according
to
the
formula
(13)
is
presented.
The
channel
e
xchange
(CE)
process
when
training
MLF-V
O-F
is
performed
has
the
e
xchange
and
synthesis
of
the
LF
L
total
as
formula
(14),
thereby
helping
to
o
v
ercome
the
problems
of
missing
data,
noisy
data,
and
inconsistent
data.
From
there,
the
entire
learning
data
is
promoted
and
mak
es
the
learning
set
predict
V
OE
more
accurately
.
In
particul
ar
,
MLF-V
O-F
includes
tw
o
main
tas
ks
with
tw
o
stages,
the
rst
stage
is
to
use
the
base-
line
frame
w
ork
to
estimate
e
go-motion
using
tw
o
independent
CNN
models
for
depth
prediction
and
pose
estimation,
as
illustrated
in
Figure
4.
At
this
stage,
MLF-V
O-F
uses
the
fully
con
v
olutional
U-Net
to
obtain
architectural
depths
at
four
scales.
The
second
stage
is
relati
v
e
pose
estimation
based
on
MLF-V
O-F
with
the
combination
of
a
multi-layer
fusion
strate
gy
according
to
se
v
eral
features
appearing
in
intermediate
layers
of
the
encoder
.
T
o
encode
features
from
color
and
depth
images,
MLF-V
O-F
includes
tw
o
structural
streams.
The
CE
strate
gy
is
used
to
sw
ap
the
positions
of
components
and
their
importance
for
combining
features
at
multiple
le
v
els.
In
both
streams,
ResNet-18
[34]
is
used
as
the
encoder
.
T
o
b
uild
an
end-to-end
automatic
learning
DL
netw
ork,
MLF-V
O-F
has
b
uilt
a
self-learning
mechanism
with
a
LF
(
L
total
)
combined
with
the
process
of
depth
prediction
and
relati
v
e
pose
estimation,
as
illustrated
in
Figure
5.
In
this
paper
,
we
are
only
interested
in
ne-tuning
the
V
OE
model
and
ne-t
uning
using
backbones
lik
e
Resnet-18.
W
e
use
ResNet-18
as
the
backbone
to
encode
the
e
xtracted
feature
s
from
color
images
because
these
tw
o
backbones
ha
v
e
enough
layers
to
create
accurac
y
and
f
ast
computat
ion
time.
W
e
conduct
e
xperim
ents
and
c
o
m
pare
with
some
backbones
to
encode
features
as
follo
ws:
V
GG-16
has
f
aster
computation
time
b
ut
lo
wer
accurac
y
than
ResNet-18
and
ResNet-34
[35],
ResNet-50,
ResNet-101,
ResNet-152
ha
v
e
slightly
bette
r
accurac
y
than
ResNet-18
and
ResNet-34
b
ut
increased
computation
time,
ResNet-18
has
higher
accurac
y
than
Dense121
[36].
Figure
5.
LF
of
MLF-V
O-F
for
self-learning
process
[21]
Combination
of
MLF-V
O-F
and
loss
functions
for
V
OE
fr
om
RGB
ima
g
e
sequence
using
...
(V
an-Hung
Le)
Evaluation Warning : The document was created with Spire.PDF for Python.
1578
❒
ISSN:
2502-4752
MLF-V
O-F
[21]
combines
features
at
the
early
,
middle,
and
late
stages
of
the
depth
estimation
process
to
detect
k
e
ypoints
bet
ween
consecuti
v
e
frames.
The
e
xtract
ed
features
are
based
on
DeepNet
wi
th
lo
w-le
v
el
te
xtures,
edges,
and
deeper
high-le
v
el
semantic
features.
MLF-V
O-F
is
tested
on
the
KITTI
dataset
and
sho
ws
good
performance
on
data
with
comple
x
scenes
and
sudden
lighting
changes.
The
KITTI
dataset
is
collected
in
an
outdoor
en
vironment,
so
the
scene
and
lighting
are
v
ery
comple
x.
In
MLF-V
O-F
,
a
self-supervised
learning
mechanism
is
used
to
self-monitor
the
training
process
of
the
V
OE
model
by
using
LFs
to
calculate
the
error
v
alue
between
GT
and
the
current
V
OE.
This
mechanism
reduces
the
impact
of
e
xternal
parameters
on
the
operation
of
the
model,
thus
increasing
the
adaptability
to
practical
applicat
ions.
Ho
we
v
er
,
MLF-V
O-F
also
has
limitations
such
as
requiring
lar
ge
and
parallel
computing
space,
and
lo
w
processing
results
with
small
data
sets.
2.3.
Comparati
v
e
study
based
on
loss
functions
In
this
paper
,
we
see
the
impact
of
the
LF
on
the
training
process
of
the
V
OE
model.
W
e
propose
a
combination
model
and
e
v
aluation
between
MLF-V
O-F
backbone
and
LFs,
as
sho
wn
in
Figure
6.
The
combina-
tion
includes
the
MLF-V
O-F
backbone
and
the
LFs:
(
L
M
S
E
,
L
M
S
E
−
L
2
,
L
C
E
,
and
L
combi
).
The
parame
ters
of
the
MLF-V
O-F
backbone
model
are
k
ept
the
same
as
in
the
original
MLF-V
O-F
.
Figure
6.
Combined
model
of
MLF-V
O-F
as
a
backbone
and
LF
for
V
OE
3.
RESUL
TS
AND
DISCUSSION
3.1.
Data
collection
KITTI
dataset:
the
KITTI
dataset
[37]
is
the
most
popular
database
for
e
v
aluating
visual
SLAM
and
V
OE
models
and
algori
thms.
The
KITTI
dataset
is
collected
from
tw
o
high-resolution
camera
systems,
a
V
elodyne
HDL-64E
laser
scanner
(grayscale
and
color),
and
a
stat
e-of-the-art
O
XTS
R
T
3003
localization
system
(a
combination
of
de
vices
such
as
GPS,
GLON
ASS,
security
IMU,
and
R
TK
correction
s
ignals).
These
de
vices
are
mounted
on
a
car
and
collect
data
o
v
er
a
distance
of
39.2
km.
The
resolution
of
the
image
is
1240
×
376
pix
els.
The
GT
data
for
e
v
aluating
visual
SLAM
models
and
V
OE,
including
three-dimensional
(3D)
pose
annotation
data
of
the
scene.
The
GT
data
to
e
v
aluate
object
detection
models
and
3D
orientation
estimation,
including
accurate
3D
bounding
box
es
for
object
classes.
3D
object’
s
point
cloud
data
is
mark
ed
by
manually
labeled.
In
the
impro
v
ed
dataset
of
the
KITTI
dataset
([37]),
additional
data
w
as
de
v
eloped
to
e
v
aluate
the
optical
o
w
algorithm.
The
authors
used
the
3D
CAD
model
in
the
Google
3D
W
arehouse
database
to
b
uild
3D
scenes
with
static
elements
and
insert
mo
ving
objects.
In
this
paper
,
we
only
use
the
frame
sequences:
0
th
sequence
(Seq.
#0),
1
st
sequence
(Seq.
#1),
2
nd
sequence
(Seq.
#2),
3
rd
sequence
(Seq.
#3),
4
th
sequence
(Seq.
#4),
5
th
sequence
(Seq.
#5),
6
th
sequence
(Seq.
#6),
7
th
sequence
(Seq.
#7),
8
th
sequence
(Seq.
#8),
9
th
sequence
(Seq.
#9),
10
th
sequence
(Seq.
#10)
with
ground
truth
trajectories.
TQ
U-SLAM
dataset:
From
the
collected
data,
the
data
collection
w
as
perform
ed
4
times
(1ST
,
2ND,
3RD,
4TH),
each
time,
the
direct
ion
of
mo
v
ement
according
to
the
blue
arro
w
w
as
in
the
forw
ard
direction
(FO-D),
and
the
direction
of
mo
v
ement
according
to
the
red
arro
w
w
as
in
the
opposite
direction
(OP-D).
W
e
cross-di
vide
the
TQ
U-SLAM
[25]
into
8
subsets,
is
done
as
follo
ws:
we
split
the
training
and
testing
data
in
a
cross-split
form
such
as
1ST
-FO-D
(21,333
frames),
2ND-FO-D
(19,992
frames),
3RD-FO-D
(17,995
frames)
for
training,
and
4TH-FO-D
(17,885
frames)
for
testing,
called
the
subset
1
st
(Sub
#1);
1ST
-OP-D(22,948
frames),
2ND-OP-D
(21,116
frames),
3RD-OP-D
(20,814
frames)
for
training,
and
4TH-OP-D
(18,548
frames)
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
39,
No.
3,
September
2025:
1571–1586
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
1579
for
testing,
called
the
subset
2
nd
(Sub
#2);
1ST
-FO-D,
2ND-FO-D,
4TH-FO-D
for
training,
and
3RD-FO-D
for
testing,
called
the
subset
3
rd
(Sub
#3);
1ST
-OP-D,
2ND-OP-D,4TH-OP-D
for
training,
and
3RD-OP-D
for
testing,
called
the
subset
4
th
(Sub
#4);
1ST
-FO-D,
3RD-FO-D,
4TH-FO-D
for
training,
and
2ND-FO-D
for
testing,
called
the
subset
5
th
(Sub
#5);
1ST
-OP-D,
3RD-OP-D,
4TH-OP-D
for
training,
and
2ND-OP-D
for
testing,
called
the
subset
6
th
(Sub
#6);
2ND-FO-D,
3RD-FOD,
4TH-FO-D
for
training,
and
1ST
-FO-D
for
testing,
called
the
subset
7
th
(Sub
#7);
2ND-OP-D,
3RD-OP-D,
4TH-OP-D
for
training,
and
1ST
-OP-D
for
testing,
called
the
subset
8
th
(Sub
#8).
Based
on
statistical
theory
and
machine
learning,
all
subsets
of
the
data
are
trained
for
the
V
OE
model
and
all
are
tested.
Based
on
statistics,
about
75%
of
the
data
is
for
training
the
model
and
25%
of
the
data
is
for
testing
the
model.
This
ratio
is
reasonable
statistically
and
for
machine
learning
problems.
Since
the
MLF-V
O-F
accepts
the
input
image
data
with
the
size
640
×
192
pix
els,
we
resize
the
RGB-D
images
of
the
TQ
U-SLAM
to
the
size
640
×
192
pix
els.
In
this
paper
,
we
use
the
MLF-V
O-F
as
a
backbone
and
combine
it
with
the
LFs
to
ne-tune
the
V
OE
model
on
the
TQ
U-SLAM.
MLF-V
O-F
source
code
is
de
v
eloped
in
Python
v3.
x
language
and
programmed
on
Ub
untu
18.04,
Pytorch
1.7.1,
and
CUD
A
10.1.
W
e
used
the
code
in
the
link
(https://github
.com/Benik
o95J/MLF-
V
O)
on
computers
with
the
follo
wing
conguration:
CPU
i5
12400f,
16
GB
DDR4,
GPU
R
TX
3060
12
GB.
W
e
ne-tune
the
V
OE
model
with
20
epochs,
and
the
parameters
are
def
ault
in
the
MLF-V
O-F
.
3.2.
Ev
aluation
metrics
T
o
e
v
aluate
the
results
of
V
OE,
we
calculate
trajectory
error
(
E
r
r
d
),
being
the
distance
error
between
the
GT
ˆ
AT
i
and
the
estimated
motion
AT
i
trajectory
.
E
r
r
d
is
calculated
according
to
formula
(16).
E
r
r
d
=
1
N
q
||
AT
i
−
ˆ
AT
i
||
2
(16)
Where
N
is
the
frame
number
of
the
frame
sequence
used
to
estimate
the
camera’
s
motion
trajectory
.
W
e
also
calculate
the
absolute
trajectory
error
(
AT
E
)
[38]
is
the
distance
error
between
the
GT
ˆ
AT
i
and
the
estimated
motion
AT
i
trajectory
,
aligned
with
an
optimal
S
E
(3)
pose
T
.
AT
E
is
calculated
according
to
formula
(17).
AT
E
=
min
T
∈
S
E
(3)
1
N
s
X
i
∈
I
g
t
||
T
AT
i
−
ˆ
AT
i
||
2
(17)
Where
N
is
the
number
of
frames
in
the
e
v
aluation
frame
sequence.
T
r
el
is
the
a
v
erage
transnational
R
M
S
E
drift
(%)
on
a
length
of
10
0m-800
m
[21].
R
r
el
is
the
a
v
erage
rotational
R
M
S
E
drift
(
◦
/100
m)
on
a
length
of
100
m-800
m
[21].
In
addition,
we
also
e
v
aluate
the
V
OE
results
using
the
R
M
S
E
measure.
R
M
S
E
is
the
standard
de
viation
of
the
residuals
(prediction
error)
between
the
GT
motion
trajectory
and
the
estimated
motion
traj
ectory
.
W
e
also
e
v
aluate
the
V
OE
results
on
the
relati
v
e
translation
error
(
R
T
E
(m)),
and
relati
v
e
rotation
error
(
R
P
E
(de
g))
metrics,
as
presented
in
[15].
3.3.
Results
and
discussions
V
OE
e
v
aluation
results
of
the
original
MLF-V
O-F
,
the
MLF-V
O-F
backbone
and
L
M
S
E
(MLF-V
O-F
+
L
M
S
E
),
the
MLF-V
O-F
backbone
and
L
C
E
(MLF-V
O-F
+
L
C
E
),
the
MLF-V
O-F
backbone
and
L
M
S
E
−
L
2
(MLF-V
O-F
+
L
M
S
E
−
L
2
),
the
MLF-V
O-F
backbone
and
L
M
S
E
−
L
2
(MLF-V
O-F
+
L
combi
)
on
the
Seq.
#4,
Seq.
#5,
Seq.
#6,
Seq.
#7,
Seq.
#9,
Seq.
#10
of
the
KITTI
dataset
are
presented
in
T
able
1.
The
best
results
in
each
method
and
with
the
metrics
we
highlight.
The
results
also
sho
w
that
the
original
MLF-V
O-F
has
the
best
results
at
Seq.
#9,
and
Seq.
#10
on
the
R
er
r
measure.
The
e
v
aluation
results
are
best
when
e
v
aluated
on
Seq.
#4,
Seq.
#5,
Seq.
#6,
Seq.
#7,
Seq.
#10
based
on
MLF-V
O-F
+
L
combi
method
with
T
er
r
and
R
er
r
measures.
In
T
able
1,
the
e
v
aluation
results
of
MLF-V
O-F
+
L
M
S
E
and
MLF-V
O-F
+
L
M
S
E
−
L
2
ha
v
e
the
lar
gest
error
,
as
MLF-V
O-F
+
L
M
S
E
method
has
AT
E
=
133
.
36(
m
)
,
T
er
r
=
17
.
41(%)
on
the
Seq.
#9,
this
is
a
v
ery
lar
ge
error
compared
to
the
best
method
(MLF-V
O-F)
when
e
v
aluating
on
the
AT
E
measure.
The
results
of
the
V
OE
comparison
of
the
moti
on
trajectories
of
MLF-V
O-F
+
L
M
S
E
,
MLF-V
O-F
+
L
M
S
E
−
L
2
,
MLF-V
O-F
+
L
C
E
on
Seq.
#7,
Seq.
#9,
Seq.
#10
of
the
KITTI
dataset
are
sho
wn
in
Figure
7.
The
L
C
E
=
L
v
is
+
L
dy
n
LF
(as
formulas
(5),
(6))
is
an
important
a
LF
to
optimize
the
training
process
of
[15]
model
for
V
OE
on
the
MPI
Sintel
[39],
Replica
[40]
datasets,
this
model
is
the
best
when
compared
with
some
models
DR
OID-SLAM
[41],
DytanV
O
[16].
The
results
also
sho
w
that
the
L
C
E
LF
has
a
lar
ge
impact
on
MLF-V
O-F
for
training
the
V
OE
model
on
KITTI
dataset.
Combination
of
MLF-V
O-F
and
loss
functions
for
V
OE
fr
om
RGB
ima
g
e
sequence
using
...
(V
an-Hung
Le)
Evaluation Warning : The document was created with Spire.PDF for Python.
1580
❒
ISSN:
2502-4752
The
L
combi
LF
(as
formula
(15))
is
a
combination
of
the
adv
ant
ages
of
the
L
total
LF
(as
formula
(14))
of
the
original
MLF-V
O-F
and
the
L
F
2
F
LF
(as
formula
(10))
of
F2F
,
which
are
both
the
best
LFs
in
MLF-V
O-
F
and
F2F
for
V
OE.
Therefore,
the
combination
of
the
L
combi
LFs
gi
v
es
the
best
results
on
the
KITT
i
dataset.
T
able
1.
V
OE
e
v
aluation
results
of
the
original
MLF-V
O-F
Methods/
datasets
/metrics
MLF-V
O-F
MLF-V
O-F
+
L
C
E
MLF-V
O-F
+
L
M
S
E
MLF-V
O-F
+
L
M
S
E
−
L
2
MLF-V
O-F
+
L
C
onbi
Seq.
#9
Seq.
#10
Seq.
#9
Seq.
#10
Seq.
#9
Seq.
#10
Seq.
#9
Seq.
#10
Seq.
#4
Seq.
#5
Seq.
#6
Seq.
#7
Seq.
#10
T
er
r
(%)
3.9
4.88
5.88
6.73
17.41
12.99
8.99
8.99
2.21
2.67
3.59
1.01
4.62
R
er
r
(de
g/100
m)
1.41
1.38
2.127
2.124
6.66
5.957
2.91
3.03
0.97
1.18
1.65
0.67
1.89
AT
E
(m)
9.86
7.36
15.22
9.34
133.36
32.27
35.18
9.744
-
-
-
-
-
R
T
E
(m)
-
-
0.075
0.06
0.09
0.08
0.08
0.07
-
-
-
-
-
R
P
E
(de
g)
-
-
0.07
0.09
0.10
0.11
0.09
0.1
-
-
-
-
-
In
research
by
Francani
and
Maximo
[23]
e
v
aluated
the
error
function
on
the
11
sequences
of
KITTI
dataset,
the
best
results
were
t
er
r
=
3
.
105%
,
r
er
r
=
1
.
063(
deg
/
100
m
)
,
AT
E
=
37
.
431
m
on
the
Seq.
#02,
and
t
er
r
=
9
.
867%
,
r
er
r
=
4
.
295(
deg
/
100
m
)
,
AT
E
=
8
.
696
m
with
the
Seq.
#03,
on
other
fram
e
sequences,
L
M
S
E
had
lo
wer
results
when
combined
with
L
M
C
LF.
Therefore,
it
can
be
seen
that
L
M
S
E
still
has
a
lar
ge
error
in
optimizing
the
training
process
of
the
DL-based
model.
Therefore,
L
M
S
E
combined
with
MLF-V
O-
F
has
the
highest
error
compared
to
other
LFs.
The
V
OE
result
on
Seq.
#9
in
Fi
gure
7
has
the
lar
gest
error
when
estimating
on
MLF-V
O-F
+
L
M
S
E
method,
which
is
similar
to
the
result
in
T
able
1,
with
error
AT
E
=
133
.
36
m
.
Figure
7.
The
comparison
results
of
V
OE
based
on
the
combination
of
MLF-V
O-F
backbone
and
L
M
S
E
LF
(MLF-V
O-F
+
L
M
S
E
)(Ours),
L
M
S
E
−
L
2
LF
(MLF-V
O-F
+
L
M
S
E
−
L
2
)(Ours),
L
C
E
LF
(MLF-V
O-F
+
L
C
E
)(Ours)
and
GT
V
O
(blue)
on
Seq.
#7,
Seq.
#9,
and
Seq.
#10
of
the
KITTI
dataset
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
39,
No.
3,
September
2025:
1571–1586
Evaluation Warning : The document was created with Spire.PDF for Python.