Indonesian J our nal of Electrical Engineering and Computer Science V ol. 39, No. 3, September 2025, pp. 1571 1586 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v39.i3.pp1571-1586 1571 Combination of MLF-V O-F and loss functions f or V OE fr om RGB image sequence using deep lear ning V an-Hung Le 1 , Huu-Son Do 1 , Thi-Ha-Phuong Nguy en 1 , V an-Thuan Nguy en 2 , T at-Hung Do 2 1 Department of Information T echnology , T an T rao Uni v ersity , T uyen Quang, V ietnam 2 F aculty of Engineering T echnology , Hung V uong Uni v ersity , V iet T ri City , V ietnam Article Inf o Article history: Recei v ed Feb 21, 2025 Re vised Apr 8, 2025 Accepted Jul 2, 2025 K eyw ords: Comparati v e study Deep learning Loss functions MLF-V O-F RGB image sequence V isual odometry ABSTRA CT V isual odometry estimation (V OE) is important in b uilding na vig ation and path- nding systems. It helps entities nd their w ay and estimate paths in the en- vironment. M ost of the computer vision (CV)-based V OE models are usually e v aluated and compared on the KITTI dataset. Multi-layer fusion frame w ork (MLF-V O-F) has had good V OE results from red, green, and blue (RGB) im- age sequence in Jiang et al. study , using the DeepNet to e xtract the lo w-le v el te xtures, edges, and deeper high-le v el semantic features for estimating motion between consecuti v e frames. This paper proposed a combined model of MLF- V O-F as a backbone and loss functions (LFs) ( L M S E , L M S E L 2 , L C E , a nd L combi ) to optimize and supervise the training process of the V OE model. W e e v aluated and compared the ef fecti v eness of LFs for V OE based on the KITTI and TQ U-SLAM datasets wi th the original MLF-V O-F . From there, choose the appropriate LF combined with the backbone for V OE. The e v aluation results on the KITTI dataset sho w that L C E ( R T E is 0 . 075 m , 0 . 06 m on the Seq. #9, Seq. #10, respecti v ely), and L combi ( t r el is 2 . 21% , 2 . 67% , 3 . 59% , 1 . 01% , and 4 . 62% on the Seq. #4, Seq. #5, Seq. #6, Seq. #7, Seq. #10, r especti v ely) ha v e the lo west errors and L M S E has the hi ghest errors ( AT E is 133 . 36 m on the Seq. #9). This is an open access article under the CC BY -SA license . Corresponding A uthor: T at-Hung Do F aculty of Engineering T echnology , Hung V uong Uni v ersity Nong T rang, V iet T ri City , Phu Tho, V ietnam Email: dotathung@hvu.edu.vn 1. INTR ODUCTION V isual odometry estimation (V OE) is one of the tw o important problems of V isual SLAM and is an important problem of computer vision (CV) and robotics technology that has been studied for a long time. V OE focuses mainly on local consistenc y and aims to incrementally esti mate the camera pose pat h after each pose and can perform local optimization. V isual SLAM estimates the entire scene/map and the camera trajec- tory/V OE. This means that visual SLAM includes the V OE problem for robots, which helps robots or softw are that supports visually impaired people to estimate the direction and path of mo v ement in the en vironment. Especially in ne w en vironments. The data used to b uild the V OE can be collected from IMU [1], [2], LiDar [3]–[5], or image sensors. The data obtained from the image sensor (RGB, depth, and stereo) can be used to b uild V OE at a reasonable cost. Pre viously , with the traditional method, V OE [6] could be implemented based on a geomet ry-based method. These methods use a k e ypoint detector to identify the salient points (k e ypoints) in the image, and J ournal homepage: http://ijeecs.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
1572 ISSN: 2502-4752 feature v ectors or descriptors are computed by considering the local re gion around each k e ypoint. T racking of k e ypoints to establis h correspondence bet ween dif ferent vie ws (or image frames) is done through descriptor matching. As PT AM [7] used a F AST corner detector to detect k e ypoints in the im age. ORB-SLAM [8] used oriented F AST and rotated BRIEF descriptors to perform the V OE model’ s tracking, mapping, and loop closure steps. The V OE model of the geomet ry-based method usually includes modules such as feature e xtraction, feature matching, pose estimation, and local optimization [9]. While deep learning (DL) [6] uses recurrent con v olutional neural netw orks (CNN) to e xtract motion features, representati v e points, or the relati v e pose between consecuti v e frames. In this model, deep neural netw orks [9] can perform instead of modules feature e xtraction, feature matching, and pose estimation instead of the traditional approach. W ith the DL-based approach, V OE can be implemented based on the follo w- ing netw ork architectures sho w in Figure 1. CNN-based frame w ork sho w in Figure 1(a), CNN-base d frame- w ork with tw o fully connected net w orks in Figure 1(b), recurrent neural netw orks (RNN)-based frame w ork in Figure 1(c), stereo-based frame w ork in Figure 1(d), and generati v e adv ersarial netw orks (GAN)-based frame- w ork in Figure 1(e). Figure 1. Illustration of DL architectures for V OE from consecuti v e frames: (a) CNN-based frame w ork, (b) CNN-based frame w ork with tw o fully connected netw orks, (c) RNN-based frame w ork, (d) stereo-based frame w ork, and (e) GAN-based frame w ork In the surv e y study , Chen et al. [10] presented the adv antages and disadv antages of DL for V OE as follo ws. A CNN-based frame w ork has the adv antage of being able to learn features such as edges, corners, and te xtures well to estimate representati v e points between consecuti v e frames and can eliminate irrele v ant features, especially with end-to-end DL for V OE. Ho we v er , the CNN-ba sed frame w ork often processes independent frames without taking adv antage of temporal features on consecuti v e frames. RNN-based frame w ork with the prominent long short-term memory (LSTM) model has the adv antage of e xploiting temporal features on the frame sequence obtained from the en vironment, so predicting the V OE of the current state is as good as considering pre vious states. Ho we v er , this model also has the disadv antage of requiring a v ery lar ge amount of memory to store the states of the frame sequence. Stereo-based frame w ork is often used to estimate depth from RGB images, with good performance in lo w-comple xity en vironments. Ho we v er , this approach has a lar ge dependence on stereo data collected from stereo cameras. GAN-based frame w ork is often applied to b uild a real-w orld conte xt dataset when labeled data of the en vironment is limited, so this approach can learn a self-supervised V OE model, which can ne-tune the predicted depth/optical o w results. Ho we v er , this approach requires a lar ge memory cost and is dif cult to train. V OE is important in b uilding na vig ation and path-nding systems for robots, autonomous v ehicles, and blind people in the en vironment [11], [12], no w adays with the v ery con vincing results of DL i n solving CV problems and modules or end-to-end DL for V OE systems [12]–[15]. Indonesian J Elec Eng & Comp Sci, V ol. 39, No. 3, September 2025: 1571–1586 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 1573 W ith the DL approach for V OE, the features e xtracted for the V OE process can be e xtracted through DL netw orks [14], [16] or traditional features such as AKAZE, ORB, SIFT , and SURF [17], then apply a DL netw ork for V OE. In addition, the V OE process can be performed based on the transformer [18], and reinforcement learning [19] methods. Ho we v er , to optimize DL netw orks for V OE, LFs [20] are often used for supervised, semi-supervised, and self-supervised training of features for V OE. Jiang et al. [21] proposed the multi-layer fusion frame w ork (MLF-V O-F) for V OE to ne-tune the V OE model with the RGB image as the input. MLF-V O-F used DepthNet to estimate the depth image and e xploited some LFs such as geometry consistenc y loss ( L g c ), smoothness loss ( L smoo ), and photometric LF ( L pm ) t o supervise the training process and impro v e the depth image estimation result corresponding to the input RGB image. And use re gularization loss ( L r eg u ) to synthesize LFs to control the scaling f actors process for channel e xchange between the RGB image and estimat ed depth image when combining the features of these tw o types of data for V OE. Recently , there ha v e also been studies by [22], [23] that used the mean squared error function ( L M S E ) to optimize the training process of V OE models. In the study of Hw ang et al. [24], the aggre g ate LF ( L F 2 F ) w as proposed to be synthesized from the forw ard loss ( L f l ) function and bi-directional LF ( L bd ), correction LF ( L co ). MLF-V O-F [21] is currently e v aluated only on the Seq. #9 and Seq. #10 frame sequences of the KITTI dataset. Recent impro v ements to V OE models ha v e also focused on a fe w frame sequences of the KITTI dataset, such as frame-to-frame (F2F) [24], which also e v aluates on the Seq. #8, Seq. #9, and Seq. #10. Therefore, e v aluating these models on other frame sequences of the KITTI dataset and other datasets is necessary to conrm the rob ustness of the V OE model. At the same time, choose a suitable LF to supervise and optimize the training process of the V OE model. In this paper , we e xploit the adv antages of M LF-V O-F as a backbone for V OE and combine it with LFs: L M S E , L M S E + L 2 , cross entrop y loss ( L C E = L v is + L dy n ), and L combi based on the component LFs the forw ard loss ( L f l ) function and bi-directional LF ( L bd ), correction LF ( L co ), and aggre g ate LF ( L F 2 F ). The combined model is trained and e v aluated on the KITTI and TQ U-SLAM [25] datasets. From there , we select the best LF for optimizing the training process of the V OE system construction model. Our paper includes the follo wing main contrib utions: (i) proposing and testing the combination of LFs ( L M S E , L M S E + L 2 , L C E , and L combi ) with the MLF-V O-F as a backbone for V OE. (ii) e v aluating and comparing the combination of LFs with MLF-V O-F as a backbone and the original MLF-V O-F for V OE on KITTI (Seq. #4, Seq. #5, Seq. #6, Seq. #7, Seq. #9, and Seq. #10) and TQ U-SLAM datasets. The structure of the paper is or g anized as follo ws. Section 1 introduces the V OE issue and related issues. The combined model of MLF-V O-F and LFs is presented in sect ion 2. The dataset and e xperimental results, discussion, and challenges will be presented in section 3. W e nally conclude and gi v e some i deas for future w ork presented in section 4. 2. METHOD Based on the adv antages and results of MLE-V O-F for V OE [21], in this paper , we propose the combination of MLF-V O-F as a backbone with LFs to ne-tune the V OE model. The details of the LFs background and MLF-V O-F are presented in detail belo w . 2.1. Loss functions V OE from image data is a re gression problem in the CV that outputs the future position of the camera in the en vironment based on the positions learned by the model trained in pre vious frames. DL netw orks use LFs to supervise the learning process to calculate the prediction error and the ground truth (GT). The LF is a function that allo ws determining the dif ference between the predicted results and the GT data. It is a method of measuring the quality of the predi ction model on the observ ed dataset. If the model predicts man y mistak es, the v alue of the LF is lar ge, and vice v ersa, if it predicts almost correctly , the v alue of the LF will be lo wer . LFs can be used unsupervised, supervised, semi-supervised, or self-supervised to optimize the V OE model during training. The mean squared error loss ( L M S E ) function [22] is a common function for calculating the square of the error as the formula (1). L M S E measures the a v erage magnitude of the squared error between the GT of camera motion P i and predicted camera motion ˆ P i . This means that it will pay attention to lar ger errors since the squared error will add a lar ge error v alue to the total v alue of L M S E . L M S E = || P i ˆ P i || 2 (1) Combination of MLF-V O-F and loss functions for V OE fr om RGB ima g e sequence using ... (V an-Hung Le) Evaluation Warning : The document was created with Spire.PDF for Python.
1574 ISSN: 2502-4752 Additionally , Liu et al. [26] used L1 loss to calculate the error between the w arped stereo image and reference image for self-supervised stereo matching loss on features on stereo data, and the error between the w arped temporal image and reference image according to the temporal model of stereo data. The Huber LF [27] describes the penalty imposed by an estimate f by the formula (2). L δ ( a ) = ( 1 2 a 2 for | a | δ , δ · | a | 1 2 δ , otherwise. (2) Where a is the dif ference between the ground truth data y and the predicted v alue f ( x ) , meaning a = y f ( x ) . L δ ( a ) is quadratic for small v alues of a and linear for lar ge v alues, with equal v alues and slopes of the dif ferent parts at tw o points where | a | = δ | a | = δ . The smooth-L1 LF ( L sm L 1 ) [28] is also used to calculate the error between the ground truth data x and the prediction y as in the formula (3). L sm L 1 = ( 0 . 5 ( x n y n ) 2 /beta, if | x n y n | < beta | x n y n | 0 . 5 beta, otherwise (3) The L sm L 1 can be vie wed as e xactly L1 Loss, b ut with the part | x y | < beta replaced by a quadratic function such that its slope is 1 at | x y | = beta . The quadratic part smooths the L1 loss near | x y | = 0 . If beta approaches 0, then smooth L1 loss con v er ges to the form of L1 Loss, while L δ ( a ) con v er ges to 0. When beta is 0, smooth L1 loss is equi v alent to L1 loss. If beta approaches innity , then L sm L 1 con v er ges to 0, while L δ ( a ) con v er ges to L M S E . When L sm L 1 has beta changing, t he L1 se gment of the loss has slope 1, then the L δ ( a ) has a slope of L1 se gment is beta . In research by Francani and Maximo [23], calculate the mean squared error LF of L 2 ( LL∈ M S E ) to optimize the V OE model training process. It is the mean squared error between all predicted motions and their GT motions, as formula (4). L M S E L 2 = 1 N f 1 N f 1 X w =1 y k w ˆ y k w 2 2 (4) Where || . || 2 2 is the squared L2 norm. y k w is the attened 6-DoF (six de grees of freedom) of t h e relati v e pose in space, ˆ y k w is its estimate predicted by the netw ork. Chen et al. [15] proposed the LEAP-V O and cross entrop y LF. Cross entrop y ( L C E = L v is + L dy n ): L v is is used to supervise the visibility label, is calculated as formula (5), where V is the estimated visibility and V is the GT visibility . L dy n is used to supervise the dynamic track label, is calculated as formula (6), where m d is the estimated dynamic track label, m d is the GT of dynamic track label. L vis = (1 V ) log (1 V ) + V log V (5) L dyn = (1 m d ) log (1 m d ) + m d log m d (6) Hw ang et al. [24] proposed a F2F method to reduce noise when estimating camera pose on the KITIT dataset, as sho wn in Figure 2. F2F consists of tw o stages: the initial estimation based on the combination of se v eral encoder netw orks, visual geometry group (V GG), ResNet, and DenseNet, and the forw ard loss ( L f l ) function and error relaxation netw ork. In this rst stage, geometric features are used to approximate camera pose prediction and are ne-tuned. The second stage is the errors of rotation and translation are reduced by using rotation and translation netw orks during the training of geometric features by using the skip method in the frame sequence. In the rst stage, F2F used the errors of three Euler angles θ and translation v ectors P to calculate the LF for ne-tuning the model as a formula (7). L f l = λ θ X || θ ˆ θ || 2 + X || P ˆ P || 2 (7) Where θ , ˆ θ are the Euler angles in the 3D space of label and estimated label, respecti v ely . P , ˆ P are the transla- tion v ector in the 3D space of between tw o spaces and λ is the balance scale between tw o spaces. Indonesian J Elec Eng & Comp Sci, V ol. 39, No. 3, September 2025: 1571–1586 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 1575 When training on the KITIT database, only training in the positi v e direction is performed, so the re v erse direction has a lar ge error . Therefore, F2F proposed a bi-directional LF ( L bd ) in the second stage according to the formula (8). L bd = X || G ˆ G i,i +1 ˆ G i +1 ,i || 2 (8) where G is the identity matrix, ˆ G i,i +1 is the result when using F2F with input image G i , G i +1 and ˆ G i +1 ,i is the result when using F2F with input image G i +1 , G i . Figure 2. Illustration of the architecture of the tw o independent CNN models underlying e go-motion estimation [24] In addition, F2F also proposed a method to reduce noise when estimating camera pose, the neighboring pix els of the current prediction need to be used for calcula tion. F2F proposed a correcti v e LF, assuming G i,i +1 has an error ϕ e as in Figure 3, then the camera pose estimation at the neighboring position can be used to reduce the error as in Gi 1 , i and G i +1 ,i +2 , the correction LF is calculated as formula (9). L co = X || G i 1 ,i +1 ˆ G i 1 ,i ˆ G i,i +1 || 2 (9) Thus, the aggre g ate LF in F2F is calculated as a formula (10). L F 2 F = L bd + L co (10) Figure 3. Illustration of the calculation of the error between a pair of frames G i,i +1 [24] Jiang et al. , [21] proposed the MLF-V O-F for V OE. T o optimize the training process of DepthNet depth estimation. Bian et al. , [29] used the smoothness LF ( L smoo ) on the RGB image to increase the dif ference between color pix els and increase the scene heterogeneity , L smoo is calculated according to the follo wing formula (11). L smoo = X p ( e −▽ I a ( p ) D a ( p )) 2 (11) Combination of MLF-V O-F and loss functions for V OE fr om RGB ima g e sequence using ... (V an-Hung Le) Evaluation Warning : The document was created with Spire.PDF for Python.
1576 ISSN: 2502-4752 Where is the rst deri v ati v e concerning the image’ s spatial directions, and the image’ s edge guides the smoothness. T o reduce the w arping of frames during depth estimat ion of a frame sequence, specically the w arp- ing of consecuti v e color image frames in a frame sequence. The photometric LF ( L pm ) is computed during unsupervised learning of the netw ork. L pm is computed using the follo wing formula (12). L pm = 1 | V | X p V ( λ i || I a ( p ) I a ( p ) || 1 + λ s 1 S S I M aa ( p ) 2 ) (12) Where the SSIM function is used to calculate the element-by-element compatibility between I a and I a , λ i , λ s are set to x ed v alues [30]. MLF-V O-F uses a smoothness loss ( L smoo ) to ensure the y do not change abruptly . The output is the loss computed between adjacent color pix els at each scale (4 scales). Calculating the re gularization loss ( L r eg u )) channel e xchange according to the formula (13) is presented. L r eg u = X m sel f .sl im.par ams ( || m || 1 0 . 01 || m m || 1 ) (13) Where || m || 1 is the L 1 re gularization for parameter m , i.e. the sum of the absolute v alues of the elements in m . m is the a v erage v alue of parameter m . m is the re gularization polorize, that is, the sum o f the absolute v alues of the dif ferences between the elements in m and the mean v alue m . The f actor 0.01 adjusts the correlation of the polorize re gularization with the L 1 re gularization. During training, optimize the LF ( L total ) as in formula (14). L total = L pm + e 2 L g c + e 3 L smoo + e 5 L r eg u (14) In this paper , we propose a combination LF ( L combi ) to optimize the self-supervi sed training model based on the MLF-V O-F . L combi is calculated as the formula (15). L combi = L total + e 6 L F 2 F (15) 2.2. MLF-V O-F backbone f or V OE Man y visual SLAM and V OE construction models ha v e recently been based on the DL method. This paper e xploits an MLF-V O-F [21] as a backbone and com bines with LFs to ne-tune the V OE model on the KITTI, TQ U-SLAM datasets. MLF-V O-F w as proposed by Jiang et al. [21] with a combination of dif ferent fusion strate gies to estimate e go-motion from RGB images and depth images obtained from depth estimation. MLF-V O-F uses DepthNet to estimate the depth image corresponding to each color image/frame as sho wn on the left side of Figure 4. Gi v en the input of consecuti v e frames of video I t , I t +1 , the netw ork rst estimates the depth images corresponding to each input frame: D t = θ depth ( I t ) , D t +1 = θ depth ( I t +1 ) . DepthNet is b uilt on the structure of U-Net. Figure 4. Illustration of the architecture of the tw o independent CNN models underlying e go-motion estimation [21] Indonesian J Elec Eng & Comp Sci, V ol. 39, No. 3, September 2025: 1571–1586 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 1577 T o smooth out the color pix els between consecuti v e frames in the input frame sequence, MLF-V O-F uses a smoothness loss ( L smoo ) to ensure the y do not change abruptly . Gi v en a pair of consecuti v e RGB images and a disparity map as input [31]–[33]. The output is the loss computed between adjacent color pix els at each scale (4 scales). T o ensure consistenc y between frames, which helps transfer consistenc y to the entire frame sequence. This creates scale-consistenc y for the entire frame sequence [31], [32]. T o do this, the geometry consistenc y loss ( L g c ) is used to calculate the loss between the depth frame and the ne xt depth frame. The input is the pix els at the current depth and the pix els at the ne xt depth image. The output is the loss calculated at each di f ferent scale (4 scales). T o reduce the impact of outliers , the photometric LF ( L pm ) is calculated based on L 1 . The L 1 loss calculates the total absolute dif ference between the predicted results and original data, making it less sensiti v e to outliers than the L 2 loss [31]–[33]. This function is used to calculate the loss between the current RGB frame and the ne xt RGB frame. The input is the pix els in the current RGB image and the pix els in t he ne xt RGB image. The output is the loss calculated at each scale (4 scales). T o reduce and control the number of parameters m of the model training process, with the input being the weight parameters initialized before the training process. T o smooth out the color pix els between consecuti v e frames in the input frame sequence, MLF-V O-F uses a smoothness loss ( L smoo ) to ensure the y do not change abruptly . Gi v en a pair of consecuti v e RGB images and a disparity map as input [31]–[33]. The output is the loss computed between adjacent color pix els at each scale (4 scales). Calculating the re gularization loss ( L r eg u ) channel e xchange according to the formula (13) is presented. The channel e xchange (CE) process when training MLF-V O-F is performed has the e xchange and synthesis of the LF L total as formula (14), thereby helping to o v ercome the problems of missing data, noisy data, and inconsistent data. From there, the entire learning data is promoted and mak es the learning set predict V OE more accurately . In particul ar , MLF-V O-F includes tw o main tas ks with tw o stages, the rst stage is to use the base- line frame w ork to estimate e go-motion using tw o independent CNN models for depth prediction and pose estimation, as illustrated in Figure 4. At this stage, MLF-V O-F uses the fully con v olutional U-Net to obtain architectural depths at four scales. The second stage is relati v e pose estimation based on MLF-V O-F with the combination of a multi-layer fusion strate gy according to se v eral features appearing in intermediate layers of the encoder . T o encode features from color and depth images, MLF-V O-F includes tw o structural streams. The CE strate gy is used to sw ap the positions of components and their importance for combining features at multiple le v els. In both streams, ResNet-18 [34] is used as the encoder . T o b uild an end-to-end automatic learning DL netw ork, MLF-V O-F has b uilt a self-learning mechanism with a LF ( L total ) combined with the process of depth prediction and relati v e pose estimation, as illustrated in Figure 5. In this paper , we are only interested in ne-tuning the V OE model and ne-t uning using backbones lik e Resnet-18. W e use ResNet-18 as the backbone to encode the e xtracted feature s from color images because these tw o backbones ha v e enough layers to create accurac y and f ast computat ion time. W e conduct e xperim ents and c o m pare with some backbones to encode features as follo ws: V GG-16 has f aster computation time b ut lo wer accurac y than ResNet-18 and ResNet-34 [35], ResNet-50, ResNet-101, ResNet-152 ha v e slightly bette r accurac y than ResNet-18 and ResNet-34 b ut increased computation time, ResNet-18 has higher accurac y than Dense121 [36]. Figure 5. LF of MLF-V O-F for self-learning process [21] Combination of MLF-V O-F and loss functions for V OE fr om RGB ima g e sequence using ... (V an-Hung Le) Evaluation Warning : The document was created with Spire.PDF for Python.
1578 ISSN: 2502-4752 MLF-V O-F [21] combines features at the early , middle, and late stages of the depth estimation process to detect k e ypoints bet ween consecuti v e frames. The e xtract ed features are based on DeepNet wi th lo w-le v el te xtures, edges, and deeper high-le v el semantic features. MLF-V O-F is tested on the KITTI dataset and sho ws good performance on data with comple x scenes and sudden lighting changes. The KITTI dataset is collected in an outdoor en vironment, so the scene and lighting are v ery comple x. In MLF-V O-F , a self-supervised learning mechanism is used to self-monitor the training process of the V OE model by using LFs to calculate the error v alue between GT and the current V OE. This mechanism reduces the impact of e xternal parameters on the operation of the model, thus increasing the adaptability to practical applicat ions. Ho we v er , MLF-V O-F also has limitations such as requiring lar ge and parallel computing space, and lo w processing results with small data sets. 2.3. Comparati v e study based on loss functions In this paper , we see the impact of the LF on the training process of the V OE model. W e propose a combination model and e v aluation between MLF-V O-F backbone and LFs, as sho wn in Figure 6. The combina- tion includes the MLF-V O-F backbone and the LFs: ( L M S E , L M S E L 2 , L C E , and L combi ). The parame ters of the MLF-V O-F backbone model are k ept the same as in the original MLF-V O-F . Figure 6. Combined model of MLF-V O-F as a backbone and LF for V OE 3. RESUL TS AND DISCUSSION 3.1. Data collection KITTI dataset: the KITTI dataset [37] is the most popular database for e v aluating visual SLAM and V OE models and algori thms. The KITTI dataset is collected from tw o high-resolution camera systems, a V elodyne HDL-64E laser scanner (grayscale and color), and a stat e-of-the-art O XTS R T 3003 localization system (a combination of de vices such as GPS, GLON ASS, security IMU, and R TK correction s ignals). These de vices are mounted on a car and collect data o v er a distance of 39.2 km. The resolution of the image is 1240 × 376 pix els. The GT data for e v aluating visual SLAM models and V OE, including three-dimensional (3D) pose annotation data of the scene. The GT data to e v aluate object detection models and 3D orientation estimation, including accurate 3D bounding box es for object classes. 3D object’ s point cloud data is mark ed by manually labeled. In the impro v ed dataset of the KITTI dataset ([37]), additional data w as de v eloped to e v aluate the optical o w algorithm. The authors used the 3D CAD model in the Google 3D W arehouse database to b uild 3D scenes with static elements and insert mo ving objects. In this paper , we only use the frame sequences: 0 th sequence (Seq. #0), 1 st sequence (Seq. #1), 2 nd sequence (Seq. #2), 3 rd sequence (Seq. #3), 4 th sequence (Seq. #4), 5 th sequence (Seq. #5), 6 th sequence (Seq. #6), 7 th sequence (Seq. #7), 8 th sequence (Seq. #8), 9 th sequence (Seq. #9), 10 th sequence (Seq. #10) with ground truth trajectories. TQ U-SLAM dataset: From the collected data, the data collection w as perform ed 4 times (1ST , 2ND, 3RD, 4TH), each time, the direct ion of mo v ement according to the blue arro w w as in the forw ard direction (FO-D), and the direction of mo v ement according to the red arro w w as in the opposite direction (OP-D). W e cross-di vide the TQ U-SLAM [25] into 8 subsets, is done as follo ws: we split the training and testing data in a cross-split form such as 1ST -FO-D (21,333 frames), 2ND-FO-D (19,992 frames), 3RD-FO-D (17,995 frames) for training, and 4TH-FO-D (17,885 frames) for testing, called the subset 1 st (Sub #1); 1ST -OP-D(22,948 frames), 2ND-OP-D (21,116 frames), 3RD-OP-D (20,814 frames) for training, and 4TH-OP-D (18,548 frames) Indonesian J Elec Eng & Comp Sci, V ol. 39, No. 3, September 2025: 1571–1586 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 1579 for testing, called the subset 2 nd (Sub #2); 1ST -FO-D, 2ND-FO-D, 4TH-FO-D for training, and 3RD-FO-D for testing, called the subset 3 rd (Sub #3); 1ST -OP-D, 2ND-OP-D,4TH-OP-D for training, and 3RD-OP-D for testing, called the subset 4 th (Sub #4); 1ST -FO-D, 3RD-FO-D, 4TH-FO-D for training, and 2ND-FO-D for testing, called the subset 5 th (Sub #5); 1ST -OP-D, 3RD-OP-D, 4TH-OP-D for training, and 2ND-OP-D for testing, called the subset 6 th (Sub #6); 2ND-FO-D, 3RD-FOD, 4TH-FO-D for training, and 1ST -FO-D for testing, called the subset 7 th (Sub #7); 2ND-OP-D, 3RD-OP-D, 4TH-OP-D for training, and 1ST -OP-D for testing, called the subset 8 th (Sub #8). Based on statistical theory and machine learning, all subsets of the data are trained for the V OE model and all are tested. Based on statistics, about 75% of the data is for training the model and 25% of the data is for testing the model. This ratio is reasonable statistically and for machine learning problems. Since the MLF-V O-F accepts the input image data with the size 640 × 192 pix els, we resize the RGB-D images of the TQ U-SLAM to the size 640 × 192 pix els. In this paper , we use the MLF-V O-F as a backbone and combine it with the LFs to ne-tune the V OE model on the TQ U-SLAM. MLF-V O-F source code is de v eloped in Python v3. x language and programmed on Ub untu 18.04, Pytorch 1.7.1, and CUD A 10.1. W e used the code in the link (https://github .com/Benik o95J/MLF- V O) on computers with the follo wing conguration: CPU i5 12400f, 16 GB DDR4, GPU R TX 3060 12 GB. W e ne-tune the V OE model with 20 epochs, and the parameters are def ault in the MLF-V O-F . 3.2. Ev aluation metrics T o e v aluate the results of V OE, we calculate trajectory error ( E r r d ), being the distance error between the GT ˆ AT i and the estimated motion AT i trajectory . E r r d is calculated according to formula (16). E r r d = 1 N q || AT i ˆ AT i || 2 (16) Where N is the frame number of the frame sequence used to estimate the camera’ s motion trajectory . W e also calculate the absolute trajectory error ( AT E ) [38] is the distance error between the GT ˆ AT i and the estimated motion AT i trajectory , aligned with an optimal S E (3) pose T . AT E is calculated according to formula (17). AT E = min T S E (3) 1 N s X i I g t || T AT i ˆ AT i || 2 (17) Where N is the number of frames in the e v aluation frame sequence. T r el is the a v erage transnational R M S E drift (%) on a length of 10 0m-800 m [21]. R r el is the a v erage rotational R M S E drift ( /100 m) on a length of 100 m-800 m [21]. In addition, we also e v aluate the V OE results using the R M S E measure. R M S E is the standard de viation of the residuals (prediction error) between the GT motion trajectory and the estimated motion traj ectory . W e also e v aluate the V OE results on the relati v e translation error ( R T E (m)), and relati v e rotation error ( R P E (de g)) metrics, as presented in [15]. 3.3. Results and discussions V OE e v aluation results of the original MLF-V O-F , the MLF-V O-F backbone and L M S E (MLF-V O-F + L M S E ), the MLF-V O-F backbone and L C E (MLF-V O-F + L C E ), the MLF-V O-F backbone and L M S E L 2 (MLF-V O-F + L M S E L 2 ), the MLF-V O-F backbone and L M S E L 2 (MLF-V O-F + L combi ) on the Seq. #4, Seq. #5, Seq. #6, Seq. #7, Seq. #9, Seq. #10 of the KITTI dataset are presented in T able 1. The best results in each method and with the metrics we highlight. The results also sho w that the original MLF-V O-F has the best results at Seq. #9, and Seq. #10 on the R er r measure. The e v aluation results are best when e v aluated on Seq. #4, Seq. #5, Seq. #6, Seq. #7, Seq. #10 based on MLF-V O-F + L combi method with T er r and R er r measures. In T able 1, the e v aluation results of MLF-V O-F + L M S E and MLF-V O-F + L M S E L 2 ha v e the lar gest error , as MLF-V O-F + L M S E method has AT E = 133 . 36( m ) , T er r = 17 . 41(%) on the Seq. #9, this is a v ery lar ge error compared to the best method (MLF-V O-F) when e v aluating on the AT E measure. The results of the V OE comparison of the moti on trajectories of MLF-V O-F + L M S E , MLF-V O-F + L M S E L 2 , MLF-V O-F + L C E on Seq. #7, Seq. #9, Seq. #10 of the KITTI dataset are sho wn in Figure 7. The L C E = L v is + L dy n LF (as formulas (5), (6)) is an important a LF to optimize the training process of [15] model for V OE on the MPI Sintel [39], Replica [40] datasets, this model is the best when compared with some models DR OID-SLAM [41], DytanV O [16]. The results also sho w that the L C E LF has a lar ge impact on MLF-V O-F for training the V OE model on KITTI dataset. Combination of MLF-V O-F and loss functions for V OE fr om RGB ima g e sequence using ... (V an-Hung Le) Evaluation Warning : The document was created with Spire.PDF for Python.
1580 ISSN: 2502-4752 The L combi LF (as formula (15)) is a combination of the adv ant ages of the L total LF (as formula (14)) of the original MLF-V O-F and the L F 2 F LF (as formula (10)) of F2F , which are both the best LFs in MLF-V O- F and F2F for V OE. Therefore, the combination of the L combi LFs gi v es the best results on the KITT i dataset. T able 1. V OE e v aluation results of the original MLF-V O-F Methods/ datasets /metrics MLF-V O-F MLF-V O-F + L C E MLF-V O-F + L M S E MLF-V O-F + L M S E L 2 MLF-V O-F + L C onbi Seq. #9 Seq. #10 Seq. #9 Seq. #10 Seq. #9 Seq. #10 Seq. #9 Seq. #10 Seq. #4 Seq. #5 Seq. #6 Seq. #7 Seq. #10 T er r (%) 3.9 4.88 5.88 6.73 17.41 12.99 8.99 8.99 2.21 2.67 3.59 1.01 4.62 R er r (de g/100 m) 1.41 1.38 2.127 2.124 6.66 5.957 2.91 3.03 0.97 1.18 1.65 0.67 1.89 AT E (m) 9.86 7.36 15.22 9.34 133.36 32.27 35.18 9.744 - - - - - R T E (m) - - 0.075 0.06 0.09 0.08 0.08 0.07 - - - - - R P E (de g) - - 0.07 0.09 0.10 0.11 0.09 0.1 - - - - - In research by Francani and Maximo [23] e v aluated the error function on the 11 sequences of KITTI dataset, the best results were t er r = 3 . 105% , r er r = 1 . 063( deg / 100 m ) , AT E = 37 . 431 m on the Seq. #02, and t er r = 9 . 867% , r er r = 4 . 295( deg / 100 m ) , AT E = 8 . 696 m with the Seq. #03, on other fram e sequences, L M S E had lo wer results when combined with L M C LF. Therefore, it can be seen that L M S E still has a lar ge error in optimizing the training process of the DL-based model. Therefore, L M S E combined with MLF-V O- F has the highest error compared to other LFs. The V OE result on Seq. #9 in Fi gure 7 has the lar gest error when estimating on MLF-V O-F + L M S E method, which is similar to the result in T able 1, with error AT E = 133 . 36 m . Figure 7. The comparison results of V OE based on the combination of MLF-V O-F backbone and L M S E LF (MLF-V O-F + L M S E )(Ours), L M S E L 2 LF (MLF-V O-F + L M S E L 2 )(Ours), L C E LF (MLF-V O-F + L C E )(Ours) and GT V O (blue) on Seq. #7, Seq. #9, and Seq. #10 of the KITTI dataset Indonesian J Elec Eng & Comp Sci, V ol. 39, No. 3, September 2025: 1571–1586 Evaluation Warning : The document was created with Spire.PDF for Python.