IAES Inter national J our nal of Articial Intelligence (IJ-AI) V ol. 14, No. 5, October 2025, pp. 4162 4170 ISSN: 2252-8938, DOI: 10.11591/ijai.v14.i5.pp4162-4170 4162 Ensemble r e v erse kno wledge distillation: training r ob ust model using weak models Christopher Ga vra Reswara, Tjeng W awan Cenggor o School of Computer Science, Bina Nusantara Uni v ersity , W est Jakarta, Indonesia Article Inf o Article history: Recei v ed Sep 19, 2024 Re vised Jun 28, 2025 Accepted Jul 13, 2025 K eyw ords: Ef cientNet Ensemble learning Kno wledge distillation T ransfer learning W eak-to-strong ABSTRA CT T o ensure that articial intelligence (AI) can be aligned with humans, AI models need to be de v eloped and supervised by humans. Unfortunately , it is possible for an AI to e xceed human capabilities, which is commonly referred to as su- peralignment models. Thus, it raised the question of whether humans can still supervise a superalignment model, which is encapsulated in a concept called weak-to-strong generalization. T o address this issue, we introduce ensemble re v erse kno wledge distillation (ERKD), which le v erages tw o weak er models to supervise a more rob ust model. This tec hnique is a potential solution for humans to manage a super -alignment of models. ERKD enables a more rob ust model to achie v e optimal performance with the assistance of tw o weak er models. W e tried to train a more rob ust Ef cientNet m odel with weak er con v olutional neural net- w ork (CNN) models in a supervised f ashion. W ith this method, the Ef cientNet model performed better than the model trai ned with the standard transfer learn- ing (STL) method. It also performed better than a model that w as supervised by a single weak er model. Finally , ERKD-trained Ef cientNet models can perform better than Ef cientNet models that are one or e v en tw o le v els stronger . This is an open access article under the CC BY -SA license . Corresponding A uthor: Christopher Ga vra Resw ara School of Computer Science, Bina Nusantara Uni v ersity K ebon Jeruk Raya No. 27, W est Jakarta, Indonesia Email: christopher .resw ara@binus.ac.id 1. INTR ODUCTION The de v elopment of articial intelligence (AI) model must be inte grated with human supervision to obtain a useful model for humans. F or e xample, in the eld of image classication, con v olutional neural netw orks (CNN) models, such as ResNet [1], DenseNet [2], Ef cientNet [3], Inception V3 [4], and MobileNet V3 [5] models, were ask ed to learn a collection of images labeled by e xperts, such as ImageNet [6], CIF AR-10 [7], F ood-101 [8], Oxford 102 Flo wers [9], Birdsnap [10], and other datasets. Lar ge language models (LLMs) such as GPT -4 [11], Gemini 1.5 [12], and Llama-3 [13] were also b uilt to learn human-generated te xt datasets to perform natural language processing (NLP) tasks. T o add an additional guarantee of its alignment with humans, LLMs were also trained with an additional s tep called reinforcement learning from human feedback (RLHF), which re w ards or punishes during learning based on human judgment [14]–[16]. Until no w , all forms of AI ha v e al w ays been intentionally directed to align with human kno wledge, e xperience, e v aluation, and feedback to assist in completing human tasks. Ho we v er , the emer gence of AI models that ha v e better capabilitie s than humans, commonly referred to as superalignment models, is una v oidable. This is lar gely due to the f act that AI supervision w as not usually done by a lar ge cro wd of humans. Most of the datasets that were used to tra in AI models no w adays were curated J ournal homepage: http://ijai.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 4163 via cro wd-sourcing. This theoretically can crystallize the wisdom of the cro wd within AI models, which can lead the models to be more intelligent than a si n gl e human. The emer gence of superalignment models can also come from the practice of applying reinforcement learning without human supervision, which has been demonstrated multiple times in video g ames [17], board g ames [18], [19], and recently LLM [20]. The emer gence of supera lignment models raised the question: Ho w can we as humans supervise t hese models to better align with us if the y are better than us? As superalignment models can eme r ge from the wisdom of the cro wd, perhaps we can also supervise these models via another wisdom of the cro wd. This study aims to simulate this idea by ha ving an ensemble of weak er models to supervise a stronger model. In the machine learning community , it is kno wn that an ensemble of weak er models can form a strong model. This concept is named ensemble learning and has been used to form a strong machine learning model such as random forest [21] and XGBoost [22]. T o achie v e our aim, we designed a schema of more than one weak er teacher models to supervise one stronger model in the kno wledge distillation (KD) frame w ork [23]. W e named this schema ensemble re v erse kno wledge distillation (ERKD). Figure 1 illustrates the ERKD schema with tw o weak teacher models. T o simulate the idea of supervising a model that is already intelligent, we use transfer learning as the main task. In particular , we use transfer learning for image classication as t he task. T o m easure the success of this study , we compare ERKD with a standard transfer learning (STL) procedure. Figure 1. The ERKD schema with tw o weak teacher models 2. METHOD 2.1. Dataset This study uses tw o image classication datasets, namely CIF AR-10 and CIF AR-100 [7]. Both datasets consist of 50,000 images for training and 10,000 images for testing. Both also ha v e 32 × 32 pix els resolution images. The dif ference between the tw o datasets is that CIF AR-10 only has ten classes, so each class consists of 6,000 images, while CIF AR-100 has 100 classes, so each class consists of 600 images. These tw o datasets are used in this study because the y are commonly used in AI studies. 2.2. T rain, v alidation, and test split data The CIF AR-10 and CIF AR-100 datasets ha v e been di vided into 50,000 images for training and 10,000 for testing. All images in the training section ha v e been randomized. Then, we split the training part into tw o parts, namely , 40,000 images used for training and 10,000 images used for v alidation. The 40,000 images used as the training model will be subjected to data augmentation. Meanwhile, the 10,000 v alidation images will calculate the error rate and v alidation when the model learns. Finally , 10,000 test images will be used to measure the performance of the model. Ensemble r e ver se knowledg e distillation: tr aining r ob ust model using ... (Christopher Gavr a Reswar a) Evaluation Warning : The document was created with Spire.PDF for Python.
4164 ISSN: 2252-8938 2.3. Data pr epr ocessing W e preprocessed the dataset with z-score standardization on scale 0 to 1. Firstly , we normalize the pix el v alues from scale 0-to-255 to scale 0-to-1. Afterw ards, we standardize the pix el v alues with z-score standardization, with the mean and standard de viation v al ues deri v ed from the dataset. F or the CIF AR-10 dataset, the mean v alues were 0.4914, 0.4822, and 0.4465 for the red, green, and blue channels, respecti v ely . The standard de viation v alues were 0.247, 0.243, and 0.261 for the red, green, and blue channels, respecti v ely . F or the CIF AR-100 dataset, the mean v alues were 0.5071, 0.4865, and 0.4409 for the red, green, and blue channels, respecti v ely . The standard de viation v alues were 0.267, 0.256, and 0.276 for the red, green, and blue channels, respecti v ely . 2.4. Data augmentation T o a v oid o v ertting, we applied data augmentation with a random crop to 28 × 28 pix els and a random horizontal ip. This data augmentation procedure is applied only to the training dataset during model training. The data augmentation process w as performed online for each epoch. 2.5. Models F or the transfer learning process in this study , we used Ef cientNet and Ef cientNet V2 [24] models, which were pre-trained on the ImageNet dataset [25], [26]. Ef cientNet models ha v e a hierarch y of weak models to strong models due to the use of systematic model scaling, i.e. from the weak est B0 to s trongest B7 in Ef cientNet and from the weak est V2S to the stronger V2M to the strongest V2L in Ef cientNet v2. W ith this characteristic, Ef cientNet models are perfect for the setup in this study . 2.6. T raining pr ocess The training process in all e xperiments in this study used Adam optimiza tion [27] with a learning rate of 10-3 and a ridge re gularization of 10-5. In addition, training w as conducted with 100 epochs, a batch size of 32, and t he random seed used w as 42. Furthermore, the temperature used in the KD process w as 2.0. The checkpoint model technique is used during training based on the best v alidation a ccurac y . The image resolution scale in the Ef cientNet study is also adjusted for each model in this study . Ef cientNet models B0 to B7 use image sizes 32, 34, 38, 44, 54, 66, 76, and 86, respecti v ely . Meanwhile, the Ef cientNet V2 models, V2S, V2M and V2L, use image sizes of 32, 40, and 48, respecti v ely . 2.7. Experiment setup In ERKD, we used tw o weak er models to supervise a stronger model. F or e xample, a stronger model Ef cientNet B2 w as supervised by using Ef cientNet B1 and B0. The weak er models were rst trained with STL on the CIF AR-10 and CIF AR-100 datasets. Afterw ards, these tw o models were used as teachers by producing soft labels to train a stronger student model in a response-based KD frame w ork. The stronger student model w as optimized to match the distrib ution of the soft labels using the K ullback-Leibler di v er gence (KL di v er gence) loss function. 3. RESUL TS AND DISCUSSION In T able 1, we compare the accurac y of STL and three dif ferent v ariations of ERKD with a dif ferent proportion gi v en to the loss functions: i) equal proportion, ii) 10% for cross entrop y loss and 45% for KL di v er gence, and iii) 30% for cross entrop y loss and 35% for KL di v er gence. The icons in the table indicate that ERKD outperforms the STL. The square indicates the best accurac y , the circle indicates the second-best accurac y , and t he triangle indicates the third-best accurac y . As seen in the table, all v ariations of ERKD outperform STL. This pro v es that tw o weak er models can still supervise the stronger model, e.g. Ef cientNet B0 and Ef cientNet B1 can still supervise Ef cientNet B2. In addition, we also e xperimented using only one weak er model as a teacher of the stronger model. F or e xample, Ef cientNet model B2 is taught only by B0 or B1. The proportion of between the cross entrop y loss and the KL di v er gence loss are both 50%. The results can be seen in T able 2. The icons in the table indicates that ERKD outperforms the STL and a single-teacher method. The square indicates the best accurac y , the circle indicates the second-best accurac y , and the triangle indicates the third-best accurac y . W e found that at least one v ariation of ERKD can outperform using only one weak er model. This pro v ed that the ensemble learning concept in ERKD is also ef fecti v e in impro ving model performance. F or e xample, Ef cientNet B2 is more optimal when supervised by B0 and B1 than B0 or B1 alone. Int J Artif Intell, V ol. 14, No. 5, October 2025: 4162–4170 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 4165 T able 1. Comparison of the student model’ s accurac y between ERKD and STL T eacher 1 T eacher 2 Student Dataset Student accurac y Model Image size Model Image size Model Image size STL T eacher 1 and 2 A v erage (%) 10 45 45 (%) 30 35 35 (%) B1 34 B0 32 B2 38 CIF AR- 10 88.84 89.79 89.40 89.49 B2 38 B1 34 B3 44 90.63 91.34 91.21 91.18 B3 44 B2 38 B4 54 91.69 92.87 92.56 92.68 B4 54 B3 44 B5 66 92.63 92.94 93.23 93.43 B5 66 B4 54 B6 76 93.02 93.23 93.61 93.51 B6 76 B5 66 B7 86 93.18 93.78 93.75 93.94 V2M 40 V2S 32 V2L 48 92.09 92.47 92.63 92.65 B1 34 B0 32 B2 38 CIF AR- 100 64.93 68.77 68.39 68.68 B2 38 B1 34 B3 44 68.39 70.09 70.42 70.36 B3 44 B2 38 B4 54 71.27 72.72 74.01 73.84 B4 54 B3 44 B5 66 71.35 73.87 74.27 73.83 B5 66 B4 54 B6 76 72.63 75.61 75.12 75.80 B6 76 B5 66 B7 86 73.11 74.75 75.89 75.92 V2M 40 V2S 32 V2L 48 69.82 70.98 70.99 72.13 T able 2. Comparison of the student model’ s accurac y between ERKD using teachers 1 and 2, and STL T eacher 1 T eacher 2 Student Dataset Student accurac y Model Image size Model Image size Model Image size STL (%) T eacher 1 T eacher 2 T eacher 1 and 2 Only (%) Only (%) A v erage (%) 10 45 45 (%) 30 35 35 (%) B1 34 B0 32 B2 38 CIF AR- 10 88.84 89.54 89.24 89.79 89.40 89.49 B2 38 B1 34 B3 44 90.63 90.87 91.24 91.34 91.21 91.18 B3 44 B2 38 B4 54 91.69 92.19 92.33 92.87 92.56 92.68 B4 54 B3 44 B5 66 92.63 93.02 93.11 92.94 93.23 93.43 B5 66 B4 54 B6 76 93.02 93.17 93.33 93.23 93.61 93.51 B6 76 B5 66 B7 86 93.18 93.89 93.86 93.78 93.75 93.94 V2M 40 V2S 32 V2L 48 92.09 92.52 91.98 92.47 92.63 92.65 B1 34 B0 32 B2 38 CIF AR- 100 64.93 67.41 67.28 68.77 68.39 68.68 B2 38 B1 34 B3 44 68.39 69.29 69.34 70.09 70.42 70.36 B3 44 B2 38 B4 54 71.27 73.05 73.60 72.72 74.01 73.84 B4 54 B3 44 B5 66 71.35 74.12 72.86 73.87 74.27 73.83 B5 66 B4 54 B6 76 72.63 74.26 75.06 75.61 75.12 75.80 B6 76 B5 66 B7 86 73.11 74.51 74.70 74.75 75.89 75.92 V2M 40 V2S 32 V2L 48 69.82 70.08 68.73 70.98 70.99 72.13 T o check whether architectural similarity can inuence the performance of ERKD, we pick ed other CNN models to repl ace Ef cientNet models as teachers. The other CNN models were pick ed and mapped to replace Ef cientNet models on the basis of similar accurac y on ImageNet dataset. Other CNN model architec- tures we nally pick ed were ResNet, Re gNet [28], Con vNe xt [29], and ResNeXt. T able 3 pro vides the mapping of the other CNN model to their Ef cientNet equi v alent. W ith the addition of other CNN models, we no w ha v e four candidates to be used as teachers: tw o weak er Ef cientNet models and tw o other CNN models equi v alent to the Ef cientNet models. F or the sak e of simplicity , we named the rst tw o Ef cientNet models as teacher 1 and teacher 2, while the other tw o CNN models as teacher 3 and teacher 4. F or e xample, to supervise Ef cientNet B2, teacher 1 and teacher 2 are respecti v ely B1 and B0, meanwhile teacher 3 and teacher 4 are respecti v ely ResNet-101 and ResNet-152. In T ables 4 and 5, we sho w the result of e xperiments on substituting only one Ef cientNet teacher with ot h e r CNN models. The result with the icon in T able 4 indicates that ERKD outperforms the STL a n d a single teacher method. The square indicates the best accurac y , the circle indicates the second-best accurac y , and the triangle indicates the third-best accurac y . While, the result with the icon in T ables 5 and 6 indicates that ERKD outperforms the STL and a single-teacher method. The square indicates the best accurac y , the Ensemble r e ver se knowledg e distillation: tr aining r ob ust model using ... (Christopher Gavr a Reswar a) Evaluation Warning : The document was created with Spire.PDF for Python.
4166 ISSN: 2252-8938 circle indicates the second-best accurac y , the equilateral triangle indicates the third-best accurac y , and the right triangle indicates the fourth-best accurac y . T able 3. The mapping of other CNN models to the Ef cientNet models based on similar accurac y on ImageNet dataset Ef cientNet model Others CNN model Ef cientNet accurac y (%) Others CNN model (%) B0 ResNet-101 77.692 77.374 B1 ResNet-152 77.692 77.374 B2 Re gNet Y 16GF 77.692 77.374 B3 Con vNeXt T in y 77.692 77.374 B4 ResNeXt101 64X4D 77.692 77.374 B5 ResNeXt101 64X4D 77.692 77.374 B6 Con vNeXt Small 77.692 77.374 V2S Con vNeXt Base 77.692 77.374 V2M Con vNeXt Lar ge 77.692 77.374 T able 4. Comparison of the student model’ s accurac y between ERKD using teachers 1 and 4, and STL T eacher 1 T eacher 4 Student Dataset Student accurac y Model Model Model STL (%) T eacher 1 T eacher 4 T eacher 1 and 4 Only (%) Only (%) A v erage (%) 10 45 45 (%) 30 35 35(%) B1 B0 B2 CIF AR- 10 88.84 89.54 89.46 89.54 89.57 89.67 B2 B1 B3 90.63 90.87 90.56 91.07 90.94 91.34 B3 B2 B4 91.69 92.19 92.20 92.54 92.83 92.62 B4 B3 B5 92.63 93.02 92.81 92.98 93.03 93.34 B5 B4 B6 93.02 93.17 93.62 94.05 93.42 94.03 B6 B5 B7 93.18 93.89 93.89 93.85 93.74 94.28 V2M V2S V2L 92.09 92.52 92.30 92.63 92.35 92.33 B1 B0 B2 CIF AR- 100 64.93 67.41 65.95 67.59 67.22 66.99 B2 B1 B3 68.39 69.29 68.57 69.53 69.17 69.60 B3 B2 B4 71.27 73.05 73.22 74.49 73.10 73.51 B4 B3 B5 71.35 74.12 72.76 74.32 73.90 74.50 B5 B4 B6 72.63 74.26 74.21 74.93 74.88 75.14 B6 B5 B7 73.11 74.51 74.66 75.62 75.23 75.26 V2M V2S V2L 69.82 70.08 69.95 72.30 71.16 71.32 T able 5. Comparison of the student model’ s accurac y between ERKD using teachers 2 and 3, and STL method T eacher 2 T eacher 3 Student Dataset Student accurac y Model Model Model STL (%) T eacher 2 T eacher 3 T eacher 2 and 3 Only (%) Only (%) A v erage (%) 10 45 45 (%) 20 40 40 (%) 30 35 35 (%) B1 B0 B2 CIF AR- 10 88.84 89.24 89.58 89.79 89.33 89.75 89.64 B2 B1 B3 90.63 91.24 91.14 91.56 91.39 91.45 91.40 B3 B2 B4 91.69 92.33 92.46 92.10 92.42 92.63 92.91 B4 B3 B5 92.63 93.11 93.26 93.46 93.20 93.30 93.20 B5 B4 B6 93.02 93.33 93.95 93.41 93.79 94.15 93.73 B6 B5 B7 93.18 93.86 94.24 93.66 93.71 94.04 93.83 V2M V2S V2L 92.09 91.98 92.47 92.31 92.63 92.29 92.26 B1 B0 B2 CIF AR- 100 64.93 67.28 66.39 67.81 67.03 67.65 67.49 B2 B1 B3 68.39 69.34 69.30 70.19 70.06 69.99 69.85 B3 B2 B4 71.27 73.60 72.96 73.26 73.93 73.24 73.41 B4 B3 B5 71.35 72.86 72.99 73.47 73.81 73.66 73.98 B5 B4 B6 72.63 75.06 74.67 74.93 75.61 76.19 75.65 B6 B5 B7 73.11 74.70 74.55 75.60 74.94 74.86 75.04 V2M V2S V2L 69.82 68.73 70.80 70.92 70.91 71.18 71.39 Int J Artif Intell, V ol. 14, No. 5, October 2025: 4162–4170 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 4167 In T able 4, only teacher 1 and teacher 4 are used. Meanwhile, only teacher 2 and teacher 3 are used in T able 5. W e add a ne w proportion of 20% for cross entrop y loss and 40% for KL di v er gence loss in T able 5. W e also sho w the result of substituting all the Ef cientNet teachers with other CNN models in T able 6. From these results, we found ERKD can generally s till impro v e the accurac y com pared to STL and usi ng one teacher only . This f act is especially ob vious when we see the accurac y of using dif ferent teachers combination side by side in T able 7, where there is no combination that dominantly outperforms other combination. The squares in T able 7 indicate superior performance. Thus, ERKD still w orks re g ardless of the architectural similarity . T able 6. Comparison of the student model’ s accurac y between ERKD using teachers 3 and 4, and STL T eacher 3 T eacher 4 Student Dataset Student accurac y Model Model Model STL (% ) T eacher 3 T eacher 4 T eacher 3 and 4 Only (%) Only (%) A v erage (%) 10 45 45 (%) 30 35 35 (%) 50 25 25 (%) B1 B0 B2 CIF AR- 10 88.84 89.58 89.46 89.50 89.23 89.52 89.85 B2 B1 B3 90.63 91.14 90.56 91.03 91.35 91.04 91.39 B3 B2 B4 91.69 92.46 92.20 92.22 92.54 92.43 92.63 B4 B3 B5 92.63 93.26 92.81 93.46 92.80 93.35 92.95 B5 B4 B6 93.02 93.95 93.62 94.11 93.92 93.56 94.24 B6 B5 B7 93.18 94.24 93.89 94.00 93.99 93.91 93.92 V2M V2S V2L 92.09 92.47 92.30 92.66 92.54 92.26 92.23 B1 B0 B2 CIF AR- 100 64.93 66.39 65.95 66.33 66.64 66.73 66.80 B2 B1 B3 68.39 69.30 68.57 69.61 69.80 69.70 68.53 B3 B2 B4 71.27 72.96 73.22 72.82 73.31 72.65 72.33 B4 B3 B5 71.35 72.99 72.76 73.76 73.39 73.56 73.25 B5 B4 B6 72.63 74.67 74.21 74.85 75.03 74.70 75.20 B6 B5 B7 73.11 74.55 74.66 75.07 74.82 74.96 74.10 V2M V2S V2L 69.82 70.80 69.95 71.58 71.03 71.69 71.10 T able 7. The accurac y of comparison of ERKD with v arious teachers Student model Dataset T eacher 1 and 2 (%) T eacher 1 and 4 (%) T eacher 2 and 3 (%) T eacher 3 and 4 (%) B2 CIF AR-10 89.79 89.67 89.79 89.85 B3 91.34 91.34 91.56 91.39 B4 92.87 92.83 92.91 92.63 B5 93.43 93.34 93.46 93.46 B6 93.61 94.05 94.15 94.24 B7 93.94 94.28 94.04 94.00 V2L 92.65 92.63 92.63 92.66 B2 CIF AR-100 68.77 67.59 67.81 66.80 B3 70.42 69.60 70.19 69.80 B4 74.01 74.49 73.93 73.31 B5 74.27 74.50 73.98 73.76 B6 75.80 75.14 76.19 75.20 B7 75.92 75.62 75.60 75.07 V2L 72.13 72.30 71.39 71.69 When we tried to compare the accurac y of models trained with ERKD with a stronger model t rained with STL, we found a surprising result that sometimes a weak er model with ERKD can be stronger than a stronger model with STL. F or e xample, we compared the performance of Ef cientNet B2 model using ERKD with the performance of the Ef cientNet model using the STL method. The results can be seen in T able 8, which sho ws that some models with certain datasets can beat stronger models. The result with the icon in T able 8 indicates that ERKD outperforms the STL of a one-le v el higher rob ust model. Similarly , the result with the icon in T able 9 sho ws that ERKD outperforms the STL of a tw o-le v el higher rob ust model. The square indicates the best accurac y , the circle indicates the second-best accurac y , the equilateral triangle indicates the third-best accurac y , and the right triangle indicates the fourth-best accurac y . F or CIF AR-10 dataset; Ef cientNet models B4, B5, and B6 with ERKD can outperform Ef cientNet models B5, B6, and B7. Meanwhile for CIF AR-100 dataset; Ef cientNet models B2, B4, B5, and B6 can outperform Ef cientNet models B3, B5, B6, and B7. Ensemble r e ver se knowledg e distillation: tr aining r ob ust model using ... (Christopher Gavr a Reswar a) Evaluation Warning : The document was created with Spire.PDF for Python.
4168 ISSN: 2252-8938 T able 8. The accurac y of comparison between ERKD and STL model with one le v el higher rob ust model Student model Dataset STL 1 le v el higher (%) T eacher 1 and 2 (%) T eacher 1 and 4 (%) T eacher 2 and 3 (%) T eacher 3 and 4 (%) B2 CIF AR-10 90.63 89.79 89.67 89.79 89.85 B3 91.69 91.34 91.34 91.56 91.39 B4 92.63 92.87 92.83 92.91 92.63 B5 93.02 93.43 93.34 93.46 93.46 B6 93.18 93.61 94.05 94.15 94.24 B2 CIF AR-100 68.39 68.77 67.59 67.81 66.80 B3 71.27 70.42 69.60 70.19 69.80 B4 71.35 74.01 74.49 73.93 73.31 B5 72.63 74.27 74.50 73.98 73.76 B6 73.11 75.80 75.14 76.19 75.20 T able 9. The accurac y of comparison between ERKD and STL model with tw o le v els higher rob ust model Student model Dataset STL 2 le v el higher (%) T eacher 1 and 2 (%) T eacher 1 and 4 (%) T eacher 2 and 3 (%) T eacher 3 and 4 (%) B2 CIF AR-10 91.69 89.79 89.67 89.79 89.85 B3 92.63 91.34 91.34 91.56 91.39 B4 93.02 92.87 92.83 92.91 92.63 B5 93.18 93.43 93.34 93.46 93.46 B2 CIF AR-100 71.27 68.77 67.59 67.81 66.80 B3 71.35 70.42 69.60 70.19 69.80 B4 72.63 74.01 74.49 73.93 73.31 B5 73.11 74.27 74.50 73.98 73.76 W e also tried to compare ERKD with the tw o-le v el stronger models with STL. F or e xample, the per - formance of the Ef cie n t Net B2 m odel using ER KD is compared with the performance of the Ef cientNet B4 model using the STL method. The results can be seen in T able 9. Surprisingly , we still found that some weak er models can be stronger with ERKD than the tw o-le v el stronger models with STL. Using ERKD. Ef cientNet model B5 with the CIF AR-10 dataset performs better than Ef cientNet model B7 with the CIF AR-10 dataset. In addition, Ef cientNet Models B4 and B5 with CIF AR-100 dataset using ERKD also perform better than Ef cientNet models B6 and B7. These tw o surprising results pro v e that ERKD ef fecti v ely impro v es model performance. 4. CONCLUSION All e xperiments pro v ed that ERKD can impro v e the model’ s performance. The model’ s performance with the ERKD method can be better than the STL and single-teacher methods. It can also be better than the STL method’ s one or tw o-le v el, stronger model. Thus, the ERKD method is suitable for supervising stronger models us ing weak er models. This study also pro v ed that the ERKD method can impro v e the model’ s performance e v en though the weak and strong models’ architectures are dif ferent. The Ef cientNet models can still outperform e v en when assisted by other CNN models. Despite using weak er AI instead of human, the result of this study sho ws a glimmer of hope that an AI with stronger intelligence than human can still be supervised by humans. The trick is to ha v e se v eral humans to collaborate in managing a super -alignment model. Future studies could in v estig ate a similar study b ut without using the trained model. The y could also in v estig ate ERKD methods in other computer vision tasks, such as image detection or image se gmentation. In addition, the y can also e xperimented on using more than tw o weak er models to supervise a stronger model to get the optimal number of weak er models. FUNDING INFORMA TION Authors state there is no funding in v olv ed. A UTHOR CONTRIB UTIONS ST A TEMENT This journal uses the C on t rib utor Roles T axonomy (CRediT) to recognize indi vidual author contrib u- tions, reduce authorship disputes, and f acilitate collaboration. Int J Artif Intell, V ol. 14, No. 5, October 2025: 4162–4170 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 4169 Name of A uthor C M So V a F o I R D O E V i Su P Fu Christopher Ga vra Resw ara Tjeng W a w an Cenggoro C : C onceptualization I : I n v estig ation V i : V i sualization M : M ethodology R : R esources Su : Su pervision So : So ftw are D : D ata Curation P : P roject Administration V a : V a lidation O : Writing - O riginal Draft Fu : Fu nding Acquisition F o : F o rmal Analysis E : Writing - Re vie w & E diting CONFLICT OF INTEREST ST A TEMENT Authors state there is no conict of interest. D A T A A V AILABILITY No ne w data were generated or analyzed during this study . REFERENCES [1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, in 2016 IEEE Confer ence on Computer V ision and P attern Reco gnition (CVPR) , 2016, pp. 770–778, doi: 10.1109/CVPR.2016.90. [2] G. Huang, Z. Liu, L. V an Der Maaten, and K. Q. W einber ger , “Densely connected con v olutional netw orks, in 2017 IEEE Confer ence on Computer V ision and P attern Reco gnition (CVPR) , 2017, pp. 2261–2269, doi: 10.1109/CVPR.2017.243. [3] M. T an and Q. Le, “Ef cientNet: Rethinking model scaling for con v olutional neural netw orks, in Pr oceedings of the 36th International Confer ence on Mac hine Learning (ICML) , pp. 6105–6114, 2019. [4] C. Sze gedy , V . V anhouck e, S. Iof fe, J. Shlens, and Z. W ojna, “Rethinking the inception architecture for computer vision, in 2016 IEEE Confer ence on Computer V ision and P attern Reco gnition (CVPR) , 2016, pp. 2818–2826, doi: 10.1109/CVPR.2016.308. [5] A. Ho w ard et al ., “Searching for MobileNetV3, in 2019 IEEE/CVF International Confer ence on Computer V ision (ICCV) , 2019, pp. 1314–1324, doi: 10.1109/ICCV .2019.00140. [6] J. Deng, W . Dong, R. Socher , L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A lar ge-scale hierarchical image database, in 2009 IEEE Confer ence on Computer V ision and P attern Reco gnition , 2009, pp. 248–255, doi: 10.1109/CVPR.2009.5206848. [7] A. Krizhe vsk y , “Learning multiple layers of features from tin y images, M.Sc. Thesis , De partment of Computer Science, Uni v ersity of T oronto, T oronto, Canada, 2009. [8] L. Bossard, M. Guillaumin, and L. V . Gool, “F ood-101–mining discriminati v e components with random forests, in Computer V ision - Eur opean Confer ence on Computer V ision(ECCV) , pp. 446-461, 2014, doi: 10.1007/978-3-319-10599-4_29. [9] M.-E. Nils back and A. Zisserman, Automated o wer classication o v er a lar ge number of classes, in 2008 Sixth Indian Confer ence on Computer V ision, Gr aphics & Ima g e Pr ocessing , 2008, pp. 722–729, doi: 10.1109/ICV GIP .2008.47. [10] T . Ber g, J. Liu, S. W . Lee, M. L. Ale xander , D. W . Jacobs, and P . N. Belhumeur , “Birdsnap: Lar ge-scale ne-grained visual cate gorization of birds, in 2014 IEEE Confer ence on Computer V ision and P attern Reco gnition , 2014, pp. 2019–2026, doi: 10.1109/CVPR.2014.259. [11] J . Achiam et al ., “GPT -4 technical report, arXiv-Computer Science , Mar . 2023. [12] P . Geor gie v et al ., “Gemini 1.5: Unlocking multimodal understanding across millions of tok ens of conte xt, arXiv-Computer Science , Dec. 2024. [13] A. Gr attaori et al ., “The Llama 3 herd of models, arXiv-Computer Science , No v . 2024. [14] A. Gl aese et al ., “Impro ving alignment of dialogue agents via tar geted human judgements, arXiv-Computer Science , Sep. 2022. [15] Y . Bai et al ., “T raining a helpful and harmless assistant with reinforcement learning from human feedback, arXiv-Computer Science , Apr . 2022. [16] L. Ouyang et al ., “T raining language models t o follo w instructions with human feedback, Advances in neur al information pr ocess- ing systems , v ol. 35, pp. 27730–27744, 2022. [17] V . Mnih et al ., “Human-le v el control through deep reinforcement learning, natur e , v ol. 518, pp. 529–533, Feb . 2015, doi: 10.1038/nature14236. [18] D. Silv er et al ., “Mastering the g ame of go without human kno wledge, natur e , v ol. 550, pp. 354–359, Oct. 2017, doi: 10.1038/na- ture24270. [19] D. Silv er et al ., A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , Science , v ol. 362, no. 6419, pp. 1140–1144, 2018, doi: 10.1126/science.aar6404. [20] D. Guo et al ., “Deepseek-R1: Incenti vizing reasoning capability in llms via reinforcem ent learning, arXiv-Computer Science , pp. 1-22, Jan. 2025. [21] L. Breiman, “Random forests, Mac hine Learning , v ol. 45, pp. 5–32, Oct. 2001, doi: 10.1023/A:1010933404324. [22] T . Chen and C. Guestrin, “XGBoost: A scalable tree boosting system, in Pr oceedings of the 22nd A CM SIGKDD International Confer ence on Knowledg e Disco very and Data Mining , pp. 785–794, 2016, doi: 10.1145/2939672.2939785. [23] G. Hinton, O. V in yals, and J. Dean, “Distilling the kno wledge in a neural netw ork, arXiv-Statistics , pp. 1-9, Mar . 2015. [24] M. T an and Q. Le, “Ef cientnetv2: Smaller models and f aster training, in Pr oceedings of the 38th International Confer ence on Mac hine Learning (ICML) , Apr . 2021, pp. 10096–10106. Ensemble r e ver se knowledg e distillation: tr aining r ob ust model using ... (Christopher Gavr a Reswar a) Evaluation Warning : The document was created with Spire.PDF for Python.
4170 ISSN: 2252-8938 [25] L. F . -Fei, J. Deng, and K. Li, “ImageNet: Constructing a lar ge-scale image database, J ournal of V ision , v ol. 9, no. 8, pp. 1037–1037, 2009, doi: 10.1167/9.8.1037. [26] O. Russak o vsk y et al ., “Imagenet lar ge scale visual recognition challenge, International J ournal of Computer V is ion , v ol. 115, pp. 211–252, 2015, doi: 10.1007/s11263-015-0816-y . [27] D. P . Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv-Computer Science , pp. 1-15, Jan. 2017. [28] I . Radosa v o vic, R. P . K osaraju, R. Girshick, K. He, and P . Dollár , “Designing netw ork design spaces, in 2020 IEEE/CVF Confer ence on Computer V ision and P attern Reco gnition (CVPR) , 2020, pp. 10425–10433, doi: 10.1109/CVPR42600.2020.01044. [29] Z. Liu, H. Mao, C.-Y . W u, C. Feichtenhofer , T . Darrell, and S. Xie, A Con vNet for the 2020s, in 2022 IEEE/CVF Confer ence on Computer V ision and P attern Reco gnition (CVPR) , 2022, pp. 11966–11976, doi: 10.1109/CVPR52688.2022.01167. BIOGRAPHIES OF A UTHORS Christopher Ga vra Reswara recei v ed his bachelor’ s de gree in Computer Science from Bina Nusantara Uni v ersity , where he is pursuing a master’ s de gree in the same eld. He also w orks as a programmer at the B ina Nusantara IT Di vision. His research focuses on articial intelli- gence, recommendation systems, and computer vision, and he has authored tw o conference papers on recommendation systems. He can be contacted at email: christopher .resw ara@binus.ac.id. Tjeng W awan Cenggor o recei v ed a bachel or’ s de gree in Information T echnology from STMIK W idya Cipta Dharma and a master’ s de gree in Information T echnology from Bina Nusantara Uni v ersity . He is currently an AI researcher focusing on de v eloping deep learning algo- rithms for applications in computer vision, natural language processing, and bioinformatics. He is also an NVIDIA Deep L earning Institute certied instructor . Throughout his 9+ year career , he has led numerous research projects related to AI and data science, with applications in man y domains such as e-commerce, agriculture, and health. He has published o v er 80 peer -re vie wed publications and re vie wed for prestigious journals, such as Scientic Reports, IEEE Access, a nd PLOS ONE. In addition to this, he also holds 4 cop yrights for AI-based video/image analytics softw are. He can be contacted at email: tjeng.cenggoro@binus.ac.id. Int J Artif Intell, V ol. 14, No. 5, October 2025: 4162–4170 Evaluation Warning : The document was created with Spire.PDF for Python.