Inter national J our nal of Inf ormatics and Communication T echnology (IJ-ICT) V ol. 15, No. 1, March 2026, pp. 198 206 ISSN: 2252-8776, DOI: 10.11591/ijict.v15i1.pp198-206 198 Le v eraging distillation tok en and weak er teacher model to impr o v e DeiT transfer lear ning capability Christopher Ga vra Reswara, Gede Putra K usuma Department of Computer Science, BINUS Graduate Program - Master of Computer Science, Bina Nusantara Uni v ersity , Jakarta, Indonesia Article Inf o Article history: Recei v ed Mar 6, 2025 Re vised Oct 28, 2025 Accepted No v 5, 2025 K eyw ords: DeiT model Distillation tok en Kno wledge distillation T ransfer learning T ransformers architecture W eak-to-strong generalization ABSTRA CT Recently , distilling kno wledge from con v olutional neural netw orks (CNN) has positi v ely impacted the data-ef cient image transformer (DeiT) model. Due to the distillation tok en, this method is capable of boosting DeiT performance and helping DeiT t o learn f aster . Unfortunately , a distillation procedure with that tok en has not yet been implemented in the DeiT for transfer learning to the do wnstream dataset. This study propos es implementing a distillation procedure based on a distillation tok en for transfer learning. It boosts DeiT performance on do wnstream datasets. F or e xample, our proposed method impro v es the DeiT B 16 model performance by 1.75% on the OxfordIIIT -Pets dataset. Furthermore, we present using a weak er model as a teacher of the DeiT . It coul d reduce the transfer learning process of the teacher model without reducing the DeiT per - formance too much. F or e xample, DeiT B 16 model performance decreased by only 0.42% on Oxford 102 Flo wers with Ef cientNet V2S compared to Re gNet Y 16GF . In contr ast, in se v eral cases, the DeiT B 16 model performance could impro v e with a weak er teacher model. F or e xample, DeiT B 16 mode l perfor - mance impro v ed by 1.06% on the OxfordIIIT -Pets dataset with Ef cientNet V2S compared to Re gNet Y 16GF as a teacher model. This is an open access article under the CC BY -SA license . Corresponding A uthor: Christopher Ga vra Resw ara Department of Computer Science, BINUS Graduate Program - Master of Computer Science Bina Nusantara Uni v ersity Jakarta, Indonesia Email: christopher .resw ara@binus.ac.id 1. INTR ODUCTION Recently , transformer [1] architectures ha v e become the model of choice in natural language process- ing (NLP) and computer vision. Due to the self-attention module, the po werful netw ork capacity with that architecture could perform v arious tasks. In NLP , transformer models achie v e competiti v e results in de v elop- ing the lar ge language model (LLM), such as GPT -4o [2], Llama 3.2 [3], and Gemini 1.5 [4]. These LLMs help complete human tasks lik e te xt summarization [5], sentiment analysis [6], question answering [7], and others. In addition, the transformer model achie v es e xcellent performance in computer vision, including image classication [8], object detection [9], image matching [10], and other tasks. The rst transformer -based model in computer vision is t he vision transformer (V iT) [11]. That model le v erages ra w image patches as input and classication tok ens as output. Subsequently , transformer -based models in computer vision de v eloped into v arious models, such as DeiT [12], Swin [8], Swin V2 [13], and others. F or e xample, DeiT introduces a ne w procedure for kno wledge distillation (KD) [14] i n a transformer - J ournal homepage: http://ijict.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Inf & Commun T echnol ISSN: 2252-8776 199 based model. This model has a ne w tok en compared to the V iT , a distillation tok en. The tok en is used to calculate the loss between teacher and student output. T o impro v e DeiT performance, the Re gNet Y 16GF [15] model w as used as its teacher . Re gNet Y 16GF has 83.6M parameters, similar to DeiT B 16, which has 86.6M. While training t he student model, the output of the teacher model becomes a supervisor . Furthermore, the classication tok en of DeiT is computed loss with cross-entrop y loss. On the other hand, the distillation tok en of DeiT and teacher output are computed loss with K ullback-Leibler di v er gence (KL Di v er gence) [16]. Both result losses will be a v erage and become student losses used for backw ard propag ation in the student model. Unfortunately , this technique is not yet used for transfer learning to do wnstream datasets. Therefore, this study in v estig ated the ef fects of util izing a distillation tok en for transfer learning to a do wnstream dataset. While the DeiT paper has e xplored the impact of the distillation tok en for training a model from scratch, i t has not e xplicitly addressed its inuence on utilizing it for transfer learning to a do wnstream dataset. It is aiming to enhance the transfer learning capability of the transformer -based model. T o pro v e it, we design a simple setup. In Figure 1, we utilize Re gNet Y 16GF to become a teacher model. (a) W e transfer learning pre- trained Re gNet Y 16GF on ImageNet-1k [17] to the do wnstream dataset, and the result we called the trained teacher model at a do wnstream dataset (as sho wn in Figure 1 (a)). After that, (b) we le v erage the trained teacher model at a do wnstream dataset to supervise pre-trained DeiT on ImageNet-1k as a student model to transfer learning to the same do wnstream dataset (as sho wn in Figure 1 (b)). In this w ay , we pro v e that this technique could impro v e DeiT’ s transfer learning capability . The e xperimental results of this study use CIF AR-10 [18], CIF AR-100 [18], Oxford 102 Flo wers [19], and Oxford-IIIT Pets [20] as do wnstream datasets. In addition, we adopt the weak-to-strong generalization [21] concept to simplify (a) transfer learning pre-trained teacher model on ImageNet-1k to the do wnstream dat aset. This concept w as based on articial intelligence’ s rapid and rob ust de v elopment, especially the transformers-based model. The superalignment model, which is more intelligent than humans, is possible. In contrast, humans need help understanding to en- sure the superalignment model is still correct and safe. Therefore, this concept pro v es that a weak er supervisor model still supervises a rob ust model. (a) (b) Figure 1. Illustration of our proposed method, (a) the pre-trained teacher model is used on the ImageNet-1k dataset for transfer learning to the do wnstream dataset and (b) the pre-trained student model is used on the ImageNet-1k dataset and the trained teacher model on the do wnstream dataset Le ver a ging distillation tok en and weak er teac her model to impr o ve DeiT ... (Christopher Gavr a Reswar a) Evaluation Warning : The document was created with Spire.PDF for Python.
200 ISSN: 2252-8776 The teacher model used in the DeiT paper is Re gNet Y 16GF . That model has 83.6M parameters, similar to DeiT B 16 (student model), which has 86.6M. T o implement the weak-to-strong generalization concept, we propose to use a weak er teacher model. W e utilize Ef cientNet B4 (19.3M) [22] and Ef cientNet V2S (21.5M) [23]. W e use the teacher model, which is approximately 75% weak compared to the student model. Our contrib utions are listed as follo ws: (a) W e propose using a distillation procedure based on a distillat ion tok en for transfer learning to the do wn- stream dataset. W e nd this technique capable of impro ving DeiT model performance on the do wnstream dataset. (b) W e introduce using a weak er model as a teacher of the DeiT model. Its method could reduce (a) the transfer learning process on the teacher model because it uses approximately 75% of the weak without reducing DeiT (student) model performance. (c) W e nd that the CNN model is the best teacher for the transformers model in the transfer learning process to do wnstream datasets. In addition, we nd that using soft distillation outperforms hard distillation. 2. METHOD 2.1. Dataset In this study , the ImageNet-1k dataset is used as a lar ge dataset to train a model from scratch. That dataset has 1000 classes and consists of 1,281,167 training images, 50,000 v alidation images and 100,000 test images. Furthermore, a trained model in the ImageNet-1k dataset will be used for transfer learning to the do wnstream datasets, i.e. CIF AR-10, CIF AR-100, Oxford 102 Flo wers and Oxford-IIIT Pets. T able 1 presents detailed information on do wnstream datasets. 2.2. T rain, v alidation, and test split data The dataset is split into a train, v alidation, and test set for conducted model training. A model uti lizes a training set for training. That set will be augmented so that a model could be trained well. In contrast, v alidation and test sets are not augmentation. A v alidation set is used to calculate the error rates of a model during training and its impact on the backw ard propag ation process. Meanwhile, a test set is used to e v aluate the performance model after the training. In this study , we tak e a 10% train image of CIF AR-10, CIF AR-100, and OxfordIIT -Pet for v alidation test. Ho we v er , we use the def ault train and v alidation split of Oxford 102 Flo wers, each 50% for the train and v alidation set. T able 2 sho ws a detailed split of the dataset in this study . T able 1. Do wnstream datasets information Dataset Size (T rain/T est) Classes CIF AR-10 50,000/10,000 10 CIF AR-100 50,000/10,000 100 Oxford 102 Flo wers 2,040/6,149 102 OxfordIIIT -Pets 3,680/3,669 37 T able 2. Split dataset to train, v alidation, and test set Dataset Size (T rain/V al/T est) CIF AR-10 45,000/5,000/10,000 CIF AR-100 45,000/5,000/10,000 Oxford 102 Flo wers 1,020/1,020/6,149 OxfordIIIT -Pets 3,312/368/3,669 2.3. Data pr epr ocessing The data preprocessing technique is used for all do wnstream datasets and all parts of that dataset, as well as train, v alidation, and test sets. T echnique data preprocessing used in this study are resized images, standardization, and normalization. First, all images will be resized to 224 x 224 pix els. Do wnstream datasets will be standardized to con v ert the image v alue from 0.0 to 255.0 into 0.0 to 1.0. Then, con v ert the image array structure from Height, W idth, and Channel to Channel, Height, and W idth. Aft er standardization, the do wnstream dataset will be normalized. Normalization w as conducted based on each do wnstream dataset a v erage and standard de viation of each channel. Therefore, the normalization v alue for CIF AR-10 is dif ferent from other do wnstream datasets, such as CIF AR-100. F or CIF AR-10, the a v erages of each channel (Red, Green, Blue) are 0.4914, 0.4822, 0.4465, and the standard de viations of each channel (Red, Green, Blue) are 0.247, 0.243, 0.261. Meanwhile, in CIF AR-100, the a v erages are 0.5071, 0.4865, 0.4409, and the standard de viations are 0.267, 0.256, 0.276. Oxford 102 Flo wers a v erages 0.4330, 0.3819, 0.2964, and the standard de viations of each channel are 0.273, 0.224, 0.253. Finally , OxfordIIIT -Pets a v erages are 0.4782, 0.4458, 0.3956, and the standard de viations are 0.247, 0.241, 0.249. Int J Inf & Commun T echnol, V ol. 15, No. 1, March 2026: 198–206 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Inf & Commun T echnol ISSN: 2252-8776 201 2.4. Data augmentation The data augmentation technique is only implemented in the train set of all do wnstream datase ts. The techni ques used are random crop and random horizontal ip images. F or CIF AR-10 and CIF AR-100, the original images are 32 x 32 pix els. Therefore, both datasets will randomly crop to 28 x 28 pix els. Meanwhile, Oxford 102 Flo wers and OxfordIII-Pets ha v e a v ariety of image sizes. Both datasets will randomly crop to 196 x 196 pix els. After that, i mages wi ll be randomly ipped horizontally . After data augmentation, the train set of all do wnstream datasets will be data preprocessed. 2.5. Distillation loss Distillation loss is a numerical metric that measures the dif ference between the student and teacher models’ predicted output. T w o techniques were used in this study to compute distillation loss, i.e., hard dis- tillation and soft distillation. Hard distillat ion uses cross-entrop y loss to compute distillation loss, while soft distillation uses the KL di v er gence function. Hard distillation. Let Z s be the logits of the DeiT (student) model and Z t be the logits of the teacher model. Then, we denote L C E by the cross-entrop y loss and ψ by the softmax function. Especially for hard distillation, the teacher models’ predicted output must be computed with the ar gmax function, which becomes the hard decision of the teacher model and we donated it by y t = ar g max c Z t ( c ) . Finally , the function to compute hard distillation loss could be dened as follo ws: L har d distil l = L C E ( ψ ( Z s ) , y t ) Soft distillation. W e denoted KL be the KL Di v er gence function and τ as the temperature of the soft distillation. The temperature in the soft distillation is used to smooth the probability distrib ution. In this study , we used τ = 2 for all soft distillation e xperiments. The function to compute soft distillation loss could be dened as follo ws: L sof t distil l = τ 2 K L ( ψ ( Z s τ ) , ψ ( Z t τ )) After computing distillation loss, we can compute global loss for the backw ard propag ation proce ss. Global loss is the a v erage between student model loss and disti llation loss. Let L g l obal by global loss and y by true class. Therefore, the function to compute global loss could be dened as follo ws: L g l obal = 1 2 L C E ( ψ ( Z s ) , y ) + 1 2 L distil l 2.6. T raining pr ocess The training process in this study uses AdamW [24] optimizer and CosineLRScheduler [25]. More- o v er , the training process w as conducted in 10 epochs, with a batch size of 32 and a random seed of 42. The checkpoint model technique w as also used during training based on the best v alidation accurac y . Finally , the training process for transfer learning of the DeiT model to the do wnstream dataset only uses the attention layer . Other layers were frozen. 2.7. Experiment setup Based on our proposed method, as sho wn in Figure 1, we present tw o steps to impro v e DeiT t ransfer learning capability . First, we use a pre-trained model on the ImageNet-1k dataset as a teacher model. That teacher model needs to transfer learning to do wnstream datasets (Figure 1(a)). The results of the rst step are the T rained T eacher model and logits of the teacher model, which we denote by Z t . In the ne xt step, we will use a pre-trained DeiT model on the ImageNet-1k dataset. Then, in the transfer learning student model to do wnstream datasets process, we use the T rained T eacher model as a helper through Z t . Z t will be compared with a distillation tok en using the distillation method and produce distillation loss. That loss will be computed with student loss to become a global loss. It is a loss used by the student model in updating the weight of the student model. In that w ay , the teacher model could help the s tudent model and impro v e student model performance. Le ver a ging distillation tok en and weak er teac her model to impr o ve DeiT ... (Christopher Gavr a Reswar a) Evaluation Warning : The document was created with Spire.PDF for Python.
202 ISSN: 2252-8776 3. RESUL TS AND DISCUSSION 3.1. CNN vs T ransf ormer teacher First, we proposed using a disti llation tok en to transfer learning to the do wnstream dataset. Therefore, we observ ed the best teacher architecture model for transfer learning. Figure 2 compares the performance of the DeiT B model while transferring learning to CIF AR-10, CIF AR-100, Oxford 102 Flo wers, and OxfordIIIT -Pets dataset between Re gNet Y 16GF and Deit B 16 as a teacher model. W e found that using the CNN architecture (Re gNet Y 16GF) as a teacher model outperforms the transformer architecture (DeiT B 16). Inducti v e bias from CNN adapted to T ransformers through distillation mak es CNN a better teacher , as e xplained by Abnar [26]. CNN has a local inducti v e bias that could help DeiT learn f aster , and complemen- tary transformers architecture designed global inducti v e bias. Hence, Re gNet Y 16GF could outperform the transfer learning process only in 10 epochs. This study’ s follo wing e xperiment uses a CNN architecture model, specically Re gNet Y 16GF , with 83.6M parameters as a teacher model. 3.2. Hard vs Soft distillation In addition, we com pare tw o techniques to compute distillation loss, as sho wn in Figure 3. Soft distillation outperforms in all do wnstream datasets. F or e xample, transfer learning DeiT B 16 model to Oxford 102 Flo wers with soft distillation accurac y is 95.83% compared to hard distillation, only 92.51%. Lik e wise, the performance of soft distillation is 1.34% higher than that of hard distillation in the CIF AR-100 dataset. Soft distillation gi v es information on the predicted probability class of data from the teacher model to the distillation tok en through KL Di v er gence. Its tok en distillation could adjust performance better because the actual class is not al w ays in the rst teacher’ s prediction. Possible actual class data in the second or third of the teacher’ s predicted. Hence, the follo wing e xperiment in this study uses the soft distillation technique with τ = 2 . Figure 2. Comparison performance DeiT B 16 model with dif ferent teacher architecture. CNN architecture (Re gNet Y 16GF) outperforms transformer architecture (DeiT B 16) Figure 3. Comparison performance DeiT B 16 model with dif ferent distillation loss techniques. Soft distillation outperforms hard distillation 3.3. T ransfer lear ning to do wnstr eam datasets Finally , we ha v e con g ur ed Re gNet Y 16GF , a teacher model and soft distillation, which mak es the DeiT B 16 model better performance. Furthermore, we pro v ed that using distillation tok ens and teacher - predicted output to compute distillation loss (our proposed method) is better than just using a v erage distillation tok ens and classication tok ens (without teacher or standard transfer learning) for transfer learning to do wn- stream datasets. Figure 4 sho ws t hat our proposed method signi cantly impro v es t he DeiT B 16 model. F or e xample, the performance of the DeiT B 16 model increased by 1.75% on the OxfordIIIT -Pets dataset. Simi- larly , on the Oxford 102 Flo wers, DeiT B 16 performance with our proposed method is 95.83% compared to without a teacher , only 94.99%. Int J Inf & Commun T echnol, V ol. 15, No. 1, March 2026: 198–206 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Inf & Commun T echnol ISSN: 2252-8776 203 This happened because the student model could le v erage the teacher’ s kno wledge well. Our proposed method could impro v e the student model because the distillation tok en gets information from the teacher model. Thats information mak es student models learn more straightforw ard and f aster . The proof is that the standard transfer learning process needs 300 epochs in the DeiT paper , while in this study , only 10 epochs. 3.4. Using weak er teacher model Unfortunately , our proposed method creates an additional process: standard transfer learning teac h e r model to a do wnstream dataset. It mak es the transfer learning process of DeiT to the do wnstream dataset longer . Ho we v er , we ha v e a fortune because of using CNN architec ture as a teacher model. The transfer learning CNN teacher model is f aster than the transfe r learning T ransformers teacher model. Ev en though the transfer learning CNN teacher model is f aster , we still tried to reduce the time standard transfer learning teacher model. Furthermore, we proposed using a weak er CNN model as a teacher . The model is ar guably weak er based on the size of the model parameters. The pre vious e xperiment used Re gNet Y 16GF as a teacher model with 83.6M parameters. Its model’ s parameters are similar to the student model, in which DeiT B 16 has 86.6M parameters. Then, we present tw o weak er models as teachers, i.e., Ef cientNet B4 and Ef cientNet V2, each with 19.3M and 21.5M parameters, respecti v ely . Therefore, our e xperiment uses a teacher model that is approximately 75% weak er t han a student model, DeiT B 16. T able 3 sho ws a detailed description of the size of the model parameters. The result is that Ef cientNet V2, whose model size is only 24.82% compared to the DeiT B 16 model, can outperform in CIF AR-10, CIF AR-100, and OxfordIIIT -Pets datasets, as sho wn in Fi gure 5. In addition, the performance of the student model with the weak er model is similar to Re gNet Y 16GF . F or e xample, in the OxfordIIIT -Pets dataset, the performance of the student model with Ef cientNet B4 as a t eacher model, whose model size is only 22.82% compared to the student model, decreased by only 0.57% (90.16% vs 90.73%) compared to Re gNet Y 16G. Our study sho ws that a weak er teacher model could simplify the training model for the teacher . Additionally , a weak er model may yield better student model performance in some e xperiments. Figure 4. Comparison performance DeiT B 16 model in transfer learning to do wnstream datasets between our proposed method and the student model without a teacher model (standard transfer learning) Figure 5. Comparison performance DeiT B 16 model in transfer learning to do wnstream datasets between Re gNet Y 16GF , Ef cientNet B4, and Ef cientNet V2S 3.5. Using another student model Finally , we pro v ed that using a distillation tok en for transfer learning to do wnstream datasets could impro v e the DeiT B 16 model performance. Moreo v er , we also pro v ed that using a weak er model as a teacher model could reduce the comple xity of the training teacher model and impro v e student model performance in s e v er al do wnstream datasets. Therefore, we try the same concept in DeiT S 16 as a student model and Ef cientNet B0 as a weak er teacher model to pro v e that our proposed method applies to the v ariety model. A teacher model is ar guably weak er by the size of the model parameters between the teacher and student model. W e denoted DeiT S 16 as the baseline of the model’ s size (100%). Then, we determine Ef cientNet B0 as a teacher model with 5.3 parameters or 24.09% compared to the student model and Re gNet Y 16G as a teacher model with 83.6M parameters or 380% compared to the student model. T able 4 sho ws detailed information on the model size in this e xperiment. Le ver a ging distillation tok en and weak er teac her model to impr o ve DeiT ... (Christopher Gavr a Reswar a) Evaluation Warning : The document was created with Spire.PDF for Python.
204 ISSN: 2252-8776 Lik e the Deit B 16 model, DeiT S 16 with Ef cientNet B0 as a teacher model outperforms compared to Re gNet Y 16GF as a teacher model in CIF AR-100, Oxford 102 Flo wers, and OxfordIIIT -Pets. Moreo v er , the Ef cientNet B0 model could increase 0.99% the DeiT S 16 performance in the CIF AR-100 dataset, as sho wn in Figure 6. Con v ersely , the dif ference in performance in the CIF AR-10 dataset is only 0.07% (96.89% vs 96.82%) between Re gNet Y 16GF and Ef cientNet B0 as a teacher . Thus, this e xperiment pro v es that our proposed method could impro v e performance on v arious models. T able 3. Model size in DeiT B 16 e xperiment Model P arams Model size DeiT B 16 86.6M 100% Re gNet Y 16GF 83.6M 96.53% Ef cientNet B4 19.3M 22.28% Ef cientNet V2S 21.5M 24.82% T able 4. Model size in DeiT S 16 e xperiment Model P arams Model size Re gNet Y 16GF 83.6M 380% DeiT S 16 22M 100% Ef cientNet B0 5.3M 24.09% Figure 6. Comparison performance DeiT S 16 model in transfer learning to do wnstream datasets between Re gNet Y 16GF and Ef cientNet B0 4. CONCLUSION Recent observ ations suggest that a ne w procedure of KD in the V iT model with distillation tok ens can impro v e performance while training it from scratch. Our ndings pro vide conclusi v e e vidence that this ne w KD procedure can also enhance model performance when applied to a do wnstream dataset through transfer learning. Utilizing a distillation tok en to calculat e distillation loss between student output and teacher output remains a helpful technique for the DeiT model in the transfer learning process. The DeiT model (student model) can ef fecti v ely learn from teacher kno wledge. This could happen by supporting CNN architecture as a teacher model and computing the distillation loss using soft distillation. In addition, we proposed using a weak er teacher m odel. W e present that se v eral do wnstream dat asets could impro v e the performance of the DeiT model . Otherwise, t he performance of the DeiT model with a weak er teacher model is similar to Re gNet Y 16GF as a teacher model. Ho we v er , the comple xity of the training teacher model could be decreased by approximately 75%. Therefore, our proposed method of using a weak er teacher model could impro v e the ef cienc y of the training process. Our study demonstrates that utilizing a distillation tok en and a weak er t eacher model can enhance the transfer learning capability of the DeiT model . Thus, future studies may e xplore the implementation of quantization and pruning methods, allo wing the size of the DeiT model parameters to be similar to that of the weak er teacher model. Additionally , it could also be e xplored to incorporate a distillation tok en technique into other transformer models, such as Swin and PVT . FUNDING INFORMA TION Authors state there is no funding in v olv ed. Int J Inf & Commun T echnol, V ol. 15, No. 1, March 2026: 198–206 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Inf & Commun T echnol ISSN: 2252-8776 205 A UTHOR CONTRIB UTIONS ST A TEMENT This journal uses the Contrib utor Roles T axonomy (CRediT) to recognize indi vidual author contrib u- tions, reduce authorship disputes, and f acilitate collaboration. Name of A uthor C M So V a F o I R D O E V i Su P Fu Christopher Ga vra Resw ara Gede Putra K usuma C : C onceptualization I : I n v estig ation V i : V i sualization M : M ethodology R : R esources Su : Su pervision So : So ftw are D : D ata Curation P : P roject Administration V a : V a lidation O : Writing - O riginal Draft Fu : Fu nding Acquisition F o : F o rmal Analysis E : Writing - Re vie w & E diting CONFLICT OF INTEREST ST A TEMENT Authors state there is no conict of interest. D A T A A V AILABILITY The datasets analyzed during the current study are publicly a v ailable. The CIF AR-10 and CIF AR-100 datasets are a v ailable at https://www .cs.toronto.edu/ kriz/cif ar .html. The Oxford 102 Flo wers dataset can be found at https://doi.or g/10.1109/ICV GIP .2008.47. The Oxford-IIIT Pet dataset is a v ailable at https://doi.or g/10.1109/CVPR.2012.6248092. REFERENCES [1] A. V asw ani, N. Shazeer , N. P armar , J. Uszk oreit, L. Jones, A. N. Gomez, Ł. Kaiser , and I. Polosukhin, Attention is all you need, Advances in neur al information pr ocessing systems , v ol. 30, 2017, doi: 10. 48550/arXi v .1706.03762. [2] Y . W u, X. Hu, Z. Fu, S. Zhou, and J. Li, “GPT -4o: visual perception performance of multimodal lar ge language models in piglet acti vity understanding, Arxive , 2024, [Online]. A v ailable: http://arxi v .or g/abs/2406.09781. [3] Meta, The Llama 3 Herd of Models, arXiv , 2024. [4] Gemi ni, “Gemini 1.5: unlocking multimodal understanding across millions of tok ens of conte xt, 2024. [5] H. Shakil, Z. Ortiz, G. C. F orbes, and J . Kalita, “Utilizing GPT to enhance t e xt summarization: a strate gy to minimize hallucina- tions, Pr ocedia Computer Science , v ol. 244, pp. 238–247, 2024. [6] J. ˇ Sm ´ ıd, P . Priban, and P . Kral, “LLaMA-based models for as pect-based sentiment analysis, in Pr oceedings of the 14th W orkshop on Computational Appr oac hes to Subjectivity , Sentiment, & Social Media Analysis , Aug. 2024, pp. 63–70, doi: 10.18653/v1/2024.w assa-1.6. [7] J. Ding, H. Nguyen, and H. Chen, “Ev aluation of quest ion-answering based te xt summarization using LLM in vited paper , in Pr oceedings - 6th IEEE International Confer ence on Arti cial Intellig ence T esting , AIT est 2024 , 2024, pp. 142–149, doi: 10.1109/AIT est62860.2024.00025. [8] Z. Liu et al. , “Swin transformer: hierarchical vision transformer using shifted W indo ws, in Pr oceedings of the IEEE International Confer ence on Computer V ision , Oct. 2021, pp. 9992–10002, doi: 10.1109/ICCV48922.2021.00986. [9] Y . Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision transformer backbones for object detection, Lectur e Notes in Computer Science (including subseries Lectur e Notes in Articial Intellig ence and Lectur e Notes in Bioinformatics) , v ol. 13669 LNCS. pp. 280–296, 2022, doi: 10.1007/978-3-031-20077-9 17. [10] C. Cao and Y . Fu, “Impro ving transformer -based image matching by cascaded capturing spatially informati v e k e y- points, in Pr oceedings of the IEEE International Confer ence on Com puter V ision , Oct. 2023, pp. 12095–12105, doi: 10.1109/ICCV51070.2023.01114. [11] A. Doso vitskiy et al. , An image is w orth 16X16 w ords: transformers for image recognition at scale, ICLR 2021 - 9th International Confer ence on Learning Repr esentations , 2021. [12] H. T ouvron, M. Cord, M. Douze, F . Massa, A. Sablayrolles, and H. J ´ egou, “T raining data-ef cient image transformers & distillation through attention, in Pr oceedings of Mac hine Learning Resear c h , 2021, v ol. 139, pp. 10347–10357. [13] Z. Liu et al. , “Swin transformer V2: scaling up capacity and resolution, in Pr oceedings of the IEEE Computer Society Confer ence on Computer V ision and P attern Reco gnition , 2022, v ol. 2022-June, pp. 11999–12009, doi: 10.1109/CVPR52688.2022.01170. [14] G. Hinton, O. V in yals, and J. Dean, “Distilling the kno wledge in a neural netw ork. 2015, [Online]. A v ailable: http://arxi v .or g/abs/1503.02531. [15] I. Radosa v o vic, R. P . K osaraju, R. Girshick, K. He, and P . Doll ´ ar , “Designing netw ork design spaces, in Pr oceedings of the IEEE Computer Society Confer ence on Computer V ision and P attern Reco gnition , 2020, pp. 10425–10433, doi: 10.1109/CVPR42600.2020.01044. Le ver a ging distillation tok en and weak er teac her model to impr o ve DeiT ... (Christopher Gavr a Reswar a) Evaluation Warning : The document was created with Spire.PDF for Python.
206 ISSN: 2252-8776 [16] S. K ullback and R. A. Leibler , “On Information and Suf cienc y , The Annals of Mathematical Statistics , v ol. 22, no. 1, pp. 79–86, 1951, doi: 10.1214/aoms/1177729694. [17] J . Deng, W . Dong, R. Socher , L. J. Li, K. Li, and L. Fei-Fei, “ImageNet: a lar ge-scale hierarchical image database, in 2009 IEEE Confer ence on Computer V ision and P attern Reco gnition, CVPR 2009 , 2009, pp. 248–255, doi: 10.1109/CVPR.2009.5206848. [18] A. K rizhe vsk y , “Learning multiple layers of features from tin y images. pp. 32–33, 2009. [19] M . E. Nilsback and A. Zisserman, Automated o wer classication o v er a lar ge number of classes, in Pr oceedings - 6th Indian Confer ence on Computer V ision, Gr aphics and Ima g e Pr ocessing , ICVGIP 2008 , 2008, pp. 722–729, doi: 10.1109/IC V GI P .2008.47. [20] O . M. P arkhi, A. V edaldi, A. Ziss erman, and C. V . Ja w ahar , “Cats and dogs, in Pr oceedings of the IEEE Computer Society Confer - ence on Computer V ision and P attern Reco gnition , 2012, pp. 3498–3505, doi: 10.1109/CVPR.2012.6248092. [21] C. Burns et al. , “W eak-to-strong generalization: eliciting strong capabilit ies with weak supervision, Pr oceedings of Mac hine Learn- ing Resear c h , v ol. 235. pp. 4971–5012, 2024. [22] M. T an and Q. V . Le, “Ef cientNet: Rethinking model scaling for con v olutional neural netw orks, 36th International Confer ence on Mac hine Learning , ICML 2019 , v ol. 2019-June, pp. 10691–10700, 2019. [23] M. T an and Q. V . Le, “Ef cientNetV2: smaller models and f aster training, Pr oceedings of Mac hine Learning Resear c h , v ol. 139, pp. 10096–10106, 2021. [24] I. Loshchilo v and F . Hutter , “Decoupled weight decay re gularization, arXiv pr eprint arXiv:1711.05101 , 2019. [25] I. Loshchilo v and F . Hutter , “SGDR: stochastic gradient descent with w arm restarts, arXiv pr eprint arXiv:1608.03983 , 2017. [26] S. Abnar , M. Dehghani, and W . Zuidema, “T ransferring inducti v e biases through kno wledge distillation, Arxiv , 2020, [Online]. A v ailable: http://arxi v .or g/abs/2006.00555. BIOGRAPHIES OF A UTHORS Christopher Ga vra Reswara recei v ed his bachelor’ s de gree in computer science from Bina Nusantara Uni v ersity , where he is pursuing a master’ s de gree in the same eld. He also w orks as a Progr ammer at the Bina Nusa ntara IT Di vision. His research focuses on AI, recommendation systems, and computer vision, and he has authored tw o conference papers on recommendation sys- tems. He can be contacted at: christopher .resw ara@binus.ac.id. Gede Putra K usuma recei v ed Ph.D. de gree in Electrical and Electronic Engineering from Nan yang T echnological Uni v ersity (NTU), Sing apore , in 2013. He is currently w orking as a Lecturer and Head of Department of Master of Computer Science, Bina Nusantara Uni v ersity , Indonesia. Before joining Bina Nusantara Uni v ersity , he w as w orking as a Research Scientist in I2R A*ST AR, Sing apore. His research interests include computer vision, deep learning, f ace recognition, appearance-based object recognition, g amication of learning, and indoor positioning system. He can be contacted at: ine g ara@binus.edu. Int J Inf & Commun T echnol, V ol. 15, No. 1, March 2026: 198–206 Evaluation Warning : The document was created with Spire.PDF for Python.