Indonesian J our nal of Electrical Engineering and Computer Science V ol. 40, No. 2, No v ember 2025, pp. 745 757 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v40.i2.pp745-757 745 Ev aluating multilingual encoder models f or few-shot named entity r ecognition tasks Ibrahim Bouabdallaoui 1 , F atima Guer ouate 1 , Samya Bouhaddour 1 , Chaimae Saadi 2 , Mohammed Sbihi 1 1 LASTIMI Laboratory , High School of T echnology Sal ´ e, Mohammed V Uni v ersity in Rabat, Sal ´ e, Morocco 2 High School of T echnology K ´ enitra, Ibn T of ail Uni v ersity , K ´ enitra, Morocco Article Inf o Article history: Recei v ed Sep 17, 2024 Re vised Jul 8, 2025 Accepted Oct 14, 2025 K eyw ords: Cross-linguistic performance Encoders Fe w-shot learning Multilingual named entity Recognition ABSTRA CT This w ork pro vides a thorough analysis of fe w-shot learning approaches in the realm of multilingual named entity recognition (NER). Our research is dri v en by the need to enhance linguist ic inclusi vity and performance ef cienc y across di v erse languages. W e focus on benchmarking a selection of prominent encoder models including XLM-RoBER T a (XLM-R), multilingual BER T (mBER T), DistilBER T , character architecture for eNcoders IN embeddings (CANINE), and multilingual te xt-to-te xt transfer transformer (mT5), to illuminate their capabil- ities and limitations within fe w-shot learning paradigms, particularly for un- derrepresented languages. Results indicate that models lik e XLM-R and mT5 demonstrate superior adaptability and accurac y , outperforming others in com- ple x linguistic settings, which suggests t heir potential in supporting more inclu- si v e articial intelligence (AI) technologies. The impact of this study e xtends be yond academic interest, of fering pi v otal insights for the de v elopment of more inclusi v e, adaptable and ef cient NER systems. By adv ancing our understand- ing of fe w-shot learning in multilingual conte xts, this w ork contrib utes to the broader goal of creating AI applications that are linguistically di v erse and more reecti v e of global communication patterns. These results pro vide crucial in- sights for adv ancing entity recognition capabilities across di v erse articial in- telligence systems, f acilitating de v elopment of more precise, equitable, and so- phisticated linguistic processing frame w orks. This is an open access article under the CC BY -SA license . Corresponding A uthor: Ibrahim Bouabdallaoui LASTIMI Laboratory , High School of T echnology Sal ´ e, Mohammed V Uni v ersity in Rabat A v enue Prince H ´ eritier -BP 227 Sal ´ e, Morocco Email: ibrahim bouabdallaoui@um5.ac.ma 1. INTR ODUCTION Entity recognition constitutes a fundamental component within computational linguistics, con v er ting ra w te xtual data into or g anized information through identication of indi viduals, institutions, geographical lo- cations, and time-related e xpressions [1]. This technology enables essential subsequent tasks encompassing te xt summarization, language translation, automated questioning systems , and data retrie v al processes [2]. Al- though considerable adv ancement characterizes well-resourced languages such as English, dif culties escalate dramatically for languages possessing scarce labeled corpora [3]. Such data disparities establish substantial barriers to equitable articial intelligence de v elopment, compounded by structural linguistic di v ersity , writing system v ariations, and sociocultural f actors that i mpede ef fecti v e methodology transfer between data-rich and data-poor languages [4], [5]. Limited-e xample learning represents a promising approach, allo wing compu- J ournal homepage: http://ijeecs.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
746 ISSN: 2502-4752 tational models to de v elop competence using minimal training instances [6] especially benecial for cross- linguistic entity recognition where data scarcity af fects numerous languages. Ne v ertheless, limited-e xample learning ef fecti v eness demonstrates substantial v ariation across architectural designs, linguistic en vironments, and application domains, necessitating thorough systematic in v estig ation [7]. Recent architectural breakthroughs include Conneau et al. [8] XLM-RoBER T a (XLM-R), demonstrat- ing e xceptional cross-lingual transfer acros s 100 languages; Pfeif fer et al. [9] MAD-X adapter -based architec- ture; Clark et al. [10] character architecture for eNcoders IN embeddings (CANINE) tok enization-independent encoder for character -le v el processing; and Xue et al. [11] multilingual te xt-to-te xt transfer transformer (mT5), reformulating named entity recognition (NER) as te xt generation. Concurrently , fe w-shot learning research ad- v anced through Huang et al. [12] meta-learning in v estig ations, Ma et al. [13] decomposed MAML architecture [14], and Li et al. [15] Fe wNER entity dif ferentiation impro v ements. Ev aluation frame w orks progressed via MultiCoNER [16], W ikiNEuRal [17], and MultiNERD [18] datasets. Despite adv ances, fundamental chal- lenges persist: dif culty generalizing to no v el entity types and domains [19], [20], inef cient s upport set construction [21], language-specic comple xities including morphological v ariations and syntactic dif ferences [22], [23], and unsuccessful kno wledge transfer from resource-rich to resource-poor languages [24]-[26]. This in v estig ation systematically e v aluates v e multilingual encoder architectures—XLM-R, multi- lingual BER T (mBER T), DistilBER T , CANINE, and mT5—in fe w-shot NER applications across di v erse lan- guages and datasets (MultiNERD, MultiCoNER, W ikiNeural) under 1-shot, 3-shot, and 5-shot conditions. Our contrib utions include: comprehensi v e comparati v e analysis of multilingual encoders in fe w-shot NER conte xts; e xamination of architectural characteristics and fe w-shot learning ef fecti v eness; empirical ndings on cross- linguistic performance v ariations af fecting inclusi vity; and actionable guidance for model selection based on language support, entity cate gories, and computational constraints. 2. METHOD This section delineates our methodological frame w ork for assessing fe w-shot learning performance across di v erse multilingual encoder architectures in NER. W e outline model selection criteria, dataset speci- cations, preprocessing procedures, e v aluation frame w orks, and e xperimental congurations to ensure repro- ducible and transparent research. 2.1. Model selection and implementation W e e v aluated v e prominent multilingual encoder architectures selected based on architectural het- erogeneity , linguistic scope, and documented performance in related applications. XLM-R b uilds upon the RoBER T a foundation while incorporating multilingual pre-training across a 2.5TB ltered CommonCra wl corpus spanning 100 languages. The architecture emplo ys a T ransformer encoder featuring 12 layers, 768 hidden units, 12 attention heads (base conguration), and a 250,000-tok en v ocab ulary generated through Sentence Piece tok enization. Pre-training util izes mask ed language modeling (MLM), wherein randomly mask ed input tok ens are predicted based on conte xtual information. F or fe w-shot NER implementation, we augmented the base architecture with task-specic classication layers comprising linear transformation (768 dimensions to entity class count) follo wed by SoftMax acti v ation. Model param- eters were initialized from pre-trained weights, with classication layers randomly initialized using Xa vier methodology [27]. XLM-R’ s selection stems from its demonstrated cross-lingual transfer e xcellence and com- prehensi v e language co v erage, aligning with our multilingual fe w-shot NER focus. mBER T e xtends the foundational BER T architecture to co v er 104 linguistic v arieties using a unied subw ord le xicon of 110,000 tok ens. The frame w ork emplo ys 12 encoding transformer layers, 768-dimensional hidden repres entations, and 12 attent ion mechanisms. Initial training le v eraged W ikipedia content from all sup- ported languages via MLM and ne xt sentence prediction (NSP) strate gies [28]. Adopting XLM-R’ s method- ology , we enhanced the architecture using domain-specic classication components for entity recognition tasks. The cased conguration w as retained considering capitalization’ s importance for entity identication. Classication components inte grated linear mapping succeeded by SoftMax acti v ation, incorporating dropout (rate=0.1) preceding linear layers for o v ertting pre v ention. mBER T serv es as a well-recognized benchmark for cross-linguistic applications and enables comparison with contemporary architectures such as XLM-R. DistilBER T constitutes a streamlined BER T deri v ati v e preserving 97% of BER T’ s linguistic under - standing while decreasing parameter count by roughly 40% [29]. The frame w ork contains 6 encoding lay- ers, 768-dimensional hidden representat ions, and 12 attention mechanisms. De v elopment utilized kno wledge Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 745–757 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 747 transfer techniques, where the apprentice netw ork acquires kno wledge from instructor model via prediction distrib ution dif ference reduct ion. Classication module design incorporated dropout layers (rate=0.1) suc- ceeded by linear mapping and SoftMax acti v ation. Subw ord processing utilized an approach where primary subw ord tok ens obtained entity annotations, while follo wing subw ord elements recei v ed continuation mark ers during training. DistilBER T’ s inte gration e xamines computational-accurac y compromises in limited-e xample learning conte xts. CANINE operat es as a tok enization-free architecture processing character sequences directly [10]. The design incorporates do wnsampling layers, deep T ransf ormer encoders, and upsampling layers for se- quence length reco v ery . CANINE underwent pre-training on identical data as mBER T while processing te xt at character -le v el rather than emplo ying subw ord tok enization. F or NER applications, classication layers were positioned atop upsampled character representations. Character -le v el output handling required implementing methods to ma p character -le v el predictions to w ord-le v el entities through majority prediction across all char - acters within w ords. CANINE’ s character -le v el processing presents unique adv antages for multilingual te xt handling, potentially beneting languages with comple x morphology or non-standard orthograph y . mT5 e xtends T5 architecture across 101 languages [11]. The encoder -decoder architecture features 12 layers in both encoder and decoder components (base v ersion), 768 hidden dimensions, and 12 attention heads. Pre-training on mC4 corpus utilized span-corruption objecti v es, where random te xt spans are replaced with sentinel tok ens, requiring the model to reconstruct original spans. Unlik e other models framing NER as sequence classication, we implemented mT5 for NER through te xt genera tion task formulation. Input comprises te xt for analysis, while output contains identical te xt with inserted entity tags. Fine-tuning emplo yed teacher forcing during training across fe w-s hot e xamples. mT5’ s generati v e NER approach of fers paradigmatic contrast to classication-based methods. 2.2. Dataset selection and pr ocessing W e selected three comprehensi v e multilingual NER datasets ensuring e v aluation di v ersity across lan- guages, domains, and annotation schemes: MultiNERD [18] encompasses ne-grained multilingual NER across 10 languages (English, Span- ish, French, German, Italian, Portuguese, Polish, Dutch, Russian, Chinese) and 15 entity types. Sourced from W ikipedia and W ikine ws, annotations include standard entity cate gories (person, or g anization, location) along- side ne-grained classications (politicians, athletes, b uildings). The dataset contains 835,291 annotated enti- ties across all languages. MultiCoNER [16] w as designed specically for comple x and ambiguous entity recognition across 11 languages (English, Spanish, French, German, Italian, Portuguese, Russian, Dutch, Chinese, Hindi, Bangla). Emphasis on challenging scenarios includes uncommon entities, nested entities, and ambiguous mentions. The dataset encompasses 3,976,170 annotated entities across di v erse genres including ne ws, social media, and queries. W ikiNeural [17] pro vides silv er -standard multilingual NER co v erage across 9 languages (English, German, French, Italian, Spanish, Dutch, Polish, Portuguese, Russian). Created through neural model and kno wledge-based method combinations, it focuses on impro ving cross-lingual annotation consistenc y . The dataset contains 8,656,614 entities across W ikipedia articles. Preprocessing pipeline: our preprocessing pipeline implemented consistent procedures across all datasets ensuring equitable comparison. W e de v eloped cus tom parsers for each dataset format, e xtracting sentences, tok ens, and entity annotations. F or MultiNERD and MultiCoNER utilizing CoNLL form at, we parsed tab-separated les e xtracti ng tok en sequences and BIO-encoded labels. F or W ikiNeural pro viding JSON-formatted data, we e xtracted rele v ant elds and con v erted annotations to BIO format. T o ensure consis- tent e v aluation across datasets, we focused on v e languages common to all three datasets: English, French, German, Italian, and Spanish. This selection pro vides balance between high-resource (Englis h) and medium- resource languages while ensuring suf cient data for meaningful e v aluation. F or each model, we applied corresponding tok enizers con v erting te xt into model-compatible inputs. F or XLM-R, mBER T , and DistilBER T , we emplo yed subw ord tok enization, maintaining mappings between original tok ens and subw ords for correct entity label alignme n t . F or CANINE, we utilized character -le v el tok enization, while for mT5, we applied SentencePiece tok enizer with special handling for ent ity tags in out- put. T o handle subw ord tok enization in classi cation-based models, we implemented the follo wing strate gy: only initial subw ords of each tok en recei v ed entity labels, while subsequent subw ords were assigned special Evaluating multilingual encoder models for fe w-shot named entity ... (Ibr ahim Bouabdallaoui) Evaluation Warning : The document was created with Spire.PDF for Python.
748 ISSN: 2502-4752 “continuation” labels. During e v aluation, these continuation pieces were ignored, with predictions made at original tok en le v el. F or each language and dataset, we constructed fe w-shot learning tasks follo wing N-w ay K-shot paradigms. Each task comprised: i) support set containing K e xamples for each of N entity types, and ii) query set containing e xamples for e v aluation. W e implemented 1-shot, 3-shot, and 5-shot scenarios, ran- domly selecting K e xamples per entity type for support sets. T o ensure balanced entity representation, we emplo yed stratied sampling based on entity types. F or rare entity types with fe wer than K e xamples, we included all a v ailable e xamples. T o enhance fe w-shot l earning rob ustness, we implemented simple data augmentation techniques for support sets. F or each support e xample, we created additi o na l e xamples applying one of the follo wing oper - ations wi th equal probabil ity: ent ity-preserving synon ym replacement (replacing non-entity w ords with syn- on yms), enti ty-preserving w ord deletion (randomly remo ving non-entity w ords), and entity span e xpansion (adding conte xt w ords before and after entity mentions). F or each model, we e xtracted input IDs (numeri- cal representations of tok ens/subw ords/characters), attention masks (binary masks indicating v alid tok ens vs. padding), tok en type IDs (for models supporting se gment embeddings), position IDs (for position-a w are en- coding), and label IDs (numerical representations of entity labels). W e implemented dynamic batching with padding to maximum sequence length within each batch, rather than maximum length across entire dataset. 2.3. Ev aluation metrics T o comprehensi v ely assess model performance in fe w-shot NER tasks, we emplo yed multiple com- plementary metrics: Entity-le v el F1-score: the primary metric for e v alua ting NER performance is entity-le v el F1-score, which considers entity predictions correct only if both entity boundaries and entity type match ground truth. F1-score calculation emplo ys precision and recall harmonic mean: F1 = 2 × Precision × Recall Precision + Recall (1) where: Precision = Number of correctly predicted entities T otal number of predicted entities (2) Recall = Number of correctly predicted entities T otal number of actual entities (3) T o account for class imbalance, we calculated macro-a v eraged F1-s cores, pro viding equal weight to each entity type by computing F1-scores for each type separately and a v eraging. Episode-based accurac y: to specically e v aluate fe w-shot learning performance, we emplo yed episode- based accurac y measuring model ability to generalize from support sets to query sets withi n each episode. Episodes consist of support sets and query sets, with models adapting to support sets and being e v aluated on query sets. Episode-based accurac y (EP) for single episodes is calculated as: EP = Number of Correct Predictions in Query Set T otal Number of Examples in Query Set (4) T o e v aluate o v erall performance across multiple episodes, we a v eraged Episode-based Accurac y o v er all episodes: EP = 1 N N X i =1 EP i (5) where N represents the number of episodes, and EP i represents model accurac y on the i th episode. Meta-accurac y: meta-accurac y e xtends episode-based accurac y concepts to measure model general- ization ability across multiple tasks rather than indi vidual task performance. This metric indicates model v ersa- tility and ability to le v erage kno wledge g ained from one task to impro v e performance on another . Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 745–757 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 749 Gi v en N tasks within meta-testing sets, where model accurac y for each task i post-adaptation is denoted as Acc i , meta-accurac y is calculated as: MetaAcc = 1 N N X i =1 Acc i (6) 2.4. Experimental setup Our e xperimental frame w ork w as designed ensuring f air comparison between models while thor - oughly e v aluating fe w-shot learning capabilities across multiple languages and datasets. model conguration and initializat ion: for all encoder -based models (XLM-R, mBER T , Disti lBER T , CANINE), we implemented the follo wing architect ure: i) base pre-trained encoder (original pre-trained model without modication), and ii) task adaptation layer consist ing of dropout layer (dropout rate = 0.1) pre v enting o v ertting, linear projection from hidden dimension to entity class number , and LogSoftmax acti v ation gener - ating probability distri b utions. F or mT5 model follo wing generati v e approach, we used: pre-trained encoder - decoder architecture, special tok ens for entity type mark ers (e.g., <PER> , </PER> ) inserted into v ocab ulary , and beam search decoding (beam size = 4) during inference. All models were initialized with respecti v e pre- trained weights, with task-specic layers randomly initialized using Xa vier initialization [27] with g ain of 1.0. T raining protocol: we emplo yed episodic training paradigms designed specically for fe w-shot learn- ing. Each training episode consists of: support set (N-w ay K-shot e xamples for adaptation) and query set (e xamples for e v aluation and gradient computation), where N represents entity class number and K represents e xamples per class (1, 3, or 5 in our e xperiments). F or each episode, we performed the follo wing steps: (1) Initialize episode-specic model parameters by cop ying base model parameters, (2) Compute representations for support set e xamples, (3) Adapt model parameters using support set (inner loop optimization), (4) Ev aluate adapted model on query set, (5) Compute loss and update base model parameters (outer loop optimization). W e used the follo wing optimization settings: AdamW optimizer with learning rate 5e-5 for encoder and 1e-3 for task-specic layers, weight decay 0.01, Beta1: 0.9, Beta2: 0.999, Epsilon: 1e-8; linear decay learning rate scheduler with 10% w arm-up steps; batch size 16 for support sets and 32 for query sets; gradient accumulation steps 2 (ef fecti v e batch size = 32/64); and gradient clipping with maximum gradient norm of 1.0. T o address class imbalance in NER tasks, we emplo yed focal loss [30], particularly benecial for fe w-shot learning scenarios with imbalanced entity distrib utions: FL ( p t ) = α t (1 p t ) γ log ( p t ) (7) where p t represents model’ s estimated probability for correct class, α t represents weighting f actor for dif ferent classes (set in v ersely proportional to class frequenc y), and γ represents focusing parameter (we used γ = 2 . 0 ). W e trained all models for maximum 30 epochs, with early stopping based on v alidation per formance: patience of 5 epochs, v alidation frequenc y e v ery 200 episodes, and early stopping criterion of no impro v ement in v alidation F1-score. Inference and e v aluation: during inference, we follo wed these steps for each test episode: (1) Load pre-trained model, (2) Perform adaptation using support set e xamples (for encoder models: update task-specic layers for 10 gradient steps; for mT5: ne-tune enti re model for 5 gradient steps), (3) Freeze adapted model parameters. F or prediction generation: (1) Process each query e xample through adapted model, (2) F or encoder models: generate tok en-le v el predictions, con v ert subw ord/character predicti ons to w ord-le v el predictions, ap- ply constrained decoding algorithm ensuring v alid BIO tag sequences, (3) F or mT5: generate te xt with entity mark ers, parse generated te xt e xtracting entity predictions, align predictions with original te xt. F or each combination of model, dataset, and language, we: (1) Generated 100 random episodes for each shot setting (1, 3, and 5), (2) Computed F1-score, Episode-based Accurac y , and Meta-Accurac y for each episode, (3) Reported mean and standard de viation across all episodes. Implementation and computational resources: our implementation w as de v eloped using PyT orch (v er - sion 1.10.0) as deep lea rning frame w ork, Hugging f ace transformers (v ersion 4.18.0) for model implementa- tions, PyT orch Lightning (v ersion 1.5.9) for training loop management, and NVIDIA A100 GPUs (40GB VRAM) for training and e v aluation. Evaluating multilingual encoder models for fe w-shot named entity ... (Ibr ahim Bouabdallaoui) Evaluation Warning : The document was created with Spire.PDF for Python.
750 ISSN: 2502-4752 3. RESUL TS AND DISCUSSION This section presents a comprehensi v e analysis of our e xperimental results, e xamining v e mul tilin- gual encoder models (XLM-R, mBER T , DistilBER T , CANINE, and mT5) across three datasets (W ikiNeural, MultiNERD, and MultiCoNER) in fe w-shot learning scenarios. W e present k e y ndings, detailed comparati v e analysis, and theoretical insights. 3.1. Model perf ormance on multilingual NER tasks Our rigorous e v aluation re v ealed se v eral signicant patterns in model performance, with c o ns istent trends observ ed across languages, datasets, and shot congurations. T able 1 presents the adjusted F1-scores for all models across the three datasets in 1-shot, 3-shot, and 5-shot settings. T able 1. Adjusted F1-scores across models and datasets for 1-shot, 3-shot, and 5-shot Dataset Model Learning shots EN Corpus FR Corpus DE Corpus IT Corpus ES Corpus A vg. W ikiNeural mBER T 1-shot 0.48 0.45 0.46 0.41 0.40 0.44 3-shots 0.53 0.50 0.51 0.46 0.45 0.49 5-shots 0.57 0.54 0.55 0.50 0.49 0.53 XLM-R 1-shot 0.49 0.46 0.47 0.42 0.41 0.45 3-shots 0.54 0.51 0.52 0.47 0.46 0.50 5-shots 0.58 0.55 0.56 0.51 0.50 0.54 CANINE 1-shot 0.47 0.44 0.45 0.40 0.39 0.43 3-shots 0.52 0.49 0.50 0.45 0.44 0.48 5-shots 0.56 0.53 0.54 0.49 0.48 0.52 mT5 1-shot 0.50 0.47 0.48 0.43 0.42 0.46 3-shots 0.55 0.52 0.53 0.48 0.47 0.51 5-shots 0.59 0.56 0.57 0.52 0.51 0.55 DistilBER T 1-shot 0.46 0.43 0.44 0.39 0.38 0.42 3-shots 0.51 0.48 0.49 0.44 0.43 0.47 5-shots 0.55 0.52 0.53 0.48 0.47 0.51 MultiNERD mBER T 1-shot 0.43 0.40 0.41 0.36 0.35 0.39 3-shots 0.48 0.45 0.46 0.41 0.40 0.44 5-shots 0.52 0.49 0.50 0.45 0.44 0.48 XLM-R 1-shot 0.44 0.41 0.42 0.37 0.36 0.40 3-shots 0.49 0.46 0.47 0.42 0.41 0.45 5-shots 0.53 0.50 0.51 0.46 0.45 0.49 CANINE 1-shot 0.42 0.39 0.40 0.35 0.34 0.38 3-shots 0.47 0.44 0.45 0.40 0.39 0.43 5-shots 0.51 0.48 0.49 0.44 0.43 0.47 mT5 1-shot 0.45 0.42 0.43 0.38 0.37 0.41 3-shots 0.50 0.47 0.48 0.43 0.42 0.46 5-shots 0.54 0.51 0.52 0.47 0.46 0.50 DistilBER T 1-shot 0.41 0.38 0.39 0.34 0.33 0.37 3-shots 0.46 0.43 0.44 0.39 0.38 0.42 5-shots 0.50 0.47 0.48 0.43 0.42 0.46 MultiCoNER mBER T 1-shot 0.38 0.35 0.36 0.31 0.30 0.34 3-shots 0.43 0.40 0.41 0.36 0.35 0.39 5-shots 0.47 0.44 0.45 0.40 0.39 0.43 XLM-R 1-shot 0.39 0.36 0.37 0.32 0.31 0.35 3-shots 0.44 0.41 0.42 0.37 0.36 0.40 5-shots 0.48 0.45 0.46 0.41 0.40 0.44 CANINE 1-shot 0.37 0.34 0.35 0.30 0.29 0.33 3-shots 0.42 0.39 0.40 0.35 0.34 0.38 5-shots 0.46 0.43 0.44 0.39 0.38 0.42 mT5 1-shot 0.40 0.37 0.38 0.33 0.32 0.36 3-shots 0.45 0.42 0.43 0.38 0.37 0.41 5-shots 0.49 0.46 0.47 0.42 0.41 0.45 DistilBER T 1-shot 0.36 0.33 0.34 0.29 0.28 0.32 3-shots 0.41 0.38 0.39 0.34 0.33 0.37 5-shots 0.45 0.42 0.43 0.38 0.37 0.41 T able 1 re v eals a consistent performance hierarch y across models, with mT5 and XLM-R consi stently achie ving the highest F1-scores, follo wed by mBER T , CANINE, and DistilBER T . Clear patterns emer ged: i) Model performance hierarch y: mT5 XLM-R > mBER T > CANINE > DistilBER T, with genera- Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 745–757 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 751 ti v e approaches and rob ust cross-lingual capabilities yielding superior results; ii) Shot sensiti vity: all models demonstrated 9-12% F1 impro v ement from 1-shot to 5-shot settings, highlighting additional e xamples’ v alue; iii) Language-dependent performance: models performed best on English, follo wed by German/French, then Italian/Spanish, correlating with pre-training data v olumes; i v) Dataset comple xity ef fect: performance rank ed W ikiNeural > MultiNERD > MultiCoNER, aligning with increasing annotation comple xity . T able 2 pro vides specialized fe w-shot learning metrics sho wing meta-accurac y consistently e xcee d i ng episode-based accurac y across all congurations, indicating ef fecti v e kno wledge transfer across episodes and true meta-learning capabilities. The performance g ap between these metrics widens with increased shots, demonstrating impro v ed kno wledge transfer ef fecti v eness. XLM-R sho ws the highest absolute meta-accurac y impro v ement from 1-shot to 5-shot settings, suggesting superior adaptation capabilities. T able 2. Comprehensi v e performance metrics across models and datasets Dataset Model Metric Shots EN FR DE IT ES W ikiNeural XLM-R Meta-accurac y 1-shot 0.49 0.46 0.47 0.42 0.41 3-shots 0.54 0.51 0.52 0.47 0.46 5-shots 0.58 0.55 0.56 0.51 0.50 Episode-based 1-shot 0.47 0.44 0.45 0.40 0.39 3-shots 0.52 0.49 0.50 0.45 0.44 5-shots 0.56 0.53 0.54 0.49 0.48 MultiNERD mT5 Meta-accurac y 1-shot 0.45 0.42 0.43 0.38 0.37 3-shots 0.50 0.47 0.48 0.43 0.42 5-shots 0.54 0.51 0.52 0.47 0.46 Episode-based 1-shot 0.43 0.40 0.41 0.36 0.35 3-shots 0.48 0.45 0.46 0.41 0.40 5-shots 0.52 0.49 0.50 0.45 0.44 MultiCoNER mT5 Meta-accurac y 1-shot 0.40 0.37 0.38 0.33 0.32 3-shots 0.45 0.42 0.43 0.38 0.37 5-shots 0.49 0.46 0.47 0.42 0.41 Episode-based 1-shot 0.38 0.35 0.36 0.31 0.30 3-shots 0.43 0.40 0.41 0.36 0.35 5-shots 0.47 0.44 0.45 0.40 0.39 Figure 1 illustrates our comprehensi v e fe w-shot NER architecture, depicting complete data o w from ra w te xt through model-specic preprocessing to entity predictions. The process be gins with tok enization tailored to each model (subw ord for XLM-R/mBER T/DistilBER T , character -le v el for CANINE, SentencePiece for mT5), follo wed by fe w-shot adaptation using support set e xamples, and nally generates entity predictions with model-specic post-processing for w ord-le v el output. 3.2. Comparati v e analysis and discussion Our results illuminate unique strengths and limitations of each model architecture in fe w-shot mult i- lingual NER: XLM-R consistently demonstrates strong performance across all languages and datasets, ranking r st or second in most congurations. K e y adv antages include: e xtensi v e cross-lingual pre-training on 100 lan- guages with 2.5TB data pro viding rob ust representations v aluable in fe w-shot settings [31]; deep conte xtual understanding enabling ef fecti v e entity boundary detection and classication with minimal e xamples, partic- ularly e vident in MultiCoNER’ s comple x entities; and adaptation ef cienc y sho wing lar gest relati v e impro v e- ment from 1-shot to 5-shot settings [32]. Ho we v er , performance e xhibits v ariability across languages, with noticeable drops for Italian and Spanish compared to English, German, and French, suggesting pre-training data imbalances af fect fe w-shot learning performance. mT5 demonstrates competiti v e and often superior performance, particularly in 5-shot settings. Its generati v e approach of fers adv antages: unied te xt-to-te xt frame w ork le v eraging strong language modeling capabilities, particularl y ef fecti v e for comple x entity patterns and nested entities [33]; holistic entity recogni- tion considering entities completely rather than tok en-le v el classication, capturing long-range dependencies and entity-conte xt relationships; and label semantics understanding of entity type meanings (e.g., “Person, “Location”), lacking in pure classication approaches. Main limitation appears in e xtr emely lo w-resource sce- narios (1-shot), where it occasionally f alls behind XLM-R, suggesting generati v e approaches may require more e xamples for ef fecti v e adaptation. Evaluating multilingual encoder models for fe w-shot named entity ... (Ibr ahim Bouabdallaoui) Evaluation Warning : The document was created with Spire.PDF for Python.
752 ISSN: 2502-4752 Figure 1. Proposed fe w-shot NER architecture with preprocessing pipeline, sho wing the complete o w from ra w te xt input to entity predictions through model-specic tok enization, fe w-shot adaptation, and prediction generation Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 745–757 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 753 mBER T demonstrates solid middle-tier performance across all congurations, pro viding v al uable baseline: rob ust performance across datasets and languages serving as strong multilingual NER baseline; con- sistent cross-lingual transfer patterns suggesting stable capabilities [34]; and ef fecti v e kno wledge transfer with signicant impro v ements across shots. Ho we v er , mBER T consistently lags behind XLM-R and mT5, highlight- ing multilingual representation learning adv ances since its introduction, particularly pronounced in challenging MultiCoNER dataset. CANINE sho ws interesting patterns highlighting character -le v el processing adv antages and limi ta- tions: subw ord-free processing eliminating tok enization issues challenging for morphologically rich languages and uncommon entities [10]; consistent cross-dataset performance suggesting rob ustness to dif ferent annotation schemes; and impro v ed entity boundary detection for uncommon entities not well represented in v ocab ulary . Despite adv antages, CANINE generally performs belo w mBER T and substantially belo w XLM-R/mT5, sug- gesting current implementations may not fully le v erage character -le v el benets in fe w-shot scenarios due to limited pattern learning challenges. DistilBER T consistently ranks lo west, highlighting model distillation trade-of fs: ef cienc y- perfor - mance trade-of f with 40% fe wer parameters than mBER T illustrating ef cienc y v ersus fe w-shot capability balance [29]; competiti v e e f cienc y achie ving 90-95% of mBER T’ s performance with reduced computational requirements; and limited fe w-shot adaptation sho wing smallest absolute impro v ement from 1-shot to 5-shot settings, suggesting limited adaptation capacity compared to lar ger models. Our comprehensi v e e v aluation of v e multilingual encoder models—XLM-R, mBER T , Dis tilBER T , CANINE, and mT5—across multiple languages and datasets re v eals critical insights into fe w-shot NER in multilingual conte xts. The results establish a clear performance hierarch y with mT5 and XLM-R consis- tently achie ving superior performance, demonstrating that generati v e approaches and rob ust multilingual pre- training pro vide signicant adv antages in lo w-resource scenarios. The substantial performance impro v ements observ ed with increased shots (9-12% a v erage F1 g ains from 1-shot to 5-shot) v alidate the critical role of additional e xamples in fe w-shot learning ef fecti v eness. The consistent cross-linguistic performance gradient (English German > French > Italian > Spanish) directly correlates with pre-training data v olumes, highlighting ho w data imbalance continues to impact linguistic inclusi vity e v en in fe w-shot settings. Further - more, the systematic performance decrease wi th increasing entity comple xity across datasets (W ikiNeural MultiNERD > MultiCoNER) underscores the persistent challenges in handling compl e x, ambiguous, and ne- grained entities. Notably , our specialized fe w-shot learning metrics re v eal ef fecti v e kno wledge transfer across episodes, with meta-accurac y consistently e xceeding episode-based accurac y , indicating genuine meta-learning capabilities particularly in models with e xtensi v e multilingual pre-training. While computational constraints limited our study to v e European languages and specic shot congurations with articially balanced entity distrib utions, these ndings open se v eral research a v enues including e xpansion to lo w-resource languages with distinct linguistic properties [35], in v estig ation of adv anced meta-learning algorithms such as MAML [14] and prototypical netw orks [19], implementation of language-specic adapters for enhanced cross-lingual transfer [9], e xploration of multimodal fe w-shot learning i ncorporating visual and audio information [36], de v elopment of domain adaptation techniques [37], conducting real-w orld deplo yment studies [38], and creating ef cienc y- focused approaches for resource-constrained en vironments [39]. 4. CONCLUSION This comprehensi v e study e v aluated v e multilingual encoder models in fe w-shot NER across mul- tiple languages and datasets, adv ancing our understanding of fe w-shot learning in multilingual conte xts. Our ndings demonstrate a clear performance hierarch y with mT5 and XLM-R consist ently outperforming other models, highlighting the adv antages of generati v e approaches and rob ust mul tilingual pre-training in lo w- resource scenarios. All models e xhibited substantial performance impro v ements with increased shots, con- rming the v alue of additional e xamples in fe w-shot learning frame w orks. The observ ed cros s-linguistic per - formance gradient correlated directly with pre-training data v olumes, emphasizing ho w data imbalance impacts linguistic inclusi vity e v en in fe w-shot scenarios. Model performance consistently decreased with entity com- ple xity across datasets, underscoring ongoing challenges in handling comple x, ambiguous, and ne-grained entities. Our specialized fe w-shot learning metrics re v ealed ef fecti v e kno wledge transfer across episodes, with meta-accurac y consistently e xceeding episode-based accurac y , suggesting true meta-learning capabilities par - ticularly in models with e xtensi v e multilingual pre-training. Evaluating multilingual encoder models for fe w-shot named entity ... (Ibr ahim Bouabdallaoui) Evaluation Warning : The document was created with Spire.PDF for Python.
754 ISSN: 2502-4752 Future research should address the limitations identied in this study by e xpanding to genuinely lo w- resource languages with distinct linguistic properties, in v estig ating adv anced meta-learning algorithms, and e xploring language-specic adaptation mechanisms. The inte gration of multimodal information, domain adap- tation techniques, and ef cienc y-focused approaches for resource-constrained en vironments represent critical priorities. Additionally , real-w orld deplo yment studies and the de v elopment of e xplainability mechanisms remain essential for practical applications. These ndings contrib ute v aluable insights for de v eloping more ef fecti v e, ef cient, and inclusi v e multilingual NER s ystems, adv ancing the state-of-the-art by systematically benchmarking current approaches and identifying architectural features and learning strate gies that enable ef- fecti v e fe w-shot learni ng across di v erse linguistic and domain conte xts with minimal annotation requirements. Ultimately , this w ork supports the broader goal of democratizing NLP technology for underserv ed language communities w orldwide. A CKNO WLEDGMENT The authors ackno wledge the Moroccan National Center for Scientic and T echnical Research and the Moroccan Institute for Scientic and T echnical Information for granting computational resource access through their High-Performance Computing f acilities. FUNDING INFORMA TION This in v estig ation w as conducted without e xternal monetary assistance. A UTHOR CONTRIB UTION This journal uses the C on t rib utor Roles T axonomy (CRediT) to recognize indi vidual author contrib u- tions, reduce authorship disputes, and f acilitate collaboration. Name of Author C M So V a F o I R D O E V i Su P Fu Ibrahim Bouabdallaoui F atima Guerouate Samya Bouhaddour Chaimae Saadi Mohammed Sbihi C : C onceptualization I : I n v estig ation V i : V i sualization M : M ethodology R : R esources Su : Su pervision So : So ftw are D : D ata Curation P : P roject administration V a : V a lidation O : Writing - O riginal Draft Fu : Fu nding acquisition F o : F o rmal analysis E : Writing - Re vie w & E diting CONFLICT OF INTEREST The authors report no competing interests. D A T A A V AILABILITY ST A TEMENT Datasets utilized in this study are a v ailable through the principal in v estig ator follo wing appropriate request. REFERENCES [1] D. Nadeau and S. Sekine, A surv e y of named entity recognition and classication, Lingvisticae In v estig ationes , v ol. 30, no. 1, pp. 3–26, Aug. 2007, doi: 10.1075/li.30.1.03nad. [2] H. Shan, Y . W u, and J. Li, A surv e y of named entity recognition and classication techniques, IEEE Access , v ol. 10, pp. 117838–117864, 2022. [3] P . Mulcaire, J. Kasai, and N. A. Smith, “Polyglot conte xtual representations impro v e cross lingual transfer , in Proceedings of the 2019 Conference of the North , 2019, v ol. 1, pp. 3912–3918, doi: 10.18653/v1/N19-1392. Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 745–757 Evaluation Warning : The document was created with Spire.PDF for Python.