Inter national J our nal of Inf ormatics and Communication T echnology (IJ-ICT) V ol. 14, No. 2, August 2025, pp. 737 750 ISSN: 2252-8776, DOI: 10.11591/ijict.v14i2.pp737-750 737 Malwar e detection using Gini, Simpson di v ersity , and Shannon-W iener indexes Y eong T yng Ling 1 , Kang Leng Chiew 1 , Piau Phang 1 , Xiao wei Zhang 2 1 F aculty of Computer Science and Information T echnology , Uni v ersiti Malaysia Sara w ak, K ota Samarahan, Malaysia 2 F aculty of Biomedical Engineering, Chengde Medical Uni v ersity , Chengde, China Article Inf o Article history: Recei v ed Aug 20, 2024 Re vised No v 19, 2024 Accepted Dec 15, 2024 K eyw ords: Gini coef cient Mal w are detection MLP Shannon-W iener Simpson di v ersity XGBoost ABSTRA CT The increasing number of mal w are attacks poses a signicant challenge to c yber security . This paper proposes a methodology for static mal w are analysis using biodi v eristy-inspired metr ics that is Gini coef cient, Simpson di v ersity , and Shannon-W iener inde x for mal w are detection. These met rics are used to b uild the structural feature representation on the ra w binary le as the feature space. The ef fecti v eness of these metrics are e v aluated using multilayer perceptron (MLP) neural netw ork and e xtreme gradient boosting (XGBoost) models. A deterministic algorithm is used to generate these features that represent the feature signature of the e x ecutable le. Additionally , we in v estig ated the ef fecti v eness of dif ferent byte sizes as the input feature for these tw o classiers. According to the results, Gini coef cient with on chunk size of 128 has successfully achie v ed a v erage F1 score of more than 98.7% by using XGBoost model. This is an open access article under the CC BY -SA license . Corresponding A uthor: Y eong T yng Ling F aculty of Computer Science and Information T echnology , Uni v ersiti Malaysia Sara w ak K ota Samarahan, Sara w ak, Malaysia Email: ytling@unimas.my 1. INTR ODUCTION Mal w are attack is one of the most signicant and pre v ailing issues in information security . Accord- ing to [1], there has been an increase of malicious tasks since Q1 2024. Hack ers use malicious softw are to cause harm to a computer or its users in the form of virus, w orm, rootkit, k e y logger ,trojan horse, ransomw are, and sp yw are. T raditional commercial anti-mal w are tools which use signature-based detection method are in- f amously inef cient when f aced with ne wly launched (a.k.a. “zero-day”) mal w are. Essentially , this method e xtracts unique byte sequences which dene the mal w are’ s signature in the le contents of pre viously seen mal w are. Ho we v er , this method is time-consuming and costly since it requires ne wly e xtracted signatures to be compared ag ainst lar ge databases of malicious signatures [2]. It also needs periodic update since mal w are writers are constantly de v eloping ne w codes to thw art detection. Hence, adv anced protection technology using machine learning (ML) is needed. There are tw o methods for mal w are analysis, dynamic or static analysis. In dynamic analysis [3]–[5], mal w are features such as runtime API or system call traces are generated by e x ecuting a mal w are le and observing its beha vior in a controlled en vironment, e.g. sandbox, to pre v ent infection and spreading during analysis. In static analysis [6]–[8], mal w are features such as n-gram, image representation, opcode are gener - ated without e x ecuting the mal w are le. T able 1 sho ws the summary of studies by the most current ones related to our w ork. J ournal homepage: http://ijict.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
738 ISSN: 2252-8776 There e xist se v eral compelling w orks that used ML models for mal w are detection and classi cation. These MLs dif fer mainly in the algorithms and the types of analysis used. The use of dynamic analysis such as the [3] e v aluated eight machines learning algorithms for mal w are detection through analysis of the frequenc y of W indo ws API system function calls. The authors observ ed the beha vior of mal w are in an isolated en vironment by using Cuck oo [9]. The beha vioral e v ents reported by Cuck oo will be the feature to be fed into the ML mod- els. The authors also applied Gini inde x in the decision tree model. Similar technique w as conducted by Syeda and Asghar [5] where applied both Chi2 and the Gini inde x to lter and select signic ant features before being fed into six ML models. F or static analysis, studies such as the [6] incorporated w ord embedding technique on opcode sequence feature wit h long short-t erm memory (LSTMs) for mal w are classication. The y used the 30 most frequent opcodes e xtracted after disassemble the e x ecutable les of 20 dif ferent mal w are f amilies. Other studies such as the [7], [8] compared models performance were also been e xamined for mal w are classication and had sho wn promising results by using e xtreme gradient boosting (XGBoost). Hybrid w ork in [4] conducted both static and dynamic mal w are analysis with dif ferent ML models. The y obtained the best detection accurac y rate of 91.9% on the static analysis dataset and 96.4% on the dynamic analysis dataset by using the XGBoost algorithm. Their study indicates that combining static and dynamic analysi s with ML is an ef fecti v e approach for identifying mal w are. Their results sho w that the ef cac y of ML model is dependent upon the respecti v e algorithm and the type of data that the model is b uilt upon. T able 1. Mal w are detection related studies Reference Analysis Feature Approach Dataset (size) Accurac y [3] Dynami c API MLs Malpedia 7400 99.50% [5] Dynami c API Random forest Mal w areBazaar 582 96.00% [10] Static Aggre g ation metrics ELM APK 600 82.50% [8] Stat ic String XGBoost EMBER 5000K 98.50% [6] Stat ic Opcode LSTM, CNN Malicia; pre vious study 25901 81.00% [7] Stat ic Entrop y , Gini MLs, neural netw ork V irusShare 938 92.17% [11] Static Image CNNs, ELMs MalImg 9300 97.70% [12] Static Image CNN Malimg 9435 98.82% [13] Static Image CNN Malimg 9389 97.32% [14] Static Entrop y , image SNN Andro-Dumpsys 906 91.20% [4] Both Function, API XGBoost V irusShare 2747-2937 96.48% Lately , deep learning is g aining much popularity due to it’ s supremac y in terms of accur ac y when trained with huge amount of data. Neural netw orks are a subset of ML, and the heart of deep learning algo- rithms. There ha v e been man y studies that utilized neural netw orks by adding con v olution and pooling layers. The study [12]–[14], used v ariants of con v olutional neural netw ork (CNN) models with image-based and other types of featur e representation for mal w are detection. T o o v ercome the android mal w are prediction model, [10] studied the patterns of intermediate code and source code of an apk le by e xtracting 16 types of metrics, such as mean, median, Gini inde x, and entrop y . From their empirical study , e xtreme learning machine (ELM) with polynomial k ernels pro vides a better performance than other ML classiers. Re g ardless of using ML or deep learning as model classiers, feature representation plays a crucial role in mal w are detection. In static analysis, most of the mal w are come in the form of ra w binary e x ecutable le. T o quantify the ra w bytes, [15] introduced the di v ers ity inde x es to quantify the qualitati v e v alue of mal w are data. The authors used [16] to compute the dif ferent di v ersity inde x es such as Shannon inde x, Simpson, in v erse Simpson, and Fisher’ s log. Their e xperimental results sho w that the ecological metric can be well used in mal w are conte xt to better understand the pattern in mal w are. Other studies such as in [17], [18] adopted mathematical models of biodi v ersity in ecology for det ection. Their studies demonstrated that biodi v ersity- related metrics can impro v e their understanding of ho w di v ersity af fects detection. Inspired by the abo v e related w ork, in this paper we e xplore the ef fecti v eness of structural feature of Gini inde x, Simpson di v ersity , and Shannon-W iener inde x with multilayer perceptron (MLP) neural netw ork and XGBoost models. The rest of this paper is or g anized as follo ws. Construction of feature representation of this study is described in section 2. The detailed results and discussion are presented in section 3. Conclusions and suggested future w ork are discussed in section 4. Int J Inf & Commun T echnol, V ol. 14, No. 2, August 2025: 737–750 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Inf & Commun T echnol ISSN: 2252-8776 739 2. METHOD In this section, the proposed method is presented. A general o w of the proposed approach is sho wn in Figure 1. Figure 1. General architecture o w 2.1. Input les The format of the input les used in this study w as in W indo ws e x ecutable format. A total of 7,852 binary les were collected as the dataset of the input les. T able 2 lists the information about the mal w are f am- ilies used in this study . These mal w are were collect from [19]. As for the corresponding benign les, we ran- domly selected 1000 W indo ws applications with size 0.01 94 MB from https://do wnload.cnet.com. Figure 2 sho ws the distrib ution of the mal w are f amilies on the selected dataset. T able 2. Mal w are information F amily T ype Size (MB) Sample Bho T rojan 0.005 - 16.0 1,391 Ceeinject V irtool 0.004 - 7.67 1,077 F ak erean Rogue 0.003 - 21.9 1,016 W inwebsec Rogue 0.325 - 0.60 1,023 Zbot T rojan 0.031 - 0.37 1,039 Zeroaccess T rojan 0.048 - 0.28 1,306 Figure 2. Mal w are distrib ution use in the e xperiment (for colors) 2.2. File splitting In this step, an input le w as split into a series of chunks. T o achie v e this, technique [20] w as adopted for generating t he proposed structural feature representation by splitting the entire le into x ed-byte of chunks. A chunk is considered to be a string of non-o v erlapping consecuti v e bytes, where each chunk contains the same number of bytes. T o do this, a unanimous le length, F , and, therefore, the number of chunks, N , ha v e to be determined. W e x ed the le length to be a po wer of 2, i.e. N = 2 α for some α N : α = l l og min { median ( M ) , median ( B ) } c m Malwar e detection using Gini, Simpson diver sity , and Shannon-W iener inde xes (Y eong T yng Ling) Evaluation Warning : The document was created with Spire.PDF for Python.
740 ISSN: 2252-8776 F or con v enience purposes, steps to determine α are restated here as follo ws: Step 1: compute the median size of a group of mal w are les and benign les, M and B , respecti v ely . Here, dif fer from [20], median score is considered as it usually pro vides a better measure of center tendenc y of sample size. Step 2: determine the minimum median size from these tw o groups. Step 3: di vide the minimum median size by chunk size, says c = 256 bytes, this gi v es the D . Step 4: nd the base-2 log arithm of v alue from pre vious step and tak e the lar gest whole inte ger . Step 5: if the whole inte ger in pre vious step is not a po wer of 2, reduce the D in step 3 by 1 and repeat step 4 until the condition met. Dif ferent chunk sizes, that is, 128, 256, 512, 1,024, and 2,048 bytes were e xamined in this study . The sliding windo w for each le splitting is the same length as the chunk sizes for con v enient purpose. These chunks pro vide granular v ariations and represent the structure of a le. 2.3. F eatur e generation Based on [20], once the number of chunks, N , has been determined, a deterministic algorithm using Procrustean notion is adopted to choose e v enly spread chunks from each le to produce a v ector of N chunks in order . In other w ords, the number of chunks for a le i s either reduced to or increased to N . An e xample is pro vided here for illustration purpose. Gi v en tw o les, P and Q , and a chunk size of c , with l eng th ( P ) = 10 c and l eng th ( Q ) = 7 c which means there are 20 chunks for P and 6 chunks for Q . Suppose that α = 3 is chosen, then N = 2 α = 8 chunks. Since P has number of chunks lar ger than N , it needed to be reduced from from 10 to 8 chunks and for Q which is smaller than N , it needed to be increased from 7 to 8 chunks. In order to choose these chunks, a subset of the current chunks using a jump f actor is generated for each le. The chunk inde x is initially set to 0, and it is incremented in e v ery step by inc 1 = 9/7 = 1.28 for P and inc 2 = 6/7 = 0.85 for Q . The indices are selected using the oor of the accumulated jump v alue, so the chosen indices will be: I P = (0 , 1 , 2 , 3 , 5 , 6 , 7 , 9) I Q = (0 , 0 , 1 , 2 , 3 , 4 , 5 , 6) These indices gi v e the location of chunk that needed to b uild the structural feature representation of a le. Three biodi v eristy-inspired metrics were adopted in this study to b uild the structural feature representation on these chunks, namely , Gini coef cient, Simpson di v ersity , and Shannon-W iener . 2.3.1. Gini coefcient Gini coef cient also kno wn as Gini inde x [21], named after Italian statistician Corrado Gini, is a w ay to measure statistical dispersion inequality especially in economics and ecology [22]. The Gini coef cient is dened as: g ini = t/ ( b 2 a ) (1) where t is a list of dif ference among the elements of a list, b is the length of a list and a is mean v alue of that list. 2.3.2. Simpson di v ersity The Simpson di v ersity inde x [23] w as introduced by Edw ard H. Simpson to measure the probability of tw o samples will belong to the same group. The v alue of Simpson di v ersity ranges from 0 to 1, with 0 representing lar ge di v ersity and 1 representing no di v ersity . The formula is gi v en as: D = R X i =1 n i ( n 1) N ( N 1) (2) where n i is the number of indi viduals in a group i , and N is the total number of groups in a sample. Int J Inf & Commun T echnol, V ol. 14, No. 2, August 2025: 737–750 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Inf & Commun T echnol ISSN: 2252-8776 741 2.3.3. Shannon-W iener The Shannon-W einer i nde x [24] w as de v eloped from information theory and is based on m easuring uncertainty . The computational formula is: H = N ln N P ( n i ln n i ) N (3) where N is the total number of groups and n i is the number of indi viduals in group i . Each metric, which is used to quantize the byte randomness in a chunk, will dene the N -v ector structural feature of a le. Thus, for a gi v en e x ecutable le, three dif ferent types of structural features were generated. 2.4. Classiers Ne xt, each type of the s tructural feature generated from the pre vious step will be fed into tw o sele cted classiers, namely MLP neural netw ork and XGBoost, respecti v ely . All the e xperiments were conducted under a W indo ws 10 64-bit operating system using the Python programming language (Scikit-learn l ibrary). A 10- fold cross v alidation w as performed to estimate the generalization performance of the proposed approach. The prediction ef cienc y w as measured in terms of accurac y rate, area under curv e (A UC), and F1 score. 2.4.1. Multilay er per ceptr on Based on the layer construction of MLP by [25], a 3-layer MLP model is constructed as ha ving one input layer , tw o hidden layers, follo wed by a output layer . After se v eral e xperiments using 3 and 4 layers, we decided to use the 3-layer MLP model as it g a v e better performance than the 4-layer model. In this study , the rst layer (input layer with linear layer), with N nodes, where N is the number of chunks generated during the le splitting step. Acti v ation function w as set to the rectied linear unit (ReLU) to suppresses ne g ati v e weights. This process is repeated for another hidden layer by reducing the half of the pre vious nodes. The last layer is a Sigmoid curv e which allo ws binary prediction. Additionally , dropout layers with 0.2 is added between hidden layers to pre v ent o v ertting. The model w as fed through and backpropag ate of errors with 50 epochs. The learning rate is 0.01 with binary cross entrop y (BCE) as the loss function and stochastic gradient descent as the optimizer . 2.4.2. XGBoost XGBoost [26], is an im plementation of gradient boosted decision trees (GBDTs). The XGBoost pro vides a wrapper class to allo w models to be treated lik e classiers in the scikit-learn frame w ork in Python. There are man y parameters for the XGBoost Classier package. W e k ept them as in the def ault for simplicity reasons and only set the objective=‘binary:lo gistic’ . A 10-fold cross v alidation w as performed on dif ferent chunk sizes (i.e.: 128) of the proposed structural feature representation as mentioned in section 2.3. 3. RESUL TS AND DISCUSSION W e w ant to study ho w ef fecti v e are the proposed structural features in discriminating mal w are from benign les i n terms of accurac y , A UC, and F1 score. W e consider a v alidation result of at least 90% as high detection rate. Based on the ndings, a discussion section is follo wed. 3.1. Results Figures 3 sho ws the a v erage time (in seconds) tak en to generate chunk size of 128 and 2,048 bytes of the proposed structural feature representations for the mal w are f amily . It is ob vious that to e xtract smaller chunk size, says 128 bytes, will tak e longer time compare to chunk size of 2,048 bytes. Based on the gure, it can be observ ed that using the Simpson inde x to generate the structural f eature is the f astest, follo wed by Gini coef cient and Shannon-W einer . It is surprising to notice that the time tak en by W inwebsec f amily is longer than the other f amilies, such as t he Bho which contains more samples than W inwebsec when using Gini coef cient and Shannon-W einer . One possible conject ure is that the W inwebsec f amily has le sizes that is much lar ger than the other f amilies. 3.1.1. MLP model T able 3 sho ws the comparison of the best F1 score performance out of the 10-fold cross v alidation. The highlighted bold indicates the highest score achie v ed among the six mal w are f amilies on respecti v e chunk size. It is observ ed that Gini coef cient produced the highest F1 scores on W inwebsec f amily e xcept with Malwar e detection using Gini, Simpson diver sity , and Shannon-W iener inde xes (Y eong T yng Ling) Evaluation Warning : The document was created with Spire.PDF for Python.
742 ISSN: 2252-8776 chunk size 256. As with chunk size 128, 512, 1,024, and 2,048 the MLP model can achie v e 99.84%, 99.32%, 99.17%, and 98.75% F1 scores on W inwebsec, respecti v ely . Figure 3. A v erage time of feature generation of the mal w are f amily T able 3. The best F1 score using the MLP model Chunk size F amily Feature type Gini coef cient Simpson di v ersity Shannon W iener 128 Bho 87.51 89.19 88.73 Ceeinject 89.37 90.27 92.23 F ak erean 87.86 90.44 89.62 W inwebsec 99.84 99.00 98.89 Zbot 96.93 98.07 97.72 Zeroaccess 99.22 99.49 97.71 256 Bho 87.32 89.35 86.10 Ceeinject 89.15 90.41 90.60 F ak erean 86.07 91.53 86.34 W inwebsec 99.36 99.52 98.73 Zbot 98.11 98.49 94.90 Zeroaccess 98.73 99.25 96.83 512 Bho 89.24 89.26 83.44 Ceeinject 88.52 87.77 87.36 F ak erean 90.85 91.09 83.24 W inwebsec 99.32 99.21 98.05 Zbot 96.84 98.13 92.04 Zeroaccess 98.57 98.85 91.55 1024 Bho 88.30 87.66 81.50 Ceeinject 84.53 85.46 80.26 F ak erean 90.06 91.00 81.69 W inwebsec 99.17 98.85 95.45 Zbot 96.78 97.52 90.55 Zeroaccess 97.81 98.08 85.39 2048 Bho 87.21 89.35 78.73 Ceeinject 86.07 84.98 64.07 F ak erean 90.63 92.15 83.75 W inwebsec 98.75 97.99 90.59 Zbot 97.11 97.00 85.40 Zeroaccess 96.87 97.86 79.95 Figure 4 depicts a closer look at the performance between tw o selected mal w are f amilies, i.e.: W in- websec and Bho. Based on the gure, the W inwebsec f amily in Figure 4(a) can easily be detected compared with the Bho f amily in Figure 4(b) using Gini coef cient as feature representation. The Simpson di v ersity inde x achie v ed stable performance across all chunk sizes. T able 4 sho ws the accurac y rate based on the best F1 score achie v ed across the mal w are f amilies. It measures ho w often the MLP model correctly predicts the outcome. As the chunk size gro ws lar ger , the F1 score decreases by using Shannon-W iener as the structural feature representation. Int J Inf & Commun T echnol, V ol. 14, No. 2, August 2025: 737–750 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Inf & Commun T echnol ISSN: 2252-8776 743 (a) (b) Figure 4. The F1 score for (a) W inwebsec and (b) Bho f amilies (see online v ersion for colors) T able 4. The accurac y rate based on the best F1 score Chunk size F amily Feature type Gini coef cient Simpson di v ersity Shannon W iener 128 Bho 84.54 87.04 86.76 Ceeinject 89.08 89.24 91.81 F ak erean 87.76 90.08 89.09 W inwebsec 99.83 99.01 98.84 Zbot 97.05 98.03 97.71 Zeroaccess 99.13 99.42 97.25 256 Bho 87.32 86.62 82.86 Ceeinject 88.44 89.88 90.20 F ak erean 85.45 91.40 86.66 W inwebsec 99.34 99.50 98.68 Zbot 98.03 98.36 94.93 Zeroaccess 98.55 99.13 96.53 512 Bho 86.90 87.46 82.31 Ceeinject 87.64 87.47 86.99 F ak erean 96.41 91.23 84.29 W inwebsec 99.34 99.17 98.02 Zbot 96.73 98.03 91.50 Zeroaccess 98.41 98.29 89.73 1024 Bho 86.35 85.65 78.83 Ceeinject 83.78 84.10 80.73 F ak erean 89.42 91.23 84.95 W inwebsec 99.17 98.84 95.38 Zbot 96.73 97.38 90.52 Zeroaccess 97.39 97.83 80.92 2048 Bho 84.81 88.02 71.86 Ceeinject 85.87 84.91 70.30 F ak erean 90.74 92.06 84.01 W inwebsec 98.68 98.02 90.28 Zbot 96.89 96.73 84.64 Zeroaccess 96.38 97.68 72.39 It is observ ed that the accurac y rates are consistent with the F1 sc o r e, where using Gini coef ci ent can achie v e the highest accurac y rate for the W inwebsec f amily e xcept with chunk size 256. Among the six f amilies, Gini coef cient can ef fecti v ely detect the mal w are from the benign le for the W inwebsec, Zbot, and Zeroaccess. It implies that these three mal w are f ami lies can easily be detected using this Gini coef cient, b ut not for the Bho, Ceeinject, and F ak erean when compared with the other tw o structural feature representations. On a v erage, feature representati on using Simpson di v ers ity sho wn relati v ely higher accurac y rate than the Gini coef cient across all the mal w are f amilies. A closer observ ation sho ws that this structural feature representation can achie v e more than 90% for four of the f amilies e xcept for the Bho and Ceeinject f amilies. Malwar e detection using Gini, Simpson diver sity , and Shannon-W iener inde xes (Y eong T yng Ling) Evaluation Warning : The document was created with Spire.PDF for Python.
744 ISSN: 2252-8776 This sho ws that Simpson di v ersity demonstrated stronger discrimination for quantifying byte information. Shannon-W iener sho wn as the least signicant structural feature representation in this study . The lo west accurac y rate it can yield is 71.86% for the Bho f amily with chunk size 2,048 and the highest accurac y rate it can yield is 98.68% for the W inwebsec f amily wi th chunk size 256. One can observ e that as the chunk size increases, the discrimination po wer for this feature representation becomes w orse. Figure 5 depicts the A UC performance based on the three proposed structural feature represent ations in Figures 5(a) to 5(c). A UC represents the de gree or measure of separability . It can be observ ed that there is a clear distinction between A UC for certain types of mal w are f amilies. F or e xample, either both Gini coef - cient and Simpson inde x yielded higher A UC for the W inwebsec, Zbot, and Zeroaccess, b ut not for the Bho, Ceeinject, and F ak erean f amilies. The performance of all three st ructural feature representations declines as the number of chunk sizes increases. (a) (b) (c) Figure 5. A UC performance: (a) Gini coef cient, (b) Simpson inde x, and (c) Shannon W iener 3.1.2. XGBoost model T able 5 sho ws the comparison of the bes t F1 score performance out of the 10-fold cross v alidation. The highlighted bold indicates the highest score achie v ed on respecti v e chunk size. It is observ ed that the a v erage precision and recall v aries for each of the feature representation. B y using the XGBoost model, W inwebsec f amily achie v ed 100% F1 score with all the three proposed structural feature representations across all dif ferent chunk sizes. Zeroaccess f amily , similar to the W inwebsec f amily , also reached F1 score of 100% for all the chunk sizes e xcept with chunk size 1,024. In terms of feature performance, on a v erage, Shannon-W iener achie v ed more number of highest F1 score follo wed by Gini coef cient and Simpson di v ersity . Ho we v er , the a v erage F1 score is 98.26%. It is follo wed by Gini coef cient and Simps on di v ersity , which yielded an a v erage F1 score of 98.63%, 98.27% respecti v ely . In terms of chunk size, Shannon-W iener and Simpson di v ersity demonstrated their discriminate Int J Inf & Commun T echnol, V ol. 14, No. 2, August 2025: 737–750 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Inf & Commun T echnol ISSN: 2252-8776 745 po wer mostly with chunk size 256 and 512, respecti v ely . As for Gini coef cient, it demonstrated its discrimi- nate po wer mostly with chunk size 128 and 1,024. Figure 6 depicts a close r look at the performance between tw o mal w are f amilies, i.e.: W inwebsec and Bho. Based on the gure, the W inwebsec f amily in Figure 6(a) can also be easily detected compared with Bho f amily in Figure 6(b) using all the three proposed structural feature representations. Ho we v e r , for the Bho f amily , Shannon-W iener inde x performed best with chunk size 256 and 1,024 only . Both the Gini coef cient and Simpson inde x produced best result with chunk size 512 and 2,048, respecti v ely . T able 5. The F1 score of the best model performance using the XGBoost model Chunk size F amily Feature type Gini coef cient Simpson di v ersity Shannon W iener 128 Bho 95.53 94.94 95.20 Ceeinject 98.24 97.52 97.41 F ak erean 99.09 98.13 97.69 W inwebsec 100.0 100.0 100.0 Zbot 99.54 99.53 100.0 Zeroaccess 100.0 100.0 100.0 256 Bho 96.24 96.84 97.57 Ceeinject 96.96 97.41 96.96 F ak erean 98.21 97.32 95.32 W inwebsec 100.0 100.0 100.0 Zbot 99.09 99.50 100.0 Zeroaccess 100.0 100.0 100.0 512 Bho 97.27 97.14 96.86 Ceeinject 95.37 95.85 95.85 F ak erean 97.32 97.67 97.75 W inwebsec 100.0 100.0 100.0 Zbot 99.54 100.0 99.50 Zeroaccess 100.0 100.0 100.0 1024 Bho 96.86 96.50 96.88 Ceeinject 95.39 96.10 94.68 F ak erean 98.65 97.65 98.21 W inwebsec 100.0 100.0 100.0 Zbot 99.06 99.38 99.50 Zeroaccess 100.0 99.28 99.63 2048 Bho 96.00 97.16 96.52 Ceeinject 94.49 93.10 94.82 F ak erean 98.64 97.65 98.18 W inwebsec 100.0 100.0 100.0 Zbot 99.09 99.49 99.50 Zeroaccess 99.59 100.0 100.0 (a) (b) Figure 6. F1 score of comparison between (a) W inwebsec and (b) Bho f amilies (see online v ersion for colors) Malwar e detection using Gini, Simpson diver sity , and Shannon-W iener inde xes (Y eong T yng Ling) Evaluation Warning : The document was created with Spire.PDF for Python.
746 ISSN: 2252-8776 T able 6 sho ws the accurac y rate based on the best F1 score across the mal w are f amilies. The high- lighted bold indicates the signicant rate achie v ed among the mal w are f amilies on respecti v e chunk size. It is observ ed that both Shannon-W iener and Gini coef cient ha v e the most number of times to yield higher accurac y rates across the mal w are f amilies. Figure 7 depicts the A UC performance based on the three structural feature representations as sho wn in Figures 7(a) to 7(c). Here, it is clear that the Gini coef cient produced higher A UC for most of the chunk sizes e xcept with chunk size 512. On the other hand, Simpson di v ersity and Shannon-W iener features performed better wit h chunk size 512. Shannon-W iener can yield high detection the W inwebsec, Zbot, and Zeroaccess f amilies with chunk size 2,048, that is, 100%, 99.01%, and 100%, respecti v ely . T able 6. The accurac y rate of the best model based on the F1 score Chunk size F amily Feature type Gini coef cient Simpson di v ersity Shannon W iener 128 Bho 95.53 94.94 95.20 Ceeinject 98.24 97.52 97.41 F ak erean 99.09 98.13 97.69 W inwebsec 100.0 100.0 100.0 Zbot 99.54 99.53 100.0 Zeroaccess 100.0 100.0 100.0 256 Bho 96.24 96.84 97.57 Ceeinject 96.96 97.41 96.96 F ak erean 98.21 97.32 95.32 W inwebsec 98.21 97.32 95.32 Zbot 100.0 100.0 100.0 Zeroaccess 99.09 99.50 100.0 512 Bho 97.27 97.14 96.86 Ceeinject 95.37 95.85 95.85 F ak erean 97.32 97.67 97.75 W inwebsec 100.0 100.0 100.0 Zbot 99.54 100.0 99.50 Zeroaccess 100.0 100.0 100.0 1024 Bho 96.86 96.50 96.88 Ceeinject 95.39 96.10 94.68 F ak erean 98.65 97.65 98.21 W inwebsec 100.0 100.0 100.0 Zbot 96.06 99.38 99.50 Zeroaccess 100.0 99.28 99.63 2048 Bho 96.00 97.16 96.52 Ceeinject 94.49 93.10 94.82 F ak erean 98.64 97.65 98.18 W inwebsec 100.0 100.0 100.0 Zbot 99.09 99.49 99.50 Zeroaccess 99.59 100.0 100.0 3.2. Discussion While earlier studies ha v e e xplored the impact of biodi v ersity-related metrics, the y ha v e not e xpli citly studied their ef fecti v eness for quant ifying on the binary le. This study in v estig ated the ef fecti v eness of three biodi v ersity-related metrics, namely Gini coef cient, Simpson di v ersity , and Shannon-W iener , on binary le for mal w are detection. The v alidation results from our study suggests that the computation steps to e xtract and generate structural feature representation shall be considered if the performance speed is a concern. In this study , the number of lar ge les in the W inwebsec f amily may be contrib uting to the f act that it required more feature generation time than the Bho f amily . Due to the computational steps in the Gini coef cient, it tak es longer time to generate the structural feature of a le. Comparing the three structural feature representations based on the F1 score, our study suggests that Gini coef cient with XGBoost can be an ef fecti v e m etric to quantify binary le for detection on a v erage. This may be due to the f act that this metric measures the probability for a v alue within a chunk bytes and the type of mal w are f amily can also af fect the performance. It is unclear the reason wh y Shannon-W iener inde x produces lo w performance in the e xperiment here. It is suspected that the computation of l n causes the di v ersity v alue Int J Inf & Commun T echnol, V ol. 14, No. 2, August 2025: 737–750 Evaluation Warning : The document was created with Spire.PDF for Python.