Inter national J our nal of Recongurable and Embedded Systems (IJRES) V ol. 14, No. 3, No v ember 2025, pp. 843 854 ISSN: 2089-4864, DOI: 10.11591/ijres.v14.i3.pp843-854 843 P arallel graph algorithms on a RISCV -based many-cor e Ashuthosh Moolemajalu Ra vikumar 1 , Aakarsh V inay 1 , Krishna K. Nagar 2 , Madhura Pur naprajna 1 1 Department of Electronics and Communications Engineering, PES Uni v ersity , Beng aluru, India 2 Google, San Francisco, United States of America Article Inf o Article history: Recei v ed May 13, 2025 Re vised Jul 17, 2025 Accepted Aug 7, 2025 K eyw ords: Analytical model Graph algorithm P arallel architecture Performance model Reduced instruction set computer– v e man y-core ABSTRA CT Graph algorithms are essential in domains lik e social netw ork analysis, web search, and bioinformatics. Their e x ecution on modern hardw are is vital due to the gro wing size and comple xity of graphs. T raditional multi-core systems strug- gle with irre gular memory access patterns in graph w orkloads. Reduced instruc- tion set computer– v e (RISC-V)-based man y-core processors of fer a promising alternati v e with their customizable open-source architecture suitable for opti- mization. This w ork focuses on parallelizing graph algorithms lik e breadth-rst search (BFS) and P ageRank (PR) on RISC-V man y-core systems. W e e v aluated performance based on gra ph structure and processor architecture , and de v el- oped an analytical model to predict e x ecution time. The model incorporates the unique characteristics of the RISC-V architecture and the types and numbers of instructions e x ecuted by multiple cores, with a maximum prediction error of 11% . Our e xperiments sho w a speedup of up to 11 . 55 × for BFS and 7 . 56 × for PR using 16 and 8 cores, respe cti v ely , o v er single-core performance. Com- parisons with e xisting graph pr ocessing frame w orks demonstrate that RISC-V systems can deli v er up to 20 × better ener gy ef cienc y on real-w orld graphs from the netw ork repository . This is an open access article under the CC BY -SA license . Corresponding A uthor: Ashuthosh Moolemajalu Ra vikumar Department of Electronics and Communications Engineering, PES Uni v ersity Beng aluru, Karnataka, India Email: ashuthoshmr@pes.edu 1. INTR ODUCTION Graph algorithms ha v e become increasingly important in applications such as social netw orks, na vi- g ation, and graph neural netw orks. The methods for processing graphs ha v e e v olv ed signicantly , transitioning from classical problems to modern general-purpose platforms lik e central processing units (CPUs) and graphics processing units (GPUs). Recent adv ancements include algorithmic impro v ements [1] and the de v elopment of specialized accelerators and frame w orks, such as GraphBLAST [2] and GraphLily [3], which le v erage GPUs and eld-programmable g ate arrays (FPGAs) for enhanced performance. Ho we v er , these adv ancements also highlight the unique challenges inherent to graph processing, par - ticularly in achie ving ef cient parallelization. Graph properties , such as connecti vity between v ertices, v ary widely depending on the graph type, leading to irre gular memory access patterns. Additionally , man y graph algorithms are memory-bound, making ef cient memory utilization critical. A prime e xample of application-architecture co-design is Google’ s TPU [4], which demonstrates a shift to w ards application-optimized parallel architectures to achie v e greater ef cienc y . Similarly , custom ac- celerators tar geting graph applications ha v e been implemented on FPGAs [3], [5]. But the signicant design time and ef fort required for FPGA-based accelerators often limit their practicality . GPUs [2] and v ector pro- J ournal homepage: http://ijr es.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
844 ISSN: 2089-4864 cessors ha v e emer ged as alternati v e parallel architectures for graph application acceleration. In this conte xt, reduced instruction set computer– v e (RISC-V -based) man y-core processors ha v e g ained prominence due to their customizable and open-source architecture. Ho we v er , there is still a need to understand ho w well graph w orkloads perform on RISC-V -based man y-core processors. Our research aims to address this question by e xploring graph algorithms and de v eloping an analytical model to predict the perfor - mance of RISC-V -based man y-core processors. This model will allo w us to project the performance of parallel graph processing for core counts up to 16 and v arious graph types. Our contrib utions in this w ork include the follo wing: P arallelizing graph algorithms on RISC-V -based man y-core processors. De v elop an analytical model to predict the performance of parallel graph algorithms on RISC-V -based man y-core. V alidate the accurac y of the model on real w orld graphs. 2. P ARALLELIZING GRAPH ALGORITHMS ON RISC-V MANY -CORE Among v arious graph algorithms, this w ork focuses on tw o widely applicable ones: bre adth-rst search (BFS) and P ageRank (PR). P arallelizing t h e se algorithms on a RISCV -based man y-core in v olv es dis- trib uting the w orkload across a v ailable cores. In this section, we e xamine the ef fects of parallelizing BFS with respect to the number of cores and graph types, wh i ch moti v ates the de v elopment of the analytical model. Our analysis le v erages a RISC-V man y-core cal led Mempool [6], featuring 16 cores and 64 KB of tightly coupled memory . 2.1. P arallel br eadth-rst sear ch BFS is a recursi v e algorithm that tries to tra v erse all the v ertices in t he graph starting from a s ingle v erte x kno wn as the source or root v erte x. T ra v ersal in v olv es e xploring the neighbors of the v erte x and updat- ing the properties of t h e neighbors such as visited status, parent v erte x and distance from the root v erte x. The tra v ersal of acti v e v ertices is done in e v ery iteration and continues until there are no acti v e v ertices. T ra v ersal can be done mainly in tw o w ays, top-do wn approach and bottom-up approach. Based on the number of acti v e v ertices e v ery itera tion, certain approach is benecial [1]. In our w ork, we consider bottom-up BFS as consid- erable time is spent on this k ernel compared to the top-do wn BFS. As sho wn in Algorithm 1, in each iteration, all v ertices (line 1) are check ed to determine if the y ha v e not been visited (line 2). The neighbors of these un visited v ert ices are then e xplored (line 3). If an i ncoming neighbor of an un visi ted v erte x is acti v e (line 4), the un visited v erte x is mark ed as visited and acti v ated for the ne xt iteration (lines 5 and 6). P arallelizing this k ernel across N cores in v olv es di viding the total v ertices (line 1) into N parts. While this di vision ensures equal distrib ution of v ertices, the un visited v ertices (line 2) and their neighbors (line 3) are une v enly distrib uted. In graphs with highly irre gular neighbor distrib utions (e.g., po wer -la w graphs), this irre gularity impacts perfor - mance by lea ving some cores idle while others remain acti v e. Increasing the number of cores in such cases will amplify the impact as v ery fe w cores compute the v ertices with lar ge number of neighbors. Algorithm 1 . Single iteration of Bottom-up BFS 1: f or i = 1 to v do 2: if cost [ i ] < 0 then 3: f or each j in incoming neighbours of i do 4: if activ e [ j ] > 0 then 5: activ e l ist i 6: cost [ i ] iter ation 7: br eak 8: end if 9: end f or 10: end if 11: end f or 12: iter ation iter ation + 1 Int J Recongurable & Embedded Syst, V ol. 14, No. 3, No v ember 2025: 843–854 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Recongurable & Embedded Syst ISSN: 2089-4864 845 2.1.1. Graph type Figure 1 presents the prole of the Mempool, sho wing the breakdo wn (%) of compute and stall times as a function of graph type. The analyzed graphs are real-w orld graphs listed in T able 1, sourced from the netw ork repository [7]. F or graphs with po wer -la w characteristics, such as soc-wiki and web-polblogs, syn- chronization o v erhead contrib utes t o stall times. This irre gularity in the graph structure result s in an une v en distrib ution of e x ecuted instructions across cores in the parallel man y-core system. Figure 1. Percentage of time spent on compute and stalls for dif ferent types of graphs on a 16-core Mempool. Compute time includes instruction e x ecution, while stalls are caused by synchronization, instruction cache misses, load-store unit (LSU) delays, and read-after -write (RA W) hazards T able 1. Di v erse type of graphs used in this w ork are from netw ork repository [7] Data V ertices Edges T ype graph512 512 3114 Synthetic soc-wiki 889 5828 Social netw ork web-polblogs 643 4560 W eb netw ork bio-diseasome 516 2376 Biological netw ork 2.1.2. Number of cor es Increasing the number of cores reduces latenc y b ut also introduces synchronization o v erhead. Figure 2 illustrates the latenc y breakdo wn for BFS tra v ersal on soc-wiki with v arying core counts. A maximum speedup of 9 . 2 × is achie v ed with 16 cores compared to single -core latenc y . Using 8 cores pro vides a 5 . 6 × speedup, while doubling the cores to 16 adds only an additional 1 . 6 × speedup. Figure 2. Latenc y breakdo wn (in clock c ycles) for BFS tra v ersal of the soc-wiki graph across v arious core counts P ar allel gr aph algorithms on a RISCV -based many-cor e (Ashuthosh Moolemajalu Ravikumar) Evaluation Warning : The document was created with Spire.PDF for Python.
846 ISSN: 2089-4864 2.2. Need f or perf ormance pr ediction In section 2.1., we e xamined ho w f actors such as graph input type and core count inuence the perfor - mance of graph algorithms on the parallel RISC-V architecture. Ho we v er , accurately predicting performance prior to e x ecution remains a challenge. T o address this, we propose an analytical model that estimates c ycle counts based on graph characteristics and system conguration. This model enables performance prediction for an y gi v en graph input on Mempool with N cores, without requiring e x ecution on the actual hardw are. 3. B UILDING THE AN AL YTICAL MODEL: METHOD Figure 3 sho ws the method of b uilding an analytical model to predict the performance of graph al- gorithms. As sho wn in the Figure 3, the graph algorithm gets translated to set of RISC-V instructions by the compiler . These instructions include compute and memory instructions that is mapped to N cores. Figure 3. Illustration of the analytical model for predicting the performance of graph algorithms on a RISC-V man y-core processor cluster Along with the instructions, the graph is also distr ib uted to the multi-core thereby di viding the o v erall computations. Based on the number and type of RISC-V instructions, the latenc y to e x ecute the instructions v ary and is gi v en by C y cl es V and C y cl es E . Similarly , these set of instructions e x ecute depending on the graph properties such as, number of v ertices ( V ) and edges ( E ) of the graph. Using these infor mation, we b uild the analytical model based on graph properties and number and type of RISC-V instructions on N cores. W e ha v e de v eloped an analytical model to predict the performance of BFS and PR. The notations used in this model for BFS and PR are gi v en in the T ables 2 and 3. The v alue of the notation dependent on core count of man y-core, instruction set architecture (ISA), and graph input are also indicated in the tables. 3.1. Br eadth-rst sear ch As sho wn in the Algorithm 1, the lines 1, 2 are e x ecuted for all the v ertices in e v ery iteration. Of all the v ertices, un visited v ertices V unv isited are e xplored (line 3). If the incoming neig hbour s of the V unv isited are acti v e i.e V activ e (line 4) , the unv isited v ertices are put in the ne w acti v e list i.e V update (line 5, 6, and 7). Based on the number of V activ e , V unv isited , V neig hbour s , the number and type of instructions e x ecuted v aries. The total c ycles for each iteration of the bottom-up BFS is gi v en using the ( 1). Using this equation, latenc y for bottom-up BFS can be predicted ahead in time based on the number of v ertices and edges pro vided V activ e , V unv isited , V neig hbour s and V update are kno wn e v ery iteration. Kno wing the V activ e , V unv isited , V neig hbour s , and V update during compile time is only possible after running BFS once. Otherwise, assumptions can be made Int J Recongurable & Embedded Syst, V ol. 14, No. 3, No v ember 2025: 843–854 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Recongurable & Embedded Syst ISSN: 2089-4864 847 by di viding the total number of v ertices equally among each iteration. Extending this to multiple cores in v olv es estimating the latenc y of the graph partition that tak es the longest latenc y . Since the amount of w ork done by each core is dependent on the size of the partition, we consider the partition with the lar gest v ertices and edges for our estimation V max and E max . C y cl es bottom bfs = Iteration X i =1 " V acti v e × C y cl es acti v e + V un visited × C y cl es un visited + V neighbours × C y cl es neighbours + V update × C y cl es update # (1) T able 2. Notations used in the model for bottom-up BFS and its dependencies Notation Description Dependence V activ e Number of acti v e v ertices in a gi v en iteration Input graph V unv isited Number of un visited v ertices tra v ersed Input graph V neig hbour s Number of neighbouring v ertices e xplored Input graph V update Number of ne w acti v e v ertices found in an iteration Input graph C y cl es activ e Cycles to e xplore each V activ e v ertices Multi-core architecture, ISA C y cl es unv isited Cycles to tra v erse each V unv isited Man y-core architecture, ISA C y cl es neig hbour s Cycles to check each V unv isited Man y-core architecture, ISA C y cl es update Cycles to put each un visited v erte x to acti v e list Man y-core architecture, ISA T able 3. Notations used in the model for PR and its dependencies Notation Description Dependence N Number of cores Multi-core architecture V max Maximum number of v ertices in a partition Input graph E max Maximum number of edges in a partition Input graph C y cl es v Cycles to operate on v ertices Multi-core architecture, ISA C y cl es e Cycles to operate on edges Multi-core architecture, ISA 3.2. P ageRank PR in v olv es ranking the v ertices of the graph based on the connecti vity . Each v erte x is initial ised with a score based on the number of outgoing edges (i.e. de gree of the v erte x) called outgoing contrib ution. As sho wn in the pseudo-code Algorithm 2, the outgoing contrib ution is accumulated into incoming total for each of the v ertices (lines 3,4). Based on the base scor e and incoming total , a ne w score is calculated (line 7). This process i s repeated until the dif ference between the old score and the ne w score i.e the error (line 8) is less than the threshold limit or until the maximum iteration. Since each v erte x’ s score can be estimated in parallel, PR can be accelerated by le v eraging the cores present in the man y-core system. Algorithm 2 . Single iteration of P ageRank 1: f or i = 1 to v v ertices do 2: incoming total 0 3: f or j : neig hbour s of i do 4: incoming total incoming total + outg oing contr ibution [ j ] 5: end f or 6: ol d scor e scor es [ i ] 7: scor es [ i ] base scor e + k damp incoming total 8: er r o r + = f abs ( scor es [ i ] ol d scor e ) 9: outg oing contr ib [ i ] = scor es [ i ] /out deg r ee [ i ] 10: end f or F or a single-core design, the latenc y to calculate the ranks of the graph G ( V , E ) with V number of v ertices and E number of edges in each iteration includes reading and accumulating the outgoing contrib utions of each v erte x and updating the score (line 2,3 of Algorithm 2). Reading outgoing contrib utions and accumu- lating for each v erte x depends on the total number of edges E in the graph. Updating the scores of the v erte x P ar allel gr aph algorithms on a RISCV -based many-cor e (Ashuthosh Moolemajalu Ravikumar) Evaluation Warning : The document was created with Spire.PDF for Python.
848 ISSN: 2089-4864 based on the accumulated cont rib utions is dependent on the number of v ertices V (line 6 to 9 of Algorithm 2). The c ycles for calculating the ranks of e v ery iteration of the v ertices of G ( V , E ) can be formulated as (2): C y cl es P ag eR ank = [ V × C y cl es v ] + [ E × C y cl es e ] (2) Similarly , e xtending this to multiple cores in v olv es estimating the latenc y of the group of v ertices that t ak es the longest. The c ycles for calcul ating the ranks of e v ery iteration in the multi-core system can be formulated as (3): C y cl es P ag eR ank = [ V max × C y cl es v ] + [ E max × C y cl es e ] (3) Since in e v ery iteration, the same set of operations are carried out, we assume total latenc y for multiple itera- tions can be obtained by multiplying the number of iterations with C y cl es P ag eR ank . In (2) and (3) pro vide a simplied vie w of performance as a function of v ertices and edges for PR. 4. EXPERIMENT AL SETUP W e e v aluated the performance of our analytical models with the actual latenc y measured in terms of clock c ycles. W e used a v ariety of graphs ranging from synthetic to real-w orld social and biological netw orks as sho wn in the T able 1. The graphs in T able 1 are stored in the compressed sparse ro w (CSR) compression format. The focus of the e xperiments are to e v aluate the model’ s predicti v e accurac y across dif ferent processor core counts and graph types. 4.1. T ar get r educed instruction set computer– v e ar chitectur e Mempool [6] is a RISCV -based man y-core cluster with tightly coupled L 1 shared memory that can be scaled upto 256 cores supporting R V 32 I M AX pul pimg instructions. The hierarchical interconnect is responsible for a maximum latenc y of 5 c ycles in the conict-free access to shared memory . As sho wn in Figure 4, the def ault conguration consists of 4 groups, 16 tiles per group, 4 cores per tile. T otal shared memory of 1M B is a v ailable where the memory is di vided into sequential addressed s pace and interlea v ed addressed space. The size of the addressing is congurable. Figure 4. Man y-core architecture of Mempool F or our e xperiments, we ha v e used the Minpool conguration with 4 groups, 1 tiles per group and 4 cores per tile. This conguration has a total of 16 cores and 64 KB of tightly coupled data memory of which each tile has 8 KB of sequential addressed memory and 8 KB of interlea v ed addressed memory . W e conducted c ycle-accurate R TL simulations [8] to obtain latenc y . F or Mempool architecture, we measured C y cl es activ e , C y cl es unv isited , C y cl es neig hbor s , and C y cl es update by reading t he timing re gister which remains x ed for Mempool architecture. Using (1), and assuming the V par t and deg r ee due to the dynamic nature of the BFS, we predict the latenc y . Similarly , we obtain the e xact number of acti v e v ertices in each iteration and the edges e xplored by running the bottom-up BFS once. W e plug these numbers into (1) and calculate the latenc y which indicates static nature for a gi v en graph. Int J Recongurable & Embedded Syst, V ol. 14, No. 3, No v ember 2025: 843–854 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Recongurable & Embedded Syst ISSN: 2089-4864 849 Since PR required oat support, we utilized a similar setup with an 8-core cluster called Snitch [9] wi th lo w-latenc y memory of 128 KB. W e measured architecture dependent C y cl es v and C y cl es e c ycles by reading the timing re gister of Snitch cluster for processing each v erte x and edge. Using these v alues, we plugged the corresponding V max and E max of the partitions in consideration as sho wn in the (3). 5. RESUL TS In t his section, we v alidate our analytical model and sho wcase the accurac y of our analytical m od e l compared t o the c ycle-accurate simulation. W e compare our w ork with e xisting state-of-the-art graph process- ing frame w orks. 5.1. Accuracy of the analytical model In this sub-section, we e v aluate the performance of our analytical model in predicting the latenc y of BFS and PR. 5.1.1. Br eadth-rst sear ch Figure 5 sho ws the accurac y ranges of performance predictions made for core counts ranging from 1 to 16 for tra v ersing BFS. W e observ e an accurac y in the range 47 94% . This accurac y is due to the dynamic nature of BFS where the number of acti v e, visited and un visited v ertices are not kno wn during compile time. Due to this dynamic nature assumptions we observ e lo w predicti v e accurac y . The right hand side of the Figure 5 sho ws the accurac y of the predictions when the number of acti v e, visited and un visited v ertices are kno wn beforehand (by running BFS once). F or this static case of BFS, the accurac y ranges from 88 99% . A maximum speedup of 11 . 55 × is observ ed on the 16-core compared to single-core for the tra v ersal of the graph512. Figure 5. Comparison of performance prediction and actual latenc y obtained for Mempool with core count 1–16 P ar allel gr aph algorithms on a RISCV -based many-cor e (Ashuthosh Moolemajalu Ravikumar) Evaluation Warning : The document was created with Spire.PDF for Python.
850 ISSN: 2089-4864 5.1.2. P ageRank In Figure 6, we ha v e compared the predicted latenc y and measured latenc y , and observ e accurac y in the range of 93% upto 99% . Since PR in v olv es updating all the scores in e v ery iteration, the (3) pro vides an accurate performance prediction. Using our model, we can accurately predict the achie v able speedup for an y input graph and core counts up to 8. A maximum speedup of 7 . 56 × is observ ed on the 8-core compared to single-core for the graph512 input. Figure 6. Comparison of performance prediction and actual latenc y obtained for Snitch’ s core count 1–8. Deterministic nature of algorithm leads to high accurate predictions 5.2. Comparison of million tra v ersed edges per second of state-of-the-art graph pr ocessing framew orks T able 4 compares state-of-the-art graph accelerators with RISC-V -based man y-core. F or comparison, we use the geometric mean of mil lion tra v ersed edges per se cond (MTEPS) and MTEPS per w att (MTEPS/W) as metrics. In our case, we e v aluate an 8-core Snitch cluster impl emented on an FPGA and on a 22 nm technology node, running both BFS and PR. F or a f air comparison, all numbers correspond to bottom- up BFS and PR with graphs stored in CSR format. The MTEPS (geometric mean) and po wer v alues for comparison are sourced from e xperiments conducted in [3]. The GraphIt [10] results are based on a tw o-sock et 32-core 2.8 GHz Intel Xeon Gold 6242 machine with 384 GB DDR4 memory and a bandwidth of 282 GB/s. GraphBLAST [2] e xperiments utilize a GTX 1080 T i GPU with 3584 CUD A cores running at a peak frequenc y of 1582 MHz and 11 GB of GDDR5X memory , pro viding a bandwidth of 484 GB/s. The GraphLily [3] accelerator operates on a Xilinx Alv eo U280 FPGA with a 165 MHz frequenc y and 16 HBM channels. W ith GraphLily , the MTEPS is comparable to GraphBLAST b ut GraphLily achie v es better MTEPS/W indicating higher ef cienc y than GraphBLAST . Although Snitch eight-core cluster pro vides lo w MTEPS, MTEPS/W is 20 × better than state-of-the-art highlighting the ener gy ef cienc y of RISCV -based man y-core processors in graph processing. T able 4. Comparison with state-of-the-art graph processing accelerators. The graphs considered in our w ork are v ery small and t within the tightly coupled memory (128 KB) of the Snitch cluster Related w orks T echnology (nm) Algorithm (geometric mean) MTEPS (W) Po wer MTEPS/W Model GraphIt [10] 22 BFS 1957 264 7 No (CPU) P ageRank 2280 268 9 GraphBLAST [2] 16 BFS 4114 146 28 No (GPU) P ageRank 4940 182 27 GraphLily [3] 16 BFS 3581 45 80 No (FPGA) P ageRank 5591 49 114 This w ork [9] 16 BFS 5 0.9 6 Y es (FPGA) P ageRank 20 0.9 23 This w ork [9] 22 BFS 103 0.17 605 Y es P ageRank 386 0.17 2275 Int J Recongurable & Embedded Syst, V ol. 14, No. 3, No v ember 2025: 843–854 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Recongurable & Embedded Syst ISSN: 2089-4864 851 5.3. Scalability analysis f or cor e counts abo v e 16 Figure 7 sho ws the performance scaling of BFS tra v ersal on soc-wiki be yond 16 cores using a 256- core conguration of Mempool. Up to 16 cores, the analytical model predicts performance with an accurac y of 89% . Ho we v er , be yond 16 cores, the accurac y drops because the model does not account for increasing synchronization delays and contention in shared resources. As core counts increase, these f actors introduce additional latenc y , which impacts performance and accurac y of the model. At 256 cores, the accurac y drops to 21% . Impro ving the model to better capture these ef fects could enhance its accurac y at higher core counts. Figure 7. Comparison of performance predicted for BFS tra v ersal of soc-wiki on core counts higher than 16, up to 256 cores 6. RELA TED W ORK In this section, we re vie w prior ef forts in accelerating graph algorithms across v arious plat forms, including man y-core CPUs, GPUs, and FPGAs. W e e xamine the usability of analytical models for performance prediction, highlighting their potential to guide architecture and algorithm design decisions. 6.1. Many-cor e, central pr ocessing unit-based accelerators Eyerman et al . [11] e xplored the scalability of man y-core architectures for graph processing using OpenMP-based implementations from the GAP benchmark suite [12]. While their w ork e xamined x86-based systems, our research tar gets lo w-po wer RISC-V architectures. Domain-specic languages (DSLs) such as GraphIt [10] ha v e also been de v eloped to map graph algorithms onto di v erse platforms, including multi-core CPUs, GPUs, and RISC-V -based man y-core systems. 6.2. Graphics pr ocessing unit-based accelerators GPUs ha v e also been used to accelerate graph applications. GraphBLAST [2] is a well kno wn l inear algebra GraphBLAS-based [13] frame w ork. GraphBLAST implements partitioning schemes that align along v erte x-centric partitioning or edge-centric partitioning. 6.3. Field-pr ogrammable gate array-based accelerators Although FPGA-based accelerators require high design ef fort, often limiting their practicality , there ha v e been ef forts to accelerate graph processing using these platforms. Ea rly frame w orks lik e GraphGen [14] introduced a v erte x-centric approach to utilize FPGAs for graph computations. FPGP [15] adv anced this with an interv al-shard structure, enabling e xibility across v arious graph algorithms without requiring complete reimplementation. F ore Graph [16] le v eraged BRAM resources across multiple FPGA boards, while F abGraph [17] further optimized performance by introducing a tw o-le v el caching mechanism. T o address the challenges of high design comple xity , recent frame w orks ha v e focused on sim plify- ing implementation. HitGraph [18] used v ertical partitioning to increase partition size, though preprocessing P ar allel gr aph algorithms on a RISCV -based many-cor e (Ashuthosh Moolemajalu Ravikumar) Evaluation Warning : The document was created with Spire.PDF for Python.
852 ISSN: 2089-4864 o v erheads from edge sorting remained a limitation. AccuGraph [19] addressed data conicts with a parallel conict resolv er b ut still relie d on edge sorting during partitioning. High-le v el synthesis (HLS)-based solutions, such as ThunderGP [5], utilized the g ather -apply-scatter (GAS) model to impro v e ef cienc y . Frame w orks lik e GraphLily [3], designed for HBM-equipped FPGAs, and GraphSoC [20], which introduced custom graph in- structions, further reduced design comple xity and bitstream generation times. 6.4. Usability of perf ormance models GAIL [1] is one of the early w orks to analytically model graphs by e xamining the number of edges tra v ersed, memory accesses, and access times. The graph algorithm iron la w suggests that implementation or algorithm i mpro v ement s are better e v aluated empirically than analytically . Ho we v er , our w ork uses an analytical approach to pro vide insights into implementation-specic impro v ements. Chhug ani et al . [21] de v eloped a scalable BFS for modern CPUs, using dynamic partitioning based on cache size to impro v e tra v ersal speed and locality . The y also introduced a performance model that achie v ed 85 90% accurac y in e v aluating tra v ersal phases. V erstraaten et al . [22] highlighted the importance of graph structure in determining the performance of dif ferent strate gies during PR iterations. Similarly , our w ork con- siders graph structure as a k e y f actor in performance analysis. V erstraaten et al . [23] also modeled graph application performance on GPUs using a data-dri v en approach. Although dynamic scheduling reduced ana- lytical model accurac y to less than 50% , the use of lar ge datasets and binary decisi on trees impro v ed prediction accurac y . Their analysis helped identify the best implementation strate gies for BFS tra v ersal. Although such w orks accelerate and model graph applications on v arious parallel architectures, these w orks do not de v elop an analytical model that aids in coming up with an optimal design on the basis of graph properties. 7. CONCLUSION AND FUTURE W ORK In this w ork, we de v eloped an analytical model resembling the RISCV -based man y-core architecture for performance estimation of graph algorithms such as BFS and PR. This model predicts perform ance based on the graph structure, e v en before e x ecuting the w orkload on the man y-core. By pro viding insights into the e xpected performance, the model enables informed decision-making re g arding the optimal number of cores when deplo ying RISC-V man y-cores on FPGAs for parallel graph processing. The adaptability of the model e xtends be yond RISC-V man y-cores, as it can be tailored to other parallel man y-core architectures by updating the architectural parameters. The analytical model sho ws a strong correlation with c ycle-accurate R TL simula- tions of the RISCV -based man y-core system, with a maxi mum prediction error of 11% observ ed on real-w orld graphs from the netw ork repository across v arious core counts. The tar get RISC-V man y-core architecture deli v ers comparable MTEPS/W on FPGAs and achie v es up to 20x better MTEPS/W on a 22 nm technology node compared to an FPGA-based graph accelerator implemented on a 16 nm process. Future w ork in v olv es v alidating the v ersatility of the analytical model using dif ferent ar chitectures, e xtending support for additional graph algorithms, and e xploring hi gh e r core counts and lar ger graph sizes. This w ork highlights the potential of analytical modeling for guiding hardw are design in graph processing. A CKNO WLEDGMENTS W e thank Rajesh V i v ekanandham and V enkatraman Ramakrishnan for their guidance and support as industry liaisons. Their input w as helpful in aligning our w ork with industry perspecti v es. FUNDING INFORMA TION This w ork is supported by Semiconductor Research Corporation (SRC) as part of Indian Research Program (Project ID 3055.001). A UTHOR CONTRIB UTIONS ST A TEMENT This journal uses the Contrib utor Roles T axonomy (CRediT) to recognize indi vidual author contrib u- tions, reduce authorship disputes, and f acilitate collaboration. Int J Recongurable & Embedded Syst, V ol. 14, No. 3, No v ember 2025: 843–854 Evaluation Warning : The document was created with Spire.PDF for Python.