Inter
national
J
our
nal
of
Recongurable
and
Embedded
Systems
(IJRES)
V
ol.
14,
No.
3,
No
v
ember
2025,
pp.
843
∼
854
ISSN:
2089-4864,
DOI:
10.11591/ijres.v14.i3.pp843-854
❒
843
P
arallel
graph
algorithms
on
a
RISCV
-based
many-cor
e
Ashuthosh
Moolemajalu
Ra
vikumar
1
,
Aakarsh
V
inay
1
,
Krishna
K.
Nagar
2
,
Madhura
Pur
naprajna
1
1
Department
of
Electronics
and
Communications
Engineering,
PES
Uni
v
ersity
,
Beng
aluru,
India
2
Google,
San
Francisco,
United
States
of
America
Article
Inf
o
Article
history:
Recei
v
ed
May
13,
2025
Re
vised
Jul
17,
2025
Accepted
Aug
7,
2025
K
eyw
ords:
Analytical
model
Graph
algorithm
P
arallel
architecture
Performance
model
Reduced
instruction
set
computer–
v
e
man
y-core
ABSTRA
CT
Graph
algorithms
are
essential
in
domains
lik
e
social
netw
ork
analysis,
web
search,
and
bioinformatics.
Their
e
x
ecution
on
modern
hardw
are
is
vital
due
to
the
gro
wing
size
and
comple
xity
of
graphs.
T
raditional
multi-core
systems
strug-
gle
with
irre
gular
memory
access
patterns
in
graph
w
orkloads.
Reduced
instruc-
tion
set
computer–
v
e
(RISC-V)-based
man
y-core
processors
of
fer
a
promising
alternati
v
e
with
their
customizable
open-source
architecture
suitable
for
opti-
mization.
This
w
ork
focuses
on
parallelizing
graph
algorithms
lik
e
breadth-rst
search
(BFS)
and
P
ageRank
(PR)
on
RISC-V
man
y-core
systems.
W
e
e
v
aluated
performance
based
on
gra
ph
structure
and
processor
architecture
,
and
de
v
el-
oped
an
analytical
model
to
predict
e
x
ecution
time.
The
model
incorporates
the
unique
characteristics
of
the
RISC-V
architecture
and
the
types
and
numbers
of
instructions
e
x
ecuted
by
multiple
cores,
with
a
maximum
prediction
error
of
11%
.
Our
e
xperiments
sho
w
a
speedup
of
up
to
11
.
55
×
for
BFS
and
7
.
56
×
for
PR
using
16
and
8
cores,
respe
cti
v
ely
,
o
v
er
single-core
performance.
Com-
parisons
with
e
xisting
graph
pr
ocessing
frame
w
orks
demonstrate
that
RISC-V
systems
can
deli
v
er
up
to
20
×
better
ener
gy
ef
cienc
y
on
real-w
orld
graphs
from
the
netw
ork
repository
.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Ashuthosh
Moolemajalu
Ra
vikumar
Department
of
Electronics
and
Communications
Engineering,
PES
Uni
v
ersity
Beng
aluru,
Karnataka,
India
Email:
ashuthoshmr@pes.edu
1.
INTR
ODUCTION
Graph
algorithms
ha
v
e
become
increasingly
important
in
applications
such
as
social
netw
orks,
na
vi-
g
ation,
and
graph
neural
netw
orks.
The
methods
for
processing
graphs
ha
v
e
e
v
olv
ed
signicantly
,
transitioning
from
classical
problems
to
modern
general-purpose
platforms
lik
e
central
processing
units
(CPUs)
and
graphics
processing
units
(GPUs).
Recent
adv
ancements
include
algorithmic
impro
v
ements
[1]
and
the
de
v
elopment
of
specialized
accelerators
and
frame
w
orks,
such
as
GraphBLAST
[2]
and
GraphLily
[3],
which
le
v
erage
GPUs
and
eld-programmable
g
ate
arrays
(FPGAs)
for
enhanced
performance.
Ho
we
v
er
,
these
adv
ancements
also
highlight
the
unique
challenges
inherent
to
graph
processing,
par
-
ticularly
in
achie
ving
ef
cient
parallelization.
Graph
properties
,
such
as
connecti
vity
between
v
ertices,
v
ary
widely
depending
on
the
graph
type,
leading
to
irre
gular
memory
access
patterns.
Additionally
,
man
y
graph
algorithms
are
memory-bound,
making
ef
cient
memory
utilization
critical.
A
prime
e
xample
of
application-architecture
co-design
is
Google’
s
TPU
[4],
which
demonstrates
a
shift
to
w
ards
application-optimized
parallel
architectures
to
achie
v
e
greater
ef
cienc
y
.
Similarly
,
custom
ac-
celerators
tar
geting
graph
applications
ha
v
e
been
implemented
on
FPGAs
[3],
[5].
But
the
signicant
design
time
and
ef
fort
required
for
FPGA-based
accelerators
often
limit
their
practicality
.
GPUs
[2]
and
v
ector
pro-
J
ournal
homepage:
http://ijr
es.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
844
❒
ISSN:
2089-4864
cessors
ha
v
e
emer
ged
as
alternati
v
e
parallel
architectures
for
graph
application
acceleration.
In
this
conte
xt,
reduced
instruction
set
computer–
v
e
(RISC-V
-based)
man
y-core
processors
ha
v
e
g
ained
prominence
due
to
their
customizable
and
open-source
architecture.
Ho
we
v
er
,
there
is
still
a
need
to
understand
ho
w
well
graph
w
orkloads
perform
on
RISC-V
-based
man
y-core
processors.
Our
research
aims
to
address
this
question
by
e
xploring
graph
algorithms
and
de
v
eloping
an
analytical
model
to
predict
the
perfor
-
mance
of
RISC-V
-based
man
y-core
processors.
This
model
will
allo
w
us
to
project
the
performance
of
parallel
graph
processing
for
core
counts
up
to
16
and
v
arious
graph
types.
Our
contrib
utions
in
this
w
ork
include
the
follo
wing:
−
P
arallelizing
graph
algorithms
on
RISC-V
-based
man
y-core
processors.
−
De
v
elop
an
analytical
model
to
predict
the
performance
of
parallel
graph
algorithms
on
RISC-V
-based
man
y-core.
−
V
alidate
the
accurac
y
of
the
model
on
real
w
orld
graphs.
2.
P
ARALLELIZING
GRAPH
ALGORITHMS
ON
RISC-V
MANY
-CORE
Among
v
arious
graph
algorithms,
this
w
ork
focuses
on
tw
o
widely
applicable
ones:
bre
adth-rst
search
(BFS)
and
P
ageRank
(PR).
P
arallelizing
t
h
e
se
algorithms
on
a
RISCV
-based
man
y-core
in
v
olv
es
dis-
trib
uting
the
w
orkload
across
a
v
ailable
cores.
In
this
section,
we
e
xamine
the
ef
fects
of
parallelizing
BFS
with
respect
to
the
number
of
cores
and
graph
types,
wh
i
ch
moti
v
ates
the
de
v
elopment
of
the
analytical
model.
Our
analysis
le
v
erages
a
RISC-V
man
y-core
cal
led
Mempool
[6],
featuring
16
cores
and
64
KB
of
tightly
coupled
memory
.
2.1.
P
arallel
br
eadth-rst
sear
ch
BFS
is
a
recursi
v
e
algorithm
that
tries
to
tra
v
erse
all
the
v
ertices
in
t
he
graph
starting
from
a
s
ingle
v
erte
x
kno
wn
as
the
source
or
root
v
erte
x.
T
ra
v
ersal
in
v
olv
es
e
xploring
the
neighbors
of
the
v
erte
x
and
updat-
ing
the
properties
of
t
h
e
neighbors
such
as
visited
status,
parent
v
erte
x
and
distance
from
the
root
v
erte
x.
The
tra
v
ersal
of
acti
v
e
v
ertices
is
done
in
e
v
ery
iteration
and
continues
until
there
are
no
acti
v
e
v
ertices.
T
ra
v
ersal
can
be
done
mainly
in
tw
o
w
ays,
top-do
wn
approach
and
bottom-up
approach.
Based
on
the
number
of
acti
v
e
v
ertices
e
v
ery
itera
tion,
certain
approach
is
benecial
[1].
In
our
w
ork,
we
consider
bottom-up
BFS
as
consid-
erable
time
is
spent
on
this
k
ernel
compared
to
the
top-do
wn
BFS.
As
sho
wn
in
Algorithm
1,
in
each
iteration,
all
v
ertices
(line
1)
are
check
ed
to
determine
if
the
y
ha
v
e
not
been
visited
(line
2).
The
neighbors
of
these
un
visited
v
ert
ices
are
then
e
xplored
(line
3).
If
an
i
ncoming
neighbor
of
an
un
visi
ted
v
erte
x
is
acti
v
e
(line
4),
the
un
visited
v
erte
x
is
mark
ed
as
visited
and
acti
v
ated
for
the
ne
xt
iteration
(lines
5
and
6).
P
arallelizing
this
k
ernel
across
N
cores
in
v
olv
es
di
viding
the
total
v
ertices
(line
1)
into
N
parts.
While
this
di
vision
ensures
equal
distrib
ution
of
v
ertices,
the
un
visited
v
ertices
(line
2)
and
their
neighbors
(line
3)
are
une
v
enly
distrib
uted.
In
graphs
with
highly
irre
gular
neighbor
distrib
utions
(e.g.,
po
wer
-la
w
graphs),
this
irre
gularity
impacts
perfor
-
mance
by
lea
ving
some
cores
idle
while
others
remain
acti
v
e.
Increasing
the
number
of
cores
in
such
cases
will
amplify
the
impact
as
v
ery
fe
w
cores
compute
the
v
ertices
with
lar
ge
number
of
neighbors.
Algorithm
1
.
Single
iteration
of
Bottom-up
BFS
1:
f
or
i
=
1
to
v
do
2:
if
cost
[
i
]
<
0
then
3:
f
or
each
j
in
incoming
neighbours
of
i
do
4:
if
activ
e
[
j
]
>
0
then
5:
activ
e
l
ist
←
i
6:
cost
[
i
]
←
iter
ation
7:
br
eak
8:
end
if
9:
end
f
or
10:
end
if
11:
end
f
or
12:
iter
ation
←
iter
ation
+
1
Int
J
Recongurable
&
Embedded
Syst,
V
ol.
14,
No.
3,
No
v
ember
2025:
843–854
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Recongurable
&
Embedded
Syst
ISSN:
2089-4864
❒
845
2.1.1.
Graph
type
Figure
1
presents
the
prole
of
the
Mempool,
sho
wing
the
breakdo
wn
(%)
of
compute
and
stall
times
as
a
function
of
graph
type.
The
analyzed
graphs
are
real-w
orld
graphs
listed
in
T
able
1,
sourced
from
the
netw
ork
repository
[7].
F
or
graphs
with
po
wer
-la
w
characteristics,
such
as
soc-wiki
and
web-polblogs,
syn-
chronization
o
v
erhead
contrib
utes
t
o
stall
times.
This
irre
gularity
in
the
graph
structure
result
s
in
an
une
v
en
distrib
ution
of
e
x
ecuted
instructions
across
cores
in
the
parallel
man
y-core
system.
Figure
1.
Percentage
of
time
spent
on
compute
and
stalls
for
dif
ferent
types
of
graphs
on
a
16-core
Mempool.
Compute
time
includes
instruction
e
x
ecution,
while
stalls
are
caused
by
synchronization,
instruction
cache
misses,
load-store
unit
(LSU)
delays,
and
read-after
-write
(RA
W)
hazards
T
able
1.
Di
v
erse
type
of
graphs
used
in
this
w
ork
are
from
netw
ork
repository
[7]
Data
V
ertices
Edges
T
ype
graph512
512
3114
Synthetic
soc-wiki
889
5828
Social
netw
ork
web-polblogs
643
4560
W
eb
netw
ork
bio-diseasome
516
2376
Biological
netw
ork
2.1.2.
Number
of
cor
es
Increasing
the
number
of
cores
reduces
latenc
y
b
ut
also
introduces
synchronization
o
v
erhead.
Figure
2
illustrates
the
latenc
y
breakdo
wn
for
BFS
tra
v
ersal
on
soc-wiki
with
v
arying
core
counts.
A
maximum
speedup
of
9
.
2
×
is
achie
v
ed
with
16
cores
compared
to
single
-core
latenc
y
.
Using
8
cores
pro
vides
a
5
.
6
×
speedup,
while
doubling
the
cores
to
16
adds
only
an
additional
1
.
6
×
speedup.
Figure
2.
Latenc
y
breakdo
wn
(in
clock
c
ycles)
for
BFS
tra
v
ersal
of
the
soc-wiki
graph
across
v
arious
core
counts
P
ar
allel
gr
aph
algorithms
on
a
RISCV
-based
many-cor
e
(Ashuthosh
Moolemajalu
Ravikumar)
Evaluation Warning : The document was created with Spire.PDF for Python.
846
❒
ISSN:
2089-4864
2.2.
Need
f
or
perf
ormance
pr
ediction
In
section
2.1.,
we
e
xamined
ho
w
f
actors
such
as
graph
input
type
and
core
count
inuence
the
perfor
-
mance
of
graph
algorithms
on
the
parallel
RISC-V
architecture.
Ho
we
v
er
,
accurately
predicting
performance
prior
to
e
x
ecution
remains
a
challenge.
T
o
address
this,
we
propose
an
analytical
model
that
estimates
c
ycle
counts
based
on
graph
characteristics
and
system
conguration.
This
model
enables
performance
prediction
for
an
y
gi
v
en
graph
input
on
Mempool
with
N
cores,
without
requiring
e
x
ecution
on
the
actual
hardw
are.
3.
B
UILDING
THE
AN
AL
YTICAL
MODEL:
METHOD
Figure
3
sho
ws
the
method
of
b
uilding
an
analytical
model
to
predict
the
performance
of
graph
al-
gorithms.
As
sho
wn
in
the
Figure
3,
the
graph
algorithm
gets
translated
to
set
of
RISC-V
instructions
by
the
compiler
.
These
instructions
include
compute
and
memory
instructions
that
is
mapped
to
N
cores.
Figure
3.
Illustration
of
the
analytical
model
for
predicting
the
performance
of
graph
algorithms
on
a
RISC-V
man
y-core
processor
cluster
Along
with
the
instructions,
the
graph
is
also
distr
ib
uted
to
the
multi-core
thereby
di
viding
the
o
v
erall
computations.
Based
on
the
number
and
type
of
RISC-V
instructions,
the
latenc
y
to
e
x
ecute
the
instructions
v
ary
and
is
gi
v
en
by
C
y
cl
es
V
and
C
y
cl
es
E
.
Similarly
,
these
set
of
instructions
e
x
ecute
depending
on
the
graph
properties
such
as,
number
of
v
ertices
(
V
)
and
edges
(
E
)
of
the
graph.
Using
these
infor
mation,
we
b
uild
the
analytical
model
based
on
graph
properties
and
number
and
type
of
RISC-V
instructions
on
N
cores.
W
e
ha
v
e
de
v
eloped
an
analytical
model
to
predict
the
performance
of
BFS
and
PR.
The
notations
used
in
this
model
for
BFS
and
PR
are
gi
v
en
in
the
T
ables
2
and
3.
The
v
alue
of
the
notation
dependent
on
core
count
of
man
y-core,
instruction
set
architecture
(ISA),
and
graph
input
are
also
indicated
in
the
tables.
3.1.
Br
eadth-rst
sear
ch
As
sho
wn
in
the
Algorithm
1,
the
lines
1,
2
are
e
x
ecuted
for
all
the
v
ertices
in
e
v
ery
iteration.
Of
all
the
v
ertices,
un
visited
v
ertices
V
unv
isited
are
e
xplored
(line
3).
If
the
incoming
neig
hbour
s
of
the
V
unv
isited
are
acti
v
e
i.e
V
activ
e
(line
4)
,
the
unv
isited
v
ertices
are
put
in
the
ne
w
acti
v
e
list
i.e
V
update
(line
5,
6,
and
7).
Based
on
the
number
of
V
activ
e
,
V
unv
isited
,
V
neig
hbour
s
,
the
number
and
type
of
instructions
e
x
ecuted
v
aries.
The
total
c
ycles
for
each
iteration
of
the
bottom-up
BFS
is
gi
v
en
using
the
(
1).
Using
this
equation,
latenc
y
for
bottom-up
BFS
can
be
predicted
ahead
in
time
based
on
the
number
of
v
ertices
and
edges
pro
vided
V
activ
e
,
V
unv
isited
,
V
neig
hbour
s
and
V
update
are
kno
wn
e
v
ery
iteration.
Kno
wing
the
V
activ
e
,
V
unv
isited
,
V
neig
hbour
s
,
and
V
update
during
compile
time
is
only
possible
after
running
BFS
once.
Otherwise,
assumptions
can
be
made
Int
J
Recongurable
&
Embedded
Syst,
V
ol.
14,
No.
3,
No
v
ember
2025:
843–854
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Recongurable
&
Embedded
Syst
ISSN:
2089-4864
❒
847
by
di
viding
the
total
number
of
v
ertices
equally
among
each
iteration.
Extending
this
to
multiple
cores
in
v
olv
es
estimating
the
latenc
y
of
the
graph
partition
that
tak
es
the
longest
latenc
y
.
Since
the
amount
of
w
ork
done
by
each
core
is
dependent
on
the
size
of
the
partition,
we
consider
the
partition
with
the
lar
gest
v
ertices
and
edges
for
our
estimation
V
max
and
E
max
.
C
y
cl
es
bottom
bfs
=
Iteration
X
i
=1
"
V
acti
v
e
×
C
y
cl
es
acti
v
e
+
V
un
visited
×
C
y
cl
es
un
visited
+
V
neighbours
×
C
y
cl
es
neighbours
+
V
update
×
C
y
cl
es
update
#
(1)
T
able
2.
Notations
used
in
the
model
for
bottom-up
BFS
and
its
dependencies
Notation
Description
Dependence
V
activ
e
Number
of
acti
v
e
v
ertices
in
a
gi
v
en
iteration
Input
graph
V
unv
isited
Number
of
un
visited
v
ertices
tra
v
ersed
Input
graph
V
neig
hbour
s
Number
of
neighbouring
v
ertices
e
xplored
Input
graph
V
update
Number
of
ne
w
acti
v
e
v
ertices
found
in
an
iteration
Input
graph
C
y
cl
es
activ
e
Cycles
to
e
xplore
each
V
activ
e
v
ertices
Multi-core
architecture,
ISA
C
y
cl
es
unv
isited
Cycles
to
tra
v
erse
each
V
unv
isited
Man
y-core
architecture,
ISA
C
y
cl
es
neig
hbour
s
Cycles
to
check
each
V
unv
isited
Man
y-core
architecture,
ISA
C
y
cl
es
update
Cycles
to
put
each
un
visited
v
erte
x
to
acti
v
e
list
Man
y-core
architecture,
ISA
T
able
3.
Notations
used
in
the
model
for
PR
and
its
dependencies
Notation
Description
Dependence
N
Number
of
cores
Multi-core
architecture
V
max
Maximum
number
of
v
ertices
in
a
partition
Input
graph
E
max
Maximum
number
of
edges
in
a
partition
Input
graph
C
y
cl
es
v
Cycles
to
operate
on
v
ertices
Multi-core
architecture,
ISA
C
y
cl
es
e
Cycles
to
operate
on
edges
Multi-core
architecture,
ISA
3.2.
P
ageRank
PR
in
v
olv
es
ranking
the
v
ertices
of
the
graph
based
on
the
connecti
vity
.
Each
v
erte
x
is
initial
ised
with
a
score
based
on
the
number
of
outgoing
edges
(i.e.
de
gree
of
the
v
erte
x)
called
outgoing
contrib
ution.
As
sho
wn
in
the
pseudo-code
Algorithm
2,
the
outgoing
contrib
ution
is
accumulated
into
incoming
total
for
each
of
the
v
ertices
(lines
3,4).
Based
on
the
base
scor
e
and
incoming
total
,
a
ne
w
score
is
calculated
(line
7).
This
process
i
s
repeated
until
the
dif
ference
between
the
old
score
and
the
ne
w
score
i.e
the
error
(line
8)
is
less
than
the
threshold
limit
or
until
the
maximum
iteration.
Since
each
v
erte
x’
s
score
can
be
estimated
in
parallel,
PR
can
be
accelerated
by
le
v
eraging
the
cores
present
in
the
man
y-core
system.
Algorithm
2
.
Single
iteration
of
P
ageRank
1:
f
or
i
=
1
to
v
v
ertices
do
2:
incoming
total
←
0
3:
f
or
j
:
neig
hbour
s
of
i
do
4:
incoming
total
←
incoming
total
+
outg
oing
contr
ibution
[
j
]
5:
end
f
or
6:
ol
d
scor
e
←
scor
es
[
i
]
7:
scor
es
[
i
]
←
base
scor
e
+
k
damp
∗
incoming
total
8:
er
r
o
r
+
=
f
abs
(
scor
es
[
i
]
−
ol
d
scor
e
)
9:
outg
oing
contr
ib
[
i
]
=
scor
es
[
i
]
/out
deg
r
ee
[
i
]
10:
end
f
or
F
or
a
single-core
design,
the
latenc
y
to
calculate
the
ranks
of
the
graph
G
(
V
,
E
)
with
V
number
of
v
ertices
and
E
number
of
edges
in
each
iteration
includes
reading
and
accumulating
the
outgoing
contrib
utions
of
each
v
erte
x
and
updating
the
score
(line
2,3
of
Algorithm
2).
Reading
outgoing
contrib
utions
and
accumu-
lating
for
each
v
erte
x
depends
on
the
total
number
of
edges
E
in
the
graph.
Updating
the
scores
of
the
v
erte
x
P
ar
allel
gr
aph
algorithms
on
a
RISCV
-based
many-cor
e
(Ashuthosh
Moolemajalu
Ravikumar)
Evaluation Warning : The document was created with Spire.PDF for Python.
848
❒
ISSN:
2089-4864
based
on
the
accumulated
cont
rib
utions
is
dependent
on
the
number
of
v
ertices
V
(line
6
to
9
of
Algorithm
2).
The
c
ycles
for
calculating
the
ranks
of
e
v
ery
iteration
of
the
v
ertices
of
G
(
V
,
E
)
can
be
formulated
as
(2):
C
y
cl
es
P
ag
eR
ank
=
[
V
×
C
y
cl
es
v
]
+
[
E
×
C
y
cl
es
e
]
(2)
Similarly
,
e
xtending
this
to
multiple
cores
in
v
olv
es
estimating
the
latenc
y
of
the
group
of
v
ertices
that
t
ak
es
the
longest.
The
c
ycles
for
calcul
ating
the
ranks
of
e
v
ery
iteration
in
the
multi-core
system
can
be
formulated
as
(3):
C
y
cl
es
P
ag
eR
ank
=
[
V
max
×
C
y
cl
es
v
]
+
[
E
max
×
C
y
cl
es
e
]
(3)
Since
in
e
v
ery
iteration,
the
same
set
of
operations
are
carried
out,
we
assume
total
latenc
y
for
multiple
itera-
tions
can
be
obtained
by
multiplying
the
number
of
iterations
with
C
y
cl
es
P
ag
eR
ank
.
In
(2)
and
(3)
pro
vide
a
simplied
vie
w
of
performance
as
a
function
of
v
ertices
and
edges
for
PR.
4.
EXPERIMENT
AL
SETUP
W
e
e
v
aluated
the
performance
of
our
analytical
models
with
the
actual
latenc
y
measured
in
terms
of
clock
c
ycles.
W
e
used
a
v
ariety
of
graphs
ranging
from
synthetic
to
real-w
orld
social
and
biological
netw
orks
as
sho
wn
in
the
T
able
1.
The
graphs
in
T
able
1
are
stored
in
the
compressed
sparse
ro
w
(CSR)
compression
format.
The
focus
of
the
e
xperiments
are
to
e
v
aluate
the
model’
s
predicti
v
e
accurac
y
across
dif
ferent
processor
core
counts
and
graph
types.
4.1.
T
ar
get
r
educed
instruction
set
computer–
v
e
ar
chitectur
e
Mempool
[6]
is
a
RISCV
-based
man
y-core
cluster
with
tightly
coupled
L
1
shared
memory
that
can
be
scaled
upto
256
cores
supporting
R
V
32
I
M
AX
pul
pimg
instructions.
The
hierarchical
interconnect
is
responsible
for
a
maximum
latenc
y
of
5
c
ycles
in
the
conict-free
access
to
shared
memory
.
As
sho
wn
in
Figure
4,
the
def
ault
conguration
consists
of
4
groups,
16
tiles
per
group,
4
cores
per
tile.
T
otal
shared
memory
of
1M
B
is
a
v
ailable
where
the
memory
is
di
vided
into
sequential
addressed
s
pace
and
interlea
v
ed
addressed
space.
The
size
of
the
addressing
is
congurable.
Figure
4.
Man
y-core
architecture
of
Mempool
F
or
our
e
xperiments,
we
ha
v
e
used
the
Minpool
conguration
with
4
groups,
1
tiles
per
group
and
4
cores
per
tile.
This
conguration
has
a
total
of
16
cores
and
64
KB
of
tightly
coupled
data
memory
of
which
each
tile
has
8
KB
of
sequential
addressed
memory
and
8
KB
of
interlea
v
ed
addressed
memory
.
W
e
conducted
c
ycle-accurate
R
TL
simulations
[8]
to
obtain
latenc
y
.
F
or
Mempool
architecture,
we
measured
C
y
cl
es
activ
e
,
C
y
cl
es
unv
isited
,
C
y
cl
es
neig
hbor
s
,
and
C
y
cl
es
update
by
reading
t
he
timing
re
gister
which
remains
x
ed
for
Mempool
architecture.
Using
(1),
and
assuming
the
V
par
t
and
deg
r
ee
due
to
the
dynamic
nature
of
the
BFS,
we
predict
the
latenc
y
.
Similarly
,
we
obtain
the
e
xact
number
of
acti
v
e
v
ertices
in
each
iteration
and
the
edges
e
xplored
by
running
the
bottom-up
BFS
once.
W
e
plug
these
numbers
into
(1)
and
calculate
the
latenc
y
which
indicates
static
nature
for
a
gi
v
en
graph.
Int
J
Recongurable
&
Embedded
Syst,
V
ol.
14,
No.
3,
No
v
ember
2025:
843–854
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Recongurable
&
Embedded
Syst
ISSN:
2089-4864
❒
849
Since
PR
required
oat
support,
we
utilized
a
similar
setup
with
an
8-core
cluster
called
Snitch
[9]
wi
th
lo
w-latenc
y
memory
of
128
KB.
W
e
measured
architecture
dependent
C
y
cl
es
v
and
C
y
cl
es
e
c
ycles
by
reading
the
timing
re
gister
of
Snitch
cluster
for
processing
each
v
erte
x
and
edge.
Using
these
v
alues,
we
plugged
the
corresponding
V
max
and
E
max
of
the
partitions
in
consideration
as
sho
wn
in
the
(3).
5.
RESUL
TS
In
t
his
section,
we
v
alidate
our
analytical
model
and
sho
wcase
the
accurac
y
of
our
analytical
m
od
e
l
compared
t
o
the
c
ycle-accurate
simulation.
W
e
compare
our
w
ork
with
e
xisting
state-of-the-art
graph
process-
ing
frame
w
orks.
5.1.
Accuracy
of
the
analytical
model
In
this
sub-section,
we
e
v
aluate
the
performance
of
our
analytical
model
in
predicting
the
latenc
y
of
BFS
and
PR.
5.1.1.
Br
eadth-rst
sear
ch
Figure
5
sho
ws
the
accurac
y
ranges
of
performance
predictions
made
for
core
counts
ranging
from
1
to
16
for
tra
v
ersing
BFS.
W
e
observ
e
an
accurac
y
in
the
range
47
−
94%
.
This
accurac
y
is
due
to
the
dynamic
nature
of
BFS
where
the
number
of
acti
v
e,
visited
and
un
visited
v
ertices
are
not
kno
wn
during
compile
time.
Due
to
this
dynamic
nature
assumptions
we
observ
e
lo
w
predicti
v
e
accurac
y
.
The
right
hand
side
of
the
Figure
5
sho
ws
the
accurac
y
of
the
predictions
when
the
number
of
acti
v
e,
visited
and
un
visited
v
ertices
are
kno
wn
beforehand
(by
running
BFS
once).
F
or
this
static
case
of
BFS,
the
accurac
y
ranges
from
88
−
99%
.
A
maximum
speedup
of
11
.
55
×
is
observ
ed
on
the
16-core
compared
to
single-core
for
the
tra
v
ersal
of
the
graph512.
Figure
5.
Comparison
of
performance
prediction
and
actual
latenc
y
obtained
for
Mempool
with
core
count
1–16
P
ar
allel
gr
aph
algorithms
on
a
RISCV
-based
many-cor
e
(Ashuthosh
Moolemajalu
Ravikumar)
Evaluation Warning : The document was created with Spire.PDF for Python.
850
❒
ISSN:
2089-4864
5.1.2.
P
ageRank
In
Figure
6,
we
ha
v
e
compared
the
predicted
latenc
y
and
measured
latenc
y
,
and
observ
e
accurac
y
in
the
range
of
93%
upto
99%
.
Since
PR
in
v
olv
es
updating
all
the
scores
in
e
v
ery
iteration,
the
(3)
pro
vides
an
accurate
performance
prediction.
Using
our
model,
we
can
accurately
predict
the
achie
v
able
speedup
for
an
y
input
graph
and
core
counts
up
to
8.
A
maximum
speedup
of
7
.
56
×
is
observ
ed
on
the
8-core
compared
to
single-core
for
the
graph512
input.
Figure
6.
Comparison
of
performance
prediction
and
actual
latenc
y
obtained
for
Snitch’
s
core
count
1–8.
Deterministic
nature
of
algorithm
leads
to
high
accurate
predictions
5.2.
Comparison
of
million
tra
v
ersed
edges
per
second
of
state-of-the-art
graph
pr
ocessing
framew
orks
T
able
4
compares
state-of-the-art
graph
accelerators
with
RISC-V
-based
man
y-core.
F
or
comparison,
we
use
the
geometric
mean
of
mil
lion
tra
v
ersed
edges
per
se
cond
(MTEPS)
and
MTEPS
per
w
att
(MTEPS/W)
as
metrics.
In
our
case,
we
e
v
aluate
an
8-core
Snitch
cluster
impl
emented
on
an
FPGA
and
on
a
22
nm
technology
node,
running
both
BFS
and
PR.
F
or
a
f
air
comparison,
all
numbers
correspond
to
bottom-
up
BFS
and
PR
with
graphs
stored
in
CSR
format.
The
MTEPS
(geometric
mean)
and
po
wer
v
alues
for
comparison
are
sourced
from
e
xperiments
conducted
in
[3].
The
GraphIt
[10]
results
are
based
on
a
tw
o-sock
et
32-core
2.8
GHz
Intel
Xeon
Gold
6242
machine
with
384
GB
DDR4
memory
and
a
bandwidth
of
282
GB/s.
GraphBLAST
[2]
e
xperiments
utilize
a
GTX
1080
T
i
GPU
with
3584
CUD
A
cores
running
at
a
peak
frequenc
y
of
1582
MHz
and
11
GB
of
GDDR5X
memory
,
pro
viding
a
bandwidth
of
484
GB/s.
The
GraphLily
[3]
accelerator
operates
on
a
Xilinx
Alv
eo
U280
FPGA
with
a
165
MHz
frequenc
y
and
16
HBM
channels.
W
ith
GraphLily
,
the
MTEPS
is
comparable
to
GraphBLAST
b
ut
GraphLily
achie
v
es
better
MTEPS/W
indicating
higher
ef
cienc
y
than
GraphBLAST
.
Although
Snitch
eight-core
cluster
pro
vides
lo
w
MTEPS,
MTEPS/W
is
20
×
better
than
state-of-the-art
highlighting
the
ener
gy
ef
cienc
y
of
RISCV
-based
man
y-core
processors
in
graph
processing.
T
able
4.
Comparison
with
state-of-the-art
graph
processing
accelerators.
The
graphs
considered
in
our
w
ork
are
v
ery
small
and
t
within
the
tightly
coupled
memory
(128
KB)
of
the
Snitch
cluster
Related
w
orks
T
echnology
(nm)
Algorithm
(geometric
mean)
MTEPS
(W)
Po
wer
MTEPS/W
Model
GraphIt
[10]
22
BFS
1957
264
7
No
(CPU)
P
ageRank
2280
268
9
GraphBLAST
[2]
16
BFS
4114
146
28
No
(GPU)
P
ageRank
4940
182
27
GraphLily
[3]
16
BFS
3581
45
80
No
(FPGA)
P
ageRank
5591
49
114
This
w
ork
[9]
16
BFS
5
0.9
6
Y
es
(FPGA)
P
ageRank
20
0.9
23
This
w
ork
[9]
22
BFS
103
0.17
605
Y
es
P
ageRank
386
0.17
2275
Int
J
Recongurable
&
Embedded
Syst,
V
ol.
14,
No.
3,
No
v
ember
2025:
843–854
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Recongurable
&
Embedded
Syst
ISSN:
2089-4864
❒
851
5.3.
Scalability
analysis
f
or
cor
e
counts
abo
v
e
16
Figure
7
sho
ws
the
performance
scaling
of
BFS
tra
v
ersal
on
soc-wiki
be
yond
16
cores
using
a
256-
core
conguration
of
Mempool.
Up
to
16
cores,
the
analytical
model
predicts
performance
with
an
accurac
y
of
89%
.
Ho
we
v
er
,
be
yond
16
cores,
the
accurac
y
drops
because
the
model
does
not
account
for
increasing
synchronization
delays
and
contention
in
shared
resources.
As
core
counts
increase,
these
f
actors
introduce
additional
latenc
y
,
which
impacts
performance
and
accurac
y
of
the
model.
At
256
cores,
the
accurac
y
drops
to
21%
.
Impro
ving
the
model
to
better
capture
these
ef
fects
could
enhance
its
accurac
y
at
higher
core
counts.
Figure
7.
Comparison
of
performance
predicted
for
BFS
tra
v
ersal
of
soc-wiki
on
core
counts
higher
than
16,
up
to
256
cores
6.
RELA
TED
W
ORK
In
this
section,
we
re
vie
w
prior
ef
forts
in
accelerating
graph
algorithms
across
v
arious
plat
forms,
including
man
y-core
CPUs,
GPUs,
and
FPGAs.
W
e
e
xamine
the
usability
of
analytical
models
for
performance
prediction,
highlighting
their
potential
to
guide
architecture
and
algorithm
design
decisions.
6.1.
Many-cor
e,
central
pr
ocessing
unit-based
accelerators
Eyerman
et
al
.
[11]
e
xplored
the
scalability
of
man
y-core
architectures
for
graph
processing
using
OpenMP-based
implementations
from
the
GAP
benchmark
suite
[12].
While
their
w
ork
e
xamined
x86-based
systems,
our
research
tar
gets
lo
w-po
wer
RISC-V
architectures.
Domain-specic
languages
(DSLs)
such
as
GraphIt
[10]
ha
v
e
also
been
de
v
eloped
to
map
graph
algorithms
onto
di
v
erse
platforms,
including
multi-core
CPUs,
GPUs,
and
RISC-V
-based
man
y-core
systems.
6.2.
Graphics
pr
ocessing
unit-based
accelerators
GPUs
ha
v
e
also
been
used
to
accelerate
graph
applications.
GraphBLAST
[2]
is
a
well
kno
wn
l
inear
algebra
GraphBLAS-based
[13]
frame
w
ork.
GraphBLAST
implements
partitioning
schemes
that
align
along
v
erte
x-centric
partitioning
or
edge-centric
partitioning.
6.3.
Field-pr
ogrammable
gate
array-based
accelerators
Although
FPGA-based
accelerators
require
high
design
ef
fort,
often
limiting
their
practicality
,
there
ha
v
e
been
ef
forts
to
accelerate
graph
processing
using
these
platforms.
Ea
rly
frame
w
orks
lik
e
GraphGen
[14]
introduced
a
v
erte
x-centric
approach
to
utilize
FPGAs
for
graph
computations.
FPGP
[15]
adv
anced
this
with
an
interv
al-shard
structure,
enabling
e
xibility
across
v
arious
graph
algorithms
without
requiring
complete
reimplementation.
F
ore
Graph
[16]
le
v
eraged
BRAM
resources
across
multiple
FPGA
boards,
while
F
abGraph
[17]
further
optimized
performance
by
introducing
a
tw
o-le
v
el
caching
mechanism.
T
o
address
the
challenges
of
high
design
comple
xity
,
recent
frame
w
orks
ha
v
e
focused
on
sim
plify-
ing
implementation.
HitGraph
[18]
used
v
ertical
partitioning
to
increase
partition
size,
though
preprocessing
P
ar
allel
gr
aph
algorithms
on
a
RISCV
-based
many-cor
e
(Ashuthosh
Moolemajalu
Ravikumar)
Evaluation Warning : The document was created with Spire.PDF for Python.
852
❒
ISSN:
2089-4864
o
v
erheads
from
edge
sorting
remained
a
limitation.
AccuGraph
[19]
addressed
data
conicts
with
a
parallel
conict
resolv
er
b
ut
still
relie
d
on
edge
sorting
during
partitioning.
High-le
v
el
synthesis
(HLS)-based
solutions,
such
as
ThunderGP
[5],
utilized
the
g
ather
-apply-scatter
(GAS)
model
to
impro
v
e
ef
cienc
y
.
Frame
w
orks
lik
e
GraphLily
[3],
designed
for
HBM-equipped
FPGAs,
and
GraphSoC
[20],
which
introduced
custom
graph
in-
structions,
further
reduced
design
comple
xity
and
bitstream
generation
times.
6.4.
Usability
of
perf
ormance
models
GAIL
[1]
is
one
of
the
early
w
orks
to
analytically
model
graphs
by
e
xamining
the
number
of
edges
tra
v
ersed,
memory
accesses,
and
access
times.
The
graph
algorithm
iron
la
w
suggests
that
implementation
or
algorithm
i
mpro
v
ement
s
are
better
e
v
aluated
empirically
than
analytically
.
Ho
we
v
er
,
our
w
ork
uses
an
analytical
approach
to
pro
vide
insights
into
implementation-specic
impro
v
ements.
Chhug
ani
et
al
.
[21]
de
v
eloped
a
scalable
BFS
for
modern
CPUs,
using
dynamic
partitioning
based
on
cache
size
to
impro
v
e
tra
v
ersal
speed
and
locality
.
The
y
also
introduced
a
performance
model
that
achie
v
ed
85
−
90%
accurac
y
in
e
v
aluating
tra
v
ersal
phases.
V
erstraaten
et
al
.
[22]
highlighted
the
importance
of
graph
structure
in
determining
the
performance
of
dif
ferent
strate
gies
during
PR
iterations.
Similarly
,
our
w
ork
con-
siders
graph
structure
as
a
k
e
y
f
actor
in
performance
analysis.
V
erstraaten
et
al
.
[23]
also
modeled
graph
application
performance
on
GPUs
using
a
data-dri
v
en
approach.
Although
dynamic
scheduling
reduced
ana-
lytical
model
accurac
y
to
less
than
50%
,
the
use
of
lar
ge
datasets
and
binary
decisi
on
trees
impro
v
ed
prediction
accurac
y
.
Their
analysis
helped
identify
the
best
implementation
strate
gies
for
BFS
tra
v
ersal.
Although
such
w
orks
accelerate
and
model
graph
applications
on
v
arious
parallel
architectures,
these
w
orks
do
not
de
v
elop
an
analytical
model
that
aids
in
coming
up
with
an
optimal
design
on
the
basis
of
graph
properties.
7.
CONCLUSION
AND
FUTURE
W
ORK
In
this
w
ork,
we
de
v
eloped
an
analytical
model
resembling
the
RISCV
-based
man
y-core
architecture
for
performance
estimation
of
graph
algorithms
such
as
BFS
and
PR.
This
model
predicts
perform
ance
based
on
the
graph
structure,
e
v
en
before
e
x
ecuting
the
w
orkload
on
the
man
y-core.
By
pro
viding
insights
into
the
e
xpected
performance,
the
model
enables
informed
decision-making
re
g
arding
the
optimal
number
of
cores
when
deplo
ying
RISC-V
man
y-cores
on
FPGAs
for
parallel
graph
processing.
The
adaptability
of
the
model
e
xtends
be
yond
RISC-V
man
y-cores,
as
it
can
be
tailored
to
other
parallel
man
y-core
architectures
by
updating
the
architectural
parameters.
The
analytical
model
sho
ws
a
strong
correlation
with
c
ycle-accurate
R
TL
simula-
tions
of
the
RISCV
-based
man
y-core
system,
with
a
maxi
mum
prediction
error
of
11%
observ
ed
on
real-w
orld
graphs
from
the
netw
ork
repository
across
v
arious
core
counts.
The
tar
get
RISC-V
man
y-core
architecture
deli
v
ers
comparable
MTEPS/W
on
FPGAs
and
achie
v
es
up
to
20x
better
MTEPS/W
on
a
22
nm
technology
node
compared
to
an
FPGA-based
graph
accelerator
implemented
on
a
16
nm
process.
Future
w
ork
in
v
olv
es
v
alidating
the
v
ersatility
of
the
analytical
model
using
dif
ferent
ar
chitectures,
e
xtending
support
for
additional
graph
algorithms,
and
e
xploring
hi
gh
e
r
core
counts
and
lar
ger
graph
sizes.
This
w
ork
highlights
the
potential
of
analytical
modeling
for
guiding
hardw
are
design
in
graph
processing.
A
CKNO
WLEDGMENTS
W
e
thank
Rajesh
V
i
v
ekanandham
and
V
enkatraman
Ramakrishnan
for
their
guidance
and
support
as
industry
liaisons.
Their
input
w
as
helpful
in
aligning
our
w
ork
with
industry
perspecti
v
es.
FUNDING
INFORMA
TION
This
w
ork
is
supported
by
Semiconductor
Research
Corporation
(SRC)
as
part
of
Indian
Research
Program
(Project
ID
3055.001).
A
UTHOR
CONTRIB
UTIONS
ST
A
TEMENT
This
journal
uses
the
Contrib
utor
Roles
T
axonomy
(CRediT)
to
recognize
indi
vidual
author
contrib
u-
tions,
reduce
authorship
disputes,
and
f
acilitate
collaboration.
Int
J
Recongurable
&
Embedded
Syst,
V
ol.
14,
No.
3,
No
v
ember
2025:
843–854
Evaluation Warning : The document was created with Spire.PDF for Python.