Inter
national
J
our
nal
of
Recongurable
and
Embedded
Systems
(IJRES)
V
ol.
14,
No.
1,
March
2025,
pp.
1
∼
11
ISSN:
2089-4864,
DOI:
10.11591/ijres.v14.i1.pp1-11
❒
1
Implementing
a
v
ery
high-speed
secur
e
hash
algorithm
3
accelerator
based
on
PCI-expr
ess
Huu-Thuan
Huynh,
T
uan-Kiet
T
ran,
T
an-Phat
Dang
Uni
v
ersity
of
Science,
V
ietnam
National
Uni
v
ersity
,
Ho
Chi
Minh
City
,
V
ietnam
Article
Inf
o
Article
history:
Recei
v
ed
May
6,
2024
Re
vised
Jul
25,
2024
Accepted
Aug
12,
2024
K
eyw
ords:
Edge
computing
Hardw
are
accelerator
KECCAK
Peripheral
component
interconnect
e
xpress
Secure
hash
algorithm
3
ABSTRA
CT
In
this
paper
,
a
high-performance
secure
hash
algorithm
3
(SHA-3)
is
proposed
to
handle
massi
v
e
amounts
of
dat
a
for
applications
such
as
edge
computing,
medical
image
encryption,
and
blockchain
netw
orks.
This
w
ork
not
only
fo-
cuses
on
the
SHA-3
core
as
in
pre
vious
w
orks
b
ut
also
addresses
the
bottleneck
phenomenon
caused
by
transfer
rates.
Our
proposed
SHA-3
architecture
serv
es
as
the
hardw
are
accelera
tor
for
personal
computers
(PC)
connected
via
a
pe-
ripheral
component
interconnect
e
xpress
(PCIe),
enhancing
data
transfer
rates
between
the
host
PC
and
dedicated
computation
components
lik
e
SHA-3.
Ad-
ditionally
,
the
throughput
of
the
SHA-3
core
is
enhanced
based
on
tw
o
dif
ferent
proposals
for
the
KECCAK-
f
algorithm:
re-scheduled
and
sub-pipelined
archi-
tectures.
The
multiple
KECCAK-
f
is
applied
to
maximize
data
transfer
through-
put.
Congurable
b
uf
fer
in/out
(BIO)
is
introduced
to
support
all
SHA-3
modes,
which
is
s
uitable
for
de
vices
that
handle
v
arious
hashing
applications.
The
pro-
posed
SHA-3
archite
ctures
are
implemented
and
tested
on
DE10-Pro
supporting
Stratix
10
-
1S
X280HU2F50E1V
G
and
PCIe,
achie
ving
a
throughput
of
up
to
35.55
Gbps
and
43.12
Gbps
for
multiple-re-scheduled-KECCAK-
f
-based
SHA-
3
(MRS)
and
multiple-sub-pipelined-KECCAK-
f
-based
SHA-3
(MSS),
respec-
ti
v
ely
.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
T
an-Phat
Dang
Uni
v
ersity
of
Science,
V
ietnam
National
Uni
v
ersity
Ho
Chi
Minh
City
,
V
ietnam
Email:
dtphat@hcmus.edu.vn
1.
INTR
ODUCTION
The
increasing
demand
for
massi
v
e
amounts
of
real-time
data
transformations
and
processing
ne
ces-
sitates
high-performance
data
serv
ers.
This
process
acquires
data
from
v
arious
sources,
such
as
remote
de
vices
via
the
Internet,
and
then
transmits
it
t
o
dedicated
hardw
are
lik
e
graphic
processing
units
(GPU)
or
hardw
are
accelerators
through
b
urst
or
streaming
mechanisms.
peripheral
component
interconnect
e
xpress
(PCIe)
has
been
utilized
to
enhance
the
data
transfer
rate
to
dedicated
hardw
are
[1],
[2].
Operating
on
point-to-point
topol-
ogy
,
PCIe
enables
de
vices
to
communicate
directly
with
other
components
without
sharing
bandwidth
with
other
de
vices
on
the
b
us.
PCIe
utilizes
multiple
independent
lanes,
ranging
from
one
lane
to
32
lanes,
for
data
transfer
between
the
host
and
end
de
vice.
Each
lane
comprises
tw
o
pairs
of
dif
ferential
signaling
wires,
one
for
transmitting
data
(Tx)
and
one
for
recei
ving
data
(Rx).
Therefore,
the
PCIe
operation
speed
can
range
from
2.5
GT/s
to
32
GT/s
for
Gen1
to
Gen5,
respecti
v
ely
.
Furthermore,
the
implementation
of
a
direct
memory
access
(DMA)
for
PCIe
to
eliminate
a
central
processing
unit
(CPU)
interv
ention
has
appeared
in
earlier
w
orks
[1],
[3],
which
in
v
olv
es
transferring
data
from
the
main
memory
of
the
host
de
vice
to
a
temporary
DMA
local
J
ournal
homepage:
http://ijr
es.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
2
❒
ISSN:
2089-4864
re
gister
before
sending
it
to
the
address
of
the
end
de
vice.
Re
g
arding
the
dedicated
computing
hardw
are,
the
eld
programmable
g
ate
array
(FPGA)-based
hard-
w
are
accelerators
for
cryptograph
y
ha
v
e
been
attracti
v
e
in
research
domains
[
4
]
,
[5],
because
of
the
increasing
need
for
rob
ust
cryptographic
algorithms
to
secure
sensiti
v
e
information
and
communications.
Cryptographic
hash
functions
play
a
fundamental
role
in
ensuring
inte
grity
[6]
and
authenticity
[7],
[8].
In
recent
years,
one
such
hash
function
that
has
g
arnered
signicant
attention
and
adoption
is
the
SHA-3.
Standardized
by
the
Na-
tional
Institute
of
Standards
and
T
echnology
(NIST)
in
2015,
SHA-3
represents
the
latest
iteration
in
the
secure
hash
algorithm
(SHA)
f
amily
[9].
Unlik
e
its
predecessors,
SHA-1,
which
has
been
susceptible
to
vulnerabilities
and
collision
attacks
[10],
SHA-3
of
fers
enhanced
security
properties
and
resistance
to
kno
wn
cryptographic
attacks.
SHA-3
is
designed
to
produce
x
ed-size
hash
v
alues,
or
message
digests,
from
input
data
of
arbitrary
length.
In
high-performance
applications,
SHA-3
is
utilize
d
more
and
more
frequently
.
F
or
multimedia
data,
such
as
image
encrypt
ion,
the
SHA-3
algorithm
is
emplo
yed
to
generate
k
e
y
streams
from
multiple
blocks
that
are
di
vided
from
the
original
images
[11],
[12].
In
the
security
channel,
the
transmission
of
medical
data
and
high-denition
images
between
doctors
and
patients
ofte
n
necessitates
hashing
to
pre
v
ent
malicious
modica-
tion,
which
is
a
crucial
requirement
in
the
medical
eld
[13],
[14].
Moreo
v
er
,
with
the
increase
of
internet
of
things
(IoT)
de
vices,
the
adoption
of
edge
and
fog
computing
has
become
increasingly
common
[15].
This
trend
has
led
to
the
emer
gence
of
high-performance
de
vices
optimized
for
processing
s
p
e
ed,
with
a
particular
focus
on
security
,
including
hash
function
algorithms
[16],
[17].
Consequently
,
there
is
a
gro
wing
demand
to
enhance
the
performance
of
cryptographic
algorithms
to
protect
the
v
ast
amounts
of
data
transmitted
between
these
de
vices.
On
the
other
hand,
hash
functions
lik
e
SHA-3
play
a
crucial
role
in
blockchain
technology
,
en-
suring
the
inte
grity
,
se
curity
,
and
transparenc
y
of
distrib
uted
ledger
systems.
The
hash
function
helps
maintain
transaction
inte
grity
based
on
the
Merkle
tree
structure
[18].
Notably
,
miners
are
task
ed
with
v
alidating
trans-
actions
under
consensus
mechanisms
such
as
proof
of
w
ork
(PoW)
[19].
T
o
be
eligible
for
re
w
ards,
miners
must
quickly
generate
nonce,
underscoring
the
need
for
a
high-performance
hash
function
[20].
T
o
enhance
the
performance
of
SHA-3,
v
arious
research
has
been
conducted,
ranging
from
softw
are
optimizations
on
GPU
to
hardw
are
accelerators
[21]-[29].
The
ef
cienc
y
of
implementing
SHA-3
in
a
GPU
en
vironment
has
been
demonstrated
in
[21].
P
arallel
Thre
ad
eXecution
(PTX)
is
utilized
to
le
v
erage
the
parallel
permutation
capabilities
of
the
SPONGE
construction
and
compute
unied
de
vice
architecture
(CUD
A)
streams
are
emplo
yed
to
enable
GPUs
to
recei
v
e
and
compute
data
simultaneously
.
Moreo
v
er
,
hardw
are
accelerators
for
SHA-3
on
FPGA
are
more
attracti
v
e
than
implement
ing
it
in
a
GPU
en
vironment
due
to
reduced
technology
dependence.
A
signicant
number
of
w
orks
aim
to
enhance
the
throughput
and
ef
cienc
y
of
SHA-3
through
unrolling,
pipelined,
and
sub-pipelined
techniques
and
optimization
of
arithmetic
cores
(KECCAK-
f
)
[23]-
[29].
Unlik
e
other
w
orks
using
the
hardw
are
description
language
(HDL),
the
w
ork
in
[22]
uses
open
computing
language
(OpenCL)
to
implement
SHA-3
as
a
co-processor
on
FPGA
to
demonstrate
the
ef
cienc
y
of
the
hardw
are
implementation
of
SHA-3.
In
this
paper
,
we
adopt
an
FPGA-based
hardw
are
design
approach
for
implementing
the
SHA-3
al-
gorithm
using
the
V
erilog
language.
This
choice
is
moti
v
ated
by
the
f
act
that
SHA-3
computations
mainly
in
v
olv
e
permutations
using
XOR,
AND,
and
NO
T
g
ates,
as
well
as
inherent
parallel
processing
capabilities.
Unlik
e
pre
vious
w
orks
[23]-[29]
that
primarily
focus
on
the
KECCAK-
f
function,
we
also
address
another
core
component
of
the
SHA-3
algorithm,
such
as
b
uf
fer
in
and
out.
Moreo
v
er
,
high-performance
applications
not
only
demand
high-speed
dedicate
d
hardw
are
b
ut
also
require
ef
cient
data
transmission.
T
o
address
this
requirement,
we
utilize
PCIe,
which
enables
high-throughput
communication
and
le
v
erages
the
computational
po
wer
of
the
PC
for
data
setup
and
management
via
softw
are.
The
k
e
y
contrib
utions
of
our
proposed
methods
are
as:
-
W
e
present
our
SHA-3
design
implement
ed
on
FPGA
as
a
hardw
are
accelerator
for
PC
via
PCIe
links.
DMA
read
and
write
is
used
to
accelerate
the
data
transfer
rate
without
the
observ
ation
of
the
CPU.
In
addition,
ping-pong
memory
enables
simultaneous
computation
and
data
transmission
between
the
PC
and
our
SHA-3
accelerator
,
thereby
maximizing
the
parallel
processing
capabilities
of
SHA-3.
-
T
o
support
v
arious
appli
cations,
we
introduce
congurable
b
uf
fers
that
are
e
xible
enough
to
switch
between
modes
and
minimize
b
uf
fer
usage
for
both
input
and
output
data
while
maintaining
e
xibility
and
ef
cienc
y
.
-
Multiple
KECCAK-
f
are
introduced
to
enhance
maximum
performance.
T
w
o
architectures
for
KECCAK-
f
,
including
the
re-scheduled
and
sub-pipelined
architectures,
are
presented,
contrib
uting
to
o
v
erall
perfor
-
mance
enhancement
in
our
SHA-3
design.
Int
J
Recongurable
&
Embedded
Syst,
V
ol.
14,
No.
1,
March
2025:
1–11
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Recongurable
&
Embedded
Syst
ISSN:
2089-4864
❒
3
The
remaining
sections
of
the
paper
are
or
g
anized
as
follo
ws.
Section
2
pro
vides
background
i
nforma-
tion
on
SHA-3
algorithms.
Our
hardw
are
design
is
comprehensi
v
ely
analyzed
in
section
3,
co
v
ering
the
model
of
the
SHA-3
accelerator
to
PC
through
PCIe,
congurable
b
uf
fers,
and
multiple
KECCAK-
f
architecture.
Ev
aluation
and
comparison
of
our
resul
ts
with
other
approaches
are
presented
in
section
4.
Finally
,
section
5
concludes
the
paper
.
2.
SHA-3
PRELIMIN
AR
Y
The
construction
of
SHA-3
dif
fers
from
the
Merkle–Damg
ard
design
in
SHA-1
and
SHA-2,
instead
adopting
the
SPONGE
construction
[9],
which
is
comprised
of
tw
o
main
phases:
absorbing
and
squeezing,
as
sho
wn
in
Figure
??
.
Prior
to
the
absorbing
and
squeezing
phases,
the
input
message
m
of
arbitrary
length
under
goes
a
padding
process.
This
ensures
that
the
message
is
e
xpanded
to
a
multiple
of
r
bits
(1152,
1088,
832,
or
576
bits)
by
appending
the
pattern
”10*1”
.
Ho
we
v
er
,
the
SHA-3
hash
function
requires
that
the
message
m
must
append
the
suf
x
”01”
to
support
domain
separation
[
?
].
Consequently
,
the
pattern
”0110*1”
is
appended
to
message
m
,
as
illustrated
in
Figure
??
.
During
the
absorbing
phase,
the
padded
message
m
is
partitioned
into
se
v
eral
blocks
of
size
r
.
Each
r
-sized
block
is
then
combined
wit
h
a
capacity
c
to
form
a
1600-bit
block,
which
is
subsequently
processed
sequentially
by
each
KECCAK-
f
function
until
all
blocks
are
processed.
In
the
squeezing
phase,
the
length
of
the
output
d
(224,
256,
384,
or
512
bits)
can
v
ary
depending
on
the
selected
mode.
0
0
f
f
f
f
r
c
f
Block 0
Block 1
Block (n-1)
01
10...1
Absorbing
Squeezing
r
r
r
message m
padding
output d
f: KECCAK-f
r = 1
152/1088/832/576 bits for SHA3-224/256/832/512.
d = 224/256/384/512 bits for SHA3-224/256/832/512.
c = b - r
, where b = 1600 bits.
θ
ρ
π
χ
ι
Figure
1.
P
adding
and
SPONGE
construction
The
KECCAK-
f
function
is
a
fundamental
component
in
SHA-3
and
is
utilized
in
both
the
abs
o
r
bing
and
squeezing
phases.
The
input
message
is
con
v
erted
into
a
three-dimensional
array
,
denoted
by
x
,
y
,
and
z
,
formatting
a
5
×
5
×
64
state
array
.
The
KECCAK-
f
function
operates
on
this
state
array
during
24
rounds
of
a
round
function
(Rnd),
with
each
round
consisting
of
v
e
step
mappings:
θ
(theta),
ρ
(rho),
π
(pi),
χ
(chi),
and
ι
(iota).
3.
DESIGN
AND
IMPLEMENT
A
TION
In
this
section,
we
pro
vide
an
o
v
ervie
w
of
the
proposed
SHA-3
architecture
at
the
system
le
v
el.
W
e
analyze
the
comprehensi
v
e
data
o
w
,
sho
wing
the
interaction
with
each
component
within
the
system.
Ne
xt,
we
discuss
the
implementation
of
congurable
b
uf
fers
to
support
multiple
modes.
Las
tly
,
our
re-scheduled
and
sub-pipelined
techniques
are
introduced
in
detail
for
the
KECCAK-
f
function.
Implementing
a
very
high-speed
secur
e
hash
algorithm
3
acceler
ator
based
on
PCI
...
(Huu-Thuan
Huynh)
Evaluation Warning : The document was created with Spire.PDF for Python.
4
❒
ISSN:
2089-4864
3.1.
Ov
er
view
ar
chitectur
e
System
le
v
el
o
v
erall
architecture
and
data
o
w
for
SHA-3
accelerator
described
in
Figure
2.
The
sys-
tem
architecture
of
the
proposed
SHA-3
accelerator
is
depicted
in
Fi
gure
??
(a).
The
PC
serv
es
as
a
host
serv
er
,
recei
ving
requests
from
v
arious
remote
de
vices
and
responding
to
the
resul
ts.
Requests
related
to
the
hash
function
are
transmitted
to
the
SHA-3
accelerator
via
PCIe.
In
this
w
ork,
we
utilize
Intel
i
ntellectual
property
(IP)
named
Intel
L/H-T
ile
A
v
alon-MM
for
PCIe
on
the
DE10-Pro
de
vice
[
?
].
Specically
,
PCIe
Gen
3x8
is
emplo
yed,
operating
at
a
frequenc
y
of
250
MHz,
allo
wing
for
a
throughput
of
up
to
63
Gbps.
Additionally
,
this
IP
serv
es
as
a
bridge,
con
v
erting
PCIe
protocol
to
the
A
v
alon
b
us.
Before
the
SHA-3
accelerator
be
gins
opera-
tion,
essential
information,
such
as
the
size
of
the
processed
string,
is
transferred
to
i
ts
Control/Status
Re
gisters
block.
This
is
emplo
yed
through
the
use
of
a
base
address
re
gister
(B
AR)
with
32-bit
non-prefetchable
memory
.
T
o
enable
high-performance
transmission,
a
DMA
engine
is
emplo
yed
along
with
separate
read-and-write
data
modules.
Additionally
,
tw
o
random-access
memories
(RAM)
operate
in
a
ping-pong
manner
for
both
input
and
output
data.
After
the
hash
computation
is
completed,
the
hash
v
alues
cannot
immediately
be
sent
t
o
the
PC;
the
y
must
w
ait
for
a
request
from
the
PC.
Therefore,
a
ping-pong
RAM
Out
is
utilized
to
temporarily
store
the
hash
v
alues.
The
ping-pong
w
ay
ensures
that
the
dedicated
hardw
are
accelerator
remains
fully
utilized,
min-
imizing
an
y
idle
time.
T
w
o
clock
domains
are
utilized
in
this
system.
The
rst
clock
operates
at
a
frequenc
y
of
250
MHz
for
the
PCIe
IP
,
while
the
second
clock
operates
at
the
frequenc
y
of
the
SHA-3
accelerator
.
This
conguration
optimizes
the
throughput
of
each
domain
to
accelerate
the
entire
system.
The
proposed
SHA-3
accelerator
comprises
three
main
components:
padding,
b
uf
fer
including
b
uf
fer
in
(BI)
and
b
uf
fer
out
(BO),
and
multiple
KECCAK-
f
units.
The
padding
and
BI
operations
are
e
x
ecuted
concurrently
.
Once
BI
accumulates
suf
cient
data
serially
from
RAM
In
0/1,
the
output
of
this
b
uf
fer
is
parallelly
combined
with
the
data
output
from
the
padding
unit
through
OR
operation.
This
data
is
then
fed
into
multiple
KECCAK-
f
units,
which
process
multiple
data
simultaneously
.
The
hash
v
alue
generated
by
multiple
KECCAK-
f
units
is
transferred
to
BO
in
parallel.
Subsequently
,
BO
serially
writes
the
data
to
RAM
Out.
The
data,
comprising
multiple
short
and
long
messages
intended
for
SHA-3
processing,
is
stored
in
the
system
memor
y
of
the
PC.
Under
CPU
control
using
T
erasic’
s
PCIe
dri
v
er
,
this
data
is
continuously
transferred
from
the
system
memory
to
the
SHA-3
accelerator
.
In
cases
where
the
data
size
e
xceeds
the
capacity
of
RAM
In
0/1,
the
ping-pong
mechanism
comes
into
play
.
As
depicted
in
Figure
??
(b),
long
data
i
s
initially
transferred
from
the
system
memory
to
RAM
In
0,
and
subsequently
to
RAM
In
1.
Once
RAM
In
0
is
lled
with
Data
0,
the
SHA-3
computes
the
hash
v
alue
and
temporarily
stores
the
results
in
RAM
Out
0.
Concurrently
,
the
SHA-3
initiates
processing
Data
1
from
RAM
In
1.
Once
all
results
of
Data
0
are
a
v
ailable
in
RAM
Out
0,
the
y
are
read
using
DMA
read
and
returned
to
the
PC’
s
system
memory
.
The
result
of
Data
1
is
transferred
from
RAM
Out
1
to
the
system
memory
,
once
DMA
read
for
Data
0
is
completed.
Thus,
the
ping-pong
approach
for
RAM
In
and
RAM
Out
f
acilitates
pipeline
processing
at
the
system
le
v
el
for
increased
performance.
Figure
??
(c)
illustrates
three
stages
for
data
transfers
and
hashing
computation.
In
st
age
0,
the
PCIe
link
connected
to
the
PC
retrie
v
es
data
from
the
system
memory
,
operating
at
a
speed
of
63
Gbps
according
to
Intel
specications
[
?
].
Mo
ving
on
to
stage
1,
the
PCIe
IP
utilizes
the
A
v
alon-MM
master
to
transfer
data
to
RAM
In
according
to
the
DMA
technique.
The
DMA
process
supports
b
urst
transfers
on
a
256-bit
interf
ace
width
with
a
frequenc
y
of
250
MHz.
While
the
throughput
of
DMA
can
reach
up
to
64
Gbps,
the
b
urst
count
is
limited
to
5
bits.
Consequently
,
when
the
data
size
e
xce
eds
8192
bits,
the
throughput
of
DMA
becomes
unstable.
Our
e
xperiments
sho
w
that
the
throughput
of
the
DMA
stage
uctuates
between
approximately
20
Gbps
and
55
Gbps,
thereby
impacting
the
SHA-3
block.
T
o
mitig
ate
this
issue,
ping-pong
memory
is
emplo
yed
for
RAM
In,
where
one
memory
recei
v
es
data
while
the
other
pro
vides
data
for
computation.
Hash
computation
is
initiated
only
when
one
of
the
tw
o
memories
is
lled,
ensuring
that
DMA
does
not
af
fect
stage
2.
In
addition,
another
crucial
f
actor
inuencing
DMA
throughput
is
the
size
of
each
RAM
In.
Small
si
zes
can
lead
to
unstable
DMA
throughput,
while
lar
ge
sizes
may
result
in
redundanc
y
.
Based
on
e
xperimental
results,
a
size
of
10
KB
for
each
RAM
In
0/1
is
chosen,
as
de
tailed
in
section
4.
Stage
2
relies
on
the
SHA-3
computation
rate,
which
can
run
up
to
43
Gbps.
The
detailed
SHA-3
architecture
is
sho
wn
in
Figure
3,
progress
ing
from
a
coarse
to
a
ne
le
v
el
of
gran-
ularity
.
Specically
,
the
tw
o
primary
components-the
Buf
fer
and
Multiple
KECCAK-f
units,
illustrated
in
Fig-
ure
3(a)-are
e
xplained
in
more
depth
in
the
subsequent
subsect
ions.
Additionally
,
the
tw
o
optimized
KECCAK-
f
architectures,
namely
re-scheduled
and
sub-pipe
lined,
which
ha
v
e
the
greatest
impact
on
t
h
e
throughput
of
the
o
v
erall
design,
are
depicted
in
Figure
3(b).
Furthermore,
a
more
detailed
e
xplanation
of
the
θ
and
(
ρ
-
π
-
χ
-
ι
)
stages
is
pro
vided
in
Figure
3(c),
of
fering
a
deeper
insight
into
the
micro-architecture
of
the
proposed
Int
J
Recongurable
&
Embedded
Syst,
V
ol.
14,
No.
1,
March
2025:
1–11
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Recongurable
&
Embedded
Syst
ISSN:
2089-4864
❒
5
design,
which
will
also
be
elaborated
on
in
the
subsequent
subsection.
SHA-3
Application
(multiple messages)
Intel L/H-T
ile
A
valon-MM
for PCI Express
RAM In 0
RAM In 1
PC
DMA
engine
Read
W
rite
BAR
Control/Status
Registers
Padding
Buf
fer
Multiple
KECCAK-f
SHA-3
Clock domain 0
A
valon Bus
Clock domain 1
FPGA
T
erasic's PCIe driver
Software
Library
System
memory
CPU
PCIe
DIMM
256 bits
256 bits
32 bits
RAM Out 0
RAM Out 1
Data 1
Data 0
256 bits
System memory:
RAM In 0:
Data 0
RAM In 1:
Data 1
DMA
W
rite
SHA-3:
Data 0
SHA-3:
Data 1
RAM Out 0:
Data 0
RAM Out 1:
Data 1
System memory
Data 0
System memory
Data 1
DMA
Read
System
memory
RAM In
SHA-3
RAM Out
PCIe
PC
FPGA
time
System
memory
PCIe IP
RAM In/Out
SHA-3
DMA
W
rite
DMA
Read
63Gbps
Stage 0
Stage 1
Stage 2
256 bits
(a)
(b)
(c)
Figure
2.
Ov
erall
architecture
and
data
o
w
for
SHA-3
accelerator
on
the
system
le
v
el
(a)
system
architecture,
(b)
data
o
w
on
the
system
le
v
el,
and
(c)
the
three
stages
in
v
olv
e
transferring
data
from
the
system
memory
to
the
SHA-3
accelerator
Figure
3.
The
proposed
SHA-3
architecture
in
detail
(a)
the
b
uf
fer
and
multiple
KECCAK-
f
architectures
and
(b)
the
tw
o
Rnd
architectures:
re-scheduled
and
sub-pipelined
w
ays,
and
(c)
θ
and
(
ρ
-
π
-
χ
-
ι
)
architectures
3.2.
Buffer
T
o
f
acilitate
e
xibility
in
handling
dif
ferent
modes,
we
introduce
a
congurable
b
uf
fer
capable
of
switching
between
BI
and
BO
based
on
the
selected
mode.
Each
mode
needs
a
distinct
block
size
r
,
as
il-
lustrated
in
Figure
??
.
F
or
SHA3-224
mode,
with
a
data
width
of
256
bits,
the
input
block
size
is
1152
bits,
corresponding
to
v
e
BIs.
Similarly
,
for
SHA3-256/384/512
modes,
the
required
number
of
BIs
is
5/4/3,
re-
specti
v
ely
.
Con
v
ers
ely
,
the
output
s
ize
for
SHA3-224/256/384/512
modes
in
the
256-bit
base
is
1/1/2/2
BOs.
T
o
optimize
the
b
uf
fer
utilization,
we
propose
four
BIs
(BI
0
to
BI
3),
one
BO,
and
one
BIO,
as
depicted
in
Figure
??
(a).
The
b
uf
fers
cascade
to
each
other
,
wit
h
the
output
of
the
preceding
b
uf
fer
serving
as
the
input
for
the
subsequent
one.
BIs
only
accept
ne
w
input
when
the
v
al
id
in
signal
is
acti
v
ated;
otherwise,
the
y
retain
Implementing
a
very
high-speed
secur
e
hash
algorithm
3
acceler
ator
based
on
PCI
...
(Huu-Thuan
Huynh)
Evaluation Warning : The document was created with Spire.PDF for Python.
6
❒
ISSN:
2089-4864
the
current
v
alue.
BIO
e
xhibits
slightly
more
comple
xity
than
BI,
as
it
can
recei
v
e
dat
a
from
the
preceding
b
uf
fer
when
the
sel
signal
is
triggered;
otherwise,
it
functions
as
a
BO,
recei
ving
hash
v
alues
from
multiple
KECCAK-
f
units.
BO
is
responsible
for
retrie
ving
hash
v
alues
and
ser
ially
pushing
them
to
RAM
Out.
F
or
SHA3-224/256/384/512
modes,
it
tak
es
1/1/2/2
clock
c
ycles
to
complete
writing
data
to
RAM
Out.
T
o
stream-
line
comple
xity
,
the
recei
ving
process
in
BI
tak
es
4/4/5/5
clock
c
ycles
for
SHA3-224/256/384/512
modes,
respecti
v
ely
.
Each
data
loaded
into
BI
requires
one
clock
c
ycle.
Therefore,
if
the
modes
do
not
pro
vide
suf
-
cient
data
within
those
clock
c
ycles,
zero
inputs
are
inserted.
F
or
e
xample,
in
the
case
of
SHA3-512
requiring
three
blocks
of
256
bits,
the
subsequent
tw
o
blocks
consist
of
zeros.
3.3.
Multiple
KECCAK-
f
The
multiple
KECCAK-
f
module
comprises
mapping
and
three
KECCAK-
f
instances,
as
illustrated
in
Figure
??
(a).
The
1152-bit
data
from
the
preceding
phase
is
fed
into
the
Mapping
block,
which
appends
zeros
to
e
xpand
it
to
a
1600-bit
data
size.
In
our
design,
we
opt
for
three
KECCAK-
f
instances
to
reduce
the
interv
al
of
input
data
to
8
clock
c
ycles.
The
output
of
each
KECCAK-
f
instance
is
512
bits
in
size,
and
depending
on
the
selected
mode,
truncation
is
applied
to
the
output
data.
In
this
w
ork,
we
introduce
tw
o
architectures
for
KECCAK-
f
:
the
re-scheduled
and
sub-pipelined
ar
-
chitectures,
depicted
in
Figure
??
(b).
In
a
con
v
entional
architecture,
the
sequence
of
steps
includes
θ
-
ρ
-
π
-
χ
-
ι
,
with
a
re
gister
placed
at
the
end
of
the
ι
step
to
indicate
the
completion
of
one
round
[
?
].
Our
re-scheduled
architecture
reorders
these
steps
to
ρ
-
π
-
χ
-
ι
-
θ
by
inserting
a
re
gister
between
the
θ
and
ρ
steps.
As
a
result,
re-scheduled
architecture
requires
25
repetitions
to
complete
the
hash
v
alue,
one
more
compared
to
the
base
architecture.
During
the
rst
repetition,
only
the
θ
step
is
implemented,
while
the
remaining
repetitions
e
x
ecute
all
steps
in
the
sequence
of
ρ
-
π
-
χ
-
ι
-
θ
.
The
re-scheduled
architecture
of
fers
higher
ef
cienc
y
compared
to
the
con
v
entional
architecture.
This
is
pro
v
en
via
synthesis
results
on
the
Stratix
10
de
vice,
re
v
ealing
that
the
re-scheduled
architecture
achie
v
es
a
frequenc
y
of
336.36
MHz,
surpassing
the
con
v
entional
architecture’
s
frequenc
y
of
321.85
MHz
by
4.31%.
Moreo
v
er
,
the
re-scheduled
architecture
utilizes
fe
wer
resources,
with
a
reduction
in
adapti
v
e
logic
module
(ALM)
utilization
of
16.67%
(4214
ALMs
compared
to
5057
ALMs
in
the
con
v
entional
architecture).
Unlik
e
pre
vious
w
orks
[28],
[29],
where
the
sub-pipelined
technique
typically
inserts
tw
o
re
gis
ters:
one
between
the
π
and
χ
steps
or
between
the
θ
and
ρ
steps
and
another
at
the
end
of
the
ι
step,
our
sub-
pipelined
architecture
uses
re
giste
rs
between
the
θ
and
ρ
steps
and
another
re
gister
before
the
θ
step.
This
decision
is
based
on
the
observ
ation
that
the
critical
path
of
the
θ
step
is
greater
than
that
of
the
ρ
,
π
,
χ
,
and
ι
steps.
Specically
,
the
θ
step
requires
at
least
four
XOR
g
ate
le
v
els
to
complete,
while
the
remaining
steps
need
only
AND
and
tw
o
XOR
g
ates,
as
sho
wn
i
n
Figure
??
(c).
By
isolating
the
θ
step,
we
aim
to
impro
v
e
the
delay
for
KECCAK-
f
.
Ho
we
v
er
,
adding
the
re
gister
in
the
round
increases
the
number
of
clock
c
ycles
required,
doubling
it
to
48
clock
c
ycles.
T
o
mitig
ate
this
increase
in
clock
c
ycles,
our
design
is
capable
of
handling
tw
o
data
simultaneously
at
tw
o
dif
ferent
stages.
F
or
e
xample,
if
data
1
is
processed
in
the
θ
stage,
data
2
is
processed
in
the
(
ρ
-
π
-
χ
-
ι
)
stage.
In
the
ne
xt
clock
c
ycle,
data
1
mo
v
es
to
the
(
ρ
-
π
-
χ
-
ι
)
stage
while
data
2
transitions
to
the
θ
stage.
Thus,
the
a
v
erage
time
to
generate
one
hash
v
alue
is
reduced
to
24
clock
c
ycles.
The
adv
antage
of
our
sub-pipelined
architecture
is
that
it
increases
the
frequenc
y
while
maintaining
a
x
ed
number
of
clock
c
ycles
at
24,
thereby
increasing
throughput.
In
both
re-scheduled
and
sub-pipelined
architectures,
the
v
e
steps
are
consistently
grouped
into
tw
o
parts:
θ
and
(
ρ
-
π
-
χ
-
ι
).
The
formulation
of
the
θ
step
is
optimized
by
combining
C
[
x
]
and
D
[
x
]
,
as
indicated
by
the
red
area
in
the
θ
part
of
Figure
??
(c),
denoted
as
C
D
[
x
]
in
(1).
C
D
[
x
]
serv
es
as
the
shared
element,
utilized
by
A
[
x,
y
]
,
and
tw
o
le
v
els
of
XOR
operation
are
emplo
yed
to
reduce
the
delay
for
the
θ
step.
C
D
[
x
]
=
A
[
x
−
1
,
0]
⊕
A
[
x
−
1
,
1]
⊕
A
[
x
−
1
,
2]
⊕
A
[
x
−
1
,
3]
⊕
A
[
x
−
1
,
4]
⊕
R
O
T
(
A
[
x
+
1
,
0]
,
1)
⊕
R
O
T
(
A
[
x
+
1
,
1]
,
1)
⊕
R
O
T
(
A
[
x
+
1
,
2]
,
1)
⊕
R
O
T
(
A
[
x
+
1
,
3]
,
1)
⊕
R
O
T
(
A
[
x
+
1
,
4]
,
1)
A
[
x,
y
]
=
A
[
x,
y
]
⊕
C
D
[
x
]
(1)
The
hardw
are
implementation
of
(
ρ
+
π
)
steps
utilizes
a
net
connection,
which
requires
no
additional
resources
or
delay
,
based
on
the
combination
of
(
ρ
+
π
)
steps
illustrated
in
[
?
].
Furthermore,
the
combination
of
(
ρ
-
π
-
χ
-
ι
)
steps
is
depicted
in
Figure
??
(c).
Unlik
e
pre
vious
w
orks
[
?
],
which
utilized
64-bit
RC,
we
ha
v
e
simplied
this
p
r
ocess
by
storing
only
the
non-zero
bits
in
RC.
Therefore,
only
the
bit
positions
0,
1,
3,
7,
15,
31,
and
63
are
stored,
ef
fecti
v
ely
reducing
resource
usage.
Int
J
Recongurable
&
Embedded
Syst,
V
ol.
14,
No.
1,
March
2025:
1–11
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Recongurable
&
Embedded
Syst
ISSN:
2089-4864
❒
7
4.
EV
ALU
A
TION
AND
COMP
ARISON
This
section
presents
the
performance
e
v
aluation
of
MRS
and
MSS
on
DE10-Pro,
considering
se
v
eral
f
actors
that
impact
the
throughput
such
as
the
size
of
each
RAM
In
in
the
ping-pong
w
ay
and
the
number
of
KECCAK-
f
instances.
Furthermore,
a
comparison
of
KECCAK-
f
computation
with
pre
vious
w
orks
[23],
[24]
is
conducted
on
V
irte
x
7
to
indicate
the
adv
antages
and
limitations
of
our
multiple-re-scheduled-based
KECCAK-
f
(MRK)
and
multiple-sub-pipelined-based
KECCAK-
f
(MSK)
architectures.
4.1.
P
erf
ormance
e
v
aluation
on
DE10-Pr
o
The
SHA-3
hardw
are
accelerator
,
utilizing
the
DE10-Pro
de
vice,
is
connected
to
the
PC
(I
ntel®
Core™
i5-10400
2.9
GHz)
via
PCIe
Gen
3x8,
as
illustrated
in
Figure
??
(a).
This
setup
allo
ws
for
functional
testing
and
performance
e
v
aluations.
On
the
PC
side,
C
code
manages
the
transfer
of
data
to
RAM
In
0/1.
Each
DMA
operation
lls
one
RAM
In
slot,
with
subsequent
transfers
populating
the
remaining
slots
in
a
ping-pong
w
ay
.
Subsequently
,
the
PC
promptly
reads
hash
v
alues
from
RAM
Out
for
comparison
with
the
golden
data
to
v
erify
their
accurac
y
.
The
e
xperimental
results
of
the
tw
o
proposed
architectures,
MRS
and
MSS,
in
relation
to
f
actors
such
as
throughput,
RAM
size,
and
the
number
of
KECCAK-f
units
across
all
modes,
are
presented
in
the
line
charts
in
Figure
4.
These
charts
visually
represent
the
optimal
congurations,
highlighting
the
correlation
between
these
k
e
y
f
actors
and
their
impact
on
the
o
v
erall
performance
of
the
SHA-3
accelerator
.
Specically
,
Figure
4(a)
illust
rates
the
relationship
between
throughput
and
v
arious
RAM
In
sizes,
while
Figure
4(c)
sho
ws
the
relationship
between
throughput
and
the
numbe
r
of
KECCAK-f
units.
Additionally
,
Figure
4(b)
pro
vides
a
detailed
vie
w
of
the
data
o
w
,
contrib
uting
to
the
analysis
of
bottlenecks
and
serving
as
a
basis
for
determining
the
optimal
number
of
KECCAK-f
units
for
an
ef
cient
conguration.
The
rate
at
which
data
is
supplied
plays
a
crucial
role
in
determining
the
performance
of
the
hardw
are
accelerator
.
Specically
,
in
our
system,
where
we
emplo
y
a
ping-pong
mechanism,
the
size
of
RAM
In
directly
inuences
performance
as
mentioned
in
subsection
3.1.
As
depicted
in
Figure
??
(a),
the
relationship
between
SHA-3
performance
and
RAM
In
size
w
as
e
xamined
across
all
modes
for
both
MRS
and
MSS
architectures.
The
MRS
a
nd
MSS
throughput
e
xperiences
a
notable
increase
within
the
range
of
1
to
8
KB,
follo
wed
by
a
gradual
rise
from
12
to
20
KB.
Be
yond
this
point
,
the
throughput
saturates
for
all
modes
in
both
MRS
and
MSS
architectures.
Consequently
,
a
RAM
In
size
of
20KB
(the
size
of
RAM
In
0/1
is
10KB)
w
as
selected,
making
a
balance
between
maximizing
MRS
and
MSS
throughput
and
minimizing
resource
utilization.
Our
propos
ed
SHA-3
architecture
emplo
ys
multiple
KECCAK-
f
instances
to
enhance
throughput.
Ho
we
v
er
,
if
too
man
y
KECCAK-
f
instances
are
utilized,
a
bottleneck
phenomenon
arises
when
the
multiple
KECCAK-
f
instances
operate
f
aster
than
the
preceding
parts,
resulting
in
resource
redundanc
y
.
Con
v
ersely
,
the
architecture
becomes
inef
cient
when
only
a
small
number
of
KECCAK-
f
instances
are
used.
The
selection
of
KECCAK-
f
instances
considers
v
arious
f
actors,
including
preceding
block
architecture
and
the
algorithmic
characteristics
of
KECCAK-
f
.
As
il
lustrated
in
Figure
??
(b),
stage
0
(BI
+
padding)
necessitates
a
maximum
of
nine
clock
c
ycles,
including
v
e
c
ycles
for
processing
data
in
the
w
orst-case
scenarios
of
SHA3-384/512
modes,
three
clock
c
ycles
for
o
v
erhead,
and
one
clock
c
ycle
for
w
aiting
for
the
ready
signal
from
multiple
KECCAK-
f
.
Moreo
v
er
,
gi
v
en
that
the
KECCAK-
f
algorithm
requires
24
repetitions
for
one
digest
v
alue,
stage
1
in
Figure
??
(b)
must
be
completed
within
approximately
eight
clock
c
ycles
for
optimal
ef
cienc
y
.
Consequently
,
three
KECCAK-
f
instances
are
chosen
for
the
multiple
KECCAK-
f
block.
This
relationship
is
further
claried
in
Figure
??
(c),
re
v
ealing
the
increasing
throughput
of
the
MRS
and
MSS
architectures
with
one
to
three
KECCAK-
f
instances.
Be
yond
this
range,
ho
we
v
er
,
the
MRS
and
MSS
throughput
saturates
with
four
KECCAK-
f
instances.
Thus,
the
optimal
number
of
KECCAK-
f
instances
is
determined
to
be
three.
Our
proposed
architectures
are
e
v
aluated
based
on
throughput
and
ef
cienc
y
.
Throughput
(TP),
mea-
sured
in
Gbps,
is
calculated
using
(2),
where
#
bit
represents
the
number
of
bits
of
input
data,
Fmax
denotes
the
maximum
frequenc
y
obtained
from
synthesis
results,
and
#
clock
indicates
the
number
of
clock
c
ycles
elapsed.
Ef
cienc
y
(Ef
f.),
on
the
other
hand,
is
determined
by
the
ratio
of
throughput
to
the
utilized
resources,
such
as
ALMs
for
Intel
de
vices
or
slices
for
Xilinx
de
vices.
The
(2)
illustrates
the
calculation
of
ef
cienc
y
based
on
throughput
and
resource
utilization.
The
e
v
aluation
process
in
v
olv
es
the
use
of
the
Intel®
Quartus®
Prime
Pro
Edition
Design
Softw
are
V
ersion
19.1
to
obtain
reports
on
frequenc
y
and
resource
utilization.
T
P
=
#
bit
×
F
max
#
cl
ock
(2)
Implementing
a
very
high-speed
secur
e
hash
algorithm
3
acceler
ator
based
on
PCI
...
(Huu-Thuan
Huynh)
Evaluation Warning : The document was created with Spire.PDF for Python.
8
❒
ISSN:
2089-4864
E
f
f
.
=
T
P
Ar
ea
(3)
The
throughput
measurement
results
of
the
tw
o
architectures,
MRS
and
MSS,
on
DE10-Pro
are
pre-
sented
in
T
able
??
.
T
o
determine
the
number
of
clock
c
ycles
(
#
clock),
each
architecture
is
equipped
with
a
counter
to
record
the
elapsed
clocks,
starting
immediately
when
the
core
be
gins
operation
and
stopping
upon
completion
of
the
process.
The
data
size
for
throughput
m
easurement
of
the
MRS
and
MSS
architectures
is
tested
up
to
32
KB.
Collaborating
with
the
data
length
and
operating
frequencies
of
MRS
and
MSS,
which
are
280
MHz
and
380
MHz,
respecti
v
ely
,
we
compute
the
throughput
for
each
mode.
Specically
,
the
throughput
for
MRS
is
35.55
Gbps,
33.60
Gbps,
27.69
Gbps,
and
19.23
Gbps
for
SHA3-224,
SHA3-256,
S
HA3-384
,
and
SHA3-512
modes,
respecti
v
ely
.
Similarly
,
for
MSS,
the
throughput
is
43.12
Gbps,
41.20
Gbps,
36.27
Gbps,
and
25.11
Gbps
for
SHA3-224,
SHA3-256,
SHA3-384,
and
SHA3-512
modes,
respecti
v
ely
.
The
resources
uti-
lized
by
MRS
and
MSS
are
obtained
from
the
Quartus
tool,
with
MSS
utilizing
8273
ALMs
and
8374
re
gisters,
while
MSS
utilizes
9485
ALMs
and
12832
re
gisters,
respecti
v
ely
.
As
a
result,
the
ef
cienc
y
of
MRS
and
MSS
for
all
modes
are
as
follo
ws:
4.30
Mbps/ALM,
4.06
Mbps/ALM,
3.35
Mbps/ALM,
and
2.32
Mbps/ALM
for
MRS,
and
4.55
Mbps/ALM,
4.34
Mbps/ALM,
3.82
Mbps/ALM,
and
2.65
Mbps/ALM
for
MSS,
respecti
v
ely
.
1
2
3
4
5
6
7
8
9
10
Stage 0
Stage 1
Stage 2
(
b)
RAM Out
Buf
fer Out
KECCAK-f 2
KECCAK-f 1
KECCAK-f 0
Padding
Buf
fer In
Overhead
RAM In
(a)
(c)
Figure
4.
The
e
xperiment
results
of
MRS
and
MSS
across
all
modes
in
(a)
the
relationship
between
throughput
and
dif
ferent
RAM
In
sizes,
(b)
data
o
w
timing
chart
of
multiple
KECCAK-
f
units,
and
(c)
the
relationship
between
throughput
and
the
dif
ferent
numbers
of
KECCAK-
f
units
T
able
1.
The
implementation
results
of
the
MRS
and
MSS
architectures
on
DE10-Pro
Architecture
Freq.
Area
Re
g.
TP
(Gbps)
Ef
f.
(Mbps/ALM)
(MHz)
(ALM)
224
256
384
512
224
256
384
512
MRS
280
8273
8374
35.55
33.60
27.69
19.23
4.30
4.06
3.35
2.32
MSS
380
9485
12832
43.12
41.20
36.27
25.11
4.55
4.34
3.82
2.65
Int
J
Recongurable
&
Embedded
Syst,
V
ol.
14,
No.
1,
March
2025:
1–11
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Recongurable
&
Embedded
Syst
ISSN:
2089-4864
❒
9
4.2.
Comparati
v
e
analysis
F
or
a
f
air
e
v
aluation
and
comparison
between
our
proposed
architectures
(MRK
and
MSK)
and
pre-
vious
ones
[
?
],
[
?
],
we
synthesize
designs
on
V
irte
x-7
XC7VX485T
using
the
V
i
v
ado
2020
tool.
Since
the
pre
vious
w
orks
[
?
],
[
?
]
focused
solely
on
the
KECCAK-
f
computation,
T
able
??
displays
the
synthesis
results
of
the
KECCAK-
f
computation
only
.
Gi
v
en
that
all
modes
utilize
the
same
KECCAK-
f
computation
archi-
tecture,
T
able
??
only
presents
the
results
for
the
SHA3-512
mode
for
comparison.
Additionally
,
it
f
aci
litates
comparison
with
the
proposal
in
[
?
]
because
the
design
only
supports
the
SHA3-512
mode.
T
able
2.
The
comparison
of
KECCAK-
f
computation
architectures
between
our
proposals
and
FPGA-based
w
orks
on
V
irte
x
7
Reference
[
?
],
2022
[
?
],
2023
Our
proposed
architecture
Approach
Dual
Rnd
Unrolling
f
actor
of
2
MRK
MSK
Fmax
(MHz)
-
378.73
380.95
485.67
Area
(Slice)
1521
1375
3203
2917
Re
gister
-
-
4831
9669
#
clock/hash
12
12
8
8
TP
(Gbps)*
22.90
18.18
27.43
34.97
Ef
f.
(Mbps/slice)*
15.11
13.22
8.56
11.99
∗
F
or
SHA3-512
mode
Our
MRK
architecture
requires
3203
slices
and
4831
re
gisters,
operating
at
a
maximum
frequenc
y
of
380.95
MHz
and
achie
ving
a
throughput
of
27.43
Gbps
and
an
ef
cienc
y
of
8.56
Mbps/slice
.
Con
v
ersely
,
the
MSK
architecture,
aimed
at
reducing
the
critic
al
path
of
Rnd,
utilizes
more
re
gisters
than
MRK
(9669
>
4831).
Ho
we
v
er
,
MSK
outperforms
MRK
in
terms
of
both
throughput
and
ef
cienc
y
,
achie
ving
34.97
Gbps
and
11.99
Mbps/slice,
respecti
v
ely
.
Sra
v
ani
and
Durai
[
?
]
proposed
the
dual
Rnd
architecture,
which
utilizes
one
Rnd
consisting
of
v
e
steps
(
θ
-
ρ
-
π
-
χ
-
ι
)
and
re
gisters
cascading
another
Rnd
and
re
gister
to
halv
e
the
number
of
clock
c
ycles
(
#
clock/hash
=
12),
achie
ving
a
throughput
of
22.90
Gbps.
Ho
we
v
er
,
the
throughputs
of
our
tw
o
architectures,
MRK
and
MSK,
are
1.20
times
(27.43
vs.
22.90)
and
1.53
times
(34.97
vs.
22.90)
hi
gher
than
that
achie
v
ed
by
the
dual
Rnd
architecture.
While
our
architectures
prioritize
high
performance,
their
ef
cienc
y
is
slightly
lo
wer
compared
to
the
dual
architecture
of
Sra
v
ani
and
Durai
[
?
]
with
MRK
being
0.56
times
(8.56
vs.
15.11)
and
MSK
being
0.79
times
(11.99
vs.
15.11).
Ho
we
v
er
,
despite
the
lo
wer
ef
cienc
y
,
the
throughput
acceleration
of
our
MSK
architecture
(53%)
surpasses
the
ef
cienc
y
acceleration
of
their
dual
Rnd
architecture
(26%).
When
comparing
our
proposals
with
that
of
Sideris
et
al.
[
?
],
who
implemented
an
unrolling
f
actor
of
2
to
halv
e
the
number
of
clock
c
ycles
(
#
clock/hash
=
12),
we
observ
e
signi
cant
impro
v
ements
in
throughput
for
both
our
MRK
and
MSK
architectures.
Specically
,
our
MRK
architecture
achie
v
es
a
throughput
1.51
times
higher
(27.43
vs.
18.18),
while
our
MSK
architecture
achie
v
es
a
throughput
1.92
times
higher
(34.97
vs.
18.18)
than
the
proposal
of
Sideris
et
al.
[
?
].
Ho
we
v
er
,
despite
these
s
ubstantial
throughput
impro
v
ements,
our
ef
cienc
y
is
slightly
lo
wer
,
with
MRK
being
0.65
times
lo
wer
(8.56
vs.
13.22)
and
MSK
being
0.91
times
lo
wer
(11.99
vs.
13.22),
respecti
v
ely
.
Nonetheless,
this
decrease
in
ef
cienc
y
is
not
considered
signicant
when
compared
to
the
notable
throughput
accelerations
of
51%
and
92%
for
MRK
and
MSK,
respecti
v
ely
.
5.
CONCLUSION
The
demand
for
high-performance
hash
functions
for
modern
applications
has
emer
ged,
especi
ally
for
the
latest
hashing
v
ersion,
SHA-3.
The
impro
v
ement
of
SHA-3
throughput
is
proposed
in
this
paper
.
Specically
,
full
SHA-3
architecture
is
present
from
b
uf
fers
and
athrimetic
core
lik
e
KECCAK-
f
to
inte
gration
at
the
system
le
v
el.
The
proposed
architectures
are
designed
on
an
FPGA
platform,
which
is
connected
to
a
PC
via
PCIe.
PCIe
boosts
the
data
transfer
rate,
which
is
used
popularly
in
modern
applications.
The
issue
of
data
transfer
in
PCIe’
s
DMA
is
analyzed
and
resolv
ed
through
the
implementation
of
ping-pong
memory
and
the
selection
of
appropriate
memory
sizes.
Furthermore,
the
conguration
BIO
is
presented
to
support
multiple
SHA-3
modes
and
minimize
the
number
of
b
uf
fer
instances.
This
feature
benets
modern
applications
which
require
v
arious
output
lengths
of
the
hash
v
alues.
The
proposed
architectures,
lik
e
MRS
and
MSS,
achie
v
e
a
high
throughput
of
up
to
35.55
Gbps
and
43.12
Gbps,
respecti
v
ely
,
thanks
to
the
multiple
KECCAK-
f
combined
with
one
of
the
re-s
cheduled
and
sub-pipelined
architectures.
MSS
demonstrates
greater
ef
cienc
y
compared
to
MRS,
for
instance,
with
4.55
Mbps/ALM
>
4.30
Mbps/ALM
for
the
SHA3-224
mode.
In
addition,
our
Implementing
a
very
high-speed
secur
e
hash
algorithm
3
acceler
ator
based
on
PCI
...
(Huu-Thuan
Huynh)
Evaluation Warning : The document was created with Spire.PDF for Python.
10
❒
ISSN:
2089-4864
MRK
and
MSK
achie
v
e
27.43
Gbps
and
34.97
Gpbs
for
SHA3-512
mode
when
implemented
on
V
irte
x
7,
respecti
v
ely
.
REFERENCES
[1]
L.
Rota,
M.
Caselle,
S.
Chiling
aryan,
A.
K
opmann,
and
M.
W
eber
,
“
A
PCie
DMA
architecture
for
mult
i-gig
abyte
per
second
data
transmission,
”
IEEE
T
r
ansactions
on
Nuclear
Science
,
v
ol.
62,
no.
3,
pp.
972–976,
2015,
doi:
10.1109/TNS.2015.2426877.
[2]
J
.
Liu,
J.
W
ang,
Y
.
Zhou,
and
F
.
Liu,
“
A
cloud
serv
er
oriented
FPGA
accelerator
for
lstm
recurrent
neural
netw
ork,
”
IEEE
Access
,
v
ol.
7,
pp.
122
408–122
418,
2019,
doi:
10.1109/A
CCESS.2019.2938234.
[3]
H.
Ka
vianipour
,
S.
Muschter
,
and
C.
Bohm,
“High
performance
FPGA-based
DMA
interf
ace
for
pcie,
”
IEEE
T
r
ansactions
on
Nuclear
Science
,
v
ol.
61,
no.
2,
pp.
745–749,
2014,
doi:
10.1109/R
TC.2012.6418352.
[4]
J
.-S.
Ng,
J.
Chen,
K.-S.
Chong,
J.
S.
Chang,
and
B.-H.
Gwee,
“
A
highly
secure
fpg
a-based
dual-hiding
asynchronous-logic
aes
accelerator
ag
ainst
side-channel
attacks,
”
IEEE
T
r
ansactions
on
V
ery
Lar
g
e
Scale
Inte
gr
ation
(VLSI)
Systems
,
v
ol.
30,
no.
9,
pp.
1144–1157,
2022,
doi:
10.1109/TVLSI.2022.3175180.
[5]
M
.
Ze
ghid,
H.
Y
.
Ahmed,
A.
Chehri,
and
A.
Sghaier
,
“Speed/area-ef
cient
ECC
processor
implementation
o
v
er
gf
(2
m)
on
FPGA
via
no
v
el
algorithm-architecture
co-design,
”
IEEE
T
r
ansactions
on
V
ery
Lar
g
e
Scale
Inte
gr
ation
(VLSI)
Systems
,
v
ol.
31,
no.
8,
pp.
1192–1203,
2023,
doi:
10.1109/TVLSI.2023.3268999.
[6]
S.
Shin
and
T
.
Kw
on,
“
A
pri
v
ac
y-preserving
authentication,
authorization,
and
k
e
y
agreement
scheme
for
wireless
sensor
netw
orks
in
5g-inte
grated
internet
of
things,
”
IEEE
access
,
v
ol.
8,
pp.
67
555–67
571,
2020,
doi:
10.1109/A
CCESS.2020.2985719.
[7]
S.
Jiang,
X.
Zhu,
and
L.
W
ang,
“
An
ef
cient
anon
ymous
batch
authentication
scheme
based
on
hmac
for
v
anets,
”
IEEE
T
r
ansactions
on
Intellig
ent
T
r
ansportation
Systems
,
v
ol.
17,
no.
8,
pp.
2193–2204,
2016,
doi:
10.1109/TITS.2016.2517603.
[8]
L.
Zhou,
C.
Su,
and
K.-H.
Y
eh,
“
A
lightweight
cryptographic
protocol
with
certicateless
signature
for
the
internet
of
things,
”
A
CM
T
r
ansactions
on
Embedded
Computing
Systems
(TECS)
,
v
ol.
18,
no.
3,
pp.
1–10,
2019,
doi:
10.1145/3301306.
[9]
Federal
Information
Processing
Standards
Publication,
“SHA-3
standard:
permutation-bas
ed
hash
and
e
xtendable-output
functions,
”
Aug.
2015,
doi:
10.6028/NIST
.FIPS.202.
[10]
M.
Ste
v
ens,
E.
Bursztein,
P
.
Karpman,
A.
Albertini,
and
Y
.
Mark
o
v
,
“The
rst
collision
for
full
sha-1,
”
in
Advances
in
Cryptolo
gy–
CR
YPT
O
2017:
37th
Annual
International
Cryptolo
gy
Confer
ence
,
Santa
Barbar
a,
CA,
USA,
A
ugust
20–24,
2017,
Pr
oceedings
,
Springer
,
2017,
pp.
570–596,
doi:
10.1007/978-3-319-63688-7
19.
[11]
X.
Zhang,
Z.
Zhou,
and
Y
.
Niu,
“
An
image
encryption
method
based
on
the
feistel
netw
ork
and
dynamic
DN
A
encoding,
”
IEEE
Photonics
J
ournal
,
v
ol.
10,
no.
4,
pp.
1–14,
2018,
doi:
10.1109/JPHO
T
.2018.2859257.
[12]
C
.
Zhu
and
K.
Sun,
“Cryptanalyzing
and
impro
ving
a
no
v
el
color
image
encryption
algorithm
using
rt-enhanced
chaotic
tent
maps,
”
IEEE
Access
,
v
ol.
6,
pp.
18
759–18
770,
2018,
doi:
10.1109/A
CCESS.2018.2817600.
[13]
W
.-K.
Lee,
R.
C.-W
.
Phan,
B.-M.
Goi,
L.
Chen,
X.
Zhang,
and
N.
N.
Xiong,
“P
arallel
and
high
speed
hashing
in
GPU
for
telemedicine
applications,
”
IEEE
Access
,
v
ol.
6,
pp.
37
991–38
002,
2018,
doi:
10.1109/A
CCESS.2018.2849439.
[14]
M
.
Sra
v
ani
and
S.
A.
Durai,
“Bio-hash
secured
hardw
are
e-health
record
system,
”
IEEE
T
r
ansactions
on
Biomedical
Cir
cuits
and
Systems
,
2023,
doi:
10.1109/TBCAS.2023.3263177.
[15]
M.
De
Donno,
K.
T
ange,
and
N.
Dragoni,
“F
oundations
and
e
v
olution
of
modern
computing
paradigms:
Cloud,
IoT
,
edge,
and
fog,
”
IEEE
Access
,
v
ol.
7,
pp.
150
936–150
948,
2019,
doi:
10.1109/A
CCESS.2019.2947652.
[16]
T
.-Y
.
W
u,
Z.
Lee,
M.
S.
Obaidat,
S.
K
umari,
S.
K
umar
,
and
C.-M.
Chen,
“
An
authenticated
k
e
y
e
xchange
protocol
for
multi-serv
er
architecture
in
5g
netw
orks,
”
IEEE
Access
,
v
ol.
8,
pp.
28
096–28
108,
2020,
doi:
10.1109/A
CCESS.2020.2969986.
[17]
W
.-K.
Lee,
K.
Jang,
G.
Song,
H.
Kim,
S.
O.
Hw
ang,
and
H.
Seo,
“Ef
cient
implementation
of
lightweight
hash
functions
on
GPU
and
quantum
computers
for
iot
applications,
”
IEEE
Access
,
v
ol.
10,
pp.
59
661–59
674,
2022,
doi:
10.1109/A
CCESS.2022.3179970.
[18]
Z.
Liu,
L.
Ren,
Y
.
Feng,
S.
W
ang,
and
J.
W
ei,
“Data
inte
grity
audit
scheme
based
on
quad
merkle
tree
and
blockchain,
”
IEEE
Access
,
2023,
doi:
10.1109/A
CCESS.2023.3240066.
[19]
S.
Islam,
M.
J.
Islam,
M.
Hossain,
S.
Noor
,
K.-S.
Kw
ak,
and
S.
R.
Islam,
“
A
surv
e
y
on
consensus
algorithms
in
blockchain-based
applications:
architecture,
taxonomy
,
and
operational
issues,
”
IEEE
Access
,
2023,
doi:
10.1109/A
CCESS.2023.3267047.
[20]
H.
Cho,
“
Asic-resistance
of
multi-hash
proof-of-w
ork
mechanisms
for
blockchain
consensus
protocols,
”
IEEE
Access
,
v
ol.
6,
pp.
66
210–66
222,
2018,
doi:
10.1109/A
CCESS.2018.2878895.
[21]
H.
Choi
and
S.
C.
Seo,
“F
ast
implementation
of
sha-3
in
GPU
en
vironment,
”
IEEE
Access
,
v
ol.
9,
pp.
144
574–144
586,
2021,
doi:
10.1109/A
CCESS.2021.3122466.
[22]
H.
Bensalem,
Y
.
Blaqui
`
ere,
and
Y
.
Sa
v
aria,
“
An
ef
cient
opencl-based
implementation
of
a
SHA-3
co-processor
on
an
fpg
a-centric
platform,
”
IEEE
T
r
ansactions
on
Cir
cuits
and
Systems
II:
Expr
ess
Briefs
,
v
ol.
70,
no.
3,
pp.
1144–1148,
2022,
doi:
10.1109/TC-
SII.2022.3223179.
[23]
M.
M.
Sra
v
ani
and
S.
A.
Durai,
“On
ef
cienc
y
enhancement
of
SHA-3
for
FPGA-based
m
ultimodal
biometric
authentication,
”
IEEE
T
r
ansactions
on
V
ery
Lar
g
e
Scale
Inte
gr
ation
(VLSI)
Systems
,
v
ol.
30,
no.
4,
pp.
488–501,
2022,
doi:
10.1109/TV
LSI.2022.3148275.
[24]
S.
El
Moumni,
M.
Fettach,
and
A.
T
ragha,
“High
throughput
implementation
of
sha3
hash
algorithm
on
el
d
programmable
g
ate
array
(FPGA),
”
Micr
oelectr
onics
journal
,
v
ol.
93,
p.
104615,
2019,
doi:
10.1016/j.mejo.2019.104615.
[25]
B.
Li,
Y
.
Y
an,
Y
.
W
ei,
and
H.
Han,
“Scalable
and
parallel
optimization
of
the
number
theoretic
transform
based
on
FPGA,
”
IEEE
T
r
ansactions
on
V
ery
Lar
g
e
Scale
Inte
gr
ation
(VLSI)
Systems
,
2023,
doi:
10.1109/TVLSI.2023.3312423.
[26]
A.
Sideris,
T
.
Sanida,
and
M.
Dasygenis,
“Hardw
are
acceleration
design
of
the
SHA-3
for
high
throughput
and
lo
w
area
on
FPGA,
”
J
ournal
of
Crypto
gr
aphic
Engineering
,
pp.
1–13,
2023,
doi:
10.1007/s13389-023-00334-0.
[27]
H.
E.
Michail,
L.
Ioannou,
and
A.
G.
V
o
yiatzis,
“Pipelined
SHA-3
implementations
on
FPGA:
architecture
and
performance
analysis,
”
in
Pr
oceedings
of
the
Second
W
orkshop
on
Crypto
gr
aphy
and
Security
in
Computing
Systems
,
2015,
pp.
13–18,
doi:
10.1145/2694805.2694808.
[28]
G.
S.
Athanasiou,
G.-P
.
Makkas,
and
G.
Theodoridis,
“High
throughput
pipelined
FPGA
implementation
of
the
ne
w
SHA-3
cryp-
tographic
hash
algorit
hm,
”
in
2014
6th
International
Symposium
on
Communications,
Contr
ol
and
Signal
Pr
ocessing
(ISCCSP)
.
IEEE,
2014,
pp.
538–541,
doi:
10.1109/ISCCSP
.2014.6877931.
Int
J
Recongurable
&
Embedded
Syst,
V
ol.
14,
No.
1,
March
2025:
1–11
Evaluation Warning : The document was created with Spire.PDF for Python.