Inter
national
J
our
nal
of
Recongurable
and
Embedded
Systems
(IJRES)
V
ol.
14,
No.
3,
No
v
ember
2025,
pp.
659
∼
675
ISSN:
2089-4864,
DOI:
10.11591/ijres.v14.i3.pp659-675
❒
659
Hard
war
e
design
f
or
fast
gate
bootstrapping
in
fully
homomor
phic
encryption
o
v
er
the
T
orus
Saru
V
ig
1
,
Ahmad
Al
Badawi
2
,
Mohd
F
aizal
Y
usof
2
1
Department
of
Cybersecurity
,
Institute
for
Infocomm
Research,
Astar
,
Sing
apore
2
Department
of
Homeland
Security
,
Rabdan
Academy
,
Ab
u
Dhabi,
United
Arab
Emirates
Article
Inf
o
Article
history:
Recei
v
ed
Feb
26,
2024
Re
vised
Sep
8,
2025
Accepted
Oct
9,
2025
K
eyw
ords:
Acceleration
Field-programmable
g
ate
array
F
ourier
transform
Hardw
are
design
T
orus
fully
homomorphic
encryption
ABSTRA
CT
Fully
homomorphic
encryption
(FHE)
is
a
promising
solution
for
pri
v
ac
y-
preserving
computations,
as
it
enables
operations
on
encrypted
data.
Despite
its
potential,
FHE
is
associated
with
high
computational
costs.
As
the
theoreti-
cal
foundations
of
FHE
mature,
mounting
interest
is
focused
to
w
ards
hardw
are
acceleration
of
established
FHE
schem
es.
In
this
w
ork,
we
present
a
hardw
are
implementation
of
the
f
ast
F
ourier
transform
(FFT)
tailored
for
polynomial
mul-
tiplication
and
aimed
at
accelerating
g
ate
bootst
rapping
in
T
orus
fully
homomor
-
phic
encryption
(TFHE)
schemes.
Our
study
includes
an
e
xtensi
v
e
design-space
e
xploration
at
v
arious
implementation
le
v
els,
le
v
era
ging
parallel
streaming
data
to
reduce
computational
latenc
y
.
W
e
introduce
a
ne
w
algorithm
to
e
xpedite
mod-
ular
polynomial
multiplication
using
ne
g
ati
v
e
wrapped
con
v
olution.
Our
imple-
mentation,
conducted
on
recongurable
hardw
are,
adheres
to
the
def
ault
TFHE
parameters
wit
h
1024-de
gree
polynomials.
The
results
demonstrate
a
signi-
cant
performance
enhancement,
with
impro
v
ements
of
up
to
30-fold,
depending
on
the
FFT
design
parameters.
Our
w
ork
contrib
utes
to
the
ongoing
ef
forts
to
optimize
FHE,
pa
ving
the
w
ay
for
more
ef
cient
and
secure
computations.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Ahmad
Al
Bada
wi
Department
of
Homeland
Security
,
Rabdan
Academy
Ab
u
Dhabi
22401,
United
Arab
Emirates
Email:
ahmad@u.nus.edu
1.
INTR
ODUCTION
W
ith
the
rise
of
cloud
computing,
concerns
re
g
a
rding
the
pri
v
ac
y
of
sensiti
v
e
data
processed
by
t
hird-
party
services
ha
v
e
been
mounting.
This
has
dri
v
en
the
de
v
elopment
of
v
arious
pri
v
ac
y-preserving
techniques
(PETs),
including
dif
ferential
pri
v
ac
y
(DP)
[1],
secure
multi-party
computation
(SMPC)
[2],
trusted
e
x
ecution
en
vironments
(TEEs)
[3],
and
fully
homomorphic
encryption
(FHE)
[4].
Among
these,
FHE
of
fers
a
promising
adv
antage
by
enabling
computation
on
encrypted
data
without
decryption,
guaranteeing
pri
v
ac
y
e
v
en
from
the
cloud
pro
vider
,
as
sho
wn
in
Figure
1.
FHE
supports
arithmetic
operations
lik
e
addition
and
multiplication
on
encrypted
data,
allo
wing
for
computations
through
untrusted
third-party
services
without
compromising
the
underlying
information.
This
mak
es
FHE
particularly
suitable
for
scenarios
where
sensiti
v
e
data
analysis
is
required
while
maintaining
stringent
pri
v
ac
y
guarantees.
In
a
typical
FHE
w
orko
w
,
the
user
encrypts
a
pri
v
ate
input
v
ector
before
sending
it
to
an
untrusted
serv
er
that
performs
computations
homomorphically
.
The
resulting
encrypted
output
is
then
returned
to
the
user
for
de-
cryption,
re
v
ealing
the
computed
result
in
plainte
xt.
Existing
FHE
schemes
typically
produce
and
operate
on
noisy
cipherte
xts.
The
noise
magnitude
in-
J
ournal
homepage:
http://ijr
es.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
660
❒
ISSN:
2089-4864
creases
as
homomorphic
operations
are
performed
on
the
cipherte
xts,
with
homomorphic
multiplication
result-
ing
in
relati
v
ely
higher
noise
gro
wth
compared
to
homomorphic
addition.
The
noise
cannot
gro
w
indenitely
,
or
decryption
will
f
ail
to
reco
v
er
the
me
ssage.
T
o
enable
arbitrary
computations,
the
cipherte
xts
must
be
refreshed
to
reduce
their
noise
content.
This
noise
control
mechanism,
kno
wn
as
bootstrapping
[4],
distinguishes
FHE
schemes
from
partial
ones.
Ho
we
v
er
,
bootstrapping
is
computationally
e
xpensi
v
e,
especially
in
w
ord-based
FHE
schemes,
and
is
considered
the
main
performance
bottleneck
of
these
schemes
[5].
Enc
ry
pti
on
E
N
C
(
)
User
Serv
er
E
N
C
(
)
E
N
C
(
(
)
)
(
)
c
=
E
N
C
=
E
N
C
(
(
)
)
Un
en
c
r
yp
t
ed
Da
t
a
Know
s
func
ti
on
Homom
orphic
ev
al
uati
on
of
De
c
ry
pti
on
E
NC
(
(
)
)
En
c
r
yp
t
e
d
Da
t
a
En
c
r
yp
t
ed
r
esu
lt
Un
en
c
r
yp
t
ed
r
esu
lt
Figure
1.
Ov
ervie
w
of
homomorphic
computation
in
FHE
T
orus
fully
homomorphic
encryption
(TFHE),
which
is
an
impro
v
ed
v
ersion
of
the
FHEW
scheme
[6],
stands
out
as
a
prominent
FHE
scheme
recognized
for
its
ef
cient
bootstrapping
proce
du
r
e.
Notably
,
it
enables
homomorphic
computation
of
binary
operations
in
a
mere
13
milliseconds
on
a
single
core
processor
,
utilizing
a
16
MB
bootstrapping
k
e
y
within
the
TFHE
library
[7].
While
surpassing
the
performance
of
pre
vious
schemes,
TFHE
can
be
computationally
e
xpensi
v
e
for
arbitrary
applications.
This
stems
from
its
approach
of
encoding
input
data
as
indi
vidual
bits
or
small-bit
inte
gers
(typically
limited
to
8
bits)
and
representing
functions
as
circuits
of
binary
g
ates.
Ho
we
v
er
,
its
small
parameter
size
and
ability
to
compute
a
wide
range
of
functions
mak
e
it
a
promising
candidate
for
future
FHE
applications.
W
e
analyzed
TFHE
and
identied
FFT
-based
polynomial
multiplication
as
the
bottleneck
operation
during
bootstrapping.
This
nding
aligns
with
pre
vious
proling
of
HE
schemes
[8],
where
it
w
as
reported
that
75%
of
estimated
c
ycles
are
spent
in
FFT
con
v
olutions
used
for
polynomial
multiplication.
F
or
TFHE,
we
calculated
that
the
number
of
FFT
operations
required
for
a
comparison
function
of
tw
o
16-bit
numbers
is
approximately
3
×
10
6
.
This
highlights
the
signicant
computational
cost
as
sociated
with
FFT
-based
polyno-
mial
multiplication
in
TFHE
and
emphasizes
the
need
for
further
research
into
more
ef
cient
approaches
for
bootstrapping
in
homomorphic
encryption
schemes.
Hardw
are
implementations
on
eld-programmable
g
ate
arrays
(FPGAs)
ha
v
e
demonstrated
potential
for
signicant
performance
g
ains
in
computationally
intensi
v
e
algorithms.
This
is
particularly
rele
v
ant
t
o
the
eld
of
homomorphic
encryption,
where
applications
lik
e
TFHE
rely
hea
vily
on
operations
lik
e
the
FFT
.
Recent
w
ork
by
Gener
et
al
.
[9]
reinforces
this
notion
by
presenting
an
FPGA-
b
a
sed
polynomial
multiplication
imple-
mented
as
a
v
ector
-matrix
product,
achie
ving
substantial
acceleration
compared
to
softw
are
implementations.
Our
w
ork
aims
to
le
v
erage
the
inherent
parallelism
and
congurability
of
FPGAs
by
e
xploring
a
hardw
are
im-
plementation
of
the
FFT
algorithm
specically
tailored
for
TFHE-based
applicat
ions.
This
approach
has
the
potential
to
further
enhance
the
ef
cienc
y
and
practicality
of
this
f
amily
of
homomorphic
encryption
schemes.
In
this
w
ork,
we
present
an
optimized
polynomial
multiplication
algorithm
for
TFHE
schemes
based
on
an
e
xtended
v
ariant
of
the
ne
g
ati
v
e-wrapped
con
v
olution
(NWC)
method.
This
well-established
technique
of
fers
signicant
ef
cienc
y
impro
v
ements
for
polynomial
multiplication,
a
fundamental
operation
in
FHE.
Our
proposed
approach
b
uilds
upon
the
NWC
method
by
incorporating
additional
optimizations,
aiming
to
further
reduce
the
computational
comple
xity
and
memory
footprint
associated
with
polynomial
multiplication
in
FHE
schemes.
W
e
implemented
and
e
v
aluated
our
polynomial
multiplier
utilizing
the
comprehensi
v
e
suite
of
tools
pro
vided
by
Xilinx
V
i
v
ado,
specically
tar
geting
the
V
irte
x-7
xc7vx1140
FPGA.
This
included
simulation,
Int
J
Recongurable
&
Embedded
Syst,
V
ol.
14,
No.
3,
No
v
ember
2025:
659–675
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Recongurable
&
Embedded
Syst
ISSN:
2089-4864
❒
661
synthesis,
and
hardw
are
implementation.
Our
design
achie
v
es
a
signicant
performance
g
ain,
nearly
30
times
f
aster
compared
to
CPU
softw
are
implementation,
when
using
an
FFT
design
of
streaming
width
of
128
and
radix-4
conguration.
This
conguration
e
xhibits
the
optimal
performance
among
the
tested
radix
v
alues.
Our
contrib
utions:
the
main
contrib
utions
of
our
w
ork
are
summarized
as
follo
ws:
−
W
e
de
v
eloped
an
ef
cient
hardw
are
archi
tecture
for
the
TFHE
scheme
on
FPGAs,
specically
tar
geting
the
time-consuming
bootstrapping
operation.
−
W
e
introduce
a
no
v
el
architecture
for
f
ast
F
ourier
transform
(FFT)-based
polynomial
multiplication.
This
inno
v
ati
v
e
design
signicantly
enhances
the
ef
cienc
y
of
polynomial
multiplication,
a
critical
component
in
TFHE
bootstrapping.
−
W
e
ha
v
e
successfully
le
v
eraged
our
polynomial
multiplier
to
accelerate
the
bootstrapping
process
in
TFHE.
−
W
e
conducted
a
comprehensi
v
e
performance
e
v
aluation
of
basic
homomorphic
binary
g
ates
using
our
ac-
celerator
achie
ving
speedups
close
to
30
×
compared
with
CPU
implementation.
Or
g
anization:
the
remainder
of
this
paper
is
or
g
anized
as
follo
ws.
Section
2
introduces
the
methods,
notations,
and
background
concepts
used
throughout
the
paper
.
Section
3
then
presents
an
analysis
of
the
bottlenecks
in
the
bootstrapping
operati
on,
along
with
our
design
space
decisions,
and
concludes
with
a
proof-
of-concept
i
mplementation
and
e
v
aluation
of
our
re
gister
-transfer
le
v
el
(R
TL)
design.
Section
4
discusses
the
results,
while
section
5
re
vie
ws
r
elated
w
ork
on
hardw
are
acceleration.
Final
ly
,
section
6
concludes
the
paper
and
highlights
the
k
e
y
tak
ea
w
ays
and
future
directions.
2.
METHODS
This
section
details
the
methodologies
emplo
yed
throughout
the
paper
.
W
e
be
gin
by
introducing
the
notation
and
symbols
used,
follo
wed
by
a
comprehensi
v
e
description
of
the
TFHE
scheme,
highlighting
its
core
b
uilding
blocks.
Subsequently
,
we
delv
e
into
the
intricacies
of
the
NWC
approach,
and
nally
,
we
introduce
the
SGen
tool,
a
platform
for
automated
design
generation
at
the
R
TL.
2.1.
Notations
Capital
and
lo
wercase
letters
distinguish
sets
from
their
elements.
W
e
denote
the
sets
of
binary
,
inte
ger
,
real,
and
comple
x
numbers
by
B
,
Z
,
R
,
and
C
,
respecti
v
ely
.
Matrices
and
v
ectors
are
represented
by
boldf
ace
capital
and
lo
wercase
letters,
respecti
v
ely
.
The
dot
product
between
tw
o
v
ectors
u
and
v
is
denoted
by
⟨
u
·
u
⟩
.
W
e
use
the
symbols
⌊·⌋
,
⌈·⌉
,
and
⌊·⌉
to
denote
the
oor
,
ceiling,
and
nearest
inte
ger
functions,
respecti
v
ely
.
F
or
an
inte
ger
a
,
a
q
denotes
the
remainder
of
a
when
di
vided
by
q
.
If
a
is
a
polynomial,
the
reduction
is
performed
on
each
coef
cient.
Z
N
[
X
]
refers
to
the
ring
of
polynomials
Z
/
(
X
N
+
1)
.
The
symbol
a
←
−
S
denotes
sampling
an
element
a
from
the
set
S
.
2.2.
T
orus
fully
homomor
phic
encryption
This
section
re
visits
the
TFHE
scheme,
as
introduced
by
Chillotti
et
al
.
[7].
Built
upon
the
learning
with
errors
(L
WE)
hardness
problem
[10],
TFHE
utilizes
the
Ring
v
ariant
of
L
WE
(RingL
WE)
[11].
Originally
proposed
as
an
optimized
v
ersion
of
FHEW
[6],
the
scheme
f
acilitates
homomorphic
e
v
aluation
of
binary
operations.
It
also
supports
performing
homomorphic
arithmetic
in
Z
p
with
p
being
a
lo
w-width
(typically
¡
8-bit)
inte
ger
modulus.
TFHE
dif
ferentiates
itself
from
earlier
F
HE
generations
by
performing
bootstrapping
after
each
g
ate
operation,
a
technique
kno
wn
as
g
ate
bootstrapping.
This
approach
enables
computations
on
commodity
hard-
w
are,
achie
ving
speeds
of
less
than
1
second
for
FHEW
and
around
0.13
seconds
for
TFHE.
Subsequently
,
TFHE
introduced
the
concept
of
circuit
bootstrapping,
where
cipherte
xts
are
refreshed
after
a
series
of
g
ate
e
v
aluations.
The
k
e
y
distinction
between
the
tw
o
methods
lies
in
the
parameters
size
and
e
x
ecution
of
the
bootstrapping
procedure.
F
or
a
detailed
comparison,
we
refer
the
reader
to
[12].
2.2.1.
T
orus
fully
homomor
phic
encryption
samples
TFHE
operates
with
tw
o
distinct
types
of
samples
:
T
orus
L
WE
(TL
WE)
and
T
orus
GSW
(TGSW).
These
samples
play
a
crucial
role
in
the
TFHE
scheme,
each
serving
a
distinct
purpose
in
the
encryption
and
homomorphic
computation
procedures.
TL
WE
samples
are
primarily
used
for
the
encryption
of
indi
vidual
bits,
while
TGSW
samples
are
utilized
for
more
comple
x
operations
such
as
bootstrapping
and
homomorphic
com-
putations.
The
interplay
between
these
tw
o
types
of
samples
is
fundamental
to
the
functionality
and
ef
cienc
y
of
the
TFHE
scheme.
In
the
follo
wing
paragraphs,
we
introduce
these
tw
o
mathematical
notions.
Har
dwar
e
design
for
fast
gate
bootstr
apping
in
fully
homomorphic
encryption
o
ver
the
T
orus
(Saru
V
ig)
Evaluation Warning : The document was created with Spire.PDF for Python.
662
❒
ISSN:
2089-4864
The
torus,
denoted
by
T
,
is
dened
as
the
quotient
group
R
/
Z
,
representing
the
real
numbers
modulo
1
under
the
standard
addition
operation.
W
e
furthe
r
generalize
this
concept
to
T
q
=
(1
/q
)
R
/
Z
,
where
q
is
a
positi
v
e
inte
ger
modulus.
The
construction
of
T
N
,q
[
X
]
e
xtends
this
denition
to
the
space
of
polynomials
whose
coef
cients
reside
in
T
q
,
with
operations
performed
modulo
(
X
N
+
1)
and
modulo
q
.
Additionally
,
B
N
[
X
]
is
dened
as
a
subset
of
polynomials
in
Z
N
[
X
]
where
coef
cients
belong
to
a
specic
ring
B
.
The
scheme
ut
ilizes
dif
ferent
types
of
samples,
cate
gorized
as
either
TL
WE
and
TGSW
samples
which
are
used
within
internal
bootstrapping
procedures,
which
we
describe
belo
w:
TL
WE
Let
n,
N
∈
N
denote
the
dimension
and
de
gree,
respecti
v
ely
.
Let
P
=
T
be
the
plainte
xt
space,
the
k
e
y
space
S
=
{
s
1
,
.
.
.
,
s
n
}
∈
B
n
,
and
the
cipherte
xt
space
C
=
T
n
+1
q
.
F
or
a
message
m
∈
P
(which
can
be
a
bit,
modular
inte
ger
,
or
real
number
in
an
interv
al),
the
encryption
of
m
gi
v
es
cipherte
xt
sample
c
which
tak
es
the
form
c
=
((
a
0
,
a
1
,
.
.
.
,
a
n
−
1
)
,
b
)
,
where
(1).
b
=
n
−
1
X
i
=0
a
i
s
i
+
e
+
∆
m.
(1)
Here,
a
←
T
n
q
is
sampled
uniformly
,
s
is
the
secret
k
e
y
,
and
e
is
the
noise
f
actor
sampled
from
the
underlying
Gaussian
distrib
ution
of
TFHE.
Decryption
is
performed
as
(2).
b
−
a
·
s
=
∆
m
−
e
Round
−
−
−
→
m.
(2)
Building
upon
the
descriptions
abo
v
e,
TRL
WE
cipherte
xt
samples
can
also
be
generated
from
polynomials.
In
this
conte
xt,
the
plainte
xt
space
is
denoted
by
T
N
,q
[
X
]
,
the
secret
k
e
y
v
ector
by
S
=
{
s
1
,
...,
s
k
}
∈
B
N
[
X
]
k
,
and
the
cipherte
xt
by
C
∈
T
N
,q
[
X
]
n
+1
.
Notably
,
TL
WE
is
a
specic
instance
of
TRL
WE
with
N
=
1
.
TRL
WE
samples
enable
homomorphic
addition
and
constant
multiplication.
The
utilization
of
tw
o
distinct
cipher
schemes,
namely
TRL
WE
and
TL
WE,
is
primarily
moti
v
ated
by
f
acilitating
the
internal
bootstrapping
process.
TGSW
the
Gentry–Sahai–W
aters
(GSW)
scheme,
a
v
ariant
of
the
L
WE
encryption
scheme,
is
capable
of
performing
non-linear
operations
homomorphically
.
The
TFHE
scheme
emplo
ys
GSW
for
the
homomorphic
multiplication
of
tw
o
cipherte
xts.
This
is
particularly
useful
in
bootstrapping,
where
an
e
xternal
product,
de-
noted
as
⊡
,
is
dened
as
TGSW
×
TL
WE
→
TL
WE.
The
primary
application
of
the
e
xternal
product
in
TFHE
is
the
controlled
multiple
x
er
,
or
CMUX.
The
operation
of
the
CMUX
is
further
dened
later
in
the
subsequent
sections.
A
TGSW
sample
ca
n
be
conceptualized
as
a
matrix
of
TL
WE
elements.
TFHE
emplo
ys
a
’g
adget
decomposition’
scheme
to
construct
these
matrices.
Gi
v
en
the
notations
described
abo
v
e
for
the
TL
WE
sample,
the
TGSW
encryption
scheme
for
a
message
m
∈
Z
p
[
X
]
is
dened
as
(3).
C
=
Z
+
m
·
H
(3)
Each
ro
w
of
Z
is
a
homogeneous
TL
WE
sample
encrypted
under
k
e
y
s
.
The
g
adget
matrix,
H
,
belongs
to
T
(
n
+1)
×
(
n
+1)
l
q
.
Reciprocally
,
C
∈
T
(
n
+1)
l
×
(
n
+1)
q
is
considered
a
v
alid
TGSW
sample
if
there
e
xists
a
message
m
∈
Z
p
[
X
]
such
that
each
ro
w
of
C
−
m
·
H
is
a
v
alid
homogeneous
TRL
WE
sample
under
the
k
e
y
.
Both
TL
WE
and
TGSW
schemes,
along
with
their
operations,
can
be
e
xtended
to
polynomials
represented
as
TRL
WE
and
TRGSW
,
respecti
v
ely
.
The
combination
of
these
samples
plays
a
crucial
role
in
refr
eshing
noisy
cipherte
xts
during
bootstrapping.
2.2.2.
Gate
le
v
el
bootstrapping
Bootstrapping
is
a
widely
used
technique
to
manage
noise
gro
wth
in
homomorphic
encryption
schemes.
It
achie
v
es
this
by
homomorphically
e
v
aluating
the
decryption
procedure
on
a
cipherte
xt
via
an
encryption
of
the
secret
k
e
y
.
Here,
we
consider
the
e
xample
of
a
specic
homomorphic
encryption
scheme,
namely
TFHE,
which
utilizes
g
ate
bootstrapping.
This
process
tak
es
a
cipherte
xt
of
the
form
TL
WE
µ
,
where
µ
is
the
plainte
xt
message,
as
input.
The
output
is
another
cipherte
xt,
either
encrypting
0
or
µ
,
with
a
controlled
amount
of
noise.
The
g
ate
bootstrapping
procedure
in
TFHE
consists
of
three
k
e
y
steps:
i)
BlindRotate,
ii)
SampleEx-
tract,
and
iii)
K
e
ySwitch,
which
are
described
as
follo
ws.
Int
J
Recongurable
&
Embedded
Syst,
V
ol.
14,
No.
3,
No
v
ember
2025:
659–675
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Recongurable
&
Embedded
Syst
ISSN:
2089-4864
❒
663
−
The
BlindRotate
operation
tak
es
a
TL
WE
sample
as
input
and
multiplies
it
with
a
secret
encrypted
po
wer
of
X
,
ef
fecti
v
ely
producing
a
rotation
of
the
coef
ci
ents.
This
rotation
is
achie
v
ed
by
in
v
oking
the
CMUX
g
ate
in
a
loop
for
each
coef
cient
of
the
cipher
.
A
visual
representation
of
BlindRota
te
can
be
found
in
Figure
2.
Figure
2.
Ov
ervie
w
of
the
BLINDR
O
T
A
TE
operation
performed
during
BOO
TSTRAPPING
after
each
g
ate-le
v
el
operation
The
procedure
relies
on
polynomials,
require
shifting
the
input
cipher
sample
from
the
TL
WE(
T
N
[
X
]
)
space
to
the
TRL
WE
space.
Here,
c
denotes
a
TRL
WE
sample
of
a
polynomial
message
v
∈
T
N
[
X
]
under
the
k
e
y
s
=
(
s
1
,
...,
s
n
)
∈
B
n
,
where
the
constant
term
is
a
rounded
approximation
of
the
original
message
in
the
T
space.
Matching
bootstrapping
k
e
ys
s
′
=
(
s
′
1
,
...,
s
′
n
)
∈
B
n
e
xist,
with
C
′
1
,
...,
C
′
n
representing
their
corresponding
TGSW
samples.
The
process
be
gins
by
rounding
up
the
input
cipher
and
dening
˜
c
(
˜
a
1
,
...,
˜
a
n
,
˜
b
)
←
⌊
c
2
N
⌉
mod
2
N
.
The
process
is
e
x
ecuted
o
v
er
tw
o
steps
by
dening
an
accumulator
,
A
CC,
as
follo
ws:
a.
A
CC
←
X
−
˜
b
·
c
b.
Perform
A
CC
←
CMUX(
C
′
i
,
X
˜
a
i
·
AC
C
,
A
CC)
for
1
≤
i
≤
n
The
output
is
a
TRL
WE
sample
encryption
of
X
−
p
·
v
,
where
p
=
˜
b
−
P
n
i
=1
s
′
i
·
˜
a
i
mod
2
N
.
−
The
SampleExtract
operation
is
the
subsequent
step
follo
wing
the
con
v
ersion
of
plainte
xt
v
into
the
en-
cryption
of
the
polynomial
X
−
p
·
v
.
In
this
operation,
the
constant
term
is
e
xtracted
from
the
cipher
.
This
e
xtraction
process
results
in
the
retrie
v
al
of
the
cipher
of
the
original
message,
no
w
encrypted
under
a
ne
w
k
e
y
.
This
operation
is
a
crucial
component
in
the
k
e
y
switching
process,
enabling
the
transition
of
encrypted
data
between
dif
ferent
encryption
k
e
ys.
−
The
K
e
ySwitch
operation
is
a
fundamental
component
in
homomorphic
encryption
schemes.
This
standard
k
e
y
switching
algorithm
is
designed
to
con
v
ert
a
cipherte
xt
encrypted
under
one
k
e
y
into
a
cipherte
xt
en-
crypted
under
a
dif
ferent
k
e
y
.
The
implementation
of
this
operation
necessi
tates
the
use
of
k
e
y-switching
k
e
ys.
Specically
,
these
are
the
TL
WE
encryptions
of
the
k
e
y
bits
of
˜
s
,
with
respect
to
the
original
k
e
y
s
.
2.3.
Negati
v
e
wrapped
con
v
olution
W
ithin
the
conte
xt
of
FHE,
one
of
the
computationally
most
e
xpensi
v
e
operations
is
the
modular
multiplication
of
polynomial
elements.
A
pre
v
alent
approach
to
achie
v
e
this
task
ef
ciently
le
v
erages
FFT
-
based
con
v
olutions.
The
c
yclic
con
v
olution
for
tw
o
length-
D
signals
x
,
y
is
denoted
by
a
signal
z
=
x
×
y
ha
ving
D
elements
as
(4).
z
m
=
X
i
+
j
≡
m
(
modD
)
x
i
y
j
(4)
Whereas
the
ne
g
ac
yclic
con
v
olution
of
x
,
y
adds
a
f
actor
of
(
−
1)
with
v
=
x
×
m
y
and
is
denoted
by
[13]:
v
m
=
X
i
+
j
=
m
x
i
y
j
−
X
i
+
j
=
D
+
m
x
i
y
j
(5)
Har
dwar
e
design
for
fast
gate
bootstr
apping
in
fully
homomorphic
encryption
o
ver
the
T
orus
(Saru
V
ig)
Evaluation Warning : The document was created with Spire.PDF for Python.
664
❒
ISSN:
2089-4864
The
ac
yclic
con
v
olution
of
the
signal
gi
v
en
by
u
=
x
×
a
y
ha
ving
2
D
elements
is
gi
v
en
by:
u
m
=
X
i
+
j
=
m
x
i
y
j
(6)
for
m
∈
{
0
,
...,
2
D
−
2
}
.
Lastly
,
the
half-con
v
olution
is
a
length-
D
signal
gi
v
en
by
x
×
h
y
consisting
of
the
rst
D
elements
of
the
c
yclic
con
v
olution
u
.
All
the
basic
con
v
olutions
are
closely
related
to
one
another
.
As
sho
wn
in
[13],
if
the
length
D
is
e
v
en
and
x
j
,
y
j
=
0
for
j
>
D
/
2
,
L
(
x
)
×
a
L
(
y
)
=
x
×
y
=
x
×
m
y
(7)
gi
v
en
the
notion
of
the
splitting
of
signals
(of
e
v
en
length)
into
halv
es,
the
lo
wer
-inde
x
ed
coef
cients
of
a
are
denoted
as
L
(
a
)
.
The
abo
v
e
statement
introduces
us
to
the
concept
of
”zero-padding”
which
is
to
append
D
zeros
to
signals
of
length
D
so
that
the
signals’
ac
yclic
con
v
olution
is
identical
to
the
c
yclic
(or
the
ne
g
ac
yclic)
con
v
olution
of
the
tw
o
padded
sequences.
While
the
classical
c
yclic
con
v
olution
is
applied
to
accelerated
polynomial
multiplication
modulo
X
N
−
1
,
ne
g
ac
yclic
con
v
ol
ution
is
used
to
implement
polynomial
multiplication
modulo
X
N
+
1
.
No
w
,
that
we
ha
v
e
established
the
use
of
ne
g
ac
yclic
con
v
olution
to
perform
modular
multi
plication,
we
e
xplain
the
possible
approach
of
zero-padding
that
has
been
implemented
as
part
of
the
TFHE
Library
which
uses
the
standard
c
yclic
con
v
olution
and
FFT
to
perform
ne
g
ac
yclic
con
v
olution.
W
e
note
that:
X
2
N
−
1
=
(
X
N
+
1)(
X
N
−
1)
(8)
The
goal
is
to
obtain
the
product
of
tw
o
N
-point
polynomials
modulo
(
X
N
+
1)
.
First
,
the
product
modulo
(
X
2
N
−
1)
via
c
yclic
con
v
olution
is
computed,
i.e.,
using
the
standard
F
ourier
transform
denition.
T
o
obtain
the
product
modulo
(
X
N
+
1)
,
the
result
needs
to
be
reduced
to
t
he
intermediate
result
modulo
X
N
+
1
.
The
description
of
ho
w
this
reduction
w
orks
w
as
presented
in
[14]
and
is
used
here
for
compl
eteness.
Let
us
dene
a
signal
p
∈
Z
[
X
]
of
de
gree
N
−
1
.
Its
ne
g
ac
yclic
e
xtension
¯
p
(of
length
2
N
)
is
dened
as
(9):
¯
p
[
X
]
=
p
[
X
]
−
X
N
×
p
[
x
]
(9)
At
each
multiplication
by
X
with
the
polynomial,
the
coef
cients
are
circularly
shifted
one
position
to
the
right
and
the
entering
coef
cient
is
ne
g
ated.
The
resulting
signal
is
such
that
its
F
ourier
image
will
ha
v
e
zeros
at
all
the
e
v
en
positions
and
the
remai
n
i
ng
coef
cients
will
be
mirrored
and
conjug
ated.
Thus,
multiplication
of
tw
o
such
signals
p
and
q
can
be
performed
as
(10):
p
×
q
mo
d
(
X
N
+
1)
=
1
2
F
−
1
(
F
(
¯
p
)
·
F
(
¯
q
))[0
...N
−
1]
(10)
Note
that
after
F
−
1
,
the
coef
cients
are
ne
g
ac
yclic,
hence
we
can
only
tak
e
the
rst
half
of
the
v
ector
.
This
approach
of
ne
g
ati
v
e
wrapped
con
v
olution
follo
wed
in
the
TFHE
library
is
also
described
in
Algorithm
1.
Thus,
the
TFHE
library
uses
a
2
N
length
FFT
architecture
to
perform
multiplication.
In
our
proposed
approach,
we
use
length
N
FFT
ar
chitecture
to
achie
v
e
the
same
result
for
modular
polynomial
multiplication.
Int
J
Recongurable
&
Embedded
Syst,
V
ol.
14,
No.
3,
No
v
ember
2025:
659–675
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Recongurable
&
Embedded
Syst
ISSN:
2089-4864
❒
665
2.4.
A
utomatic
generation
of
r
egister
-transfer
le
v
el
le
v
el
design
An
important
step
in
g
ate
bootstrappi
n
g
is
the
generation
of
R
TL
designs
for
the
in
v
olv
ed
functions.
In
this
conte
xt,
we
e
xplore
the
generalization
of
the
hardw
are
back-end
of
the
SGen
tool,
presented
in
[15],
to
automatically
generate
data-path
designs
for
FFTs.
SGen
is
an
open-source
tool
that
generates
hardw
are
desi
gns
for
v
arious
signal
processing
trans
forms.
It
focuses
on
designs
that
operate
on
streaming
data,
where
the
input
dataset
is
di
vided
into
smaller
chunks
processed
o
v
er
multiple
c
ycles,
leading
to
reduced
resource
usage.
The
FFT
data
path
comprises
O
(log
2
N
)
cascaded
stages
for
a
signal
of
length
N
,
as
sho
wn
in
Figure
3.
Each
stage
processes
w
elements
(w
ords)
per
c
ycle,
with
a
streaming
width
of
w
.
Consequently
,
the
number
of
c
ycles
required
to
recei
v
e
the
entire
input
(or
output)
is
N
w
.
The
hardw
are
in
each
stage
is
reused
for
subsequent
sets
of
elements,
ef
fecti
v
ely
performing
a
”v
ertical
folding”
of
the
signal.
SGen
implements
the
Coole
y-T
uk
e
y
algorithm
for
streaming
FFT
and
of
fers
an
area-ef
cient
alternati
v
e
to
the
w
ork
presented
in
[16]
for
comple
x
data
type
FFT
problems
.
A
detailed
comparison
of
these
tools
is
presented
later
in
this
paper
.
Our
w
ork
in
v
estig
ates
the
applicability
of
SGen
as
a
tool
for
generating
FFT
data
paths
in
the
case
of
TFHE
with
128-bit
security
le
v
el.
W
e
will
le
v
erage
the
data-path
design
generated
by
SGen
and
apply
it
to
our
proposed
polynomial
multiplication
Algorithm
2.
Figure
3.
Streaming
FFT
architecture
3.
IMPLEMENT
A
TION
This
section
describes
the
fundamental
b
uilding
blocks
of
the
bootstrapping
approach
imple
mented
on
the
FPGA.
Figure
4
pro
vides
a
high-le
v
el
o
v
ervie
w
of
the
entire
architecture.
Through
comple
xity
analysis,
we
identied
the
time-critical
functions
and
tar
geted
them
for
hardw
are
acceleration.
Har
dwar
e
design
for
fast
gate
bootstr
apping
in
fully
homomorphic
encryption
o
ver
the
T
orus
(Saru
V
ig)
Evaluation Warning : The document was created with Spire.PDF for Python.
666
❒
ISSN:
2089-4864
Figure
4.
High-le
v
el
system
o
v
ervie
w
The
BLINDR
O
T
A
TE
function
is
the
core
of
the
bootstrapping
process
,
responsible
for
refreshing
the
noise.
It
performs
a
coef
cient
rotation
by
multiplying
an
encrypted
polynomial
with
a
n
encrypted
po
wer
of
X
.
This
rotation
is
achie
v
ed
using
the
CMUX
g
ate,
as
illustrated
in
Figure
2.
The
CMUX
g
ate
tak
es
tw
o
TL
WE
samples,
d
0
and
d
1
,
as
inputs,
along
with
a
control
TGSW
sample,
C
.
Its
output
is
a
TL
WE
sample
and
is
computed
as
(11):
C
⊙
(
d
1
−
d
0
)
+
d
0
,
(11)
where
⊙
denotes
the
e
xternal
pr
oduct
,
homomorphic
to
the
e
xternal
R
-modulo
product
between
the
respecti
v
e
plainte
xts.
The
e
xternal
pr
oduct
is
formally
dened
as
(12):
T
GS
W
×
T
LW
E
→
T
LW
E
.
(12)
The
e
xternal
product
operation
in
homomorphic
encryption
schemes
is
kno
wn
to
ha
v
e
tw
o
com-
putationally
e
xpensi
v
e
functions:
the
FFT
and
add-multiply
(AddMul).
As
illustrated
i
n
Figure
2,
each
B
LI
N
D
R
O
T
AT
E
operation,
performed
with
def
ault
security
parameters,
in
v
olv
es
630
C
M
U
X
operations,
one
for
each
coef
cient
of
the
bootstrapping
k
e
y
polynomial.
F
or
each
C
M
U
X
within
the
e
xternal
product,
the
forw
ard
FFT
is
e
x
ecuted
si
x
times,
follo
wed
by
the
AddMul
operation
and
the
in
v
erse
FFT
.
Our
e
xperiments
indicate
that,
for
a
simple
homomorphic
comparison
of
tw
o
16-bit
numbers,
the
FFT
is
performed
309,846
times.
T
able
1
details
the
FFT
counts
for
all
basic
operations
in
the
e
xample
code.
F
ollo
wing
the
bootstrapping
process,
a
k
e
y-switching
function
is
applied.
This
operation
also
presents
an
opportunity
for
optimization,
which
will
be
discussed
later
in
this
section.
T
able
1.
FFT
counts
for
16-bit
inte
ger
comparison
Operation
K
e
y
generation
2-input
g
ates
Mux
Comparison
function
FFT
counts
15120
3780
7554
309846
3.1.
F
ast
F
ourier
transf
orm
The
forw
ard
and
in
v
erse
FFTs
share
the
same
underlying
archi
tecture,
with
the
sole
distinction
lying
in
the
applied
multiplicati
v
e
twiddle
f
actors.
In
this
paper
,
we
only
discuss
the
data-path
a
forw
ard
FFT
.
The
core
ba
ck-end
architecture,
generated
by
SGen,
utilizes
the
Coole
y-T
uk
e
y
algorithm
iterati
v
ely
to
perform
the
FFT
operation.
As
illustrated
in
Figure
5,
SGen
implements
a
fully
streaming
FFT
,
thereby
enhancing
resource
utilization.
A
signal
of
length
2
n
is
permuted
o
v
er
2
t
c
ycles
in
chunks
of
size
2
k
elements,
where
n
=
k
+
t
.
The
generated
R
TL
incorporates
additional
optimizations,
including:
−
Simplication
of
read-only
memories
(R
OMs)
containing
periodic
v
alues
by
replacing
them
with
constants.
−
Replacement
of
single-v
alue
R
OMs
with
constants.
Int
J
Recongurable
&
Embedded
Syst,
V
ol.
14,
No.
3,
No
v
ember
2025:
659–675
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Recongurable
&
Embedded
Syst
ISSN:
2089-4864
❒
667
−
Simplication
of
tri
vial
arithmetic
operations.
−
P
airing
of
multiple
x
ers
and
R
OMs
sharing
identical
v
alues.
These
optimizations
further
impro
v
e
the
area
and
resource
ef
cienc
y
of
the
generated
hardw
are
design.
Figure
5.
Ov
ervie
w
of
FPGA
implementation
of
FFT
3.1.1.
Design
decisions
Automatic
R
TL
generator:
SGen
of
fers
the
e
xibility
to
customize
v
arious
parameters
based
on
spe-
cic
design
requirements.
T
o
e
v
aluate
its
performance,
we
compared
the
latenc
y
of
SGen
and
SPIRAL
[16]
for
the
FFT
of
a
64-point,
32-bit
x
ed-point
signal.
The
results
are
presented
in
T
able
2.
As
sho
wn,
SGen
e
xhibits
superior
latenc
y
performance
compared
to
SPIRAL.
Additionally
,
SGen
pro
vides
greater
e
xibility
in
selecting
data
size
and
type.
While
SPIRAL
is
limited
to
32-bit
x
ed-point
data,
SGen
is
capable
of
generating
R
TL
for
64-bit
x
ed-point
data
types.
T
able
2.
Comparison
between
SPIRAL
and
SGEN
Radix
2
4
8
Streaming
width
2
4
8
32
64
4
32
64
8
32
64
SPIRAL
165
116
92
57
42
92
40
25
76
46
31
SGEN
131
90
67
34
22
83
26
14
66
30
18
Data
type
and
size:
as
pre
viously
discussed,
SGen
allo
ws
selection
from
v
arious
data
types
(s
ingle,
double,
and
x
ed-point).
In
the
TFHE
library
,
FFT
operations
are
performed
using
the
FFTW3
library
[17]
with
double-precision
(64-bit)
data.
W
e
e
xperimented
with
the
single-precision
(32-bit)
v
ersion
of
FFTW3
for
the
def
ault
TFHE
use
case,
b
ut
switching
to
single-precision
oats
resulted
in
inaccurate
post-decryption
results.
Therefore,
we
opted
for
a
64-bit
w
ord
length
for
the
R
TL
design.
Ha
ving
determined
the
w
ord
length,
we
aimed
to
select
the
data
type
(x
ed-point
or
oat
ing-point)
by
performi
ng
simulations.
W
e
e
x
ecuted
FFTs
on
the
same
signals
using
MA
TLAB
R2020b
and
the
FFTW3
library
implementation
in
Eclipse.
The
resulting
error
mar
gins
are
presented
in
T
able
3.
Our
goal
w
as
to
choose
precision
with
the
minimal
dif
ference
compared
to
the
double-precision
FFTW3
implementation.
These
e
xper
-
iments
used
signals
within
the
operational
bounds
of
TFHE.
Research
indicated
that
the
inte
ger
coef
cients
of
polynomials
used
for
bootstrapping
FFTs
typically
reside
in
the
range
[
−
64
,
63]
.
W
ith
64-bit
x
ed-point
rep-
resentation,
the
maximum
error
mar
gin
reached
e
−
5
(14
decimal
points).
As
e
xpected,
increas
ing
the
decimal
point
precision
reduced
the
error
mar
gin,
ranging
from
e
−
10
for
30
decimal
points
to
e
−
5
for
14
decimal
points.
Double-precision
oating-point
yielded
the
lo
west
error
mar
gin,
reaching
e
−
15
.
Ho
we
v
er
,
for
a
gi
v
en
transform
length,
x
ed-point
implementations
are
generally
f
aster
than
oating-point
implementations
(approximately
3
times
f
aster).
Consequently
,
we
opted
for
64-bit
x
ed-point
data.
It
is
note
w
orth
y
that
in
x
ed-point
designs,
the
precision
length
for
the
fractional
part
does
not
af
fect
the
data
path
design.
T
able
3.
Output
dif
ference
between
FFT
function
on
MA
TLAB
and
FFTW3
library
for
dif
ferent
data
types
Precision
Double
x
ed,
14
x
ed,
18
x
ed,
22
x
ed,
26
x
ed,
30
Error
mar
gin
e
−
16
e
−
5
e
−
6
e
−
8
e
−
9
e
−
10
Har
dwar
e
design
for
fast
gate
bootstr
apping
in
fully
homomorphic
encryption
o
ver
the
T
orus
(Saru
V
ig)
Evaluation Warning : The document was created with Spire.PDF for Python.
668
❒
ISSN:
2089-4864
Radix
size:
the
Coole
y-T
uk
e
y
algorithm
implemented
by
SGen
recursi
v
ely
e
xpresses
the
FFT
as
a
composite
size
of
N
=
N
1
N
2
.
One
of
N
1
or
N
2
acts
as
the
radix,
which
controls
the
number
of
points
processed
by
the
basic
computational
block
of
the
FFT
,
oft
en
referred
to
as
the
b
uttery
due
to
its
data-path
shape
(see
the
magnied
section
of
Figure
5
for
a
high-le
v
el
vie
w).
The
streaming
width
must
be
a
multiple
of
the
radix;
that
is,
log
2
(
radix
)
should
be
di
visible
by
log
2
(
N
)
.
While
a
higher
radix
is
computationally
more
comple
x,
it
is
also
more
ef
cient
due
to
the
reduced
number
of
multiplications
and
additions
required.
T
o
achie
v
e
maximum
ef
cienc
y
,
high
I/O
bandwidth
between
the
CPU
and
FPGA
is
crucial
to
maintain
a
full
pipeline
at
all
times.
Streaming
width:
SGen
allo
ws
for
v
arying
the
streaming
width
from
2
to
256
comple
x
w
ords
per
c
ycle,
depending
on
the
chosen
radix.
A
fully
streaming
architec
ture
enables
continuous
data
o
w
into
and
out
of
the
system.
The
Coole
y-T
uk
e
y
FFT
data
path
comprises
O
(log
2
N
)
cascaded
stages
for
a
signal
of
length
N
.
A
streaming
width
of
w
w
ords
per
c
ycle
requires
N
w
c
ycles
to
recei
v
e
the
entire
input
(or
output).
The
ne
xt
section
presents
an
e
v
aluation
of
performance
(resource
utilization
and
speed)
as
inuenced
by
both
streaming
width
and
radix.
3.1.2.
Pr
oposed
con
v
olution
appr
oach
W
e
propose
a
ne
w
method
for
performing
multiplication
using
FFT
,
as
sho
wn
in
Algorithm
2.
This
method
is
applicable
to
signals
of
length
N
and
utili
zes
an
FFT
data
path
of
length
N
/
2
.
W
e
e
xtend
the
concept
of
NWC
from
nite
elds
to
the
eld
of
comple
x
numbers.
A
similar
approach
for
real-v
alued
numbers
w
as
pre
viously
proposed
in
[18].
W
e
assume
that
the
twiddle
f
actors
are
pre-computed.
The
inputs
are
real-v
alued
signals
of
length
N
,
which
are
folded
to
create
a
comple
x
signal
of
length
N
/
2
.
T
o
v
erify
the
correctness
of
our
approach,
we
performed
polynomial
multi
plication
using
the
imple-
mentation
pro
vided
by
the
TFHE
library
(described
in
Algorithm
1)
and
the
proposed
approach
(described
in
Algorithm
2)
on
v
e
dif
ferent
input
signals.
W
e
then
calculated
the
a
v
erage
dif
ference
between
the
correspond-
ing
coef
cients
of
the
nal
product
signals.
W
e
x
ed
the
range
of
the
inte
ger
input
coef
cients
to
[-63,
64],
consistent
with
the
TFHE
g
adget
decomposition
function.
Our
ndings
indicate
that
the
proposed
approach
achie
v
es
high
accurac
y
e
v
en
when
using
a
half-size
FFT
data
path.
The
total
error
mar
gins
were
a
v
eraged
at
7
.
81253
×
10
−
9
,
with
a
median
of
1
.
95309
×
10
−
9
.
3.2.
AddMul
The
AddMul
function
performs
multiplication
follo
wed
by
successi
v
e
addition
within
a
loop,
as
de-
tailed
in
Algorithm
3.
In
the
TFHE
library
,
this
operation
i
s
performed
in
the
frequenc
y
domain
on
signals
after
the
FFT
operation.
The
data
type
for
this
operation
is
double-precisi
on
oating-point.
Based
on
our
e
xperiments,
the
input
v
alues
are
bounded
within
the
range
[
−
20733
.
1
,
20316
.
4]
.
W
e
implemented
the
R
TL
design
using
MA
TLAB
Simulink
R2020b
.
Th
e
implementation
emplo
yed
64-bit
x
ed-point
data
points
with
an
input
representation
of
(1
,
64
,
48)
and
an
output
representation
of
(1
,
64
,
33)
.
This
x
ed-point
imple
menta-
tion
resulted
in
a
minimal
error
mar
gin
between
e
−
7
and
e
−
8
on
the
output
compared
to
the
double-precision
representation.
While
increasing
the
streaming
width
can
enable
f
aster
e
x
ecution
by
performing
operations
in
parallel,
this
needs
to
be
carefully
balanced
ag
ainst
the
resulting
increase
in
resource
utilization.
3.3.
K
ey
switching
K
e
y
switching
is
a
well-established
procedure
in
the
literature
[6],
[7],
allo
wing
the
transition
between
k
e
ys
with
dif
ferent
parameters.
Our
analysis
of
this
function’
s
proling
re
v
ealed
that
the
shift
operation,
as
presented
in
(3.3.),
is
the
most
time-consuming
step.
This
operation,
performed
on
32-bit
inte
ger
data
in
TFHE,
e
xhibits
high
frequenc
y
and
lo
w
pot
ential
for
hardw
are
acceleration.
Due
to
its
con
v
ersion
into
a
single-c
ycle
CPU
instruction,
the
shift
operation
will
not
benet
signicantly
from
an
R
TL
implementation.
Therefore,
the
implementation
of
this
function
in
R
TL
is
omitted
in
this
w
ork.
aij
=
(
aibar
>
>
(32
−
(
j
+
1)
∗
basebit
))&
mask
(13)
Int
J
Recongurable
&
Embedded
Syst,
V
ol.
14,
No.
3,
No
v
ember
2025:
659–675
Evaluation Warning : The document was created with Spire.PDF for Python.