Inter national J our nal of Recongurable and Embedded Systems (IJRES) V ol. 14, No. 3, No v ember 2025, pp. 659 675 ISSN: 2089-4864, DOI: 10.11591/ijres.v14.i3.pp659-675 659 Hard war e design f or fast gate bootstrapping in fully homomor phic encryption o v er the T orus Saru V ig 1 , Ahmad Al Badawi 2 , Mohd F aizal Y usof 2 1 Department of Cybersecurity , Institute for Infocomm Research, Astar , Sing apore 2 Department of Homeland Security , Rabdan Academy , Ab u Dhabi, United Arab Emirates Article Inf o Article history: Recei v ed Feb 26, 2024 Re vised Sep 8, 2025 Accepted Oct 9, 2025 K eyw ords: Acceleration Field-programmable g ate array F ourier transform Hardw are design T orus fully homomorphic encryption ABSTRA CT Fully homomorphic encryption (FHE) is a promising solution for pri v ac y- preserving computations, as it enables operations on encrypted data. Despite its potential, FHE is associated with high computational costs. As the theoreti- cal foundations of FHE mature, mounting interest is focused to w ards hardw are acceleration of established FHE schem es. In this w ork, we present a hardw are implementation of the f ast F ourier transform (FFT) tailored for polynomial mul- tiplication and aimed at accelerating g ate bootst rapping in T orus fully homomor - phic encryption (TFHE) schemes. Our study includes an e xtensi v e design-space e xploration at v arious implementation le v els, le v era ging parallel streaming data to reduce computational latenc y . W e introduce a ne w algorithm to e xpedite mod- ular polynomial multiplication using ne g ati v e wrapped con v olution. Our imple- mentation, conducted on recongurable hardw are, adheres to the def ault TFHE parameters wit h 1024-de gree polynomials. The results demonstrate a signi- cant performance enhancement, with impro v ements of up to 30-fold, depending on the FFT design parameters. Our w ork contrib utes to the ongoing ef forts to optimize FHE, pa ving the w ay for more ef cient and secure computations. This is an open access article under the CC BY -SA license . Corresponding A uthor: Ahmad Al Bada wi Department of Homeland Security , Rabdan Academy Ab u Dhabi 22401, United Arab Emirates Email: ahmad@u.nus.edu 1. INTR ODUCTION W ith the rise of cloud computing, concerns re g a rding the pri v ac y of sensiti v e data processed by t hird- party services ha v e been mounting. This has dri v en the de v elopment of v arious pri v ac y-preserving techniques (PETs), including dif ferential pri v ac y (DP) [1], secure multi-party computation (SMPC) [2], trusted e x ecution en vironments (TEEs) [3], and fully homomorphic encryption (FHE) [4]. Among these, FHE of fers a promising adv antage by enabling computation on encrypted data without decryption, guaranteeing pri v ac y e v en from the cloud pro vider , as sho wn in Figure 1. FHE supports arithmetic operations lik e addition and multiplication on encrypted data, allo wing for computations through untrusted third-party services without compromising the underlying information. This mak es FHE particularly suitable for scenarios where sensiti v e data analysis is required while maintaining stringent pri v ac y guarantees. In a typical FHE w orko w , the user encrypts a pri v ate input v ector before sending it to an untrusted serv er that performs computations homomorphically . The resulting encrypted output is then returned to the user for de- cryption, re v ealing the computed result in plainte xt. Existing FHE schemes typically produce and operate on noisy cipherte xts. The noise magnitude in- J ournal homepage: http://ijr es.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
660 ISSN: 2089-4864 creases as homomorphic operations are performed on the cipherte xts, with homomorphic multiplication result- ing in relati v ely higher noise gro wth compared to homomorphic addition. The noise cannot gro w indenitely , or decryption will f ail to reco v er the me ssage. T o enable arbitrary computations, the cipherte xts must be refreshed to reduce their noise content. This noise control mechanism, kno wn as bootstrapping [4], distinguishes FHE schemes from partial ones. Ho we v er , bootstrapping is computationally e xpensi v e, especially in w ord-based FHE schemes, and is considered the main performance bottleneck of these schemes [5]. Enc ry pti on E N C ( ) User Serv er               E N C ( ) E N C ( ( ) ) ( ) c = E N C = E N C ( ( ) ) Un en c r yp t ed   Da t a Know s  func ti on  Homom orphic   ev al uati on  of    De c ry pti on E NC ( ( ) ) En c r yp t e d   Da t a En c r yp t ed   r esu lt Un en c r yp t ed   r esu lt Figure 1. Ov ervie w of homomorphic computation in FHE T orus fully homomorphic encryption (TFHE), which is an impro v ed v ersion of the FHEW scheme [6], stands out as a prominent FHE scheme recognized for its ef cient bootstrapping proce du r e. Notably , it enables homomorphic computation of binary operations in a mere 13 milliseconds on a single core processor , utilizing a 16 MB bootstrapping k e y within the TFHE library [7]. While surpassing the performance of pre vious schemes, TFHE can be computationally e xpensi v e for arbitrary applications. This stems from its approach of encoding input data as indi vidual bits or small-bit inte gers (typically limited to 8 bits) and representing functions as circuits of binary g ates. Ho we v er , its small parameter size and ability to compute a wide range of functions mak e it a promising candidate for future FHE applications. W e analyzed TFHE and identied FFT -based polynomial multiplication as the bottleneck operation during bootstrapping. This nding aligns with pre vious proling of HE schemes [8], where it w as reported that 75% of estimated c ycles are spent in FFT con v olutions used for polynomial multiplication. F or TFHE, we calculated that the number of FFT operations required for a comparison function of tw o 16-bit numbers is approximately 3 × 10 6 . This highlights the signicant computational cost as sociated with FFT -based polyno- mial multiplication in TFHE and emphasizes the need for further research into more ef cient approaches for bootstrapping in homomorphic encryption schemes. Hardw are implementations on eld-programmable g ate arrays (FPGAs) ha v e demonstrated potential for signicant performance g ains in computationally intensi v e algorithms. This is particularly rele v ant t o the eld of homomorphic encryption, where applications lik e TFHE rely hea vily on operations lik e the FFT . Recent w ork by Gener et al . [9] reinforces this notion by presenting an FPGA- b a sed polynomial multiplication imple- mented as a v ector -matrix product, achie ving substantial acceleration compared to softw are implementations. Our w ork aims to le v erage the inherent parallelism and congurability of FPGAs by e xploring a hardw are im- plementation of the FFT algorithm specically tailored for TFHE-based applicat ions. This approach has the potential to further enhance the ef cienc y and practicality of this f amily of homomorphic encryption schemes. In this w ork, we present an optimized polynomial multiplication algorithm for TFHE schemes based on an e xtended v ariant of the ne g ati v e-wrapped con v olution (NWC) method. This well-established technique of fers signicant ef cienc y impro v ements for polynomial multiplication, a fundamental operation in FHE. Our proposed approach b uilds upon the NWC method by incorporating additional optimizations, aiming to further reduce the computational comple xity and memory footprint associated with polynomial multiplication in FHE schemes. W e implemented and e v aluated our polynomial multiplier utilizing the comprehensi v e suite of tools pro vided by Xilinx V i v ado, specically tar geting the V irte x-7 xc7vx1140 FPGA. This included simulation, Int J Recongurable & Embedded Syst, V ol. 14, No. 3, No v ember 2025: 659–675 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Recongurable & Embedded Syst ISSN: 2089-4864 661 synthesis, and hardw are implementation. Our design achie v es a signicant performance g ain, nearly 30 times f aster compared to CPU softw are implementation, when using an FFT design of streaming width of 128 and radix-4 conguration. This conguration e xhibits the optimal performance among the tested radix v alues. Our contrib utions: the main contrib utions of our w ork are summarized as follo ws: W e de v eloped an ef cient hardw are archi tecture for the TFHE scheme on FPGAs, specically tar geting the time-consuming bootstrapping operation. W e introduce a no v el architecture for f ast F ourier transform (FFT)-based polynomial multiplication. This inno v ati v e design signicantly enhances the ef cienc y of polynomial multiplication, a critical component in TFHE bootstrapping. W e ha v e successfully le v eraged our polynomial multiplier to accelerate the bootstrapping process in TFHE. W e conducted a comprehensi v e performance e v aluation of basic homomorphic binary g ates using our ac- celerator achie ving speedups close to 30 × compared with CPU implementation. Or g anization: the remainder of this paper is or g anized as follo ws. Section 2 introduces the methods, notations, and background concepts used throughout the paper . Section 3 then presents an analysis of the bottlenecks in the bootstrapping operati on, along with our design space decisions, and concludes with a proof- of-concept i mplementation and e v aluation of our re gister -transfer le v el (R TL) design. Section 4 discusses the results, while section 5 re vie ws r elated w ork on hardw are acceleration. Final ly , section 6 concludes the paper and highlights the k e y tak ea w ays and future directions. 2. METHODS This section details the methodologies emplo yed throughout the paper . W e be gin by introducing the notation and symbols used, follo wed by a comprehensi v e description of the TFHE scheme, highlighting its core b uilding blocks. Subsequently , we delv e into the intricacies of the NWC approach, and nally , we introduce the SGen tool, a platform for automated design generation at the R TL. 2.1. Notations Capital and lo wercase letters distinguish sets from their elements. W e denote the sets of binary , inte ger , real, and comple x numbers by B , Z , R , and C , respecti v ely . Matrices and v ectors are represented by boldf ace capital and lo wercase letters, respecti v ely . The dot product between tw o v ectors u and v is denoted by u · u . W e use the symbols ⌊·⌋ , ⌈·⌉ , and ⌊·⌉ to denote the oor , ceiling, and nearest inte ger functions, respecti v ely . F or an inte ger a , a q denotes the remainder of a when di vided by q . If a is a polynomial, the reduction is performed on each coef cient. Z N [ X ] refers to the ring of polynomials Z / ( X N + 1) . The symbol a S denotes sampling an element a from the set S . 2.2. T orus fully homomor phic encryption This section re visits the TFHE scheme, as introduced by Chillotti et al . [7]. Built upon the learning with errors (L WE) hardness problem [10], TFHE utilizes the Ring v ariant of L WE (RingL WE) [11]. Originally proposed as an optimized v ersion of FHEW [6], the scheme f acilitates homomorphic e v aluation of binary operations. It also supports performing homomorphic arithmetic in Z p with p being a lo w-width (typically ¡ 8-bit) inte ger modulus. TFHE dif ferentiates itself from earlier F HE generations by performing bootstrapping after each g ate operation, a technique kno wn as g ate bootstrapping. This approach enables computations on commodity hard- w are, achie ving speeds of less than 1 second for FHEW and around 0.13 seconds for TFHE. Subsequently , TFHE introduced the concept of circuit bootstrapping, where cipherte xts are refreshed after a series of g ate e v aluations. The k e y distinction between the tw o methods lies in the parameters size and e x ecution of the bootstrapping procedure. F or a detailed comparison, we refer the reader to [12]. 2.2.1. T orus fully homomor phic encryption samples TFHE operates with tw o distinct types of samples : T orus L WE (TL WE) and T orus GSW (TGSW). These samples play a crucial role in the TFHE scheme, each serving a distinct purpose in the encryption and homomorphic computation procedures. TL WE samples are primarily used for the encryption of indi vidual bits, while TGSW samples are utilized for more comple x operations such as bootstrapping and homomorphic com- putations. The interplay between these tw o types of samples is fundamental to the functionality and ef cienc y of the TFHE scheme. In the follo wing paragraphs, we introduce these tw o mathematical notions. Har dwar e design for fast gate bootstr apping in fully homomorphic encryption o ver the T orus (Saru V ig) Evaluation Warning : The document was created with Spire.PDF for Python.
662 ISSN: 2089-4864 The torus, denoted by T , is dened as the quotient group R / Z , representing the real numbers modulo 1 under the standard addition operation. W e furthe r generalize this concept to T q = (1 /q ) R / Z , where q is a positi v e inte ger modulus. The construction of T N ,q [ X ] e xtends this denition to the space of polynomials whose coef cients reside in T q , with operations performed modulo ( X N + 1) and modulo q . Additionally , B N [ X ] is dened as a subset of polynomials in Z N [ X ] where coef cients belong to a specic ring B . The scheme ut ilizes dif ferent types of samples, cate gorized as either TL WE and TGSW samples which are used within internal bootstrapping procedures, which we describe belo w: TL WE Let n, N N denote the dimension and de gree, respecti v ely . Let P = T be the plainte xt space, the k e y space S = { s 1 , . . . , s n } B n , and the cipherte xt space C = T n +1 q . F or a message m P (which can be a bit, modular inte ger , or real number in an interv al), the encryption of m gi v es cipherte xt sample c which tak es the form c = (( a 0 , a 1 , . . . , a n 1 ) , b ) , where (1). b = n 1 X i =0 a i s i + e + m. (1) Here, a T n q is sampled uniformly , s is the secret k e y , and e is the noise f actor sampled from the underlying Gaussian distrib ution of TFHE. Decryption is performed as (2). b a · s = m e Round m. (2) Building upon the descriptions abo v e, TRL WE cipherte xt samples can also be generated from polynomials. In this conte xt, the plainte xt space is denoted by T N ,q [ X ] , the secret k e y v ector by S = { s 1 , ..., s k } B N [ X ] k , and the cipherte xt by C T N ,q [ X ] n +1 . Notably , TL WE is a specic instance of TRL WE with N = 1 . TRL WE samples enable homomorphic addition and constant multiplication. The utilization of tw o distinct cipher schemes, namely TRL WE and TL WE, is primarily moti v ated by f acilitating the internal bootstrapping process. TGSW the Gentry–Sahai–W aters (GSW) scheme, a v ariant of the L WE encryption scheme, is capable of performing non-linear operations homomorphically . The TFHE scheme emplo ys GSW for the homomorphic multiplication of tw o cipherte xts. This is particularly useful in bootstrapping, where an e xternal product, de- noted as , is dened as TGSW × TL WE TL WE. The primary application of the e xternal product in TFHE is the controlled multiple x er , or CMUX. The operation of the CMUX is further dened later in the subsequent sections. A TGSW sample ca n be conceptualized as a matrix of TL WE elements. TFHE emplo ys a ’g adget decomposition’ scheme to construct these matrices. Gi v en the notations described abo v e for the TL WE sample, the TGSW encryption scheme for a message m Z p [ X ] is dened as (3). C = Z + m · H (3) Each ro w of Z is a homogeneous TL WE sample encrypted under k e y s . The g adget matrix, H , belongs to T ( n +1) × ( n +1) l q . Reciprocally , C T ( n +1) l × ( n +1) q is considered a v alid TGSW sample if there e xists a message m Z p [ X ] such that each ro w of C m · H is a v alid homogeneous TRL WE sample under the k e y . Both TL WE and TGSW schemes, along with their operations, can be e xtended to polynomials represented as TRL WE and TRGSW , respecti v ely . The combination of these samples plays a crucial role in refr eshing noisy cipherte xts during bootstrapping. 2.2.2. Gate le v el bootstrapping Bootstrapping is a widely used technique to manage noise gro wth in homomorphic encryption schemes. It achie v es this by homomorphically e v aluating the decryption procedure on a cipherte xt via an encryption of the secret k e y . Here, we consider the e xample of a specic homomorphic encryption scheme, namely TFHE, which utilizes g ate bootstrapping. This process tak es a cipherte xt of the form TL WE µ , where µ is the plainte xt message, as input. The output is another cipherte xt, either encrypting 0 or µ , with a controlled amount of noise. The g ate bootstrapping procedure in TFHE consists of three k e y steps: i) BlindRotate, ii) SampleEx- tract, and iii) K e ySwitch, which are described as follo ws. Int J Recongurable & Embedded Syst, V ol. 14, No. 3, No v ember 2025: 659–675 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Recongurable & Embedded Syst ISSN: 2089-4864 663 The BlindRotate operation tak es a TL WE sample as input and multiplies it with a secret encrypted po wer of X , ef fecti v ely producing a rotation of the coef ci ents. This rotation is achie v ed by in v oking the CMUX g ate in a loop for each coef cient of the cipher . A visual representation of BlindRota te can be found in Figure 2. Figure 2. Ov ervie w of the BLINDR O T A TE operation performed during BOO TSTRAPPING after each g ate-le v el operation The procedure relies on polynomials, require shifting the input cipher sample from the TL WE( T N [ X ] ) space to the TRL WE space. Here, c denotes a TRL WE sample of a polynomial message v T N [ X ] under the k e y s = ( s 1 , ..., s n ) B n , where the constant term is a rounded approximation of the original message in the T space. Matching bootstrapping k e ys s = ( s 1 , ..., s n ) B n e xist, with C 1 , ..., C n representing their corresponding TGSW samples. The process be gins by rounding up the input cipher and dening ˜ c ( ˜ a 1 , ..., ˜ a n , ˜ b ) c 2 N mod 2 N . The process is e x ecuted o v er tw o steps by dening an accumulator , A CC, as follo ws: a. A CC X ˜ b · c b. Perform A CC CMUX( C i , X ˜ a i · AC C , A CC) for 1 i n The output is a TRL WE sample encryption of X p · v , where p = ˜ b P n i =1 s i · ˜ a i mod 2 N . The SampleExtract operation is the subsequent step follo wing the con v ersion of plainte xt v into the en- cryption of the polynomial X p · v . In this operation, the constant term is e xtracted from the cipher . This e xtraction process results in the retrie v al of the cipher of the original message, no w encrypted under a ne w k e y . This operation is a crucial component in the k e y switching process, enabling the transition of encrypted data between dif ferent encryption k e ys. The K e ySwitch operation is a fundamental component in homomorphic encryption schemes. This standard k e y switching algorithm is designed to con v ert a cipherte xt encrypted under one k e y into a cipherte xt en- crypted under a dif ferent k e y . The implementation of this operation necessi tates the use of k e y-switching k e ys. Specically , these are the TL WE encryptions of the k e y bits of ˜ s , with respect to the original k e y s . 2.3. Negati v e wrapped con v olution W ithin the conte xt of FHE, one of the computationally most e xpensi v e operations is the modular multiplication of polynomial elements. A pre v alent approach to achie v e this task ef ciently le v erages FFT - based con v olutions. The c yclic con v olution for tw o length- D signals x , y is denoted by a signal z = x × y ha ving D elements as (4). z m = X i + j m ( modD ) x i y j (4) Whereas the ne g ac yclic con v olution of x , y adds a f actor of ( 1) with v = x × m y and is denoted by [13]: v m = X i + j = m x i y j X i + j = D + m x i y j (5) Har dwar e design for fast gate bootstr apping in fully homomorphic encryption o ver the T orus (Saru V ig) Evaluation Warning : The document was created with Spire.PDF for Python.
664 ISSN: 2089-4864 The ac yclic con v olution of the signal gi v en by u = x × a y ha ving 2 D elements is gi v en by: u m = X i + j = m x i y j (6) for m { 0 , ..., 2 D 2 } . Lastly , the half-con v olution is a length- D signal gi v en by x × h y consisting of the rst D elements of the c yclic con v olution u . All the basic con v olutions are closely related to one another . As sho wn in [13], if the length D is e v en and x j , y j = 0 for j > D / 2 , L ( x ) × a L ( y ) = x × y = x × m y (7) gi v en the notion of the splitting of signals (of e v en length) into halv es, the lo wer -inde x ed coef cients of a are denoted as L ( a ) . The abo v e statement introduces us to the concept of ”zero-padding” which is to append D zeros to signals of length D so that the signals’ ac yclic con v olution is identical to the c yclic (or the ne g ac yclic) con v olution of the tw o padded sequences. While the classical c yclic con v olution is applied to accelerated polynomial multiplication modulo X N 1 , ne g ac yclic con v ol ution is used to implement polynomial multiplication modulo X N + 1 . No w , that we ha v e established the use of ne g ac yclic con v olution to perform modular multi plication, we e xplain the possible approach of zero-padding that has been implemented as part of the TFHE Library which uses the standard c yclic con v olution and FFT to perform ne g ac yclic con v olution. W e note that: X 2 N 1 = ( X N + 1)( X N 1) (8) The goal is to obtain the product of tw o N -point polynomials modulo ( X N + 1) . First , the product modulo ( X 2 N 1) via c yclic con v olution is computed, i.e., using the standard F ourier transform denition. T o obtain the product modulo ( X N + 1) , the result needs to be reduced to t he intermediate result modulo X N + 1 . The description of ho w this reduction w orks w as presented in [14] and is used here for compl eteness. Let us dene a signal p Z [ X ] of de gree N 1 . Its ne g ac yclic e xtension ¯ p (of length 2 N ) is dened as (9): ¯ p [ X ] = p [ X ] X N × p [ x ] (9) At each multiplication by X with the polynomial, the coef cients are circularly shifted one position to the right and the entering coef cient is ne g ated. The resulting signal is such that its F ourier image will ha v e zeros at all the e v en positions and the remai n i ng coef cients will be mirrored and conjug ated. Thus, multiplication of tw o such signals p and q can be performed as (10): p × q mo d ( X N + 1) = 1 2 F 1 ( F ( ¯ p ) · F ( ¯ q ))[0 ...N 1] (10) Note that after F 1 , the coef cients are ne g ac yclic, hence we can only tak e the rst half of the v ector . This approach of ne g ati v e wrapped con v olution follo wed in the TFHE library is also described in Algorithm 1. Thus, the TFHE library uses a 2 N length FFT architecture to perform multiplication. In our proposed approach, we use length N FFT ar chitecture to achie v e the same result for modular polynomial multiplication. Int J Recongurable & Embedded Syst, V ol. 14, No. 3, No v ember 2025: 659–675 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Recongurable & Embedded Syst ISSN: 2089-4864 665 2.4. A utomatic generation of r egister -transfer le v el le v el design An important step in g ate bootstrappi n g is the generation of R TL designs for the in v olv ed functions. In this conte xt, we e xplore the generalization of the hardw are back-end of the SGen tool, presented in [15], to automatically generate data-path designs for FFTs. SGen is an open-source tool that generates hardw are desi gns for v arious signal processing trans forms. It focuses on designs that operate on streaming data, where the input dataset is di vided into smaller chunks processed o v er multiple c ycles, leading to reduced resource usage. The FFT data path comprises O (log 2 N ) cascaded stages for a signal of length N , as sho wn in Figure 3. Each stage processes w elements (w ords) per c ycle, with a streaming width of w . Consequently , the number of c ycles required to recei v e the entire input (or output) is N w . The hardw are in each stage is reused for subsequent sets of elements, ef fecti v ely performing a ”v ertical folding” of the signal. SGen implements the Coole y-T uk e y algorithm for streaming FFT and of fers an area-ef cient alternati v e to the w ork presented in [16] for comple x data type FFT problems . A detailed comparison of these tools is presented later in this paper . Our w ork in v estig ates the applicability of SGen as a tool for generating FFT data paths in the case of TFHE with 128-bit security le v el. W e will le v erage the data-path design generated by SGen and apply it to our proposed polynomial multiplication Algorithm 2. Figure 3. Streaming FFT architecture 3. IMPLEMENT A TION This section describes the fundamental b uilding blocks of the bootstrapping approach imple mented on the FPGA. Figure 4 pro vides a high-le v el o v ervie w of the entire architecture. Through comple xity analysis, we identied the time-critical functions and tar geted them for hardw are acceleration. Har dwar e design for fast gate bootstr apping in fully homomorphic encryption o ver the T orus (Saru V ig) Evaluation Warning : The document was created with Spire.PDF for Python.
666 ISSN: 2089-4864 Figure 4. High-le v el system o v ervie w The BLINDR O T A TE function is the core of the bootstrapping process , responsible for refreshing the noise. It performs a coef cient rotation by multiplying an encrypted polynomial with a n encrypted po wer of X . This rotation is achie v ed using the CMUX g ate, as illustrated in Figure 2. The CMUX g ate tak es tw o TL WE samples, d 0 and d 1 , as inputs, along with a control TGSW sample, C . Its output is a TL WE sample and is computed as (11): C ( d 1 d 0 ) + d 0 , (11) where denotes the e xternal pr oduct , homomorphic to the e xternal R -modulo product between the respecti v e plainte xts. The e xternal pr oduct is formally dened as (12): T GS W × T LW E T LW E . (12) The e xternal product operation in homomorphic encryption schemes is kno wn to ha v e tw o com- putationally e xpensi v e functions: the FFT and add-multiply (AddMul). As illustrated i n Figure 2, each B LI N D R O T AT E operation, performed with def ault security parameters, in v olv es 630 C M U X operations, one for each coef cient of the bootstrapping k e y polynomial. F or each C M U X within the e xternal product, the forw ard FFT is e x ecuted si x times, follo wed by the AddMul operation and the in v erse FFT . Our e xperiments indicate that, for a simple homomorphic comparison of tw o 16-bit numbers, the FFT is performed 309,846 times. T able 1 details the FFT counts for all basic operations in the e xample code. F ollo wing the bootstrapping process, a k e y-switching function is applied. This operation also presents an opportunity for optimization, which will be discussed later in this section. T able 1. FFT counts for 16-bit inte ger comparison Operation K e y generation 2-input g ates Mux Comparison function FFT counts 15120 3780 7554 309846 3.1. F ast F ourier transf orm The forw ard and in v erse FFTs share the same underlying archi tecture, with the sole distinction lying in the applied multiplicati v e twiddle f actors. In this paper , we only discuss the data-path a forw ard FFT . The core ba ck-end architecture, generated by SGen, utilizes the Coole y-T uk e y algorithm iterati v ely to perform the FFT operation. As illustrated in Figure 5, SGen implements a fully streaming FFT , thereby enhancing resource utilization. A signal of length 2 n is permuted o v er 2 t c ycles in chunks of size 2 k elements, where n = k + t . The generated R TL incorporates additional optimizations, including: Simplication of read-only memories (R OMs) containing periodic v alues by replacing them with constants. Replacement of single-v alue R OMs with constants. Int J Recongurable & Embedded Syst, V ol. 14, No. 3, No v ember 2025: 659–675 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Recongurable & Embedded Syst ISSN: 2089-4864 667 Simplication of tri vial arithmetic operations. P airing of multiple x ers and R OMs sharing identical v alues. These optimizations further impro v e the area and resource ef cienc y of the generated hardw are design. Figure 5. Ov ervie w of FPGA implementation of FFT 3.1.1. Design decisions Automatic R TL generator: SGen of fers the e xibility to customize v arious parameters based on spe- cic design requirements. T o e v aluate its performance, we compared the latenc y of SGen and SPIRAL [16] for the FFT of a 64-point, 32-bit x ed-point signal. The results are presented in T able 2. As sho wn, SGen e xhibits superior latenc y performance compared to SPIRAL. Additionally , SGen pro vides greater e xibility in selecting data size and type. While SPIRAL is limited to 32-bit x ed-point data, SGen is capable of generating R TL for 64-bit x ed-point data types. T able 2. Comparison between SPIRAL and SGEN Radix 2 4 8 Streaming width 2 4 8 32 64 4 32 64 8 32 64 SPIRAL 165 116 92 57 42 92 40 25 76 46 31 SGEN 131 90 67 34 22 83 26 14 66 30 18 Data type and size: as pre viously discussed, SGen allo ws selection from v arious data types (s ingle, double, and x ed-point). In the TFHE library , FFT operations are performed using the FFTW3 library [17] with double-precision (64-bit) data. W e e xperimented with the single-precision (32-bit) v ersion of FFTW3 for the def ault TFHE use case, b ut switching to single-precision oats resulted in inaccurate post-decryption results. Therefore, we opted for a 64-bit w ord length for the R TL design. Ha ving determined the w ord length, we aimed to select the data type (x ed-point or oat ing-point) by performi ng simulations. W e e x ecuted FFTs on the same signals using MA TLAB R2020b and the FFTW3 library implementation in Eclipse. The resulting error mar gins are presented in T able 3. Our goal w as to choose precision with the minimal dif ference compared to the double-precision FFTW3 implementation. These e xper - iments used signals within the operational bounds of TFHE. Research indicated that the inte ger coef cients of polynomials used for bootstrapping FFTs typically reside in the range [ 64 , 63] . W ith 64-bit x ed-point rep- resentation, the maximum error mar gin reached e 5 (14 decimal points). As e xpected, increas ing the decimal point precision reduced the error mar gin, ranging from e 10 for 30 decimal points to e 5 for 14 decimal points. Double-precision oating-point yielded the lo west error mar gin, reaching e 15 . Ho we v er , for a gi v en transform length, x ed-point implementations are generally f aster than oating-point implementations (approximately 3 times f aster). Consequently , we opted for 64-bit x ed-point data. It is note w orth y that in x ed-point designs, the precision length for the fractional part does not af fect the data path design. T able 3. Output dif ference between FFT function on MA TLAB and FFTW3 library for dif ferent data types Precision Double x ed, 14 x ed, 18 x ed, 22 x ed, 26 x ed, 30 Error mar gin e 16 e 5 e 6 e 8 e 9 e 10 Har dwar e design for fast gate bootstr apping in fully homomorphic encryption o ver the T orus (Saru V ig) Evaluation Warning : The document was created with Spire.PDF for Python.
668 ISSN: 2089-4864 Radix size: the Coole y-T uk e y algorithm implemented by SGen recursi v ely e xpresses the FFT as a composite size of N = N 1 N 2 . One of N 1 or N 2 acts as the radix, which controls the number of points processed by the basic computational block of the FFT , oft en referred to as the b uttery due to its data-path shape (see the magnied section of Figure 5 for a high-le v el vie w). The streaming width must be a multiple of the radix; that is, log 2 ( radix ) should be di visible by log 2 ( N ) . While a higher radix is computationally more comple x, it is also more ef cient due to the reduced number of multiplications and additions required. T o achie v e maximum ef cienc y , high I/O bandwidth between the CPU and FPGA is crucial to maintain a full pipeline at all times. Streaming width: SGen allo ws for v arying the streaming width from 2 to 256 comple x w ords per c ycle, depending on the chosen radix. A fully streaming architec ture enables continuous data o w into and out of the system. The Coole y-T uk e y FFT data path comprises O (log 2 N ) cascaded stages for a signal of length N . A streaming width of w w ords per c ycle requires N w c ycles to recei v e the entire input (or output). The ne xt section presents an e v aluation of performance (resource utilization and speed) as inuenced by both streaming width and radix. 3.1.2. Pr oposed con v olution appr oach W e propose a ne w method for performing multiplication using FFT , as sho wn in Algorithm 2. This method is applicable to signals of length N and utili zes an FFT data path of length N / 2 . W e e xtend the concept of NWC from nite elds to the eld of comple x numbers. A similar approach for real-v alued numbers w as pre viously proposed in [18]. W e assume that the twiddle f actors are pre-computed. The inputs are real-v alued signals of length N , which are folded to create a comple x signal of length N / 2 . T o v erify the correctness of our approach, we performed polynomial multi plication using the imple- mentation pro vided by the TFHE library (described in Algorithm 1) and the proposed approach (described in Algorithm 2) on v e dif ferent input signals. W e then calculated the a v erage dif ference between the correspond- ing coef cients of the nal product signals. W e x ed the range of the inte ger input coef cients to [-63, 64], consistent with the TFHE g adget decomposition function. Our ndings indicate that the proposed approach achie v es high accurac y e v en when using a half-size FFT data path. The total error mar gins were a v eraged at 7 . 81253 × 10 9 , with a median of 1 . 95309 × 10 9 . 3.2. AddMul The AddMul function performs multiplication follo wed by successi v e addition within a loop, as de- tailed in Algorithm 3. In the TFHE library , this operation i s performed in the frequenc y domain on signals after the FFT operation. The data type for this operation is double-precisi on oating-point. Based on our e xperiments, the input v alues are bounded within the range [ 20733 . 1 , 20316 . 4] . W e implemented the R TL design using MA TLAB Simulink R2020b . Th e implementation emplo yed 64-bit x ed-point data points with an input representation of (1 , 64 , 48) and an output representation of (1 , 64 , 33) . This x ed-point imple menta- tion resulted in a minimal error mar gin between e 7 and e 8 on the output compared to the double-precision representation. While increasing the streaming width can enable f aster e x ecution by performing operations in parallel, this needs to be carefully balanced ag ainst the resulting increase in resource utilization. 3.3. K ey switching K e y switching is a well-established procedure in the literature [6], [7], allo wing the transition between k e ys with dif ferent parameters. Our analysis of this function’ s proling re v ealed that the shift operation, as presented in (3.3.), is the most time-consuming step. This operation, performed on 32-bit inte ger data in TFHE, e xhibits high frequenc y and lo w pot ential for hardw are acceleration. Due to its con v ersion into a single-c ycle CPU instruction, the shift operation will not benet signicantly from an R TL implementation. Therefore, the implementation of this function in R TL is omitted in this w ork. aij = ( aibar > > (32 ( j + 1) basebit ))& mask (13) Int J Recongurable & Embedded Syst, V ol. 14, No. 3, No v ember 2025: 659–675 Evaluation Warning : The document was created with Spire.PDF for Python.