Inter national J our nal of Recongurable and Embedded Systems (IJRES) V ol. 14, No. 1, March 2025, pp. 1 11 ISSN: 2089-4864, DOI: 10.11591/ijres.v14.i1.pp1-11 1 Implementing a v ery high-speed secur e hash algorithm 3 accelerator based on PCI-expr ess Huu-Thuan Huynh, T uan-Kiet T ran, T an-Phat Dang Uni v ersity of Science, V ietnam National Uni v ersity , Ho Chi Minh City , V ietnam Article Inf o Article history: Recei v ed May 6, 2024 Re vised Jul 25, 2024 Accepted Aug 12, 2024 K eyw ords: Edge computing Hardw are accelerator KECCAK Peripheral component interconnect e xpress Secure hash algorithm 3 ABSTRA CT In this paper , a high-performance secure hash algorithm 3 (SHA-3) is proposed to handle massi v e amounts of dat a for applications such as edge computing, medical image encryption, and blockchain netw orks. This w ork not only fo- cuses on the SHA-3 core as in pre vious w orks b ut also addresses the bottleneck phenomenon caused by transfer rates. Our proposed SHA-3 architecture serv es as the hardw are accelera tor for personal computers (PC) connected via a pe- ripheral component interconnect e xpress (PCIe), enhancing data transfer rates between the host PC and dedicated computation components lik e SHA-3. Ad- ditionally , the throughput of the SHA-3 core is enhanced based on tw o dif ferent proposals for the KECCAK- f algorithm: re-scheduled and sub-pipelined archi- tectures. The multiple KECCAK- f is applied to maximize data transfer through- put. Congurable b uf fer in/out (BIO) is introduced to support all SHA-3 modes, which is s uitable for de vices that handle v arious hashing applications. The pro- posed SHA-3 archite ctures are implemented and tested on DE10-Pro supporting Stratix 10 - 1S X280HU2F50E1V G and PCIe, achie ving a throughput of up to 35.55 Gbps and 43.12 Gbps for multiple-re-scheduled-KECCAK- f -based SHA- 3 (MRS) and multiple-sub-pipelined-KECCAK- f -based SHA-3 (MSS), respec- ti v ely . This is an open access article under the CC BY -SA license . Corresponding A uthor: T an-Phat Dang Uni v ersity of Science, V ietnam National Uni v ersity Ho Chi Minh City , V ietnam Email: dtphat@hcmus.edu.vn 1. INTR ODUCTION The increasing demand for massi v e amounts of real-time data transformations and processing ne ces- sitates high-performance data serv ers. This process acquires data from v arious sources, such as remote de vices via the Internet, and then transmits it t o dedicated hardw are lik e graphic processing units (GPU) or hardw are accelerators through b urst or streaming mechanisms. peripheral component interconnect e xpress (PCIe) has been utilized to enhance the data transfer rate to dedicated hardw are [1], [2]. Operating on point-to-point topol- ogy , PCIe enables de vices to communicate directly with other components without sharing bandwidth with other de vices on the b us. PCIe utilizes multiple independent lanes, ranging from one lane to 32 lanes, for data transfer between the host and end de vice. Each lane comprises tw o pairs of dif ferential signaling wires, one for transmitting data (Tx) and one for recei ving data (Rx). Therefore, the PCIe operation speed can range from 2.5 GT/s to 32 GT/s for Gen1 to Gen5, respecti v ely . Furthermore, the implementation of a direct memory access (DMA) for PCIe to eliminate a central processing unit (CPU) interv ention has appeared in earlier w orks [1], [3], which in v olv es transferring data from the main memory of the host de vice to a temporary DMA local J ournal homepage: http://ijr es.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
2 ISSN: 2089-4864 re gister before sending it to the address of the end de vice. Re g arding the dedicated computing hardw are, the eld programmable g ate array (FPGA)-based hard- w are accelerators for cryptograph y ha v e been attracti v e in research domains [ 4 ] , [5], because of the increasing need for rob ust cryptographic algorithms to secure sensiti v e information and communications. Cryptographic hash functions play a fundamental role in ensuring inte grity [6] and authenticity [7], [8]. In recent years, one such hash function that has g arnered signicant attention and adoption is the SHA-3. Standardized by the Na- tional Institute of Standards and T echnology (NIST) in 2015, SHA-3 represents the latest iteration in the secure hash algorithm (SHA) f amily [9]. Unlik e its predecessors, SHA-1, which has been susceptible to vulnerabilities and collision attacks [10], SHA-3 of fers enhanced security properties and resistance to kno wn cryptographic attacks. SHA-3 is designed to produce x ed-size hash v alues, or message digests, from input data of arbitrary length. In high-performance applications, SHA-3 is utilize d more and more frequently . F or multimedia data, such as image encrypt ion, the SHA-3 algorithm is emplo yed to generate k e y streams from multiple blocks that are di vided from the original images [11], [12]. In the security channel, the transmission of medical data and high-denition images between doctors and patients ofte n necessitates hashing to pre v ent malicious modica- tion, which is a crucial requirement in the medical eld [13], [14]. Moreo v er , with the increase of internet of things (IoT) de vices, the adoption of edge and fog computing has become increasingly common [15]. This trend has led to the emer gence of high-performance de vices optimized for processing s p e ed, with a particular focus on security , including hash function algorithms [16], [17]. Consequently , there is a gro wing demand to enhance the performance of cryptographic algorithms to protect the v ast amounts of data transmitted between these de vices. On the other hand, hash functions lik e SHA-3 play a crucial role in blockchain technology , en- suring the inte grity , se curity , and transparenc y of distrib uted ledger systems. The hash function helps maintain transaction inte grity based on the Merkle tree structure [18]. Notably , miners are task ed with v alidating trans- actions under consensus mechanisms such as proof of w ork (PoW) [19]. T o be eligible for re w ards, miners must quickly generate nonce, underscoring the need for a high-performance hash function [20]. T o enhance the performance of SHA-3, v arious research has been conducted, ranging from softw are optimizations on GPU to hardw are accelerators [21]-[29]. The ef cienc y of implementing SHA-3 in a GPU en vironment has been demonstrated in [21]. P arallel Thre ad eXecution (PTX) is utilized to le v erage the parallel permutation capabilities of the SPONGE construction and compute unied de vice architecture (CUD A) streams are emplo yed to enable GPUs to recei v e and compute data simultaneously . Moreo v er , hardw are accelerators for SHA-3 on FPGA are more attracti v e than implement ing it in a GPU en vironment due to reduced technology dependence. A signicant number of w orks aim to enhance the throughput and ef cienc y of SHA-3 through unrolling, pipelined, and sub-pipelined techniques and optimization of arithmetic cores (KECCAK- f ) [23]- [29]. Unlik e other w orks using the hardw are description language (HDL), the w ork in [22] uses open computing language (OpenCL) to implement SHA-3 as a co-processor on FPGA to demonstrate the ef cienc y of the hardw are implementation of SHA-3. In this paper , we adopt an FPGA-based hardw are design approach for implementing the SHA-3 al- gorithm using the V erilog language. This choice is moti v ated by the f act that SHA-3 computations mainly in v olv e permutations using XOR, AND, and NO T g ates, as well as inherent parallel processing capabilities. Unlik e pre vious w orks [23]-[29] that primarily focus on the KECCAK- f function, we also address another core component of the SHA-3 algorithm, such as b uf fer in and out. Moreo v er , high-performance applications not only demand high-speed dedicate d hardw are b ut also require ef cient data transmission. T o address this requirement, we utilize PCIe, which enables high-throughput communication and le v erages the computational po wer of the PC for data setup and management via softw are. The k e y contrib utions of our proposed methods are as: - W e present our SHA-3 design implement ed on FPGA as a hardw are accelerator for PC via PCIe links. DMA read and write is used to accelerate the data transfer rate without the observ ation of the CPU. In addition, ping-pong memory enables simultaneous computation and data transmission between the PC and our SHA-3 accelerator , thereby maximizing the parallel processing capabilities of SHA-3. - T o support v arious appli cations, we introduce congurable b uf fers that are e xible enough to switch between modes and minimize b uf fer usage for both input and output data while maintaining e xibility and ef cienc y . - Multiple KECCAK- f are introduced to enhance maximum performance. T w o architectures for KECCAK- f , including the re-scheduled and sub-pipelined architectures, are presented, contrib uting to o v erall perfor - mance enhancement in our SHA-3 design. Int J Recongurable & Embedded Syst, V ol. 14, No. 1, March 2025: 1–11 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Recongurable & Embedded Syst ISSN: 2089-4864 3 The remaining sections of the paper are or g anized as follo ws. Section 2 pro vides background i nforma- tion on SHA-3 algorithms. Our hardw are design is comprehensi v ely analyzed in section 3, co v ering the model of the SHA-3 accelerator to PC through PCIe, congurable b uf fers, and multiple KECCAK- f architecture. Ev aluation and comparison of our resul ts with other approaches are presented in section 4. Finally , section 5 concludes the paper . 2. SHA-3 PRELIMIN AR Y The construction of SHA-3 dif fers from the Merkle–Damg ard design in SHA-1 and SHA-2, instead adopting the SPONGE construction [9], which is comprised of tw o main phases: absorbing and squeezing, as sho wn in Figure ?? . Prior to the absorbing and squeezing phases, the input message m of arbitrary length under goes a padding process. This ensures that the message is e xpanded to a multiple of r bits (1152, 1088, 832, or 576 bits) by appending the pattern ”10*1” . Ho we v er , the SHA-3 hash function requires that the message m must append the suf x ”01” to support domain separation [ ? ]. Consequently , the pattern ”0110*1” is appended to message m , as illustrated in Figure ?? . During the absorbing phase, the padded message m is partitioned into se v eral blocks of size r . Each r -sized block is then combined wit h a capacity c to form a 1600-bit block, which is subsequently processed sequentially by each KECCAK- f function until all blocks are processed. In the squeezing phase, the length of the output d (224, 256, 384, or 512 bits) can v ary depending on the selected mode. 0 0 f f f f r c f Block 0 Block 1 Block (n-1) 01 10...1 Absorbing Squeezing r r r message m padding output d     f: KECCAK-f     r = 1 152/1088/832/576 bits for SHA3-224/256/832/512.     d = 224/256/384/512 bits for SHA3-224/256/832/512.     c = b - r , where b = 1600 bits. θ ρ π χ ι Figure 1. P adding and SPONGE construction The KECCAK- f function is a fundamental component in SHA-3 and is utilized in both the abs o r bing and squeezing phases. The input message is con v erted into a three-dimensional array , denoted by x , y , and z , formatting a 5 × 5 × 64 state array . The KECCAK- f function operates on this state array during 24 rounds of a round function (Rnd), with each round consisting of v e step mappings: θ (theta), ρ (rho), π (pi), χ (chi), and ι (iota). 3. DESIGN AND IMPLEMENT A TION In this section, we pro vide an o v ervie w of the proposed SHA-3 architecture at the system le v el. W e analyze the comprehensi v e data o w , sho wing the interaction with each component within the system. Ne xt, we discuss the implementation of congurable b uf fers to support multiple modes. Las tly , our re-scheduled and sub-pipelined techniques are introduced in detail for the KECCAK- f function. Implementing a very high-speed secur e hash algorithm 3 acceler ator based on PCI ... (Huu-Thuan Huynh) Evaluation Warning : The document was created with Spire.PDF for Python.
4 ISSN: 2089-4864 3.1. Ov er view ar chitectur e System le v el o v erall architecture and data o w for SHA-3 accelerator described in Figure 2. The sys- tem architecture of the proposed SHA-3 accelerator is depicted in Fi gure ?? (a). The PC serv es as a host serv er , recei ving requests from v arious remote de vices and responding to the resul ts. Requests related to the hash function are transmitted to the SHA-3 accelerator via PCIe. In this w ork, we utilize Intel i ntellectual property (IP) named Intel L/H-T ile A v alon-MM for PCIe on the DE10-Pro de vice [ ? ]. Specically , PCIe Gen 3x8 is emplo yed, operating at a frequenc y of 250 MHz, allo wing for a throughput of up to 63 Gbps. Additionally , this IP serv es as a bridge, con v erting PCIe protocol to the A v alon b us. Before the SHA-3 accelerator be gins opera- tion, essential information, such as the size of the processed string, is transferred to i ts Control/Status Re gisters block. This is emplo yed through the use of a base address re gister (B AR) with 32-bit non-prefetchable memory . T o enable high-performance transmission, a DMA engine is emplo yed along with separate read-and-write data modules. Additionally , tw o random-access memories (RAM) operate in a ping-pong manner for both input and output data. After the hash computation is completed, the hash v alues cannot immediately be sent t o the PC; the y must w ait for a request from the PC. Therefore, a ping-pong RAM Out is utilized to temporarily store the hash v alues. The ping-pong w ay ensures that the dedicated hardw are accelerator remains fully utilized, min- imizing an y idle time. T w o clock domains are utilized in this system. The rst clock operates at a frequenc y of 250 MHz for the PCIe IP , while the second clock operates at the frequenc y of the SHA-3 accelerator . This conguration optimizes the throughput of each domain to accelerate the entire system. The proposed SHA-3 accelerator comprises three main components: padding, b uf fer including b uf fer in (BI) and b uf fer out (BO), and multiple KECCAK- f units. The padding and BI operations are e x ecuted concurrently . Once BI accumulates suf cient data serially from RAM In 0/1, the output of this b uf fer is parallelly combined with the data output from the padding unit through OR operation. This data is then fed into multiple KECCAK- f units, which process multiple data simultaneously . The hash v alue generated by multiple KECCAK- f units is transferred to BO in parallel. Subsequently , BO serially writes the data to RAM Out. The data, comprising multiple short and long messages intended for SHA-3 processing, is stored in the system memor y of the PC. Under CPU control using T erasic’ s PCIe dri v er , this data is continuously transferred from the system memory to the SHA-3 accelerator . In cases where the data size e xceeds the capacity of RAM In 0/1, the ping-pong mechanism comes into play . As depicted in Figure ?? (b), long data i s initially transferred from the system memory to RAM In 0, and subsequently to RAM In 1. Once RAM In 0 is lled with Data 0, the SHA-3 computes the hash v alue and temporarily stores the results in RAM Out 0. Concurrently , the SHA-3 initiates processing Data 1 from RAM In 1. Once all results of Data 0 are a v ailable in RAM Out 0, the y are read using DMA read and returned to the PC’ s system memory . The result of Data 1 is transferred from RAM Out 1 to the system memory , once DMA read for Data 0 is completed. Thus, the ping-pong approach for RAM In and RAM Out f acilitates pipeline processing at the system le v el for increased performance. Figure ?? (c) illustrates three stages for data transfers and hashing computation. In st age 0, the PCIe link connected to the PC retrie v es data from the system memory , operating at a speed of 63 Gbps according to Intel specications [ ? ]. Mo ving on to stage 1, the PCIe IP utilizes the A v alon-MM master to transfer data to RAM In according to the DMA technique. The DMA process supports b urst transfers on a 256-bit interf ace width with a frequenc y of 250 MHz. While the throughput of DMA can reach up to 64 Gbps, the b urst count is limited to 5 bits. Consequently , when the data size e xce eds 8192 bits, the throughput of DMA becomes unstable. Our e xperiments sho w that the throughput of the DMA stage uctuates between approximately 20 Gbps and 55 Gbps, thereby impacting the SHA-3 block. T o mitig ate this issue, ping-pong memory is emplo yed for RAM In, where one memory recei v es data while the other pro vides data for computation. Hash computation is initiated only when one of the tw o memories is lled, ensuring that DMA does not af fect stage 2. In addition, another crucial f actor inuencing DMA throughput is the size of each RAM In. Small si zes can lead to unstable DMA throughput, while lar ge sizes may result in redundanc y . Based on e xperimental results, a size of 10 KB for each RAM In 0/1 is chosen, as de tailed in section 4. Stage 2 relies on the SHA-3 computation rate, which can run up to 43 Gbps. The detailed SHA-3 architecture is sho wn in Figure 3, progress ing from a coarse to a ne le v el of gran- ularity . Specically , the tw o primary components-the Buf fer and Multiple KECCAK-f units, illustrated in Fig- ure 3(a)-are e xplained in more depth in the subsequent subsect ions. Additionally , the tw o optimized KECCAK- f architectures, namely re-scheduled and sub-pipe lined, which ha v e the greatest impact on t h e throughput of the o v erall design, are depicted in Figure 3(b). Furthermore, a more detailed e xplanation of the θ and ( ρ - π - χ - ι ) stages is pro vided in Figure 3(c), of fering a deeper insight into the micro-architecture of the proposed Int J Recongurable & Embedded Syst, V ol. 14, No. 1, March 2025: 1–11 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Recongurable & Embedded Syst ISSN: 2089-4864 5 design, which will also be elaborated on in the subsequent subsection. SHA-3  Application (multiple messages) Intel L/H-T ile  A valon-MM  for PCI Express RAM In 0 RAM In 1 PC DMA  engine Read W rite BAR Control/Status  Registers Padding Buf fer Multiple KECCAK-f SHA-3 Clock domain 0 A valon Bus Clock domain 1 FPGA T erasic's PCIe driver Software  Library System  memory CPU PCIe  DIMM 256 bits 256 bits 32 bits RAM Out 0 RAM Out 1 Data 1 Data 0 256 bits  System memory: RAM In 0: Data 0 RAM In 1: Data 1 DMA   W rite SHA-3: Data 0 SHA-3: Data 1 RAM Out 0: Data 0 RAM Out 1: Data 1  System memory Data 0  System memory Data 1 DMA  Read  System  memory RAM In SHA-3 RAM Out PCIe PC FPGA time System  memory PCIe IP RAM In/Out SHA-3 DMA   W rite DMA  Read 63Gbps Stage 0 Stage 1 Stage 2 256 bits (a) (b) (c) Figure 2. Ov erall architecture and data o w for SHA-3 accelerator on the system le v el (a) system architecture, (b) data o w on the system le v el, and (c) the three stages in v olv e transferring data from the system memory to the SHA-3 accelerator Figure 3. The proposed SHA-3 architecture in detail (a) the b uf fer and multiple KECCAK- f architectures and (b) the tw o Rnd architectures: re-scheduled and sub-pipelined w ays, and (c) θ and ( ρ - π - χ - ι ) architectures 3.2. Buffer T o f acilitate e xibility in handling dif ferent modes, we introduce a congurable b uf fer capable of switching between BI and BO based on the selected mode. Each mode needs a distinct block size r , as il- lustrated in Figure ?? . F or SHA3-224 mode, with a data width of 256 bits, the input block size is 1152 bits, corresponding to v e BIs. Similarly , for SHA3-256/384/512 modes, the required number of BIs is 5/4/3, re- specti v ely . Con v ers ely , the output s ize for SHA3-224/256/384/512 modes in the 256-bit base is 1/1/2/2 BOs. T o optimize the b uf fer utilization, we propose four BIs (BI 0 to BI 3), one BO, and one BIO, as depicted in Figure ?? (a). The b uf fers cascade to each other , wit h the output of the preceding b uf fer serving as the input for the subsequent one. BIs only accept ne w input when the v al id in signal is acti v ated; otherwise, the y retain Implementing a very high-speed secur e hash algorithm 3 acceler ator based on PCI ... (Huu-Thuan Huynh) Evaluation Warning : The document was created with Spire.PDF for Python.
6 ISSN: 2089-4864 the current v alue. BIO e xhibits slightly more comple xity than BI, as it can recei v e dat a from the preceding b uf fer when the sel signal is triggered; otherwise, it functions as a BO, recei ving hash v alues from multiple KECCAK- f units. BO is responsible for retrie ving hash v alues and ser ially pushing them to RAM Out. F or SHA3-224/256/384/512 modes, it tak es 1/1/2/2 clock c ycles to complete writing data to RAM Out. T o stream- line comple xity , the recei ving process in BI tak es 4/4/5/5 clock c ycles for SHA3-224/256/384/512 modes, respecti v ely . Each data loaded into BI requires one clock c ycle. Therefore, if the modes do not pro vide suf - cient data within those clock c ycles, zero inputs are inserted. F or e xample, in the case of SHA3-512 requiring three blocks of 256 bits, the subsequent tw o blocks consist of zeros. 3.3. Multiple KECCAK- f The multiple KECCAK- f module comprises mapping and three KECCAK- f instances, as illustrated in Figure ?? (a). The 1152-bit data from the preceding phase is fed into the Mapping block, which appends zeros to e xpand it to a 1600-bit data size. In our design, we opt for three KECCAK- f instances to reduce the interv al of input data to 8 clock c ycles. The output of each KECCAK- f instance is 512 bits in size, and depending on the selected mode, truncation is applied to the output data. In this w ork, we introduce tw o architectures for KECCAK- f : the re-scheduled and sub-pipelined ar - chitectures, depicted in Figure ?? (b). In a con v entional architecture, the sequence of steps includes θ - ρ - π - χ - ι , with a re gister placed at the end of the ι step to indicate the completion of one round [ ? ]. Our re-scheduled architecture reorders these steps to ρ - π - χ - ι - θ by inserting a re gister between the θ and ρ steps. As a result, re-scheduled architecture requires 25 repetitions to complete the hash v alue, one more compared to the base architecture. During the rst repetition, only the θ step is implemented, while the remaining repetitions e x ecute all steps in the sequence of ρ - π - χ - ι - θ . The re-scheduled architecture of fers higher ef cienc y compared to the con v entional architecture. This is pro v en via synthesis results on the Stratix 10 de vice, re v ealing that the re-scheduled architecture achie v es a frequenc y of 336.36 MHz, surpassing the con v entional architecture’ s frequenc y of 321.85 MHz by 4.31%. Moreo v er , the re-scheduled architecture utilizes fe wer resources, with a reduction in adapti v e logic module (ALM) utilization of 16.67% (4214 ALMs compared to 5057 ALMs in the con v entional architecture). Unlik e pre vious w orks [28], [29], where the sub-pipelined technique typically inserts tw o re gis ters: one between the π and χ steps or between the θ and ρ steps and another at the end of the ι step, our sub- pipelined architecture uses re giste rs between the θ and ρ steps and another re gister before the θ step. This decision is based on the observ ation that the critical path of the θ step is greater than that of the ρ , π , χ , and ι steps. Specically , the θ step requires at least four XOR g ate le v els to complete, while the remaining steps need only AND and tw o XOR g ates, as sho wn i n Figure ?? (c). By isolating the θ step, we aim to impro v e the delay for KECCAK- f . Ho we v er , adding the re gister in the round increases the number of clock c ycles required, doubling it to 48 clock c ycles. T o mitig ate this increase in clock c ycles, our design is capable of handling tw o data simultaneously at tw o dif ferent stages. F or e xample, if data 1 is processed in the θ stage, data 2 is processed in the ( ρ - π - χ - ι ) stage. In the ne xt clock c ycle, data 1 mo v es to the ( ρ - π - χ - ι ) stage while data 2 transitions to the θ stage. Thus, the a v erage time to generate one hash v alue is reduced to 24 clock c ycles. The adv antage of our sub-pipelined architecture is that it increases the frequenc y while maintaining a x ed number of clock c ycles at 24, thereby increasing throughput. In both re-scheduled and sub-pipelined architectures, the v e steps are consistently grouped into tw o parts: θ and ( ρ - π - χ - ι ). The formulation of the θ step is optimized by combining C [ x ] and D [ x ] , as indicated by the red area in the θ part of Figure ?? (c), denoted as C D [ x ] in (1). C D [ x ] serv es as the shared element, utilized by A [ x, y ] , and tw o le v els of XOR operation are emplo yed to reduce the delay for the θ step. C D [ x ] = A [ x 1 , 0] A [ x 1 , 1] A [ x 1 , 2] A [ x 1 , 3] A [ x 1 , 4] R O T ( A [ x + 1 , 0] , 1) R O T ( A [ x + 1 , 1] , 1) R O T ( A [ x + 1 , 2] , 1) R O T ( A [ x + 1 , 3] , 1) R O T ( A [ x + 1 , 4] , 1) A [ x, y ] = A [ x, y ] C D [ x ] (1) The hardw are implementation of ( ρ + π ) steps utilizes a net connection, which requires no additional resources or delay , based on the combination of ( ρ + π ) steps illustrated in [ ? ]. Furthermore, the combination of ( ρ - π - χ - ι ) steps is depicted in Figure ?? (c). Unlik e pre vious w orks [ ? ], which utilized 64-bit RC, we ha v e simplied this p r ocess by storing only the non-zero bits in RC. Therefore, only the bit positions 0, 1, 3, 7, 15, 31, and 63 are stored, ef fecti v ely reducing resource usage. Int J Recongurable & Embedded Syst, V ol. 14, No. 1, March 2025: 1–11 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Recongurable & Embedded Syst ISSN: 2089-4864 7 4. EV ALU A TION AND COMP ARISON This section presents the performance e v aluation of MRS and MSS on DE10-Pro, considering se v eral f actors that impact the throughput such as the size of each RAM In in the ping-pong w ay and the number of KECCAK- f instances. Furthermore, a comparison of KECCAK- f computation with pre vious w orks [23], [24] is conducted on V irte x 7 to indicate the adv antages and limitations of our multiple-re-scheduled-based KECCAK- f (MRK) and multiple-sub-pipelined-based KECCAK- f (MSK) architectures. 4.1. P erf ormance e v aluation on DE10-Pr o The SHA-3 hardw are accelerator , utilizing the DE10-Pro de vice, is connected to the PC (I ntel® Core™ i5-10400 2.9 GHz) via PCIe Gen 3x8, as illustrated in Figure ?? (a). This setup allo ws for functional testing and performance e v aluations. On the PC side, C code manages the transfer of data to RAM In 0/1. Each DMA operation lls one RAM In slot, with subsequent transfers populating the remaining slots in a ping-pong w ay . Subsequently , the PC promptly reads hash v alues from RAM Out for comparison with the golden data to v erify their accurac y . The e xperimental results of the tw o proposed architectures, MRS and MSS, in relation to f actors such as throughput, RAM size, and the number of KECCAK-f units across all modes, are presented in the line charts in Figure 4. These charts visually represent the optimal congurations, highlighting the correlation between these k e y f actors and their impact on the o v erall performance of the SHA-3 accelerator . Specically , Figure 4(a) illust rates the relationship between throughput and v arious RAM In sizes, while Figure 4(c) sho ws the relationship between throughput and the numbe r of KECCAK-f units. Additionally , Figure 4(b) pro vides a detailed vie w of the data o w , contrib uting to the analysis of bottlenecks and serving as a basis for determining the optimal number of KECCAK-f units for an ef cient conguration. The rate at which data is supplied plays a crucial role in determining the performance of the hardw are accelerator . Specically , in our system, where we emplo y a ping-pong mechanism, the size of RAM In directly inuences performance as mentioned in subsection 3.1. As depicted in Figure ?? (a), the relationship between SHA-3 performance and RAM In size w as e xamined across all modes for both MRS and MSS architectures. The MRS a nd MSS throughput e xperiences a notable increase within the range of 1 to 8 KB, follo wed by a gradual rise from 12 to 20 KB. Be yond this point , the throughput saturates for all modes in both MRS and MSS architectures. Consequently , a RAM In size of 20KB (the size of RAM In 0/1 is 10KB) w as selected, making a balance between maximizing MRS and MSS throughput and minimizing resource utilization. Our propos ed SHA-3 architecture emplo ys multiple KECCAK- f instances to enhance throughput. Ho we v er , if too man y KECCAK- f instances are utilized, a bottleneck phenomenon arises when the multiple KECCAK- f instances operate f aster than the preceding parts, resulting in resource redundanc y . Con v ersely , the architecture becomes inef cient when only a small number of KECCAK- f instances are used. The selection of KECCAK- f instances considers v arious f actors, including preceding block architecture and the algorithmic characteristics of KECCAK- f . As il lustrated in Figure ?? (b), stage 0 (BI + padding) necessitates a maximum of nine clock c ycles, including v e c ycles for processing data in the w orst-case scenarios of SHA3-384/512 modes, three clock c ycles for o v erhead, and one clock c ycle for w aiting for the ready signal from multiple KECCAK- f . Moreo v er , gi v en that the KECCAK- f algorithm requires 24 repetitions for one digest v alue, stage 1 in Figure ?? (b) must be completed within approximately eight clock c ycles for optimal ef cienc y . Consequently , three KECCAK- f instances are chosen for the multiple KECCAK- f block. This relationship is further claried in Figure ?? (c), re v ealing the increasing throughput of the MRS and MSS architectures with one to three KECCAK- f instances. Be yond this range, ho we v er , the MRS and MSS throughput saturates with four KECCAK- f instances. Thus, the optimal number of KECCAK- f instances is determined to be three. Our proposed architectures are e v aluated based on throughput and ef cienc y . Throughput (TP), mea- sured in Gbps, is calculated using (2), where # bit represents the number of bits of input data, Fmax denotes the maximum frequenc y obtained from synthesis results, and # clock indicates the number of clock c ycles elapsed. Ef cienc y (Ef f.), on the other hand, is determined by the ratio of throughput to the utilized resources, such as ALMs for Intel de vices or slices for Xilinx de vices. The (2) illustrates the calculation of ef cienc y based on throughput and resource utilization. The e v aluation process in v olv es the use of the Intel® Quartus® Prime Pro Edition Design Softw are V ersion 19.1 to obtain reports on frequenc y and resource utilization. T P = # bit × F max # cl ock (2) Implementing a very high-speed secur e hash algorithm 3 acceler ator based on PCI ... (Huu-Thuan Huynh) Evaluation Warning : The document was created with Spire.PDF for Python.
8 ISSN: 2089-4864 E f f . = T P Ar ea (3) The throughput measurement results of the tw o architectures, MRS and MSS, on DE10-Pro are pre- sented in T able ?? . T o determine the number of clock c ycles ( # clock), each architecture is equipped with a counter to record the elapsed clocks, starting immediately when the core be gins operation and stopping upon completion of the process. The data size for throughput m easurement of the MRS and MSS architectures is tested up to 32 KB. Collaborating with the data length and operating frequencies of MRS and MSS, which are 280 MHz and 380 MHz, respecti v ely , we compute the throughput for each mode. Specically , the throughput for MRS is 35.55 Gbps, 33.60 Gbps, 27.69 Gbps, and 19.23 Gbps for SHA3-224, SHA3-256, S HA3-384 , and SHA3-512 modes, respecti v ely . Similarly , for MSS, the throughput is 43.12 Gbps, 41.20 Gbps, 36.27 Gbps, and 25.11 Gbps for SHA3-224, SHA3-256, SHA3-384, and SHA3-512 modes, respecti v ely . The resources uti- lized by MRS and MSS are obtained from the Quartus tool, with MSS utilizing 8273 ALMs and 8374 re gisters, while MSS utilizes 9485 ALMs and 12832 re gisters, respecti v ely . As a result, the ef cienc y of MRS and MSS for all modes are as follo ws: 4.30 Mbps/ALM, 4.06 Mbps/ALM, 3.35 Mbps/ALM, and 2.32 Mbps/ALM for MRS, and 4.55 Mbps/ALM, 4.34 Mbps/ALM, 3.82 Mbps/ALM, and 2.65 Mbps/ALM for MSS, respecti v ely . 1 2 3 4 5 6 7 8 9 10 Stage 0 Stage 1 Stage 2 ( b) RAM Out Buf fer Out KECCAK-f 2 KECCAK-f 1 KECCAK-f 0 Padding Buf fer In Overhead RAM In (a) (c) Figure 4. The e xperiment results of MRS and MSS across all modes in (a) the relationship between throughput and dif ferent RAM In sizes, (b) data o w timing chart of multiple KECCAK- f units, and (c) the relationship between throughput and the dif ferent numbers of KECCAK- f units T able 1. The implementation results of the MRS and MSS architectures on DE10-Pro Architecture Freq. Area Re g. TP (Gbps) Ef f. (Mbps/ALM) (MHz) (ALM) 224 256 384 512 224 256 384 512 MRS 280 8273 8374 35.55 33.60 27.69 19.23 4.30 4.06 3.35 2.32 MSS 380 9485 12832 43.12 41.20 36.27 25.11 4.55 4.34 3.82 2.65 Int J Recongurable & Embedded Syst, V ol. 14, No. 1, March 2025: 1–11 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Recongurable & Embedded Syst ISSN: 2089-4864 9 4.2. Comparati v e analysis F or a f air e v aluation and comparison between our proposed architectures (MRK and MSK) and pre- vious ones [ ? ], [ ? ], we synthesize designs on V irte x-7 XC7VX485T using the V i v ado 2020 tool. Since the pre vious w orks [ ? ], [ ? ] focused solely on the KECCAK- f computation, T able ?? displays the synthesis results of the KECCAK- f computation only . Gi v en that all modes utilize the same KECCAK- f computation archi- tecture, T able ?? only presents the results for the SHA3-512 mode for comparison. Additionally , it f aci litates comparison with the proposal in [ ? ] because the design only supports the SHA3-512 mode. T able 2. The comparison of KECCAK- f computation architectures between our proposals and FPGA-based w orks on V irte x 7 Reference [ ? ], 2022 [ ? ], 2023 Our proposed architecture Approach Dual Rnd Unrolling f actor of 2 MRK MSK Fmax (MHz) - 378.73 380.95 485.67 Area (Slice) 1521 1375 3203 2917 Re gister - - 4831 9669 # clock/hash 12 12 8 8 TP (Gbps)* 22.90 18.18 27.43 34.97 Ef f. (Mbps/slice)* 15.11 13.22 8.56 11.99 F or SHA3-512 mode Our MRK architecture requires 3203 slices and 4831 re gisters, operating at a maximum frequenc y of 380.95 MHz and achie ving a throughput of 27.43 Gbps and an ef cienc y of 8.56 Mbps/slice . Con v ersely , the MSK architecture, aimed at reducing the critic al path of Rnd, utilizes more re gisters than MRK (9669 > 4831). Ho we v er , MSK outperforms MRK in terms of both throughput and ef cienc y , achie ving 34.97 Gbps and 11.99 Mbps/slice, respecti v ely . Sra v ani and Durai [ ? ] proposed the dual Rnd architecture, which utilizes one Rnd consisting of v e steps ( θ - ρ - π - χ - ι ) and re gisters cascading another Rnd and re gister to halv e the number of clock c ycles ( # clock/hash = 12), achie ving a throughput of 22.90 Gbps. Ho we v er , the throughputs of our tw o architectures, MRK and MSK, are 1.20 times (27.43 vs. 22.90) and 1.53 times (34.97 vs. 22.90) hi gher than that achie v ed by the dual Rnd architecture. While our architectures prioritize high performance, their ef cienc y is slightly lo wer compared to the dual architecture of Sra v ani and Durai [ ? ] with MRK being 0.56 times (8.56 vs. 15.11) and MSK being 0.79 times (11.99 vs. 15.11). Ho we v er , despite the lo wer ef cienc y , the throughput acceleration of our MSK architecture (53%) surpasses the ef cienc y acceleration of their dual Rnd architecture (26%). When comparing our proposals with that of Sideris et al. [ ? ], who implemented an unrolling f actor of 2 to halv e the number of clock c ycles ( # clock/hash = 12), we observ e signi cant impro v ements in throughput for both our MRK and MSK architectures. Specically , our MRK architecture achie v es a throughput 1.51 times higher (27.43 vs. 18.18), while our MSK architecture achie v es a throughput 1.92 times higher (34.97 vs. 18.18) than the proposal of Sideris et al. [ ? ]. Ho we v er , despite these s ubstantial throughput impro v ements, our ef cienc y is slightly lo wer , with MRK being 0.65 times lo wer (8.56 vs. 13.22) and MSK being 0.91 times lo wer (11.99 vs. 13.22), respecti v ely . Nonetheless, this decrease in ef cienc y is not considered signicant when compared to the notable throughput accelerations of 51% and 92% for MRK and MSK, respecti v ely . 5. CONCLUSION The demand for high-performance hash functions for modern applications has emer ged, especi ally for the latest hashing v ersion, SHA-3. The impro v ement of SHA-3 throughput is proposed in this paper . Specically , full SHA-3 architecture is present from b uf fers and athrimetic core lik e KECCAK- f to inte gration at the system le v el. The proposed architectures are designed on an FPGA platform, which is connected to a PC via PCIe. PCIe boosts the data transfer rate, which is used popularly in modern applications. The issue of data transfer in PCIe’ s DMA is analyzed and resolv ed through the implementation of ping-pong memory and the selection of appropriate memory sizes. Furthermore, the conguration BIO is presented to support multiple SHA-3 modes and minimize the number of b uf fer instances. This feature benets modern applications which require v arious output lengths of the hash v alues. The proposed architectures, lik e MRS and MSS, achie v e a high throughput of up to 35.55 Gbps and 43.12 Gbps, respecti v ely , thanks to the multiple KECCAK- f combined with one of the re-s cheduled and sub-pipelined architectures. MSS demonstrates greater ef cienc y compared to MRS, for instance, with 4.55 Mbps/ALM > 4.30 Mbps/ALM for the SHA3-224 mode. In addition, our Implementing a very high-speed secur e hash algorithm 3 acceler ator based on PCI ... (Huu-Thuan Huynh) Evaluation Warning : The document was created with Spire.PDF for Python.
10 ISSN: 2089-4864 MRK and MSK achie v e 27.43 Gbps and 34.97 Gpbs for SHA3-512 mode when implemented on V irte x 7, respecti v ely . REFERENCES [1] L. Rota, M. Caselle, S. Chiling aryan, A. K opmann, and M. W eber , A PCie DMA architecture for mult i-gig abyte per second data transmission, IEEE T r ansactions on Nuclear Science , v ol. 62, no. 3, pp. 972–976, 2015, doi: 10.1109/TNS.2015.2426877. [2] J . Liu, J. W ang, Y . Zhou, and F . Liu, A cloud serv er oriented FPGA accelerator for lstm recurrent neural netw ork, IEEE Access , v ol. 7, pp. 122 408–122 418, 2019, doi: 10.1109/A CCESS.2019.2938234. [3] H. Ka vianipour , S. Muschter , and C. Bohm, “High performance FPGA-based DMA interf ace for pcie, IEEE T r ansactions on Nuclear Science , v ol. 61, no. 2, pp. 745–749, 2014, doi: 10.1109/R TC.2012.6418352. [4] J .-S. Ng, J. Chen, K.-S. Chong, J. S. Chang, and B.-H. Gwee, A highly secure fpg a-based dual-hiding asynchronous-logic aes accelerator ag ainst side-channel attacks, IEEE T r ansactions on V ery Lar g e Scale Inte gr ation (VLSI) Systems , v ol. 30, no. 9, pp. 1144–1157, 2022, doi: 10.1109/TVLSI.2022.3175180. [5] M . Ze ghid, H. Y . Ahmed, A. Chehri, and A. Sghaier , “Speed/area-ef cient ECC processor implementation o v er gf (2 m) on FPGA via no v el algorithm-architecture co-design, IEEE T r ansactions on V ery Lar g e Scale Inte gr ation (VLSI) Systems , v ol. 31, no. 8, pp. 1192–1203, 2023, doi: 10.1109/TVLSI.2023.3268999. [6] S. Shin and T . Kw on, A pri v ac y-preserving authentication, authorization, and k e y agreement scheme for wireless sensor netw orks in 5g-inte grated internet of things, IEEE access , v ol. 8, pp. 67 555–67 571, 2020, doi: 10.1109/A CCESS.2020.2985719. [7] S. Jiang, X. Zhu, and L. W ang, An ef cient anon ymous batch authentication scheme based on hmac for v anets, IEEE T r ansactions on Intellig ent T r ansportation Systems , v ol. 17, no. 8, pp. 2193–2204, 2016, doi: 10.1109/TITS.2016.2517603. [8] L. Zhou, C. Su, and K.-H. Y eh, A lightweight cryptographic protocol with certicateless signature for the internet of things, A CM T r ansactions on Embedded Computing Systems (TECS) , v ol. 18, no. 3, pp. 1–10, 2019, doi: 10.1145/3301306. [9] Federal Information Processing Standards Publication, “SHA-3 standard: permutation-bas ed hash and e xtendable-output functions, Aug. 2015, doi: 10.6028/NIST .FIPS.202. [10] M. Ste v ens, E. Bursztein, P . Karpman, A. Albertini, and Y . Mark o v , “The rst collision for full sha-1, in Advances in Cryptolo gy– CR YPT O 2017: 37th Annual International Cryptolo gy Confer ence , Santa Barbar a, CA, USA, A ugust 20–24, 2017, Pr oceedings , Springer , 2017, pp. 570–596, doi: 10.1007/978-3-319-63688-7 19. [11] X. Zhang, Z. Zhou, and Y . Niu, An image encryption method based on the feistel netw ork and dynamic DN A encoding, IEEE Photonics J ournal , v ol. 10, no. 4, pp. 1–14, 2018, doi: 10.1109/JPHO T .2018.2859257. [12] C . Zhu and K. Sun, “Cryptanalyzing and impro ving a no v el color image encryption algorithm using rt-enhanced chaotic tent maps, IEEE Access , v ol. 6, pp. 18 759–18 770, 2018, doi: 10.1109/A CCESS.2018.2817600. [13] W .-K. Lee, R. C.-W . Phan, B.-M. Goi, L. Chen, X. Zhang, and N. N. Xiong, “P arallel and high speed hashing in GPU for telemedicine applications, IEEE Access , v ol. 6, pp. 37 991–38 002, 2018, doi: 10.1109/A CCESS.2018.2849439. [14] M . Sra v ani and S. A. Durai, “Bio-hash secured hardw are e-health record system, IEEE T r ansactions on Biomedical Cir cuits and Systems , 2023, doi: 10.1109/TBCAS.2023.3263177. [15] M. De Donno, K. T ange, and N. Dragoni, “F oundations and e v olution of modern computing paradigms: Cloud, IoT , edge, and fog, IEEE Access , v ol. 7, pp. 150 936–150 948, 2019, doi: 10.1109/A CCESS.2019.2947652. [16] T .-Y . W u, Z. Lee, M. S. Obaidat, S. K umari, S. K umar , and C.-M. Chen, An authenticated k e y e xchange protocol for multi-serv er architecture in 5g netw orks, IEEE Access , v ol. 8, pp. 28 096–28 108, 2020, doi: 10.1109/A CCESS.2020.2969986. [17] W .-K. Lee, K. Jang, G. Song, H. Kim, S. O. Hw ang, and H. Seo, “Ef cient implementation of lightweight hash functions on GPU and quantum computers for iot applications, IEEE Access , v ol. 10, pp. 59 661–59 674, 2022, doi: 10.1109/A CCESS.2022.3179970. [18] Z. Liu, L. Ren, Y . Feng, S. W ang, and J. W ei, “Data inte grity audit scheme based on quad merkle tree and blockchain, IEEE Access , 2023, doi: 10.1109/A CCESS.2023.3240066. [19] S. Islam, M. J. Islam, M. Hossain, S. Noor , K.-S. Kw ak, and S. R. Islam, A surv e y on consensus algorithms in blockchain-based applications: architecture, taxonomy , and operational issues, IEEE Access , 2023, doi: 10.1109/A CCESS.2023.3267047. [20] H. Cho, Asic-resistance of multi-hash proof-of-w ork mechanisms for blockchain consensus protocols, IEEE Access , v ol. 6, pp. 66 210–66 222, 2018, doi: 10.1109/A CCESS.2018.2878895. [21] H. Choi and S. C. Seo, “F ast implementation of sha-3 in GPU en vironment, IEEE Access , v ol. 9, pp. 144 574–144 586, 2021, doi: 10.1109/A CCESS.2021.3122466. [22] H. Bensalem, Y . Blaqui ` ere, and Y . Sa v aria, An ef cient opencl-based implementation of a SHA-3 co-processor on an fpg a-centric platform, IEEE T r ansactions on Cir cuits and Systems II: Expr ess Briefs , v ol. 70, no. 3, pp. 1144–1148, 2022, doi: 10.1109/TC- SII.2022.3223179. [23] M. M. Sra v ani and S. A. Durai, “On ef cienc y enhancement of SHA-3 for FPGA-based m ultimodal biometric authentication, IEEE T r ansactions on V ery Lar g e Scale Inte gr ation (VLSI) Systems , v ol. 30, no. 4, pp. 488–501, 2022, doi: 10.1109/TV LSI.2022.3148275. [24] S. El Moumni, M. Fettach, and A. T ragha, “High throughput implementation of sha3 hash algorithm on el d programmable g ate array (FPGA), Micr oelectr onics journal , v ol. 93, p. 104615, 2019, doi: 10.1016/j.mejo.2019.104615. [25] B. Li, Y . Y an, Y . W ei, and H. Han, “Scalable and parallel optimization of the number theoretic transform based on FPGA, IEEE T r ansactions on V ery Lar g e Scale Inte gr ation (VLSI) Systems , 2023, doi: 10.1109/TVLSI.2023.3312423. [26] A. Sideris, T . Sanida, and M. Dasygenis, “Hardw are acceleration design of the SHA-3 for high throughput and lo w area on FPGA, J ournal of Crypto gr aphic Engineering , pp. 1–13, 2023, doi: 10.1007/s13389-023-00334-0. [27] H. E. Michail, L. Ioannou, and A. G. V o yiatzis, “Pipelined SHA-3 implementations on FPGA: architecture and performance analysis, in Pr oceedings of the Second W orkshop on Crypto gr aphy and Security in Computing Systems , 2015, pp. 13–18, doi: 10.1145/2694805.2694808. [28] G. S. Athanasiou, G.-P . Makkas, and G. Theodoridis, “High throughput pipelined FPGA implementation of the ne w SHA-3 cryp- tographic hash algorit hm, in 2014 6th International Symposium on Communications, Contr ol and Signal Pr ocessing (ISCCSP) . IEEE, 2014, pp. 538–541, doi: 10.1109/ISCCSP .2014.6877931. Int J Recongurable & Embedded Syst, V ol. 14, No. 1, March 2025: 1–11 Evaluation Warning : The document was created with Spire.PDF for Python.