TELK OMNIKA , V ol. 15, No . 4, December 2017, pp . 1884 1893 ISSN: 1693-6930, accredited A b y DIKTI, Decree No: 58/DIKTI/K ep/2013 DOI: 10.12928/telk omnika.v15.i4.6049 1884 A Load-Balanced P arallelization of AKS Algorithm Ar dhi Wiratama Baskara Y udha and Reza Pulungan z Depar tment of Computer Science and Electronics , F aculty of Mathematics and Na tur al Sciences , Univ ersitas Gadjah Mada, Y ogy akar ta, Indonesia z Corresponding author , e-mail: pulungan@ugm.ac.id Abstract The best kno wn deter ministic polynomial-time algor ithm f or pr imality testing r ight no w is due to Ag r a w al, Ka y al, and Sax ena. This algor ithm has a time comple xity O (log 15 = 2 ( n )) . Although this algor ithm is polynomial, its reliance on the cong r uence of large polynomials results in en or mous computational require- ment. In this paper , w e propose a par allelization technique f or this algor ithm based on message-passing par allelism together with f our w or kload-distr ib ution str ategies . W e perf or m a ser ies of e xper iments on an implementation of this algor ithm in a high-perf or mance computing system consisting of 15 nodes , each with 4 CPU cores . The e xper iments indicate that our proposed par allelization technique introduces a significant speedup on e xisting implementa tions . Fur ther more , the dynamic w or kload-distr ib ution str at egy perf or ms better than the others . Ov er all, th e e xper iments sho w that the par allelization obtains up to 36 times speedup . K e yw or d: Pr imality testing, AKS algor ithm, par allelization, load balancing, high-perf or man ce computing. Cop yright c 2013 Univer sitas Ahmad Dahlan. All rights reser ved. 1. Intr oduction Pr ime n umbers are the cor nerstone of n umber theor y . Mathematicians and n umber the- or ist, since ancient times , ha v e been f ascinated b y man y prob lems concer ning pr ime n umbers . In moder n time , man y of the most impor tant cr yptog r aph ic algor ithms rely on big pr ime n umbers to perf or m encr yption and decr yption. One of them is Riv est-Shamir-Adleman (RSA) algor ithm [1], which is no w widely used in stor age encr yption [2], digital cer tificate [3], and w eb secur ity [4], including in banking tr ansaction secur ity . RSA algor ithm depends on the f act that it is difficult to find the pr ime f actors of a b ig integer . Electronic F rontier F oundation (EFF) off ers $250,000 as a re w ard to the first individual or g roup who disco v ers a pr ime n umber with at least 1,000,000,000 decimal digits [5]. Searching f or a pr ime n umber is usua lly based on an efficient algor ithm that deter mines whether a giv en n umber is pr ime or composite . Such algor ithms are called pr imality testing algor ithms . Most of pr imality testing algor ithms are probab ilistic , namely the y cannot ascer tain the pr imality of a giv en n umber , b ut only pro vide a probability that the giv en n umber is pr ime . Miller- Rabin pr imality test [6, 7], f or instance , has an error r ate belo w 25%, which means that if the giv en n umber pa ss e s this test n times , then the probability that the n umber is pr ime is 1 0 : 25 n [8]. Solo v a y-Str assen [9] pr imalit y test, on the other hand, has an error r ate belo w 50%. Probabilis- tic pr imality testing algor ithms are relativ ely f ast, of lo w comple xity , b ut with tunab le accur acy . Ho w e v er , there are cases that require cer tainty that a giv en n umber is pr ime or not; and thus , probabilistic algor ithms cannot be used. In 2002, three Indian computer scientists Ag r a w al, Ka y al, and Sax ena [10] proposed a deter ministic— i.e . , non-probabilistic—pr imality testing algor ithm that r uns in polynomia l time; w e will ref er to this algor ithm as AKS algor ithm . This is the first deter ministic polynomial-time al- gor ithm f or pr imality testing. Since this seminal paper , the pr imality testing prob lem no longer resides in the comple xity classes of NP-Hard, NP , or ZPP [11]. AKS algor ithm, interestingly , is relativ ely simple and str aightf orw ard, while pre vious w or k b y other researchers attempted to sho w that pr imality testing is of polynomial time comple xity b y making comple x modifications on e xisting pr imality testing algor ithms [12]. Receiv ed March 20, 2017; Re vised October 26, 2017; Accepted No v ember 14, 2017 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 1885 Since this theoretical breakthrough, man y researchers ha v e proposed theoretical and pr actical impro v ements to this algor ithm soon after it w as released in pub lic. Notab le among them are Lenstr a [13] and Ber nstein [14]. Ber nstein [14] proposed tw o pr actical possibilities f or accel- er ating AKS algor ithm with lo w-le v el speedup b y impr o ving the integer squar ing method and high- le v el speedup b y reducing the n umber of f or loop iter ations in the last step of the algor ithm. This included all state-of-the-ar t impro v ements on reducing the last f or loop iter ations and produced speedup of man y orders of magnitudes . These ha v e been incor por ated in the latest v ersion of AKS algor ithm. Lenstr a and P omer ance [15, 16], on the other hand, proposed theoretical impro v ements to AKS algor ithm and obtained a ne w technique with time comple xity O (log 6 ( n )) . The y modi- fied the or iginal AKS algor ithm b y decreasing the n u mber of iter ations in the f or loop . This is done b y replacing the use of the cyclotomic polynomials in AKS algor ithm b y a monic polynomial f ( x ) of deg ree r with integer coefficients such that the r ing Z [ x ] = ( f ( x ) ; n ) is a pseudofield. Ber n- stein in [17] proposed a fur ther theoretical impro v ement to AKS algor ithm with time comple xity O (log 4 ( n )) . The proposal also attempted to decrease the n umber of iter ations in the f or loop b y replacing the use of the cyclotomic polynomials b y r andom K ummer e xtensions of Z [ x ] =n . Cr andall and P apadopoulos [18] implemented a v ar iant of AKS algor ithm b y Lenstr a [13] and f ound that empir ically the time comple xity of the v ar iant is c log 6 ( n ) , where c is around 1,000 cloc k cycles . Li [19] also implemented the Lenstr a v ar iant of AKS algor ithm using C++ and NTL libr ar y to handle the polynomial data str ucture . In this implementation, a 15-decimal-digit pr ime n umber required around 3,000 seconds to compute in a single-processor computer . Menon in [20] implemented AKS algor ithm in SA GE (Softw are f or Algebr a and Geometr y Exper imentation), and produced an implementation, in which a 25-decimal-digit pr ime n umber required more than 4,000 seconds to compute in a single-processor computer . Cao [21] analyz ed the stor age space re- quirement f or AK S algor ithm and sho w ed that the required stor age space f or testing a n umber with length 1,024 bits is about 1,000,000,000 Gigab yte , which is p r actically inf easib le . This is due to the need to store e xtremely large polynomials dur ing the computation. Man y scientists ha v e made impro v ements on the or iginal AKS algor ithm, b ut sequential implementations of the algor ithm are still impr actical to use due to the e xpensiv e computation in v olv ed in each step and its stor age requirements . Future direction seems to be to w ards par allel implementations . This paper repor ts on our eff or t to de v elop a par allelization technique f or AKS algor ithm based on message-passing par allelism (using MPI) and to find out the best w or kload- distr ib ution str ategy f or the par allelization technique . The rest of the paper is organiz ed as f ollo ws: Section 2 presents the basis of AKS algo- r ithm. Section 3 descr ibes the proposed par allelization technique , together with f our accompan y- ing w or kload-distr ib ution str ategies . In Section 4, w e present the result of our e xper iments with the proposed par allelization technique and the f our w or kload-distr ib ution str ategies and pro vide analysis . Section 5 concludes the paper . 2. Preliminaries Let Z be the set of integers and let a and b be tw o positiv e integers . Let g cd ( a; b ) be the g reatest common divisor of a and b . The tw o integers a and b are relativ ely pr ime if and only if g cd ( a; b ) = 1 . Let ( a ) be the Euler’ s totient function, namely the n umber of positiv e integers smaller than a that are relativ ely pr ime to a . F or relativ ely pr ime a and r , let o r ( a ) be the order of a modulo r , namely the smallest integer k such that a k 1 (mo d r ) . Let a rem b be the remainder of integer division betw een a and b . In the ear liest v ersion of their pub lication, Ag r a w al, Ka y al, an d Sax ena obtained an algo- r ithm with the w orst-case time comple xity of O (log 12 ( n )) , where n is the giv en n umber . In this paper , w e are ref err ing to the latest v ersion (v ersion 6) of their pub lication [10] , in which the latest AKS algor ithm w as presented. The latest v ersion has incor por ated man y impro v ements proposed b y man y researchers and the resulting algor ithm r uns in polynomial time with the w orst-case com- ple xity o f O (log 15 = 2 ( n )) . Pr ior to the pub lication of this algor ithm, there w ere oth er pr imality pro ving algor ithms that seemed to r un in polynomial time , b ut AKS algor ithm is the first one that is de- A Load-Balanced P ar allelization of AKS Algor ithm (Y udha and Pulungan) Evaluation Warning : The document was created with Spire.PDF for Python.
1886 ISSN: 1693-6930 ter ministic as w ell as of polynomial time [18]. The main idea of AKS algor ithm is descr ibed in Lemma 1, which is a gener alization of F er mat’ s little theorem. Lemma 1 ([10]) Let a 2 Z be relativ ely pr ime to n 2 Z and n 2 . Then n is pr ime if and only if: ( x + a ) n x n + a (mo d n ) : (1) T o reduce the n umber of oper ations perf or med, both sides of (1) can be simplified b y taking their respectiv e remainders modulo a polynomial x r 1 , f or some small positiv e r 2 Z , namely: ( x + a ) n x n + a (mo d x r 1 ; n ) : (2) Ho w e v er , r ight no w the bi-implication in Lemma 1 no longer applies , since non-pr ime n ma y also satisfy (2) f or some a and r . Theorem 1—as ref or m ulated b y Gr an ville in [12]—f or ms the cor nerstone of AKS algo r ithm. The theorem basically asser ts that f or appropr iately selected r s , if (2) is satisfied b y some a , then n m ust be a pr ime . Theref ore , r m ust be selected accordingly . Theorem 1 ([10, 12]) Giv en n 2 Z and n 2 , let r < n be a posit iv e integer satisfying o r ( n ) > log 2 ( n ) . Then n is pr ime if and only if: (1) n is not a perf ect po w er , (2) n does not ha v e an y pr ime f actor r , and (3) ( x + a ) n x n + a (mo d x r 1 ; n ) f or an y integer a , where 1 a p ( r ) log ( n ) . A str aightf orw ard implementation of Theorem 1 is giv en in Algor ithm 1, where condition (1) corresponds to the first if; and conditions (2) and (3) correspond to the first and the last f or , respectiv ely . Algorithm 1: AKS algor ithm Input: n 2 Z ; n 2 Output: A str ing Pr ime or Composite 1 begin 2 if n = a b , where a; b 2 Z and a; b > 1 then 3 return Composite 4 end 5 Find the smallest r that satisfies o r ( n ) > b log 2 ( n ) c 6 f or 2 to r do 7 if g cd ( a; n ) > 1 then 8 return Composite 9 end 10 end 11 f or a   1 to b p ( r ) log ( n ) c do 12 if ( x + a ) n 6 x n + a (mo d x r 1 ; n ) then 13 return Composite 14 end 15 end 16 return Pr ime 17 end TELK OMNIKA V ol. 15, No . 4, December 2017 : 1884 1893 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 1887 3. Pr oposed Method 3.1. P arallel AKS Algorithm Scr utinizing Algor ithm 1, w e can see that the algor ithm basically compr ises 4 st eps: de- ter mining whether n is a perf ect po w er (lines 2–4); deter mining r (line 5); deter mining whether n has pr ime f actors r (lines 6–10); and deter mining the cong r uence of polynomials ( x + a ) n and x n + a modulo ( x r 1 ; n ) (lines 11–15) f or some v alues of a . Of these f our steps , the last tak es most of the computation times of the algor ithm, since w e are dealin g with an enor mous n . Fur- ther more , when r aising polynomial ( x + a ) to th e po w er n —albeit modulo ( x r 1 ; n ) —inter mediate results might be enor mous polynomials requir ing large stor age and hea vy computation. Our par- allelization eff or t will be f ocused on computing this last step . P ar allelizing the other steps will incur comm unication o v erhead that, with the current state of netw or king technology , renders the sa ving achie v ed b y the par allelization w or thless e v en f or hundreds-decimal-digit n . As has been noticed b y Cr andall and P apadopoulos in [18], AKS algor ithm is an em- barr assingly par allel algor ithm. It can easily be par alleliz ed using master-sla v e technique , b y distr ib uting the w or k of deter mining the cong r uence of polynomials ( x + a ) n and x n + a modulo ( x r 1 ; n ) f or diff erent v alues of a to diff erent computer nodes in a message-passing par allel system. Figure 1 illustr ates this master-sla v e technique . 0 2 1 u u -1 1 2 v 1 2 v 1 2 v 1 2 v Master node Slave nodes i j : node i : cor j : communication : owner Figure 1. The design of the par allelization technique In the beginning, the master node perf or ms the computation of the first three steps of AKS algor ithm sequentially . O nce the master node obtains the v alue of r , it broadcasts the v alues of n and r together with other necessar y inf or mation about distr ib ution of w or k (namely the distr ib ution of the v alues of a ) to all sla v e nodes . Each sla v e node then proceeds with the computation of deter mining the cong r uence of polynomials ( x + a ) n and x n + a modulo ( x r 1 ; n ) f or se v er al v alues of a . A sla v e node comm unicates only with the master node and only in tw o cases: (1) when f or some v alue of a the polynomials are not cong r uent, and (2) when f or all v alues of a assigned b y the master node , both polynomials are alw a ys cong r uent, and thus signaling that the w or k as- signed to the sla v e node has been completed. Upon receiving a comm unication of type (1) from a sla v e node , the master node immediately dismisses the last f or loop and thereb y announces that n is composite; and proceeds to command the rest of the sla v e nodes to abor t their computation. Receiving comm unication type (2) from all sla v e nodes indicates that all sla v e nodes ha v e com- pleted their w or k and all of them find that the tw o polynomials are cong r uent f or all v alues of a ; the master node then proceeds to announce that n is pr ime . A Load-Balanced P ar allelization of AKS Algor ithm (Y udha and Pulungan) Evaluation Warning : The document was created with Spire.PDF for Python.
1888 ISSN: 1693-6930 A moder n computer system usually has m ulti-core CPUs . A par allelizatio n technique where each of these CPU cores in a sla v e computer node is treated as a sla v e node as w ell is ref erred to as single-le v el par allelization . In this technique , each core is assigned b y the master node se v er al v alues of a to compute separ ately from other cores in the same computer node . Comm unications from all cores in a sla v e node to the master node m ust pass through the same channel of comm unication and this ma y result in contention. Ho w e v er , compared to the computa- tion time spent f or each v alue of a , the o v erhead produced b y this contention is negligib le . 3.2. W orkload-Distrib ution Strategies The single-le v el par allelization technique requires the distr ib ution of w or kload from the master node to all sla v e nodes . This basically entails distr ib uting the v alues of a f or sla v e nodes to w or k on. Recall from Algor ithm 1 that t he cong r uence of polyno mials ( x + a ) n and x n + a modulo x r 1 m ust be deter mined f or 1 a b p ( r ) log ( n ) c . Let q = b p ( r ) log ( n ) c and u be the n umber of sla v e nodes . Fur ther , let % stand f or the integer division oper ator . In the f ollo wing, w e present f our w or kload-distr ib ution str ategies that will be e xper imented on in this study . Strategy 1 Figure 2 illustr ates the first w or kload-distr ib ution str ategy . A rectangle in the figure represents a single v alue of a , while the circle r ight belo w the rectangl e represents the sla v e node responsib le f or computing that v alue . This str ategy is the simplest of the three str ategies , where sla v e node i is responsib le to deter mine the cong r uence of polynomials ( x + a ) n and x n + a modulo x r 1 , f or ( i 1)( q % u ) + 1 a i ( q % u ) . Hence sla v e node #1 w or ks on the first q % u v alues of a , sla v e node #2 w or ks on the second q % u v alues of a , and so on, while sla v e node # u w or ks on the remaining v alues of a . Th is last sla v e node ma y only w or k on f e w er than q % u v alues of a , if q is not divisib le b y u . 1 2 q % u q % u +1 q % u +2 2 (q % u ) 2 (q % u ) +1 2 (q % u ) +2 3 (q % u ) (u -1 ) (q % u ) (u -1 ) (q % u )+1 q 1 1 1 2 2 2 3 V alues of a Node 3 3 u u u V alues of a Node Figure 2. W or kload-distr ib ution str ategy 1 Strategy 2 One of the main concer ns with the first str ategy is that one sla v e node is assigned only with v alues of a that are consistently smaller or larger than those assigned to other sla v e nodes . All v alues of a assigned to sla v e node #1, f or instance , are smaller than those assigned to sla v e node #2. A larger v alue of a ma y result in a longer computation time , since the resulting inter mediate polynomials will ha v e larger coefficients , which in tur n tak e longer to m ultiply and require more stor age . The second and third str ategies tr y to address this . Figure 3 illustr ates the second w or kload-distr ib ution str ategy . The first sla v e node will get a = 1 , the second sla v e node will get a = 2 , and so on, until the last sla v e node will get a = u . This is then repeated until all v alues of a are e xhausted. Theref ore sla v e node i will be assigned the v alues of a of i; i + u; i + 2 u; : : : ; i + j u , where j is the largest int eger that still satisfies i + j u q . This str ategy manages to a v oid assigning one sla v e node v alues of a that are consistently smaller or larger than those assigned to other sla v e nodes . Ho w e v er , each v alue of a assigned to a sla v e node is alw a ys relativ ely smaller or larger than that assigned to other sla v e nodes . F or e v er y v alue i assigned to sla v e node #1, f or instance , the v alue i + 1 is assigned to sla v e nod e #2. Hence , if larger v alue of a alw a ys results in longer computation time , sla v e node #1 will complete its w or kload ear lier than sla v e node #2. This prob lem will be addressed b y the third str ategy . TELK OMNIKA V ol. 15, No . 4, December 2017 : 1884 1893 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 1889 1 2 u u +1 u +2 2 u 2 u +1 2 u +2 3 u 3 u +1 3 u +2 4 u 4 u +1 q 1 2 u 1 2 u 1 2 u 1 2 u 1 q   re u V alues of a Node V alues of a Node Figure 3. W or kload-distr ib ution str ategy 2 Strategy 3 The third str ategy addresses the prob lem encountered in the second str ategy b y ensur ing that if a sla v e node is assigned a small v alue of a , it will be compensated b y another assignment with large v alue of a . Figure 4 illustr ates the third w or kload-distr ib ution str ategy . 1 2 u u +1 u +2 2 u 2 u +1 q -2 u -1 q -2 u q -u -2 q -u -1 q -u q -1 q 1 1 2 2 u u 1 2 1 2 u u 1 1 V alues of a Node V alues of a Node Figure 4. W or kload-distr ib ution str ategy 3 Hence , since sla v e node #1 is assigned the v alue a = 1 (the smallest), then it will also be assigned the v alue a = q (the largest). Similar ly , since sla v e node #2 is assigne d the v alue a = 2 (the second smallest), then it will also be assigned the v alue a = q 1 (the second largest). This is carr ied out subsequently until all v alues of a are assigned to all sla v e nodes in similar f ashion: if the v alue a = i is assigned to sla v e node j , then the v alue a = q i is also assigned to sla v e node j , f or i q %2 . Strategy 4 All pre vious str ategies are static , in the sense that w or kload distr ib utions are actually predefined e v en bef ore e x ecution; a specific node alw a ys gets the same set of a s when the input n is the same . In this f our th str ategy , w e propose a dynamic str ategy where the set of a s assigned to a sla v e node cannot be predicted bef ore it is r un. The idea is , firstly , a sla v e node is assigned an a according to its id, f or e xample sla v e node #1 gets a = 1 , sla v e node #2 gets a = 2 and so on until sla v e node # u gets a = u . After completing a w or k, a sla v e node requests to the master node f or another remaining a or inf or ms the master node if the polynomial cong r uence chec k produces f alse result. The master node then sends a remaining a to the requesting node or the master node simply ter minates all nodes and output composite result f or the other condition. When no remaining a e xists , the master node ter minates all sla v e nodes then outputs pr ime result. 4. Result and Anal ysis 4.1. Implementation Since w e are pr imar ily concer ned with big n umbers , w e use GNU Multiple Precision (GMP) ar ithmetic libr ar y v ersion 6.10 to handle integer of arbitr ar y length. In the first step of the algor ithm, GMP function mpz p erfect p o w er p() is used to chec k f or perf ect po w ers . T o com- pute the v al ue of r in the second step , w e use function P o w erMo d() of NTL libr ar y v ersion 9.10.0, which basically perf or ms integer modular e xpon entiations . F or chec king the e xistence of f actors of the input n umber that are no more than r in the third step , NTL function GCD() is used. A Load-Balanced P ar allelization of AKS Algor ithm (Y udha and Pulungan) Evaluation Warning : The document was created with Spire.PDF for Python.
1890 ISSN: 1693-6930 Comm unications betw een master and sla v e nodes is perf or med using MPICH libr ar y v er- sion 3.2. T o broadcast the input n umber n and the v alue of r , the master and sla v e nodes use function MPI Bcast() . Since the MPI does not suppor t data type mpz t defined b y GMP as w ell as data type ZZ defined b y NTL, n and r are first con v er ted to arr a ys of b ytes bef ore the y are broad- cast. Once arr iv ed, the y will be con v er ted bac k to type mpz t using function mpz init set str() . After all the v alues required to compute the last step are obtained b y a sla v e node , it t hen com- putes the left and r ight side of the cong r uence using function P o w erMo d() . 4.2. Experimental Setup All e xper iments in this study are conducted using High-P erf or mance Computing (HPC) system pro vided b y Director ate of Inf or mation System and Resources (DSSDI) of Univ ersitas Gadjah Mada. The HPC system has 15 sla v e nodes , each with 2 CPU Dual Core AMD Opteron TM Processor 280 (hence , 4 CPU cores), 4 GB DDR3 RAM, Op enSUSE 11.2 64 bits oper ating system, and GCC compiler v ersion 6.1. W e e xper iment o n pr ime n umbers r anging from 5 digits to 35 digits in length as sho wn in T ab le 1. The se v en pr ime n umbers selected are the largest pr ime n umbers f or the corresponding n umbers of digits according to [22]. T ab le 1. Pr ime n umbers used in e xper iments Digits Pr ime Number 5 99,929 10 9,999,999,929 15 999,998,727,899,999 20 99,999,999,999,999,999,989 25 9,989,999,899,883,889,989,999,899 30 909,090,909,090,909,090,909,090,909,091 35 68,476,562,763,327,854,359,085,599,065,855,383 4.3. Result Comparing the w orkload-distrib ution strategies T ab le 2 sho ws the r unning times of the se- quential as w ell as the par allel implementations of AKS algor ithm f o r the se v en pr ime n umbers . The par allel implementations are r u n on a 60-processor message-passing system, while the se- quential one is r un on one of the processors . It is e vident that the dynamic w or kload distr ib ution (str ategy 4) perf or ms consistently and significantly better than other str a tegies in all e xper iments , which means that this str ategy is the most load balanced among the proposed str ategies . This also indicates that the o v erheads associated with comm unication times bet w een the master node and sla v e nodes are insignificant compared to the computation times f or diff erent v alues of a . F rom T ab le 2, it is clear that patter ns from the r unning times of the first three w or kload- distr ib ution str at egies are not easy to discer n. This result is contr ar y to the authors’ or iginal e xpectation, as descr ibed in Section 3.2. The result indicates that the computation times required f or th e v alues of a are not propor tional to those v alues: a larger v alue of a ma y require less computation time than that of smaller one . Speedups f or v arious n umber of pr ocessor s The pre vious result sho ws that w or kload dis- tr ib ution str ategy 4 produces the b est par allel implementation f or AKS algor ithm. In this par t, w e f ocus on this str ategy and find out the speedups that are achie v ab le f or v ar ious n umber of processors . The result is presented in Figure 5. Figure 5 sho ws that f or almost all n umbers of digits , speedup mostly g ro ws linear ly as the n umber of processors used in the computation increases . The apparent e xception to this is f or TELK OMNIKA V ol. 15, No . 4, December 2017 : 1884 1893 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 1891 T ab le 2. Running times (in seconds) of the diff erent w or kload-distr ib ution str ategies Digits Sequential P ar allel Str ategy 1 Str ategy 2 Str ategy 3 Str ategy 4 5 1.57168 0.41278 0.19912 0.20151 0.11407 10 105.7 5.3 2.1 2.2 1.7 15 712.2 35.3 22.6 24.0 19.6 20 3,236.0 128.2 130.4 128.8 112.3 25 8,848.8 446.6 371.5 374.1 325.4 30 32,343.2 1,421.6 1,317.3 1,972.4 1,179.6 35 70,901.7 4,121.8 4,177.6 4,152.3 2,457.4 0 5 10 15 20 25 30 35 40 0 10 20 30 40 50 60 Sp e e d u p Nu m b e r o f p r o c e s s o r s 5 d i g i t s 10 d i g i t s 15 d i g i t s 20 d i g i t s 25 d i g i t s 30 d i g i t s 35 d i g i t s Figure 5. Speedups obtained b y v ar ying the n umber of processors when the n umber of digits is 5. F or this case , there are only 275 diff erent v alues of a to chec k and each of them requires a relativ ely shor t computation time . When the n umber of processors e x- ceeds 30, comm unication o v erheads becomes large enough to offset the sa vings of computation times b y the par allelism. Ov er all, the largest speedup is obtained when the n umber of digits is 10 and it is a bit more than 36 times the computation time of the sequential implementation. The eff ects of m ulti-core pr ocessor s The high-perf or mance computing system used in the e xper iments consists of computers , each with m ulti-core processors . When each of this core is treated as a node , contentions ma y occur wh en there is more than one core comm unicating sim ultaneously with the master node . In this par t, f or each w or kload-distr ib ution str ategy , w e v ar y the n umber of cores per node used in the par allel computation to estab lish their eff ects on the o v er all computation time . F or this pur pose , three scenar ios are created, namely 1 core per node , 2 cores per node , and 4 cores per node . In all of these scenar ios , the o v er all n umber of cores is maintained at 8 in order to set a baseline . Figure 6 depicts the result of the e xper iments . F rom Figure 6, w e can conclude that w or kload distr ib ution str ategies 1, 2 and 3 ar e al- most not aff ected b y the n umber of cores per node used in the par allel computation. This is understandab le since , in these str ategies , a sla v e node r arely comm unicates with the master node . Comm unications betw een a sla v e node and the master node occurs only dur ing ter mina- tion, namely when the sla v e node finds that the polynomials are not cong r uent f or a specific v alue of a or when it finds that the polynomials are cong r uent f or all assigned v alues of a . The eff ect f or w or kload-distr ib ution str ategy 4, ho w e v er , is star k and the larger the pr ime n umber the more pronounced the eff ect. Ha ving more cores per node results in longer computa- tion time . This is in line with our e xpectation, since , ha ving more cores per node results in hea vier use of the comm unication line betw een the master and the sla v e nodes . What w e do not e xpect A Load-Balanced P ar allelization of AKS Algor ithm (Y udha and Pulungan) Evaluation Warning : The document was created with Spire.PDF for Python.
1892 ISSN: 1693-6930 0 5000 10000 15000 20000 25000 30000 0 5 10 15 20 25 30 Ru n n i n g t i m e ( s e c o n d s ) Di g i t s o f t h e p r i m e n u m b e r Wo r k l o a d - di s t r i but i o n s t r a t e g y 2 2 n o d e s ( 8 c o r e s ) 4 n o d e s ( 8 c o r e s ) 8 n o d e s ( 8 c o r e s ) 0 5000 10000 15000 20000 25000 30000 0 5 10 15 20 25 30 Ru n n i n g t i m e ( s e c o n d s ) Di g i t s o f t h e p r i m e n u m b e r Wo r k l o a d - di s t r i but i o n s t r a t e g y 1 2 n o d e s ( 8 c o r e s ) 4 n o d e s ( 8 c o r e s ) 8 n o d e s ( 8 c o r e s ) 0 5000 10000 15000 20000 25000 30000 0 5 10 15 20 25 30 Ru n n i n g t i m e ( s e c o n d s ) Di g i t s o f t h e p r i m e n u m b e r Wo r k l o a d - di s t r i but i o n s t r a t e g y 3 2 n o d e s ( 8 c o r e s ) 4 n o d e s ( 8 c o r e s ) 8 n o d e s ( 8 c o r e s ) 0 5000 10000 15000 20000 25000 30000 0 5 10 15 20 25 30 Ru n n i n g t i m e ( s e c o n d s ) Di g i t s o f t h e p r i m e n u m b e r Wo r k l o a d - di s t r i but i o n s t r a t e g y 4 2 n o d e s ( 8 c o r e s ) 4 n o d e s ( 8 c o r e s ) 8 n o d e s ( 8 c o r e s ) Figure 6. Running times f or v ar ious n umbers of cores per node is f or the eff ect to be so strong (1 core per node is f aster more than twice compared to 4 cores per node). This means that an HPC with single-core nodes will produces e v en better results . 5. Conc lusion In this paper , w e proposed a par allelization technique based on message passing par al- lelism f o r AKS algor ithm. W e also de v eloped f our w or kload-distr ib utio n str ategies f or this par- allelization technique . F rom the e xper iments w e ha v e conducted, w e conclude that dynamic w or kload-distr ib ution str ategy is the most load-balanced one . Fur ther more , the diff erence be- tw een the dynamic str ategy and static str ategies is so significant that it is difficult to en vision cir- cumstances when one wishes to use the static ones . Ov er all, the dynamic str ategy can achie v e a speedup of up to 36 times the sequential computation. Ne v er theless , the dynamic str ategy has one ob vious dr a wbac k, namely the bottlenec k in the comm unication line to w ards the master node . The more nodes in v olv ed in the par allelism, the b usier the master node and the hea vier the comm unication line to w ards the master node . W e did not manage to demonstr ate this due to the limited siz e of t he HPC a v ailab le to us . W e also sho w ed that the n umber of cores per node has a strong eff ect f or the dynamic w or kload-distr ib ution str ategy . Ac kno wledg ement The authors w ould lik e to thank Director ate of Inf or mation System and Resources (DSSDI) of Univ ersitas Gadjah Mada f or pro viding the high-perf or mance computing ser vice used in this re- search. TELK OMNIKA V ol. 15, No . 4, December 2017 : 1884 1893 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 1893 Ref erences [1] R. L. Riv est, A. Shamir , and L. Adleman, “A method f or obtaini ng digital signatures and pub lic-k e y cr yptosystems , Comm unications of the A CM , v ol. 21, no . 2, pp . 120–126, F eb . 1978. [2] H. Y ang, “Research on design method based on hardw are encr yption and tw o-w a y id au- thentication f or secur ity mobile hard disk, TELK OMNIKA , v ol. 12, no . 4, pp . 300 1–3009, 2014. [3] Z. Qi, “Secure digital cer tificate design based on the pub lic k e y cr yptog r aph y algor ithm, TELK OMNIKA , v ol. 11, no . 12, pp . 7366–7372, 2013. [4] A. Muzakir and A. Ashar i, “Rancang bangun k eamanan w eb ser vice dengan metode ws- secur ity , IJCCS (Indonesian Jour nal of Computing and Cyber netics Systems) , v ol. 6, no . 1, pp . 1–10, 2012. [5] EFF , “EFF off ers cooper ativ e computing p r iz es , 2009, last accessed: 2017-03-19. [Online]. A v ailab le: https://www .eff .org/a w ards/coop [6] G. L. Miller , “Riemann’ s h ypothesis and tests f or pr imality , Jour nal of Computer and System Sciences , v ol. 13, no . 3, pp . 300–317, 1976. [7] M. O . Rabin, “Probabilistic algor ithm f or testing pr imality , Jour nal of Number Theor y , v ol. 12, no . 1, pp . 128–138, 1980. [8] C . Dong, “Math in netw or k secur ity: A cr ash course , 2016, last accessed: 2017-03-19. [Online]. A v ailab le: http://www .doc.ic.ac.uk/ mrh/330tutor/ [9] R. Solo v a y and V . Str assen, “A f ast Monte-Car lo test f or pr imality , SIAM Jour nal on Comput- ing , v ol. 6, no . 1, pp . 84–85, 1977. [10] M. Ag r a w al, N. Ka y al, and N. Sax ena, “PRIMES is in P, Annals of Mathematics , v ol. 2, pp . 781–793, 2002. [11] L. K. Nemana and V . C . V enkaiah, “An empir ical study to w ards refining the AKS pr imality testing algor ithm, IA CR Cr yptology ePr int Archiv e , v ol. 2016, pp . 362–387, 2016. [12] A. Gr an ville , “It is easy to deter mine whether a giv en integer is pr ime , Bulletin of the Amer i- can Mathematical Society , v ol. 42, no . 1, pp . 3–38, 2005. [13] H. W . Lenstr a, “Pr imality testing with cyclotomic r ings , Mathematic Institute , Univ ersity of Leiden, T ech. Rep ., 2002. [14] D . Ber nstein, “Pro ving pr imality after Ag r a w al-Ka y al-Sax ena, Depar tment of Mathematics , Statistics , and Computer Science , Univ ersity of Illinois , T ech. Rep ., 2003. [15] H. W . Lenstr a, “Pr imality testing with Gaussian per iods , in FST TCS 2002: F oundations of Softw are T echnology and Theoretical Computer Science , 22nd Conf erence Kanpur , India, December 12-14, 2002, Proceedings , ser . Lecture Notes in Computer Science , M. Ag r a w al and A. Seth, Eds ., v ol. 2556. Spr inger , 2002, pp . 1–45. [16] H. W . Lenstr a and C . P omer ance , “Pr imality testing with Gaussian per iods , Depar tment of Mathematics , Univ ersity of Dar tmouth, T ech. Rep ., 2011. [17] D . Ber nstein, “Pro ving pr imality in essentially quar tic r andom time , Mathematics of compu- tation , v ol. 76, no . 257, pp . 389–403, 2007. [18] R. E. Cr andall and J . S . P apadopoulos , “On the implementation of AKS-class pr imality tests , Univ ersity of Mar yland College P ar k, T ech. Rep ., 2003. [19] H. Li, “The analysis and implementation of the AKS algor ithm and its impro v ement algo- r ithms , Master’ s thesis , Depar tment of Computer Science , Univ ersity of Bath, 2007. [20] V . Menon, “Deter ministic pr imality testing - understanding the AKS algor ithm, CoRR , v ol. abs/1311.3785, 2013. [21] Z. Cao , “A note on the stor age requirement f or AKS pr imality testing algor ithm, IA CR Cr yp- tology ePr int Archiv e , v ol. 2013, pp . 449–452, 2013. [22] C . K. Caldw ell, “Th e pr ime pages: pr ime n umber research, records , and resources , 2017, last accessed: 2017-03-19. [Online]. A v ailab le: https://pr imes .utm.edu A Load-Balanced P ar allelization of AKS Algor ithm (Y udha and Pulungan) Evaluation Warning : The document was created with Spire.PDF for Python.