Indonesian J our nal of Electrical Engineering and Computer Science V ol. 23, No. 1, July 2021, pp. 510 518 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v23.i1.pp510-518 r 510 Efficient multi-k eyw ord similarity sear ch o v er encrypted cloud documents A yad I. Abdulsada, Dhafer G. Honi, Salah Al-Darraji Department of Computer Science, Colle ge of Education for Pure Science, Uni v ersity of Basrah, Iraq Article Inf o Article history: Recei v ed Dec 6, 2020 Re vised Apr 12, 2021 Accepted Jun 16, 2021 K eyw ords: Cloud computing Multi-k e yw ords ranking search Pri v ac y preserving Searchable encryption Simhash ABSTRA CT Man y or g anizations and indi viduals are attracted to outsource the ir data into remote cloud service pro viders. T o ensure pri v ac y , sensiti v e dat a should be encrypted be- fore being hosted. Ho we v er , encryption disables the direct application of the essential data management operations lik e searching and inde xing. Searchable encryption is a cryptographic tool that gi v es users the ability to search the encrypted data while being encrypted. Ho we v er , the e xisting schemes either serv e a single e xact search that loss the ability to handle the misspell ed k e yw ords or multi-k e yw ord search that generate v ery long trapdoors. In this paper , we address the problem of designing a practical multi-k e yw ord similarity scheme that pro vides short trapdoors and returns the correct results according to their similarity scores. T o do so, each document is translated into a compressed trapdoor . T rapdoors are generated using k e y based hash functions to en- sure their pri v ac y . Only authorized users can issue v alid trapdoors. Similarity scores of tw o te xtual documents are e v aluated by comput ing the Hamming distance between their corresponding trapdoors. A rob ust security definition is pro vided together with its proof. Our e xperimental results illustrate that the proposed scheme impro v es the search ef ficienc y compared to the e xisting schemes. Furthermore, it sho ws a high le v el of performance. This is an open access article under the CC BY -SA license . Corresponding A uthor: A yad I. Abdulsada Department of Computer Science, Colle ge of Education for Pure Science Uni v ersity of Basrah, Iraq Email: ayad.abdulsada@uobasrah.edu.iq 1. INTR ODUCTION Cloud computing is a promising technology that supports cost-ef fecti v e solutions for storing and pro- cessing lar ge datasets. F or this reason, indi viduals and or g anizations with constrained-resource machines tend to outsource their data collections to such professional po wer serv ers. Ho we v er , such outsourced service may raise main concerns to w ards users pri v ac y , where personal data should be preserv ed [1]. Such data may include E-mail, medical information, pri v ate vi d e os, and photos. Therefore, users emplo y encryption to protect the pri v ac y of their confidential data. Unfortunately , e n c ryption disables traditional k e yw ord search operations on remote d a ta. Searchable encrypt ion schemes allo w preforming search o v er encrypted data at the serv er side without decryption. Lik e search o v er plainte xt data, searchable encryption methods b uild a searchable inde x from the entire dataset, such that during the search, only trapdoors generated using a secret k e ys can match inde x entries to get rele v ant results. Inde x contents should re v eal nothing to the adv ersary serv er . Inde x-based search not only enhances search ef ficienc y , b ut also isolates data and inde x encryption schemes. Under the set- ting of searchable symmetric encryption (SSE) schemes, the encrypt ed data are uploaded and retrie v ed by the J ournal homepage: http://ijeecs.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 511 same party . Most of SSE methods focus on servi n g e xact single k e yw ord search queries [2–6]. Some methods [7, 8] focus on single k e yw ord fuzzy search. The results of such schemes are too lar ge especially when huge data collections ar e used. Us ers can filter the res ults locally by sending a list of single-k e yw ord quer ies and get the intersection of the returned results. This scheme is inef ficient and requires intensi v e computations for the user . Another solution is to find the intersection at the serv er side and ask the latter to return the final results to the user . This enables the serv er from learning undesirable information such as the intersection patter n s for all k e yw ords. Therefore, it is necessary to ha v e SSE schemes that enable the submission of search queries with a conjunction of se v eral k e yw ords. Such scheme reduces the computation b urden of users since only the best matching documents are retrie v ed. In this paper , we propose a ne w pri v ac y preserving yet ef ficient multi- k e yw ord similarity search scheme that returns the rele v ant documents according to their similarity scores e v en in case of pro viding misspelled k e yw ords. Contrib utions: Contrib utions are briefly described as follo ws: Firstly , we emplo y Simhash function to design pri v ac y-preserving scheme that supports multi-k e yw ord similarity search. Such a function transforms the te xtual documents into a fix ed size v ectors, such that tw o similar documents l ead to tw o similar v ectors. Secondly , we introduce adapti v e semantic security definition and pro v e that the proposed scheme satisfies this definition. Thirdly , we utilize a ranking method that emplo ys XOR operation between tw o bit v ectors. Finally , we implement the proposed scheme and sho w its ef ficienc y and ef fecti v eness. Or g anization: The paper is outlined as follo ws. Secti on 2 summarizes the pre vious w orks. Section 3 pro vides the problem description and illustrates its security definition. Section 4 sho ws the basic steps of the proposed scheme. Section 5 reports the formal pro v e of the scheme. Section 6 sho ws the results of the conducted e xperiments, and section 7 concludes our w ork. 2. RELA TED W ORK Single k e yw ord SSE schemes: The problem of searching o v er encrypted data is addressed by v arious w orks in the literature. The majority of these w orks focus on handling a single k e yw ord search requests. Under such a setting, searchable inde x is generated and encrypted before being uploaded. Search process scans the encrypted inde x to identify the documents that includes the pro vided queries. Goh et al. [2] introduced a Bloom filter based construction, where the unique k e yw ords of a gi v en document are hashed into a single filter . The major problem of this w ork is that it may return f alse positi v e results. Chang et al. [5] pro vided simulator -based security definition, which is stronger than [2]. Their inde x construction searches without f alse positi v e results. Later on, Curtmola et al. [6] emplo yed the in v erted inde x structure to describe the entire collection. Here, search time is log arithmic in the total number of the search k e yw ords stored on the serv er . Strizho v and Ray [9] and Cash et al. [3] de v eloped ne w schemes that gi v e the data o wner the ability to update his collection with a minimum leak. Dynamic SSE constructions with adv anced security definitions are presented in [10–15]. The main disadv antage of the abo v e mentioned schemes is that the y use only single k e yw ord search which return massi v e number of results, where most of them are not rele v ant to the users. Single Similarity SSE schemes: Some SSE schemes enhance query richness to support simil arity search, where the scheme can retrie v e the correct results e v en with pro viding misspelled k e yw ords. In the w ork of [7], each k e yw ord is e xpanded by listing all its v ariants that are within a specific edit distance. All k e yw ord v ariants are stored in the searchable inde x, which require more computation and communication costs. K uzu et al. [8] emplo y the Minhash technique to capture the Jaccard distance between tw o sets. Such schemes support only single k e yw ord similarity search. Multy-k e yw ord SSE schemes: T o enhance se arch functionality , multi-k e yw ord search is used to re- trie v e only the rele v ant data items. Ca o et al. [16] proposed the first scheme that performs rank ed multi-k e yw ord search, where each document is represented as a binary v ector . The size of this v ector is determined by the total number of k e yw ords in the v ocab ulary . V ectors are protected by multiplication with randomly generated ma- trices. This scheme requires from the data users to kno w the position of each k e yw ord to generate v alid search requests. Furthermore, binary v ectors ignore the importance of each k e yw ord within the pro vided document. Additionally , similarity scores are defined as the number of matched k e yw ords between tw o v ectors, which is not standard operation for ranking the returned results. Cash et al. [4] designed an ef ficient construction that supports conjuncti v e queries. Such a construction returns only the documents that match all items of the pro vided query . Ho we v er , pairing-based solutions require intensi v e computational costs. ¨ Orencik et al. [17] proposed an ef ficient rank ed multi-k e yw ord search scheme. In this scheme, a fix ed-size v ector is generated Ef ficient multi-k e ywor d similarity sear c h o ver encrypted cloud documents (A yad I. Abdulsada) Evaluation Warning : The document was created with Spire.PDF for Python.
512 r ISSN: 2502-4752 for each document. The underline method does not allo w for ranking results. The w ork of [18] described an ef ficient scheme that solv ed the problem of document ranking with multi-k e yw ord search. The y emplo yed Minhash method for generating fingerprint items for each file. The items of the entire collection are used to construct an in v erted inde x. Ho we v er , such scheme assumes the e xistence of tw o non-colluding serv ers to rank the returned results. Ne w security impro v ements are presented in [19]. Our scheme uses only one serv er which performs all ranking tasks. Strizho v et al. [9] used inner product similarity and tf-idf based ranking model. Such a model uses onl y one serv er to answer multi-k e yw ord queries, where results are returned without sorting. Such a model uses inef ficient fully homomorphi c encryption scheme. Baldimtsi et al. [20] designed recently a tool for sorting an encrypted data. Such a tool w as successfully utilized to solv e the problem of rank ed search o v er encrypted data, where tw o serv ers are needed. Recently , a secure multi-k e yw ord rank ed search that release dynamic search functionality is proposed in [21], [22]. W ang et al. [23] proposed a secure scheme that protects the search pattern. Recently , Rani et al. [24] designed a tree-based inde x to solv e the problem of secure search. 3. PR OBLEM DESCRIPTION AND SECURITY DEFINITION In this section, the notations, formal description of the problem, and the security definition are pre- sented. 3.1. Notations Let be the s ecurity parameter . D = f D 1 ; D 2 ; : : : ; D n g is a document collection of size n , D i = f w 1 ; :::; w m g is a set of m distinct k e yw ords, C = f C 1 ; C 2 ; : : : ; C n g is the encrypted collection, id ( C i ) is the identifier of document C i , j C i j is the size of the encrypted document C i , Q = f Q 1 ; Q 2 ; : : : ; Q q g is a collection of q successi v e queries, where Q i = f w 1 ; :::w s g is a list of s k e yw ords, D B ( k w ) is the collection of documents that match the multi-k e yw ord query k w , and S I = f S I 1 ; :::; S I n g is the inde x, where S I i is the document inde x of D i . K D and K S represent secret k e ys for encrypting document collection and searchable inde x, respecti v ely . The occurrence of k e yw ord w i within a gi v en document is denoted by tf i , whereas T is the secure trapdoor for a gi v en search query . Finally , t is the user defined search threshold. 3.2. Pr oblem description W e consider the follo wing setting: a data o wner who o wns a pri v ate collection of n te xtual documents D = f D 1 ; :::; D n g . He intend to upload his pri v ate collection into a profes sional yet untrusted serv er , which pro vides data storage and computing services with an ef ficient cost. Before outsourcing, data o wner generates a searchable inde x S I and uploads it together with the encrypted v ersion of the encrypted documents C to the remote serv er . During search time, authorized users pro vide multi-k e yw ord search request k w and use secret k e ys to construct a v alid trapdoor T . Once recei ving T , the ser v er scans the searchable inde x S I to find the most similar encrypted documents that match the pro vided trapdoor . Be yond what data o wner allo ws to leak, the security guarantee pre v ents the serv er from emplo ying the outsourced data set and the pro vided trapdoors to get additional information about the underline collection and the search k e yw ords. 3.3. Security definition The objecti v e of an y SSE scheme is to protect: (1) document collection pri v ac y , (2) user query pri v ac y , and (3) searchable inde x pri v ac y . Document collection pri v ac y is protected using encryption before outsourc- ing. Interestingly , documents are encrypted by AES which supports PCP A (pseudo-randomness ag ainst chosen plainte xt attack) security notion [6]. Such a notion guarantees that the cipherte xt is indistinguishable from truly random string. User queries and document indices are tr ansformed into binary v ectors using secret k e ys. Such v ectors hide both the number and content of the underline search k e yw ords. The majority of the current SSE schemes define a leakage function, which describes precisely what is allo wed to be leak ed for the adv ersary serv er . W e list some of the security definitions [6]. Definition 1: History ( H q ) represents the document collection D and the set of search queries Q . Such issue represents the sensiti v e information that should not be leak ed to the serv er . F ormally , H q = ( D ; Q ) . Defini- tion 2: Access pattern ( A ( Q i ) ) includes the collection of document identifiers D B ( Q i ) that satisfy the pro vided search query Q i . F ormally , A ( Q i ) = D B ( Q i ) . Definition 3: Search pattern ( P ) determines the repeated search queries, where it is determined by collecting the repeated queries. It is formulated as a square q q matrix, where P ( i; j ) = 1 if f Q i = Q j , 8 i; j = 1 ; : : : ; q . Definition 4: T race ( T r ( H q ) ) it is the leakage function. Gi v en the Indonesian J Elec Eng & Comp Sci, V ol. 23, No. 1, July 2021 : 510 518 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 513 history H q , trace is defined as: T r ( H q ) = f ( id ( C 1 ) ; : : : :; id ( C n )) ; ( j C 1 j ; : : : ; j C n j ) ; A ( H q ) ; P ( H q ) g . Defini- tion 5: V ie w ( V ( H q ) ) represents the actual information that the adv ersary serv er can see during the e x ecution of the SSE scheme. F ormally , V ( H q ) = f ( id ( C 1 ) ; : : : ; id ( C n )) ; S I ; C ; Q g . Definition 6: Adapti v e semantic security: SSE is adapti v e semantic secure if gi v en T r ( H q ) , there e xists S algorithm that can simulates V ( H q ) with probability 1 for all probabilistic polynomial time ( P P T ) adv ersaries, where is a ne gligible proba- bility . Our scheme assumes the cloud serv er to be semi-honest party , where it follo ws e xactly the construction steps, while try to analyze its accessed data to get additional information be yond what is allo wed to learn. 4. THE PR OPOSED SCHEME W e adopt a model of tw o-parties: the first one is the client (data o wner), and the second one is the serv er . The client constructs an encrypted inde x, and outsources it along with the encrypted documents to the serv er , who pro vides lar ge storage and responds to the search queries of the client. The proposed scheme is composed of v e polynomial-time algorithms: =( Gen , IndexBuilding , Makestrapdoor , Search , Dec ). 1. ( K S ; K D )   Gen( 1 ) : this algorithm is run by data o wner . It tak es the security parameter and returns tw o secret k e ys K S and K D . 2. ( S I ; C )   IndexBuilding( D ; K S ; K D ) : this algorithm is run by the data o wner to generate the searchable inde x S I and the encrypted collection C , gi v en the data collection D , and the secret k e ys K D and K S . 3. T   Makestrapdoor ( k w ; K S ) : Gi v en the k e yw ord set k w , data o wner runs this algorithm to gen- erate the secret trapdoor T utilizing the secret k e y K S . 4. R   Search ( T ; S I ) : Once recei ving search trapdoor T , cloud serv er compares it with all of S I entries to find the list R of the most similar documents. 5. D i   Dec ( C i ; K D ) : gi v en the sec ret k e y K D , data user runs this algorithm to get the plainte xt form of document C i . 4.1. Index b uilding In this w ork, a direct inde x [25] S I i is generated for each document D i . Under such a setting, search time is linear to the number of document collection files. The process of inde x generation includes three operations: k e yw ord set e xtraction, document inde x construction, and trapdoor generation. K e yw ord set e xtract ion: Gi v en the document D i , data o wner runs a preprocessing step to refine t he k e yw ord set of that document. Such step includes some techniques that are borro wed from information re- trie v al systems. Such techniques includes: lo wer case con v ersion, tok enization, remo ving stop w ords, and stemming [26]. Stemming con v erts the dif ferent forms of the same w ord into their unique stem. F or e xample, the w ords talks , talking , and talk ed are all con v erted to the w ord talk . This process increases the probability of finding documents that ha v e the pro vided k e yw ord e v en in dif ferent forms. After refining the k e yw ord set, the occurrence tf j of each k e yw ord w j within D i is computed. At the end, document D i is represented as high dimensional v ector: D i = f ( w 1 ; tf 1 ) ; : : : ; ( w m ; tf m ) g . Document inde x construction: Gi v en the document k e yw ords and their corresponding oc currences, we need to encode them together into a small fingerprint of fix ed size l (usually l =64). T o do so, we ap- ply the Simhash function [27], which reduces high dimensional v ectors such that tw o similar te xtual in- puts will produce tw o fingerprints of a minimum Hamming distance. The fingerprint T of the document D i = f ( w 1 ; tf 1 ) ; : : : ; ( w m ; tf m ) g is g e nerated as follo ws: initialize l zero counters < s 1 ; :::; s l > . Each k e yw ord w i 2 D i is hashed by SHA-1. If the bit j ( j 2 f 1 ; :::; l g ) equals 1, then its corresponding counter s j is incremented by tf i . Otherwise s j is decremented by tf i . After the processing of all document k e yw ords, the counter v alue s j ( j 2 f 1 ; ::; l g ) is set to 1 if it is positi v e. Otherwise, it is set to 0. The fingerprint is the final binary bits of < s 0 ; :::; s l > . T rapdoor generation: Fingerprints are generated using SHA-1, which is a public function. Ho we v er , using such a function raises security risks. This is because adv ersary serv er can guess search k e yw ords with brute force attack, where it tries to compute the fingerprint for se v eral inputs until it finds a collision. T o solv e this problem, we generate a trapdoor for each fingerprint using k e y based hashing function such as HMA C, where k e yw ords are hashed using a secret k e y . This k e y will mak e the brute force attack a hard job . HMA C K S : f 0 ; 1 g ! f 0 ; 1 g l [28] is a popular message authentication code that uses a secret k e y K S to hash Ef ficient multi-k e ywor d similarity sear c h o ver encrypted cloud documents (A yad I. Abdulsada) Evaluation Warning : The document was created with Spire.PDF for Python.
514 r ISSN: 2502-4752 each k e yw ord of the gi v en document. Other steps of Simhash function still without changing. Algorithm 1 illustrates inde x construction. T rapdoors of all document collection wil l be the final inde x that will be uploaded to the serv er . Algorithm 1: Inde x b uilding Input : T e xtual collection D = f D 1 ; D 2 ; : : : ; D n g , secret k e ys: K S , and K D , fingerprint size l . Output: Searchable inde x S I , encrypted collection C . f or i   1 to n do C i   E nc K D ( D i ); f ( w 1 ; tf 1 ) ; ( w 2 ; tf 2 ) ; ; : : : ; ( w m ; tf m ) g   PreProcessing ( D i ); Initialize zero v ector < s 1 : : : s l > ; f or j   1 to m do MA C   H M AC K S ( w j ); f or u   1 to l do if MA C u = 0 then s u = s u tf j else s u = s u + tf j f or u   1 to l do if s u 0 then S I i [ u ] = 0 else S I i [ u ] =1 S I = f S I 1 ; :::; S I n g ; C = f C 1 ; :::; C n g ; Retur n S I , C ; 4.2. T rapdoor construction Gi v en the search k e yw ord set k w = f w 1 ; w 2 ; : : : ; w s g , data user utilizes the secret k e y K S , which is obtained from the data o wner via a secret channel, to gener ate Simhash fingerprint T for such a set. T rapdoor is constructed in the same w ay of document inde x generation. Note that the trapdoor size is l , which is unrelated to the number of k e yw ords s , so s will be hidden from the adv ersary serv er . T o construct a trapdoor , data user performs only hash functions and fe w bitwise comparisons. 4.3. Pri v acy pr eser ving sear ch Cloud serv er e v aluates the Hamming distance between the trapdoor T and each document inde x S I i , i = f 1 ; :::; n g . P articularly , he e v aluates the XOR operation between tw o binary v ectors. Similarity scores of tw o binary v ectors are determined by identifying the number of matched bits. T o control the number of returned results, users pro vide a threshold v alue t , such that only the encrypted files of the top- t minimum scores are r eturned. The search operation cost is il lustrated as follo ws: cloud serv er needs to perform a binary comparison of l -bit fingerprint with n entries, then it sorts the similarity scores, and selects only the t -minimum scores. Algorithm 2 describes the steps of serv er search operation. Once recei ving the t encrypted documents that ha v e the best matching scores, data user utilizes the secret k e y K D to decrypt them. Algorithm 2: Pri v ac y preserving search Input : Simhash fingerprint T , threshold t , Searchable inde x S I , and encrypted collection C . Output: encrypted documents of the top- t similarity scores. f or i   1 to n do XV ect   XOR ( S I i , T ) ; Scores [ i ]   P l j =1 XV ect [ j ] ; Sor t score   Sort ( Scores ); Inde x   Minimum t ( Sor t score ); f or j   1 to t do R j   C ( Inde x [ j ]) ; Retur n R ; Indonesian J Elec Eng & Comp Sci, V ol. 23, No. 1, July 2021 : 510 518 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 515 5. SECURITY DISCUSSION W e emplo y the real vs ideal g ames approach to pro v e the security of our proposed scheme according to Definition 6. In the real g ame, the client beha v es as in the proposed scheme, while in the ideal g ame , we b uilds a simulator who emplo ys the leak ed information to simulate the beha vior of the client to w ards the serv er . If there is no PPT distinguisher that can distinguish the beha vior of the tw o g ames, then the scheme is considered a se- cure. The intuition of this approach is: gi v en user pri v ate inputs H q , accessible information to the serv er V ( H q ) , and the leakage function T r ( H q ) , then if there e xists a simulator S that emplo ys T r ( H q ) to simulate V ( H q ) such that no P P T distinguisher can distinguish V ( H q ) from the simulated one, then the leakage is not useful and hence the security is ensured. This is because T r ( H q ) does not help the adv ersary to learn an y useful in- formation from V ( H q ) . F ormally , let the original vie w V ( H q ) = V ( H q ) = f ( id ( C 1 ) ; : : : :; id ( C n ) ; S I ; C ; Q g , and the tr ace is T r ( H q ) = f ( id ( C 1 ) ; : : : :; id ( C n )) ; ( j C 1 j ; : : : ; j C n j ) ; A ( H q ) ; P ( H q ) g . Let V ( H q ) be the simulated vie w , which is defined as: V ( H q ) = f ( id ( C 1 ) ; : : : :; id ( C n )) ; S I ; C ; Q g . Our scheme is con- sidered to be secure if V ( H q ) is indistinguishable from V ( H q ) . Document identifiers id ( C j ) of V ( H q ) are a v ailable in T r ( H q ) . Thus S can simulate them simply by setting id ( C j ) = id ( C j ) . Simulator kno ws from T r ( H q ) the number of outsourced documents and the real size of e ach one. Thus, it can create a simulated v ersion C j for each document C j by encrypting a random stream of size j C j j . This beha vior cannot be distin- guished by an y PPT distinguis her due to the security of PCP A encryption method. Similarly , S simulates S I by generating n random binary v ectors S I j of size l . S I j can not be distinguished from S I j since it is generated using a secure hash function HMA C. No w , S simulates the q successi v e queries Q = f Q 1 ; Q 2 ; : : : ; Q q g of V ( H q ) as follo ws: if Q j w as not issued before, as indicated by A ( H q ) of T r ( H q ) , then it constructs a random Q j . Otherwise it sets Q j = Q p , where p is pre vious inde x for Q j . Note that 8 i; Q i is indistinguishable from Q i . W e e xplain ho w S can simulate V ( H q ) . Hence, our proposed scheme is secure. 6. EXPERIMENT AL EV ALU A TION W e demonstrate the ef ficienc y and ef fecti v eness of our proposed scheme by e v aluating thorough e x- periments. The proposed scheme is e v aluated by Ja v a language using 64-bit W indo ws 10 operating system with Intel Core-i7 processor of 1.8 GHz. During the e xperiments, 8000 randomly selected files are from the RCV1 database [29], a list of 570 stop w ords are used, Porter algorithm [30] is used to stem the remaining w ords. Inde x generation time is determined by the parameter l . Figure 1a sho ws the ef fect of l for v ariable collection sizes. It is ob vious that long Simhash fingerprints require more processing time. 0 2 ; 000 4 ; 000 6 ; 000 8 ; 000 0 100 200 300 400 Number of documents Inde x b uilding time (s) l = 64 l = 128 l = 256 l = 384 (a) 0 2 ; 000 4 ; 000 6 ; 000 8 ; 000 0 100 200 300 400 500 Number of documents Inde x b uilding time (s) Minhash Simhash (b) Figure 1. Inde x b uilding time: (a) ef fect of l on inde x generation time and (b) comparison of our scheme with [18] in terms of inde x b uilding time Our scheme is compared with the w ork of [18] in terms of inde x generation time. Figure 1b demon- strates that the method of [18] requires more time than our w ork, since it uses hea vy cryptographic primiti v es. W e used tw o e v aluation metrics to sho w the ef fecti v eness of our scheme: precision and recall. Preci- sion pr ec refers to the percentage of rele v ant documents from the o v erall returned documents, while recall r ec Ef ficient multi-k e ywor d similarity sear c h o ver encrypted cloud documents (A yad I. Abdulsada) Evaluation Warning : The document was created with Spire.PDF for Python.
516 r ISSN: 2502-4752 refers to the percentage of rele v ant documents retrie v ed among the total number of rele v ant documents. Let k w = f w 1 ; w 2 ; : : : ; w s g be the set of search k e yw ords, R ( k w ) be the set of retrie v ed documents and R ( k w ) be the set of retrie v ed documents that include all k e yw ord set k w , D B ( k w ) be set of documents in the entire collection that contains all k e yw ord set k w . Note R ( k w ) R ( k w ) and R ( k w ) D B ( k w ) . W e define the precision( P r ec ( k w ) ), recall ( R ec ( k w ) ), a v erage precision ( Av g pr ec ( k w ) ) and A v erage recall ( Av g r ec ( k w ) ) as follo ws: P r ec ( k w ) = j R ( k w ) j j R ( k w ) j ; (1) R ec ( k w ) = j R ( k w ) j j D B ( k w ) j (2) Av g pr ec ( k w ) = q X i =1 pr ec ( k w i ) q ; (3) Av g r e c ( k w ) = q X i =1 r ec ( k w i ) q (4) Figure 2a sho ws the accurac y of results of our proposed scheme with v ariable Simhash size l . In this e xperiment, 150 search queries are pro vided with v ariable number of k e yw ords and only the top 20 matching documents are returned. 0 100 200 300 400 0 0 : 2 0 : 4 0 : 6 0 : 8 1 Fingureprint size Accurac y Precision Recall (a) 0 1 2 3 4 5 6 7 8 9 10 0 0 : 2 0 : 4 0 : 6 0 : 8 1 Number of k e yw ords Precision Simhash Minhash (b) Figure 2. Ef fecti v eness e v aluation: (a) Fingureprint size and (b) Number of k e yw ords In the ne xt e xperiment, we e v aluate ho w the number of k e yw ords s in the search query can af fect the result accurac y . Figure 2b sho ws the precision of results for dif ferent schemes with v ariable number of search k e yw ords. F or both schemes, precision decreases as the number of k e yw ords increases from 1 to 10. This is because when the number of search k e yw ords increased the non rele v ant retrie v ed documents will be accumulated, leading to lo wer precision. Notice that, the accurac y of our proposed scheme outperforms MinHash-based scheme, since it better inte grates the term frequenc y of each k e yw ord in the generated v ector , whereas MinHash method ignores such v aluable information. 7. CONCLUSION In this paper , we proposed a practical scheme for a multi-k e yw ord similarity search o v er encrypted data. The proposed scheme answers the multi-k e yw ord queries and can handle the misspelled k e yw ords. Doc- uments are retrie v ed to the user according to specific ranking method. The scheme is constructed according to the setting of direct inde xing, where a specific inde x is constructed for each document. Simhash function Indonesian J Elec Eng & Comp Sci, V ol. 23, No. 1, July 2021 : 510 518 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 517 is emplo yed to generate a small fix ed size fingerprint for each document. M u l ti-k e yw ord search queries are constructed in the same w ay , so the search operation is e v aluated simply by performing Hamming distance. T o illustrate the security of our scheme, a precis definition of s ecurity is presented along with its formal proof. Extensi v e e xperiments were conducted to sho w the practical v alue of our scheme. The proposed scheme sho ws high precision of 70% and recall of 85%. Our scheme requires 395 ms to generate the inde x for 8000 docu- ments. REFERENCES [1] W . Hassan, T .-S. Chou, X. Li, P . Appiah-K ubi, and O. T amer , “Latest trends, challenges and solutions in security in the era of cloud computing and softw are defined netw orks, International J ournal of Informatics and Communic ation T ec hnolo gy (IJ-ICT) , v ol. 8, no. 3, pp. 162–183, 2019, doi: 10.11591/ijict.v8i3.pp162-183. [2] E. Goh, “Secure inde x es, IA CR Cryptol. ePrint Ar c h. , v ol. 2003, p. 216, 2003. [Online]. A v ailable: http://eprint.iacr .or g/2003/216 [3] D. Ca sh, J. Jae ger , S. Jarecki, C. S. Jutla, H. Kra wczyk, M. Rosu, and M. Steiner , “Dynamic searchable encryption in v ery-lar ge databases: Data structures and implementation, in 21st Annual Network and Distrib uted System Security Symposium, NDSS 2014, San Die go, California, USA, F ebruary 23-26, 2014 . The Internet Society , 2014. [Online]. A v ailable: https://www .ndss-symposium.or g/ndss2014/dynamic-sea rchable-encryption-v ery-lar ge- databases-data-structures-and-implementation [4] D. Cas h, S. Jarecki, C. S. Jutla, H. Kra wczyk, M. Rosu, and M. Steiner , “Highly-scalable searchable symmetric encryption with support for boolean queries, in Advances in Cryptolo gy - CR YPT O 2013 - 33r d Annual Cryptolo gy Confer ence , Santa Barbar a, CA, USA, A ugust 18-22, 2013. Pr oceedings, P art I , ser . Lecture Notes in Computer Science, R. Canett i and J. A. Garay , Eds., v ol. 8042. Springer , 2013, pp. 353–373, doi: 10.1007/978-3-642-40041-4 20. [5] Y . Chang and M. Mitz enmacher , “Pri v ac y preserving k e yw ord searches on remote encrypted data, in Applied Crypto gr aphy and Network Security , Thir d International Confer ence , A CNS 2005, Ne w Y ork, NY , USA, J une 7-10, 2005, Pr oceedings , ser . Lecture Notes in Computer Science, J. Ioannidis, A. D. K eromytis, and M. Y ung, Eds., v ol. 3531, 2005, pp. 442–455, doi: 10.1007/11496137 30. [6] R. Curtmola, J. A. Garay , S. Kamara, and R. Ostro vsk y , “Searchable symmetric encryption: Impro v ed definitions and ef ficient constructions, IA CR Cryptol. ePrint Ar c h. , v ol. 2006, p. 210, 2006. [Online]. A v ailable: http://eprint.iacr .or g/2006/210 [7] J. Li, Q. W ang, C. W a ng, N. Cao, K. Ren, and W . Lou, “Fuzzy k e yw ord search o v er encrypted data in cloud computing, in 2010 Pr oceedings IEEE INFOCOM , 2010, pp. 1-5, doi: 10.1109/INFCOM.2010.5462196. [8] M. K uzu, M. S. Isl am, and M. Kantarcioglu, “Ef ficient similarity search o v er encrypted data, in 2012 IEEE 28th International Confer ence on Data Engineering , 2012, pp. 1156-1167, doi: 10.1109/ICDE.2012.23. [9] M. Strizho v and I. Ray , “Secure multi-k e yw ord similarity search o v er encrypted cloud data supporting ef ficient multi-user setup, T r ans. Data Priv . , v ol. 9, no. 2, pp. 131–159, 2016, doi: 10.5555/2993206.2993208. [10] X. Song, C. Dong, D. Y uan, Q. Xu, and M. Zhao, “F orw a rd pri v ate searchable symmetric encryption with optimized I/O ef ficienc y , IEEE T r ans. Dependable Secur . Comput. , v ol. 17, no. 5, pp. 912–927, 2020, doi: 10.1109/TDSC.2018.2822294. [11] J. G. Chamani, D. P apadopoulos, C. P apamanthou, and R. Jalili, “Ne w constructions for forw ard and backw ard pri v ate symmetric searchable encryption, in Pr oceedings of the 2018 A CM SIGSA C Confer ence on Computer and Communications Security , CCS 2018, T or onto, ON, Canada, October 15-19, 2018 , D. Lie, M. Mannan, M. Back es, and X. W ang, Eds. A CM, 2018, pp. 1038–1055, doi: 10.1145/3243734.3243833. [12] S. Chatterjee, S. K. P . Puria, and A. Shah, “Ef ficient backw ard pri v ate searchable encryption, J . Comput. Secur . , v ol. 28, no. 2, pp. 229–267, 2020, doi: 10.3233/JCS-191322. [13] I. Demertzis, J. G. Chamani, D. P apadopoulos, and C. P apamanthou, “Dynamic searchable encryption with small client storage. IA CR Cryptol. ePrint Ar c h. , v ol. 2019, p. 1227, 2019. [Online]. A v ailable: https://eprint.iacr .or g/2019/1227 [14] G. Amjad, S. Kamara, and T . Moataz, “F orw ard and backw ard pri v ate searchable encryption with sgx, in Pr oceedings of the 12th Eur opean W orkshop on Systems Security , 2019, pp. 1–6, doi: 10.1145/3301417.3312496. [15] L. Bingjie, Z. Jun, and C. Zhenfu, A multi-user forw ard secure dynamic symmetric searchable encryption with enhanced security , J ournal of Computer Resear c h and De velopment , v ol. 57, no. 10, p. 2104, 2020. [16] N. Cao, C. W ang, M. Li, K. Ren, and W . Lou, “Pri v ac y-preserving multi-k e yw ord rank ed search o v er encrypted cloud data, in IEEE T r ansactions on P ar al lel and Distrib uted Systems , v ol. 25, no. 1, pp. 222-233, Jan. 2014, doi: 10.1109/TPDS.2013.45. [17] C. ¨ Orencik and E. Sa v as, An ef ficient pri v ac y-preserving multi-k e yw ord search o v er encrypted cloud data with ranking, Distrib uted P ar allel Databases , v ol. 32, no. 1, pp. 119–160, 2014, doi: 10.1007/s10619-013-7123-9. Ef ficient multi-k e ywor d similarity sear c h o ver encrypted cloud documents (A yad I. Abdulsada) Evaluation Warning : The document was created with Spire.PDF for Python.
518 r ISSN: 2502-4752 [18] C. ¨ Orencik, M. Kantarcioglu, and E. Sa v as, A practical and secure multi-k e yw ord search method o v er encrypted cloud data, in 2013 IEEE Sixth International Confer ence on Cloud Computing , 2013, pp. 390-397, doi: 10.1109/CLOUD.2013.18. [19] C. ¨ Orencik, A. Selcuk, E. Sa v as, and M. Kantarcioglu, “Multi-k e yw ord search o v er encrypted data with scoring and search pattern obfuscation, Int. J . Inf . Sec. , v ol. 15, no. 3, pp. 251–269, 2016, doi: 10.1007/s10207-015-0294-9. [20] F . Baldimtsi and O. Ohrimenk o, “Sorting and searching behind the curtain, in F inancial Crypto gr aphy and Data Security - 19th International Confer ence , FC 2015, San J uan, Puerto Rico, J anuary 26-30, 2015, Re vised Selected P aper s , ser . Lecture Notes in Computer Science, R. B ¨ ohme and T . Okamoto, Eds., v ol. 8975. Springer , 2015, pp. 127–146, doi: 10.1007/978-3-662-47854-7 8. [21] Z. Xia, X. W ang, X. Sun, and Q. W ang, A secure and dynamic multi-k e yw ord rank ed search s cheme o v er encrypted cloud data, IEEE T r ansactions on P ar allel and Dist rib uted Systems , v ol. 27, no. 2, pp. 340-352, 1 Feb . 2016, doi: 10.1109/TPDS.2015.2401003. [22] C. Guo, R. Zhuang, C. Chang, and Q. Y uan, “Dynamic multi-k e yw ord rank ed search based on bloom filter o v er encrypted cloud data, IEEE Access , v ol. 7, pp. 35826-35837, 2019, doi: 10.1109/A CCESS.2019.2904763. [23] B. W ang, W . Song, W . Lou, and Y . T . Hou, “In v erted i nde x based multi-k e yw ord public-k e y searchable encryption with strong pri v ac y guarantee, in 2015 IEEE Confer ence on Computer Communications (INFOCOM) , 2015, pp. 2092-2100, doi: 10.1109/INFOCOM.2015.7218594. [24] K. Pushpa Rani, L. Lakshmi, C. Sabitha, B. Dhana Lakshmi, and S. Sreeja, “T op-k search scheme on encrypted data in cloud, International J ournal of Advances in Applied Sciences (IJ AAS) , v ol. 9, no. 1, pp. 67–69, 2020, doi: 10.11591/ijaas.v9.i1.pp67-69. [25] G. S. Poh, J. Chin, W . Y au, K. R. Choo, and M. S. Mohamad, “Searchable symmetric encryption: Designs and challenges, A CM Comput. Surv . , v ol. 50, no. 3, pp. 1-37, 2017, doi: 10.1145/3064005. [26] M. Sanderson, “Christopher d. manning, prabhakar ragha v an, hinrich sch ¨ utze, Introducti on to Information Retrie v al, cambridge uni v ersity press 2008. ISBN-13 978-0-521-86571-5, xxi + 482 pages, Nat. Lang . Eng . , v ol. 16, no. 1, pp. 100–103, 2010, doi: 10.1017/S1351324909005129. [27] M. Charikar , “Similarity estimation techniques from rounding algorithms, in Pr oceedings on 34th Annual A CM Symposium on Theory of Computing , May 19-21, 2002, Montr ´ eal, Qu ´ ebec, Canada , J. H. Reif, Ed. A CM, 2002, pp. 380–388, doi: 10.1145/509907.509965. [28] M. Bell are, R. Canetti, and H. Kra wczyk, “K e ying hash functions for message authentication, in Advances in Cryptol- o gy - CR YPT O ’96 , N. K oblitz, Ed. Berlin, Heidelber g: Springer Berl in Heidelber g, 1996, pp. 1-15, doi: 10.1007/3- 540-68697-5 1. [29] D. D. Le wis, Y . Y ang, T . G. Rose, and F . Li, “Rcv1: A ne w benchmark collection for te xt cate gorization research, J . Mac h. Learn. Res. , v ol. 5, p. 361–397, Dec. 2004. [30] M. F . Porter , An algorithm for suf fix stripping, Pr o gr am: electr onic libr ary and information systems , v ol. 40, no. 3, pp. 211-218, 2006, doi: 10.1108/00330330610681286. Indonesian J Elec Eng & Comp Sci, V ol. 23, No. 1, July 2021 : 510 518 Evaluation Warning : The document was created with Spire.PDF for Python.