1 Introduction
Background and Motivation.
In cloud computing, searchable symmetric encryption (SSE) for multiple data owners model (multiowner model, MOD) draws much attention as it enables multiple data users (clients) to perform searches over encrypted cloud data outsourced by multiple data owners (authorities). Unfortunately, none of the previouslyknown traditional SSE scheme for MOD achieve secure and precise query, efficient search and flexible dynamic system maintenance at the same time [9]. This severely limits the practical value of SSE and decreases its chance of deployment in realworld cloud storage systems.
Related Work and Challenge.
SSE has been continuously developed since it was proposed by Song et al. [12], and multikeyword ranked search over encrypted cloud data scheme is recognized as outstanding [9]. Cao et al. [1] first proposed privacypreserving multikeyword ranked search scheme (MRSE), and established strict privacy requirements. They first employed asymmetric scalarproduct preserving encryption (ASPE) approach [15]
to obtain the similarity scores of the query vector and the index vector, so that the cloud server can return the
topk documents. However, they did not provide the optimal balance of query precision and privacy protection strength. For better query precision and query speed, Sun et al. [13] proposed MTS with the TFIDF keyword weight model, where the keyword weight depends on the frequency of the keyword in the document and the ratio of the documents containing this keyword to the total documents. This means that TFIDF cannot handle the differences between data from different owners in MOD, since each owner’s data is different and there is no uniform standard to measure keyword weights. Based on MRSE, Li et al. [8] proposed a better solution (MKQE), where a new index construction algorithm and trapdoor generation algorithm are designed to realize the dynamic expansion of the keyword dictionary and improve the system performance. However, their scheme only realized the linean search efficiency. Xia et al. [16] provided EDMRS to support flexible dynamic operation by using balanced index tree that builded following the bottomup strategy and “greedy” method, and they used parallel computing to improve search efficiency. However, when migrating to MOD, ordinary balanced binary tree they employed is not optimistic [6]. It is frustrating that the above solutions only support SSE for single data owner model. Due to the diverse demand of the application scenario, such as emerging authorised searchable technology for multiclient (authority) encrypted medical databases that focuses on privacy protection [18, 19], research on SSE for MOD is increasingly active. Guo et al. [6]proposed MKRS_MO for MOD, they designed a heuristic weight generation algorithm based on the relationships among keywords, documents and owners (KDO). They considered the correlation among documents and the impact of documents’ quality on search results, so that the KDO is more suitable for MOD than the TF
IDF. However, they ignored the secure search scheme in known background model [1](a threat model that measures the ability of “honest but curious” cloud server [14, 20] to evaluate private data and the risk of revealing private information in SSE system). Currently, SSE for MOD is still these challenges: (1) comprehensively optimizing query precision and privacy protection is difficult; (2)a large amount of different data from multiple data owners make the data features sparse, and the calculation of highdimensional vectors can cause “curse of dimensionality”;
(3) frequent updates of data challenge the scalability of dynamic system maintenance.Our Contribution.
This paper proposes secure and efficient multikeyword ranked search over encrypted cloud data for multiowner model based on searching adversarial networks (MRSM_SAN). Specifically, including the following three techniques: (1) optimal pseudokeyword padding based on searching adversarial networks (SAN): To improve the privacy protection strength of SSE is a top priority. Padding random noise into the data [1, 8, 17] is a current popular method designed to interfere with the analysis and evaluation from cloud server, which protects the document content and keyword information better. However, such an operation will reduce the query precision [1]. In response to this, we creatively use adversarial learning [4] to obtain the
optimal probability distribution
for controlling pseudokeyword padding and the optimal game equilibrium for the query precision and the privacy protection strength. This makes query precision exceeds 95% while ensuring adequate privacy protection, which is better than traditional SSE [1, 6, 8, 13, 16]; (2) efficient search based on maximum likelihood search balanced tree (MLSBTree):The construction of the index tree is the biggest factor affecting the search efficiency. If the leaf nodes of the index tree are sorted by maximum probability (the ranking of the index vectors from high to low depends on the probability of being searched), the computational complexity will be close to
[7]. Probabilistic learning is employed to obtain MLSBTree, which is ordered in a maximum probability. The experimental evaluation shows that MLSBTreebased search is faster and more stable compare with related works [6, 16]; (3) flexible dynamic system maintenance based on balanced index forest (BIF): Using unsupervised learning [3, 10]to design a fast index clustering algorithm to classify all indexes into multiple index partitions, and a corresponding balanced index tree is constructed for each index partition, thus all index trees form BIF. Owing to BIF is distributed, it only needs to maintain the corresponding index partition without touching all indexes in dynamic system maintenance, which improves the efficiency of index update operations and introduces low overhead on the computation, communication and storage. In summary, MRSM_SAN increases the possibility of deploying dynamic SSE in realworld cloud storage systems.
Organization and Version Notes.
2 Secure and Efficient MRSM_SAN
2.1 System Model
The proposed system model consists of four parties, is depicted in Fig. 1. Data owners () are responsible for constructing searchable index, encrypting data and sending them to cloud server or trusted proxy; Data users () are consumers of cloud services. Based on attributebased encryption [5], once authorize attributes related to the retrieved data, can retrieve the corresponding data; Trusted proxy () is responsible for index processing, query and trapdoor generation, user authority authentication; Cloud server () provides cloud service, including running authorized access controls, performing searches over encrypted cloud data based on query requests, and returning topk documents to . is considered“honest but curious” [14, 20], so that it is necessary to provide a secure search scheme to protect privacy. Our goal is to protect index privacy, query privacy and keyword privacy in dynamic SSE.
2.2 MRSM_SAN Framework
 Setup:

Based on index clustering results ( index partitions) and privacy requirements in known background model [1], determines the size of subdictionary , the number of pseudokeyword, sets the parameter . Thus = {,…,}, = {,…,}, = {,…,}.
 KeyGen():

generates key = {,…,}, where = {, , }, and are two dimensional invertible matrices, is a random dimensional vector. Symmetric key = {, , }.
 ExtendedKeyGen():
 BuildIndex():

To realize secure search in known background model [1], pads pseudokeyword into weighted index (associated with document ) to obtain secure index , and uses and to generate BIF = {,…,} and encrypted BIF = {,…,}. Finally, sends to .
 Trapdoor():

send query requests (keywords and their weights) and attribute identification to . generates query = {,…,} and generates trapdoor = {,…,} using . Finally, sends to .
 Query:

sends query information to and specifies index partitions to be queried. performs searches and retrieves topk documents.
2.3 Algorithms for Scheme

Binary Index Generation: uses algorithm 1 to generate the binary index (vector) for the document , and sends to .

Fast Index Clustering & Keyword Dictionary Segmentation: We employ algorithm 2 to solve “curse of dimensionality” issue in computing.

MLSBTree and BIF Generation: uses algorithm 4 to generate MLSBTree and BIF = .

Trapdoor Generation. Based on query request from , generates = {,…,} and = {,…,} using algorithm 6, and sends to .

Search Process. (1) Query Preparation: send query request and attribute identifications to . If validating queries are valid, generates trapdoors and initiates search queries to . If access control passes, performs searches and returns topk documents to . Otherwise refuses to query. (2) Calculate Matching Score for Query on MLBSTree :
(3) Search Algorithm for BIF: the greedy depthfirst search (GDFS) algorithm for BIF as shown in algorithm 6.
2.4 Security Improvement and Analysis
Adversarial Learning.
Padding random noise into the data [1, 8, 17] is a popular method to improve security. However, pseudokeyword padding that follows different probability distributions will reduce query precision to varying degrees [1, 8]. Therefore, it is necessary to optimize the probability distribution that controls pseudokeyword padding. To address this, adversarial learning [4] for optimal pseudokeyword padding is proposed. As shown in Fig. 2. Searcher Network S() : The search result (SR) is generated by taking the random noise (, is the object probability distribution ) as an input and performing a search, and supplies SR to the discriminator network . Discriminator Network D(): The input has an accurate search result (ASR) or SR and attempts to predict whether the current input is an ASR or a SR. One of the inputs is obtained from the real search result set distribution , and then one or two are solved. Classify problems and generate scalars ranging from 0 to 1. Finally, in order to reach a balance point which is the best point of the minimax game, generates SR, and (considered as adversary) considers the probability that produces ASR is , i.e. it is difficult to distinguish between padding and withoutpadding, thus it can achieve effective security [17].
Similar to GAN [4], to learn the searcher’s distribution over data , we define a prior on input noise variables , then represent a mapping to data space as , where is a differentiable function represented by a multilayer perception with parameters . We also define a second multilayer perception that outputs a single scalar. represents the probability that came from the data rather than . We train to maximize the probability of assigning the correct label to both training examples and samples from . We simultaneously train to minimize : In other words, and play the following twoplayer minimax game with value function :
Security Analysis.
Index confidentiality and query confidentiality: ASPE approach [15] is widely used to generate secure index/query in privacypreserving keyword search schemes [1, 6, 8, 13, 16] and its security has been proven. Since the index/query vector is randomly generated and search queries return only the secure inner product [1] computation results (nonzero) of encrypted index and trapdoor, thus is difficult to accurately evaluate the keywords including in the query and matching topk documents. Moreover, confidentiality is further enhanced as the optimal pseudokeyword padding is difficult to distinguish and the transformation matrices are harder to figure out [15].
Query unlinkability: By introducing the random value (padding pseudokeyword), the same search requests will generate different query vectors and receive different relevance score distributions [1, 16]. The optimal game equilibrium for precision and privacy is obtain by adversarial learning, which further improves query unlinkability. Meanwhile, SAN are designed to protect access pattern [17], which makes it difficult for to judge whether the retrieved ranked search results come from the same request.
Keyword privacy: According to the security analysis in [16], for th index partition, aiming to maximize the randomness of the relevance score distribution, it is necessary to obtain as many different as possible (where ; in [16], ). Assuming each index vector has at least different choices, the probability of two share the same value is less than . If we set each (Uniform distribution
), according to the central limit theorem,
(Normal distribution), where , . Therefore, it can setand balance precision and privacy by adjusting the variance
in realworld application. In fact, when (floating point number), SAN can achieve stronger privacy protection.3 Experimental Evaluation
We implemented the proposed scheme using Python in Windows 10 operation system with Intel Core i5 Processor 2.40GHz and evaluated its performance on a realworld data set (academic conference publications provided by IEEE xplore https://ieeexplore.ieee.org/, including 20,000 papers and 80,000 different keywords, 400 academic conferences were randomly selected as data owners ). All experimental results represent the average of 1000 trials.
Optimal Pseudokeyword Padding.
The parameters controlling the probability distribution (using SAN to find or approximate) are adjusted to find the optimal game equilibrium for query precision (denoted as ) and rank privacy protection (denoted as ) (where , , and are respectively the number of real topk documents and the rank number of document in the retrieved documents, and is document’s real rank number in the whole ranked results [1]). We choose 95% query precision and 80% rank privacy protection as benchmarks to get the game equilibrium score calculation formula: (objective function to be optimized). As shown in Fig. 3, we find the optimal game equilibrium () at , , . The corresponding query precision are: 98%, 97%, 93%. The corresponding rank privacy protection are: 78%,79%,84%. Therefore, we can choose the best value of to achieve optimal pseudokeyword padding to satisfy query precision requirement and maximize rank privacy protection.
With different choice of standard deviation
for the random variable
. {(a) query precision(%) and rank privacy protection(%); (b) game equilibrium (score). explanation for : When is greater than 0.2, the weight of the pseudokeyword may be greater than 1, which violates our weight setting (between 0 and 1), so we only need to find the best game equilibrium point when .)Search Efficiency of MLSBTree.
Search efficiency is mainly described by query speed, and our experimental objects are index trees that are structured with different strategy: EDMRS [16] (ordinary balanced binary tree), MKRS_MO [6] (grouped balanced binary tree), MRSM_SAN(globally grouped balanced binary tree) and MRSM_SAN_MLSBTree. We first randomly generate 1000 query vectors, then perform search on each index tree respectively, finally take the results of 20 repeated experiments for analysis. As shown in Fig. 4, the query speed and query stability based on MLSBTree are better than other index trees. Compared with EDMRS and MKRS_MO, query speed increased by 21.72% and 17.69%. In terms of stability, MLSBTree is better than other index trees. (variance of search time(s): 0.0515 [6], 0.0193 [16], 0.0061[MLSBTree])
Search Efficiency of BIF.
As shown in Fig. 4, query speed of MRSM_SAN (with MLSBTree and BIF) is significantly higher than MRSM_SAN (only with MLSBTree), and the search efficiency is improved by 5 times and the stability increase too. This is just the experimental result of 500 documents set with the 4000dimensional keyword dictionary. After the index clustering operation, the keyword dictionary is divided into four subdictionaries with a dimension of approximately 1000. As the amount of data increases, the dimension of the keyword dictionary will become extremely large, and the advantages of BIF will become more apparent. In our analytical experiments, the theoretical efficiency ratio before and after segmentation is: ,where denotes the number of index partitions after fast index clustering, and denotes the total number of documents. When the amount of data increases to 20,000, the total keyword dictionary dimension is as high as 80,000. If the keyword subdictionary dimension is 1000, the number of index partitions after fast index clustering is 80, the search efficiency will increase by more than 100 times (). This will bring huge benefits to large information systems, and our solutions can exchange huge returns with minimal computing resources.
Comparison of Search Efficiency (Larger Data Set).
The efficiency of MRSM_SAN (without BIF) and related works [1, 6, 8, 16] are show as Fig. 5, and the efficiency of MRSM_SAN(without BIF) and MRSM_SAN(with BIF) are show as Fig. 5. It is more notable that the maintenance cost of scheme based on BIF is much lower than the cost of scheme only based on a balanced index tree. When adding a new document to , we need to insert a new index vector (the node of the tree) in the index tree accordingly. If it is only based on an index tree, search complexity (search the location where the new index is inserted into the index tree) and update complexity (update the parent node corresponding to the new index) are both at least [9], the total cost is (where denotes the total number of documents). But BIF is very different, because we group all index vectors into different index partitions and reduce their dimension. We assume that the number of index vectors in each index partition is equal, thus we need to spend the same update operation for each partition, which makes the cost is only and enables flexible dynamic system maintenance. Moreover, the increase in efficiency is positively correlated with the increase in data volume and data sparsity. For communication, compared with traditional SSE schemes [6, 13, 16], when the old index tree in the cloud is overwritten by the new index tree uploaded by /, our scheme only needs to update the specified index tree instead of the entire index forest. For storage, treebased overhead is times forestbased.
4 Discussion
This paper proposes secure and efficient MRSM_SAN, and conducts indepth security analysis and experimental evaluation. Creatively using adversarial learning to find optimal game equilibrium for query precision and privacy protection strength and combining traditional SSE with uncertain control theory, which opens a door for intelligent SSE. In addition, we propose MLSBTree, which generated by a sufficient amount of random searches and brings the computational complexity close to . It means that using probabilistic learning to optimize the query result is effective in an uncertain system (owner’s data and user’s queries are uncertain). Last but not least, we implement flexible dynamic system maintenance with BIF, which not only reduces the overhead of dynamic maintenance and makes full use of distributed computing, but also improves the search efficiency and achieves finegrained search. This is beneficial to improve the availability, flexibility and efficiency of dynamic SSE system.
Acknowledgment
This work was supported by “the Fundamental Research Funds for the Central Universities” (No. 30918012204) and “the National Undergraduate Training Program for Innovation and Entrepreneurship” (Item number: 201810288061). NJUST graduate Scientific Research Training of ‘Hundred, Thousand and Ten Thousand’ Project “Research on Intelligent Searchable Encryption Technology”.
References
 [1] (2014) Privacypreserving multikeyword ranked search over encrypted cloud data. IEEE TPDS 25 (1), pp. 222–233. Cited by: §1, §1, item Setup:, item BuildIndex():, §2.4, §2.4, §2.4, §3, §3.
 [2] (2019) Multiclient secure encrypted search using searching adversarial networks. IACR Cryptology ePrint Archive 2019, pp. 900. Cited by: §1.
 [3] (2019) Entropybased fuzzy twin bounded support vector machine for binary classification. IEEE Access 7, pp. 86555–86569. Cited by: §1, 1.
 [4] (2014) Generative adversarial networks. CoRR abs/1406.2661 (). Cited by: §1, §2.4, §2.4.
 [5] (2006) Attributebased encryption for finegrained access control of encrypted data. In ACM CCS 2006., pp. 89–98. Cited by: §2.1.
 [6] (2018) Secure multikeyword ranked search over encrypted cloud data for multiple data owners. Journal of Systems and Software 137 (3), pp. 380–395. Cited by: §1, §1, item 3, §2.4, Figure 5, §3, §3, 2.
 [7] (1998) The art of computer programming, volume iii, 2nd edition. AddisonWesley. Cited by: §1.
 [8] (2014) Efficient multikeyword ranked query over encrypted data in cloud computing. FGCS 30, pp. 179–190. Cited by: §1, §1, item ExtendedKeyGen():, §2.4, §2.4, §3.
 [9] (2017) Searchable symmetric encryption: designs and challenges. ACM Comput. Surv. 50 (3), pp. 40:1–40:37. Cited by: §1, §1, §3.
 [10] (2018) A fast and effective partitional clustering algorithm for large categorical datasets using a kmeans based approach. Computers & Electrical Engineering 68, pp. 463–483. Cited by: §1, 2.
 [11] (1975) A vector space model for automatic indexing. Commun. ACM 18 (11), pp. 613–620. Cited by: 2.
 [12] (2000) Practical techniques for searches on encrypted data. In IEEE S & P 2000., pp. 44–55. Cited by: §1.
 [13] (2014) Verifiable privacypreserving multikeyword text search in the cloud supporting similaritybased ranking. IEEE TPDS 25 (11), pp. 3025–3035. Cited by: §1, §1, §2.4, §3.
 [14] (2010) Privacypreserving public auditing for data storage security in cloud computing. In IEEE INFOCOM 2010., pp. 525–533. Cited by: §1, §2.1.

[15]
(2009)
Secure knn computation on encrypted databases
. In ACM SIGMOD 2009., pp. 139–152. Cited by: §1, §2.4.  [16] (2016) A secure and dynamic multikeyword ranked search scheme over encrypted cloud data. IEEE TPDS 27 (2), pp. 340–352. Cited by: §1, §1, item ExtendedKeyGen():, item 5, §2.4, §2.4, §2.4, Figure 5, §3, §3, 5, 3.
 [17] (2019) Hardening database padding for searchable encryption. In IEEE INFOCOM 2019., pp. 2503–2511. Cited by: §1, §2.4, §2.4, 5.
 [18] (2019) Enabling authorized encrypted search for multiauthority medical databases. IEEE TETC. External Links: Document Cited by: §1.
 [19] (2019) Building a dynamic searchable encrypted medical database for multiclient. Inf. Sci.. External Links: Document Cited by: §1.
 [20] (2010) Achieving secure, scalable, and finegrained data access control in cloud computing. In IEEE INFOCOM 2010., pp. 534–542. Cited by: §1, §2.1.
Comments
There are no comments yet.