1 Introduction
1.1 Motivation and related works
Feature selection is one of the typical problems in machine learning. For example, the human genome consists of 3.1 billion base pairs, of which at most a few dozen pairs are said to affect a particular disease. Feature selection extracts a set of features from this very sparse data that match a specific purpose, and the results are used by various machine learning algorithms. We will review the definition of the feature selection problem, its computational complexity, and the approximate solutions that have been proposed so far.
Feature selection is defined by a data set , a feature set , and a class set . Here, in particular, we consider feature and class to be binary, but it is easy to extend the problem to multilabel. Thus, a feature and a class for a data are denoted by by , and thus, when , the data
is associated with a binary vector of length
.Given a triple , the goal of an algorithm is to extract a consistent and minimal , where is consistent if, for any , for all implies and a feature set is minimal if any proper subset of is no longer consistent.
1  0  1  1  1  0  
1  1  0  0  0  0  
0  0  0  1  1  0  
1  0  1  0  0  0  
1  1  1  1  0  1  
0  1  0  1  0  1  
0  1  0  0  1  1  
0  0  0  0  1  1  
0.189  0.189  0.049  0.000  0.000 
To our knowledge, the most common method for finding features that characterize
is to select features that show higher relevance in some statistical measure. The relevance of individual features can be estimated using statistical measures such as mutual information and Bayesian risk. For example, at the bottom row of Table
1, the mutual information score of each feature to class labels is described. We can see that is more relevant than , since . Based on the mutual information score, and of Table 1 will be selected to explain . However, looking into more closely, we understand that and cannot determine uniquely. In fact, we find and with and whose class labels are different. On the other hand, we can also find the fact that and uniquely determine by the formula while holds. Therefore, the traditional method based on relevance scores of individual features misses the right answer.This problem is well known as the problem of interacting features, which has been intensively studied in machine learning research. The literature describes a class of feature selection algorithms that can solve this problem, referred to as consistencybased feature selection [14, 16, 17, 11, 2]. CWC (Combination of Weakest Components) [16] is the most simplest one of these consistencybased feature selection algorithms. CWC is the simplest of such consistencybased feature selection algorithms, and even though CWC uses the most rigorous measure, it shows one of the best performances in terms of accuracy as well as computational speed compared to other methods [15].
1.2 Our contribution
Algorithm  Time  Space 

a naive CWC on plaintext [16]  
secure CWC (baseline)  
improved 
We extend the feature selection problem to multiusers having their own private datasets and propose the first secure multiparty protocol to jointly compute the feature selection over the entire data without revealing their private information.
Our proposed method is a twoparty protocol based on the fully homomorphic encryption. Here, we briefly explain the related works of homomorphic cryptosystem. Given a publickey cryptosystem, let be an integer encrypted with its public key; if (the cipher text of ) can be computed from and only with public information, in particular without decrypting and , then is said to be additive homomorphic, and if can also be computed from and in addition, then is said to be fully homomorphic. Besides, when any plaintext can encrypt into any element of a set consisting of sufficiently many ciphertexts and for each execution of encryption, such a ciphertext is chosen probabilistically, is said to be probabilistic. For a cryptosystem, being probabilistic is required to satisfy the security of ciphertext indistinguishability: given and when is secretly selected one of and uniformly at random, it is computationally impossible to guess .
In the last two decades, various homomorphic encryptions have been proposed that satisfy those homomorphic properties. The first (probabilistic) additive homomorphic encryption was proposed by Paillier [12]. Somewhat homomorphic encryption that allows a sufficient number of additions and a restricted number of multiplications have also been proposed [5, 6, 3], and by using these cryptosystems, we can compute more difficult problems, such as the inner product of two vectors. The first fully homomorphic encryption with unlimited number of additions and multiplications was proposed by Gentry [9], and since then, useful libraries for the fully homomorphic encryption have been developed specially for bitwise operations and floating point operations.
TFHE [7, 8] is known as a fastest fully homomorphic encryption specialized for bitwise operations. In this study, we use TFHE to design our algorithm for the multiparty protocols of feature selection problems. We assume that parties and have their own private data and , and and are known. Under the assumption that the parties can use their respective TFHE, say and , the goal of the parties is to jointly compute the result of CWC algorithm on the plain data , without revealing any other information about and .
We summarize the results of this work in Table 2. baseline is a naive algorithm that simulates the original CWC [16] over the ciphertexts using TFHE operations. Given , the essential task of CWC is to sort in an increasing order of their relevance to . Using the sorted , CWC decides whether or not should be selected for . The resulting features are the output of CWC.
It is wellknown that sorting, the main task in CWC, is one of most difficult problems in secure computation. So we propose an improvement of the baseline algorithm reducing the cost of sorting. We show the time and space complexities for both algorithms in Table 2. This significantly improved the time complexity while maintaining the space complexity. We also implemented the baseline algorithm and examined its running time for real data. As a result, it was confirmed that most of the time was spent on sorting. The implementation of an improved algorithm is a future work.
2 Preliminaries
2.1 CWC algorithm over plaintext
For the dataset associated with and , we generally assume that contains no error, i.e., if for all , . When contains such errors, these are removed beforehand, then as a result, contains at most one with the same feature values.
We describe the original algorithm for finding a minimal consistent features in Algorithm 1. Given with and , a data of is called a positive data and of is called a negative data. Let be the number of positive data and . Let be the th positive data and the th negative data . Then, the bit string of length is defined by: if and otherwise. means that is not consistent with the pair because despite . Recall that is said to be consistent only if implies for any . Thus, is defined to be the number of s in .
For a subset, is said to be consistent, if for any and , there exists such that and hold. Using this, CWC removes irrelevant features from to construct a minimal consistent feature set^{1}^{1}1 Finding a smallest consistent feature set is clearly NPhard due to an obvious reduction from the minimum set cover. .
In Table 3, we show an example of and the corresponding . Let us consider the behavior of CWC on this example. All are computed as preprocessing. Then, the features are sorted by the order and . By the consistency order , CWC checks whether can be removed from the current . By the consistency measure, CWC removes and and the resulting is the output. In fact, we can predict the class of by the logical operation .
0  1  1  0  
1  0  1  0  0  
1  1  0  0  0  
0  1  0  1  0  
0  0  1  1  
1  0  1  0  0  
1  1  0  0  0 
2.2 TFHE: a faster fully homomorphic encryption
For the privacypreserving CWC, is entirely encrypted by a fully homomorphic encryption. We review the TFHE [8], one of the fastest libraries allowing the bitwise addition (this means XOR ‘’) and bitwise multiplication (AND ‘’). On TFHE, any integer is encrypted bitwise: For bit integer , we denote its bitwise encryption by , for short. These operations are denoted by and for and the ciphertexts and . An encrypted array is denoted in the same way. For example, when and are integers of length and , respectively, we abbreviate the bitwise encryption of a sequence of integers, e.g., denotes the following ciphertext.
By the elementary operations and , TFHE allows all arithmetic and logical operations. Here, we describe how to construct the adder and comparison operations. Let be bit integers and be the th bit of respectively. Let be the th carryin bit and is the th bit of the sum . Then, we can get by the bitwise operations of ciphertexts using and . Based on the adder, we can construct other operations like subtraction, multiplication, and division. For example, is obtained by , where is the bit complement of obtained by for all th bit. On the other hand, we also review the comparison. We want to get without decrypting and where if and otherwise. Here, we can get the bit as the most significant bit of over ciphertexts. Similarly, we can compute the encrypted bit for the equality test.
3 Algorithms
3.1 Baseline algorithm
We propose our baseline algorithm, which is a privacypreserving version of CWC. In this subsection, we consider a twoparty protocol, where a party has his private data and outsources the computation of CWC to another party , but our baseline algorithm is easily extended to a multiparty protocol where more than two parties cooperate one another to select features with the joint data. During the computation, party should not gain other information than the number of positive data, the number of negative data and the number of features. Note that party can conceal the actual number of data by inserting dummy data and telling the inflated numbers and to . The algorithm can distinguish dummy data by adding an extra bit indicating the data is a dummy iff the bit is . For each class the values of features and dummy bit of data in the class are encrypted by public key of and sent to .
The algorithm consists of three steps: Computing encrypted bit string , sorting ’s and executing feature selection on ’s.
Since all data in this subsection is encrypted by public key of , we omit the description of the encryption function to simplify presentation.
3.1.1 Computing
We can compute by , where represents the dummy bit for data . becomes iff is inconsistent for the pair of and . The part “” is added to make the whole value when one of the data is a dummy. It takes time and space in total.
3.1.2 Sorting ’s
We can compute in encrypted form by summing up values in in time (noting that each operation on integers of bits takes time). Instead we can set an upper bound of the bits used to store consistency measure to reduce the time complexity to .
Then sorting ’s in the incremental order of consistency measures can be done using any sorting network in which comparison and swap are conducted in encrypted form without leaking the information about ordering of features. Note that, in this approach, the algorithm has to spend time to swap (or pretend to swap) two bit strings and original feature indices of bits regardless that two features are actually swapped or not. Since this is the heaviest part in our baseline algorithm, we will show how to improve it. Using AKS sorting network [1] of size , the total time for sorting ’s is .
3.1.3 Selecting features
Let be the sorted list of features. We first compute a sequence of bit strings of length each such that for any and , namely is the bit array storing cumulative or of each position for . The computation takes time and space.
For feature selection, we simulate Algorithm 1 on encrypted ’s and ’s. In addtion we use two initialized bit arrays, of length and of length . is meant to store iff the th feature (in sorted order) is selected. is used to keep track of the cumulative or for the bit strings of the currently selected features. Namely, is set to if features
have been selected at the moment.
Suppose that we are in the th iteration of the for loop of Algorithm 1. Note that is consistent iff is . Since we keep the th feature iff is inconsistent, the algorithm sets . After computing , we can correctly update by for every in time.
Since each feature is processed in time, the total computational time is .
3.1.4 Summing up analysis
The bottleneck of computational time is of the sorting step. Since CWC works with any consistent measure, we do not have to use in full accuracy, and thus, we assume that is set to be a constant. Under the assumption, we obtain the following theorem.
Theorem 1
We can securely simulate CWC in time and space without revealing the private data of the parties under the assumption that TFHE is secure.

According to the discussion above, computing for all features takes time and space, sorting features takes time, and selecting features takes time. Finally, party computes in time an integer array with , which stores the original indices of selected features. Party randomly shuffles and sends to party as the result of CWC. Therefore, we can securely simulate CWC in time and space.
3.2 Improvement of secure CWC
The task of sorting is a major bottleneck for CWC in secret computation presented above. The reason is that pointers cannot be moved over ciphertexts. For example, consider the case of secure integer sort. Let the variables and contain integers and , respectively. Here, by performing the secure operation , the result is obtained as . Using this logical bit , we can swap the values of and in time satisfying by the secure operation and .
However, in the case of CWC, each integer of feature is associated with the bit string . Since any cannot be decrypted, we cannot swap the pointers appropriately. Therefore, the baseline algorithm swaps itself. As a result,the computation time for sorting increases to . We improve the time complexity to .
Since the improved algorithm uses the mix network mechanism [13] as a subroutine, we first give a brief overview of the mix network.
The purpose of a mix network is, given an encrypted sequence , to obtain a random permutation , where and are reencrypted and shuffled. Recall that is a probabilistic encryption. Thus, we cannot know how they were shuffled by comparing and the original . Among two parties and , the mix network can be realized using the public key encryption of and . We show such a mix network in Algorithm 2. We can assume that cannot know any information about the permutation without decryption.
Using the mix network, we propose the improved secure CWC (Algorithm 3) reducing the time complexity to . An example run of Algorithm 3 is illustrated in Fig. 1. As shown in this example, the party can securely sort randomized features in time and then swap each associated bit string of length in time. After this preprocessing, the parties obtain a minimal consistent features decrypting the output of CWC. Finally, we obtain the following result.
Theorem 2
Algorithm 3 can securely simulate CWC in time and space without revealing the private data of the parties under the assumption that TFHE is secure.

The party shuffles by a permutation . The parties communicate only in the step 5 and 6. can decrypt any , but due to the added noise, he cannot know anything about . On the other hand, obtains the plaintext , but he cannot compute . Thus, the parties cannot get the rank of the original feature from each party’s information alone. Therefore, the protocol of Algorithm 3 is as secure as TFHE. On the other hand, the time and space complexities are clear because the algorithm moves of length at most times. It follows that the time complexity is reduced to .
4 Experiments
We implemented our baseline algorithm in C++ using TFHE library for bitwise operations on fully homomorphic encryption. The experiments were conducted on the machine with Intel Core i76567U (3.30GHz) and 16GB RAM. In the following, (resp. ) is the number of positive (resp. negative) data and is the number of features.
Table 5 shows the time for computing in three different sizes of . As the theoretical time bound suggests, the time linearly increases to the size of . We note that can be computed independently from other with , and thus, thery can be computed in parallel.
Table 5 shows the time for computing while changing the size of and upper bound of bits to store consistency measures. Since there are additions to a bits integer, the time complexity is . We can observe that the time per addition linearly increases to . Note that the computation of for all features can be conducted in parallel.
time [sec]  

100  6.04 
500  30.06 
1000  60.14 
time [sec]  time per addition [sec]  

100  7  27.624  0.28 
500  9  175.826  0.35 
1000  10  394.967  0.40 
Table 7 shows the time for swapping a pair of data in the sorting procedure. Since the theoretical time complexity is , the time is mostly dominated by .
Since the whole procedure of sorting takes a long time, we estimate it from the time for a single swap in Table 7. Table 7 shows the estimated total time for sorting ’s with OEM sort. Here we assume that all the swaps are conducted in serial (without utilizing parallelism of sorting network).
time [sec]  
10  100  7  4  8.88 
10  500  9  4  39.59 
10  1000  10  4  78.05 
50  100  7  6  9.05 
50  500  9  6  39.73 
50  1000  10  6  78.04 
100  100  7  7  9.10 
100  500  9  7  39.93 
100  1000  10  7  77.94 
# swaps  time [sec]  

10  100  63  559.57 
10  500  63  2494.23 
10  1000  63  4917.40 
50  100  543  4911.44 
50  500  543  21573.39 
50  1000  543  42376.26 
100  100  1471  13386.10 
100  500  1471  58732.62 
100  1000  1471  114646.80 
Table 8 shows the time for feature selection from sorted list of ’s. The results follow the theoretical time complexity .
time [sec]  

10  100  111.96 
10  500  558.17 
10  1000  1114.24 
50  100  589.35 
50  500  2941.07 
50  1000  5919.71 
100  100  1179.06 
100  500  5952.54 
100  1000  11867.00 
Table 9 summarizes the time for each step of our baseline algorithm under the assumption that the parallelism is not used. The table shows that the sorting part is the bottleneck.
Step 1 [sec]  Step 2 [sec]  Step 3 [sec]  
10  100  60.37  835.81  111.96 
10  500  300.59  4252.49  558.17 
10  1000  601.41  8867.07  1114.24 
50  100  301.85  6292.64  589.35 
50  500  1502.95  30364.69  2941.07 
50  1000  3007.05  62124.61  5919.71 
100  100  603.70  16148.50  1179.06 
100  500  3005.90  76315.22  5952.54 
100  1000  6014.10  154143.50  11867.00 
As we can see from the experimental results (e.g. Table 9), most of the computational time of the baseline algorithm is spent on sorting. Thus, the implementation of the improved secure CWC is an important future work. Although we have implemented a twoparty protocol, our algorithm including the improved secure CWC can be easily extended to general multiparty protocols.
References
 [1] (1983) An sorting network. In STOC, pp. 1–9. Cited by: §3.1.2.
 [2] (1994) Learning boolean concepts in the presence of many irrelevant features. Artif. Intell. 69 (12), pp. 279–30. Cited by: §1.1.
 [3] (2018) Efficient twolevel homomorphic encryption in primeorder bilinear groups and a fast implementation in webassembly. In ASIACCS, pp. 685–697. Cited by: §1.2.
 [4] (1968) Sorting networks and their applications. In AFIPS Spring Joint Computing Conference, pp. 307–314. Cited by: §3.1.2.
 [5] (2005) Evaluating 2DNF formulas on ciphertexts. In TCC, pp. 325–341. Cited by: §1.2.
 [6] (2012) (Leveled) fully homomorphic encryption without bootstrapping. In ITCS, pp. 309–325. Cited by: §1.2.
 [7] (2020) TFHE: fast fully homomorphic encryptionover the torus. Journal of Cryptology 33, pp. 34–91. Cited by: §1.2, §2.2.
 [8] (August 2016) TFHE: fast fully homomorphic encryption library. Note: https://tfhe.github.io/tfhe/ Cited by: §1.2, §2.2, §2.2.
 [9] (2009) Fully homomorphic encryption using ideal lattices. In STOC, pp. 169–178. Cited by: §1.2.
 [10] (2014) Oblivious radix sort: an efficient sorting algorithm for practical secure multiparty computation. In IACR Cryptol. ePrint Arch., pp. 121. Cited by: §3.1.2.
 [11] (1998) A monotonic measure for optimal feature selection. In ECML, pp. 101–106. Cited by: §1.1.
 [12] (1999) Publickey cryptosystems based on composite degree residuosity classes. In EUROCRYPT, pp. 223–238. Cited by: §1.2.
 [13] (2006) A survey on mix networks and their secure applications. Proceedings of the IEEE 94 (12), pp. 2142–2181. Cited by: §3.2.
 [14] (2011) Consistency measures for feature selection: a formal definition, relative sensitivity comparison, and a fast algorithm. In IJCAI, pp. 1491–1497. Cited by: §1.1.
 [15] (2017) SCWC/slcc: highly scalable feature selection algorithms. Information 8 (4), pp. 159. Cited by: §1.1, Table 1.
 [16] (2009) Consistencybased feature selection. In KES, pp. 28–30. Cited by: §1.1, §1.2, Table 2.
 [17] (2007) Searching for interacting features. In IJCAI, pp. 1156–1161. Cited by: §1.1.
Comments
There are no comments yet.