Log In Sign Up

Privacy Against Inference Attacks in Vertical Federated Learning

by   Borzoo Rassouli, et al.
Imperial College London
University of Essex

Vertical federated learning is considered, where an active party, having access to true class labels, wishes to build a classification model by utilizing more features from a passive party, which has no access to the labels, to improve the model accuracy. In the prediction phase, with logistic regression as the classification model, several inference attack techniques are proposed that the adversary, i.e., the active party, can employ to reconstruct the passive party's features, regarded as sensitive information. These attacks, which are mainly based on a classical notion of the center of a set, i.e., the Chebyshev center, are shown to be superior to those proposed in the literature. Moreover, several theoretical performance guarantees are provided for the aforementioned attacks. Subsequently, we consider the minimum amount of information that the adversary needs to fully reconstruct the passive party's features. In particular, it is shown that when the passive party holds one feature, and the adversary is only aware of the signs of the parameters involved, it can perfectly reconstruct that feature when the number of predictions is large enough. Next, as a defense mechanism, a privacy-preserving scheme is proposed that worsen the adversary's reconstruction attacks, while preserving the full benefits that VFL brings to the active party. Finally, experimental results demonstrate the effectiveness of the proposed attacks and the privacy-preserving scheme.


page 1

page 2

page 3

page 4


Feature Reconstruction Attacks and Countermeasures of DNN training in Vertical Federated Learning

Federated learning (FL) has increasingly been deployed, in its vertical ...

Feature Inference Attack on Model Predictions in Vertical Federated Learning

Federated learning (FL) is an emerging paradigm for facilitating multipl...

DVFL: A Vertical Federated Learning Method for Dynamic Data

Federated learning, which solves the problem of data island by connectin...

Is Vertical Logistic Regression Privacy-Preserving? A Comprehensive Privacy Analysis and Beyond

We consider vertical logistic regression (VLR) trained with mini-batch g...

Privacy Preserving Vertical Federated Learning for Tree-based Models

Federated learning (FL) is an emerging paradigm that enables multiple or...

Key Protected Classification for Collaborative Learning

Large-scale datasets play a fundamental role in training deep learning m...

Toward Active and Passive Confidentiality Attacks On Cryptocurrency Off-Chain Networks

Cryptocurrency off-chain networks such as Lightning (e.g., Bitcoin) or R...

I introduction

To tackle the concerns in the traditional centralized learning, i.e., privacy, storage, and computational complexity, Federated Learning (FL) has been proposed in [16]

where machine learning (ML) models are jointly trained by multiple local data owners (i.e., parties), such as smart phones, data centres, etc., without revealing their private data to each other. This approach has gained interest in many real-life applications, such as health systems

[14, 12], keyboard prediction [2, 10], and e-commerce [23, 21].

Thus far, based on how data is partitioned among participating parties, three variants of FL, which are horizontal, vertical and transfer FL, have been considered. Horizontal FL (HFL) refers to the FL among data owners that share different data records/samples with the same set of features [25], and vertical FL (VFL) is the FL in which parties share common data samples with disjoint set of features [4].

Figure 1 illustrates a digital banking system as an example of a VFL setting, in which two parties are participating, namely, a bank and a FinTech company [15]. The bank wishes to build a binary classification model to approve/disapprove a user’s credit card application by utilizing more features from the Fintech company. In this context, only the bank has access to the class labels in the training and testing datasets, hence named the active party, and the FinTech company that is unaware of the labels is referred to as the passive party.

Fig. 1: Digital banking as an example of vertical federated learning [15].

Once the model is trained, it can be used to predict the decision (approve/disapprove) on a new credit card application in the prediction dataset. The model outputs, referred to as the prediction outputs or more specifically, the confidence scores, are revealed to the active party that is in charge of making decision. The adversary maybe aware or unaware of the passive party’s model parameters, which are, respectively, referred to as the white-box and black-box settings.

As stated in [15], upon the receipt of the prediction outputs, which generally depend on the passive party’s features, a curious active party can perform reconstruction attacks to infer the latter, which are regarded as passive the party’s sensitive information. This privacy leakage in the prediction phase of VFL is the main focus of this paper, and the following contributions are made.

  • In the white-box setting, several reconstruction attacks are proposed that outperform those given in [15],[11]. The attacks are motivated by the notion of the Chebyshev center of a convex polytope.

  • Theorems 1 and 2 provide theoretical bounds as rigorous guarantees for some of these attacks.

  • In the black-box setting, it is shown that when the passive party holds one feature, and the adversary is aware of the signs of the parameters involved, it can still fully reconstruct the passive party’s feature given that the number of predictions is large enough.

  • Two privacy-preserving schemes are proposed as defense techniques against reconstruction attacks, which have the advantage of not degrading the benefits that VFL brings to the active party.

The organization of the paper is as follows. In section II, an explanation of the system model under consideration is provided. In section III, the elementary steps that pave the way for the adversary’s attack are elaborated, and the measure by which the performance of the reconstruction attack is evaluated is provided. The analysis and derivations in this paper need some preliminaries from linear algebra and optimization. To make the text as self contained as possible, these have been provided in section IV. The main results of the paper are given in sections V to VIII. In section V, the white-box setting is considered and several attack methods are proposed and evaluated analytically. Section VI deals with the black-box setting and investigates the minimum knowledge the adversary needs to perform a successful attack. In section VII, two privacy-preserving schemes are provided that worsen the adversary’s attacks, while not altering the confidence scores revealed to it. Section VIII is devoted to the experimental evaluation of the results of this paper and comparison to those in the literature. Finally, section IX concludes the paper. To improve the readability, the notation used in this paper is provided next.


Matrices and vectors are denoted by bold capital (e.g.

) and bold lower case letters (e.g.

), respectively. Random variables are denoted by capital letters (e.g.

), and their realizations by lower case letters (e.g. ). 111In order to prevent confusion between the notation of a matrix (bold capital) and a random vector (bold capital), in this paper, letters are not used to denote a matrix, hence, are random vectors rather than matrices.Sets are denoted by capital letters in calligraphic font (e.g. ) with the exception of the set of real numbers, i.e., . The cardinality of the finite set is denoted by . For a matrix , the null space, rank, and nullity are denoted by , , and , respectively, with , i.e., the number of columns. The transpose of is denoted by , and when , its trace and determinant are denoted by and , respectively. For an integer , the terms , , and denote the -by-identity matrix, the -dimensional all-one, and all-zero column vectors, respectively, and whenever it is clear from the context, their subscripts are dropped. For two vectors , means that each element of is greater than or equal to the corresponding element of . The notation is used to show that is positive semi-definite, and is equivalent to . For integers , we have the discrete interval , and the set is written in short as .

denotes the cumulative distribution function (CDF) of random variable

, whose expectation is denoted by . In this paper, all the (in)equalities that involve a random variable are in the almost surely

(a.s.) sense, i.e., they happen with probability 1. For

and , the -norm is defined as , and . Throughout the paper, (i.e., without subscript) refers to the -norm. Let be two arbitrary pmfs on

. The Kullback–Leibler divergence from

to is defined as222We assume that is absolutely continuous with respect to , i.e., implies , otherwise, . , which is also shown as with being the corresponding probability vectors of , respectively..

Ii System model

Ii-a Machine learning (ML)

An ML model is a function parameterized by the vector , where and denote the input and output spaces, respectively. Supervised classification is considered in this paper, where a labeled training dataset is used to train the model.

Assume that a training dataset is given, where each is a -dimensional example/sample and denotes its corresponding label. Learning refers to the process of obtaining the parameter vector

in the minimization of a loss function, i.e.,


where measures the loss of predicting , while the true label is . A regularization term can be added to the optimization to avoid overfitting.

Once the model is trained, i.e., is obtained, it can be used for the prediction of any new sample. In practice, the prediction is (probability) vector-valued, i.e., it is a vector of confidence scores as with , where denotes the probability that the sample belongs to class , and denotes the number of classes. Classification can be done by choosing the class that has the highest confidence score.

In this paper, we focus on logistic regression (LR), which can be modelled as


where and are the parameters collectively denoted as , and is the sigmoid or softmax function in the case of binary or multi-class classification, respectively.

Ii-B Vertical Federated Learning

VFL is a type of ML model training approach in which two or more parties are involved in the training process, such that they hold the same set of samples with disjoint set of features. The main goal in VFL is to train a model in a privacy-preserving manner, i.e., to collaboratively train a model without each party having access to other parties’ features. Typically, the training involves a trusted third party known as the coordinator authority (CA), and it is commonly assumed that only one party has access to the label information in the training and testing datasets. This party is named active and the remaining parties are called passive. Throughout this paper, we assume that only two parties are involved; one is active and the other is passive. The active party is assumed to be honest but curious, i.e., it obeys the protocols exactly, but may try to infer passive party’s features based on the information received. As a result, the active party is referred to as the adversary in this paper.

In the existing VFL frameworks, CA’s main task is to coordinate the learning process once it has been initiated by the active party. During the training, CA receives the intermediate model updates from each party, and after a set of computations, backpropagates each party’s gradient updates, separately and securely. To meet the privacy requirements of parties’ datasets, cryptographic techniques such as secure multi-party computation (SMC)

[24] or homomorphic encryption (HE) [5] are used.

Once the global model is trained, upon the request of the active party for a new record prediction, each party computes the results of their model using their own features. CA aggregates these results from all the parties, obtains the prediction (confidence scores), and delivers that to the active party for further action.

As in [15], we assume that the active party has no information about the underlying distribution of the passive party’s features. However, it is assumed that the knowledge about the name, types and range of the features is available to the active party to decide whether to participate in a VFL or not.

Iii Problem statement

Let denote a random -dimensional input sample for prediction, where the -dimensional and the -dimensional correspond to the feature values held by the active and passive parties, respectively. The VFL model under consideration is LR, where the confidence score is given by with . Denoting the number of classes in the classification task by , (with dimension ) and (with dimension ) are the model parameters of the active and passive parties, respectively, and is the

-dimensional bias vector. From the definition of

, we have


where denote the -th element of , respectively. Denote the -th row of and by and , respectively, and construct matrices and whose -th rows () are and , respectively. From the identity , an equivalent representation of (3) is


where is a -dimensional vector whose -th element is . Denoting the RHS of (4) by , (4) writes as , where is dimensional..

The white-box setting refers to the scenario where the adversary is aware of and the black-box setting refers to the context in which the adversary is only aware of .

Since the active party wishes to reconstruct the passive party’s features, one measure by which the attack performance can be evaluated is the mean square error per feature, i.e.,



is the adversary’s estimate. Let

denote the number of predictions. Assuming that these predictions are carried out in an i.i.d. manner, Law of Large Numbers (LLN) allows to approximate MSE by its empirical value , since the latter converges almost surely to (5) as grows.333It is important to note however that in the case when the adversary’s estimates are not independent across the predictions (non-i.i.d. case) the empirical MSE is not necessarily equal to (5). In such cases, the empirical MSE is taken as the performance metric. This observation is later used in the experimental results to evaluate the performance of different reconstruction attacks.

Iv Preliminaries

Throughout this paper, we are interested in solving a satisfiable444This means that at least one solution exists for this system, which is due to the context in which this problem arises. system of linear equations, in which the unknowns (features of the passive party) are in the range . This can be captured by solving for in the equation , where , , and for some positive integers . We are particularly interested in the case when the number of unknowns is greater than the number of equations . This is a particular case of an indeterminate/under-determined system, where does not have full column rank and an infinitude of solutions exists for this linear system. Since the system under consideration is satisfiable, any solution can be written as for some , where denotes the pseudoinverse of satisfying the Moore-Penrose conditions[18]555When has linearly independent rows, we have .. One property of pseudoinverse that is useful in the sequel is that if

is a singular value decomposition (SVD) of

, then , in which is obtained by taking the reciprocal of each non-zero element on the diagonal of , and then transposing the matrix.

For a given pair , define


as the solution space and feasible solution space, respectively. Alternatively, by defining


we have


We have that is a closed and bounded convex set defined as an intersection of half-spaces. Since is the image of under an affine transformation, it is a closed convex polytope in .

Fig. 2: An example with . The feasible solution space is the intersection of the solution space , which is denoted by the plane representing , with the hypercube . In this example, the minimum-norm point on the solution space, i.e., , does not belong to .
Fig. 3: An example of the Chebyshev centre of , which is the convex polytope in this paper.

For a general (satisfiable or not) system of linear equations , we have that . Moreover, if the system is satisfiable, the quantity is the minimum -norm solution. Therefore, in our problem, we have for all . Define , where the subscripts stands for Least Square666Note that this naming is with a slight abuse of convention, as the term least square points to a vector that minimizes in general, rather than a minimum norm solution.. It is important to note that may not necessarily belong to , which is our region of interest. A geometrical representation for the case is provided in Figure 2. As a result, one can always consider the constrained optimization with the constraint in order to find a feasible solution. We denote any solution obtained in this manner by , where the subscript stands for Constrained Least Square. In an indeterminate system, in contrast to , is not unique, and any point in can be a candidate for depending on the initial point of the solver.

Consider this simple example that is an unknown quantity in the range to be estimated and the error in the estimation is measured by the mean square, i.e., . Obviously, any point in can be proposed as an estimate for . However, without any further knowledge about , one can select the center of , i.e., as an intuitive estimate. The rationale behind this selection is that the maximum error of the estimate , i.e., is minimal among all other estimates. In other words, the center minimizes the worst possible estimation error, and hence, it is optimal in the best-worst sense. As mentioned earlier, any element of is a feasible solution of . This calls for a proper definition of the ”center” of as the best-worst solution. This is called the Chebyshev center which is introduced in a general topological context as follows.

Definition 1.

(Chebyshev Center [17]) Let be a bounded subset of a metric space , where denotes the distance. A Chebyshev center of is the center of minimal closed ball containing , i.e., it is an element such that . The quantity is the Chebyshev radius of .

In this paper, the metric space under consideration is for some positive integer , and we have


For example the Chebyshev center of in is , and the Chebyshev center of the ball in is the origin777Note that the Chebyshev center of the circle in the same metric is still the origin, but obviously it does not belong to the circle, as the circle is not convex in .

In this paper, the subset of interest, i.e., (), is bounded, closed and convex. In this context, the Chebyshev center of is unique and belongs to . Hence, in the argmin in (9), can be replaced with . An example is provided in Figure 3.

Except for simple cases, computing the Chebyshev center is a computationally complex problem due to the non-convex quadratic inner maximization in (9). When the subset of interest, i.e., , can be written as the convex hull of a finite number of points, there are algorithms [22, 3] that can find the Chebyshev center. In this paper, () is a convex polytope with a finite number of extreme points (as shown in Figure 2), hence, one can apply these algorithms. However, it is important to note that these extreme points are not given a priori and they need to be found in the first place from the equation . Since the procedure of finding the extreme points of is exponentially complex, it makes sense to seek approximations for the Chebyshev center that can be handled efficiently. Therefore, in this paper, instead of obtaining the exact Chebyshev center of , we rely on its approximations. A nice approximation worth mentioning is given in [8], which is in the context of signal processing and is explained in the sequel. This approximation is based on replacing the non-convex inner maximization in (9) by its semidefinite relaxation, and then solving the resulting convex-concave minimax problem. A clear explanation of this method, henceforth named Relaxed Chebyshev Center 1 (RCC1), is needed because it is used as one of the adversary’s attack methods in this paper. Later, in proposition 3, a second relaxation is proposed, which is denoted as RCC2.

The set in [8] is an intersection of ellipsoids, i.e.,


where , and the optimization problem is given in (9). Defining , the equivalence holds


where . By focusing on the right hand side (RHS) of (11) instead of its left hand side (LHS), we are now dealing with the maximization of a concave (linear) function in . However, the downside is that is not convex, in contrast to . Here is where the relaxation is done in [8], and the optimization is carried out over a relaxed version of , i.e.,

which is a convex set, and obviously . As a results, RCC1 is the solution to the following minimax problem


Since is bounded, and the objective in (12) is convex in and concave (linear) in , the order of minimization and maximization can be changed. Knowing that the minimum (over ) of the objective function occurs at , (12) reduces to

whose objective is concave and the constraints are linear matrrix inequalities, and RCC1 is the -part of the solution. Since , the radius of the corresponding ball of RCC1 is an upperbound on , i.e., the Chebyshev radius of .

An explicit representation of is given in [8, Theorem III.1], which is restated here.


where is an optimal solution of the following convex problem


which can be cast as a semidefinite program (SDP) and solved by an SDP solver.

It is shown in [8] that similarly to the exact Chebyshev center, is also unique (due to strict convexity of the -norm) and it belongs to , where the latter follows from the fact that for any , we have , which is due to the positive semidefiniteness of .

Finally, suppose that one of the constraints defining the set is a double-sided linear inequality of the form . We can proceed and write this constraint as two constraints, i.e., and . However, it is shown in [8] that it is better (in the sense of a smaller minimax estimate) to write it in the quadratic form, i.e., . Although the exact Chebyshev center of does not rely on its specific representation, the RCC1 does, as it is the result of a relaxation of . Hence, any constraint of the form will be replaced by , with , and .

V White-box setting

Let be a random variable distributed according to an unknown CDF . The goal is to find an estimate . First, we need the following lemma, which states that when there is no side information available to the estimator, there is no loss of optimality in restricting to the set of deterministic estimates.

Lemma 1.

Any randomized guess is outperformed by its statistical mean, and the performance improvement is equal to the variance of the random guess.


Let be a random guess distributed according to a fixed CDF . We have

Hence, any estimate is outperformed by the new deterministic estimate , whose performance improvement is . ∎

Since the underlying distribution of is unknown to the estimator, one conventional approach is to consider the best-worst estimator. In other words, the goal of the estimator is to minimize the maximum error, which can be cast as a minimax problem, i.e.,


where lemma 1 is used in the minimization, i.e., instead of minimizing over , we are minimizing over the singleton . Since for any fixed , we have , the best-worst estimation is the solution to


which is the Chebyshev center of the interval in the space and it is equal to . This implies that with the estimator being blind to the underlying distribution and any possible side information, the best-worst estimate is the Chebyshev center of the support of the random variable, here .

As one step further, consider that is a -dimensional random vector distributed according to an unknown CDF . Although the estimator is still unaware of , this time it has access to the matrix-vector pair , and based on this side information, it gives an estimate . This side information refines the prior belief of to . Similarly to the previous discussion, the best-worst estimator gives the Chebyshev center of . As mentioned before, obtaining the exact Chebyshev centre of

is computationally difficult, hence, we focus on its approximations. However, prior to the approximation, we start with simple heuristic estimates that bear an intuitive notion of centeredness.

The first scheme in estimating , is the naive estimate of , which is the Chebyshev center of . We name this estimate as . We already know that when the only information that we have about is that it belongs to , then is optimal in the best-worst sense.

The adversary can perform better when the side information is available. A second scheme can be built on top of the previous scheme as follows. The estimator finds a solution in the solution space, i.e., , that is closest to , which is shown in Figure 2. In this scheme, the estimate, named , is given by


whose explicit representation is provided in the following proposition.888Note that may or may not belong to .

Proposition 1.

We have


For any , we have for some . Hence,


where and . It is already known that the minimizer in (20) is , which results in


where (21) to (23) are justified as follows. Let be an SVD of . From , we get and . Knowing that is a diagonal matrix with only 0 and 1 on its diagonal, we get , and therefore, , which results in (21). Noting that is a projector results in (22). Finally, by noting that , we get , which results in (23). ∎

Thus far, we have considered two simple schemes, i.e., and . In what follows, we investigate two approximations for the Chebyshev center of . The exact Chebyshev center of is given by


Let be an SVD of

, where the singular values are arranged in a non-increasing order, i.e.,

. Let . Hence, , which is the span of those right singular vectors that correspond to zero singular values. Define .

The orthonormal columns of can be regarded as a basis for . Hence, any vector can be written as , where , and . With the definition of , we have

Noting that has orthonormal columns, we have . Therefore, in (8) can be written as




Denoting the -th row of and the -th element of by and , respectively, the following proposition provides an approximation for the exact Chebyshev center in (24).

Proposition 2.

A relaxed Chebyshev center of is given by


where ’s are obtained as in (14) with , , and . Furthermore, is unique and it belongs to the set of feasible solution, i.e., .


The linear constraints of are in the form for . By writing these constraints as their dual quadratic form , with , and , and following the approach in [8], which is explained in section IV, is obtained as in (27). Finally, the uniqueness and feasibility of follows from the arguments after (14). ∎

A second relaxation is provided in the following proposition.

Proposition 3.

A relaxed Chebyshev center of is given by


where is the solution of


Furthermore, is unique and it belongs to the set of feasible solution, i.e., .


The inner maximization in (24) is


which is a maximization of a convex objective function. As discussed before, one way of relaxing this problem was studied in [8] where the relaxation was over the search space. Here, we propose to directly relax the objective function by making use of the boundedness of . In other words, since for any , , we have . Hence, we can write


where (31) follows from i) the boundedness of , and ii) the concavity (linearity) and convexity of the objective in and , respectively. (32) follows from the fact that knowing , is the minimizer in (31). The RCC2 estimate is the solution of (32). (33) follows from the equivalence given in (25), and denoting the maximizer of (33) by , we have . In (34), we have used the fact that and

Finally, since the objective of (29) is strictly concave, we have that , and hence, are unique. Moreover, due to the constraint in (29), we have . ∎

Denoting the MSE of a certain estimate by , the following theorem provides a relationship between some of the estimates introduced thus far.

Theorem 1.

The following inequalities hold.


In order to prove the first inequality, we proceed as follows. The derivative of the objective of (29) with respect to is


Since the objective in (29) is (strictly) concave in , by setting , we obtain as the maximizer. It is important to note that this is not the solution of (29), i.e., , in general, as it might not satisfy its constraints. Define . We have


where the equality follows from the definition of .

If satisfies the constraints of (29), then , and , otherwise, we have that is the point in that is closest to , and as a result, is the point in that is closest to . This is justified as follows.

which results in . Hence, we can write999This follows from the fact that if is a nonempty convex subset of and a convex and differentiable function, then we have if and only if . (39) can be obtained by replacing and with and , respectively, and noting that and .


which results in the following inequality for