I introduction
To tackle the concerns in the traditional centralized learning, i.e., privacy, storage, and computational complexity, Federated Learning (FL) has been proposed in [16]
where machine learning (ML) models are jointly trained by multiple local data owners (i.e., parties), such as smart phones, data centres, etc., without revealing their private data to each other. This approach has gained interest in many reallife applications, such as health systems
[14, 12], keyboard prediction [2, 10], and ecommerce [23, 21].Thus far, based on how data is partitioned among participating parties, three variants of FL, which are horizontal, vertical and transfer FL, have been considered. Horizontal FL (HFL) refers to the FL among data owners that share different data records/samples with the same set of features [25], and vertical FL (VFL) is the FL in which parties share common data samples with disjoint set of features [4].
Figure 1 illustrates a digital banking system as an example of a VFL setting, in which two parties are participating, namely, a bank and a FinTech company [15]. The bank wishes to build a binary classification model to approve/disapprove a user’s credit card application by utilizing more features from the Fintech company. In this context, only the bank has access to the class labels in the training and testing datasets, hence named the active party, and the FinTech company that is unaware of the labels is referred to as the passive party.
Once the model is trained, it can be used to predict the decision (approve/disapprove) on a new credit card application in the prediction dataset. The model outputs, referred to as the prediction outputs or more specifically, the confidence scores, are revealed to the active party that is in charge of making decision. The adversary maybe aware or unaware of the passive party’s model parameters, which are, respectively, referred to as the whitebox and blackbox settings.
As stated in [15], upon the receipt of the prediction outputs, which generally depend on the passive party’s features, a curious active party can perform reconstruction attacks to infer the latter, which are regarded as passive the party’s sensitive information. This privacy leakage in the prediction phase of VFL is the main focus of this paper, and the following contributions are made.

Theorems 1 and 2 provide theoretical bounds as rigorous guarantees for some of these attacks.

In the blackbox setting, it is shown that when the passive party holds one feature, and the adversary is aware of the signs of the parameters involved, it can still fully reconstruct the passive party’s feature given that the number of predictions is large enough.

Two privacypreserving schemes are proposed as defense techniques against reconstruction attacks, which have the advantage of not degrading the benefits that VFL brings to the active party.
The organization of the paper is as follows. In section II, an explanation of the system model under consideration is provided. In section III, the elementary steps that pave the way for the adversary’s attack are elaborated, and the measure by which the performance of the reconstruction attack is evaluated is provided. The analysis and derivations in this paper need some preliminaries from linear algebra and optimization. To make the text as self contained as possible, these have been provided in section IV. The main results of the paper are given in sections V to VIII. In section V, the whitebox setting is considered and several attack methods are proposed and evaluated analytically. Section VI deals with the blackbox setting and investigates the minimum knowledge the adversary needs to perform a successful attack. In section VII, two privacypreserving schemes are provided that worsen the adversary’s attacks, while not altering the confidence scores revealed to it. Section VIII is devoted to the experimental evaluation of the results of this paper and comparison to those in the literature. Finally, section IX concludes the paper. To improve the readability, the notation used in this paper is provided next.
Notation.
Matrices and vectors are denoted by bold capital (e.g.
) and bold lower case letters (e.g.), respectively. Random variables are denoted by capital letters (e.g.
), and their realizations by lower case letters (e.g. ). ^{1}^{1}1In order to prevent confusion between the notation of a matrix (bold capital) and a random vector (bold capital), in this paper, letters are not used to denote a matrix, hence, are random vectors rather than matrices.Sets are denoted by capital letters in calligraphic font (e.g. ) with the exception of the set of real numbers, i.e., . The cardinality of the finite set is denoted by . For a matrix , the null space, rank, and nullity are denoted by , , and , respectively, with , i.e., the number of columns. The transpose of is denoted by , and when , its trace and determinant are denoted by and , respectively. For an integer , the terms , , and denote the byidentity matrix, the dimensional allone, and allzero column vectors, respectively, and whenever it is clear from the context, their subscripts are dropped. For two vectors , means that each element of is greater than or equal to the corresponding element of . The notation is used to show that is positive semidefinite, and is equivalent to . For integers , we have the discrete interval , and the set is written in short as .denotes the cumulative distribution function (CDF) of random variable
, whose expectation is denoted by . In this paper, all the (in)equalities that involve a random variable are in the almost surely(a.s.) sense, i.e., they happen with probability 1. For
and , the norm is defined as , and . Throughout the paper, (i.e., without subscript) refers to the norm. Let be two arbitrary pmfs on. The Kullback–Leibler divergence from
to is defined as^{2}^{2}2We assume that is absolutely continuous with respect to , i.e., implies , otherwise, . , which is also shown as with being the corresponding probability vectors of , respectively..Ii System model
Iia Machine learning (ML)
An ML model is a function parameterized by the vector , where and denote the input and output spaces, respectively. Supervised classification is considered in this paper, where a labeled training dataset is used to train the model.
Assume that a training dataset is given, where each is a dimensional example/sample and denotes its corresponding label. Learning refers to the process of obtaining the parameter vector
in the minimization of a loss function, i.e.,
(1) 
where measures the loss of predicting , while the true label is . A regularization term can be added to the optimization to avoid overfitting.
Once the model is trained, i.e., is obtained, it can be used for the prediction of any new sample. In practice, the prediction is (probability) vectorvalued, i.e., it is a vector of confidence scores as with , where denotes the probability that the sample belongs to class , and denotes the number of classes. Classification can be done by choosing the class that has the highest confidence score.
In this paper, we focus on logistic regression (LR), which can be modelled as
(2) 
where and are the parameters collectively denoted as , and is the sigmoid or softmax function in the case of binary or multiclass classification, respectively.
IiB Vertical Federated Learning
VFL is a type of ML model training approach in which two or more parties are involved in the training process, such that they hold the same set of samples with disjoint set of features. The main goal in VFL is to train a model in a privacypreserving manner, i.e., to collaboratively train a model without each party having access to other parties’ features. Typically, the training involves a trusted third party known as the coordinator authority (CA), and it is commonly assumed that only one party has access to the label information in the training and testing datasets. This party is named active and the remaining parties are called passive. Throughout this paper, we assume that only two parties are involved; one is active and the other is passive. The active party is assumed to be honest but curious, i.e., it obeys the protocols exactly, but may try to infer passive party’s features based on the information received. As a result, the active party is referred to as the adversary in this paper.
In the existing VFL frameworks, CA’s main task is to coordinate the learning process once it has been initiated by the active party. During the training, CA receives the intermediate model updates from each party, and after a set of computations, backpropagates each party’s gradient updates, separately and securely. To meet the privacy requirements of parties’ datasets, cryptographic techniques such as secure multiparty computation (SMC)
[24] or homomorphic encryption (HE) [5] are used.Once the global model is trained, upon the request of the active party for a new record prediction, each party computes the results of their model using their own features. CA aggregates these results from all the parties, obtains the prediction (confidence scores), and delivers that to the active party for further action.
As in [15], we assume that the active party has no information about the underlying distribution of the passive party’s features. However, it is assumed that the knowledge about the name, types and range of the features is available to the active party to decide whether to participate in a VFL or not.
Iii Problem statement
Let denote a random dimensional input sample for prediction, where the dimensional and the dimensional correspond to the feature values held by the active and passive parties, respectively. The VFL model under consideration is LR, where the confidence score is given by with . Denoting the number of classes in the classification task by , (with dimension ) and (with dimension ) are the model parameters of the active and passive parties, respectively, and is the
dimensional bias vector. From the definition of
, we have(3) 
where denote the th element of , respectively. Denote the th row of and by and , respectively, and construct matrices and whose th rows () are and , respectively. From the identity , an equivalent representation of (3) is
(4) 
where is a dimensional vector whose th element is . Denoting the RHS of (4) by , (4) writes as , where is dimensional..
The whitebox setting refers to the scenario where the adversary is aware of and the blackbox setting refers to the context in which the adversary is only aware of .
Since the active party wishes to reconstruct the passive party’s features, one measure by which the attack performance can be evaluated is the mean square error per feature, i.e.,
(5) 
where
is the adversary’s estimate. Let
denote the number of predictions. Assuming that these predictions are carried out in an i.i.d. manner, Law of Large Numbers (LLN) allows to approximate MSE by its empirical value , since the latter converges almost surely to (5) as grows.^{3}^{3}3It is important to note however that in the case when the adversary’s estimates are not independent across the predictions (noni.i.d. case) the empirical MSE is not necessarily equal to (5). In such cases, the empirical MSE is taken as the performance metric. This observation is later used in the experimental results to evaluate the performance of different reconstruction attacks.Iv Preliminaries
Throughout this paper, we are interested in solving a satisfiable^{4}^{4}4This means that at least one solution exists for this system, which is due to the context in which this problem arises. system of linear equations, in which the unknowns (features of the passive party) are in the range . This can be captured by solving for in the equation , where , , and for some positive integers . We are particularly interested in the case when the number of unknowns is greater than the number of equations . This is a particular case of an indeterminate/underdetermined system, where does not have full column rank and an infinitude of solutions exists for this linear system. Since the system under consideration is satisfiable, any solution can be written as for some , where denotes the pseudoinverse of satisfying the MoorePenrose conditions[18]^{5}^{5}5When has linearly independent rows, we have .. One property of pseudoinverse that is useful in the sequel is that if
is a singular value decomposition (SVD) of
, then , in which is obtained by taking the reciprocal of each nonzero element on the diagonal of , and then transposing the matrix.For a given pair , define
(6) 
as the solution space and feasible solution space, respectively. Alternatively, by defining
(7) 
we have
(8) 
We have that is a closed and bounded convex set defined as an intersection of halfspaces. Since is the image of under an affine transformation, it is a closed convex polytope in .
For a general (satisfiable or not) system of linear equations , we have that . Moreover, if the system is satisfiable, the quantity is the minimum norm solution. Therefore, in our problem, we have for all . Define , where the subscripts stands for Least Square^{6}^{6}6Note that this naming is with a slight abuse of convention, as the term least square points to a vector that minimizes in general, rather than a minimum norm solution.. It is important to note that may not necessarily belong to , which is our region of interest. A geometrical representation for the case is provided in Figure 2. As a result, one can always consider the constrained optimization with the constraint in order to find a feasible solution. We denote any solution obtained in this manner by , where the subscript stands for Constrained Least Square. In an indeterminate system, in contrast to , is not unique, and any point in can be a candidate for depending on the initial point of the solver.
Consider this simple example that is an unknown quantity in the range to be estimated and the error in the estimation is measured by the mean square, i.e., . Obviously, any point in can be proposed as an estimate for . However, without any further knowledge about , one can select the center of , i.e., as an intuitive estimate. The rationale behind this selection is that the maximum error of the estimate , i.e., is minimal among all other estimates. In other words, the center minimizes the worst possible estimation error, and hence, it is optimal in the bestworst sense. As mentioned earlier, any element of is a feasible solution of . This calls for a proper definition of the ”center” of as the bestworst solution. This is called the Chebyshev center which is introduced in a general topological context as follows.
Definition 1.
(Chebyshev Center [17]) Let be a bounded subset of a metric space , where denotes the distance. A Chebyshev center of is the center of minimal closed ball containing , i.e., it is an element such that . The quantity is the Chebyshev radius of .
In this paper, the metric space under consideration is for some positive integer , and we have
(9) 
For example the Chebyshev center of in is , and the Chebyshev center of the ball in is the origin^{7}^{7}7Note that the Chebyshev center of the circle in the same metric is still the origin, but obviously it does not belong to the circle, as the circle is not convex in .
In this paper, the subset of interest, i.e., (), is bounded, closed and convex. In this context, the Chebyshev center of is unique and belongs to . Hence, in the argmin in (9), can be replaced with . An example is provided in Figure 3.
Except for simple cases, computing the Chebyshev center is a computationally complex problem due to the nonconvex quadratic inner maximization in (9). When the subset of interest, i.e., , can be written as the convex hull of a finite number of points, there are algorithms [22, 3] that can find the Chebyshev center. In this paper, () is a convex polytope with a finite number of extreme points (as shown in Figure 2), hence, one can apply these algorithms. However, it is important to note that these extreme points are not given a priori and they need to be found in the first place from the equation . Since the procedure of finding the extreme points of is exponentially complex, it makes sense to seek approximations for the Chebyshev center that can be handled efficiently. Therefore, in this paper, instead of obtaining the exact Chebyshev center of , we rely on its approximations. A nice approximation worth mentioning is given in [8], which is in the context of signal processing and is explained in the sequel. This approximation is based on replacing the nonconvex inner maximization in (9) by its semidefinite relaxation, and then solving the resulting convexconcave minimax problem. A clear explanation of this method, henceforth named Relaxed Chebyshev Center 1 (RCC1), is needed because it is used as one of the adversary’s attack methods in this paper. Later, in proposition 3, a second relaxation is proposed, which is denoted as RCC2.
The set in [8] is an intersection of ellipsoids, i.e.,
(10) 
where , and the optimization problem is given in (9). Defining , the equivalence holds
(11) 
where . By focusing on the right hand side (RHS) of (11) instead of its left hand side (LHS), we are now dealing with the maximization of a concave (linear) function in . However, the downside is that is not convex, in contrast to . Here is where the relaxation is done in [8], and the optimization is carried out over a relaxed version of , i.e.,
which is a convex set, and obviously . As a results, RCC1 is the solution to the following minimax problem
(12) 
Since is bounded, and the objective in (12) is convex in and concave (linear) in , the order of minimization and maximization can be changed. Knowing that the minimum (over ) of the objective function occurs at , (12) reduces to
whose objective is concave and the constraints are linear matrrix inequalities, and RCC1 is the part of the solution. Since , the radius of the corresponding ball of RCC1 is an upperbound on , i.e., the Chebyshev radius of .
An explicit representation of is given in [8, Theorem III.1], which is restated here.
(13) 
where is an optimal solution of the following convex problem
(14) 
which can be cast as a semidefinite program (SDP) and solved by an SDP solver.
It is shown in [8] that similarly to the exact Chebyshev center, is also unique (due to strict convexity of the norm) and it belongs to , where the latter follows from the fact that for any , we have , which is due to the positive semidefiniteness of .
Finally, suppose that one of the constraints defining the set is a doublesided linear inequality of the form . We can proceed and write this constraint as two constraints, i.e., and . However, it is shown in [8] that it is better (in the sense of a smaller minimax estimate) to write it in the quadratic form, i.e., . Although the exact Chebyshev center of does not rely on its specific representation, the RCC1 does, as it is the result of a relaxation of . Hence, any constraint of the form will be replaced by , with , and .
V Whitebox setting
Let be a random variable distributed according to an unknown CDF . The goal is to find an estimate . First, we need the following lemma, which states that when there is no side information available to the estimator, there is no loss of optimality in restricting to the set of deterministic estimates.
Lemma 1.
Any randomized guess is outperformed by its statistical mean, and the performance improvement is equal to the variance of the random guess.
Proof.
Let be a random guess distributed according to a fixed CDF . We have
Hence, any estimate is outperformed by the new deterministic estimate , whose performance improvement is . ∎
Since the underlying distribution of is unknown to the estimator, one conventional approach is to consider the bestworst estimator. In other words, the goal of the estimator is to minimize the maximum error, which can be cast as a minimax problem, i.e.,
(15) 
where lemma 1 is used in the minimization, i.e., instead of minimizing over , we are minimizing over the singleton . Since for any fixed , we have , the bestworst estimation is the solution to
(16) 
which is the Chebyshev center of the interval in the space and it is equal to . This implies that with the estimator being blind to the underlying distribution and any possible side information, the bestworst estimate is the Chebyshev center of the support of the random variable, here .
As one step further, consider that is a dimensional random vector distributed according to an unknown CDF . Although the estimator is still unaware of , this time it has access to the matrixvector pair , and based on this side information, it gives an estimate . This side information refines the prior belief of to . Similarly to the previous discussion, the bestworst estimator gives the Chebyshev center of . As mentioned before, obtaining the exact Chebyshev centre of
is computationally difficult, hence, we focus on its approximations. However, prior to the approximation, we start with simple heuristic estimates that bear an intuitive notion of centeredness.
The first scheme in estimating , is the naive estimate of , which is the Chebyshev center of . We name this estimate as . We already know that when the only information that we have about is that it belongs to , then is optimal in the bestworst sense.
The adversary can perform better when the side information is available. A second scheme can be built on top of the previous scheme as follows. The estimator finds a solution in the solution space, i.e., , that is closest to , which is shown in Figure 2. In this scheme, the estimate, named , is given by
(17) 
whose explicit representation is provided in the following proposition.^{8}^{8}8Note that may or may not belong to .
Proposition 1.
We have
(18) 
Proof.
For any , we have for some . Hence,
(19)  
(20) 
where and . It is already known that the minimizer in (20) is , which results in
(21)  
(22)  
(23) 
where (21) to (23) are justified as follows. Let be an SVD of . From , we get and . Knowing that is a diagonal matrix with only 0 and 1 on its diagonal, we get , and therefore, , which results in (21). Noting that is a projector results in (22). Finally, by noting that , we get , which results in (23). ∎
Thus far, we have considered two simple schemes, i.e., and . In what follows, we investigate two approximations for the Chebyshev center of . The exact Chebyshev center of is given by
(24) 
Let be an SVD of
, where the singular values are arranged in a nonincreasing order, i.e.,
. Let . Hence, , which is the span of those right singular vectors that correspond to zero singular values. Define .The orthonormal columns of can be regarded as a basis for . Hence, any vector can be written as , where , and . With the definition of , we have
Noting that has orthonormal columns, we have . Therefore, in (8) can be written as
(25) 
Therefore,
(26) 
Denoting the th row of and the th element of by and , respectively, the following proposition provides an approximation for the exact Chebyshev center in (24).
Proposition 2.
A relaxed Chebyshev center of is given by
(27) 
where ’s are obtained as in (14) with , , and . Furthermore, is unique and it belongs to the set of feasible solution, i.e., .
Proof.
A second relaxation is provided in the following proposition.
Proposition 3.
A relaxed Chebyshev center of is given by
(28) 
where is the solution of
(29) 
Furthermore, is unique and it belongs to the set of feasible solution, i.e., .
Proof.
The inner maximization in (24) is
(30) 
which is a maximization of a convex objective function. As discussed before, one way of relaxing this problem was studied in [8] where the relaxation was over the search space. Here, we propose to directly relax the objective function by making use of the boundedness of . In other words, since for any , , we have . Hence, we can write
(31)  
(32)  
(33)  
(34) 
where (31) follows from i) the boundedness of , and ii) the concavity (linearity) and convexity of the objective in and , respectively. (32) follows from the fact that knowing , is the minimizer in (31). The RCC2 estimate is the solution of (32). (33) follows from the equivalence given in (25), and denoting the maximizer of (33) by , we have . In (34), we have used the fact that and
Denoting the MSE of a certain estimate by , the following theorem provides a relationship between some of the estimates introduced thus far.
Theorem 1.
The following inequalities hold.
(35) 
Proof.
In order to prove the first inequality, we proceed as follows. The derivative of the objective of (29) with respect to is
(36) 
Since the objective in (29) is (strictly) concave in , by setting , we obtain as the maximizer. It is important to note that this is not the solution of (29), i.e., , in general, as it might not satisfy its constraints. Define . We have
(37)  
(38) 
where the equality follows from the definition of .
If satisfies the constraints of (29), then , and , otherwise, we have that is the point in that is closest to , and as a result, is the point in that is closest to . This is justified as follows.
which results in . Hence, we can write^{9}^{9}9This follows from the fact that if is a nonempty convex subset of and a convex and differentiable function, then we have if and only if . (39) can be obtained by replacing and with and , respectively, and noting that and .
(39) 
which results in the following inequality for