Linear regression and its extensions are among the most fundamental models in statistics and related fields. In the classical and most common setting, we are given samples with features and response , where denotes the sample indices. We assume that the features and responses are perfectly matched i.e., and correspond to the same record or sample. However, in important applications (for example, due to errors in the data merging process), the correspondence between the response and features may be broken [13, 14, 18]. This erroneous correspondence needs to be adjusted before performing downstream statistical analysis. Thus motivated, we consider a mismatched linear model with responses and covariates satisfying
where are the true regression coefficients, is the noise term, and is an unknown permutation matrix. We consider the classical setting where and
has full rank; and seek to estimate bothand based on the observations . Note that the main computational difficulty in this task arises from the unknown permutation.
Linear regression with mismatched/permuted data (model (1.1)) has a long history in statistics dating back to 1960s [13, 5, 6]. In addition to the aforementioned application in record linkage, similar problems also appear in robotics , multitarget tracking  and signal processing 
, among others. Recently, this problem has garnered significant attention from the statistics and machine learning communities. A series of recent works[8, 22, 14, 15, 1, 2, 11, 9, 17, 24, 7, 21, 18] have studied the statistical and computational aspects of this model. To learn the coefficients and the matrix , one can consider the following natural optimization problem:
where is the set of permutation matrices. Solving problem (1.2) is difficult as there are exponentially many choices for . Given however, it is easy to estimate via least squares.  shows that in the noiseless setting (), a solution of problem (1.2) equals
with probability one ifand the entries of are independent and identically distributed (iid) as per a distribution that is absolutely continuous with respect to the Lebesgue measure. [15, 11] studies the estimation of under the noisy setting. It is shown in  that Problem (1.2) is NP-hard if for some constant . A polynomial-time approximation algorithm appears in  for a fixed . However, as noted in , this algorithm does not appear to be practically efficient.  propose a branch-and-bound method, that can solve small problems with (within a reasonable time).  propose a branch-and-bound method for a concave minimization formulation, which can solve problem (1.2) with and (the authors report a runtime of 40 minutes to solve instances with and ).  propose an approach using tools from algebraic geometry, which can handle problems with and —the cost of this method increases exponentially with . This approach is exact for the noiseless case but approximate for the noisy case (
). Several heuristics have been proposed for (1.2): Examples include, alternating minimization [9, 24]2]—as far as we can tell, these methods are sensitive to initialization, and have limited theoretical guarantees.
and the identity matrix, thenis much smaller than . In this paper, we focus on such sparse permutation matrices. This motivates a constrained version of (1.2), given by
where, the constraint limits the number of mismatches between and the identity permutation to be at most . Above, is taken such that (Further details on the choice of can be found in Sections 3 and 5). Note that as long as , the true parameters lead to a feasible solution to (1.3). In the special case when , the constraint is redundant, and Problem (1.3) is equivalent to problem (1.2). Interesting convex optimization approaches based on robust regression have been proposed in  to approximately solve (1.3) when . The authors focus on obtaining an estimate of . Similar ideas have been extended to consider problems with multiple responses .
Problem (1.3) can be formulated as a mixed-integer program (MIP) with binary variables (to model the unknown permutation matrix). Solving this MIP with off-the-shelf MIP solvers (e.g., Gurobi) becomes computationally expensive for even a small value of (e.g. ). To the best of our knowledge, we are not aware of computationally practical algorithms with theoretical guarantees that can optimally solve the original problem (1.3) (under suitable assumptions on the problem data). Addressing this gap is the main focus of this paper: We propose and study a novel greedy local search method222We draw inspiration from the local search method presented in  in the context of a different problem: high dimensional sparse regression. for Problem (1.3). Loosely speaking, our algorithm at every step performs a greedy swap (or transposition) in an attempt to improve the cost function. This algorithm is typically efficient in practice based on our numerical experiments. We also propose an approximate version of the greedy swap procedure that scales to much larger problem instances. We establish theoretical guarantees on the convergence of the proposed method under suitable assumptions on the problem data. Under a suitable scaling of the number of mismatched pairs compared to the number of samples and features, and certain assumptions on the covariates and noise; our local search method converges to an objective value that is at most a constant multiple of the norm of the underlying noise term. From a statistical viewpoint, this is the best objective value that one can hope to obtain (due to the noise in the problem). Interestingly, in the special case of (i.e., the noiseless setting), our algorithm converges to an optimal solution of (1.3) with a linear rate333The extended abstract  which is a shorter version of this manuscript, studies the noiseless setting..
Notation and preliminaries:
For a vector, we let denote the Euclidean norm, the -norm and the -pseudo-norm (i.e., number of nonzeros) of . We let denote the operator norm for matrices. Let be the natural orthogonal basis of . For a finite set , we let denote its cardinality. For any permutation matrix , let be the corresponding permutation of , that is, if and only if if and only if . We define the distance between two permutation matrices and as
Recall that we assume . For a given permutation matrix , define the -neighbourhood of as
It is easy to check that , and for any , contains more than one element. For any permutation matrix , we define its support as: For a real symmetric matrix , let and
denote the largest and smallest eigenvalues of, respectively.
For two positive scalar sequences , we write or equivalently, , if there exists a universal constant such that . We write or equivalently, , if there exists a universal constant such that . We write if both and hold.
2 A local search method
where is the projection matrix onto the columns of . To simplify notation, denote , then (2.1) is equivalent to
For each , with , the vector has at most two nonzero entries. Since we pre-compute , computing the first term in (2.4) costs operations. As we retain a copy of in memory, computing the second term in (2.4) also costs operations. Therefore, computing (2.3) requires operations, as there are at most -many possible swaps to search over. The per-iteration cost is quite reasonable for medium-sized examples with being a few hundred to a few thousand, but might be expensive for larger examples. In Section 4, we propose a fast method to find an approximate solution of (2.3) that scales to instances with in a few minutes (see Section 5 for numerical findings).
3 Theoretical guarantees for Algorithm 1
Here we present theoretical guarantees for Algorithm 1. The main assumptions and conclusions appear in Section 3.1. Section 3.2 presents the proofs of the main theorems. Unless otherwise mentioned, in Sections 3.1 and 3.2, we assume that the problem data (i.e., ) is deterministic. Section 3.3 discusses conditions on the distribution of the features and the noise term, under which the main assumptions hold true with high probability.
3.1 Main results
We state and prove the main theorems on the convergence of Algorithm 1. For any , define
We first state the assumptions useful for our technical analysis.
(1). There exist constants such that
(2) Set for some constant .
(3) There is a constant such that , and
(4) There is a constant satisfying such that
Note that the lower bound in Assumption 3.1 (1) states that the -value for a record that has been mismatched is not too close to its original value (before mismatch). Assumption 3.1 (2) states that is set to a constant multiple of . This constant can be large (), and appears to be an artifact of our proof techniques. Our numerical experience suggests that this constant can be much smaller in practice. Assumption 3.1 (3) is a restricted eigenvalue (RE)-type condition  stating that: a multiplication of any -sparse vector by will result in a vector with small norm (in the case ). Section 3.3 discusses conditions on the distribution of the rows of under which Assumption 3.1 (3) holds true with high probability. Note that if , then for the assumption to hold true, we require . Assumption 3.1 (4) limits the amount of noise in the problem. Section 3.3 presents conditions on the error distribution which ensures Assumption 3.1 (4) holds true with high probability.
Assumption 3.1 (3) plays an important role in our analysis. In particular, this allows us to approximate the objective function in (2.2) with one that is easier to analyze. To provide some intuition, we write —noting that , and assuming that the noise is small, we have:
Intuitively, the term on the right-hand side is the approximate objective that we analyze in our theory. Lemma 3.2 presents a one-step decrease property on the approximate objective function.
(One-step decrease) Given any and , there exists a permutation matrix such that , and
If in addition for some , then
The main theorems (Theorems 3.3 and 3.5 below) make use of Lemma 3.2 and formalize the intuition conveyed in (3.3). We first present Theorem 3.3 that states some results pertaining to Algorithm 1 under Assumption 3.1.
Therefore, if , then we have . However, for a finite , it is not clear from this upper bound if In addition, Theorem 3.3 does not provide a convergence rate of Algorithm 1. Our next result Theorem 3.5, answers the above questions, under an additional assumption, stated below.
Let and be parameters appearing in Assumption 3.1.
(1). Suppose .
(2). There is a constant such that , and .
In light of the discussion following Assumption 3.1, Assumption 3.4 (1) places a stricter condition on the size of via the requirement . If , then we would need , which is stronger than the condition needed in Assumption 3.1.
Assumption 3.4 (2) requires a lower bound on – this can be equivalently viewed as an upper bound on , in addition to the upper bound appearing in Assumption 3.1 (4). Section 3.3 provides a sufficient condition for Assumption 3.4 (2) to hold. In particular, in the noiseless case (), Assumption 3.1 (4) and Assumption 3.4 (2) hold with . We state our main convergence result.
In the special (noiseless) setting when , Theorem 3.5 proves that the sequence of objective values generated by Algorithm 1 converges to zero (the optimal objective value) at a linear rate. The parameter for the linear rate of convergence depends upon the search width . Following the discussion after Assumption 3.4, the sample-size requirement is more stringent than that needed in order for the model to be identifiable ()  in the noiseless setting. In particular, when , the number of mismatched pairs needs to be bounded by a constant. Numerical evidence presented in Section 5 (for the noiseless case) appears to suggest that the sample size needed to recover is smaller than what is suggested by our theory.
In the noisy case (i.e. ), (3.7) provides an upper bound on the objective value consisting of two terms. The first term converges to with a linear rate similar to the noiseless case. The second term is a constant multiple of the squared norm of the unavoidable noise term444Recall that the objective value at is .: . In other words, Algorithm 1 finds a solution whose objective value is at most a constant multiple of the objective value at the true permutation .
The proof of Lemma 3.6 is presented in Section A.3. As mentioned earlier, our analysis makes use of the one-step decrease condition in Lemma 3.2. Note however, if the permutation matrix at the current iteration, denoted by , is on the boundary, i.e. , it is not clear whether the permutation found by Lemma 3.2 is within the search region . Lemma 3.6 helps address this issue (See the proof of Theorem 3.5 below for details).
3.2.1 Proof of Theorem 3.3
Part 1) We show this result by contradiction. Suppose that there exists a such that . Let be the first iteration () such that , i.e.,
Let but , then by Assumption 3.1 (1), we have
By Lemma 3.6, we have for all . As a result,
Since we have
This is a contradiction, so such an iteration counter does not exist; and for all , we have .
Part 2) Let be the iteration upon termination, i.e., . Recalling the definition of (in the notations section), we have
By Lemma 3.2, there exists a permutation matrix such that
We make use of the following claim whose proof is in Section A.6:
Let . Recall that , so we have
This is equivalent to
On the other hand, from (3.10) we have
Hence we have
Because has only two nonzero coordinates, by Assumption 3.1 (1) and (4) we have
By Assumption 3.1 (3), (4) and (1) we have
Using and , to bound the right hand term in the above display, we get:
The proof follows by a rearrangement of the above:
3.2.2 Proof of Theorem 3.5
Let . Because , we have ; and for any :
Hence, by Lemma 3.2, there exists a permutation matrix such that , and
As a result,
Recall that (1.1) leads to , so we have
Let and , then (3.18) leads to:
where, to arrive at the second inequality, we drop the term . We now make use of the following claim whose proof is in Section A.7:
On the other hand, by Cauchy-Schwartz inequality,