# Linear regression with partially mismatched data: local search with theoretical guarantees

Linear regression is a fundamental modeling tool in statistics and related fields. In this paper, we study an important variant of linear regression in which the predictor-response pairs are partially mismatched. We use an optimization formulation to simultaneously learn the underlying regression coefficients and the permutation corresponding to the mismatches. The combinatorial structure of the problem leads to computational challenges. We propose and study a simple greedy local search algorithm for this optimization problem that enjoys strong theoretical guarantees and appealing computational performance. We prove that under a suitable scaling of the number of mismatched pairs compared to the number of samples and features, and certain assumptions on problem data; our local search algorithm converges to a nearly-optimal solution at a linear rate. In particular, in the noiseless case, our algorithm converges to the global optimal solution with a linear convergence rate. We also propose an approximate local search step that allows us to scale our approach to much larger instances. We conduct numerical experiments to gather further insights into our theoretical results and show promising performance gains compared to existing approaches.

## Authors

• 37 publications
• 3 publications
10/15/2012

### Local optima networks and the performance of iterated local search

Local Optima Networks (LONs) have been recently proposed as an alternati...
10/29/2020

### A Local Search Framework for Experimental Design

We present a local search framework to design and analyze both combinato...
06/29/2020

### Optimization Landscape of Tucker Decomposition

Tucker decomposition is a popular technique for many data analysis and m...
01/16/2014

### Efficient Multi-Start Strategies for Local Search Algorithms

Local search algorithms applied to optimization problems often suffer fr...
09/22/2017

### EB-GLS: An Improved Guided Local Search Based on the Big Valley Structure

Local search is a basic building block in memetic algorithms. Guided Loc...
12/07/2016

### Re-identification of Humans in Crowds using Personal, Social and Environmental Constraints

This paper addresses the problem of human re-identification across non-o...
01/10/2020

### Algorithms for Optimizing Fleet Staging of Air Ambulances

In a disaster situation, air ambulance rapid response will often be the ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Linear regression and its extensions are among the most fundamental models in statistics and related fields. In the classical and most common setting, we are given samples with features and response , where denotes the sample indices. We assume that the features and responses are perfectly matched i.e., and correspond to the same record or sample. However, in important applications (for example, due to errors in the data merging process), the correspondence between the response and features may be broken [13, 14, 18]. This erroneous correspondence needs to be adjusted before performing downstream statistical analysis. Thus motivated, we consider a mismatched linear model with responses and covariates satisfying

 P∗y=Xβ∗+ϵ (1.1)

where are the true regression coefficients, is the noise term, and is an unknown permutation matrix. We consider the classical setting where and

has full rank; and seek to estimate both

and based on the observations . Note that the main computational difficulty in this task arises from the unknown permutation.

Linear regression with mismatched/permuted data (model (1.1)) has a long history in statistics dating back to 1960s [13, 5, 6]. In addition to the aforementioned application in record linkage, similar problems also appear in robotics [20], multitarget tracking [4] and signal processing [3]

, among others. Recently, this problem has garnered significant attention from the statistics and machine learning communities. A series of recent works

[8, 22, 14, 15, 1, 2, 11, 9, 17, 24, 7, 21, 18] have studied the statistical and computational aspects of this model. To learn the coefficients and the matrix , one can consider the following natural optimization problem:

 minβ,P ∥Py−Xβ∥2   s.t.   P∈Πn (1.2)

where is the set of permutation matrices. Solving problem (1.2) is difficult as there are exponentially many choices for . Given however, it is easy to estimate via least squares. [22] shows that in the noiseless setting (), a solution of problem (1.2) equals

with probability one if

and the entries of are independent and identically distributed (iid) as per a distribution that is absolutely continuous with respect to the Lebesgue measure. [15, 11] studies the estimation of under the noisy setting. It is shown in [15] that Problem (1.2) is NP-hard if for some constant . A polynomial-time approximation algorithm appears in [11] for a fixed . However, as noted in [11], this algorithm does not appear to be practically efficient. [8] propose a branch-and-bound method, that can solve small problems with (within a reasonable time). [16] propose a branch-and-bound method for a concave minimization formulation, which can solve problem (1.2) with and (the authors report a runtime of 40 minutes to solve instances with and ). [21] propose an approach using tools from algebraic geometry, which can handle problems with and —the cost of this method increases exponentially with . This approach is exact for the noiseless case but approximate for the noisy case (

). Several heuristics have been proposed for (

1.2): Examples include, alternating minimization [9, 24][2]—as far as we can tell, these methods are sensitive to initialization, and have limited theoretical guarantees.

As discussed in [19, 18], in many applications, a small fraction of the samples are mismatched — that is, the permutation is sparse. In other words, if denotes the number of mismatches between

and the identity matrix, then

is much smaller than . In this paper, we focus on such sparse permutation matrices. This motivates a constrained version of (1.2), given by

 minβ,P ∥Py−Xβ∥2   s.t.   P∈Πn, dist(P,In)≤R (1.3)

where, the constraint limits the number of mismatches between and the identity permutation to be at most . Above, is taken such that (Further details on the choice of can be found in Sections 3 and 5). Note that as long as , the true parameters lead to a feasible solution to (1.3). In the special case when , the constraint is redundant, and Problem (1.3) is equivalent to problem (1.2). Interesting convex optimization approaches based on robust regression have been proposed in [19] to approximately solve (1.3) when . The authors focus on obtaining an estimate of . Similar ideas have been extended to consider problems with multiple responses [18].

Problem (1.3) can be formulated as a mixed-integer program (MIP) with binary variables (to model the unknown permutation matrix). Solving this MIP with off-the-shelf MIP solvers (e.g., Gurobi) becomes computationally expensive for even a small value of (e.g. ). To the best of our knowledge, we are not aware of computationally practical algorithms with theoretical guarantees that can optimally solve the original problem (1.3) (under suitable assumptions on the problem data). Addressing this gap is the main focus of this paper: We propose and study a novel greedy local search method222We draw inspiration from the local search method presented in [10] in the context of a different problem: high dimensional sparse regression. for Problem (1.3). Loosely speaking, our algorithm at every step performs a greedy swap (or transposition) in an attempt to improve the cost function. This algorithm is typically efficient in practice based on our numerical experiments. We also propose an approximate version of the greedy swap procedure that scales to much larger problem instances. We establish theoretical guarantees on the convergence of the proposed method under suitable assumptions on the problem data. Under a suitable scaling of the number of mismatched pairs compared to the number of samples and features, and certain assumptions on the covariates and noise; our local search method converges to an objective value that is at most a constant multiple of the norm of the underlying noise term. From a statistical viewpoint, this is the best objective value that one can hope to obtain (due to the noise in the problem). Interestingly, in the special case of (i.e., the noiseless setting), our algorithm converges to an optimal solution of (1.3) with a linear rate333The extended abstract [12] which is a shorter version of this manuscript, studies the noiseless setting..

Notation and preliminaries:

For a vector

, we let denote the Euclidean norm, the -norm and the -pseudo-norm (i.e., number of nonzeros) of . We let denote the operator norm for matrices. Let be the natural orthogonal basis of . For a finite set , we let denote its cardinality. For any permutation matrix , let be the corresponding permutation of , that is, if and only if if and only if . We define the distance between two permutation matrices and as

 dist(P,Q)=#{i∈[n]:πP(i)≠πQ(i)}. (1.4)

Recall that we assume . For a given permutation matrix , define the -neighbourhood of as

 Nm(P):={Q∈Πn: dist(P,Q)≤m}. (1.5)

It is easy to check that , and for any , contains more than one element. For any permutation matrix , we define its support as: For a real symmetric matrix , let and

denote the largest and smallest eigenvalues of

, respectively.

For two positive scalar sequences , we write or equivalently, , if there exists a universal constant such that . We write or equivalently, , if there exists a universal constant such that . We write if both and hold.

## 2 A local search method

Here we present our local search method for (1.3). For any fixed , by minimizing the objective function in (1.3) with respect to , we have an equivalent formulation

 minP  ∥Py−HPy∥2  s.t.  P∈Πn, dist(P,In)≤R , (2.1)

where is the projection matrix onto the columns of . To simplify notation, denote , then (2.1) is equivalent to

 minP  ∥˜HPy∥2  s.t. P∈Πn, dist(P,In)≤R . (2.2)

Our local search approach for (2.2) is summarized in Algorithm 1.

At iteration , Algorithm 1 finds a swap (within a distance of from ) that leads to the smallest objective value. To see the computational cost of (2.3), note that:

 ∥˜HPy∥2=∥˜H(P−P(k))y+˜HP(k)y∥2 (2.4) = ∥˜H(P−P(k))y∥2+2⟨(P−P(k))y,˜HP(k)y⟩+∥˜HP(k)y∥2 .

For each , with , the vector has at most two nonzero entries. Since we pre-compute , computing the first term in (2.4) costs operations. As we retain a copy of in memory, computing the second term in (2.4) also costs operations. Therefore, computing (2.3) requires operations, as there are at most -many possible swaps to search over. The per-iteration cost is quite reasonable for medium-sized examples with being a few hundred to a few thousand, but might be expensive for larger examples. In Section 4, we propose a fast method to find an approximate solution of (2.3) that scales to instances with in a few minutes (see Section 5 for numerical findings).

## 3 Theoretical guarantees for Algorithm 1

Here we present theoretical guarantees for Algorithm 1. The main assumptions and conclusions appear in Section 3.1. Section 3.2 presents the proofs of the main theorems. Unless otherwise mentioned, in Sections 3.1 and 3.2, we assume that the problem data (i.e., ) is deterministic. Section 3.3 discusses conditions on the distribution of the features and the noise term, under which the main assumptions hold true with high probability.

### 3.1 Main results

We state and prove the main theorems on the convergence of Algorithm 1. For any , define

 Bm:={w∈Rn: ∥w∥0≤m}. (3.1)

We first state the assumptions useful for our technical analysis.

###### Assumption 3.1

(1). There exist constants such that

 maxi,j∈[n]|yi−yj|≤U,   and   |(P∗y)i−yi|≥L     ∀i∈supp(P∗) .

(2) Set for some constant .
(3) There is a constant such that , and

 ∥Hu∥2≤ρn∥u∥2  ∀u∈B4,  and  ∥Hu∥2≤Rρn∥u∥2  ∀u∈B2R . (3.2)

(4) There is a constant satisfying such that

 ∥ϵ∥∞≤¯σ,  ∥Hϵ∥≤√d¯σ .

Note that the lower bound in Assumption 3.1 (1) states that the -value for a record that has been mismatched is not too close to its original value (before mismatch). Assumption 3.1 (2) states that is set to a constant multiple of . This constant can be large (), and appears to be an artifact of our proof techniques. Our numerical experience suggests that this constant can be much smaller in practice. Assumption 3.1 (3) is a restricted eigenvalue (RE)-type condition [23] stating that: a multiplication of any -sparse vector by will result in a vector with small norm (in the case ). Section 3.3 discusses conditions on the distribution of the rows of under which Assumption 3.1 (3) holds true with high probability. Note that if , then for the assumption to hold true, we require . Assumption 3.1 (4) limits the amount of noise in the problem. Section 3.3 presents conditions on the error distribution which ensures Assumption 3.1 (4) holds true with high probability.

Assumption 3.1 (3) plays an important role in our analysis. In particular, this allows us to approximate the objective function in (2.2) with one that is easier to analyze. To provide some intuition, we write —noting that , and assuming that the noise is small, we have:

 ∥˜HP(k)y∥2≈∥˜H(P(k)y−P∗y)∥≈∥P(k)y−P∗y∥2. (3.3)

Intuitively, the term on the right-hand side is the approximate objective that we analyze in our theory. Lemma 3.2 presents a one-step decrease property on the approximate objective function.

###### Lemma 3.2

(One-step decrease) Given any and , there exists a permutation matrix such that , and

 ∥Py−P∗y∥2−∥˜Py−P∗y∥2≥(1/2)∥Py−P∗y∥2∞ . (3.4)

If in addition for some , then

 ∥˜Py−P∗y∥2≤(1−1/(2m))∥Py−P∗y∥2 . (3.5)

The main theorems (Theorems 3.3 and 3.5 below) make use of Lemma 3.2 and formalize the intuition conveyed in (3.3). We first present Theorem 3.3 that states some results pertaining to Algorithm 1 under Assumption 3.1.

###### Theorem 3.3

Suppose Assumption 3.1 holds. Let be the permutation matrices generated by Algorithm 1. Then
1) For all , it holds .
2) Let be the solution obtained from Algorithm 1 upon termination (i.e., a local search step from cannot further decrease the objective value). Then

 ∥^Py−P∗y∥2∞≤64U2R2ρ2n+16¯σU+8√2dρn¯σU . (3.6)

We make a few remarks regarding (3.6). In the special case , we can take , and the conclusion (3.6) reduces to

 ∥^Py−P∗y∥2∞/U2≤64R2ρ2n=O(r2d2log2(n)/n2) .

Therefore, if , then we have . However, for a finite , it is not clear from this upper bound if In addition, Theorem 3.3 does not provide a convergence rate of Algorithm 1. Our next result Theorem 3.5, answers the above questions, under an additional assumption, stated below.

###### Assumption 3.4

Let and be parameters appearing in Assumption 3.1.
(1). Suppose .
(2). There is a constant such that , and .

In light of the discussion following Assumption 3.1, Assumption 3.4 (1) places a stricter condition on the size of via the requirement . If , then we would need , which is stronger than the condition needed in Assumption 3.1.

Assumption 3.4 (2) requires a lower bound on – this can be equivalently viewed as an upper bound on , in addition to the upper bound appearing in Assumption 3.1 (4). Section 3.3 provides a sufficient condition for Assumption 3.4 (2) to hold. In particular, in the noiseless case (), Assumption 3.1 (4) and Assumption 3.4 (2) hold with . We state our main convergence result.

###### Theorem 3.5

Suppose Assumptions 3.1 and 3.4 hold with being an even number. Let be the permutation matrices generated by Algorithm 1. Then for any , we have

 (3.7)

In the special (noiseless) setting when , Theorem 3.5 proves that the sequence of objective values generated by Algorithm 1 converges to zero (the optimal objective value) at a linear rate. The parameter for the linear rate of convergence depends upon the search width . Following the discussion after Assumption 3.4, the sample-size requirement is more stringent than that needed in order for the model to be identifiable ([22] in the noiseless setting. In particular, when , the number of mismatched pairs needs to be bounded by a constant. Numerical evidence presented in Section 5 (for the noiseless case) appears to suggest that the sample size needed to recover is smaller than what is suggested by our theory.

In the noisy case (i.e. ), (3.7) provides an upper bound on the objective value consisting of two terms. The first term converges to with a linear rate similar to the noiseless case. The second term is a constant multiple of the squared norm of the unavoidable noise term444Recall that the objective value at is .: . In other words, Algorithm 1 finds a solution whose objective value is at most a constant multiple of the objective value at the true permutation .

### 3.2 Proofs of Theorem 3.3 and Theorem 3.5

In this section, we present the proofs of Theorems 3.3 and 3.5. We first present a technical result used in our proofs.

###### Lemma 3.6

Suppose Assumption 3.1 holds. Let be the permutation matrices generated by Algorithm 1. Suppose for some . Suppose at least one of the two conditions holds: (i) ; or (ii) , and for all . Then for all , we have

 ∥P(t+1)y−P∗y∥2−∥P(t)y−P∗y∥2≤−L2/5 . (3.8)

The proof of Lemma 3.6 is presented in Section A.3. As mentioned earlier, our analysis makes use of the one-step decrease condition in Lemma 3.2. Note however, if the permutation matrix at the current iteration, denoted by , is on the boundary, i.e. , it is not clear whether the permutation found by Lemma 3.2 is within the search region . Lemma 3.6 helps address this issue (See the proof of Theorem 3.5 below for details).

#### 3.2.1 Proof of Theorem 3.3

Part 1) We show this result by contradiction. Suppose that there exists a such that . Let be the first iteration () such that , i.e.,

 supp(P∗)⊈supp(P(T))   and   supp(P∗)⊆supp(P(k))  ∀ R/2≤k≤T−1 .

Let but , then by Assumption 3.1 (1), we have

 ∥P(T)y−P∗y∥∞≥|e⊤i(P(T)y−P∗y)|=|e⊤i(y−P∗y)|≥L.

By Lemma 3.6, we have for all . As a result,

 ∥P(T)y−P∗y∥2−∥P(0)y−P∗y∥2≤−TL2/5≤−RL2/10 .

Since we have

 ∥P(T)y−P∗y∥2≤rU2−RL2/10≤rU2−L21010C1rU2L2=(1−C1)rU2<0 .

This is a contradiction, so such an iteration counter does not exist; and for all , we have .

Part 2) Let be the iteration upon termination, i.e., . Recalling the definition of (in the notations section), we have

 ∥˜HP(K)y∥2≤∥˜HPy∥2   ∀ P∈N2(P(K))∩NR(In) . (3.9)

By Lemma 3.2, there exists a permutation matrix such that

 dist(˜P(K),P(K))≤2,  supp(˜P(K)(P∗)−1)⊆supp(P(K)(P∗)−1)

and

 ∥˜P(K)y−P∗y∥2−∥P(K)y−P∗y∥2≤−(1/2)∥P(K)y−P∗y∥2∞ . (3.10)

We make use of the following claim whose proof is in Section A.6:

 Claim.     ˜P(K)∈NR(In)∩N2(P(K)) . (3.11)

By Claim (3.11) and (3.9), we have

 ∥˜HP(K)y∥2≤∥˜H˜P(K)y∥2 .

Let . Recall that , so we have

 ∥˜H(P(K)y−P∗y)+˜Hϵ∥2≤∥˜H(P(K)y−P∗y)+˜Hϵ+˜Hz∥2 .

This is equivalent to

 −2⟨˜H(P(K)y−P∗y),˜Hz⟩−∥˜Hz∥2≤2⟨˜Hz,˜Hϵ⟩ . (3.12)

On the other hand, from (3.10) we have

 ∥P(K)y−P∗y+z∥2−∥P(K)y−P∗y∥2≤−(1/2)∥P(K)y−P∗y∥2∞

or equivalently,

 2⟨P(K)y−P∗y,z⟩+∥z∥2≤−(1/2)∥P(K)y−P∗y∥2∞ . (3.13)

Summing up (3.12) and (3.13) we have

 2⟨H(P(K)y−P∗y),Hz⟩+∥Hz∥2≤2⟨˜Hz,˜Hϵ⟩−(1/2)∥P(K)y−P∗y∥2∞ .

Hence we have

 ∥P(K)y−P∗y∥2∞ (3.14) ≤ 4⟨˜Hz,˜Hϵ⟩−4⟨H(P(K)y−P∗y),Hz⟩ ≤ 4|⟨z,ϵ⟩|+4|⟨Hz,Hϵ⟩|+4|⟨Hz,H(P(K)y−P∗y)⟩| .

Because has only two nonzero coordinates, by Assumption 3.1 (1) and (4) we have

 |⟨z,ϵ⟩|≤2¯σU . (3.15)

By Assumption 3.1 (3), (4) and (1) we have

 |⟨Hz,Hϵ⟩|≤∥Hz∥∥Hϵ∥≤√ρn∥z∥√d¯σ≤√2dρn¯σU (3.16)

and

 |⟨Hz,H(P(K)y−P∗y)⟩| ≤∥Hz∥∥H(P(K)y−P∗y)∥ ≤√Rρn∥z∥∥P(K)y−P∗y∥ .

Using and , to bound the right hand term in the above display, we get:

 |⟨Hz,H(P(K)y−P∗y)⟩|≤2RρnU∥P(K)y−P∗y∥∞ . (3.17)

Combining (3.14) – (3.17), we have

 ∥P(K)y−P∗y∥2∞ ≤ 8¯σU+4√2dρn¯σU+8RρnU∥P(K)y−P∗y∥∞ ≤ 8¯σU+4√2dρn¯σU+(1/2)∥P(K)y−P∗y∥2∞+(1/2)(8RρnU)2 .

The proof follows by a rearrangement of the above:

 ∥P(K)y−P∗y∥2∞≤16¯σU+8√2dρn¯σU+64R2ρ2nU2 .

#### 3.2.2 Proof of Theorem 3.5

Let . Because , we have ; and for any :

 ∥P(k)y−P∗y∥0≤dist(P(k),In)+dist(P∗,In)≤r+R≤aR/2.

Hence, by Lemma 3.2, there exists a permutation matrix such that , and

 ∥˜P(k)y−P∗y∥2≤(1−1/(aR))∥P(k)y−P∗y∥2 .

As a result,

 ∥˜H(˜P(k)y−P∗y)∥2≤∥˜P(k)y−P∗y∥2 ≤ (1−1/(aR))∥P(k)y−P∗y∥2 ≤ 1−1/(aR)1−Rρn∥˜H(P(k)y−P∗y)∥2 ,

where the last inequality is from Assumption 3.1 (3). Note that by Assumption 3.4 (1), we have , so . Because , we have

 ∥˜H(˜P(k)y−P∗y)∥2 ≤ 1−1/(aR)1−1/(2aR)∥˜H(P(k)y−P∗y)∥2 ≤ (1−1/(2aR))∥˜H(P(k)y−P∗y)∥2 .

Recall that (1.1) leads to , so we have

 ∥˜H˜P(k)y−˜Hϵ∥2≤(1−1/(2aR))∥˜HP(k)y−˜Hϵ∥2 . (3.18)

Let and , then (3.18) leads to:

 ∥˜H˜P(k)y∥2 (3.19) ≤ (1−η)∥˜HP(k)y∥2−η∥˜Hϵ∥2+2⟨˜H˜P(k)y,˜Hϵ⟩−2(1−η)⟨˜HP(k)y,˜Hϵ⟩ ≤ (1−η)∥˜HP(k)y∥2+2(1−η)⟨˜Hz,˜Hϵ⟩+2η⟨˜H˜P(k)y,˜Hϵ⟩ ≤ (1−η)∥˜HP(k)y∥2+2|⟨˜Hz,˜Hϵ⟩|+2η|⟨˜H˜P(k)y,˜Hϵ⟩|

where, to arrive at the second inequality, we drop the term . We now make use of the following claim whose proof is in Section A.7:

 Claim.    2|⟨˜Hz,˜Hϵ⟩|≤η4∥˜H˜P(k)y∥2+η4∥˜HP(k)y∥2+4η∥˜Hϵ∥2 . (3.20)

On the other hand, by Cauchy-Schwartz inequality,

 2η|⟨