1 Introduction
Linear regression and its extensions are among the most fundamental models in statistics and related fields. In the classical and most common setting, we are given samples with features and response , where denotes the sample indices. We assume that the features and responses are perfectly matched i.e., and correspond to the same record or sample. However, in important applications (for example, due to errors in the data merging process), the correspondence between the response and features may be broken [13, 14, 18]. This erroneous correspondence needs to be adjusted before performing downstream statistical analysis. Thus motivated, we consider a mismatched linear model with responses and covariates satisfying
(1.1) |
where are the true regression coefficients, is the noise term, and is an unknown permutation matrix. We consider the classical setting where and
has full rank; and seek to estimate both
and based on the observations . Note that the main computational difficulty in this task arises from the unknown permutation.Linear regression with mismatched/permuted data (model (1.1)) has a long history in statistics dating back to 1960s [13, 5, 6]. In addition to the aforementioned application in record linkage, similar problems also appear in robotics [20], multitarget tracking [4] and signal processing [3]
, among others. Recently, this problem has garnered significant attention from the statistics and machine learning communities. A series of recent works
[8, 22, 14, 15, 1, 2, 11, 9, 17, 24, 7, 21, 18] have studied the statistical and computational aspects of this model. To learn the coefficients and the matrix , one can consider the following natural optimization problem:(1.2) |
where is the set of permutation matrices. Solving problem (1.2) is difficult as there are exponentially many choices for . Given however, it is easy to estimate via least squares. [22] shows that in the noiseless setting (), a solution of problem (1.2) equals
with probability one if
and the entries of are independent and identically distributed (iid) as per a distribution that is absolutely continuous with respect to the Lebesgue measure. [15, 11] studies the estimation of under the noisy setting. It is shown in [15] that Problem (1.2) is NP-hard if for some constant . A polynomial-time approximation algorithm appears in [11] for a fixed . However, as noted in [11], this algorithm does not appear to be practically efficient. [8] propose a branch-and-bound method, that can solve small problems with (within a reasonable time). [16] propose a branch-and-bound method for a concave minimization formulation, which can solve problem (1.2) with and (the authors report a runtime of 40 minutes to solve instances with and ). [21] propose an approach using tools from algebraic geometry, which can handle problems with and —the cost of this method increases exponentially with . This approach is exact for the noiseless case but approximate for the noisy case (). Several heuristics have been proposed for (
1.2): Examples include, alternating minimization [9, 24][2]—as far as we can tell, these methods are sensitive to initialization, and have limited theoretical guarantees.As discussed in [19, 18], in many applications, a small fraction of the samples are mismatched — that is, the permutation is sparse. In other words, if denotes the number of mismatches between
and the identity matrix, then
is much smaller than . In this paper, we focus on such sparse permutation matrices. This motivates a constrained version of (1.2), given by(1.3) |
where, the constraint limits the number of mismatches between and the identity permutation to be at most . Above, is taken such that (Further details on the choice of can be found in Sections 3 and 5). Note that as long as , the true parameters lead to a feasible solution to (1.3). In the special case when , the constraint is redundant, and Problem (1.3) is equivalent to problem (1.2). Interesting convex optimization approaches based on robust regression have been proposed in [19] to approximately solve (1.3) when . The authors focus on obtaining an estimate of . Similar ideas have been extended to consider problems with multiple responses [18].
Problem (1.3) can be formulated as a mixed-integer program (MIP) with binary variables (to model the unknown permutation matrix). Solving this MIP with off-the-shelf MIP solvers (e.g., Gurobi) becomes computationally expensive for even a small value of (e.g. ). To the best of our knowledge, we are not aware of computationally practical algorithms with theoretical guarantees that can optimally solve the original problem (1.3) (under suitable assumptions on the problem data). Addressing this gap is the main focus of this paper: We propose and study a novel greedy local search method222We draw inspiration from the local search method presented in [10] in the context of a different problem: high dimensional sparse regression. for Problem (1.3). Loosely speaking, our algorithm at every step performs a greedy swap (or transposition) in an attempt to improve the cost function. This algorithm is typically efficient in practice based on our numerical experiments. We also propose an approximate version of the greedy swap procedure that scales to much larger problem instances. We establish theoretical guarantees on the convergence of the proposed method under suitable assumptions on the problem data. Under a suitable scaling of the number of mismatched pairs compared to the number of samples and features, and certain assumptions on the covariates and noise; our local search method converges to an objective value that is at most a constant multiple of the norm of the underlying noise term. From a statistical viewpoint, this is the best objective value that one can hope to obtain (due to the noise in the problem). Interestingly, in the special case of (i.e., the noiseless setting), our algorithm converges to an optimal solution of (1.3) with a linear rate333The extended abstract [12] which is a shorter version of this manuscript, studies the noiseless setting..
Notation and preliminaries:
For a vector
, we let denote the Euclidean norm, the -norm and the -pseudo-norm (i.e., number of nonzeros) of . We let denote the operator norm for matrices. Let be the natural orthogonal basis of . For a finite set , we let denote its cardinality. For any permutation matrix , let be the corresponding permutation of , that is, if and only if if and only if . We define the distance between two permutation matrices and as(1.4) |
Recall that we assume . For a given permutation matrix , define the -neighbourhood of as
(1.5) |
It is easy to check that , and for any , contains more than one element. For any permutation matrix , we define its support as: For a real symmetric matrix , let and
denote the largest and smallest eigenvalues of
, respectively.For two positive scalar sequences , we write or equivalently, , if there exists a universal constant such that . We write or equivalently, , if there exists a universal constant such that . We write if both and hold.
2 A local search method
Here we present our local search method for (1.3). For any fixed , by minimizing the objective function in (1.3) with respect to , we have an equivalent formulation
(2.1) |
where is the projection matrix onto the columns of . To simplify notation, denote , then (2.1) is equivalent to
(2.2) |
Our local search approach for (2.2) is summarized in Algorithm 1.
(2.3) |
At iteration , Algorithm 1 finds a swap (within a distance of from ) that leads to the smallest objective value. To see the computational cost of (2.3), note that:
(2.4) | |||||
For each , with , the vector has at most two nonzero entries. Since we pre-compute , computing the first term in (2.4) costs operations. As we retain a copy of in memory, computing the second term in (2.4) also costs operations. Therefore, computing (2.3) requires operations, as there are at most -many possible swaps to search over. The per-iteration cost is quite reasonable for medium-sized examples with being a few hundred to a few thousand, but might be expensive for larger examples. In Section 4, we propose a fast method to find an approximate solution of (2.3) that scales to instances with in a few minutes (see Section 5 for numerical findings).
3 Theoretical guarantees for Algorithm 1
Here we present theoretical guarantees for Algorithm 1. The main assumptions and conclusions appear in Section 3.1. Section 3.2 presents the proofs of the main theorems. Unless otherwise mentioned, in Sections 3.1 and 3.2, we assume that the problem data (i.e., ) is deterministic. Section 3.3 discusses conditions on the distribution of the features and the noise term, under which the main assumptions hold true with high probability.
3.1 Main results
We state and prove the main theorems on the convergence of Algorithm 1. For any , define
(3.1) |
We first state the assumptions useful for our technical analysis.
Assumption 3.1
(1). There exist constants such that
(2) Set for some constant .
(3) There is a constant
such that , and
(3.2) |
(4) There is a constant satisfying such that
Note that the lower bound in Assumption 3.1 (1) states that the -value for a record that has been mismatched is not too close to its original value (before mismatch). Assumption 3.1 (2) states that is set to a constant multiple of . This constant can be large (), and appears to be an artifact of our proof techniques. Our numerical experience suggests that this constant can be much smaller in practice. Assumption 3.1 (3) is a restricted eigenvalue (RE)-type condition [23] stating that: a multiplication of any -sparse vector by will result in a vector with small norm (in the case ). Section 3.3 discusses conditions on the distribution of the rows of under which Assumption 3.1 (3) holds true with high probability. Note that if , then for the assumption to hold true, we require . Assumption 3.1 (4) limits the amount of noise in the problem. Section 3.3 presents conditions on the error distribution which ensures Assumption 3.1 (4) holds true with high probability.
Assumption 3.1 (3) plays an important role in our analysis. In particular, this allows us to approximate the objective function in (2.2) with one that is easier to analyze. To provide some intuition, we write —noting that , and assuming that the noise is small, we have:
(3.3) |
Intuitively, the term on the right-hand side is the approximate objective that we analyze in our theory. Lemma 3.2 presents a one-step decrease property on the approximate objective function.
Lemma 3.2
(One-step decrease) Given any and , there exists a permutation matrix such that , and
(3.4) |
If in addition for some , then
(3.5) |
The main theorems (Theorems 3.3 and 3.5 below) make use of Lemma 3.2 and formalize the intuition conveyed in (3.3). We first present Theorem 3.3 that states some results pertaining to Algorithm 1 under Assumption 3.1.
Theorem 3.3
We make a few remarks regarding (3.6). In the special case , we can take , and the conclusion (3.6) reduces to
Therefore, if , then we have . However, for a finite , it is not clear from this upper bound if In addition, Theorem 3.3 does not provide a convergence rate of Algorithm 1. Our next result Theorem 3.5, answers the above questions, under an additional assumption, stated below.
Assumption 3.4
Let and be parameters appearing in Assumption 3.1.
(1). Suppose .
(2). There is a constant such that , and .
In light of the discussion following Assumption 3.1, Assumption 3.4 (1) places a stricter condition on the size of via the requirement . If , then we would need , which is stronger than the condition needed in Assumption 3.1.
Assumption 3.4 (2) requires a lower bound on – this can be equivalently viewed as an upper bound on , in addition to the upper bound appearing in Assumption 3.1 (4). Section 3.3 provides a sufficient condition for Assumption 3.4 (2) to hold. In particular, in the noiseless case (), Assumption 3.1 (4) and Assumption 3.4 (2) hold with . We state our main convergence result.
Theorem 3.5
In the special (noiseless) setting when , Theorem 3.5 proves that the sequence of objective values generated by Algorithm 1 converges to zero (the optimal objective value) at a linear rate. The parameter for the linear rate of convergence depends upon the search width . Following the discussion after Assumption 3.4, the sample-size requirement is more stringent than that needed in order for the model to be identifiable () [22] in the noiseless setting. In particular, when , the number of mismatched pairs needs to be bounded by a constant. Numerical evidence presented in Section 5 (for the noiseless case) appears to suggest that the sample size needed to recover is smaller than what is suggested by our theory.
In the noisy case (i.e. ), (3.7) provides an upper bound on the objective value consisting of two terms. The first term converges to with a linear rate similar to the noiseless case. The second term is a constant multiple of the squared norm of the unavoidable noise term444Recall that the objective value at is .: . In other words, Algorithm 1 finds a solution whose objective value is at most a constant multiple of the objective value at the true permutation .
3.2 Proofs of Theorem 3.3 and Theorem 3.5
In this section, we present the proofs of Theorems 3.3 and 3.5. We first present a technical result used in our proofs.
Lemma 3.6
The proof of Lemma 3.6 is presented in Section A.3. As mentioned earlier, our analysis makes use of the one-step decrease condition in Lemma 3.2. Note however, if the permutation matrix at the current iteration, denoted by , is on the boundary, i.e. , it is not clear whether the permutation found by Lemma 3.2 is within the search region . Lemma 3.6 helps address this issue (See the proof of Theorem 3.5 below for details).
3.2.1 Proof of Theorem 3.3
Part 1) We show this result by contradiction. Suppose that there exists a such that . Let be the first iteration () such that , i.e.,
Let but , then by Assumption 3.1 (1), we have
By Lemma 3.6, we have for all . As a result,
Since we have
This is a contradiction, so such an iteration counter does not exist; and for all , we have .
Part 2) Let be the iteration upon termination, i.e., . Recalling the definition of (in the notations section), we have
(3.9) |
By Lemma 3.2, there exists a permutation matrix such that
and
(3.10) |
We make use of the following claim whose proof is in Section A.6:
(3.11) |
By Claim (3.11) and (3.9), we have
Let . Recall that , so we have
This is equivalent to
(3.12) |
On the other hand, from (3.10) we have
or equivalently,
(3.13) |
Summing up (3.12) and (3.13) we have
Hence we have
(3.14) | ||||
Because has only two nonzero coordinates, by Assumption 3.1 (1) and (4) we have
(3.15) |
By Assumption 3.1 (3), (4) and (1) we have
(3.16) |
and
Using and , to bound the right hand term in the above display, we get:
(3.17) |
Combining (3.14) – (3.17), we have
The proof follows by a rearrangement of the above:
3.2.2 Proof of Theorem 3.5
Let . Because , we have ; and for any :
Hence, by Lemma 3.2, there exists a permutation matrix such that , and
As a result,
where the last inequality is from Assumption 3.1 (3). Note that by Assumption 3.4 (1), we have , so . Because , we have
Recall that (1.1) leads to , so we have
(3.18) |
Let and , then (3.18) leads to:
(3.19) | |||||
where, to arrive at the second inequality, we drop the term . We now make use of the following claim whose proof is in Section A.7:
(3.20) |
On the other hand, by Cauchy-Schwartz inequality,
Comments
There are no comments yet.