An Algebraic-Geometric Approach to Shuffled Linear Regression

10/12/2018 ∙ by Manolis C. Tsakiris, et al. ∙ Università di Genova 0

Shuffled linear regression is the problem of performing a linear regression fit to a dataset for which the correspondences between the independent samples and the observations are unknown. Such a problem arises in diverse domains such as computer vision, communications and biology. In its simplest form, it is tantamount to solving a linear system of equations, for which the entries of the right hand side vector have been permuted. This type of data corruption renders the linear regression task considerably harder, even in the absence of other corruptions, such as noise, outliers or missing entries. Existing methods are either applicable only to noiseless data or they are very sensitive to initialization and work only for partially shuffled data. In this paper we address both of these issues via an algebraic geometric approach, which uses symmetric polynomials to extract permutation-invariant constraints that the parameters x∈R^n of the linear regression model must satisfy. This naturally leads to a polynomial system of n equations in n unknowns, which contains x in its root locus. Using the machinery of algebraic geometry we prove that as long as the independent samples are generic, this polynomial system is always consistent with at most n! complex roots, regardless of any type of corruption inflicted on the observations. The algorithmic implication of this fact is that one can always solve this polynomial system and use its most suitable root as initialization to the Expectation Maximization algorithm. To the best of our knowledge, the resulting method is the first working solution for small values of n able to handle thousands of fully shuffled noisy observations in milliseconds.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the span of more than years since the work of Legendre [1] and Gauss [2], linear regression has grown to be a cornerstone of statistics, with applications in almost every branch of science and engineering that involves computing with data. In its simplest form, classical linear regression considers a data model whose output is a linear function of known functions of the input data . More precisely, given correspondences , one seeks to find real numbers , such that

(1)
(2)

where are known functions111A classical case is when is a polynomial function of , in which case the denote the coefficients of the polynomial.. Between the classic least-squares solution of Gauss and modern approaches designed to deal with highly corrupted data [3, 4, 5, 6], a literature too vast to enumerate has been devoted to progressively more complicated versions of the linear regression problem [7].

1.1 Shuffled Linear Regression

In this paper we are interested in a particular type of data corruption, i.e., lack of correspondences. In such a case, one is given the input samples, or more precisely functions of the input samples as in (2), i.e.,

(3)

and a shuffled version , of the observations , where the shuffling indices are unknown. This problem of shuffled linear regression, also known as linear regression without correspondences [8], linear regression with shuffled data [9], unlabeled sensing [10] or permuted linear model [11], can be stated, in the absence of any other data corruptions, as follows:

Problem 1.

Suppose we are given a matrix with , and a vector , where is some vector and an permutation matrix. We wish to efficiently compute when is unknown and . In other words, without knowing , we want to solve the linear system

(4)

Problem 1 naturally arises in many applications, such as in 1) computer vision, e.g., for simultaneous localization and mapping [12, 13, 14], multi-target tracking [15]

, and pose/correspondence estimation

[16, 17], in 2) biology, e.g., for cell tracking [18], genome-assembly [19], and identical tokens in signaling pathways [20], in 3) communication networks, e.g., for data de-anonymization [21, 22] and low-latency communications in Internet-Of-Things networks [11], in 4) signal processing, e.g., when dealing with signals sampled in the presence of time jitter [23], as well as in 5) archaeology, e.g., for the estimation of the chronological ordering of archaeological deposits [24].

1.2 Prior Art

Over the past years there has been a considerable amount of work on special instances of Problem 1 in diverse contexts from computer vision to biology, e.g., see the excellent literature reviews in [9] and [10]. Nevertheless, it has only been until very recently that the problem of shuffled linear regression has been considered in its full generality, with the main achievements so far concentrating on a theoretical understanding of the conditions that allow unique recovery of or ([9, 10, 25, 8, 26, 11, 27, 28]).

Letting

be drawn at random from any continuous probability distribution, the work of

[10] proves that any such can be uniquely recovered with probability as long as . If on the other hand , then is not unique with probability . Further considering additive random noise on , the authors in [8] establish lower bounds on the SNR, below which for any estimator there is a whose estimation error is large. With

drawn from the standard normal distribution (without loss of generality),

unknown but fixed and corrupted by additive random noise, [9] shows that, as long as the SNR exceeds a threshold, coincides with high probability with the Maximum Likelihood Estimator (MLE), which they define as (also considered in [26, 25])

(5)

where in (5) is constrained to be a permutation matrix. If on the other hand the SNR is not large enough, differs from with high probability, in agreement with the results of [8]. This is further complemented by [10], which shows that is locally stable under noise, in the sense that as the SNR tends to infinity tends to . However, according to [26], for SNR fixed, is asymptotically inconsistent.

On the algorithmic front of solving Problem 1, the case is well-understood and solved at a complexity of by sorting ([9], [26]). For the obvious approach is by brute force: for each permutation among the permutations of the entries of check whether the linear system is consistent, and if yes, solve it; this is an algorithm of complexity . In [10] they authors write, …although we showed that recovery of the unknown is possible from unlabeled measurements, we do not study the problem of designing an efficient algorithm to recover . Our solution is to consider all possible permutations of the unlabeled observations which might be prohibitively complex in large dimensional problems. An algorithm that is more efficient than the brute force one is that of [8], which is able to reduce the complexity as a function of from to a factor of at least . However, as the authors of [8] write, their algorithm strongly exploits the assumption of noiseless measurements and is also very brittle and very likely fails in the presence of noise; the same is true for the complexity algorithm of [29]. Finally, the authors in [8] write we are not aware of previous algorithms for the average-case problem in general dimension . In the same paper a approximation algorithm with theoretical guarantees and of complexity is proposed, which however, is not meant for practical deployment, but instead is intended to shed light on the computational difficulty of the least squares problem (5). Indeed, as per [9] this is an NP-hard problem for .

On the other hand, the approach that seems to be the predominant one in terms of practical deployment is that of solving (5) via alternating minimization [25]: given an estimate for one computes an estimate for via sorting; given an estimate for one computes an estimate for

via ordinary least-squares. Nevertheless, this approach is very sensitive to initialization and generally works only for partially shuffled data; see

[25] for a soft variation of this alternating scheme.

In conclusion, to the best of our knowledge, there does not yet exist an algorithm for solving Problem 1 that is theoretically justifiable, efficient and robust to even mild levels of noise.

1.3 Our contributions

In this work, we contribute to the study of Problem 1 on both theory (§2) and algorithms (§3). On the theoretical level, which is our main focus, we show that for generic noiseless data , there is a unique solution , as soon as . We show that is contained in the root locus of a system of polynomial equations in unknowns. Using advanced tools from algebraic geometry, we show that this polynomial system is always consistent with at most complex solutions, regardless of any noise that may further corrupt the observations .

The algorithmic implication of our theoretical development complements the current state-of-the-art: we solve the polynomial system and use a simple criterion to identify its most appropriate root to be used as an initialization for computing the MLE estimator (5) via alternating minimization. Even though solving the polynomial system entails in principle an exponential complexity in , its complexity in is linear for any . Furthermore, we use methods from automatic algebraic-geometric solver generation to obtain highly efficient solvers for . Overall, for , our approach is the first working solution to shuffled linear regression that remains stable under noise and has manageable complexity. As an example, for , , and additive noise, our method computes in milliseconds a solution that is within error from the ground truth.

2 Theoretical Contributions

The main contribution of this paper is to develop the theory to an algebraic geometric approach towards solving Problem 1. The key idea of this approach, described in detail in §2.2, uses symmetric power-sum polynomials to eliminate the unknown permutation, thus resulting in a polynomial system of equations in unknowns. These polynomials were considered implicitly in the statistical approach of [26], towards constructing a

self-moment

estimator. The authors of that paper wrote, …in fact there may not be a solution to the system, which led them to compute their estimator via gradient descent on a highly non-convex objective function, a procedure lacking any theoretical guarantees and very sensitive to initialization, thus requiring a large number of multi-starts. The geometric significance of the polynomial system was also recognized by the last two authors of the present paper in the short conference paper [11], but important questions such as

  1. “does have finitely many solutions?” or

  2. “does have any solutions in the presence of noise?”,

where left as open problems.

In this paper we answer these two questions in the affirmative in §2.3, via Theorems 2 and 3 respectively. The main message is that if the data are generic (to be made precise in §2.1), then using power-sum polynomials of degrees as constraints, defines an algebraic variety that consists of at most points, among which lies the solution of the shuffled linear regression Problem 1. In addition the same conclusion holds true in the case where the observation vector has undergone any type of corruption: the variety defined by the noisy polynomials is non-empty and consisting of at most complex points. This guarantees that the equations are almost always consistent even if the data are imperfect, which enables algorithmic development and applications.

The proofs of Theorems 2-3, given in §2.5, require a thorough understanding of the notion of dimension of polynomial systems of equations. We describe the necessary notions in §2.4 in an expository style for the benefit of the reader who is familiar with linear algebra but not with algebraic geometry.

2.1 Genericity and well-posedness

Before we are in a position to state our main results, i.e., Theorems 2-3 described in §2.3, we need to clarify what we mean when we refer to as being generic (§2.1.1), and also settle the well-posedness of Problem 12.1.2).

2.1.1 The notion of generic

We start with an example.

Example 1.

Consider the matrix

(6)

where the entries of are real numbers. Then is invertible if and only if . Now consider the polynomial ring in four variables, with each of them corresponding to an entry of . The equation

(7)

defines a hypersurface of , and if and only if is non-invertible. This hypersurface has measure zero, say, under the standard normal distribution of , and hence if one samples at random from this distribution, will lie outside of with probability . We express this by saying that “if is a generic matrix, then is invertible”.

As Example 1 suggests, we usually attach the attribute generic to an object (matrix in Example 1) with respect to some property of (invertibility of in Example 1). We say that “if is generic then is true”, and mean that the set of objects for which is not true forms a zero-measure set of the underlying space that parametrizes that object under some continuous probability distribution. Hence sampling at random from that probability distribution results in having the property with probability . Finally, if there are finitely many properties of interest with regard to the object , and if is generic with respect to each of the , then is generic with respect to all of them; this follows from the fact that the union of finitely many zero-measure sets is a zero-measure set. The connection between algebraic geometry and measure theory that the reader should keep in mind for our purpose here, is that algebraic varieties222Here we exclude the ambient space, which is also an algebraic variety. have zero measure and that the union of finitely many algebraic varieties is an algebraic variety.

2.1.2 Uniqueness of

Our first result, Theorem 1, guarantees that Problem 1 is well-posed for generic data, in which case it makes sense to talk about a unique permutation and a unique solution .

Theorem 1.

Suppose that . Then as long as is a generic matrix and the permutation of a generic vector in , and in Problem 1 are unique.

It is interesting to compare Theorem 1 with a simplified version of the main result of [10], stated next for convenience.

Theorem A (Theorem in [10]).

Suppose that . Then as long as is a generic matrix and the permutation of any vector in , in Problem 1 is unique.

Both Theorems 1 and A are concerned with a generic matrix and the permutation of some vector in . In Theorem 1 the vector is taken to be a generic vector in , and as it turns out it is enough that for to be uniquely defined from the data . On the other hand, in Theorem A the vector is allowed to be any vector in . This is a considerably more difficult setting, and the remarkable proof of [10] reveals that is still uniquely defined from the data, as long as now .

Finally, we note that in the setting of Theorem A unique recovery of the permutation is in principle not possible, as per Theorem in [10]. Instead, one has to either allow for the observed vector to be generic (i.e., the permutation of a generic vector of ) in which case is uniquely recoverable by Theorem 1, or consider unique recovery with high probability, which is indeed possible even when is corrupted by noise [9].

2.2 Eliminating via symmetric polynomials

To describe the main idea of the algebraic-geometric approach to solving Problem 1, let be the ring of polynomials with real coefficients over variables . A polynomial333We do not distinguish between and . is called symmetric if it is invariant to any permutation of the variables , i.e.,

(8)

where is a permutation on (i.e., a bijective function from to itself) and is an matrix representing this permutation, with th row the canonical vector of all zeros, except a at position .

Now let be as in Problem 1 and let be a solution. Let be a symmetric polynomial. Then the key observation is that the equality implies the equality , and since is symmetric, this in turn implies

(9)

That is, the symmetric polynomial serves in eliminating the unknown permutation and providing a constraint equation that depends only on the known data ,

(10)

and which the solution that we seek needs to satisfy.

Example 2.

Consider the data

(11)

It is simple to check that there is only one permutation

(12)

that results in a consistent linear system of equations

(13)

with solution ; every other permutation results in inconsistent equations, since it forces to have different values. Now consider the symmetric polynomial

(14)

which we may use as in (10) to generate the constraint

(15)
(16)
(17)

Indeed, we see that the solution satisfies (17).

The polynomial in (10) is an element of the polynomial ring in variables , and the set of its roots, denoted as and called an algebraic variety, in principle defines a hypersurface of . Since the solution to (4) is an element of the -dimensional space , and for any such , one expects that using sufficiently independent such polynomials, will yield a system of equations in unknowns,

(18)

that has a finite number of solutions. Geometrically, these solutions are the intersection points of the corresponding hypersurfaces , which contain all solutions to Problem (1), as well as potentially other irrelevant points.

Example 3.

Continuing with Example 2, suppose we further use the symmetric polynomial

(19)

in (10) to obtain the additional constraint

(20)

Solving (17) with respect to and substituting to (3), gives a quadratic equation in with solutions and . Solving (17) for gives (up to second decimal precision)

(21)

We see that contains the solution of the linear system (4) but also an additional irrelevant point.

We note here that one may use polynomials in order to remove the irrelevant points, e.g., as was done in [11]. However, such an approach is of theoretical interest only, since a system of (sufficiently independent) equations in unknowns is bound to be inconsistent even in the slightest presence of noise. Instead, here we study a system of equations in unknowns and later show (see §3) how one can filter its roots of interest.

2.3 Main Results: zero dimension theorems

We present our main results in §2.3.1 (Theorem 2) and §2.3.2 (Theorem 3) for exact and corrupted data, respectively.

2.3.1 Exact data

As Examples 2-3 suggest, a natural choice for our symmetric polynomials are the first power sums , defined as

(22)

The above discussion has already established that any solution of (4) must satisfy the polynomial constraints

(23)
(24)

and denotes the th row of . The next major result guarantees that there can only be a finite number of other irrelevant solutions.

Theorem 2.

If is generic and is some permutation of some vector in , then the algebraic variety contains all such that there exist permutations with , while it may contain at most other points of . If in addition is some permutation of a generic vector in , then .

Theorem 2 guarantees that the system of polynomial equations

(25)

always has a finite number of solutions in (at most ), among which lie all possible solutions of Problem 1. The importance of the solutions being finite in is computational: even if one is interested only in the real roots, knowing that the system has finitely many complex roots allows one to use much more efficient solvers. On the other hand, there exist many pathological cases where a system of polynomial equations has finitely many real roots but an infinity of complex roots, as the next example demonstrates; Theorem 2 guarantees that such a pathological case can not occur.

Example 4.

The polynomial equation has only one real root , while over the complex numbers it defines a union of two lines in .

2.3.2 Corrupted data

Our next result addresses the issue of corrupted data. As is common in practical applications, we consider the case where we are given data that are corrupted versions of . Even in the presence of the slightest corruptions the linear system

(26)

is expected to be inconsistent for every permutation matrix . On the other hand, if the level of corruption is sufficiently small, there will exist a permutation such that (26) is approximately consistent, in the sense that its least-squares solution has small residual error. In such a case, we still want to use the concepts described so far to get an approximate solution of (26). Towards that end, define the corrupted power-sum polynomials as

(27)

and consider the polynomial system given by

(28)

These are equations of degrees in unknowns. Since (26) has in principle no exact solution , it is entirely unclear whether has any solutions at all, even in .

Example 5.

Let as in (27)-(28) with , given by

(29)
(30)
(31)

Then has no solutions in because multiplying the first equation with and subtracting it from the third equation gives , which contradicts the second equation.

In principle, given polynomial equations in unknowns it is very complicated to determine whether a solution exists or not; a general criterion is given by Hilbert’s Nullstellensatz (Proposition 4), which can be checked algorithmically via the device of Groebner basis. Even if has a solution, it might be the case that it has infinitely many of them as the next example shows.

Example 6.

Let as in (27)-(28) with , given by

(32)
(33)
(34)

Notice that the third equation is equal to the product of the first equation with . Thus the third equation represents the union of two surfaces, one of them being the surface defined by the first equation. Hence, every point in the curve that arises as the intersection of the surfaces defined by the first two equations is a solution to , i.e., has infinitely many solutions in .

Theorem 3 below essentially states that the pathological situations of Examples 5-6 can only occur for taking values on a subset of of measure zero, regardless of what is.

Theorem 3.

If is generic and is any vector, then is non-empty containing at most points of .

Theorem 3 is important for at least two reasons. First, it guarantees that the system of polynomial equations (28) is almost always consistent, i.e., there exists at least one solution. In the absence of noise this property is immediate simply because is a root to any of the noiseless polynomials . However, for noisy data the consistency of (28) is far from obvious; for example, the authors of [26] write It is generally impossible to solve these equations analytically; in fact there may not be a solution to the system. Theorem 3 guarantees that such an issue is of no concern. Secondly, it guarantees that the system (28) has a finite number of solutions in : solving (28) with any standard algebraic geometry software will yield a finite number of points, among which lies an approximate solution to the shuffled linear system (26). The algorithmic implication of Theorem 3 is further pursued in §3.

2.4 The notion of dimension in algebraic geometry

2.4.1 Geometric characterization of dimension

Let be polynomials in and consider their common root locus , called an algebraic variety, defined as

(35)

What is the dimension of ? If , then we have a single equation and one intuitively expects to be a hypersurface of having dimension ; this is in analogy with linear algebra, where a single linear equation defines a linear subspace of dimension one less than the ambient dimension. However, as Example 4 shows, it may be the case that consists of a single point (or even no points at all), in which case should be zero (or if the variety is empty), in analogy with linear algebra where a linear subspace has zero dimension only if it contains a single point (the origin ).

To resolve the above issue and have a consistent definition of dimension that generalizes the linear algebraic one, it is necessary that we consider the common root locus of the polynomials in the algebraic closure of :

(36)

In that case, there is a well developed theory ([30, 31, 32]) that leads to consistent characterizations of the dimension of the geometric object and that of its algebraic counterpart . The next definition provides the geometric characterization of .

Definition 1.

Defining to be closed if it is of the form for some polynomials , and irreducible if it is not the union of two proper closed subsets, is defined to be the largest444The acute reader may notice that there is a-priori no guarantee that such a maximal integer exists. However, this is true because is a Noetherian topological space, a technical notion that is beyond the scope of this paper. non-negative integer such that there exists a chain of the form

(37)

where each is a closed irreducible subset of .

Definition 1 is a generalization of the notion of dimension in linear algebra: if is a linear subspace of , then is precisely equal to the maximal length of a descending chain of linear subspaces that starts with ; one can get such a chain by removing a single basis vector of at each step555For more information on the algebraic geometric structure of linear subspaces the reader is referred to Appendix C in [33]..

Example 7.

With the vector with zeros everywhere except a at position , and , admits a chain

(38)

A very important structural fact about algebraic varieties is the following decomposition theorem.

Proposition 1.

Let for some . Then can be written uniquely as , for some positive integer , where the are irreducible closed sets of (see Definition 1), and they are minimal, in the sense that if one removes one of the , the resulting union is a strictly smaller set than . The are called the irreducible components of .

Definition 1 together with Proposition 1 ensure that the only algebraic varieties that have dimension zero are the ones that consist of a finite number of points; these are precisely the varieties of interest in this paper.

Proposition 2.

Let . Then if and only if consists of a finite number of points of .

2.4.2 Algebraic characterization of dimension

Even though Definition 1 is perfectly acceptable from an intuitive point of view, it is not as convenient to use in practice, since one is usually given polynomials of interest and wants to determine whether has zero dimension, without having to solve the polynomial system. This is the case in this paper, where, e.g., to prove Theorem 2 we need to show that has zero dimension for any suitable : clearly, computing the common root locus of as a function of , is extremely challenging if not impossible. This is precisely where the algebraic characterization of comes in handy, since it allows its computation solely from the algebraic structure of .

To introduce this algebraic notion of dimension we first need the notion of an ideal of . Given polynomials , the ideal generated by these polynomials, denoted by , is the set of all linear combinations of the , but in contrast to linear algebra, the coefficients of the linear combination are allowed to be polynomials themselves:

(39)

Next we need the notion of a prime ideal. An ideal is called prime if it satisfies the following property: whenever the product of two polynomials is inside , then at least one of these polynomials must be inside . With that we have:

Proposition 3.

is the largest non-negative integer such that there exists a chain of the form

(40)

where each is a prime ideal of .

Example 8.

Continuing with Example 8, and

(41)

Since every ideal of the form is prime, we have the ascending chain of prime ideals of length :

(42)

We close this section by noting that the main tool behind the proof of Proposition 3 is the famous Hilbert’s Nullstellensatz, stated next, which holds over but not over . This is why we need to work over to get consistent geometric and algebraic characterizations of the dimension of an algebraic variety.

Proposition 4.

Let be polynomials of . Then

  • if and only if .

  • Suppose that and let be a polynomial such that . Then for some positive integer .

2.4.3 Dimension and homogenization

A monomial of degree is a polynomial of the form , where is a vector of non-negative integers such that . Every polynomial can be uniquely written as a linear combination of monomials

(43)

where is a finite set of multi-exponents and are the corresponding coefficients. Then is called homogeneous of degree if all its monomials have the same degree .

Let be a set of polynomials of . For rather subtle reasons beyond the scope of this paper, further characterizing the dimension of beyond Proposition 3 is simpler when all the are homogeneous (this will be discussed in the next section). When this is not the case, there is a simple procedure called homogenization, through which we can convert non-homogenous polynomials to homogeneous.

Suppose that the polynomial given in (43) is not homogeneous. Let be the maximal degree among all the monomials of , i.e., , where is the vector of all ones. Then the homogenization of is a polynomial in the extended polynomial ring with one additional variable , defined as

(44)
Example 9.

A non-homogeneous polynomial of degree and its homogenization :

(45)
(46)

Now, let be the ideal generated by some polynomials and consider the homogenization of this ideal

(47)

We note here the subtle fact that certainly contains , but in principle it is larger than the ideal generated by , as the next example illustrates.

Example 10.

Let be polynomials of . Then . Now, the polynomial is in the ideal , and it is already homogeneous so that . However is not inside the ideal , since the latter only contains elements of degree and higher.

Since the elements of are polynomials in variables, they define an algebraic variety666Let be an ideal. Then Hilbert’s Basis Theorem guarantees that always has a finite set of generators, i.e., there is a positive integer and polynomials such that . of . What is the relationship between and ? It is actually not hard to see that if is a point of , then is a point of , for any . Hence any non-zero point of gives rise to an entire line inside ; this line passes through the origin and

, and its intersection with the hyperplane

can be used to recover the original point . Hence is called the affine cone over with vertex . Moreover, the variety is embedded inside the affine cone through a mapping that takes points to lines. In addition, contains so-called points at infinity, which are obtained by setting . As it turns out, there is a tight topological relationship between and and the important fact for our analysis is the following dimension theorem; see [34] for a detailed discussion for the non-expert reader in the context of subspace clustering.

Proposition 5.

Let be an algebraic variety of and let be its affine cone. Then .

Example 11.

Let be an affine line of given by the equation . Then is a plane through the origin in given by the equation .

The next fact, known as Bezout’s Theorem, will be used in bounding the number of points of the zero-dimensional variety of Theorem 2.

Proposition 6.

Let be homogeneous polynomials of of degrees . If is a finite union of lines through the origin, then the number of these lines is at most .

2.4.4 Regular sequences

In §2.2 we argued that if the polynomials are sufficiently independent, then , i.e., the dimension of the algebraic variety drops precisely by the number of its defining equations. More generally, the precise notion of what sufficiently independent should mean for polynomials , so that , is easier to characterize when all the are homogeneous. The right notion is that of a regular sequence.

Definition 2.

Let be polynomials of . Then is a regular sequence if , and for every the following property is true: whenever there is a polynomial such that , then we must have .

The crucial fact for our analysis is the following.

Proposition 7.

Let , be non-constant homogeneous polynomials of . Then , if and only if is a regular sequence.

Given a regular sequence of polynomials in of length , it is of interest to be able to augment this sequence to a regular sequence of length . The simplest type of a homogeneous polynomial that one may consider is a linear form , which represents a hyperplane with normal vector . As it turns out, almost all such hyperplanes qualify, with the exception of those with normal vector that lies inside an algebraic variety of determined by .

Proposition 8.