# The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences

Various applications involve assigning discrete label values to a collection of objects based on some pairwise noisy data. Due to the discrete---and hence nonconvex---structure of the problem, computing the optimal assignment (e.g. maximum likelihood assignment) becomes intractable at first sight. This paper makes progress towards efficient computation by focusing on a concrete joint alignment problem---that is, the problem of recovering n discrete variables x_i ∈{1,..., m}, 1≤ i≤ n given noisy observations of their modulo differences {x_i - x_j mod m}. We propose a low-complexity and model-free procedure, which operates in a lifted space by representing distinct label values in orthogonal directions, and which attempts to optimize quadratic functions over hypercubes. Starting with a first guess computed via a spectral method, the algorithm successively refines the iterates via projected power iterations. We prove that for a broad class of statistical models, the proposed projected power method makes no error---and hence converges to the maximum likelihood estimate---in a suitable regime. Numerical experiments have been carried out on both synthetic and real data to demonstrate the practicality of our algorithm. We expect this algorithmic framework to be effective for a broad range of discrete assignment problems.

## Authors

• 87 publications
• 11 publications
03/13/2020

### Joint Alignment From Pairwise Differences with a Noisy Oracle

In this work we consider the problem of recovering n discrete random var...
07/16/2017

### Projected Power Iteration for Network Alignment

The network alignment problem asks for the best correspondence between t...
09/21/2019

### Optimal Learning of Joint Alignments with a Faulty Oracle

We consider the following problem, which is useful in applications such ...
10/20/2010

### Maximum Likelihood Joint Tracking and Association in a Strong Clutter without Combinatorial Complexity

We have developed an efficient algorithm for the maximum likelihood join...
12/03/2021

### Fast Projected Newton-like Method for Precision Matrix Estimation with Nonnegative Partial Correlations

We study the problem of estimating precision matrices in multivariate Ga...
06/23/2022

### Efficient and Accurate Top-K Recovery from Choice Data

The intersection of learning to rank and choice modeling is an active ar...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Nonconvex optimization

Nonconvex optimization permeates almost all fields of science and engineering applications. For instance, consider the structured recovery problem where one wishes to recover some structured inputs from noisy samples . The recovery procedure often involves solving some optimization problem (e.g. maximum likelihood estimation)

 maximizez∈Rnℓ(z;y)subject toz∈S, (1)

where the objective function measures how well a candidate fits the samples. Unfortunately, this program (1) may be highly nonconvex, depending on the choices of the goodness-of-fit measure as well as the feasible set . In contrast to convex optimization that has become the cornerstone of modern algorithm design, nonconvex problems are in general daunting to solve. Part of the challenges arises from the existence of (possibly exponentially many) local stationary points; in fact, oftentimes even checking local optimality for a feasible point proves NP-hard.

Despite the general intractability, recent years have seen progress on nonconvex procedures for several classes of problems, including low-rank matrix recovery [KMO10a, KMO10b, JNS13, SL15, CW15, ZL15, TBSR15, MWCC17, ZWL15, PKB16, YPCC16], phase retrieval [NJS13, CLS15, SBE14, CC17, CLM15, SQW16, CL16, ZCL16, MWCC17, WGE16, ZL16], dictionary learning [SQW15a, SQW15b], blind deconvolution [LLSW16, MWCC17], empirical risk minimization [MBM16], to name just a few. For example, we have learned that several problems of this kind provably enjoy benign geometric structure when the sample complexity is sufficiently large, in the sense that all local stationary points (except for the global optimum) become saddle points and are not difficult to escape [SQW15b, SQW15a, GLM16, BNS16, LSJR16]. For the problem of solving certain random systems of quadratic equations, this phenomenon arises as long as the number of equations or sample size exceeds the order of , with denoting the number of unknowns [SQW16].111This geometric property alone is not sufficient to ensure rapid convergence of an algorithm. We have also learned that it is possible to minimize certain non-convex random functionals—closely associated with the famous phase retrieval problem—even when there may be multiple local minima [CLS15, CC17]. In such problems, one can find a reasonably large basin of attraction around the global solution, in which a first-order method converges geometrically fast. More importantly, the existence of such a basin is often guaranteed even in the most challenging regime with minimal sample complexity. Take the phase retrieval problem as an example: this basin exists as soon as the sample size is about the order of [CC17]. This motivates the development of an efficient two-stage paradigm that consists of a carefully-designed initialization scheme to enter the basin, followed by an iterative refinement procedure that is expected to converge within a logarithmic number of iterations [CLS15, CC17]; see also [KMO10a, KMO10b] for related ideas in matrix completion.

In the present work, we extend the knowledge of nonconvex optimization by studying a class of assignment problems in which each is represented on a finite alphabet, as detailed in the next subsection. Unlike the aforementioned problems like phase retrieval which are inherently continuous in nature, in this work we are preoccupied with an input space that is discrete and already nonconvex to start with. We would like to contribute to understanding what is possible to solve in this setting.

### 1.2 A joint alignment problem

This paper primarily focuses on the following joint discrete alignment problem. Consider a collection of variables , where each variable can take different possible values, namely, . Imagine we obtain a set of pairwise difference samples over some symmetric222We say is symmetric if implies for any and . index set , where is a noisy measurement of the modulo difference of the incident variables

 yi,j ← xi−xj mod m,(i,j)∈Ω. (2)

For example, one might obtain a set of data where only 50% of them are consistent with the truth . The goal is to simultaneously recover all based on the measurements , up to some unrecoverable global offset.333Specifically, it is impossible to distinguish the sets of inputs , , , and even if we obtain perfect measurements of all pairwise differences .

To tackle this problem, one is often led to the following program

 maximize{zi} ∑(i,j)∈Ωℓ(zi,zj; yi,j) (3) subject to zi∈{1,⋯,m},i=1,⋯,n,

where is some function that evaluates how consistent the observed sample corresponds to the candidate solution . For instance, one possibility for may be

 ℓ(zi,zj; yi,j)={1,if zi−zj=yi,j mod m,0,else, (4)

under which the program (3) seeks a solution that maximizes the agreement between the paiwise observations and the recovery. Throughout the rest of the paper, we set whenever .

This joint alignment problem finds applications in multiple domains. To begin with, the binary case (i.e. ) deserves special attention, as it reduces to a graph partitioning problem. For instance, in a community detection scenario in which one wishes to partition all users into two clusters, the variables to recover indicate the cluster assignments for each user, while represents the friendship between two users and (e.g. [KN11, CO10, ABH16, CGT12, MNS14, JCSX11, OH11, CSX12, HWX16, CKST16, JMRT16, CRV15, GRSY15]). This allows to model, for example, the haplotype phasing problem arising in computational genomics [SVV14, CKST16]. Another example is the problem of water-fat separation in magnetic resonance imaging (more precisely, in Dixon imaging). A crucial step is to determine, at each image pixel , the phasor (associated with the field inhomogeneity) out of two possible candidates, represented by and , respectively. The task takes as input some pre-computed pairwise cost functions , which provides information about whether or not at pixels and ; see [ZCB17, HKHL10, BAJK11] for details.

Moving beyond the binary case, this problem is motivated by the need of jointly aligning multiple images/shapes/pictures that arises in various fields. Imagine a sequence of images of the same physical instance (e.g. a building, a molecule), where each represents the orientation of the camera when taking the

th image. A variety of computer vision tasks (e.g. 3D reconstruction from multiple scenes) or structural biology applications (e.g. cryo-electron microscopy) rely upon joint alignment of these images; or equivalently, joint recovery of the camera orientations associated with each image. Practically, it is often easier to estimate the relative camera orientation between a pair of images using raw features

[HSG13, WS13, BCSZ14]. The problem then boils down to this: how to jointly aggregate such pairwise information in order to improve the collection of camera pose estimates?

### 1.3 Our contributions

In this work, we propose to solve the problem (3) via a novel model-free nonconvex procedure. Informally, the procedure starts by lifting each variable to higher dimensions such that distinct values are represented in orthogonal directions, and then encodes the goodness-of-fit measure for each by an matrix. This way of representation allows to recast (3

) as a constrained quadratic program or, equivalently, a constrained principal component analysis (PCA) problem. We then attempt optimization by means of projected power iterations, following an initial guess obtained via suitable low-rank factorization. This procedure proves effective for a broad family of statistical models, and might be interesting for many other Boolean assignment problems beyond joint alignment.

## 2 Algorithm: projected power method

In this section, we present a nonconvex procedure to solve the nonconvex problem (3), which entails a series of projected power iterations over a higher-dimensional space. In what follows, this algorithm will be termed a projected power method (PPM).

### 2.1 Matrix representation

The formulation (3) admits an alternative matrix representation that is often more amenable to computation. To begin with, each state

can be represented by a binary-valued vector

such that

 zi=j⟺zi=ej∈Rm, (5)

where , , , are the canonical basis vectors. In addition, for each pair , one can introduce an input matrix to encode given all possible input combinations of ; that is,

 (Li,j)α,β:=ℓ(zi=α,zj=β; yi,j),1≤α,β≤m. (6)

Take the choice (4) of for example:

 (Li,j)α,β={1,if α−β=yi,j mod m,0,else,∀(i,j)∈Ω; (7)

in words,

is a cyclic permutation matrix obtained by circularly shifting the identity matrix

by positions. By convention, we take for all and for all .

The preceding notation enables the quadratic form representation

 ℓ(zi,zj;yi,j)=z⊤iLi,jzj.

For notational simplicity, we stack all the ’s and the ’s into a concatenated vector and matrix

 (8)

respectively, representing the states and log-likelihoods altogether. As a consequence, our problem can be succinctly recast as a constrained quadratic program:

 maximizez z⊤Lz (9) subject to zi∈{e1,⋯,em},i=1,⋯,n.

This representation is appealing due to the simplicity of the objective function regardless of the landscape of , which allows one to focus on quadratic optimization rather than optimizing the (possibly complicated) function directly.

There are other families of that also lead to the problem (3). We single out a simple, yet, important family obtained by enforcing global scaling and offset of . Specifically, the solution to (9) remains unchanged if each is replaced by444This is because given that .

 Li,j←aLi,j+b⋅11⊤,(i,j)∈Ω (10)

for some numerical values and . Another important instance in this family is the debiased version of —denoted by —defined as follows

 Ldebiasi,j=Li,j−1⊤Li,j1m2⋅11⊤,1≤i,j≤n, (11)

which essentially removes the empirical average of in each block.

### 2.2 Algorithm

One can interpret the quadratic program (9) as finding the principal component of subject to certain structural constraints. This motivates us to tackle this constrained PCA problem by means of a power method, with the assistance of appropriate regularization to enforce the structural constraints. More precisely, we consider the following procedure, which starts from a suitable initialization and follows the update rule

 z(t+1)=PΔn(μtLz(t)),∀t≥0 (12)

for some scaling parameter . Here, represents block-wise projection onto the standard simplex, namely, for any vector ,

 PΔn(z):=⎡⎢ ⎢⎣PΔ(z1)⋮PΔ(zn)⎤⎥ ⎥⎦, (13)

where is the projection of onto the standard simplex

 Δ:={s∈Rm∣1⊤s=1; s is non-negative}. (14)

In particular, when , reduces to a rounding procedure. Specifically, if the largest entry of is strictly larger than its second largest entry, then one has

 limμt→∞PΔ(μtzi)=ej (15)

with denoting the index of the largest entry of ; see Fact 3 for a justification.

The key advantage of the PPM is its computational efficiency: the most expensive step in each iteration lies in matrix multiplication, which can be completed in nearly linear time, i.e. in time .555Here and throughout, the standard notion or mean there exists a constant such that ; means ; means there exists a constant such that ; and means there exist constants such that . This arises from the fact that each block is circulant, so we can compute a matrix-vector product using at most two -point FFTs. The projection step can be performed in flops via a sorting-based algorithm (e.g. [DSSSC08, Figure 1]), and hence is much cheaper than the matrix-vector multiplication given that occurs in most applications.

One important step towards guaranteeing rapid convergence is to identify a decent initial guess . This is accomplished by low-rank factorization as follows

1. Compute the best rank- approximation of the input matrix , namely,

 ^L:=argminM: rank(M)≤m∥M−L∥F, (16)

where represents the Frobenius norm;

2. Pick a random column of and set the initial guess as .

###### Remark 1.

Alternatively, one can take to be the best rank- approximation of the debiased input matrix defined in (11), which can be computed in a slightly faster manner.

###### Remark 2.

A natural question arises as to whether the algorithm works with an arbitrary initial point. This question has been studied by [BBV16] for the more special stochastic block models, which shows that under some (suboptimal) conditions, all second-order critical points correspond to the truth and hence an arbitrary initialization works. However, the condition presented therein is much more stringent than the optimal threshold [ABH16, MNS14]. Moreover, it is unclear whether a local algorithm like the PPM can achieve optimal computation time without proper initialization. All of this would be interesting for future investigation.

The main motivation comes from the (approximate) low-rank structure of the input matrix . As we shall shortly see, in many scenarios the data matrix is approximately of rank if the samples are noise-free. Therefore, a low-rank approximation of serves as a denoised version of the data, which is expected to reveal much information about the truth.

The low-rank factorization step can be performed efficiently via the method of orthogonal iteration (also called block power method) [GVL12, Section 7.3.2]. Each power iteration consists of a matrix product of the form

as well as a QR decomposition of some matrix

, where . The matrix product can be computed in flops with the assistance of -point FFTs, whereas the QR decomposition takes time . In summary, each power iteration runs in time . Consequently, the matrix product constitutes the main computational cost when , while the QR decomposition becomes the bottleneck when .

It is noteworthy that both the initialization and the refinement we propose are model-free, which do not make any assumptions on the data model. The whole algorithm is summarized in Algorithm 1. There is of course the question of what sequence to use, which we defer to Section 3.

The proposed two-step algorithm, which is based on proper initialization followed by successive projection onto product of simplices, is a new paradigm for solving a class of discrete optimization problems. As we will detail in the next section, it is provably effective for a family of statistical models. On the other hand, we remark that there exist many algorithms of a similar flavor to tackle other generalized eigenproblems, including but not limited to sparse PCA [JNRS10, HB10, YZ13], water-fat separation [ZCB17], the hidden clique problem [DM15], phase synchronization [Bou16, LYS16], cone-constrained PCA [DMR14], and automatic network analysis [WLS12]. These algorithms are variants of the projected power method, which combine proper power iterations with additional procedures to promote sparsity or enforce other feasibility constraints. For instance, Deshpande et al. [DMR14] show that under some simple models, cone-constrained PCA can be efficiently computed using a generalized projected power method, provided that the cone constraint is convex. The current work adds a new instance to this growing family of nonconvex methods.

## 3 Statistical models and main results

This section explores the performance guarantees of the projected power method. We assume that is obtained via random sampling at an observation rate so that each is included in

independently with probability

, and is assumed to be independent of the measurement noise. In addition, we assume that the samples are independently generated. While the independence noise assumption may not hold in reality, it serves as a starting point for us to develop a quantitative theoretical understanding for the effectiveness of the projected power method. This is also a common assumption in the literature (e.g. [Sin11, WS13, HG13, CGH14, LYS16, Bou16, PKS13]).

With the above assumptions in mind, the MLE is exactly given by (3), with representing the log-likelihood (or some other equivalent function) of the candidate solution given the outcome . Our key finding is that the PPM is not only much more practical than computing the MLE directly666Finding the MLE here is an NP hard problem, and in general cannot be solved within polynomial time. Practically, one might attempt to compute it via convex relaxation (e.g. [HG13, BCSZ14, CGH14]), which is much more expensive than the PPM. , but also capable of achieving nearly identical statistical accuracy as the MLE in a variety of scenarios.

Before proceeding to our results, we find it convenient to introduce a block sparsity metric. Specifically, the block sparsity of a vector is defined and denoted by

 ∥h∥∗,0:=n∑i=1I{hi≠0},

where is the indicator function. Since one can only hope to recover up to some global offset, we define the misclassification rate as the normalized block sparsity of the estimation error modulo the global shift

 MCR(^x,x):=1nmin0≤l

Here, , where is obtained by circularly shifting the entries of by positions. Additionally, we let represent the natural logarithm throughout this paper.

### 3.1 Random corruption model

While our goal is to accommodate a general class of noise models, it is helpful to start with a concrete and simple example—termed a random corruption model—such that

 yi,j={xi−xj mod m,with % probability π0,Unif(m),else,(i,j)∈Ω, (19)

with

being the uniform distribution over

. We will term the parameter the non-corruption rate, since with probability the observation behaves like a random noise carrying no information whatsoever. Under this single-parameter model, one can write

 ℓ(zi,zj; yi,j)=⎧⎪⎨⎪⎩log(π0+1−π0m),if zi−zj=yi,j mod m,log(1−π0m),else,(i,j)∈Ω. (20)

Apart from its mathematical simplicity, the random corruption model somehow corresponds to the worst-case situation since the uniform noise enjoys the highest entropy among all distributions over a fixed range, thus forming a reasonable benchmark for practitioners.

Additionally, while Algorithm 1 can certainly be implemented using the formulation

 (Li,j)α,β={log(π0+1−π0m),if α−β=yi,j mod m,log(1−π0m),else,(i,j)∈Ω, (21)

we recommend taking (7) as the input matrix in this case. It is easy to verify that (21) and (7) are equivalent up to global scaling and offset, but (7) is parameter free and hence practically more appealing.

We show that the PPM is guaranteed to work even when the non-corruption rate is vanishingly small, which corresponds to the scenario where almost all acquired measurements behave like random noise. A formal statement is this:

###### Theorem 1.

Consider the random corruption model (19) and the input matrix given in (7). Fix , and suppose and for some sufficiently large constants . Then there exists some absolute constant such that with probability approaching one as scales, the iterates of Algorithm 1 obey

 MCR(z(t),x)≤0.49ρt,∀t≥0, (22)

provided that the non-corruption rate exceeds777Theorem 1 continues to hold if we replace 1.01 with any other constant in (1,).

 π0>2√1.01lognmnpobs. (23)
###### Remark 3.

Here and throughout, is the

th largest singular value of

. In fact, one can often replace with for other . But is usually not a good choice unless we employ the debiased version of instead, because typically corresponds to the “direct current” component of which could be excessively large. In addition, we note that have been computed during spectral initialization and, as a result, will not result in extra computational cost.

###### Remark 4.

As will be seen in Section 6, a stronger version of error contraction arises such that

 MCR(z(t+1),x)≤ρMCR(z(t),x),if MCR(z(t),x)≤0.49. (24)

This is a uniform result in the sense that (24) occurs simultaneously for all obeying , regardless of the preceding iterates and the statistical dependency between and . In particular, if , one has , and hence forms a sequence of feasible iterates with increasing accuracy. In this case, the iterates become accurate whenever .

###### Remark 5.

The contraction rate can actually be as small as , which is at most if is fixed and if the condition (23) holds.

According to Theorem 1, convergence to the ground truth can be expected in at most iterations. This together with the per-iteration cost (which is on the order of since is a cyclic permutation matrix) shows that the computational complexity of the iterative stage is at most . This is nearly optimal since even reading all the data and likelihood values take time about the order of . All of this happens as soon as the corruption rate does not exceed , uncovering the remarkable ability of the PPM to tolerate and correct dense input errors.

As we shall see later in Section 6, Theorem 1 holds as long as the algorithm starts with any initial guess obeying

 MCR(z(0),x)≤0.49, (25)

irrespective of whether is independent of the data or not. Therefore, it often suffices to run the power method for a constant number of iterations during the initialization stage, which can be completed in flops when is fixed. The broader implication is that Algorithm 1 remains successful if one adopts other initialization that can enter the basin of attraction.

Finally, our result is sharp: to be sure, the error-correction capability of the projected power method is statistically optimal, as revealed by the following converse result.

###### Theorem 2.

Consider the random corruption model (19) with any fixed , and suppose for some sufficiently large constant . If 888Theorem 2 continues to hold if we replace 0.99 with any other constant between 0 and 1.

 π0<2√0.99lognmnpobs, (26)

then the minimax probability of error

 inf^xmaxxi∈[m],1≤i≤nP(MCR(^x,x)>0∣x)→1,% as n→∞,

where the infimum is taken over all estimators and is the vector representation of as before.

As mentioned before, the binary case bears some similarity with the community detection problem in the presence of two communities. Arguably the most popular model for community detection is the stochastic block model (SBM), where any two vertices within the same cluster (resp. across different clusters) are connected by an edge with probability (resp. ). The asymptotic limits for both exact and partial recovery have been extensively studied [Mas14, ABH16, MNS14, HWX16, AS15, CRV15, BDG16, GV15]. We note, however, that the primary focus of community detection lies in the sparse regime or logarithmic sparse regime (i.e. ), which is in contrast to the joint alignment problem in which the measurements are often considerably denser. There are, however, a few theoretical results that cover the dense regime, e.g. [MNS14]. To facilitate comparison, consider the case where and for some , then the SBM reduces to the random corruption model with . One can easily verify that the limit we derive matches the recovery threshold given in999Note that the model studied in [MNS14] is an SBM with vertices with vertices belonging to each cluster. Therefore, the threshold chacterization [MNS14, Proposition 2.9] should read

when applied to our setting, with . [MNS14, Theorem 2.5 and Proposition 2.9].

### 3.2 More general noise models

The theoretical guarantees we develop for the random corruption model are special instances of a set of more general results. In this subsection, we cover a far more general class of noise models such that

 yi,jind.=xi−xj+ηi,j mod m,(i,j)∈Ω, (27)

) are i.i.d. random variables supported on

. In what follows, we define to be the distribution of , i.e.

 (28)

For instance, the random corruption model (19) is a special case of (27) with the noise distribution

 P0(y)={π0+1−π0m,if y=0;1−π0m,if y=1,⋯,m−1. (29)

To simplify notation, we set for all throughout the paper. Unless otherwise noted, we take for all , and restrict attention to the class of symmetric noise distributions obeying

 P0(y)=P0(m−y),y=1,⋯,m−1, (30)

which largely simplifies the exposition.

#### 3.2.1 Key metrics

The feasibility of accurate recovery necessarily depends on the noise distribution or, more precisely, the distinguishability of the output distributions given distinct inputs. In particular, there are distributions that we’d like to emphasize, where represents the distribution of conditional on . Alternatively, is also the -shifted distribution of the noise given by

 Pl(y):=P(yi,j=y∣xi−xj=l)=P(ηi,j=y−l). (31)

Here and below, we write and interchangeably whenever it is clear from the context, and adopt the cyclic notation () for any quantity taking the form .

We would like to quantify the distinguishability of these distributions via some distance metric. One candidate is the Kullback–Leibler (KL) divergence defined by

 KL(Pi∥Pl):=∑yPi(y)logPi(y)Pl(y),0≤i,l

which plays an important role in our main theory.

#### 3.2.2 Performance guarantees

We now proceed to the main findings. To simplify matters, we shall concern ourselves primarily with the kind of noise distributions obeying the following assumption.

###### Assumption 1.

is bounded away from 0.

###### Remark 6.

When , one can replace with in Assumption 1. However, if is allowed to scale with —which is the case in Section 3.4—then the prefactor cannot be dropped.

In words, Assumption 1 ensures that the noise density is not exceedingly lower than the average density at any point. The reason why we introduce this assumption is two-fold. To begin with, this enables us to preclude the case where the entries of —or equivalently, the log-likelihoods—are too wild. For instance, if for some , then , resulting in computational instability. The other reason is to simplify the analysis and exposition slightly, making it easier for the readers. We note, however, that this assumption is not crucial and can be dropped by means of a slight modification of the algorithm, which will be detailed later.

Another assumption that we would like to introduce is more subtle:

###### Assumption 2.

is bounded, where

 (33)

Roughly speaking, Assumption 2 states that the mutual distances of the possible output distributions lie within a reasonable dynamic range, so that one cannot find a pair of them that are considerably more separated than other pairs. Alternatively, it is understood that the variation of the log-likelihood ratio, as we will show later, is often governed by the KL divergence between the two corresponding distributions. From this point of view, Assumption 2 tells us that there is no submatrix of that is significantly more volatile than the remaining parts, which often leads to enhanced stability when computing the power iteration.

With these assumptions in place, we are positioned to state our main result. It is not hard to see that Theorem 1 is an immediate consequence of the following theorem.

###### Theorem 3.

Fix , and assume and for some sufficiently large constants . Under Assumptions 1-2, there exist some absolute constants such that with probability tending to one as scales, the iterates of Algorithm 1 with the input matrix (6) or (11) obey

 MCR(z(t),x)≤νρt,∀t≥0, (34)

provided that101010Theorem 3 remains valid if we replace 4.01 by any other constant in (4,).

 KLmin≥4.01lognnpobs. (35)
###### Remark 7.

Alternatively, Theorem 3 can be stated in terms of other divergence metrics like the squared Hellinger distance . Specifically, Theorem 3 holds if the minimum squared Hellinger distance obeys

 H2min:=min1≤l1.01lognnpobs, (36)

where . We will see later in Lemma 3 that , which justifies the equivalence between (35) and (36).

The recovery condition (35) is non-asymptotic, and takes the form of a minimum KL divergence criterion. This is consistent with the understanding that the hardness of exact recovery often arises in differentiating minimally separated output distributions. Within at most projected power iterations, the PPM returns an estimate with absolutely no error, as soon as the minimum KL divergence exceeds some threshold. This threshold can be remarkably small when is large or, equivalently, when we have many pairwise measurements available.

Theorem 3 accommodates a broad class of noise models. Here we highlight a few examples to illustrate its generality. To begin with, it is self-evident that the random corruption model belongs to this class with . Beyond this simple model, we list two important families which satisfy Assumption 2 and which receive broad practical interest. This list, however, is by no means exhaustive.

1. A class of distributions that obey

 (37)

This says that the output distributions are the closest when the two corresponding inputs are minimally separated.

2. A class of unimodal distributions that satisfy

 (38)

This says that the likelihood decays as the distance to the truth increases.

###### Lemma 1.

Fix , and suppose Assumption 1 holds. Then the noise distribution satisfying either (37) or (38) obeys Assumption 2.

See Appendix B.∎

#### 3.2.3 Why Algorithm 1 works?

We pause here to gain some insights about Algorithm 1 and, in particular, why the minimum KL divergence has emerged as a key metric. Without loss of generality, we assume to simplify the presentation.

Recall that Algorithm 1 attempts to find the constrained principal component of . To enable successful recovery, one would naturally hope the structure of the data matrix to reveal much information about the truth. In the limit of large samples, it is helpful to start by looking at the mean of , which is given by

 E[Li,j]α,β = pobsEy∼P0[logPα−β(y)] (39) = pobsEy∼P0[logPα−β(y)P0(y)]+pobsEy∼P0[logP0(y)] = pobs[−KLα−β−H(P0)]

for any and ; here and throughout, is the entropy functional, and

 KLl := KL(P0∥Pl),0≤l

We can thus write

 E[L]=pobs⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣0K⋯KK0⋱⋮⋮⋱0KK⋯K0⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (41)

with denoting a circulant matrix

 K:=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣−KL0−KLm−1⋯−KL1−KL1−KL0⋱−KL2⋮⋱⋱⋮−KLm−1⋯−KL1−KL0⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦:=K0−H(P0)1⋅1⊤. (42)

It is easy to see that the largest entries of lie on the main diagonal, due to the fact that

 −KL0=0and−KLl=−KL(P0∥Pl)<0(1≤l

Consequently, for any column of , knowledge of the largest entries in each block reveals the relative positions across all . Take the 2nd column of for example: all but the first blocks of this column attain the maximum values in their 2nd entries, telling us that .

Given the noisy nature of the acquired data, one would further need to ensure that the true structure stands out from the noise. This hinges upon understanding when can serve as a reasonably good proxy for in the (projected) power iterations. Since we are interested in identifying the largest entries, the signal contained in each block—which is essentially the mean separation between the largest and second largest entries—is of size

 pobsmin1≤l

The total signal strength is thus given by

. In addition, the variance in each measurement is bounded by

 (a)≲ max1≤l

where the inequality (a) will be demonstrated later in Lemma 3. From the semicircle law, the perturbation can be controlled by

 ∥L−E[L]∥=O(√npobsKLmax), (43)

where is the spectral norm. This cannot exceed the size of the signal, namely,

 npobsKLmin≳√npobsKLmax.

This condition reduces to

 KLmin≳1npobs

under Assumption 2, which is consistent with Theorem 3 up to some logarithmic factor.

#### 3.2.4 Optimality

The preceding performance guarantee turns out to be information theoretically optimal in the asymptotic regime. In fact, the KL divergence threshold given in Theorem 3 is arbitrarily close to the information limit, a level below which every procedure is bound to fail in a minimax sense. We formalize this finding as a converse result:

###### Theorem 4.

Fix . Let be a sequence of probability measures supported on a finite set , where is bounded away from 0. Suppose that there exists such that

 KL(P0,n∥Pl,n)=min1≤j

for all sufficiently large , and that for some sufficiently large constant . If 111111Theorem 4 continues to hold if we replace 3.99 with any other constant between 0 and 4.

 KLmin,n<3.99lognnpobs, (44)

then the minimax probability of error

 inf^xmaxxi∈[m],1≤i≤nP(MCR(^x,x)>0∣x)→1,% as n→∞, (45)

where the infimum is over all possible estimators and is the vector representation of as usual.

### 3.3 Extension: removing Assumption 1

We return to Assumption 1. As mentioned before, an exceedingly small might result in unstable log-likelihoods, which suggests we regularize the data before running the algorithm. To this end, one alternative is to introduce a little more entropy to the samples so as to regularize the noise density, namely, we add a small level of random noise to yield

 ~yi,jind.={yi,j,with % probability 1−ς,Unif(m),else, (46)

for some appropriate small constant . The distribution of the new data given is thus given by

 ~Pl←(1−ς)Pl+ςUnif(m), (47)

which effectively bumps up to . We then propose to run Algorithm 1 using the new data and , leading to the following performance guarantee.

###### Theorem 5.

Take to be some sufficiently small constant, and suppose that Algorithm 1 operates upon and . Then Theorem 3 holds without Assumption 1.

###### Proof.

See Appendix A. ∎

### 3.4 Extension: large-m case

So far our study has focused on the case where the alphabet size does not scale with . There are, however, no shortage of situations where is so large that it cannot be treated as a fixed constant. The encouraging news is that Algorithm 1 appears surprisingly competitive for the large- case as well. Once again, we begin with the random corruption model, and our analysis developed for fixed immediately applies here.

###### Theorem 6.

Suppose that , , and for some sufficiently large constant . Then Theorem 1 continues to hold with probability at least 0.99, as long as (23) is replaced by

 π0>c3√npobs (48)

for some universal constant . Here, 0.99 is arbitrary and can be replaced by any constant between 0 and 1.

The main message of Theorem 6 is that the error correction capability of the proposed method improves as the number of unknowns grows. The quantitative bound (48) implies successful recovery even when an overwhelming fraction of the measurements are corrupted. Notably, when is exceedingly large, Theorem 6 might shed light on the continuous joint alignment problem. In particular, there are two cases worth emphasizing:

• When (e.g. ), the random corruption model converges to the following continuous spike model as scales:

 xi∈[0,1), 1≤i≤nandyi,j={xi−xj mod 1,with prob. π0,Unif(0,1),else,(i,j)∈Ω. (49)

This coincides with the setting studied in [Sin11, WS13] over the orthogonal group , under the name of synchronization [LYS16, Bou16, BBV16]

. It has been shown that the leading eigenvector of a certain data matrix becomes positively correlated with the truth as long as

when [Sin11]. In addition, a generalized power method—which is equivalent to projected gradient descent—provably converges to the solution of the nonconvex least-squares estimation, as long as the size of the noise is below some threshold [LYS16, Bou16]. When it comes to exact recovery, Wang et al. prove that semidefinite relaxation succeeds as long as [WS13, Theorem 4.1], a constant threshold irrespective of . In contrast, the exact recovery performance of our approach—which operates over a lifted discrete space rather than —improves with , allowing to be arbitrarily small when is sufficiently large. On the other hand, the model (49) is reminiscent of the more general robust PCA problem [CLMW11, CSPW11], which consists in recovering a low-rank matrix when a fraction of observed entries are corrupted. We have learned from the literature [GWL10, CJSC13] that perfect reconstruction is feasible and tractable even though a dominant portion of the observed entries may suffer from random corruption, which is consistent with our finding in Theorem 6.

• In the preceding spike model, the probability density of each measurement experiences an impulse around the truth. In a variety of realistic scenarios, however, the noise density might be more smooth rather than being spiky. Such smoothness conditions can be modeled by enforcing for all , so as to rule out any sharp jump. To satisfy this condition, we can at most take in view of Theorem 6. In some sense, this uncovers the “resolution” of our estimator under the “smooth” noise model: if we constrain the input domain to be the unit interval by letting represent the grid points , respectively, then the PPM can recover each variable up to a resolution of

 1m≍1√npobs. (50)

Notably, the discrete random corruption model has been investigated in prior literature [HG13, CGH14], with the best theoretical support derived for convex programming. Specifically, it has been shown by [CGH14] that convex relaxation is guaranteed to work as soon as . In comparison, this is more stringent than the recovery condition (48) we develop for the PPM by some logarithmic factor. Furthermore, Theorem 6 is an immediate consequence of a more general result:

Assume that ,