Structured††Accepted for presentation at Conference on Learning Theory (COLT) 2018 matrices with entries in the range and unknown permutations acting on their rows and columns arise in multiple applications, including estimation from pairwise comparisons [BT52, SBGW17] and crowd-labeling [DS79, SBW16b]
. Traditional parametric models[BT52, Luc59, Thu27, DS79] assume that these matrices are obtained from rank-one matrices via a known link function. Aided by tools such as maximum likelihood estimation and spectral methods, researchers have made significant progress in studying both statistical and computational aspects of these parametric models [HOX14, RA14, SBB16, NOS16, ZCZJ16, GZ13, GLZ16, KOS11b, LPI12, DDKR13, GKM11] and their low-rank generalizations [RA16, NOTX17, KOS11a].
There has been evidence from empirical studies (e.g., [ML65, BW97]) that real-world data is not always well-captured by such parametric models. With the goal of increasing model flexibility, a recent line of work has studied the class of permutation-based models [Cha15, SBGW17, SBW16b]. Rather than imposing parametric conditions on the matrix entries, these models impose only shape constraints on the matrix, such as monotonicity, before unknown permutations act on the its rows and columns. This more flexible class reduces modeling bias compared to its parametric counterparts while, perhaps surprisingly, producing models that can be estimated at rates that differ only by logarithmic factors from parametric models. On the negative side, these advantages of permutation-based models are accompanied by significant computational challenges. The unknown permutations make the parameter space highly non-convex, so that efficient maximum likelihood estimation is unlikely. Moreover, spectral methods are often suboptimal in approximating shape-constrained sets of matrices [Cha15, SBGW17]. Consequently, results from many recent papers show a non-trivial statistical-computational gap in estimation rates for models with latent permutations [SBGW17, CM16, SBW16b, FMR16, PWC17].
While the main motivation of our work comes from nonparametric methods for aggregating pairwise comparisons, we begin by discussing a few other lines of related work. The current paper lies at the intersection of shape-constrained estimation and latent permutation learning. Shape-constrained estimation has long been a major topic in nonparametric statistics, and of particular relevance to our work is the estimation of a bivariate isotonic matrix without latent permutations [CGS18]. There, it was shown that the minimax rate of estimating an matrix from noisy observations of all its entries is . The upper bound is achieved by the least squares estimator, which is efficiently computable due to the convexity of the parameter space.
Shape-constrained matrices with permuted rows or columns also arise in applications such as seriation [FJBd13, FMR16] and feature matching [CD16]. In particular, the monotone subclass of the statistical seriation model [FMR16] contains matrices that have increasing columns, and an unknown row permutation. The authors established the minimax rate for estimating matrices in this class and proposed a computationally efficient algorithm with rate . For the subclass of such matrices where in addition, the rows are also monotone, the results of the current paper improve the two rates to and respectively.
Another related model is that of noisy sorting [BM08], which involves a latent permutation but no shape-constraint. In this prototype of a permutation-based ranking model, we have an unknown, matrix with constant upper and lower triangular portions whose rows and columns are acted upon by an unknown permutation. The hardness of recovering any such matrix in noise lies in estimating the unknown permutation. As it turns out, this class of matrices can be estimated efficiently at minimax optimal rate by multiple procedures: the original work by Braverman and Mossel [BM08] proposed an algorithm with time complexity for some unknown and large constant , and recently, an -time algorithm was proposed by Mao et al. [MWR17]. These algorithms, however, do not generalize beyond the noisy sorting class, which constitutes a small subclass of an interesting class of matrices that we describe next.
The most relevant body of work to the current paper is that on estimating matrices satisfying the strong stochastic transitivity condition, or SST for short. This class of matrices contains all
bivariate isotonic matrices with unknown permutations acting on their rows and columns, with an additional skew-symmetry constraint. The first theoretical study of these matrices was carried out by Chatterjee[Cha15], who showed that a spectral algorithm achieved the rate in the normalized Frobenius norm. Shah et al. [SBGW17] then showed that the minimax rate of estimation is given by , and also improved the analysis of the spectral estimator of Chatterjee [Cha15] to obtain the computationally efficient rate . In follow-up work [SBW16a], they also showed a second estimator based on the Borda count that achieved the same rate, but in near-linear time. In related work, Chatterjee and Mukherjee [CM16] analyzed a variant of the estimator, showing that for subclasses of SST matrices, it achieved rates that were faster than . In a complementary direction, a superset of the current authors [PMM17] analyzed the estimation problem under an observation model with structured missing data, and showed that for many observation patterns, a variant of the estimator was minimax optimal.
Shah et al. [SBW16a] also showed that conditioned on the planted clique conjecture, it is impossible to improve upon a certain notion of adaptivity of the estimator in polynomial time. Such results have prompted various authors [FMR16, SBW16a] to conjecture that a similar statistical-computational gap also exists when estimating SST matrices in the Frobenius norm.
Our main contribution in the current work is to tighten the aforementioned statistical-computational gap. More precisely, we study the problem of estimating a bivariate isotonic matrix with unknown permutations acting on its rows and columns, given noisy, partial observations of its entries; this matrix class strictly contains the SST model [Cha15, SBGW17] for ranking from pairwise comparisons. As a corollary of our results, we show that when the underlying matrix has dimension and noisy entries are observed, our polynomial-time, two-dimensional sorting algorithm provably achieves the rate of estimation in the normalized Frobenius norm; thus, this result breaks the previously mentioned barrier [SBGW17, CM16]. Although the rate still differs from the minimax optimal rate , our algorithm is, to the best of our knowledge, the first efficient procedure to obtain a rate faster than uniformly over the SST class. This guarantee, which is stated in slightly more technical terms below, can be significant in practice (see Figure 1).
Main theorem (informal)
There is an estimator computable in time such that for any SST matrix , given Bernoulli observations of its entries, we have
Our algorithm is novel in the sense that it is neither spectral in nature, nor simple variations of the Borda count estimator that was previously employed. Our algorithm takes advantage of the fine monotonicity structure of the underlying matrix along both dimensions, and this allows us to prove tighter bounds than before. In addition to making algorithmic contributions, we also briefly revisit the minimax rates of estimation.
In Section 2, we formally introduce our estimation problem. Section 3 contains statements and discussions of our main results, and in Section 4, we describe in detail how the estimation problem that we study is connected to applications in crowd-labeling and ranking from pairwise comparisons. We provide the proofs of our main results in Section 5.
For a positive integer , let . For a finite set , we use to denote its cardinality. For two sequences and , we write if there is a universal constant such that for all . The relation is defined analogously. We use to denote universal constants that may change from line to line. We use, the notation
to denote the binomial distribution withtrials and success probability , and the notation
to denote the Poisson distribution with parameter. Given a matrix , its -th row is denoted by
. For a vector, define its variation as . Let denote the set of all permutations . Let denote the identity permutation, where the dimension can be inferred from context.
2 Background and problem setup
In this section, we present the relevant technical background and notation on permutation-based models, and introduce the observation model of interest.
2.1 Matrix models
Our main focus is on designing efficient algorithms for estimating a bivariate isotonic matrix with unknown permutations acting on its rows and columns. Formally, we define to be the class of matrices in with nondecreasing rows and nondecreasing columns. For readability, we assume throughout that unless otherwise stated; our results can be straightforwardly extended to the other case. Given a matrix and permutations and , we define the matrix by specifying its entries as
Also define the class as the set of matrices that are bivariate isotonic when viewed along the row permutation and column permutation , respectively.
The class of matrices that we are interested in estimating contains bivariate isotonic matrices whose rows and columns are both acted upon by unknown permutations:
2.2 Observation model
In order to study estimation from noisy observations of a matrix in the class , we suppose that noisy entries are sampled independently and uniformly with replacement from all entries of . This sampling model is popular in the matrix completion literature, and is a special case of the trace regression model [NW12, KLT11]. It has also been used in the context of permutation models by Mao et al. [MWR17] to study the noisy sorting class.
More precisely, let denote the matrix with in the -th entry and elsewhere, and suppose that
is a random matrix sampled independently and uniformly from the set. We observe independent pairs from the model
where the observations are contaminated by independent, centered, sub-Gaussian noise
with variance parameter. Of particular interest is the noise model considered in applications such as crowd-labeling and ranking from pairwise comparisons. Here our samples take the form
and consequently, the sub-Gaussian parameter is bounded; for a discussion of other regimes of noise in a related matrix model, see Gao [Gao17].
For analytical convenience, we employ the standard trick of Poissonization, whereby we assume throughout the paper that random samples are drawn according to the trace regression model (1). Upper and lower bounds derived under this model carry over with loss of constant factors to the model with exactly samples; for a detailed discussion, see Appendix B.
For notational convenience, denote the probability that an entry of the matrix is observed under Poissonized sampling by . Since we assume throughout that , it can be verified that .
Now given observations , let us define the matrix of observations , with entry given by
In words, the rescaled entry is the average of all the noisy realizations of that we have observed, or zero if the entry goes unobserved. Note that so that . Moreover, we may write the model in the linearized form , where is a matrix of additive noise having independent, zero-mean, sub-Gaussian entries.
3 Main results
In this section, we present our main results—we begin by briefly revisiting the fundamental limits of estimation, and then introduce our algorithms in Section 3.2. We assume throughout this section that as per the setup, we have and .
3.1 Statistical limits of estimation
We begin by characterizing the fundamental limits of estimation under the trace regression observation model (1) with observations. We define the least squares estimator over the class of matrices as the projection
The projection is a non-convex problem, and is unlikely to be computable exactly in polynomial time. However, studying this estimator allows us to establish a baseline that characterizes the best achievable statistical rate. The following theorem characterizes its risk up to a logarithmic factor in the dimension; recall the shorthand .
For any matrix , we have
|with probability at least .|
Additionally, under the Bernoulli observation model (2), any estimator satisfies
The factor appears in the upper bound instead of the noise variance because even if the noise is zero, there are missing entries. The theorem characterizes the minimax rate of estimation for the class up to a logarithmic factor.
3.2 Efficient algorithms
Next, we propose polynomial-time algorithms for estimating the permutations and the matrix . Our main algorithm relies on two distinct steps: first, we estimate the unknown permutations, and then project onto the class of matrices that are bivariate isotonic when viewed along the estimated permutations. The formal meta-algorithm is described below.
Algorithm 1 (meta-algorithm)
Step 0: Split the observations into two disjoint parts, each containing observations, and construct the matrices and .
Step 1: Use to obtain the permutation estimates .
Step 2: Return the matrix estimate
Owing to the convexity of the set , the projection operation in Step 2 of the algorithm can be computed in near linear time [BDPR84, KRS15]. The following result, a slight variant of Proposition 4.2 of Chatterjee and Mukherjee [CM16], allows us to characterize the error rate of any such meta-algorithm as a function of the permutation estimates .
Suppose that where and are unknown permutations in and respectively. Then with probability at least , we have
The first term on the right hand side of the bound (5) corresponds to an estimation error, if the true permutations and were known a priori, and the latter two terms correspond to an approximation error that we incur as a result of having to estimate these permutations from data. Comparing the bound (5) to the minimax lower bound (4b), we see that up to a logarithmic factor, the first term of the bound (5) is unavoidable, and so we can restrict our attention to obtaining good permutation estimates . We now present our main permutation estimation procedure that can be plugged into Step 1 of this meta-algorithm.
3.2.1 Two-dimensional sorting
Since the underlying matrix of interest is individually monotonic along each dimension, the row and column sums provide noisy information about the respective unknown permutations. Consequently, variants of such procedures are popular in the literature [CM16, FMR16]. However, such a procedure does not take simultaneous advantage of the fact that the underlying matrix is monotonic in both dimensions. To improve upon simply sorting row (resp. column) sums, we propose an algorithm that first sorts the columns (resp. rows) of the matrix approximately, and then exploits this approximate ordering to sort the rows (resp. columns) of the matrix.
We need more notation to facilitate the description of the
algorithm. For a partition
of the set 111 is a partition of if and for , we group the columns of a matrix into blocks according to their indices in , and refer to as a partition or blocking of the columns of .
Given a data matrix , the following blocking subroutine returns a column partition . In the main algorithm, partial row sums are computed on indices contained in each block.
Subroutine 1 (blocking)
Step 1: Compute the column sums of the matrix as
Let be the permutation along which the sequence is nondecreasing.
Step 2: Set and . Partition the columns of into blocks by defining
Note that each block is contiguous when the columns are permuted by .
Step 3 (aggregation): Set . Call a block “large” if and “small” otherwise. Aggregate small blocks in while leaving the large blocks as they are, to obtain the final partition .
More precisely, consider the matrix having nondecreasing column sums and contiguous blocks. Call two small blocks “adjacent” if there is no other small block between them. Take unions of adjacent small blocks to ensure that the size of each resulting block is in the range . If the union of all small blocks is smaller than , aggregate them all.
Return the resulting partition .
The threshold is a chosen to be a high probability bound on the perturbation of any column sum. In particular, this ensures that we obtain blocks containing columns that are close when the matrix is ordered according to the correct permutation. Computing partial row sums within each block then provides more refined information about the underlying row permutation than simply computing full row sums, and this intuition underlies the two-dimensional sorting algorithm to follow. As a technical detail, it is important to note that Step 3 aggregates small blocks into large enough ones to reduce noise in these partial row sums. We are now in a position to describe the two-dimensional sorting algorithm.
Algorithm 2 (two-dimensional sorting)
Step 0: Split the observations into two independent subsamples of equal size, and form the corresponding matrices and according to equation (3).
Step 1: Apply Subroutine 1 to the matrix to obtain a partition of the columns. Let be the number of blocks in .
Step 2: Using the second sample , compute the row sums
and the partial row sums within each block
Create a directed graph with vertex set , where an edge is present if either
Step 3: Compute a topological sort of the graph ; if none exists, set .
Step 4: Repeat Steps 1–3 with replacing for , the roles of and switched, and the roles of and switched, to compute the permutation estimate .
Step 5: Return the permutation estimates .
Recall that a permutation is called a topological sort of if for every directed edge . The construction of the graph in Step 2 dominates the computational complexity, and takes time . We have the following guarantee for the two-dimensional sorting algorithm.
For any matrix , we have
with probability at least .
In particular, setting , we have proved that our efficient estimator enjoys the rate
which is the main theoretical guarantee established in this paper for permutation-based models.
We now discuss in detail how the matrix models studied in this paper arise in practice. The class was studied as a permutation-based model for crowd-labeling [SBW16b] in the case of binary questions, and was proposed as a strict generalization of the classical Dawid-Skene model [DS79, KOS11b, LPI12, DDKR13, GKM11]. Here there is a set of questions of a binary nature; the true answer to these questions can be represented by a vector , and our goal is to estimate this vector by asking these questions to workers on a crowdsourcing platform. A key to this problem is being able to model the probabilities with which workers answer questions correctly, and we do so by collecting these probabilities within a matrix . Assuming that workers have a strict ordering of their abilities, and that questions have a strict ordering of their difficulties, the matrix is bivariate isotonic when the rows are ordered in increasing order of worker ability, and columns are ordered in decreasing order of question difficulty. However, since worker abilities and question difficulties are unknown a priori, the matrix of probabilities obeys the inclusion .
In the calibration problem, we would like to ask questions whose answers we know a priori, so that we can estimate worker abilities and question difficulties, or more generally, the entries of the matrix . This corresponds to estimating matrices in the class from noisy observations of their entries, whose rate of estimation is our main result.
A subclass of specializes to the case , and also imposes an additional skew symmetry constraint. More precisely, define analogously to the class , except with matrices having columns that are nonincreasing instead of nondecreasing. Also define the class , and the strong stochastic transitivity class
The class is useful as a model for estimation from pairwise comparisons [Cha15, SBGW17], and was proposed as a strict generalization of parametric models for this problem [BT52, NOS16, RA14]. In particular, given items obeying some unknown underlying ranking , entry of a matrix represents the probability with which item beats item in a pairwise comparison between them. The shape constraint encodes the transitivity condition that for all triples obeying , we must have
For a more classical introduction to these models, see the papers [Fis73, ML65, BW97] and the references therein. Our task is to estimate the underlying ranking from results of passively chosen pairwise comparisons222Such a passive, simultaneous setting should be contrasted with the active case (e.g., [HSRW16, FOPS17, AAAK17]), where we may sequentially choose pairs of items to compare depending on the results of previous comparisons. between the items, or more generally, to estimate the underlying probabilities that govern these comparisons333Accurate, proper estimates of translate to accurate estimates of the ranking (see Shah et al. [SBGW17]).. All the results we obtain in this work clearly extend to the class with minimal modifications; for example, either of the two estimates or may be returned as an estimate of the permutation . Consequently, the informal theorem stated in the introduction is an immediate corollary of Theorem 2 once these modifications are made to the algorithm.
Throughout the proofs, we assume without loss of generality that . Because we are interested in rates of estimation up to universal constants, we assume that each independent subsample contains observations (instead of or ). We use the shorthand , throughout.
5.1 Some preliminary lemmas
Before turning to the proof of Theorems 1 and 2, we provide three lemmas that underlie many of our arguments. The first lemma can be readily distilled from the proof of Theorem 5 of Shah et al. [SBGW17] with slight modifications. It is worth mentioning that similar lemmas characterizing the estimation error of a bivariate isotonic matrix were also proved by [CGS18, CM16].
Lemma 1 ([Sbgw17]).
Let , and let . Assume that our observation model takes the form , where the noise matrix satisfies the properties
the entries are independent, centered,
-sub-Gaussian random variables;
the second moments are bounded asfor all .
Then the least squares estimator satisfies
Moreover, the same result holds if the class is replaced by the class .
For any nonempty subset , it holds that
where is a -sub-Gaussian noise matrix with independent entries. Consequently, we can express the noise on each entry as where are independent, zero-mean random variables given by
and are independent, zero-mean random variables such that
We control the two separately. First, we have and the variance of each is bounded by . Hence Bernstein’s inequality for bounded noise yields
Taking and recalling that , we obtain
In order to control the deviation of the sum of , we note that the -th moment of is bounded by Then another version of Bernstein’s inequality [BLM13] yields
and setting gives
Combining the above two deviation bounds completes the proof. ∎
The last lemma is a deterministic result.
Let be a nondecreasing sequence of real numbers. If is a permutation in such that whenever where , then for all .
Suppose that for some index . Since is a bijection, there must exist an index such that . However, we then have , which contradicts the assumption. A similar argument shows that also leads to a contradiction. Therefore, we obtain that for every . ∎
With these lemmas in hand, we are now ready to prove our main theorems.
5.2 Proof of Theorem 1
We split the proof into two parts by proving the upper and lower bounds separately.
5.2.1 Proof of upper bound
The upper bound follows from Lemma 1 once we check the conditions on the noise for our model. We have seen in the proof of Lemma 2 that the noise on each entry can be written as . Again, and are -sub-Gaussian and -sub-Gaussian respectively, and have variances bounded by and respectively. Hence the conditions on in Lemma 1 are satisfied. Then we can apply the lemma, recall the relation and normalize the bound by to complete the proof.
5.2.2 Proof of lower bound
The lower bound follows from an application of Fano’s lemma. The technique is standard, and we briefly review it here. Suppose we wish to estimate a parameter over an indexed class of distributions in the square of a (pseudo-)metric . We refer to a subset of parameters as a local -packing set if
Note that this set is a -packing in the metric with the average KL-divergence bounded by . The following result is a straightforward consequence of Fano’s inequality:
Lemma 4 (Local packing Fano lower bound).
For any -packing set of cardinality , we have
for some fixed tuple of constants . We use this guarantee to design a packing of matrices in the class . For each , fix some to be precisely set later, and define the matrix having identical columns, with entries given by
Clearly, each of these matrices is a member of the class , and each distinct pair of matrices satisfies the inequality .
denote the probability distribution of the observations in the model (2) with underlying matrix
. Our observations are independent across entries of the matrix, and so the KL divergence tensorizes to yield
Let us now examine one term of this sum. We observe samples of entry ; conditioned on the event , we have the distributions
Consequently, the KL divergence conditioned on is given by
where we have used to denote the KL divergence between the Bernoulli random variables and .
Note that for , we have