1 →cIntroduction
2 →cThe matrixvariate Gaussian process
The matrixvariate Gaussian process (MVGP) is a collection of random variables defined by their joint distribution for finite index sets. Let
be the set representing the rows (tasks) and be the set representing the columns (examples), with sizes and . Let , where denotes the MVGP with mean function and covariance function . As with the scalar GP, the MVGP is completely specified by its mean function and its covariance function. These are defined as:where is the expected value. For a finite index set , define the matrix such that , then is a distributed as a multivariate Gaussian with mean and covariance matrix , i.e., , where , , and , .
The covariance function of the prior MVGP is assumed to have Kronecker product structure [11, 7]. The Kronecker product prior covariance captures the assumption that the prior covariance between matrix entries can be decomposed as the product of the row and column covariances. The Kronecker prior covariance assumption is a useful restriction as: (i) it improves computational tractability, enabling the model to scale to larger problems than may be possible with a full joint prior covariance, (ii) the regularity imposed by the separability assumption improves the reliability of parameter estimates even with significant data sparsity, e.g., when the observed data consists of a single matrix (sub)sample, and (iii) rowwise and columnwise prior covariance functions are often the only prior information available. A closely related concept is kernel MTL with separable kernels. This is a special case of vector valued regularized function estimation where the joint kernel decomposes into the product of the row kernel and the column kernel [6, 7, 25]. Learning in these models is analogous to inference for the MVGP, with the prior row (resp. column) covariance matrix used as row (resp. column) kernels.
Define the row covariance (kernel) function and the column covariance function . The joint covariance function of the MVGP with Kronecker covariance decomposes into product form as , or equivalently, . Hence, for the random variable and a finite index set , , where is the row covariance matrix and and is the column covariance matrix.
This definition also extends to finite subsets that are not complete matrices. Given any finite subset , where , the vector is distributed as . The vector are arranged from the entries of the mean matrix corresponding to the set , and is the covariance matrix evaluated only on pairs .
Our goal is to estimate an unknown response matrix with rows and columns. We assume observed data consisting of a subset of the matrix entries collected into a vector. Note that ; hence, the data represents a partially observed matrix. Our generative assumption proceeds as follows (see Fig. 1):

Draw the function from a zero mean MVGP .

Given , draw each observed response independently: .
Hence, with entries may be interpreted as the latent noisefree matrix. The inference task is to estimate the posterior distribution , where . The posterior distribution is again a Gaussian process, i.e., , with mean and covariance functions:
(1)  
(2)  
where corresponds to the vector of responses for all training data indexes . The covariance function corresponds to the sampled covariance matrix between the index and all training data indexes , is the covariance matrix between all pairs , and is the identity matrix. The closed form follows directly from the definition of a MVGP as a scalar GP [9] with appropriately vectorized variables. The model complexity scales with the number of observed samples . Storing the kernel matrix requires memory, and the naïve inference implementation requires computation.
Although the matrixvariate Gaussian process approach results in closed form inference, it does not model in low rank matrix structure. The factor GP is a hierarchical Gaussian process model that attempts to address this deficiency in the MVGP [24, 23]. With a fixed model rank , the generative model for the factor GP is as follows (see Fig. 2):

For each , draw row functions: . Let with entries .

For each , draw column functions: . Let with .

Draw each matrix entry independently: .
where is the row of , and is the row of . The maximumaposteriori (MAP) estimates of and can be computed as the solution of the following optimization problem:
(3) 
where is the trace of the matrix . However, the joint posterior distribution of and the distribution of are not Gaussian, and the required expectations and posterior distributions are quite challenging to characterize. A Laplace approximation by proposed by [23] and [24] utilized sampling techniques. Statistically, the factor GP may be interpreted as the sum of rankone factor matrices. Hence, as the rank
, the law of large numbers can be used to show that the distribution of
converges to [23].2.1 Spectral norms of compact operators
The mean function of the MVGP is an element of the Hilbert space defined by the kernels (covariances). We provide a brief overview of some relevant background required for defining this representation and for defining relevant spectral norms of compact operators. We will focus on the MVGP with Kronecker prior covariance. Our exposition is closely related to the approach outlined in [27]. Further details may be found in [28].
Let denote the Hilbert space of functions induced by the row kernel . Similarly, let denote the Hilbert space of functions induced by the column kernel . Let and define (possibly infinite dimensional) feature vectors. The mean function the MVGP is defined by a linear map , i.e., the bilinear form on given by:
Let denote the set of compact bilinear operators mapping . A compact linear operator admits a spectral decomposition [27]
with singular values given by
.The trace norm is given by the ell1 norm on the spectrum of :
(4) 
When the dimensions are finite, (4) is the trace norm of the matrix
. This norm has been widely applied to several machine learning tasks including multitask learning
[29, 30] and recommender systems [31]. In addition to the trace norm, a common regularizer is the induced Hilbert norm given by the ell2 norm on the spectrum of :(5) 
(5) is equivalent to the matrix Frobenius norm for finite dimensional computed as:
Let
represent the loss function for a finite set of training data points
and be a spectral regularizer. We define the regularized risk functional:where is the regularization constant. A representer theorem exists, i.e., the function that optimizes the regularized risk can be represented as a finite weighted sum of the kernel functions evaluated on training data[27]. Employing this representer theorem, the optimizing function can be computed as:
(6) 
where is a parameter matrix, is the kernel matrix evaluated between and , i.e., the row of , and is the kernel matrix evaluated between and all .
Computing the norms: The Hilbert norm can be computed as:
(7) 
The trace norm can be computed using a basis transformation approach[27] or by using the low rank “variational” approximation [27, 30].
Basis transformation: With the index set fixed, define bases and such that and . One such basis is the square root of the kernel matrix and . When the feature space is finite dimensional, the feature matrices and are also an appropriate basis. The mean function can be reparametrized as , where . Now, the trace norm is given directly by the trace norm of the parameter matrix, i.e., .
Low rank “variational” approximation: The trace norm can also be computed using the low rank approximation. This is sometimes known as the variational approximation of the trace norm [30].
(8) 
where . This approach is exact when is larger than the true rank of . Note that this is the same regularization that is required for MAP inference with the factor GP model (3). Hence, when is sufficiently large, the regularizer in the factor GP model is the trace norm. Unfortunately, it is difficult to select an appropriate rank apriori, and no such claims exist when is insufficiently large. With finite dimensions, the variational approximation of the trace norm is given by:
where and This sum of factor norms has proven effective for the regularization of matrix factorization models and other latent factor problems [32].
3 →cTrace norm constrained inference for the MVGP
A generative model for low rank matrices has proven to be a challenging problem. We are unaware of any (nonhierarchical) distributions in the literature that generate sample matrices of low rank. Hierarchical models have been proposed, but such models introduce issues such as nonconvexity and nonidentifiability of parameter estimates. Instead of seeking a generative model for low rank matrices, we propose a variational inference approach. We constrain the inference of the MVGP such that expected value of the approximate posterior distribution has a constrained trace norm (and is generally of low rank). In contrast to standard variational inference, this constraint is enforced in order to extract structure.
The goal of inference is to estimate of the posterior distribution . We propose approximate inference using the log likelihood lower bound [33]:
(9) 
Our approach is to restrict the search to the space of Gaussian processes subject to a trace norm constraint as defined in (4). With no loss of generality, we assume a set of rows and columns of interest so . Let be the matrix of hidden variables.
Given finite indexes, is a Gaussian distribution where , , and . Let be a permutation matrix such that is the covariance matrix of the subset of observed entries , and is the prior covariance of the corresponding subset of entries. Evaluating expectations, the lower bound (9) results in the following inference cost function (omitting terms independent of and ):
where is the determinant of matrix .
First, we compute gradients with respect to . After setting the gradients to zero, we compute:
(10) 
The second equality is a consequence of the matrix inversion lemma. Interestingly, (3) is identical the posterior covariance of the unconstrained MVGP (2).
Next, we collect the terms involving the mean. This results in the optimization problem:
(11) 
This is a convex regularized least squares problem with a convex constraint set. Hence, (11) is convex, and is unique. Using the Kronecker identity, we can rewrite the cost function in parameter matrix form. We can also replace the trace constraint with the equivalent regularizer weighed by . Multiplying through by leads to the equivalent cost:
(12) 
Applying the representation (2.1), we recover the parametric form of the mean function as where . We may also solve for directly:
(13) 
where is the mean function corresponding to the parameter (see (2.1)). The representation of the mean function in functional form is useful for avoiding repeated optimization when testing a trained model with different evaluation sets.
The approximate posterior distribution is itself a finite index set representation of an underlying Gaussian process.
Theorem 1.
The posterior distribution is the finite index set representation of the Gaussian process where the mean function is given by (3) and the covariance function is given by (2). is the unique posterior distribution that maximizes the lower bound of the log likelihood (9) subject to the trace constraint .
Sketch of proof: Uniqueness of the solution follows from (9), which is jointly convex in . To show that the posterior distribution is a Gaussian process, we simply need to show that for a fixed training set , the posterior distribution of the superset has the same mean function and covariance function. These follow directly from the solution of (3) and from (2) (see (3)). In addition to showing uniqueness, Theorem 1 shows how the trained model can be extended to evaluate the posterior distribution of data points not in training.
In the case where a basis for and can be found, (11) may be solved using the matrix trace norm approach directly (see section 2.1):
(14) 
where is the basis for and is the basis for . is the estimated parameter matrix. The mean function is then given by .
Spectral elastic net regularization: The regularization that results from the constrained inference has an interesting interpretation as the spectral elastic net norm. As discussed in section 2.1, the mean function may be represented as for and . The spectral elastic net is given as a weighed sum of the ell2 norm and the ell1 norms on the spectrum :
(15) 
where and are weighting constants. The naming is intentionally suggestive of the analogy to the elastic net regularizer [26]. The elastic net regularizer is a weighted sum of the ell2 norm and the ell1 norms of the parameter vector in a linear model. The elastic net is a tradeoff between smoothness, encouraged by the ell2 norm, and sparsity, encouraged by the ell1 norm. The elastic net is particularly useful when learning with correlated features. The spectral elastic net has similar properties. The Hilbert norm encourages smoothness over the spectrum, while the trace norm encourages spectral sparsity, i.e., low rank. To the best of our knowledge, this combination of norms is novel, both in the matrix estimation literature and in the kernelized MTL literature. When the dimensions are finite, (15) is given by a weighted sum of the trace norm and the Frobenius norm of the parameter matrix.
We propose a parametrization of the mean function inference inspired by the elastic net [26]. Let and where and . The loss function (3) can be parametrized as:
(16) 
The same parametrization can also be applied to the equivalent representations given in (3) and (3). This spectral elastic net parametrization clarifies the tradeoff between the trace norm and the Hilbert norm. The trace norm is recovered when , and the Hilbert norm is recovered for . The spectral elastic net approach is also useful for speeding up the computation with warmstart i.e. for a fixed , we may employ warmstart for decreasing values of . Computation of the spectral elastic net norm follows directly from the Hilbert and trace norms. From the variational approximation of the trace norm (8), it is clear that MAP inference for the factor GP (3) is equivalent to inference for the mean of the trace constrained MVGP (3) in the special case where (assuming that the nonconvex optimization (3) achieves the global maximum).
Nonzero mean prior: To simplify the explanation, we have assumed so far that the prior Gaussian process has a zero mean. The nonzero mean case is a straightforward extension [9]. We include a short discussion for completeness. Let represent the mean parameter of the Gaussian process prior, i.e., . The posterior covariance estimate remains the same, and the posterior mean computation requires the same optimization, but with the observation replaced by . The resulting posterior mean must then be shifted by the bias, i.e., . If desired, this parameter may be easily estimated. Suppose we choose to model a rowwise bias. Let , then solving the straightforward optimization, we find that the the row bias estimate is given by:
4 →cBipartite ranking
Bipartite ranking is the task of learning an ordering for items drawn from two sets, known as the positive set and the negative set, such that the items in the positive set are ranked ahead of the items in the negative set [1, 2, 3, 4]. Many models for bipartite ranking attempt to optimize the pairwise misclassification cost, i.e., the model is penalized for each pair of data points where the positive labeled item is ranked lower than the negative labeled item. Although this approach has proven effective, the required computation is quadratic in the number of items. This quadratic computation cost limits the applicability of pairwise bipartite ranking to large scale problems.
More recently, researchers have shown that it may be sufficient to optimize a classification loss, such as the exponential loss or the logistic loss, directly to solve the bipartite ranking problem [3, 4]. This is also known as the pointwise approach in the ranking literature. In contrast to the pointwise and pairwise approach, we propose a listwise bipartite ranking model. The listwise approach learns a ranking model for the entire set of items and has gained prominence in the learning to rank literature [34, 35] as it comes with strong theoretical guarantees and has been shown to have superior empirical performance.
Our approach is inspired by monotone retargeting (MR) [35], a recent method for adapting regression to ranking tasks. Although many ranking models are trained to predict the relevance scores, there is no need to fit scores exactly. Any scores that induce the correct ordering will suffice. MR jointly optimizes over monotonic transformations of the target scores and an underlying regression function (see Fig. 3). We will show that maximum likelihood parameter estimation in the proposed model is equivalent to learning the target scores in MR. Though we show this equivalence for the special case of bipartite ranking with square loss, the relation holds for more general Bregman divergences and ranking tasks. This extension is beyond the scope of this paper. In addition to improving performance, MR has favorable statistical and optimization theoretic properties, particularly when combined with a Bregman divergence such as squared loss. To the best of our knowledge, ours is the first generative model for listwise bipartite ranking.
4.1 Background
Let , and let be the set of binary isotonic vectors (binary vectors in sorted order), i.e., any satisfies and . Similarly, let be the set of real valued isotonic vectors, i.e., any satisfies and , then satisfies partial order. We state that satisfies total order or strict isotonicity when the ordering is a strict inequality, i.e., . We denote a vector in sorted order as .
Compatibility is a useful concept for capturing the match between the sorted order of two vectors.
Definition 2 (Compatibility [34]).
is compatible with the sorted order of (denoted as ) if for every pair of indexes , implies .
Compatibility is an asymmetric relationship, i.e., . It follows that sorted vectors always satisfy compatibility, i.e., if and are two sorted vectors, then by definition 2, . Compatibility is straightforward to check when the target vector is binary. Let and , and let be the be the number of ’s in the sorted vector , i.e., the threshold for transition between and . Then implies that:
(17) 
There are several ways to permute a sorted binary vector while keeping all its values the same. These are permutations that separately reorder the s at the top of and the s at the bottom. Given , we represent the set of permutations that do not change the value the sorted as . It follows that the set contains all permutations that satisfy . In other words, all that satisfy can be represented as for some .
We propose a representation for compatible vectors that reduces to permutations of isotonic vectors.
Proposition 3.
Let . Any that satisfies can be represented by where , the set and .
Sketch of proof: First, we note that by definition of compatibility for binary vectors (17), any permutation satisfies . Next, we note that the sorted order is a member of the permutation set. This representation is unique when satisfies strict ordering.
The set is a convex cone. To see this, note that the convex composition of two isotonic vectors and preserves isotonicity. Further, any scaling where preserves the ordering. Let
be the set of probability distributions, i.e.,
, and . The set of probability distributions in sorted order is given by so for each , and .Lemma 4 (Representation of [35]).
The set of all discrete probability distributions of dimension that are in descending order is the image where and is an upper triangular matrix generated from the vector such that
4.2 Generative model
Let be the label for item in task and let be the set of items in task so . We define the negative set as and the positive set as . For the task, the negative set is defined as and the positive set as so that . The vector of labels for the task are given by .
We propose the following generative model for :
(18) 
where is the indicator function defined as:
For clarity, we have suppressed the dependence of on the sets .
It is instructive to compare the form of the generative model (18
) to the area under the ROC curve (AUC) given by the fraction of correctly ordered pairs:
(19) 
We note that is nonzero if and only if satisfies the ordering defined by . It follows that any vector is nonzero also maximizes the AUC.
We can now combine the bipartite ranking model with the latent regression model. The full generative model proceeds as follows (see Fig. 4):

Draw the latent variable from a zero mean MVGP as .

Given , draw each score vector independently as .

For each task , draw the observed response vector as given by (18).
4.3 Inference and parameter estimation
We utilize variational inference to train the underlying multitask regression model and maximum likelihood to estimate the parameters of the bipartite ranking model. This is equivalent to the variational approximation where . The variational lower bound of the log likelihood (9) is given by:
where . As outlined in section 3, we restrict our search to the space for of Gaussian processes subject to a trace norm constraint . Evaluating expectations (and ignoring constant terms independent of ) results in the following optimization problem:
(20) 
Inference and parameter estimation follow an alternating optimization scheme. We alternately optimize each of the parameters till a local optima is reached. Following section 3, it is straightforward to show that the optimal is given in closed form (3) and is independent of . Hence, model training requires alternating between optimizing and optimizing . We will show that (20) is convex, and the alternating optimization approach achieves the global optimum. The optimization for follows directly from the discussion in section 3. Hence, we will focus our efforts on the optimization of .
Collecting the terms of (20) that are dependent on results in the following loss function for :
The first term in the loss penalizes violations of order. In fact, the first term evaluates to infinity if any of the binary order constraints are violated. Hence, to maximize the log likelihood, the variables must satisfy the order constraints . This interpretation suggests a constrained optimization approach:
(21) 
Note that this loss decomposes taskwise. Hence, the proposed approach results in a listwise ranking model. We also note that the independence between tasks means that the optimization is embarrassingly parallel.
The constrained score vectors can be optimized efficiently using the inner representation outlined in Proposition 3. One issue that arises is that the cost function (20) is not invariant to scale. Hence, the loss can be reduced just by scaling its arguments down. To avoid this degeneracy, we must constrain the score vectors away from . We achieve this by constraining the score vectors to the ordered simplex , as it is a convex set and satisfies the requirement . Applying Lemma 4, the score is given by for .
Let be the score vector ordered to satisfy