1 Introduction
Our focus in this paper is on learning structured lowrank matrices and we consider the following problem:
(1) 
where is a given matrix, is a convex loss function, denotes the nuclear norm regularizer, is the cost parameter, and is the linear subspace corresponding to structural constraints. It is well known the nuclear norm regularization promotes low rank solutions since is equal to the
norm on the singular values of
(Fazel et al., 2001). The linear subspace in problem (1) is represented as , where is a linear map and represents equality () or greater than equal to () constraint.Lowrank matrices are commonly learned in several machine learning applications such as matrix completion
(Abernethy et al., 2009; Boumal and Absil, 2011), multitask learning (Argyriou et al., 2008; Zhang and Yeung, 2010; Jawanpuria and Nath, 2012), multivariate regression (Yuan et al., 2007; Journée et al., 2010), to name a few. In addition to the lowrank constraint, other structural constraints may exist, e.g., entrywise nonnegative/bounded constraints (Kannan et al., 2012; Marecek et al., 2017; Fang et al., 2017). Several linear dynamical system models require learning a lowrank Hankel matrix (Fazel et al., 2013; Markovsky and Usevich, 2013). A Hankel matrix has the structural constraint that all its antidiagonal entries are the same. In robust matrix completion and robust PCA problems (Wright et al., 2009), the matrix is learned as a superimposition of a lowrank matrix and a sparse matrix. This sparse structure is modeled effectively by choosing the loss function as the loss (Cambier and Absil, 2016). Similarly, lowrank 1bit matrix completion solvers employ the logistic loss function (Davenport et al., 2014; Bhaskar and Javanmard, 2015) to complete a matrix.We propose a generic framework to the structured lowrank matrix learning problem (1) that is well suited for handling a variety of loss functions (including nonsmooth ones such as the loss), structural constraints , and is scalable for largescale problem instances. Using the duality theory, we introduce a novel modeling of structured lowrank matrix of rank as , where and . Our factorization naturally decouples the lowrank and structural constraints on . The lowrank of is enforced with , the structural constraint is modeled by , and the loss function specific structure is modeled by . The separation of lowrank and structural constraints onto separate factors makes the optimization conceptually simpler. To the best of our knowledge, such a decoupling of constraints has not been studied in the existing structured lowrank matrix learning literature (Fazel et al., 2013; Markovsky and Usevich, 2013; Yu et al., 2014; Cambier and Absil, 2016; Fang et al., 2017).
Our approach leads to an optimization problem on the Riemannian spectrahedron manifold. We exploit the Riemannian framework to develop computationally efficient conjugate gradient (firstorder) and trustregion (secondorder) algorithms. The proposed algorithms outperform stateoftheart in robust and nonnegative matrix completion problems as well as lowrank Hankel matrix learning application. Our algorithms readily scale to the Netflix data set, even with the nonsmooth loss and SVR (
insensitive support vector regression) loss functions.
The main contributions of the paper are:

we propose a novel factorization for modeling structured lowrank matrices.

we present a unified framework to learn structured lowrank matrix for several applications with different constraints and loss functions.

we develop Riemannian conjugate gradient and trustregion algorithms for our framework. Our algorithms obtain stateoftheart generalization performance across applications.
The outline of the paper is as follows. We introduce our structured lowrank matrix learning framework in Section 3. Section 4 presents our optimization approach. Section 5 discusses specialized formulations (for various applications) within from our framework. The empirical results are presented in Section 6. A shorter version of this work has been published in ICML’18 (Jawanpuria and Mishra, 2018). Our codes are available at https://pratikjawanpuria.com/ and https://bamdevmishra.com/. We begin by discussing the related works in the next section.
2 Related work
Matrix completion: Existing lowrank matrix learning literature has been primarily focused on problem (1) with the square loss and in the absence of the structural constraint . Singular value thresholding (Cai et al., 2010), proximal gradient descent (Toh and Yun, 2010), active subspace selection (Hsieh and Olsen, 2014) are some of the algorithms proposed to solve (1) without the structural constraint . Alternatively, several works (Wen et al., 2012; Mishra and Sepulchre, 2014; Boumal and Absil, 2015) propose to learn a lowrank matrix by fixing the rank explicitly, i.e. , where , and the rank is fixed a priori.
Robust matrix completion:
A matrix completion problem where few of the observed entries are perturbed/outliers
(Cambier and Absil, 2016). Candès et al. (2011) model the robust matrix completion problem as a convex program with separate lowrank and sparse constraints. He et al. (2012) propose to use loss as the loss function and impose only the lowrank constraint. They propose an online algorithm that learns a lowdimensional subspace. Cambier and Absil (2016) build on this and employ the pseudoHuber loss as a proxy for the nonsmooth loss. They solve the following optimization problem:(2) 
where is the set of observed entries, is the complement of set , is the given rank, and is a given parameter. For small value of , the pseudoHuber loss is close to the loss. They develop a largescale Riemannian conjugate gradient algorithm in the fixedrank setting.
Nonnegative matrix completion: Certain recommender system and image completion based applications desire matrix completion with nonnegative entries (Kannan et al., 2012; Sun and Mazumder, 2013; Tsagkatakis et al., 2016; Fang et al., 2017), i.e . Kannan et al. (2012) present a block coordinate descent algorithm that learns as for a given rank . Recently, Fang et al. (2017) propose a largescale alternating direction method of multipliers (ADMM) algorithm for (1) with the square loss in this setting. In each iteration , for a known matrix , they need to solve the following key subproblem
They approximately solve it in closed form for a given rank via softthresholding of the singular values of .
Hankel matrix learning: The Hankel constraint is a linear equality constraint (denoted by ). Fazel et al. (2013) propose the ADMM approaches to solve (1) with the above constraint. On the other hand, Yu et al. (2014) learn a lowrank Hankel matrix by relaxing the structural constraint with a penalty term in the objective function. They solve the following optimization problem:
Hence, they approximately enforce the structural constraint. They discuss a generalized gradient algorithm to solve the above problem. Markovsky and Usevich (2013) model the lowrank Hankel matrix learning problem as a nonlinear least square problem in the fixed rank setting and propose a secondorder algorithm.
Multitask feature learning: The goal in multitask feature learning (Argyriou et al., 2008) is to jointly learn a lowdimensional latent feature representation common across several classification/regression problems (tasks). The optimization approaches explored include alternate minimization (Argyriou et al., 2008), accelerated gradient descent (Ji and Ye, 2009) and mirror descent (Jawanpuria and Nath, 2011). Problems such as multiclass classification (Amit et al., 2007), multivariate regression (Yuan et al., 2007), matrix completion with side information (Xu et al., 2013), among others, may be viewed as special cases of multitask feature learning.
In the following section, we present a unified framework for solving the above problems.
3 Structured lowrank matrix learning
For notational simplicity, we consider only equality structural constraint to present the main results. However, our framework also admits inequality constraints . We use the notation to denote the set of positive semidefinite matrices with unit trace.
Problem (1) is a convex problem with linear constraint. In addition to the tracenorm regularizer, the loss function may also be nonsmooth (as in the case of loss or SVR loss). Dealing with (1) directly or characterizing the nature of its optimal solution in the general setting is non trivial. To this end, we propose an equivalent partial dual of (1) in the following section. The use of dual framework often leads to a better understanding of the primal problem (Jawanpuria et al., 2015). In our case, the duality theory helps in discovering a novel factorization of its optimal solution, which is not evident directly from the primal problem (1). This subsequently helps in the development of computationally efficient algorithms.
3.1 Decoupling of constraints and duality
The following theorem presents a dual problem equivalent to the primal problem (1).
Let be the Fenchel conjugate function of the loss: , . An equivalent partial dual problem of (1) with constraint is
(3) 
where the function is defined as
(4) 
and is the adjoint of . We first note the the following variational characterization of the squared trace norm regularizer from Theorem 4.1 in (Argyriou et al., 2006):
(5) 
where . For a given matrix, the optimal .
By employing the result in (5), problem (1) can be shown to be equivalent to the following problem:
(6)  
where is the indicator function for set .
In the following, we derive the dual problem of the following subproblem of (6):
(7)  
We now introduce auxiliary variable such that . The dual variables with respect to the constraints and are and , respectively. Hence, the Lagrangian () is as follows:
(8) 
The dual function of (7) is defined as
(9) 
Using the definition of the conjugate function (Boyd and Vandenberghe, 2004), we get
(10) 
where be the Fenchel conjugate function of the loss: , .
We next compute the minimizer of with respect to . From the definition of the adjoint operator, it follows that
The minimizer of with respect to satisfy the following conditions
(11)  
(12) 
which implies,
(13) 
Thus, the expression of the minimizer of with respect to is
(14) 
Plugging the results (14) and (10) in the dual function (9), we obtain
Thus, the following partial dual problem is equivalent to the primal problem (1)
Our next result gives the expression of an optimal solution of the primal problem (1). (Representer theorem) Let be an optimal solution of (3). Then, an optimal solution of (1) with constraint is:
The above result is obtained from the proof of Theorem 3.1. In particular, see (14).
Remark 1: in (3) is a positive semidefinite with unit trace. Hence, an optimal in (3) is a lowrank matrix.
Remark 2: Theorem 3.1 gives us an expression for an optimal solution of (1): it is a product of and . The lowrank constraint is enforced through , the lossspecific structure (encoded in ) is enforced through , and the structural constraint is enforced through . Overall, such a decoupling of constraints onto separate variables facilitates the use of simpler optimization techniques as compared to the case where all the constraints are enforced on a single variable.
As discussed earlier, an optimal of (3) is a lowrank positive semidefinite matrix. However, an algorithm for (3) need not produce intermediate iterates that are low rank. For largescale optimization, this observation as well as other computational efficiency concerns motivate a fixedrank parameterization of as discussed in the following section.
3.2 A new fixedrank factorization of
We model as a rank matrix in the following way: , where and . The proposed modeling has several benefits in largescale lowrank matrix learning problems, where is a common setting. First, the parameterization ensures that constraint is always satisfied. This saves the costly projection operations to ensure . Enforcing constraint costs . Second, the dimension of the search space of problem (3) with is , which is much lower than the dimension () of . By restricting the search space for , we gain computational efficiency. Third, increasing the parameter in (1) and (3) promotes low training error but high rank of the solution, and viceversa. The proposed fixedrank parameterization decouples this tradeoff.
Remark 3: With the proposed fixedrank parameterization of , the expression for primal variable becomes .
Instead of solving a minimax objective as in (3), we solve a minimization problem after incorporating the parameterization as follows:
(15) 
where the function is defined as
(16) 
(15) is the proposed generic structured lowrank matrix learning problem. The applicationspecific details are modeled within the subproblem (16). In Section 5, we present specialized versions of (16), tailored for applications such as Hankel matrix learning, nonnegative matrix completion, and robustmatrix completion. We propose a unified optimization framework for solving (15) in Section 4.
Input: matrix , rank , regularization parameter .  
Initialize .  
repeat  
1: Solve for by computing in (16). Section 5 discusses solvers for specific applications.  
2: Compute as given in Lemma 4.  
3: Riemannian CG step: compute a conjugate direction and step size using Armijo line search. It makes use of .  3: Riemannian TR step: compute a search direction which minimizes the trust region subproblem. It makes use of and its directional derivative. Step size . 
4: Update (retraction step)  
until convergence  
Output: and . 
The fixedrank parameterization, , results in nonconvexity of the overall optimization problem (15), though subproblem (16) is a convex optimization problem. We end this section by stating sufficient conditions of obtaining a globally optimal solution of (3) from a solution of (15). Let be a feasible solution of (15) and be an optimal solution of the convex problem in (16) at . Let be the maximum singular of the matrix . A candidate solution for (3) is , where . The duality gap () associated with is given by
Furthermore, if is a rank deficient local minimum of (15), then is a global minimum of (3), i.e., . The minmax problem (3) can be equivalently rewritten as
(17) 
where
(18) 
Given as described in the statement of the theorem, .
Using the minmax interchange (Sion, 1958), the maxmin problem equivalent to (3) can be written as
(19) 
where
(20)  
(21) 
Note that problem (21
) is a well studied problem in the duality theory. It is one of the definitions of the spectral norm (maximum eigenvalue of a matrix) – as the dual of the trace norm
(Boyd and Vandenberghe, 2004). Its optimal value is the spectral norm of the matrix (Boyd and Vandenberghe, 2004).Let be the maximum singular of the matrix . Then, the spectral norm of a symmetric matrix is equal to .
The duality gap () associated with is given by
Last part of the theorem: it is a special case of the general result proved by Journée et al. (2010, Theorem 7 and Corollary 8).
The value of the duality gap can be used as to verify whether a candidate solution is a global optimum of (3). The cost of computing is computationally cheap as it requires only a few power iteration updates.
4 Optimization on spectrahedron manifold
The matrix lies in, what is popularly known as, the spectrahedron manifold . Specifically, the spectrahedron manifold has the structure of a compact Riemannian quotient manifold (Journée et al., 2010). The quotient structure takes the rotational invariance of the constraint into account. The Riemannian optimization framework embeds the constraint into the search space, conceptually translating the constrained optimization problem (15) into an unconstrained optimization over the spectrahedron manifold. The Riemannian optimization framework generalizes various classical first and secondorder Euclidean algorithms (e.g., the conjugate gradient and trust region algorithms) to manifolds and provide concrete convergence guarantees (Edelman et al., 1998; Absil et al., 2008; Journée et al., 2010; Sato and Iwai, 2013; Sato et al., 2017). In particular, Absil et al. (2008) provide a systematic way of implementing Riemannian conjugate gradient (CG) and trust region (TR) algorithms. The particular details are provided in Appendix A.
We implement the Riemannian conjugate gradient (CG) and trustregion (TR) algorithms for (15). These require the notions of the Riemannian gradient (firstorder derivative of the objective function on the manifold), Riemannian Hessian along a search direction (the covariant derivative of the Riemannian gradient along a tangential direction on the manifold), and the retraction operator (that ensures that we always stay on the manifold). The Riemannian gradient and Hessian notions require computations of the standard (Euclidean) gradient and the directional derivative of this gradient along a given search direction, which are expressed in the following lemma. Let be an optimal solution of the convex problem (16) at . Then, the gradient of at is given by the following expression:
Let denote the directional derivative of the gradient along . Let denote the directional derivative of along at . Then,
The gradient is computed by employing the Danskin’s theorem (Bertsekas, 1999; Bonnans and Shapiro, 2000)
. The directional derivative of the gradient follows directly from the chain rule. The terms
are computed from the firstorder KKT conditions of the convex problem (16) at . In particular, when is differentiable (e.g., the square loss), are obtained by solving the following linear system:In various applications such as the traditional matrix completion or multivariate regression, both the gradient and its directional derivation can be computed by closedform expressions.
Riemannian CG algorithm: It computes the Riemannian conjugate gradient direction by employing the firstorder information (Lemma 4). We perform Armijo line search on to compute a stepsize that sufficiently decreases on the manifold. We update along the conjugate direction with the stepsize by retraction.
Riemannian TR algorithm: It solves a Riemannian trustregion subproblem (in a neighborhood) at every iteration. Solving the trustregion subproblem leads to a search direction that minimizes a quadratic approximation of on the manifold. Solving this subproblem does not require inverting the full Hessian of the objective function. It makes use of and its directional derivative (Lemma 4).
Overall algorithm: Algorithm 1 summarizes the proposed first and secondorder algorithms for solving (15).
Convergence: Absil et al. (2008); Sato and Iwai (2013) discuss the rate of convergence analysis of manifold algorithms, and their results are directly applicable in the present setting. The global convergence results for conjugate gradient algorithm is discussed in (Sato and Iwai, 2013). For trust regions, the global convergence to a firstorder critical point is discussed in (Absil et al., 2008)[Section 7.4.1], while the local convergence to local minima is discussed in (Absil et al., 2008)[Section 7.4.2].
Computational complexity: The computational complexity of Algorithm 1 is the sum of the cost of manifold related operations and the cost of application specific ingredients. The spectrahedron manifold operations cost . The following section discusses the application specific computational costs.
Although we have focused on batch algorithms, our framework can be extended to stochastic settings, e.g., when the columns are streamed one by one (Bonnabel, 2013). In this case, when a new column is received, we perform a (stochastic) gradient update on .
5 Specialized formulations for applications
The expression of in (16) depends on the functions and employed in the application at hand. Below, we discuss for popular applications.
5.1 Matrix completion
Given a partially observed matrix at indices , we learn the full matrix (Toh and Yun, 2010; Cai et al., 2010). Let be the set of indices that are observed in , the column of . Let and represents the rows of and , respectively, that correspond to the indices in . Then, the function in (16), for the standard lowrank matrix completion problem with square loss, is as follows:
(22) 
Problem (22) is a leastsquares problem for each and can be solved efficiently in closedform by employing the Woodbury matrixinversion identity. The computational cost of solving problem (22) is . Overall, the periteration computational cost of Algorithm 1 is . Problem (22) can be solved in parallel for each , making it amenable to parallelization.
5.2 Robust matrix completion
The setting of this problem is same as the lowrank matrix completion problem. Following Cambier and Absil (2016), we employ a robust loss function (loss) instead of the square loss. The expression for , in our case, is as follows:
Coordinate descent algorithm (CD) is employed to efficiently solve the above problem. The computational cost depends on the number of iterations of the coordinate descent algorithm. In particular, if is the number of iterations, then the cost of computing is . A small value of suffices for a good approximation to the solution. Hence, for the case of robust matrix completion problem, the periteration computational cost of Algorithm 1 is . It should be noted that the computation of can be a parallelized across . In addition to the loss, we also experimented with the SVR loss (Ho and Lin, 2012) for this problem.
5.3 Nonnegative matrix completion
Given a partially observed matrix at indices , the aim here is to learn the full matrix with nonnegative entries only. For this problem, the function with the square loss function is as follows:
(23) 
Since the dual variables have nonnegative constraints (due to the constraint in primal problem), we model (23) as a nonnegative least squares (NNLS) problem. We employ the NNLS algorithm of Kim et al. (2013) to solve for the variable in (23). In each iteration of NNLS, is solved in closed form. If is the number of iterations of NNLS, then the cost of computing is . Hence, for the case of nonnegative matrix completion problem, the periteration computational cost of Algorithm 1 is . In our experiments, we observe that a small value of is sufficient to obtain a good approximation of (23).
It should be noted that (23) is computationally challenging as it has entrywise nonnegativity constraints. In our initial experiments, we observe that the solution is highly sparse. For largescale problems, we exploit this observation for an efficient implementation.
5.4 Hankel matrix learning
Hankel matrices have the structural constraint that its antidiagonal entries are the same. A Hankel matrix corresponding to a vector is as follows:
Hankel matrices play an important role in determining the order or complexity of various linear timeinvariant systems (Fazel et al., 2013; Markovsky and Usevich, 2013). The aim in such settings is to find the minimum order of the system that explains the observed data (i.e. low error). A loworder system is usually desired as it translates into cost benefit and ease of analysis. Hence, the optimization problem can be modeled as problem (1) where we wish to learn a lowrank Hankel matrix that results in low error.
The function specialized for the case of lowrank Hankel matrix learning with square loss is as follows:
We solve the above problem via a conjugate gradient algorithm. The equality constraints are handled efficiently by using an affine projection operator. The computational cost of computing is , where is the number of iterations of the inner conjugate gradient algorithm. Hence, for the case of Hankel matrix learning problem, the periteration complexity of Algorithm 1 is .
5.5 Multitask feature learning
The paradigm of multitask feature learning advocates learning several related tasks (regression/classification problems) jointly by sharing a lowdimensional latent feature representation across all the tasks (Argyriou et al., 2008). Since the tasks are related, learning them together, by sharing knowledge, is expected to obtain better generalization than learning them independently.
Multitask feature learning setup is as follows: we are given tasks and the aim is to learn the model parameter for each task such that they share a lowdimensional latent feature space. Argyriou et al. (2008) employ the tracenorm regularizer on the model parameter matrix to enforce a lowdimensional representation of . Each task has an input/output training data set , where and . The prediction function for task is .
The proposed generic formulation for structured lowrank matrix learning (15) can easily be specialized to this setting, with the function being as follows:
The function can be computed in closed form and costs . Hence, the periteration complexity of Algorithm 1 for MTFL is . The computation of can be a parallelized across the tasks. This setting can be further specialized for problems such as matrix completion with side information, inductive matrix completion, and multivariate regression.
6 Experiments
In this section, we evaluate the generalization performance as well as computational efficiency of our approach against stateoftheart in different applications. It should be emphasized that stateoftheart in each application are different and to the best of our knowledge there does not exist a unified framework for solving such applications. All our algorithms are implemented using the Manopt toolbox (Boumal et al., 2014). We term our algorithm as Riemannian Structured Lowrank Matrix learning (RSLM).
Data set  

MovieLens1m (ML1m)  
MovieLens10m (ML10m)  
MovieLens20m (ML20m)  
Netflix  
Synthetic 
6.1 Matrix Completion
Baseline techniques: Our first and secondorder matrix completion algorithms are denoted by RSLMcg and RSLMtr, respectively. We compare against stateoftheart fixedrank and nuclear norm minimization based matrix completion solvers:

APGL: An accelerated proximal gradient approach for nuclear norm regularization with square loss function (Toh and Yun, 2010).

Active ALT: Stateoftheart firstorder nuclear norm solver based on active subspace selection (Hsieh and Olsen, 2014).

MMBS: A secondorder fixed rank nuclear norm minimization algorithm (Mishra et al., 2013). It employs an efficient factorization of the matrix which renders the trace norm regularizer differentiable in the primal formulation.

R3MC: A nonlinear conjugate gradient based approach for fixed rank matrix completion (Mishra and Sepulchre, 2014). It employs a Riemannian preconditioning technique, customized for the square loss function.

LMaFit: A nonlinear successive overrelaxation based approach for low rank matrix completion based on alternate least squares (Wen et al., 2012).

PRP: a recent proximal Riemannian pursuit algorithm Tan et al. (2016).
Data sets and experimental setup: We compare the performance of the above algorithms on a synthetic data set and three movie recommendation data sets: Netflix (Recht and Ré, 2013), MovieLens10m (ML10m), and MovieLens20m (ML20m) (Harper and Konstan, 2015). The data set statistics is provided in Table 1. The regularization parameters for respective algorithms are crossvalidated in the set to obtain their best generalization performance. The optimization strategies for the competing algorithm were set to those prescribed by their authors. For instance, linesearch, continuation and truncation were kept on for APGL. The initialization for all the algorithms is based on the first few singular vectors of the given partially complete matrix (Boumal and Absil, 2015). All the fixed algorithms (R3MC, LMaFit, MMBS, RTRMC, Proposedcgsq, Proposedtrsq) are provided the rank for real data sets and for synthetic data set. In all variable rank approaches (APGL, Active ALT, PRP), the maximum rank parameter is set to for real data sets and for synthetic data set. We run all the methods on five random / train/test splits and report the average root mean squared error on the test set (test RMSE). We report the minimum test RMSE achieved after the algorithms have converged or have reached maximum number of iterations.
Synthetic data set results. We choose , and to create a synthetic data set (with observed entries), following the procedure detailed in (Boumal and Absil, 2011, 2015). The number of observed entries for both training () and testing was . The generalization performance of different methods is shown in Figure 1(a). For the same run, we also plot the variation of the relative duality gap across iterations for our algorithms in figure 1(b). It can be observed that RSLMtr approach the global optima for the nuclear norm regularized problem 1 in very few iterations and obtain test RMSE . Our first order algorithm, RSLMcg, also achieves low test RMSE and low relative duality gap. It should be noted that our methods are able to exploit the condition that (rectangular matrices), similar to RTRMC.
Realworld data set results: Figures 2 (a)&(b) display the evolution of root mean squared error on the test set (test RMSE) against the training time on the Netflix data set for first and secondorder algorithms, respectively. RSLMcg is among the most efficient firstorder method and RSLMtr is the best secondorder method. Table 2 reports the test RMSE obtained by all the algorithms on the three data sets. We observe that both our algorithms obtain the best generalization performance.
Netflix  MovieLens10m  MovieLens20m  

RSLMtr  
RSLMcg  
R3MC  
RTRMC  
APGL  
Active ALT  
MMBS  
LMaFit  
PRP 
Generalization performance of various algorithms on the matrix completion problem. The table reports mean test RMSE along with the standard deviation over five random traintest split. The proposed firstorder (RSLMcg) and secondorder (RSLMtr) algorithms achieve the lowest test RMSE.
6.2 Robust matrix completion
We develop RSLM with two different robust loss functions: loss and SVR loss. The nonsmooth nature of loss and SVR loss makes them challenging to optimize in largescale lowrank setting.
Baseline techniques: For evaluation, we compare RSLM against RMC (Cambier and Absil, 2016), a largescale stateoftheart Riemannian optimization algorithm for robust matrix completion. RMC employs the pseudoHuber loss function (2). While solving (2), RMC decreases the value of at regular intervals. Hence, at later iterations, when the value of is low, the pseudoHuber loss approximates the loss.
Data sets and experimental setup: We compare the performance of all three algorithms on the Netflix data set. We follow the same experimental setup as described for the case of matrix completion. Rank is fixed for all the algorithms.
Results: Figure 2(c) shows the results on the Netflix data set. We observe that RSLM scales effortlessly on the Netflix data set even with nonsmooth loss functions and obtains the best generalization performance with the SVR loss. The test RMSE obtained at convergence are: (RSLM with SVR loss), (RSLM with loss), and (RMC).
6.3 Nonnegative matrix completion
Baseline techniques: We include the following methods in our comparisons: BMC (Fang et al., 2017) and BMA (Kannan et al., 2012). BMC is a recently proposed ADMM based algorithm with carefully designed update rules to ensure an efficient computational and space complexity. BMA is based on coordinate descent algorithm.
Data sets and experimental setup: We compare the performance of the above algorithms on three real world data sets (Table 1): MovieLens1m (ML1m), MovieLens10m (ML10m), and MovieLens20m (ML20m). The experimental setup is the same as described for the case of matrix completion. The performance of all the algorithms are evaluated at three ranks: .
Results: We observe in Table 3 that RSLM outperforms both BMC and BMA. The improvement obtained by RSLM over BMC is more pronounced with larger data sets. BMA is not able to run on MovieLens20m due to high memory and time complexity.
Figures 3(a)(e) compare the convergence behavior of RSLM and BMC for different values of parameters and on all three data sets. Both algorithms aim to minimize the same primal objective (1) for a given rank . Though the proposed approach solves the proposed fixedrank dual formulation (15), we compute the corresponding primal objective value of every iterate for the plots. We observe that our algorithm RSLM is significantly faster than BMC in converging to a lower objective value.
Figures 4(a)(e) plot the evolution of test RMSE against training time for algorithms RSLM and BMC on different data sets with different ranks (and the corresponding best parameter). For a given data set and rank, both RSLM and BMC chose the same value of via crossvalidation. We observe that the RSLM outperforms BMC in converging to a lower test RMSE at a much faster rate.
Data set  RSLM  BMC  BMA  
MovieLens1m  
MovieLens10m  
MovieLens20m  –  
–  
– 
6.4 Hankel Matrix Learning
Baseline techniques: We compare our algorithm RSLM with three methods (discussed in Section 2): GCG (Yu et al., 2014), SLRA (Markovsky, 2014; Markovsky and Usevich, 2014), and DADM (Fazel et al., 2013). Since GCG and DADM employ the nuclear norm regularizer, we tune the regularization parameter to vary the rank of their solution.
Data sets and experimental setup: Given a vector , we obtain a Hankel matrix as discussed in Section 5.4. The true parameter is generated as the impulse response (skipping the first sample) of a discretetime random linear timeinvariant system of order (Markovsky, 2011; Fazel et al., 2013)
. The noisy estimate of these parameters (
) are generated as , where is the measurement noise. We set and generatefrom the standard Gaussian distribution
. It should be noted that is the train data and is the true data.We generate three different data set of varying size and order. has , where is the true order of the underlying system. The other two data sets, and , has the configurations and , respectively. We evaluate the algorithms on all the three data sets and report their RMSE with respect to the true data (Liu and Vandenberghe, 2009) at different ranks () in Table 4.
Results: We observe from Table 4 that the proposed algorithm RSLM obtains the best result in all the three data sets and across different ranks. In addition, RSLM also usually obtains the lowest true RMSE for a data set at the rank equal to of the data set. This implies that RSLM is able to identify the minimal order of the systems corresponding to the data sets.
Data set  RSLM  SLRA  DADM  GCG  

:  
:  
:  
6.5 Multitask feature learning
Experimental setup: We compare the generalization performance of our algorithm RSLM with the convex multitask feature learning algorithm MTFL (Argyriou et al., 2008). Optimal solution for MTFL at different ranks is obtained by tracing the solution path with respect to the regularization parameter, whose value is varied as . For RSLM, we fix the value of the regularization parameter , and vary the rank to obtain different ranked solutions. The experiments are performed on two benchmark multitask regression data sets:

Parkinsons: We need to predict the Parkinson’s disease symptom score of 42 patients (Frank and Asuncion, 2010). Each patient is described using biomedical features. The data set has a total of 5,875 readings from all the patients.

School: The data consists of 15,362 students from 139 schools (Argyriou et al., 2008). The aim is to predict the performance of each student given their description and earlier record. Overall, each student data has 28 features. Predicting the performance of students belonging to one school is considered as one task.
Following (Argyriou et al., 2008), we report the normalized mean square error over the test set (test NMSE).
Results: Figure 5(a) present the results on the Parkinsons data set. We observe from the figure that our method achieves better generalization performance at low ranks compared to MTFL. Figure 5(b) plots the variation of duality gap with rank for our algorithm. We observe that as the rank is varied from low to high, we converge to the globally optimal solution of (3) and obtain the duality gap close to zero. Similar results are obtained on the School data set as observed in Figures 5 (c)&(d).
7 Conclusion
We have proposed a novel factorization for structured lowrank matrix learning problems, which stems from the application of duality theory and rankconstrained parameterization of positive semidefinite matrices. This allows to develop a conceptually simpler and unified optimization framework for various applications. Stateoftheart performance of our algorithms on several applications shows the effectiveness of our approach.
Acknowledgement
We thank Léopold Cambier, Ivan Markovsky, and Konstantin Usevich for useful discussions. We also thank Rodolphe Jenatton, Syama Rangapuram, and Matthias Hein for giving feedback on an earlier version of this work. We are grateful to Adams Wei Yu and Mingkui Tan for making their codes available. Most of this work was done when PJ and BM were at Amazon.com, India.
Appendix A Optimization on Spectrahedron
We are interested in the optimization problem of the form
(24) 
where is the set of positive semidefinite matrices with unit trace and is a smooth function. A specific interest is when we seek matrices of rank . Using the parameterization , the problem (24) is formulated as
(25) 
where , which is called the spectrahedron manifold (Journée et al., 2010). It should be emphasized the objective function in (25) is invariant to the post multiplication of with orthogonal matrices of size , i.e., for all , which is the set of orthogonal matrices of size such that . An implication of the this observation is that the minimizers of (25) are no longer isolated in the matrix space, but are isolated in the quotient space, which is the set of equivalence classes . Consequently, the search space is
(26) 
In other words, the optimization problem (25) has the structure of optimization on the quotient manifold, i.e.,
(27) 
but numerically, by necessity, algorithms are implemented in the matrix space , which is also called the total space.
Below, we briefly discuss the manifold ingredients and their matrix characterizations for (27). Specific details of the spectrahedron manifold are discussed in (Journée et al., 2010). A general introduction to manifold optimization and numerical algorithms on manifolds are discussed in (Absil et al., 2008).
a.1 Tangent vector representation as horizontal lifts
Since the manifold , defined in (26), is an abstract space, the elements of its tangent space at also call for a matrix representation in the tangent space that respects the equivalence relation for all . Equivalently, the matrix representation of should be restricted to the directions in the tangent space on the total space at that do not induce a displacement along the equivalence class . In particular, we decompose into complementary subspaces, the vertical and horizontal subspaces, such that .
Matrix representation of an element  
Total space  
Group action  , where 
Comments
There are no comments yet.