# A Saddle Point Approach to Structured Low-rank Matrix Learning

We propose a novel optimization approach for learning a low-rank matrix which is also constrained to be in a given linear subspace. Low-rank constraints are regularly employed in applications such as recommender systems and multi-task learning. In addition, several system identification problems require a learning matrix with both low-rank and linear subspace constraints. We model the classical nuclear norm regularized formulation as an equivalent saddle point minimax problem. This decouples the low-rank and the linear subspace constraints onto separate factors. Motivated by large-scale problems, we reformulate the minimax problem via a rank constrained non-convex surrogate. This translates into an optimization problem on the Riemannian spectrahedron manifold. We exploit the Riemannian structure to propose efficient first- and second-order algorithms. The duality theory allows to compute the duality gap for a candidate solution and our approach easily accommodates popular non-smooth loss functions, e.g., the absolute-value loss. We effortlessly scale on the Netflix data set on both matrix completion and robust matrix completion problems, obtaining state-of-the-art generalization performance. Additionally, we demonstrate the efficacy of our approach in Hankel matrix learning and multi-task learning problems.

## Authors

• 17 publications
• 27 publications
• ### Robust Low-rank Matrix Completion via an Alternating Manifold Proximal Gradient Continuation Method

Robust low-rank matrix completion (RMC), or robust principal component a...
08/18/2020 ∙ by Minhui Huang, et al. ∙ 0

• ### A Dual Framework for Low-rank Tensor Completion

We propose a novel formulation of the low-rank tensor completion problem...
12/04/2017 ∙ by Madhav Nimishakavi, et al. ∙ 0

• ### A low-rank matrix equation method for solving PDE-constrained optimization problems

PDE-constrained optimization problems arise in a broad number of applica...
05/29/2020 ∙ by Alexandra Bünger, et al. ∙ 0

• ### On the regularity and conditioning of low rank semidefinite programs

Low rank matrix recovery problems appear widely in statistics, combinato...
02/25/2020 ∙ by Lijun Ding, et al. ∙ 0

• ### Manopt, a Matlab toolbox for optimization on manifolds

Optimization on manifolds is a rapidly developing branch of nonlinear op...
08/23/2013 ∙ by Nicolas Boumal, et al. ∙ 0

• ### Orthogonal Rank-One Matrix Pursuit for Low Rank Matrix Completion

In this paper, we propose an efficient and scalable low rank matrix comp...
04/04/2014 ∙ by Zheng Wang, et al. ∙ 0

• ### Distributed Low-rank Subspace Segmentation

Vision problems ranging from image clustering to motion segmentation to ...
04/20/2013 ∙ by Ameet Talwalkar, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Our focus in this paper is on learning structured low-rank matrices and we consider the following problem:

 minW∈Rd×TCL(Y,W)+∥W∥2∗,subject toW∈D, (1)

where is a given matrix, is a convex loss function, denotes the nuclear norm regularizer, is the cost parameter, and is the linear subspace corresponding to structural constraints. It is well known the nuclear norm regularization promotes low rank solutions since is equal to the

-norm on the singular values of

(Fazel et al., 2001). The linear subspace in problem (1) is represented as , where is a linear map and represents equality () or greater than equal to () constraint.

Low-rank matrices are commonly learned in several machine learning applications such as matrix completion

(Abernethy et al., 2009; Boumal and Absil, 2011), multi-task learning (Argyriou et al., 2008; Zhang and Yeung, 2010; Jawanpuria and Nath, 2012), multivariate regression (Yuan et al., 2007; Journée et al., 2010), to name a few. In addition to the low-rank constraint, other structural constraints may exist, e.g., entry-wise non-negative/bounded constraints (Kannan et al., 2012; Marecek et al., 2017; Fang et al., 2017). Several linear dynamical system models require learning a low-rank Hankel matrix (Fazel et al., 2013; Markovsky and Usevich, 2013). A Hankel matrix has the structural constraint that all its anti-diagonal entries are the same. In robust matrix completion and robust PCA problems (Wright et al., 2009), the matrix is learned as a superimposition of a low-rank matrix and a sparse matrix. This sparse structure is modeled effectively by choosing the loss function as the -loss (Cambier and Absil, 2016). Similarly, low-rank 1-bit matrix completion solvers employ the logistic loss function (Davenport et al., 2014; Bhaskar and Javanmard, 2015) to complete a matrix.

We propose a generic framework to the structured low-rank matrix learning problem (1) that is well suited for handling a variety of loss functions (including non-smooth ones such as the -loss), structural constraints , and is scalable for large-scale problem instances. Using the duality theory, we introduce a novel modeling of structured low-rank matrix of rank as , where and . Our factorization naturally decouples the low-rank and structural constraints on . The low-rank of is enforced with , the structural constraint is modeled by , and the loss function specific structure is modeled by . The separation of low-rank and structural constraints onto separate factors makes the optimization conceptually simpler. To the best of our knowledge, such a decoupling of constraints has not been studied in the existing structured low-rank matrix learning literature (Fazel et al., 2013; Markovsky and Usevich, 2013; Yu et al., 2014; Cambier and Absil, 2016; Fang et al., 2017).

Our approach leads to an optimization problem on the Riemannian spectrahedron manifold. We exploit the Riemannian framework to develop computationally efficient conjugate gradient (first-order) and trust-region (second-order) algorithms. The proposed algorithms outperform state-of-the-art in robust and non-negative matrix completion problems as well as low-rank Hankel matrix learning application. Our algorithms readily scale to the Netflix data set, even with the non-smooth -loss and -SVR (

-insensitive support vector regression) loss functions.

The main contributions of the paper are:

• we propose a novel factorization for modeling structured low-rank matrices.

• we present a unified framework to learn structured low-rank matrix for several applications with different constraints and loss functions.

• we develop Riemannian conjugate gradient and trust-region algorithms for our framework. Our algorithms obtain state-of-the-art generalization performance across applications.

The outline of the paper is as follows. We introduce our structured low-rank matrix learning framework in Section 3. Section 4 presents our optimization approach. Section 5 discusses specialized formulations (for various applications) within from our framework. The empirical results are presented in Section 6. A shorter version of this work has been published in ICML’18 (Jawanpuria and Mishra, 2018). Our codes are available at https://pratikjawanpuria.com/ and https://bamdevmishra.com/. We begin by discussing the related works in the next section.

## 2 Related work

Matrix completion: Existing low-rank matrix learning literature has been primarily focused on problem (1) with the square loss and in the absence of the structural constraint . Singular value thresholding (Cai et al., 2010), proximal gradient descent (Toh and Yun, 2010), active subspace selection (Hsieh and Olsen, 2014) are some of the algorithms proposed to solve (1) without the structural constraint . Alternatively, several works (Wen et al., 2012; Mishra and Sepulchre, 2014; Boumal and Absil, 2015) propose to learn a low-rank matrix by fixing the rank explicitly, i.e. , where , and the rank is fixed a priori.

Robust matrix completion:

A matrix completion problem where few of the observed entries are perturbed/outliers

(Cambier and Absil, 2016). Candès et al. (2011) model the robust matrix completion problem as a convex program with separate low-rank and sparse constraints. He et al. (2012) propose to use -loss as the loss function and impose only the low-rank constraint. They propose an online algorithm that learns a low-dimensional subspace. Cambier and Absil (2016) build on this and employ the pseudo-Huber loss as a proxy for the non-smooth -loss. They solve the following optimization problem:

 minW∈Rd×T,rank(W)=r∑(i,j)∈ΩC√δ2+(Yij−Wij)2+∑(i,j)∈ΩcW2ij, (2)

where is the set of observed entries, is the complement of set , is the given rank, and is a given parameter. For small value of , the pseudo-Huber loss is close to the -loss. They develop a large-scale Riemannian conjugate gradient algorithm in the fixed-rank setting.

Non-negative matrix completion: Certain recommender system and image completion based applications desire matrix completion with non-negative entries (Kannan et al., 2012; Sun and Mazumder, 2013; Tsagkatakis et al., 2016; Fang et al., 2017), i.e . Kannan et al. (2012) present a block coordinate descent algorithm that learns as for a given rank . Recently, Fang et al. (2017) propose a large-scale alternating direction method of multipliers (ADMM) algorithm for (1) with the square loss in this setting. In each iteration , for a known matrix , they need to solve the following key sub-problem

 minW∈Rd×TC∥W−Bi∥2F+12∥W∥2∗.

They approximately solve it in closed form for a given rank via soft-thresholding of the singular values of .

Hankel matrix learning: The Hankel constraint is a linear equality constraint (denoted by ). Fazel et al. (2013) propose the ADMM approaches to solve (1) with the above constraint. On the other hand, Yu et al. (2014) learn a low-rank Hankel matrix by relaxing the structural constraint with a penalty term in the objective function. They solve the following optimization problem:

 minW∈Rd×TC∥W−Y∥2F+λ∥H(W)∥2F+12∥W∥∗.

Hence, they approximately enforce the structural constraint. They discuss a generalized gradient algorithm to solve the above problem. Markovsky and Usevich (2013) model the low-rank Hankel matrix learning problem as a non-linear least square problem in the fixed rank setting and propose a second-order algorithm.

Multi-task feature learning: The goal in multi-task feature learning (Argyriou et al., 2008) is to jointly learn a low-dimensional latent feature representation common across several classification/regression problems (tasks). The optimization approaches explored include alternate minimization (Argyriou et al., 2008), accelerated gradient descent (Ji and Ye, 2009) and mirror descent (Jawanpuria and Nath, 2011). Problems such as multi-class classification (Amit et al., 2007), multi-variate regression (Yuan et al., 2007), matrix completion with side information (Xu et al., 2013), among others, may be viewed as special cases of multi-task feature learning.

In the following section, we present a unified framework for solving the above problems.

## 3 Structured low-rank matrix learning

For notational simplicity, we consider only equality structural constraint to present the main results. However, our framework also admits inequality constraints . We use the notation to denote the set of positive semi-definite matrices with unit trace.

Problem (1) is a convex problem with linear constraint. In addition to the trace-norm regularizer, the loss function may also be non-smooth (as in the case of -loss or -SVR loss). Dealing with (1) directly or characterizing the nature of its optimal solution in the general setting is non trivial. To this end, we propose an equivalent partial dual of (1) in the following section. The use of dual framework often leads to a better understanding of the primal problem (Jawanpuria et al., 2015). In our case, the duality theory helps in discovering a novel factorization of its optimal solution, which is not evident directly from the primal problem (1). This subsequently helps in the development of computationally efficient algorithms.

### 3.1 Decoupling of constraints and duality

The following theorem presents a dual problem equivalent to the primal problem (1).

Let be the Fenchel conjugate function of the loss: , . An equivalent partial dual problem of (1) with constraint is

 minΘ∈Pd  maxZ∈Rd×T,s∈Rn  f(Θ,Z,s), (3)

where the function is defined as

 f(Θ,Z,s)\coloneqq−CL∗(−ZC)−12⟨Θ(Z+A∗(s)),Z+A∗(s)⟩, (4)

and is the adjoint of . We first note the the following variational characterization of the squared trace norm regularizer from Theorem 4.1 in (Argyriou et al., 2006):

 (5)

where . For a given matrix, the optimal .

By employing the result in (5), problem (1) can be shown to be equivalent to the following problem:

 (6) subject to: A(W)=0,

where is the indicator function for set .

In the following, we derive the dual problem of the following sub-problem of (6):

 minW∈Rd×T CL(Y,W)+12⟨Θ†W,W⟩+irange(Θ)(W) (7) subject to: A(W)=0,.

We now introduce auxiliary variable such that . The dual variables with respect to the constraints and are and , respectively. Hence, the Lagrangian () is as follows:

 Q(W,U,Z,s)=CL(Y,U)+12⟨Θ†W,W⟩+irange(Θ)(W)+⟨Z,U−W⟩−⟨s,A(W)⟩. (8)

The dual function of (7) is defined as

 q(Z,s;Θ)\coloneqqminW∈Rd×T,U∈Rd×TQ(W,U,Z,s) (9)

Using the definition of the conjugate function (Boyd and Vandenberghe, 2004), we get

 minU∈Rd×TCL(Y,U)+⟨Z,U⟩=−CL∗(−ZC), (10)

where be the Fenchel conjugate function of the loss: , .

We next compute the minimizer of with respect to . From the definition of the adjoint operator, it follows that

The minimizer of with respect to satisfy the following conditions

 (11) W∈range(Θ) (12)

which implies,

 Θ†W=Z+A∗(s), subjecttoW∈range(Θ) (13)

Thus, the expression of the minimizer of with respect to is

 W=Θ(Z+A∗(s)). (14)

Plugging the results (14) and (10) in the dual function (9), we obtain

Thus, the following partial dual problem is equivalent to the primal problem (1)

 minΘ∈Pd maxZ∈Rd×T,s∈Rnq(Z,s;Θ).

Our next result gives the expression of an optimal solution of the primal problem (1). (Representer theorem) Let be an optimal solution of (3). Then, an optimal solution of (1) with constraint is:

 ¯W=¯Θ(¯Z+A∗(¯s)).

The above result is obtained from the proof of Theorem 3.1. In particular, see (14).

Remark 1: in (3) is a positive semi-definite with unit trace. Hence, an optimal in (3) is a low-rank matrix.

Remark 2: Theorem 3.1 gives us an expression for an optimal solution of (1): it is a product of and . The low-rank constraint is enforced through , the loss-specific structure (encoded in ) is enforced through , and the structural constraint is enforced through . Overall, such a decoupling of constraints onto separate variables facilitates the use of simpler optimization techniques as compared to the case where all the constraints are enforced on a single variable.

As discussed earlier, an optimal of (3) is a low-rank positive semi-definite matrix. However, an algorithm for (3) need not produce intermediate iterates that are low rank. For large-scale optimization, this observation as well as other computational efficiency concerns motivate a fixed-rank parameterization of as discussed in the following section.

### 3.2 A new fixed-rank factorization of W

We model as a rank matrix in the following way: , where and . The proposed modeling has several benefits in large-scale low-rank matrix learning problems, where is a common setting. First, the parameterization ensures that constraint is always satisfied. This saves the costly projection operations to ensure . Enforcing constraint costs . Second, the dimension of the search space of problem (3) with is , which is much lower than the dimension () of . By restricting the search space for , we gain computational efficiency. Third, increasing the parameter in (1) and (3) promotes low training error but high rank of the solution, and vice-versa. The proposed fixed-rank parameterization decouples this trade-off.

Remark 3: With the proposed fixed-rank parameterization of , the expression for primal variable becomes .

Instead of solving a minimax objective as in (3), we solve a minimization problem after incorporating the parameterization as follows:

 minU∈Rd×r,∥U∥F=1 g(U), (15)

where the function is defined as

 g(U)\coloneqqmaxZ∈Rd×T,s∈Rn −CL∗(−Z/C)−12∥∥U⊤(Z+A∗(s))∥∥2F. (16)

(15) is the proposed generic structured low-rank matrix learning problem. The application-specific details are modeled within the sub-problem (16). In Section 5, we present specialized versions of (16), tailored for applications such as Hankel matrix learning, non-negative matrix completion, and robust-matrix completion. We propose a unified optimization framework for solving (15) in Section 4.

The fixed-rank parameterization, , results in non-convexity of the overall optimization problem (15), though sub-problem (16) is a convex optimization problem. We end this section by stating sufficient conditions of obtaining a globally optimal solution of (3) from a solution of (15). Let be a feasible solution of (15) and be an optimal solution of the convex problem in (16) at . Let be the maximum singular of the matrix . A candidate solution for (3) is , where . The duality gap () associated with is given by

 Δ=12(σ21−∥∥^U⊤(^Z+A∗(^s))∥∥2F).

Furthermore, if is a rank deficient local minimum of (15), then is a global minimum of (3), i.e., . The min-max problem (3) can be equivalently re-written as

 minΘ∈PdG1(Θ), (17)

where

 G1(Θ)\coloneqqmaxs∈RnmaxZ∈Rd×T−CL∗(−ZC)−12⟨Θ(Z+A∗(s)),Z+A∗(s)⟩. (18)

Given as described in the statement of the theorem, .

Using the min-max interchange (Sion, 1958), the max-min problem equivalent to (3) can be written as

 maxs∈RnmaxZ∈Rd×TG2(Z,s), (19)

where

 G2(Z,s)\coloneqq−CL∗(−^ZC)−12B(Z,s), and (20) B(Z,s)=maxΘ∈Pd⟨Θ,(Z+A∗(s))(Z+A∗(s))⊤⟩. (21)

Note that problem (21

) is a well studied problem in the duality theory. It is one of the definitions of the spectral norm (maximum eigenvalue of a matrix) – as the dual of the trace norm

(Boyd and Vandenberghe, 2004). Its optimal value is the spectral norm of the matrix  (Boyd and Vandenberghe, 2004).

Let be the maximum singular of the matrix . Then, the spectral norm of a symmetric matrix is equal to .

The duality gap () associated with is given by

 Δ =G1(^Θ)−G2(^Z,^s) =12(σ21−∥∥^U⊤(^Z+A∗(^s))∥∥2F).

Last part of the theorem: it is a special case of the general result proved by Journée et al. (2010, Theorem 7 and Corollary 8).

The value of the duality gap can be used as to verify whether a candidate solution is a global optimum of (3). The cost of computing is computationally cheap as it requires only a few power iteration updates.

## 4 Optimization on spectrahedron manifold

The matrix lies in, what is popularly known as, the spectrahedron manifold . Specifically, the spectrahedron manifold has the structure of a compact Riemannian quotient manifold (Journée et al., 2010). The quotient structure takes the rotational invariance of the constraint into account. The Riemannian optimization framework embeds the constraint into the search space, conceptually translating the constrained optimization problem (15) into an unconstrained optimization over the spectrahedron manifold. The Riemannian optimization framework generalizes various classical first- and second-order Euclidean algorithms (e.g., the conjugate gradient and trust region algorithms) to manifolds and provide concrete convergence guarantees (Edelman et al., 1998; Absil et al., 2008; Journée et al., 2010; Sato and Iwai, 2013; Sato et al., 2017). In particular, Absil et al. (2008) provide a systematic way of implementing Riemannian conjugate gradient (CG) and trust region (TR) algorithms. The particular details are provided in Appendix A.

We implement the Riemannian conjugate gradient (CG) and trust-region (TR) algorithms for (15). These require the notions of the Riemannian gradient (first-order derivative of the objective function on the manifold), Riemannian Hessian along a search direction (the covariant derivative of the Riemannian gradient along a tangential direction on the manifold), and the retraction operator (that ensures that we always stay on the manifold). The Riemannian gradient and Hessian notions require computations of the standard (Euclidean) gradient and the directional derivative of this gradient along a given search direction, which are expressed in the following lemma. Let be an optimal solution of the convex problem (16) at . Then, the gradient of at is given by the following expression:

 ∇g(U)=−(^Z+A∗(^s))(^Z+A∗(^s))⊤U.

Let denote the directional derivative of the gradient along . Let denote the directional derivative of along at . Then,

 D∇g(U)[V]=(˙Z+A∗(˙s))(^Z+A∗(^s))⊤U+(^Z+A∗(^s))((˙Z+A∗(˙s))⊤U−(^Z+A∗(^s))⊤V).

The gradient is computed by employing the Danskin’s theorem (Bertsekas, 1999; Bonnans and Shapiro, 2000)

. The directional derivative of the gradient follows directly from the chain rule. The terms

are computed from the first-order KKT conditions of the convex problem (16) at . In particular, when is differentiable (e.g., the square loss), are obtained by solving the following linear system:

 D∇L∗(−Z/C)[˙Z]−(UV⊤+VU⊤)(^Z+A∗(^s))−UU⊤(˙Z+A∗(˙s))=0 A((UV⊤+VU⊤)(^Z+A∗(^s))+UU⊤(˙Z+A∗(˙s)))=0.

In various applications such as the traditional matrix completion or multi-variate regression, both the gradient and its directional derivation can be computed by closed-form expressions.

Riemannian CG algorithm: It computes the Riemannian conjugate gradient direction by employing the first-order information (Lemma 4). We perform Armijo line search on to compute a step-size that sufficiently decreases on the manifold. We update along the conjugate direction with the step-size by retraction.

Riemannian TR algorithm: It solves a Riemannian trust-region sub-problem (in a neighborhood) at every iteration. Solving the trust-region sub-problem leads to a search direction that minimizes a quadratic approximation of on the manifold. Solving this sub-problem does not require inverting the full Hessian of the objective function. It makes use of and its directional derivative (Lemma 4).

Overall algorithm: Algorithm 1 summarizes the proposed first- and second-order algorithms for solving (15).

Convergence: Absil et al. (2008); Sato and Iwai (2013) discuss the rate of convergence analysis of manifold algorithms, and their results are directly applicable in the present setting. The global convergence results for conjugate gradient algorithm is discussed in (Sato and Iwai, 2013). For trust regions, the global convergence to a first-order critical point is discussed in (Absil et al., 2008)[Section 7.4.1], while the local convergence to local minima is discussed in (Absil et al., 2008)[Section 7.4.2].

Computational complexity: The computational complexity of Algorithm 1 is the sum of the cost of manifold related operations and the cost of application specific ingredients. The spectrahedron manifold operations cost . The following section discusses the application specific computational costs.

Although we have focused on batch algorithms, our framework can be extended to stochastic settings, e.g., when the columns are streamed one by one (Bonnabel, 2013). In this case, when a new column is received, we perform a (stochastic) gradient update on .

## 5 Specialized formulations for applications

The expression of in (16) depends on the functions and employed in the application at hand. Below, we discuss for popular applications.

### 5.1 Matrix completion

Given a partially observed matrix at indices , we learn the full matrix (Toh and Yun, 2010; Cai et al., 2010). Let be the set of indices that are observed in , the column of . Let and represents the rows of and , respectively, that correspond to the indices in . Then, the function in (16), for the standard low-rank matrix completion problem with square loss, is as follows:

 g(U)=T∑t=1maxzt∈R|Ωt| ⟨ytΩt,zt⟩−14C∥zt∥2−12∥∥U⊤Ωtzt∥∥2. (22)

Problem (22) is a least-squares problem for each and can be solved efficiently in closed-form by employing the Woodbury matrix-inversion identity. The computational cost of solving problem (22) is . Overall, the per-iteration computational cost of Algorithm 1 is . Problem (22) can be solved in parallel for each , making it amenable to parallelization.

### 5.2 Robust matrix completion

The setting of this problem is same as the low-rank matrix completion problem. Following Cambier and Absil (2016), we employ a robust loss function (-loss) instead of the square loss. The expression for , in our case, is as follows:

 g(U)=T∑t=1 maxzt∈[−C,C]|Ωt| ⟨ytΩt,zt⟩−12∥∥U⊤Ωtzt∥∥2.

Coordinate descent algorithm (CD) is employed to efficiently solve the above problem. The computational cost depends on the number of iterations of the coordinate descent algorithm. In particular, if is the number of iterations, then the cost of computing is . A small value of suffices for a good approximation to the solution. Hence, for the case of robust matrix completion problem, the per-iteration computational cost of Algorithm 1 is . It should be noted that the computation of can be a parallelized across . In addition to the -loss, we also experimented with the -SVR loss (Ho and Lin, 2012) for this problem.

### 5.3 Non-negative matrix completion

Given a partially observed matrix at indices , the aim here is to learn the full matrix with non-negative entries only. For this problem, the function with the square loss function is as follows:

 g(U)=T∑t=1 maxst∈[0,∞)d (maxzt∈R|Ωt| ⟨ytΩt,zt⟩−14C∥zt∥2−12∥∥U⊤Ωtzt+U⊤st∥∥2). (23)

Since the dual variables have non-negative constraints (due to the constraint in primal problem), we model (23) as a non-negative least squares (NNLS) problem. We employ the NNLS algorithm of Kim et al. (2013) to solve for the variable in (23). In each iteration of NNLS, is solved in closed form. If is the number of iterations of NNLS, then the cost of computing is . Hence, for the case of non-negative matrix completion problem, the per-iteration computational cost of Algorithm 1 is . In our experiments, we observe that a small value of is sufficient to obtain a good approximation of (23).

It should be noted that (23) is computationally challenging as it has entry-wise non-negativity constraints. In our initial experiments, we observe that the solution is highly sparse. For large-scale problems, we exploit this observation for an efficient implementation.

### 5.4 Hankel matrix learning

Hankel matrices have the structural constraint that its anti-diagonal entries are the same. A Hankel matrix corresponding to a vector is as follows:

 ⎡⎢⎣y1y2y3y4y5y2y3y4y5y6y3y4y5y6y7⎤⎥⎦.

Hankel matrices play an important role in determining the order or complexity of various linear time-invariant systems (Fazel et al., 2013; Markovsky and Usevich, 2013). The aim in such settings is to find the minimum order of the system that explains the observed data (i.e. low error). A low-order system is usually desired as it translates into cost benefit and ease of analysis. Hence, the optimization problem can be modeled as problem (1) where we wish to learn a low-rank Hankel matrix that results in low error.

The function specialized for the case of low-rank Hankel matrix learning with square loss is as follows:

 g(U)=maxst∈Rd∀t,z∈Rd+T−1z⊤y−14C∥z∥2−12T∑t=1∥∥U⊤st∥∥2 subject to:∑(i,t):i+t=k,1≤i≤d,1≤t≤Tsti=zk ∀k=2,…,d+T.

We solve the above problem via a conjugate gradient algorithm. The equality constraints are handled efficiently by using an affine projection operator. The computational cost of computing is , where is the number of iterations of the inner conjugate gradient algorithm. Hence, for the case of Hankel matrix learning problem, the per-iteration complexity of Algorithm 1 is .

The paradigm of multi-task feature learning advocates learning several related tasks (regression/classification problems) jointly by sharing a low-dimensional latent feature representation across all the tasks (Argyriou et al., 2008). Since the tasks are related, learning them together, by sharing knowledge, is expected to obtain better generalization than learning them independently.

Multi-task feature learning setup is as follows: we are given tasks and the aim is to learn the model parameter for each task such that they share a low-dimensional latent feature space. Argyriou et al. (2008) employ the trace-norm regularizer on the model parameter matrix to enforce a low-dimensional representation of . Each task has an input/output training data set , where and . The prediction function for task is .

The proposed generic formulation for structured low-rank matrix learning (15) can easily be specialized to this setting, with the function being as follows:

 g(U)=T∑t=1maxzt∈Rnt ⟨yt,zt⟩−14C∥zt∥2−12∥∥U⊤X⊤tzt∥∥2.

The function can be computed in closed form and costs . Hence, the per-iteration complexity of Algorithm 1 for MTFL is . The computation of can be a parallelized across the tasks. This setting can be further specialized for problems such as matrix completion with side information, inductive matrix completion, and multi-variate regression.

## 6 Experiments

In this section, we evaluate the generalization performance as well as computational efficiency of our approach against state-of-the-art in different applications. It should be emphasized that state-of-the-art in each application are different and to the best of our knowledge there does not exist a unified framework for solving such applications. All our algorithms are implemented using the Manopt toolbox (Boumal et al., 2014). We term our algorithm as Riemannian Structured Low-rank Matrix learning (RSLM).

### 6.1 Matrix Completion

Baseline techniques: Our first- and second-order matrix completion algorithms are denoted by RSLM-cg and RSLM-tr, respectively. We compare against state-of-the-art fixed-rank and nuclear norm minimization based matrix completion solvers:

• APGL: An accelerated proximal gradient approach for nuclear norm regularization with square loss function (Toh and Yun, 2010).

• Active ALT: State-of-the-art first-order nuclear norm solver based on active subspace selection (Hsieh and Olsen, 2014).

• MMBS: A second-order fixed rank nuclear norm minimization algorithm (Mishra et al., 2013). It employs an efficient factorization of the matrix which renders the trace norm regularizer differentiable in the primal formulation.

• R3MC: A non-linear conjugate gradient based approach for fixed rank matrix completion (Mishra and Sepulchre, 2014). It employs a Riemannian preconditioning technique, customized for the square loss function.

• RTRMC: It models fixed rank matrix completion problems with square loss on the Grassmann manifold and solves it via a second order preconditioned Riemannian trust-region method (Boumal and Absil, 2011, 2015).

• LMaFit: A nonlinear successive over-relaxation based approach for low rank matrix completion based on alternate least squares (Wen et al., 2012).

• PRP: a recent proximal Riemannian pursuit algorithm Tan et al. (2016).

Data sets and experimental setup: We compare the performance of the above algorithms on a synthetic data set and three movie recommendation data sets: Netflix (Recht and Ré, 2013), MovieLens10m (ML10m), and MovieLens20m (ML20m) (Harper and Konstan, 2015). The data set statistics is provided in Table 1. The regularization parameters for respective algorithms are cross-validated in the set to obtain their best generalization performance. The optimization strategies for the competing algorithm were set to those prescribed by their authors. For instance, line-search, continuation and truncation were kept on for APGL. The initialization for all the algorithms is based on the first few singular vectors of the given partially complete matrix (Boumal and Absil, 2015). All the fixed algorithms (R3MC, LMaFit, MMBS, RTRMC, Proposed-cg-sq, Proposed-tr-sq) are provided the rank for real data sets and for synthetic data set. In all variable rank approaches (APGL, Active ALT, PRP), the maximum rank parameter is set to for real data sets and for synthetic data set. We run all the methods on five random / train/test splits and report the average root mean squared error on the test set (test RMSE). We report the minimum test RMSE achieved after the algorithms have converged or have reached maximum number of iterations.

Synthetic data set results. We choose , and to create a synthetic data set (with observed entries), following the procedure detailed in (Boumal and Absil, 2011, 2015). The number of observed entries for both training () and testing was . The generalization performance of different methods is shown in Figure 1(a). For the same run, we also plot the variation of the relative duality gap across iterations for our algorithms in figure 1(b). It can be observed that RSLM-tr approach the global optima for the nuclear norm regularized problem 1 in very few iterations and obtain test RMSE . Our first order algorithm, RSLM-cg, also achieves low test RMSE and low relative duality gap. It should be noted that our methods are able to exploit the condition that (rectangular matrices), similar to RTRMC.

Real-world data set results: Figures 2 (a)&(b) display the evolution of root mean squared error on the test set (test RMSE) against the training time on the Netflix data set for first- and second-order algorithms, respectively. RSLM-cg is among the most efficient first-order method and RSLM-tr is the best second-order method. Table 2 reports the test RMSE obtained by all the algorithms on the three data sets. We observe that both our algorithms obtain the best generalization performance.

### 6.2 Robust matrix completion

We develop RSLM with two different robust loss functions: -loss and -SVR loss. The non-smooth nature of -loss and -SVR loss makes them challenging to optimize in large-scale low-rank setting.

Baseline techniques: For evaluation, we compare RSLM against RMC (Cambier and Absil, 2016), a large-scale state-of-the-art Riemannian optimization algorithm for robust matrix completion. RMC employs the pseudo-Huber loss function (2). While solving (2), RMC decreases the value of at regular intervals. Hence, at later iterations, when the value of is low, the pseudo-Huber loss approximates the -loss.

Data sets and experimental setup: We compare the performance of all three algorithms on the Netflix data set. We follow the same experimental setup as described for the case of matrix completion. Rank is fixed for all the algorithms.

Results: Figure 2(c) shows the results on the Netflix data set. We observe that RSLM scales effortlessly on the Netflix data set even with non-smooth loss functions and obtains the best generalization performance with the -SVR loss. The test RMSE obtained at convergence are: (RSLM with -SVR loss), (RSLM with -loss), and (RMC).

### 6.3 Non-negative matrix completion

Baseline techniques: We include the following methods in our comparisons: BMC (Fang et al., 2017) and BMA (Kannan et al., 2012). BMC is a recently proposed ADMM based algorithm with carefully designed update rules to ensure an efficient computational and space complexity. BMA is based on co-ordinate descent algorithm.

Data sets and experimental setup: We compare the performance of the above algorithms on three real world data sets (Table 1): MovieLens1m (ML1m), MovieLens10m (ML10m), and MovieLens20m (ML20m). The experimental setup is the same as described for the case of matrix completion. The performance of all the algorithms are evaluated at three ranks: .

Results: We observe in Table 3 that RSLM outperforms both BMC and BMA. The improvement obtained by RSLM over BMC is more pronounced with larger data sets. BMA is not able to run on MovieLens20m due to high memory and time complexity.

Figures 3(a)-(e) compare the convergence behavior of RSLM and BMC for different values of parameters and on all three data sets. Both algorithms aim to minimize the same primal objective (1) for a given rank . Though the proposed approach solves the proposed fixed-rank dual formulation (15), we compute the corresponding primal objective value of every iterate for the plots. We observe that our algorithm RSLM is significantly faster than BMC in converging to a lower objective value.

Figures 4(a)-(e) plot the evolution of test RMSE against training time for algorithms RSLM and BMC on different data sets with different ranks (and the corresponding best parameter). For a given data set and rank, both RSLM and BMC chose the same value of via cross-validation. We observe that the RSLM outperforms BMC in converging to a lower test RMSE at a much faster rate.

### 6.4 Hankel Matrix Learning

Baseline techniques: We compare our algorithm RSLM with three methods (discussed in Section 2): GCG (Yu et al., 2014), SLRA (Markovsky, 2014; Markovsky and Usevich, 2014), and DADM (Fazel et al., 2013). Since GCG and DADM employ the nuclear norm regularizer, we tune the regularization parameter to vary the rank of their solution.

Data sets and experimental setup: Given a vector , we obtain a Hankel matrix as discussed in Section 5.4. The true parameter is generated as the impulse response (skipping the first sample) of a discrete-time random linear time-invariant system of order (Markovsky, 2011; Fazel et al., 2013)

. The noisy estimate of these parameters (

) are generated as , where is the measurement noise. We set and generate

from the standard Gaussian distribution

. It should be noted that is the train data and is the true data.

We generate three different data set of varying size and order. has , where is the true order of the underlying system. The other two data sets, and , has the configurations and , respectively. We evaluate the algorithms on all the three data sets and report their RMSE with respect to the true data (Liu and Vandenberghe, 2009) at different ranks () in Table 4.

Results: We observe from Table 4 that the proposed algorithm RSLM obtains the best result in all the three data sets and across different ranks. In addition, RSLM also usually obtains the lowest true RMSE for a data set at the rank equal to of the data set. This implies that RSLM is able to identify the minimal order of the systems corresponding to the data sets.

Experimental setup: We compare the generalization performance of our algorithm RSLM with the convex multi-task feature learning algorithm MTFL (Argyriou et al., 2008). Optimal solution for MTFL at different ranks is obtained by tracing the solution path with respect to the regularization parameter, whose value is varied as . For RSLM, we fix the value of the regularization parameter , and vary the rank to obtain different ranked solutions. The experiments are performed on two benchmark multi-task regression data sets:

• Parkinsons: We need to predict the Parkinson’s disease symptom score of 42 patients (Frank and Asuncion, 2010). Each patient is described using bio-medical features. The data set has a total of 5,875 readings from all the patients.

• School: The data consists of 15,362 students from 139 schools (Argyriou et al., 2008). The aim is to predict the performance of each student given their description and earlier record. Overall, each student data has 28 features. Predicting the performance of students belonging to one school is considered as one task.

Following (Argyriou et al., 2008), we report the normalized mean square error over the test set (test NMSE).

Results: Figure 5(a) present the results on the Parkinsons data set. We observe from the figure that our method achieves better generalization performance at low ranks compared to MTFL. Figure 5(b) plots the variation of duality gap with rank for our algorithm. We observe that as the rank is varied from low to high, we converge to the globally optimal solution of (3) and obtain the duality gap close to zero. Similar results are obtained on the School data set as observed in Figures 5 (c)&(d).

## 7 Conclusion

We have proposed a novel factorization for structured low-rank matrix learning problems, which stems from the application of duality theory and rank-constrained parameterization of positive semi-definite matrices. This allows to develop a conceptually simpler and unified optimization framework for various applications. State-of-the-art performance of our algorithms on several applications shows the effectiveness of our approach.

## Acknowledgement

We thank Léopold Cambier, Ivan Markovsky, and Konstantin Usevich for useful discussions. We also thank Rodolphe Jenatton, Syama Rangapuram, and Matthias Hein for giving feedback on an earlier version of this work. We are grateful to Adams Wei Yu and Mingkui Tan for making their codes available. Most of this work was done when PJ and BM were at Amazon.com, India.

## Appendix A Optimization on Spectrahedron

We are interested in the optimization problem of the form

 minΘ∈Pdf(Θ), (24)

where is the set of positive semi-definite matrices with unit trace and is a smooth function. A specific interest is when we seek matrices of rank . Using the parameterization , the problem (24) is formulated as

 minU∈Sdrf(UU⊤), (25)

where , which is called the spectrahedron manifold (Journée et al., 2010). It should be emphasized the objective function in (25) is invariant to the post multiplication of with orthogonal matrices of size , i.e., for all , which is the set of orthogonal matrices of size such that . An implication of the this observation is that the minimizers of (25) are no longer isolated in the matrix space, but are isolated in the quotient space, which is the set of equivalence classes . Consequently, the search space is

 M\coloneqqSdr/O(r). (26)

In other words, the optimization problem (25) has the structure of optimization on the quotient manifold, i.e.,

 min[U]∈Mf([U]), (27)

but numerically, by necessity, algorithms are implemented in the matrix space , which is also called the total space.

Below, we briefly discuss the manifold ingredients and their matrix characterizations for (27). Specific details of the spectrahedron manifold are discussed in (Journée et al., 2010). A general introduction to manifold optimization and numerical algorithms on manifolds are discussed in (Absil et al., 2008).

### a.1 Tangent vector representation as horizontal lifts

Since the manifold , defined in (26), is an abstract space, the elements of its tangent space at also call for a matrix representation in the tangent space that respects the equivalence relation for all . Equivalently, the matrix representation of should be restricted to the directions in the tangent space on the total space at that do not induce a displacement along the equivalence class . In particular, we decompose into complementary subspaces, the vertical and horizontal subspaces, such that .