 # Matrix Recovery with Implicitly Low-Rank Data

In this paper, we study the problem of matrix recovery, which aims to restore a target matrix of authentic samples from grossly corrupted observations. Most of the existing methods, such as the well-known Robust Principal Component Analysis (RPCA), assume that the target matrix we wish to recover is low-rank. However, the underlying data structure is often non-linear in practice, therefore the low-rankness assumption could be violated. To tackle this issue, we propose a novel method for matrix recovery in this paper, which could well handle the case where the target matrix is low-rank in an implicit feature space but high-rank or even full-rank in its original form. Namely, our method pursues the low-rank structure of the target matrix in an implicit feature space. By making use of the specifics of an accelerated proximal gradient based optimization algorithm, the proposed method could recover the target matrix with non-linear structures from its corrupted version. Comprehensive experiments on both synthetic and real datasets demonstrate the superiority of our method.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Due to the unconstrained nature of today’s data acquisition procedure, the observed data is often contaminated by gross errors, such as large corruptions and outliers. The gross errors, in general, could significantly reduce the representativeness of data samples and therefore seriously distort the analysis of data. Given this pressing situation, it is of considerable practical significance to study the problem of

Matrix Recovery, which aims to correct the errors possibly existing in a data matrix of observations.

###### Problem 1.

(Matrix Recovery). Let be an observed data matrix which could be decomposed as

 X=A+E,

where is the target matrix of interest in which each column is a -dimensional authentic sample, and corresponds to the possible errors. Given , the goal is to recover .

In general, the above problem is ill-posed, and thus some restrictions are necessary to be imposed for both and . Some methods have already been proposed to solve the above problem with proper constraints. For example, provided that is low-rank and is sparse, Problem 1 could be well solved by a convex procedure termed Principal Component Pursuit (PCP), which is also known as Robust Principal Component Anlaysis (RPCA) candes2011robust ; zhang2015exact . Outlier Pursuit (OP) xu2010robust solves Problem 1 under the conditions that is low-rank and is column-wisely sparse. Under similar conditions, Low-Rank Representation (LRR) tpami_2013_lrr ; liu:tpami:2016 guarantees to recover the row space of . In addition, LRR equipped with proper dictionaries could handle the cases where is of high coherence liu2017blessing ; liu:tsp:2016 . Even though these related approaches are very powerful, they all rely on the assumption that is low-rank, which, however, could be violated in practice.

To cope with the data of complex structures, it would be more suitable to consider the cases where is low-rank after feature mapping: namely, is implicitly low-rank in some (unknown) feature space but could be high-rank or even full-rank by itself. There are only a few investigations in this direction, such as the Kernel Principal Component Analysis (KPCA) nguyen2009robust . In general, KPCA could apply to the data matrix that is implicitly low-rank but originally high-rank. However, this method assumes that the data is contaminated by small Gaussian noise and is therefore brittle in the presence of gross errors. We also notice that many kernel methods have been established in community of low-rankness modeling, e.g.,  xiao2016robust ; ji2017low ; xie2018implicit ; nguyen2015kernel . Nevertheless, these methods are designed for a specific purpose of classification or clustering, and thus they cannot be directly applied to Problem 1 which is essentially a data recovery problem.

In this work, we would like to study Problem 1 in the context of is implicitly low-rank and contains gross errors. Following xu2010robust ; tpami_2013_lrr , we will focus on the case where is column-wisely sparse, i.e., the observed data matrix is contaminated by outliers. The basic idea of our method to pursue the low-rank structure of in an implicit feature space of higher but unknown (maybe infinite) dimension is simple and traditional. Nevertheless, it is indeed rather challenging to realize this idea:

• Firstly, the rank of an unknown-dimensional matrix cannot be calculated directly. To overcome this difficulty, we show that the nuclear norm of after feature mapping is actually equivalent to the nuclear norm of the square root of the Gram matrix (or kernel matrix as equal). This enables the possibility of obtaining a computable formulation for solving the low-rank constraint in an unknown-dimensional implicit space.

• Secondly, in the presence of outliers, it is actually inaccurate to estimate the Gram matrix based on

. This is because the outliers could seriously reduce the quality of the estimated Gram matrix, in which the whole column and row corresponding to the outliers are corrupted. Hence, we build our algorithm upon a kernel function that only defines the inner product of the points in feature space. Since the kernel function is independent of the data , this strategy is conducive to reduce the influence of outliers and preserve the geometry structure of the clean data.

• Finally, the combination of implicit feature mapping with kernel low-rankness pursuit generally leads to a challenging optimization problem, which is nonconvex and nonsmooth. To overcome this difficulty, we adopt the Accelerated Proximal Gradient (APG) method established by li2015accelerated , together with some linearization operators, to solved the raised optimization problem. In particular, we provide some theoretical analyses for the convergence of our optimization algorithm. Namely, the solution produced by the proposed algorithm is analytically proved to be a stationary point.

We conduct experiments on both synthetic and real datasets, and we also compare with some state-of-the-art methods. The results show that, in terms of recovery accuracy, our method is distinctly better than all competing methods.

## 2 Related Work

### 2.1 Linear low-rank recovery

Recently, linear low-rank recovery has attracted great attention due to its pleasing efficacy in exploring the low-dimensional structures from given measurements. Formally, the linear low-rank recovery problem can be directly or indirectly written in the following form:

 minA∥A∥∗+λ∥A−X∥ℓ, (1)

where and represent the given data and the desired structure, respectively. is the error residue. is a certain robust norm to measure the residual between the observed and recovered signals. denotes a low-rank structure regularization and is a non-negative parameter that provides a trade-off between the recovery fidelity and the low-rank promoting regularizer. The major difference among existing recovery methods is pertaining to the choice of penalty on the residual. Candès et al. candes2011robust choose norm to model the sparse noise. They theoretically prove that their model can exactly recover the ground-truth data with the assumption of sparse outliers/noise. The works in xu2010robust ; zhang2015exact select

norm to penalize the column-sparse residual. Their model can also recover the correct column space of data. The linear low-rank recovery has been applied to many computer vision tasks, such as face recognition

zheng2014fisher and image classification zhang2015image , where they perform very well. Besides, for low-rank matrix recovery, Liu et al. liu2013fast propose a fast tri-factorization method, and Cui et al. cui2018exact

come up with a transformed affine matrix rank minimization method.

### 2.2 Kernel low-rank method

KPCA, an widespread extension of traditional PCA, seeks a low-rank approximation of the affinity among the data points in the kernel space scholkopf1998nonlinear . Similar to PCA, it is sensitive to the outliers even after mapping. Hence, some robust kernel low-rank methods have been proposed and investigated. In particular, the works in baghshah2011learning ; ji2017low ; xie2018implicit provide kernel low-rank methods for subspace clustering, which demonstrate that the kernel low-rank approximation does benefit the clustering of non-linear data. Nguyen et al. nguyen2015kernel apply the kernel low-rank representation to face recognition. Works in pan2011learning ; rakotomamonjy2014 investigate the influence of different kernels. Garg et al. garg2016non present a new way to pursue the low-rankness in the kernel space, but the measurement of the other regularization is still in the original space, which cannot be directly utilized to solve Problem 1.

Though the existing methods have achieved great success for the clustering or linear low-rank recovery tasks, none of them can robustly recover the non-linear or super low dimensional data in the original space. Comparatively, our model solves Problem 1 robustly when is implicitly low-rank but could be high-rank or even full-rank by itself.

## 3 Kernel Low-Rank Recovery

### 3.1 Problem Formulation

The model, for solving the linear low-rank recovery problem with column-wise noise, can be represented as:

 minA∥A∥∗+λ∥A−X∥2,1, (2)

where

is the nuclear norm (sum of all singular values) and the

-norm can be calculated as . To tackle the issue of implicitly low-rank data, it is worthwhile to kernelize the model in (2) to handle the data which are sampled from some complex nonlinear manifold. Moreover, in the scenario that the ambient dimension is far greater than the data size , kernel method is more efficient.

Let be a mapping from the input space to the reproducing kernel Hilbert space . Here we assume that resides in a certain linear subspace in . Namely, the non-linear observation is considered to be linearly dependent in . Let be a positive semidefinite kernel Gram matrix whose elements are computed as:

 Kij=(ϕ(A)Tϕ(A))ij=ϕ(ai)Tϕ(aj)=ker(ai,aj),

where is the kernel function and

 ϕ(A)=[ϕ(a1),ϕ(% a2),⋯,ϕ(an)].

With the above assumption, by kernelizing model (2), our model can be represented as:

 minA∥ϕ(A)∥∗+λ∥ϕ(A)−ϕ(X)∥2,1. (3)

Note that, after mapping, the data matrix still contains column-wise noise or outliers. Hence, we also adopt the -norm in (3) to measure the error residue in the kernel space.

### 3.2 Reformulation and Relaxation

It is hard to optimize (3) due to the explicit dependency on . Fortunately, as shown in garg2016non , a symmetric and positive semi-definite matrix can be factorized. We can easily derive the following proposition.

###### Proposition 0.

Assume is a kernel Gram matrix which is computed as , then we have

 ∥B∥∗=∥ϕ(A)∥∗,  ∀  B : K=BTB, (4)

where .

Substituting (4) into (3), we convert (3) into:

 minA,B∥B∥∗+λ∥ϕ(A)−ϕ(X)∥2,1,s.t.BTB=ϕ(A)Tϕ(A). (5)

We then relax the constrained problem to the following unconstrained one:

 minA,B ∥B∥∗+λ∥ϕ(A)−ϕ(X)∥2,1+ρ2∥BTB−ϕ(A)Tϕ(A)∥2F, (6)

where is a parameter which balances the difference and the original objective function. We can see that when is sufficiently large, (6) and (5) are the same model. It is worth mentioning that (5) can be solved by adopting the alternative direction method of multipliers (ADMM) technique. However, the optimization of the subproblem related to is nonconvex and an auxiliary variable will be introduced. ADMM fails to ensure the convergence when the optimization involves more than three variables. Therefore, we choose an APG based method for our nonconvex and nonsmooth problem whose convergence can be guaranteed li2015accelerated . Another advantages of the relaxation is that sometimes the rank of ground-truth matrix is higher than that of the solution of (5), which is caused by some unsuitable . the solution of (6) is closer to the ground-truth in this case, and thus (6) is more robust to the selection of mapping functions.

### 3.3 Optimization Algorithm

We will show how to solve (6) in this subsection. We minimize the objective function alternately over and . The updating of is performed by the Monotone APG together with some linear approximation. Meanwhile, the subproblem involving has a closed-form solution.
(1) Update

can be updated by solving the following subproblem:

 minB∥B∥∗+ρ2∥BTB−KA∥2F, (7)

where

. Denote the singular value decomposition (SVD) of

as , and this subproblem has a closed-form solution given by garg2016non :

 B∗=Γ∗VT. (8)

is a diagonal matrix with , where is the -th singular value of . Hence, each can be achieved by solving a cubic equation. Note that

is not unique since one can multiply an arbitrary unitary matrix to the left of (

8) without changing the objective value in (7). Fortunately, the non-uniqueness does not affect the optimization of since only involves the updating of .
(2) Update

To update , the following subproblem should be solved:

 minA∥ϕ(A)−ϕ(X)∥2,1+α2∥BTB−ϕ(A)Tϕ(A)∥2F, (9)

where . By dividing the matrix into columns, (9) can be rewritten as:

 mina1,⋯,ann∑i=1{∥ϕ(ai)−ϕ(xi)∥2+α2∥mi−ϕ(A)Tϕ(ai)∥22},

where is the -th column of . The solution of this problem can be achieved by the block coordinate descent (BCD) method xu2013block which minimizes the objective cyclically over each of while fixing the remaining blocks at their last updated values. Hence, we are required to address the following problem:

 minai√ϕ(ai)Tϕ(ai)+ϕ(xi)Tϕ(xi)−2ϕ(xi)Tϕ(ai)+α2n∑j=1(mij−ϕ(ai)Tϕ(aj))2. (10)

To optimize this problem, it requires to define the kernel function . Here we choose two types of kernels (convex and non-convex) as the examples. The optimization related to other kernel functions can be solved in a similar way.

(i) Convex kernel: We select the most commonly used convex kernel, i.e., Polynomial Kernel Function (). The inner product in the kernel space can be represented as

 ϕ(ai)Tϕ(aj) = (aTiaj+c)d

where is a free parameter trading off the influence of higher-order versus lower-order terms in the polynomial. is the order of the polynomial kernel. (10) can be rewritten as

 minai√(aTiai+c)d+(xTixi+c)d−2(aTixi+c)d+α2n∑j=1(mij−(aTiaj+c)d))2. (11)

Note that, is a real-valued function and it is differentiable at non-zero points. Thus we utilize its linear approximation at point to simplify and accelerate the optimization.

 (12)

where , , , and is the smooth parameter. Obviously, one local minimizer can be calculated in an alternating minimization way:

 τk+1ai=((aki)Taki+c)d−1,  τk+1xi=(xTiaki+c)d−1, (13) δk+1i=1/√((aki)Taki+c)d+(xTixi+c)d−2((aki)Txi+c)d+μ2, (14) ak+1i=argminai μk+1in∑j=1(mij−(aTiaj+c)d)2+(τk+1aiai−τk+1xixi)T(ai−xi). (15)

where .

(ii) Non-Convex kernel: For non-convex kernel, we choose the Gaussian Kernel Function for mapping the observation into an infinite-dimensional space. The inner product in the kernel space can be represented as , where and is the precision parameter of the Gaussian Kernel Function. (10) can be rewritten as:

 minai√2−2exp(−γ∥ai−xi∥22)+α2n∑j=1(mij−exp(−γ∥ai−aj∥22))2. (16)

Note that, is a real-valued function and it is differentiable at non-zero points. Thus we utilize its linear approximation at point to simplify and accelerate the optimization. The problem of (16) is converted into:

 minai α2n∑j=1(mij−exp(−γ∥ai−aj∥22))2+2βipiγ∥ai−xi∥22, (17)

where , , and is the smooth parameter. Obviously, one local minimizer can be calculated in an alternating minimization way:

 pk+1i=exp(−γ∥aki−xi∥22), (18) βk+1i=1/√2−2exp(−γ∥aki−xi∥22)+μ2, (19) (20)

where . Note that, in most cases, the solution to the linear approximation problem is not exactly equivalent to that of the original problem. However, in contrast, here the updating steps (18) – (20) can solve the optimization in (16), which we will show in the next section.
(3) Solve Nonconvex Programming

The optimization problem in (15) or (20) is a nonconvex programming whose solution can be attained by the APG method. The updating steps of includes:

 yki=aki+tk−1tk(zk−aki)+tk−1−1tk(aki−ak−1i), (21) zk+1i=proxδg(yk−δ∇f(yk)), (22) vk+1i=proxδg(aki−δ∇f(aki)), (23) tk+1=√4(tk)2+1+12, (24) ak+1i={ zk+1,   if F(zk+1i)≤F(vk+1i), vk+1,   otherwise. (25)

where is for Polynomial Kernel or for Gaussian Kernel. is the gradient of and represents or . The proximal mapping is defined as . is a fixed constant satisfying . is the Lipschitz constant of and denotes .

The algorithm to solve (6) with the APG and alternating minimization is outlined in Algorithm 1.

### 3.4 Computational Complexity

The updating of consists of two parts, finding roots for cubic equations and performing the SVD operator on . The computational complexity for achieving the roots is , since we can get the closed-form expression for the roots of cubic equations. The complexity for the SVD is , where is the size of data and is the rank of . During the procedure of updating according to (21) - (23), the

matrix vector multiplication needs to be carried out. Hence, the computational complexity for calculating

is . In summary, the total computational complexity for the whole algorithm in each iteration is .

## 4 Theoretical Analysis

In this section, we first provide some useful theoretical results, including Lemma 3 for illustrating the connection between (16) and (17), as well as Theorem 4 and Theorem 5 for ensuing the convergence of the optimization.

Before stating the Lemma 1, we first introduce one proposition to rewrite the non-linear mapping by its conjugated function. Based on the theory of convex conjugated functions rockafellar2015convex , we can derive the following proposition.

###### Proposition 0.

There exists a convex conjugated function of such that

 exp(γ∥x∥22)=−minp( pγ∥x∥22−φ(p) ), (26)

where is a scalar variable. For a fixed , the minimum is reached at  he2011robust .

Base on the above proposition, we have the following connections between (16) and (17):

###### Lemma 0.

Cyclic iteration between steps (18) – (20) can solve the optimization in (16).

###### Proof.

We represent as . With the same spirit of the iteratively reweighted least squares (IRLS) method fornasier2011low , we can solve (16) by iteratively optimizing the following problem with the weight determined from the last iteration:

 ak+1i=argminai−2βkiexp(−γ∥ai−xi∥22)+m(ai), (27)

where and is the smooth parameter. Substituting (26) into (27), it gives that:

 {ak+1i,pk+1i}=argminai,pi2βipiγ∥ai−xi∥22+m(ai)−φ(p).

Proposition 2 gives that . Hence, we get:

 ak+1i = argminai2βkipk+1iγ∥ai−xi∥22+m(ai). (28)

Due to (27) – (28), we find that steps (18) – (20) actually solve the problem in (16) by the iteratively reweighted strategy, and hence cyclic iteration between these steps can solve the optimization in (16). ∎

We denote the objective of (6) as . Then the following theorem regarding the convergence of Algorithm 1 can be established.

###### Theorem 4.

The sequence generated in Algorithm 1 satisfies the following properties:
The objective is monotonically decreasing, i.e.

 F(Ak,Bk)−F(Ak+1,Bk+1)≥ ρ2∥Bk+1−Bk∥2F+(12δ−L2)∥Vk+1−Ak∥2F; (29)

,
The sequence , and are bounded.

###### Proof.

First, from the updating rule of in (8), we have

 Bk+1 = argminBF(Ak,B).

Note that is -strongly convex. By the Lemma B.5 in mairal2013optimization .We have

 F(Ak,Bk)−F(Ak,Bk+1)≥ ρ2∥Bk+1−Bk∥2F. (30)

Second, we denote the objective in (17) as , from the Theorem 1 in li2015accelerated , for all , we have

 f(aki,Bk+1i)−f(ak+1i,Bk+1i)≥ ζ∥vk+1i−aki∥22, (31)

where . As aforementioned, is the linear approximation of at , which gives . From the concavity of , we have . Sum the inequality in (31) for all , we get

 F(Ak,Bk+1)−F(Ak+1,Bk+1)≥ ζ∥Vk+1−Ak∥2F.

Thus, together with (30), we achieve the conclusion in (29). Hence, is monotonically decreasing and thus it is upper bounded. This implies that is bounded.

Now, summing (29) over , we have

 ∞∑k=0ρ2∥Bk+1−Bk∥2F+ζ∥Vk+1−Ak∥2F≤F(A0,B0).

This implies and . Then, similar to , is also bounded.
The proof is completed. ∎

###### Theorem 5.

The sequence generated in Algorithm 1 has at least one accumulation point. Let be any accumulation point of , and we have , i.e., is a stationary point.

###### Proof.

Now, from the boundedness of , there exist a point and a subsequence such that , . Then by (2) in Theorem 1, we have , . On the other hand, from the optimality of to (7), to (23) and Theorem 1 in li2015accelerated , we have

 0∈∂BF(Akj,Bkj+1),   0∈∂AF(Vkj+1,Bkj+1).

Let above. We have

 0∈∂BF(A∗,B∗),   0∈∂AF(A∗,B∗).

Hence, is a stationary point of (6). ∎

## 5 Experimental Verification

### 5.1 Experimental Settings

In this section, we conduct experiments on both synthetic and real datasets to show the advantages of our proposed method.

Data: The real datasets cover two computer vision tasks: 1) non-linear data recovery from the similarity; 2) non-linear data denoising over the MNIST salakhutdinov2008quantitative and COIL-20 nene1996columbia

databases. The MNIST database consists of

-bit grayscale images of digits from ”” to ””. Each image is centered on a grid. The COIL-20 database contains 1440 samples distributed over 20 objects, where each image is with the size of .

Baselines: We assess the performance of the proposed model in comparison with several state-of-the-art methods including Outlier Pursuit (OP) xu2010robust , KPCA nguyen2009robust and GRPCA shahid2015robust , the codes of which are downloaded from the authors’ websites except KPCA. We implement KPCA according to the paper. All methods’ settings follow the suggestions by the authors or the given parameters.

Evaluation metrics: Two metrics are used to evaluate the performance of data recovery methods.
– Peak Signal-to-Noise Ratio (PSNR) : Suppose Mean Squared Error (MSE) is defined as , where are the original image and the recovered image, respectively, then the PSNR value can be calculated by .
– Signal-to-Noise Ratio (SNR) : The SNR can be calculated by .

### 5.2 Data Recovery with Graph Constraint Figure 1: (a) is the original data. (b) is the corrupted observation with Gaussian or structure noise. (c) is the result provided by our non-linear recovery method. (d) is the recovery results of GRPCA.

In this experiment, we aim at recovering the data with a graph constraint. For our proposed model, we solve the subproblem (9) with fixed. Note that, except GRPCA, all other comparative methods cannot cope with this similarity recovery task. We examine the effectiveness of our model over the MNIST database. Firstly, we randomly select images from digit ”” and ””, and then rotate them with a random degree . Secondly, of the images are randomly chosen to corrupt: for each chosen image

, its observation is computed by adding Gaussian noise with zero mean and standard deviation

, or adding three blocks of structure occlusion with the size of . Finally, we convert these images to vectors of dimension. In order to construct the graph constraint for our proposed model and GRPCA, we adopt the same way as in shahid2015robust and the input graph is calculated from the -nearest neighbors. Note that we utilize the Gaussian kernel function with on digit ”” and Polynomial Kernel Function with on digit ”” .

Fig. 1 shows the results of our method and GRPCA on the rotated MNIST data. As we can see, the proposed method produces the encouraging recovery results and outperforms the competing method. This confirms the superiority of our model in the setting of highly non-linear scenario. It is worthwhile to mention that the method used to solve the problem in (9) can be directly applied into some other scenarios, such as multi-modal inference and multi-view learning for recovery from similarity.

### 5.3 Data Denoising Figure 2: (a) The input data. (b) The recovery results of OP. (c) The recovery results of KPCA. (d) is the result of our kernel low-rank recovery method. Figure 3: Visual comparison on the task of Data Denoising.

We now evaluate the effectiveness of our method on the data denoising problem.

1) Two-Dimensional Case: Fig. 2 shows the results of methods on some synthetic data. We randomly select 100 data points from a circle embedded in a two-dimensional plane, resulting in a clean data matrix. We then select 10% data points as the outliers. In this example, the ambient data dimension is equal to the extrinsic dimension, and thus traditional low-rankness based methods (e.g., OP) cannot recover the data points correctly. In sharp contrast, as shown in Fig. 2, our method can still identify the outliers and replace them by the points which are close to the ground-truth manifold.

2) High-Dimensional Case: We apply the proposed method to denoise data over the MNIST and the COIL-20 databases. We compare all the recovery methods in two cases: (1) rotation with Gaussian noise, (2) rotation with occlusion. For the MNIST database, we randomly select images from digit ”” and ””, and then rotate them with a random degree. For the COIL-20 database, we randomly choose subjects and their corresponding images, and then rotate each image times with a degree from to . In these two cases, data is randomly chosen to corrupt in the same way as the previous experiment. Finally, we convert these images to vectors and normalize them to a unit length.

Table 1 compares our model against all competing methods. We report the results of our proposed method with the Gaussian Kernel Function () and the Polynomial Kernel Function () in the last two lines. Since the rotated data owns a structure of highly nonlinearity, our method, which combines the data independent kernel trick and kernel low-rank pursuit, consistently outperforms other methods and obtains the highest PSNR and SNR. Fig. 3 visualizes the denoised results of all the methods. It can be seen that our model is more robust to the gross corruption and achieves better recovery results in details due to the unrestraint of rank in the original space. We notice that the proposed method with the Gaussian Kernel obtains a better recovery results than using the Polynomial Kernel. This is because, for big order , the first term in problem (17) dominates the optimization procedure, however, with small , the proposed method cannot capture the underlying non-linear structure of data. Contrary to the Polynomial Kernel, the Gaussian Kernel is a infinite-dimensional mapping with bounded value. Thus when the data owns a highly nonlinear structure, the Gaussian Kernel can perform better than Polynomial Kernel. Comparing with other methods, the results of ours, again, confirm the superiority of combing the implicitly low-rank pursuit and data independent kernel trick.

The CPU time for all the competing methods on the MNIST dataset with Gaussian noise is presented in tabel 2. All methods based on graph or kernel have high computational complexity. By utilizing the APG strategy for optimization, our method is faster than the other two graph based methods.

## 6 Conclusion and Future Work

This paper shows a method, more robust than other kernel methods, for solving the non-linear matrix recovery problem. To solve the nonconvex optimization problem, we propose an algorithm that leverages the techniques of linearization and proximal gradient. In the meantime, we also analyze the convergence and complexity of our algorithm, and theoretically prove that the obtained solution is a stationary point. Compared with the state-of-the-art methods, our proposed method achieves much better results in both data recovery and denoising tasks.

For future work, we hope to reduce the computational complexity of the proposed algorithm. It is worth mentioning that the computational complexity of updating can be reduced to , since the high cost of our method comes from a sequential updating step of the columns of matrix . In practice, when optimizing , other columns are fixed and we update the full when all columns are calculated. Hence a parallel strategy can be introduced, namely, we can parallelly update the columns of in one iteration. In this case, the computational complexity of updating can be reduced to .

## Acknowledgment

The authors would like to thank the anonymous reviewers for their helpful comments. The work of Guangcan Liu is supported in part by National Natural Science Foundation of China (NSFC) under Grant 61622305 and Grant 61502238, in part by the Natural Science Foundation of Jiangsu Province of China (NSFJPC) under Grant BK20160040. The work of Jun Wang is supported in part by NSFC under Grant 61402224, 6177226), the Fundamental Research Funds for the Central Universities (NE2014402, NE2016004).

## References

•  Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM, 58(3):11, 2011.
•  Hongyang Zhang, Zhouchen Lin, Chao Zhang, and Edward Y Chang. Exact recoverability of robust pca via outlier pursuit with tight recovery bounds. In AAAI, pages 3143–3149, 2015.
•  Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit. In Advances in Neural Information Processing Systems, pages 2496–2504, 2010.
•  Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):171–184, 2013.
•  Guangcan Liu, Huan Xu, Jinhui Tang, Qingshan Liu, and Shuicheng Yan. A deterministic analysis for LRR. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3):417–430, 2016.
•  Guangcan Liu, Qingshan Liu, and Ping Li. Blessing of dimensionality: Recovering mixture data via dictionary pursuit. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1):47–60, 2017.
•  Guangcan Liu and Ping Li. Low-rank matrix completion in the presence of high coherence. IEEE Transactions on Signal Processing, 64(21):5623–5633, 2016.
•  Minh H Nguyen and Fernando Torre. Robust kernel principal component analysis. In Advances in Neural Information Processing Systems, pages 1185–1192, 2009.
•  Shijie Xiao, Mingkui Tan, Dong Xu, and Zhao Yang Dong. Robust kernel low-rank representation.

IEEE Transactions on Neural Networks and Learning Systems

, 27(11):2268–2281, 2016.
•  Pan Ji, Ian Reid, Ravi Garg, Hongdong Li, and Mathieu Salzmann. Low-rank kernel subspace clustering. arXiv preprint arXiv:1707.04974, 2017.
•  Xingyu Xie, Xianglin Guo, Guangcan Liu, and Jun Wang. Implicit block diagonal low-rank representation. IEEE Transactions on Image Processing, 27(1):477–489, 2018.
•  Hoangvu Nguyen, Wankou Yang, Fumin Shen, and Changyin Sun. Kernel low-rank representation for face recognition. Neurocomputing, 155:32–42, 2015.
•  Huan Li and Zhouchen Lin. Accelerated proximal gradient methods for nonconvex programming. In Advances in Neural Information Processing Systems, pages 379–387, 2015.
•  Zhonglong Zheng, Mudan Yu, Jiong Jia, Huawen Liu, Daohong Xiang, Xiaoqiao Huang, and Jie Yang. Fisher discrimination based low rank matrix recovery for face recognition. Pattern recognition, 47(11):3502–3511, 2014.
•  Xu Zhang, Shijie Hao, Chenyang Xu, Xueming Qian, Meng Wang, and Jianguo Jiang.

Image classification based on low-rank matrix recovery and naive bayes collaborative representation.

Neurocomputing, 169:110–118, 2015.
•  Yuanyuan Liu, LC Jiao, and Fanhua Shang. A fast tri-factorization method for low-rank matrix recovery and completion. Pattern Recognition, 46(1):163–173, 2013.
•  Angang Cui, Jigen Peng, and Haiyang Li. Exact recovery low-rank matrix via transformed affine matrix rank minimization. Neurocomputing, 2018.
•  Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller.

Nonlinear component analysis as a kernel eigenvalue problem.

Neural Computation, 10(5):1299–1319, 1998.
•  Mahdieh Soleymani Baghshah and Saeed Bagheri Shouraki. Learning low-rank kernel matrices for constrained clustering. Neurocomputing, 74(12-13):2201–2211, 2011.
•  Binbin Pan, Jianhuang Lai, and Pong C Yuen. Learning low-rank mercer kernels with fast-decaying spectrum. Neurocomputing, 74(17):3028–3035, 2011.
•  Alain Rakotomamonjy and Sukalpa Chanda. lp-norm multiple kernel learning with low-rank kernels. Neurocomputing, 143:68–79, 2014.
•  Ravi Garg, Anders Eriksson, and Ian Reid. Non-linear dimensionality regularizer for solving inverse problems. arXiv preprint arXiv:1603.05015, 2016.
•  Yangyang Xu and Wotao Yin.

A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion.

SIAM Journal on imaging sciences, 6(3):1758–1789, 2013.
•  Ralph Tyrell Rockafellar. Convex analysis. Princeton University Press, 2015.
•  Ran He, Bao-Gang Hu, Wei-Shi Zheng, and Xiang-Wei Kong. Robust principal component analysis based on maximum correntropy criterion. IEEE Transactions on Image Processing, 20(6):1485–1494, 2011.
•  Massimo Fornasier, Holger Rauhut, and Rachel Ward. Low-rank matrix recovery via iteratively reweighted least squares minimization. SIAM Journal on Optimization, 21(4):1614–1640, 2011.
•  Julien Mairal. Optimization with first-order surrogate functions. In

Proceedings of the International Conference on Machine Learning

, pages 783–791, 2013.
•  Ruslan Salakhutdinov and Iain Murray.

On the quantitative analysis of deep belief networks.

In Proceedings of the International Conference on Machine Learning, pages 872–879, 2008.
•  Sameer A Nene, Shree K Nayar, Hiroshi Murase, et al. Columbia object image library (coil-20). 1996.
•  Nauman Shahid, Vassilis Kalofolias, Xavier Bresson, Michael Bronstein, and Pierre Vandergheynst. Robust principal component analysis on graphs. In Proceedings of the IEEE International Conference on Computer Vision, pages 2812–2820, 2015.