1 Introduction
Researchers from machine learning
Ye2005Generalized , computer vision lrslibrary2015 ; Udell2014Generalized and statistics Koltchinskii2011NUCLEAR have paid increasing attention to lowrank matrix factorization (LRMF) Lee1999Learning . Generally speaking, many realworld modeling tasks can be attributed as the problems of LRMF. The tasks include but are not limited to recommender systems WU201846 , community detection airoldi2008mixed ; gopalan2013efficient , link prediction WANG2017104 Wright2009Robust and image denoising Fei2017Denoising .The key idea of LRMF is to approximate a given matrix by the product of two lowrank matrices. Specifically, given an observed matrix , LRMF aims at solving the optimization problem
(1) 
where (usually, ) and denotes the Hadamard product, that is, the elementwise product. The indicator matrix implies whether some data are missing, where if is nonmissing and 0 otherwise. The symbol indicates a certain norm of a matrix, in which the most prevalent one is
norm. It is wellknown that singular value decomposition provides a closedform solution for
norm LRMF without missing entries. In addition, researchers have presented many fast algorithms to solve Eq. (1) when contains missing entries, as well. Thenorm LRMF greatly facilitates theoretical analysis, but it provides the best solution in sense of maximum likelihood principle only when noise is indeed sampled from a Gaussian distribution. If noise is from a heavytailed distribution or data are corrupted by outliers,
norm LRMF is likely to perform badly. Thereafter, norm LRMF begins to gain increasing interest of both theoretical researchers and practitioners due to its robustness Ke:2005:RLN:1068507.1068989 . In fact, norm LRMF hypothesizes that noise is from a Laplace distribution. As is often the case with norm LRMF, norm LRMF may provide unexpected results as well if its assumptions are violated.Because the noise in real data generally deviates far away from a Gaussian or Laplace distribution, analysts are no longer satisfied with  or norm LRMF. To improve the robustness of LRMF, researchers attempt to directly model unknown noise via a mixture of Gaussians (MoG) due to its good property to universally approximate any continuous distribution Meng2014Robust ; Maz1996On . Nevertheless, the technique cannot fit real noise precisely in some complex cases. For example, in theory, infinite Gaussian components are required to approximate a Laplace distribution. In practice, we only utilize finite Gaussian components due to the characteristics of MoG. On the other hand, Gaussian, Laplace and MoG distributions are all symmetric. In the conditions with real noise being skew, they may provide unsatisfactory results.
As a matter of fact, there are no strictly symmetric noise in real images. For instance, Figure 1 illustrates several examples in which the real noise is either skewed to the left (e.g., (a4) and (c4)) or the right (e.g., (b4)). In these situations, the symmetric distributions like Gaussian or Laplace are inadequate to approximate the noise. In statistics, to deal with an asymmetric noise distribution, a preliminary exploration called quantile regression has been made. Consider a simple case that there is only one covariate , the quantile regression coefficient can be obtained by
(2) 
where are observations and is a predefined asymmetry parameter. Moreover, the quantile loss is defined as
(3)  
with is the indicator function. Evidently, the quantile loss with corresponds to the norm loss. From the Bayesian viewpoint, the estimate obtained by minimizing the quantile loss in (2) coincides with the result by assuming noise coming from an asymmetric Laplace distribution (ALD) Kozumi2011Gibbs ; Keming2005A .
To overcome the shortcomings of existing LRMF methods that they assume the type of noise distribution, we present in this paper an adaptive quantile LRMF (AQLRMF) algorithm. The key idea of AQLRMF is to model noise via a mixture of asymmetric Laplace distributions (MoAL). The expectation maximization (EM) algorithm is employed to estimate parameters, under the maximum likelihood framework. The novelty of AQLRMF and our main contributions can be summarized as follows.

The Mstep of the EM algorithm corresponds to a weighted norm LRMF, where the weights encode the information about skewness and outliers.

The weights are automatically learned from data under the framework of EM algorithm.

Different from quantile regression, our method does not need to predefine the asymmetry parameter of quantile loss, because it is adaptively determined by data.

Our model can capture local structural information contained in some real images, although we do not encode it into our model.
The experiments show that our method can effectively approximate many different kinds of noise. If the noise has a strong tendency to take a particular sign, AQLRMF will produce better estimates than a method which assumes a symmetric noise distribution. In comparison with several stateoftheart methods, the superiority of our method is demonstrated in both synthetic and realdata experiments such as image inpainting, face modeling, hyperspectral image (HSI) construction and so on.
The rest of the paper is organized as follows. Section 2 presents related work of LRMF. In section 3, we propose the AQLRMF model and also provide an efficient learning algorithm for it. Section 4 includes experimental studies. At last, some conclusions are drawn in section 5.
2 Related work
The study of robust LRMF has a long history. Srebro and Jaakkola icml2003SrebroJ03 suggested to use a weighted loss to improve LRMF’s robustness. The problem can be solved by a simple but efficient EM algorithm. However, the choice of weights significantly affects its capability. Thereafter, Ke and Kanade Ke:2005:RLN:1068507.1068989 attempted to replace loss with loss and to solve the optimization by alternated linear or quadratic programming (ALP/AQP). In order to catalyze convergence, Eriksson and Hengel L1Wiberg developed the Wiberg algorithm. Kim et al. Kim2015Efficient used alternating rectified gradient method to solve a largescale norm LRMF. The simulated experiments showed that this method performs well in terms of both matrix reconstruction performance and computational complexity. Okutomi et al. RegL1ALM modified the objective function of Wiberg by adding the nuclear norm of V and the orthogonality constraints on U. This method has been shown to be effective in addressing structure from motion issue. Despite the nonconvexity and nonsmoothness of norm LRMF, Meng et al. CWM proposed a computationally efficient algorithm, cyclic weighted median (CWM) method, by solving a sequence of scalar minimization subproblems to obtain the optimal solution. Inspired by majorizationminimization technique, Lin et al. Lin2017Robust proposed LRMFMM to solve an LRMF optimization task with loss plus the norm penalty placing on and . In each step, they upper bound the original objective function by a strongly convex surrogate and then minimize the surrogate. Experiments on both simulated and real data sets testify the effectiveness of LRMFMM. Li et al. Li2017Efficient considered a similar problem, but they replace the norm penalty imposed on with . This model is solved by augmented Lagrange multiplier method. Furthermore, the authors of Li2017Efficient
designed a heuristic rank estimator for their model. As argued in introduction,
loss actually corresponds to the Laplacedistributed noise. When the real distribution of noise deviates too far from Laplace, the robustness of LRMF will be suspectable.Recently, the research community began to focus on probabilistic extensions of robust matrix factorizations. Generally speaking, it is assumed that , where is a noise matrix. Lakshminarayanan et al. Lakshminarayanan2011Robust replaced Gaussian noise with Gaussian scale mixture noise. Nevertheless, it may be ineffective when processing heavytailed (such as Laplacetype) noise. Wang et al. Wang2012 proposed a probabilistic
norm LRMF, but they did not employ a fully Bayesian inference process. Beyond Laplace noise, Meng and Torre
Meng2014Robust presented a robust LRMF with unknown noise modeled by an MoG. In essence, the method iteratively optimizes , where are the MoG parameters which are automatically updated during optimization, and is the weight function of . Due to the benefit to adaptively assign small weights to corrupted entries, MoGLRMF has been reported to be fairly effective. More recently, Cao et al. Cao2015Low presented a novel LRMF model by assuming noise as a mixture of exponential power (MoEP) distributions and also offered the corresponding learning algorithm.On the other hand, robust principle component analysis (robust PCA) Candes2011Robust considers an issue that is similar to LRMF, that is,
(4) 
The underlying assumption of robust PCA is that the original data can be decomposed into the sum of a lowrank matrix and a sparse outlier matrix (i.e., the number of nonzero elements in is small). Clearly, plays the same role as the product of and . Since Eq. (4) involves a nonconvex objective function, Candes2011Robust consider a tractable convex alternative, called principal component pursuit, to handle the corresponding problem, namely,
(5) 
where denotes the nuclear norm. It is worthwhile that principal component pursuit sometimes may fail to recover when the real observation is also corrupted by a dense inlier matrix. To overcome this shortcoming, Zhou et al. Zhou2010Stable proposed the stable principal component pursuit (SPCP) by solving
(6) 
Actually, the underlying assumption of SPCP is , where is lowrank component, is the sparse outliers and is the smallmagnitude noise that can be modeled by Gaussian. Both theory and experiments have shown that SPCP guarantees the stable recovery of .
3 Adaptive Quantile LRMF (AQLRMF)
3.1 Motivation
Generally speaking, researchers employ the or loss function when solving a lowrank matrix factorization problem. As argued in introduction, or loss implicitly hypothesizes that the noise distribution is symmetric. Nevertheless, the noise in real data is often asymmetric and Figure 1 illustrates several examples.
There are two face images and a hyperspectral image in Figure 1. Figure 1 (a) displays a face image captured with a poor light source. There are cast shadows in a large area, while there exists overexposure phenomenon in a small area. As a result, the noise is negative skew. By contrast, Figure 1 (b) illustrates a face image captured under a strong light source. Because of the camera range settings, there are saturated pixels, especially on the forehead. Under this circumstance, the noise is positive skew. Figure 1 (c) shows a hyperspectral image that is mainly corrupted by stripe and Gaussian noise. Its residual image indicates that the signs of the noise are unbalanced, i.e., more pixels are corrupted by noise with negative values. Actually, the skewness values of three residual (noise) images are , 0.69 and , respectively. Note that a symmetric distribution has skewness 0, the noise contained in these real data sets is thus asymmetric.
As a matter of fact, the noise in real data can hardly be governed by a strictly symmetric probability distribution. Therefore, it is natural to utilize an asymmetric distribution to model realistic noise. In statistics, researchers usually make use of a quantile loss function defined in (
3) to address this issue. It has been shown that quantile loss function corresponds to the situation that noise is from an asymmetric Laplace distribution Kozumi2011Gibbs ; Keming2005A . In order to improve the performance of lowrank matrix factorization, we attempt to use a mixture of asymmetric Laplacian distributions (MoAL) to approximate noise.3.2 Asymmetric Laplace distribution
In what follows, we use to denote an ALD with location, scale and asymmetric parameters , and , respectively. Its probability distribution function (PDF) is
(7)  
Obviously, the location parameter is exactly the mode of an ALD. In Figure 2, we demonstrate the PDF curves for several ALDs with different parameters. The asymmetry parameter controls the skewness of an ALD and . In general, an ALD is positive skew if , and is negative skew if . If , the ALD becomes a Laplace distribution. The smaller the scale parameter is, the more heavytailed ALD is.
It is worthwhile that skew Gaussian distributions azzalini1996the are also prevailing in both theory and applications. However, it is not ideal for the analysis of LRMF. On one hand, the PDF of a skew Gaussian distribution is complex. On the other hand, its skewness lies in which is only a subset of the range of . Due to this fact, the fitting capability of an ALD is greater than that of a skew Gaussian distribution.
3.3 AQLRMF model
To enhance the robustness of LRMF in situations with skew and heavytailed noise, we propose an adaptive quantile LRMF (AQLRMF) model by modeling unknown noise as an MoAL. In particular, we consider a generative model of the observed matrix . For its each entry , suppose that there is
(8) 
where is the th row of U, is the th row of V, and is the noise term. In AQLRMF, we assume that is distributed as an MoAL, namely,
(9) 
in which stands for an asymmetric distribution with parameters and . Meanwhile, indicates the mixing proportion with and , and means the number of mixture components.
To facilitate the estimation of unknown parameters, we further equip each noise
with an indicator vector
where and . Here, indicates that the noise is drawn from the th AL distribution. Evidently, follows a multinomial distribution, i.e., . Under these assumptions, we can have(10) 
Now, it is easy to obtain the probability of as
(11) 
where , and are unknown parameters. To estimate as well as , we employ the maximum likelihood principle. Consequently, the goal is to maximize the loglikelihood function of complete data shown below, namely,
(12) 
where denotes the index set of the nonmissing entries of data. Subsequently, we will discuss how to maximize to get our interested items.
3.4 Learning of AQLRMF
Since each associates with an indicator vector , the EM algorithm em is utilized to train the AQLRMF model. Particularly, the algorithm needs to iteratively implement the following two steps (i.e., Estep and Mstep) until it converges. For ease of exposition, we let and abbreviate as in the following discussions.
Estep: Compute the conditional expectation of the latent variable as
(13) 
In order to attain the updating rules of other parameters, we need to compute the function. According to the working mechanism of EM algorithm, the function can be obtained by taking expectation of the loglikelihood function shown in (12) with regard to the conditional distribution of the latent variables . Specifically, it can be derived as
(14)  
where
(15) 
Mstep: Maximize the function by iteratively updating its parameters as follows.

Update : To attain the update for , we need to solve the following constrained optimization problem
(16) via the Lagrangian multiplier method. By some derivations, we have
(17) in which stands for the cardinality of .

Update : Compute the gradient and let it be zero. Consequently, the update of can be obtained as
(18) 
Update : Compute the gradient and let it be zero, we can have
(19) where the coefficients . It is a twoorder equation with regard to . Obviously, Eq. (19) has a unique root satisfying , that is,
(20) 
Update : By omitting some constants, the objective function to optimize can be rewritten as
(21) where the th entry of W is
(22)
Hence, the optimization problem in Eq. (21) is equivalent to the weighted LRMF, which can be solved by a fast offtheshelf algorithm such as the cyclic weighted median filter (CWM) CWM .
On one hand, it is interesting that the Mstep in AQLRMF is the same as that of MoGLRMF Meng2014Robust , except that the latter one minimizes a weighted loss. Due to this feature, AQLRMF is more robust than MoGLRMF. On the other hand, each weight of MoGLRMF embodies the information about whether the corresponding entry is an outlier. For each weight of AQLRMF, it actually contains additional information about the sign of bias. In detail, is the scale parameter and the entries with smaller correspond to outliers. According to the definition of in Eq. (15), we know that is a function of the skewness parameter . If the residual , and otherwise. Hence, the weights assigned to two points still differ if two residuals with the same absolute value have different signs. In conclusion, AQLRMF has more capacity to process heavytailed skew data.
Based on the above analysis, we summarize the main steps to learn the parameters involved in AQLRMF in the following Algorithm 1.
3.5 Solution of weighted Lrmf
As stated in the last subsection, the learning of AQLRMF can be cast into a weighted LRMF problem. Now we will provide more details about how to solve it (i.e., how to update by Eq. (21)) with the CWM method CWM .
Essentially, CWM minimizes the objective via solving a series of scalar minimization subproblems. Let and be the th column of and , respectively. To update , we assume that the other parameters have been estimated. As a result, the original problem can be rewritten as the optimization problem regarding , i.e.,
(23) 
where , and and are th column of and , respectively. In Eq. (23), c denotes a constant term not depending on . In this way, the optimal , say , can be easily attained by the weighted median filter when minimizing Eq. (23) can be provided by weighted median filter. Specifically, if let and , we can reformulate Eq. (23) as
(24) 
From the format of Eq. (24), it can be seen that the optimal coincides with the weighted median of the sequence under weights . As for the update of , it can be handled in a similar procedure. In short, the optimal can be obtained by employing CWM to repeatedly update and until the algorithm converges.
3.6 Some details of learning AQLRMF
Tuning the number of components in MoAL: Too large violates Occam Razor’s principle, while small leads to poor performance. In consequence, as described in step 8 of Algorithm 1, we employ an effective method to tune . To begin with, we initialize to be a small number such as . After each iteration, we compute the cluster that belongs to, by . If there is no entry belonging to cluster , we remove the corresponding ALD component.
Initialization: In Algorithm 1, we initialized the th entry of as , where denotes a random number sampled from the standard Gaussian distribution . In addition, where is the median of all entries in . Due to the characteristics of and , we initialize similarly. Moreover, the parameters and is randomly sampled from .
Convergence condition: By following the common practice of EM algorithm, we stop the iteration if the change of is smaller than a predefined value or the maximum iteration number is reached.
4 Experimental Studies
We carried out experiments in this section to examine the performance of AQLRMF model. Several stateoftheart methods were considered, including four robust LRMF methods (namely, MoG Meng2014Robust ^{1}^{1}1http://www.gr.xjtu.edu.cn/c/document_library/get_file?folderId=1816179&name=DLFE32163.rar, CWM CWM , Damped Wiberg (DW) ^{2}^{2}2http://www.vision.is.tohoku.ac.jp/us/download/ Okatani2011EfficientAF , RegL1ALM ^{3}^{3}3https://sites.google.com/site/yinqiangzheng/ RegL1ALM ) and a robust PCA method (SPCP solved by quasi Newton method) ^{4}^{4}4https://github.com/stephenbeckr/fastRPCA EPFLCONF199542 . We wrote the programming code for CWM. For the other compared algorithms, the codes provided by the corresponding authors were availed. Since SPCP is not available in presence of missing entries, it was thus excluded from some experiments which involve missing data. Notice that DW is only considered in section 4.1 because it meets the “out of memory” problem for largescale datasets. In the meantime, we assigned the same rank to all the considered algorithms but SPCP since it can automatically determine the rank. To make the comparison more fair, all algorithms were initialized with the same values. Each algorithm was terminated when either 100 iterative steps are reached or the change of is less than . In order to simplify notations, our proposed method AQLRMF was denoted as AQ in later discussions. All the experiments were conducted with Matlab R2015b and run on a computer with Intel Core CPU 2.30 GHz, 4.00 GB RAM and Windows 7(64bit) system.
The remainder of this section has the following structure. Section 4.1 studies the performance of each algorithm on synthetic data in presence of various kinds of noise as well as missing values. Sections 4.2 and 4.3 employ some inpainted and multispectral images to investigate how the compared algorithms behave on real images which contain missing values and various kinds of noise, respectively. Finally, sections 4.4 and 4.5 examine the performance of all algorithms on face modelling and hyperspectral image processing tasks.
4.1 Synthetic experiments
First, we compared the behavior of each method with synthetic data containing different kinds of noise. For each case, we randomly generated 30 low rank matrices of size , where and were sampled from the standard Gaussian distribution . In particular, we set to 4 and 8. Subsequently, we stochastically set 20% entries of X as missing data and corrupted the nonmissing entries with the following three groups of noise. (i) The first group include 4 kinds of heavytailed noise, i.e., (Laplace noise with scale parameter and location parameter ), Gaussian noise with and Student’s
noise with degrees of freedom 1 and 2, respectively. (ii) The second group include two kinds of skew noise, i.e., asymmetric Laplace noise with
and skew normal with . (iii) The last group include two kinds of mixture noise. The one is and the other one is . It is worthwhile to mention that the two mixture noises simulate the noise contained in real data, where most entries are corrupted by standard Gaussian noise and the rest entries are corrupted by heavytailed or skew noise. To evaluate the performance of each method, we employed the average error that is, .In our experiments, all algorithms were implemented with true rank . Tables 1 and 2 summarize the metrics averaged over 30 randomly generated matrices. When , it is quite obvious that our method reaches the minimum error in all situations, while MoG and CWM almost take the second place. And the approaches RegL1ALM and DW can hardly deal with the heavytailed and skew noise well. From the results of , similar conclusions can be drawn. However, CWM evidently outperforms MoG when , which indicates that MoG may be instable when the real rank in observed data is high. In summary, our model performs very well to cope with different kinds of noise.
AQ  MoG  CWM  RegL1ALM  DW  
Laplace Noise (b=1.5)  1.23  1.27  1.34  1.93  10.44 [t] 
Gaussian Noise ()  2.95  3.46  3.16  4.97  4.56 
Student’s Noise ()  1.47  2.36  2.88  27.45  925.17 
Student’s Noise ()  0.97  1.37  1.21  2.85  7.16 
AL Noise ()  1.99  2.58  2.33  3.99  3.65 
SN Noise ()  1.90  2.00  2.13  3.03  2.11 
Mixture Noise 1  0.84  0.93  1.02  1.28  1.06 
Mixture Noise 2  0.99  1.61  1.27  3.88  10.50 [b] 
mean  1.54  1.95  1.92  6.17  120.58 [t] 
median  1.35  1.81  1.73  3.45  5.86 [b] 
AQ  MoG  CWM  RegL1ALM  DW  
Laplace Noise (b=1.5)  1.82  2.13  1.90  3.81  12.71 [t] 
Gaussian Noise ()  4.14  5.07  4.11  8.38  9.04 
Student’s Noise ()  2.54  3.91  3.60  22.38  2063.74 
Student’s Noise ()  1.63  2.32  1.81  4.20  40.82 
AL Noise ()  2.91  3.69  3.07  6.76  17.84 
SN Noise ()  2.61  3.14  2.68  4.97  4.26 
Mixture Noise 1  1.34  1.55  1.55  2.59  3.80 
Mixture Noise 2  1.68  2.32  1.91  5.29  22.19 [b] 
mean  2.33  3.02  2.58  7.30  271.80 [t] 
median  2.18  2.73  2.30  5.13  15.27 [b] 
To delve into the difference between AQ and MoG, we further compared the distributions of residuals with real noises. Specifically, two symmetric and two asymmetric cases are illustrated in Figure 3. Here, the shown distributions fitted by AQ and MoG correspond to those reach the maximum likelihood over 30 random experiments. It is obvious that AQ does a much better job to approximate the real noise than MoG. Particularly, AQ almost provides a duplicate of real noise. In contrast, MoG is able to fit the tails, while, at the same time, it results in bad approximation to peaks. Hence, AQ has more power in fitting complex noise than MoG.
4.2 Image inpainting experiments
Image inpainting is a typical image processing task. In real applications, some parts of an image may be deteriorated so that the corresponding information is lost. To facilitate the understanding of the image, some sophisticated technique need to be adopted to recover the corrupted parts of the image. This is exactly the objective of image inpainting. There is evidence that many images are lowrank matrices so that the single image inpainting can be done by matrix completion Hu2013Fast . In image inpainting, the corrupted pixels are viewed as missing values and then the image can be recovered by an LRMF algorithm. In this paper, three typical RGB images ^{5}^{5}5https://sites.google.com/site/zjuyaohu/ of size were employed. In our experiments, each image was reshaped to . By following the common practice in the research of image inpainting, we artificially corrupted the given images by putting some masks onto them. In doing so, it is convenient to examine how well each method performs to restore the original images. Here, three kinds of masks were considered, namely, random mask where 20% pixels were stochastically removed, text masks with big and small fonts, respectively. The rank was set to 80 for all algorithms.
Figure 4 visualizes the original, masked and reconstructed images, and Table 3 reports the average errors of each algorithm. It is obvious that removing a random mask is the easiest task. In this situation, there is no significantly visible difference among the reconstructed images. In terms of average error, MoG performs best and AQ can be ranked in second place. In contrast, the results shown in Figure 4 and Table 3 indicate that text mask removal is more difficult, especially when the images are corrupted with big fonts. The main reason lies in that the text mask is spatially correlated while it is difficult for any LRMF algorithm to effectively utilize this type of information. Under these circumstances, it can be observed in Figure 4 and Table 3 that AQ outperforms the other methods to remove the text masks with regard to both average error and visualization. RegL1ALM and MoG perform badly and the clear text can often be seen in their reconstructed images. Although CWM produces slightly better results, its average error is still higher than that of AQ. In a word, AQ possesses the superiority over the other algorithms in our investigated image inpainting tasks. In particular, AQ achieves the smallest average error in 6 cases and the second smallest one in 2 cases.
Image A  Image B  Image C  

AQ  MoG  CWM  RegL1ALM  AQ  MoG  CWM  RegL1ALM  AQ  MoG  CWM  RegL1ALM  
Small  2.32  2.59  2.33  2.91  5.60  7.43  6.90  8.08  5.13  5.83  6.30  6.97 
Large  5.59  9.25  7.84  8.77  6.88  20.16  8.78  19.11  6.84  7.15  7.67  17.70 
Random  2.81  2.18  3.06  2.45  5.91  5.75  6.65  6.80  5.61  6.74  6.77  5.07 
4.3 Multispectral image experiments
In this subsection, we study the behavior of all algorithms in image denoising tasks. The Columbia Multispectral Image database ^{6}^{6}6http://www1.cs.columbia.edu/CAVE/databases/multispectral was employed, where every scene contains 31 bands with size . To achieve our purpose, eight scenes out of them (i.e., Balloon, Paints, Flowers, Cloth, Feathers, Hairs, Pompoms and Clay) were utilized to test the effectiveness of our methods. The used images were resized by half and the pixels were rescaled to [0,1]. Analogous to the strategy used in image inpainting experiments, some noise was artificially added to the original images. Then, each LRMF algorithm was applied to remove the noise so that the corrupted images can be restored as accurate as possible. In the experiments, four different kinds of noise were considered, that is, Laplace noise with scale parameter , asymmetric Laplace noise with and mixture noise, i.e., . The rank was set to 4 for all algorithms.
Table 4 reports the average errors of each method. For Laplace noise, it is found that RegL1ALM performs best, while AQ, MoG and CWM are ranked in the second, third and fourth places, respectively. The success of RegL1ALM can be attributed to the special format of its objective function, namely, norm loss plus two penalties on and . On one hand, the norm loss is exactly compatible with Laplace noise. On the other hand, there is empirical evidence showing that its used penalties can lead to better performance on image datasets. For asymmetric Laplace and mixture noise, it is not surprising that AQ outperforms all the other methods. SPCP reaches the second lowest average error, while the other ones perform badly. The reason may be that SPCP does not rely on noise distribution, while the other approaches implicitly assume that the noise distribution is not skew.
Scene  Type of  AQ  MoG  CWM  RegL1ALM  SPCP 

Balloon  Laplace  0.0342  0.0348  0.0398  0.0343  0.0487 [t] 
Asymmetric Laplace  0.0804  0.1964  0.1501  0.1379  0.1280  
Mixture noise  0.1974  0.2214  0.2453  0.2422  0.2026 [b]  
Flowers  Laplace  0.0362  0.0395  0.0371  0.0343  0.0437 [t] 
Asymmetric Laplace  0.0761  0.1708  0.1450  0.1374  0.1289  
Mixture noise  0.1709  0.2339  0.2393  0.2437  0.2025 [b]  
Cloth  Laplace  0.0412  0.0399  0.0365  0.0323  0.0667 [t] 
Asymmetric Laplace  0.0975  0.1640  0.1424  0.1374  0.1314  
Mixture noise  0.1927  0.1898  0.2334  0.2512  0.2037 [b]  
Feathers  Laplace  0.0392  0.0390  0.0417  0.0373  0.0470 [t] 
Asymmetric Laplace  0.0911  0.1526  0.1487  0.1396  0.1303  
Mixture noise  0.1946  0.2375  0.2455  0.2425  0.2046 [b]  
Hairs  Laplace  0.0321  0.0380  0.0358  0.0292  0.0253 [t] 
Asymmetric Laplace  0.0681  0.1969  0.1373  0.1346  0.1201  
Mixture noise  0.1412  0.2172  0.2305  0.2387  0.1979 [b]  
Pompoms  Laplace  0.0493  0.0420  0.0414  0.0379  0.0824 [t] 
Asymmetric Laplace  0.1002  0.1937  0.1508  0.1422  0.1472  
Mixture noise  0.1630  0.2221  0.2489  0.2323  0.2153 [b]  
Paints  Laplace  0.0424  0.0370  0.0431  0.0354  0.0396 [t] 
Asymmetric Laplace  0.1009  0.1910  0.1432  0.1402  0.1259  
Mixture noise  0.2119  0.2311  0.2460  0.2399  0.2018 [b]  
Clay  Laplace  0.0335  0.0402  0.0362  0.0344  0.0596 [t] 
Asymmetric Laplace  0.0778  0.1787  0.1385  0.1342  0.1380  
Mixture noise  0.1569  0.2169  0.2403  0.2453  0.2097 [t]  
mean  Laplace  0.0385  0.0388  0.0390  0.0344  0.0516 [t] 
Asymmetric Laplace  0.0865  0.1805  0.1445  0.1380  0.1312  
Mixture noise  0.1786  0.2212  0.2411  0.2420  0.2048 [b] 
4.4 Face modeling experiments
Here, we applied the LRMF techniques to address the face modeling task. The Extended Yale B database ^{7}^{7}7http://cvc.yale.edu/projects/yalefaces/yalefaces.html consisting of 64 images with size of each subject was considered. Therefore, it leads to a matrix for each subject. Particularly, we used the face images of the third and fifth subjects. The first column of Figure 5 demonstrates some typical faces for illustration. We set the rank to 4 for all methods except for SPCP which determines the rank automatically. The second to sixth columns of Figure 5 display the faces reconstructed by the compared LRMF algorithms.
From Figure 5, we can observe that that all methods are able to remove the cast shadows, saturations and camera noise. However, the performance of SPCP seems to be worse in comparison with other algorithms. Evidently, AQ always outperforms the other methods due to its pretty reconstruction. As shown in Figure 1, there is an asymmetric distribution in the face with a large dark region. Because of this, the techniques MoG, CWM, RegL1ALM and SPCP which utilize the symmetric loss function lead to bad results, while AQ with the quantile loss function produces the best reconstructed images.
4.5 Hyperspectral image experiments
In this subsection, we employed two HSI datasets, Urban and Terrain ^{8}^{8}8http://www.erdc.usace.army.mil/Media/FactSheets/FactSheetArticleView/Article/610433/hypercube/, to investigate the behavior of all algorithms. There are 210 bands, each of which is of size for Urban and for Terrain. Thus, the data matrix is of size for Urban and for Terrain. Here, we utilized the same experimental settings as those used in subsection 4.4. DW was still unavailable in this experiment due to the computational problem. As show in the first column of Figure 6, some parts of bands are seriously polluted by the atmosphere and water absorption.
The reconstructed images of bands 106 and 207 in the Terrain data set and the band 104 in the Urban data set are shown in Figure 6 (a), (c) and (e), respectively. Their residual images (i.e., ) are also demonstrated below the reconstructed ones. Obviously, the band 106 in Terrain is seriously polluted. Nevertheless, our proposed AQ method still effectively reconstructs a clean and smooth one. Although MoG, CWM and RegL1ALM remove most parts of noise, they miss a part of local information, that is, the line from upper left corner to bottom right hand side (i.e., the parallelogram marked in the original image). As for SPCP, it only removes few parts of noise. The residual images also reveal that AQ behaves better to deal with the detailed information. Note that the band 207 in Terrain and the band 104 in Urban are mainly corrupted by the stripe and Guassianlike noise. Under these circumstances, AQ still outperforms the others because the latter fails to remove the stripe noise. In particular, for the interested areas that are marked by rectangles and amplified areas, the bands reconstructed by MoG, CWM, RegL1ALM and SPCP contain evident stripes. As far as the reconstructed images produced by AQ are concerned, however, this phenomenon does not exist.
We conjectured that the main reason for the different behavior of these algorithms lies in their used loss function. For CWM, RegL1ALM and SPCP, too simple loss function lead them to work not well when encountering complicated noise. In contrast, AQ and MoG perform better because they use multiple distribution components to model noise. It is very interesting to study the difference between AQ and MoG. For these two algorithms, we found that they both approximate the noise in our considered three bands with two components. For AQ (MoG), we denoted them as AQ1 and AQ2 (MoG1 and MoG2), respectively. In Figure 7, we presented denoised images and residual images produced by each component. Take the denoised image in the column AQ1 as an example, it corresponds to and the residual image shown below it corresponds to AQ1 (i.e., . The other images can be understood similarly. In doing so, we can further figure out the role that each component in AQ or MoG plays. When dealing with the band 106 in Terrain, the first AQ component is seen to denoise the center parts, while the second one targets at the left and right edges. For the band 207 in Terrain, two AQ components denoise the bottom and the rest parts, respectively. Regarding the band 104 in Urban, they focus on the right upper and center parts, respectively. By inspecting the results generated by MoG, however, we cannot discover some regular patterns for the role that two components play. Therefore, it can be concluded that AQ can capture the local structural information of real images, although we do not encode it into our model. The reason may be that the pixels with the same skewness in real images tend to cluster. In this aspect, AQ also possesses superiority over MoG.
5 Conclusions
Aiming at enhancing the performance of existing LRMF methods to cope with complicated noise in real applications, we propose in this work a new lowrank matrix factorization method AQLRMF to recover subspaces. The core idea of AQLRMF is to directly model unknown noise by a mixture of asymmetric Laplace distributions. We also present an efficient procedure based on the EM algorithm to estimate the parameters in AQLRMF. Actually, the objective function of AQLRMF corresponds to the adaptive quantile loss like those used in quantile regression. Compared with several stateoftheart counterparts, the novel AQLRMF model always outperforms them in synthetic and real data experiments. In addition, AQLRMF also has the superiority to capture local structural information in real images. Therefore, AQLRMF can be deemed as a competitive tool to cope with complex real problems.
Acknowledgment
This research was supported by the National Natural Science Foundation of China [grant number 11671317, 61572393].
References
References
 (1) J. Ye, Generalized low rank approximations of matrices, Machine Learning 61 (13) (2005) 167–191.
 (2) A. Sobral, T. Bouwmans, E.h. Zahzah, Lrslibrary: Lowrank and sparse tools for background modeling and subtraction in videos, in: Handbook of Robust LowRank and Sparse Matrix Decomposition: Applications in Image and Video Processing, CRC Press, Taylor and Francis Group.
 (3) M. Udell, C. Horn, R. Zadeh, S. Boyd, Generalized low rank models, Foundations and Trends in Machine Learning 9 (1) (2016) 1–118.
 (4) V. Koltchinskii, K. Lounici, A. B. Tsybakov, Nuclearnorm penalization and optimal rates for noisy lowrank matrix completion, Annals of Statistics 39 (5) (2011) 2302–2329.
 (5) D. D. Lee, H. S. Seung, Learning the parts of objects by nonnegative matrix factorization, Nature 401 (6755) (1999) 788–791.

(6)
H. Wu, Z. Zhang, K. Yue, B. Zhang, J. He, L. Sun, Dualregularized matrix factorization with deep neural networks for recommender systems, KnowledgeBased Systems 145 (2018) 46 – 58.
 (7) E. M. Airoldi, D. M. Blei, S. E. Fienberg, E. P. Xing, Mixed membership stochastic blockmodels, Journal of Machine Learning Research 9 (2008) 1981–2014.
 (8) P. Gopalan, D. M. Blei, Efficient discovery of overlapping communities in massive networks, Proceedings of the National Academy of Sciences of the United States of America 110 (36) (2013) 14534–14539.
 (9) W. Wang, Y. Feng, P. Jiao, W. Yu, Kernel framework based on nonnegative matrix factorization for networks reconstruction and link prediction, KnowledgeBased Systems 137 (2017) 104 – 114.

(10)
J. Wright, Y. Peng, Y. Ma, A. Ganesh, S. Rao, Robust principal component analysis: exact recovery of corrupted lowrank matrices by convex optimization, in: International Conference on Neural Information Processing Systems (NIPS), 2009, pp. 2080–2088.
 (11) X. Fei, Y. Chen, P. Chong, Y. Wang, X. Liu, G. He, Denoising of hyperspectral image using lowrank matrix factorization, IEEE Geoscience & Remote Sensing Letters 14 (7) (2017) 1141–1145.

(12)
Q. Ke, T. Kanade, Robust
norm factorization in the presence of outliers and missing data by alternative convex programming, in: 2005 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 739–746.
 (13) D. Meng, F. D. L. Torre, Robust matrix factorization with unknown noise, in: 2014 IEEE International Conference on Computer Vision (ICCV), 2014, pp. 1337–1344.
 (14) V. Maz’Ya, G. Schmidt, On approximate approximations using gaussian kernels, IMA Journal of Numerical Analysis 16 (1) (1996) 13–29.
 (15) H. Kozumi, G. Kobayashi, Gibbs sampling methods for bayesian quantile regression, Journal of Statistical Computation & Simulation 81 (11) (2011) 1565–1578.
 (16) K. Yu, J. Zhang, A threeparameter asymmetric laplace distribution and its extension, Communications in Statistics  Theory and Methods 34 (910) (2005) 1867–1879.
 (17) N. Srebro, T. Jaakkola, Weighted lowrank approximations, in: Proceedings of the 20th International Conference on Machine Learning (ICML), 2003, pp. 720–727.
 (18) A. Eriksson, A. van den Hengel, Efficient computation of robust lowrank matrix approximations in the presence of missing data using the norm, in: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 771–778.
 (19) E. Kim, M. Lee, C. H. Choi, N. Kwak, S. Oh, Efficient normbased lowrank matrix approximations for largescale problems using alternating rectified gradient method, IEEE Transactions on Neural Networks and Learning Systems 26 (2) (2015) 237–251.
 (20) M. Okutomi, S. Yan, S. Sugimoto, G. Liu, Y. Zheng, Practical lowrank matrix approximation under robust norm, IEEE Computer Society, Los Alamitos, CA, USA, 2012, pp. 1410–1417.

(21)
D. Meng, Z. Xu, L. Zhang, J. Zhao, A cyclic weighted median method for
lowrank matrix factorization with missing entries, in: Proceedings of the TwentySeventh AAAI Conference on Artificial Intelligence, 2013, pp. 704–710.
 (22) Z. Lin, C. Xu, H. Zha, Robust matrix factorization by majorization minimization, IEEE Transactions on Pattern Analysis & Machine Intelligence 40 (1) (2018) 208–220.
 (23) S. Li, J. Zhang, X. Guo, Efficient low rank matrix approximation via orthogonality pursuit and regularization, in: IEEE International Conference on Multimedia and Expo, 2017, pp. 871–876.
 (24) B. Lakshminarayanan, G. Bouchard, C. Archambeau, Robust bayesian matrix factorisation, Journal of Machine Learning Research 15 (2011) 425–433.
 (25) N. Wang, T. Yao, J. Wang, D.Y. Yeung, A Probabilistic Approach to Robust Matrix Factorization, Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 126–139.
 (26) X. Cao, Y. Chen, Q. Zhao, D. Meng, Lowrank matrix factorization under general mixture noise distributions, in: IEEE Transactions on Image Processing, Vol. 25, 2016, pp. 4677–4690.
 (27) E. Candès, X. Li, Y. Ma, J. Wright, Robust principal component analysis?, Journal of the ACM 58 (3) (2011) 11.
 (28) Z. Zhou, X. Li, J. Wright, E. Candès, Stable principal component pursuit, in: IEEE International Symposium on Information Theory Proceedings, Austin, TX, USA, 2010, pp. 1518–1522.

(29)
A. Azzalini, A. D. Valle, The multivariate skewnormal distribution, Biometrika 83 (4) (1996) 715–726.
 (30) A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the em algorithm, Journal of the Royal Statistical Society. Series B (Methodological) 39 (1) (1977) 1–38.
 (31) T. Okatani, T. Yoshida, K. Deguchi, Efficient algorithm for lowrank matrix factorization with missing components and performance comparison of latest algorithms, 2011, pp. 842–849.
 (32) A. Aravkin, S. Becker, V. Cevher, P. Olsen, A variational approach to stable principal component pursuit, in: 2014 Conference on Uncertainty in Artificial Intelligence (UAI), 2014.
 (33) Y. Hu, D. Zhang, J. Ye, X. Li, X. He, Fast and accurate matrix completion via truncated nuclear norm regularization, IEEE Transactions on Pattern Analysis & Machine Intelligence 35 (9) (2013) 2117–2130.
Comments
There are no comments yet.