Adaptive Quantile Low-Rank Matrix Factorization

01/01/2019 ∙ by Shuang Xu, et al. ∙ Xi'an Jiaotong University 30

Low-rank matrix factorization (LRMF) has received much popularity owing to its successful applications in both computer vision and data mining. By assuming the noise term to come from a Gaussian, Laplace or a mixture of Gaussian distributions, significant efforts have been made on optimizing the (weighted) L_1 or L_2-norm loss between an observed matrix and its bilinear factorization. However, the type of noise distribution is generally unknown in real applications and inappropriate assumptions will inevitably deteriorate the behavior of LRMF. On the other hand, real data are often corrupted by skew rather than symmetric noise. To tackle this problem, this paper presents a novel LRMF model called AQ-LRMF by modeling noise with a mixture of asymmetric Laplace distributions. An efficient algorithm based on the expectation-maximization (EM) algorithm is also offered to estimate the parameters involved in AQ-LRMF. The AQ-LRMF model possesses the advantage that it can approximate noise well no matter whether the real noise is symmetric or skew. The core idea of AQ-LRMF lies in solving a weighted L_1 problem with weights being learned from data. The experiments conducted with synthetic and real datasets show that AQ-LRMF outperforms several state-of-the-art techniques. Furthermore, AQ-LRMF also has the superiority over the other algorithms that it can capture local structural information contained in real images.



There are no comments yet.


page 3

page 16

page 19

page 21

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Researchers from machine learning

Ye2005Generalized , computer vision lrslibrary2015 ; Udell2014Generalized and statistics Koltchinskii2011NUCLEAR have paid increasing attention to low-rank matrix factorization (LRMF) Lee1999Learning . Generally speaking, many real-world modeling tasks can be attributed as the problems of LRMF. The tasks include but are not limited to recommender systems WU201846 , community detection airoldi2008mixed ; gopalan2013efficient , link prediction WANG2017104

, face recognition

Wright2009Robust and image denoising Fei2017Denoising .

The key idea of LRMF is to approximate a given matrix by the product of two low-rank matrices. Specifically, given an observed matrix , LRMF aims at solving the optimization problem


where (usually, ) and denotes the Hadamard product, that is, the element-wise product. The indicator matrix implies whether some data are missing, where if is non-missing and 0 otherwise. The symbol indicates a certain norm of a matrix, in which the most prevalent one is

norm. It is well-known that singular value decomposition provides a closed-form solution for

-norm LRMF without missing entries. In addition, researchers have presented many fast algorithms to solve Eq. (1) when contains missing entries, as well. The

-norm LRMF greatly facilitates theoretical analysis, but it provides the best solution in sense of maximum likelihood principle only when noise is indeed sampled from a Gaussian distribution. If noise is from a heavy-tailed distribution or data are corrupted by outliers,

-norm LRMF is likely to perform badly. Thereafter, -norm LRMF begins to gain increasing interest of both theoretical researchers and practitioners due to its robustness Ke:2005:RLN:1068507.1068989 . In fact, -norm LRMF hypothesizes that noise is from a Laplace distribution. As is often the case with -norm LRMF, -norm LRMF may provide unexpected results as well if its assumptions are violated.

Because the noise in real data generally deviates far away from a Gaussian or Laplace distribution, analysts are no longer satisfied with - or -norm LRMF. To improve the robustness of LRMF, researchers attempt to directly model unknown noise via a mixture of Gaussians (MoG) due to its good property to universally approximate any continuous distribution Meng2014Robust ; Maz1996On . Nevertheless, the technique cannot fit real noise precisely in some complex cases. For example, in theory, infinite Gaussian components are required to approximate a Laplace distribution. In practice, we only utilize finite Gaussian components due to the characteristics of MoG. On the other hand, Gaussian, Laplace and MoG distributions are all symmetric. In the conditions with real noise being skew, they may provide unsatisfactory results.

Figure 1: (a) and (b) illustrate two face images corresponding to underexposure and overexposure cases, respectively. In particular, (a-1) and (b-1) are face images captured with improper light sources while (a-2) and (b-2) are face images obtained with proper light sources. (a-3) and (b-3) are residual images in which the yellow (blue) locations indicate positive (negative) values. (a-4) and (b-4) illustrate the histograms of the residual images as well as the PDF curves fitted by ALD with and , respectively. The skewness of the residual face in (a-3) is whilst that for (b-3) is 0.69. (c) shows a hyperspectral image. Similar to cases (a) and (b), the images from (c-1) to (c-4) are original, de-noised, noise images and the histogram of residuals, respectively. The skewness of the noise image (c-3) is . In (c-4), the fitted ALD is with , and . Obviously, the distributions of noise shown here are all asymmetric.

As a matter of fact, there are no strictly symmetric noise in real images. For instance, Figure 1 illustrates several examples in which the real noise is either skewed to the left (e.g., (a-4) and (c-4)) or the right (e.g., (b-4)). In these situations, the symmetric distributions like Gaussian or Laplace are inadequate to approximate the noise. In statistics, to deal with an asymmetric noise distribution, a preliminary exploration called quantile regression has been made. Consider a simple case that there is only one covariate , the quantile regression coefficient can be obtained by


where are observations and is a pre-defined asymmetry parameter. Moreover, the quantile loss is defined as


with is the indicator function. Evidently, the quantile loss with corresponds to the -norm loss. From the Bayesian viewpoint, the estimate obtained by minimizing the quantile loss in (2) coincides with the result by assuming noise coming from an asymmetric Laplace distribution (ALD) Kozumi2011Gibbs ; Keming2005A .

To overcome the shortcomings of existing LRMF methods that they assume the type of noise distribution, we present in this paper an adaptive quantile LRMF (AQ-LRMF) algorithm. The key idea of AQ-LRMF is to model noise via a mixture of asymmetric Laplace distributions (MoAL). The expectation maximization (EM) algorithm is employed to estimate parameters, under the maximum likelihood framework. The novelty of AQ-LRMF and our main contributions can be summarized as follows.

  1. The M-step of the EM algorithm corresponds to a weighted -norm LRMF, where the weights encode the information about skewness and outliers.

  2. The weights are automatically learned from data under the framework of EM algorithm.

  3. Different from quantile regression, our method does not need to pre-define the asymmetry parameter of quantile loss, because it is adaptively determined by data.

  4. Our model can capture local structural information contained in some real images, although we do not encode it into our model.

The experiments show that our method can effectively approximate many different kinds of noise. If the noise has a strong tendency to take a particular sign, AQ-LRMF will produce better estimates than a method which assumes a symmetric noise distribution. In comparison with several state-of-the-art methods, the superiority of our method is demonstrated in both synthetic and real-data experiments such as image inpainting, face modeling, hyperspectral image (HSI) construction and so on.

The rest of the paper is organized as follows. Section 2 presents related work of LRMF. In section 3, we propose the AQ-LRMF model and also provide an efficient learning algorithm for it. Section 4 includes experimental studies. At last, some conclusions are drawn in section 5.

2 Related work

The study of robust LRMF has a long history. Srebro and Jaakkola icml2003SrebroJ03 suggested to use a weighted loss to improve LRMF’s robustness. The problem can be solved by a simple but efficient EM algorithm. However, the choice of weights significantly affects its capability. Thereafter, Ke and Kanade Ke:2005:RLN:1068507.1068989 attempted to replace loss with loss and to solve the optimization by alternated linear or quadratic programming (ALP/AQP). In order to catalyze convergence, Eriksson and Hengel L1Wiberg developed the -Wiberg algorithm. Kim et al. Kim2015Efficient used alternating rectified gradient method to solve a large-scale -norm LRMF. The simulated experiments showed that this method performs well in terms of both matrix reconstruction performance and computational complexity. Okutomi et al. RegL1ALM modified the objective function of -Wiberg by adding the nuclear norm of V and the orthogonality constraints on U. This method has been shown to be effective in addressing structure from motion issue. Despite the non-convexity and non-smoothness of -norm LRMF, Meng et al. CWM proposed a computationally efficient algorithm, cyclic weighted median (CWM) method, by solving a sequence of scalar minimization sub-problems to obtain the optimal solution. Inspired by majorization-minimization technique, Lin et al. Lin2017Robust proposed LRMF-MM to solve an LRMF optimization task with loss plus the -norm penalty placing on and . In each step, they upper bound the original objective function by a strongly convex surrogate and then minimize the surrogate. Experiments on both simulated and real data sets testify the effectiveness of LRMF-MM. Li et al. Li2017Efficient considered a similar problem, but they replace the -norm penalty imposed on with . This model is solved by augmented Lagrange multiplier method. Furthermore, the authors of Li2017Efficient

designed a heuristic rank estimator for their model. As argued in introduction,

loss actually corresponds to the Laplace-distributed noise. When the real distribution of noise deviates too far from Laplace, the robustness of LRMF will be suspectable.

Recently, the research community began to focus on probabilistic extensions of robust matrix factorizations. Generally speaking, it is assumed that , where is a noise matrix. Lakshminarayanan et al. Lakshminarayanan2011Robust replaced Gaussian noise with Gaussian scale mixture noise. Nevertheless, it may be ineffective when processing heavy-tailed (such as Laplace-type) noise. Wang et al. Wang2012 proposed a probabilistic

-norm LRMF, but they did not employ a fully Bayesian inference process. Beyond Laplace noise, Meng and Torre

Meng2014Robust presented a robust LRMF with unknown noise modeled by an MoG. In essence, the method iteratively optimizes , where are the MoG parameters which are automatically updated during optimization, and is the weight function of . Due to the benefit to adaptively assign small weights to corrupted entries, MoG-LRMF has been reported to be fairly effective. More recently, Cao et al. Cao2015Low presented a novel LRMF model by assuming noise as a mixture of exponential power (MoEP) distributions and also offered the corresponding learning algorithm.

On the other hand, robust principle component analysis (robust PCA) Candes2011Robust considers an issue that is similar to LRMF, that is,


The underlying assumption of robust PCA is that the original data can be decomposed into the sum of a low-rank matrix and a sparse outlier matrix (i.e., the number of non-zero elements in is small). Clearly, plays the same role as the product of and . Since Eq. (4) involves a non-convex objective function, Candes2011Robust consider a tractable convex alternative, called principal component pursuit, to handle the corresponding problem, namely,


where denotes the nuclear norm. It is worthwhile that principal component pursuit sometimes may fail to recover when the real observation is also corrupted by a dense inlier matrix. To overcome this shortcoming, Zhou et al. Zhou2010Stable proposed the stable principal component pursuit (SPCP) by solving


Actually, the underlying assumption of SPCP is , where is low-rank component, is the sparse outliers and is the small-magnitude noise that can be modeled by Gaussian. Both theory and experiments have shown that SPCP guarantees the stable recovery of .

3 Adaptive Quantile LRMF (AQ-LRMF)

3.1 Motivation

Generally speaking, researchers employ the or loss function when solving a low-rank matrix factorization problem. As argued in introduction, or loss implicitly hypothesizes that the noise distribution is symmetric. Nevertheless, the noise in real data is often asymmetric and Figure 1 illustrates several examples.

There are two face images and a hyperspectral image in Figure 1. Figure 1 (a) displays a face image captured with a poor light source. There are cast shadows in a large area, while there exists overexposure phenomenon in a small area. As a result, the noise is negative skew. By contrast, Figure 1 (b) illustrates a face image captured under a strong light source. Because of the camera range settings, there are saturated pixels, especially on the forehead. Under this circumstance, the noise is positive skew. Figure 1 (c) shows a hyperspectral image that is mainly corrupted by stripe and Gaussian noise. Its residual image indicates that the signs of the noise are unbalanced, i.e., more pixels are corrupted by noise with negative values. Actually, the skewness values of three residual (noise) images are , 0.69 and , respectively. Note that a symmetric distribution has skewness 0, the noise contained in these real data sets is thus asymmetric.

As a matter of fact, the noise in real data can hardly be governed by a strictly symmetric probability distribution. Therefore, it is natural to utilize an asymmetric distribution to model realistic noise. In statistics, researchers usually make use of a quantile loss function defined in (

3) to address this issue. It has been shown that quantile loss function corresponds to the situation that noise is from an asymmetric Laplace distribution Kozumi2011Gibbs ; Keming2005A . In order to improve the performance of low-rank matrix factorization, we attempt to use a mixture of asymmetric Laplacian distributions (MoAL) to approximate noise.

3.2 Asymmetric Laplace distribution

In what follows, we use to denote an ALD with location, scale and asymmetric parameters , and , respectively. Its probability distribution function (PDF) is


Obviously, the location parameter is exactly the mode of an ALD. In Figure 2, we demonstrate the PDF curves for several ALDs with different parameters. The asymmetry parameter controls the skewness of an ALD and . In general, an ALD is positive skew if , and is negative skew if . If , the ALD becomes a Laplace distribution. The smaller the scale parameter is, the more heavy-tailed ALD is.

Figure 2: The PDF curves of ALDs. The location parameter is . Left: ; right: .

It is worthwhile that skew Gaussian distributions azzalini1996the are also prevailing in both theory and applications. However, it is not ideal for the analysis of LRMF. On one hand, the PDF of a skew Gaussian distribution is complex. On the other hand, its skewness lies in which is only a subset of the range of . Due to this fact, the fitting capability of an ALD is greater than that of a skew Gaussian distribution.

3.3 AQ-LRMF model

To enhance the robustness of LRMF in situations with skew and heavy-tailed noise, we propose an adaptive quantile LRMF (AQ-LRMF) model by modeling unknown noise as an MoAL. In particular, we consider a generative model of the observed matrix . For its each entry , suppose that there is


where is the th row of U, is the th row of V, and is the noise term. In AQ-LRMF, we assume that is distributed as an MoAL, namely,


in which stands for an asymmetric distribution with parameters and . Meanwhile, indicates the mixing proportion with and , and means the number of mixture components.

To facilitate the estimation of unknown parameters, we further equip each noise

with an indicator vector

where and . Here, indicates that the noise is drawn from the th AL distribution. Evidently, follows a multinomial distribution, i.e., . Under these assumptions, we can have


Now, it is easy to obtain the probability of as


where , and are unknown parameters. To estimate as well as , we employ the maximum likelihood principle. Consequently, the goal is to maximize the log-likelihood function of complete data shown below, namely,


where denotes the index set of the non-missing entries of data. Subsequently, we will discuss how to maximize to get our interested items.

3.4 Learning of AQ-LRMF

Since each associates with an indicator vector , the EM algorithm em is utilized to train the AQ-LRMF model. Particularly, the algorithm needs to iteratively implement the following two steps (i.e., E-step and M-step) until it converges. For ease of exposition, we let and abbreviate as in the following discussions.

E-step: Compute the conditional expectation of the latent variable as


In order to attain the updating rules of other parameters, we need to compute the -function. According to the working mechanism of EM algorithm, the -function can be obtained by taking expectation of the log-likelihood function shown in (12) with regard to the conditional distribution of the latent variables . Specifically, it can be derived as




M-step: Maximize the -function by iteratively updating its parameters as follows.

  1. Update : To attain the update for , we need to solve the following constrained optimization problem


    via the Lagrangian multiplier method. By some derivations, we have


    in which stands for the cardinality of .

  2. Update : Compute the gradient and let it be zero. Consequently, the update of can be obtained as

  3. Update : Compute the gradient and let it be zero, we can have


    where the coefficients . It is a two-order equation with regard to . Obviously, Eq. (19) has a unique root satisfying , that is,

  4. Update : By omitting some constants, the objective function to optimize can be rewritten as


    where the th entry of W is


Hence, the optimization problem in Eq. (21) is equivalent to the weighted -LRMF, which can be solved by a fast off-the-shelf algorithm such as the cyclic weighted median filter (CWM) CWM .

On one hand, it is interesting that the M-step in AQ-LRMF is the same as that of MoG-LRMF Meng2014Robust , except that the latter one minimizes a weighted loss. Due to this feature, AQ-LRMF is more robust than MoG-LRMF. On the other hand, each weight of MoG-LRMF embodies the information about whether the corresponding entry is an outlier. For each weight of AQ-LRMF, it actually contains additional information about the sign of bias. In detail, is the scale parameter and the entries with smaller correspond to outliers. According to the definition of in Eq. (15), we know that is a function of the skewness parameter . If the residual , and otherwise. Hence, the weights assigned to two points still differ if two residuals with the same absolute value have different signs. In conclusion, AQ-LRMF has more capacity to process heavy-tailed skew data.

Based on the above analysis, we summarize the main steps to learn the parameters involved in AQ-LRMF in the following Algorithm 1.

0:    The observed matrix X of order ; the index set of non-missing entries of X; number of components in MoAL.
0:    .
1:  Initialize ;
2:  (Initial E-step): Evaluate by Eq. (13), .
3:  while the convergence criterion does not satisfy do
4:     (M-step 1): Update () with Eqs. (17), (18) and (20), respectively.
5:     (E-step 1): Evaluate by Eq. (13), .
6:     (M-step 2): Update by Eq. (21).
7:     (E-step 2): Evaluate by Eq. (13), .
8:     (Tune ): For each pair , compute its noise component index . Remove any ALD components which are not in C. Let be the current number of ALD components.
9:  end while
Algorithm 1 Learning algorithm of AQ-LRMF

3.5 Solution of weighted -Lrmf

As stated in the last subsection, the learning of AQ-LRMF can be cast into a weighted -LRMF problem. Now we will provide more details about how to solve it (i.e., how to update by Eq. (21)) with the CWM method CWM .

Essentially, CWM minimizes the objective via solving a series of scalar minimization subproblems. Let and be the th column of and , respectively. To update , we assume that the other parameters have been estimated. As a result, the original problem can be rewritten as the optimization problem regarding , i.e.,


where , and and are th column of and , respectively. In Eq. (23), c denotes a constant term not depending on . In this way, the optimal , say , can be easily attained by the weighted median filter when minimizing Eq. (23) can be provided by weighted median filter. Specifically, if let and , we can reformulate Eq. (23) as


From the format of Eq. (24), it can be seen that the optimal coincides with the weighted median of the sequence under weights . As for the update of , it can be handled in a similar procedure. In short, the optimal can be obtained by employing CWM to repeatedly update and until the algorithm converges.

3.6 Some details of learning AQ-LRMF

Tuning the number of components in MoAL: Too large violates Occam Razor’s principle, while small leads to poor performance. In consequence, as described in step 8 of Algorithm 1, we employ an effective method to tune . To begin with, we initialize to be a small number such as . After each iteration, we compute the cluster that belongs to, by . If there is no entry belonging to cluster , we remove the corresponding ALD component.

Initialization: In Algorithm 1, we initialized the th entry of as , where denotes a random number sampled from the standard Gaussian distribution . In addition, where is the median of all entries in . Due to the characteristics of and , we initialize similarly. Moreover, the parameters and is randomly sampled from .

Convergence condition: By following the common practice of EM algorithm, we stop the iteration if the change of is smaller than a pre-defined value or the maximum iteration number is reached.

4 Experimental Studies

We carried out experiments in this section to examine the performance of AQ-LRMF model. Several state-of-the-art methods were considered, including four robust LRMF methods (namely, MoG Meng2014Robust 111, CWM CWM , Damped Wiberg (DW) 222 Okatani2011EfficientAF , RegL1ALM 333 RegL1ALM ) and a robust PCA method (SPCP solved by quasi Newton method) 444 EPFL-CONF-199542 . We wrote the programming code for CWM. For the other compared algorithms, the codes provided by the corresponding authors were availed. Since SPCP is not available in presence of missing entries, it was thus excluded from some experiments which involve missing data. Notice that DW is only considered in section 4.1 because it meets the “out of memory” problem for large-scale datasets. In the meantime, we assigned the same rank to all the considered algorithms but SPCP since it can automatically determine the rank. To make the comparison more fair, all algorithms were initialized with the same values. Each algorithm was terminated when either 100 iterative steps are reached or the change of is less than . In order to simplify notations, our proposed method AQ-LRMF was denoted as AQ in later discussions. All the experiments were conducted with Matlab R2015b and run on a computer with Intel Core CPU 2.30 GHz, 4.00 GB RAM and Windows 7(64-bit) system.

The remainder of this section has the following structure. Section 4.1 studies the performance of each algorithm on synthetic data in presence of various kinds of noise as well as missing values. Sections 4.2 and 4.3 employ some inpainted and multispectral images to investigate how the compared algorithms behave on real images which contain missing values and various kinds of noise, respectively. Finally, sections 4.4 and 4.5 examine the performance of all algorithms on face modelling and hyperspectral image processing tasks.

4.1 Synthetic experiments

First, we compared the behavior of each method with synthetic data containing different kinds of noise. For each case, we randomly generated 30 low rank matrices of size , where and were sampled from the standard Gaussian distribution . In particular, we set to 4 and 8. Subsequently, we stochastically set 20% entries of X as missing data and corrupted the non-missing entries with the following three groups of noise. (i) The first group include 4 kinds of heavy-tailed noise, i.e., (Laplace noise with scale parameter and location parameter ), Gaussian noise with and Student’s

noise with degrees of freedom 1 and 2, respectively. (ii) The second group include two kinds of skew noise, i.e., asymmetric Laplace noise with

and skew normal with . (iii) The last group include two kinds of mixture noise. The one is and the other one is . It is worthwhile to mention that the two mixture noises simulate the noise contained in real data, where most entries are corrupted by standard Gaussian noise and the rest entries are corrupted by heavy-tailed or skew noise. To evaluate the performance of each method, we employed the average error that is, .

In our experiments, all algorithms were implemented with true rank . Tables 1 and 2 summarize the metrics averaged over 30 randomly generated matrices. When , it is quite obvious that our method reaches the minimum error in all situations, while MoG and CWM almost take the second place. And the approaches RegL1ALM and DW can hardly deal with the heavy-tailed and skew noise well. From the results of , similar conclusions can be drawn. However, CWM evidently outperforms MoG when , which indicates that MoG may be instable when the real rank in observed data is high. In summary, our model performs very well to cope with different kinds of noise.

Laplace Noise (b=1.5) 1.23 1.27 1.34 1.93 10.44 [t]
Gaussian Noise () 2.95 3.46 3.16 4.97 4.56
Student’s Noise () 1.47 2.36 2.88 27.45 925.17
Student’s Noise () 0.97 1.37 1.21 2.85 7.16
AL Noise () 1.99 2.58 2.33 3.99 3.65
SN Noise () 1.90 2.00 2.13 3.03 2.11
Mixture Noise 1 0.84 0.93 1.02 1.28 1.06
Mixture Noise 2 0.99 1.61 1.27 3.88 10.50 [b]
mean 1.54 1.95 1.92 6.17 120.58 [t]
median 1.35 1.81 1.73 3.45 5.86 [b]
Table 1: The performance evaluation on synthetic data with rank 4. The best and second best results are highlighted in bold and italic typeface, respectively.
Laplace Noise (b=1.5) 1.82 2.13 1.90 3.81 12.71 [t]
Gaussian Noise () 4.14 5.07 4.11 8.38 9.04
Student’s Noise () 2.54 3.91 3.60 22.38 2063.74
Student’s Noise () 1.63 2.32 1.81 4.20 40.82
AL Noise () 2.91 3.69 3.07 6.76 17.84
SN Noise () 2.61 3.14 2.68 4.97 4.26
Mixture Noise 1 1.34 1.55 1.55 2.59 3.80
Mixture Noise 2 1.68 2.32 1.91 5.29 22.19 [b]
mean 2.33 3.02 2.58 7.30 271.80 [t]
median 2.18 2.73 2.30 5.13 15.27 [b]
Table 2: The performance evaluation on synthetic data with rank 8. The best and second best results are highlighted in bold and italic typeface, respectively.

To delve into the difference between AQ and MoG, we further compared the distributions of residuals with real noises. Specifically, two symmetric and two asymmetric cases are illustrated in Figure 3. Here, the shown distributions fitted by AQ and MoG correspond to those reach the maximum likelihood over 30 random experiments. It is obvious that AQ does a much better job to approximate the real noise than MoG. Particularly, AQ almost provides a duplicate of real noise. In contrast, MoG is able to fit the tails, while, at the same time, it results in bad approximation to peaks. Hence, AQ has more power in fitting complex noise than MoG.

Figure 3: The visual comparison of the PDFs for real noise and the ones fitted by AQ and MoG in the synthetic experiments.

4.2 Image inpainting experiments

Image inpainting is a typical image processing task. In real applications, some parts of an image may be deteriorated so that the corresponding information is lost. To facilitate the understanding of the image, some sophisticated technique need to be adopted to recover the corrupted parts of the image. This is exactly the objective of image inpainting. There is evidence that many images are low-rank matrices so that the single image inpainting can be done by matrix completion Hu2013Fast . In image inpainting, the corrupted pixels are viewed as missing values and then the image can be recovered by an LRMF algorithm. In this paper, three typical RGB images 555 of size were employed. In our experiments, each image was reshaped to . By following the common practice in the research of image inpainting, we artificially corrupted the given images by putting some masks onto them. In doing so, it is convenient to examine how well each method performs to restore the original images. Here, three kinds of masks were considered, namely, random mask where 20% pixels were stochastically removed, text masks with big and small fonts, respectively. The rank was set to 80 for all algorithms.

Figure 4 visualizes the original, masked and reconstructed images, and Table 3 reports the average errors of each algorithm. It is obvious that removing a random mask is the easiest task. In this situation, there is no significantly visible difference among the reconstructed images. In terms of average error, MoG performs best and AQ can be ranked in second place. In contrast, the results shown in Figure 4 and Table 3 indicate that text mask removal is more difficult, especially when the images are corrupted with big fonts. The main reason lies in that the text mask is spatially correlated while it is difficult for any LRMF algorithm to effectively utilize this type of information. Under these circumstances, it can be observed in Figure 4 and Table 3 that AQ outperforms the other methods to remove the text masks with regard to both average error and visualization. RegL1ALM and MoG perform badly and the clear text can often be seen in their reconstructed images. Although CWM produces slightly better results, its average error is still higher than that of AQ. In a word, AQ possesses the superiority over the other algorithms in our investigated image inpainting tasks. In particular, AQ achieves the smallest average error in 6 cases and the second smallest one in 2 cases.

Figure 4: The original, masked and inpainting images.
Image A Image B Image C
Small 2.32 2.59 2.33 2.91 5.60 7.43 6.90 8.08 5.13 5.83 6.30 6.97
Large 5.59 9.25 7.84 8.77 6.88 20.16 8.78 19.11 6.84 7.15 7.67 17.70
Random 2.81 2.18 3.06 2.45 5.91 5.75 6.65 6.80 5.61 6.74 6.77 5.07
Table 3: The average errors of each method on image inpainting experiments. The best and second best results are highlighted in bold and italic typeface, respectively.

4.3 Multispectral image experiments

In this subsection, we study the behavior of all algorithms in image denoising tasks. The Columbia Multispectral Image database 666 was employed, where every scene contains 31 bands with size . To achieve our purpose, eight scenes out of them (i.e., Balloon, Paints, Flowers, Cloth, Feathers, Hairs, Pompoms and Clay) were utilized to test the effectiveness of our methods. The used images were resized by half and the pixels were rescaled to [0,1]. Analogous to the strategy used in image inpainting experiments, some noise was artificially added to the original images. Then, each LRMF algorithm was applied to remove the noise so that the corrupted images can be restored as accurate as possible. In the experiments, four different kinds of noise were considered, that is, Laplace noise with scale parameter , asymmetric Laplace noise with and mixture noise, i.e., . The rank was set to 4 for all algorithms.

Table 4 reports the average errors of each method. For Laplace noise, it is found that RegL1ALM performs best, while AQ, MoG and CWM are ranked in the second, third and fourth places, respectively. The success of RegL1ALM can be attributed to the special format of its objective function, namely, norm loss plus two penalties on and . On one hand, the norm loss is exactly compatible with Laplace noise. On the other hand, there is empirical evidence showing that its used penalties can lead to better performance on image datasets. For asymmetric Laplace and mixture noise, it is not surprising that AQ outperforms all the other methods. SPCP reaches the second lowest average error, while the other ones perform badly. The reason may be that SPCP does not rely on noise distribution, while the other approaches implicitly assume that the noise distribution is not skew.

Scene Type of AQ MoG CWM RegL1ALM SPCP
Balloon Laplace 0.0342 0.0348 0.0398 0.0343 0.0487 [t]
Asymmetric Laplace 0.0804 0.1964 0.1501 0.1379 0.1280
Mixture noise 0.1974 0.2214 0.2453 0.2422 0.2026 [b]
Flowers Laplace 0.0362 0.0395 0.0371 0.0343 0.0437 [t]
Asymmetric Laplace 0.0761 0.1708 0.1450 0.1374 0.1289
Mixture noise 0.1709 0.2339 0.2393 0.2437 0.2025 [b]
Cloth Laplace 0.0412 0.0399 0.0365 0.0323 0.0667 [t]
Asymmetric Laplace 0.0975 0.1640 0.1424 0.1374 0.1314
Mixture noise 0.1927 0.1898 0.2334 0.2512 0.2037 [b]
Feathers Laplace 0.0392 0.0390 0.0417 0.0373 0.0470 [t]
Asymmetric Laplace 0.0911 0.1526 0.1487 0.1396 0.1303
Mixture noise 0.1946 0.2375 0.2455 0.2425 0.2046 [b]
Hairs Laplace 0.0321 0.0380 0.0358 0.0292 0.0253 [t]
Asymmetric Laplace 0.0681 0.1969 0.1373 0.1346 0.1201
Mixture noise 0.1412 0.2172 0.2305 0.2387 0.1979 [b]
Pompoms Laplace 0.0493 0.0420 0.0414 0.0379 0.0824 [t]
Asymmetric Laplace 0.1002 0.1937 0.1508 0.1422 0.1472
Mixture noise 0.1630 0.2221 0.2489 0.2323 0.2153 [b]
Paints Laplace 0.0424 0.0370 0.0431 0.0354 0.0396 [t]
Asymmetric Laplace 0.1009 0.1910 0.1432 0.1402 0.1259
Mixture noise 0.2119 0.2311 0.2460 0.2399 0.2018 [b]
Clay Laplace 0.0335 0.0402 0.0362 0.0344 0.0596 [t]
Asymmetric Laplace 0.0778 0.1787 0.1385 0.1342 0.1380
Mixture noise 0.1569 0.2169 0.2403 0.2453 0.2097 [t]
mean Laplace 0.0385 0.0388 0.0390 0.0344 0.0516 [t]
Asymmetric Laplace 0.0865 0.1805 0.1445 0.1380 0.1312
Mixture noise 0.1786 0.2212 0.2411 0.2420 0.2048 [b]
Table 4: The average errors of each method on multispectral image experiments. The best and second best results are highlighted in bold and italic typeface, respectively.

4.4 Face modeling experiments

Here, we applied the LRMF techniques to address the face modeling task. The Extended Yale B database 777 consisting of 64 images with size of each subject was considered. Therefore, it leads to a matrix for each subject. Particularly, we used the face images of the third and fifth subjects. The first column of Figure 5 demonstrates some typical faces for illustration. We set the rank to 4 for all methods except for SPCP which determines the rank automatically. The second to sixth columns of Figure 5 display the faces reconstructed by the compared LRMF algorithms.

From Figure 5, we can observe that that all methods are able to remove the cast shadows, saturations and camera noise. However, the performance of SPCP seems to be worse in comparison with other algorithms. Evidently, AQ always outperforms the other methods due to its pretty reconstruction. As shown in Figure 1, there is an asymmetric distribution in the face with a large dark region. Because of this, the techniques MoG, CWM, RegL1ALM and SPCP which utilize the symmetric loss function lead to bad results, while AQ with the quantile loss function produces the best reconstructed images.

Figure 5: The original faces and the reconstructed ones.

4.5 Hyperspectral image experiments

In this subsection, we employed two HSI datasets, Urban and Terrain 888, to investigate the behavior of all algorithms. There are 210 bands, each of which is of size for Urban and for Terrain. Thus, the data matrix is of size for Urban and for Terrain. Here, we utilized the same experimental settings as those used in subsection 4.4. DW was still unavailable in this experiment due to the computational problem. As show in the first column of Figure 6, some parts of bands are seriously polluted by the atmosphere and water absorption.

The reconstructed images of bands 106 and 207 in the Terrain data set and the band 104 in the Urban data set are shown in Figure 6 (a), (c) and (e), respectively. Their residual images (i.e., ) are also demonstrated below the reconstructed ones. Obviously, the band 106 in Terrain is seriously polluted. Nevertheless, our proposed AQ method still effectively reconstructs a clean and smooth one. Although MoG, CWM and RegL1ALM remove most parts of noise, they miss a part of local information, that is, the line from upper left corner to bottom right hand side (i.e., the parallelogram marked in the original image). As for SPCP, it only removes few parts of noise. The residual images also reveal that AQ behaves better to deal with the detailed information. Note that the band 207 in Terrain and the band 104 in Urban are mainly corrupted by the stripe and Guassian-like noise. Under these circumstances, AQ still outperforms the others because the latter fails to remove the stripe noise. In particular, for the interested areas that are marked by rectangles and amplified areas, the bands reconstructed by MoG, CWM, RegL1ALM and SPCP contain evident stripes. As far as the reconstructed images produced by AQ are concerned, however, this phenomenon does not exist.

Figure 6: The reconstructed and residual images.

We conjectured that the main reason for the different behavior of these algorithms lies in their used loss function. For CWM, RegL1ALM and SPCP, too simple loss function lead them to work not well when encountering complicated noise. In contrast, AQ and MoG perform better because they use multiple distribution components to model noise. It is very interesting to study the difference between AQ and MoG. For these two algorithms, we found that they both approximate the noise in our considered three bands with two components. For AQ (MoG), we denoted them as AQ1 and AQ2 (MoG1 and MoG2), respectively. In Figure 7, we presented de-noised images and residual images produced by each component. Take the de-noised image in the column AQ1 as an example, it corresponds to and the residual image shown below it corresponds to AQ1 (i.e., . The other images can be understood similarly. In doing so, we can further figure out the role that each component in AQ or MoG plays. When dealing with the band 106 in Terrain, the first AQ component is seen to de-noise the center parts, while the second one targets at the left and right edges. For the band 207 in Terrain, two AQ components de-noise the bottom and the rest parts, respectively. Regarding the band 104 in Urban, they focus on the right upper and center parts, respectively. By inspecting the results generated by MoG, however, we cannot discover some regular patterns for the role that two components play. Therefore, it can be concluded that AQ can capture the local structural information of real images, although we do not encode it into our model. The reason may be that the pixels with the same skewness in real images tend to cluster. In this aspect, AQ also possesses superiority over MoG.

Figure 7: The de-noised and residual images produced by the two components of AQ (i.e., columns marked with AQ1 and AQ2) and MoG (i.e., columns marked with MoG1 and MoG2). For example, the image lies in the first row and second column is the de-noised image which is obtained by removing the first AQ component from the original image.

5 Conclusions

Aiming at enhancing the performance of existing LRMF methods to cope with complicated noise in real applications, we propose in this work a new low-rank matrix factorization method AQ-LRMF to recover subspaces. The core idea of AQ-LRMF is to directly model unknown noise by a mixture of asymmetric Laplace distributions. We also present an efficient procedure based on the EM algorithm to estimate the parameters in AQ-LRMF. Actually, the objective function of AQ-LRMF corresponds to the adaptive quantile loss like those used in quantile regression. Compared with several state-of-the-art counterparts, the novel AQ-LRMF model always outperforms them in synthetic and real data experiments. In addition, AQ-LRMF also has the superiority to capture local structural information in real images. Therefore, AQ-LRMF can be deemed as a competitive tool to cope with complex real problems.


This research was supported by the National Natural Science Foundation of China [grant number 11671317, 61572393].



  • (1) J. Ye, Generalized low rank approximations of matrices, Machine Learning 61 (1-3) (2005) 167–191.
  • (2) A. Sobral, T. Bouwmans, E.-h. Zahzah, Lrslibrary: Low-rank and sparse tools for background modeling and subtraction in videos, in: Handbook of Robust Low-Rank and Sparse Matrix Decomposition: Applications in Image and Video Processing, CRC Press, Taylor and Francis Group.
  • (3) M. Udell, C. Horn, R. Zadeh, S. Boyd, Generalized low rank models, Foundations and Trends in Machine Learning 9 (1) (2016) 1–118.
  • (4) V. Koltchinskii, K. Lounici, A. B. Tsybakov, Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion, Annals of Statistics 39 (5) (2011) 2302–2329.
  • (5) D. D. Lee, H. S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (6755) (1999) 788–791.
  • (6)

    H. Wu, Z. Zhang, K. Yue, B. Zhang, J. He, L. Sun, Dual-regularized matrix factorization with deep neural networks for recommender systems, Knowledge-Based Systems 145 (2018) 46 – 58.

  • (7) E. M. Airoldi, D. M. Blei, S. E. Fienberg, E. P. Xing, Mixed membership stochastic blockmodels, Journal of Machine Learning Research 9 (2008) 1981–2014.
  • (8) P. Gopalan, D. M. Blei, Efficient discovery of overlapping communities in massive networks, Proceedings of the National Academy of Sciences of the United States of America 110 (36) (2013) 14534–14539.
  • (9) W. Wang, Y. Feng, P. Jiao, W. Yu, Kernel framework based on non-negative matrix factorization for networks reconstruction and link prediction, Knowledge-Based Systems 137 (2017) 104 – 114.
  • (10)

    J. Wright, Y. Peng, Y. Ma, A. Ganesh, S. Rao, Robust principal component analysis: exact recovery of corrupted low-rank matrices by convex optimization, in: International Conference on Neural Information Processing Systems (NIPS), 2009, pp. 2080–2088.

  • (11) X. Fei, Y. Chen, P. Chong, Y. Wang, X. Liu, G. He, Denoising of hyperspectral image using low-rank matrix factorization, IEEE Geoscience & Remote Sensing Letters 14 (7) (2017) 1141–1145.
  • (12) Q. Ke, T. Kanade, Robust

    norm factorization in the presence of outliers and missing data by alternative convex programming, in: 2005 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 739–746.

  • (13) D. Meng, F. D. L. Torre, Robust matrix factorization with unknown noise, in: 2014 IEEE International Conference on Computer Vision (ICCV), 2014, pp. 1337–1344.
  • (14) V. Maz’Ya, G. Schmidt, On approximate approximations using gaussian kernels, IMA Journal of Numerical Analysis 16 (1) (1996) 13–29.
  • (15) H. Kozumi, G. Kobayashi, Gibbs sampling methods for bayesian quantile regression, Journal of Statistical Computation & Simulation 81 (11) (2011) 1565–1578.
  • (16) K. Yu, J. Zhang, A three-parameter asymmetric laplace distribution and its extension, Communications in Statistics - Theory and Methods 34 (9-10) (2005) 1867–1879.
  • (17) N. Srebro, T. Jaakkola, Weighted low-rank approximations, in: Proceedings of the 20th International Conference on Machine Learning (ICML), 2003, pp. 720–727.
  • (18) A. Eriksson, A. van den Hengel, Efficient computation of robust low-rank matrix approximations in the presence of missing data using the norm, in: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 771–778.
  • (19) E. Kim, M. Lee, C. H. Choi, N. Kwak, S. Oh, Efficient -norm-based low-rank matrix approximations for large-scale problems using alternating rectified gradient method, IEEE Transactions on Neural Networks and Learning Systems 26 (2) (2015) 237–251.
  • (20) M. Okutomi, S. Yan, S. Sugimoto, G. Liu, Y. Zheng, Practical low-rank matrix approximation under robust -norm, IEEE Computer Society, Los Alamitos, CA, USA, 2012, pp. 1410–1417.
  • (21) D. Meng, Z. Xu, L. Zhang, J. Zhao, A cyclic weighted median method for

    low-rank matrix factorization with missing entries, in: Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, 2013, pp. 704–710.

  • (22) Z. Lin, C. Xu, H. Zha, Robust matrix factorization by majorization minimization, IEEE Transactions on Pattern Analysis & Machine Intelligence 40 (1) (2018) 208–220.
  • (23) S. Li, J. Zhang, X. Guo, Efficient low rank matrix approximation via orthogonality pursuit and regularization, in: IEEE International Conference on Multimedia and Expo, 2017, pp. 871–876.
  • (24) B. Lakshminarayanan, G. Bouchard, C. Archambeau, Robust bayesian matrix factorisation, Journal of Machine Learning Research 15 (2011) 425–433.
  • (25) N. Wang, T. Yao, J. Wang, D.-Y. Yeung, A Probabilistic Approach to Robust Matrix Factorization, Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 126–139.
  • (26) X. Cao, Y. Chen, Q. Zhao, D. Meng, Low-rank matrix factorization under general mixture noise distributions, in: IEEE Transactions on Image Processing, Vol. 25, 2016, pp. 4677–4690.
  • (27) E. Candès, X. Li, Y. Ma, J. Wright, Robust principal component analysis?, Journal of the ACM 58 (3) (2011) 11.
  • (28) Z. Zhou, X. Li, J. Wright, E. Candès, Stable principal component pursuit, in: IEEE International Symposium on Information Theory Proceedings, Austin, TX, USA, 2010, pp. 1518–1522.
  • (29)

    A. Azzalini, A. D. Valle, The multivariate skew-normal distribution, Biometrika 83 (4) (1996) 715–726.

  • (30) A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the em algorithm, Journal of the Royal Statistical Society. Series B (Methodological) 39 (1) (1977) 1–38.
  • (31) T. Okatani, T. Yoshida, K. Deguchi, Efficient algorithm for low-rank matrix factorization with missing components and performance comparison of latest algorithms, 2011, pp. 842–849.
  • (32) A. Aravkin, S. Becker, V. Cevher, P. Olsen, A variational approach to stable principal component pursuit, in: 2014 Conference on Uncertainty in Artificial Intelligence (UAI), 2014.
  • (33) Y. Hu, D. Zhang, J. Ye, X. Li, X. He, Fast and accurate matrix completion via truncated nuclear norm regularization, IEEE Transactions on Pattern Analysis & Machine Intelligence 35 (9) (2013) 2117–2130.