1 Introduction
Supervised learning, and in particular deep learning [1, 2]
, have been very successful in computer vision. Applications include autoencoders
[3] that map between noisy and clean images [4], convolutional networks for image/video analysis [5], and generative adversarial networks that synthesize real worldlike images [6].In contrast, unsupervised learning still poses significant challenges. Broadly, unsupervised learning seeks to discover hidden structure in the data without using ground truth labels, thereby revealing features of interest.
In this paper, we consider unsupervised representation learning methods which can be used along with centroidbased clustering to summarize the data distribution using a few characteristic samples.
We are interested in spectral clustering [7] and subspace clustering [8]; the proposed ideas can also be generalized to deep embeddingbased clustering strategies [9]. Spectral clustering methods use neighborhood graphs to learn the underlying representation [7]; this approach is used for image segmentation [10, 11] and 3D mesh segmentation [12]. Subspace clustering methods model the dataset as a union of lowdimensional linear subspaces and utilize sparse and lowrank methods to obtain the representation; this model is used for facial clustering and recognition [8, 13].
Learning effective latent representations hinges on accurately modeling noise and outliers. Further, in practice, the data satisfy the structural assumptions (union of subspaces, low rank, etc.) only approximately. Adopting robust optimization strategies is a natural way to combat these challenges. For example, consider principal component analysis (PCA), a prototypical representation learning method based on matrix factorization. Given lowrank data contaminated by outliers, the classical PCA method will fail to find it. Consequently, the robust PCA (rPCA) method
[14], which decomposes data into low rank and sparse components, is preferred in practice, e.g. background/foreground separation [14, 15]. Similarly, when data assumed to be from a union of subspaces is contaminated by outliers, allowing for sparse outliers during optimization leads to accurate recovery of the subspaces, e.g. face classification [16].Our goal is to develop effective robust formulations for unsupervised representation learning tasks in computer vision; we are interested in complex situations, when the data is corrupted with a combination of sparse outliers and dense noise.
Contributions. We first review the relationship between outlier models and statistically robust formulations. In particular, we show that the rPCA formulation is equivalent to solving a Huber regression problem for lowrank representation learning. Using this connection, we develop a new nonconvex penalty, dubbed the Tiber, designed to aggressively penalize midsized residuals. In Section 2, we show that this penalty is well suited for dynamic background separation, outperforming classic rPCA methods.
Our second contribution is to use the design philosophy behind robust lowrank representation learning to develop a new formulation for robust clustering. We formulate classic spectral analysis as an optimization problem, and then modify this problem to be robust to outliers. The advantages are shown using a synthetic clustering example. We then combine robust spectral clustering with robust subspace clustering to achieve superior performance on face recognition tasks, surpassing prior work without any data preprocessing; see Section
3, Table 1.2 New Penalties for Learning Robust Representations
Many tasks in computer vision depend on unsupervised representation learning. A wellknown example is background/foreground separation, often solved by robust principal component analysis (rPCA). rPCA learns lowrank representations by decomposing a data matrix into a sum of lowrank and sparse components. The lowrank component represents the background and the sparse component represents the foreground [14].
In this section, we show that rPCA is equivalent to a robust regression problem, and solving a Huberrobust regression [17] for the background representation is completely equivalent to the full rPCA solution. We use this equivalence to design a new robust penalty (dubbed Tiber) based on statistical descriptions of the signals of interest. We illustrate the benefits of using this new nonconvex penalty for separating foreground from a dynamic background, using real datasets.
2.1 Huber in rPCA
Background/foreground separation is widely used for detecting moving objects in videos from stationary cameras. A broad range of techniques have been developed to tackle this task, ranging from simple thresholding [18] to mixtures of Gaussian models[19, 20, 21]. In particular, rPCA has been widely adopted to solve this problem [22, 23].
Denote a given video stream by , where each of
frames is reshaped to be a vector of size
. There are many variants of rPCA [24]. We use the stable principal component pursuit (SPCP) formulation:(1) 
where represents the background, and the foreground. The regularizations used by this formulation ensure that is chosen to be low rank, while is designed to be sparse; using a quadratic penalty on the residual fits of the data up to some error level.
We can minimize over the variables in any order. Minimizing the first two summands of (1) in gives a closed form function
with the wellknown Huber penalty [17]
(2) 
We provide a simple statement of the following wellknown result with a short selfcontained proof.
Claim 1.
(3) 
Proof.
The optimization problem is separable, so the result immediately extends to the vector case. Upon minimization over , problem (1) then reduces to
(4) 
To simplify the problem further, we use a factorized representation of [26], choosing the rank to be to obtaining the nonconvex formulation
(5) 
where and .
Comparing (5) to (1) we see two advantages:

The dimension of the decision variable has been reduced from to .

(5) is smooth, and does not require computing SVDs.
Once we have and , we can easily recover and :
The approach is illustrated in the left panels of Figure 2. Although the residual (shown in row 2) is noisy and not sparse, applying we get the sparse component (row 3), just as we would by solving the original formulation (1).
From a statistical perspective, the equivalence of rPCA and Huber means that the residual , which contains both and random noise, can be modeled by a heavy tailed error distribution.
Claim 2.
Suppose are i.i.d. samples from a distribution with density
where is the normalization constant. Then maximum likelihood formulation for is equivalent to the minimization problem
The claim follows immediately by taking the negative log of the maximum likelihood. Claim 2 means that solving (5) is equivalent to assuming that elements in are i.i.d. samples from the Laplace density
The function has linear tails (See Figure 1), which means this distribution is much more likely to produce large samples compared to the Gaussian.
2.2 Weaknesses of the Huber
Although the Huber distribution can detect sparse outliers, it does not model small errors well. In many background/foreground separation problems, we must cope with a dynamic background (e.g. motion of tree leaves or water waves). These small dynamic background perturbations correspond to motion we do not care about — we are much more interested in detecting cars, people, and animals moving through the scene.
We want to move these dynamics into the lowrank background term. However, the Huber is quadratic near the origin (i.e. nearly flat), so small perturbations do not significantly affect the objective value; and solving (5) leaves these terms in the residual . Thresholding these terms is either too aggressive (removing features we care about), or too lenient, leaving the dynamics in the foreground (see first two columns of Figure 2). A better penalty would rise steeply for small values of , without significantly affecting tail behavior.
2.3 Tiber for rPCA
We propose a new penalty, which we call the Tiber. While the Huber is defined by partially minimizing the sum of the 1norm with a quadratic (2), the Tiber replaces the quadratic with a nonconvex function. The resulting penalty can match the tail behavior of Huber, yet have different properties around the origin (see Figure 1). Tiber is better suited for background/foreground separation problems with dynamic background. We define the penalty as follows:
(6)  
The Tiber is parametrized by thresholding parameter and scale parameter . Just as the Huber, it can be expressed as the value function of a minimization problem. We replace the quadratic penalty in Claim 1 by the smooth nonconvex penalty . For simplicity, we use in the result below.
Claim 3.
(7) 
Proof.
Denote the objective function in (7) by . It is easy to check that is quasiconvex in when . We look to local optimality conditions to understand the structure of the minimizers.

Suppose . Then means
this requires .

Suppose . Then means
this requires .

otherwise .
Therefore . Plugging this into (7), we have
∎
In Figure 1, we see that Tiber rises steeply near the origin. This behavior discourages dynamic terms (leaves, waves) in , forcing them to be fit by . The new Tiberrobust rPCA problem is given by:
(8) 
which also has all of the advantages of (5). Moreover, because of the characterization from Claim 3, once we solve (8), we immediately recover and :
2.4 Experiment: Foreground Separation
We use a publicly available data set^{1}^{1}1Downloaded from http://viswww.cs.umass.edu/~narayana/castanza/I2Rdataset/ with a dynamic background (moving trees). We sample 102 data frames from this data set, convert them to grey scale, and reshape them as column vectors of matrix . We compare formulations (5) and (8). Proximal alternating linearized minimization algorithm (PALM) [27] was used to solve all of the optimization problems.
Rank of and was set to be for all experiments. We manually tuned parameters to achieve the best possible recovery in each formulation. For Huber, we selected two nearby values, and ; for Tiber, we selected and , resulting in the threshold parameter .
The results are shown in Figure 2. The task is identifying the van while avoiding interference from moving leaves. The Huber is unable to separate the van from the leaves for any threshold values . When we choose (left panel in Figure 2), we cut out too much information, giving an incomplete van in . If we make a less conservative choice (middle panel in Figure 2), we leave too much dynamic noise in , which obscures the van.
The Tiber Penalty obtains a cleaner picture of the moving vehicle (right panel in Figure 2). As expected, it forces more of the dynamic background to be fit by , leaving a fairly complete van in without too much contamination.
3 Robust Representation Learning for Clustering
Centroidbased clustering, e.g. kMeans, is a standard tool to partition and summarize datasets. Given the high dimensionality and complexity of data in computer vision applications, it is necessary to learn latent representations, such as the underlying metric, prior to clustering. Clustering is then performed in the latent space.
We develop an approach for robust spectral clustering. We illustrate the advantages using a synthetic dataset, and then combine the approach with robust subspace clustering to achieve perfect performance on face recognition tasks.
3.1 Spectral Clustering
Synthetic Data Clustering: Up: result from eigenvalue decomposition, Down: result from (
10).Spectral clustering [7] is formulated as follows. Given datapoints , we arrange them in a matrix . To partition the data into groups, spectral clustering uses the following steps:

Given a dataset of samples, we construct the similarity matrix of the data points.

Project each row of onto the unit ball, and apply distancebased clustering.
Finding a meaningful similarity matrix is crucial to the success of spectral clustering. Ideally, will be a block diagonal matrix with blocks. This rarely happens for real applications; even when underlying structure in is present, it can be obscured by noise and a small number of points that don’t follow the general pattern.
To find a factorization of noisy , we need a robust method for eigenvalue decomposition. We first formulate eigenvalue decomposition as an optimization problem.
Claim 4.
Assume is a symmetric matrix with eigenvalues less than or equal to 1. Then the solution to the problem
(9)  
s.t. 
is with the eigenvector corresponding to the largest eigenvalue of , and the by identity matrix.
Proof.
Since is a symmetric matrix, it has a eigenvalue decomposition,
where is orthogonal and is diagonal, with . Similarly, we have
where
is a orthogonal matrix whose first
columns agree with those of , is a diagonal matrix with first elements on the diagonal are 1 and the rest are 0. From the Cauchy Schwarz inequality, we havewhere equality hold when and share the same singular vectors, i.e.,
equals to the first columns of .
Therefore
with equality hold when columns of are eigenvectors corresponding to the largest eigenvalues. ∎
We robustify (9) by replacing the Frobenius norm in the optimization formulation by the Huber function (or another robust penalty):
(10) 
This approach can be very effective. Consider the following clustering experiment with , , and
. We generate five clusters (sampling from four 2D Gaussians, one rectangular uniform distribution) with 100 points per group. To make the problem challenging, we move the clusters close together so much that trying to tell them apart with the naked eye is hard (Figure
3, top). True clusters appear in Figure 3, bottom.3.2 Subspace Clustering
Subspace clustering looks for low dimensional representation of high dimensional data, by grouping the points along lowdimensional subspaces. Given a data matrix
as in Section 3.1, the optimization for subspace clustering is given by [8]:(11) 
This formulation looks for a sparse representation of the dataset by its members: . To avoid the trivial solution, we require the diagonal of to be identically 0. After obtaining , it is postprocessed and a similarity matrix is constructed as . will be ideally close to blockdiagonal, where each block represents a subspace, and spectral clustering is performed it to identify cluster memberships.
3.3 Face Clustering
Given multiple face images taken at different conditions, the goal of face clustering [8] is to identify images that belong to the same person.
We use images from the publicly available Extended Yale B dataset [28] ^{2}^{2}2Downloaded from http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html. Each image has pixels, and there are 2414 images in the dataset. These images belong to 38 people, with approximately 64 pictures per person.
Under the Lambertian assumption, pictures obtained from one person under different illuminations should lie close to a 9 dimensional subspace [29]. In practice, these spaces are hard to detect because of noise in the images, and a robust approach is required.
Robust subspace clustering for face images:

Obtain sparse representation using (13).

Construct similarity matrix from .

Normalize columns of to have maximum absolute value no larger than 1.

Form

Normalize : , where is a diagonal matrix with .


Apply spectral clustering using .

Apply robust symmetric factorization (10) to , to obtain the latent representation .

Project each row of onto the unit 2norm ball.

Apply Kmeans algorithm to the new rows of .

The results are shown in Table 1. We implement the approach for different numbers of subjects . We show the parameters and in (13) used to achieve the high accuracies given in Table 1^{3}^{3}3In [8], the images used are of size . The numbers shown are therefore indicative..
clusters  in (13)  in (13)  error  error in [8] 

0.5  1  0.00%  1.86%  
0.1  0.7  0.00%  3.10%  
0.05  0.7  0.00%  4.31%  
0.03  0.5  2.73%  5.85% 
To get better intuition of the method, we plot the similarity matrix corresponding to in Figure 6. We can clearly see three blocks along the diagonal that correspond to the three face clusters.
The resulting projected obtained from the eigenvalue decomposition of similarity matrix are shown in Figure 7. The three clusters are clearly well separated.
The final algorithm has perfect accuracy in this example.
4 Discussion
Robust approaches are essential for unsupervised learning, and can be designed using optimization formulations. For example, in both rPCA and robust spectral learning, SVD and eigenvalue decomposition are first characterized using optimization, then reformulated with robust losses.
Several tasks in this approach are difficult. First, there is a need to tune parameters in the optimization formulations. For example, the Tiber depends on two parameters, and . Automatic ways to tune these parameters can make robust unsupervised learning a lot more portable. Second, the optimization problems we have to solve are largescale; time required for robust subspace clustering for images scales nonlinearly with both the number and size of images. Designing nonsmooth stochastic algorithms that take the structure of these problems into account is essential.
References
 [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.

[2]
J. Schmidhuber, “Deep learning in neural networks: An overview,”
Neural networks, vol. 61, pp. 85–117, 2015. 
[3]
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,”
Journal of Machine Learning Research
, vol. 11, no. Dec, pp. 3371–3408, 2010.  [4] J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting with deep neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 341–349.

[5]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei, “Largescale video classification with convolutional neural networks,” in
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition
, 2014, pp. 1725–1732.  [6] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.

[7]
A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in
Advances in neural information processing systems, 2002, pp. 849–856.  [8] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 11, pp. 2765–2781, 2013.
 [9] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in International Conference on Machine Learning, 2016, pp. 478–487.
 [10] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 8, pp. 888–905, 2000.
 [11] L. ZelnikManor and P. Perona, “Selftuning spectral clustering,” in Advances in neural information processing systems, 2005, pp. 1601–1608.
 [12] R. Liu and H. Zhang, “Segmentation of 3d meshes through spectral clustering,” in Computer Graphics and Applications, 2004. PG 2004. Proceedings. 12th Pacific Conference on. IEEE, 2004, pp. 298–305.
 [13] G. Shakhnarovich and B. Moghaddam, “Face recognition in subspaces,” in Handbook of Face Recognition. Springer, 2011, pp. 19–49.
 [14] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM (JACM), vol. 58, no. 3, p. 11, 2011.
 [15] A. Sobral, T. Bouwmans, and E.h. Zahzah, “Lrslibrary: Lowrank and sparse tools for background modeling and subtraction in videos,” Robust LowRank and Sparse Matrix Decomposition: Applications in Image and Video Processing. CRC Press, Taylor and Francis Group, 2016.
 [16] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 2, pp. 210–227, 2009.
 [17] P. J. Huber, “Robust statistics,” in International Encyclopedia of Statistical Science. Springer, 2011, pp. 1248–1251.
 [18] T. Veit, F. Cao, and P. Bouthemy, “A maximality principle applied to a contrario motion detection,” in Image Processing, 2005. ICIP 2005. IEEE International Conference on, vol. 1. IEEE, 2005, pp. I–1061.
 [19] C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models for realtime tracking,” in Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on., vol. 2. IEEE, 1999, pp. 246–252.
 [20] R. H. Evangelio, M. Pätzold, and T. Sikora, “Splitting gaussians in mixture models,” in Advanced Video and SignalBased Surveillance (AVSS), 2012 IEEE Ninth International Conference on. IEEE, 2012, pp. 300–305.
 [21] T. S. Haines and T. Xiang, “Background subtraction with dirichlet processes,” in European Conference on Computer Vision. Springer, 2012, pp. 99–113.
 [22] C. Guyon, T. Bouwmans, and E.h. Zahzah, “Robust principal component analysis for background subtraction: Systematic evaluation and comparative analysis,” in Principal component analysis. InTech, 2012.
 [23] R. Otazo, E. Candès, and D. K. Sodickson, “Lowrank plus sparse matrix decomposition for accelerated dynamic mri with separation of background and dynamic components,” Magnetic Resonance in Medicine, vol. 73, no. 3, pp. 1125–1136, 2015.
 [24] A. Aravkin and S. Becker, “Dual smoothing and value function techniques for variational matrix decomposition,” in Handbook of Robust LowRank and Sparse Matrix Decomposition: Applications in Image and Video Processing, T. Bouwmans, N. S. Aybat, and E.h. Zahzah, Eds. CRC Press, 2016, ch. 3.
 [25] P. L. Combettes and J.C. Pesquet, “Proximal splitting methods in signal processing,” in Fixedpoint algorithms for inverse problems in science and engineering. Springer, 2011, pp. 185–212.
 [26] S. Burer and R. D. Monteiro, “A nonlinear programming algorithm for solving semidefinite programs via lowrank factorization,” Mathematical Programming, vol. 95, no. 2, pp. 329–357, 2003.
 [27] J. Bolte, S. Sabach, and M. Teboulle, “Proximal alternating linearized minimization for nonconvex and nonsmooth problems,” Mathematical Programming, vol. 146, no. 12, pp. 459–494, 2014.
 [28] K.C. Lee, J. Ho, and D. J. Kriegman, “Acquiring linear subspaces for face recognition under variable lighting,” IEEE Transactions on pattern analysis and machine intelligence, vol. 27, no. 5, pp. 684–698, 2005.
 [29] R. Basri and D. W. Jacobs, “Lambertian reflectance and linear subspaces,” IEEE transactions on pattern analysis and machine intelligence, vol. 25, no. 2, pp. 218–233, 2003.
Comments
There are no comments yet.