1 Introduction
Sparse coding has been successfully applied to numerous computer vision tasks, including face recognition
[1], scene categorization [2] and object detection [3]. Application of sparse representationbased classifier (SRC) on face recognition [1] demonstrates a startling robustness over noise and occlusions, where the test subjects are still recognizable even when they wear sunglasses or scarf. However, SRC has been found to be highly sensitive to the misalignment of the image dataset: a small amount of image distortion due to translation, rotation, scaling and dimensional pose variations can lead to a significant degradation on the classification performance [4].One straightforward way to solve the misalignment problem is to register the test image with dictionary atoms before sparse recovery. By assuming the dictionary atoms are registered, Wagner et al. [4] parameterize the misalignment of the test image with an affine transformation. These parameters are optimized using generalized GaussNewton methods after linearizing the affine transformation constraints. By minimizing the sparse registration error iteratively and sequentially for each class, their framework is able to deal with a large range of variations in translation, scaling, rotation and even D pose variations. Due to the adoption of holistic features, sparse coding is more robust and less likely to overfit.
In the case of local featurebased sparse coding, max pooling strategy
[5] is often employed over the neighboring coefficients to produce local translationinvariant property. Based on spatial pyramid matching framework, Yang et. al. [2]proposed a local sparse coding model with local SIFT features followed by multiscale max pooling. The results on several large variance datasets achieved plausible performance that can hardly be pursued by simply applying holistic sparse coding. To improve the discriminability of the sparse codes, their dictionary was trained with supervised learning via backpropagation
[6]. Classification performance of local featurebased sparse coding has also been evaluated on several large datasets in [7], demonstrating a stateofart performance that is competitive with deep learning
[8]. Another interesting approach is the convolutional sparse coding [9], where the local features are reconstructed by convoluting the local sparse codes using local dictionary. Visualization of its dictionary shows that the dictionary atoms contain more complex features, therefore having more discriminative power.In this paper, we present a novel sparse coding framework that is robust to image transformation. In the proposed model, each dictionary atom is constructed in the form of a tensor and is aligned with the test image using the large displacement optical flow concept [10]. We show experimentally that the proposed sparse coding framework outperforms most other sparsitybased methods. Specifically, our paper has the following novelties and contributions: (i) The proposed algorithm does not require the training dataset to be prealigned. (ii) Adapting the dictionary to the input test image is highly efficient: requiring only operations for adapting each dictionary atom, where is the number of pixels in a searching window and is the total number of subatoms to be aligned. (iii) Supervised dictionary learning algorithm is developed for the proposed sparse coding framework.
The remainder of the paper is organized as follows: We first introduce the proposed sparse coding framework for dealing with dataset misalignment in Section 2. Next, in Section 3, we show how to train the dictionary in a supervised manner by solving a bilevel optimization problem. Finally, in Section 4, experimental results demonstrate that the proposed framework has a stateofart performance, which is more promising over most existing sparsitybased methods.
2 Sparse Coding with Image Alignment via Large Displacement Optical Flow
In this section, we first introduce how to construct the dictionary atoms and input images in the form of tensors. We then illustrate how to eliminate the misalignment by dynamically adapt the tensor dictionary atoms to the input tensor image.
In the proposed sparse coding model, as shown in Fig. 1, both dictionary atom and input image are represented by image tensors. Each pixel in the tensor image is a vectorized version of a local patch in the original image, referred to as a vector pixel. Denote the tensor atom as and a given test tensor image as , where is the subatom of the tensor atom and is the vector pixel of the input image. is the dimension of vector pixel, is the dictionary atom index and is the total number of subatoms in the tensor atom, which is the same number of vector pixels in the test tensor image. The dictionary is denoted as . Given a dictionary with tensor atoms, a typical sparse recovery problem [1] is formulated as:
(1) 
where is the sparse coefficient and is the regularization parameter. Problem (1) is a standard form of sparse recovery problem that can be efficiently solved using alternating direction method of multipliers (ADMM) [11].
When images in both the training and test datasets are misaligned, sparse coefficients recovered by solving the problem (1) become unreliable, thus resulting in poor classification performance. To alleviate the misalignment problem, we propose to register each tensor atom with the input test image via large displacement optical flow [10]. The notion of optical flow field is used here to describe the displacements of vector pixels within each tensor atom, and the sparse recovery is then performed by using only the best matching subatoms selected from the tensor atoms. The proposed framework is illustrated in Fig. 1. Denote as the subatoms within the searching window centered at the location of the tensor atom. The recovery of the optical flow and sparse codes can be formally described as follows:
(2)  
where is the cardinality constraint and is the sparse index vector that is used to characterize the optical flow field. The constraint in (2) suggests that is a binary index vector and only one element is nonzero, which means that it can only select one subatom within the searching window.
The optimization problem in (2) is a mixedinteger problem and NPhard [12]
. Therefore, we propose a heuristic algorithm to find an informative
and the sparse index vectors for all vector pixels. As shown in Fig. (1), the optical flow field for each vector pixel is found by searching for the best match between neighboring subatoms and the corresponding input vector pixel. In practice, we found that searching for the best match without involving the sparse code is the key to render plausible performance in both classification accuracy and computational efficiency. Formally, we propose to find a local optimum of problem (2) by solving the following optimization problem:(3)  
In our approach, the sparse coding part of (3) is solved by using the alternating direction method of multipliers (ADMM) [11]. One important advantage of the above model is that it is highly computational efficient because it only takes operations to search for the best match for each vector pixel.
3 Supervised Dictionary Learning
In order to improve the efficiency of sparse coding and discriminablity of the dictionary, we employ the supervised dictionary learning framework [6, 13, 14] to optimize the dictionary and the classifier parameters simultaneously. Formulated as a bilevel optimization problem, the dictionary is updated using back propagation to minimize the classification error. Formally, the supervised dictionary learning problem can be formulated as follows:
(4) 
where is some smooth and convex function that is used to define the classification error and is the regularization parameter used to alleviate the overfitting of the classifier. Due to the triviality of updating classifier parameters, here we only state the update for the dictionary:
(5) 
where is the learning rate, is the iteration counter and is the projection that regulate the Frobenius norm of every tensor atom to be one. Similar to [6, 13, 14], (4) suggests that the update of both the dictionary and the classifier are driven by reducing classification error. The local optima can be solved by using descent method [15] based on error backpropagation. The sparse code is an implicit function of and . In addition, each optical flow field is an implicit function of and . Therefore, given an input image and an optimal sparse code
, apply the chain rule of differentiation, the direction along which the upperlevel cost decreases can be formulated as:
(6) 
where and denotes the direct sum. Also,
is obtained by zeropadding with
, where to elements of are from those of . Due to the binary constraints on , every element of the gradient equals to zero. On the other hand, the first part of the derivative can be solved by applying fixed point differentiation [16]. Due to the page limitation of the paper and the triviality for deriving the term , we only show the final derivation of as follows:(7) 
where is the index set of active atoms of the sparse code . is the matrix obtained by collecting the active columns of , and is the submatrix obtained by selecting the active columns and rows of . The matrix is always nonsingular since the total number of measurement is always significantly larger than the number of active atoms. Combining (6) with (7) for each dictionary element, the gradient for updating the dictionary can be achieved. For a large dataset, the dictionary and the classifier parameters are updated in an online manner.
4 Experiments
In this section, we evaluate the proposed algorithm on handwritten digits datasets including the MNIST and USPS. The sparse coding is performed with a single dictionary and linear SVM is used for classification. For a fair comparison, we only compare with the results that are produced with the same SRC strategy. The dictionary size in our paper is set to be no larger than those used in other methods. Similar to [6], parameters in our experiments are chosen heuristically. The batch size for updating the dictionary is . Initial learning rate is set to and .
4.1 Evaluation on the MNIST Database
MNIST [17] consists of a total number of images of digits, of which are training set and the rest are test set. Each digit is centered and normalized in a field. The dictionary size is set to be for this database.
We first evaluate the performance of the proposed algorithm under various number of training samples. We follow the same experimental setting as in [18], examining the classification accuracy given the training size . The performance is shown in Fig. 2 (a). The proposed method significantly outperforms the sparse codingbased algorithm (L1SC) [14].
We then demonstrate the robustness of the proposed method towards various image deformations. Following a similar setting as in [4], we perform the translation along direction, rotation and scaling separately only on the test samples. We report the classification accuracy with respect to various levels of deformation and compare the performance with L1SC. The experimental results are shown in Fig. 2(b)(d). Performance of our method and L1SC are illustrated in red and blue lines, respectively. The shadow area at the bottom of each figure is the accuracy difference between the two methods. We can see for all three deformations, the proposed method consistently outperforms L1SC. In addition, the hump shape of the shadow area indicates that the proposed method is robust to numerous image deformations.
Finally, the error rate for the MNIST is shown in Table 1. Our method reaches the lowest error rate of . On MNIST, differences of more than are statistically significant [19]. Comparing with the second best algorithm, the proposed method reduces the error rate by , exhibiting better generality and dictionary compactness.
4.2 Evaluation on the USPS Database
The USPS dataset has training and test images, where each of them is of size . Being compared to MNIST, the USPS dataset has a much larger variance and a smaller training set, which challenges the dictionary generality. For a fair comparison, the dictionary size is set to be . Local patch size is ( = ). Searching window size is ( = ). The performance of various approaches on USPS database are depicted in Table 1. Our algorithm achieves the lowest error rate among other supervised learningbased methods. The experimental result validates the efficacy of our proposed algorithm on a dataset with a larger variance.
Method  MNIST  USPS 

CBN  
ESC [20]  
Ramirez et al. [21]  
Deep Belief Network [8]      
MMDL [22]    
Proposed  
Improvements 
5 Conclusion
In this paper, we present a novel sparse coding algorithm that is able to dynamically select the dictionary subatoms to adapt to the misaligned image dataset. In the proposed method, both the dictionary atoms and the input test image are represented by tensors, and each vector pixel in the tensor image is a vectorized local patch. Each tensor atom is aligned with the input tensor image using large displacement optical flow, which is highly computationally efficient. Using the fixed point differentiation, a supervised dictionary learning algorithm is developed for the proposed sparse coding framework, which significantly reduces the required dictionary size.
References
 [1] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. on Pattern Anal. and Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.
 [2] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in CVPR, pp. 1794–1801, Jun. 2009.
 [3] S. Agarwal and D. Roth, “Learning a sparse representation for object detection,” in ECCV, vol. 4, pp. 113–130, May 2002.
 [4] A. Wagner, J. Wright, A. Ganesh, Z. Zhou, H. Mobahi, and Y. Ma, “Toward a practical face recognition system: Robust alignment and illumination by sparse representation,” IEEE Trans. on Pattern Anal. and Mach. Intell., vol. 34, no. 2, pp. 372–386, Feb. 2012.

[5]
H. Lee, R. B. Grosse, R. Ranganath, and A. Y. Ng,
“Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,”
in ICML, Jun. 2009.  [6] J. Yang, K. Yu, and T. Huang, “Supervised translationinvariant sparse coding,” in CVPR, pp. 3517–3524, Jun. 2010.
 [7] A. Coates and A. Y. Ng, “The importance of encoding versus training with sparse coding and vector quantization,” in ICML, pp. 921–928, Jul. 2011.
 [8] G. E. Hinton and S. Osindero, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 527–1554, Jul. 2006.
 [9] K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu, and Y. Lecun, “Learning convolutional feature hierarchies for visual recognition,” in NIPS, pp. 1090–1098, Dec. 2010.

[10]
T. Brox and J. Malik,
“Large displacement optical flow: Descriptor matching in variational motion estimation,”
IEEE Trans. on Pattern Anal. and Mach. Intell., vol. 33, no. 3, pp. 500–513, Mar. 2011.  [11] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Journal FTML, vol. 3, no. 1, pp. 1–122, Jan. 2011.
 [12] D. Bertsimas and R. Weismantel, “Optimization over integers,” Athena Scientific, 2005.
 [13] J. Yang, Z. Wang, Z. Lin, X. Shu, and T. Huang, “Bilevel sparse coding for coupled feature spaces,” in CVPR, pp. 2360–2367, Jun. 2012.
 [14] J. Mairal, F. Bach, and J. Ponce, “Taskdriven dictionary learning,” IEEE Trans. on Pattern Anal. and Mach. Intell., vol. 34, no. 4, pp. 791–804, Apr. 2012.
 [15] B. Colson, P. Marcotte, and G. Savard, “An overview of bilevel optimization,” Ann. of Operat. Res., vol. 153, no. 1, pp. 235–256, Apr. 2007.
 [16] D. M. Bradley and J. A. Bagnell, “Differentiable sparse coding,” in NIPS, Dec. 2008.
 [17] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.
 [18] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid, “Convolutional Kernel Networks,” in NIPS, 2014.
 [19] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layerwise training of deep networks,” in NIPS, pp. 153–160, Dec. 2007.
 [20] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” in NIPS, pp. 801–808, Dec. 2006.
 [21] I. Ramirez, P. Sprechmann, and G. Sapiro, “Classification and clustering via dictionary learning with structured incoherence and shared features,” in CVPR, pp. 3501–3508, Jun. 2010.
 [22] Z. Wang, J. Yang, N. M. Nasrabadi, and T. Huang, “A maxmargin perspective on sparse representationbased classification,” in ICCV, pp. 1217–1224, Dec. 2013.
Comments
There are no comments yet.