Sparse Coding with Fast Image Alignment via Large Displacement Optical Flow

12/21/2015 ∙ by Xiaoxia Sun, et al. ∙ 0

Sparse representation-based classifiers have shown outstanding accuracy and robustness in image classification tasks even with the presence of intense noise and occlusion. However, it has been discovered that the performance degrades significantly either when test image is not aligned with the dictionary atoms or the dictionary atoms themselves are not aligned with each other, in which cases the sparse linear representation assumption fails. In this paper, having both training and test images misaligned, we introduce a novel sparse coding framework that is able to efficiently adapt the dictionary atoms to the test image via large displacement optical flow. In the proposed algorithm, every dictionary atom is automatically aligned with the input image and the sparse code is then recovered using the adapted dictionary atoms. A corresponding supervised dictionary learning algorithm is also developed for the proposed framework. Experimental results on digit datasets recognition verify the efficacy and robustness of the proposed algorithm.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sparse coding has been successfully applied to numerous computer vision tasks, including face recognition

[1], scene categorization [2] and object detection [3]. Application of sparse representation-based classifier (SRC) on face recognition [1] demonstrates a startling robustness over noise and occlusions, where the test subjects are still recognizable even when they wear sunglasses or scarf. However, SRC has been found to be highly sensitive to the misalignment of the image dataset: a small amount of image distortion due to translation, rotation, scaling and -dimensional pose variations can lead to a significant degradation on the classification performance [4].

One straightforward way to solve the misalignment problem is to register the test image with dictionary atoms before sparse recovery. By assuming the dictionary atoms are registered, Wagner et al. [4] parameterize the misalignment of the test image with an affine transformation. These parameters are optimized using generalized Gauss-Newton methods after linearizing the affine transformation constraints. By minimizing the sparse registration error iteratively and sequentially for each class, their framework is able to deal with a large range of variations in translation, scaling, rotation and even D pose variations. Due to the adoption of holistic features, sparse coding is more robust and less likely to overfit.

In the case of local feature-based sparse coding, max pooling strategy

[5] is often employed over the neighboring coefficients to produce local translation-invariant property. Based on spatial pyramid matching framework, Yang et. al. [2]

proposed a local sparse coding model with local SIFT features followed by multi-scale max pooling. The results on several large variance datasets achieved plausible performance that can hardly be pursued by simply applying holistic sparse coding. To improve the discriminability of the sparse codes, their dictionary was trained with supervised learning via backpropagation

[6]. Classification performance of local feature-based sparse coding has also been evaluated on several large datasets in [7]

, demonstrating a state-of-art performance that is competitive with deep learning

[8]. Another interesting approach is the convolutional sparse coding [9], where the local features are reconstructed by convoluting the local sparse codes using local dictionary. Visualization of its dictionary shows that the dictionary atoms contain more complex features, therefore having more discriminative power.

Figure 1:

Proposed sparse coding framework: Dictionary tensor atoms

and the test tensor image are shown in the lower part of the figure. Searching window of size within each tensor atom is colored with purple. Each group of neighboring subatoms

is matched with the corresponding vector pixel

of the test tensor image, resulting in an aligned subatom . After the matching process, the sparse code for is recovered using all the aligned subatoms . For illustration purposes, only five dictionary tensor atoms are shown in the figure and the magnitude of the sparse codes are displayed with various intensities in red.

In this paper, we present a novel sparse coding framework that is robust to image transformation. In the proposed model, each dictionary atom is constructed in the form of a tensor and is aligned with the test image using the large displacement optical flow concept [10]. We show experimentally that the proposed sparse coding framework outperforms most other sparsity-based methods. Specifically, our paper has the following novelties and contributions: (i) The proposed algorithm does not require the training dataset to be pre-aligned. (ii) Adapting the dictionary to the input test image is highly efficient: requiring only operations for adapting each dictionary atom, where is the number of pixels in a searching window and is the total number of subatoms to be aligned. (iii) Supervised dictionary learning algorithm is developed for the proposed sparse coding framework.

The remainder of the paper is organized as follows: We first introduce the proposed sparse coding framework for dealing with dataset misalignment in Section 2. Next, in Section 3, we show how to train the dictionary in a supervised manner by solving a bilevel optimization problem. Finally, in Section 4, experimental results demonstrate that the proposed framework has a state-of-art performance, which is more promising over most existing sparsity-based methods.

2 Sparse Coding with Image Alignment via Large Displacement Optical Flow

In this section, we first introduce how to construct the dictionary atoms and input images in the form of tensors. We then illustrate how to eliminate the misalignment by dynamically adapt the tensor dictionary atoms to the input tensor image.

In the proposed sparse coding model, as shown in Fig. 1, both dictionary atom and input image are represented by image tensors. Each pixel in the tensor image is a vectorized version of a local patch in the original image, referred to as a vector pixel. Denote the tensor atom as and a given test tensor image as , where is the subatom of the tensor atom and is the vector pixel of the input image. is the dimension of vector pixel, is the dictionary atom index and is the total number of subatoms in the tensor atom, which is the same number of vector pixels in the test tensor image. The dictionary is denoted as . Given a dictionary with tensor atoms, a typical sparse recovery problem [1] is formulated as:


where is the sparse coefficient and is the regularization parameter. Problem (1) is a standard form of -sparse recovery problem that can be efficiently solved using alternating direction method of multipliers (ADMM) [11].

When images in both the training and test datasets are misaligned, sparse coefficients recovered by solving the problem (1) become unreliable, thus resulting in poor classification performance. To alleviate the misalignment problem, we propose to register each tensor atom with the input test image via large displacement optical flow [10]. The notion of optical flow field is used here to describe the displacements of vector pixels within each tensor atom, and the sparse recovery is then performed by using only the best matching subatoms selected from the tensor atoms. The proposed framework is illustrated in Fig. 1. Denote as the subatoms within the searching window centered at the location of the tensor atom. The recovery of the optical flow and sparse codes can be formally described as follows:


where is the cardinality constraint and is the sparse index vector that is used to characterize the optical flow field. The constraint in (2) suggests that is a binary index vector and only one element is nonzero, which means that it can only select one subatom within the searching window.

The optimization problem in (2) is a mixed-integer problem and NP-hard [12]

. Therefore, we propose a heuristic algorithm to find an informative

and the sparse index vectors for all vector pixels. As shown in Fig. (1), the optical flow field for each vector pixel is found by searching for the best match between neighboring subatoms and the corresponding input vector pixel. In practice, we found that searching for the best match without involving the sparse code is the key to render plausible performance in both classification accuracy and computational efficiency. Formally, we propose to find a local optimum of problem (2) by solving the following optimization problem:


In our approach, the sparse coding part of (3) is solved by using the alternating direction method of multipliers (ADMM) [11]. One important advantage of the above model is that it is highly computational efficient because it only takes operations to search for the best match for each vector pixel.

3 Supervised Dictionary Learning

In order to improve the efficiency of sparse coding and discriminablity of the dictionary, we employ the supervised dictionary learning framework [6, 13, 14] to optimize the dictionary and the classifier parameters simultaneously. Formulated as a bilevel optimization problem, the dictionary is updated using back propagation to minimize the classification error. Formally, the supervised dictionary learning problem can be formulated as follows:


where is some smooth and convex function that is used to define the classification error and is the regularization parameter used to alleviate the overfitting of the classifier. Due to the triviality of updating classifier parameters, here we only state the update for the dictionary:


where is the learning rate, is the iteration counter and is the projection that regulate the Frobenius norm of every tensor atom to be one. Similar to [6, 13, 14], (4) suggests that the update of both the dictionary and the classifier are driven by reducing classification error. The local optima can be solved by using descent method [15] based on error backpropagation. The sparse code is an implicit function of and . In addition, each optical flow field is an implicit function of and . Therefore, given an input image and an optimal sparse code

, apply the chain rule of differentiation, the direction along which the upper-level cost decreases can be formulated as:


where and denotes the direct sum. Also,

is obtained by zero-padding with

, where to elements of are from those of . Due to the binary constraints on , every element of the gradient equals to zero. On the other hand, the first part of the derivative can be solved by applying fixed point differentiation [16]. Due to the page limitation of the paper and the triviality for deriving the term , we only show the final derivation of as follows:


where is the index set of active atoms of the sparse code . is the matrix obtained by collecting the active columns of , and is the submatrix obtained by selecting the active columns and rows of . The matrix is always nonsingular since the total number of measurement is always significantly larger than the number of active atoms. Combining (6) with (7) for each dictionary element, the gradient for updating the dictionary can be achieved. For a large dataset, the dictionary and the classifier parameters are updated in an online manner.




Figure 2: The proposed method demonstrates plausible performance on MNIST digits recognition with a small number of training samples. It also demonstrates robustness towards various image deformations. Classification accuracy of different experimental settings are shown in the above sub-figures: (a) Error rate under various sizes of training samples. (b) Translation along direction versus classification accuracy. (c) In-plane rotation only. (d) Scale variation only. In (b)-(c), red and blue lines are the results of the proposed method and L1SC, respectively. Gray shadow area at the bottom of each figure is the accuracy difference between the proposed method and L1SC.

4 Experiments

In this section, we evaluate the proposed algorithm on hand-written digits datasets including the MNIST and USPS. The sparse coding is performed with a single dictionary and linear SVM is used for classification. For a fair comparison, we only compare with the results that are produced with the same SRC strategy. The dictionary size in our paper is set to be no larger than those used in other methods. Similar to [6], parameters in our experiments are chosen heuristically. The batch size for updating the dictionary is . Initial learning rate is set to and .

4.1 Evaluation on the MNIST Database

MNIST [17] consists of a total number of images of digits, of which are training set and the rest are test set. Each digit is centered and normalized in a field. The dictionary size is set to be for this database.

We first evaluate the performance of the proposed algorithm under various number of training samples. We follow the same experimental setting as in [18], examining the classification accuracy given the training size . The performance is shown in Fig. 2 (a). The proposed method significantly outperforms the sparse coding-based algorithm (L1SC) [14].

We then demonstrate the robustness of the proposed method towards various image deformations. Following a similar setting as in [4], we perform the translation along direction, rotation and scaling separately only on the test samples. We report the classification accuracy with respect to various levels of deformation and compare the performance with L1SC. The experimental results are shown in Fig. 2(b)-(d). Performance of our method and L1SC are illustrated in red and blue lines, respectively. The shadow area at the bottom of each figure is the accuracy difference between the two methods. We can see for all three deformations, the proposed method consistently outperforms L1SC. In addition, the hump shape of the shadow area indicates that the proposed method is robust to numerous image deformations.

Finally, the error rate for the MNIST is shown in Table 1. Our method reaches the lowest error rate of . On MNIST, differences of more than are statistically significant [19]. Comparing with the second best algorithm, the proposed method reduces the error rate by , exhibiting better generality and dictionary compactness.

4.2 Evaluation on the USPS Database

The USPS dataset has training and test images, where each of them is of size . Being compared to MNIST, the USPS dataset has a much larger variance and a smaller training set, which challenges the dictionary generality. For a fair comparison, the dictionary size is set to be . Local patch size is ( = ). Searching window size is ( = ). The performance of various approaches on USPS database are depicted in Table 1. Our algorithm achieves the lowest error rate among other supervised learning-based methods. The experimental result validates the efficacy of our proposed algorithm on a dataset with a larger variance.

ESC [20]
Ramirez et al. [21]
Deep Belief Network [8] - - -
MMDL [22] --
Table 1: Error Rate (%) on MNIST and USPS datasets. The dictionary size is shown in the parentheses. Improvements over the second best algorithm is shown in the last line.

5 Conclusion

In this paper, we present a novel sparse coding algorithm that is able to dynamically select the dictionary subatoms to adapt to the misaligned image dataset. In the proposed method, both the dictionary atoms and the input test image are represented by tensors, and each vector pixel in the tensor image is a vectorized local patch. Each tensor atom is aligned with the input tensor image using large displacement optical flow, which is highly computationally efficient. Using the fixed point differentiation, a supervised dictionary learning algorithm is developed for the proposed sparse coding framework, which significantly reduces the required dictionary size.


  • [1] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. on Pattern Anal. and Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.
  • [2] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in CVPR, pp. 1794–1801, Jun. 2009.
  • [3] S. Agarwal and D. Roth, “Learning a sparse representation for object detection,” in ECCV, vol. 4, pp. 113–130, May 2002.
  • [4] A. Wagner, J. Wright, A. Ganesh, Z. Zhou, H. Mobahi, and Y. Ma, “Toward a practical face recognition system: Robust alignment and illumination by sparse representation,” IEEE Trans. on Pattern Anal. and Mach. Intell., vol. 34, no. 2, pp. 372–386, Feb. 2012.
  • [5] H. Lee, R. B. Grosse, R. Ranganath, and A. Y. Ng,

    “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,”

    in ICML, Jun. 2009.
  • [6] J. Yang, K. Yu, and T. Huang, “Supervised translation-invariant sparse coding,” in CVPR, pp. 3517–3524, Jun. 2010.
  • [7] A. Coates and A. Y. Ng, “The importance of encoding versus training with sparse coding and vector quantization,” in ICML, pp. 921–928, Jul. 2011.
  • [8] G. E. Hinton and S. Osindero, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 527–1554, Jul. 2006.
  • [9] K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu, and Y. Lecun, “Learning convolutional feature hierarchies for visual recognition,” in NIPS, pp. 1090–1098, Dec. 2010.
  • [10] T. Brox and J. Malik,

    “Large displacement optical flow: Descriptor matching in variational motion estimation,”

    IEEE Trans. on Pattern Anal. and Mach. Intell., vol. 33, no. 3, pp. 500–513, Mar. 2011.
  • [11] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Journal FTML, vol. 3, no. 1, pp. 1–122, Jan. 2011.
  • [12] D. Bertsimas and R. Weismantel, “Optimization over integers,” Athena Scientific, 2005.
  • [13] J. Yang, Z. Wang, Z. Lin, X. Shu, and T. Huang, “Bilevel sparse coding for coupled feature spaces,” in CVPR, pp. 2360–2367, Jun. 2012.
  • [14] J. Mairal, F. Bach, and J. Ponce, “Task-driven dictionary learning,” IEEE Trans. on Pattern Anal. and Mach. Intell., vol. 34, no. 4, pp. 791–804, Apr. 2012.
  • [15] B. Colson, P. Marcotte, and G. Savard, “An overview of bilevel optimization,” Ann. of Operat. Res., vol. 153, no. 1, pp. 235–256, Apr. 2007.
  • [16] D. M. Bradley and J. A. Bagnell, “Differentiable sparse coding,” in NIPS, Dec. 2008.
  • [17] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.
  • [18] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid, “Convolutional Kernel Networks,” in NIPS, 2014.
  • [19] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” in NIPS, pp. 153–160, Dec. 2007.
  • [20] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” in NIPS, pp. 801–808, Dec. 2006.
  • [21] I. Ramirez, P. Sprechmann, and G. Sapiro, “Classification and clustering via dictionary learning with structured incoherence and shared features,” in CVPR, pp. 3501–3508, Jun. 2010.
  • [22] Z. Wang, J. Yang, N. M. Nasrabadi, and T. Huang, “A max-margin perspective on sparse representation-based classification,” in ICCV, pp. 1217–1224, Dec. 2013.