The Structure Transfer Machine Theory and Applications

04/01/2018 ∙ by Baochang Zhang, et al. ∙ Beihang University 0

Representation learning is a fundamental but challenging problem, especially when the distribution of data is unknown. We propose a new representation learning method, termed Structure Transfer Machine (STM), which enables feature learning process to converge at the representation expectation in a probabilistic way. We theoretically show that such an expected value of the representation (mean) is achievable if the manifold structure can be transferred from the data space to the feature space. The resulting structure regularization term, named manifold loss, is incorporated into the loss function of the typical deep learning pipeline. The STM architecture is constructed to enforce the learned deep representation to satisfy the intrinsic manifold structure from the data, which results in robust features that suit various application scenarios, such as digit recognition, image classification and object tracking. Compared to state-of-the-art CNN architectures, we achieve the better results on several commonly used benchmarks[The source code is available. https://github.com/stmstmstm/stm ].

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Human perception system abstracts the correct concept, when the relationship or compactness of the intra-class data (small structure variations) is maintained after the perception; otherwise it will cause conceptual errors [8]. Analogously, data-driven learning approaches become as a trend which, by all means, try to maintain the class-specific feature compactness (perception) of the input data [31, 25, 30]

. For a learning algorithm, such a compactness can be accomplished if the expected presentation is achieved for an unbiased estimator (classifier)

[15, 3].

Fig. 1: Basic idea of the structure transferred machine (STM) method. By incorporating the manifold structure calculated in the input space into CNNs’ loss function, termed as manifold loss, we can theoretically obtain the expected value of the representation (mean) in a probabilistic way, as a result the variations among local neighbors of the data are mitigated due to converging into the expected representation and thus gain the system robustness. is the mapping function by CNNs (not necessarily the entire net).

Traditional hand-crafted features often require human expert knowledge, thereby making themselves domain specific. In contrast, deep learning based features can be learned automatically by composing multiple nonlinear transformations, which yield more abstract and useful representations. However, typically no distribution prior is embedded into the learning of deep features, making such schemes uncontrollable for certain circumstances. A center loss regularization term is sucessfully exploited, in essence a Gaussian prior, in deep learning to improve face recognition performance

[30]. However, such a system does not seem to work properly when the data is of complicated structure. Considering the fact that the conventional deep learning features are able to better distinguish the between-class variability [14, 17], can we breakthrough the restriction of simple Gaussian prior [30] into a better prior, so as to tolerate large intra-class variations? If so, even the intra-class variations are large, the inter-class samples can be still well separated due to super discriminant capability of deep learning. As a result, the generalization ability of the learned feature can be expected to be significantly improved.

In this paper, we discover that a desired representation in deep learning can actually be achieved and the local neighborhood with no constraint of the data structure required is able to converge at its expectation during the learning process. More importantly, it is noticed that the features describing the local structure suit better in representing the data, rather than the global ones, since the global features tend to be inaccurate when the data variation is large, which often occurs in real-world applications. The above observations inspire us to adopt a nonlinear manifold structure, other than the center loss [30], into the objective function of deep feature learning, which assumes more flexible structure of the data. By doing so, we can accommodate variations among local neighbours such as rotations, rescalings, and translations so as to gain system robustness. However, directly embedding such data distribution into the deep learning framework is not an easy task at all, because formulating the underlying concept into appropriate training criteria is problematic. In this paper, we theoretically show that the expected representation can be achieved in a probabilistic way as long as the property of manifold structure is revealed in the objective function of deep learning.

Upon such a proven, we present a novel structure transfer machine (STM) to learn structured deep features, the framework of which is illustrated in Fig. 1

. Our STM starts with the incorporation of manifold structure into Convolutional Neural Network (CNNs) by calculating the manifold structure based on existing algorithms (i.e., LLE or Laplacian) in the input space. Such a manifold structure is in turn transferred to feature space and integrated into the loss function of the CNNs’ objective. Afterwards, the new CNN models are used to extract constitutional feature maps, where the intrinsic manifold structure is preserved even if it changes from the data space to the feature space. Experimental results demonstrate that these learned features can yield state-of-the-art performance in various computer vision tasks, e.g., digit recognition, natural image classification, ImageNet and object tracking, on commonly used benchmarks. Our main contributions include:

  • A theorem is developed to reveal that the expectation of representation can be obtained in a probabilistic way if a structure regularization is incorporated into the deep learning pipeline. With such structure regularization, we revise typical deep learning networks to a Structure Transfer Machine, which gains state-of-the-art performance on image classification and object tracking.

  • With the aid of manifold, the structure of the input data is transferred into the feature space (output) with the intention to alleviate the unstructured problem in the higher-dimensional space, which eventually transfers the data structure into constraints in CNNs and leads to manifold loss. It is also demonstrated that in the deep feature space, the proposed manifold loss indeed improves the performance, compared to intra-class compactness methods such as the center loss.

Regarding the related work, highly relevant works contain the structure related regularization techniques [17] as well as the techniques that embed the prior knowledge, such as 2D topological structure of input data [13], both of which reveal that regularizing data structure is pretty useful when dealing with the image classification task. In [14], the adversarial examples suffer from performance degradation caused by small perturbations, manifold regularized networks (MRnet) that utilize a new training objective function aiming to minimize the difference between them. However, none of the existing works discussed the manifold constraint from theoretical perspective. Noted that in [17], a manifold deep learning method is carried out for set classification. However, the difference between our work and their work is clear: We provide a theoretical investigation into the structure based deep learning, whereas the work in [17] is more empirical. From application perspective, we consider the intra-class information and focus on single image based classification, while  [17] is designed for set based classification by considering the inter-class information.

There are some manifold based learning methods published recently [22, 21], such as region manifolds [1], graph manifold [20], product manifold [28], Grassmannia manifold [7], or its application to zero-shot [24]. We find that all of them are different from ours, and we focus on structure transferring in CNNs and actually provide a new theoretical investigation into CNNs.

Ii Deep Structure Transfer Machine

It is reasonable to assume that the data lies on a manifold, whose intrinsic structure is expected to be embedded into the objective function of deep model. This is achieved by developing a generic representation learning method without any prior applied to the classification model. The proposed method is elaborated below, followed by the theoretical analysis.

Ii-a Problem formulation

Let be a training dataset, where is the total number of samples. We define as an embedding from an image into a feature space with the label by a convolution neural network. The softmax loss function for the -layers network is formulated as:

(1)

where denotes the weight matrix in the last fully connected (FC) layer; for simplicity we only use one FC layer as an example, with as its th column; is the number of classes; the scalar is a weight decay coefficient. We denote the feature from the -th layer at the -th iteration for as , and as the corresponding weight. The set of deep features from the -th layer is denoted as , where the subscript may be omitted in what follows for an easy presentation.

The conventional objective function for classification in Eq. (1) does not consider the property that the data usually lies in a specific manifold  [35], which reveals the nonlinear dependency of the data. Modeling this property can actually generate better solutions for lots of existing problems [17, 14]. In the deep learning approach with error propagation from the top layer, it is more favorable to impose the manifold constraint on the top layer features. Our inspiration also comes from the idea of preserving manifold structure in different spaces, i.e., the high-dimensional and the low-dimensional spaces. Similarly, the manifold structure of is assumed to be preserved in the resulting deep features of our model in order to reduce variation in the higher-dimensional feature space (Fig. 1). We resort to a new manifold constraint in deep learning, and achieve a new problem (P1),

(P1)

where can be the deep feature of any layer, i.e., . It is noticed that the objective shown in problem (P1) is learnable if is given, because is directly related to the learned filters (). How to solve the constraint is elaborated in the next section.

1:  Set
2:  Initialize ,
3:  Initialize manifold set or feature buffer shown in Fig. 2 with samples, and learning rate
4:  repeat
5:     ;
6:     Compute based on Equ. 4 and Equ. 6.
7:     Update by
8:     Update in LLE by or Updating in Laplacian by
9:  until convergence
Algorithm 1 STM for the problem (P1) based on manifold loss

Ii-B Manifold loss

Local linear embedding, LLE: Solving the above problem needs to know manifold . Here we hypothesize it to be any manifold, e.g., LLE. According to LLE, it is assumed that each data point and its neighbors lie on a locally linear patch of the manifold. Hence we compute the linear coefficients to reconstruct each data from its neighbors by minimizing the reconstruction error:

(2)

which is a manifold loss. Here we define with being the corresponding weights of neighborhood data , , which is actually the feature buffer set, as shown in Fig. 2, for the -th data in the original data space. We enforce if does not belong to the neighborhood of , such that each point is only reconstructed by its neighbors. The optimal weights can be found by solving a least squares problem with constraint . As assumed, a linear embedding process for neighborhood preserving in the feature space is given by:

(3)

where is the deep feature from the current layer in the -th iteration for the -th input sample, and the feature of its neighbors or feature buffer (shown in Fig. 2) are denoted by . In this process, the feature of each sample is linearly reconstructed from by linear coefficients. The reconstruction weight is obtained by minimizing Eq. 2, which characterizes intrinsic geometric properties of the data that are invariant to rotations, rescalings, and translations of that data point and its neighbors [23], is related to the manifold . This is the key part of the proposed algorithm where the constraint manifold arises. As assumed, replacing equals incorporating Eq. 3 into our objective. This is the modularity alluded previously. Based on the Lagrangian multiplier method, Eq. 3 is introduced to solve problem (P1) by a new objective as:

(4)

where the scalar is adopted to balance these two terms of the objective.

Laplacian: Similar to LLE, we can also exploit the Laplacian manifold  [2] to our problem. As shown in [2], we can deduce such a manifold loss:

(5)

where is defined to be exponential distance between the th and th sample in the input space [2], which is actually the feature buffer set as shown in Fig. 2 to improve the efficiency. Similarly, we obtain the following objective as:

(6)

Finally, we have the STM algorithm to solve our problem, which is summarized in Alg. 1. Regarding the convergence of the proposed algorithm, our learning procedures never hurt the convergence of the back propagation, because newly added variables related to the manifold loss (convex) are solved following the similar pipeline.

Ii-C Theoretical analysis

In this section, we theoretically show that the manifold loss can lead a convergence process to the expectation of data representation, based on assumption that data lies on a manifold. More specifically, Theorem 2 provides a foundation of feature learning that the expected value can be achieved, if manifold structure is transferred from the input space to the feature space. Such a proof is very useful to guide the feature design in various practical applications. Notably, many machine learning tasks often require that features are compact and stable during the model learning process. In other words, the variances among the data are mitigated in the learning process, as they are converging into a single expected value. In the following, we will address how our theorem can be involved in the learning stage.

Definition 1: For , define:

(7)

where

is a 1D random variable and

. For simplicity, we further define with where

is a Gaussian distribution with

mean and standard variation.

Theorem 1: If satisfies Definition 1, then:

(8)

where is the average of .

Proof: the details are shown in the appendix part.

Theorem 2

If vectors

, then:

(9)

where is the average vector, with , is a constant.

Theorem 2 means that the expectation of is achieved in a probabilistic way. Before proving the theorem, we first introduce the following proposition and lemma.

Proposition: The most popular approaches such as ISOMAP [27] and Local Linear Embedding (LLE) [23] are with the underlying idea that a high dimensional vector representing the data that can be mapped into a lower dimension space preserving, as much as possible, the metric of the original space. The distances of all the pairs of data points in the embedding space is bounded [4, 16]. Thus, it is claimed that:

(10)

where , and denotes the projection from the original sample to . And denotes the th dimension of in the manifold feature space, i.e., the deep feature space obtained based on manifold loss in this work; is a constant. , Due to is controlled by the input sample, so that it is reasonable to claim that :

(11)

Lemma: For any vector , we have:

(12)

The lemma is obvious. Next, we prove Theorem 2.

According to Theorem 1 and Equ. 11, we have:

(13)

where we set . We set , based on Equ. 12 we have:

(14)

Thus, Theorem 2 is proved.

Fig. 2: Illustration of the STM architecture.

Iii Experiments

In this section, we first present the details about how to implement our method with a deep learning pipeline. We then use the MNIST digit recognition experiments to show the superiority of our method. We finally validate the effectiveness of our method with large-scale visual tasks including image classification and object tracking.

Fig. 3: Distribution illustration of STM. a) is the manifold, and we create each structure separately but show in one figure, b) is baseline CNN feature, c) is center loss feature, d) is the STM feature.
Fig. 4: Test error curves for the ImageNet experiment.

Iii-a Implementation details

Comparison: We validate our method on various CNN base models, including ResNet[9], WideResNet [34], and then compare the performance with state-of-the-art networks. Center loss [30] is also evaluated equally as comparison. For the unavailability of the training face database used in [30], we choose other testbeds, such as MNIST, CIFAR, ImageNet and the large scale OTB-50 tracking database for a fair comparison.

Manifold: We introduce manifold loss or the constraint term for structure preserving based on LLE or Laplacian.We build a feature buffer (Fig. 2) consisting of its (e.g., 30) previous samples from the same class, which denotes the maximum number of nearest neighbors used to calculate the reconstruction weights ( , ) exactly as that in LLE or Laplacian manifold. Notice that we then obtain the feature from this sample mapped by the network in each iteration and its corresponding subset of features from its neighbor samples, and learn the model by the proposed loss function as the reconstruction weights ( , ) are introduced. The above process is used for the MNIST, CIFAR and ImageNet datasets, but in the tracking task we divide each sequence into batch sets, which are then used to calculate the manifold for further learning process.

Settings in CNN:

The proposed models are implemented on common libraries (i.e., Caffe and TensorFlow) with our modifications, and can still be trained end-to-end by SGD without introducing many parameters compared to their base model. The architecture of our CNN models is given in Fig.

2. The details can also be referred to our source code. We train our STMs via the algorithm shown in Sec. 3. The manifold loss is added before the FC layer shown in Fig. 2. To further understand how the elements of the framework affect the performance, we test our models when the manifold data structure is extracted by different techniques, such as the LLE (STM-lle) or Laplacian (STM-lap). For fair comparison, most of settings in our networks follow the ones in their base model, except for the learning rate and policy, because the modified object function is determined on validation set. More detailed settings in each experiment are described in corresponding subsection.

Error

14 30 0.36%
22 30 0.41%
30 0.36%
20 0.38%
TABLE I: Comparisons with different parameters on the MNIST dataset.

Base-CNNs

ResNet

STN(affine)

centerloss

STM-lle

STM-lap

Ref. - [9] [11] [30] ours ours
error rate (%) 0.73 - 0.61 0.61 0.36 0.36
TABLE II:

Comparisons with CNNs on the MNIST database in terms of error rates.

VGG

WideResNet

ResNet

centerloss

STM

Ref. [26] [34] 222our implementation in Tensorflow [9] [30] ours
CIFAR-10 6.32 5.61 6.43 5.58 4.60
CIFAR-100 28.49 22.07 25.16 22.08 20.2
TABLE III: Comparisons with CNNs on the CIFAR-10-100 database on error rates.
Fig. 5: Success and precision plots according to the online tracking benchmark
Fig. 6: Precision plots for the attributes of the online tracking benchmark.

FCNT

KCF

Cf+CNN

HCFT

MEEM

DSST

STM

Ref. [29] [10] [19] [18] [36] [5] ours
Precision 85.7 74.1 90.7 89.1 83 73.9 91.6
Success rate 47.2 51.3 61.1 60.5 56.6 50.5 61.2
TABLE IV: Comparisons with state-of-the-art trackers on the OTB benchmark sequences.

Iii-B Digit recognition

MNIST dataset of handwritten digits 333http://yann.lecun.com/exdb/mnist/ contains a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

We use a weight decay of 0.0001 and momentum of 0.9 in our model with a mini-batch size of 500. The learning rate is started from 0.1, and divided by 10 at 32k and 48k iterations, and the training procedure is terminated at 64k iterations. We do not conduct any data augmentation for training. The LeNet++ architectures are used for base-CNNs, our STM and center loss [30] for a fair comparison.

1) Parameter evaluation and performance comparison: There are several parameters affecting the performance of the proposed method, i.e., denoting the size of the feature buffer set, denoting number of neighbors in LLE. determines the number of local neighbors in LLE. The results in Table I show that STM-lle achieves the best performance when . The performances of LLE and Laplacian based STMs are very similar, but STM-lle needs to set two parameters, in comparison with only one parameter for STM-lap, i.e., , for all the following experiments. The parameter is also evaluated (in the github), which show that in the certain scope the parameter affect little on the final performance.

2) Illustration: We first conduct experiments to illustrate how the STM method influences the distribution in Fig. 3. Without the special note, STM means using Laplacian to calculate from the original data space for learning. Fig. 3 shows that the distributions of STM deep features appear to be simpler (parsimony) than the original one because of its approaching to the expectation. This is even more profound in the sense that the compactness did not conflict with the structure preservation, which can be viewed that STM obtains a more similar structure as that of the original manifold than the center loss. In addition, the center loss appears more scattered, meaning our structure preservation is a better strategy to achieve a good representation. In Table II, we report the error rates obtained by different approaches. It can be seen that ours is far lower than the existing approaches including center loss one, indicating that the manifold loss term indeed increases the discriminative power of the deeply learned features. In addition, it seems that STM-lle performs slightly better depending on parameter selections than STM-lap. Moreover, the weight calculation for Laplacian is much easier than that of LLE so that we change the notation of STM-lap to STM and use it in the following experiments.

Iii-C Natural object recognition

CIFAR dataset [12] is a famous natural image classification benchmark which consists of 60000 32x32 color images in 10 or 100 classes, with 6000 images per class. There are 50000 training images and 10000 test images. We follow the same protocol as that of [34]. Four CNNs including VGG [26] or baseline CNN, ResNet [9], WideResNet [34], and centerloss [30] are used as baselines on these datasets.

We use a weight decay of 0.0001 and momentum of 0.9. These models are trained on two GPUs (Titan XP) with a minibatch size of 128. The learning rate is started from 0.1, and divided by 10 at 32k and 48k iterations, and the training precudure is terminated at 64k iterations, which is determined on a 45k/5k train/val split. We follow the same data augmentation in [9] for training: horizontal flipping is adopted, and a

crop is sampled randomly from the image padded by 4 pixels on each side. For testing, we only evaluate the single view of the original

image.

Our algorithm is also compared with the state-of-the-art algorithms when carrying out the task of image classification. To be fair, the settings for all the algorithms follow WideResNet [34], which was implemented by us. The results in Table III again show that STM significantly improves the baselines (e.g., WideResNet) on both CIFAR10 and CIFAR100 datasets. We notice that the top-2 classes of being improved in CIFAR10 are dog (34% higher than baseline WideResNet), and horse (14%), in which significant image variations take place. This implies considering the manifold structure in feature learning enhances the capability of handling image variations. In addition, the center loss method performs worse than STM due to severe variations in the CIFAR datasets.

Iii-D Large Size Image Classification

The previous experiments are conducted on datasets with small size images. To further show the effectiveness of the proposed STM method, we evaluate it on the ImageNet [6] dataset. Different from MNIST and CIFAR, ImageNet consists of images with a much higher resolution. In addition, the images usually contain more than one attribute per image, which may have a large impact on the classification accuracy. Since this experiment is only to validate the effectiveness of our STM on large size images, we don’t use the full ImageNet dataset because it will take a significant time to train a deep model on such large scale set. Alternatively, we choose a 100-class ImageNet2012 [6] subset in this experiment. The 100 classes are selected from the full ImageNet dataset at a step of 10. Similar subset is also applied in [33]

. For the ImageNet-100 experiment, we use the same model as the baseline ResNet-101 model, and the setting is the same as the previous experiments. Both methods are trained after 120 epochs. The learning rate is initialized as 0.1 and decreases to 1/10 times per 30 epochs. Top-1 and Top-5 errors are used as evaluation metrics. The test error curve is depicted in Fig. 

4. As compared to the baseline, our STMs achieve better classification performances (i.e., Top-5 error: 2.94% vs. 3.16%, Top-1 error: 10.67% vs. 11.94%) with almost the same parameters (44.54M). Considering the large variations in ImageNet, STM can still achieve a better performance than Resnet, and we believe that manifold loss is really effective.

Iii-E Object tracking

In this section, we evaluate the performance of STM on the tracking problem based on 50 sequences from the commonly used tracking OTB benchmark [32].

OTB benchmark [32] is a large dataset with ground-truth object positions and extents for tracking and introduces the sequence attributes for the performance analysis. They integrate most of the publicly available trackers into one code library with the uniform input and output formats to facilitate large-scale performance evaluation. The performances of most tracking algorithms are included on 50 sequences with different initialization settings. In this tracking benchmark [32], each sequence is manually tagged with different attributes, such as illumination variations, scale variations, occlusions, deformations, motion blur, abrupt motion, in-plane rotation, out-of-plane rotation, out-of-view, background clutters and low resolution, indicating what kind of challenges exist in the video.

We implement our STM model based on VGG-19 with two outputs for each sequence separately. We randomly collect 50 positive and 200 negative samples for each frame from VOT13, VOT14, and VOT15 444http://www.votchallenge.net/, where the positive and negative examples have and IoU overlap ratios with ground-truth bounding boxes, respectively. Noted that we remove the overlapped sequences with OTB from the trained databases. In the tracking, we used the same strategy as that of [18, 19], which learns a discriminative classifier and estimates the translation of target objects by searching for the maximum value of correlation response map. Similar to KCF [10], using the set of correlation response maps based on deep features can hierarchically infer the target translation at each layer, i.e., the location of the maximum value in the last layer is used as a regularization to search for the maximum value of the earlier layer. Our STM tracker can be generated by simply replacing the deep model of [18, 19]. Regarding the comparison, our baseline algorithms mainly consist of correlation filters or deep learning based trackers, such as KCF, FCNT, and Cf+CNN [19].

In Fig. 5, we plot the precision against location error curve, which measures the ratio of successful tracking frames when the threshold of allowed location errors is changed. Here, the location error (x-axis, in pixel) on the plot implies the distance between the bounding box center and the ground-truth. For ease of comparison, we also include the plots of several baseline trackers in the figure. As can be seen in Table IV, the STM and KCF achieve 61.2% and 51.3% based on the average success rate, while HCFT and MEEM trackers respectively achieve 60.5% and 56.6%. In terms of precision, STM and KCF respectively achieve 91.6% and 74.1% when the threshold is set to 20. Moreover, the STM and baseline HCFT obtain 91.6% and 89.1% respectively, which further confirms that the proposed deep model is effective on object tracking. We also compare with cf+CNN, one of the latest variants of KCF, and the results show that STM still achieves performance improvement in terms of precision. It is believed that the special strategy used in cf+CNN can also be used to further improve STM. All the above observations clearly demonstrate that imposing the manifold prior constraint during the feature learning helps generate more robust features for tracking, thus enabling its superiority over the state-of-the-art trackers. Here, we also show the scale variation and lighting attributes in Fig. 6, where STM performs much better than other trackers again. Again, STM shows its super capability of handling severe variations.

Iv Conclusion

In this paper, we have presented a new concept for representation learning that data structure preservation can help feature learning process converge at the representation expectation. By doing so, we open up a possible way to learn deep features which are robust to the variations of the input data because of theoretical convergence. The proposed STM method formulates the data structure preserving into an objective function optimization problem with constraint, which can be solved via the BP algorithm. Extensive experiments and comparisons on the commonly used benchmarks show that the proposed method significantly improved the performance of CNNs, and achieved a better performance than the state-of-the-art CNNs.

Appendix

Theorem 1: If satisfies Definition 1, then:

(15)

where is the average of .

Proof: For a fixed t (), the function of the variable is convex in the interval with .We draw a line between the two endpoints points and . The curve of lies entirely below this line. Thus,

(16)

According to Equ. 16 and (actually ), we have:

(17)

where . Based on Definition 1, we have:

Using the Taylor expansion, we have:

(18)

With the condition , we have:

(19)

Inductively, we have:

(20)

where is the sum of the input samples. According to Markov’s inequality, we have:

(21)

We choose (in order to minimize the above expression), and have:

(22)

To derive a similar lower bound, we consider instead of in the preceding proof. Then we obtain the following bound for the lower tail:

(23)

So, we have:

(24)

where is the average. Thus, the theorem is proved.

References

  • [1] I. Ahmet, T. Giorgos, A. Yannis, F. Teddy, and C. O. Efficient diffusion on region manifolds: Recovering small objects with compact cnn representations. In

    IEEE Conference on Computer Vision Pattern Recognition

    , page 1, 2017.
  • [2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, 2003.
  • [3] A. Birnbaum. A unified theory of estimation. Annals of Mathematical Statistics, 32(1):112–135, 1961.
  • [4] J. Bourgain. On lipschitz embedding of finite metric spaces in hilbert space. Israel Journal of Mathematics, 52:46–52, 1985.
  • [5] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. Accurate scale estimation for robust visual tracking. 2014.
  • [6] J. Deng, W. Dong, R. Socher, and L. J. Li. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255, 2009.
  • [7] J. Gao, Q. Wang, and H. Li.

    Grassmannian manifold optimization assisted sparse spectral clustering.

    In IEEE Conference on Computer Vision Pattern Recognition, page 1, 2017.
  • [8] R. Gregory. Concepts and mechanisms of perception. London: Duckworth, 1974.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. computer vision and pattern recognition, pages 770–778, 2016.
  • [10] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2015.
  • [11] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. neural information processing systems, pages 2017–2025, 2015.
  • [12] A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  • [13] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3361–3368, 2011.
  • [14] T. Lee, M. Choi, and S. Yoon. Manifold regularized deep neural networks using adversarial examples. arXiv: Learning, 2015.
  • [15] E. L. Lehmann. A general concept of unbiasedness. Annals of Mathematical Statistics, 22(4):587–592, 1951.
  • [16] N. Linial, E. London, and Y. Rabinovich. The geometry of graphs and some of its algorithmic applications. In Foundations of Computer Science, 1994 Proceedings., Symposium on, pages 577–591, 1995.
  • [17] J. Lu, G. Wang, W. Deng, P. Moulin, and J. Zhou. Multi-manifold deep metric learning for image set classification. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1137–1145, 2015.
  • [18] C. Ma, J. Huang, X. Yang, and M. Yang. Hierarchical convolutional features for visual tracking. pages 3074–3082, 2015.
  • [19] C. Ma, Y. Xu, B. Ni, and X. Yang. When correlation filters meet convolutional neural networks for visual tracking. IEEE Signal Processing Letters, 23(10):1454–1458, 2016.
  • [20] F. Monti, D. Boscaini, J. Masci, E. RodolÃ, J. Svoboda, and M. M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In IEEE Conference on Computer Vision Pattern Recognition, page 1, 2017.
  • [21] E. Patrick, S. Florent, and K. Renaud. Shape priors using manifold learning techniques. In IEEE Conference on Computer Vision, page 1, 2017.
  • [22] N. Ramanan, W. F. Andrew, and C. Roberto. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In IEEE Conference on Computer Vision, page 1, 2017.
  • [23] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
  • [24] D. Shay, K. Soheil, K. Kyungnam, O. Yuri, and S. Stefano. Zero shot learning via multi-scale manifold regularization. In IEEE Conference on Computer Vision Pattern Recognition, page 1, 2017.
  • [25] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. Computer Science, 2014.
  • [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. international conference on learning representations, 2015.
  • [27] J. B. Tenenbaum, V. De Silva, and J. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
  • [28] M. Vestner, R. Litman, E. RodolÃ, A. Bronstein, and D. Cremers.

    Product manifold filter: Non-rigid shape correspondence via kernel density estimation in the product space.

    In IEEE Conference on Computer Vision Pattern Recognition, page 1, 2017.
  • [29] L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual tracking with fully convolutional networks. pages 3119–3127, 2015.
  • [30] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. pages 499–515, 2016.
  • [31] J. Wright, A. Y. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2):210–227, 2009.
  • [32] Y. Wu, J. Lim, and M. Yang. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1834–1848, 2015.
  • [33] L. Yao and J. Miller. Tiny imagenet classification with convolutional neural networks. CS 231N, 2015.
  • [34] S. Zagoruyko and N. Komodakis. Wide residual networks. british machine vision conference, 2016.
  • [35] B. Zhang, A. Perina, V. Murino, and A. D. Bue. Sparse representation classification with manifold constraints transfer. In Computer Vision and Pattern Recognition, pages 4557–4565, 2015.
  • [36] J. Zhang, S. Ma, and S. Sclaroff. Meem: Robust tracking via multiple experts using entropy minimization. pages 188–203, 2014.