research
Connected domain:
view repo
One of the biggest problems in deep learning is its difficulty to retain consistent robustness when transferring the model trained on one dataset to another dataset. To conquer the problem, deep transfer learning was implemented to execute various vision tasks by using a pre-trained deep model in a diverse dataset. However, the robustness was often far from state-of-the-art. We propose a collaborative weight-based classification method for deep transfer learning (DeepCWC). The method performs the L2-norm based collaborative representation on the original images, as well as the deep features extracted by pre-trained deep models. Two distance vectors will be obtained based on the two representation coefficients, and then fused together via the collaborative weight. The two feature sets show a complementary character, and the original images provide information compensating the missed part in the transferred deep model. A series of experiments conducted on both small and large vision datasets demonstrated the robustness of the proposed DeepCWC in both face recognition and object recognition tasks.
READ FULL TEXT VIEW PDFConnected domain:
Machine learning methods have been applied to deal with various multi-media and computer vision tasks. Traditionally, linear models such as sparse representation (SR)
[30] and collaborative representation (CR) [44] have drawn much attention and gained promising results in image classification. Lately, nonlinear deep learning [12] models, e.g., ResNet [8] and VGG [24], have produced state-of-the-art results in many image-based tasks, including face recognition, object detection, video tracking, etc. Linear sparse models can be utilized to improve deep neural networks
[34]. On the other hand, more and more conventional methods took deep features as input to gain more promising classification results [2, 42, 43]. However, recent studies showed that deep features from neural networks are usually designed for SVM-like classifiers
[13]. Using deep features as sole input in non-SVM classical models could be dubious. For this problem, we believe that the technique of linearly representing images can be applied to enhance nonlinear deep models.Deep learning shows a very strong capacity to learn discriminative image features. CNN features off-the-shelf [23] were demonstrated to be powerful for recognition tasks. Learning deep features can help to obtain state-of-the-art results for different tasks like face recognition [29]
[46], person re-identification [32] and general image classification [22, 26]. The good news is that many conventional machine learning methods can also learn credible features from images. In recent years, Sparse Representation (SR) [30] via regularization has shown huge potential in feature extraction and image classification. On the other hand, regularization-based Collaborative Representation (CR) [44] can also build a similarly robust linear model. The regularization inside CR helps to create an equally discriminative but faster sparse representation [37]. According to our observation, sparseness plays an important role in both linear and nonlinear models. It is likely for these two paradigms, deep and classic representations, to generate a new representation learning model when collaborating with each other.In this paper, we propose a Collaborative Weight-based Classification method that brings deep and classic non-deep representation together, to implement a more promising image classification. We name it DeepCWC for short. The contribution of this work includes: 1) proposing a new classifier to integrate features from linear and nonlinear models, 2) giving an analysis on how black-box deep features work in a sparse classification model, 3) conducting image classification experiments on different CNN models and convolutional layers inside them, to demonstrate the performance of DeepCWC in a consistent and comprehensive way. The proposed method produces promising results on face and object recognition. In particular, it ranks first in recognition (97.66%) on the Fashion-MNIST dataset.
The root inspiration comes from the popular deep residual network (ResNet) [8]. ResNet keeps an identity map learned from the last layer, and applies it to next layer of learning. Then, it constructs a new building block , as shown in Fig. 1
(a). This explicitly allows these layers to fit a residual mapping, so as to make it easier for the residual to be zero (sparse) than to fit an identity mapping by a stack of nonlinear layers. In this way, the discrimination learned in the previous layers will be propagated layer-by-layer. Also, there would be a linear transformation between some layers, if two connected blocks have a different dimension, as shown in Fig.
1(b). Denote the linear projection as , where the building blocks become .ResNet attracts great attention and progresses observably [9], while our main focus is the way how it utilizes the prior information. It is possible to include sparse learning, as prior information, in deep neural networks as well. For example, grouping multiple sparse regularizations for simultaneously optimizing deep neural networks [21]. Wen et al. proposed to learn a structured sparsity in deep neural networks to regularize the inside structures (i.e., filters, channels, filter shapes, and layer depth) [28]. Afterwards, a fixed linear sparse filter can be cascaded with a thresholding nonlinearity to maximize sparsity in deep neural networks [34]. It becomes an emerging trend to utilize linear sparse models to collaborate with nonlinear neural networks.
As shown in Fig. 1(c), our idea has a similar structure following the building block of ResNet. The key is fusing the identity map learned from -norm collaborative representation [44] to the result after deep residual learning. To simplify the implementation structure, the linear model is not injected into the building block of the neural network. Instead, it performs on the classifier. There are several reasons for this structure. Firstly, it helps to avoid overheads in the training process of the neural networks. Furthermore, it creates a more general structure that can be easily extended to other types of neural networks, which are not limited to ResNet. For example, we also implement this in Inception [25] and VGG [24], which will be demonstrated in Sec. 4. The idea behind pairing nonlinear deep learning with an additional linear representation is to make the network more capable in different classification tasks.
However, the usage of the prior learned information is different in our implementation. ResNet adds up the learned in model training, while the proposed DeepCWC will introduce a collaborative weight in classification, which is obtained by an element-wise multiplication instead of addition. The next section explains the detailed implementation.
The key idea in Deep Collaborative Weight-based Classification is straightforward: using the model of Collaborative Representation (CR) [44] to learn a classifier from the original images and deep features in pairs. CR is based on normalization and emphasizes the collaboration among all samples in the representation. However, more and more evidence points to the fact that the collaboration requires help from the sparseness in the representation to maintain a high level of performance [1]. We believe that using deep features is one of the possible solutions.
In CR based classification (CRC), the role of collaboration among classes is stressed, rather than sparsity in the representation, when representing a test sample. Let denote the training samples selected from all classes, while is the test sample. Both and will be normalized to have -norm. The representation of by can be denoted as an approximate linear problem , where is the representation coefficient to be solved.
First of all, a regularized least square method [43] is used to solve the problem and perform the collaborative representation of the original image sample as follows
(1) |
where is the regularization parameter, which introduces a certain number of “sparsity” to the solution. The solution of this linear problem by using regularized least square can be derived as
(2) |
Let , such that is a projection matrix that can be pre-solved and independent of the test sample . The projection makes CR much faster than the conventional SR. It is noted that this operation may not fit in the memory of a large-scale dataset [38]. In our implementation, we used an incremental strategy [33, 20] to deal with this problem, despite the fact that there are other potential solutions, e.g., dictionary learning [2, 35].
From this, we can obtain the coefficient vector related with the th class. Typically, in SRC the coefficient is utilized to solve the representation residual of a specific class by . Besides this, in the original CRC implementation, the -norm of the sparse coefficient was also added to obtain more discrimination when performing classification [44]. Finally, the residual is obtained by
(3) |
At the same time, a side-by-side CR is performed on deep features. The so-called deep features are specific to the layer in the deep model. Theoretically, any layer can be utilized to extract deep features. For example, the layer [8, 9, 25] in the ResNet and Inception models. Here, we denote the deep features from one specific layer () of the neural networks for all training samples as
(4) |
and therefore the query sample becomes
(5) |
The size of the feature set is determined by the design of the specific layer, where it is normally mismatched with the size of the original images. It is hard to simply perform integration on the feature level. Therefore, fusion of the feature pairs will be performed on the residuals, which have the size same as the class number.
Then, the representation coefficient obtained from the deep features by a similar CR process is
(6) |
After that, the residual between the query feature set and each class of training feature sets can be solved with the same method as Eq. (3)
(7) |
Our proposed method manages to retrieve this part of the missed information via a novel fusion operation. The fusion is performed on two residuals, since both have an equal dimension depending on the number of classes. Therefore, fusion on the residuals is not only straightforward, but also faster.
Let us denote the residuals solved from two groups of samples as
and , where is the number of classes in the dataset. Then, the fusion via the collaborative weight is performed on the residual vector via an element-wise multiplication,
(8) |
where the residual entry related with the th class is calculated by the collaborative weight. This weight means that each entry in the residual vector is assigned a weight solved by the collaborative representation of the original images. The information carried by this weight compensates the missing part of the abstract higher layers in a neural network. In this way, we obtain the final fusion residual. Although additional or weighted averages are a more common approach to perform fusion in many other methods [40, 42, 27], they require a set of fine-tuned factors to obtain a good result. What is more, we observed a descending accuracy when adding up two residuals.
Finally, we classify the test sample to a class with minimal residual as follows
(9) |
The idea of our collaborative weight is simplistic and intuitive. The collaborative weights are determined by the relative contribution of each class from the original samples. Each residual of the deep features is overlapped by a weight solved using the collaborative representation of the original samples, in order to integrate its contribution.
To answer this question, we first need to answer another question: What is the relationship between collaboration and sparsity? Many had tried to give an answer with some considering sparsity as being more important [30, 6]. On the other hand, others insist that collaboration matters more [44, 18]. However, as for the rest, they treat collaboration and sparsity as equal factors [41, 40, 1]. So far, the last viewpoint well explains our proposed collaboratively weighting deep and non-deep representation.
The subspace occupied by the columns of a sparse dictionary can be denoted as a set of . Fig. 2 shows the geometrical illustration of this subspace. A test sample can be approximately represented by the columns of , and the error is . In addition, vector can be decomposed to and , as depicted in Fig. 2 (a1). According to [44], the angles and together decide the robustness of the CR model. However, Fig. 2 (a2) shows that it is likely to have more than one right answer, which is depicted as the circle. The distances of and are the same. Therefore, CR by itself without considering sparsity may not be robust enough [1].
Fig. 2 (b1) and (b2) show how the sparsity can help in the CR model. In these two cases, where (b1) and (b2) , and are two paths to points and . The distances are also the same in these two instances, while in (b1) and in (b2). This means that the class-specific residuals are equal in both cases, but construction of vectors and may be different. The components consisting of the path depict the sparsity in the representation, where fewer steps of indicates a sparser collaborative representation coefficient . Using sparser features in CR can help to produce a more robust classification, which is the very reason why DeepCWC works.
The black box deep model provides an unpredictable sparsity in the deep features. Currently, we can only accept this fact according to the largely contracted dimension of the deep feature set, with feature learning being the most powerful characteristic of deep learning. As shown in Fig. 2 (c), the effort of CR on deep features can be painted as a random and unpredictable curve between and . This can be treated as potentially the most efficient path and is also the result observed in the experiments.
When the deep features are ready and fit well in the CR model, the next problem is how to consolidate both of them into an united set. This is where the collaborative weight works. Previous work showed that well constructing the residuals is helpful to generate a robust sparse model, i.e., the two-phase sparse representation model [36]. To illustrate the impact of the weight, we captured some runtime data from our experiments, which is plotted in Fig. 3.
The purpose of the collaborative weight is to expand the more promising residuals, while restricting the other ones. The target class is selected by a final minimization, hence, we look for the smallest values. For example, the residuals of classes 1-10 are shown on the model in Fig. 3 (a). The correct class label is 1, where the distances of CRC on images and deep features are both below 1 (0.72). However, the minimal values of them are 0.65 and 0.62, respectively. This could lead to a wrong classification result (Class 4 and 10). After the fusion, the resultant residual becomes even smaller, resulting in the correct class being chosen (Class 1). This ensures that the classification will not be affected by other nearby classes. The same phenomenon can be found on other models, which are annotated in Fig. 3 (b) - (f). On the other hand, the classes with a larger distance value, e.g., , the fusion will make it much larger (, if and ). This in turn helps to avoid negative results.
Another point to note in the fusion is that the parameters do not need to be tuned. This is not like other conventional fusion schemas, which usually introduces one or two fusion factors [40, 42, 27], where more parameters call for more tuning. The fusion result is determined by the residuals themselves. We barely need to look for an optimal parameter and maintains the effectiveness of DeepCWC.
This section describes the experiments to demonstrate the robustness and performance of the proposed method. First of all, six facial datasets, including FERET [19], MUCT [16], Yale B [7], Georgia Tech (GT) [4], AR [15], and ORL [3], are selected to evaluate the performance of face recognition. These experiments were conducted due to the fact that CRC is usually applied to face recognition. Secondly, another set of experiments had been performed on some object datasets, including a leaf dataset Flavia [26], and three object datasets CIFAR-10 [10], Fashion-MNIST [31] and COIL-100 [17], which are often utilized to evaluate deep learning methods.
Also, in order to evaluate the robustness after introducing the collaborative weight between the original images and the deep features, we extracted the deep features using multiple state-of-the-art deep CNN models, including ResNet_v1_101, ResNet_v2_101, Inception_v4, Inception_ResNet_v2, VGG_16 and VGG_19. All of these are trained previously on the ImageNet dataset
[5]in Google TensorFlow
^{2}^{2}2https://github.com/tensorflow/models/tree/master/research/slim#Pretrained. Our assumption is the proposed DeepCWC works on different deep CNN models. The feature extraction is performed on the TensorFlow-Slim library. Besides this, another goal is to investigate which layer of features in a CNN model are more suitable for collaborative weight. Based on the considerations, we conducted a set of relevant experiments and obtained the following results.We ran experiments on six popular benchmark facial datasets. These datasets are relatively small. The smaller datasets do not contain enough samples to train a robust model by CNN, but we can extract the deep features using pre-trained deep CNN models. Our goal in this group of experiments is to compare the classification result between our proposed method and state-of-the-art methods. The best results are shown in Fig. 4 (a).
It is clear that the proposed DeepCWC method outputs a higher recognition accuracy than normal CRC, CRC using deep features and other state-of-the-art methods, no matter which CNN model is used to extract the deep features. It is uncertain that using deep features would generate a higher recognition accuracy than using the original images. For example, when utilizing the ResNet_v1_101 model to extract features of the AR dataset, CRC performs better on original images than deep features, as shown in Fig. 4 (a). And this is also true in some other experimental cases, which shows one of the limitations of a typical deep learning method. This is the very reason why we proposed the DeepCWC.
No matter which dataset, performing fusion of two feature sets based on the collaborative weight generates a higher recognition accuracy. Even in cases that merely use deep features without collaborative weight, e.g., on FERET, MUCT, Yale B, GT, AR and ORL, the proposed DeepCWC helps the recognition by fusing features from the original images and the CNN models.
The next set of experiments were performed on some leaf and object datasets, including Flavia (leaf), COIL-100, CIFAR, and Fashion-MNIST. The results are consistent on all datasets, as shown in Fig. 4 (b). Incorporating the deep features learned by ResNet_v1_101, the recognition accuracy (yellow) is much higher than the result on the original images (blue), except the result on the Flavia. However, DeepCWC further pushes the recognition up to an even higher level, and the improvements are stable on all datasets.
The highest accuracy is obtained on the COIL-100 dataset when using the first 60 samples in each class (83%) as the training samples. Deep features are beneficial to classification on this dataset, where DeepCRC (up to 98.83%) outperformed CRC (only 69.0%) by over 30%. That being said, the proposed DeepCWC still produces the highest accuracy of 99.42%, which reached a state-of-the-art level in recognition. On the Flavia leaf dataset, the improvement generated by collaborative weight is remarkable, though the accuracy is relatively lower, as shown in Fig. 4 (b). The results on the CIFAR-10 dataset get an improvement as well. And the improvement (the column Impr) is calculated by the rate of the accuracy from the DeepCWC over the higher one between CRC on images and deep features, and the improvements on the Flavia and Fashion-MNIST are up to 21.01% and 12.41%, respectively.
Two versions of the ResNet pre-trained models, ResNet_v1_101 [8] and ResNet_v2_101 [9], are tested in this set of experiments. As described above, we borrowed a similar architecture idea from the deep residual network, as shown in Fig. 1. For this reason, we design the first implementation of DeepCWC based on ResNet. There are 101 layers in the network, and we evaluate the proposed method on the features obtained from three layers, which are , ^{3}^{3}3 or . and ^{4}^{4}4 or . from shallower to deeper. The layer is the layer before the last convolutional layer, while the layer is before . These two layers have the same feature set size (1000). However, the layer is before this layer and has a larger size of 2048. The classification results are demonstrated in Fig. 5 (a) and (b).
In both models, performing CRC on deep features obtained from each layer produces a higher accuracy than using the original images. DeepCWC generates an even higher accuracy than both of these. We observe exactly the same result () in both and
layers. For classification, the feature maps of the last convolutional layer are fed into fully connected layers followed by a softmax logistic regression layer
[11]. The global average pooling [14] is introduced to avoid overfitting in the fully connected layers. Using the features maps captured from the layer produces a slightly higher accuracy (). Every result reaches a state-of-the-art level, and is higher than all current implementations (See subsection 4.4). It is noted that the computation time increases due to a larger size (double) of the feature maps from layer . Therefore, the (or ) layer should be a better choice when considering the balance between accuracy and speed.To investigate the performance when using a different CNN model, two Inception models are evaluated in a similar way. They are Inception_v4 and Inception_ResNet_v2 models pre-trained on ImageNet. Besides the layer, the and layers are also utilized to extract deep features, before being fed into to the linear CR model. A set of similar results are observed in the experiments, as shown in Fig. 5 (c) and (d).
DeepCWC achieves an accuracy over on deep features obtained from three layers. The highest result () is the one with the largest size (2048) from the layer. In this case, the layer, with a smaller size (1001) than , produced an approximately equal accuracy of . The result from is close to this. In fact, all of the results in DeepCWC for the three cases are stable and close to each other.
The last set of models are of the VGG implementation. We chose VGG-16 and VGG-19 models, and utilized the feature maps from their , and layers. The size of both and is 4096, while the last layer has a smaller size of 1000. The largest feature set in this group of experiments produced the most promising classification results. What is more, the trend is consistent with before. As shown in Fig. 5 (e) and (f), the highest accuracy is up to 97.66%, using the shallower layer in VGG-16, which is also the most promising result we obtained using this dataset and in all cases. The results achieved by VGG-19 are slightly lower than this, but higher than the other cases. The larger feature size helps to produce a more accurate classification.
Deep Model () | Preprocessing | Highest Accuracy |
CapsNet | None | 90.6% |
VGG16 26M parameters | None | 93.5% |
GoogleNet with cross-entropy loss | None | 93.7% |
MobileNet | Yes | 95.0% |
DenseNet-BC 768K params | Yes | 95.4% |
Dual path wide resnet 28-10 | Yes | 95.7% |
WRN-28-10 + Random Erasing | Yes | 96.3% |
WRN40-4 8.9M params | Yes | 96.7% |
Ours | ||
DeepCWC with Inception_v4 | None (global pool) | 97.24% |
DeepCWC with Inception_RN_v2 | Yes (global pool) | 97.25% |
DeepCWC with ResNet_v2_101 | Yes (global pool) | 97.33% |
DeepCWC with ResNet_v1_101 | None (global pool) | 97.36% |
DeepCWC with VGG_19 | None (fc6) | 97.52% |
DeepCWC with VGG_16 | None (fc6) | 97.66% |
Data claimed in https://github.com/zalandoresearch/fashion-mnist | ||
CapsNet in https://github.com/naturomics/CapsNet-Tensorflow#results |
Model | Layer | Feature Size | Time (sec) | Speed (sec/image) |
---|---|---|---|---|
VGG-16 | /fc6 | 4096 | 16342.347 | 0.233 |
VGG-16 | /fc8 | 1000 | 6826.632 | 0.098 |
VGG-19 | /fc6 | 4096 | 16567.730 | 0.237 |
VGG-19 | /fc8 | 1000 | 6526.632 | 0.093 |
Inception_v4 | 1001 | 7177.612 | 0.103 | |
Inception_v4 | /global_pool | 1536 | 8626.181 | 0.123 |
Inception_RN_v2 | /Logits | 1001 | 6734.609 | 0.096 |
Inception_RN_v2 | /global_pool | 1536 | 7884.362 | 0.113 |
ResNet_v1_101 | /global_pool | 2048 | 9108.256 | 0.207 |
ResNet_v1_101 | /logits | 1000 | 7302.812 | 0.104 |
ResNet_v2_101 | /global_pool | 2048 | 8373.421 | 0.120 |
ResNet_v2_101 | /logits | 1000 | 7254.657 | 0.104 |
The results obtained on the pre-trained models (over 97%) are all state-of-the-art, as shown in in Tab. 1. According to the description of the current methods, all of them are tuned and trained on the Fashion-MNIST dataset locally. Also, most of them applied one or two preprocessing techniques, and used the same deep neutral network architecture, e.g., VGG, ResNet, etc. Previously, the most promising accuracy was obtained by the Wide Residual Networks (WRN) model [39], both of which applied the standard preprocessing (mean/std subtraction/division) and augmentation (random crops/horizontal flips). For example, one used the random erasing technique [45] and produced an accuracy of 96.3% ^{5}^{5}5https://github.com/zhunzhong07/Random-Erasing, while the other one with 96.7% accuracy had 8.9 M parameters and utilized freezing layers ^{6}^{6}6https://github.com/ajbrock/FreezeOut.
Compared to current state-of-the-art methods, the proposed DeepCWC produces a higher result using multiple CNN models. The classification accuracy ranges from 97.24% to 97.66%, which are all higher than previous methods. The highest accuracy is generated on VGG-16 from the layer with a size of 4096.
Our experimental machine was configured with the following hardware, including an Intel^{®} Core^{TM} i7-7820X CPU@3.60GHz x 16, 64 GB RAM, 1.3 TB SSD and one NVIDIA TITAN Xp GPU. The code was run on TensorFlow 1.6, MATLAB R2016 and Ubuntu 16.04 OS. The recorded time consumption of each experimental case is shown in Tab. 2.
This time includes the whole training and testing of both the original images and deep features in CR, but does not count the time for feature extraction by the pre-trained models. Therefore, the speed (seconds per sample) is calculated by dividing the total time by the size of dataset (70000). Considering that the running state of the machine may fluctuate, the speed is between 0.1 - 0.2 seconds per sample. Furthermore, the following can be discussed about the proposed method.
Linear representation such as CR can improve deep neural networks based representation learning. Even using a pre-trained model, the proposed DeepCWC achieved a state-of-the-art classification result on the Fashion-MNIST dataset. The accuracy and performance outperformed current popular methods as well. This gives us a clue that linear methods have a new way to cooperate with nonlinear models.
The collaborative weight of two diverse representations help produce an accurate classifier. Currently, more work is focusing on the neural network architecture and/or parameter tuning. However, the proposed DeepCWC neither pays much attention to deep learning model itself, nor tunes any parameters. Fusing multiple representations creates a robust classifier that works well on multiple deep learning models. The results are all at a state-of-the-art level.
The global average pool layer shows an effective capacity to extract discriminative features. Global average pooling was proposed to enforce the learning of the class level feature maps [14]. The experiments on layer analysis showed that using features extracted from the global average pool layer can produce a higher accuracy. This is true in the Inception and ResNet models, as shown in Fig. 4.3 (a) and (d), and Tab. 1. That being said, it needs more computation due to the relatively larger size of the feature set from this layer.
Multiple layers in a deep CNN model show an effective capacity to extract discriminative features, including the global average pooling layer [14] and the fully connected layer. This is confirmed in the experiments, as shown in Fig. 4.3, and Tab. 1. However, the size of feature map decides the time consumption.
The proposed DeepCWC takes the NO. 1 position in current benchmark rank of Fashion-MNIST. The most promising result is obtained on the VGG-16 model, which outperforms current leaders mainly using the WRN model.
We propose a deep collaborative weight-based classification (DeepCWC) method. It first performs the linear representation on original images and deep features, extracted from nonlinear neural networks. Then, both of them collaboratively weight each other to build a strong discriminative classifier. The method is extensively evaluated using multiple popular deep CNN models, like ResNet, Inception, and VGG. The experimental results are promising on more than one layers in these neural networks, with most of the results belonging to a state-of-the-art level.
The -norm based CR model is chosen as the linear constraint in this work to enhance the classification based on pre-trained CNN models. However, there are still some questions, for example, whether there are other linear models (like sparse representation, dictionary learning, etc.), more suitable for the same task, or whether it can bring one more step of break-through when applied on locally trained and tuned CNN models. We will keep working on these open topics in the future.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In AAAI, pages 4278–4284, 2017.