Recently, deep neural networks (DNNs) have demonstrated impressive results in image classification [krizhevsky2012imagenet, he2016deep, 8794588], object detection [girshick2014rich, ren2015faster, ma2018mdcn, zhu2018visdrone], instance segmentation [he2017mask]
, depth estimation[he2018learning, he2018spindle]
, and face recognition[cen2019dictionary]. The success of DNNs has become possible mostly due to a large amount of annotated datasets [deng2009imagenet], as well as advances in computing resources and better learning algorithms [goyal2017accurate, Zhang_2018_CVPR]. Most of these works typically assume that the images are of sufficiently high resolution ( or larger).
The limitation of requiring large amount of data to train DNNs has been alleviated by the introduction of transfer learning techniques. A common way to make use of transfer learning in the context of DNNs is to start from a pre-trained model in a similar task or domain, and then finetune the parameters to the new task. For example, the pre-trained model on ImageNet for classification can be finetuned for object detection on Pascal VOC[girshick2014rich, ren2015faster].
In this paper, we focus on low resolution ( or less) image classification as for privacy purpose, it is common to use low resolution images in real-world applications, such as face recognition in surveillance videos [zou2011very]. Without additional information, learning from low resolution images always reduces to an ill-posed optimization problem, and achieves a much degraded performance [pinheiro2015learning].
As shown in Fig. 1, the deep feature of high resolution images extracted from pre-trained convenet has already learned discriminative per-class feature representation. Therefore, it is able to be well separated in the tSNE visualization. However, the extracted feature of low resolution images is mixed together. A possible solution is to exploit the transfer learning, leveraging the discriminative feature representation from high resolution images to low resolution images.
In this paper, we propose a simple while effective unsupervised deep feature transfer approach that boosts classification performance in low resolution images. We assume that we have access to high resolution labeled images during training, but at test we only have low resolution images. Most existing datasets are high resolution. Moreover, it is much easier to label subcategories in high resolution images. Therefore, we believe it is a reasonable assumption. We aim to transfer knowledge from such high resolution images to real world scenarios that only have low resolution images. The basic intuition behind our approach is to utilize high quality discriminative representations in the training domain to guide feature learning for the target low resolution domain.
The contributions of our work have three-fold.
No fine-tuning on convenet filters is required in our method. We use pre-trained convenet to extract features for both high resolution and low resolution images, and then feed them into a two-layer feature transfer network for knowledge transfer. A SVM classifier is learned directly using these transferred low resolution features. Our network can be embedded into the state-of-the-art DNNs as a plug-in feature enhancement module.
It preserves data structures in feature space for high resolution images, by transferring the discriminative features from a well-structured source domain (high resolution features space) to a not well-organized target domain (low resolution features space).
Our performance is better than that of baseline using feature extraction approach for low resolution image classification task.
2 Related Work
Our method is closely related to unsupervised learning of features and transfer learning.
Unsupervised learning of features: Clustering has been widely used for image classification [caron2018deep, yang2016joint, ji2018invariant]. Ji [ji2018invariant] propose invariant information clustering relying on statistical learning by optimising mutual information between related pairs for unsupervised image classification and segmentation. Caron [caron2018deep] present a clustering method that jointly learns the parameters of a neural network and the cluster assignments of the resulting features. Yang [yang2016joint] propose an approach to jointly learn deep representations and image clusters by combining agglomerative clustering with CNNs and formulate them as a recurrent process.
Transfer learning: It is commonly used in the scenario where the training and testing data distributions are different. Saenko [saenko2010adapting]
learn a regularized non-linear transformation in the context of object recognition to minimize the effect of domain-induced changes in the feature distribution. Chen[chen2015net2net] transfer knowledge stored in one previous network into each new deeper or wider network to accelerate the training of a significantly larger neural network. Yosinski [yosinski2014transferable] experimentally study the transferability of hierarchical features in deep neural networks. Azizpour [azizpour2016factors] investigate the factors of transferability of a generic deep convolutional networks such as the network architecture, distribution of the training data, etc. Tzeng [tzeng2015simultaneous] learn a CNN architecture to optimize domain invariance and transfer information between tasks. Long [long2015learning] propose a deep adaptation network architecture to match the mean embeddings of different domain distributions in a reproducing kernel Hilbert space. Guo [guo2019spottune] propose an adaptive fine-tuning approach to find the optimal fine-tuning strategy per instance for the target data. Readers can refer to [pan2010survey] and the references therein for details about transfer learning.
3 Proposed Approach
This section describes the proposed unsupervised deep feature transfer approach.
With the recent success of deep learning in computer vision, the deep convnets have become a popular choice for representation learning, to map raw images to an embedding vector space of fixed dimensionality. In the context of supervised learning, they could achieve better performance than humanbeings on standard classification benchmarks[he2015delving, krizhevsky2012imagenet] when trained with large amount of labelled data.
Let denote the convenet mapping function, where is the corresponding learnable parameters. We refer to the vector obtained by applying this mapping to an image as feature or features. Given a training set of images, and the corresponding ground truth labels , we want to find an optimal parameter such that the mapping predicts good general features. Each image associates with a class label in . Let denote a classifier with parameter . The classifier would predict the labels on top of the features . The parameter of the mapping function and the parameter of the classifier are then learned jointly by optimizing the following objective function:
where is the multinominal logistic loss for measuring the difference between the predicted labels and ground-truth labels given training data samples.
3.2 Unsupervised Deep Feature Transfer
The idea of this work is to boost the feature learning for low resolution images by exploiting the capability of unsupervised deep feature transfer from the discriminative high resolution feature. The overview of proposed approach is shown in Fig. 2. It consists of three modules: feature extraction, unsupervised deep feature transfer, and classification, discussed below.
Feature extraction. We observe that the deep features extracted from convenet could generate well separated clusters as shown in Fig. 1. Therefore, we introduce the transfer learning to boost the low resolution features learning via the supervision from high resolution features. Then, we extract the features (N-Dimensional) of both high and low resolution images from a pre-trained deep convenet. More details are described in Sec. 4.2.
Unsupervised deep feature transfer.
We propose a feature transfer network to boost the low resolution features learning. However, in our assumption, the ground truth labels for low resolution images are absent. Therefore, we need to make use of the information from high resolution features. In order to do this, we propose to cluster the high resolution features and use the subsequent cluster assignments as “pseudo-label” to guide the learning of feature transfer network with low resolution features as input. Without loss of generality, we use a standard clustering algorithm, k-means. The k-means takes a high resolution feature as input, in our case the featureextracted from the convenet, and clusters them into distinct groups based on a geometric criterion. Then, the pseudo-label of low resolution features are assigned by finding its nearest neighbor to the centroids of high resolution features. Finally, the parameter of the feature transfer network is updated by optimizing Eq. (1
) with mini-batch stochastic gradient descent.
The final step is to train a commonly used classifier such as Support Vector Machine (SVM) using the transferred low resolution features. In testing, given only the low resolution images, first, our algorithm extracts the features. Then feeds them to the learned feature transfer network to obtain the transferred low resolution features. Finally, we run SVM to get the classification results directly.
We conduct the low resolution classification on the PASCAL VOC2007 dataset [Everingham15] with 20 object classes. There are images in VOC2007 trainval set and images in VOC2007 test set. However, the images in the dataset are high resolution images only. We follow [lin2014microsoft] to generate the low resolution images. In this work, we generate high resolution images by resizing the original images to
using bicubic interpolation. We generate the low resolution images by down-sampling the original to, and then up-sampling to .
4.2 Implementation Details
We conduct our experiment using Caffe[jia2014caffe]. We use the resnet-101 [he2016deep] pre-trained on ILSVRC111We download the Caffe Model from https://github.com/BVLC/caffe/wiki/Model-Zoo [russakovsky2015imagenet] as the backbone convenet to extract the features from high and low resolution images. We extract the features from the pool5 layer, which gives a feature vector with dimension of .
The feature transfer network is a two-layer fully connected network. We conduct grid search to find the optimal design for the network architecture, see Sec. 4.3. It is initialized using MSRA [jia2014caffe] initialization. We train the feature transfer network using stochastic gradient descent with weight decay , momentum , batch size
, epoch, total iteration . The initial learning rate is , and is decreased by after every iterations.
4.3 Feature Transfer Network
The feature transfer network is shallow, with two fully connected layers. Let and
denote the neurons of the first and second fully connected layers, respectively. We conduct grid search to find the optimal combination forand , as shown in Table 2. The number is determined by the number of clusters for the pseudo labels in k-means.
As we can see, when the neurons of is fixed, the mAP increases as the neurons of increases. This is because the capacity of the two-layers feature transfer network increases as the neurons increases in . However, given a fixed number of neurons of , the value of mAP would increase first, and then decrease when the value of neurons in is larger enough, maybe
is a threshold value in our two-layer network as shown in the table. We observe that the hyperparameters withand for the neurons give us the best performance. We use the same values in our experiment.
4.4 Low Resolution Image Classification
We evaluate the performance of image classification in the context of binary classification task on the VOC2007 test set using SVM [chang2011libsvm] classifier in matlab. We have compared our algorithm with two baselines: Baseline-HR and Baseline-LR, discussed below. Baseline-HR is to use the extracted high resolution features (2048-D) of VOC2007 trainval set to train the SVM and report the classification performance on VOC2007 test set. It is similar for Baseline-LR, but with the extracted low resolution features (2048-D). Our method transfers the low resolution feature from 2048-D to 100-D. Therefore, we train the SVM using the 100-D features for each class. We show the comparison in Table 1.
The Baseline-HR is the upper bound of our method, and Baseline-LR is the lower bound. As we can see from the Table 1, the proposed unsupervised deep feature transfer is able to boost the low resolution image classification by about . Except for the classes of “bottle” and “sheep”, our method outperforms the Baseline-LR. As shown in Fig. 3, we find the transferred low resolution features are sperated much better than the extracted low resolution features. Those indicate that the proposed unsupervised deep feature transfer algorithm does help transfer more discriminative representations from high resolution features. Therefore, it boost on low resolution images classification task. The feature transfer network could also be embedded into the state-of-the-art deep neural networks as an plug-in module to enhance the learned features.
In this paper, we propose an unsupervised deep feature transfer algorithm for low resolution image classification. The proposed two-layer feature transfer network is able to boost the classification by 2% on mAP. It can be embedded into the state-of-the-art deep neural networks as a plug-in feature enhancement module. While our current experiments focus on generic classification, we expect our feature enhancement module to be very useful in detection, retrieval, and category discovery settings as well in the future.
Dr. Zhang was supported by MERL. Mr. Wu and Prof. Wang were supported in part by NSF NRI and USDA NIFA under the award no. 2019-67021-28996 and KU General Research Fund (GRF).