For a long time the vision community has been striving for the quest of creating human-like intelligent systems. Recently the resurgence of neural networks DBLP:conf/ijcai/Hinton05 ; DBLP:journals/neco/HintonOT06 ; bengio2007greedy has first led to a revolution in computer vision, for example DBLP:journals/nn/CiresanMMS12 ; DBLP:conf/nips/KrizhevskySH12 ; DBLP:journals/corr/SimonyanZ14a ; DBLP:journals/corr/SzegedyLJSRAEVR14 ; razavian2014cnn
, and then quickly provoked to other areas including reinforcement learningDBLP:journals/corr/MnihKSGAWR13 , speech recognition DBLP:conf/icassp/GravesMH13 DBLP:conf/nips/MikolovSCCD13 . For the most part those neural network models are supervised ones which require lots of labeled training data hence pose scalability challenges. This paper studies an alternative to train deep neural networks using massive amount of unannotated Web images.
Convnets have been well-known for the excellent generalization of its learned representation and being widely acknowledged as the de facto representation learning method. A convnet is no more than an end-to-end feature mapping, i.e. starting from raw pixel intensities and then through many hidden layers of different types a robust representation can be learned. At the top of the network there is often a layer representing some loss function which is specific to each problem. Giryes et al. DBLP:journals/corr/GiryesSB15 proved that under random Gaussian weights deep neural networks are distance-preserving mappings with a special treatment for intra- and inter-class data.
What makes convnets special is that it learns distributed representation111
This mean one concept is represented by multiple neurons and each neuron participates in the representation of more than one concept. Distributed representation is indeed much more expressive than a local representation due to its compactness in term of lesser hidden units DBLP:conf/nips/DelalleauB11 and much more regions of linearity DBLP:conf/nips/MontufarPCB14 . Theoretical justifications of deep networks as a class of universal approximators DBLP:journals/mcss/Cybenko92 ; DBLP:journals/nn/HornikSW89 . A more recent work DBLP:conf/icml/AnBB15 proved that a two-layer rectifier network can make any disjoint data linearly separable.
While distributed representation is commonly present in many deep networks, convolutional and pooling layers, which are exclusive in convnet, are known to shift-invariance DBLP:journals/corr/abs-1301-3537 and local context preservation. In fact convolutional layers are crucial for convnet to obtain better representation than other deep networks such as stacked auto-encoders, i.e. compare reported results in DBLP:conf/icml/LeRMDCCDN12 and DBLP:conf/nips/KrizhevskySH12 . Importantly convnets go beyond the i.i.d (independent identical distribution) assumption where their inner representation is highly transferable to related tasks. For example the convnet model trained for image classification DBLP:conf/nips/KrizhevskySH12 can be used as a feature detector DBLP:journals/corr/RazavianASC14 for object detection DBLP:conf/cvpr/ErhanSTA14 , image segmentation DBLP:journals/corr/LongSD14
, and image retrievalDBLP:conf/eccv/BabenkoSCL14 ; DBLP:conf/cvpr/WangSLRWPCW14 .
Representation learning has been continually pursued by unsupervised methods such as auto-encoders hinton2006reducing , deep belief nets DBLP:journals/neco/HintonOT06 , and sparse encoding poultney2006efficient
, however from our viewpoint representation learning should combine advantages of both supervised and unsupervised regimes. Unsupervised learning alone is lack of a strong data prior. Since label information of training data is unavailable in an unsupervised setting, the objective function of an unsupervised network uses reconstruction loss, for examplehinton2006reducing . This loss concerns too much on redundant image details, i.e. it tries to reconstruct at much at possible input images at pixel level hence makes it less capable of generalizing discriminative features from visual variations. Supervised learning on the contrary can access to labels of training data thus is better guided. By minimizing the classification loss, supervised training helps pruning unnecessary details and magnifying discriminative features.
Our perspective is also shared in DBLP:conf/nips/DosovitskiySRB14 ; DBLP:journals/corr/Valpola14 ; DBLP:journals/corr/RasmusVHBR15 . Interestingly we find DBLP:conf/nips/DosovitskiySRB14 was the pioneer to train general feature detectors using supervised convnets combined with artificially generated training data labeling information is instance-based rather than class-based. The learned representation therefore is quite robust and outperforms other unsupervised representation learning methods. Although the method proposed in DBLP:conf/nips/DosovitskiySRB14 is limited to small images of dimension , it demonstrates the validity of training convnets for representation learning.
Inspired by DBLP:conf/nips/DosovitskiySRB14 we explore the approach of training large-scale convnets under supervised regime using weakly labeled data. Notice that this perspective is different from a known way DBLP:journals/neco/HintonOT06 of combining unsupervised and supervised learning where the unsupervised part initializes weights which is then followed by supervised fine-tuning. Prior to deep learning, there have been some works DBLP:journals/tmm/UlgesWB11 ; DBLP:journals/pami/WangHF12
that uses images harvested from Internet and photo sharing sites such as Flickr to train scalable image classifiers. However there is absent a thoroughly empirical analysis on the effectiveness of using noisy Web images to train deep networks. This is where our work comes into the context.
Working with Web images comes with both pros and cons where images are cheap and abundant but very noisy. The advantage of Web images easily satisfies the data-hungry property of convnet; our only concern is the tolerance of convnet against noises. Lately we learned that convnet is surprisingly noise tolerable. In sukhbaatar2014learning and soon followed by xiao2015learning
studied several solutions to train deep convolutional networks as classifiers under noisy condition. In their works training data are assumed to contain mislabeled images so that probabilistic frameworks are proposed to estimate conditional mislabeling probabilities. Finally those probabilities are integrated into extra label noise layers placed at the top of convnet in order to improve posterior predictions. Different fromsukhbaatar2014learning ; xiao2015learning we are rather interested in building a robust representation for general purposes from noisy data. Our experiments shown that even without any of special treatment of noisy images, convnet already performs very well. We aim to improve further this performance, not just limited in few specific cases but across a variety of domains.
Our contribution is twofold. First, we train convnets using noisy and unannotated Web images retrieved from the image search engine Bing and the photo sharing network Flickr. Experiments are scaled from a small image collection of hundred concepts and 400K images to a larger collection with a thousand concepts with 3.14 million images. In both scales the learned representations provide very generalized features that lead to promising accuracies on many classification datasets. Second, we convey image reranking techniques to remove noises from training data and train convnets of deeper architectures. Results show that the proposed techniques help improving classification results significantly. The best of our performance outperforms CaffeNet and close the gap with Vgg-16 DBLP:journals/corr/SimonyanZ14a .
2 Image Collections
Data acquisition takes an important role in our study. Data sources of Web images are so vast and diverse that it is better not to rely on a single source. However image crawling often comes up with a price and user privacy problems. Social network platforms like Facebook and Instagram either have strict privacy policies or simply do not provide image search API. Nevertheless for research Flickr always comes first with abundant data source and free of charge. Flickr photos are so diverse that they are not biased toward any of particular themes. Some Flickr photos are organized into groups and galleries which turns out that image search on Flickr gives quite relevant results. Indeed many public datasets have adopted Flickr as one of their principal sources, for example VOC Pascal challenges Everingham10 and ImageNet ILSVRC15 .
Recently Flickr released the YFCC collection of 100 millions images and videos for research purpose. While this collection is very useful to experiment with unsupervised learning and data mining, we stitch to the most general approach where images of any concept or class can be retrieved using search engines. Retrieved images therefore truly reflect challenges caused by noisy images in practice.
Because we want to compare our approach with the standard supervised approach that uses clean images from ImageNet, our collections are based on Wordnet222The lexical database of English organized like a thesaurus in which words having similar meanings (synonyms) are grouped into synsets; a synset expresses a concept. Given a synset in Wordnet, its synonyms are used as keywords to retrieve images. Downloaded images are not subjected to manual screening except duplicating images removal. Data imbalance between synsets is avoided by setting an equal number of images per synset.
Besides Flickr we use Bing as a complementary source. Using more data sources also prevent our collections from biased. Thanks to the Bing Azure API we can freely download up to 250,000 images per account per month. Notice that we use a text query without specifying any of visual filter. Manual examination on downloaded images from Bing gives us a sense that Bing images are quite noisy, and at some extent they are much more noisy than Flickr images.
Depicted in Fig. 1
are some examples from our Web image collections. At a first glance, these examples expose both high intra-variance per category and inter-variance between data sources. While the former is unavoidable and has to be reduced by means of image reranking techniques, the effect of the latter is unknown. At a closer look, Bing seems to have more documentary images and diagrams while Flickr has more personal photos with better aesthetic quality. This distinction is not difficult to explain. Bing Search is based on text so that images with rich accompanying texts are well indexed so that they appear in top results. On the contrary images of Flickr are uploaded, tagged, and organized by users; some of images are very relevant to specific topics but those topics may irrelevant to search queries. Flickr mostly contains natural photos so that it is unlikely to contains cartoons or sketches.
Multiple data sources bring both advantages and challenges. In the one hand it improves diversity. In the other hand it may reduce intra-class consistency. To give a final conclusion, we trained two convnets that either merely use Flickr images or mix Bing and Flickr images; classification results on some third-party datasets show that using more than one data source leads to better generalization. As a result experiments in the rest of this paper use both Flickr and Bing images.
Studying the effect of noisy data onto representation learning should be done in different problem scales because results may drastically change as more noisy data take part in. We conduct experiments at two scales: the small collection of 100 synsets and the large collection of 1000 synsets. With the small collection we can quickly test to find out optimal hyperparameter settings; doing this on a large-scale model is very time-consuming and expensive. Once a good setting has been found, it will be is applied onto the large-scale problem. The two collections are described in the following.
2.1 Flickr-Bing 100 (FB-s)
This collection consists of 100 synsets randomly sampled from WordNet. The number of images per synset ranges from 3000 up to 5000 images. Out of the total 416,000 images, Flickr and Bing contribute 67 % and 33 % respectively. Using the same set of 100 synsets, we create the baseline collection (IN-s) whose images are sampled from ImageNet; each synset contains approximately 1000 images thus IN-s has 100,000 images in total. The baseline dataset IN-s is used as the training data for fully supervised convnets. Evaluating relative performance of convnets trained from FB-s and IN-s will reveal how good the weak-supervised approach perform.
2.2 Flickr-Bing 1000 (FB-l)
This collection consists of 1000 synsets which are officially used in ILSVRC image classification challenges ILSVRC15 . Each synset has approximately 3000 images thus the total number of images in FB-l is about 3.12 million images in which Flickr and Bing contribute 70% and 30% respectively. In fact there is no special reason to prevent us from using other synsets than those included in the ILSVRC challenges. Adopting the synset set of ILSVRC challenge henceforth does not reduce the generality of the approach and furthermore we can easily compare our results with existing works. As a result we do not need to prepare another baseline dataset as done with FB-s.
Our method consists of two stages: i) partly remove noisy images and outliers from the collection, ii) train convnets with the refined collection. Image reranking is used in the first stage in order to rerank relevant images (clean data) to be at the top while pushing irrelevant images (a.k.a noises and outliers) out of the top list. For reranking is just a preprocessing step, it is regarded as being helpful if the learned representation produces better performance in a classification task.
Given an arbitrary synset, let us call the set of labeled examples and the set of unlabeled examples which can be Web images in our context. Here, and are the vectorial representations of the corresponding labeled instance and unlabeled instance ; we also assume that to emphasize the necessity of semi-supervised reranking methods where examples are scarce. A reranking algorithm aims to select a subset such that are more relevant to at least one among , than to .
3.1 Cross-Validation (CV)
This technique splits into equal disjoint subsets. Each subset is scored by a binary SVM classifier cortes1995support trained on the rest subsets as positive samples and other 10K images as negative samples. The latter can be obtained with ease: either subsampling images of synsets which are not relevant to the synset of interest. Iterating times gives us exactly one prediction for every data point in . Samples with negative scores are listed as noise thus rejected. The hyperparameter is manually chosen; increasing make lesser images classified as noise.
3.2 Kernel Mean Matching (KMM)
This is a semi-supervised technique DBLP:conf/nips/HuangSGBS06 that reweights unlabeled data w.r.t labeled data such that the (weighted) arithmetic means of the two sets are approximately equal, i.e. . If then is considered as noise. The optimal is the solution of the following convex quadratic program
Notice that Eq. 1 operates directly on the input features without passing them through any nonlinear mapping. Using linear KMM therefore is really fast. Eq. (1) is convex and can be expressed as the canonical quadratic form as follows:
where the first constraint defines a scope bounding discrepancy between the distributions of and . The bigger is, the lesser number of points ’s are highly re-weighted. As value of approaches zero, an unweighted solution is obtained. The second constraint ensures the measurement
is close to a probability distribution (for further details seeDBLP:conf/cvpr/ChuTC13 ).
3.3 Transductive Support Vector Machine (TSVM)
Proposed by DBLP:conf/sigir/SindhwaniK06 , TSVM uses both labeled data and unlabeled data to infer a decision function. In our context it is the noise removal function . According to the setting of TSVM, could be much smaller than , which perfectly fits to our context. To find , the following quadratic program must be iteratively solved
where hyperparameters and control the influence of labeled data and unlabeled data on the classifier ; the loss penalizes the predicted labels and w.r.t its temporary label and groundtruth respectively. Because and are coupled by the second loss term, Eq. 3 therefore is non-convex and could be solved by alternating minimization. In particular, is the set of temporary labels of during optimization, i.e. is assigned by at iteration -th. The optimization process terminates if either or the maximum number of iteration is reached. As a result an unlabeled point is classified as noise if . To avoid a trivial solution where all of falls into either positive or negative side, the ratio of positive labels and negative labels is set to , i.e. .
3.4 Convnet Architectures
With millions of learnable parameters, dozens of hyperparameters and network topology, finding a neural net architecture appropriate for a task is more an art than science. Fortunately convnet architecture is somewhat constrained by feedforward learning and relative order of layer types. Currently a handful of convnets performs seamlessly, for example AlexNet DBLP:conf/nips/KrizhevskySH12 or its slight variation CaffeNet jia2014caffe , vgg nets DBLP:journals/corr/SimonyanZ14a , PreLU nets he2015delving , Google net DBLP:journals/corr/SzegedyLJSRAEVR14 . In our experiments we choose Alex net DBLP:conf/nips/KrizhevskySH12 as a starting point and then try to increase network’s depth with a modified structure.
In most cases increasing the depth of convnets depth leads to performance gain, for example DBLP:journals/corr/SimonyanZ14a ; he2015delving ; DBLP:journals/corr/SzegedyLJSRAEVR14 . However, DBLP:journals/corr/SimonyanZ14a experiences impossibility of training a convnet with more than 16 layers which can be overcame by layer pre-initialization DBLP:journals/corr/SimonyanZ14a ; romero2014fitnets . We encounter this problem at a lesser number of layers, for instance we cannot train a 16-layer convnet using FB-l as training data. Since the major difference between our experiments and other works, for example DBLP:journals/corr/SimonyanZ14a , is the use of weakly labeled and noisy data. Rather of initializing intermediate layers with pre-trained weights, some technical modifications can resolve this problem.
A foremost factor is the minibatch size of stochastic gradient descent (SGD) algorithm used to train the network. In practice convnets can be trained with a very small batch size up to 16 images per batch (with lowered learning curve), this is no longer true in our cases in which training data are heavily corrupted by noises. We found that with a too small batch size a convnet like Vgg-16 could not decrease training loss even after thousands of iterations. Our conjecture is that the fluctuation of of gradient directions (due to noises) between subsequent batch subsamplings slowdowns learning speed of convnets.
Another factor is the size of convolutional kernels. Recent findings DBLP:journals/corr/SimonyanZ14a suggests that deeper convnets with small kernel sizes such as tends to improve generalization. Our results on the contrary shows that medium kernel sizes tends to work better for Web images. While we have not figured out evidences of the association between image appearances and kernel sizes, a plausible explanation can be based on over-flourishing appearances of Web images. Such images come from a variety of contexts and objects contained in them may occur at any scales and styles such as cartoons, diagrams, sketches.
Based on the observations above, we propose the 13-layer network architecture called FBNet as shown in Table 1
. Our network is very much alike Vgg-16 net except that it just has 4 max-pooling layers and the kernel dimension of the first convolutional layer is double size of Vgg-16 net () and half of CaffeNet (). While we are interested in increasing further the depth of our convnet, limited time and computational resources give us maximum 13 layers, which fits to a single GTX Titan-Z 12GB memory. Each 100 iterations takes around 8 minutes for the batch size 196; it takes about 3 weeks for a full training from scratch.
An alternative way to attain more depth is to try GoogleNet DBLP:journals/corr/SzegedyLJSRAEVR14 . This architecture is deeper than Vgg-16 but still consumes slightly less memory and even run faster than Vgg-16. We use this convnet without modification and train from scratch on Web data; on average it takes 2.5 minutes for each 100 iterations run with the mini-batch size 128. A full training stops after 1.2 million iterations.
4 Experiment Setup
The representation learned from Web images are evaluated on public datasets with various themes: indoor scenes MIT67 DBLP:conf/cvpr/QuattoniT09 , a variety of outdoor and street scenes SUN397 DBLP:conf/cvpr/XiaoHEOT10 , human actions Action40 DBLP:conf/iccv/YaoJKLGF11 , object categories Caltech256 griffin2007caltech , objects in context VOC07 Everingham10 , three fine-grained datasets of flower species Oxford102 Nilsback08 , dog species StandfordDogs , bird species CUB-200 WelinderEtal2010 , and one fine-grained dataset of car brands and models StanfordCars krause20133d . Mean accuracy is used to evaluate all of datasets except of VOC07 which is using mean average precision (mAP).
4.1 Image Reranking
The feature extractor runs as follows. Images are forwarded from input layer and 4096-dimensional feature vectors can be extracted at the fully connected layerfc7; these features are normalized with scheme before fed into reranking algorithms. Notice that among the three reranking algorithms, cvsvm is the only one that do not require labeled examples. Therefore we have to manually annotate a tiny set of examples for two semi-supervised methods kmm and tsvm. In particular labeled examples per synset are annotated, which means 1,000 and 10,000 examples of FB-s and FB-l are given to reranking algorithms. Other hyperparameter settings include:
cvsvm: liblinear is used to run its SVM sub-problems with linear kernel and ; the number of folds ; tends to accept more noisy images as clean.
kmm: the quadratic equation is solved using CVXOPT; tends to balance between the need of more training images (decreasing ) versus noise removal (increasing ).
tsvm: we reuse the svmlin code released by its author DBLP:conf/sigir/SindhwaniK06 ; is set to so that for each of synset, about 1000 images is selected as clean; and emphasize the importance of labeled over unlabeled images.
4.2 Visualizing Reranking Results
The effect of image reranking is not clear until we observe final classification results of the classifiers trained by reranked data. We can however observe visualizations of the reranked data in order to have a sense of how algorithms handle data. We extract fc7 features of images in the synset “salmon” of two small collections FB-s and IN-s, next run the dimensionality reduction method t-SNE van2008visualizing , then display the resulted 2D embedding in Fig.3. Notice that the reranking algorithms just operate on Web images (pink and black dots) and labeled examples (red); ImageNet images (green) are shown just for clarity purpose.
We observe that the two distributions of Web images and ImageNet are not quite overlapped. This explains why cvsvm, an unsupervised reranking method that does not use any examples, selects a large portion of Web images belonging to big clusters as clean data. On the contrary, semi-supervised methods tsvm and kmm favor in choosing Web data points surrounding examples. Here the difference between tsvm and kmm is clear: tsvm takes into account both internal structure of Web images and provided examples, while kmm disregards that structure of unlabeled images and keeps trying to match the empirical means of examples and the subset of clean images. At the end such differences lead to different results of convnet training as we discuss in next sections.
4.3 Convnet Training
Training convnets is time consuming, hence we try various settings of reranking first on the small collection FB-s to figure out working recipes and then apply onto the large collection FB-l. To know the effect of reranking, we train several convnets with respect to different data configurations. First, a convnet is trained from scratch using Web images without using any reranking method. Second, three convnets are trained from reranked images produced by the three methods respectively. As shown in Table 2, there are significant drops in term of number of images when applying reranking algorithms to the original Web collections. Because few training data tends to produce overfitting, the third configuration is to fine-tune three convnets (with respect to three reranking algorithms) based on the pre-trained model of the first configuration.
For the FB-s collection we just use CaffeNet architecture while both CaffeNet and FBNet are used for the FB-l collection. Input image dimension is fixed to , the optimization algorithm is SGD with momentum ; the learning rate and drops by a magnitude of after a step size of iterations; the maximum number of iterations is 450,000. The learning rate for fine-tuning is ten times less, i.e. , the corresponding step size is also shorter, which is from to . A fine-tuning process is stopped after maximum 150,000 iterations.
In this section we evaluate the generalization of the learned representations w.r.t Web image collections to unseen data. We present classification results considering convnets as end-to-end classifiers, however this serves as a reference and not the main purpose of our study. To test feature transferability of each trained convnet, we extract image features at layer fc7 and train linear SVM classifiers for evaluation.
5.1 Results of FB-s
5.1.1 As End-to-End Classifiers
We train caffe net for each of variant of the training data produced by individual reranking method. The accuracy is computed by comparing image groundtruth versus its softmax output at the last layer fc8. Results of classification accuracy on the test set of IN-s are summarized in Table 3.
Table (a)a shows that training with reranked data turns out to be effective. This is true for the algorithm cvsvm but neither kmm nor tsvm. We are curious about this and finally explain this as the consequence of overfitting. Semi-supervised reranking methods kmm and tsvm reject quite many images as noises so that the amount of training images is insufficient to train such a complex system like convnet. See Table 2 to compare the number of images before and after reranking for each of method.
To achieve training convnets using less noisy data and avoiding overfitting, we apply two-step training. In the first step We train from scratch a CaffeNet model using the entire Flickr and Bing images without reranking. Once finished its weights are used to initialize another CaffeNet model which will be trained (a.k.a fine-tuning) on reranked data done by either cvsvm, kmm, or tsvm. The second step spends considerably less training time, i.e. the maximum number of iterations may vary from 50,000 to 100,000. We also test another two-step training where the first step starts with FB-s/cvsvm data, and the second step uses either reranked data of kmm or tsvm.
Show in Table (b)b are the classification results obtained by the two-step training and we observe significant improvement. As expected the results of convnets trained from reranked data FB-s/tsvm and FB-s/kmm are better than those of no reranking and unsupervised reranking cvsvm. Noticeably the overall result is slightly better if the first step is trained by the full FB-s collection (compare the first 3 columns versus the last 2 columns). Among reranking methods, kmm outperforms the rest probably due to its objective function of matching the empirical means of examples (drawn from the distribution that generates the test set IN-s) and the Web images.
When comparing the best result of the convnet trained from FB-s to the one trained from IN-s (see Table (c)c), the latter obviously outperforms. This happens without surprise because the latter was trained and tested in the same data domain.
We test on six public datasets. Images of each dataset are forwarded through the net and at the layer fc7 we obtain 4096-dimensional features; applying normalization on these features and then we feed them to the training stage of one-vs-rest linear SVM classifiers. The choice of hyperparameters is made optimal per dataset per net. Evaluation protocol of each dataset is strictly followed.
The results are shown in Table 4. In Table (a)a and Table (b)b the relative performance between different nets are similar to Table (a)a and Table (b)b. In other word, the two-step training scheme consistently improve the generalization of learned representations for within-domain and new domains. When comparing our best net against the CaffeNet trained on IN-s, it is surprising that our net outperforms with a large margin in 6 out of 6 datasets. We consider this as a promising signal for our approach but inferior performance of IN-s may due to its relatively small training data compared to FB-s.
5.2 Results of FB-l
Extending the experiment above at larger data scales is necessary to verify the scalability of our approach. In this section we repeat the experiments with the large collection FB-l of 1000 synsets. Based on previous results we select well performing nets for this new experiment so that we avoid wasting lots of training time on suboptimal configurations.
Notice that we use self-reranking to produce reranked images. In other word self-reranking firstly trains a CaffeNet from raw Web images and then use this net as a feature extractor to compute features required by reranking algorithms.
5.2.1 As End-to-End Classifiers
Shown in Table (a)a are classification accuracies of our convnets on the ILSVRC 2012 validation set. Compare with the previous results in Table 3, this time our convnets perform poorly in which the best top-5 accuracy obtained by the convnet trained on FB-l/cvsvm is 23.8%, a huge drop compared to 80.4% of the convnet trained on IN-l. This means that when more data involved, the number of noisy images grow up and this leads to degraded classification results. The message of the results is also clear: to use a convnet as a good end-to-end classifier, having lots of clean labeled training data is a must.
The rest question, as a reminder, which is also our question of interest: good generalization of a transferred representation requires abundant clean training data too? In the previous experiment with the small collection FB-s, the answer is no. For the large collection FB-l, the results are shown in Table (b)b. Again the answer is no. The best of our results trained from 3.1 millions Web images are comparable to the reference model trained from 1.2 million ImageNet images. In three datasets VOC07, Caltech256, and SUN 297 our best performance obtained by the convnet trained from FB-l/cvsvm are mostly comparable to the reference model; it just performs worse than the reference at two datasets MIT67 and Action40.
FB-l/cvsvm however slightly outperforms IN-l at Oxford102. This is interesting because unlike other datasets, the number of training images of Oxford102 is 1000 while the number of test images is 5000. This means each flower category just has 10 training images which is insufficient for convnet fine-tuning. The fact that the convnet trained on Web images generalizes better than the one trained on ImageNet on the Oxford102 dataset may imply a potential advantage of the proposed approach in domains with scarce data.
In overall these results show that it is possible of obtaining competitive results by just using abundant amounts of unlabeled and noisy Web images. The results also show that reranking methods, especially the semi-supervised ones, have little effects and in some cases decrease the performance due to reducing the amount of training data via noise removal. In order to obtain again better results with reranking methods, more number of training images should be collected so that they are still abundant after rejecting noisy ones.
5.3 Deeper Architectures
According to recent empirical evidences the power of deep models depends on its depth. The 16-layer convnet Vgg-16 DBLP:journals/corr/SimonyanZ14a and 22-layer convnet DBLP:journals/corr/SzegedyLJSRAEVR14 have approached and surpassed human-level performance of image classification on ImageNet data. In our study we are also curious if a deeper convnet also learn a better representation with noisy data. We train a 13-layer convnet, denoted as FBNet, with FB-l. Results are compared against CaffeNet and GoogleNet trained on the same dataset.
Training a very deep network is notoriously difficult. Back-propagating gradients from the loss layer make gradients vanished before reaching all layers below. Rectifier activation DBLP:journals/jmlr/GlorotBB11 seems to solve this problem; furthermore training a very deep network requires a careful weight initialization romero2014fitnets
(use shallower ones to initialize deeper ones) as well as shift covariance elimination using batch normalizationDBLP:conf/icml/IoffeS15 . The former is unnecessarily complex while the latter requires more GPU memory space since batch normalization could not be done in-place. Due to constrained resources, we instead apply he2015delving
that sets the standard deviationof the Gaussians used by weight initialization according to the formula in which and are the filter dimension and input channels of -th convolutional layer.
To facilitate evaluation between architectures, we extract image features at the last fully connected layer (fc7 layer for CaffeNet and FBNet, loss1/fc and loss2/fc and pool5/7x7_s1 for GoogleNet), do normalization, then train one-vs-rest linear SVM classifiers w.r.t classes of the datasets. Three last layers of GoogleNet are concatenated into a single vector. Parameter tuning for SVM training is done similarly to previous experiments.
From the results presented in Table 6, increasing the depth of a covnet greatly improves accuracy regardless of that convnet was trained from labeled data like ImageNet or unlabeled data like FB-l. Also for the first time our FBNet outperforms CaffeNet trained on ImageNet. And FBNet is less performant than Vgg-16 which is deeper than ours. Our future work includes examining the performance of FBNet with even deeper structures which is expected to bridge the gap.
The results in Table (b)b also reveals that GoogleNet is not so good as the conventional convnet architecture in learning transferable representation. This suggests us that a good form of transferability favors more degrees of distributed features which can be learned well using fully connected layers. However another reason may lie in the comparative feature dimensionality of features extracted from GoogleNet versus other convnets. We let this as a future work.
6 Application to Fine-grained Category Classification
In this last section we examine the potential of using convnets pre-trained on Web images to specific problems. In particular three fine-grained category classification tasks are evaluated on convnets which are fine-tuned from the ones pre-trained with FB-l and IN-l. If both of them perform comparable then our approach is an efficient recipe in production. First collect a lot of Web images and train a convnet even the training data are noisy. This is followed by a subsequent refinement stage in which a smaller but clean dataset is used to fine-tune the previous convnet. This approach saves significantly annotation effort if successfully employed.
Back to our experiment, three datasets are chosen: Birds WahCUB_200_2011 (200 classes), Cars krause20133d (196 classes), and Flowers Nilsback08 (102 classes). The numbers of training images may vary from 10 to 40 images per category.
To measure the effectiveness of fine-tuning, classification accuracy is computed at two points before and after the process. The former is computed by evaluating linear SVM classifiers trained on normalized fc7 features of a pre-trained net. The latter is straightforwardly computed based on output of the last layer of the tuned convnet. We adopt the CaffeNet architecture to for this experiment. The first 7 layers are initialized with pre-trained weights; the learning rate of each layer is . The last fully connected layer is initialized with random weights; the learning rate is . We use SGD with momentum and step size 2,000 iterations. Fine-tuning process is stopped after 10,000 iterations.
Results are shown in Table 7. In 3 out of 3 datasets, the accuracies of convnets tuned either by FB or Ref are quite comparable, before and after fine-tuning. In fact our net just performs slightly worse than the baseline in StanfordCars dataset and slightly outperforms in Flowers and Birds datasets. This is due to too small training size of Flower dataset (10 images per category) which is already explained in Section 5.2.2.
Based on observations above we speculate that the proposed approach proves itself as an economical and effective solution to supervised classification problems, especially in fine-grained category tasks. In order to improve further the performance, the fine-tuning weights should be initialized by a convnet trained from lots of Web images retrieved within a particular domain, for instance several thousands of flower categories. In this way the learned representation will be more discriminative to flora-related features.
We proposed a novel approach that uses convnet to learn transferable image representation based on a massive amount of (noisy) Web images. Throughout the paper we have successfully explored this approach under several problem scales, with different image reranking techniques, alternated several network architectures, and illustrated potential applications.
The significance of our study is threefold. First, our results show that convnets trained on Web images can obtain good generalization. Second, image reranking algorithms are useful to improve the generalization of convnet, especially at small and medium scales. In order to make image reranking useful at large scale problems, the training set should be considerably large and cover different visual variances of concepts. Third, deep convnet architectures can be trained on noisy images. Beside open problems addressed in this paper, we plan to investigate on how using a lot more unlabeled images can help reducing the need of labeled images in semi-supervised deep learning.
- (1) G. E. Hinton, What kind of graphical model is the brain?, in: IJCAI, 2005, p. 1765.
- (2) G. E. Hinton, S. Osindero, Y. W. Teh, A fast learning algorithm for deep belief nets, Neural Computation 18 (7) (2006) 1527–1554.
- (3) Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, et al., Greedy layer-wise training of deep networks, NIPS 19 (2007) 153.
- (4) D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber, Multi-column deep neural network for traffic sign classification, Neural Networks 32 (2012) 333–338.
A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: NIPS, 2012.
- (6) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, CoRR.
- (7) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, CoRR.
A. Razavian, H. Azizpour, J. Sullivan, S. Carlsson, Cnn features off-the-shelf: an astounding baseline for recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 806–813.
- (9) V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. A. Riedmiller, Playing atari with deep reinforcement learning, CoRR abs/1312.5602.
A. Graves, A. Mohamed, G. E. Hinton, Speech recognition with deep recurrent neural networks, in: ICASSP, 2013, pp. 6645–6649.
- (11) T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: NIPS, 2013, pp. 3111–3119.
- (12) R. Giryes, G. Sapiro, A. M. Bronstein, Deep neural networks with random gaussian weights: A universal classification strategy?, CoRR abs/1504.08291.
- (13) O. Delalleau, Y. Bengio, Shallow vs. deep sum-product networks, in: Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain., 2011, pp. 666–674.
- (14) G. F. Montúfar, R. Pascanu, K. Cho, Y. Bengio, On the number of linear regions of deep neural networks, in: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014, pp. 2924–2932.
G. Cybenko, Approximation by superpositions of a sigmoidal function, MCSS 5 (4) (1992) 455.
- (16) K. Hornik, M. B. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2 (5) (1989) 359–366.
S. An, F. Boussaïd, M. Bennamoun, How can deep rectifier networks achieve linear separability and preserve distances?, in: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, 2015, pp. 514–523.
- (18) J. Bruna, A. Szlam, Y. LeCun, Learning stable group invariant representations with convolutional networks, CoRR abs/1301.3537.
- (19) Q. V. Le, M. Ranzato, R. Monga, M. Devin, G. Corrado, K. Chen, J. Dean, A. Y. Ng, Building high-level features using large scale unsupervised learning, in: ICML, 2012.
- (20) A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson, CNN features off-the-shelf: an astounding baseline for recognition, CoRR.
- (21) D. Erhan, C. Szegedy, A. Toshev, D. Anguelov, Scalable object detection using deep neural networks, in: CVPR, 2014.
- (22) J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, CoRR.
- (23) A. Babenko, A. Slesarev, A. Chigorin, V. S. Lempitsky, Neural codes for image retrieval, in: ECCV, 2014.
- (24) J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, Y. Wu, Learning fine-grained image similarity with deep ranking, in: CVPR, 2014.
- (25) G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313 (5786) (2006) 504–507.
C. Poultney, S. Chopra, Y. L. Cun, et al., Efficient learning of sparse representations with an energy-based model, in: NIPS, 2006, pp. 1137–1144.
- (27) A. Dosovitskiy, J. T. Springenberg, M. A. Riedmiller, T. Brox, Discriminative unsupervised feature learning with convolutional neural networks, in: NIPS, 2014.
- (28) H. Valpola, From neural PCA to deep unsupervised learning, CoRR abs/1411.7783.
- (29) A. Rasmus, H. Valpola, M. Honkala, M. Berglund, T. Raiko, Semi-supervised learning with ladder network, CoRR abs/1507.02672.
- (30) A. Ulges, M. Worring, T. M. Breuel, Learning visual contexts for image annotation from flickr groups, IEEE Transactions on Multimedia.
- (31) G. Wang, D. Hoiem, D. A. Forsyth, Learning image similarity from flickr groups using fast kernel machines, IEEE TPAMI 34 (11) (2012) 2177–2188.
- (32) S. Sukhbaatar, R. Fergus, Learning from noisy labels with deep neural networks, arXiv.
- (33) T. Xiao, T. Xia, Y. Yang, C. Huang, X. Wang, Learning from massive noisy labeled data for image classification, in: CVPR, 2015, pp. 2691–2699.
- (34) M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, IJCV 88 (2) (2010) 303–338.
- (35) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale Visual Recognition Challenge, IJCV (2015) 1–42.
- (36) C. Cortes, V. Vapnik, Support-vector networks, Machine learning.
- (37) J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, B. Schölkopf, Correcting sample selection bias by unlabeled data, in: NIPS, 2006.
- (38) W. Chu, F. D. la Torre, J. F. Cohn, Selective transfer machine for personalized facial action unit detection, in: CVPR, 2013.
- (39) V. Sindhwani, S. S. Keerthi, Large scale semi-supervised linear svms, in: SIGIR, 2006.
- (40) Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, arXiv preprint arXiv:1408.5093.
- (41) K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, arXiv preprint arXiv:1502.01852.
- (42) A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, Y. Bengio, Fitnets: Hints for thin deep nets, arXiv preprint arXiv:1412.6550.
- (43) A. Quattoni, A. Torralba, Recognizing indoor scenes, in: CVPR, 2009, pp. 413–420.
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, A. Torralba, SUN database: Large-scale scene recognition from abbey to zoo, in: CVPR, 2010, pp. 3485–3492.
- (45) B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. J. Guibas, F. Li, Human action recognition by learning bases of action attributes and parts, in: ICCV, 2011, pp. 1331–1338.
- (46) G. Griffin, A. Holub, P. Perona, Caltech-256 object category dataset.
- (47) M.-E. Nilsback, A. Zisserman, Automated flower classification over a large number of classes, in: Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 2008.
- (48) P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, P. Perona, Caltech-UCSD Birds 200, Tech. Rep. CNS-TR-2010-001, California Institute of Technology (2010).
- (49) J. Krause, M. Stark, J. Deng, L. Fei-Fei, 3d object representations for fine-grained categorization, in: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2013, pp. 554–561.
- (50) L. Van der Maaten, G. Hinton, Visualizing data using t-sne, JMLR 9 (2579-2605) (2008) 85.
X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, 2011, pp. 315–323.
- (52) S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, 2015, pp. 448–456.
- (53) C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The Caltech-UCSD Birds-200-2011 Dataset, Tech. rep. (2011).