"You might also like this model": Data Driven Approach for Recommending Deep Learning Models for Unknown Image Datasets

11/26/2019 ∙ by Ameya Prabhu, et al. ∙ 0

For an unknown (new) classification dataset, choosing an appropriate deep learning architecture is often a recursive, time-taking, and laborious process. In this research, we propose a novel technique to recommend a suitable architecture from a repository of known models. Further, we predict the performance accuracy of the recommended architecture on the given unknown dataset, without the need for training the model. We propose a model encoder approach to learn a fixed length representation of deep learning architectures along with its hyperparameters, in an unsupervised fashion. We manually curate a repository of image datasets with corresponding known deep learning models and show that the predicted accuracy is a good estimator of the actual accuracy. We discuss the implications of the proposed approach for three benchmark images datasets and also the challenges in using the approach for text modality. To further increase the reproducibility of the proposed approach, the entire implementation is made publicly available along with the trained models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the current unprecedented growth in deep learning, the primary and most pressing challenge faced by the community is to find the most appropriate model for a given dataset. Consider the $1M Data Science Bowl challenge for detecting lung cancer hosted by Kaggle in

111https://www.kaggle.com/c/data-science-bowl-2017. It introduces a dataset of lung scans, consisting of thousands of images, and aims to develop algorithms that accurately determine when lesions in the lungs are cancerous. To solve this in practice, a common approach is to abstract the problem of lung cancer detection as a special case of object detection: use pre-trained models on large scale image classification datasets, and fine-tune them to target lung cancer dataset. The process would begin with choosing a state-of-the-art deep learning architecture, say AlexNet Krizhevsky et al. (2012)

, with weights pre-trained on ImageNet dataset. After fine-tuning multiple models and using multiple pre-training datasets, it is found that both the AlexNet architecture and ImageNet dataset are not suitable for the task of lung cancer detection. This procedure is then extensively repeated for multiple models, such as

ResNet He et al. (2016a), VGG-16 Simonyan and Zisserman (2014), VGG-19 Simonyan and Zisserman (2014), and Network-in-Network Lin et al. (2013), till the ideal model is found. Similarly, it has to be repeated for different datasets such as CIFAR-10, CIFAR-100, and TinyImageNet until the pre-training dataset is obtained. This is an extremely expensive hit-and-trial search approach.

Models pre-trained from generic datasets have shown improved performance in different domains and diverse tasks such as music genre classification Raina et al. (2007)

, face recognition 

Sun et al. (2014), healthcare Esteva et al. (2017), and food industry Wang et al. (2015). Currently, the choice of the generic dataset and pre-train model is purely based on human expertise and prior knowledge. Kornblith et al. Kornblith et al. (2018) considered thirteen different deep learning models trained on ImageNet and studied the fine-tuned performance to different target datasets. They found improvements in certain transfer scenarios and also showed that transferability is limited based on the source and target datasets properties. This explains the necessity for a systematic approach to choose the dataset and pre-trained networks for a given unknown dataset or task (such as, lung cancer detection). In this paper, we aim to address this problem by proposing an automated deep learning model recommendation system from a repository of models for a given unknown dataset. We further predict the accuracy of the recommended deep learning model on the unknown dataset without the need for training or fine-tuning. This enables the user to take a well informed decision on which popular deep learning model to adopt for the unknown dataset in hand and also what is the ballpark performance to expect.

The proposed research problem is defined as follow: For a given unknown dataset , select a dataset and model that will provide the best fine-tuning accuracy, of after being pre-trained on and also predict the accuracy without the need for training. Formally, our system assumes a repository of popular deep learning architectures trained independently on different existing datasets. Given an unknown dataset , we find the most similarly dataset from the list of datasets and predict the accuracy of every model on that dataset, without actually training the model on it. This allow us to quantitatively assess the promise of transferring models and also recommend a suitable model for the unknown dataset. For example, for the given unknown lung cancer dataset, say the proposed approach predicts that STL-10 dataset Coates et al. (2011)

as the most similar dataset. Then, we predict the accuracy of all the architectures available for STL-10 in the model repository for the lung cancer dataset and rank them. Thus, we obtain the best performing pre-training dataset as well as the architecture. The proposed approach advances the literature to achieve a deep neural network recommendation systems using only limited resources and in real-time. To summarize, the primary research contributions of this research are as follows:

1. A model recommendation system which predicts the best suitable pre-trained model from a repository of models and predict its accuracy for the unknown dataset.

2. A general purpose unsupervised model encoder

which extracts a fixed length, continuous vector representation for any given discrete, variable-length deep learning architecture, along with its hyperparameters.

3. A dataset similarity ranker

system which characterizes the similarity distribution between a given unknown dataset and datasets in our repository using an ensemble of classifiers. We show that it is possible to get a good correlation between the dataset similarity predictions and actual accuracy obtained on that dataset.

4. A accuracy regressor which estimates the accuracy of a deep learning model on an unknown query dataset efficiently, using the dataset similarity ranker and model encoder features.

5. In order to further increase the reproducibility of the proposed work, the entire working implementation is publicly made available along with the trained models: https://github.com/goodboyanush/dl-model-recommend

2 Existing Literature

We will discuss the existing literature in the area of Neural Architecture Search (NAS), accuracy prediction, as well as recommender systems.

Neural Architecture Search (NAS): The aim of NAS is to find the most suitable architecture for a given dataset from the set of all possible architectures Zoph,Barret and Le (2017). ENAS Pham et al. (2018) is the first work towards fast and inexpensive automatic model design. Baker et al. Baker et al. (2017) uses a Markov decision based meta-modeling algorithm with an average time to search for the best model being around 8-10 days. Liu et al. Liu et al. (2018a) and Real et al. Real et al. (2018)

use an evolutionary algorithm instead of reinforcement learning algorithms. Liu 

et al. Liu et al. (2017) propose using a sequential model-based optimization (SMBO) strategy, which is up to times more efficient than reinforcement learning based techniques. Liu et al. Liu et al. (2018b) is the first major work to pose architecture search as a differentiable problem over a discrete and non-differentiable search space instead of a reinforcement learning problem.

Accuracy Prediction: Baker et al. Baker et al. (2018) leverage standard frequentist regression models to predict final performance based on architecture, hyperparameters and partial learning curves. Deng et al. Deng et al. (2017) predict the performance of a network before training, based on its architecture and hyperparameters. TAPAS Istrate et al. (2018) is another novel deep neural network accuracy predictor, parameterized on network topology as well as a measure of dataset difficulty. Scheidegger et al. Scheidegger et al. (2018) introduced a class of neural networks called ProbeNets to measure the difficulty of an image classification dataset, without having to train expensive state-of-the-art neural networks on new datasets. In contrast to existing techniques that rely on reinforcement learning or evolutionary algorithms, Elsken et al. Elsken,Thomas et al. (2018) employ a new method which is a combination of hill climbing, network morphism, and cosine annealing based optimization. Summarizing, works such as Peephole Deng et al. (2017) use only the model architecture, while TAPAS Istrate et al. (2018) use the model architecture along with the characterization of the query unknown dataset. In the proposed work, we use the model architecture as well as the similarity between the unknown dataset and the known dataset on which the model was trained upon. Additionally, the training process for learning the model representation and the dataset similarity are performed with a large training data.

Traditionally in literature, deep learning methods are used as a solution to solve the personalized recommendation problem. However in this research, we propose a technique to use recommendation systems as a solution for which deep learning model to be used for a dataset and task. Further, most of the existing NAS techniques for deep learning are still unusable in practical situations, requiring huge clusters of GPUs and consuming a lot of time222https://twitter.com/beenwrekt/status/961262527240921088. Moreover, in most of these applications, finding a novel architecture from scratch is not essentially required and a minor variant of a popular deep learning model would suffice.

(a) An overview of the proposed system.
(b) An overview of the Unsupervised Model Encoder
Figure 1: Given a query dataset, we first calculate the dataset similarity vector. The obtained pairwise vector along with the model encoding is used to predict the accuracy. Then we rank the results and recommend a model from our repository.

3 Model Recommendation Approach

As illustrated in Figure 0(a), the proposed approach consists of three novel components:

  1. Unsupervised Model Encoder: which obtains a fixed length continuous space representation for a variable-length, discrete-spaced deep learning model architecture, along with its hyperparameters, using an unsupervised encoding technique.

  2. Dataset Similarity Ranker: which predicts the most similar existing dataset for any given unknown dataset .

  3. Accuracy Regressor: It learns the mapping from the above two unsupervised representations to the accuracy obtained by the model.

Thus, for a given unknown dataset, our system will retrieve a dataset and architecture from the repository using the dataset similarity ranker, encode a fixed length representation of the architecture using the unsupervised model encoder, and predict the accuracy of the architecture on the unknown dataset using the accuracy regressor. Although this is quite a challenging combination of tasks, we feel that it remains an important problem to solve, given its benefits in saving both resources and time compared to hit-and-trial approaches.

3.1 Unsupervised Model Encoder

Deep neural networks’ architecture can be considered as a directed acyclic graph (DAG) whose nodes represent certain transformations, such as convolution, recurrent cells, dropout, and pooling. In this component, we aim to develop a representation of such a graph (network architecture) in an unsupervised fashion. The first step is to define a representation of individual nodes, i.e., the layers and encode information about the layer sequence into fixed sized vectors. This is analogous to encoding individual words of a sentence using a word embedding model (such as, word2vec) and using the individual word embeddings to learn a language model at the sentence level.

Learning to generate valid models: We exploit the fact that models have only certain structures which are valid. Valid models are those which could be trained for a given dataset without any errors and could turn out to be accurate / optimal or inaccurate / sub-optimal for that dataset. Invalid

models are those that are either structurally impossible to occur, such as networks having embedding layer between two LSTM layers, or those that cannot be compiled for the given dataset, such as a CNN that reduces the image size to less than zero. Similarities can be drawn between this imposition of structures in deep networks and imposition of a grammar in a language. This further motivates the usage of a sequential language model technique to encode possible structures of a network architecture. A manually defined grammar is used to generate lots of possible valid models for a given dataset and these valid models are stored in a custom JSON structure, which is very similar to the Keras JSON format or the Caffe protobuf format.

Construction: As illustrated in Figure 2(b), given an input abstract JSON representation of model architecture, we compute a fixed-length vector as the output. The major steps are as follows:

(1) Layer Encoding: A layer vocabulary is constructed which contains all unique layers with its hyperparameter combinations. For instance, a Convolution2D layer has the following hyperparameter set: {’number of filters’: [512, 384, 256, 128, 64, 32], ’kernel row’: [1, 2, 3, 4, 5], ’kernel column’: [1, 2, 3, 4, 5], ’stride row’: [1, 2, 3], ’stride column’: [1, 2, 3], ’border mode’: [’Same’, ’Valid’]}, totalling to

unique combinations to the layer vocabulary. To account for layers or hyperparameters that are not a part of our grammar, we added an Unknown layer, UNK, to our vocabulary to be able to encode any kind of deep learning architecture. A total of unique layers were used resulting in a vocabulary size of tokens. The encoding is performed similar to a Unified Layer CodeDeng et al. (2017).
(2) Generating Layer Representations: Each model architecture is represented as a sequence of tokens, for example Convolution2D _512_3_3_1_1_Same is one token in that sentence. If a function model is provided, each path from source to sink is added as an independent sentence. Inspired from word embedding, for each given layer we predict the surrounding context of layers resulting in vector representations for each layer, independently. We train word2vec representation with standard hyperparameters (gensim library) to obtain a -dimensional layer representation.
(3) Generating Model Representations: We use the layer embeddings to initialize and train a three layer LSTM model with tied weights and trained it similar to a language modelling task to generate the 512-dimensional model representation. Sentence perplexity is used as the objective function to be optimized while learning the language model.

Thus, we develop an unsupervised subsystem to convert a variable length sequence of discrete network layers to a succinct, continuous, vector-space representation.

(a) An overview of the Dataset Similarity Ranker.
(b) An overview of the Accuracy Regressor
Figure 2: Given the dataset and model encoding representations, we can compute the predicted accuracies for that pre-trained pairing. In this manner, we predict the accuracies. They are further ranked and the best predicted accuracy is used to return the model and dataset.

3.2 Dataset Similarity Ranker

This component computes the similarity between the given dataset and all existing datasets in the dataset repository. The aim is to study the similarity between datasets and provide a guided approach for transferability between datasets. As illustrated in Figure 1(a), given a query dataset, and a list of repository datasets, , the procedure for calculating the dataset similarity between the query dataset and the repository datasets is as follows:

1. A set of data samples are uniformly picked from each of the repository datasets .

2. For every sample, , in these , we extract features from the input data. These form the input vectors and the output class is the dataset number .

3. Several classifiers are trained on each of the sampled image features to predict which of the repository datasets does the given feature vector belong to. Torralba Torralba and Efros (2011) studied the presence of a unique signature for every dataset, enabling us to find similarity and dissimilarity between datasets.

Now, given an unknown query dataset,

1. A set of data samples are randomly picked from the query dataset, . For each sample, we extract all the set of features .

2. The features are passed to the respective trained -class classifiers, which classify each sample individually to one of the repository datasets,


, denotes the classifier learnt on feature , and is the number of samples in the set

3. We collect all the predictions and perform majority voting fusion across the ensemble, obtained a

output vector denoting the probability of the similarity between the unknown dataset

against each of the repository dataset ,


where, denotes concatenation of values across the repository datasets

There are three feature extractors used for the image modality: (i) GIST Oliva and Torralba (2006) (ii) DAISY Tola et al. (2010) (iii) Local Binary Pattern (LBP) Zhang et al. (2007)

. Five popular classifiers are used in the ensemble: (i) Naive Bayes (NB), (ii) Random Decision Forest (RDF), (iii) Boosted Gradient Trees (BGT), (iv) Multilayer Perceptron (MLP), and (v) Support Vector Machines (SVM).

3.3 Accuracy Regressor

The accuracy regressor takes a -dimensional dataset similarity vector between unknown dataset and repository dataset (obtained using equation (1) and a -dimensional model representation vector as input and predicts accuracy of the model for the unknown dataset, as shown in Figure 1(b). This is learnt using a supervised regression approach, thus avoiding the need to efficiently learn to predict accuracy of deep networks. The system predicts the expected accuracy of a model trained on a dataset with a degree of similarity to the query dataset as given by the dataset similarity vector.

Given a query dataset , the accuracy regressor component is learnt as follows:

1. We extract the dataset similarity vector, , for every pair of dataset (all seven image datasets) using the dataset similarity ranker subsystem.

2. Using model encoder subsystem, we encode the models available in our model repository to obtain a vector for each model.

3. We concatenate these two features as dimensional input vector and perform regression using an ensemble of regressors to learn a mapping function between this high dimensional input vectors and the accuracy of the model, pre-trained on and fine-tuned on .

We use eight different types of regressors: (i) Support Vector Regressor (RBF, linear, polynomial Kernel), (ii) Multi-Layer Perceptron, (iii) Ridge Regression, (iv) RandomForest Regressor, (v) GradientBoosting Regression, and (vi) AdaBoost Regressor.

4 Experiments and Analysis

In this section, we demonstrate the performance of the three individual components and the overall approach. All the experiments were implemented using PyTorch 

333https://pytorch.org/ and the code is publicly made available along with the trained models: https://github.com/dl-model-recommend/cikm-2019

4.1 Model Repository

The image dataset repository contains seven different diverse benchmark vision datasets: (i) MNIST, (ii) Fashion-MNIST, (iii) CIFAR-10, (iv) CIFAR-100, (v) SVHN, (vi) STL-10, and (vii) GTSRB. All of them are resized to pixels. The choice of image based deep learning architectures in the repository is constrained by the input image size (), with: (i) VGG-16 Simonyan and Zisserman (2014), (ii) Network-in-Network (NIN) Lin et al. (2013)

, (iii) Strictly Convolutional Neural Network (All-CNN) 

Springenberg et al. (2014), (iv) ResNet-20 He et al. (2016a), (v) Wide-ResNet Zagoruyko and Komodakis (2016), (vi) Pre-ResNet He et al. (2016b), and (vii) LeNet LeCun et al. (1998).

4.2 Experimental Details

To learn the word embedding and the language model for unsupervised model encoding, we generated random valid models using the proposed grammar (simulated dataset). For each model, we randomly replaced a layer as UNK with a probability of and generated a total of . This dropout makes the sampling more diverse as well as enables us to encode models which cannot be defined by the grammar. To train and evaluate the accuracy regressor, we take a subset of models from the above set of models. We train these models on the seven different image datasets and we have inaccurate models which perform poorly and accurate models on the respective datasets. This constitutes a total of 1204 models along with the accuracy they obtain on the respective datasets. We divide the models into a 80-20 train-test split randomly and use this dataset to train and evaluate the accuracy regressor.

(a) The tSNE plot of VGG variant and random DL models
(b) Correlation plot with coefficients for an unknown dataset
Figure 3: The performance of unsupervised feature encoder

4.3 Unsupervised Model Encoder

We evaluate the subsystem by evaluating the perplexity of the encoded representations generated, as shown in Figure 5 (b). A lower perplexity score implies that the language model is better at generating valid models. To study the effectiveness of our learned model architecture representation, we take variations of VGG model by varying the number of blocks with hyperparameters and random deep learning models. The two dimensional tSNE visualization of the model representations in Figure 2(a) show that all the VGG-like models are clustered together and are very different from the random deep learning models. This shows that similar looking architectures have similar representations in the learnt feature space. Thus, the proposed unsupervised model encoder can be used as a general purpose deep learning architecture encoding technique and can be used and extended for multiple applications.

(a) Sankey plot with computer dataset similarity
(b) The effect of sample size
Figure 4: The performance of dataset similarity ranker
Pearson 0.981 0.844 0.564 0.796 0.572 0.217
Spearman 0.928 0.943 0.883 0.886 0.429 0.486
Kendall 0.828 0.867 0.788 0.733 0.200 0.333
Pearson 0.624 0.070 0.976 0.312 0.970 0.899
Spearman 0.828 0.029 0.551 0.232 0.599 0.714
Kendall 0.733 0.066 0.414 0.138 0.466 0.466
Pearson 0.480 0.492 0.983 0.085 0.978 0.934
Spearman 0.314 0.486 0.464 0.232 0.314 0.714
Kendall 0.200 0.333 0.276 0.138 0.200 0.466
Pearson 0.95 0.886 0.728 0.790 0.720 -0.185
Spearman 0.638 0.943 0.706 0.829 0.486 -0.029
Kendall 0.552 0.867 0.645 0.733 0.333 -0.067
Pearson 0.000 0.089 0.966 0.499 0.969 0.697
Spearman -0.085 0.257 0.522 0.232 0.486 0.609
Kendall -0.200 0.200 0.414 0.138 0.333 0.414
Pearson -0.370 0.721 0.962 0.497 0.965 0.945
Spearman -0.085 0.428 0.521 0.232 0.486 0.714
Kendall -0.200 0.200 0.414 0.138 0.333 0.466
Pearson 0.981 0.991 0.756 0.798 0.755 0.982
Spearman 0.928 0.943 0.530 0.886 0.600 1.00
Kendall 0.828 0.867 0.501 0.733 0.467 1.00
Table 1:

The correlation coefficients obtained between the dataset similarity scores and the actual performance accuracy. This shows that the dataset similarity score is an unbiased estimator of the model’s accuracy.

4.4 Dataset Similarity Ranker

We evaluate the performance of the dataset similarity ranker by performing an exhaustive leave-one-out test on the dataset repository. For each of the seven unknown datasets and the rest of the repository, we predict the ranking of the datasets obtained from our system. To obtain the ground truth, we train all the seven models: (i) VGG-16, (ii) NIN, (iii) All-CNN, (iv) ResNet-20, (v) Wide-ResNet, (vi) Pre-ResNet, and (vii) LeNet on each of the 6 remaining datasets present in the catalog. Given the query dataset, we fine-tune these networks, giving accuracy values . The ensemble of models are trained using a sample of images taken from the train dataset of the respective datasets, while the ensemble of models are tested using a sample of images taken from the test dataset. The covariance shift that exists between the train and test of the respective datasets could also influence the performance of the dataset similarity ranker. We obtain the correlation scores and show that the dataset ranking provided by our system is highly correlated to the ranking obtained by the accuracy exhaustive fine-tuning. This indicates that models pre-trained on the dataset that we predicted to be the most similar dataset, provided the best performance accuracy after being fine-tuned on the unknown dataset. The results are populated in Table 1 and the correlation plot for an unknown dataset, CIFAR-100 and LeNet as the model is provided in Figure 2(b). It can be observed that the correlation coefficients are positive and high for all the datasets except SVHN. This implies that finding similar datasets that could provide a good pre-training for models is possible and also shows that there are no similar datasets for SVHN in the repository, indicating that none of the pre-trained models are bound to produce high results in SVHN datasets.

Also, based on general intuition we expect CIFAR-10 and CIFAR-100, and MNIST and Fashion-MNIST to look visually similar. The proportion of each unknown dataset being classified to the repository dataset is shown in figure 3(a), which follows our intuitions. Furthermore, we study the effect of sample size which is one of the critical hyper-parameter for computing the dataset similarity. Although we used as the effective sample size, we studied the effect of four different sample size on the classification performance: [, , ]. The result is shown in Figure 4(b) and can be observed that our subsystem can give reliable predictions irrespective of the size of the sample for MNIST and CIFAR variations. However, for SVHN and STL for which there are no related datasets, a smaller sample size tends to classify the input images towards SVHN.

(a) The MSE error of the various regressors
(b) The perplexity graph of training an LSTM for Unsupervised Model Encoder.
Figure 5: The performance of accuracy regressors and the reason for failure in text based DL models

4.5 Accuracy Regressor

We evaluate the regressor model using the Mean Square Error (MSE) error between the predicted accuracy by the regressor and the actual accuracy obtained after fine-tuning. The obtained results are shown in Figure 4(a). It can be observed that ridge regression performs the best with a MSE of . This shows the a simple regression could predict the approximate performance of a deep learning model on a given unknown dataset, without the need for sophisticated models. Thus for a given unknown dataset, we sample

images, find the most similar dataset using an ensemble of simple machine learning classifier. For all the architectures available in the repository for most similar dataset, we extract a fixed length representation using the unsupervised model encoder. This is a simple forward pass through the word embedding layer and the LSTM based language model. The dataset similarity vector and the model representation is fed into the accuracy regressor to predict the performance of the given models and find the best performing architecture. Hence, we show that accuracy prediction could be a practical almost real-time solution and could be adopted to various challenging domains.

5 Practical Use Case

A practitioner will usually prefer the most recent deep learning model, which might be unnecessarily complex for the task at hand. However, theoretically the choice of model depends on the properties of the dataset and the task 4. It is interesting to study the performance of the proposed model recommendation system with respect to human preferences. To show the effectiveness of the proposed deep learning model recommendation pipeline in a practical setting, we provide human baselines for three different datasets: (i) Caltech-UCSD Birds-200-2011 Welinder et al. (2010), (ii) Stanford Cars Krause et al. (2013), and (iii) ETHZ Food-101 Bossard et al. (2014). Accuracies of various deep learning learning models on these datasets are manually computed in the literature 4. For Caltech-UCSD Birds-200-2011 and ETHZ Food-101, our approach retrieved ResNet as the recommended architecture with a predicted accuracy of and , respectively. The ground truth training, as performed in the literature 4, yields and , respectively, which are much higher than LeNet and VGG models. However, in case of Stanford Cars dataset, our approach recommended VGG-16 architecture with a predicted accuracy of . This trend could be observed in the literature, as well, where VGG-16 performs better than ResNet variants and LeNet providing accuracy. Thus, although the accuracy prediction provides a ballpark of the expected actual accuracy, the rank order of the retrieved models suggests that the proposed approach does not always retrieve the most complex model, but rather, retrieves models based on the properties of the datasets, the task, and the architecture of the model.

6 Conclusion and Future Work

We proposed a novel system for recommending the most suitable pretrained architecture for a given unknown dataset. The proposed system consists of 3 subsystems: a dataset similarity subsystem, which predicts the similarity for any two given datasets; an unsupervised model encoder which extracts a fixed length, continuous vector representation and an accuracy regressor which estimates the accuracy of a deep learning model on a unknown query dataset. Combining these subsystems, we explore the aim of recommending neural network models. Our system is one of the earliest approaches in this direction, and we hope that this research acts as a seed work for future extensions.


  • B. Baker, O. Gupta, N. Naik, and R. Raskar (2017) Designing neural network architectures using reinforcement learning. External Links: Link Cited by: §2.
  • B. Baker, O. Gupta, R. Raskar, and N. Naik (2018) Accelerating neural architecture search using performance prediction. External Links: Link Cited by: §2.
  • L. Bossard, M. Guillaumin, and L. Van Gool (2014)

    Food-101 – mining discriminative components with random forests


    European Conference on Computer Vision

    Cited by: §5.
  • [4] CNN baseline for fine grained recognition. Note: http://guopei.github.io/2016/Benchmarking-Fine-Grained-Recognition/Accessed: 2019-02-03 Cited by: §5.
  • A. Coates, A. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    pp. 215–223. Cited by: §1.
  • A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    pp. 670–680. Cited by: item 2.
  • B. Deng, J. Yan, and D. Lin (2017) Peephole predicting network performance before training. arXiv preprint arXiv:1712.03351. Cited by: §2, §3.1.
  • Elsken,Thomas, J. H. Metzen, and F. Hutter (2018) Simple and efficient architecture search for convolutional neural networks. External Links: Link Cited by: §2.
  • A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun (2017) Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 (7639), pp. 115. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016a) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §1, §4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016b) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §4.1.
  • R. Istrate, F. Scheidegger, G. Mariani, D. Nikolopoulos, C. Bekas, and A. Malossi (2018) TAPAS: train-less accuracy predictor for architecture search. arXiv preprint arXiv:1806.00250. Cited by: §2.
  • R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler (2015) Skip-thought vectors. In Advances in neural information processing systems, pp. 3294–3302. Cited by: item 2.
  • S. Kornblith, J. Shlens, and Q. V. Le (2018) Do better imagenet models transfer better?. arXiv preprint arXiv:1805.08974. Cited by: §1.
  • J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561. Cited by: §5.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §1.
  • Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
  • M. Lin, Q. Chen, and S. Yan (2013) Network in network. arXiv preprint arXiv:1312.4400. Cited by: §1, §4.1.
  • C. Liu, B. Zoph, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2017) Progressive neural architecture search. arXiv preprint arXiv:1712.00559. Cited by: §2.
  • H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu (2018a) Hierarchical representations for efficient architecture search. In ICLR, External Links: Link Cited by: §2.
  • H. Liu, K. Simonyan, and Y. Yang (2018b) DARTS: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §2.
  • A. Oliva and A. Torralba (2006) Building the gist of a scene: the role of global image features in recognition. Progress in brain research 155, pp. 23–36. Cited by: §3.2.
  • H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §2.
  • R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng (2007)

    Self-taught learning: transfer learning from unlabeled data

    In Proceedings of the 24th international conference on Machine learning, pp. 759–766. Cited by: §1.
  • E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2018) Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548. Cited by: §2.
  • F. Scheidegger, R. Istrate, G. Mariani, L. Benini, C. Bekas, and C. Malossi (2018) Efficient image dataset classification difficulty estimation for predicting deep-learning accuracy. arXiv preprint arXiv:1803.09588. Cited by: §2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §4.1.
  • J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §4.1.
  • Y. Sun, X. Wang, and X. Tang (2014) Deep learning face representation from predicting 10,000 classes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1891–1898. Cited by: §1.
  • E. Tola, V. Lepetit, and P. Fua (2010) Daisy: an efficient dense descriptor applied to wide-baseline stereo. IEEE transactions on pattern analysis and machine intelligence 32 (5), pp. 815–830. Cited by: §3.2.
  • A. Torralba and A. A. Efros (2011) Unbiased look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1521–1528. Cited by: §3.2.
  • X. Wang, D. Kumar, N. Thome, M. Cord, and F. Precioso (2015) Recipe recognition with large multimodal food dataset. In Multimedia & Expo Workshops (ICMEW), 2015 IEEE International Conference on, pp. 1–6. Cited by: §1.
  • P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona (2010) Caltech-UCSD Birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: §5.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.1.
  • L. Zhang, R. Chu, S. Xiang, S. Liao, and S. Z. Li (2007) Face detection based on multi-block lbp representation. In International conference on biometrics, pp. 11–18. Cited by: §3.2.
  • Zoph,Barret and Q. V. Le (2017) Neural architecture search with reinforcement learning. In ICLR, External Links: Link Cited by: §2.

7 Additional Details: Dataset Similarity Ranker

The model catalog has seven image datasets. The properties and details of each of these datasets are provided in Table 1.

Dataset Image Size #Classes TrainSize TestSize
Table 2: Different image datasets, and their properties, which are used as a part of the proposed model repository.

Similarly, the model catalog has six text datasets. The properties and details of each of these datasets are provided in Table 2.

Dataset #Samples #Classes
Yahoo Answers
YELP Reviews
YELP Review Polarity
Table 3: Different text datasets, and their properties, which are used as a part of the proposed model repository.

The ensemble of classifiers used over the different features extracted, contains five popular classifiers: (i) Naive Bayes (NB), (ii) Random Decision Forest (RDF), (iii) Boosted Gradient Trees (BGT), (iv) Multilayer Perceptron (MLP), and (v) Support Vector Machines (SVM). Table 3 provides different hyperparameters used for each of these classifiers. While some of these hyperparameters are chosen using human expertise and popular defaults, most of the values are obtained through extensive search and experimentation.

Classifier Parameters
Multinomial Naive Bayes Nil
Random Decision Forest Depth:25
Boosted Gradient Trees Depth:25
Multi Layer Perceptron Solver:LBFGS
HiddenSizes: 256,128
Support Vector Machines Kernel: Linear
Table 4: Different features and its parameters used in constructing the ensemble of classifiers model.

8 Additional Details: Unsupervised Model Encoder

To generate valid random text based deep learning models for learning the langauge modeling part, the following grammar is used:

  1. RNN cell can be RNN/LSTM/GRU

  2. Pooling can be last/max/mean

  3. 300 dimensional embedding, GloVe pretrained

  4. 256 dimensional hidden size

  5. 2 layers, 0.5 dropout between layers

  6. bidirectional learning

  7. Adam optimizer, 0.001 learning rate

  8. gradient clipping at gradient norm of 5

  9. Weight decay of 1e-4

  10. 15 epochs, batch size of 128

9 Additional Details: Accuracy Prediction

Hyperparameters for training baseline models and finetuning models is as follows:

  1. Learning rate 0.1

  2. Weight decay 5e-4

  3. SGD with 0.9 momentum

  4. 128 batchsize, 100 epochs with early stopping

  5. Transformations / Augmentation: RandomCrop (CenterCrop during test), Random Horizontal Flip (except for digit datasets), Mean/Variance Normalization

10 Implications on Text Classification Datasets and Models

Having observed the efficacy of our proposed pipeline in a practical application in fine-grained computer vision, we attempted the ambitious goal of trying the same for text classification. We constructed a repository of six datasets: (i) AG-News, (ii) DBPedia, (iii) Yahoo Answers, (iv) YELP Reviews, (v) YELP Review Polarity, and (vi) SST. For the model repository, we have (i) LSTM, (ii) GRU, and (iii) RNN cells, with bidirectional / unidirectional variants, as well as 1-layer / 2-layer variants, for a total of variants. We observe that the proposed pipeline, as it exists currently, does not excel at recommending text classification models and we derive the following insights from our initial experiments:

Figure 6: The tSNE representation of text classification model representations obtained using our unsupervised model encoder.
  1. Unsupervised Model Encoder: A total of unique layers with a vocabulary of tokens was used to simulate and generate valid text classification models with depth varying from till . The word2vec

    and the model architecture encoder (explained in section 3.1) were trained from scratch on the generated text classification architectures. However, for the obtained representation we did not observe good clusters in the corresponding tSNE space, as shown in Figure 

    6. This is potentially due to the lack of diversity in the generated RNN based architectures. Upon plotting the tSNE reduced representations of CNN and RNN architectures, we clearly obtained two clusters proving that the unsupervised model encoder learns the representations to some extent.

  2. Dataset Similarity: To find the similarity between two datasets, we use a similar ensemble based technique (as explained in section 3.2). There are four feature extractors used for the text modality: (i) BoW-TF (ii) BoW-TF-IDF (iii) InferSent Conneau et al. (2017) (iv) Skip-thoughts Kiros et al. (2015). Five popular classifiers are used in the ensemble: (i) Naive Bayes (NB), (ii) Random Decision Forest (RDF), (iii) Boosted Gradient Trees (BGT), (iv) Multilayer Perceptron (MLP), and (v) Support Vector Machines (SVM). However, we observed that there was negative correlation between the predicted performance and the actual performance (as compared to Table 1). This could be possible because the similarity between two text datasets could be highly sensitive to the overlapping vocabulary space, unlike in images where there exists some abstract overlapping concepts such as edges, corners, and basic shapes.