Teacher Guided Architecture Search

08/04/2018 ∙ by Pouya Bashivan, et al. ∙ MIT University of Amsterdam 0

Strong improvements in network performance in vision tasks have resulted from the search of alternative network architectures, and prior work has shown that this search process can be automated and guided by evaluating candidate network performance following limited training (Performance Guided Architecture Search or PGAS). However, because of the large architecture search spaces and the high computational cost associated with evaluating each candidate model, further gains in computational efficiency are needed. Here we present a method termed Teacher Guided Search for Architectures by Generation and Evaluation (TG-SAGE) that produces up to an order of magnitude in search efficiency over PGAS methods. Specifically, TG-SAGE guides each step of the architecture search by evaluating the similarity of internal representations of the candidate networks with those of the (fixed) teacher network. We show that this procedure leads to significant reduction in required per-sample training and that, this advantage holds for two different search spaces of architectures, and two different search algorithms. We further show that in the space of convolutional cells for visual categorization, TG-SAGE finds a cell structure with similar performance as was previously found using other methods but at a total computational cost that is two orders of magnitude lower than Neural Architecture Search (NAS) and more than four times lower than progressive neural architecture search (PNAS). These results suggest that TG-SAGE can be used to accelerate network architecture search in cases where one has access to some or all of the internal representations of a teacher network of interest, such as the brain.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The accuracy of deep convolutional neural networks (CNNs) for visual categorization has advanced substantially from 2012 levels (AlexNet

[21]) to current state-of-the-art CNNs like ResNet [15], Inception [36], DenseNet [17]. This progress is mostly due to discovery of new network architectures. Yet, even the space of feedforward neural network architectures is essentially infinite and given this complexity, the design of better architectures remains a challenging and time consuming task.

Several approaches have been proposed to automate the discovery of neural network architectures, including random search [28]

, reinforcement learning

[39], evolution [31], and sequential model based optimization (SMBO) [23, 7]

. These methods operate by iteratively sampling from the hyperparameter space, training the corresponding architecture, evaluating it on a validation set, and using the search history of those scores to guide further architecture sampling. But even with recent improvements in search efficiency, the total cost of architecture search is still outside the reach of many groups and thus impedes the research in this area (e.g. some of the recent work in this area has spent 40-557k GPU-hours for each search experiment

[30, 39]).

What drives the total computational cost of running a search? For current architectural search procedures (above), the parameters of each sampled architecture must be trained before its performance can be evaluated and the amount of such training turns out to be a key driver in the total computational cost. Thus, to reduce that total cost, each architecture is typically only partially trained to a premature state and its premature performance is used as a proxy of its mature performance (i.e. the performance it would have achieved if was actually fully trained).

Because the search goal is high mature performance in a task of interest, the most natural choice of an architecture evaluation score is its premature performance. However, this may not be the best choice of evaluation score. For example, it has been observed that, as a network is trained, multiple sets of internal features begin to emerge over network layers, and the quality of these internal features determines the ultimate “behavioral” performance of the neural network as a whole. Based on these observations, we reasoned that, if we could evaluate the quality of a network’s internal features even in a very premature state, we might be able to more quickly determine if a given architecture is likely to obtain high levels of mature performance.

But without a reference set of high quality internal features, how can we determine the quality of a network’s internal features? The main idea proposed here is to use features of a high performing “teacher” network as a reference to identify promising sample architectures at a much earlier premature state. Our proposed method is inspired by prior work showing that the internal representations of a high-performing teacher network can be used to optimize the parameters of smaller, shallower, or thinner “student” networks [1, 16, 32]

. It is also inspired by the fact that such internal representation measures can potentially be obtained from primate brain and thus could be used as an ultimate teacher. While our ability to simultaneously record from large populations of neurons is fast growing

[35], these measurements have already been shown to have remarkable similarities to internal activations of CNNs [38, 33].

One challenge in comparing representations across models or between models and brains is the lack of one-to-one mapping between the features (or neurons in the brain). Representational Similarity Analysis (RSA) is a tool that summarizes the representational behavior into a matrix called Representational Dissimilarity Matrix (RDM) that embeds the distance between activation in response to different inputs. In doing so it abstracts away from individual features (i.e. activations) and therefore enables us to compare the representations originating from different models or even between models and biological organisms.

Based on the RDM metric, we propose a method for architecture search termed “Teacher Guided Search for Architectures by Generation and Evaluation” (TG-SAGE). Specifically, TG-SAGE guides each step of an architecture search by evaluating the similarity between representations in the candidate network and those in a fixed, high-performing teacher network with unknown architectural parameters but observable internal states. We found that when this evaluation is combined with the usual performance evaluation (above), we can predict the “mature” performance of sampled architectures with an order of magnitude less premature training and thus an order of magnitude less total computational cost. We then used this observation to execute multiple runs of TG-SAGE for different architecture search spaces to confirm that TG-SAGE can indeed discover network architectures of comparable mature performance to those discovered with performance-only search methods, but with far less total computational cost. More importantly, when considering the primate visual system as the teacher network with measurements of neural activity from only several hundred neural sites, TG-SAGE finds a network with an Imagenet top-1 error that was 5% lower than that achieved by performance-guided architecture search.

In section 2 we review some of the previous studies in neural networks architecture search and the use of RSA to compare artificial and biological neural networks. In section 3 we describe representational dissimilarity matrix and how we use this metric in TG-SAGE to compare representations. In section 4 we show the effectiveness of TG-SAGE in comparison to performance-guided search methods by using two search methods in different architectural spaces of increasing size. We then show that in the absence of a teacher model, how we can use measurements from the brain as a teacher to guide the architecture search.

2 Previous Work

There have been several recent studies on using reinforcement learning to design high performing neural network architectures [2, 39]. Of special relevance to this work is Neural Architecture Search (NAS) [39]

in which a long short-term memory network (LSTM) trained using REINFORCE was used to learn to design neural network architectures for object recognition and natural language processing tasks. A variation of this approach was later used to design convolutional cell structures similar to those used in Inception network that could be transferred to larger datasets like Imagenet

[40]. Real et al. [31, 30] used an evolutionary approach in which samples taken from a pool of networks were engaged in a pairwise competition game. This method searched for optimal architectures and weights jointly by reusing all or part of weights from the parent network in an effort to reduce the computation cost associated with training the candidate networks as well as the post-search retraining of the best found networks. However, compared to alternative search method that depend on (some) training for each candidate network starting from an initial point, it did not offer any significant improvement in efficiency.

While most of these works have focused on discovering higher performing architectures, there has been a number of efforts emphasizing the computational efficiency in hyperparameter search. In order to reduce the computational cost of architecture search, Brock et al. [10] proposed to use a hypernetwork [14] to predict the layer weights for any arbitrary candidate architecture instead of retraining from random initial values. Hyperband [22] formulated hyperparameter search as a resource allocation problem and improved the efficiency by controlling the amount of resources (e.g. training) allocated to each sample. Similarly, several other methods proposed to increase the search efficiency by introducing early-stopping criteria during training [3] or by extrapolating the learning curve [12]. These approaches are closely related to our proposed method in that, their main focus is to reduce the per-sample training cost.

Efficient NAS [27] and DARTS [24] methods proposed to share the trainable parameters across all candidate networks and to jointly optimize for the hyperparameters and the network weights during the search. While these approaches led to significant reduction in total search cost, they can only be applied to the spaces of network architectures in which the number of trainable weights do not change as a result of hyperparameter choices (e.g. when the number of filters in a CNN is fixed).

More recently progressive neural architecture search (PNAS) [23] proposed a sequential model based optimization (SMBO) approach that learned a predictive model of performance given the hyperparameters through a procedure which gradually increased the complexity of the space. This approach led to an impressive improvement in the computational cost of search compared to NAS.

On the other hand, it has been shown that networks with good generalization ability converge to similar internal representations that are dissimilar from those emerging in networks with low or no generalization ability [25]. Representational similarity analysis have been used to compare the representations between convolutional neural networks and measurements from primate brain [11, 38]

. These studies have noted remarkable similarities between the representations in CNNs trained for object recognition and those found in primate ventral stream that is known to be critical for invariant object recognition in the brain. Moreover, brain measurements have also been used to improve generalization in machine learning systems with various degrees of success

[13, 9].

Figure 1: Left – Illustration of an exemplar RDM matrix for a dataset with 8 object categories and 8 object instances per category. Right – Overview of TG-SAGE method. Correlation between RDMs of candidate and teacher networks, are combined with candidate network premature performance to form P+TG score for guiding the architecture search.

3 Methods

3.1 Representational Dissimilarity Matrix

Representational Dissimilarity Matrix (RDM) [20] is an embedding computed for a representation that quantifies the dissimilarity between activation patterns in that representational space in response to a set of inputs or input categories. For a given input , the

network activations at one layer can be represented as a vector

. Similarly, the collection of activations in response to a set of inputs can be represented in a matrix which contains activations measured in response to inputs.

For a given activation matrix , we derive RDM () by computing the pairwise distances between each pair of activation vectors (i.e. and which correspond to rows and in activation matrix ) using a distance measure like the correlation residual.


When calculating RDM for different categories (instead of individual inputs) we substitute the matrix with in which each row contains the average activation pattern across all inputs in category .

RDM constitutes an embedding of the representational space that abstracts away from individual activations. Because of this, it allows us to compare the representations in different models or even between models and biological organisms [38, 11]. Once RDM is calculated for two representational spaces (e.g. for a layer in each of the student and teacher networks), we can evaluate the similarity of those representations by calculating the correlation coefficient (e.g. Pearson’s ) between the values in the upper triangle of the two RDM matrices.

3.2 Teacher Representational Similarity as Performance Surrogate

The largest portion of cost associated with neural network architecture search comes from training the sampled networks, which is proportional to the number of training steps (SGD updates) performed on the network. Due to the high cost of fully training each sampled network, in most cases a surrogate score is used as a proxy for the mature performance. Correlation between the surrogate and mature score may affect the architecture search performance as poor proxy values could guide the search algorithm towards suboptimal regions of the space. Previous work on architecture search in the space of Convolutional Neural Networks (CNN) have concurred with the empirical surrogate measure of premature

performance after about 20 epochs of training. While 20 epochs is much lower than the usual number of epochs used to fully train a CNN network (300-900 epochs), it still forces a large cost on conducing architecture searches. We propose that evaluating the internal representations of a network would be a more reliable measure of architecture quality during the early phase of training (e.g. after several hundreds of SGD iterations), when the features are starting to be formed but the network is not yet performing reliably on the task.

An overview of the procedure is illustrated in Figure-1. We evaluate each sampled candidate model by measuring the similarity between its RDMs at different layers (e.g. ) to those extracted from the teacher network (e.g. ). To this end, we compute RDM for all layers in the network and then compute the correlation between all pairs of student and teacher RDMs. To score a candidate network against a given layer in the teacher network, we consider the highest RDM similarity to teacher layer calculated over all layers of the student network (i.e. ; ).

We then construct an overall teacher similarity score by taking the mean of the RDM scores which we call “Teacher Guidance” (TG). Finally, we define the combined Performance and TG score (P+TG) which is formulated as weighted sum of premature performance and TG score in the form of . The combined score guides the architecture search to maximize performance as well as representational similarity with the teacher architecture. The parameter can be used to tune the relative weight assigned to TG score compared to the performance score. We consider the teacher architecture as any high-performing network with unknown architecture but observable activations. We can have one or several measured endpoints from the teacher network that each could potentially be used to generate a similarity score.

4 Experiments and Results

4.1 Performance Predictability from Teacher Representational Similarity

We first investigated if the teacher similarity evaluation measure (P+TG) of premature networks improves the prediction of mature performance (compared to evaluation of only premature performance, P). To do this, we made a pool of CNN architectures for which we computed the premature and mature performances as well as the premature RDMs (a measure of the internal feature representation, see 3.1) at every model layer. To select the CNN architectures in the pool we first ran several performance-guided architecture searches with 20 epoch/sample training (see section 4.2 and supplementary material) and then selected 116 architectures found at different stages of the search. These networks had a wide range of mature performance levels that also included the best network architectures found during each search.

In experiments carried out in this paper, we used a variant of ResNet [15] with 54 convolutional layers (=9) as the teacher network. This architecture was selected as the teacher because it is high performing (top-1 accuracy of 94.75% and 75.89% on CIFAR10 and CIFAR100 datasets respectively). Notably, the teacher architecture is not in our search spaces (see supp. material). The features after each of the three stacks of residual blocks (here named L1-L3) were chosen as the teacher’s internal features, and a RDM was created from each using random subsample of features in that layer. We did not attempt to optimize this choice — these were chosen simply because they sampled approximately evenly over the full depth of the teacher.

In order to find the optimum TG weight factor, we varied the parameter and measured the change in correlation between the P+TG score and the mature performance (Figure 2). We observed that higher led to larger gains in predicting the mature performance when models were trained only for few epochs (2.5 epochs). However, with more training, larger values reduced the predictability. We found that for networks trained for 2 epochs, a value of is close to optimum. The combined “P+TG” score (see 3.2) composes the best predictor of mature performance during most of the early training period (Figure 3-bottom). This observation was consistent with previous findings that learning in deep networks predominantly happen “bottom-up” [29].

We further found that the earlier teacher layers (L1) are better predictors of the mature performance compared to other layers early on during the training (2epochs) but as the training progresses, the later layers (L2 and L3) become better predictors (3epochs) and with more training (3epochs) the premature performance becomes the best single predictor of the mature (i.e. fully trained) performance (Figure 3).

In addition to ResNet, we also analyzed a second teacher network, namely NASNet (see section 2 in supp. material) and confirmed our findings using the alternative teacher network. We also found that NASNet activations (which performs higher than ResNet; 82.12% compared to 75.9%) form a better predictor of mature performance in almost all training regimes (see supp. material).

Figure 2: Effect of TG weight on predicting the mature performance.
Figure 3: Comparison of performance and P+TG measures at premature state (epochs=2) as predictors of mature performance. (top-left) Scatter plot of premature and mature performance values. (top-right) Scatter plot of premature P+TG measure and mature performance. (bottom) Correlation between performance, single layer RDMs, and combined P+TG measures with mature performance at varying number of premature training epochs.

4.2 Teacher Guided Search in the Space of Convolutional Networks

As outlined in the Introduction, we expected that the (P+TG) evaluation score’s improved predictivity (Figure 3) should enable it to support a more efficient architecture search than performance evaluation alone (P). To test this directly, we used the (P+TG) evaluation score in full architectural search experiments using a range of configurations. For these experiments, we searched two spaces of convolutional neural networks similar to previous search experiments [39] (maximum network depth of either 10 or 20 layers). These architectural search spaces are important and interesting because they are large. In addition, because networks in these search spaces are relatively inexpensive to train to maturity, we could evaluate the true underlying search progress at a range of checkpoints (below). We ran searches in each space using four different search methods: using the (P+TG) evaluation score at 2 or 20 epochs of premature training, and using the (P) evaluation score at either 2 or 20 epochs of premature training. For these experiments, we used random [28], reinforcement learning (RL) [39], as well as TPE architecture selection algorithm [5] (see Methods), and we halted the search after 1000 or 2000 sampled architectures (for the 10- and 20-layer search spaces, respectively). We conducted our search experiments on CIFAR100 instead of CIFAR10 because of larger number of classes in the dataset that provided a higher dimensional RDM.

Search Algorithm
Search Space
10 layer 20 layer 10 layer 20 layer
# Epoch/Sample
2 20 2 20 2 2
Random - Best C100 Error (%) 45.4 2.5 41.3 1.5 41.2 1.8 38.3 4.8 45.4 2.5 41.2 1.8
P - Best C100 Error (%) 41.0 0.5 40.5 0.4 37.5 0.2 32.7 0.9 42.5 5.7 37.0 3.0
P+TG - Best C100 Error (%) 38.3 1.1 39.2 0.9 33.2 1.4 32.2 0.8 37.6 1.2 33.0 2.4
Performance Improvement (%) 2.7 1.3 4.3 0.5 4.9 4
Table 1: Comparison of premature performance and representational similarity measure in architecture search using RL and TPE algorithms. P: premature performance as validation score; P+TG: combined premature performance and RDMs as the validation score. Values are across 3 search runs.

We found that, for all search configurations, the (P+TG) driven search algorithm (i.e. TG-SAGE) consistently outperformed the performance-only driven algorithm (P) in that, using equal computational cost it always discovered higher performing networks (Table 1). This gain was substantial in that TG-SAGE found network architectures with approximately the same performance as (P) search but at less computational cost (2 vs. 20 epochs; Table 1).

To assess and track the efficiency of these searches, we measured the maximum validation set performance of the fully trained network architectures returned by each search at its current choice of the top-5 architectures. We repeated each search experiment three times to estimate variance in these measures resulting from both search sampling and network initial filter weight sampling. Figure

4 shows that the teacher guided search (P+TG) leads to finding network architectures that were on par with performance guided search (P) throughout the search runs while being 10 more efficient.

4.3 Teacher Guided Search in the Space of Convolutional Cells

To find architectures that are transferable across datasets we performed architecture search with P+TG score to the space of convolutional cells similar to the one used in [23]. For this, after a cell structure is sampled, the full architecture is constructed by stacking the same cell multiple times with a predefined structure (see supplementary material). While both RL and TPE search methods led to similar outcomes in our experiments in section 4.1, average TPE results were slightly higher for both experiments. Hence, we chose to conduct the search experiment in this section using TPE algorithm with the same setup as in section 4.1 using CIFAR100 with 1000 samples.

For each sample architecture, we computed RDMs for each cell’s output. Considering that we had cell repetitions in each block during search, we ended up with 8 RDMs in each sampled cell that were compared with 3 precomputed RDMs from the teacher network (24 comparisons over validation set of 5000 images). Due to the imperfect correlation between the premature and mature performances, doing a small post-search reranking step increases the chance of finding slightly better cell structures. We chose the top 10 discovered cells and trained them for 300 epochs on the training set (45k samples) and evaluated on the validation set (5k samples). Cell structure with the highest validation performance was then fully trained on the complete training set (50k samples) for 600 epochs using the procedure described in [40] and evaluated on the test set.

Figure 4: Effect of different surrogate measures on architecture search performance. (left) shows the average C100 performance of the best network architectures found during different stages of three runs of RL search in each case (see text). (right) same as the plot on left but displayed with respect to the total computational cost invested ().
Network B N F # Params C10 Error C100 Error Cost
AmoebaNet-A 5 6 36 3.2M 3.34 - 20000 1.13M 100 27M 25.2B
NASNet-A 5 6 32 3.3M 3.41 (3.72) 17.88 20000 0.9M 250 13.5M 21.4-29.3B
PNASNet-5 5 3 48 3.2M 3.41 (4.06) 19.26 1160 0.9M 0 0 1.0B

5 6 - 4.6M 3.54 - 310 50k 0 0 15.5M
SAGENet 5 6 32 6.0M 3.66 17.42 1000 90K 10 13.5M 225M
SAGENet-sep 2.7M 3.88 17.51
Table 2: Performance of discovered cells on CIFAR10 and CIFAR100 datasets. *indicates error rates from retraining the network using the same training pipeline on 2-GPUs. B: Number of operation blocks in each cell. N: number of cell repetitions in each network block. F: number of filters in the first cell.

We compared our best found cell structure with those found using NAS [40] and PNAS [23] methods on CIFAR-10, CIFAR-100, and Imagenet datasets (Tables 2 and 3). To rule out any differences in performance that might have originated from differences in training procedure, we used the same training pipeline to train our proposed network (SAGENet) as well as the as well as the two baselines (NASNet and PNASNet). We found that on all datasets, SAGENet performed on par with the other two baseline networks we had considered.

With regard to compactness, SAGENet had more parameters and FLOPS compared to NASNet and PNASNet due mostly to symmetric and convolutions. But we had not considered any costs associated with the number of parameters or the number of FLOPS when conducting the search experiments. For this reason, we also considered another version of SAGENet in which we replaced the symmetric convolutions with “ separable” convolutions (SAGENet-sep). SAGENet-sep had half the number of parameters and FLOPS compared to SAGENet and slightly higher error rates.

To compare the cost and efficiency of different search procedures we adopted the same measures as in [23]. Total cost of search was computed as the total number of examples that were processed with SGD throughout the search procedure. This includes sampled cell structures that were trained with examples during the search and top cells trained on examples post-search to find the top performing cell structure. The total cost was then calculated as . While SAGENet performed on par to both NASNet and PNASNet top networks on all C10, C100, and Imagenet, the cost of search was about 100 and 4.5 times less than NASNet and PNASNet respectively (Table 2). A unique features of this cell is the large number of skip connections (both within blocks and across cells) (see supp. material). Interestingly, at mature state our top architecture performed better than the teacher network (ResNet) on C10 and C100 datasets (96.34% and 82.58% on C10 and C100 for TG-SAGE as compared to 94.75% and 75.89% for our teacher-ResNet).

Network B N F Top-1 Err Top-5 Err # Params (M) FLOPS (B)
NASNet-A 5 4 44 31.07 11.41 5.3 1.16
PNASNet-5 5 3 56 29.92 10.63 5.4 1.30
SAGENet 5 4 48 31.81 11.79 9.7 2.15
SAGENet-sep 31.9 11.99 4.9 1.03
TPE-imagenet 5 4 40 34.4 13.5 5.5 1.26
SAGENet-neuro 5 3 40 32.54 12.26 5.6 1.35
Table 3: Performance of discovered cells on Imagenet dataset in mobile settings.*indicates error rates from training all networks using the same training pipeline on 2-GPUs.

4.4 Using Cortical Measurements as the Teacher Network

In the absence of an already high performing teacher network, the utility of TG-SAGE seems far-fetched. However, as discussed earlier, the teacher network could be any network that is high-performing and its internal activations are partially observable. One such network is the primate brain that is both high performing in object categorization task and is partially observable through electrophysiological recording tools. To demonstrate the validity of this hypothesis, we conducted an additional experiment in which we used neural measurements from macaque visual cortex to guide the architecture search.

To facilitate the comparison of representations between brain and CNNs, we needed a fixed set of inputs that could be shown to both CNNs and the monkeys. For this purpose, we used a set of 5760 images that contained 3D rendered objects placed on uncorrelated natural backgrounds and were designed to include large variations in position, size, and pose of the objects (see supplementary material). We used previously published neural measurements from 296 neural sites in two macaque monkeys in response to these images [38]. These neural responses were measured from three anatomical regions along the ventral visual pathway (V4, posterior-inferior temporal (p-IT), and anterior inferior temporal (a-IT) cortex) in each monkey – a series of cortical regions in the primate brain that facilitate object recognition. To allow the candidate networks to be more comparable to the brain measurements, we conducted the experiment on Imagenet dataset and trained each candidate network for epoch using images of size . We used the same setup as in section 4.3 but this time with three RDMs generated from our neural measurements in each area (i.e. V4, p-IT, a-IT). We held out 50,000 of the images from the original Imagenet training set as the validation set that was used to evaluate the premature performance for the candidate networks. To further speed up the search, we removed the first 2 reduction cells in the architecture during the search. After running the architecture search for 1000 samples, we picked the top 10 networks and fully trained them on imagenet for 40 epochs and picked the network with highest validation accuracy. We then trained this network on the full Imagenet training set and evaluated its performance on the test set.

As a baseline, we also performed a similar search but using the performance metric to guide the search. The best discovered network found using the combined P+TG metric (SAGENet-neuro) reached a top-1 error of 32.54% that was significantly lower than 34.4% top-1 error achieved by the best network derived from performance-guided search (TPE-imagenet; see Table 3). Despite this, the best model found using the primate brain representations (SAGENet-neuro) did not perform as well as the model found by searching on CIFAR-100 dataset using ResNet as the teacher network. One critical factor that might have affected the quality of the best discovered model was the amount of per-sample training done during the search that was restricted to 2000 steps (epoch) in our experiment. Naturally, allowing more training before evaluation would potentially result in a more accurate prediction of mature performance and discovering a higher performing model. Another important factor was the sufficiency of the neural recordings for constructing a teacher RDM that could further be improved with larger population of neural responses measured in response to more inputs. Nevertheless, the representational embedding constructed from only a few hundred neural sites were still informative enough to produce a meaningful guidance for the architecture search.

5 Discussion and Future Directions

We here demonstrate that, when the internal “neural” representations of a powerful teacher neural network are partially observable (such as the brain’s neural network), that knowledge can substantially accelerate the discovery of high performing artificial networks. We propose a new method to accomplish that acceleration (TG-SAGE) and demonstrate its ability using a previous state-of-the-art network as the teacher. Essentially, TG-SAGE jointly maximizes a model’s premature performance and its representational similarity to those of a partially observable teacher network. With the architecture space and search settings tested here, we report a computational efficiency gain of in discovering CNNs for visual categorization. This gain in efficiency of the search (with maintained performance) was achieved without any additional constraints on the search space as in more efficient search methods like ENAS [27] or DARTS [24]. We empirically demonstrated this by performing searches in several CNN architectural spaces. In addition, as a proof of concept, we here showed how limited measurements from the brain (neural population patterns of responses to many images) could be formulated as teacher constraints to accelerate the search for higher performing networks. However, it remains to be seen if larger scale neural measurements – which are obtainable in the near future – could achieve even better acceleration.

An important aspect of teacher guided architecture search relates to the metrics used for evaluating similarity of representational spaces. Here we used representational dissimilarity matrix to achieve this goal. However, we acknowledge that RDM might not be the most accurate or fastest metric for this purpose. Exploring other representational analysis metrics like Singular Vector Canonical Correlation Analysis (SVCCA) [29] are an important direction we would like to pursue in the future.

Another interesting future direction would be to conduct the architecture search by iteratively substituting the teacher network with the best network discovered so far. This approach would make the procedure independent of the choice of the teacher network and make it possible to perform efficient search when good teacher architectures have not been deployed yet.


  • [1] J. Ba and R. Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662, 2014.
  • [2] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
  • [3] B. Baker, O. Gupta, R. Raskar, and N. Naik. Practical neural network performance prediction for early stopping. arXiv preprint arXiv:1705.10823, 2017.
  • [4] R. Bardenet and B. Kegl. Surrogating the surrogate: accelerating gaussian-process-based global optimization with a mixture cross-entropy algorithm. In 27th International Conference on Machine Learning (ICML 2010), pages 55–62. Omnipress, 2010.
  • [5] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl. Algorithms for Hyper-Parameter Optimization. pages 1–9, 2011.
  • [6] J. Bergstra, N. Pinto, and D. Cox. Machine learning for predictive auto-tuning with boosted regression trees. In Innovative Parallel Computing (InPar), 2012, pages 1–9. IEEE, 2012.
  • [7] J. Bergstra, D. Yamins, and D. D. Cox. Making a Science of Model Search. pages 1–11, 2012.
  • [8] J. Bergstra, D. Yamins, and D. D. Cox. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in Science Conference, pages 13–20. Citeseer, 2013.
  • [9] N. Blanchard, J. Kinnison, B. RichardWebster, P. Bashivan, and W. J. Scheirer. A neurobiological cross-domain evaluation metric for predictive coding networks. In

    Conference on Computer Vision and Pattern Recognition

    , 2019.
  • [10] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. SMASH: One-Shot Model Architecture Search through HyperNetworks. 2017.
  • [11] C. F. Cadieu, H. Hong, D. L. K. Yamins, N. Pinto, D. Ardila, E. A. Solomon, N. J. Majaj, and J. J. DiCarlo. Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition. PLoS Computational Biology, 10(12), 2014.
  • [12] T. Domhan, J. T. Springenberg, and F. Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. 15:3460–8, 2015.
  • [13] R. C. Fong, W. J. Scheirer, and D. D. Cox. Using human brain activity to guide machine learning. Scientific Reports, 8(1):1–10, 2018.
  • [14] D. Ha, A. Dai, and Q. V. Le. HyperNetworks. 2016.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. Arxiv.Org, 7(3):171–180, 2015.
  • [16] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [17] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely Connected Convolutional Networks. 2016.
  • [18] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization, pages 507–523. Springer, 2011.
  • [19] D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations, pages 1–13, 2014.
  • [20] N. Kriegeskorte, M. Mur, and P. a. Bandettini. Representational similarity analysis - connecting the branches of systems neuroscience. Frontiers in systems neuroscience, 2(November), 2008.
  • [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances In Neural Information Processing Systems, pages 1–9, 2012.
  • [22] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765–6816, 2017.
  • [23] C. Liu, B. Zoph, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressive Neural Architecture Search. 2017.
  • [24] H. Liu, K. Simonyan, and Y. Yang. DARTS: Differentiable Architecture Search. 2018.
  • [25] A. S. Morcos, M. Raghu, and S. Bengio. Insights on representational similarity in neural networks with canonical correlation. Advances in neural information processing systems, (NeurIPS), 2018.
  • [26] R. Pascanu, T. Mikolov, and Y. Bengio.

    On the difficulty of training Recurrent Neural Networks.

    (2), 2012.
  • [27] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient Neural Architecture Search via Parameters Sharing. 2018.
  • [28] N. Pinto, D. Doukhan, J. J. DiCarlo, and D. D. Cox. A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLoS computational biology, 5(11):e1000579, 2009.
  • [29] M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein.

    SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability.

  • [30] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le.

    Regularized Evolution for Image Classifier Architecture Search.

    (2017), 2018.
  • [31] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. Le, and A. Kurakin. Large-Scale Evolution of Image Classifiers. 2016.
  • [32] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. FitNets: Hints for Thin Deep Nets. pages 1–13, 2014.
  • [33] M. Schrimpf, J. Kubilius, H. Hong, N. J. Majaj, R. Rajalingham, E. B. Issa, and K. Kar. Brain-Score : Which Artificial Neural Network for Object Recognition is most Brain-Like ? pages 1–9, 2018.
  • [34] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. pages 1–10, 2014.
  • [35] I. H. Stevenson and K. P. Kording. How advances in neural recording affect data analysis. Nature neuroscience, 14(2):139, 2011.
  • [36] C. Szegedy, V. Vanhoucke, J. Shlens, and Z. Wojna. Rethinking the Inception Architecture for Computer Vision. 2014.
  • [37] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • [38] D. L. K. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23):8619–8624, 2014.
  • [39] B. Zoph and Q. V. Le. Neural architecture Search With reinforcement learning. ICLR, 2017.
  • [40] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning Transferable Architectures for Scalable Image Recognition. 10, 2017.

Supplementary Material

Hyperparameter Search with Reinforcement Learning (RL)

We follow the method proposed by [39]

to learn the probability of hyperparameter choices (

) that maximize the unknown but observable reward function

. A 2-layer long short-term memory (LSTM) is used as the controller that chooses each hyperparameter in the network at every unrolling step. The LSTM network, models the conditional probability distribution of optimal hyperparameter choices as a function of all previous choices

in which

is the set of all tunable parameters in the LSTM network. Since a differentiable loss function is not known for this problem, usual maximum likelihood methods could not be used in this setting. Instead parameters are optimized through reinforcement learning based approaches (e.g. REINFORCE

[37]) by increasing the likelihood of each hyperparameter choice according to the reward (score) computed for each sampled network (or a batch of sampled networks). Relative to [39], we made two modifications. First, since the order of dependencies between different hyperparameters in each layer/block is arbitrary, we ran the LSTM controller for one step per layer (instead of once per hyper-parameter). This results in shorter choice sequences generated by the LSTM controller and therefore shorter sequence dependencies. Second, we chose a Boltzman policy method for action selection to allow the search to continue the exploration throughout the search experiment. Hyperparameter values were selected according to the probability distribution over all action choices. Compared to -greedy method, following the softmax policy reduces the likelihood of sub-optimal actions throughout the training.

For each hyperparameter, choice probability is computed using a linear transformation (e.g.

) from LSTM output at the last layer () followed by a softmax. To reduce the number of tunable parameters and more generalization across layers, we used shared parameters between layers.


Probability distribution over possible number of layers is formulated as a function of the first output value of the LSTM (). In addition to layers’ hyperparameters we also search over layers’ connections. Similar to the approach taken in [39] we formulated the probability of a connection between layer i and j as a function of the state of the LSTM at each of these layers ().


where represents the probability of a connection between layer i output to j’s input. and are tunable parameters that link the hidden state of LSTM to probability of a connection existing between the two layers.

Hyperparameter Search with Tree of Parzen Estimators (TPE)

Sequential Model-Based Optimization (SMBO) [18] approaches are numerical methods used to optimize a given score function . They are usually applied in settings where evaluating the function at each point is costly and it’s important to minimize the number of evaluations to reach the optimal value. Various SMBO approaches were previously proposed [6, 4] and some have been used for hyperparameter optimization in neural networks [5, 7, 23]. Bayesian SMBO approaches model the posterior or conditional probability distribution of values (scores) and use a criteria to iteratively suggest new samples while the probability distribution is updated to incorporate the history of previous sample tuples where is a sample hyperparameter vector and is the received score (or loss). Here we adopted Tree of Parzen Estimators (TPE) because of its intuitiveness and successful application in various domains with high dimensional spaces. Unlike most other Bayesian SMBO methods that directly model the posterior distribution of values , TPE models the conditional distribution with two non-parametric densities.


We consider to be the loss value which we are trying to minimize (e.g. error rate of a network on a given task). For simplicity, value of

could be taken as some quantile of values observed so far (

). At every iteration, TPE fits a kernel density estimator with Gaussian kernels to subset of observed samples with lowest loss value (

) and another to those with highest loss (). Ideally we want to find that minimizes . Expected Improvement (EI) is the expected reduction in compared to threshold under current model of . Maximizing EI, encourages the model to further explore parts of the space which lead to lower loss values and can be used to suggest new hyperparameter samples.


Given that and for , it has been shown [5] that EI would be proportional to . Therefore the EI criterion can be maximized by taking samples with minimum probability under and maximum probability under . For simplicity, at every iteration samples are drawn from and the hyperparameter choice with lowest ratio is suggested as the next sample.

Alternative Teacher Network - NASNet

We examined the effect of choosing an alternative teacher network, namely NASNet and performed a set of analyses similar to those done on ResNet. We observed that similar to ResNet, early layers are better predictors of the mature performance during early stages of the training. With additional training, the premature performance becomes a better single-predictor of the mature performance but during most of the training the combined P+TG score best predicts the mature performance (Figure S1-left). We also varied the “TG” weight factor and found that compared to ResNet, higher values led to increased gains in predicting the mature performance. was used to compute the P+TG scores shown in Figure S1.

Overall, we found that NASNet representations were significantly better predictors of mature performance for all evaluated time points during training when compared to ResNet (Figure S1-right).

Figure S1: (top) Comparison of single layer and combined RDMs with premature performance as predictors of mature performance on NASNet. P+TG was computed using . (middle) Gain in predicting the mature performance with varying TG weight. (bottom) Comparison of combined RDM scores using two alternative teacher models at various stages of training. values of 1 and 5 were used for ResNet and NASNet respectively.

Datasets and Preprocessing

CIFAR: We followed the standard image preprocessing for CIFAR labeled dataset, a 100-way object classification task [15]

. Images were zero-padded to size

. A random crop of size

was selected, randomly flipped along the vertical axis, and standardized over all pixel values in each image to have zero mean and standard deviation of 1. We split the training set into training set (45,000 images) and a validation set (5,000 images) by random selection.

Imagenet: We used standard VGG preprocessing [34] on images from Imagenet training set. During training, images were resized to have their smaller side match a random number between 256 and 512 while preserving the aspect ratio. A random crop of size 224 was then cut out from the image and randomly flipped along the central vertical axis. The central crop of size 224 was used for evaluation.

Details of Search Algorithms

RL Search Algorithm: We used a two-layer LSTM with 32 hidden units in each layer as the controller. Parameters were trained using Adam optimizer [19] with a batch size of 5. For all searches, the learning rate was 0.001, and the Adam first momentum coefficient was zero . Gradients were clipped according to global gradient norm with a clipping value of 1 [26].

TPE Search Algorithm: We used the python implementation of TPE hyperparameter search from HyperOpt package [8]. We employed the linear sample forgetting as suggested in [7] and set the threshold for the set of observed samples. Each search run started with 20 random samples and continued with TPE suggestion algorithm. At every iteration, draws were taken from and choice of hyperparameter was used as the next sample (see section 3.3 in the main text).

Experimental Details for Search in the Space of Convolutional Networks

Search Space: Similar to [39] we defined the hyperparameter space as the following independent choices for each layer: , . In addition we searched over number of layers () and possible connections between the layers. In this space of CNNs, the input to every layer could have originated from the input image or the output of any of the previous layers. We considered two particular spaces in our experiments that differed in the value of (=10 or 20).

CIFAR Training:

Selected networks were trained on CIFAR training set (45k samples) from random initial weights using SGD with Nesterov momentum of 0.9 for 300 epochs on the training set. The initial learning rate was 0.1 and was divided by 10 after every 100 epochs. Mature performance was then evaluated on the validation set (above).

Experimental Details for Search in the Space of Convolutional Cells

Search Space: We used the same search space and network generation procedure as in [40, 23] with the exception that we added two extra hyperparameters which could force each of the cell inputs (from previous cell or the one prior to that) to be directly concatenated in the output of the cell even if they were already connected to some of the blocks in the cell. This extra hyperparameter choice was motivated by the open-source implementation of NASNet at the time of conducting the search experiments that contained similar connections111available at https://github.com/tensorflow/models/blob/376dc8dd0999e6333514bcb8a6beef2b5b1bb8da/research/slim/nets/nasnet/nasnet_utils.py.

Each cell receives two inputs which are the outputs of the previous two cells. In early layers, the missing inputs are substituted by the input image. Each cell consists of blocks with a prespecified structure. Each block receives two inputs, an operation is applied on each input independently and the results are added together to form the output of the block. The search algorithm picks each of the operations and inputs for every block in the cell. Operations are selected from a pool of 8 possible choices: {identity, average pooling, max pooling, dilated convolution, followed by convolution, depthwise-separable convolution, depthwise-separable convolution, depthwise-separable convolution}.

Imagenet Training: For our Imagenet training experiments, we used a batch size of 128 images of size

pixels. Each batch was divided between two GPUs and the gradients computed on each half were averaged before updating the weights. We used an initial learning rate of 0.1 with a decay of 0.1 after every 15 epochs. Each network was trained for 40 epochs on the Imagenet training set and validated on the central crop for all images from Imagenet validation. No dropout or drop-path was used when training the networks. RMSProp optimizer with a decay rate of 0.9 and momentum rate of 0.9 was used during training and gradients were normalized by their global norm when the norm value exceeded a threshold of 10. L2-norm regularizer was applied on all trainable weights with a weight decay rate of


CIFAR Training: The networks were trained on CIFAR10/CIFAR100 training set including all 50,000 samples for 600 epochs with an initial learning rate of 0.025 and a single period cosine decay [40]. We used SGD with Nesterov momentum rate of 0.9. We used L2 weight decay on all trainable weights with a rate of

. Gradient clipping similar to that used for Imagenet and a threshold of 5 was used.

Best Discovered Convolutional Cell: Figure S2 shows the structure of the best discovered cell by TG-SAGE on CIFAR100. Only four (out of ten) operations contain trainable weights and there are several bypass connections in the cell.

Figure S2: SAGENet - Structure of the best cell discovered during the search with TG-SAGE.

Neural Measurements from Macaque Monkeys

We used a dataset of neural spiking activity for a population of 296 neural sites in two awake behaving macaque monkeys in response to 5760 images [38]. Neural data were collected using parallel microelectrode arrays that were chronically implanted on the cortical surface in area V4 and IT. Fixating animals were presented with images for 100ms, and the neural response patterns were obtained by averaging the spike counts in the time window of 70-170ms post stimulus onset. To enhance the signal-to-noise ratio, each image was presented to each monkey between 21-50 times and the average response pattern across all presentation were considered for each image. The 296 recorded sites were partitioned into three cortical regions (V4, posterior-IT, and anterior-IT) and a RDM was calculated for each region.

The image set consisted of a total of 5760 images. Each image contained a 3D rendered object placed on an uncorrelated natural background. The rendered objects were selected from a battery of 64 objects from 8 categories (animals, boats, cars, chairs, faces, fruits, planes, and tables) with 8 objects per category. The images were generated to include large variations in position, size, and pose of the objects and were shown within the central 8 of monkeys’ visual field. Some example images are illustrated in Figure-S3.

Figure S3: Example images from each of the eight object categories that were used to record neural responses.

Implementation Details

Because of heavy computational load associated with training neural networks and in particular in large-scale model training, we needed a scalable and efficient framework to facilitate the search procedure. We implemented our proposed framework in four main modules: (i) explorer, (ii) trainer, (iii) evaluator, and (iv) tracker. The explorer module contained the search algorithm. The trainer module optimized the parameters of the proposed architecture on an object recognition task using a large-scale image dataset. Once the training job was completed, the evaluator module extracted the network activations in response to a set of predetermined image-set and assessed the similarity of representations to the teacher benchmarks. The tracker module consisted of a database which tracked the details and status of every proposed architectures and acted as a bridge between all three modules.

Figure S4: Implementation of a distributed framework for conducting architecture search.

During the search experiments, the explorer module proposes new candidate architectures and records the details in the database (tracker module). It also continuously monitors the database for newly evaluated networks. Upon receiving adequate number of samples (i.e. when a new batch is complete), it updates its parameters. Active workers periodically monitor the database for newly added untrained models, and train the architecture on the prespecified dataset. After the training phase is completed, the evaluator module extracts the features from all layers in response to the validation set and computes the premature-performance and RDM consistencies and writes back the results in the database. The trainer and evaluator modules are then freed up to process new candidate networks. This framework enabled us to run many worker programs on several clusters speeding up the search procedure. An overview of the implemented framework is illustrated in Figure S4. Experiments reported in this paper were run on three server clusters with up to 40 GPUs in total.