I Introduction
Neural networks (NNs) have demonstrated their great potential in a wide range of artificial intelligence tasks such as image classification, object detection or speech recognition [zoph2016neural, wang2020fcos, ding2020autospeech]. Nevertheless, designing a NN for a given task or a dataset requires significant human expertise, making their application restricted in the realworld [elsken2018neural]. Recently, neural architecture search (NAS) has been demonstrated to be a promising solution for this issue [zoph2016neural], which automatically designs a NN for a given dataset and a target objective. Current NAS methods are already able to automatically find better neural architectures, in comparison to handmade NNs [zoph2016neural, ding2020autospeech, wang2020fcos, real2019regularized].
. The random variables
represent the learnable relevance over the operations. The coloured lines represent different candidate operations and their thickness represents their likelihood. All outputs from the processed states and are concatenated in the output along the channel dimension into , symbolised by the dashed lines. Green rectangles  states signify data. (b) Network skeleton comprising of and normal cells and two reduction cells and which share the same structure, in total giving cells. The network also contains a stem comprised of convolution and at the end of the network are average pooling followed by a linear classifier.NAS itself is a challenging problem spanning on a discrete search space, which can be simplified into reasoning about what operations should be present and how should they be interconnected between each other in the NN architecture. Common operation types that are considered are, for example, different types of convolutions or pooling [zoph2016neural]. If the search is not approached with caution, the resultant NN might not be flexible enough to learn useful patterns. Additionally, the ability of the model to generalise is also directly dependant on the NN architecture [zoph2016neural, liu2018darts]. Therefore, there is an omnipresent need for finding architectures that are expressive enough and at the same time achieve good generalisation performance.
Based on the core algorithmic principle operating during the search, NAS can be divided into four categories: (i) reinforcement learningbased on an actorcritic framework
[zoph2016neural](ii) evolutionary methods based on genetic algorithms
[real2019regularized], (iii) Bayesian optimisationbased on proxy models [cai2018path] or (iv) gradientbased methods [liu2018darts]. In particular, gradientbased NAS’s [liu2018darts] have been recently popularised for convolutional NN (CNN) architecture search due to compute efficiency during the search. Nevertheless, gradientbased NAS is likely to collapse into a situation where it selects all operations to be the same [zela2019understanding], treats operations unfairly [chu2019fair] or is hard to adapt across different datasets and search spaces [li2019random].To solve the issues in the existing gradientbased NAS methods, this paper proposes Variational Inferencebased Neural Network Architecture Search (VINNAS). Under the same search space as in the case of NAS methods [liu2018darts, chu2020noisy, zela2019understanding]
, our approach does not require any additional computation to the standard backpropagation algorithm. In
VINNAS, we tackle NAS using Bayesian inference, by modeling the architecture search through additional random variables
which determine different operation types or connections between operations, our algorithm is able to conduct effective NN architecture search. The importance of using particular operations is determined by using a variational dropout scheme [molchanov2017variational, kingma2015variational] with the automatic relevance determination (ARD) [mackay1995probable] prior. We specifically search for a network structure that is composed of cells containing a variety of operations. The operations are organised into two types of cells: normal and reduction, and similarly to cellbased NAS [liu2018darts], the cells are replicated and then used to construct the complete CNN. The model is shown in Figure 1. To encourage traversal through the NN architecture search space, we formulated an autoregularising objective that promotes exploration, while ensure high levels of certainty in the selection phase.We performed experiments on searching CNNs for classification on image datasets namely MNIST, FashionMNIST and CIFAR10. Our results demonstrate stateoftheart performance, thanks to targeting sparse architectures that focus on learning efficient representations, which is enforced by strict regularisation. For example on CIFAR10, we demonstrate finding an architecture that has up to fewer parameters needed in comparison to the stateoftheart, without any human intervention.
In summary, our main contributions are as follows:

A differentiable neural architecture search method adopting variational dropout, which is effective in searching neural network architectures with the stateoftheart performance on multiple datasets.

An architecture search objective using scheduled regularisation to promote exploration, but at the same time motivate certainty in the operation selection.

An updated rule for selecting the most dominant operations based on their inferred uncertainty.
In the sequel, we describe our approach in detail. In Section II we review related work, in Section III we introduce variational learning and gradientbased NAS. In Section IV we introduce our search objective, search space and the proposed overall algorithm. Section V documents the performance of our search method on experiments and lastly, in Section VI we draw our conclusions. Our implementation can be found at: https://github.com/iimlucl/vinnas.
Architecture  Architecture search space (supergraph)  Data/State in architecture  Architecture var.  Dataset / Dataset size 
Operation candidates  Candidate operations  Total number of cells  N Normal cell  R Reduction cell 
Prior density  Approximation density  Weights  Other params.  Reparametrisation params. 
Ii Related Work
Differentiable Neural Architecture Search
Since Zoph et al. [zoph2016neural] popularised NAS for CNNs, the field has been growing from intensive scientific [liu2018darts, zhou2019bayesnas] and industrial [zoph2016neural, real2019regularized] interests. NAS techniques automate the design of CNNs, mainly in terms of highlevel operations, such as different types of convolutions or pooling, and their connections. The core to these techniques is the search space of potential architectures, their optimisation objective and search algorithm. For further detail of NAS, we refer the reader to a review of NAS by Elsken et al. [elsken2018neural]. It is a common practise to organise the search space for all potential architectures into finding cells that specify the operations and their connections [liu2018darts], which are then stacked on top of each other to construct the final NN, as previously shown in Figure 1. Modern NAS methods often apply a weightsharing [pham2018efficient] approach where they optimise the search over several architectures in parallel by sharing weights of their operations to save memory resources. Among these approaches, gradientbased NAS has become one of the most popular methods [liu2018darts], mainly due to its compute feasibility. DARTS [liu2018darts] defines the search for an architecture as optimising continuous weights associated to operations in an overparametrised supergraph , while utilising weightsharing. After the best combination of operations in the supergraph is identified, it is then used to construct the final architecture for evaluation. However, Zela et al. [zela2019understanding] identified a wide range of search spaces for which DARTS yields degenerate architectures with very poor test performance. Chu et al. [chu2019fair] observed critical problems in twostage weightsharing NAS due to inherent unfairness in the supergraph training. Chu et al. [chu2020noisy] attempt to fix this problem by adding noise to the skipconnection operation during the search. Our approach is similar to [chu2020noisy], however, we do not bias the search only towards skipconnections, but rather, infer the properties of the noise distribution with respect to ARD.
Pruning
Gradientbased NAS can be regarded as a subset of pruning in NNs and there have been many approaches introduced for pruning, such as by LeCun et al. [lecun1990optimal] who pruned networks by analysing secondorder derivatives. Other approaches [scardapane2017group] consider removing groups of filters in convolutions. Kingma et al. [kingma2015variational] prune NNs at a nodelevel by noticing connections between dropout [srivastava2014dropout] and approximate variational inference. Molchanov et al. [molchanov2017variational] show that the interpretation of Gaussian dropout as performing variational inference in a network with log uniform priors over weights leads to high sparsity in weights. Blundell et al. [blundell2015weight] introduce a mixture of Gaussians prior on the weights, with one mixture tightly concentrated around zero, thus approximating a spike and slab prior over weights. Ghosh et al. [ghosh2018structured] and Loizous et al. [louizos2017bayesian] simultaneously consider grouped Horseshoe priors [carvalho2009handling] for neural pruning. Zhou et al. [zhou2020posteriorguided] use variational dropout [kingma2015variational] to select filters for convolution. Our method differs to these approaches, by not only inferring sparse weights for operations, but also attempting to infer weights over the operations’ search space to search NN architectures.
Iii Preliminaries
In this Section we introduce variational learning and cellbased differential neural architecture search which serve as basic building blocks for developing VINNAS. Notation used in this paper in summarised in Table I.
Iiia Variational Learning
We specify a CNN as a parametrisable function approximator with some architecture learnt on data samples consisting of inputs and targets forming a dataset as . The architecture , composed of operations, might have certain parameters, for example weights , which are distributed given some prior distributions . and combined define the model and the likelihood . We seek to learn the posterior distribution over the parameters using the Bayes rule. However, that is analytically intractable due to the normalising factor , which cannot be computed exactly due to the high dimensionality of .
Therefore, we need to formulate an approximate parametrisable posterior distribution whose parameters can be learnt in order to approach the true posterior, . Moving the distribution closer to in terms of naturally raises an objective: to minimise their separation, which is expressed as the KullbeckLeibler () divergence [kullback1951information]. This objective is approximated through the evidence lower bound (ELBO), shown in (1).
(1) 
The first term is the negative loglikelihood of the data which measures the datafit, while the second term is a regulariser whose influence can be manged through . The represent other learnable pointwise parameters that are assumed to have uniform priors, which contribute to the term that is independent of the parameters.
Kingma et al. introduced the local reparametrisation trick (LRT) [kingma2015variational] that allows us to solve the objective in (1) with respect to
through stochastic gradient descent (SGD) with low variance. We can backpropagete the gradients with respect to the distribution
by sampling that is obtained through deterministic transformation as where is a parameterfree noise, e.g.: .Moreover, using this trick, Molchanov et al. [molchanov2017variational], were able to search for an unbounded approximation^{1}^{1}1 represents a Hadamard product. for weights as shown in (2), which corresponds to a Gaussian dropout model with learnable parameters [srivastava2014dropout].
(2) 
After placing a factorised loguniform prior on the weights, such that , the authors observed an effect similar to ARD [molchanov2017variational], however, without the need to modify the prior. Throughout inference the learnt weights tend to a delta function centred at , leaving the model only with the important nonzero weights. The relevance determination is achieved by optimising both the and and if they are both close to zero, they can be pruned.
IiiB Cellbased Differential Neural Architecture Search
As shown above, Bayesian inference can be used to induce sparsity in the weight space, however, we wish to find from some architecture space .
Authors of DARTS [liu2018darts] defined the search for an architecture as finding specific associated to choosing operations in an overparametrised directed acyclic graph (DAG) , where the learnt values of are then used to specify at test time. Due to compute feasibility, the search space for all potential architectures is simplified into finding cells. The cell structure is defined with respect to where the indices signify the potential connections and operations between information states and inside the cell with states, where . The information state
is a 4dimensional tensor
with samples, containing channels, height and width . The index represents the number of different types of cells, where represents 2 different cell types: normal (N) cells preserve the input dimensionality while reduce (R) cells decrease the spatial dimensionality, but increase the number of channels [liu2018darts]. The cells can be interleaved and repeated giving total cells. The information for the state inside the cell is a weighted sum of the outputs generated from the different operations on . Choosing one of the operations can be approximated through performing on the architecture variables , instead of argmax, which provides the method with differentiable strengths of potential operations as shown in (3). The last state , which is the output of the cell, is then a concatenation of all the previous states, except the first two input states .(3) 
After the search, each state is connected with the outputs from two operations , whose strengths have the highest magnitude. The learnt weights and are discarded and the resultant architecture is retrained from scratch.
DARTS has been heavily adopted by the NAS community, due to its computational efficiency, in comparison to other NAS methods. However, upon a careful inspection it can be observed that it does not promote choosing a particular operation and often collapses to a mode based on the fact that the graph is overparametrised through a variety of parallel operations [chu2019fair]. The supergraph then focuses on improving the performance with respect to the whole graph, without providing a dominant architecture. Additionally, others have observed [chu2019fair, chu2020noisy]
that the method requires careful hyperparameter tuning without which it might collapse into preferring only one operation type over the others.
Iv Vinnas
In this Section, we first describe the search space assumptions for VINNAS in detail, followed by the objective that guides the exploration among the architectures. At last, we present the algorithm of VINNAS that couples everything together.
Iva Search Space
Our method extends the idea behind gradientbased NAS, while using variational learning to solve the aforementioned defects in previous work. VINNAS builds its search space as an overparametrised directed acyclic supergraph such that it contains the sought architecture template . Similarly to DARTS, we aim to search for two repeated cells, namely a normal and a reduction cell that will be repeated as shown in Figure 1. Therefore, the contains several of normal and reduction cells laid in a sequence with each containing the parallel operation options. However, is downscaled in the number of cells and channels in comparison to the considered during the evaluation, such that the supergraph can fit into GPU memory. Nevertheless, the pattern and the ratio of the number of cells and or s in are preserved in accordance to the model shown in Figure 1. To apply variational inference and subsequently ARD through variational dropout, we associate the structural strength for normal cells and for reduction cells with probabilistic interpretation. The graphical model of the supergraph that pairs together its weights and architecture strengths is shown in Figure 2.
For simplicity, we assume fully factorisable loguniform prior for . The prior biases the distributions of the operations’ strengths towards zero, which avoids giving an advantage to certain operations over the others. We similarly model the weights of the supergraph as random variables such that the joint prior distribution is . It is not analytically possible to find the true posterior , therefore, we resort to formulating an approximation . We again set factorisable approximations for both and
, such that the joint distribution factorises
with respect to the optimisable parameters for and for . The prior and approximations are detailed in (4) and (5) respectively. The indeces stand for different states in the cells with and is associated to the available operations.(4)  
(5)  
The approximate posteriors were selected as Gaussians with diagonal covariance matrices. We used the formulation by Molchanov et al. [molchanov2017variational] for both , during the search phase, and , during both the search and test phases. We aim to induce sparsity in the operations’ space, which would result in most operations’ strengths in the DAG as zero, while the most relevant operations are expected to be nonzero. At the same time, the method induces sparsity in the weight space and thus motivates the individual operations to be extremely efficient in their learnt patterns. We believe Gaussians are suitable approximations, since increasing the amount of training data implies that the posterior over these random variables will be similarly Gaussian. Also, the Gaussian noise used in our method effectively disrupts the previously observed unfairness in operation selection during NAS as partially demonstrated by [chu2020noisy] for skipconnection operation. Circling back to (3) the information in each cell during search is now calculated with respect to a sample from the inferred distributions . The second level parameters such as the individual means and variances are assumed to have noninformative uniform prior.
IvB Search Objective
The goal of the search is to determine the right set of structural variables or their corresponding parameters such that they can be later used to construct the desired architecture . Therefore, the search objective is in fact a secondary objective to the primary objective of minimising (1) with respect to some unknown parameters implied by the chosen as shown in (6).
(6) 
The and refer to the reparametrisations for the supergraph. Therefore, at the same time it is necessary to optimise the objective with respect to the structural parameters , the operations’ weight parameters and indicating their usefulness in the final architecture . Derived from the original ELBO in (1), optimising the supergraph with respect to the learnable parameters rises the following objective in (7) below.
(7) 
The first term again corresponds to the datafitting term which pushes the parameters toward maximising the expectation of the loglikelihood with respect to the variational distributions towards targets . The other two terms are regulariser terms, which due to the factorisation of the joint distributions and priors can be separated, and scaled by arbitrary constants . As previously stated, and enable the tradeoff between the datafit and regularisation. Molchanov et al. [molchanov2017variational] approximated the divergence between the prior and the posterior using as . After the search or training of the final evaluation the variances are only considered to compute which weights can be pruned and otherwise they are not considered during evaluation.
Additionally, we were inspired by [chu2019fair] which promoted the confidence in selecting connections in a graph by explicitly minimising their entropy in a similar NAS setup to minimise their uncertainty. In our case, we want to achieve certainty in the operations’ selection across , which is equivalent to minimising their joint entropy across the potential operations as . Applying a regulated coefficient on the entropy term, the final search objective is formulated in (8).
(8) 
IvC Algorithm
Our algorithm, shown in Algorithm 1, is based on SGD and relies on complete differentiation of all the operations. VINNAS iterates between two stages: (1, lines 68) optimisation of and and (2, lines 1014) optimisation of . The usage of this twostage optimisation aims to avoid overadaption of parameters as suggested in [liu2018darts].
After the initialisation of the parameters, the optimisation loops over stages (1) and (2) using two samesized portions of the dataset. However, the optimisation of the stage (2) is not started from the very beginning, but only after a certain number of epochs 
weight epochs, which are meant as a warmup for training the weights of the individual operations, to avoid oscillations and settling in local minima [liu2018darts]. The variance parameters are optimised as logarithms to guarantee computational stability. We linearly increase the values of and to force the cells to gradually choose the most relevant operations and weight patterns with respect to and . To avoid stranding into a local minima, we do not enforce the regularisation from the very start of the search, meaning the s are initialised as zero. After each iteration of (1) and (2), we compute the error on the the data sampled from and save the if that error was lower than that in previous iterations. The search is repeated until the search budget, which is defined as number of epochs that the search is allowed to perform, is not depleted. Note that the parameters for the weights or are discarded after the search. The main outcome of the search algorithm are the parameters that are used further to perform the architecture selection that leads to .Signal to noise ratio (SNR) is a commonly used measure in signal processing to distinguish between useful information and unwanted noise contained in a signal. In the context of NN architecture, the SNR can be used as an indicative of parameter importance; the higher the SNR, the more effective or important the parameter is to the model predictions for a given task. In this work we propose to look at the SNR when choosing the operations through the learnt variances , which can be used to compute the positive SNR as . It can then be used as a measure based on which the right operations should be chosen, instead of just relying on the means as in the previous work [liu2018darts].
Dataset  Method  Test Accuracy (%)  # Params (M)  Search Cost 
Positive SNR Magnitude  Positive SNR Magnitude  (GPU days)  
MNIST  VINNAS  0.02  
Random  0.0  
FashionMNIST  VINNAS  0.16  
Random  0.0  
CIFAR10  VINNAS  1.7  
Random  0.0 







Yamada et al. [yamada2016deep]  handmade  97.33  26.2    
Li & Talkwalkar [li2019random]  random  97.15  4.3  2.7  
Liu et al. [liu2018darts]  gradient  97.240.09  3.4  1  
Zoph et al. [zoph2016neural]  reinf. lear.  97.35  3.3  1800  
Real et al. [real2019regularized]  genetic alg.  97.450.05  2.8  3150  
Liu et al. [cai2018path]  Bayesian opt.  96.590.09  3.2  225  
Zhou et al. [zhou2019bayesnas]  gradient  97.390.04  3.400.62  0.2  
Chu et al. [chu2020noisy]  gradient  97.61  3.25    
Chu et al. [chu2019fair]  gradient  97.460.05  3.320.46    
Zela et al. [zela2019understanding]  gradient  97.05      
VINNAS [Ours]  gradient  97.66  1.18  1.7 
V Experiments
To demonstrate the effectiveness of the proposed VINNAS method, we perform experiments on three different datasets, namely MNIST (M), FashionMNIST (F) and CIFAR10 (C).
Va Experimental Settings
For each dataset, we search for a separate network structure composing of operations commonly used in CNNs, namely: , and separable convolutions, and dilated separable convolutions, convolution, max pooling, average pooling, skipconnection and zero, meaning no connection making
. Note that we clip the strength of the zero operation to avoid scaling problems with respect to other operations. All operations are followed by BN and ReLU activation except zero and skipconnection.
Each cell accepts an input from the previous cells and . Each input is processed trough ReLUconvolutionBN block to map the input shape required by that particular cell. For M, we search for an architecture comprising of a single reduction cell (R), with states and for F, we search for an architecture comprising of 2 normal (N) and 2 reduction cells (NRRN) with states each. Both of these architectures have the same layout during evaluation, however, for F the number of channels is quadrupled during evaluation. For C, during the search phase we optimise a network consisting of 8 cells with states (NNRNNRNN) that is then scaled to 20 cells during evaluation (6NR6NR6N), along with the channel sizes, which are increased by a factor of 2.5. Each state always accepts 2 inputs processed through 2 operations. Each net also has a stem, which is a convolution followed by BN. At the end of the network, we perform average pooling followed by a linear classifier with the softmax activation.
The search space complexity for each net is given as which for M is , for F is and for C is . Weights from the search phase are not kept and we retrain the resultant architectures from scratch. We repeat each search and evaluation 3 times. We train the networks with respect to a single sample with respect to s and LRT. Instead of cherrypicking of the found architectures through further evaluation and then selecting the resultant architectures by hand [liu2018darts], we report the results of the found architectures directly through VINNAS.
Search Settings
For optimising both the architecture parameters as well as the weight parameters, we use Adam [kingma2014adam] with different initial learning rates. We use cosine scheduling [loshchilov2016sgdr] for the learning rate of the weights’ parameters and we keep the architecture’s learning rate constant through the search. We initialise s and start applying and gradually linearly increasing them during the search process. We disable tracking of BN’s learnable parameters for affine transformation or stats tracking. We initialise the operations strengths’ through sampling . We utilise label smoothing [muller2019does] to avoid the architecture parameters to hard commit to a certain pattern. To speed up the search we not only search reduced architectures in terms of number of channels and cells, but also search on 25%, 50% and 50% of the data for M, F and C respectively, while using 50% of that portion as the dataset for learning the architecture parameters. For M we use znormalisation. For F and C we use random crops, flips and erasing [zhong2017random] together with input channel normalisation. We search for 20, 50 and 100 epochs for M, F and C respectively.
Evaluation Settings
During evaluation we scale up the found architectures in terms of channels and cells as described previously. We again use Adam optimiser with varying learning rates and cosine learning rate scheduling. We similarly initialise and linearly increase it from a given epoch. We do so, in the search phase, to avoid overregularisation and the clamping weights to zero too soon during the optimisation. We train on full datasets for M, F and C for 100, 200 and 300 epochs respectively, and we preserve the data augmentation strategies also during retraining.
For both the search and evaluation we initialise the weights’ means with Xavier uniform initialisation [glorot2010understanding]. At the same time we initialise all the logvariances to . We use a batch size of 256 for all experiments. We encourage the reader to inspect the individual hyperparamter values, random seeds and scheduling at our publicly made available implementation at https://github.com/iimlucl/vinnas.
VB Evaluation
The evaluation is condensed in Tables II and III. The numbers in bold represent the score for the best performing model with the given selection method: positive SNR/magnitude and the dataset. The found best performing architectures are shown in Figs. 3, 4 and 5. We first perform random search on our search spaces for M, F and C. Note that the search spaces are vast and we deem it impossible to evaluate all architectures in the search space, given our available compute resources, and thus we sample 3 separate architectures from each search space and we train them with the same hyperparameter settings as the found architectures to avoid any bias. The number of parameters is taken as the amount after pruning with respect to .
When comparing the found architectures for the different datasets in Table II, we noticed that in case of M or F, there are certain connections onto which an operation could potentially be completely omitted with the positive SNR being relatively small. We attribute this to the fact that these datasets are easy to generalise to, which can be also seen by the overall performance of the random search for these datasets. However, on CIFAR10, it can be seen that the inferred importance of all the operations and the structure is very high. The results also demonstrated that using the learnt uncertainty in the operation selection, in addition to the magnitude, benefits the operation selection. Compared with DARTS [liu2018darts] which only uses separable convolutions and max pooling everywhere, it can be observed that the found architectures are rich in the variety of operations that they employ and the search does not collapse into a mode where all the operations are the same. For future reference regarding deeper models such as for F and C, we observe that the found cells of the best performing architectures do contain at least one skipconnection to enable efficient propagation of gradients and better generalisation.
The main limiting factor of this work is the GPU search cost which is higher, in comparison to the other NAS methods, due to using LRT, which requires two forward passes during both search and evaluation.
Most importantly, all the found architectures demonstrate good generalisation performance in terms of the measured test accuracy. Specifically for the case of CIFAR10, in Table III it is shown that VINNAS found an architecture that is comparable to the stateoftheart, however, with fewer parameters.
Vi Conclusion
In summary, our work introduces a new direction of using a combined approach of probabilistic modelling and neural architecture search. Specifically, we give the operations’ strengths a probabilistic interpretation by viewing them as learnable random variables. Automatic relevance determinationlike priors are imposed on these variables, along with their corresponding operation weights, which incentivises automatic detection of pertinent operations and zeroingout the others, without significant hyperparameter tuning. Additionally, we promote certainty in the operations selection, through a custom loss function which allows us to determine the most relevant operations in the architecture. We demonstrated the effectiveness of
VINNAS on three different datasets and search spaces. On CIFAR10 we achieve stateoftheart accuracy with up to fewer parameter.In the future work, we aim to explore a hierarchical Bayesian model for the architecture parameters, which could lead to architectures composed of more diverse cell types, instead of just two. Additionally, all of the evaluated NNs shared the same evaluation hyperparameters and in the future we want to investigate an approach which can automatically determine suitable hyperparameters for the found architecture.