In recent years, there has been a flurry of Deep neural networks (DNNs) producing remarkable results on a broad variety of tasks. Generally, these methods have usually involved careful network design, often relying on domain knowledge to design a structure which can encapsulate the task at hand. Neural Architecture Search (NAS) has provided an alternative to hand designed networks, allowing for the search and even direct optimisation of the network’s structure. But, the search space for architectures is often vast, with potentially limitless design choices. Furthermore, each configuration must undergo some training or fine-tuning for its efficacy to be determined. This has lead to the development of methods which lump multiple design parameters together, which reduce the search space in a principled manner(Tan and Le, 2019), as well as creating the need for sophisticated search algorithms (Liu et al., 2018; Wu et al., 2019), which can more quickly converge to an improved design. Both techniques reduce the number of search iterations and ultimately reduce the number of required training/fine-tuning stages.
Architecture search has so far, to the best of our knowledge, avoided exploring grouped convolution design. However, grouped convolution network design presents itself as an ideal candidate for architecture search. It has been been widely used particularly in some prevalent networks. ResNeXt (Xie et al., 2017) used grouped convolution for improved accuracy over the analogous ResNets (He et al., 2016). On the other hand, MobileNet (Howard et al., 2017) and various others (Zhang et al., 2018; Sandler et al., 2018) have utilised grouped convolutions in the depthwise case in a ResNet-style for extremely low-cost inference. With these architectures, grouped convolution has proven to be a valuable design tool for high-performance and low-cost design alike. Applying it for these contrasting performance profiles requires an intuition, which so far has remained relatively unexplored.
However, grouped convolution design implications have remained relatively unexplored. Decomposition of networks is time consuming. Also, there isn’t necessarily a heuristic or intuition for how combinations of grouped convolutions with varying numbers of groups interact in a network. We tackle this in this work with the introduction of a Group-size Series (GroSS) decomposition. GroSS allows us to train the entire search space of architecturessimultaneously
. In doing so, we shift the expense of architecture search with respect to group-size away from decomposition and training, and towards cheaper test-time sampling. This allows for the exploration of possible configurations, while significantly reducing the need for imparting bias on the group design hyperparameter selection.
The contributions of this paper can be summarised as follows:
We present GroSS Decomposition – a novel formulation of tensor decomposition as a series of rank approximations. This provides a mathematical basis for grouped convolution as a series of increasing rank terms.
GroSS provides the apparatus for differentiably switching between grouped convolution ranks. Therefore, to the best of our knowledge, it is the first simultaneous training of differing numbers of groups within a single layer, as well as the all possible configurations between layers. Effectively training an entire architecture search space at once.
We explore this concurrently trained architecture space with a small network, as well VGG-16, using a proof-of-concept exhaustive and breadth-first search. Illustrating the efficacy of the GroSS, as well as taking a step towards removing the train burden from architecture search.
2 Related Work
Grouped convolution has had a wide impact on neural network architectures, particularly due to its efficiency. It was first introduced in AlexNet (Krizhevsky et al., 2012) as an aid for the single network to be trained over multiple GPUs. Since then, it has had a wide impact on DNN architecture design. ResNeXt (Xie et al., 2017) used grouped convolutions synonymously with concept of cardinality, ultimately exploiting the efficiency of grouped convolutions for high-accuracy network design. The reduced complexity of grouped convolution allowed for ResNeXt to incorporate deeper layers within the ResNet-analogous residual blocks (He et al., 2016). In all, this allowed higher accuracy with a similar inference cost as an equivalent ResNet. The efficiency of grouped convolution has also lead to several low-cost network designs. MobileNet (Howard et al., 2017) utilised a ResNet-like bottleneck design with depthwise convolutions–a special case of grouped convolutions where the number of groups is set to equal the number of in channels–for an extremely efficient network with mobile applications in mind. ShuffleNet (Zhang et al., 2018) was also based on a depthwise bottleneck, however, pointwise layers were also made grouped convolutions.
Previous works (Jaderberg et al., 2014; Denton et al., 2014; Lebedev et al., 2014; Vanhoucke et al., 2011) have applied low-rank approximation of convolution for network compression and acceleration. Block Term Decomposition (BTD) (De Lathauwer, 2008) has recently been applied to the task of network factorisation (Chen et al., 2018), where it was shown that the BTD factorisation of a convolutional weight was equivalent to a grouped convolution within a bottleneck architecture. Wang et al. (2018) applied this equivalency for network acceleration. Since decomposition is costly, these methods have relied on heuristics and intuition to set hyperparameters such as the rank of successive layers within the decomposition. In this paper, we present a method for decomposition which allows for exploration of the decomposition hyperparameters and all the combinations.
Existing architecture search methods have overwhelmingly favoured reinforcement learning. Examples of this include, but are not limited to, NASNet(Zoph et al., 2018), MNasNet (Tan et al., 2019), ReLeq-Net (Elthakeb et al., 2018). In broad terms, these methods all set a baseline structure, which is manipulated by a separate controller. The controller optimises the structure through and objective based on network performance. There has also been work in differentiable architecture search (Wu et al., 2019; Liu et al., 2018) which makes the network architecture manipulations themselves differentiable. In addition, work such as (Tan and Le, 2019) aims to limit the network scaling within a performance envelope to a single parameter.
These methods all have a commonality: the cost of re-training or fine-tuning at each stage motivates the recovery of the optimal architecture in as few training steps as possible, whether this is achieved through a trained controller, direct optimisation or significantly reducing the search space. In this work, however, we produce a method where the entire space is trained at once and therefore shift the burden of architecture search away from training.
In this section, we will first introduce Block Term Decomposition (BTD) and detail how its factorisation can be applied to a convolutional layer. After that, we will introduce GroSS decomposition, where we formulate a unification of a series of ranked decompositions so that they can dynamically and differentially be combined. We detail the training strategy for training the whole series at once. We describe our response reconstruction formulation to improve the approximation provided by factorisation.
3.1 General Block Term Decomposition
Block Term Decomposition (BTD) (De Lathauwer, 2008) aims to factorise a tensor into the sum of multiple low rank-Tuckers (Tucker, 1966). That is, given an order tensor , BTD factorises into the sum of terms with lower rank :
In the above, is known as the core tensor and we will refer to as factors matrices. We use the usual notation to represent the mode-n product (De Lathauwer, 2008).
3.2 Converting a Single Convolution to a Bottleneck Using BTD
Here, we can restrict discussion from a general, N-mode, tensor to the 4-mode weights of a 2D convolution as follows: , where and represent the number of input and output channels, and and the spatial size of the filter kernel. Typically the spatial extent of each filter is small and thus we only factorise and . So, to eliminate superscripts, we define and . Therefore, the BTD for convolutional weights is expressed as follows:
It can be shown that this factorisation of the convolutional weights into groups forms a three-layer bottleneck-style structure (Yunpeng et al., 2017): a pointwise () convolution , formed from factor ; followed by a grouped convolution , formed from core and with groups; and finally another pointwise convolution , formed from factor . With careful selection of the BTD parameters, the bottleneck approximation can be applied to any standard convolutional layer.
In Table 1
, we detail how the dimensions of the bottleneck architecture are determined from its corresponding convolutional layer, and indicate how properties such as stride, padding and bias are applied within the bottleneck for equivalency with the original layer. It is worth noting that we often refer to the quantitiesor as the group-size; this quantity determines the number of channels present in each group and is equivalent to the rank of the decomposition.
3.3 Group-size Series Decomposition
Group-size Series (GroSS) decomposition unifies multiple ranks of BTD factorisations. This is achieved by defining each successive factorisation relative to the lower order ranks. Thus we ensure that higher rank decompositions only contain information that was missed by the lower order approximations. Therefore the approximation of is given as follows:
Here, , and represent the additional information captured between the and rank of approximation, and , and to represent total approximation from lower rank approximations in the form of cores and factors. However, both the core and factors must be recomputed so that the dimensions match the ranks required , which is not a trivial manipulation.
Instead, we introduce a function, , which allows the weights of a grouped convolution to be “expanded”. The expanded weight from a convolution with group-size can be used in a convolution with group-size , where , giving identical outputs:
where is the weight for a grouped convolution, refers to convolution with group-size , and is the feature map to which the convolution is applied. This expansion allows us to conveniently reformulate the GroSS decomposition in terms of the successive convolutional weights obtained from BTD, rather than within the cores and factors directly. More specifically, we define the bottleneck weights for the order GroSS decomposition with group-sizes, , as follows:
, and represent the weights obtained from the lowest rank decomposition present in the series. , and represent the additional information that the rank decomposition contribute to the bottleneck approximation:
This formulation involving only manipulation of the convolutional weights is exactly equivalent to forming the bottleneck components , and from , and , as in the general BTD to bottleneck case.
Further, the grouped convolution weight expansion, , enables us to dynamically, and differentiably, change group-size of a convolution. In itself, this is not particularly useful: a convolution with a larger group-size is requires more operations and more memory, while yielding identical outputs. However, it allows for direct interaction between differently ranked network decomposition and, therefore, the representation of one rank by the combination of lower ranks. Thus, GroSS treats the decomposition of the original convolution as the sum of successive order approximations, with each order contributing additional representational power.
3.3.1 Training GroSS Simultaneously
The expression of a group-size decomposition as the combination of lower rank decompositions is useful because it enables the group-size to be dynamically changed during training. The expansion and summation of convolutional weights is differentiable and so training at a high rank, also optimises the lower rank approximations simultaneously. To the best of our knowledge GroSS is the first method that allows for the simultaneous training of varying group-size convolutions.
We leverage the series form of the factorisation during training, by randomly sampling a group-size for each decomposed layer at each iteration. We sample a group-size
for each decomposed layer from the probability distribution:
where refers to the smallest decomposed group-size, denotes the number of available group-size and is the sampling temperature. When ,
is a uniform distribution. By increasing the sampling temperature, we update the weights of lower group-sizes more aggressively, and implicitly enforce the lower order approximation to carries more of the signal of the approximation. We set the default sampling temperature to and provide experimental evaluation to justify this as an appropriate choice.
3.3.2 Response Reconstruction
The aim of factorising a convolutional layer is to ultimately mimic its performance on a particular task. However, the objective of the factorisation itself is to minimise the error between the original tensor and the approximation with respect to the Frobenius norm. While this goes some way to meeting the overall goal of similar performance, small errors in approximation can lead to drastic decreases in performance. Therefore, we encourage the decomposed layer to reconstruct the response of the original layer.
We follow the proposal of Jaderberg et al. (2014)
were minimising the response approximation error is minimised through backpropagation. This is done by freezing all the standard (not decomposed) layers, and penalising the difference between the activationof decomposed layers and activation of the original layer using the loss:
Loss for response reconstruction is the Frobenius Norm of differences between activation of decomposed layer and the target activation, normalised by the Frobenius Norm of the target activation.
4 Experimental Setup
In this section, we list the setup for our experimental evaluation. We first detail the dataset on which evaluation is conducted. Next, we describe the network architecture on which perform GroSS decomposition. Finally, we list the procedure for the decomposition, response reconstruction and fine-tuning.
We perform our experimental evaluation on CIFAR-10 (Krizhevsky et al., 2014). It is a dataset consisting of 10 classes. The size of each image is . In total there are 60,000 images, which are split into 50,000 train images and 10,000 testing images. We further divide the training set into a training and validation splits with 40,000 and 10,000 images, respectively.
We evaluate GroSS on a small 4-layer network, as well as VGG-16 (Simonyan and Zisserman, 2014). We detail the structure, initialisation, and training strategy of these network in Appendix A.1. Here we will provide the exact decomposition strategy of each of these networks.
In each case we decompose all convolutional layers in the network aside from the first. For the 4-layer network, group-sizes are set to all powers of 2 which do not exceed the number of in channels for that respective layer. This leads to a total of 252 configurations represented by our decomposition. In our VGG-16 network, we decompose each layer into 4 group-sizes: (1, 4, 16, 32). This leads to a total of configurations represented by our decomposition of VGG-16.
Our formulation of GroSS decomposition as a series of convolutional weight differences (expanded weights in the case of the grouped convolution), as detailed by Equation 6 means that we are able to use an off-the-shelf BTD framework (Kossaifi et al., 2019). For each group-size, we set the stopping criteria for BTD identically: when the decrease in approximation error between steps is below for the 4-layer network and for VGG-16, or steps have elapsed. We define approximation error as the Frobenius norm between the original tensor and the product of the BTD cores and factors divided by the Frobenius norm of the original tensor. Again, we perform this decomposition 5 times.
4.3 Response Reconstruction
Once decomposed, we perform response reconstruction on the decomposed layers of the 4-layer network simultaneously. The response reconstruction training lasts 30 epochs, which we found to be sufficient for convergence. Again, the response reconstruction stage is optimised through SGD with an initial learning rate was set to 0.0001, and momentum of 0.9. We decay the learning rate by a factor of 0.1 after 20 epochs. Inputs come from the CIFAR-10 train split. Target responses are generated by the full network. All parameters in the network aside from the decomposed pointwise layers and the grouped layers are frozen. The biases for the decomposed layers are also frozen. We do not perform response reconstruction on VGG-16, since we found it was not necessary.
After we have performed response reconstruction on the factorised network, we then fine-tune on the classification task. For the 4-layer network, we tune for 150 epochs with an initial learning rate of 0.0001 and momentum 0.9. We decay the learning rate by a factor of 0.1 after both 80 and 120 epochs. For VGG-16, we fine-tune with the SGD parameters, however we only train for 100 epochs, and decay the learning rate after 50 and 75 epochs. Data augmentation remains the same as with training the full network. Once more, all network parameters are frozen aside from the GroSS decomposition weights.
4.5 Baseline Group-size Configurations
We also decompose the original network into 4 fixed configurations. These configurations simple with a single group-size selected for each decomposed layer: 1, 4, 16, and 32. They represent a baseline as a standard BTD network compression method, where these would be reasonable group-sizes with which to decompose the network. Importantly, they span almost the entirety of the possible performance envelopes available to our network: from the smallest depthwise compression, to nearly the largest. The accuracy and cost of these configurations is detailed in Table 2.
We perform response reconstruction, followed by fine-tuning on each of these fixed configurations almost identically to the method outlined for our GroSS decomposition. However, the initial learning rates for response reconstruction and fine-tuning is set to 0.01 and 0.001, respectively. The fine-tuning for these fixed configurations in the 4-layer network lasts 100 epochs, with the learning rate scaled by a factor 0.1 after 80 epochs. The schedule was reduced because the fixed configurations converged more quickly. With VGG-16, the schedule remains the same length as that used for the GroSS decomposition, but the initial learning rate is increased to 0.001.
5.1 Group-size Search
|32 32 32||83.70||83.69 (0.14)||4.28M|
|16 16 16||82.84||82.83 (0.10)||2.66M|
|4 4 4||82.09||82.06 (0.07)||1.44M|
|1 1 1||81.76||81.74 (0.07)||1.14M|
|Baseline: 16 16 16||2.66M||81.17 (0.16)||82.83 (0.10)||-||-|
|2 32 64||2.36M||81.68 (0.10)||83.84 (0.12)||11.3%||1.00|
|8 16 64||2.51M||81.60 (0.24)||83.83 (0.13)||5.6%||0.99|
|Baseline: 4 4 4||1.44M||80.39 (0.24)||82.06 (0.07)||-||-|
|2 4 8||1.33M||80.86 (0.14)||82.76 (0.11)||7.6%||0.70|
|2 8 8||1.41M||80.82 (0.23)||82.92 (0.09)||2.1%||0.86|
|VBMF: 16 8 16||2.51M||80.89 (0.19)||83.33 (0.10)||-||-|
|2 32 64||2.36M||81.68 (0.10)||83.84 (0.12)||6.0%||0.51|
|4 16 64||2.22M||81.49 (0.26)||83.93 (0.13)||11.6%||0.60|
Exhaustive search on our 4-layer network. Here we evaluate the top 2 configurations returned from the exhaustive with the baseline configuration (italic) setting the upper bound for inference cost. We list the mean accuracy and standard deviation from 5 runs of the configuration when trained with GroSS and when trained independently in the “GroSS” and the “Ind.” column, respectively. After the new configurations have been fine-tuned, we compare their accuracy and inference cost to the base configuration of the search. We provide the performance of the full uncompressed network for reference. The numbers in the configuration column correspond to the group-size of each decomposed layer within the network.
|1 32 16 32 16 16 32 32 32 1 4 16||26.19M||91.30||91.56||9.8%||0.12|
|1 4 1 32 1 1 16 1 4 16 32 4||8.84M||90.90||91.28||5.9%||0.28|
Since we have trained the entirety of the group-size configurations of the decomposed networks simultaneously, we have effectively removed the train burden from architecture search. Therefore, to determine a candidate configuration for the 4-layer network, we evaluate all 252 possible configurations. Specifically, we assign an the evaluation task of finding architectures which have higher accuracy, but lower inference cost than their respective baseline configuration. We choose the 4 and 16 fixed configurations. To do so, we simply filter any configuration with multiply accumulates (MACs) above the respective target configuration. After filtering, we can select the highest accuracy remaining.
In the case of our VGG-16, although we have also trained all configurations simultaneously, this is too many to feasibly enable the exhaustive evaluation of all possible configurations. We, therefore, implement a breadth-first search which returns a lower cost configuration with higher accuracy. The full details of how this search is performed are described in Appendix A.2.
Once a configuration has been selected, we decompose, response reconstruct and fine-tune exactly as described for the fixed configurations. This provides a fair comparison to the target configuration accuracy. The results of the exhaustive search on the 4-layer network are shown in Table 3, where the decomposition and tune is performed 5 times for each configuration and the mean and standard deviation are reported. Results for the breadth-first search on VGG-16 are shown in Table 4.
As can be seen in both Tables 3 and 4, the most accurate configurations for a particular performance bracket within the GroSS decompostion, remain more accurate when decomposed and fine-tuned individually. In all cases tested, a significant increase in accuracy was found in a configuration with cheaper inference. Notably, a configuration found in Table 4 outperforms the original network before factorisation, but requiring an order of magnitude fewer operations.
For our 4-layer network, we also search against the rank estimation produced by Variational Bayesian Matrix Factorisation (VBMF)(Nakajima et al., 2013), which is used for one-shot rank selection in (Kim et al., 2015). Again, we are able to find configurations with higher accuracy that require fewer operations. In fact, Kim et al. (2015) note that, although they achieve good network compression results with the result of VBMF, they had not investigated whether this method of rank selection was indeed optimal. With the results in Table 3, we demonstrate that GroSS is a valuable tool for this investigation, and that VBMF is not optimal in this case.
An interesting point to note is that ascending group sizes along layers seems to be preferable. This is not necessarily intuitive and further emphasises the need for search among grouped architectures.
5.2 Training Sampling Strategy
In this section, we evaluate the effect of the sampling distribution temperature on the performance profile of the model’s search space. The aim is to create a profile which most accurately recreates that produced by separate decomposition and fine-tuning of the many possible group-size configurations. To evaluate this, in Figure 1, we show the performance profiles produced by differing values of temperature within our sampling distribution. We train each sampling temperature 5 times and plot the mean accuracy at a particular configuration. The percentage change in accuracy visualised is computed as the difference between the fixed configuration accuracy and the accuracy obtained from GroSS when running at the same group-size configuration.
As would follow intuition, higher temperatures are able to better recover accuracy for small group-size configurations, but lose significant accuracy at larger configurations. Conversely, low temperatures favour large group-sizes, but suffer with the smaller. Sampling with a temperature of provides the most balanced search space, as can be seen from its flat profile. We therefore use a this temperature setting as the default temperature in our fine-tuning stage.
Although this temperature setting may seem aggressive, when decomposing into multiple group sizes, each successive size in the series only aims to capture information not approximated by the previous order term. In Table 2, we show that even a depthwise factorisation of the network is able to recover almost all of the original accuracy (83.99 vs 81.74). Therefore, most of the energy of the approximation should be captured by the lowest rank term in the decomposition series. This provides intuition that the increasing rank terms in the series should be sampled with frequency that reflects the energy which they capture in the approximation, hence a high sampling temperature.
In this paper, we have presented GroSS, a series BTD factorisation which allows for the dynamic assignment and simultaneous training of differing numbers of groups within a layer. We have demonstrated how GroSS-decomposed layers can be combined to train an entire grouped convolution search space at once. We demonstrate the value of these configurations through an exhaustive search, which is made possible through the simultaneous training. In doing this, we take a step towards shifting the burden of architecture search away from decomposition and training.
We gratefully acknowledge the European Commission Project Multiple-actOrs Virtual Empathic CARegiver for the Elder (MoveCare) for financially supporting the authors for this work.
- Sharing residual units through collective tensor factorization to improve deep neural networks.. In IJCAI, pp. 635–641. Cited by: §2.
- Decompositions of a higher-order tensor in block terms—part ii: definitions and uniqueness. SIAM Journal on Matrix Analysis and Applications 30 (3), pp. 1033–1066. Cited by: §2, §3.1, §3.1.
- Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: §2.
- Releq: a reinforcement learning approach for deep quantization of neural networks. arXiv preprint arXiv:1811.01704. Cited by: §2.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In
Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §A.1.1.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §2.
Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1, §2.
- Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: §2, §3.3.2.
- Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530. Cited by: §5.1.
Tensorly: tensor learning in python.
The Journal of Machine Learning Research20 (1), pp. 925–930. Cited by: §4.2.
- The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html 55. Cited by: §4.1.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §2.
- Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553. Cited by: §2.
- Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1, §2.
- Global analytic solution of fully-observed variational bayesian matrix factorization. Journal of Machine Learning Research 14 (Jan), pp. 1–37. Cited by: §5.1.
- Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §1.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §A.1.2, §4.2.
- Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §2.
- EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §1, §2.
- Some mathematical notes on three-mode factor analysis. Psychometrika 31 (3), pp. 279–311. Cited by: §3.1.
- Improving the speed of neural networks on cpus. Cited by: §2.
- Deepsearch: a fast image search framework for mobile devices. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14 (1), pp. 6. Cited by: §2.
- Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10734–10742. Cited by: §1, §2.
- Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §1, §2.
- Sharing residual units through collective tensor factorization in deep neural networks. arXiv preprint arXiv:1703.02180. Cited by: §3.2.
- Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §1, §2.
- Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §2.
Appendix A Appendix
a.1 Models: Definition, Initialisation and Training from Scratch
a.1.1 4-Layer Network
As the name might suggest, our 4-layer network with four convolutional layers, with channel dimensions of 32, 32, 64 and 64, followed by two fully-connected layers of size 256 and 10. The convolution layers all have kernel dimensions
, a bias term, stride of 1 and padding of 1. Each convolution is followed by a ReLU layer andmax-pooling. The first fully-connected layer has a ReLU applied to its output. Further, we use a dropout layer with dropout probability of 0.5 between the two fully-connected layers.
Convolutional weights in the network are initialised with the He initialisation (He et al., 2015)
in the “fan out” mode with a ReLU non-linearity. The weights of the fully-connected layers are initialised with a zero-mean, 0.01-variance normal distribution. All bias terms in the network are initialised to 0.
The network is trained from scratch on CIFAR-10 training split for 100 epochs using stochastic gradient descent (SGD). We adopt a initial learning rate of 0.1 and momentum of 0.9. The learning rate is decayed by a factor of 0.1 after 50 and 75 epochs. We apply the following data normalisation and augmentation strategy to the training images: images are padded with 2 pixels and a randomcrop is taken from the padded image; there is probability of 0.5 that the image will be horizontally flipped; all images are normalised to a mean of with variance . We train the network 5 times and use the weights with median accuracy for further experiments.
We take the standard convolution structure of VGG-16 trained on ImageNet as in (Simonyan and Zisserman, 2014), but make some changes to the fully connected structure for training and inference on CIFAR-10. Our implementation has 13 convolutional layers, identical to those in (Simonyan and Zisserman, 2014). These convolutional layers are then followed by a max-pooling and two fully-connected layers of size 512 and 10, respectively. A ReLU layer and dropout with probability of 0.5 is applied between the fully-connected layers.
Weights are initialised with identical strategy to the 4-layer network. We train this full network on CIFAR-10 for a total of 200 epochs, again using stochastic gradient descent. The initial learning rate is set to 0.05 and momentum to 0.9. The learning rate is decayed by a factor of 0.1 after 100 and 150 epochs. We also apply the same data augmentation and normalisation strategy as in the training of our 4-layer network.
a.2 Breadth-first Search
We implement a rudimentary breadth-first search for searching on networks where an exhaustive search is not feasible due to the sheer number of possible configurations. Given a base configuration we aim to find an alternative configuration which offers cheaper inference, while still being more accurate.
We first randomly select a configuration which requires fewer operations for inference than the base configuration. We evaluate all neighbouring configurations of the currently selected configuration. We define neighbouring configurations as those which only require one layer to have it’s group-size incremented or decremented. For the case of our VGG-16 decomposition, this would be a single layer changing from a group-size of 16 to 32 or from 32 to 1, since we consider the 4 possible sizes as a cycle. Once all neighbours are evaluated, we select the neighbour with the highest accuracy that does not exceed the number of MACs of the base configuration for the next step. We repeat this step of evaluating and choosing a neighbour for a maximum of 25 steps, or until there are no more accurate neighbours not exceeding the cost of the base configuration.
This process is repeated 10 times, each time from a randomly selected initial configuration. The most accurate configuration from all of the 10 runs is considered the result of the search.