Large Scale Neural Architecture Search with Polyharmonic Splines

11/20/2020 ∙ by Ulrich Finkler, et al. ∙ 0

Neural Architecture Search (NAS) is a powerful tool to automatically design deep neural networks for many tasks, including image classification. Due to the significant computational burden of the search phase, most NAS methods have focused so far on small, balanced datasets. All attempts at conducting NAS at large scale have employed small proxy sets, and then transferred the learned architectures to larger datasets by replicating or stacking the searched cells. We propose a NAS method based on polyharmonic splines that can perform search directly on large scale, imbalanced target datasets. We demonstrate the effectiveness of our method on the ImageNet22K benchmark[16], which contains 14 million images distributed in a highly imbalanced manner over 21,841 categories. By exploring the search space of the ResNet [23] and Big-Little Net ResNext [11] architectures directly on ImageNet22K, our polyharmonic splines NAS method designed a model which achieved a top-1 accuracy of 40.03 ImageNet22K, an absolute improvement of 3.13 similar global batch size [15].

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Designing Deep Neural Networks is a challenging task and requires a Subject Matter Expert (SME). One way of reducing the design burden and still be able to obtain a custom designed architecture for a given training problem is to use Neural Architecture Search (NAS) [19, 53]. Most NAS methods have been tried on small scale datasets like Cifar10-100 [27] and then subsequently architectures have been transferred to larger datasets like ImageNet1K [6, 32, 41]. This is mainly due to the heavy computational burden of the search phase, which despite recent progress [31, 48], remains mostly intractable at large scale. No previous published work has explored NAS on large scale datasets like ImageNet22K directly, and very few have even attempted it on ImageNet1K itself.

Additionally, most NAS methods have been tested on datasets with uniform distribution in terms of number of images per class. For example Cifar100 has

classes containing

images each, and Imagenet1K has 1k classes with 1k images each. While the design of such uniform distributions facilitate learning, they are not reflective of real world conditions, where skew in terms of instances per label/class is very much evident.

In this paper we describe a novel method for NAS that can conduct search directly on large-scale skewed datasets using Polyharmonic Splines. First, we describe in detail how specific operations within a deep neural network (convolution, ReLu, average pooling, batchnorm) are well suited to be approximated by a spline. Given a search space consisting of a set of

operations to optimize over, the search phase of our proposed NAS method based on polyharmonic splines requires only an initial set of points, followed by few additional points to complete its the optimization. This means that the number of evaluations required in the search phase depends only on the number of operations under search, not on the number of possible values for each of those operations. This drastically reduces the computational cost of the search phase of our NAS approach, allowing it to be performed directly on large scale target datasets.

We demonstrate the effectiveness of our method on the ImageNet22K benchmark of million images over labels. The dataset has high skew in terms of images per label and distribution across semantic categories, as shown in Figure 1. By exploring the search space of the ResNet [23], a Big-Little Net [11], and ResNext [45] architectures, the method designed a model which achieved a top-1 accuracy of on ImageNet22K. This significantly improves the best published performance [15] of on such large-scale, imbalanced dataset

The remainder of the paper is organized as follows. After Section 2, which covers related work, we describe in Section 3

the basic components of deep neural networks which are adopted as dimensions for our type of NAS and explain why the functions on such variables are well suited for interpolation. We describe the application of polyharmonic splines and the necessary conditions for the existence of a solution in Section 

4, before discussing experimental results on ImageNet22K in Section 5. Finally we draw conclusions in Section 6.

2 Related Work

NAS.

Neural Architecture Search (NAS), which aims at automatic design of deep learning networks for various applications spanning from image classification

[29, 38, 42, 44] to NLP [31, 35, 53], from object detection [12, 22, 40] to semantic segmentation [28] and control tasks [21] has attracted intense attention in recent years. A number of NAS strategies have been proposed, including evolutionary methods [2, 30, 36, 52]

, reinforcement learning

[29, 35, 51, 53], and gradient-based methods [10, 31, 34, 46, 47]. Efficiency on specific platforms has also been a very active area of research within the NAS umbrella, with the development of search strategies that optimize not only accuracy but also latency [8, 24, 38, 39, 42]. Methods based on micro-search of primary building cells [51, 54], and parameter sharing between child models [3, 7, 31, 35, 48] have also been recently proposed. Another widely used approach to address efficiency in NAS is to search for an architectural building block on a small dataset (e.g., CIFAR10) and then transfer the block to the larger target dataset by replicating and stacking it multiple times in order to increase network capacity according to the scale of the dataset.

Bayesian Methods. Our work is also related to Bayesian optimization which often use an acquisition function to obtain suitable candidates, has been widely used in NAS [18, 33, 26, 9]. The acquisition function measures the utility by accounting for both, the predicted response and the uncertainty in the prediction. While Bayesian optimization methods usually consumes more computation to determine future points than other methods, this pays dividends when the evaluations are very expensive. Many different strategies have been studied including classic Gaussian process-based optimization [37, 26], tree-based models [4]

, random forests 

[20]

to effectively optimize both neural network architectures and their hyperparameters. Several works also predict the validation accuracy 

[49], or the curve of validation accuracy with respect to training time [2].

3 Optimization Basis

Neural Architecture Search in its essence optimizes a function , residing in a multidimensional space of network and training parameters and being a metric for the quality of the network, for example the top-1 accuracy on a validation data set. The parameters captured in may be discrete. might not be differentiable in terms of the parameters . Importantly, is generally very expensive to evaluate. For a given parameter set , the network has to be trained and evaluated on the full training and validation sets.

The interpolation quality of methods to approximate scalar functions of multidimensional arguments based on a limited set of support points depend on the properties of the interpolated function. Hence it is important to understand the properties of and if it is likely that there exists a well behaved interpolating function through the potentially discrete support points.

This section discusses the algebraic structure of popular residual neural networks and establishes that the resulting

is with high probability well suited for interpolation with a polyharmonic spline for parameters that have a sufficiently large number of possible values. For parameters with only a few choices, using separate splines for the discrete cases can be more effective. Furthermore, the algebraic analysis provides insights into the sensitivity of global network parameters and hence search spaces.

3.1 Neural Network Forward Pass as Piecewise Linear Function

The forward pass through a typical neural network consisting of convolutions, fully connected layers, ReLUs, batch-norms and average-pooling can be interpreted as a piecewise linear function that effectively transforms an input, for example an image as a set of values across three channels, into values in a feature space

, which forms the input to a final linear classifier.

3.1.1 Fully Connected Layers

A single neuron with

inputs performs the operation

with an activation function

. In the following we discuss the case , i.e. ReLU.

The condition

defines a hyperplane in the

dimensional input space of such that for locations above the hyperplane and for locations on or below the hyperplane. The output of the neuron without activation is in fact the distance of from the hyperplane. Multiple neurons with inputs from the same input space define a set of hyperplanes that partition the input space into a set different regions. In each region, is determined by a set of linear equations.

For example in a two-dimensional input space, 2 neurons and

with linearly independent weight vectors and ReLU activation create four regions

A second layer of similar structure, a linear operation followed by ReLU activation, potentially partitions each of the segments of the input space again. In the following we discuss how other layers, namely convolution, batchnorm and average pooling, are operations of analogous structure and hence the entire set of operations leading to the final linear classifier is a piecewise linear function that maps hyper-volumes in the input space into the feature space. For a trained neural network, this piecewise linear function is optimized to create linearly separable clusters of mapped training points in the feature space.

3.1.2 Convolution

Convolution in a neural network is

wherein identifies the input channel, e.g. color in an rgb image or filter from a prior convolution. The tuple is the location within the input at which a patch of the size of a filter is scanned. identifies the filter, i.e. the output channel of the convolution. and

iterate through the positions within the input patch and filter. Hence, each tensor

is of size , the filter width, height and number of input channels.

We can apply an index transformation which transforms into . For a fixed position of the convolution we can drop the indices and to obtain . The vectors are points in a vector space of dimension . I.e., on one input patch a convolution has the same algebraic structure as a neuron. Average-pooling is a convolution with a filter whose elements are all identical.

Note that these operations can be expressed in the form of a fully connected layer, with a set of neurons that correspond to the filter channels that is present multiple times with different subsets of the input weights set to zero to express the selection of patches in the input.

3.1.3 Batch Norm

On first glance, a batch norm layer does not look like a linear function. But a closer look at the details reveals that the running means and averages are treated as constants in backpropagation, they do not have gradients. The batch norm (mean and variance across the batch, width and height for a channel) is an estimate for the constant that is used during inference on the trained model, which is in itself an estimate for the mean and variance across the entire training set. As the weights change during training, the estimates for running means and variances follow.

Pytorch uses Bessel’s correction

, an unbiased estimate of the population variance

from a finite sample for BatchNorm2d.

With as an input tensor of form NCWH, being the number of elements in a ’channel slice’ and as the iteration number, mean and unbiased variance for a ’channel slice’ are

With running mean and variance

batch norm in training () and eval () mode result in

Hence, with and

as constants, batch norm maps a linear transformation.

3.2 Base Transformations, Projections and Convolutional Networks

The rank-factorization of a matrix of rank expresses as a product of a full column rank matrix and a full row rank matrix . For a square matrix of full rank the equation is a transformation of vector from one base of to another.

Hence, conceptually a matrix of rank maps a vector from a space with dimensions into a subspace of dimensions and from there into a space with dimensions. If , then we can express all of the dimensional vectors in terms of an dimensional basis. Thus, a trained network whose matrices are not close to full rank is inefficient.

Furthermore, we can interpret a convolution and subsequent activation on a single ’patch’ as a transformation from a dimensional space to a dimensional space, and being the width and height of the filters and and being the numbers of the input channels and output channels, respectively.

Assuming that the combined convolution matrix has close to full rank in an efficient network, the activations resulting from a convolution can be interpreted as the coefficients in an expression that approximates the input tensor in terms of a set of functions defined by the filter tensors, a form of compression.

The ’approximation’ has to be of limited loss, since the correlation between classification results and inputs has to be preserved. In some sense training optimizes the approximation such that it aides in transforming inputs into linearly separable points in the feature space and at the same time it allows to approximate with limited loss of information across all its inputs.

Based on above observations, the essential part of the functionality of the convolutional cone leading to a linear classifier in a neural network is provided by mapping points from one space to another to create a piecewise linear function whose output in the feature space is highly linearly separable with respect to the number of hyperplanes in the final classification layer. What primarily matters is how many linear pieces the network provides relative to the number of hyperplanes, i.e. classes, in the final classifier.

A second aspect is trainability, i.e. how easy a network converges, which suggests to focus on residual networks with batchnorm. While the specific structure of a network should have relatively little influence on final accuracy, it may have a significant impact on the network size that is required to generate a competitive piecewise linear approximation. Which lead us to focus architecture search on different families of residual networks, specifically ResNet [23], ResNeXt [43] and BLResNeXt [11]. Typical residual networks have 4 groups of residual blocks with different numbers of layers. For example ResNet50 has groups of input width 64, 128, 256, 512 channels with an expansion factor of 4 and depths of 3, 4, 6, 3. ResNet18 on the other hand has the same widths, but an expansion factor of 1 and depths of 2, 2, 2, 2.

The number of classes, i.e. the number of hyperplanes in the feature space, provides guidance on the optimal number of dimensions for the feature space. In an dimensional space, we can place linearly independent hyperplanes such that all volumes bounded by hyperplanes are infinite. If we place more hyperplanes, some volumes have to be limited to a constant that depends on the placement of the hyperplanes. Hence, a dimesion of the feature space that is larger than the number of classes should be beneficial. Note that for ResNet50 the dimension of the feature space is 2048 and for ResNet18 it is 512.

If the leading layers ’choke’ the dimensions of the vector spaces leading to the feature space too much, the analysis suggests that this will negatively influence the accuracy of the network. For example the initial convolution in resnet 18 has an input space of dimensions and maps this to 64 dimensions. A convolution with 64 channels to 128 channels maps 576 dimensions to 128, i.e. creates a significant reduction. Furthermore, the number of activations drops by a factor of 4 through every group.

As an experimental test we designed a simple neural network r18U based on resnet18 whose behavior should be closer to resnet50 for Imagenet1k by adjusting only the block widths. We eliminated the max-pool

layer and compensated for its reduction by increasing the stride of the initial convolution to

based on the hypothesis that the non-linear layer is not essential for the functionality of the neural network. Indeed, this did not negatively impact final accuracy but appeared to improve conversion. We increased the numbers of channels of the initial convolution and the subsequent layer-groups from 64,64,128,256,512 to 128,256,512,1024,2048. Note that the dimension of the feature space matches that of resnet50 and the output dimension of the output of the initial convolution almost matches the input dimension. The network r18U achieved a validation accuracy of over 76.5%, compared to resnet50’s 75.1%, in our pytorch setup. Network r18U is less efficient than resnet50, it has significantly more parameters. Larger depth increases the potential number of pieces in the piecewise linear function for a similar number of neurons.

Our algebraic analysis and experimental results as well as theory (two layer theorem) suggest that roughly the same final accuracy can be achieved by many network families, the distinguishing factor is the required model size. Imposing structure via convolutions, higher depth, residual blocks and their substructures increases the granularity of the piecewise linear function for a network with a given number of neurons.

4 Polyharmonic Spline Neural Architecture Search

Given the insights from the prior Section, we developed a method to efficiently investigate a high dimensional parameter search space. We analyzed the search space of a single network family like ResNet or ResNeXt. Widths and depths of blocks already provide 8 dimensions within any deep network family search space. On top of those, parameters internal to the blocks, as for example cardinality in ResNeXt, would increase the number of dimensions even further, but based on the algebraic interpretation were likely to have less impact on final accuracy.

Three key issues make NAS challenging for larger classification problems such as ImageNet1k or ImageNet22k: search (and evaluation) time, memory consumption, and probability of getting stuck in local minima.

First, in order to obtain a measure for the final accuracy of a model, it has to be trained nearly to saturation, that is, over a sufficiently large number of epochs to ensure that all the variants tested reach close to their full potential. In our experiments, early stopping proved to be misleading, since smaller networks tend to initially converge faster and the crossover point was near the end of a complete learning rate schedule. Training ImageNet1k for 90 epochs and ImageNet22k for 60 epochs (values experimentally shown to provide meaningful accuracies) requires large amounts of computation. Hence, as in most NAS approaches,

minimizing the number of evaluations is critical.

Second, the algebraic observations suggested that the better performing networks are large, such that GPU memory limitations become a factor. “Supernet” approaches such as FBNet[42] would not fit even two variants of the larger and more performing networks into a 16 GB GPU. In fact, the most accurate networks on ImageNet22K tend to use most of the memory of even a 32 GB GPU.

Third, the topologies of the hypersurfaces defined by the parameter dimensions and the achieved final accuracy as the evaluation axis may have local minima, i.e. gradient descend based methods are suboptimal to find minima in the accuracy-hypersurface.

Importantly, the algebraic investigation suggests that small changes to parameters such as the number of filters or number of layers in a “convolution group” or cardinalities of elements of basic blocks within a “network family” like ResNeXt or BLResNeXt will result in small changes in the final accuracy, since they cause only small changes in the degrees of freedom in the number of pieces in the piecewise linear function and the equations that govern the pieces. Our experimental results support this hypothesis. Hence, it is appropriate to assume that there exists a set of continuous functions in

that pass through a set of reasonably spaced support points over discrete parameters and the resulting validation accuracy after training the network (close) to saturation.

4.1 Polyharmonic Splines

Polyharmonic splines are an interesting option to define a function that passes through such a set of support points. Given a set of support points for a function , a polyharmonic spline interpolation has the form

where is the Euclidian norm in and is a real valued polynomial in variables [25].

is a radial basis function.

The polyharmonic radial basis functions are solutions to a polyharmonic equation, a partial differential equation of the form

= 0. Polyharmonic splines minimize the “curvature” of the hyperplane that passes through all the support points, hence they minimize oscillations while providing a smooth, continuous surface. This interpolation expression is differentiable. Thus, if we assume there exists a continuous and differentiable function that correctly models the behavior of the system that generates the support points, then there exists a polyharmonic spline such that the integral of the difference between spline and in an interpolation d-box vanishes as the number of support points increases.

We chose a radial basis function of

. Solving the equation system to determine the coefficients for the polyharmonic spline proved to be sensitive to numerical instability. Hence we chose a pivoting Householder QR decomposition (the EIGEN implementation), trading performance for numerical stability.

In order for the linear equation system that determines the spline coefficients to have a solution, the matrix formed by the support points has to have full rank. Since interpolation accuracy is in general higher than extrapolation accuracy, a minimal set of support points it needed that spans a d-box in the d-dimensional space in which the spline interpolates the approximated function which leads to a support point matrix of full rank. We chose the following set for parameters:

  • , the vector of the largest usable or legal parameter values in the d-box for all dimensions.

  • , the smallest usable or legal parameter values.

  • , a point near the center point of the d-box.

  • for all .

  • for all .

For two dimensions, these are the corners of a square and its center. For three dimensions, these are the corners of a cube and its center. With these support points, a total of points is sufficient to span an initial cubic spline that interpolates within a d-box. Additional support points can be added to improve the quality of the interpolation. To maintain numerical stability, new support points have to maintain a certain distance from previous support points. We eliminated splines due to numerical instability by checking on the solution with an error margin for floating point computation errors.

4.2 Minima Search

Since gradient descent is susceptable to local minima, we used a hierarchical Monte Carlo approach to search for a minimum in the interpolation d-box, with a sequence of low discrepancy, specifically a Halton sequence. For a given d-box, reducing the average distance between sample points by a factor of two requires an exponential increase in additional sample points, i.e. progress in terms of finding better minima slows daramatically as more sample points are added.

As the average distance between sampling points decreases relative to the distance between support points, the curvature minimizing property of the polyharmonic spline reduces the risk of missing minima. Hence a hierarchical approach, iteratively shrinking the d-box around the minimum found so far, significantly reduces the time to determine a good approximation for the minimum within the d-box. The search was stopped if no further improvement within floating point accuracy was achieved within a certain compute budget.

Note that in our architecture search a potentially bad estimate for a minimum does not limit progress of the search, as described in Algorithm 1. It merely leads to a potentially suboptimal measurement point and hence to potentially requiring more measurement points to reach convergence between prediction and measurement.

Input:
(a) Initial set of support points , where is the number of dimensions of each point ;
(b) Measured function values for initial points;
(c) Minimum Difference between prediction () and measure ();
begin
      Solve equation system for spline over initial support points;
       Compute via nested Monte Carlo Sampling;
       Compute measurement by training network to saturation;
       ;
       while  do
             Add to set of support points ;
             Solve equation system for spline over all support points;
             Compute via nested MCS;
             Compute measurement by training network;
             if  then
                   ;
                  
             end if
            
       end while
      
end
Result: Optimal Parameters Configuration
Algorithm 1 Polyharmonic Splines NAS

5 Experiments

In our experiments we investigated a couple of micro (within basic block structure) and multiple macro (overall network dimensions) parameters for two network families, namely ResNet [23] and Big-Little ResNeXt [11]. Search was performed directly on the target dataset ImageNet22K. Training experiments were conducted on the the Summit Supercomputer at Oakridge National Lab, using 34 nodes each equipped with 6 Nvidia Volta V100 GPUs with 16GB GPU cache each. for a total of 204 GPUs. All GPUs in a node have NVLink connection, and the nodes are connected by Mellanox EDR 100G Infiniband and have access to shared GPFS storage. Software used included PowerAI Vision 1.6, NCCL and Pytorch, using its distributed data parallel package. Batch size was set at 32 per GPU, for a total batch size of 6,528. The initial learning rate was set at 0.1 and the followed polynomial decay, optimizer SGD with momentum 0.9 and weight decay 0.0001.

5.1 Experimental Dataset: ImageNet22k

ImageNet22k contains 14 million images representing 21,841 categories organized in a hierarchy derived from WordNet and including top level concepts such as sport, garment, fungus, plant, animal, furniture, food, person, nature, music, fruit, fabric, tool etc. Figure 1 shows the top level catefories of the ImageNet22k hierarchy dataset and their relative sizes in terms of number of images. The imbalance across top level semantic categories is quite evident. For example animal is represented three times as much as person, and artifacts dominate the distribution with very little representation for activities. The skew is also significant in terms of number of examples per category. On average there are images per class, ranging from a minimum of to a maximum of. The scale and imbalance of ImageNet22K make it particularly challenging even for human designed architectures, with a limited set of published results [13, 14, 15, 50], as opposed to the smaller and balanced ImageNet1K version. Recently, some works have used ImageNet22K as pre-training for Imagene1K evaluation [1]. To the best of our knowledge, this is the first work to perform NAS directly at the scale of ImageNet22K, not by transfer from smaller proxy sets.

Following standard practise [5, 13, 17]

we randomly partitioned the ImageNet22K dataset into 50% training and 50% validation, consisting of approximately 7 million images each. We split the data into two sets such that number of images per label are approximately equal in both sets. In the cases where a label had odd number of images, we put the extra image in the validation set.

Figure 1: ImageNet22K taxonomy of higher level labels (left) and distribution of images per label (right).

5.2 Results

We applied our polyharmonic spline NAS method to the the ResNet18 and BLResNext50 architectures search spaces. For each point in the spline, training and evalution was conducted on half of the ImageNet22k dataset. Once the optimal configuration was determined by our search, that network was trained and evaluated on the full ImageNet22k dataset.

Point Type Dimensions Top-1 Accuracy %
conv c1 group g1 group g2 group g3 group g4 Measured Predicted
Initial 150 300 600 1200 2400 38.03 -
32 32 32 32 32 10.33 -
150 32 32 32 32 10.54 -
32 300 32 32 32 15.46 -
32 32 600 32 32 19.45 -
32 32 32 1200 32 23.15 -
32 32 32 32 2400 31.25 -
75 150 300 600 1200 35.46 -
32 300 600 1200 2400 38.03 -
150 32 600 1200 2400 37.15 -
150 300 32 1200 2400 36.76 -
150 300 600 32 2400 34.40 -
150 300 600 1200 32 29.04 -
Incremental 50 116 330 1200 2400 37.31 41.02
80 208 475 736 2400 - 38.92
Table 1: Simplified ResNet18 with 15 points over 5 dimensions of search including 13 points for initial spline and 2 incremental points

5.2.1 ResNet18 Search Space.

For the ResNet18 architecture we removed maxpool layer and increased the stride for the first convolution to . We performed search over dimensions : the number of filters in the first convolution () and in the four groups of layers (). The ranges for each dimension are as follows: , , , and . Therefore the possible combinations spanning the entire search space are 3 trillion (). Using our spline NAS approach allows to start from only an initial set of 13 support points (, as explained in Section 4.1) and then iterate from there with few additional points. This results in a tremendous computational gain for our NAS method which allows to perform search directly on large scale datasets, as opposed to traditional NAS approaches needing to resort to small proxy sets.

Table 1 shows the coordinates for the initial 13 support points to span the first spline and two incremental points. An incremental point is a measurement on the optimum estimated by the prior set of points. The first prediction, based on the minimal point set, predicted a reasonable point, but the estimate of is clearly very optimistic. Adding a measurement at this point produces a new point with a prediction close to the best support point.

Figure 2 shows projections of the polyharmonic spline derived from the measured points as show in Table 1. The interpolation suggests that the parameter is the dominant limiting factor, since it shows the steepest slope at the edge of the “box”. This matches the algebraic interpretation, 22k classes could benefit from a higher dimensional feature space. The earlier layers/group show maxima within the box for maximum values for the later layers, indicating that once the degrees of freedom of a later part of the network are saturated, adding more capacity to earlier layers becomes counterproductive.

Hence, we measured configuration of [], roughly projecting out the ratios of the optimum from the spline with some adjustments to fit into available GPU memory. This achieved, all other hyperparameters remaining equal, a top-1 accuracy of . Since this network was significantly larger and hence may benefit from a different learning rate schedule, we performed 2 epochs of fine tuning, which increased the accuracy to .

Being limited by GPU memory and compute resource, we performed one more experiment to increase the number of pieces in the piecewise linear function without increasing the model size significantly by replacing the convolution with stride 2 at the beginning with a basicblock. A basicblock consists of two convolutions, one of which has stride 2, which has a similar aperture and the same reduction of the activations. Indeed, this yielded the expected improvement, top-1 accuracy with fine-tuning.

(a) Projection
(b) Projection
(c) Projection
(d) Projection
Figure 2: Top-1 accuracy on ImageNet22k (half) for the ResNet18 architecture by projecting the search space to two dimensions. For each projection, three parameters are fixed and performance is inspected over the search space spanning the range of the remaining pair of parameters: (a) varying and , (b) varying and , (c) varying and , (d) varying and .

5.2.2 BLResNext50 Search Space

Big-Little Net[11] is a mechanism that splits each block within a deep network into multiple paths through which different resolutions of an input are passed. In our search space we considered two paths. The first, called big branch, through which a downsized (by half) version of the input is passed, containing kernels and layers. The second, called little branch, processing the input in its original resolution, but containing kernels and layers. The big-little version of ResNext [11] with a depth of 50 is deeper and offers more alternate paths through groups than the basic ResNet18, and hence theoretically allows for more pieces in the piecewise linear function relative to the number of network parameters. Thus, this family of networks promises to achieve higher accuracy within a given GPU memory capacity, which was clearly the limiting factor for the ResNet18 case.

we defined our search space as spanning only three variables, hence reducing the number of needed measurements for optimization. We chose as parameters the and parameters of the big-little structure and a multiplier to the group width, that gets uniformly applied across the network. The original group widths for BLResNext50 were . With a bottleneck expansion factor of , this results in a 2048-dimensional feature space. The choice for example results in group width and a 4096-dimensional feature space. The total combinations of the search space are at least 539 (, considering only at discrete increments of 0.1), but the spline optimization needs only supporting measurements followed by a couple of additional ones.

Point type Initial Incremental
8 2 2 8 4 2 8 2 8 2
8 2 8 2 4 2 2 8 8 8
1 1 1 1 1.5 2 2 2 2 3
Top-1 Accuracy % 38.17 38.83 38.75 38.18 39.90 40.96 40.53 40.99 40.48 41.64
Table 2: BLResNext50, 3 dimensions (, , and ), 9 points for initial spline, 1 incremental
Figure 3: Top-1 accuracy on ImageNet22k (half) for BLResNext50. Projection , accuracy over parameters and .
Figure 4: BLResNext50 18x8d testing accuracy on the full ImageNet22K ( nodes, GPUs, batch size on Summit Supercomputer.

Table 2 shows that and have only a small influence on accuracy compared to . Figure 3 shows the projection . The dependencies within the “box” are nearly linear and the minimum is located in the corner, clearly indicating that an increase in width has the best probability to increase accuracy significantly. Hence, we measured , which indeed yielded to top-1 accuracy of . This was interesting also because the shape of the spline suggested an increment in the variable beyond the initially designed range of the search space.

The total amount of single Nvidia Volta GPU hours needed for the NAS Spline search was approximately . This is the accumulation of conducting evaluation (training and validation) half of the ImageNet22K dataset for all the configurations corresponding to the initial points and additional 3 data points. Each individual point measurement took about GPU hours. For reference, the original reinforcement learning based NAS [53] method would require a minimum of GPU hours.

The final recommended BL-Net architecture was trained and evaluated on the full ImageNet22K dataset using the Summit supercomputer over nodes with Volta GPUs with a global batch size of using Pytorch distributed data parallel. Figure 4 shows how the top-1 accuracy climbed as the learning progressed. On the way to the accuracy it crossed two previously published results as shown. In table 3 the comparison of our result against previous results as well as a baseline SME designed architecture based on the BL-ResNext are summarized. The SME designed architecture was a BL-ResNext 101 based model in comparison to the BL-ResNext 50 based Spline recommended model. As can be seen, the Spline recommended architecture resulted in a jump of increase in top-1 accuracy. This is the first published result which has crossed in overall top-1 accuracy on ImageNet22K.

Model Batch GPUs Top-1(Top-5) FLOPs Training Number of
Size Accuracy % (G) Time(Hours) Epochs
ResNet-101 [14] 5,120 256 33.8 (-) - 7 -
WRN-50-4-2 [15] 6,400 200 36.9 (65.1) - - 24
BL-ResNext101 32x8d 6,528 204 39.7 (68.3) 11.25 16 60
BL-ResNext50 18x8d (ours) 6,528 204 40.03 (69.04) 17.88 15 60
Table 3: Comparison of ImageNet22K results. The last two row denotes the Spline NAS recommended architecture. FLOPs are estimated with a network forward pass using input image resolution of 256x256

5.3 BLResNext on ImageNet1K

In order to investigate the influence of both group-width and group-depth in the Big-little-ResNext family, we picked 8 parameter dimensions and experimentally verified the convergence of the spline method using the smaller ImageNet1K dataset. The investigated parameters are the number of filters in the first layer in a layer group, i.e. the number of filters in the output of a group is increased by the expansion factor. Parameters are the depths, i.e. the number of calls to make_layer within a basicblock for the four block groups.

Point Type Dimensions Top-1 Accuracy %
Measured Predicted
Initial 32 32 32 32 2 2 2 2 52.95 -
128 256 512 768 10 10 18 5 78.82 -
80 160 320 480 5 5 5 3 77.61 -
128 32 32 32 2 2 2 2 60.60 -
32 256 32 32 2 2 2 2 65.53 -
32 32 512 32 2 2 2 2 70.19 -
32 32 32 768 2 2 2 2 69.84 -
32 32 32 32 10 2 2 2 58.37 -
32 32 32 32 2 10 2 2 58.83 -
32 32 32 32 2 2 18 2 59.17 -
32 32 32 32 2 2 2 5 55.16 -
32 256 512 768 10 10 18 5 77.68 -
128 32 512 768 10 10 18 5 78.18 -
128 256 32 768 10 10 18 5 78.26 -
128 256 512 32 10 10 18 5 77.78 -
128 256 512 768 2 10 18 5 78.55 -
128 256 512 768 10 2 18 5 78.40 -
128 256 512 768 10 10 2 5 78.44 -
128 256 512 768 10 10 18 2 79.27 -
Incremental 71 256 512 768 2 2 2 2 76.42 85.09
91 72 512 768 6 7 10 2 77.76 80.88
128 256 441 768 10 10 18 2 - 79.29
BL-ResNext50 Default 64 128 256 512 3 4 6 3 77.02 75.50
Table 4: BLResNext, 8 dimensions, 19 points for initial spline. Check against known blresnext50 config () followed by iterative points, measured top-1 accuracy vs prediction.

Table 4 shows the 19 () measured points that span an initial spline and two points with a comparison of the prediction against a measurement. After the initial set of observations, we run predictions of the spline on different points within the search space. As an example, we can look at the default BL-ResNext50 configuration (last row in the Table). Since it is located close to the center of the d-box and hence close to a support point, the relative error between the top-1 accuracy predicted by the spline model and the measured performance is moderate, approximately 1.5% (77.02% measured versus the predicted 75.50%).

The point is the minimum found within the d-box. It is located relatively far from the support points, hence its prediction is much less accurate. Iteratively adding minima as support points tends to quickly improve the accuracy of predictions and thus leads to good network parameters. If the prediction quality doesn’t improve, adding more support points in the region of interest, e.g. spanning a smaller d-box inside the first one, can improve the splines predictive capabilities.

Adding to the base spline (not including the default BL-ResNext50 point) delivers a new prediction, with predicted top-1 accuracy of . Adding the measurement for this second point to the spline delivers a new prediction with accuracy which closely matches a measured point of the corner of the d-box. That suggests that this corner point is the optimimum within this d-box.

The location of the minimum in a corner point suggests that better parameters may exist outside the interpolation d-box. Hence, we added a set of measurements to expand the d-box to a wider range of parameters for the widths of the convolution groups to start a new iteration. Table 5 shows the additional points and the predictions and measurements. Not unsurprising, the point at the maximum edge of the new d-box shows already an improved final top-1 accuracy of 79.38%. Additional iterative points close the gap between prediction and measurements. We stopped the iteration as it became evident that still the best corner point would be very close to a settled maximum. The iteration unveiled a smaller network with almost identical final accuracy of at .

Point Type Dimensions Top-1 Accuracy %
Measured Predicted
Initial 256 512 768 1024 10 10 18 5 79.38 -
256 32 32 32 2 2 2 2 64.91 -
32 512 32 32 2 2 2 2 68.82 -
32 32 768 32 2 2 2 2 70.92 -
32 32 32 1024 2 2 2 2 70.51 -
32 512 768 1024 10 10 18 5 78.12 -
256 32 768 1024 10 10 18 5 78.71 -
256 512 32 1024 10 10 18 5 79.01 -
256 512 768 32 10 10 18 5 78.81 -
256 512 768 1024 2 10 18 5 79.10 -
256 512 768 1024 10 2 18 5 79.25 -
256 512 768 1024 10 10 2 5 79.11 -
256 512 768 1024 10 10 18 2 79.17 -
Incremental 210 357 433 500 3 5 13 2 78.83 81.78
235 355 408 872 10 10 18 3 79.37 80.13
130 289 417 489 3 4 9 3 78.47 79.93
158 314 381 761 5 10 18 3 79.04 79.86
240 400 620 532 10 8 18 2 79.16 79.76
256 385 433 1023 10 3 18 5 79.31 79.6
188 365 445 875 10 10 18 2 79.18 79.51
256 293 363 1024 10 10 18 5 79.24 79.58
180 284 569 614 10 10 18 3 - 79.51
BL-ResNext50 Default 64 128 256 512 3 4 6 3 77.02 75.50
Table 5: BLResNext, 8 dimensions, Additional points for widened spline. Includes blresnext50 config and measured iterative points for the narrower d-box from table 4, measured top-1 accuracy vs prediction.

6 Conclusions

We described a novel NAS method based on polyharmonic splines that can perform search directly on large scale, imbalanced target datasets. We demonstrated how most common operations in deep neural networks can be included as variables in the search space of a spline modeling the accuracy of a given architecture. The number of evaluations required during the search phase of our NAS approach is proportional to the number of operations in the search space, not in the number of possible values they each operation could have, making the approach tractable at large scale. We demonstrate the effectiveness of our method on the ImageNet22K benchmark [16], achieving a state of the art top-1 accuracy of . This result paves the way to apply polyharmonic spline based NAS to other architectures and operations within networks, potentially also including hyperparameters in the search space.

7 Acknowledgement

This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. It also used resources of the IBM T.J. Watson Research Center Scaling Cluster (WSC).

References

  • [1] Anonymous (2021) An image is worth 16x16 words: transformers for image recognition at scale. In Submitted to International Conference on Learning Representations, Note: under review External Links: Link Cited by: §5.1.
  • [2] B. Baker, O. Gupta, R. Raskar, and N. Naik (2018) Accelerating neural architecture search using performance prediction. In ICLR Workshops, Cited by: §2, §2.
  • [3] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, and Q. Le (2018) Understanding and simplifying one-shot architecture search. In ICML, Cited by: §2.
  • [4] J. Bergstra, D. Yamins, and D. Cox (2013) Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In

    International conference on machine learning

    ,
    pp. 115–123. Cited by: §2.
  • [5] B. Bhattacharjee, M. Hill, H. Wu, P. Chandakkar, J. Smith, and M. Wegman (2017)

    Distributed learning of deep feature embeddings for visual recognition tasks

    .
    In IBM Journal of R and D, Cited by: §5.1.
  • [6] Z. Borsos, A. Khorlin, and A. Gesmundo (2019) Transfer nas: knowledge transfer between search spaces with transformer agents. arXiv preprint 1906.08102. Cited by: §1.
  • [7] A. Brock, T. Lim, J.M. Ritchie, and N. Weston (2018) SMASH: one-shot model architecture search through hypernetworks. In ICLR, Cited by: §2.
  • [8] H. Cai, L. Zhu, and S. Han (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In ICLR, Cited by: §2.
  • [9] S. Cao, X. Wang, and K. M. Kitani (2019) Learnable embedding space for efficient neural architecture compression. arXiv preprint arXiv:1902.00383. Cited by: §2.
  • [10] J. Chang, x. zhang, Y. Guo, G. MENG, S. XIANG, and C. Pan (2019) DATA: differentiable architecture approximation. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [11] C. (. Chen, Q. Fan, N. Mallinar, T. Sercu, and R. Feris (2019) Big-little net: an efficient multi-scale feature representation for visual and speech recognition. In ICLR, Cited by: Large Scale Neural Architecture Search with Polyharmonic Splines, §1, §3.2, §5.2.2, §5.
  • [12] Y. Chen, T. Yang, X. Zhang, G. MENG, X. Xiao, and J. Sun (2019) DetNAS: backbone search for object detection. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [13] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman (2014) Project adam: building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 571–582. Cited by: §5.1, §5.1.
  • [14] M. Cho, U. Finkler, S. Kumar, D. Kung, V. Saxena, and D. Sreedhar (2017) PowerAI ddl. In arXiv preprint 1708.02188, Cited by: §5.1, Table 3.
  • [15] V. Codreanu, D. Podareanu, and V. Saletore (2017)

    Achieving deep learning training in less than 40 minutes on imagenet-1k and best accuracy and training time on imagenet-22k and places-365

    .
    Note: https://bit.ly/2VdG5B7 Cited by: Large Scale Neural Architecture Search with Polyharmonic Splines, §1, §5.1, Table 3.
  • [16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, Cited by: Large Scale Neural Architecture Search with Polyharmonic Splines, §6.
  • [17] J. Deng, A. Berg, k. Li, and F. Li (2010) What does classifying more than 10,000 image categories tell us?. In ECCV, Cited by: §5.1.
  • [18] T. Domhan, J. T. Springenberg, and F. Hutter (2015) Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In

    Twenty-Fourth International Joint Conference on Artificial Intelligence

    ,
    Cited by: §2.
  • [19] T. Elsken, J. H. Metzen, and F. Hutter (2019) Neural architecture search: a survey. JMLR 20 (55), pp. 1–21. Cited by: §1.
  • [20] S. Falkner, A. Klein, and F. Hutter (2018) BOHB: robust and efficient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774. Cited by: §2.
  • [21] A. Gaier and D. Ha (2019) Weight agnostic neural networks. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [22] G. Ghiasi, T. Lin, and Q. V. Le (2019) NAS-fpn: learning scalable feature pyramid architecture for object detection. In CVPR, Cited by: §2.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: Large Scale Neural Architecture Search with Polyharmonic Splines, §1, §3.2, §5.
  • [24] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam (2019) Searching for mobilenetv3. In ICCV, Cited by: §2.
  • [25] A. Iske and V. I. Arnold (2004) Multiresolution methods in scattered data modelling. Vol. 37, SpringerVerlag. Cited by: §4.1.
  • [26] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing (2018) Neural architecture search with bayesian optimisation and optimal transport. In Advances in neural information processing systems, pp. 2016–2025. Cited by: §2.
  • [27] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §1.
  • [28] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. Yuille, and L. Fei-Fei (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In CVPR, Cited by: §2.
  • [29] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In ECCV, Cited by: §2.
  • [30] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu (2018) Hierarchical representations for efficient architecture search. In ICLR, Cited by: §2.
  • [31] H. Liu, K. Simonyan, and Y. Yang (2018) DARTS: differentiable architecture search. arXiv preprint 1806.09055. Cited by: §1, §2.
  • [32] Z. Lu, G. Sreekumar, E. Goodman, W. Banzhaf, K. Deb, and V. N. Boddeti (2020) Neural architecture transfer. arXiv preprint 2005.05859. Cited by: §1.
  • [33] H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg, and F. Hutter (2016) Towards automatically-tuned neural networks. In Workshop on Automatic Machine Learning, pp. 58–65. Cited by: §2.
  • [34] N. Nayman, A. Noy, T. Ridnik, I. Friedman, R. Jin, and L. Zelnik-Manor (2019) XNAS: neural architecture search with expert advice. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [35] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean (2018) Efficient neural architecture search via parameters sharing. In ICML, Cited by: §2.
  • [36] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019) Regularized evolution for image classifier architecture search. In AAAI, Cited by: §2.
  • [37] K. Swersky, D. Duvenaud, J. Snoek, F. Hutter, and M. A. Osborne (2014) Raiders of the lost architecture: kernels for bayesian optimization in conditional parameter spaces. arXiv preprint arXiv:1409.4011. Cited by: §2.
  • [38] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) MnasNet: platform-aware neural architecture search for mobile. In CVPR, Cited by: §2.
  • [39] M. Tan and Q. V. Le (2019)

    EfficientNet: rethinking model scaling for convolutional neural networks

    .
    In ICML, Cited by: §2.
  • [40] N. Wang, Y. Gao, H. Chen, P. Wang, Z. Tian, and C. Shen (2019) NAS-FCOS: fast neural architecture search for object detection. arXiv preprint 1906.04423. Cited by: §2.
  • [41] M. Wistuba (2019) XferNAS: transfer neural architecture search. arXiv preprint 1907.08307. Cited by: §1.
  • [42] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer (2019) FBNet: hardware-aware efficient convnet design via differentiable neural architecture search. In CVPR, Cited by: §2, §4.
  • [43] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In CVPR, Cited by: §3.2.
  • [44] S. Xie, A. Kirillov, R. Girshick, and K. He (2019) Exploring randomly wired neural networks for image recognition. In ICCV, Cited by: §2.
  • [45] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 1492–1500. Cited by: §1.
  • [46] S. Xie, H. Zheng, C. Liu, and L. Lin (2019) SNAS: stochastic neural architecture search. In ICLR, Cited by: §2.
  • [47] A. Zela, T. Elsken, T. Saikia, Y. Marrakchi, T. Brox, and F. Hutter (2020) Understanding and robustifying differentiable architecture search. In ICLR, Cited by: §2.
  • [48] A. Zela, J. Siems, and F. Hutter (2020) NAS-bench-1shot1: benchmarking and dissecting one-shot neural architecture search. In ICLR, Cited by: §1, §2.
  • [49] C. Zhang, M. Ren, and R. Urtasun (2018) Graph hypernetworks for neural architecture search. arXiv preprint arXiv:1810.05749. Cited by: §2.
  • [50] H. Zhang, Z. Hu, J. Wei, P. Xie, G. Kim, Q. Ho, and E. Xing (2015) Poseidon: a system architecture for efficient gpu-based deep learning on multiple machines. In arXiv preprint 1512.06216, Cited by: §5.1.
  • [51] Z. Zhong, J. Yan, W. Wu, J. Shao, and C. Liu (2018) Practical block-wise neural network architecture generation. In CVPR, Cited by: §2.
  • [52] D. Zhou, X. Zhou, W. Zhang, C. C. Loy, S. Yi, X. Zhang, and W. Ouyang (2020) EcoNAS: finding proxies for economical neural architecture search. In CVPR, Cited by: §2.
  • [53] B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §1, §2, §5.2.2.
  • [54] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In CVPR, Cited by: §2.