1 Introduction
Designing Deep Neural Networks is a challenging task and requires a Subject Matter Expert (SME). One way of reducing the design burden and still be able to obtain a custom designed architecture for a given training problem is to use Neural Architecture Search (NAS) [19, 53]. Most NAS methods have been tried on small scale datasets like Cifar10100 [27] and then subsequently architectures have been transferred to larger datasets like ImageNet1K [6, 32, 41]. This is mainly due to the heavy computational burden of the search phase, which despite recent progress [31, 48], remains mostly intractable at large scale. No previous published work has explored NAS on large scale datasets like ImageNet22K directly, and very few have even attempted it on ImageNet1K itself.
Additionally, most NAS methods have been tested on datasets with uniform distribution in terms of number of images per class. For example Cifar100 has
classes containingimages each, and Imagenet1K has 1k classes with 1k images each. While the design of such uniform distributions facilitate learning, they are not reflective of real world conditions, where skew in terms of instances per label/class is very much evident.
In this paper we describe a novel method for NAS that can conduct search directly on largescale skewed datasets using Polyharmonic Splines. First, we describe in detail how specific operations within a deep neural network (convolution, ReLu, average pooling, batchnorm) are well suited to be approximated by a spline. Given a search space consisting of a set of
operations to optimize over, the search phase of our proposed NAS method based on polyharmonic splines requires only an initial set of points, followed by few additional points to complete its the optimization. This means that the number of evaluations required in the search phase depends only on the number of operations under search, not on the number of possible values for each of those operations. This drastically reduces the computational cost of the search phase of our NAS approach, allowing it to be performed directly on large scale target datasets.We demonstrate the effectiveness of our method on the ImageNet22K benchmark of million images over labels. The dataset has high skew in terms of images per label and distribution across semantic categories, as shown in Figure 1. By exploring the search space of the ResNet [23], a BigLittle Net [11], and ResNext [45] architectures, the method designed a model which achieved a top1 accuracy of on ImageNet22K. This significantly improves the best published performance [15] of on such largescale, imbalanced dataset
The remainder of the paper is organized as follows. After Section 2, which covers related work, we describe in Section 3
the basic components of deep neural networks which are adopted as dimensions for our type of NAS and explain why the functions on such variables are well suited for interpolation. We describe the application of polyharmonic splines and the necessary conditions for the existence of a solution in Section
4, before discussing experimental results on ImageNet22K in Section 5. Finally we draw conclusions in Section 6.2 Related Work
NAS.
Neural Architecture Search (NAS), which aims at automatic design of deep learning networks for various applications spanning from image classification
[29, 38, 42, 44] to NLP [31, 35, 53], from object detection [12, 22, 40] to semantic segmentation [28] and control tasks [21] has attracted intense attention in recent years. A number of NAS strategies have been proposed, including evolutionary methods [2, 30, 36, 52][29, 35, 51, 53], and gradientbased methods [10, 31, 34, 46, 47]. Efficiency on specific platforms has also been a very active area of research within the NAS umbrella, with the development of search strategies that optimize not only accuracy but also latency [8, 24, 38, 39, 42]. Methods based on microsearch of primary building cells [51, 54], and parameter sharing between child models [3, 7, 31, 35, 48] have also been recently proposed. Another widely used approach to address efficiency in NAS is to search for an architectural building block on a small dataset (e.g., CIFAR10) and then transfer the block to the larger target dataset by replicating and stacking it multiple times in order to increase network capacity according to the scale of the dataset.Bayesian Methods. Our work is also related to Bayesian optimization which often use an acquisition function to obtain suitable candidates, has been widely used in NAS [18, 33, 26, 9]. The acquisition function measures the utility by accounting for both, the predicted response and the uncertainty in the prediction. While Bayesian optimization methods usually consumes more computation to determine future points than other methods, this pays dividends when the evaluations are very expensive. Many different strategies have been studied including classic Gaussian processbased optimization [37, 26], treebased models [4]
[20]to effectively optimize both neural network architectures and their hyperparameters. Several works also predict the validation accuracy
[49], or the curve of validation accuracy with respect to training time [2].3 Optimization Basis
Neural Architecture Search in its essence optimizes a function , residing in a multidimensional space of network and training parameters and being a metric for the quality of the network, for example the top1 accuracy on a validation data set. The parameters captured in may be discrete. might not be differentiable in terms of the parameters . Importantly, is generally very expensive to evaluate. For a given parameter set , the network has to be trained and evaluated on the full training and validation sets.
The interpolation quality of methods to approximate scalar functions of multidimensional arguments based on a limited set of support points depend on the properties of the interpolated function. Hence it is important to understand the properties of and if it is likely that there exists a well behaved interpolating function through the potentially discrete support points.
This section discusses the algebraic structure of popular residual neural networks and establishes that the resulting
is with high probability well suited for interpolation with a polyharmonic spline for parameters that have a sufficiently large number of possible values. For parameters with only a few choices, using separate splines for the discrete cases can be more effective. Furthermore, the algebraic analysis provides insights into the sensitivity of global network parameters and hence search spaces.
3.1 Neural Network Forward Pass as Piecewise Linear Function
The forward pass through a typical neural network consisting of convolutions, fully connected layers, ReLUs, batchnorms and averagepooling can be interpreted as a piecewise linear function that effectively transforms an input, for example an image as a set of values across three channels, into values in a feature space
, which forms the input to a final linear classifier.
3.1.1 Fully Connected Layers
A single neuron with
inputs performs the operationwith an activation function
. In the following we discuss the case , i.e. ReLU.The condition
defines a hyperplane in the
dimensional input space of such that for locations above the hyperplane and for locations on or below the hyperplane. The output of the neuron without activation is in fact the distance of from the hyperplane. Multiple neurons with inputs from the same input space define a set of hyperplanes that partition the input space into a set different regions. In each region, is determined by a set of linear equations.For example in a twodimensional input space, 2 neurons and
with linearly independent weight vectors and ReLU activation create four regions
A second layer of similar structure, a linear operation followed by ReLU activation, potentially partitions each of the segments of the input space again. In the following we discuss how other layers, namely convolution, batchnorm and average pooling, are operations of analogous structure and hence the entire set of operations leading to the final linear classifier is a piecewise linear function that maps hypervolumes in the input space into the feature space. For a trained neural network, this piecewise linear function is optimized to create linearly separable clusters of mapped training points in the feature space.
3.1.2 Convolution
Convolution in a neural network is
wherein identifies the input channel, e.g. color in an rgb image or filter from a prior convolution. The tuple is the location within the input at which a patch of the size of a filter is scanned. identifies the filter, i.e. the output channel of the convolution. and
iterate through the positions within the input patch and filter. Hence, each tensor
is of size , the filter width, height and number of input channels.We can apply an index transformation which transforms into . For a fixed position of the convolution we can drop the indices and to obtain . The vectors are points in a vector space of dimension . I.e., on one input patch a convolution has the same algebraic structure as a neuron. Averagepooling is a convolution with a filter whose elements are all identical.
Note that these operations can be expressed in the form of a fully connected layer, with a set of neurons that correspond to the filter channels that is present multiple times with different subsets of the input weights set to zero to express the selection of patches in the input.
3.1.3 Batch Norm
On first glance, a batch norm layer does not look like a linear function. But a closer look at the details reveals that the running means and averages are treated as constants in backpropagation, they do not have gradients. The batch norm (mean and variance across the batch, width and height for a channel) is an estimate for the constant that is used during inference on the trained model, which is in itself an estimate for the mean and variance across the entire training set. As the weights change during training, the estimates for running means and variances follow.
Pytorch uses Bessel’s correction
, an unbiased estimate of the population variance
from a finite sample for BatchNorm2d.With as an input tensor of form NCWH, being the number of elements in a ’channel slice’ and as the iteration number, mean and unbiased variance for a ’channel slice’ are
With running mean and variance
batch norm in training () and eval () mode result in
Hence, with and
as constants, batch norm maps a linear transformation.
3.2 Base Transformations, Projections and Convolutional Networks
The rankfactorization of a matrix of rank expresses as a product of a full column rank matrix and a full row rank matrix . For a square matrix of full rank the equation is a transformation of vector from one base of to another.
Hence, conceptually a matrix of rank maps a vector from a space with dimensions into a subspace of dimensions and from there into a space with dimensions. If , then we can express all of the dimensional vectors in terms of an dimensional basis. Thus, a trained network whose matrices are not close to full rank is inefficient.
Furthermore, we can interpret a convolution and subsequent activation on a single ’patch’ as a transformation from a dimensional space to a dimensional space, and being the width and height of the filters and and being the numbers of the input channels and output channels, respectively.
Assuming that the combined convolution matrix has close to full rank in an efficient network, the activations resulting from a convolution can be interpreted as the coefficients in an expression that approximates the input tensor in terms of a set of functions defined by the filter tensors, a form of compression.
The ’approximation’ has to be of limited loss, since the correlation between classification results and inputs has to be preserved. In some sense training optimizes the approximation such that it aides in transforming inputs into linearly separable points in the feature space and at the same time it allows to approximate with limited loss of information across all its inputs.
Based on above observations, the essential part of the functionality of the convolutional cone leading to a linear classifier in a neural network is provided by mapping points from one space to another to create a piecewise linear function whose output in the feature space is highly linearly separable with respect to the number of hyperplanes in the final classification layer. What primarily matters is how many linear pieces the network provides relative to the number of hyperplanes, i.e. classes, in the final classifier.
A second aspect is trainability, i.e. how easy a network converges, which suggests to focus on residual networks with batchnorm. While the specific structure of a network should have relatively little influence on final accuracy, it may have a significant impact on the network size that is required to generate a competitive piecewise linear approximation. Which lead us to focus architecture search on different families of residual networks, specifically ResNet [23], ResNeXt [43] and BLResNeXt [11]. Typical residual networks have 4 groups of residual blocks with different numbers of layers. For example ResNet50 has groups of input width 64, 128, 256, 512 channels with an expansion factor of 4 and depths of 3, 4, 6, 3. ResNet18 on the other hand has the same widths, but an expansion factor of 1 and depths of 2, 2, 2, 2.
The number of classes, i.e. the number of hyperplanes in the feature space, provides guidance on the optimal number of dimensions for the feature space. In an dimensional space, we can place linearly independent hyperplanes such that all volumes bounded by hyperplanes are infinite. If we place more hyperplanes, some volumes have to be limited to a constant that depends on the placement of the hyperplanes. Hence, a dimesion of the feature space that is larger than the number of classes should be beneficial. Note that for ResNet50 the dimension of the feature space is 2048 and for ResNet18 it is 512.
If the leading layers ’choke’ the dimensions of the vector spaces leading to the feature space too much, the analysis suggests that this will negatively influence the accuracy of the network. For example the initial convolution in resnet 18 has an input space of dimensions and maps this to 64 dimensions. A convolution with 64 channels to 128 channels maps 576 dimensions to 128, i.e. creates a significant reduction. Furthermore, the number of activations drops by a factor of 4 through every group.
As an experimental test we designed a simple neural network r18U based on resnet18 whose behavior should be closer to resnet50 for Imagenet1k by adjusting only the block widths. We eliminated the maxpool
layer and compensated for its reduction by increasing the stride of the initial convolution to
based on the hypothesis that the nonlinear layer is not essential for the functionality of the neural network. Indeed, this did not negatively impact final accuracy but appeared to improve conversion. We increased the numbers of channels of the initial convolution and the subsequent layergroups from 64,64,128,256,512 to 128,256,512,1024,2048. Note that the dimension of the feature space matches that of resnet50 and the output dimension of the output of the initial convolution almost matches the input dimension. The network r18U achieved a validation accuracy of over 76.5%, compared to resnet50’s 75.1%, in our pytorch setup. Network r18U is less efficient than resnet50, it has significantly more parameters. Larger depth increases the potential number of pieces in the piecewise linear function for a similar number of neurons.Our algebraic analysis and experimental results as well as theory (two layer theorem) suggest that roughly the same final accuracy can be achieved by many network families, the distinguishing factor is the required model size. Imposing structure via convolutions, higher depth, residual blocks and their substructures increases the granularity of the piecewise linear function for a network with a given number of neurons.
4 Polyharmonic Spline Neural Architecture Search
Given the insights from the prior Section, we developed a method to efficiently investigate a high dimensional parameter search space. We analyzed the search space of a single network family like ResNet or ResNeXt. Widths and depths of blocks already provide 8 dimensions within any deep network family search space. On top of those, parameters internal to the blocks, as for example cardinality in ResNeXt, would increase the number of dimensions even further, but based on the algebraic interpretation were likely to have less impact on final accuracy.
Three key issues make NAS challenging for larger classification problems such as ImageNet1k or ImageNet22k: search (and evaluation) time, memory consumption, and probability of getting stuck in local minima.
First, in order to obtain a measure for the final accuracy of a model, it has to be trained nearly to saturation, that is, over a sufficiently large number of epochs to ensure that all the variants tested reach close to their full potential. In our experiments, early stopping proved to be misleading, since smaller networks tend to initially converge faster and the crossover point was near the end of a complete learning rate schedule. Training ImageNet1k for 90 epochs and ImageNet22k for 60 epochs (values experimentally shown to provide meaningful accuracies) requires large amounts of computation. Hence, as in most NAS approaches,
minimizing the number of evaluations is critical.Second, the algebraic observations suggested that the better performing networks are large, such that GPU memory limitations become a factor. “Supernet” approaches such as FBNet[42] would not fit even two variants of the larger and more performing networks into a 16 GB GPU. In fact, the most accurate networks on ImageNet22K tend to use most of the memory of even a 32 GB GPU.
Third, the topologies of the hypersurfaces defined by the parameter dimensions and the achieved final accuracy as the evaluation axis may have local minima, i.e. gradient descend based methods are suboptimal to find minima in the accuracyhypersurface.
Importantly, the algebraic investigation suggests that small changes to parameters such as the number of filters or number of layers in a “convolution group” or cardinalities of elements of basic blocks within a “network family” like ResNeXt or BLResNeXt will result in small changes in the final accuracy, since they cause only small changes in the degrees of freedom in the number of pieces in the piecewise linear function and the equations that govern the pieces. Our experimental results support this hypothesis. Hence, it is appropriate to assume that there exists a set of continuous functions in
that pass through a set of reasonably spaced support points over discrete parameters and the resulting validation accuracy after training the network (close) to saturation.4.1 Polyharmonic Splines
Polyharmonic splines are an interesting option to define a function that passes through such a set of support points. Given a set of support points for a function , a polyharmonic spline interpolation has the form
where is the Euclidian norm in and is a real valued polynomial in variables [25].
is a radial basis function.
The polyharmonic radial basis functions are solutions to a polyharmonic equation, a partial differential equation of the form
= 0. Polyharmonic splines minimize the “curvature” of the hyperplane that passes through all the support points, hence they minimize oscillations while providing a smooth, continuous surface. This interpolation expression is differentiable. Thus, if we assume there exists a continuous and differentiable function that correctly models the behavior of the system that generates the support points, then there exists a polyharmonic spline such that the integral of the difference between spline and in an interpolation dbox vanishes as the number of support points increases.We chose a radial basis function of
. Solving the equation system to determine the coefficients for the polyharmonic spline proved to be sensitive to numerical instability. Hence we chose a pivoting Householder QR decomposition (the EIGEN implementation), trading performance for numerical stability.
In order for the linear equation system that determines the spline coefficients to have a solution, the matrix formed by the support points has to have full rank. Since interpolation accuracy is in general higher than extrapolation accuracy, a minimal set of support points it needed that spans a dbox in the ddimensional space in which the spline interpolates the approximated function which leads to a support point matrix of full rank. We chose the following set for parameters:

, the vector of the largest usable or legal parameter values in the dbox for all dimensions.

, the smallest usable or legal parameter values.

, a point near the center point of the dbox.

for all .

for all .
For two dimensions, these are the corners of a square and its center. For three dimensions, these are the corners of a cube and its center. With these support points, a total of points is sufficient to span an initial cubic spline that interpolates within a dbox. Additional support points can be added to improve the quality of the interpolation. To maintain numerical stability, new support points have to maintain a certain distance from previous support points. We eliminated splines due to numerical instability by checking on the solution with an error margin for floating point computation errors.
4.2 Minima Search
Since gradient descent is susceptable to local minima, we used a hierarchical Monte Carlo approach to search for a minimum in the interpolation dbox, with a sequence of low discrepancy, specifically a Halton sequence. For a given dbox, reducing the average distance between sample points by a factor of two requires an exponential increase in additional sample points, i.e. progress in terms of finding better minima slows daramatically as more sample points are added.
As the average distance between sampling points decreases relative to the distance between support points, the curvature minimizing property of the polyharmonic spline reduces the risk of missing minima. Hence a hierarchical approach, iteratively shrinking the dbox around the minimum found so far, significantly reduces the time to determine a good approximation for the minimum within the dbox. The search was stopped if no further improvement within floating point accuracy was achieved within a certain compute budget.
Note that in our architecture search a potentially bad estimate for a minimum does not limit progress of the search, as described in Algorithm 1. It merely leads to a potentially suboptimal measurement point and hence to potentially requiring more measurement points to reach convergence between prediction and measurement.
5 Experiments
In our experiments we investigated a couple of micro (within basic block structure) and multiple macro (overall network dimensions) parameters for two network families, namely ResNet [23] and BigLittle ResNeXt [11]. Search was performed directly on the target dataset ImageNet22K. Training experiments were conducted on the the Summit Supercomputer at Oakridge National Lab, using 34 nodes each equipped with 6 Nvidia Volta V100 GPUs with 16GB GPU cache each. for a total of 204 GPUs. All GPUs in a node have NVLink connection, and the nodes are connected by Mellanox EDR 100G Infiniband and have access to shared GPFS storage. Software used included PowerAI Vision 1.6, NCCL and Pytorch, using its distributed data parallel package. Batch size was set at 32 per GPU, for a total batch size of 6,528. The initial learning rate was set at 0.1 and the followed polynomial decay, optimizer SGD with momentum 0.9 and weight decay 0.0001.
5.1 Experimental Dataset: ImageNet22k
ImageNet22k contains 14 million images representing 21,841 categories organized in a hierarchy derived from WordNet and including top level concepts such as sport, garment, fungus, plant, animal, furniture, food, person, nature, music, fruit, fabric, tool etc. Figure 1 shows the top level catefories of the ImageNet22k hierarchy dataset and their relative sizes in terms of number of images. The imbalance across top level semantic categories is quite evident. For example animal is represented three times as much as person, and artifacts dominate the distribution with very little representation for activities. The skew is also significant in terms of number of examples per category. On average there are images per class, ranging from a minimum of to a maximum of. The scale and imbalance of ImageNet22K make it particularly challenging even for human designed architectures, with a limited set of published results [13, 14, 15, 50], as opposed to the smaller and balanced ImageNet1K version. Recently, some works have used ImageNet22K as pretraining for Imagene1K evaluation [1]. To the best of our knowledge, this is the first work to perform NAS directly at the scale of ImageNet22K, not by transfer from smaller proxy sets.
Following standard practise [5, 13, 17]
we randomly partitioned the ImageNet22K dataset into 50% training and 50% validation, consisting of approximately 7 million images each. We split the data into two sets such that number of images per label are approximately equal in both sets. In the cases where a label had odd number of images, we put the extra image in the validation set.
5.2 Results
We applied our polyharmonic spline NAS method to the the ResNet18 and BLResNext50 architectures search spaces. For each point in the spline, training and evalution was conducted on half of the ImageNet22k dataset. Once the optimal configuration was determined by our search, that network was trained and evaluated on the full ImageNet22k dataset.
Point Type  Dimensions  Top1 Accuracy %  
conv c1  group g1  group g2  group g3  group g4  Measured  Predicted  
Initial  150  300  600  1200  2400  38.03   
32  32  32  32  32  10.33    
150  32  32  32  32  10.54    
32  300  32  32  32  15.46    
32  32  600  32  32  19.45    
32  32  32  1200  32  23.15    
32  32  32  32  2400  31.25    
75  150  300  600  1200  35.46    
32  300  600  1200  2400  38.03    
150  32  600  1200  2400  37.15    
150  300  32  1200  2400  36.76    
150  300  600  32  2400  34.40    
150  300  600  1200  32  29.04    
Incremental  50  116  330  1200  2400  37.31  41.02 
80  208  475  736  2400    38.92 
5.2.1 ResNet18 Search Space.
For the ResNet18 architecture we removed maxpool layer and increased the stride for the first convolution to . We performed search over dimensions : the number of filters in the first convolution () and in the four groups of layers (). The ranges for each dimension are as follows: , , , and . Therefore the possible combinations spanning the entire search space are 3 trillion (). Using our spline NAS approach allows to start from only an initial set of 13 support points (, as explained in Section 4.1) and then iterate from there with few additional points. This results in a tremendous computational gain for our NAS method which allows to perform search directly on large scale datasets, as opposed to traditional NAS approaches needing to resort to small proxy sets.
Table 1 shows the coordinates for the initial 13 support points to span the first spline and two incremental points. An incremental point is a measurement on the optimum estimated by the prior set of points. The first prediction, based on the minimal point set, predicted a reasonable point, but the estimate of is clearly very optimistic. Adding a measurement at this point produces a new point with a prediction close to the best support point.
Figure 2 shows projections of the polyharmonic spline derived from the measured points as show in Table 1. The interpolation suggests that the parameter is the dominant limiting factor, since it shows the steepest slope at the edge of the “box”. This matches the algebraic interpretation, 22k classes could benefit from a higher dimensional feature space. The earlier layers/group show maxima within the box for maximum values for the later layers, indicating that once the degrees of freedom of a later part of the network are saturated, adding more capacity to earlier layers becomes counterproductive.
Hence, we measured configuration of [], roughly projecting out the ratios of the optimum from the spline with some adjustments to fit into available GPU memory. This achieved, all other hyperparameters remaining equal, a top1 accuracy of . Since this network was significantly larger and hence may benefit from a different learning rate schedule, we performed 2 epochs of fine tuning, which increased the accuracy to .
Being limited by GPU memory and compute resource, we performed one more experiment to increase the number of pieces in the piecewise linear function without increasing the model size significantly by replacing the convolution with stride 2 at the beginning with a basicblock. A basicblock consists of two convolutions, one of which has stride 2, which has a similar aperture and the same reduction of the activations. Indeed, this yielded the expected improvement, top1 accuracy with finetuning.
5.2.2 BLResNext50 Search Space
BigLittle Net[11] is a mechanism that splits each block within a deep network into multiple paths through which different resolutions of an input are passed. In our search space we considered two paths. The first, called big branch, through which a downsized (by half) version of the input is passed, containing kernels and layers. The second, called little branch, processing the input in its original resolution, but containing kernels and layers. The biglittle version of ResNext [11] with a depth of 50 is deeper and offers more alternate paths through groups than the basic ResNet18, and hence theoretically allows for more pieces in the piecewise linear function relative to the number of network parameters. Thus, this family of networks promises to achieve higher accuracy within a given GPU memory capacity, which was clearly the limiting factor for the ResNet18 case.
we defined our search space as spanning only three variables, hence reducing the number of needed measurements for optimization. We chose as parameters the and parameters of the biglittle structure and a multiplier to the group width, that gets uniformly applied across the network. The original group widths for BLResNext50 were . With a bottleneck expansion factor of , this results in a 2048dimensional feature space. The choice for example results in group width and a 4096dimensional feature space. The total combinations of the search space are at least 539 (, considering only at discrete increments of 0.1), but the spline optimization needs only supporting measurements followed by a couple of additional ones.
Point type  Initial  Incremental  
8  2  2  8  4  2  8  2  8  2  
8  2  8  2  4  2  2  8  8  8  
1  1  1  1  1.5  2  2  2  2  3  
Top1 Accuracy %  38.17  38.83  38.75  38.18  39.90  40.96  40.53  40.99  40.48  41.64 
Table 2 shows that and have only a small influence on accuracy compared to . Figure 3 shows the projection . The dependencies within the “box” are nearly linear and the minimum is located in the corner, clearly indicating that an increase in width has the best probability to increase accuracy significantly. Hence, we measured , which indeed yielded to top1 accuracy of . This was interesting also because the shape of the spline suggested an increment in the variable beyond the initially designed range of the search space.
The total amount of single Nvidia Volta GPU hours needed for the NAS Spline search was approximately . This is the accumulation of conducting evaluation (training and validation) half of the ImageNet22K dataset for all the configurations corresponding to the initial points and additional 3 data points. Each individual point measurement took about GPU hours. For reference, the original reinforcement learning based NAS [53] method would require a minimum of GPU hours.
The final recommended BLNet architecture was trained and evaluated on the full ImageNet22K dataset using the Summit supercomputer over nodes with Volta GPUs with a global batch size of using Pytorch distributed data parallel. Figure 4 shows how the top1 accuracy climbed as the learning progressed. On the way to the accuracy it crossed two previously published results as shown. In table 3 the comparison of our result against previous results as well as a baseline SME designed architecture based on the BLResNext are summarized. The SME designed architecture was a BLResNext 101 based model in comparison to the BLResNext 50 based Spline recommended model. As can be seen, the Spline recommended architecture resulted in a jump of increase in top1 accuracy. This is the first published result which has crossed in overall top1 accuracy on ImageNet22K.
Model  Batch  GPUs  Top1(Top5)  FLOPs  Training  Number of 

Size  Accuracy %  (G)  Time(Hours)  Epochs  
ResNet101 [14]  5,120  256  33.8 ()    7   
WRN5042 [15]  6,400  200  36.9 (65.1)      24 
BLResNext101 32x8d  6,528  204  39.7 (68.3)  11.25  16  60 
BLResNext50 18x8d (ours)  6,528  204  40.03 (69.04)  17.88  15  60 
5.3 BLResNext on ImageNet1K
In order to investigate the influence of both groupwidth and groupdepth in the BiglittleResNext family, we picked 8 parameter dimensions and experimentally verified the convergence of the spline method using the smaller ImageNet1K dataset. The investigated parameters are the number of filters in the first layer in a layer group, i.e. the number of filters in the output of a group is increased by the expansion factor. Parameters are the depths, i.e. the number of calls to make_layer within a basicblock for the four block groups.
Point Type  Dimensions  Top1 Accuracy %  
Measured  Predicted  
Initial  32  32  32  32  2  2  2  2  52.95   
128  256  512  768  10  10  18  5  78.82    
80  160  320  480  5  5  5  3  77.61    
128  32  32  32  2  2  2  2  60.60    
32  256  32  32  2  2  2  2  65.53    
32  32  512  32  2  2  2  2  70.19    
32  32  32  768  2  2  2  2  69.84    
32  32  32  32  10  2  2  2  58.37    
32  32  32  32  2  10  2  2  58.83    
32  32  32  32  2  2  18  2  59.17    
32  32  32  32  2  2  2  5  55.16    
32  256  512  768  10  10  18  5  77.68    
128  32  512  768  10  10  18  5  78.18    
128  256  32  768  10  10  18  5  78.26    
128  256  512  32  10  10  18  5  77.78    
128  256  512  768  2  10  18  5  78.55    
128  256  512  768  10  2  18  5  78.40    
128  256  512  768  10  10  2  5  78.44    
128  256  512  768  10  10  18  2  79.27    
Incremental  71  256  512  768  2  2  2  2  76.42  85.09 
91  72  512  768  6  7  10  2  77.76  80.88  
128  256  441  768  10  10  18  2    79.29  
BLResNext50 Default  64  128  256  512  3  4  6  3  77.02  75.50 
Table 4 shows the 19 () measured points that span an initial spline and two points with a comparison of the prediction against a measurement. After the initial set of observations, we run predictions of the spline on different points within the search space. As an example, we can look at the default BLResNext50 configuration (last row in the Table). Since it is located close to the center of the dbox and hence close to a support point, the relative error between the top1 accuracy predicted by the spline model and the measured performance is moderate, approximately 1.5% (77.02% measured versus the predicted 75.50%).
The point is the minimum found within the dbox. It is located relatively far from the support points, hence its prediction is much less accurate. Iteratively adding minima as support points tends to quickly improve the accuracy of predictions and thus leads to good network parameters. If the prediction quality doesn’t improve, adding more support points in the region of interest, e.g. spanning a smaller dbox inside the first one, can improve the splines predictive capabilities.
Adding to the base spline (not including the default BLResNext50 point) delivers a new prediction, with predicted top1 accuracy of . Adding the measurement for this second point to the spline delivers a new prediction with accuracy which closely matches a measured point of the corner of the dbox. That suggests that this corner point is the optimimum within this dbox.
The location of the minimum in a corner point suggests that better parameters may exist outside the interpolation dbox. Hence, we added a set of measurements to expand the dbox to a wider range of parameters for the widths of the convolution groups to start a new iteration. Table 5 shows the additional points and the predictions and measurements. Not unsurprising, the point at the maximum edge of the new dbox shows already an improved final top1 accuracy of 79.38%. Additional iterative points close the gap between prediction and measurements. We stopped the iteration as it became evident that still the best corner point would be very close to a settled maximum. The iteration unveiled a smaller network with almost identical final accuracy of at .
Point Type  Dimensions  Top1 Accuracy %  
Measured  Predicted  
Initial  256  512  768  1024  10  10  18  5  79.38   
256  32  32  32  2  2  2  2  64.91    
32  512  32  32  2  2  2  2  68.82    
32  32  768  32  2  2  2  2  70.92    
32  32  32  1024  2  2  2  2  70.51    
32  512  768  1024  10  10  18  5  78.12    
256  32  768  1024  10  10  18  5  78.71    
256  512  32  1024  10  10  18  5  79.01    
256  512  768  32  10  10  18  5  78.81    
256  512  768  1024  2  10  18  5  79.10    
256  512  768  1024  10  2  18  5  79.25    
256  512  768  1024  10  10  2  5  79.11    
256  512  768  1024  10  10  18  2  79.17    
Incremental  210  357  433  500  3  5  13  2  78.83  81.78 
235  355  408  872  10  10  18  3  79.37  80.13  
130  289  417  489  3  4  9  3  78.47  79.93  
158  314  381  761  5  10  18  3  79.04  79.86  
240  400  620  532  10  8  18  2  79.16  79.76  
256  385  433  1023  10  3  18  5  79.31  79.6  
188  365  445  875  10  10  18  2  79.18  79.51  
256  293  363  1024  10  10  18  5  79.24  79.58  
180  284  569  614  10  10  18  3    79.51  
BLResNext50 Default  64  128  256  512  3  4  6  3  77.02  75.50 
6 Conclusions
We described a novel NAS method based on polyharmonic splines that can perform search directly on large scale, imbalanced target datasets. We demonstrated how most common operations in deep neural networks can be included as variables in the search space of a spline modeling the accuracy of a given architecture. The number of evaluations required during the search phase of our NAS approach is proportional to the number of operations in the search space, not in the number of possible values they each operation could have, making the approach tractable at large scale. We demonstrate the effectiveness of our method on the ImageNet22K benchmark [16], achieving a state of the art top1 accuracy of . This result paves the way to apply polyharmonic spline based NAS to other architectures and operations within networks, potentially also including hyperparameters in the search space.
7 Acknowledgement
This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DEAC0500OR22725. It also used resources of the IBM T.J. Watson Research Center Scaling Cluster (WSC).
References
 [1] (2021) An image is worth 16x16 words: transformers for image recognition at scale. In Submitted to International Conference on Learning Representations, Note: under review External Links: Link Cited by: §5.1.
 [2] (2018) Accelerating neural architecture search using performance prediction. In ICLR Workshops, Cited by: §2, §2.
 [3] (2018) Understanding and simplifying oneshot architecture search. In ICML, Cited by: §2.

[4]
(2013)
Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures.
In
International conference on machine learning
, pp. 115–123. Cited by: §2. 
[5]
(2017)
Distributed learning of deep feature embeddings for visual recognition tasks
. In IBM Journal of R and D, Cited by: §5.1.  [6] (2019) Transfer nas: knowledge transfer between search spaces with transformer agents. arXiv preprint 1906.08102. Cited by: §1.
 [7] (2018) SMASH: oneshot model architecture search through hypernetworks. In ICLR, Cited by: §2.
 [8] (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In ICLR, Cited by: §2.
 [9] (2019) Learnable embedding space for efficient neural architecture compression. arXiv preprint arXiv:1902.00383. Cited by: §2.
 [10] (2019) DATA: differentiable architecture approximation. In Advances in Neural Information Processing Systems, Cited by: §2.
 [11] (2019) Biglittle net: an efficient multiscale feature representation for visual and speech recognition. In ICLR, Cited by: Large Scale Neural Architecture Search with Polyharmonic Splines, §1, §3.2, §5.2.2, §5.
 [12] (2019) DetNAS: backbone search for object detection. In Advances in Neural Information Processing Systems, Cited by: §2.
 [13] (2014) Project adam: building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 571–582. Cited by: §5.1, §5.1.
 [14] (2017) PowerAI ddl. In arXiv preprint 1708.02188, Cited by: §5.1, Table 3.

[15]
(2017)
Achieving deep learning training in less than 40 minutes on imagenet1k and best accuracy and training time on imagenet22k and places365
. Note: https://bit.ly/2VdG5B7 Cited by: Large Scale Neural Architecture Search with Polyharmonic Splines, §1, §5.1, Table 3.  [16] (2009) ImageNet: A LargeScale Hierarchical Image Database. In CVPR, Cited by: Large Scale Neural Architecture Search with Polyharmonic Splines, §6.
 [17] (2010) What does classifying more than 10,000 image categories tell us?. In ECCV, Cited by: §5.1.

[18]
(2015)
Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves.
In
TwentyFourth International Joint Conference on Artificial Intelligence
, Cited by: §2.  [19] (2019) Neural architecture search: a survey. JMLR 20 (55), pp. 1–21. Cited by: §1.
 [20] (2018) BOHB: robust and efficient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774. Cited by: §2.
 [21] (2019) Weight agnostic neural networks. In Advances in Neural Information Processing Systems, Cited by: §2.
 [22] (2019) NASfpn: learning scalable feature pyramid architecture for object detection. In CVPR, Cited by: §2.
 [23] (2016) Deep residual learning for image recognition. In CVPR, Cited by: Large Scale Neural Architecture Search with Polyharmonic Splines, §1, §3.2, §5.
 [24] (2019) Searching for mobilenetv3. In ICCV, Cited by: §2.
 [25] (2004) Multiresolution methods in scattered data modelling. Vol. 37, SpringerVerlag. Cited by: §4.1.
 [26] (2018) Neural architecture search with bayesian optimisation and optimal transport. In Advances in neural information processing systems, pp. 2016–2025. Cited by: §2.
 [27] (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §1.
 [28] (2019) Autodeeplab: hierarchical neural architecture search for semantic image segmentation. In CVPR, Cited by: §2.
 [29] (2018) Progressive neural architecture search. In ECCV, Cited by: §2.
 [30] (2018) Hierarchical representations for efficient architecture search. In ICLR, Cited by: §2.
 [31] (2018) DARTS: differentiable architecture search. arXiv preprint 1806.09055. Cited by: §1, §2.
 [32] (2020) Neural architecture transfer. arXiv preprint 2005.05859. Cited by: §1.
 [33] (2016) Towards automaticallytuned neural networks. In Workshop on Automatic Machine Learning, pp. 58–65. Cited by: §2.
 [34] (2019) XNAS: neural architecture search with expert advice. In Advances in Neural Information Processing Systems, Cited by: §2.
 [35] (2018) Efficient neural architecture search via parameters sharing. In ICML, Cited by: §2.
 [36] (2019) Regularized evolution for image classifier architecture search. In AAAI, Cited by: §2.
 [37] (2014) Raiders of the lost architecture: kernels for bayesian optimization in conditional parameter spaces. arXiv preprint arXiv:1409.4011. Cited by: §2.
 [38] (2019) MnasNet: platformaware neural architecture search for mobile. In CVPR, Cited by: §2.

[39]
(2019)
EfficientNet: rethinking model scaling for convolutional neural networks
. In ICML, Cited by: §2.  [40] (2019) NASFCOS: fast neural architecture search for object detection. arXiv preprint 1906.04423. Cited by: §2.
 [41] (2019) XferNAS: transfer neural architecture search. arXiv preprint 1907.08307. Cited by: §1.
 [42] (2019) FBNet: hardwareaware efficient convnet design via differentiable neural architecture search. In CVPR, Cited by: §2, §4.
 [43] (2017) Aggregated residual transformations for deep neural networks. In CVPR, Cited by: §3.2.
 [44] (2019) Exploring randomly wired neural networks for image recognition. In ICCV, Cited by: §2.

[45]
(2017)
Aggregated residual transformations for deep neural networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1492–1500. Cited by: §1.  [46] (2019) SNAS: stochastic neural architecture search. In ICLR, Cited by: §2.
 [47] (2020) Understanding and robustifying differentiable architecture search. In ICLR, Cited by: §2.
 [48] (2020) NASbench1shot1: benchmarking and dissecting oneshot neural architecture search. In ICLR, Cited by: §1, §2.
 [49] (2018) Graph hypernetworks for neural architecture search. arXiv preprint arXiv:1810.05749. Cited by: §2.
 [50] (2015) Poseidon: a system architecture for efficient gpubased deep learning on multiple machines. In arXiv preprint 1512.06216, Cited by: §5.1.
 [51] (2018) Practical blockwise neural network architecture generation. In CVPR, Cited by: §2.
 [52] (2020) EcoNAS: finding proxies for economical neural architecture search. In CVPR, Cited by: §2.
 [53] (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §1, §2, §5.2.2.
 [54] (2018) Learning transferable architectures for scalable image recognition. In CVPR, Cited by: §2.