NeuNetS: An Automated Synthesis Engine for Neural Network Design

01/17/2019 ∙ by Atin Sood, et al. ∙ 0

Application of neural networks to a vast variety of practical applications is transforming the way AI is applied in practice. Pre-trained neural network models available through APIs or capability to custom train pre-built neural network architectures with customer data has made the consumption of AI by developers much simpler and resulted in broad adoption of these complex AI models. While prebuilt network models exist for certain scenarios, to try and meet the constraints that are unique to each application, AI teams need to think about developing custom neural network architectures that can meet the tradeoff between accuracy and memory footprint to achieve the tight constraints of their unique use-cases. However, only a small proportion of data science teams have the skills and experience needed to create a neural network from scratch, and the demand far exceeds the supply. In this paper, we present NeuNetS : An automated Neural Network Synthesis engine for custom neural network design that is available as part of IBM's AI OpenScale's product. NeuNetS is available for both Text and Image domains and can build neural networks for specific tasks in a fraction of the time it takes today with human effort, and with accuracy similar to that of human-designed AI models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

AI is changing the way businesses work. However, it’s important to remember that every business has unique challenges to solve, and the range of use-cases for AI is constantly expanding. While pre-built AI models exist for certain scenarios, to try and meet the constraints that are unique to each application, AI teams will need to think about developing custom AI models of their own. Artificial neural networks are arguably the most powerful tool currently available to data scientists and businesses. However, only a small proportion of data scientists have the skills and experience needed to create a neural network from scratch, and the demand far exceeds the supply. As a result, getting a new neural network that is architecturally custom designed to meet the needs of that application, even to the proof-of-concept stage, requires a level of investment that most enterprises struggle to afford. Automation technologies that bridge this skills gap by automatically designing the architecture of neural networks for a given data are increasingly gaining importance. In this paper, we present NeuNetS : A Neural Network Synthesis engine for neural network design that is available as part of IBM’s AI OpenScale’s product. NeuNetS automatically configures itself to the needs of the user and the use case and helps reduce the complexity and skills required to build AI models, making data science teams more productive and enabling them to scale AI across their workflows. Overall, NeuNetS has two main stages: Coarse-grained synthesis and Fine-grained synthesis. Coarse-grained synthesis automatically optimizes and determines the overall architecture of the network: How many layers there should be, how they are connected, different architectural features like convolution layers and so on. The unique and novel step of fine-grained synthesis enables NeuNetS to take a deeper dive into each layer and optimizes the individual neurons and connection—for example, what kind of convolution filter should be applied, and which neurons and edges should be optimized. One of the critical breakthroughs that have enabled this capability is a very high-fidelity approach to performance estimation, which allows us to bypass real-time training and analysis and design neural networks automatically in a matter of hours—compared to the weeks or months that it might take a data scientist to train and optimize the AI model. NeuNetS is available for both Text and Image domains and can build neural networks for specific tasks in a fraction of the time it takes today, and with accuracy similar to that of human-designed AI models. The data science teams can then further fine tune the model, leading to greater productivity and cost-efficiency. NeuNetS is a novel tool for augmenting human expertise with powerful, AI-driven optimization capabilities.

The remainder of the paper is organized as follows. We first provide the underlying flexible architecture of NeuNetS in Sec. 3. Next, in Sec. 4.1, we give details behind two of the coarse grained neural architecture search engines that are key part of NeuNetS. In Sec. 4.2, we describe a unique set of fine-grained transformation to further optimize the Neural Network designs. Finally, in Sec. 5, we provide empirical results on several standard and real-world datasets.

2 Related Work

Evolutionary algorithms and reinforcement learning are currently the two state-of-the-art techniques used by neural network architectures search algorithms. With Neural Architecture Search [1], Zoph et al. demonstrated in an experiment over 28 days and with 800 GPUs that neural network architectures with performances close to state-of-the-art architectures can be found. In parallel or inspired by this work, others proposed to use reinforcement learning to detect sequential architectures [2], reduce the search space to repeating cells [3, 4] or apply function-preserving actions to accelerate the search [5].

Neuro-evolution dates back three decades. In the beginning it focused only on evolving weights [6] but it turned out to be effective to evolve the architecture as well [7]. Neuro-evolutionary algorithms gained new momentum due to the work by Real et al. [8]. In an extraordinary experiment that used 250 GPUs for almost 11 days, they showed that architectures can be found which provide similar good results as human-crafted image classification network architectures. Very recently, the idea of learning cells instead of the full network has also been adopted for evolutionary algorithms [9]. Miikkulainen et al. even propose to coevolve a set of cells and their wiring [10].

Other methods that try to optimize neural network architectures or their hyperparameters are based on model-based optimization

[11, 12, 13, 14] and Monte-Carlo Tree Search [15, 16, 17].

Various techniques exist which try to shorten the training time. One idea is based on the idea of terminating unpromising training runs early. The partially observed learning curve is used directly to decide to terminate a run early [18] or first extrapolated and then used [19, 20, 21]. Other methods are able to sample different architectures and then predict its likely performance. Peephole [22] predicts a network accuracy by only analyzing the network structure, however it works only on a fixed dataset test case. SMASH uses a hypernetwork to predict weights for an architecture without training and uses its validation performance as a proxy for its performance after training [23]. Others reduce the search time by sharing or reusing model weights [5, 24, 25, 26].

3 NeuNetS Architecture

3.1 Overview

The lifecycle of a NeuNetS project consists of a series of states or stages, as detailed in Fig. (1). During this lifecycle, the synthesis states are executed multiple times to explore/evolve, train, and evaluate different networks. Once stopping conditions, whether budgetary or algorithmic, are reached, the synthesis loop ends and final results are extracted to the user’s storage instance.

Figure 1: Execution Pipeline Operational States

The architectural implementation of NeuNetS consists of three main components: the service component, the core engine component, and the synthesizer component. The relations between these components and the required external services is illustrated in Fig. (2

). The service component includes the NeuNetS APIs and handles all incoming requests to the NeuNetS project. The core engine component maintains the state of the project and other relevant data. In each synthesis cycle, it obtains new architecture configurations from the synthesizer component and submits them to Watson Machine Learning

[27] for training. When the stopping conditions are reached, it stores the final models in the user’s cloud storage instance. The synthesizer component is a pluggable register of algorithms which use the state information passed from the engine to produce new architecture configurations. The rest of this section describes the functionality of these components in more detail.

Figure 2: NeuNetS Component Architecture

3.2 Service Component

The NeuNetS service component receives and manages all API requests and responses. These include initialization of a NeuNetS operation, obtaining the ongoing status, providing metrics, and prematurely stopping an operation. An incoming request to initialize and start an operation results in a preliminary validation of the request parameters. The service component performs a series of internal calculations to determine optimal operation parameters with which to initialize the NeuNetS synthesis. Once these preliminary checks are passed, the service component calls the core engine to initialize a synthesis pipeline.

One of the unique attributes of NeuNetS (not available in the beta feature) is an API providing advanced visualization techniques for synthesized models. As illustrated in Fig. (3), these tools allow users to interactively compare various metrics of the models that they train with NeuNetS.

Figure 3: NeuNetS visual interface to manually pick a best model. Left: list of models in training and trained models with measures of performance and architecture diagram. Right: Details to compare selected models in depth.

3.3 Core Engine Component

The NeuNetS core engine component is responsible for the overall lifecycle management of a NeuNetS operation. When it receives a request from the services component, it initializes a synthesis pipeline. The engine then calls the synthesizer component to obtain an initial set of architecture configurations. It processes these configurations and submits them to the Watson Machine Learning Service to be trained. This requires that the core engine define and compute all required resources (GPU/CPU, memory) for the training. Once the configuration has been trained for the requested amount time, the engine stores these intermediate configurations in an internal persistent storage instance. These configurations, along with their performance metrics from training, are then provided to the synthesizer component, which uses this information to produce a new set of configurations. This loop is performed multiple times to iteratively improve the network performance. At each step the engine checks the validity of the results from the last step, handles any error conditions, places relevant data into the persistent storage, and updates the state of the pipeline.

This process terminates either when the engine determines that specific budgetary objectives have been achieved, or the synthesis component receives a configuration whose performance meets its algorithmic requirements. A final completion operation releases all pipeline resources and facilitates the transfer of the final synthesized neural network model to a destination bucket inside of a valid IBM Cloud Object Storage [28] service instance.

The communication with persistent storage is a key operational aspect of the core engine. This decouples the overall state of the pipeline from the components and services performing the synthesis. This decoupling enables operational recovery in the event of service component failure, as new instances of the service component can immediately resume managing the lifecycle of the active pipeline, based on the stored pipeline state data.

3.4 Synthesizer Component

The NeuNetS synthesizer component provides a pluggable framework for multiple distinct model synthesis algorithms, each of which get registered with the NeuNetS core engine component. Each algorithm implements a common base interface that reflects the required interaction between itself and the core engine. An initialization contract in the interface provides the algorithm with overall synthesis parameters specific to a provided dataset, as well as runtime parameters related to an executing pipeline’s environmental budgetary considerations. The interface also defines the contracts for three important operations that govern a NeuNetS pipeline operation. The first operation encompasses the algorithm providing a configuration to the NeuNetS core engine that represents one or more data-inspired architectures of a deep layer neural network. The second operation is centered on providing the algorithm with all training results, along with any associated artifacts of consequence. The third operation is a validation operation, whereby the NeuNetS core engine queries the algorithm for an overall assessment of the pipeline progress based on the training results and additional state available to the algorithm.

4 Neural Network Synthesis Methods

4.1 Coarse grained synthesis

NeuNetS features three large scale architecture search algorithms: NCEvolve [26], TAPAS [29], and Hyperband++. These algorithms make a step forward with respect to the most advanced works in the literature, addressing fundamental problems such as dataset generality and performance scalability.

NeuNetS algorithms are designed to synthesize new models in a short and reasonable time, without using transfer-learning or pre-trained models. This allows us to explore a wide space of network architecture configurations, and fine-tune the model for the specific dataset provided by the user.

Being based on multiple optimization algorithms, NeuNetS can accommodate a wider range of model synthesis scenarios. In future releases, the user will not only be able to update data, but also to decide how much time and how many resources to allocate for the model synthesis, as well as optionally the maximum size of the model, and the target deployment platform. Based on these constraints, NeuNetS will select the best optimization strategy to serve back to the users the right models for their needs.

The portfolio of algorithms will be continuously extended including top works from the public community, as well as further advanced developments from IBM Research.

In the following paragraphs, we briefly recall the main technical features of our current portfolio of optimization algorithms.

4.1.1 NCEvolve

NCEvolve is a novel neuro-evolutionary technique to search for neural architectures without human interference. It assumes that a neural network architecture is a sequence of neuro-cells and keeps mutating them using function-preserving operations. This assumption has several advantages. First, it reduces the search space complexity. Second, these cells are possibly transferable and can be used in order to arbitrarily extend the complexity of the network. Mutations based on function-preserving operations guarantee better parameter initialization than random initialization such that less training time is required per network architecture.

Chen et al. [30] proposed a family of function-preserving network manipulations in order to transfer knowledge from one network to another. Suppose a teacher network is represented by a function where is the input of the network and are its parameters. Then an operation changing the network to a student network is called function-preserving if and only if the output for any given model remains unchanged:

(1)

Note that typically the number of parameters of and

are different. We will use this approach in order to initialize our mutated network architectures. Then, the network is trained for some additional epochs with gradient-based optimization techniques. Using this initialization, the network requires only few epochs before it provides decent predictions. We briefly explain the proposed manipulations and our novel contributions to it. Please note that a fully connected layer is a special case of a convolutional layer. For a more detailed description, we refer to

[26].

Convolutions in Deep Learning

Convolutional layers are a common layer type used in neural networks for visual tasks. We denote the convolution operation between the layer input with a layer with parameters by . Here, is the number of input channels, the input dimension, the kernel size and the number of output feature maps. Depthwise separable convolutions, or for short just separable convolutions, are a special kind of convolution factored into two operations. During the depthwise convolution a spatial convolution with parameters is applied for each channel separately. We denote this operation by using . This is in contrast to the typical convolution which is applied across all channels. In the next step the pointwise convolution, i.e. a convolution with a kernel, traverses the feature maps which result from the first operation with parameters . Comparing the normal convolution operation with the separable convolution , we immediately notice that in practice the former requires with more parameters than the latter which only needs .

Layer Widening

Assume the teacher network contains a convolutional layer with a kernel which is represented by a matrix where is the number of input feature maps and is the number of output feature maps or filters. Widening this layer means that we increase the number of filters to . Chen et al. [30] proposed to extend by replicating the parameters along the last axis at random. This means the widened layer of the student network uses the parameters

(2)

In order to achieve the function-preserving property, the replication of some filters needs to be taken into account for the next layer . This is achieved by dividing the parameters of by the number of times the -th filter has been replicated. If is the number of times the -th filter was replicated, the weights of the next layer for the student network are defined by

(3)

We extended this mechanism to depthwise separable convolutional layers. A depthwise separable convolutional layer at depth is widened as follows. The pointwise convolution for the student is estimated according to Equation 2. This results into replicated output feature maps. The depthwise convolution is identical to the one of the teacher network, i.e. the operations with parameters and . Independently of whether we used a depthwise separable or normal convolution in layer , widening it requires adaptations in a following depthwise separable convolutional layer. The parameters of the depthwise convolution are replicated according to the replication of parameters in the previous layer similar to Equation 2. Furthermore, the parameter of the pointwise convolution depend on the replications in the previous layers analogously to Equation 3.

Layer Deepening

Chen et al. [30] proposed a way to deepen a network by inserting an additional convolutional or fully connected layer. We complete this definition by extending it to depthwise separable convolutions.

A layer can be considered to be a function which gets as an input the output of the previous layer and provides the input for the next layer. A simple function-preserving operation is to set the weights of a new layer such that the input of the layer is equal to its output. If we assume

incoming channels and an odd kernel height and weight for the new convolutional layer, we achieve this by setting the weights of the layer with a

kernel to the identity matrix:

(4)

This operation is function-preserving and the number of filters is equal to the number of input channels. More filters can be added by layer widening, however, it is not possible to use less than

filters for the new layer. Another restriction is that this operation is only possible for activation functions

with

(5)

The ReLU activation function

fulfills this requirement.

We extend this operation to depthwise convolutions. The parameters of the pointwise convolution are initialized analogously to Equation 4 and the depthwise convolution is set to one:

(6)
(7)

This initialization ensures that both, the depthwise and pointwise convolution, just copy the input. New layers can be inserted at arbitrary positions with one exception. Under certain conditions an insertion right after the input layer is not function-preserving. For example if a ReLU activation is used, there exists no identity function for inputs with negative entries.

Kernel Widening

Increasing the kernel size in a convolutional layer is achieved by padding the tensor using zeros until it matches the desired size. The same idea can be applied to increase the kernel size of depthwise separable convolution by padding the depthwise convolution with zeros.

Insert Skip Connections

Many modern neural network architectures rely on skip connections [31]. The idea is to add the output of the current layer to the output of a previous. One simple example is

(8)

Therefore, we propose a function-preserving operation which allows inserting skip connection. We propose to add layer(s) and initialize them in a way such that the output is 0 independent on the input. This allows to add a skip because now adding the output of the previous layer to zero is an identity operation. A new operation is added setting its parameters to zero, , achieving a zero output. Now, adding this output to the input is an identity operation.

Branch Layers
Figure 4: Visualization of branching the colored layer and insert a convolution into the left branch. Same colored circles represent identical feature maps. Circles without filling can have any value and are not important for the visualization. Activation functions are omitted to avoid clutter.

We also propose to branch layers. Given a convolutional layer it can be reformulated as

(9)

where merge concatenates the resulting output. The student network’s parameters are defined as

This operation is not only function-preserving, it also does not add any further parameters and in fact is the very same operation. However, combining this operation with other function-preserving operations allows to extend networks by having parallel convolutional operations or add new convolutional layers with smaller filter sizes. In Figure 4 we demonstrate how to achieve this. The colored layer is first branched and then a new convolutional layer is added to the left branch. In contrast to only adding a new layer as described in Section 4.1.1, the new layer has only two output channels instead of three.

Multiple In- or Outputs

All the presented operations are still possible for networks where a layer might have inputs from different layers or provide output for multiple outputs. In that case only the affected weights need to be adapted according to the aforementioned equations.

Evolution of Neuro-Cells

The very basic idea of our proposed cell-based neuro-evolution is the following. Given is a very simple neural network architecture which contains multiple neuro-cells (see Figure 5). The cells itself share their structure and the task is to find a structure that improves the overall neural network architecture for a given data set and machine learning task. In the beginning, a cell is identical to a convolutional layer and is changed during the evolutionary optimization process. Our evolutionary algorithm is using tournament selection to select an individual from the population: randomly, a fraction of individuals is selected from the population. From this set the individual with highest fitness is selected for mutation. We define the fitness by the accuracy achieved by the individual on a hold-out data set. The mutation is selected at random which is applied to all neuro-cells such that they remain identical. The network is trained for some epochs on the training set and is then added to the population. Finally, the process starts all over again. After meeting some stopping criterion, the individual with highest fitness is returned.

Figure 5: Neural network template as used in our experiments.
Mutations

All mutations used are based on the function-preserving operations introduced in the last section. This means, a mutation does not change the fitness of an individual, however, it will increase its complexity. The advantage over creating the same network structure with randomly initialized weights is obviously that we start with a partially pretrained network. This enables us to train the network in less epochs. All mutations are applied only to the structure within a neuro-cell if not otherwise mentioned. Our neuro-evolutional algorithm considers the following mutations.

Insert Convolution A convolution is added at a random position. Its kernel size is , the number of filters is equal to its input dimension. It is randomly decided whether it is a separable convolution instead.

Branch and Insert Convolution A convolution is selected at random and branched according to Section 4.1.1. A new convolution is added according to the “Insert Convolution” mutation in one of the branches. For an example see Figure 4.

Insert Skip A convolution is selected at random. Its output is added to the output of a newly added convolution (see “Insert Convolution”) and is the input for the following layers.

Alter Number of Filters A convolution is selected at random and widened by a factor uniformly at random sampled from . This mutation might also be applied to convolutions outside of a neuro-cell.

Alter Number of Units Similar to the previous one but alters the number of units of fully connected layers. This mutation is only applied outside the neuro-cells.

Alter Kernel Size Selects a convolution at random and increases its kernel size by two along each axis.

The motivation of selecting this set of mutations is to enable the neuro-evolutionary algorithm to discover similar architectures as proposed by human experts. Adding convolutions allows to reach popular architectures such as VGG16 [32], combinations of adding skips and convolutions allow to discover residual networks [31]. Finally the combination of branching, change of kernel sizes and addition of (separable) convolutions allows to discover architectures similar to Inception [33], Xception [34] or FractalNet [35].

The optimization is started with only a single individual. Then always two individuals are selected with replacement based on the previously described tournament selection process and trained in parallel.

4.1.2 Tapas

Figure 6: Schematic TAPAS workflow. First row: the Dataset Characterization (DC) takes a new, unseen dataset and characterizes its difficulty by computing the Dataset Characterization Number (DCN). This number is then used to select a subset of experiments executed on similarly difficult datasets from the Lifelong Database of Experiments (LDE). Subsequently, the filtered experiments are used to train the Train-less Accuracy Predictor (TAP), an operation that takes up to a few minutes. Second row: the trained TAP takes the network architecture structure and the dataset DCN and predict the peak accuracy reachable after training. This phase scales very efficiently in a few seconds over a large number of networks.

TAPAS is a framework that runs large scale architecture searches of thousands of networks in a few minutes on CPU. We achieve this with a novel deep neural network accuracy predictor, that estimates in fractions of a second classification performance for unseen input datasets, without training. In contrast to previously proposed approaches, our prediction is not only calibrated on the topological network information, but also on the characterization of the dataset-difficulty which allows us to re-tune the prediction without any training. The TAPAS framework, depicted in Figure 6, is built on three main components:

  1. Dataset Characterization (DC): Receives an unseen dataset and computes a scalar score, namely the Dataset Characterization Number (DCN) [36], which is used to rank datasets;

  2. Lifelong Database of Experiments (LDE): Ingests training experiments of NNs on a variety of image classification datasets executed inside the TAPAS framework;

  3. Train-less Accuracy Predictor (TAP): Given an NN architecture and a DCN, it predicts the potentially reachable peak accuracy without training the network.

In the following we will detail each of the main components.

Dataset characterization (DC)

The same CNN can yield different results if trained on an easy dataset (e.g., MNIST [37]) or on a more challenging one (e.g., CIFAR-100 [38]), although the two datasets might share features such as number of classes, number of images, and resolution. Therefore, in order to reliably estimate a CNN performance on a dataset we argue that we must first analyze the dataset difficulty. We compute the DCN by training a probe net to obtain a dataset difficulty estimation [36]. We use the DCN for filtering datasets from the LDE and directly as input score in the TAP training and prediction phases as described in Section 4.1.2.

DCN computation Prob nets are modest-sized neural networks designed to characterize the difficulty of an image classification dataset [36]. We compute the DCN as peak accuracy, ranged in , obtained by training the Deep normalized ProbeNet on a specific dataset for ten epochs. The DCN calculation cost is low due the following reasons: (i) Deep norm ProbeNet is a modest-size network, (ii) the characterization step is performed only once at the entry of the dataset in the framework (the LDE stores the DCN afterwards), (iii) the DCN does not require an extremely accurate training, thus reducing the cost to a few epochs, and (iv) large datasets can be subsampled both in terms of number of images and of pixels.

Lifelong database of experiments (LDE)

LDE is a continuously growing DB, which ingests every new experiment effectuated inside the framework. An experiment includes the CNN architecture description, the training hyper-parameters, the employed dataset (with its DCN), as well as the achieved accuracy.

Figure 7: List of image classification datasets used for characterization. The datasets are sorted by the DCN value from the easiest (left) to the hardest (right).

LDE initialization At the very beginning, the LDE is empty. Thus we perform a massive initialization procedure to populate it with experiments. For each available dataset in Figure 7 we sample 800 networks from a slight variation of the space of MetaQNN [2]

. For convolution layers we use strides with values in

, receptive fields with values in , padding in

and whether is batch normalized or not. We also add two more layer types to the search space: residual blocks and skip connections. The hyperparameters of the residual blocks are the receptive field, stride and the repeat factor. The receptive field and the stride have the same bounds as in the convolution layer, while the repeat factor varies between 1 and 6 inclusively. The skip connection has only one hyperparameter, namely the previous layer to be connected to.

To speed up the process, we train the networks one layer at a time using the incremental method described in [39]. In this way we obtain the accuracies of all intermediary sub-networks at the same cost of the entire one. To facilitate the TAP, we train all networks with the same hyper-parameters, i.e., same optimizer, learning rate, batch size, and weights initiallizer. Although the fixed hyper-parameter setting seems a strong limitation and might limit peak accuracy by a few percent, it is enough to trim poorly performing networks and, in the case of an architecture search, to fairly rank competitive networks, the performance of which can later be optimized further. As data augmentation we use standard horizontal flips, when possible, and left/right shifts with four pixels. For all datasets we perform feature-wise standardization.

This paper LDE initialization takes 18 months on a single P100 GPU. This number can be scaled down embarrassingly with the number of GPUs. It must also be considered that, even though the time spent to generate the LDE is comparable to the time of manual engineering search of hyperparameters, the LDE can then be employed in architecture searches for multiple datasets at no additional cost. Moreover, in an industrial environments, pre-existing runs on technical propriertary-datasets can be used to heat-up the LDE quickly.

LDE selection Let us consider an LDE populated with experiments from  different datasets , with . Given a new input dataset  and its corresponding characterization , the LDE block returns all experiments performed with datasets that satisfy the following relation

(10)

where is a predefined threshold that, in our experiments, is set to 0.05.

Train-less accuracy predictor (TAP)

TAP is designed to perform fast and reliable CNN accuracy predictions. Compared to Peephole [22], TAP leverages knowledge accumulated through experiments of datasets of similar difficulty filtered from the LDE based on the DCN. Additionally, TAP does not first analyze the entire NN structure and then makes a prediction, but instead performs an iterative prediction as depicted in Figure 8. In other words, it aims to predict the accuracy of a sub-network , assuming the accuracy of the sub-network

is known. The main building elements of the predictor are: (i) a compact encoding vector that represents the main network characteristics, (ii) a quickly-trainable network of LSTMs, and (iii) a layer-by-layer prediction mechanism.

Figure 8: Encoding vector structure and its usage in the iterative prediction. a) The encoding vector contains two blocks: -layer information and from input to -layer sub-network information. b) The encoding vector is used by the TAP following an iterative scheme. Starting from Layer 1 (input) we encode and concatenate two layers at a time and feed them to the TAP. In the concatenated vector, the Accuracy field of is set to the predicted accuracy obtained from the previous TAP evaluation, whereas the one of corresponding to is always set to zero. For the input layer, we set to , where is the number of classes, assuming a random distribution. The final predicted accuracy is the accuracy of the complete network.

Neural network architecture encoding Similar to Peephole, TAP employs a layer-by-layer encoding vector as described in Figure 8. Unlike Peephole, we encode more complex information of the network architecture for a better prediction.

Let us consider a network with  layers, being the -th layer counting from the input, with . We define a CNN sub-network as with . Our encoding vector contains two types of information as depicted in Figure 8 a): (i) -th layer information and (ii)  sub-network information. For the current -th layer we make the following selection of parameters: Layer type

is a one-hot encoding that identifies either convolution, pooling, batch normalization, dropout, residual block, skip connection, or fully connected. In future we will include latest motifs present in literature such as DenseNets

[40] or AmoebaNets [41]. Note that for the shortcut connection of the residual block we use both the identity and the projection shortcuts [42]. The projection is employed only when the residual block decreases the number of filters as compared to the previous layer. Moreover, as compared to [22], our networks do not follow a fixed skeleton in the convolutional pipeline, allowing for more generality. We only force a fixed block at the end, by using a global pooling and a fully connected layer to prevent networks from overfitting [43].

The ratio between the output height and input height of each layer accounts for different strides or paddings, whereas the ratio between the output depth and input depth accounts for modifications of the number of kernels. The number of weights specifies the total of learnable parameters in . This value helps the TAP differentiate between layers that increase the learning power of the network (e.g., convolution, fully connected layers) and layers that reduce the dimensionality or avoid overfitting (e.g., pooling, dropout). In the second part of the encoding vector, we include: Total number of layers, counting from input to , Inference FLOPs and Inference memory that are an accurate estimate of the computational cost and memory requirements of the sub-network, and finally Accuracy, which is set either to 1/, for the first layer, where is the number of classes to predict, zero for prediction purposes, or a specific value

that is obtained from the previous layer prediction. Before training, we perform a feature-wise standardization of the data, meaning that for each feature of the encoding vector, we subtract the mean and divide by the standard deviation.

TAP architecture TAP is a neural network consisting of two stacked LSTMs of 50 and 100 hidden units, respectively, followed by a single-output fully connected layer with sigmoid activation. The TAP network has two inputs. The first input is a concatenation of two encoding vectors corresponding to layer and , respectively. This input is fed into the first LSTM. The second input is the DCN and is concatenated with the output of the second LSTM and then fed into the fully connected layer.

TAP training TAP requires a significant amount of training data to make reliable predictions. The LDE provides this data as described in Section 4.1.2. As mentioned above, all our generated networks are trained in an incremental fashion, as presented in [39], meaning that for each network of length we train all intermediary sub-networks with and save their performance . We encode each set of two consecutive layers and following the schema detailed in 4.1.2, setting the accuracy field in the encoding vector of to , which was obtained through training, and aiming to predict .

TAP is trained with RMSprop 

[44], using a learning rate of , a HeNormal weight initialization [45], and a batch size of 512. As the architecture of the TAP is very small, the training process is of the order of a few minutes on a single GPU device. Moreover, the trained TAP can be stored and reapplied to other datasets with similar DCN numbers without the need for retraining.

TAP prediction TAP employs a layer-by-layer prediction mechanism. The accuracy  of the sub-network  predicted by the previous TAP evaluation is subsequently fed as input into the next TAP evaluation, which returns the predicted accuracy  of the sub-network . This mechanism is described more in detail in Figure 8 b).

4.1.3 Hyperband++ Engine:

The original Hyperband algorithm:

Hyperband [18] proposed by Li et al., speeds up random search by using early stopping strategy to allocate resources adaptively. It is easy to use and of good performance, lots of work have based on it. [46] replaces the random selection of configurations at the beginning of each Hyperband iteration by using Tree Parzen Estimator (TPE) [47]. In order to adequately explore large hyper-parameters spaces, [48] considers the massive parallel hyper-parameters search, and scales linearly with the number of workers in distributed settings as well as converges to a high quality configuration. However, all these methods focus on hyper-parameters tuning. In our work, we extend the Hyperband to support joint neural network search and hyper-parameters search.

Model Representation:

Effective model representation is necessary to link Hyperband and NAS (Neural Architecture Search). [9] shows that with an effective model representation, even random search can achieve good performance. In our work, we support four model representations: plain chain structure, skip chain structure, multi-branch structure and hierarchy structure. Plain chain structure is shown in Fig. 9. The architecture includes one or more than one components sequentially connected, each component ends with a pooling layer. While if there is no pooling layer in the architecture, it is deemed as of one component. In every component, there will be several convoluational stacks, with attributes of kernel size, type, and output channel numbers. Using this representation, chain structure network and Hyperband search space can be an one to one mapping. As shown in Fig.10, Skip chain structure is similar to plane chain structure where only skip pattern is added. Here we make a constrain that skipping only occur within the component. Multi-branches structure in Fig. 11 is introduced in [24]. And it is widely used in the neural network search strategy [25] for cell based search. Hierarchy structure in Fig.12 is proposed by [9], which defines the three-level hierarchical architecture. See the bottom row, the level-1 primitive operations like convoluational layer, pooling layer etc. are assembled to form a level-2 motif; again various level-2 motifs are assembled to form a level-3 motif as shown in top row in figure. Our work extend the Hyperband to support these four kinds of structures to do the neural network search, we call it Hypterband++. Because Hyperband also have the intrinsic capability to search learning related hyper-parameters like learning rate, weight decay, momentum, our proposed Hyperband++ can do joint search for both neural network and hyper-parameters.

Figure 9: Example of chain structure representation supported by Hyperband++
Figure 10: Example of skip structure representation supported by Hyperband++
Figure 11: Example of multi-branches based structure representation [24] supported by Hyperband++
Figure 12: Example of hierarchy structure representation [9] supported by Hyperband++
Meta-learning for Hyperband:

As we known, most existing Neural network approaches take considerable long time for model searching. Thus there are lots of methods target high efficient NAS, such as weight sharing [24], one-shot model [9], multi-task Bayesian [49] and performance prediction[50]. In our work, we propose a method to reuse the neural network or part of the neural network to speed up the searching process. The method is tailored to the setting whereby the datasets come sequentially and one need to search model for the new arrival datasets efficiently. First, the meta-features of datasets can be extracting, then we implement the virtual dataset layer and group the datasets based on the meta-features at the first step. Accordingly, we follow the general idea of recycle and reuse, such that the pre-searched models i.e. architectures and hyper-parameters can be used for the new arrival datasets. It should be noted that the meta-learning technique can also be coupled with most NAS methods (not limited on Hyperband) in an out-of-box fashion.

Hyperband for Object Detection:

Unlike image classification, object detection also considers informative region selection and feature extraction besides classification. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. However, their performance is limited as low-level features can not represent the high-level context from object detectors effectively. Thanks to the rapid development in deep learning, semantic, high-level features can be learned. There are mainly two types of frameworks in generic object detection: region proposal based methods, which include RCNN

[51], FRCNN [52], Faster R-CNN [53], FPN [54] and Mask R-CNN [55], etc.; Regression/Classification based methods, which contain YOLO [56], SSD [57], YOLOv2 [58], etc. Empirical results show the performance of object detection for different datasets is highly depended on the pre-trained network and lots of hyper-parameters. In this section, we describe the ways to apply Hyperband for object detection.

Hyper-parameters tuning for Object Detection: Comparing with image classification, there are more hyper-parameters in object detection applications. Besides learning process related hyper-parameters like learning rate, decay policy, momentum, etc., there are also object detection related hyper-parameters including anchor size, aspect, pre-trained model categories, loss weights, fraction of foreground, overlap for positive RPN, bbox threshold, FPN level, etc.. Hyperband can be used to deal with tuning these kinds of hyper-parameters efficiently.

Meta-learning for high efficient Hyper-parameters tuning: By using the methods described in 4.1.3, hyper-parameters tuning for object detection can be speeded up by grouping the similar datasets with the same set or subset of hyper-parameters. In object detection, there are also lots of works by adopting the concept of groups. [59]

uses a biasing sampling to match the statistics of the ground truth bonding boxes with K-means clustering, while

[60] proposes a subcategory-aware RPN. We leave more effective groups construction methods beyond hyper-parameter sharing for future work.

Architecture Search for Object Detection:

Considering the object detection pipeline, with newly-searched network architecture, it always needs pre-train on ImageNet to obtain the corresponding reward values, which makes it impossible to search neural network for object detection. An alternative way is to use transferable network to replace the existing pre-trained networks like Inceptions, resNet, VGG and ZF.

[9] shows that the auto-searched network for CIFAR10 can be used as pre-trained network of Faster-RCNN for object detection. In our work, we first use Hyperband to search the optimal architectures for different datasets for image classfication task, and then plug in these architectures pre-trained on ImageNet into Faster-RCNN or FPN pipeline. By using the meta-learning proposed in section 4.1.3, the proper pre-trained network are chosen based on the input dataset.

4.2 Fine grained synthesis

Broadly, we describe a supervised learning algorithm that serves as an optimization step towards automatically generating a neural network model. Specifically, given an input dataset with classification labels, we will describe an algorithm to incrementally add new connections and a small number of new trainable variables within a neural network to improve the prediction accuracy of the neural network.

There are a lot of techniques that use template layers such as convolution, fully-connected, max-pool, avg. pool, etc to automatically synthesize neural networks. These techniques work well for pre-processed and cleaned datasets but due to their parameter size have a tendency to over-fit to the training data. There is another set of techniques that explores more ”fine-grained” connections within neural networks. This class of techniques are more along the lines of the proposed technique. The technique proposed in [Filter-Shape] computes a co-variance metric to determine which features (inputs) that should be combined to create a filter.

The filter shaping approach is restricted by the computational complexity of computing co-variance matrix. Hence the size of the neighborhood that can be searched using this technique is smaller than the proposed technique. Secondly, the filter shaping paper looks for features with the most correlation. However, it is unclear whether such a metric is fundamentally necessary for good generalization.

Our technique uses an evolutionary algorithm for building a custom convolution filter. The algorithm has five phases - 0) Given a network that has been trained until a certain early stopping criteria is met, the 0th or initialization phase involves selecting a subset of neurons that are ”important” for the given dataset and classification task . If there is no network to start with then a single fully connected layer connecting the inputs to the outputs is used as a starting network. 1) The first phase involves growing the network by adding a layer of dense connections to this subset of neurons. The initial values of the new connections are chosen such that the output neuron values are not perturbed. Doing so ensures that the new connections do not cause the network to forget what has been learnt. The newly grown network is then re-trained for a few epochs. 2) The second phase involves iteratively pruning network connections in the newly grown layer that show the least change from their initialized values. During this pruning stage the network is trained for a few epochs after every pruning step. 3) The third phase involves merging connections in the newly grown layer into ”k” weight buckets. The degree of similarity of input weight distribution determines whether the connections are merged into a single bucket or not. The ”k” buckets then become the ”k” custom filters of the newly grown layer. 4) The fourth and final phase involves re-initializing the weights of the pruned and merged network so that the output neuron values are unperturbed from the values after phase (i). The new network is then retrained until the early stopping criteria is met.

The proposed algorithm learns the shape of the custom filters from the evolution of the weight values on the connections. Due to this it is able to leverage the acceleration in training offered by specialized hardware such as GPUs. Thus, in comparison with techniques such as [Filter-Shape] and [Grow-Prune] the proposed approach is much faster. Another advantage of this technique over [Filter-Shape] is that it offers higher accuracy improvement with the same increase in network size, which translates to overall better network quality.

The network shown in Phase 0 represents either: a) The final fully connected layer of an existing network that is either manually designed or auto-generated from another neural architecture search algorithm. or b) As a more general application of the proposed technique, the blue input neurons can represent a subset of ”important” neurons from an existing neural network. If an initial network is not available, the blue input neurons can also represent all or a subset of input features of the given dataset. In this case, a network is initialized with a fully connected layer connecting blue input neurons to the red output neurons.

Assuming that there are ’n’ input neurons and ’m’ output neurons, the operation performed in the fully connected layer can be expressed as:

y = Act(W*x + B)

where ’x’ is the ’n’-dimensional vector containing values of the input neurons, y is the ’m’-dimensional vector containing values of the output neurons, W is the ’mxn’-dimensional matrix containing weight variables, B is the ’m’-dimensional vector containing bias variables, and Act is an activation function for e.g., Relu, Tanh, Sigmoid, etc.

The initial values of W and B are obtained by training the initial network until the early stopping criteria is met.

In phase 1, the network is grown by adding a hidden layer containing ’l’ hidden neurons. The layer connecting input neurons to the hidden neurons can either be fully connected or selectively connected to combine input neurons selectively into a hidden neuron. The operation performed by this layer can be expressed as:

z = Act(W’*x + B’)

where z is an ’l’-dimensional vector containing values of the hidden neurons, W’ is an ’lxn’-dimensional weight matrix, B’ is an ’l’-dimensional bias vector, Act is the activation function. In the case where the layer is selectively connected, the missing connections can be represented as zeros in the weight matrix.

The layer connecting hidden neurons to the output neurons is a fully connected layer. The operation performed by this layer can be expressed as:

y = Act(W”*z + B”)

where y is an ’m’-dimensional vector containing values of the output neurons, W” is an ’mxl’-dimensional weight matrix, B” is an ’m’-dimensional bias vector, Act is the activation function. The initialization of the two layers in phase 1 has to be carefully determined to ensure continuity in training an evolving network in phase 0. To achieve this, the initial values for [W’,B’] and [W”, B”] are derived from the trained values of [W, B] at the end of phase 0. The initial values satisfying this constraint can be achieved in many different ways, for e.g., assume a blue input layer neuron gets connected to 3 neurons in the hidden layer. The weight variable for these neurons can be initialized to w1, w2, w3 such that sum(w1, w2, w3) = 1. The weight variable for connections from other neurons in the input layer to these 3 neurons can be initialized to 0. Such an initial value assignment ensures continuity in the training process.

In phase 2 of the optimization, connections in the layer between the input and hidden layer are pruned based on a certain pruning metric. Examples of metrics for pruning include value of weight variable, absolute value of weight variable, magnitude of change in weight value over several epochs. Pruning can happen either as a single step or in multiple steps applied iteratively. Between two pruning steps, the network is trained for a few epochs for the values to adjust for the pruned variables. At the end of pruning, N input connections are retained for each hidden neuron.

In phase 3 of the optimization the N input weight variables for the neurons in the hidden layer are merged into ’k’ buckets of N weight variables each. Merging of weight variables takes place based on the similarity of shape of the distribution of the weight variables. One way of measuring similarity is to use the L-2 distance of the normalized values of weight variables. For instance, [1.2, 0.6, 0.3] has a shape similar to [2, 1, 0.5] than to [1, 0.9, 0.8].

Finally, in phase 4, the merged and pruned network weights and bias values are re-initialized such that the values of output neurons are unperturbed from phase 0. The re-initialized network is then trained until the early stopping criteria is met.

5 Experimental Evaluation

We tested the NeuNetS Framework on various benchmark datasets for image and text classification.

All images are normalized by subtracting the mean and dividing by the standard deviation. For images with resolution higher than 64x64, they are scaled to 64x64 for NCEvolve. For TAPAS, images are always scaled to 32x32. The maximum GPU budget per dataset is divided into three categories: low, medium and high. A dataset is assigned to one category based on the number of examples. All datasets with at most 10K examples get a low GPU budget of 2 hours. Datasets with at most 75K examples get the medium budget of 5 hours. Finally, all other datasets get the high budget of at most 16 hours. In contrast to the literature in the domain of automated architecture search, this budget contains both the search and training time. State-of-the-art methods use double this budget only for the search followed by an expensive post-processing [24]. We evaluate NeuNetS on 12 image classification benchmarks and report the results in Table 1.

As we all know, machine learning models cannot accept text as input. They only work with integers or floats. In order to overcome this, a standard practice is to tokenize the training data and identify the most frequent words (excluding stop words) and map them to integers. To elaborate further, the most common word would be given an integer representation of 0, the second most common word a representation of 1, and so on. All the words outside of the top are replaced by an unknown token UNK. The model requires input of fixed dimensions. thus, the maximum number of tokens MAX in the input in predetermined. If the number of words in an instance is greater than the maximum imposed, the instance is truncated to MAX length. If the number of words is less than MAX, we pad the sentence with UNK tokens to meet the desired length. The weights of the first layer of the deep learning model, also known as the embedding layer, are initialized with the word embedding matrix of the top K words in the training data. The row of the embedding matrix represents the word embedding (obtained from GloVe [61] or Word2vec [62, 63, 64]) of the word whose integer mapping is .

Based on similar criteria, we also impose maximum GPU budget for the synthesis of text classifiers. Therefore, we assign 2 hours for a dataset comprising of at most 250K examples, 5 hours for those up to 2M examples and a maximum of 16 hours for the rest. We evaluate NeuNets on 9 text classification datasets and report the results in Table 

2.

Dataset Cls Examples Error Time Params
Caltech-256 [65] 257 31K 48.56 5.0 5.78M
CIFAR-10 [66] 10 60K 6.32 3.6 3.68M
CIFAR-100 [66] 100 60K 27.79 3.9 9.60M
Fashion [67] 10 70K 4.51 3.2 5.73M
Flowers-5 [68] 5 4K 16.41 2.0 4.57M
Flowers-102 [68] 102 2K 54.02 1.2 3.35M
Food-101 [69] 101 101K 38.12 11.9 7.38M
GTSRB [70] 43 52K 3.52 4.1 3.05M
MNIST [71] 10 70K 0.64 2.5 3.74M
Quick, Draw! [72] 345 380K 27.34 16.0 2.58M
STL-10 [73] 10 13K 25.14 1.8 6.68M
SVHN [74] 10 99K 3.37 12.6 4.83M
Table 1: Results on various image classification benchmarks with required training time in GPU hours.
Dataset Cls Examples Error Time
Cola [75] 2 9K 29.60 1.1
IMDB Sentiment [76] 2 22K 12.00 0.7
Rotten TMC [77] 5 140K 31.51 0.9
SMS Spam [78] 2 5K 0.54 0.2
Snips [79] 7 2K 0.00 0.2
Stanford Sentiment [80] 6 215K 31.17 1.0
TREC [81] 6 5K 11.62 0.3
Yelp [82] 5 52K 40.45 1.0
Youtube Spam [83] 2 2K 3.05 0.2
Table 2: Results on various text classification benchmarks with required training time in GPU hours.

References