Real-time Federated Evolutionary Neural Architecture Search

03/04/2020 ∙ by Hangyu Zhu, et al. ∙ 0

Federated learning is a distributed machine learning approach to privacy preservation and two major technical challenges prevent a wider application of federated learning. One is that federated learning raises high demands on communication, since a large number of model parameters must be transmitted between the server and the clients. The other challenge is that training large machine learning models such as deep neural networks in federated learning requires a large amount of computational resources, which may be unrealistic for edge devices such as mobile phones. The problem becomes worse when deep neural architecture search is to be carried out in federated learning. To address the above challenges, we propose an evolutionary approach to real-time federated neural architecture search that not only optimize the model performance but also reduces the local payload. During the search, a double-sampling technique is introduced, in which for each individual, a randomly sampled sub-model of a master model is transmitted to a number of randomly sampled clients for training without reinitialization. This way, we effectively reduce computational and communication costs required for evolutionary optimization and avoid big performance fluctuations of the local models, making the proposed framework well suited for real-time federated neural architecture search.



There are no comments yet.


page 2

page 4

page 5

page 7

page 9

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Standard centralized machine learning methods require to collect training data from distributed users and stored them on a single server, which suffers from a high risk of leaking users’ private information. Therefore, a distributed approach called federated learning [31] was proposed to preserve data privacy, enabling multiple local devices to collaboratively train a shared global model while the training data remain to be deployed on the edge devices. Consequently, the central server has no access to the private raw data and the client privacy is protected.

However, federated learning demands a large amount of communication resources in contrast to the conventional centralized learning paradigm, since updating the global model needs to frequently download and upload model parameters between the server and edge clients. To mitigate this problem, a large body of research work has been carried out to reduce the communication costs in federated learning. The most popular approaches include compression and sub-sampling of the client uploads [40, 24, 7] or quantization of the weights of the models [16]. Most recently, Chen et al. [8] suggest a layer-wise parameter update algorithm to reduce the number of parameters to be transmitted between the server and the clients. In addition, Zhu et al. [55]

uses a multi-objective evolutionary algorithm (MOEA) to simultaneously enhance the model performance and communication efficiency.

Little work has been reported on offline optimization of the architecture of deep neural networks (DNNs) for federated learning, let alone real-time neural architecture search (NAS) suited for the federated environment. In the field of centralized machine learning, Zoph et al. [56]

present some early work on NAS using reinforcement learning, which, however, consumes a plenty of computational resources. To mitigate this problem, Pham

et al. [34] introduce a directed acyclic graph (DAG) based neural architecture representation to significantly accelerate the search speed without much degradation of the learning performance by using weight sharing technique.

Recently, evolutionary approaches to NAS have received increasing attention [41, 35, 36, 29], and hybrid methods that combine evolutionary search with the gradient method [30] have also been reported [13]. To reduce the computational cost, surrogate ensembles have been introduced in evolutionary NAS, which has been shown promising in reducing the computational complexity without deteriorating the learning performance [42].

Most existing search strategies for NAS in a centralized learning environment are not well suited for federated NAS for the following reasons. First, most current search strategies for centralized learning focus on improving the model performance without paying much attention to the model size and computation cost. Client devices like mobile phones cannot afford computationally intensive model training and bandwidth restrictions do not allow very large models to be transmitted frequently between the server and clients. Second, many NAS techniques for accelerating model training [34, 35, 29, 13, 30, 46]

adopt transfer learning techniques

[32, 47]

by searching upon cell based small models and transferring the found promising cell structures to large models. Such techniques are not directly applicable to federated learning, because such cell transfer methods may cause model divergence in a distributed training scheme and learning transferred models from scratch will consume extra communication resources. Finally, modern deep neural networks may fail to work in federated learning as learnable parameters in the batch normalization layer


may degrade the global model performance after model aggregation, because the locally trained models may have very different weights in mean and variance.

Note that our previous work on multi-objective evolutionary federated optimization [55] is an offline evolutionary approach to NAS, like most conventional evolutionary optimization algorithms. In offline evolutionary NAS, the parameters of a newly generated offspring model are randomly reinitialized and trained from scratch before the model is evaluated on a validation dataset, requiring a large amount of computational resources. What is worse, the performance of the reinitialized models will dramatically degrade, making the offline optimization approach infeasible for realtime application of federated learning, such as online recommendation systems [1].

Therefore, offline federated evolutionary optimization of neural networks is not applicable to the real-world applications and it is highly desirable to develop a framework for real-time federated evolutionary neural architecture search. The main contributions of this work are summarized as follows:

  1. A double-sampling technique is proposed that randomly samples a sub-network of the global model, whose parameters are transmitted to a few randomly sampled local devices without replacement. The number of devices to be sampled depends on the ratio between the number of connected local clients and the number of individuals in the population. The double-sampling technique brings about two advantages. First, only a sub-network needs to be trained on local devices, significantly reducing the number of parameters to be uploaded from the local devices to the server. Second, sampling of the clients without replacement makes sure that each local device needs to train only one sub-network for once at each generation. The above two features together make the proposed real-time evolutionary method substantially different from offline evolutionary NAS, where the whole neural network must be trained on all local devices for fitness evaluations, and each device needs to train all networks in the population. To the best of our knowledge, it is the first time that a real-time evolutionary NAS algorithm has been developed for the federated learning framework.

  2. An aggregation strategy is developed that updates the global model based on the sub-networks sampled and trained at each generation. Thus, in the proposed real-time evolutionary NAS, each generation is equivalent to a training round in the traditional federated learning. In addition, the weights of the sampled sub-networks are inherited before it is trained on a local device, accelerating the convergence and avoiding dramatic performance deterioration caused by random reinitialization.

Extensive comparative studies are performed to verify the performance of the proposed real-time federated evolutionary NAS by comparing the learning performance and computational complexity of the models it obtains with that of the standard ResNet18 [17] on both IID and non-IID data. We also show the proposed real-time evolutionary NAS is at least five times faster than the offline evolutionary NAS method.

Ii Background

In this section, a review of federated learning is given at first, followed by an introduction to NAS for deep neural networks. Then, the basic of a multi-objective evolutionary algorithms will be introduced. Finally, a brief discussion is given to clarify the differences between the offline and real-time evolutionary optimization frameworks.

Ii-a Federated Learning

As mentioned before, federated learning is an emerging decentralized privacy-preserving model training technology that enables local users to training a global model without uploading their private local data to a central server. A conventional federated learning algorithm called federated averaging (FedAvg) algorithm is shown in Algorithm 1. In the following, we briefly introduce this algorithm.

1:Server Update:
4:for each communication round  do
5:      Select clients, clients
6:      Download to each client
7:      for each client  do
8:            Wait Client Update for synchronization
10:      end for
11:end for
13:Client Update:
15:for each iteration from 1 to  do
16:      for batch  do
18:      end for
19:end for
20:return to server
Algorithm 1 FederatedAveraging. indicates the total number of clients; is size of mini-batches, is equal to the number of training iterations, is the learning rate, is the total number of data pairs on all distributed clients, and is the number of data pairs on client .

Ii-A1 Server Side

The model parameters are initialized once at the beginning of FedAvg algorithm, which is then sent to all selected clients, where is the number of total clients and is the fraction of all clients between and . After all clients update and send the updated local model parameters back to the central server, the parameters of the global server model will be replaced by the weighted averaging of each client’s model parameters .

Ii-A2 Client Side

The local model parameters are replaced by the downloaded global model parameters

. Then the local model parameters are updated by the batch stochastic gradient descent (SGD) algorithm

[5], where is the local learning batch size. After local training, the learned model parameters will be sent back to the central server for global model aggregation.

An alternative approach to local execution is to calculate and upload the local model gradients only. Then, they are aggregated on the server by , where represents local gradients. This method is beneficial to reduce the local computation consumption while sharing the same computing result with the previous one.

Ii-B Neural Architecture Search

Deep learning [27]

has been extremely successful in the field of image recognition and speech recognition. However, most deep neural network (DNN) models are manually designed by human experts, and not until very recently has increasing attention been attracted to automatically search for a good model architecture using NAS methods. The search space in NAS depends on the neural network model in use and in this work, we limit our discussions to convolutional neural networks (CNNs)

[28, 26].

Roughly speaking, the search space for NAS can be categorized into macro and micro search spaces [34]. The macro search space is over the entire CNN model, as shown in Fig. 1. The whole model consists of sequential layers where the dashed lines are shortcut connections, similar to ResNet [17]. So the macro search space aims to represent a good structured model in terms of the number of hidden layers , operations types (e.g. convolution), model hyper parameters (e.g. convolutional kernel size), and the link methods for shortcut connections. Different from the macro search space, micro search space only covers repeated motifs or cells [45, 19] in the whole model structure. And these cells are built in complex multi-branch operations [43, 44] as shown in Fig. 2, where the given structure contains two inputs and coming from two previous layers and only one concatenated output.

Fig. 1: An illustrative example of CNN represented in a macro search space.
Fig. 2: A typical structure of the normal cell represented in a micro search space, where each block receives the outputs from the previous cell and the cell before the previous cell as its inputs, which are then connected to two operations, denoted by ’op’ in the figure. Finally all the branches are concatenated at the output of this cell.

Much work has been done that uses reinforcement learning (RL) for NAS [56, 57, 34, 3, 54]

. RL-based search methods always adopt a recurrent neural network (RNN)

[38] as a controller to sample a new candidate model to be trained and then use the model performance as the reward score. And then this score can be used to update the controller for sampling a better child model in the next iteration.

Apart from RL, evolutionary algorithms (EAs) based approaches are often used to deal with multi-objective optimization problems in NAS. Different from the previous neuro-evolution techniques [2, 49] that aim to optimize both the weights and architecture of neural networks, EA based NAS only optimizes the model architecture itself, and the model parameters are trained using conventional gradient descent methods [36, 41, 29, 35]

. Evolutionary NAS starts with randomly generating a population of parent models with different structures and train them in parallel. When all the models are learned for some pre-defined epochs, fitness values are calculated by evaluating the models on the validation dataset. After that, genetic operators such as crossover and mutation are applied on the parents to generate the offspring population consisting of new models. Then, a selection operation will be performed that selects the better offspring models to be the parents of the next generation, which is often known as survival of the fittest. This selection and reproduction process repeats for a certain number of generations until certain conditions are satisfied. Note that conventional evolutionary NAS needs to evaluate a population of neural networks at each generation and all newly generated neural models are trained from scratch, which are not suited for the online optimization task.

The gradient method became increasingly popular recently, mainly because its search speed is much faster than RL-based and evolutionary NAS methods. In [30, 13] relaxation tricks [48] are used to make a weighted sum of candidate operations differentiable so that the gradient descent can be directly employed upon these weights [6]. No reinitialization of the model parameters is needed in this approach.

However, the gradient based technique requires much more computation memories than other approaches, since the overall network need to be jointly optimized. To fix this issue, a sampling strategy is proposed by [14, 4], in which only one or two paths (a path is a sub-network connecting the inputs and outputs) are sampled from a complex neural network for training. In addition, a weight sharing technique [34] is also suggested to avoid reinitialization of the newly sampled models, which saves a plenty of training time. Other techniques such as Bayesian optimization [39] is also a good approach reducing the computational complexity of NAS [23].

Ii-C Multi-Objective Evolutionary Optimization

Federated NAS is naturally a multi-objective optimization problem. For instance, the model performance should be maximized, while the payload transferred between the server and clients should be minimized. In the machine learning community, multiple objectives are usually aggregated to be a scalar objective function using hyperparameters. By contrast, the Pareto approach to solving multi-objective optimization problems has been very popular and successful in evolutionary computation

[11], which has also been extended to machine learning [22]. The main difference between the Pareto approach and the conventional aggregated approach to machine learning is that in the Pareto approach, no hyperparameters need to be defined and a set of models presenting trade-off relationships between the objectives will be obtained. Finally, one or multiple models can be chosen from the trade-off solutions based on the user’s preferences.

The elitist non-dominated sorting genetic algorithm, NSGA-II in short

[12], is a very popular multi-objective evolutionary algorithm (MOEA) based on the dominance relationship between the individuals. The overall framework of NSGA-II is summarized in Algorithm 2.

2:Initialize parents with a population size
3:Calculate the objective values for
4:while the maximum number of generations is not reached do
5:      Generate offspring by applying genetic operators on
6:      Calculate objective values of
7:      Combine the parent and offspring populations:
8:      Perform fast non-dominated sorting on and calculate the crowding distance of all individuals in
9:      Sort the combined population according to the dominance relationship and the crowding distance
10:      Select the best individuals from and store them in parents
12:end while
Algorithm 2 NSGA-II, is the population size, is the number of generations

The main idea of non-dominated sorting is to generate a set of ordered fronts based on the Pareto dominance relationships between the objective values of and solutions located in the same front cannot dominate with each other. Solutions in the first non-dominated fronts will have a higher priority to be selected. This sorting algorithm has a computation complexity of , where is the number of objectives and is the population size. To promote the solution diversity, a crowding distance that measures the distance between two neighboring solutions in the same front is calculated and those having a large crowding distance have a higher priority to be selected. The computation complexity of crowding distance calculation is in the worst case, when all the solutions are located in one non-dominated front. Readers are referred to [12] for more details.

Ii-D Online and Offline Evolutionary NAS

As previously discussed, existing evolutionary NAS methods, also including most RL and gradient based NAS algorithms, are meant for offline model optimization. In other words, they are not suited for scenarios in which neural architecture search must be performed while the network is already in use, such as in recommender systems or vision surveillance systems.

Conventional offline evolutionary NAS is not suited for real-time applications for the following reasons. First, a neural network model is not allowed to be randomly initialized when it is already in use, because random initialization will seriously deteriorate the performance of the neural network. Second, the available computational resource on client devices is limited and therefore, NAS should not require substantially more computational resource. However, EAs are population based search, and at each generation (i.e., time instant), a set of models (depending on the population size) must be assessed, which will considerably increase the computational costs of the clients. Moreover, NAS should not significantly increase the communication costs. This is again challenging for evolutionary NAS, where the parameters of multiple models (depending on the population size) must be transferred between the server and the clients. Finally, the models in offline evolutionary NAS are usually not fully trained to reduce computation time. Thus, they must be trained again at the end of the evolutionary search, which will incur additional communication costs.

Iii Proposed Algorithm

In this section, we will introduce the encoding method and model structure used in federated NAS at first. And then the two objectives to be optimized and the double-sampling method for evaluating the objectives are explained. Finally, the overall framework of the proposed evolutionary federated NAS is presented.

Iii-a Structure Encoding for Federated NAS

For real-time NAS, we adopt a light-weighted CNN as the master model, since communication cost is always a primary concern in federated learning. In addition, the search space should not be too large and the total number of layers should be limited to make it appropriate for real-time federated optimization.

The master model used in the proposed online federated NAS is shown in Fig. 3, where a convolutional block, choice blocks (each branch of choice block contains two convolutional or more advanced depthwise convolutional layers except for the identity block) and a fully connected layer are linked to build a DNN containing a maximum of hidden layers. Specifically, the convolutional block consists of three sequentially connected layers with a convolutional layer, a batch normalization layer [20]

, and a rectified linear (ReLu) layer. And one choice block is composed of four pre-defined branches of candidate blocks, namely identity block, residual block, inverted residual block and depthwise separable block, as shown in Fig.

4. Thus, there are in total possible one path sub-networks. In addition, these four candidate blocks are categorized into two groups, one is called normal block whose input and output share the same channel dimension, whereas the other is called reduction block whose output channel dimension is doubled and the spatial dimension is quartered compared to its input.

Fig. 3: An example structure of the master model, in which each choice block consists of four branches.

Identity block links its input to its output directly (Fig. 4

(a)), which can be seen as a ’layer removal’ operation. For the structure reduction part, it just operates two branches of point-wise convolution with a stride of

at first and then concatenates these two outputs from the channel dimension. As a result, the spatial dimensions of the inputs are quartered and the filter channels are doubled through this identity reduction block.

Residual block contains two sequentially connected convolution blocks as shown in Fig. 4(b), where the normal block has the same structure as the residual block used in ResNet [17]. Note that the reduction residual block does not contain shortcut connections while the normal block has.

Inverted residual block (Fig. 4(c)) has an ’inverted’ bottleneck structure of the residual block, which was first proposed in MobilenetV2 [37]. This block contains three convolution layers: 1) a

expanding point-wise convolution layer, followed by a batch normalization layer and a ReLu activation function. 2) a

depthwise convolution layer [9, 18], followed by a batch normalization layer and a ReLu activation function. 3) a spatially filtered point-wise convolution layer followed by a batch normalization layer.

The intuition of using expanding at the first layer instead of in the last layer as done in the bottleneck layer is that a nonlinear activation function such as ReLu may cause layer information loss [15]

and using a nonlinear projection upon a high-dimensional space can mitigate this issue. After the tensor is mapped back to a low-dimensional space again through the last point-wise layer, the ReLu function is not needed to prevent information loss.

Depthwise separable block consists of two depthwise convolution operations [9] (Fig. 4(d)) that incurs lower computation cost than the conventional convolution operation. It has been proved in [18] that depthwise convolution consumes about one-eighth to one-ninth of the computation time of the standard convolution at the expense of a small deterioration in performance.

(a) Identity Block (b) Residual Block (c) Inverted Residual Block (d) Depth Wise Separate Block
Fig. 4: Four candidate blocks used in the proposed federated NAS, where the left part of each subfigure is the normal blocks and the right part is the reduction blocks. Symbol in (a) represents the concatenation operation. Only normal blocks contain shortcut connections.

For each communication round, only one branch of all choice blocks is sampled from the master model and then downloaded to a client device for reducing communication costs and computation resources required at the local device. This sampled sub-model can be encoded into a two-bit binary string with a total length of bits. Every two bits in the code represent one specific branch in the choice block. For instance, represents branch , which is the identity block, represents branch , i.e., the residual block, represents branch , the inverted residual block, and represents branch , which is the depthwise separable block. Therefore, the binary string (also called the choice key) can be decoded into a sub-model with structure as shown in Fig. 5.

Fig. 5: A sub-model represented by the choice key .

Iii-B A Double-Sampling for Objective Evaluations

Offline evolutionary optimization is intrinsically not suited for federated learning. Although one or multiple light weighted models with high performance can be found by an offline evolutionary algorithm in the last generation [55], a large amount of extra computation and communication resources is required. For instance, for an EA of a population size of , each client in an offline evolutionary NAS algorithm must evaluate the fitness of individuals at each generation, which is times in extra compared to the gradient method. In addition, the models are repeatedly randomly reinitialized and trained from scratch on the clients, which may not be good enough for use during the optimization.

In order to address the above issues, a double-sampling technique is proposed here to develop a real-time evolutionary federated NAS, in which at each generation, the global model for each individual are sub-sampled from a common master model, whereas the clients for training the global model of one individual is sub-sampled from the participating clients. Specifically, a choice key is generated for each individual to randomly sample a sub-network from the master model to be this individual’s global model. Then, the global model of this individual (a sampled sub-model) will be downloaded to a randomly sampled subset of the participating clients. The number of clients to be chosen, say , for training one global model of an individual is determined by the ratio between the number of individuals and the number of participating clients, i.e., , where is the number of participating clients, is the total number of clients and is the participation ratio at the current round. Here, we assume that the number of clients is equal to or larger than that of the population size.

In this work, two objectives are to be optimized: one is the test error of each global model and the other is the floating point operations per second (FLOPs) of the model. Recall that the global model of each individual is a sub-model of the master model sampled using a choice key. Population initialization of the proposed methods is composed of the following four main steps.

  • Initialize the mater model. Generate the initial parent population containing individuals, each representing a sub-model sampled from the master model using a choice key. Sample clients for each individual without replacement. That is, each client should be sampled only once.

  • Download the sub-model of each parent individual to the selected clients and train it using the data on the clients. Once the training is completed, upload the local sub-models to the server for aggregation to update the master model.

  • Generate offspring individuals using crossover and mutation. Similarly, generate a choice key for each offspring individual to sample a sub-model from the master model. Download the sub-model of each offspring individual to randomly sampled participating clients and train it on these clients. Note that the weights of the sub-models are inherited directly from the master model and will not be reinitialized in training. Upload the trained local sub-models and aggregate them to update the master model.

  • Finally, download the master model together with the choice keys of all parents and offspring to all clients to evaluate the test errors and FLOPs. Upload the test errors and FLOPs to the server and calculate the weighted averaging of test errors for each individual.

Once the objective values of all parent and offspring individuals are calculated, environmental selection can be carried out to generate the parent individuals of the next generation based on the elitist non-dominated sorting and crowding distance, as discussed in Algorithm 2 in Section II.C.

In the following generations, similar steps as described above will be carried out, except that the master model is updated only once at each generation after the global model (also a sub-model randomly sampled from the master model) of all offspring individuals is trained on the sampled clients and the resulting local sub-models are aggregated. It should be emphasized that at each generation, the master model shared by all individuals need to be downloaded to all participating clients only once for fitness evaluations.

Note that model aggregation is different from that in the conventional federated learning. The reason is that different clients may be sampled for different individuals, and different individuals may have different model structures, which cannot be directly aggregated. Fig. 6 shows an illustrative example, where the master model has two choice blocks and . There are two individuals, and the choice keys for the two individuals are [0, 0, 0, 1] and [1, 0, 1, 1], respectively. The resulting sub-models are and , and and . We further assume that client 1 is chosen for training sub-model and and client 2 for and . Then each client updates its model parameters according to the available local dataset and then upload the trained model to the server, which is denoted by the shaded square. And then two master models are reconstructed by filling in the sub-models that are not updated, which are denoted by the white squares. Since the reconstructed master models have the same structure, they can be easily aggregated using i.e., the weighted averaging. The pseudo code for model aggregation is presented in Algorithm 3.

The advantage of the above filling and aggregation method is able to prevent abrupt changes in some sub-models and improve convergence in federated learning. In addition, this aggregation method does not require extra communication resources, since this operation is only performed on the server.

From the above description, we can see that the double-sampling strategy fits perfectly well with the population based evolutionary search so that the objective values of all individuals at one generation can be evaluated within one communication round, seamlessly embedding a generation of architecture search into one round of training in federated learning.

1: is the parameters of the master model in the last communication round, is the parameters of the master model for the -th branch in the -th hidden layer.
2:Receive client model parameters and choice key , where is the model parameters and is the choice of -th hidden layer.
3:Server Aggregation:
5:for each  do
6:      for each  do
7:            if  is not in choice blocks then
9:            else
10:                 for each branch  do
11:                       if 
13:                       else
15:                       end if
16:                 end for
17:            end if
18:      end for
19:end for
Algorithm 3 Model Aggregation, is the total number of client uploads. is the client index. is the total number of hidden layers of the master model, is the hidden layer index of the learning model, is the total branches in one choice block, is the branch block index, is the number of data samples on all clients, is the number of data on client , and is the communication round.
Fig. 6: An illustrative example of model aggregation. The master model contains two choice blocks (, ) and , , , are four different branches in a choice block. The two sampled sub-models are downloaded to client 1 and client 2, respectively for training. After the updated sub-models are uploaded, they are are filled with the remaining sub-models (those not updated in this round for this individual) to reconstruct the master model before all reconstructed master models are aggregated.

Iii-C Overall Framework

We use NSGA-II to optimize both model FLOPs and performances for the online evolutionary federated NAS framework. The framework is illustrated in Fig. 7 and the pseudo code is listed in Algorithm 4.

1:// Double sampling method used here
6:// Server master model sampling (model sampling)
7:Randomly sub sample parent choice keys with a population size
8:for each communication round  do
9:      // For online optimization, the generation is equal to the communication round
10:      Convert all choice keys into binary codes
11:      Generate with the size of by binary genetic operators
12:      Convert into offspring choice keys
14:      Select clients, clients
15:      if  then
16:            Generate sub models ,
17:            // Client sampling (clients sampling)
18:            Randomly sub sample clients to groups
19:            for , (do
20:                 Download to all clients in group
21:            end for
22:            for each client  do
23:                 Wait Client Update for synchronization
24:                  Do server aggregation in Algorithm 3
25:            end for
26:      end if
27:      // No need to reinitialize the model parameters of offspring models
28:      Generate sub models ,
29:      // Client sampling (clients sampling)
30:      Randomly sub sample clients to groups
31:      for , (do
32:            if  then
33:                 Download the choice key to all clients in group
34:            else
35:                 Download and the choice key to all clients in group
36:            end if
37:      end for
38:      for each client  do
39:            Wait Client Update for synchronization
40:             Do server aggregation in Algorithm 3
41:      end for
42:      // Do NSGA-II optimization
43:      Calculate FLOPs of all sub models in
44:      for each client  do
45:            Download master model and choice keys to client
46:            Calculate the test errors for all sub models in
47:            Upload them to the server
48:      end for
49:      Do weighted averaging on test errors of all uploads based on the local data size to achieve the final test errors of all sub models in
50:      Do fast non dominated sorting
51:      Do crowding distance sorting
52:      // New solutions are generated within each communication round
53:      Generate new parent choice keys from
55:end for
57:Client Update:
58:if Receive one choice key  then
59:      Sub sample based on the choice key to generate
62:end if
63:for each iteration from 1 to  do
64:      for batch  do
66:      end for
67:end for
68:return to the server
Algorithm 4 Online Federated NAS by NSGA-II, is the population size, is the number of generations, indicates the total numbers of clients; is size of mini-batch, is equal to training iterations and is the learning rate, is the total number of data points on all clients, is the number of data points on client
Fig. 7: The overall framework for multi-objective online evolutionary federated NAS, where is the communication round, is the parent population and is the corresponding choice keys, is the offspring population and is its choice keys, is the combined population and is the choice keys of the combined population, is all trainable parameters of the master model.

The fitness evaluations are done for both parent and offspring populations at every generation, which is equivalent to a communication round in the proposed real-time evolutionary NAS. For fitness evaluations, both the master model and all choice keys are downloaded to all participating clients. Thus, we do not need to download any model parameters for training of any sub-models in the next round and it is sufficient to download the choice keys only (refer to Lines 32-33 in Algorithm 4. After that, each client uploads the updated local sub-models to the server for model aggregation. As a result, the proposed model sampling method can reduce both local computation complexity and communication costs for uploading the models.

It should be noticed that the parent sub-models are trained only at the first generation. In the subsequent evolutionary optimization, only offspring sub-models need to be trained. However, all sub-models sampled by need to be evaluated to calculate the test errors for fitness evaluations at each generation, because training the offspring sub-models will also affect the parameters of the parent sub-model since parent and offspring sub-models always share weights of the master model. In addition, we do not need to reinitialize the model parameters of the sampled offspring sub-models before training starts. Due to the client sampling strategy, the population size does not affect the communication costs for fitness evaluations, since the entire master model is downloaded from the server and sampling of the master model can be done on the clients.

Since NSGA-II is a Pareto based multi-objective optimization algorith which can find a set of optimal models that trade off between accuracy and computational complexity (FLOPs). Therefore, for online applications, the user needs to articulate preferences to select one of the Pareto optial solutions from the parent population in each round. In practice, the Pareto solutions with the high test accuracies or those near the knee points [51, 50, 53] are preferred unless there are other strong user-specified preferences.

Iv Experimental Results

Iv-a NSGA-II Settings

The settings for NSGA-II are listed in the Table I. Here, we use binary one-point crossover and binary bit flip mutation as the genetic operators. Since the master network only contains four choice blocks, a two-bit binary string is used for representation.

Parameter Value
Generations 500
Population 10

Crossover Probability

Mutation Probability 0.1
Bit Length 2
TABLE I: Parameters of NSGA-II

Iv-B Federated Learning Settings

The hyper parameters for federated learning are presented in Table II.

Parameter value
Total Clients 10, 20, 50
Local Epochs 1
Train Batch Size 50
Test Batch Size 100
Initial Learning rate 0.1
Momentum 0.5
Learning Decay 0.995
TABLE II: Hyper parameters for federated learning

Here, learning decay means the decay of the learning rate over each communication round. Apart from this, the number of total communication rounds is not set here, because it is equal to the number of generations in real-time evolutionary optimization.

Iv-C Model and Dataset Settings

The master model used in this work is a deep neural network with multiple branches containing a total of 28 layers (12 choice blocks, each containing two convolution layers). The number of channels for all block layers are [64, 64, 64, 128, 128, 128, 256, 256, 256, 512, 512, 512]. The overall structure of the master model is shown in Fig. 3. Apart from this, the trainable parameters in the batch normalization layer may slow down the convergence speed in federated learning, since they perform poorly for learning with small batch sizes [21] and weight sharing training paradigms [52, 14, 10], especially for non-IID scenarios. In addition, we disable both the variables and exponential moving average variables in the batch normalization layer because we found that they may cause the divergence of the master model.

We adopt Cifar10 [25] as the dataset, which contains 50000 training and 10000 testing 32x32 RGB images with 10 different kinds of objects. For IID federated simulations, all training image data are evenly and randomly distributed to each local client without overlaps. For experiments on non-IID data, each client has images with different kinds of objects. We do not consider very extreme cases where each client has data with only or classes, since this is not realistic in the real world environment. For instance, it is not beneficial to collaboratively train a global model with clients having completely different datasets.

Note that we do not apply any data augmentation [33] in our simulation, since the server cannot do any operations on the client data in federated learning.

In all experiments, we use one GTX 1080Ti GPU.

Iv-D Experiment Results

Iv-D1 Baseline Model Used in Federated Learning

We use ResNet18 as the baseline model and its parameters are provided in Table III. As previously mentioned, all trainable parameters in the batch normalization layers are removed, resulting in a total FLOPs (MAC) of 0.5587G on the cifar10 dataset. The settings for the federated learning follow those given in Section III.

Layer Name Output Size 18 Layers
Conv1 32x32 3x3, 64
Conv2_x 32x32
Conv3_x 16x16
Conv4_x 8x8
Conv5_x 4x4
Average Pool 1x1
TABLE III: Architecture of ResNet for cifar10

Iv-E Federated Evolutionary NAS Results

We adopt three different numbers of clients (10, 20 and 50) for real-time multi-objective evolutionary federated NAS and the obtained Pareto optimal solutions on both IID and non-IID data after 500 rounds (generations) of optimization are shown in Fig. 8. From these results we can see that the real-time evolutionary federated NAS algorithm is able to achieve a set of evenly distributed Pareto optimal solutions.

(a) 10 clients, IID data (b) 10 clients, non-IID data (c) 20 clients, IID data (d) 20 clients, non-IID data (e) 50 clients, IID data (f) 50 clients, non-IID data
Fig. 8: Pareto optimal solutions obtained on the IID and non-IID data for , and clients.

The following two observations can be made. First, the classification accuracies of the optimized models on the IID data are better than those for on the non-IID data. Second, the smaller the number of the clients, the better the classification performance. These two phenomenons are reasonable, since learning on IID data is much easier than learning on non-IID data in federated learning. On the other hand, the more the number of clients, the less data there is on each client, as the amount of data in total is given.

To take a closer look at the obtained models, we present the test accuracies and FLOPs of the model having the highest test accuracy (called High) and the knee solution (called Knee), as indicated in Fig. 8, as well as that of the ResNet in Table IV.

Model clients IID Test Accuracy FLOPs (MAC)
ResNet 10 Yes 81.14% 0.5587G
High 10 Yes 86.68% 0.2796G
Knee 10 Yes 83.36% 0.031G
ResNet 10 No 80.29% 0.5587G
High 10 No 81.27% 0.2272G
Knee 10 No 78.94% 0.0302G
ResNet 20 Yes 82.5% 0.5587G
High 20 Yes 85.16% 0.2012G
Knee 20 Yes 83.65% 0.0398G
ResNet 20 No 79.91% 0.5587G
High 20 No 83.01% 0.3936G
Knee 20 No 81.58% 0.1098G
ResNet 50 Yes 80.78% 0.5587G
High 50 Yes 77.1% 0.3360G
Knee 50 Yes 71.46% 0.0547G
ResNet 50 No 77.63% 0.5587G
High 50 No 75.1% 0.1719G
Knee 50 No 69.22% 0.0635G
TABLE IV: Comparison between ResNet and NSGA-II evolved two pareto solutions

From these results, we can see that federated learning with a total of and clients, the model having the highest accuracy found by the proposed algorithm is better than the original ResNet in accuracy and has a much lower computational complexity. For 10 clients, the knee solutions found by the proposed algorithm have a much lower model FLOPS than the original ResNet. By contrast for 20 clients, the model FLOPs of the knee solutions is again must lower than that of the ResNet and the performance on both IID and non-IID data is better. However, for clients, the test accuracies of the models found by the proposed algorithm is worse than the ResNet, although the model FLOPs are lower. This indicates that it becomes harder to find an optimal global model as the amount of data on each client becomes less.

Iv-F Real-time Performance

Since the proposed method is meant for real-time purposes, here we examine the performance of two models during the optimization, one is the model having the highest test accuracy and the other is the knee solution. For simplicity, we only investigate the real-time performance when the number of participating clients is , which is presented in Fig. 9. For comparison, the performance of ResNet18 is also plotted. We can see clearly that ResNet18 performs better than both models found by the evolutionary search method at the early stage. However, the two solutions are able to outperform ResNet after approximately communication rounds. We can also find that the performance of the best model is very stable during the evolutionary search, although the knee solution experiences some minor fluctuations in performance. Both models perform much more stably than those in the conventional offline evolutionary NAS in [55].

The model FLOPs of the two solutions are shown in Fig. 9(c)(d), which are smaller than that of the original ResNet18.

(a) clients, IID data (b) clients, non-IID data (c) clients, IID data (d) clients, non-IID data
Fig. 9: Test accuracies and the model FLOPs of the best model and the knee solution in each round (generation) of the evolutionary search.

From the above results, we can see that the proposed real-time evolutionary NAS algorithm is not only able to find light weighted models, but also ensure stable and competitive performance during the optimization.

Iv-G Comparison with Offline Federated NAS

Here, we also compare the proposed algorithm with the offline evolutionary federated NAS optimization. In the conventional offline model, the sampled model is downloaded to all clients with the same structure and offspring models should be reinitialized and trained from scratch, similar to the settings in [55].

For a fair comparison, models are collaboratively trained with one communication round for fitness evaluations. The number of clients is set to be and the Pareto optimal solutions obtained on the non-IID data after 50 generations are plotted in Fig. 10. Only four solutions are found in the last generation. Finally, the model having the highest accuracy and the knee solution as indicated in Fig. 10 are selected to be trained from scratch for communication rounds and their learning curve is shown in Fig. 11. From these results we can see that the performance of the best model found by the offline evolutionary NAS is lower than 75 in this case, which is attributed to the very fast convergence in the early stage of the evolutionary search. These results also imply that in the offline evolutionary NAS, it is non-trivial to define the number of training rounds at each generation.

Fig. 10: Pareto optimal solutions found by the offline evolutionary federated NAS.
Fig. 11: Test accuracies of the two selected solutions found by the offline evolutionary NAS.

It takes about hrs/gpu for the real-time evolutionary NAS to run 500 generations (rounds). In contrast, the offline evolutionary NAS for 50 generations needs about hrs/gpu without considering the re-training time. This means that on average, the real-time evolutionary NAS is approximately five times faster than the offline evolutionary NAS algorithm for each round of search.

V Conclusions and Future Work

This paper proposes a real-time multi-objective evolutionary method for federated NAS, which can effectively avoid extra communication costs and computational resources and maintain a stable performance of the models during the optimization. This is achieved by a double-sampling approach that samples a master model shared by all individuals in the same population and samples the participating clients for training the global model. This way, one generation of evolutionary optimization can be embedded and completed within one single communication round, thereby reducing both communication and computation costs.

The experimental results demonstrate that the proposed evolutionary federated NAS framework is able to find a set of evenly distributed Pareto optimal solutions for both IID and non-IID datasets. Among these Pareto optimal solutions, we can obtain models having different architectures that present a trade-off between classification performance and computational complexity. In addition, we show that these models are computationally much simpler than the standard model while the performance is still highly competitive. The high computational efficiency of the proposed real-time evolutionary NAS is also confirmed by the comparative results with the conventional offline evolutionary method.

The present work is a first valuable step towards the application of neural architecture search to the federated learning framework. In the future, we are going to verify and extend the the proposed algorithm for real-time NAS in large-scale federated learning systems. In addition, new techniques remain to be developed to deal with data that are vertically partitioned and distributed on the clients.


  • [1] M. Ammad-ud-din, E. Ivannikova, S. A. Khan, W. Oyomno, Q. Fu, K. E. Tan, and A. Flanagan (2019) Federated collaborative filtering for privacy-preserving personalized recommendation system. Cited by: §I.
  • [2] P. J. Angeline, G. M. Saunders, and J. B. Pollack (1994) An evolutionary algorithm that constructs recurrent neural networks. 5 (1), pp. 54–65. Cited by: §II-B.
  • [3] B. Baker, O. Gupta, N. Naik, and R. Raskar (2016) Designing neural network architectures using reinforcement learning. Cited by: §II-B.
  • [4] G. Bender (2018) Understanding and simplifying one-shot architecture search. Cited by: §II-B.
  • [5] L. Bottou (1991) Stochastic gradient learning in neural networks. 91 (8), pp. 12. Cited by: §II-A2.
  • [6] L. Bottou (2010) Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §II-B.
  • [7] S. Caldas, J. Konečny, H. B. McMahan, and A. Talwalkar (2018) Expanding the reach of federated learning by reducing client resource requirements. Cited by: §I.
  • [8] Y. Chen, X. Sun, and Y. Jin (2019) Communication-efficient federated deep learning with asynchronous model update and temporally weighted aggregation. Cited by: §I.
  • [9] F. Chollet (2017) Xception: deep learning with depthwise separable convolutions. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1251–1258. Cited by: §III-A, §III-A.
  • [10] X. Chu, B. Zhang, R. Xu, and J. Li (2019) Fairnas: rethinking evaluation fairness of weight sharing neural architecture search. Cited by: §IV-C.
  • [11] K. Deb (2005) Multi-objective optimization using evolutionary algorithms. Wiley. Cited by: §II-C.
  • [12] K. Deb, S. Agrawal, A. Pratap, and T. Meyarivan (2000) A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: nsga-ii. In International conference on parallel problem solving from nature, pp. 849–858. Cited by: §II-C, §II-C.
  • [13] X. Dong and Y. Yang (2019) Searching for a robust neural architecture in four gpu hours. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1761–1770. Cited by: §I, §I, §II-B.
  • [14] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun (2019) Single path one-shot neural architecture search with uniform sampling. Cited by: §II-B, §IV-C.
  • [15] D. Han, J. Kim, and J. Kim (2017) Deep pyramidal residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5927–5935. Cited by: §III-A.
  • [16] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. Cited by: §I.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §II-B, §III-A.
  • [18] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. Cited by: §III-A, §III-A.
  • [19] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §II-B.
  • [20] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. Cited by: §I, §III-A.
  • [21] S. Ioffe (2017) Batch renormalization: towards reducing minibatch dependence in batch-normalized models. In Advances in neural information processing systems, pp. 1945–1953. Cited by: §IV-C.
  • [22] Y. Jin (Ed.) (2006) Multi-objective machine learning. Springer. Cited by: §II-C.
  • [23] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing (2018) Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems, pp. 2016–2025. Cited by: §II-B.
  • [24] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated learning: strategies for improving communication efficiency. Cited by: §I.
  • [25] A. Krizhevsky, V. Nair, and G. Hinton () CIFAR-10 (canadian institute for advanced research). IEEE Transactions on Evolutionary ComputationarXiv preprint arXiv:1712.04621arXiv preprint arXiv:1602.05629ACM Transactions on Intelligent Systems and Technology (TIST)arXiv preprint arXiv:1610.05492arXiv preprint arXiv:1510.00149arXiv preprint arXiv:1812.07210arXiv preprint arXiv:1903.07424IEEE transactions on neural networks and learning systemsnaturearXiv preprint arXiv:1808.05377arXiv preprint arXiv:1802.03268The handbook of brain theory and neural networksarXiv preprint arXiv:1502.03167arXiv preprint arXiv:1611.01578arXiv preprint arXiv:1611.02167Naval Research Logistics (NRL)IEEE Transactions on Signal ProcessingIEEE transactions on Neural NetworksEvolutionary computationProceedings of the IEEEarXiv preprint arXiv:1711.00436arXiv preprint arXiv:1806.09055arXiv preprint arXiv:1910.11858Proceedings of the IEEEProceedings of Neuro-NımesIEEE Transactions on Evolutionary ComputationarXiv preprint arXiv:1704.04861IEEE Transactions on knowledge and data engineeringarXiv preprint arXiv:1904.00420arXiv preprint arXiv:1907.01845arXiv preprint arXiv:1812.08928IEEE Transactions on CyberneticsIEEE Transactions on Evolutionary ComputationIEEE Transactions on CyberneticsIEEE Transactions on Evolutionary ComputationarXiv preprint arXiv:1901.09888. External Links: Link Cited by: §IV-C.
  • [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §II-B.
  • [27] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. 521 (7553), pp. 436–444. Cited by: §II-B.
  • [28] Y. LeCun, Y. Bengio, et al. (1995) Convolutional networks for images, speech, and time series. 3361 (10), pp. 1995. Cited by: §II-B.
  • [29] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu (2017) Hierarchical representations for efficient architecture search. Cited by: §I, §I, §II-B.
  • [30] H. Liu, K. Simonyan, and Y. Yang (2018) Darts: differentiable architecture search. Cited by: §I, §I, §II-B.
  • [31] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al. (2016) Communication-efficient learning of deep networks from decentralized data. Cited by: §I.
  • [32] S. J. Pan and Q. Yang (2009) A survey on transfer learning. 22 (10), pp. 1345–1359. Cited by: §I.
  • [33] L. Perez and J. Wang (2017) The effectiveness of data augmentation in image classification using deep learning. Cited by: §IV-C.
  • [34] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. Cited by: §I, §I, §II-B, §II-B, §II-B.
  • [35] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search


    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 4780–4789. Cited by: §I, §I, §II-B.
  • [36] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin (2017) Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2902–2911. Cited by: §I, §II-B.
  • [37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §III-A.
  • [38] M. Schuster and K. K. Paliwal (1997) Bidirectional recurrent neural networks. 45 (11), pp. 2673–2681. Cited by: §II-B.
  • [39] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas (2015) Taking the human out of the loop: a review of bayesian optimization. 104 (1), pp. 148–175. Cited by: §II-B.
  • [40] R. Shokri and V. Shmatikov (2015) Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp. 1310–1321. Cited by: §I.
  • [41] M. Suganuma, S. Shirakawa, and T. Nagao (2017)

    A genetic programming approach to designing convolutional neural network architectures

    In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 497–504. Cited by: §I, §II-B.
  • [42] Y. Sun, H. Wang, B. Xue, Y. Jin, G. G. Yen, and M. Zhang (2019)

    Surrogate-assisted evolutionary deep learning using an end-to-end random forest-based performance predictor

    Note: DOI: 10.1109/TEVC.2019.2924461 Cited by: §I.
  • [43] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §II-B.
  • [44] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §II-B.
  • [45] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §II-B.
  • [46] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §I.
  • [47] L. Torrey and J. Shavlik (2010) Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pp. 242–264. Cited by: §I.
  • [48] M. A. Trick (1992)

    A linear relaxation heuristic for the generalized assignment problem

    39 (2), pp. 137–151. Cited by: §II-B.
  • [49] X. Yao (1999) Evolving artificial neural networks. 87 (9), pp. 1423–1447. Cited by: §II-B.
  • [50] G. Yu, Y. Jin, and M. Olhofer (2018-07) A method for a posteriori identification of knee points based on solution density. In 2018 IEEE Congress on Evolutionary Computation (CEC), Vol. , pp. 1–8. External Links: Document, ISSN null Cited by: §III-C.
  • [51] G. Yu, Y. Jin, and M. Olhofer (2019) Benchmark problems and performance indicators for search of knee points in multiobjective optimization. (), pp. 1–14. External Links: Document, ISSN 2168-2275 Cited by: §III-C.
  • [52] J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang (2018) Slimmable neural networks. Cited by: §IV-C.
  • [53] X. Zhang, Y. Tian, and Y. Jin (2015-12) A knee point-driven evolutionary algorithm for many-objective optimization. 19 (6), pp. 761–776. External Links: Document, ISSN 1941-0026 Cited by: §III-C.
  • [54] Z. Zhong, J. Yan, W. Wu, J. Shao, and C. Liu (2018) Practical block-wise neural network architecture generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2423–2432. Cited by: §II-B.
  • [55] H. Zhu and Y. Jin (2019) Multi-objective evolutionary federated learning. Cited by: §I, §I, §III-B, §IV-F, §IV-G.
  • [56] B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. Cited by: §I, §II-B.
  • [57] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §II-B.