1 Introduction
The development of deep neural networks have successfully pushed the limits of visual recognition with remarkable improvements in stateoftheart performance. For example, the top5 error of an ensemble of residual nets (He et al., 2016) decreases to 3.57% on the ImageNet dataset (Russakovsky et al., 2015), and the first rank performance achieves 10.99% in terms of the average of top1 and top5 errors in trimmed video recognition task of ActivityNet Challenge 2018 (Ghanem et al., 2018). The achievements rest on the basis of the impressive design on the newlyminted 2D or 3D Convolutional Neural Networks (CNN). Nevertheless, developing a powerful and generic network structure often requires significant engineering of human experts. To reduce the efforts on humaninvented architectures and speed up the procedure of exploring neural networks particularly on the dataset of interest, there have been several techniques being proposed for automating the architecture design.
There are two general directions along the exploitation of automatic network architecture search: discovering evolution over a discrete search space (Liu et al., 2017, 2018b; Real et al., 2018; Zoph et al., 2018) and relaxing the search space to be continuous (Liu et al., 2019; Xie et al., 2019). The former often capitalizes on one controller to sample networks with different structures in a discrete search space, learn the generated networks and in turn update the weights of the controller in the next round till convergence. Such methods demand expensive computations since a large number of evaluations are required. Instead of optimizing over a discrete set of structures, the search space could be relaxed to be continuous and the process of architecture search is then performed by efficient gradient descent optimization. Please also note that the controller is not involved in this case, making the search framework more flexible. We follow this elegant recipe and employ the continuous relaxation of architecture search in our work, which is more fit for the heavy computations on visual data.
One typical method to relax the search space is Differentiable Architecture Search (DAS), which represents the architecture (cell) as a directed graph and executes continuous relaxation by weighted mixing all the operations on each edge. The seek for the optimal architecture is then converted to learn the weight of each operation by using gradient descent. Once the learning completes, the architecture is fixed by choosing the only operation with the highest weight as the one on each edge whilst irrationally ignoring the other operations. Figure 1(a) conceptually illustrates the process of architecture discretization in DAS. As such, the learnt architecture may suffer from stability problem. Therefore, we propose to alleviate the problem by progressively inducing the operations on the edges during training with a schedule in order to gradually teach the model to produce a stable architecture, as shown in Figure 1(b). More importantly, considering that video is a temporal sequence which contains both spatial and temporal dimensions, we devise several unique operations, such as convolutions, channelwise scaling and channelwise bias, but all in spatiotemporal 3D mode.
By consolidating the idea of scheduled arrangement of operations in automatic architecture search, we present a new Scheduled Differentiable Architecture Search (SDAS) for visual recognition. Specifically, we depict an architecture or a cell as a directed graph, in which each directed edge is associated with a set of candidate operations. The evolution of the architecture is then equivalent to jointly optimizing the weights or variables of operations on each edge and network parameters through gradient descent. In the training, we employ a schedule, whereby we relax the choice of a particular operation on each edge as a softmax over all the operations at the beginning and gradually discretize the operation on each edge in turn as learning proceeds. Once all the cells have been fixed, we retrain the whole network with the searched cells to fully optimize the network parameters for visual recognition.
The main contribution of this work is the proposal of SDAS to automate the network design for visual recognition. The solution also leads to the elegant view of how to devise unique operations particularly for modeling spatiotemporal dynamics in videos, and how to scheme an effective and efficient strategy for architecture search, which are problems not yet fully understood in the literature.
2 Related Work
We briefly group the related works into three categories: image recognition, network architecture search and video recognition.
Image Recognition
has received intensive research attention in the area of computer vision and the rise of Convolutional Neural Networks has achieved remarkable performances on several benchmarks. A lot of recent efforts have been made to achieve a highperforming neural network architecture by human experts, such as AlexNet
(Krizhevsky et al., 2012), Inception (Szegedy et al., 2015), VGG (Simonyan and Zisserman, 2015), BNInception (Ioffe and Szegedy, 2015), ResNet (He et al., 2016), ResNeXt (Xie et al., 2017), Xception (Chollet, 2017), MobileNet (Howard et al., 2017), ShuffleNet (Zhang et al., 2018) and SENet (Hu et al., 2018). More recently, to automatically design network architectures with less manual intervention, researchers have proposed various Network Architecture Searchapproaches for image recognition. One successful direction is with reinforcement learning
(Zoph and Le, 2017), which devises a controller network to generate the network architecture. Zoph et al. (Zoph et al., 2018) further improve the search space in (Zoph and Le, 2017)and obtain the stateoftheart performances on the tasks of image classification and natural language processing. Despite the remarkable performance, this computationally expensive approach takes 1800 GPU days. Several approaches for speeding up the process have been proposed, such as imposing a particular structure of the search space
(Liu et al., 2017, 2018b), weights prediction for each individual architecture (Baker et al., 2018; Brock et al., 2017) and weight sharing across multiple architectures (Cai et al., 2018; Pham et al., 2018). Different from the approaches which discover network evolution over a discrete search space, differentiable architecture search (Liu et al., 2019) relaxes the search space to be continuous, that optimizes the architecture by gradient descent and achieves competitive performance using fewer computation resources. Similarly, SNAS (Xie et al., 2019)reformulates architecture search as an optimization problem of a joint distribution of the search space and make a generic differentiable loss for architecture search.
The early deep models on Video Recognition mostly extend 2D CNN for image recognition to video frames. Karparty et al. (Karpathy et al., 2014) adapted framebased 2D CNN for a clip input with fixed temporal window size. The twostream architecture (Simonyan and Zisserman, 2014) is devised by utilizing two networks separately on visual frames and stacked optical flow images. This twostream scheme is further enhanced by exploiting convolutional fusion (Feichtenhofer et al., 2016), keyvolume mining (Zhu et al., 2016), temporal segment networks (Wang et al., 2016, 2018a) and temporal linear encoding (Diba et al., 2017). The aforementioned approaches often treat a video as a sequence of frames or optical flow images for video recognition. Nevertheless, the pixellevel temporal evolution across consecutive frames are seldom explored. To alleviate this issue, 3D CNN in (Ji et al., 2013) is devised to directly learn spatiotemporal representation from a short video clip by 3D convolution. Later in (Tran et al., 2015), Tran et al. design a widely adopted 3D CNN, namely C3D, consisting of 3D convolutions and 3D poolings optimized on largescale Sports1M (Karpathy et al., 2014) dataset. More advanced techniques are studied and proposed recently for 3D CNN, including inflating 2D convolutional kernels (Carreira and Zisserman, 2017) and decomposing 3D convolutional kernels (Qiu et al., 2017; Tran et al., 2018). In this work, we apply neural architecture search for automating 3D CNN backbone design specifically for video recognition.
In short, our work in this paper mainly focuses on the gradientbased architecture search for visual recognition tasks. Different form the previous methods of DARTS (Liu et al., 2019) which chooses all the operations at once after optimization, we propose a scheduled scheme that progressively induces the optimal operation on each edge during training. Moreover, we enlarge the novel search space particularly for video recognition by devising several unique spatiotemporal operations.
3 Scheduled Differentiable Architecture Search (SDAS)
The idea of Scheduled Differentiable Architecture Search (SDAS) is to improve the automatic search of an architecture or a cell in a scheduled manner. In this way, SDAS gradually approaches the optimal architectures or cells, making the structure of the whole network more stable. The search procedure can be efficiently implemented by using gradient descent. We begin this Section by presenting the continuous relaxation of search space which enables a joint optimization of architecture and network parameters, and followed by the scheduled discretization scheme to progressively determine the structure of the architecture.
3.1 Continuous Search Space
Following the standard practice in the works (Liu et al., 2017, 2018b, 2019; Real et al., 2018; Zoph et al., 2018) of automatic architecture search, we search for the basic cells and manually predetermine the way of connections between these cells. Derived from the idea of DAS (Liu et al., 2019), we formulate a computation cell as a directed acyclic graph which consists of an ordered sequence of nodes. Each node is a feature map and each directed edge is associated with an operation . We assume the cell to have two input nodes which are from two prior cells, and the other nodes in the cell are all intermediate ones inferred from the predecessor nodes. Formally, for each intermediate node, the feature map can be calculated as
(1) 
where is the assigned operation on edge and is the set of candidate operations. The output of each cell is achieved by depthwisely concatenating all the intermediate nodes. As a result, the problem of architecture search is equivalent to learning the operation on each edge in the directed graph over a discrete search space, which demands expensive computations.
To allow efficient search of the architecture, the choice of a particular operation on each edge is relaxed as a softmax over all the operations. As such, the search space changes to be continuous and the architecture can be optimized with respect to the validation performance by gradient descent. Specifically, the mixing operation on edge is formulated as
(2) 
where denotes the learnable weight of different operations on identical edge , and is the effect from node to node . In this case, the continuous variables and can be utilized to measure the operation selected on each edge and topological structure of the directed acyclic graph, respectively. Specifically, the predecessor node with top strongest effect will be connected to intermediate node , and the final choice of is the operation with highest weight, i.e., .
By assuming the weights of operations/edges and the parameters in the operations (e.g., convolutional kernels) are independent of each other as in (Liu et al., 2019), the two are jointly optimized by
(3) 
in which measures the loss on training set for learning the parameters in the operations, and estimates the loss on validation set for updating the weights of operations/edges in the architecture.
3.2 Scheduled Discretization
When performing continuous relaxation of search space and optimizing the architecture by gradient descent, a valid question is then how to obtain an ultimate architecture which is discrete. The natural way is to directly discretize the architecture by choosing the only operation with the highest weight as the one on each edge. The rationale behind is based on the assumption that the validation loss of the architecture in relaxed mode could reflect the performance of the architecture in discrete mode. However, in practice, such discretization scheme once ignores all the other operations, making the change dramatically. Therefore, this kind of scheme may result in skewed optimization of the architecture.
Towards a natural and smooth transition from relaxed to discrete architecture, here we devise a scheduled scheme that progressively induces the optimal operation during training. Specifically, we decompose the discretization of whole architecture into several onestep discretizations performed on edges and nodes separately. For the edge discretization, the mixing operation in this edge will be replaced by the operation with highest weight as shown in Figure 2. Formally, when performing discretization on edge , the mixing operation in Equ. 2 will be switched to discrete operation
(4) 
where is the optimal operation for edge . After edge discretization, since the operation on this edge is resolved, we remove the weights for operations but still optimize the weight to measure the importance of edge . Besides the operation on each edge, we also gradually determine the topological structure by node discretization, as shown in Figure 3. Specifically, for node , when the types of operations on all incident edges have been resolved, we will determine the connection between node and other nodes by only retaining the predecessors with top strongest effect (e.g., in Figure 3). After the node discretization, the computation of node will be changed from the continuous function in Equ. 2 to discrete function in Equ. 1, and the final architecture will be obtained when all the intermediate nodes are discretized.
For each architecture discretization, we need onestep discretizations, where and is the number of edges and nodes, respectively. To control the priority and frequency of discretization, we define a scheduled variable as the number of onestep discretizations have been performed in the th minibatch of the training. When , a mixture of candidate operations are initially placed on all the edges and all the predecessors are connected to all the nodes. While when equals the total onestep discretization number , the choice of operations on all the edges are fixed as and the topological structure are also fixed. During the training procedure, we gradually discretize the edges and nodes in turn and increases from to as learning proceeds. With the increase of , we select either an unresolved edge or an unresolved node with highest priority to be discretized. For each unresolved edge, we measure the priority of edge discretization by the weight difference between topmost operation and the second one. The higher weight difference indicates more confidence in the choice of operation. Analogously, for node discretization, the priority is defined as weight difference between rank predecessor node and rank predecessor node.
Formally, we propose to use a schedule to increase as a function of itself and capitalize on such schedule to progressively perform discretization on edges and nodes. The selection of schedule also impacts the convergence speed and time cost of architecture search. Examples of three schedules are given in Figure 4 as follows:
(1) ScheduleA: linearly increases the number of discretization decisions, where T denotes the maximum iteration.
(2) ScheduleB: increases the number in a polynomial manner (with quartic as an example), which converges lower than linear function and discretizes the edges and nodes at the late iterations.
(3) ScheduleC: utilizes a negative polynomial function. The discretization is performed at early iterations and the architecture will be converged faster than ScheduleA.
The comparisons between different schedules in SDAS will be elaborated in experiments. Algorithm 1 further details the optimization steps.
3.3 The View of Search Space Reduction
In this section, we will introduce the principle of our proposed SDAS in the view of search space reduction. As discussed in Section 3.2, the gap between architecture in relaxed mode and architecture in discrete mode will lead to skewed optimization for DAS. One vital reason behind this gap is the enormous search space. For example, the search space for image recognition in DARTS (Liu et al., 2019) consists of valid architectures. Single continuously relaxed architecture cannot well reflect the performance of this large number of discrete architectures. However, in our proposed SDAS, the search space is reduced gradually during training by discreting single edge or node. One example is shown in Table 1, in which we utilize SDAS with the same settings as in DARTS and the search space changes during training is given. With the variable increases, the search space is reduced dramatically since part of the decisions have been made during onestep discretization. The iterations after that will only consider the subspace with valid architectures conditioned on the determined decisions. Finally, after optimization, the only valid architecture is the optimal architecture produced by SDAS.
In order to quantitatively measure the gap between architecture during search and produced architecture, Xie et al. (Xie et al., 2019) introduce to evaluate the performance of these two architectures by using the parameters learnt during search. We follow the settings in (Xie et al., 2019) and evaluate our SDAS as shown in Table 2. Unsurprisingly, the discrete architecture of DARTS achieves much lower accuracy than the relaxed one, since the discretization scheme that once ignores all the other operations making the network change dramatically. By replacing the traditional softmaxrelaxation by gambling softmax, the discrete architecture by SNAS can even get slightly higher performance than relaxed mode. For our SDAS, since the onestep discretization has been performed during search according to the schedule function, the relaxed architecture will converge to discrete mode which achieves the highest accuracy.
Method  Search Space  

DARTS (Liu et al., 2019)  –  –  
SDAS (ScheduleA)  
4 Network Structure for Visual Recognition
4.1 Scalable Network Structure
Figure 5 shows an overview of the scalable network structures for (a) image recognition with lowresolution input; (b) image recognition with highresolution input; (c) video recognition. Each network mainly consists of two types of layers, i.e., stem layers and automatic design layers. The stem layers are manually determined to learn the lowlevel representations from the input data. The main purpose of stem layers is to reduce the spatial or temporal dimensions, so as to decrease computational cost and GPU memory demand. The automatic design layers are composed of a stack of computation architectures or cells. Each cell is automatically designed by our SDAS and has the same structure, but different weights. For image recognition, we define two kinds of computation cells, each with different property. Normal Cell preserves the resolution of the inputs. Reduction Cell produces the output feature map with the spatial resolution reduced by a factor of two. While for video recognition, we further divide the Reduction Cell into SReduction Cell and STReduction Cell
for reducing only spatial dimension and reducing both spatiotemporal dimension, respectively. For all the Reduction Cells, we make the initial operation applied on the input nodes with the corresponding stride to change the resolution or duration. The architecture parameters
is the combination of parameters for each cell, which are optimized together during search process. The placement of different cells is detailed in the Figure 5, in which is the repeat number of Normal Cell, and denotes the number of output channels in the stem layers and the first cell, respectively. The three numbers are considered as free parameters to make the network structure tailored to the scale of visual recognition problem.4.2 Operation Set
Inspired by the recent advances in CNN, we start from an operation set , which consists of several 2D operations. All of them are prevalent in image recognition (Liu et al., 2019; Xie et al., 2019):
identity  33 ave pooling 
3  33 separable conv 
55 separable conv  33 dilated separable conv 
55 dilated separable conv 
The includes three types of operations, i.e., identity shortcut, 2D pooling and 2D convolution. For each pooling or convolution, we denote the local window size as where is the spatial size. The separable convolution (Chollet, 2017) factorizes the standard convolution into a depthwise convolution and a pointwise convolution for a good tradeoff between computation cost and performance, and is always applied twice in an operation (Liu et al., 2019; Zoph et al., 2018). The dilated convolution (Yu and Koltun, 2016)
further enlarges the receptive field of each convolution by atrous rate 2. For each convolutional operation, we exploit ReLUConvBN order.
When utilize the operation set to video recognition, we extend the local window size of the operation to , where is the temporal duration of local window. Here, we consider the operations with as 2D operations since they are performed on each frame independently. Therefore, the for video recognition is given as
identity  133 ave pooling 
133 max pooling  133 separable conv 
155 separable conv  133 dilated separable conv 
155 dilated separable conv 
To fully explore the temporal evolution across consecutive frames, we build the following set by extending the operations in to 3D manner:
identity  133 max pooling 
133 separable conv  155 separable conv 
133 dilated separable conv  333 max pooling 
333 separable3d conv  355 separable3d conv 
333 dilated separable3d conv 
Here, we remove the most infrequent operations, i.e., average pooling and dilated convolution to control the size of operation set. One unique design is Separable3d convolution, in which we remould the idea in (Qiu et al., 2017) of decomposing 3D learning into 2D convolution in spatial space and 1D operation in temporal dimension. The structure is shown in Figure 6(a). Separable3d convolution simulates each convolution with one depthwise convolution plus one depthwise convolution. The outputs from the two are accumulated and input into pointwise convolution to obtain the final output. Such convolution is conceptually similar to P3DB block as presented in (Qiu et al., 2017), which places spatial convolution and temporal convolution in parallel.
In addition to the operation set , we also construct an advanced operation set which adds two new channelwise operations to . The new operations are derived from the idea in SqueezeandExcitation Networks (SENet) (Hu et al., 2018). The goal of SENet is to improve the representational power of a network by explicitly modeling the interdependencies between the channels of feature map. As a result, we summarize the key component of SENet as two standalone operations, namely Channelwise Scale operation and Channelwise Bias operation, which recalibrates channelwise feature responses by adaptive scale and bias, respectively. Figure 6(b) details the Channelwise Scale operation, in which the channelwise weights are achieved by the globalaveragepooled representation, and then utilized to rescale the input feature map. For Channelwise Bias operation in Figure 6(c), the adaptive bias for each channel is learnt instead and broadcasted to each position in the input feature map. By the proposal of these two operations, the advanced operation set is defined as
identity  133 max pooling 
133 separable conv  155 separable conv 
333 max pooling  333 separable3d conv 
355 separable3d conv  channelwise scale 
channelwise bias 
Similarly, the infrequent operation dilated convolution is removed for a more exquisite operation set. The comparison between different operation sets will be discussed in the experiments.
5 Implementation
5.1 Datasets
We empirically evaluate our SDAS on CIFAR10 and ImageNet dataset for image recognition. The CIFAR10 (Krizhevsky et al., 2009) dataset consists of 60K images with lowresolution () in 10 classes, and ImageNet (Russakovsky et al., 2015) consists of around 1.2M image in 1K categories. We use the official training/test split provided by the dataset organizers. Following the settings in (Liu et al., 2019), we search the optimal architecture on CIFAR10 dataset, and then utilize the produced architecture on both CIFAR10 and ImageNet dataset.
For video recognition, we conduct the experiments on three public benchmarks, i.e., UCF101, HMDB51, and Kinetics400. UCF101 (Soomro et al., 2012) and HMDB51 (Kuehne et al., 2011) are two of the most popular video action recognition benchmarks. UCF101 consists of 13K videos from 101 action categories, and HMDB51 consists of 7K videos from 51 action categories. Each split in UCF101 includes about 9.5K training and 3.7K test videos, while a HMDB51 split contains 3.5K training and 1.5K test videos. The Kinetics400 (Carreira and Zisserman, 2017) dataset is one of the largescale action recognition benchmarks. It consists of around 300K videos from 400 action categories. The 300K videos are divided into 240K, 20K, 40K for training, validation and test set, respectively. Each video in this dataset is 10second short clip cropped from the raw YouTube video. Note that the labels for test set are not publicly available and the performances on Kinetics400 dataset are all reported on the validation set. In addition, we conduct a subset of Kinetics dataset, called Kinetics10, which consists of 10 categories, by merging the similar finegrained categories. The details about categories in Kinetics10 dataset is given in Table 3. Such subset contains 40K training videos and 3K validation videos. Similar to image recognition, we search the optimal architecture on UCF101, HMDB51 and Kinetics10 dataset, and then evaluate on the largescale Kinetics400 dataset.
Category  Subcategories 

basketball  dribbling basketball, dunking basketball, playing basketball, shooting basketball 
biking  biking through snow, falling off bike, jumping bicycle, riding a bike 
bodybuilding  clean and jerk, deadlifting, pull ups, situp, snatch weight lifting, squat, stretching arm, stretching leg 
cooking  chopping vegetables, cooking egg, cooking on campfire, cooking sausages (not on barbeque), cooking scallops, flipping pancake, frying vegetables, making a sandwich, scrambling eggs 
dancing  belly dancing, breakdancing, country line dancing, dancing ballet, dancing charleston, dancing gangnam style, dancing macarena, jumping into pool, mosh pit dancing, salsa dancing, square dancing, swing dancing, tango dancing, tap dancing 
eating  eating burger, eating cake, eating carrots, eating chips, eating doughnuts, eating hotdog, eating ice cream, eating spaghetti, eating watermelon 
golfing  golf chipping, golf driving, golf putting 
skiing  ski jumping, skiing crosscountry, skiing mono, skiing slalom, snowboarding, tobogganing 
soccer  juggling soccer ball, kicking soccer ball, passing soccer ball, shooting goal (soccer) 
swimming  ice swimming, swimming backstroke, swimming breast stroke, swimming butterfly stroke, swimming front crawl 
5.2 Data Preprocessing.
During training on CIFAR10 and ImageNet, the dimension of input image is set as and
which is randomly cropped from the padded/resized image with short edge 40/256, respectively. For video recognition, the input video clip is
volume cropped from the nonoverlapping 16frame clip with short edge 128. Each input is randomly flipped along horizontal direction for data augmentation. At the inference time, we resize the input data according to shorter side, and perform inference on the center crop. Thus, the network produces one score for each image/clip, and the videolevel prediction score is calculated by averaging all scores from 20 uniformly sampled clips.5.3 Architecture Search.
For architecture search, we randomly split the original training set into two equal parts as training and validation sets to optimize the network architecture. Note that the original validation set is never utilized in the optimization of architecture search. Our proposed SDAS is implemented on Caffe
(Jia et al., 2014)platform and minibatch Stochastic Gradient Descent (SGD) algorithm is exploited to optimize the model. We set each minibatch as 64 images/clips, which are implemented with multiple NVidia Titan Xp GPUs in parallel. For the parameter
, we use the momentum SGD with initial learning rate which is annealed down to zero following a cosine decay, while the weights of architecture are optimized by Adam algorithm with fixed learning rate. The search completes after 50/320/320/96 epochs for CIFAR10, UCF101, HMDB51 and Kinetics10, respectively.
6 Experiments on Image Recognition
6.1 Evaluations on CIFAR10
In architecture search on CIFAR10, we utilize the operation set , and each cell in our experiments consists of nodes, which include 2 input nodes plus 4 intermediate nodes. Please note that the search space in this setting is the same as in (Liu et al., 2019; Xie et al., 2019). During search, we set the repeat number of Normal Cell as 2, and the output channels are fixed as and . Once the architecture of each cell is optimized, we increase the learning capacity of network by using , and in the architecture evaluation. We retrain the searched architectures on original CIFAR10 training set for 600 epochs with batch size 96. Additional enhancements in (Liu et al., 2019), including cutout, path dropout and auxiliary towers, are all exploited. As such, DAS in this paper is exactly the same setting as the DARTS (first order), which is the very comparable baseline to our SDAS.
Architecture  Test Error  Params  Search Cost  Search 

(%)  (M)  (GPU days)  Method  
DenseNetBC (Huang et al., 2017)  3.46  25.6  –  manual 
NASNetA + cutout (Zoph et al., 2018)  2.65  3.3  1800  RL 
AmoebaNetA + cutout (Real et al., 2018)  3.34 0.06  3.2  3150  evolution 
AmoebaNetB + cutout (Real et al., 2018)  2.55 0.05  2.8  3150  evolution 
Hierarchical Evo (Liu et al., 2018b)  3.75 0.12  15.7  300  evolution 
PNAS (Liu et al., 2018a)  3.41 0.09  3.2  225  SMBO 
ENAS + cutout (Pham et al., 2018)  2.89  4.6  0.5  RL 
DARTS (first order) + cutout (Liu et al., 2019)  2.94  2.9  1.5  gradientbased 
DARTS (second order) + cutout (Liu et al., 2019)  2.83 0.06  3.4  4  gradientbased 
SNAS + mild + cutout (Xie et al., 2019)  2.98  2.9  1.5  gradientbased 
SNAS + moderate + cutout (Xie et al., 2019)  2.85 0.02  2.8  1.5  gradientbased 
SNAS + aggressive + cutout(Xie et al., 2019)  3.10 0.04  2.3  1.5  gradientbased 
Random + cutout  3.49  3.1  –  – 
DAS + cutout  2.93 0.08  2.7  0.52  gradientbased 
SDASA + cutout  2.81 0.07  2.9  0.31  gradientbased 
SDASB + cutout  2.74 0.11  3.2  0.37  gradientbased 
SDASC + cutout  2.84 0.12  3.9  0.25  gradientbased 
Comparisons with stateoftheart image classifiers on CIFAR10.
denotes the search cost that is reported in the original DARTS paper (Liu et al., 2019) but with different GPU from ours. In our implementations, the time cost of DARTS (1st order) and DARTS (2nd order) is 0.5 and 1.9 GPU days, respectively.Figure 7 illustrates the architecture evolution of a Normal Cell during SDAS optimization with ScheduleA. Let us look at how the onestep discretization is performed on Normal Cell during SDAS optimization, which gradually achieves the optimal discrete architecture starting from relaxed mode. As shown in Figure 7(b) and Figure 7(c), at the beginning of optimization, onestep discretization will be utilized on the most confident edge, and replace mixed operation with the optimal operation on this edge. While in Figure 7(d), since all the incident edge of node 2 have been discretized, the SDAS considered to discretize this node. Thus, the connection between node 1 and node 2 is removed by node discretization on node 2. When reaching the maximum epochs in Figure 7(f), all the edges and nodes are discretized and search space only contains the optimal architecture.
Architecture  Test Error (%)  Params  Search Cost  Search  
top1  top5  (M)  (M)  (GPU days)  Method  
Inceptionv1 (Szegedy et al., 2015)  30.2  10.1  6.6  1448  –  manual 
MobileNet (Howard et al., 2017)  29.4  10.5  4.2  569  –  manual 
ShuffleNet 2 (v1) (Zhang et al., 2018)  29.1  10.2  5  524  –  manual 
ShuffleNet 2 (v2) (Zhang et al., 2018)  26.3  –  5  524  –  manual 
NASNetA (Zoph et al., 2018)  26.0  8.4  5.3  564  1800  RL 
NASNetB (Zoph et al., 2018)  27.2  8.7  5.3  488  1800  RL 
NASNetC (Zoph et al., 2018)  27.5  9.0  4.9  558  1800  RL 
AmoebaNetA (Real et al., 2018)  25.5  8.0  5.1  555  3150  evolution 
AmoebaNetB (Real et al., 2018)  26.0  8.5  5.3  555  3150  evolution 
AmoebaNetC (Real et al., 2018)  24.3  7.6  6.4  570  3150  evolution 
PNAS (Liu et al., 2018a)  25.8  8.1  5.1  588  225  SMBO 
DARTS (Liu et al., 2019)  26.9  9.0  4.9  595  4  gradientbased 
SNAS + mild (Xie et al., 2019)  27.3  9.2  4.3  522  1.5  gradientbased 
SDASA  27.0  9.1  4.6  531  0.31  gradientbased 
SDASB  26.6  8.7  4.4  510  0.37  gradientbased 
SDASC  26.7  8.8  4.5  511  0.25  gradientbased 
Table 4 summarizes the comparisons with stateoftheart network architectures on CIFAR10 dataset. SDASA, SDASB and SDASC denote our SDAS with the three different schedule functions introduced in Section 3.2. For each DAS or SDAS setting, we run 4 times and report the average test error. Though DAS and SDAS both involve architecture discretization, they are different in the way that DAS is as a result of determining all the operations at one step, while SDAS is by selecting the operations during training with a schedule. As indicated by our results, utilizing a schedule in SDAS can constantly lead to a lower error rate than onestep decision in DAS. In particular, the SDASB achieves the lowest error across SDASs with different schedule functions, which demonstrates a possible speedup with 6.5% relative drop in average test error over DAS. The optimal architectures learnt by SDASs with different schedule functions are shown in Figure 8.
6.2 Transferability of Learnt Architectures
For the transferability of learnt architectures, following (Liu et al., 2019), we apply the optimal architectures searched with each schedule to ImageNet dataset. Here, we only consider the mobile setting where the number of multiplyadd operations with input is restricted to be less than 600M. In order to reach the close network scale of this limitation, the repeat number of Normal Cell is set as 4, and the output channels are fixed as and for the ImageNet evaluation. The network is trained for 250 epochs with batch size 128, weight decay and initial learning rate 0.1 which is decayed by 0.97 after each epoch. The comparisons with stateoftheart image classifiers on ImageNet in the mobile setting are summarized in Table 5. Similarly to the conclusion on CIFAR10, SDASB achieves top5 test error, which is the lowest across three SDAS settings. When compared with DARTS and SNAS, which utilize the same search space as our setting, SDASB leads to and faster search speed with 3.3% and 5.4% relative drop on top5 error, respectively. For the baselines NASNet, AmoebaNet and PNAS, which need much more computation resources during search, SDASB can also achieve comparable performance with even less parameters. This result basically indicates the transferability of learnt architectures by SDAS for image recognition.
7 Experiments on Video Recognition
Method  Ops  Search Cost  Params  Error 

(GPU days)  (M)  (%)  
DAS  6.8  1.41  11.67 0.21  
DAS  9.9  1.39  10.88 0.11  
DAS  9.2  1.46  11.49 0.30  
SDASA  3.8  0.95  11.02 0.08  
SDASA  5.7  1.06  10.68 0.11  
SDASA  4.6  1.21  10.30 0.19  
SDASB  5.2  0.94  11.09 0.11  
SDASB  7.5  1.12  10.62 0.15  
SDASB  5.9  1.10  10.24 0.08  
SDASC  2.3  0.82  10.93 0.27  
SDASC  3.5  0.92  10.58 0.12  
SDASC  3.3  0.93  10.22 0.10 
7.1 Evaluations on Kinetics10
Evaluation on Operation Set. For video recognition, we first examine how the performance of automatic network design is affected when capitalizing on different operation sets. Here, we search the optimal architecture on Kinetics10 by DAS and SDAS on the three operation set , and , respectively. During search, the repeat number of Normal Cell is set as , and output channels are fixed as and . After architecture search, we enlarge the learning capacity of network by using , and unless otherwise stated. The optimal architecture is trained for 96 epochs with batch size 64 on Kinetics10 training set.
Table 6
summarizes the search time, the number of parameters and error rate on Kinetics10 validation set by DAS and SDAS on the three operation sets. For each architecture, the mean and standard deviation across 4 independent runs are given. Overall, the results across three operation sets consistently indicate that the network with the architectures learnt by SDAS leads to a performance boost and search speedup against that by DAS. Furthermore, searching the cells on
operation set also exhibits better performance than architecture optimization on set by both DAS and SDAS. That basically verifies the merit of 3D operations to encode spatiotemporal context in videos. One observation is that DAS cannot handle the advanced operation well and achieves worse performance when switching to advanced set . While for our SDAS, the involving of advanced operation can consistently boost up the performance, which demonstrate the ability of SDAS to deal with different kinds of operations.Figure 9 depicts the best performing Normal Cell, STReduction Cell and SReduction Cell on Kinetics10 by SDASC and the advanced operation set is exploited in this experiments. An interesting observation is that the operation of 3d max pooling and 2d max pooling is often chosen in STReduction Cell and SReduction Cell, respectively. This reasonably meets our expectation, which is in conformity with the manually designed networks.
Method  Ops  Params  Error  

(M)  (%)  
SDASC  2  16  64  0.24  12.44 0.19  
2  24  96  0.53  11.11 0.17  
2  32  128  0.93  10.22 0.10  
3  32  128  1.31  10.08 0.04  
3  32  192  2.89  9.66 0.06 
The Effect of Free Parameters. A common practice in architecture design that is tailored to network capacity is to consider the repeat number of Normal Cell and the output channels and in the stem layers and the first cell as free parameters. In the previous experiments, the values were optimally set as , and to evaluate the network structure. Furthermore, we conducted the experiments to examine the effect of these parameters. Table 7 shows the comparisons between network structures by SDASC on Kinetics10 but exploiting different parameters. In general, the increase of , and results in more parameters, but meanwhile the error rate decreases when using more cells and larger output channels. For example, the error drops in the range of 2.78%.
Method  Params  Error  

(M)  (G)  (%)  
BNInception2D  0.62  0.92  12.67 0.20 
Xception2D  1.28  1.15  11.31 0.04 
Res502D  1.41  1.13  12.07 0.05 
Res1012D  2.55  2.00  11.78 0.22 
ResNeXt502D  1.63  1.29  11.30 0.18 
ResNeXt1012D  3.01  2.34  11.30 0.20 
SENet2D  3.30  2.34  10.84 0.10 
BNInception3D  0.76  1.00  12.32 0.15 
Xception3D  1.30  1.17  10.61 0.13 
Res503D  1.63  1.30  11.74 0.27 
Res1013D  2.97  2.32  11.33 0.10 
ResNeXt503D  1.74  1.38  10.96 0.21 
ResNeXt1013D  3.22  2.50  10.40 0.05 
SENet3D  3.51  2.50  10.03 0.09 
SDASC  0.24  0.36  12.44 0.19 
SDASC  0.53  0.79  11.11 0.17 
SDASC  0.93  1.38  10.22 0.10 
SDASC  1.31  1.52  10.08 0.04 
SDASC  2.89  2.20  9.66 0.06 
Comparison with HandCrafted Structures. We compare the following handcrafted network structures for architecture evaluation: BNInception (Ioffe and Szegedy, 2015), ResNet (He et al., 2016), ResNeXt (Xie et al., 2017), Xception (Chollet, 2017) and SENet (Hu et al., 2018). To support the shortclip input in these 2D CNN structures, as shown in Figure 10, we change 2D convolutions to spatial convolution and insert two temporal max pooling to reduce the temporal dimension progressively. These structures can also be extended to 3D networks by attaching a parallel temporal convolution to each spatial convolution. To make the scale of each structure comparable with our SDAS on Kinetics10, we reduce the output channels of layers in these networks by a factor of 4.
Table 8 compares the architectures by SDASC with the handcrafted network structures. The results are also shown in Figure 11 in the form of accuracyvs#param curve and accuracyvs#operation curve for better view. The network structures on the cells learnt by our SDASC across different scales consistently achieve superior accuracy to that of humaninvented architectures with comparable number of parameters or comparable number of multiplyadd operations. Though both our set and SENet involve utilization of the channelswise operation, the network structure by SDASC is benefited from the way of automatic design and leads to better result.
Method  Ops  Search Cost  Params  Error 

(GPU days)  (M)  (%)  
UCF101  
DAS  7.3  0.73  37.49 0.91  
SDASC  2.6  0.77  37.19 0.92  
HMDB51  
DAS  2.7  0.91  74.08 1.18  
SDASC  0.9  1.15  73.87 0.25 
7.2 Evaluations on UCF101 and HMDB51
In this section, we turn to examine how the proposed SDAS behave when searching the optimal architecture on different datasets. Here, we utilize two famous action recognition datasets, i.e., UCF101 and HMDB51. For these two datasets, all the settings are the same as that for Kinetics10 except the number of iterations. Table 9 compares the architectures by DAS and SDASC on UCF101 and HMDB51. The results on both datasets indicate that the network with the cells learnt by SDASC outperforms against that by DAS with search speedup. However, it should also be noticed that the standard deviation for each architecture on these two datasets is quite large (), which is close to the difference between architectures sometimes. As such, the experimental results are somewhat unstable when searching on UCF101 and HMDB51. Figure 12 and Figure 13 shows the optimal architecture by SDASC on UCF101 and HMDB51, respectively.
Method  Search  Search Cost  Params  Top1 Acc 

Dataset  (GPU days)  (M)  (%)  
DAS  Kinetics10  9.2  37.68  73.3 
UCF101  7.3  19.71  71.2  
HMDB51  2.7  24.92  72.3  
SDASC  Kinetics10  3.3  26.01  74.2 
UCF101  2.6  20.82  71.5  
HMDB51  0.9  31.07  73.0 
7.3 Transferability of Learnt Architectures
To validate the transferability of learnt architectures, we perform a series of experiments on Kinetics400 with the best architectures searched from Kinetics10, UCF101 and HMDB51. Note that we merely transfer the architectures but train the weights of all models on Kinetics400 from scratch. The free parameters are set as , and . The network is trained for 128 epochs with batch size 256. Table 10 summarizes the top1 accuracy of different architectures on Kinetics400 validation set. As expected, SDASC using the cells learnt on Kinetics10 (a subset of Kinetics400) outperforms architectures searched on other dataset, or searched by DAS. For example, the architecture searched by SDASC on Kinetics10 leads to 1.2% relative improvement against that by DAS with more parameters.
Then, we further extend the best performing architecture by SDASC with more than 16 input frames to model longterm temporal information, as summarized in Table 11. For the clip longer than 16 frames, we firstly train the networks with 16frame clips and then finetune on target length of frames to speed up the optimization. In addition to RGB image input, we further execute the architecture search with optical flow image input to model the change of consecutive frames. Specifically, the network can capitalize on the twodirection optical flow images extracted by TVL1 algorithm (Zach et al., 2007) by changing the input channels of first convolution to 2. As indicated by our results, the increase of the number of frames/optical flow images generally leads to performance improvements. On the input of RGB frames/optical flow images, the top1 accuracy is boosted up from 74.2%/64.8% to 76.5%/69.4% when the number changes from 16 to 128. The accuracy of late fusion on two streams reaches 78.7%.
Method  Input  Frames  Params  Top1 Acc 
(M)  (%)  
SDASC  RGB  16  26.01  74.2 
32  75.1  
64  76.0  
128  76.5  
SDASC  Flow  16  26.01  64.8 
32  67.4  
64  68.6  
128  69.4  
SDASC Twostream  128  52.02  78.7 
7.4 Performance Comparisons on Kinetics400
We compare with the following stateoftheart handcrafted networks on Kinetics400 dataset and all the baselines are either trained on Kinetics400 from scratch or pretrained on ImageNet or Sports1M dataset.
(1) Inflated 3D ConvNet (I3D) (Carreira and Zisserman, 2017) expands the 2D convolutions and 2D poolings in Inception network to 3D.
(2) Pseudo3D ResNet (P3D) (Qiu et al., 2017) proposes a family of P3D blocks to decompose 3D learning into 2D convolutions in spatial space and 1D operations in temporal dimension.
(3) ResNet200, ResNeXt101, DenseNet201 (Hara et al., 2018) constructs a series of 3D networks by replacing the 2D filters with 3D filters in different CNN architectures. Here, we list there architectures deriving from ResNet (He et al., 2016), ResNeXt (Xie et al., 2017) and DenseNet (Huang et al., 2017), respectively.
(4) R(2+1)D (Tran et al., 2018) builds spatiotemporal residual network with 2D spatial convolutions and 1D temporal convolutions. The R(2+1)D model is either trained from scratch or pretrained on largescale Sports1M (Karpathy et al., 2014) dataset.
(5) S3DG (Xie et al., 2018) redesigns the Inception block used in I3D by replacing the inflated 3D kernel with one 2D convolution and one 1D convolution.
(6) NL I3D (Wang et al., 2018b) inserts the nonlocal operation between the residual blocks of ResNet101 to leverage the relation across different positions in the feature map.
Method  Pretraining  Top1  Top5 

(%)  (%)  
I3D  none  67.5  87.2 
ResNet200  none  63.1  84.4 
ResNeXt101  none  65.1  85.7 
DenseNet201  none  61.3  83.3 
R(2+1)D  none  72.0  90.0 
SDASC  none  76.5  93.1 
SDASC Twostream  none  78.7  94.2 
I3D  ImageNet  72.1  90.3 
I3D Twostream  ImageNet  75.7  92.0 
P3D  ImageNet  72.6  90.7 
R(2+1)D  Sports1M  74.3  91.4 
R(2+1)D Twostream  Sports1M  75.4  91.9 
S3DG  ImageNet  74.7  93.4 
S3DG Twostream  ImageNet  77.2  93.0 
NL I3D  ImageNet  77.7  93.3 
SDASC  ImageNet  78.2  93.8 
SDASC Twostream  ImageNet  80.1  94.7 
Table 12 details the top1 and top5 accuracy on Kinetics400 validation set. Overall, randomly initialized SDASC outperforms all the networks which are trained on Kinetics400 from scratch. In particular, the top1 accuracy of SDASC can achieve 76.5%, which makes the absolute improvement over the best competitor R(2+1)D by 4.5%. Our SDASC is even superior to S3DG pretrained on ImageNet dataset in terms of top1 accuracy and obtains comparable top5 accuracy. In addition, we can also pretrain our SDASC architecture on ImageNet as proposed in (Qiu et al., 2017), which prelearn the 2D spatial kernel on ImageNet and then finetune the whole networks on video data. As such, the top1 accuracy can be boost up from 76.5% to 78.2%, which is higher than best competitor NL I3D pretrained on ImageNet. Similar performance trends are observed when extending the networks to twostream structure.
8 Conclusion
We have presented Scheduled Differentiable Architecture Search (SDAS) method which aims to automate the architecture design for visual recognition. Particularly, we study the problem of formulating an architecture or a cell as a directed graph and inducing the optimal computational architecture in a scheduled manner. To verify our claim, we have proposed a scheduled scheme which progressively fixes the optimal operation on each edge and changes the topological connection on each node during training, and integrated such scheme into an efficient architecture search framework based on the continuous relaxation of search space. To encode spatiotemporal information in videos, we further enlarge the search space for video recognition by devising several unique operations. Extensive experiments conducted on CIFAR10, Kinetics10, UCF101 and HMDB51 validate our proposal and analysis. SDAS leads to clear improvements over DAS with a speedup. More remarkably, applying the architecture learnt on CIFAR10/Kinetics10 to ImageNet/Kinetics400 successfully outperforms the advanced handcrafted structure and demonstrates good transferability on both image and video recognition tasks.
References
 Baker et al. (2018) Baker B, Gupta O, Raskar R, Naik N (2018) Accelerating neural architecture search using performance prediction. In: ICLR Workshop
 Brock et al. (2017) Brock A, Lim T, Ritchie JM, Weston N (2017) Smash: oneshot model architecture search through hypernetworks. arXiv preprint arXiv:170805344
 Cai et al. (2018) Cai H, Chen T, Zhang W, Yu Y, Wang J (2018) Efficient architecture search by network transformation. In: AAAI
 Carreira and Zisserman (2017) Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR

Chollet (2017)
Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: CVPR
 Diba et al. (2017) Diba A, Sharma V, Van Gool L (2017) Deep temporal linear encoding networks. In: CVPR
 Feichtenhofer et al. (2016) Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional twostream network fusion for video action recognition. In: CVPR
 Ghanem et al. (2018) Ghanem B, Niebles JC, Snoek C, Heilbron FC, Alwassel H, Escorcia V, Khrisna R, Buch S, Dao CD (2018) The activitynet largescale activity recognition challenge 2018 summary. arXiv preprint arXiv:180803766
 Hara et al. (2018) Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet. In: CVPR
 He et al. (2016) He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR
 Howard et al. (2017) Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:170404861
 Hu et al. (2018) Hu J, Shen L, Sun G (2018) Squeezeandexcitation networks. In: CVPR
 Huang et al. (2017) Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: CVPR

Ioffe and Szegedy (2015)
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:150203167
 Ji et al. (2013) Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans on PAMI 35(1):221–231
 Jia et al. (2014) Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: ACM MM
 Karpathy et al. (2014) Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, FeiFei L (2014) Largescale video classification with convolutional neural networks. In: CVPR
 Krizhevsky et al. (2009) Krizhevsky A, Hinton G, et al. (2009) Learning multiple layers of features from tiny images. Tech. rep., Citeseer
 Krizhevsky et al. (2012) Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS
 Kuehne et al. (2011) Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: ICCV
 Liu et al. (2017) Liu C, Zoph B, Shlens J, Hua W, Li LJ, FeiFei L, Yuille A, Huang J, Murphy K (2017) Progressive neural architecture search. arXiv preprint arXiv:171200559
 Liu et al. (2018a) Liu C, Zoph B, Neumann M, Shlens J, Hua W, Li LJ, FeiFei L, Yuille A, Huang J, Murphy K (2018a) Progressive neural architecture search. In: ECCV
 Liu et al. (2018b) Liu H, Simonyan K, Vinyals O, Fernando C, Kavukcuoglu K (2018b) Hierarchical representations for efficient architecture search. In: ICLR
 Liu et al. (2019) Liu H, Simonyan K, Yang Y (2019) Darts: Differentiable architecture search
 Pham et al. (2018) Pham H, Guan MY, Zoph B, Le QV, Dean J (2018) Efficient neural architecture search via parameter sharing. In: ICML
 Qiu et al. (2017) Qiu Z, Yao T, Mei T (2017) Learning spatiotemporal representation with pseudo3d residual networks. In: ICCV
 Real et al. (2018) Real E, Aggarwal A, Huang Y, Le QV (2018) Regularized evolution for image classifier architecture search. arXiv preprint arXiv:180201548
 Russakovsky et al. (2015) Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al. (2015) Imagenet large scale visual recognition challenge. IJCV 115(3):211–252
 Simonyan and Zisserman (2014) Simonyan K, Zisserman A (2014) Twostream convolutional networks for action recognition in videos. In: NIPS
 Simonyan and Zisserman (2015) Simonyan K, Zisserman A (2015) Very deep convolutional networks for largescale image recognition. In: ICLR
 Soomro et al. (2012) Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human action classes from videos in the wild. CRCVTR1201
 Szegedy et al. (2015) Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR
 Tran et al. (2015) Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: ICCV
 Tran et al. (2018) Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: CVPR
 Wang et al. (2016) Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: ECCV
 Wang et al. (2018a) Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2018a) Temporal segment networks for action recognition in videos. IEEE Trans on PAMI
 Wang et al. (2018b) Wang X, Girshick R, Gupta A, He K (2018b) Nonlocal neural networks. In: CVPR
 Xie et al. (2017) Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: CVPR
 Xie et al. (2018) Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speedaccuracy tradeoffs in video classification. In: ECCV
 Xie et al. (2019) Xie S, Zheng H, Liu C, Lin L (2019) SNAS: stochastic neural architecture search. In: ICLR
 Yu and Koltun (2016) Yu F, Koltun V (2016) Multiscale context aggregation by dilated convolutions. In: ICLR

Zach et al. (2007)
Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tvl1 optical flow. Pattern Recognition
 Zhang et al. (2018) Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: CVPR
 Zhu et al. (2016) Zhu W, Hu J, Sun G, Cao X, Qiao Y (2016) A key volume mining deep framework for action recognition. In: CVPR
 Zoph and Le (2017) Zoph B, Le QV (2017) Neural architecture search with reinforcement learning. In: ICML
 Zoph et al. (2018) Zoph B, Vasudevan V, Shlens J, Le QV (2018) Learning transferable architectures for scalable image recognition. In: CVPR