Graph-guided Architecture Search for Real-time Semantic Segmentation

09/15/2019 ∙ by Peiwen Lin, et al. ∙ SenseTime Corporation Zhejiang University 0

Designing a lightweight semantic segmentation network often requires researchers to find a trade-off between performance and speed, which is always empirical due to the limited interpretability of neural networks. In order to release researchers from these tedious mechanical trials, we propose a Graph-guided Architecture Search (GAS) pipeline to automatically search real-time semantic segmentation networks. Unlike previous works that use a simplified search space and stack a repeatable cell to form a network, we introduce a novel search mechanism with new search space where a lightweight model can be effectively explored through the cell-level diversity and latencyoriented constraint. Specifically, to produce the cell-level diversity, the cell-sharing constraint is eliminated through the cell-independent manner. Then a graph convolution network (GCN) is seamlessly integrated as a communication mechanism between cells. Finally, a latency-oriented constraint is endowed into the search process to balance the speed and performance. Extensive experiments on Cityscapes and CamVid datasets demonstrate that GAS achieves new state-of-the-art trade-off between accuracy and speed. In particular, on Cityscapes dataset, GAS achieves the new best performance of 73.3



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As a fundamental topic in computer vision, semantic image segmentation

[24, 44, 7, 8] aims at predicting pixel-level labels for an image. Leveraging the strong ability of CNNs, many works have achieved state-of-the-art performance in popular semantic segmentation benchmarks [13, 15, 4]. To achieve higher accuracy, state-of-the-art models become increasingly larger and deeper that require high computational resources and large memory overhead, which makes it difficult to deploy on resource-constrained platforms, such as mobile devices, robotics, and self-driving cars, etc.

Recently, many researches have focused on designing and improving CNN models with light computation cost and high segmentation accuracy. For example, [1, 30] reduce the computation cost via the pruning algorithms, and [43] uses an image cascade network to incorporate multi-resolution input. BiSeNet [41] and DFANet [21] utilize a light-weight backbone to speed up, and equip with a well-designed feature fusion or aggregation module to remedy the accuracy drop. Normally, researchers acquire expertise in architecture design through enormous trial and error to carefully balance accuracy and resource-efficiency.

To design more effective segmentation network for embedded devices, some researchers have explored automatically neural architecture search (NAS) methods [23, 46, 27, 20, 32, 5, 39] and achieved excellent results. For example, Auto-Deeplab [22] searches cell structure and the downsampling strategy together in the same round. CAS [42] searches an architecture with customized resource constraints and a multi-scale module that has been widely used in semantic segmentation field [7, 44].

Particularly, CAS has achieved state-of-the-art segmentation performance in lightweight community [43, 21, 41]. Like the general NAS methods, such as ENAS [32], DARTS [23] and SNAS [39], CAS also searches for a few types of cells (i.e. normal cell and reduction cell) and then repeatedly stacks the same cells through the network. This simplifies the search process, but also increases the difficulties to find a good trade-off between performance and speed due to the limited cell diversity. For example, the cell is prone to learn a complicated structure to pursue high performance without any resource constraint. As shown in Fig.2(a), the whole network stacked with complicated cell will result in high latency. When a low-computation constraint is applied, the cell structure tends to be over-simplified as shown in Fig.2(b), which may not achieve satisfactory performance.

Different from the traditional search algorithms with simplified search space, in this paper, we propose a novel search mechanism with new search space, where a lightweight model with high performance can be fully explored through the well-designed cell-level diversity and latency-oriented constraint. On one hand, to encourage the cell-level diversity, we make each cell structure independent, thus the cells with different computation cost can be flexibly stacked to form a lightweight network shown in Fig.2(c). For example, simple cells can be applied to the stage with high computation cost to achieve low latency, while complicated cells can be chosen in deep layers with low computation for high accuracy. On the other hand, we apply a real-world latency-oriented constraint into the search process, through which the searched model can achieve better trade-off between the performance and latency.

However, simply endowing cells with independence in exploring its own structure enlarges the search space and makes the optimization more difficult, which causes accuracy degradation as shown in Table 3. To address this issue, we incorporate a Graph Convolution Network (GCN) [19] as the communication deliverer between cells. We name the method as Graph-guided Architecture Search (GAS). Our idea is inspired by [26] that different cells can be treated as multiple agencies, whose achievement of social welfare may require communication between them. Specifically, in the forward process, starting from the first cell, the information of each cell is propagated to the next adjacent cell with a GCN. Our ablation study exhibits that this communication mechanism tends to guide cells to select less-parametric operations, thus achieving the balance between accuracy and latency.

Figure 2: (a) The network stacked by complicated cells results in high latency and high performance. (b) The network stacked by simple cells leads to low latency and low performance. (c) The cell diversity strategy, i.e., each cell possesses own independent structure, can flexibly construct the high accuracy lightweight network. Best viewed in color.

We conduct extensive experiments on the standard Cityscapes [13] and CamVid [4] benchmarks. Compared to other state-of-the-art methods, the proposed method achieves the new best performance while maintaining competitive latency. Particularly, our method locates in the top-right area shown in Fig. 1, which achieves the state-of-the-art trade-off between speed and performance.

The main contributions can be summarized as follows:

  • We propose a novel search framework, for real-time semantic segmentation task, with a new search space in which a lightweight model with high performance can be effectively explored.

  • We integrate the graph convolution network seamlessly into neural architecture search as a communication mechanism between cells.

  • The lightweight segmentation network searched with GAS is customizable in real applications. Notably, GAS has achieved 73.3% mIoU on Cityscapes test dataset and 102FPS on NVIDIA Titan Xp with an image.

Figure 3:

Illustration of our Graph-Guided Network Architecture Search. In reduction cells, all the operations adjacent to the input nodes are of stride two. (a) The backbone network, it’s stacked by a series of independent cells. (b) The GCN-Guided Module (GGM), it propagates information between adjacent cells.

and represent the architecture parameters for cell and cell respectively, and is the updated architecture parameters by GGM for cell . Best viewed in color.

2 Related Work

Efficient Semantic Segmentation Methods

Fully convolutional neural networks 

[24] is the pioneer work in semantic segmentation. Some remarkable network have achieved state-of-the-art performance by introducing heavy network backbones (VGGNet [34], ResNet [17], DenseNet [18], Xception [12]). And some outstanding works introduce effective modules to capture multi-scale context information [44, 8, 9]. In terms of efficient segmentation methods, there are two mainstreams: One is to employ relatively lighter backbone (e.g. ENet [30]) or introduce some efficient operations (depth-wise dilated convolution). DFANet [21] utilizes a lightweight backbone to speed up and equips with a cross-level feature aggregation module to remedy the accuracy drop. Another is multi-branch based algorithm that consists of more than one path. For example, ICNet [43] proposed to use the multi-scale image cascade to speed up the inference. BiSeNet [41] decouples the extraction for spatial and context information using two paths.

Neural Architecture Search

Neural Architecture Search (NAS) aims at automatically searching network architectures. Most existing architecture search papers are based on either reinforcement learning 

[45, 16]

or evolutionary algorithm 

[33, 11]. Though they can achieve satisfactory performance, they need thousands of GPU hours. To solve this time-consuming problem, one-shot methods [2, 3] have been developed to greatly solve the time-consuming problem by training an parent network from which each sub-network can inherit the weight. They can be roughly divided into cell-based and layer-based methods according to the type of search space. For cell-based methods, ENAS [32] proposes a parameter sharing strategy among sub-networks, DARTS  [23] relaxes the discrete architecture distribution as continuous deterministic weights, such that they could be optimized with gradient descent. SNAS  [39] propose novel search gradients that train neural operation parameters and architecture distribution parameters in same round of back-propagation. What’s more, there are also some excellent works [10, 29] to reduce the difficult of optimization by decreasing gradually the size of search space. For layer-based methods, FBNet [37], MnasNet [35], ProxylessNAS [5] use a multi-objective search approach that optimizes both accuracy and real-world latency.  In the field of semantic segmentation,  [6] is the pioneer work by introducing meta-learning techniques into the network search problem. Auto-Deeplab [22] search cell structure and the downsampling strategy together in same round. More recently, CAS [42] search an architecture with customized resource constraints and a multi-scale module that has been widely used in semantic segmentation field. And [28] over-parameterise the architecture during the training via a set of auxiliary cells using reinforcement learning.

Graph Convolution Network

 Convolution neural networks on graph-structure data is an emerging topic in deep learning research. Kipf

[19] present a scalable approach for graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs, for better information transfer. After that, Graph Convolution Networks (GCNs) [19] is widely used in many domains, such as video classification [36] and action recognition [40]. In this paper, we apply the GCNs [19] to model the relationship of adjacent cells in network architecture search. As far as we know, we propose a novel mechanism which is the first that applies graph-based neural networks for the network architecture search task.

3 Methods

As shown in Fig. 3, GAS searches for, with GCN-Guided module (GGM), an optimal network constructed by a series of independent cells. In the search process, we take the latency into consideration to get a network with computational efficiency. This searching problem can be formulated as:


where denotes the search space, and are the validation loss and the latency loss, respectively. Our goal is to search an optimal architecture that can achieves the best trade-off between the performance and speed.

In this section, we will describe three main components in GAS: 1) Network Architecture Search; 2) GCN-Guided Module; 3) Latency-Oriented Optimization.

3.1 Network Architecture Search

As shown in Fig. 3 (a), the whole backbone takes one image as input which is first filtered with three convolutional layers followed by a series of independent cells. The ASPP [7] module is subsequently used to extract the multi-scale context for the final prediction.

A cell is a directed acyclic graph (DAG) as shown in Fig. 4. Each cell consists of ordered nodes, denoted by , and each node represents the latent representation (e.g. feature map) in network. Each directed edge in this DAG represents an operation transformation (e.g. conv, pooling). Each cell has two input nodes, represented as , and , and output the concatenation of all intermediate nodes . In our work, we set =2. So for intermediate node , it has two input . For intermediate node , it has three input . The intermediate nodes can be calculated by:

Figure 4: The structure of cell in our GAS. Each colored edge represents one candidate operation.

where is the final operation at edge (, ).

To search the final operation , we use the method described in SNAS [39]

, where the search space is represented with a set of one-hot random variables from a fully factorizable joint distribution

. Concretely, each edge is associated with a one-hot random variable which is multiplied as a mask to the all possible operations =(, , …, ) in this edge. We denote one-hot random variable as = (, , …, ) where is the number of candidate operations. The intermediate nodes during search process in such way are:


To make differentiable, reparameterization trick [25] is used to relax the discrete architecture distribution to be continuous:


where is the architecture parameters at the edge , and =

is a vector of Gumbel random variables,

is a uniform random variable and is used to control the temperature of softmax.

For the set of candidate operations , we only use the following 7 kinds of operations to better balance the speed and performance: 3 3 separable conv, 3

3 max pooling, 3

3 conv, skip connection, zero operation, 3 3 dilated separable conv (dilation=2), and 3 3 dilated separable conv (dilation=4).

3.2 GCN-Guided Module

With cell independent to each other, the inter-cell relationship becomes every important for searching efficiently. We propose a novel GCN-Guided Module (GGM) to naturally bridge the operation information between adjacent cells. The total network architecture of our GGM is shown in Fig. 3(b). Inspired by [36], the GGM represents the communication between adjacent cells as a graph and perform reasoning on the graph for information delivery. Specifically, we utilize the similarity relations of edges in adjacent cells to construct the graph where each node represents one edge in cells. In this way, the state changes for previous cell can be delivery to current cell by reasoning on this graph.

Let represents the architecture parameters matrix for the cell , and the dimension of is where and represents the number of edges and the number of candidate operations respectively. Same for cell , the architecture parameters for cell also is a dimension matrix. Given a edge in cell , we calculate the similarity between this edge and all other edges in cell . Therefore, the adjacency matrix of the graph between two adjacent cells and can be established by


where we have = and = for two different transformations of the original matrixes, and parameters and are both dimensions weights which can be learned via back propagation. The result is an matrix.

Based on this adjacency matrix , we use Graph Convolution Networks (GCNs) [19] to perform reasoning on the graph, efficiently propagating information from cell to cell . The reasoning process includes the following three steps.

Firstly, to get the graph node feature representation, we apply the convolutional operation to the architecture parameters and then obtain its dimension node feature representation matrix in the embedding space:


where represents the convolutional operation weight.

Secondly, with the graph node representations and the , we use the GCNs [19] to perform information propagation on the graph as shown in Equation 7

. A residual connection is added to each layer of GCN. The GCNs allow us to compute the response of a node based on its neighbors defined by the graph relations, so performing graph convolution is equal to performing message propagation on the graphs.


where the denotes the GCNs weight with dimension , which can be learned via back propagation.

Finally, the output of each GCNs is still in dimensions, so we use another convolutional operation to map the representation from the embedding space to source space as the in Equation 8. So we then add the and the original in element-wise manner as the updated in Equation 9, where control the weight between and . Then the current cell has fused the parameter information of previous cell . We use the new as the new architecture parameter of cell . And the denotes the weight for .


Through the proposed well-designed GGM that seamlessly integrates the graph convolution network and neural architecture search, which can bridge the operation information between adjacent cells.

3.3 Latency-Oriented Optimization

Similar to many excellent NAS works [5, 37, 42, 35], we also take real-world latency for a network into consideration during the search process, which orients the search process toward the direction to find a optimal lightweight model. Specifically, we create a GPU-latency lookup table which records the inference latency of each candidate operation. During the search process, each candidate operation at edge (, ) will be assigned a cost given by the designed lookup table. In this way, the total latency for cell is accumulated as:


where is the architecture parameter for operation at edge (, ) and is the number of candidate operations. Given a architecture

, the total latency cost is estimated as:


where refers to the number of cells in architecture . The latency for each operation is a constant and thus total latency loss is differentiable with respect to the architecture parameters .

Different from the exponent coefficient latency loss  [37]

, we briefly define the total loss function as follows:


where denotes the cross-entropy loss of architecture with parameter , denotes the overall latency of architecture , which is measured in micro-second, and the coefficient controls the balance between the accuracy and latency. During the search phrase, we directly optimize the architecture parameter and the weight in same round of back-propagation rather than using iterative optimization [37].

4 Experiments

In this section, we conduct extensive experiments to verify the effectiveness of our GAS. Firstly, we compare the network searched by our method with the state-of-the-art works on two standard benchmarks. Secondly, we perform the ablation study for the GCN-Guided Module and latency optimization settings and close with a insight about GCN-Guided module.

4.1 Benchmark and Evaluation Metrics

Datasets In order to verify the effectiveness and robustness of our method, we evaluate our method on Cityscapes [13] and CamVid [4] datasets. The Cityscapes [13]

is a public released dataset for semantic urban scene understanding. It contains 5,000 high quality pixel-level fine annotated images (2975, 500, and 1525 for the training, validation, and testing sets respectively) with size 1024 x 2048 collected from 50 cities. The dense annotation contains 30 common classes and 19 of them are used in training and testing following

[13]. CamVid [4] is another public released dataset with object class semantic labels. It contains 701 images in total, in which 367 for training, 101 for validation and 233 for testing. The images have a resolution of 960 x 720 and 11 semantic categories.

Evaluation Metrics For evaluation, we use three metrics, including mean of class-wise intersection over uniou (mIOU), network forward time (Latency), and Frames Per Second (FPS).

4.2 Implementation Details

We conduct all experiments based on Pytorch 0.4

[31]. All experiments are on a workstation with Titan Xp GPU cards under CUDA 9.0, and the inference time in all experiments is also reported on Nvidia Titan Xp GPU.

We first conduct architecture search using GAS on segmentation dataset and then obtain the target light-weight network architecture according to the optimized

. We then utilize the ImageNet 


dataset to pretrain the searched network from scratch. We finally finetune the network on the specific segmentation dataset for 200 epochs.

In search process, the architecture contains 16 cells and each cell has = 2 nodes. With the consideration of speed, the initial channel for network is 8. For the training hyper-parameters, the mini-batch size is set to 16. The architecture distribution parameters are optimized by Adam, with initial learning rate 0.001, = (0.5, 0.999) and weight decay 0.0001. The network parameters are optimized using SGD with momentum 0.9, weight decay 0.001, and cosine learning scheduler that decays learning rate from 0.025 to 0.001. For gumbel softmax, we set the initial temperature in equation 4 as 1.0, and gradually decrease to the minimum value of 0.03.

For finetuning details, we train the network with mini-batch 8 and SGD optimizer with poly learning rate scheduler that decay learning rate from 0.01 to zero. Following [38], The online bootstrapping strategy has been applied to the training process. For data augmentation, we use random flip and random resize with scale between 0.5 and 2.0. Finally, we randomly crop the image into a fixed size for training.

For the GCN-guided Module, we use one Graph Convolution Network (GCN) [19] between every adjacent cells, and each GCN contains one layer of graph convolutions. The kernels size of the parameters in graph convolutions operation is 64x64.

Method InputSize mIOU Latency(ms) FPS
FCN-8S 512x1024 65.3 227.23 4.4
PSPNet 713x713 81.2 1288.0 0.78
DeepLabV3 769x769 81.3 769.23 1.3
SegNet 640x320 57.0 30.3 33
ENet 640x320 58.3 12.7 78.4
SQ 1024x2048 59.8 46.0 21.7
ICNet 1024x2048 69.5 26.5 37.7
BiSeNet 768x1536 68.4 9.52 105.8
DFANet A 1024x1024 71.3 10.0 100.0
DFANet A 111 1024x1024 71.3 11.48 87.1
CAS 768x1536 70.5 9.25 108.0
CAS 768x1536 72.3 9.25 108.0
GAS 769x1537 71.6 9.80 102.0
GAS A 769x1537 70.4 8.68 115.1
GAS 769x1537 73.3 9.80 102.0
Table 1: Comparing results on Cityscapes test dataset. Methods trained using both fine and coarse data are marked with . The mark represents the speed on TitanX, and the mark represents the speed is remeasured on Titan Xp.

4.3 Real-time Semantic Segmentation Results

In this part, we compare the model searched by GAS with other existing state-of-the-art real-time segmentation models on semantic segmentation datasets. The inference time is calculated on one Nvidia Titan Xp GPU and the speed of other methods reported in the paper [42] are used for comparing. Moreover, the speed is measured again on the Titan Xp if the origin paper reports the speed on different GPU.

Results on Cityscapes. We evaluate the network searched by GAS on Cityscapes test sets. The validation set is added to train network before submitting to Cityscapes server. Following [41, 42], GAS takes as an input image with size 769x1537 that resize from origin image size 1024x2048. Overall, our GAS get the best performance among all methods while maintains the comparable speed with 102 FPS. With only fine data and without any evaluation tricks, our GAS yields 71.6% mIoU which is the state-of-the-art trade-off for light-weight semantic segmentation. The performance achieve 73.3% when coarse data is added into the training dataset. The full comparison results are shown in Table 1. Compared to BiSeNet and CAS which have a slight speed advantage, our GAS beat them along multiple performance points with 3.2% and 1.1% respectively. Compared to other methods such as SegNet, ENet, SQ and ICNet, our method achieves significant improvement in speed while get performance improvement over them about 14.6%, 13.3%, 11.8%, 2.1% respectively. GAS A is searched by the latency constraint 0.01.

11footnotetext: After merging the BN layers for DFANet, there still has a speed gap between the original paper and our measurement. We suspect that it’s caused by the inconsistency of implementation platform in which DFANet has optimized the depth-wise convolution (DW-Conv). GAS also have many candidate operations using DW-Conv, so the speed of our GAS is still capable of beating it if the DW-Conv be optimized correctly like DFANet or BiSeNet.

Results on CamVid. We also conduct the whole GAS pipeline on CamVid dataset to further verify our method’s ability. Table 2 shows the comparison results with other methods. With input size 720x960, we achieve the 71.9% mIoU with 142 FPS which is also the state-of-the-art trade-off between accuracy and speed.

Method  mIOU Latency(ms)  FPS
SegNet 55.6 34.01 29.4
ENet 51.3 16.33 61.2
ICNet 67.1 28.98 34.5
BiSeNet 65.6 - -
DFANet A 64.7 8.33 120
CAS 71.2 5.92 169
GAS 71.9 7.04 142.0
Table 2: Results on CamVid test set with resulotion 960x720. ”-” indicates the corresponding result is not provided by the methods.

4.4 Ablation Study

In this part, we detailedly verify the effect of each component in our framework, we perform the ablation study experiments for the GCN-Guided Module and the latency loss. Furthermore, we give a insight about what role does the GCN-Guided Module play in the search process.

Methods mIOU
a) Cell shared 63.0
b) Cell independent 60.7
c) Cell independent + FC 63.6
d) Cell independent + GCN 66.3
Table 3: Ablation study for the effectiveness of GCN-Guided Module on Cityscapes dataset.

4.4.1 Effectiveness of the GCN-Guided Module

We propose the GCN-Guided Module (GGM) to build the connection between cells. To verify the advantage of the GGM, we conducted a series of experiments with different strategies: a) network stacked by shared cell; b) network stacked by independent cell; c) Based on b, using fully connected layer to infer the relationship between cells; d) Based on b, using GCN-Guided Module to infer the relationship between cells. Experiment results are shown in Table 3. The performance reported here is the average mIoU over five repeated experiments on Cityscapes validation dataset during search phrase without latency loss. Overall, with only independent cell, the performance degrades due to the enlarge search space which make optimization more difficult. This reduction is mitigated through adding communication mechanism between cells by GCN. Specially, our GCN-guided module can bring about 2.7 points performance improvement compare to the setting (c).

We illustrate the network structure searched by GAS in the Fig. 6. An interesting observation is that the operations selected by GAS with GGM have fewer parameters and less computational complexity than GAS without GGM, where more dilated or separated convolution kernels are preferred. This exhibits the emergence of concept of burden sharing in a group of cells when they know how much others are willing to contribute. It also explains why GAS with GGM on is less overfitted.

Figure 5: The validation accuracy on Cityscapes dataset for different latency constraint. Best viewed in color.

4.4.2 Effectiveness of the Latency Constraint

As mentioned earlier, GAS provides the ability to flexibly achieve a good trade-off between the performance and speed with the latency-oriented optimization. We conduct a series of experiments with different loss weight in Equation 12. Fig. 5 shows the variation of mIoU and latency as changes. With smaller , we can obtain a model with higher accuracy, and vice-versa. When the increases from 0.0005 to 0.005, the latency decreases rapidly and the performance is slowly falling. But when increases from 0.005 to 0.05, the performance drops quickly and the decline of latency is fairly limited. So in our experiments, we set as 0.005. We can clearly see that the latency-oriented optimization is effective for balancing the accuracy and latency.

4.4.3 Analysis of the GCN-Guided Module

One concern is about what kind of role does GCN play in the search process. We suspect that its effectiveness is derived from the following two aspects: 1) In order to learn a light-weight network, we allow the cell structures not to share with each other to encourage structure diversity. Apparently, learning cell independently makes the search more difficult and does not guarantee better performance, thus the GCN-Guided Module can be regraded as a regularization term to regularize the search process. 2) We have discussed that is a fully factorizable joint distribution in above section. As shown in Equation 4,

for current cell becomes a conditional probability if the architecture parameter

depends on the probability

for previous cell. In this case, the GCN-Guided Module plays a role that model the condition in probability distribution


5 Conclusion & Discussion

In this paper, a novel Graph-guided architecture search (GAS) framework is proposed to tackle the real-time semantic segmentation task. Different to the existing NAS approaches that stacks the same searched cell into a whole network, GAS explores to search different cell architectures and adopts the graph convolution network to bridge the information connection among cells. In addition, a latency-oriented constraint is endowed into the search process for balancing the accuracy and speed. Extensive experiments have demonstrated that GAS has achieved much better performance than the state-of-the-art real-time segmentation approaches.

In the future, we will extend the GAS to the following directions: 1) We will search networks directly for the segmentation and detection tasks without retraining. 2) We will explore some deeper research on how to effectively combine the NAS and the graph convolution network. 3) Exploring other approaches to apply latency constraint.


  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE trans. PAMI 39 (12), pp. 2481–2495. Cited by: §1.
  • [2] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, and Q. Le (2018) Understanding and simplifying one-shot architecture search. In ICML, pp. 549–558. Cited by: §2.
  • [3] A. Brock, T. Lim, J. M. Ritchie, and N. Weston (2017) SMASH: one-shot model architecture search through hypernetworks. arXiv:1708.05344. Cited by: §2.
  • [4] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla (2008) Segmentation and recognition using structure from motion point clouds. In ECCV (1), pp. 44–57. Cited by: §1, §1, §4.1.
  • [5] H. Cai, L. Zhu, and S. Han (2018) Proxylessnas: direct neural architecture search on target task and hardware. arXiv:1812.00332. Cited by: §1, §2, §3.3.
  • [6] L. Chen, M. D. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens (2018) Searching for efficient multi-scale architectures for dense image prediction. In NeurIPS, pp. 8713–8724. Cited by: §2.
  • [7] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE trans. PAMI 40 (4), pp. 834–848. Cited by: §1, §1, §3.1.
  • [8] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587. Cited by: §1, §2.
  • [9] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pp. 833–851. Cited by: §2.
  • [10] X. Chen, L. Xie, J. Wu, and Q. Tian (2019) Progressive differentiable architecture search: bridging the depth gap between search and evaluation. arXiv:1904.12760. Cited by: §2.
  • [11] Y. Chen, Q. Zhang, C. Huang, L. Mu, G. Meng, and X. Wang (2018) Reinforced evolutionary neural architecture search. CoRR abs/1808.00193. External Links: Link Cited by: §2.
  • [12] F. Chollet (2017) Xception: deep learning with depthwise separable convolutions. In CVPR, pp. 1800–1807. Cited by: §2.
  • [13] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 3213–3223. Cited by: §1, §1, §4.1.
  • [14] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §4.2.
  • [15] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2015) The pascal visual object classes challenge: a retrospective. IJCV 111 (1), pp. 98–136. Cited by: §1.
  • [16] M. Guo, Z. Zhong, W. Wu, D. Lin, and J. Yan (2018) IRLAS: inverse reinforcement learning for architecture search. CoRR abs/1812.05285. Cited by: §2.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §2.
  • [18] G. Huang, Z. Liu, and K. Q. Weinberger (2016) Densely connected convolutional networks. CVPR, pp. 1–9. Cited by: §2.
  • [19] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv:1609.02907. Cited by: §1, §2, §3.2, §3.2, §4.2.
  • [20] B. Krause, E. Kahembwe, I. Murray, and S. Renals (2017) Dynamic evaluation of neural sequence models. arXiv:1709.07432. Cited by: §1.
  • [21] H. Li, P. Xiong, H. Fan, and J. Sun (2019)

    Dfanet: deep feature aggregation for real-time semantic segmentation

    In CVPR, pp. 9522–9531. Cited by: §1, §1, §2.
  • [22] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. Fei-Fei (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. CoRR abs/1901.02985. Cited by: §1, §2.
  • [23] H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: differentiable architecture search. In ICLR, Cited by: §1, §1, §2.
  • [24] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: §1, §2.
  • [25] C. J. Maddison, A. Mnih, and Y. W. Teh (2016)

    The concrete distribution: a continuous relaxation of discrete random variables

    arXiv:1611.00712. Cited by: §3.1.
  • [26] M. Minsky (1988) The society of mind. Simon & Schuster. Cited by: §1.
  • [27] R. Negrinho and G. Gordon (2017) Deeparchitect: automatically designing and training deep architectures. arXiv:1704.08792. Cited by: §1.
  • [28] V. Nekrasov, H. Chen, C. Shen, and I. Reid (2019) Fast neural architecture search of compact semantic segmentation models via auxiliary cells. In CVPR, pp. 9126–9135. Cited by: §2.
  • [29] A. Noy, N. Nayman, T. Ridnik, N. Zamir, S. Doveh, I. Friedman, R. Giryes, and L. Zelnik-Manor (2019) ASAP: architecture search, anneal and prune. arXiv:1904.04123. Cited by: §2.
  • [30] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello (2016) Enet: a deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147. Cited by: §1, §2.
  • [31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.2.
  • [32] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. In ICML, pp. 4092–4101. Cited by: §1, §1, §2.
  • [33] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2018)

    Regularized evolution for image classifier architecture search

    CoRR abs/1802.01548. External Links: Link Cited by: §2.
  • [34] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. Cited by: §2.
  • [35] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In CVPR, pp. 2820–2828. Cited by: §2, §3.3.
  • [36] X. Wang and A. Gupta (2018) Videos as space-time region graphs. In ECCV, pp. 399–417. Cited by: §2, §3.2.
  • [37] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer (2019) Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In CVPR, pp. 10734–10742. Cited by: §2, §3.3, §3.3.
  • [38] Z. Wu, C. Shen, and A. van den Hengel (2016) High-performance semantic segmentation using very deep fully convolutional networks. CoRR abs/1604.04339. Cited by: §4.2.
  • [39] S. Xie, H. Zheng, C. Liu, and L. Lin (2019) SNAS: stochastic neural architecture search. In ICLR, Cited by: §1, §1, §2, §3.1.
  • [40] S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, Cited by: §2.
  • [41] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) BiSeNet: bilateral segmentation network for real-time semantic segmentation. In ECCV, pp. 334–349. Cited by: §1, §1, §2, §4.3.
  • [42] Y. Zhang, Z. Qiu, J. Liu, T. Yao, D. Liu, and T. Mei (2019) Customizable architecture search for semantic segmentation. In CVPR, pp. 11641–11650. Cited by: §1, §2, §3.3, §4.3, §4.3.
  • [43] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia (2018) Icnet for real-time semantic segmentation on high-resolution images. In ECCV, pp. 405–420. Cited by: §1, §1, §2.
  • [44] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, pp. 2881–2890. Cited by: §1, §1, §2.
  • [45] B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §2.
  • [46] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In CVPR, pp. 8697–8710. Cited by: §1.