Code for Fast Neural Architecture Search of Compact Semantic Segmentation Models via Auxiliary Cells, CVPR '19
Automated design of architectures tailored for a specific task at hand is an extremely promising, albeit inherently difficult, venue to explore. While most results in this domain have been achieved on image classification and language modelling problems, here we concentrate on dense per-pixel tasks, in particular, semantic image segmentation using fully convolutional networks. In contrast to the aforementioned areas, the design choice of a fully convolutional network requires several changes, ranging from the sort of operations that need to be used - e.g., dilated convolutions - to solving of a more difficult optimisation problem. In this work, we are particularly interested in searching for high-performance compact segmentation architectures, able to run in real-time using limited resources. To achieve that, we intentionally over-parameterise the architecture during the training time via a set of auxiliary cells that provide an intermediate supervisory signal and can be omitted during the evaluation phase. The design of the auxiliary cell is emitted by a controller, a neural architecture with the fixed structure trained using reinforcement learning. More crucially, we demonstrate how to efficiently search for these architectures within limited time and computational budgets. In particular, we rely on a progressive strategy that terminates non-promising architectures from being further trained, and on Polyak averaging coupled with knowledge distillation to speed-up the convergence. Quantitatively, in 8 GPU-days our approach discovers a set of architectures performing on-par with state-of-the-art among compact models.READ FULL TEXT VIEW PDF
Code for Fast Neural Architecture Search of Compact Semantic Segmentation Models via Auxiliary Cells, CVPR '19
For years, the design of neural network architectures was thought to be solely a duty of a human expert - it was her responsibility to specify which type of architecture to use, how many layers should there be, how many channels should convolutional layers have and etc. This is no longer the case as the automated neural architecture search - a way of predicting the neural network structure via a non-human expert (an algorithm) - is fast-growing. Potentially, this may well mean that instead of manually adapting a single state-of-the-art architecture for a new task at hand, the algorithm would discover a set of best-suited and high-performing architectures on given data.
Few decades ago, such an algorithm was based on evolutionary programming strategies where best seen so far architectures underwent mutations and their most promising off-springs were bound to continue evolving . Now, we have reached the stage where a secondary neural network, oftentimes called controller, replaces a human in the loop, by iteratively searching among possible architecture candidates and maximising the expected score on the held-out set . While there is a lack of theoretical work behind this latter approach, several promising empirical breakthroughs have already been achieved [3, 49].
At this point, it is important to emphasise the fact that such accomplishments required an excessive amount of computational resources—more than GPU-days for the work of Zoph and Le  and for Zoph et al. . Although a few works have reduced those to single digit numbers on image classification and language processing tasks [29, 22]
, we consider more challenging dense per-pixel tasks that produce an output for each pixel in the input image and for which no efficient training regimes have been previously presented. Although here we concentrate only on semantic image segmentation, our proposed methodology can immediately be applied to other per-pixel prediction tasks, such as depth estimation and pose estimation. In our experiments, we demonstrate the transferability of the discovered segmentation architecture to the latter problems. Notably, all of them play an important role in computer vision and robotic applications and so far have been relying on manually designed accurate low-latency models for real-world scenarios.
The focus of our work is to automatically discover compact high-performing fully convolutional architectures, able to run in real-time on a low-computational budget, for example, on the Jetson platform. To this end, we are explicitly looking for structures that not only improve the performance on the held-out set, but also facilitate the optimisation during the training stage. Concretely, we consider the encoder-decoder type of a fully-convolutional network 
, where encoder is represented by a pre-trained image classifier, and the decoder structure is emitted by the controller network. The controller generates the connectivity structure between encoder and decoder, as well as the sequence of operations (that form the so-calledcell
) to be applied on each connected path. The same cell structure is used to form an auxiliary classifier, the goal of which is to provide intermediate supervision and to implicitly over-parameterise the model. Over-parameterisation is believed to be the primary reason behind the successes of deep learning models, and a few theoretical works have already addressed it in simplified cases[38, 8]. Along with empirical results, this is the primary motivation behind the described approach.
Last, but not least, we devise a search strategy that permits to find high-performing architectures within a small number of days using only few GPUs. Concretely, we pursue two goals here:
To prevent ‘bad’ architectures from being trained for long; and
To achieve a solid performance estimate as soon as possible.
To tackle the first goal, we divide the training process during the search into two stages. During the first stage, we fix the encoder’s weights and pre-compute its outputs, while only training the decoder part. For the second stage, we train the whole model end-to-end. We validate the performance after the first stage and terminate the training of non-promising architectures. For the second goal, we employ Polyak averaging  and knowledge distillation  to speed-up convergence.
To summarise, our contributions in this work are to propose an efficient neural architecture search strategy for dense-per-pixel tasks that (i.) allows to sample compact high-performing architectures, and (ii.) can be used in real-time on low-computing platforms, such as JetsonTX2. In particular, the above points are made possible by:
Devising a progressive strategy able to eliminate poor candidates early in the training;
Developing a training schedule for semantic segmentation able to provide solid results quickly via the means of knowledge distillation and Polyak averaging;
Searching for an over-parameterised auxiliary cell that provides better training and is obsolete during inference.
Traditionally, architecture search methods have been relying upon evolutionary strategies [2, 41, 40], where a population of networks (oftentimes together with their weights) is continuously mutated, and less promising networks are being discarded. Modern neuro-evolutionary approaches [31, 23]
rely on the same principles and benefit from available computational resources, that allow them to achieve impressive results. Bayesian optimisation methods estimating the probability density of objective function have long been used for hyper-parameter search[37, 4]. Scaling up Bayesian methods for architecture search is an ongoing work, and few kernel-based approaches have already shown solid performance [42, 14].
Most recently, neural architecture search (NAS) strategies based on reinforcement learning (RL) have attained state-of-the-art results on the tasks of image classification and natural language processing[3, 48, 49]. Relying on enormous computational resources, these algorithms comprise a separate neural network, the so-called ‘controller’, that emits an architecture design and receives a scalar reward after the emitted architecture is trained on the task of interest. Notably, thousand of iterations and GPU-days are needed for convergence. Rather than searching for the whole network structure from scratch, these methods tend to look for cells—repeatable motifs that can be stacked multiple times in a feedforward fashion.
Several solutions for making NAS methods more efficient have been recently proposed. In particular, Pham et al.  unroll the computational graph of all possible architectures and allow sharing the weights among different architectures. This dramatically reduces the number of resources needed for convergence. In a similar vein of research, Liu et al.  exploit a progressive strategy where the network complexity is gradually increased, while the ranking network is trained in parallel to predict the performance of a new architecture. A few methods have been built around continuous relaxation of the search problem. Particularly Luo et al.  use an encoder to embed the architecture description into a latent space, and estimator to predict the performance of an architecture given its embedding. While these methods make the search process more efficient, they achieve so by sacrificing the expressiveness of the search space, and hence, may arrive to a sub-optimal solution.
In semantic segmentation [19, 18, 20], up to now all the architectures have been manually designed, closely following the winner entries of image classification challenges. Two prominent directions have emerged over the last few years: the encoder-decoder type [24, 28, 18], where better features are learned at the expense of having a spatially coarse output mask; whereas other popular approach discards several down-sampling layers and relies on dilated convolutions for keeping the receptive field size intact [6, 47, 45]. Chen et al.  have also shown that the combination of those two paradigms lead to even better results across different benchmarks. In terms of NAS in semantic segmentation, independently of us and in parallel to our work, a straightforward adaptation of image classification NAS methods was proposed by Chen et al. . In it they randomly search for a single segmentation cell design and achieve expressive results by using almost GPUs over the range of days. In contrast to that, our method first and foremost is able to find compact segmentation models only in a fraction of that time. Secondly, it differs significantly in terms of the search design and search methodology.
For the purposes of a clearer presentation of our ideas, we briefly review knowledge distillation, an approach proposed by Hinton et al. 
to successfully train a compact model using the outputs of a single (or an ensemble of) large network(s) pre-trained on the current task. In it, the logits of the pre-trained network are being used as an additional regulariser for the small network. In other words, the latter has to mimic the outputs of the former. Such a method was shown to provide a better learning signal for the small network. As a result of that, it has already found its way across multiple domains: computer vision, reinforcement learning , continuous learning  – to name a few.
We start with the problem formulation, proceed with the definitions of an auxiliary cell and knowledge distillation loss, and conclude with the overall search strategy.
We primarily focus on two research questions: (i.) how to acquire a reliable estimate of the segmentation model performance as quickly as possible; and (ii.) how to improve the training process of the segmentation architecture through over-parameterisation, obsolete during inference.
We consider dense prediction task , for which we have multiple training tuples , where both and are
-dimensional tensors with equal spatial and arbitrary third dimensions. In this work,is a -channel RGB image, while is a -channel one-hot segmentation mask with being equal to the number of classes, which corresponds to semantic image segmentation. Furthermore, we rely on a mapping with parameters
, that is represented by a fully convolutional neural network. We assume that the networkcan further be decomposed into two parts: - representing encoder, and - for decoder. We initialise encoder with weights from a pre-trained classification network consisting of multiple down-sampling operations that reduce the spatial dimensions of the input. The decoder part, on the other hand, has access to several outputs of encoder with varying spatial and channel dimensions. The search goal is to choose which feature maps to use and what operations to apply on them. We next describe the decoder search space in full detail.
We restrict our attention to the decoder part, as it is currently infeasible to perform a full segmentation network search from scratch.
As mentioned above, the decoder has access to multiple layers from the pre-trained encoder with varying dimensions. To keep sampled architectures compact and of approximately equal size, each encoder output undergoes a single
convolution with the same number of output channels. We rely on a recurrent neural network, the controller, to sequentially produce pairs of indices of which layers to use, and what operations to apply on them. In particular, this sequence of operations is combined to form a cell (see example in Fig.2). The same cell but with different weights is applied to each layer inside the sampled pair, and the outputs of two cells are summed up. The resultant layer is added to the sampling pool. The number of times pairs of layers are sampled is controlled by a hyper-parameter, which we set to in our experiments, allowing the controller to recover such encoder-decoder architectures as FCN , or RefineNet . All non-sampled summation outputs are concatenated, before being fed into a single convolution to reduce the number of channels followed by the final classification layer.
Each cell takes a single input with the controller first deciding which operation to use on that input. The controller then proceeds by sampling with replacement two locations out of two, i.e., of input and the result of the first operation, and two corresponding operations. The outputs of each operation are summed up, and all three layers (from each operation and the result of their summation) together with the initial two can be sampled on the next step. The number of times the locations are sampled inside the cell is controlled by another hyper-parameter, which we also set to in our experiments in order to keep the number of all possible architectures to a feasible amount111Taking into account symmetrical – thus, identical – architectures, we estimate the number of unique connections in the decoder part to be , and the number of unique cells , leading to , which is on-par with concurrent works.. All existing non-sampled summation outputs inside the cell are summed up, and used as the cell output. In this case, we resort to sum as concatenation may lead to variable-sized outputs between different architectures.
Based on existing research in semantic segmentation, we consider operations:
separable conv ,
separable conv ,
global average pooling followed by upsampling and conv ,
conv with dilation rate ,
conv with dilation rate ,
separable conv with dilation rate ,
separable conv with dilation rate ,
zero-operation that effectively nullifies the path.
An example of the search layout with decoder blocks and cell branches is depicted on Fig. 2.
We divide the training set into two disjoint sets - meta-train and meta-val. The meta-train subset is used to train the sampled architecture on the given task (i.e., semantic segmentation), whereas meta-val, on the other hand, is used to evaluate the trained architecture and provide the controller with a scalar, oftentimes called reward in the reinforcement learning literature. Given the sampled sequence, its logarithmic probabilities and the reward signal, the controller is optimised via proximal policy optimisation (PPO) . Hence, there are two training processes present: inner - optimisation of the sampled architecture on the given task, and outer - optimisation of the controller. We next concentrate on the inner loop.
We divide the inner training process into two stages. During the first stage, the encoder weights are fixed and its outputs are pre-computed, while only decoder is being trained. This leads to a quick adaptation of the decoder weights and a reasonable estimate of the performance of the sampled architecture. We exploit a simple heuristic to decide whether to continue training the sampled architecture for the second stage, or not. Concretely, the current reward value is being compared with the running mean of rewards seen so far, and if it is higher, we continue training. Otherwise, with probabilitywe terminate the training process. The probability is annealed throughout our search (starting from ).
The motivation behind this is straightforward: the results of the first stage, while noisy, can still provide a reasonable estimate of the potential of the sampled architecture. At the very least, they would present a reliable signal that the sampled architecture is non-promising, while spending only few seconds on it. Such a simple approach encourages exploration during early stages of search akin to the -greedy strategy often used in the multi-armed bandit problem .
Semantic segmentation models are notable for requiring many iterations to converge. Partially, this is addressed by initialising the encoder part from a pre-trained classification network. Unfortunately, no such thing exists for decoder.
Fortunately, though, we can explore several alternatives that provide faster convergence. Besides tailoring our optimisation hyper-parameters, we rely on two more tricks: firstly, we keep track of the running average of the parameters during each stage and apply them before the final validation . Secondly, we append an additional loss term between the logits of the current architecture and a pre-trained teacher network. We can either pre-compute the teacher’s outputs beforehand, or acquire them on-the-fly in case the teacher’s computations are negligible.
The combination of both of these approaches allows us to receive a very reliable estimate of the performance of the semantic segmentation model as quickly as possible without a significant overhead.
We further look for ways of easing optimisation during fast search, as well as during a longer training of semantic segmentation models. Thus, still aligning with the goal of having a compact but accurate model, we explicitly aim to find ways of performing steps that are beneficial during training and obsolete during evaluation.
One approach that we consider here is to append an auxiliary cell after each summation between pairs of main cells - the auxiliary cell is identical to the main cell and can either be conditioned to output ground truth directly, or to mimic the teacher’s network predictions (or the combination of the above two). At the same time, it does not influence the output of the main classifier either during the training or testing and merely provides better gradients for the rest of the network. In the end, the reward per the sampled architecture will still be decided by the output of the main classifier. For simplicity, we only apply the segmentation loss on all auxiliary outputs.
The notion of intermediate supervision is not novel in neural networks, but to the best of our knowledge, prior works have merely been relying on a simple auxiliary classifier, and we are the first to tie up the design of decoder with the design of the auxiliary cell. We demonstrate the quantitative benefits of doing so in our ablation studies (Sect. 4.2).
Furthermore, our motivation behind searching for cells that may also serve as intermediate supervisors stems from ever-growing empirical (and theoretical under certain assumptions) evidence that deep networks benefit from over-parameterisation during training [38, 8]. While auxiliary cells provide an implicit notion of over-parameterisation, we could have explicitly increased the number of channels and then resorted to pruning. Nonetheless, pruning methods tend to result in unstructured networks often carrying no tangible benefits in terms of the runtime speed, whereas our solution simply permits omitting unused layers during inference.
We conduct extensive experiments on PASCAL VOC which is an established semantic segmentation benchmark that comprises semantic classes and provides training images . For the search process, we extend those to more than by exploiting annotations from BSD . As commonly done, during search, we keep of those images for validation of the sampled architectures that provides the controller with the reward signal. For the first stage, we pre-compute the encoder outputs on images and store them for faster processing.
The controller is a two-layer recurrent LSTM  neural network with
hidden units. All the units are randomly initialised from a uniform distribution. We use PPO for optimisation with the learning rate of .
The encoder part of our network is MobileNet-v2 , pretrained on MS COCO  for semantic segmentation using the Light-Weight RefineNet decoder . We omit the last layers and consider four outputs from layers as inputs to decoder; convolutional layers used for adaptation of the encoder outputs have output channels during search and during training. Decoder weights are randomly initialised using the Xavier scheme . To perform knowledge distillation, we use Light-Weight RefineNet-152 , and apply loss with the coefficient of . The knowledge distillation outputs are pre-computed for the first stage and omitted during the second one in the interests of time. Polyak averaging is applied with the decay rates of and , correspondingly. Batch normalisation statistics are updated during both stages.
All our search experiments are being conducted on two Ti GPU cards, with the search process being terminated after days. All runtime measurements are carried out on a single Ti card, or on JetsonTX2, if mentioned otherwise. In particular, we perform the forward pass
times and report the mean result together with standard deviation.
For the inner training of the sampled architectures, we devise a fast and stable training strategy: we exploit the Adam learning rule  for the decoder part of the network, and SGD with momentum - for encoder. In particular, we use learning rates of - and -, respectively. We pre-train each sampled architecture for epochs on the first stage, and for
on the second (in case the stopping criterion is not triggered). As the reward signal, we consider the geometric mean of three quantities: namely,
i.) mean intersection-over-union (IoU), or Jaccard Index, primarily used across semantic segmentation benchmarks;
ii.) frequency-weighted IoU, that scales each class IoU by the number of pixels present in that class, and
iii.) mean-pixel accuracy, that averages the number of correct pixels per each class. When computing, we do not include background class as it tends to skew the results due to a large number of pixels belonging to background. As mentioned above, we keep the running mean of rewards after the first stage to decide whether to continue training a sampled architecture.
We visualise the reward progress during both stages on Figure 4. As evident from it, the quality of the emitted architectures grows with time - it is even possible that more iterations would lead to better results, although we do not explore that to save the time spent. On the other hand, while random search has the potential of occasionally sampling decent architectures, it finds only a fraction of them in comparison to the RL-based controller.
Moreover, we evaluate the impact of the inclusion of Polyak averaging, auxiliary cells and knowledge distillation on each training stage. To this end, we randomly sample and train architectures. We visualise the distributions of rewards on Fig. 5. All the tested settings significantly outperform baseline on both stages, and the highest rewards on the second stage are attained when using all of the components above.
After the search process is finished, we select architectures discovered by the RL controller with highest rewards and proceed by carrying out additional ablation studies aimed to estimate the benefit of the proposed auxiliary scheme in case the architectures are allowed to train for longer.
In particular, we train each architecture for epochs on BSD together with PASCAL VOC and epochs on PASCAL VOC only. For simplicity, we omit Polyak averaging and knowledge distillation. Three distinct setups are being tested: concretely, we estimate whether intermediate supervision helps at all, and whether auxiliary cell is superior to a plain auxiliar classifier
The results of these ablation studies are given in Fig. 6. Auxiliary supervised architectures achieve significantly higher mean IoU, and, in particular, architectures with auxiliary cells attain best results in out of cases, reaching absolute best values across all the setups and architectures.
We further measure the effect of correlation between rewards acquired during the search process with the RL-based controller and mean IoU attained by same architectures trained for longer.
To this end, we randomly sample architectures out of those explored by the controller: for fair comparison, we sample architectures with poor search performance (with rewards being less than ), with medium rewards (between and ), and with high rewards (). We train each architecture on BSD+VOC and VOC as in Sect. 4.2, rank each according to its rewards, and mean IoU, and measure the Spearman’s rank correlation coefficient. As visible in Fig. 9, there is a strong correlation between rewards after each stage, as well as between the final reward and mean IoU. This signals that our search process is able to reliably differentiate between poor-performing and well-performing architectures.
Finally, we choose best performing architectures from Sect. 4.2 and train each on the full training set, augmented with annotations from MS COCO . The training setup is analogous to the aforementioned one with the first stage being trained for epochs (on COCO+BSD+VOC), the second stage - for (BSD+VOC), and the last one - for (VOC only). After each stage, the learning rates are halved. Additionally, halfway through the last stage we freeze the batch norm statistics and divide the learning rate in half. We exploit intermediate supervision via auxiliary cells with coefficients of across the stages.
Quantitative results are given in Table 1 and few qualitative examples are in Fig. 37. The architectures discovered by our method achieve competitive performance in comparison to state-of-the-art compact models and even do so with a significantly lower number of floating point operations for same output resolution. At the same time, the found architectures can be run in real-time both on a generic GPU card and JetsonTX2.
We visualise the structure of the highest performing architecture (arch0) on Fig. 11. Having multiple branches encoding information of different scales, it resembles several prominent blocks in semantic segmentation, notably the ASPP module . Importantly, the cell found by our method differs in the way the receptive field size is controlled. Whereas ASPP solely relies on various dilation rates, here convolutions with different kernel sizes arranged in a cascaded manner allow much more flexibility. Furthermore, this design is more computationally efficient and has better expressiveness as intermediate features can be easily re-used.
We further apply the found architectures on the task of pose estimation. In particular, the MPII  and MS COCO Keypoint  datasets are used as our benchmark. MPII includes K images containing K people with annotated body joints. The evaluation measure is PCKh  with thresholds of and . The COCO dataset comprises K images of K people with body joints. Based on object keypoint similarity (OKS)222http://cocodataset.org/#keypoints-eval, we report average precision (AP) and average recall (AR) over different OKS thresholds.
Finally, we train the architectures on NYUDv2  for the depth estimation task. Following previous work , we only use K training images with depth annotations collected with the Kinect sensor. We report validation results on images in Table 3. Among other compact real-time networks, we achieve significantly better results across all the metrics without any additional tricks. Note also that the work in  trained the depth model jointly with semantic segmentation, thus using extra information.
There is little doubt that manual design of neural architectures is a tedious and difficult task to handle. It is even more complicated to come up with a design of compact and high-performing architecture on challenging dense prediction problems, such as semantic segmentation. In this work, we showcased a simple and reliable approach of searching for fully convolutional architectures within a reasonable amount of time and computational resources. Our method is based around over-parameterisation of small networks that allows them to converge to better solutions. We achieved competitive performance to manually designed state-of-the-art compact architectures on PASCAL VOC, while searching only for days on GPU cards. Furthermore, best found segmentation architectures also attained excellent results on other dense per-pixel tasks, namely, pose estimation and depth prediction.
Our future goals include exploration of alternative ways of over-parameterisation and search space description.
VN, CS, IR’s participation in this work were in part supported by ARC Centre of Excellence for Robotic Vision. CS was also supported by the GeoVision CRC Project. Correspondence should be addressed to CS.
An evolutionary algorithm that constructs recurrent neural networks.IEEE Trans. Neural Networks, 1994.
Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures.In Proc. Int. Conf. Mach. Learn., 2013.
Practical bayesian optimization of machine learning algorithms.In Proc. Advances in Neural Inf. Process. Syst., 2012.
Taskonomy: Disentangling task transfer learning.Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.
Our fully-convolutional networks follow the encoder-decoder design paradigm. In particular, in place of the encoder we rely on an existing image classifier - here, MobileNet-v2 . The decoder has access to layers from the encoder with varying dimensions. To form connections inside the decoder part, we i.) first sample a pair of indices out of possible choices with replacement, ii.) apply the same set of operations (cell) on each sample index, iii.) sum up the outputs (Fig. 39), and iv.) add the resultant layer into the sampling pool. In total, we repeat this process times. Finally, all non-sampled summation outputs are concatenated, before being fed into a single convolution to reduce the number of channels followed by the final classification layer.
The cell structure is similarly generated via sampling a set of operations and corresponding indices. Nevertheless, there are several notable differences:
The operation at each position can vary;
A single operation is applied to the input without any aggregation operator;
After that, two indices and two operations are being sampled with replacement, with the corresponding outputs being summed up (this is repeated times);
The outputs of each operation along with their summation layer are added into the sampling pool.
An example of the cell structure with its complete search space is illustrated in Fig. 41.
We use a list of integers to encode the architecture found by the controller, corresponding to the output sequence of the RNN. Specifically, the list describes the connectivity structure and the cell configuration. For example, the following connectivity structure contains three pairs of digits, indicating the input index of a corresponding layer in the sampling pool. The cell configuration, , comprises the first operation followed by three cell branches with the operation applied on the index .
We provide the description of operations in Table 4, and visualise the discovered structures in Fig. 43 (arch0), Fig. 45 (arch1), and Fig. 47 (arch2). Note that inside the cell only the final summation operator is displayed as intermediate summations would lead to identical structures.
|4||gap||global average pooling followed by upsampling and conv|
|5||conv3x3 rate 3||conv with dilation rate|
|6||conv3x3 rate 12||conv with dilation rate|
|7||sep3x3 rate 3||separable conv with dilation rate|
|8||sep5x5 rate 6||separable conv with dilation rate|
|10||zero||zero-operation that effectively nullifies the path|
We start training with the learning rates of - and - - for the encoder and the decoder, respectively. The encoder weights are updated using SGD with the momentum value of , whereas for the decoder part we rely on Adam  with default parameters of , and . We exploit the batch size of , evenly divided over two Ti GPU cards. Each image in the batch is randomly scaled in the range of
, randomly mirrored, before being randomly cropped and padded to the size of. During training, in order to calculate the loss term, we upsample the logits to the size of the target mask.
In addition to the results presented in the main text, we provide per-class intersection-over-union values across the models in Table 5.
For pose estimation, we crop the human instance with fixed aspect ratios, for MPII  and for COCO . Following Xiao et al. , the bounding box is further resized such that the longer side is equal to . For MPII, scale, degree rotation and random flip are used for data augmentation. The scale and rotation factors for COCO are and
degrees, respectively. We generate keypoint heatmaps of output stride
with Gaussian distribution with
. The MobileNet-v2 encoder is initialised from ImageNet. We use the Adam optimiser with the base learning rate of, and reduce it by after epochs and . The training terminates at the epoch . We use the batch size of evenly split between two Ti GPU cards.
For depth estimation, we start training with the learning rates of - and - - for the encoder and the decoder, respectively. For both we use SGD with the momentum value of , and anneal the learning rates via the ‘Poly’ schedule: . The training is stopped after epochs. We exploit the batch size of , evenly divided over two Ti GPU cards. Each image in the batch is randomly scaled in the range of , randomly mirrored, before being randomly cropped and padded to the size of . We upsample the logits to the size of the target mask and use the inverse Huber loss  for optimisation, ignoring pixels with missing depth measurements.
We visualise qualitative results on the validation set in Fig. 79.
During our experiments we observed a significant difference between models’ runtime on JetsonTX2 and Ti. To better understand it, we additionally measured runtime of each discovered architecture together with Light-Weight RefineNet  varying the input resolution.
As evident from Fig. 82, the models with a larger number of floating point operations (i.e., Arch0 and RF-LW) do not scale well with the input resolution. The effect is even more pronounced on JetsonTX2, as been independently confirmed by an NVIDIA employer in a private conversation.