Human beings are well-equipped by evolution to quickly observe changes in dynamic environments. From merely few seconds of studying an unknown scene, we are able to coherently map out its main constituents. In contrast, static semantic segmentation networks would perform poorly in such conditions, and may as well produce contradictory predictions across the frames. Therefore, the question arises of how to make the static models suitable for segmenting continuously evolving scenes?
One well-known approach would be to use the optical flow that describes the motion in the scene between adjacent frames [9, 32]. The optical flow calculation tends to be expensive and also comes with several notable disadvantages, among which its inability to deal with occlusions and newly appeared objects. Nevertheless, as shown by Gadde 
, a relatively poor estimate of the optical flow may still carry significant benefits, not the least of which lies in computational savings.
Alternatively, one may choose to model which information must be propagated across the frames, with the help of a recurrent neural network with memory units. Even more biologically plausible are the models that compute different features at various time-scales , in a vein similar to neural spikes. Naturally, this comes with its own set of disadvantages, most notably the difficulty of choosing an appropriate scheduling regime for updating individual parts of the network.
Yet another complementary line of work focuses on approximating an expensive per-frame forward pass with cheaper alternatives: Li  predicted local filters to be applied on the segmentation prediction from the previous frame, while Jain  used a larger network for key frames and directly employed a smaller one for consecutive frames. Such savings may allow to re-use more expensive optical flow methods without a significant slowdown, but the choice of key frames can be crucial and not readily justifiable.
Looking closely at the aforementioned approaches for video semantic segmentation, one may notice an easily discernible pattern: a typical video segmentation network predicts a labelling of the current frame based on the information propagated from the previous one and hidden representations of the current one (Fig.1). While seemingly obvious, it possesses certain variations depending on the goal - whether efficiency, or real-time performance is desired. Importantly, what we would like to emphasise here is that, while technically sound, all the current approaches have been manually designed and have not considered any interplay between different building blocks.
Starting from that general pattern we instead propose to leverage the neural architecture search (NAS)  methodology to find contextual blocks that enhance a per-frame segmentation network with dynamic components. This motivation is justified by recent results achieved using NAS on such tasks as image classification [34, 16], language modelling  and static semantic segmentation [5, 18], that oftentimes outperform manually designed networks. We build upon those results and adapt current approaches in a way suitable for handling the dynamic nature of dense per-pixel classification. To the best of our knowledge, we are the first to consider the application of NAS to the task of video semantic segmentation.
Our automated approach comes with certain benefits, concretely:
it considers a larger span of initial building blocks than any previous work,
it empirically evaluates different design structures and finds most promising ones, and
it requires only few GPU-days to find a set of high-performing structures.
Furthermore, although we do not consider it in this work, the proposed methodology can further be extended to take into account different specific objectives (even non-differentiable), such as runtime .
2 Related Work
2.1 Static semantic segmentation
Most recent approaches in static semantic segmentation have been exploiting fully convolutional neural networks. Typical methods are based either on the encoder-decoder structure with skip-connections [17, 15], dilated convolutional layers [28, 30, 6], or the combination of the above . Per-frame instantiations of these networks are usually computationally expensive, hence, several works have considered building light-weight segmentation architectures [29, 19]. Nevertheless, due to the lack of information propagation between frames, these networks perform poorly on videos and are unable to provide consistent results.
2.2 Dynamic semantic segmentation
One of the first lines of work in video segmentation has been built upon the usage of the optical flow 
, in which features extracted from the previous frame are propagated to the current one via warping. This usually results in a slight computational overhead, although as noted by Gadde an easily attainable noisy estimate of the optical flow still carries significant benefits. Nevertheless, the optical flow does not fair well in situations when scenes are undergoing substantial changes with novel objects constantly appearing and multiple occlusions being present. Thus, Jain  have proposed to combine the optical flow estimate with a relatively cheaper approximation of the current frame using a smaller network. Xu  have chosen to assign different image regions to two different networks to process: while the first one - deep and slow - works on regions that have significantly changed, the second one - shallow - predicts new features based on the optical flow information. In a similar vein, Nilsson and Sminchisescu  have propagated labels from the previous frame at only those pixels where the optical flow estimate is reliable.
A seemingly different approach, proposed by Li , instead predicts local convolutional kernels based on the low-level representation of the current frame that are applied on the prediction from the previous frame. Importantly, while the current estimate is being used for next frame, a more accurate one is being computed in parallel for future re-use.
2.3 Neural Architecture Search
NAS methods aim to find high-performing architectures in an automated way. Here, we consider the reinforcement learning-based (RL) approach, where a separate recurrent neural network (controller) outputs a sequence of tokens describing an architecture that should provide highest score on the holdout validation set.
While there is no prior work on NAS for video segmentation, two results in static segmentation are worth mentioning: Chen  used a random search to find a single set of operations (so-called ‘cell’) on the top of the DeepLab architecture , while Nekrasov  exploited RL to find a cell together with the topological structure of the encoder-decoder type of architecture. We borrow one of the architectures found by Nekrasov as our static baseline, and extend their NAS approach for video segmentation. Since we are only searching for the dynamic component that connects different instantiations of the already pre-trained static segmentation network, we are able to train and evaluate each candidate in a short amount of time, the trait that is extremely important for all NAS methods.
As noted in introduction and depicted in Fig. 1, we attempt to generalise previous solutions for video semantic segmentation in such a way that NAS methods become readily applicable. To this end, we look for a single cell that connects representations from the previous frame and enhances current predictions without a significant overhead. What follows is the description of the input space (Sect. 3.1), the search space (Sect. 3.2), and the search approach (Sect. 3.3).
3.1 Input space
We consider the arch2 network from the work of Nekrasov 
. It is an encoder-decoder type of the segmentation network with the encoder being a light-weight classifier (MobileNet-v2), and the decoder being an automatically discovered structure presented in Fig. 2. This architecture strikes a fine balance between accuracy and runtime, both being important characteristics for semantic video segmentation. Here it should be noted that the application of our methodology is not directly tied to a concrete architecture and can be easily adapted to work with other networks.
In the proposed setup, the static network is applied end-to-end on the first frame and three outputs are being recorded: an intermediate representation - in this case, the encoder’s output with the resolution of of the input image (), the decoder’s output before () and after the final classifier () - both with resolutions of of the input image and with and numbers of channels, correspondingly, where is the number of output classes. For the second frame, we record three outputs from the encoder only - two intermediate ones with the resolutions of () and (), respectively, and the final one with the resolution of ().
We rely on the dynamic cell, the layout of which will be described below, to predict the semantic labelling of the current frame given inputs: and from the previous frame, and from the current one. This way, we do not have to execute the decoder part of the static segmentation network on the current frame (thus decreasing latency), at the same time re-using information from the previous frame. The output of the dynamic cell serves as the input for the next frame.
3.2 Search space
We rely on an LSTM-based controller to predict a sequence of operations together with locations where they should be applied in order to form a dynamic cell . Concretely, we first choose two layers out of the provided five (with replacement), two corresponding operations that need to be applied on each of them, and an aggregation operation that combines two inputs into a single output. On the next step, we repeat this process, but now we are sampling two layers out of six possible, with the aggregated result being added into the sampling pool. This process can be repeated multiple times, with the final output being formed by the concatenation of all non-sampled aggregated results.
We rely on a similar set of operations as for static segmentation (Table 1), and in order to enable the dynamic cell to apply convolutional filters on irregular grids, we also include deformable convolution .
|1||global average pooling followed by upsampling and conv|
|2||separable conv with dilation rate|
|3||separable conv with dilation rate|
|0||summation with per-channel learnable weights per each input|
|1||channel-wise concatenation of two inputs followed by conv to reduce the number of channels to the original size|
|2||(weight) predictive operation, where the first input becomes a set of spatial convolutional filters (weights) applied on the second one|
|3||bilinear sampling of the first input, where an affine grid is predicted based on the values of the second input |
|4||3D-convolution where two inputs are stacked together forming a new dimension with convolution applied on top|
|5||dense attention: element-wise multiplication between the first input and the sigmoid-activated second one|
Based on the previous works, we conjecture that this set of operations will be sufficient for the task of video segmentation, and we provide experimental results to support this claim.
3.3 Finding optimal architectures
We assume that there exists a video dataset that comes with segmentation annotations for at least a subset of consecutive frames. From it, we build pairs (or triplets) of frames such that in each sequence all the frames following the first one are always annotated. As commonly done, we further divide this set into two disjoint parts - meta-train and meta-val. We further assume an existence of the static segmentation network pre-trained on this dataset111Please refer to Sect. 6 for the details on pre-training of static segmentation networks. - in particular, arch2 from . As mentioned above, we chose this particular architecture due to its compactness and low latency.
The controller samples a structure of the dynamic cell which we train on the meta-train set and evaluate on meta-val. As done in 
, we consider the geometric mean of three metrics as the validation score: mean intersection-over-union (mIoU), frequency-weighted IoU (fwIoU) and mean-pixel accuracy (mAcc). This score is used by the controller to update its weights, and the process is repeated multiple times. After that, one can either sample several cells from the trained controller, or simply choose best found cells that achieved highest results during the search process.
The first one, CamVid, comprises RGB images of resolution densely annotated into categories. Following previous work , we use the dataset splits of images for training, - for validation and - for testing. We train generated architectures with batches of examples each comprising consecutive frames.
The CityScapes dataset contains high-resolution images densely labelled with semantic classes - for training, for validation and for testing, respectively. In addition to that, raw unannotated frames extracted from videos are also provided. For each annotated example, we add an image frame that precedes it and train architectures with batches of sequences of length , in which the second frame is always annotated.
In each case, we initialise the decoder’s output dec on the first frame using the pre-trained static segmentation network, and rely on the dynamic cell at all following frames in the sequence as described in Sect. 3.1. To update the dynamic cell weights, we sum up cross-entropy loss terms at each frame after the first one and back-propagate the gradients.
For both, search and training, we exploit a single V GPU with GB of memory.
For searching we only employ the training splits of each dataset. We further divide each randomly in non-overlapping sets - meta-train () and meta-val (). We pre-compute all required outputs from the pre-trained static network and store them in memory. The static network is kept unchanged during the whole search process. Each generated architecture is trained on the meta-train split and evaluated on meta-val. We keep track of average performance and apply early stopping halfway through the training if the generated architecture is un-promising as done in .
Our controller is a two-layer LSTM with
hidden units randomly initialised from uniform distribution. The controller is trained with PPO  with the learning rate of . To reduce the size of generated cells, we set the number of emitted layers (each layer is a string of five tokens as described in Sect. 3.2) to on CamVid and to on CityScapes.
with the shorter side being mean-padded to. No transformations are applied to the validation sequences.
For CityScapes, we train for epochs with sequences each cropped to with the longer side being resized to .
We visualise the progress of rewards on each dataset in Fig. 3. Although the rewards are not directly comparable between the datasets, the growth dynamics on both datasets signal that the controller is able to discover better architectures throughout the search process.
We further look at the distributions of sampled operations, aggregation operations and input layers plotted on Fig. 10. On both datasets, global average pooling and separable convolution with dilation rate are sampled less frequently than other operations, potentially indicating that these layers could be omitted from the search process. On average, the controller trained on CityScapes prefers sampling deformable convolution (Fig. (a)a), while the CamVid one - separable convolution (Fig. (d)d).
In terms of aggregation operations, the dynamics between two controllers vary significantly: the CamVid-based controller tend to rely on dense attention, while omitting the predictive operation (Fig. (e)e). In contrast, the CityScapes controller is more likely to apply bilinear sampling on an affine grid, and to ignore predictive operation together with dense attention (Fig. (b)b).
When sampling the input layers, the controllers behave similarly: in particular, both tend to skip from the previous and current frames. The CityScapes controller extensively uses information from the previous layer (Fig. (c)c), while the CamVid one - from of the current frame (Fig. (f)f). This may well imply that on CityScapes the final predictions on the current frame change only slightly with respect to the previous frame.
Importantly, these observations indicate that two controllers trained on two different datasets exhibit various patterns, potentially capturing dataset-specific attributes in order to discover better performing architectures.
4.2 End-to-end Training
We further select top- performing dynamic cells on each dataset to train end-to-end on full training sets for longer.
In particular, for CamVid, we pre-train the dynamic cell only with Adam and the learning rate of for epochs with the batch size of sequences. Then we decrease the cell’s learning rate in half, and fine-tune the whole architecture (i.e., with the per-frame segmentation network) end-to-end for epochs - the static network weights are updated using SGD with momentum of and the learning rate of . Each sample in the batch is cropped to with the shorter side being padded to .
On CityScapes we pre-train for epochs with the batch size of sequences and fine-tune end-to-end for epochs. Each example in the batch is cropped to .
We provide quantitative results on CamVid in Table 3. The inclusion of dynamic cells in both cases leads to an improvement over baseline by more than . Importantly, with the exclusion of first frame in the sequence, we do not rely on expensive computations involving the static decoder.
Both our models perform comparably to other state-of-the-art video segmentation networks even though the backbone that we rely on - MobileNet-v2  - is much smaller in comparison to ResNet-101  exploited by Chandra , or DilatedNet  - by Gadde  and GRFP . Furthermore, we did not make any use of higher-resolution images of to further improve our scores.
We further visualise a few qualitative examples in Fig. 36. The dynamic cell enables the network to effectively propagate information about thin structures, such as poles, which makes the resultant segmentation masks consistent in contrast to the per-frame baseline (rows ). Furthermore, the multi-frame segmentation network is able to track objects across neighbouring frames (rows ).
We include the validation results of two discovered cells on CityScapes in Table 4. Once again, both dynamic cells are able to outperform the per-frame baseline by . Furthermore, our models achieve favourable results in comparison to other video segmentation methods, all of which employ significantly larger backbones and, with the exclusion of Li , all rely on the optical flow computation. Note also that Gadde  improved over their respective static baseline by , too, while introducing a non-negligible overhead of ms; and Li  compromised more than of the baseline score in order to reduce the latency. In contrast, we overcame our static baseline and decreased the runtime (Table 5).
A few inference examples are visualised in Fig. 62. As can be seen, the dynamic cells enhance the per-frame baseline results and identify partially occluded vehicles more accurately (rows , ), while also avoiding misclassification of traffic signs at pixels with similar texture patterns (rows ).
4.3 Details of Discovered Architectures
We include characteristics of our networks together with numbers reported by others in Table 5. As evident, our dynamic segmentation approach is superior to others in terms of its latency and compactness. Concretely, all our architectures contain at most M parameters while having an average per-frame runtime of ms on high-resolution images. This is possible due to both the network design and the exclusion of the optical flow computation.
All the trained cells are visualised in Fig. 68. Notably, layers with deformable convolution are present in all architectures. To propagate information from the previous frame, each cell exploits the output instead of . All the cells prefer aggregating outputs via channel-wise concatenation with cell0 also relying on dense attention, and cell3 – on affine transformation with bilinear sampling. In addition, cell1 and cell2 employ D convolution in order to capture information between various inputs.
5 Discussion & Conclusions
It is still an open question of what is the optimal way of propagating and extracting information across video frames. While a straightforward solution involving the optical flow allows to achieve solid results, it possesses several disadvantages that stem from the limitations of the optical flow itself and ultimately limit the ability of the network to adapt to novel frames. Furthermore, computations involving the optical flow cause a significant overhead, prohibiting the final system from being deployed in real-time.
In this work, instead of manually enhancing static segmentation networks with dynamic components, we proposed an automatic approach based on neural architecture search methods. Such automation have multiple benefits as it explores a large pool of networks and finds best-performing ones on the given dataset. In a broader sense, starting from a static per-frame segmentation network, we showcased a way of generalising existing solutions without any reliance on the optical flow. In particular, we extended the static baseline with a dynamic cell, the design of which is automatically discovered with the help of reinforcement learning. The best discovered cells improve the baseline by more than at the same time leading to significant memory and latency savings. Concretely, two discovered cells on CityScapes reach mean IoU and require only ms on average to process a frame. While the proposed methodology relies on the static baseline, we expect that omitting that requirement and searching for a video segmentation network end-to-end would be an interesting problem to consider in the future work.
VN, CS, IR’s participation in this work were in part supported by ARC Centre of Excellence for Robotic Vision. CS was also supported by the GeoVision CRC Project.
6 Training Details of Static Baseline
On CityScapes, we train for epochs with mini-batches of examples each randomly scaled with the scale factor in range of and randomly cropped to with each side zero-padded accordingly. On CamVid, we train for epochs with mini-batches of examples each randomly scaled with the scale factor in range of and randomly cropped to with each side zero-padded accordingly.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 39(12), 2017.
-  G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2):88–97, 2009.
-  S. Chandra, C. Couprie, and I. Kokkinos. Deep spatio-temporal random fields for efficient video segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.
-  S. Chandra and I. Kokkinos. Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian crfs. In Proc. Eur. Conf. Comp. Vis., 2016.
-  L. Chen, M. D. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens. Searching for efficient multi-scale architectures for dense image prediction. Proc. Advances in Neural Inf. Process. Syst., 2018.
-  L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 2018.
-  L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proc. Eur. Conf. Comp. Vis., 2018.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
-  R. Gadde, V. Jampani, and P. V. Gehler. Semantic video cnns through representation warping. In Proc. IEEE Int. Conf. Comp. Vis., 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
-  M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In Proc. Advances in Neural Inf. Process. Syst., 2015.
-  S. Jain, X. Wang, and J. Gonzalez. Accel: A corrective fusion network for efficient semantic segmentation on video. arXiv: Comp. Res. Repository, abs/1807.06667, 2018.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv: Comp. Res. Repository, abs/1412.6980, 2014.
-  Y. Li, J. Shi, and D. Lin. Low-latency video semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.
-  G. Lin, A. Milan, C. Shen, and I. D. Reid. RefineNet: Multi-path refinement networks for high-resolution semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
-  C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. L. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In Proc. Eur. Conf. Comp. Vis., 2018.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015.
-  V. Nekrasov, H. Chen, C. Shen, and I. D. Reid. Fast neural architecture search of compact semantic segmentation models via auxiliary cells. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019.
-  V. Nekrasov, C. Shen, and I. D. Reid. Light-weight refinenet for real-time semantic segmentation. In Proc. British Machine Vis. Conf., 2018.
-  D. Nilsson and C. Sminchisescu. Semantic video segmentation by gated recurrent flow propagation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.
-  H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. In Proc. Int. Conf. Mach. Learn., 2018.
-  M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv: Comp. Res. Repository, abs/1707.06347, 2017.
-  E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell. Clockwork convnets for video semantic segmentation. In Proc. Eur. Conf. Comp. Vis., 2016.
-  S. Xie, H. Zheng, C. Liu, and L. Lin. SNAS: stochastic neural architecture search. arXiv: Comp. Res. Repository, abs/1812.09926, 2018.
-  Y. Xu, T. Fu, H. Yang, and C. Lee. Dynamic video segmentation network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.
-  F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. Proc. Int. Conf. Learn. Representations, 2016.
-  F. Yu, V. Koltun, and T. A. Funkhouser. Dilated residual networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
-  H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. Icnet for real-time semantic segmentation on high-resolution images. In Proc. Eur. Conf. Comp. Vis., 2018.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
-  X. Zhu, H. Hu, S. Lin, and J. Dai. Deformable convnets v2: More deformable, better results. arXiv: Comp. Res. Repository, abs/1811.11168, 2018.
-  X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei. Deep feature flow for video recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
-  B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. Proc. Int. Conf. Learn. Representations, 2017.
-  B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.