TensorFlow implementation of ENet, trained on the Cityscapes dataset.
The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18× faster, requires 75× less FLOPs, has 79× less parameters, and provides similar or better accuracy to existing models. We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster.READ FULL TEXT VIEW PDF
TensorFlow implementation of ENet, trained on the Cityscapes dataset.
Caffe-based docker image for ENet A DNN for Real-Time Semantic Segmentation
Crop out humans in images effortlessly
Recent interest in augmented reality wearables, home-automation devices, and self-driving vehicles has created a strong need for semantic-segmentation (or visual scene-understanding) algorithms that can operate in real-time on low-power mobile devices. These algorithms label each and every pixel in the image with one of the object classes. In recent years, the availability of larger datasets and computationally-powerful machines have helped deep convolutional neural networks (CNNs)lecun1998cnn ; alex2012 ; karen14 ; christian15
surpass the performance of many conventional computer vision algorithmsjamie09 ; perr2010 ; vande2011 . Even though CNNs are increasingly successful at classification and categorization tasks, they provide coarse spatial results when applied to pixel-wise labeling of images. Therefore, they are often cascaded with other algorithms to refine the results, such as color based segmentation clement13 or conditional random fields liang14 , to name a few.
In order to both spatially classify and finely segment images, several neural network architectures have been proposed, such as SegNetbadrinarayanan15basic ; badrinarayanan15 or fully convolutional networks long15 . All these works are based on a VGG16 simonyan14 architecture, which is a very large model designed for multi-class classification. These references propose networks with huge numbers of parameters, and long inference times. In these conditions, they become unusable for many mobile or battery-powered applications, which require processing images at rates higher than 10 fps.
In this paper, we propose a new neural network architecture optimized for fast inference and high accuracy. Examples of images segmented using ENet are shown in Figure 1. In our work, we chose not to use any post-processing steps, which can of course be combined with our method, but would worsen the performance of an end-to-end CNN approach.
In Section 3 we propose a fast and compact encoder-decoder architecture named ENet. It has been designed according to rules and ideas that have appeared in the literature recently, all of which we discuss in Section 4. Proposed network has been evaluated on Cityscapes cityscape2016 and CamVid camvid08 for driving scenario, whereas SUN dataset sun2015 has been used for testing our network in an indoor situation. We benchmark it on NVIDIA Jetson TX1 Embedded Systems Module as well as on an NVIDIA Titan X GPU. The results can be found in Section 5.
Semantic segmentation is important in understanding the content of images and finding target objects. This technique is of utmost importance in applications such as driving aids and augmented reality. Moreover, real-time operation is a must for them, and therefore, designing CNNs carefully is vital. Contemporary computer vision applications extensively use deep neural networks, which are now one of the most widely used techniques for many different tasks, including semantic segmentation. This work presents a new neural network architecture, and therefore we aim to compare to other literature that performs the large majority of inference in the same way.
State-of-the-art scene-parsing CNNs use two separate neural network architectures combined together: an encoder and a decoder. Inspired by probabilistic auto-encoders ranzato07 ; ngiam11 , encoder-decoder network architecture has been introduced in SegNet-basic badrinarayanan15basic , and further improved in SegNet badrinarayanan15 . The encoder is a vanilla CNN (such as VGG16 simonyan14 ) which is trained to classify the input, while the decoder is used to upsample the output of the encoder long15 ; noh2015learning ; zheng2015conditional ; eigen2015predicting ; hong2015decoupled . However, these networks are slow during inference due to their large architectures and numerous parameters. Unlike in fully convolutional networks (FCN) long15 , fully connected layers of VGG16 were discarded in the latest incarnation of SegNet, in order to reduce the number of floating point operations and memory footprint, making it the smallest of these networks. Still, none of them can operate in real-time.
, these techniques use onerous post-processing steps and often fail to label the classes that occupy fewer number of pixels in a frame. CNNs can be also combined with recurrent neural networkszheng2015conditional to improve accuracy, but then they suffer from speed degradation. Also, one has to keep in mind that RNN, used as a post-processing step, can be used in conjunction with any other technique, including the one presented in this work.
The architecture of our network is presented in Table 1. It is divided into several stages, as highlighted by horizontal lines in the table and the first digit after each block name. Output sizes are reported for an example input image resolution of . We adopt a view of ResNets he2015resnet that describes them as having a single main branch and extensions with convolutional filters that separate from it, and then merge back with an element-wise addition, as shown in Figure 1(b). Each block consists of three convolutional layers: a projection that reduces the dimensionality, a main convolutional layer (conv in Figure 1(b)), and a
expansion. We place Batch Normalizationioffe2015batchnorm and PReLU he2015
between all convolutions. Just as in the original paper, we refer to these as bottleneck modules. If the bottleneck is downsampling, a max pooling layer is added to the main branch.
|Repeat section 2, without bottleneck2.0|
Also, the first projection is replaced with a
convolution with stride
in both dimensions. We zero pad the activations, to match the number of feature maps.conv
is either a regular, dilated or full convolution (also known as deconvolution or fractionally strided convolution) withfilters. Sometimes we replace it with asymmetric convolution i.e. a sequence of and convolutions. For the regularizer, we use Spatial Dropout tompson15 , with before bottleneck2.0, and afterwards.
The initial stage contains a single block, that is presented in Figure 1(a). Stage 1 consists of bottleneck blocks, while stage 2 and 3 have the same structure, with the exception that stage 3 does not downsample the input at the beginning (we omit the th bottleneck). These three first stages are the encoder. Stage 4 and 5 belong to the decoder.
We did not use bias terms in any of the projections, in order to reduce the number of kernel calls and overall memory operations, as cuDNN chetlur2014cudnn uses separate kernels for convolution and bias addition. This choice didn’t have any impact on the accuracy. Between each convolutional layer and following non-linearity we use Batch Normalization ioffe2015batchnorm . In the decoder max pooling is replaced with max unpooling, and padding is replaced with spatial convolution without bias. We did not use pooling indices in the last upsampling module, because the initial block operated on the channels of the input frame, while the final output has feature maps (the number of object classes). Also, for performance reasons, we decided to place only a bare full convolution as the last module of the network, which alone takes up a sizeable portion of the decoder processing time.
In this section we will discuss our most important experimental results and intuitions, that have shaped the final architecture of ENet.
Downsampling images during semantic segmentation has two main drawbacks. Firstly, reducing feature map resolution implies loss of spatial information like exact edge shape. Secondly, full pixel segmentation requires that the output has the same resolution as the input. This implies that strong downsampling will require equally strong upsampling, which increases model size and computational cost. The first issue has been addressed in FCN long15 by adding the feature maps produced by encoder, and in SegNet badrinarayanan15basic by saving indices of elements chosen in max pooling layers, and using them to produce sparse upsampled maps in the decoder. We followed the SegNet approach, because it allows to reduce memory requirements. Still, we have found that strong downsampling hurts the accuracy, and tried to limit it as much as possible.
However, downsampling has one big advantage. Filters operating on downsampled images have a bigger receptive field, that allows them to gather more context. This is especially important when trying to differentiate between classes like, for example, rider and pedestrian in a road scene. It is not enough that the network learns how people look, the context in which they appear is equally important. In the end, we have found that it is better to use dilated convolutions for this purpose yu2015dilated .
One crucial intuition to achieving good performance and real-time operation is realizing that processing large input frames is very expensive. This might sound very obvious, however many popular architectures do not to pay much attention to optimization of early stages of the network, which are often the most expensive by far.
ENet first two blocks heavily reduce the input size, and use only a small set of feature maps. The idea behind it, is that visual information is highly spatially redundant, and thus can be compressed into a more efficient representation. Also, our intuition is that the initial network layers should not directly contribute to classification. Instead, they should rather act as good feature extractors and only preprocess the input for later portions of the network. This insight worked well in our experiments; increasing the number of feature maps from to did not improve accuracy on Cityscapes cityscape2016 dataset.
In this work we would like to provide a different view on encoder-decoder architectures than the one presented in badrinarayanan15 . SegNet is a very symmetric architecture, as the encoder is an exact mirror of the encoder. Instead, our architecture consists of a large encoder, and a small decoder. This is motivated by the idea that the encoder should be able to work in a similar fashion to original classification architectures, i.e. to operate on smaller resolution data and provide for information processing and filtering. Instead, the role of the the decoder, is to upsample the output of the encoder, only fine-tuning the details.
A recent paper he2016identity
reports that it is beneficial to use ReLU and Batch Normalization layers before convolutions. We tried applying these ideas to ENet, but this had a detrimental effect on accuracy. Instead, we have found that removing most ReLUs in the initial layers of the network improved the results. It was quite a surprising finding so we decided to investigate its cause.
We replaced all ReLUs in the network with PReLUs he2015 , which use an additional parameter per feature map, with the goal of learning the negative slope of non-linearities. We expected that in layers where identity is a preferable transfer function, PReLU weights will have values close to , and conversely, values around if ReLU is preferable. Results of this experiment can be seen in Figure 3.
Initial layers weights exhibit a large variance and are slightly biased towards positive values, while in the later portions of the encoder they settle to a recurring pattern. All layers in the main branch behave nearly exactly like regular ReLUs, while the weights inside bottleneck modules are negative i.e. the function inverts and scales down negative values. We hypothesize that identity did not work well in our architecture because of its limited depth. The reason why such lossy functions are learned might be that that the original ResNetshe2016identity are networks that can be hundreds of layers deep, while our network uses only a couple of layers, and it needs to quickly filter out information. It is notable that the decoder weights become much more positive and learn functions closer to identity. This confirms our intuitions that the decoder is used only to fine-tune the upsampled output.
As stated earlier, it is necessary to downsample the input early, but aggressive dimensionality reduction can also hinder the information flow. A very good approach to this problem has been presented in szegedy2015rethinking . It has been argued that a method used by the VGG architectures, i.e. as performing a pooling followed by a convolution expanding the dimensionality, however relatively cheap, introduces a representational bottleneck (or forces one to use a greater number of filters, which lowers computational efficiency). On the other hand, pooling after a convolution, that increases feature map depth, is computationally expensive. Therefore, as proposed in szegedy2015rethinking , we chose to perform pooling operation in parallel with a convolution of stride 2, and concatenate resulting feature maps. This technique allowed us to speed up inference time of the initial block times.
Additionally, we have found one problem in the original ResNet architecture. When downsampling, the first projection of the convolutional branch is performed with a stride of in both dimensions, which effectively discards of the input. Increasing the filter size to allows to take the full input into consideration, and thus improves the information flow and accuracy. Of course, it makes these layers more computationally expensive, however there are so few of these in ENet, that the overhead is unnoticeable.
It has been shown that convolutional weights have a fair amount of redundancy, and each convolution can be decomposed into two smaller ones following each other: one with a filter and the other with a filter jin2014flattened . This idea has been also presented in szegedy2015rethinking , and from now on we adopt their naming convention and will refer to these as asymmetric convolutions. We have used asymmetric convolutions with in our network, so cost of these two operations is similar to a single convolution. This allowed to increase the variety of functions learned by blocks and increase the receptive field.
What’s more, a sequence of operations used in the bottleneck module (projection, convolution, projection) can be seen as decomposing one large convolutional layer into a series of smaller and simpler operations, that are its low-rank approximation. Such factorization allows for large speedups, and greatly reduces the number of parameters, making them less redundant jin2014flattened . Additionally, it allows to make the functions they compute richer, thanks to the non-linear operations that are inserted between layers.
As argued above, it is very important for the network to have a wide receptive field, so it can perform classification by taking a wider context into account. We wanted to avoid overly downsampling the feature maps, and decided to use dilated convolutions yu2015dilated to improve our model. They replaced the main convolutional layers inside several bottleneck modules in the stages that operate on the smallest resolutions. These gave a significant accuracy boost, by raising IoU on Cityscapes by around percentage points, with no additional cost. We obtained the best accuracy when we interleaved them with other bottleneck modules (both regular and asymmetric), instead of arranging them in sequence, as has been done in yu2015dilated .
Most pixel-wise segmentation datasets are relatively small (on order of images), so such expressive models as neural networks quickly begin to overfit them. In initial experiments, we used L2 weight decay with little success. Then, inspired by huang2016stochastic , we have tried stochastic depth, which increased accuracy. However it became apparent that dropping whole branches (i.e. setting their output to ) is in fact a special case of applying Spatial Dropout tompson15 , where either all of the channels, or none of them are ignored, instead of selecting a random subset. We placed Spatial Dropout at the end of convolutional branches, right before the addition, and it turned out to work much better than stochastic depth.
We benchmarked the performance of ENet on three different datasets to demonstrate real-time and accurate for practical applications. We tested on CamVid and Cityscapes datasets of road scenes, and SUN RGB-D dataset of indoor scenes. We set SegNet badrinarayanan15
as a baseline since it is one of the fastest segmentation models, that also has way fewer parameters and requires less memory to operate than FCN. All our models, training, testing and performance evaluation scripts were using the Torch7 machine-learning library, with cuDNN backend. To compare results, we use class average accuracy and intersection-over-union (IoU) metrics.
We report results on inference speed on widely used NVIDIA Titan X GPU as well as on NVIDIA TX1 embedded system module. ENet was designed to achieve more than fps on the NVIDIA TX1 board with an input image size , which is adequate for practical road scene parsing applications. For inference we merge batch normalization and dropout layers into the convolutional filters, to speed up all networks.
|Model||NVIDIA TX1||NVIDIA Titan X|
Table 2 compares inference time for a single input frame of varying resolution. We also report the number of frames per second that can be processed. Dashes indicate that we could not obtain a measurement, due to lack of memory. ENet is significantly faster than SegNet, providing high frame rates for real-time applications and allowing for practical use of very deep neural network models with encoder-decoder architecture.
|GFLOPs||Parameters||Model size (fp16)|
Hardware requirements. FLOPs are estimated for an input of.
Table 3 reports a comparison of number of floating point operations and parameters used by different models. ENet efficiency is evident, as its requirements are on two orders of magnitude smaller. Please note that we report storage required to save model parameters in half precision floating point format. ENet has so few parameters, that the required space is only 0.7MB, which makes it possible to fit the whole network in an extremely fast on-chip memory in embedded processors. Also, this alleviates the need for model compression song15 , making it possible to use general purpose neural network libraries. However, if one needs to operate under incredibly strict memory constraints, these techniques can still be applied to ENet as well.
One of the most important techniques that has allowed us to reach these levels of performance is convolutional layer factorization. However, we have found one surprising drawback. Although applying this method allowed us to greatly reduce the number of floating point operations and parameters, it also increased the number of individual kernels calls, making each of them smaller.
We have found that some of these operations can become so cheap, that the cost of GPU kernel launch starts to outweigh the cost of the actual computation. Also, because kernels do not have access to values that have been kept in registers by previous ones, they have to load all data from global memory at launch, and save it when their work is finished. This means that using a higher number of kernels, increases the number of memory transactions, because feature maps have to be constantly saved and reloaded. This becomes especially apparent in case of non-linear operations. In ENet, PReLUs consume more than a quarter of inference time. Since they are only simple point-wise operations and are very easy to parallelize, we hypothesize it is caused by the aforementioned data movement.
These are serious limitations, however they could be resolved by performing kernel fusion in existing software i.e. create kernels that apply non-linearities to results of convolutions directly, or perform a number of smaller convolutions in one call. This improvement in GPU libraries, such as cuDNN, could increase the speed and efficiency of our network even further.
We have used the Adam optimization algorithm diederik14 to train the network. It allowed ENet to converge very quickly and on every dataset we have used training took only 3-6 hours, using four Titan X GPUs. It was performed in two stages: first we trained only the encoder to categorize downsampled regions of the input image, then we appended the decoder and trained the network to perform upsampling and pixel-wise classification. Learning rate of and L2 weight decay of , along with batch size of consistently provided the best results. We have used a custom class weighing scheme defined as
. In contrast to the inverse class probability weighing, the weights are bounded as the probability approaches. is an additional hyper-parameter, which we set to (i.e. we restrict the class weights to be in the interval of ).
|Model||Class IoU||Class iIoU||Category IoU||Category iIoU|
This dataset consists of 5000 fine-annotated images, out of which 2975 are available for training, 500 for validation, and the remaining 1525 have been selected as test set cityscape2016 . Cityscapes was the most important benchmark for us, because of its outstanding quality and highly varying road scenarios, often featuring many pedestrians and cyclists. We trained on 19 classes that have been selected in the official evaluation scripts cityscape2016 . It makes use of an additional metric called instance-level intersection over union metric (iIoU), which is IoU weighed by the average object size. As reported in Table 4, ENet outperforms SegNet in class IoU and iIoU, as well as in category IoU. ENet is currently the fastest model in the Cityscapes benchmark. Example predictions for images from validation set are presented in Figure 4.
Another automotive dataset, on which we have tested ENet, was CamVid. It contains 367 training and 233 testing images camvid08 . There are eleven different classes such as building, tree, sky, car, road, etc. while the twelfth class contains unlabeled data, which we ignore while training. The original frame resolution for this dataset is 960720 but we downsampled the images to 480360 before training. In Table 5 we compare the performance of ENet with existing state-of-the-art algorithms. ENet outperforms other models in six classes, which are difficult to learn because they correspond to smaller objects. ENet output for example images from the test set can be found in Figure 5.
|Model||Global avg.||Class avg.||Mean IoU|
The SUN dataset consists of 5285 training images and 5050 testing images with 37 indoor object classes. We did not make any use of depth information in this work and trained the network only on RGB data. In Table 6 we compare the performance of ENet with SegNet badrinarayanan15 , which is the only neural network model that reports accuracy on this dataset. Our results, though inferior in global average accuracy and IoU, are comparable in class average accuracy. Since global average accuracy and IoU are metrics that favor correct classification of classes occupying large image patches, researchers generally emphasize the importance of other metrics in case of semantic segmentation. One notable example is introduction of iIoU metric cityscape2016 . Comparable result in class average accuracy indicates, that our network is capable of differentiating smaller objects nearly as well as SegNet. Moreover, the difference in accuracy should not overshadow the huge performance gap between these two networks. ENet can process the images in real-time, and is nearly faster than SegNet on embedded platforms. Example predictions from SUN test set are shown in Figure 6.
We have proposed a novel neural network architecture designed from the ground up specifically for semantic segmentation. Our main aim is to make efficient use of scarce resources available on embedded platforms, compared to fully fledged deep learning workstations. Our work provides large gains in this task, while matching and at times exceeding existing baseline models, that have an order of magnitude larger computational and memory requirements. The application of ENet on the NVIDIA TX1 hardware exemplifies real-time portable embedded solutions.
Even though the main goal was to run the network on mobile devices, we have found that it is also very efficient on high end GPUs like NVIDIA Titan X. This may prove useful in data-center applications, where there is a need of processing large numbers of high resolution images. ENet allows to perform large-scale computations in a much faster and more efficient manner, which might lead to significant savings.
This work is partly supported by the Office of Naval Research (ONR) grants N00014-12-1-0167, N00014-15-1-2791 and MURI N00014-10-1-0278. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the TX1, Titan X, K40 GPUs used for this research.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in Neural Information Processing Systems 25, 2012, pp. 1097–1105.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
M. A. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. LeCun, “Unsupervised learning of invariant feature hierarchies with applications to object recognition,” inComputer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, 2007, pp. 1–8.