ESPNetv2
A lightweight, power efficient, and general purpose convolutional neural network
view repo
We introduce a lightweight, power efficient, and general purpose convolutional neural network, ESPNetv2, for modeling visual and sequential data. Our network uses group pointwise and depthwise dilated separable convolutions to learn representations from a large effective receptive field with fewer FLOPs and parameters. The performance of our network is evaluated on three different tasks: (1) object classification, (2) semantic segmentation, and (3) language modeling. Experiments on these tasks, including image classification on the ImageNet and language modeling on the PenTree bank dataset, demonstrate the superior performance of our method over the stateoftheart methods. Our network has better generalization properties than ShuffleNetv2 when tested on the MSCOCO multiobject classification task and the Cityscapes urban scene semantic segmentation task. Our experiments show that ESPNetv2 is much more power efficient than existing stateoftheart efficient methods including ShuffleNets and MobileNets. Our code is opensource and available at <https://github.com/sacmehta/ESPNetv2>
READ FULL TEXT VIEW PDFA lightweight, power efficient, and general purpose convolutional neural network
This repository contains the source code of our work on designing efficient CNNs for computer vision
Semantic segmentation on iPhone using ESPNetv2
None
None
The increasing programmability and computational power of GPUs have accelerated the growth of deep convolutional neural networks (CNNs) for modeling visual data [19, 12, 30]
. CNNs are being used in realworld visual recognition applications such as visual scene understanding
[52] and biomedical image analysis [36]. Many of these realworld applications, such as selfdriving cars and robots, run on resourceconstrained edge devices and demand online processing of data with low latency.Existing CNNbased visual recognition systems require large amounts of computational resources, including memory and power. While they achieve high performance on highend GPUbased machines (e.g. with NVIDIA TitanX), they are often too expensive for resource constrained edge devices such as cell phones and embedded compute platforms. As an example, ResNet50 [12]
, one of the most well known CNN architecture for image classification, has 25.56 million parameters (98 MB of memory) and performs 2.8 billion high precision operations to classify an image. These numbers are even higher for deeper CNNs, e.g. ResNet101. These models quickly overtax the limited resources, including compute capabilities, memory, and battery, available on edge devices. Therefore, CNNs for realworld applications running on edge devices should be lightweight and efficient while delivering high accuracy.
Recent efforts for building lightweight networks can be broadly classified as: (1) Network compressionbased methods remove redundancies in a pretrained model in order to be more efficient. These models are usually implemented by different parameter pruning techniques [47, 21]. (2) Lowbit representationbased methods represent learned weights using few bits instead of high precision floating points [40, 34, 16]. These models usually do not change the structure of the network and the convolutional operations could be implemented using logical gates to enable fast processing on CPUs. (3) Lightweight CNNs improve the efficiency of a network by factoring computationally expensive convolution operation [13, 38, 51, 25, 14, 28]. These models are computationally efficient by their design i.e. the underlying model structure learns fewer parameters and has fewer floating point operations (FLOPs).
In this paper, we introduce a lightweight architecture, ESPNetv2 , that can be easily deployed on edge devices. Our model extends ESPNet [28], a lightweight semantic segmentation network, by using group pointwise and depthwise “dilated” separable convolutions instead of computationally expensive pointwise and dilated convolutions. This reduces network parameters and complexity while maintaining high accuracy. The core builiding block of our network, the EESP unit, is general and can be used across wide range of visual and sequence modeling tasks. Our approach is orthogonal to the current stateoftheart efficient models [51, 25] yet reaches higher performance without any channel shuffle or channel split, which have been shown to be very effective for improving the accuracy of lightweight models.
To show the generalizability of our model, we evaluate our network across three different tasks: (1) object classification, (2) semantic segmentation, and (3) language modeling. On the ImageNet classification task [37], our model outperforms all of the previous efficient model designs in terms of efficiency and accuracy, especially under small computational budgets. For example, our model outperforms MobileNetv2 [38] by 2% at a computational budget of 28 million FLOPs. Our most efficient model learns 3.5 million parameters and has 284 million FLOPs while delivering a top1 classification accuracy of 72.1% on the ImageNet classification task. Our network has better generalization properties than ShuffleNetv2 [25] when tested on the MSCOCO multiobject classification task [22] and the Cityscapes urban scene semantic segmentation task [6]
. Furthermore, we show that the introduced EESP unit can be used as a dropin replacement for recurrent neural networks and delivers stateoftheart performance while learning fewer parameters. Our experiments also show that our network is much more power efficient than existing stateoftheart efficient methods including ShuffleNets
[25, 51] and MobileNets [13, 38].We also propose to use SGD with a cyclic learning rate schedule and warm restarts. In each cycle, the learning rate is initialized to its maximum value and is then scheduled to decrease linearly to its minimum value. Our experimental results on the ImageNet dataset suggest that such a learning policy helps the network avoid saddle points and reach higher accuracy in comparison to widely used stepwise learning policies. Our code is opensource and available at https://github.com/sacmehta/ESPNetv2.
This section briefly reviews different methods for building efficient networks.
Most stateoftheart efficient networks [13, 38, 25] use depthwise separable convolutions [13] that factor a convolution into two steps to reduce computational complexity: (1) depthwise convolution that performs lightweight filtering by applying a single convolutional kernel per input channel and (2) pointwise convolution that usually expands the feature map along channels by learning linear combinations of the input channels. Another efficient form of convolution that has been used in efficient networks [51, 14] is group convolution [19], wherein input channels and convolutional kernels are factored into groups and each group is convolved independently. The ESPNetv2 network extends the ESPNet network [28] using these efficient forms of convolutions. To learn representations from a large effective receptive field, ESPNetv2 uses depthwise “dilated” separable convolutions instead of depthwise separable convolutions.
These approaches improve the inference of a pretrained network by pruning network connections or channels [9, 10, 47, 21, 45]. These approaches are effective, because CNNs have a substantial number of redundant weights. The efficiency gain in most of these approaches are due to the sparsity of parameters, and are difficult to efficiently implement on CPUs due to the cost of lookup and data migration operations. These approaches are complementary to our network.
Another approach to improve inference of a pretrained network is lowbit representation of network weights using quantization [40, 34, 48, 7, 53, 16, 1]. These approaches use fewer bits to represent weights of a pretrained network instead of 32bit highprecision floating points. Similar to network compressionbased methods, these approaches are complementary to our work.
This section elaborates the ESPNetv2 architecture in detail. We first describe depthwise dilated separable convolutions that enables our network to learn representations from a large effective receptive field efficiently. We then describe the core unit of the ESPNetv2 network, the EESP unit, which is built using group pointwise convolutions and depthwise dilated separable convolutions.
Convolution factorization is the key principle that has been used by many efficient architectures [13, 25, 51, 38]. The basic idea is to replace the full convolutional operation with a factorized version such as depthwise separable convolution [13] or group convolution [19]. In this section, we describe depthwise dilated separable convolutions and compare with other similar efficient forms of convolution.
A standard convolution convolves an input with convolutional kernel to produce an output by learning parameters from an effective receptive field of . In contrast to standard convolution, depthwise dilated separable convolutions apply a lightweight filtering by factoring a standard convolution into two layers: 1) depthwise dilated convolution per input channel with a dilation rate of ; enabling the convolution to learn representations from an effective receptive field of and 2) pointwise convolution to learn linear combinations of input. This factorization reduces the computational cost by a factor of . A comparison between different types of convolutions is provided in Table 1. Depthwise dilated separable convolutions are efficient and can learn representations from large effective receptive fields.
Convolution type  Parameters  Eff. receptive field 

Standard  
Group  
Depthwise separable  
Depthwise dilated separable 
Layer  Output  Kernel size  Repeat  Output channels for different ESPNetv2 models  

Size  / Stride 

Convolution  112 112  3 3 / 2  1  16  32  32  32  32  32 
Strided EESP (Fig. 2)  56 56  1  32  64  80  96  112  128  
Strided EESP (Fig. 2)  28 28  1  64  128  160  192  224  256  
EESP (Fig. 0(c))  28 28  3  64  128  160  192  224  256  
Strided EESP (Fig. 2)  14 14  1  128  256  320  384  448  512  
EESP (Fig. 0(c))  14 14  7  128  256  320  384  448  512  
Strided EESP (Fig. 2)  7 7  1  256  512  640  768  896  1024  
EESP (Fig. 0(c))  7 7  3  256  512  640  768  896  1024  
Depthwise convolution  7 7  3 3  256  512  640  768  896  1024  
Group convolution  7 7  1 1  1024  1024  1024  1024  1280  1280  
Global avg. pool  1 1  7 7  
Fully connected  1000  1000  1000  1000  1000  1000  
Complexity  28 M  86 M  123 M  169 M  224 M  284 M  
Parameters  1.24 M  1.67 M  1.97 M  2.31 M  3.03 M  3.49 M 
Taking advantage of depthwise dilated separable and group pointwise convolutions, we introduce a new unit EESP, Extremely Efficient Spatial Pyramid of Depthwise Dilated Separable Convolutions, which is specifically designed for edge devices. The design of our network is motivated by the ESPNet architecture [28], a stateoftheart efficient segmentation network. The basic building block of the ESPNet architecture is the ESP module, shown in Figure 0(a). It is based on a reducesplittransformmerge strategy. The ESP unit first projects the highdimensional input feature maps into lowdimensional space using pointwise convolutions and then learn the representations in parallel using dilated convolutions with different dilation rates. Different dilation rates in each branch allow the ESP unit to learn the representations from a large effective receptive field. This factorization, especially learning the representations in a lowdimensional space, allows the ESP unit to be efficient.
To make the ESP module even more computationally efficient, we first replace pointwise convolutions with group pointwise convolutions. We then replace computationally expensive dilated convolutions with their economical counterparts i.e. depthwise dilated separable convolutions. To remove the gridding artifacts caused by dilated convolutions, we fuse the feature maps using the computationally efficient hierarchical feature fusion (HFF) method [28]. This method additively fuses the feature maps learned using dilated convolutions in a hierarchical fashion; feature maps from the branch with lowest receptive field are combined with the feature maps from the branch with next highest receptive field at each level of the hierarchy^{1}^{1}1Other existing works [50, 46] add more convolutional layers with small dilation rates to remove gridding artifacts. This increases the computational complexity of the unit or network.. The resultant unit is shown in Figure 0(b). With group pointwise and depthwise dilated separable convolutions, the total complexity of the ESP block is reduced by a factor of , where is the number of parallel branches and is the number of groups in group pointwise convolution. For example, the EESP unit learns fewer parameters than the ESP unit when =, ==, and ==.
We note that computing pointwise (or ) convolutions in Figure 0(b) independently is equivalent to a single group pointwise convolution with groups in terms of complexity; however, group pointwise convolution is more efficient in terms of implementation, because it launches one convolutional kernel rather than pointwise convolutional kernels. Therefore, we replace these pointwise convolutions with a group pointwise convolution, as shown in Figure 0(c). We will refer to this unit as EESP.
To learn representations efficiently at multiple scales, we make following changes to the EESP block in Figure 0(c): 1) depthwise dilated convolutions are replaced with their strided counterpart, 2) an average pooling operation is added instead of an identity connection, and 3) the elementwise addition operation is replaced with a concatenation operation, which helps in expanding the dimensions of feature maps efficiently [51].
Spatial information is lost during downsampling and convolution (filtering) operations. To better encode spatial relationships and learn representations efficiently, we add an efficient longrange shortcut connection between the input image and the current downsampling unit. This connection first downsamples the image to the same size as that of the feature map and then learns the representations using a stack of two convolutions. The first convolution is a standard convolution that learns the spatial representations while the second convolution is a pointwise convolution that learns linear combinations between the input, and projects it to a highdimensional space. The resultant EESP unit with longrange shortcut connection to the input is shown in Figure 2.

The ESPNetv2 network is built using EESP units. At each spatial level, the ESPNetv2 repeats the EESP units several times to increase the depth of the network. In the EESP unit (Figure 0(c)
), we use batch normalization
[17] and PReLU [11] after every convolutional layer with an exception to the last groupwise convolutional layer where PReLU is applied after elementwise sum operation. To maintain the same computational complexity at each spatiallevel, the feature maps are doubled after every downsampling operation [39, 12].In our experiments, we set the dilation rate proportional to the number of branches in the EESP unit (). The effective receptive field of the EESP unit grows with . Some of the kernels, especially at low spatial levels such as , might have a larger effective receptive field than the size of the feature map. Therefore, such kernels might not contribute to learning. In order to have meaningful kernels, we limit the effective receptive field at each spatial level with spatial dimension as: with the effective receptive field () corresponding to the lowest spatial level (i.e. ) as . Following [28], we set in our experiments. Furthermore, in order to have a homogeneous architecture, we set the number of groups in group pointwise convolutions equal to number of parallel branches (). The overall ESPNetv2 architectures at different computational complexities are shown in Table 2.
To showcase the power of the ESPNetv2 network, we evaluate and compare the performance with stateoftheart methods on three different tasks: (1) object classification, (2) semantic segmentation, and (3) language modeling.
We evaluate the performance of the ESPNetv2 on the ImageNet 2012 dataset [37] that contains 1.28 million images for training and 50,000 images for validation. The task is to classify an image into 1,000 categories. We evaluate the performance of our network using the single crop top1 classification accuracy, i.e. we compute the accuracy on the center cropped view of size .
The ESPNetv2 networks are trained using the PyTorch deep learning framework
[33] with CUDA 9.0 and cuDNN as the backends. For optimization, we use SGD [43]with warm restarts. At each epoch
, we compute the learning rate as:(1) 
where and are the ranges for the learning rate and is the cycle length after which learning rate will restart. Figure 4 visualizes the learning rate policy for three cycles. This learning rate scheme can be seen as a variant of the cosine learning policy [24], wherein the learning rate is decayed as a function of cosine before warm restart. In our experiment, we set , , and . We train our networks with a batch size of 512 for 300 epochs by optimizing the crossentropy loss. For faster convergence, we decay the learning rate by a factor of two at the following epoch intervals: {50, 100, 130, 160, 190, 220, 250, 280}. We use a standard data augmentation strategy [44, 12] with an exception to colorbased normalization. This is in contrast to recent efficient architectures that uses less scale augmentation to prevent underfitting [25, 51]. The weights of our networks are initialized using the method described in [11].
Figure 3 provides a performance comparison between ESPNetv2 and stateoftheart efficient networks. We observe that
[leftmargin=*]
ESPNetv2 outperforms ShuffleNetv1 [51] with or without channel shuffle; suggesting that our architecture enables learning of efficient representations.
ShuffleNetv2 [25] extends ShuffleNetv1 [51] by adding channel split functionality, which enables it to deliver better performance than ShuffleNetv1. ESPNetv2 delivers comparable accuracy to ShuffleNetv2 without any channel split or shuffle. We believe that such functionalities are orthogonal to our network and can further improve its efficiency and accuracy.
Compared to other efficient networks at a computational budget of approximately 300 million FLOPs, ESPNetv2 delivered better performance (e.g. 1.1% more accurate than the CondenseNet [14]).
To evaluate the generalizability for transfer learning, we evaluate our model on the MSCOCO multiobject classification task
[22]. The dataset consists of 82,783 images, which are categorized into 80 classes with 2.9 object labels per image. Following [54], we evaluated our method on the validation set (40,504 images) using classwise and overall F1 score. We finetune ESPNetv2 (284 million FLOPs) and Shufflenetv2 [25] (299 million FLOPs) for 100 epochs using the same data augmentation and training settings as for the ImageNet dataset, except =, = and learning rate is decayed by two at the 50th and 80th epochs. We use binary cross entropy loss for optimization. Results are shown in Figure 6. ESPNetv2 outperforms ShuffleNetv2 by a large margin, especially when tested at image resolution of ; suggesting large effective receptive fields of the EESP unit help ESPNetv2 learn better and generalizable representations.Edge devices have limited computational resources and restrictive energy overhead. An efficient network for such devices should consume less power and have low latency with a high accuracy. We measure the efficiency of our network, ESPNetv2 , along with other stateoftheart networks (MobileNets [13, 38] and ShuffleNets [51, 25]) on two different devices: 1) a highend graphics card (NVIDIA GTX 1080 Ti) and 2) an embedded device (NVIDIA Jetson TX2). For a fair comparison, we use PyTorch as a deeplearning framework. Figure 5 compares the inference time and power consumption while networks complexity along with their accuracy are compared in Figure 3. The inference speed of ESPNetv2 is slightly lower than the fastest network (ShuffleNetv2 [25]) on both devices, however, it is much more power efficient while delivering similar accuracy on the ImageNet dataset. This suggests that ESPNetv2 network has a good tradeoff between accuracy, power consumption, and latency; a much desirable property for any network running on edge devices.
We evaluate the performance of the ESPNetv2 on an urban scene semantic segmentation dataset, the Cityscapes [6]. The dataset is collected across 50 cities in different environmental conditions such as weather and season. It consists of 5,000 finely annotated images (training/validation/test: 2,975/500/1,525). The task is to segment an image into 19 classes that belongs to 7 categories.
We train our network for 300 epochs using ADAM [18] with an initial learning rate of 0.0005 and polynomial rate decay with a power of 0.9. Standard data augmentation strategies, such as scaling, cropping and flipping, are used while training the networks. For training, we subsample the images by a factor of 2 (or image size of ). We evaluate the accuracy in terms of mean Intersection over Union (mIOU) on the private test set using online evaluation server. For evaluation, we upsample segmented masks to the same size as of the input image (i.e.
) using nearest neighbour interpolation.


The performance of the ESPNetv2 with the ESPNet [28] is compared in Figure 6(a). Clearly, the ESPNetv2 is much more efficient and accurate than the ESPNet. When the base segmentation network, ESPNetv2 , is replaced with ShuffleNetv2 (with the same computational complexity), the performance of the segmentation network is dropped by about 2%; suggesting that ESPNetv2 has better generalization properties (see Figure 6(b)). Furthermore, Figure 6(c) provides a comparison between the ESPNetv2 network and stateoftheart networks. Under the same computational constraints, ESPNetv2 is 4% and 2% more accurate than ENet [32] and ESPNet [28] respectively.
We extend LSTMbased language models by replacing linear transforms for processing the input vector with the EESP unit inside the LSTM cell
^{2}^{2}2We replace 2D convolutions with 1D convolutions in the EESP unit.. We call this model ERU (Efficient Recurrent Unit). Our model uses 3layers of ERU with an embedding size of 400. We use standard dropout [41]with probability of 0.5 after embedding layer, the output between ERU layers, and the output of final ERU layer. We train the network using the same learning policy as
[30]. We evaluate the performance in terms of perplexity; a lower value of perplexity is desirable.Language modeling results are provided in Table 3. ERUs achieve similar or better performance than stateoftheart methods while learning fewer parameters. With similar hyperparameter settings such as dropout, ERUs deliver similar (only 1 point less than PRU [28]) or better performance than stateoftheart recurrent networks while learning fewer parameters; suggesting that the introduced EESP unit (Figure 0(c)) is efficient and powerful, and can be applied across different sequence modeling tasks such as question answering and machine translation. We note that our smallest language model with 7 million parameters outperforms most of stateoftheart language models (e.g. [8, 49, 3]). We believe that the performance of ERU can be further improved by rigorous hyperparameter search [29] and advanced dropouts [30, 8].
Language Model  # Params  Perplexity 
Variational LSTM [8]  20 M  78.6 
SRU [20]  24 M  60.3 
Quantized LSTM [49]  –  89.8 
QRNN [3]  18 M  78.3 
Skipconnection LSTM [29]  24 M  58.3 
AWDLSTM [30]  24 M  57.3 
PRU [27] (with standard dropout [41])  19 M  62.42 
AWDPRU [27] (with weight dropout [30])  19 M  56.56 
ERUOurs (with standard dropout [41])  7 M  73.63 
15 M  63.47 
This section elaborate on various choices that helped make ESPNetv2 efficient and accurate.
In [28], HFF is introduced to remove gridding artifacts caused by dilated convolutions. Here, we study their influence on object classification. The performance of the ESPNetv2 network with and without HFF are shown in Table 4 (see R1 and R2). HFF improves classification performance by about 1.5% while having no impact on the network’s complexity. This suggests that the role of HFF is dual purpose. First, it removes gridding artifacts caused by dilated convolutions (as noted by [28]). Second, it enables sharing of information between different branches of the EESP unit (see Figure 0(c)) that allows it to learn rich and strong representations.
To see the influence of shortcut connections with the input image, we train the ESPNetv2 network with and without shortcut connection. Results are shown in Table 4 (see R2 and R3). Clearly, these connections are effective and efficient, improving the performance by about 1% with a little (or negligible) impact on network’s complexity.
Network properties  Learning schedule  Performance  
HFF  LRSC  Fixed  Cyclic  # Params  FLOPs  Top1  
R1  ✗  ✗  ✓  ✗  1.66 M  84 M  58.94 
R2  ✓  ✗  ✓  ✗  1.66 M  84 M  60.07 
R3  ✓  ✓  ✓  ✗  1.67 M  86 M  61.20 
R4  ✓  ✓  ✗  ✓  1.67 M  86 M  62.17 
R5  ✓  ✓  ✗  ✓  1.67 M  86 M  66.10 
A comparison between fixed and cyclic learning schedule is shown in Figure 7(a) and Table 4 (R3 and R4). With cyclic learning schedule, the ESPNetv2 network achieves about 1% higher top1 validation accuracy on the ImageNet dataset; suggesting that cyclic learning schedule allows to find a better local minima than fixed learning schedule. Further, when we trained ESPNetv2 network for longer (300 epochs) using the learning schedule outlined in Section 4.1, performance improved by about 4% (see R4 and R5 in Table 4 and Figure 7(b)). We observed similar gains when we trained ShuffleNetv1 [51]; top1 accuracy improved by 3.2% to 65.86 over fixed learning schedule^{3}^{3}3We note that the top1 accuracy is lower than the reported accuracy of 67.2; likely due to different scale augmentation..
We replace group pointwise convolutions in Figure 0(c) (or Figure 0(b)) with pointwise convolutions for dimensionality reduction. Results are shown in Table 5. With group pointwise convolutions, the ESPNetv2 network is able to achieve similar performance as with pointwise convolutions, but more efficiently.
Group pointwise ()  Pointwise  

# Params  FLOPs  Top1  # Params  FLOPs  Top1 
1.24 M  28 M  57.7  1.30 M  39 M  57.14 
1.67 M  86 M  66.1  1.92 M  127 M  67.14 
We introduce a lightweight and power efficient network, ESPNetv2 , which better encode the spatial information in images by learning representations from a large effective receptive field. Our network is a general purpose network with good generalization abilities and can be used across a wide range of tasks, including sequence modeling. Our network delivered stateoftheart performance across different tasks such as object classification, semantic segmentation, and language modeling while being much more power efficient.
Acknowledgement: This research was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Interior/Interior Business Center (DOI/IBC) contract number D17PC00343, NSF III (1703166), Allen Distinguished Investigator Award, Samsung GRO award, and gifts from Google, Amazon, and Bloomberg. We also thank Rik KoncelKedziorski, David Wadden, Beibin Li, and Anat Caspi for their helpful comments. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government.
Sgdr: Stochastic gradient descent with warm restarts.
In ICLR, 2017.Expectation backpropagation: Parameterfree training of multilayer neural networks with continuous or discrete weights.
In NIPS, 2014.