ReXNet: Diminishing Representational Bottleneck on Convolutional Neural Network

07/02/2020 ∙ by Dongyoon Han, et al. ∙ 0

This paper addresses representational bottleneck in a network and propose a set of design principles that improves model performance significantly. We argue that a representational bottleneck may happen in a network designed by a conventional design and results in degrading the model performance. To investigate the representational bottleneck, we study the matrix rank of the features generated by ten thousand random networks. We further study the entire layer's channel configuration towards designing more accurate network architectures. Based on the investigation, we propose simple yet effective design principles to mitigate the representational bottleneck. Slight changes on baseline networks by following the principle leads to achieving remarkable performance improvements on ImageNet classification. Additionally, COCO object detection results and transfer learning results on several datasets provide other backups of the link between diminishing representational bottleneck of a network and improving performance. Code and pretrained models are available at



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Official Pytorch implementation of ReXNet (Rank eXpansion Network) with pretrained models

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modeling efficient, so-called lightweight, networks is one of the most important issues in computer vision for both researchers and practitioners. Previously proposed efficient models 

mobilenetv1 ; mobilenetv2 ; mobilenetv3 ; efficientnet have tried to find a cheap network design (e.g., shrinking channel dimension) by focusing on computational efficiency, showing promising trade-offs between the computational cost and accuracy.

In this paper, we aim to find out what network design principles followed by the above methods are missing, representational bottleneck. As a pioneer, Inceptionv3 conceptually introduced the representational bottleneck caused by extreme compression of channel dimension. The authors regard a feed-forward network as an acyclic graph, and the information flow from the input to the output can be hampered by architectural design such as extreme compression. In language modeling, as a milestone work, yang2017breaking

firstly revealed the existence of representational bottleneck at the softmax layer,

Softmax bottleneck. The authors show the bounded matrix rank causes the representational bottleneck and handle this by expanding the rank with additional nonlinearity on the linear softmax. The successors kanai2018sigsoftmax ; ganea2019breaking also observed that the softmax layer’s low rank can cause the representational bottleneck which degrades the overall performance of the model.

Taking a further step from the above pioneering works, we investigate the representational bottleneck of the entire layers of a network. We first show there exist layers that are limited in the encoding capability of generating discriminative features considered as the representational bottleneck. We provide a simple theoretical backup using matrix rank analysis of intermediate features. Also, we conduct empirical studies to investigate the representational bottleneck through randomly generated networks and verify the matrix rank of weights is directly linked to the model’s performance. By the evidence, we propose a set of new design principles to boost the actual performance of the model: 1) enlarge the input channel size (dimension) of a layer; 2) equip with a proper nonlinearity; 3) design a network with many expand layers. We further train the network which is designed according to the principles upon an existing network on ImageNet dataset ImageNet and compute the matrix rank of the layers to provide a practical backup.

Finally, we propose our new models Rank eXpansion Networks (ReXNets) following the design principles. It turns out that a simple modification upon the baseline models could show remarkable improvement in performance on ImageNet classification. Our models even outperform the state-of-the-art networks whose architectures were found by neural architecture search (NAS) that requires huge computational resources. Thus, this work will encourage the researchers in NAS field to adopt our simple yet effective design principles into their search space for further performance boosts. The performance improvement of the ImageNet classification is well transferred to the object detection on COCO dataset coco2017 and to the various fine-grained classification tasks, showing the effectiveness of our model as a strong feature extractor.

Our contributions are: investigation of representational bottleneck problem that happens in a network through a mathematical and an empirical studies (§2); new design principles with improved network architectures (§3); state-of-the-art results on ImageNet dataset ImageNet and prominent transfer learning results on COCO detection coco2017 and four different fine-grained classifications (§4).

2 On Representational Bottleneck

2.1 Preliminary: Feature encoding

Given an -depth network, features are encoded from -dimensional input . Features are represented as with the weight matrix . We call the layer with an expand layer, and the layer with an condense layer. Each of denotes

-th point-wise nonlinearity, such as a ReLU 


with a Batch Normalization (BN) layer 

BN . denotes Softmax function. When training the network, every single forward step encodes an input to the output to minimize the gap between and label matrix . Therefore, how effectively the features are encoded towards the label is related to how well it is likely to reduce the gap. The formulation for a CNN is slightly changed to where and denote the convolution operation and the -th convolutional layer’s weight with kernel size , respectively. We rewrite each convolution with the conventional reordering cudnn by , where and the reordered feature . We write the -th feature as


2.2 Representational bottleneck and matrix rank

Revisiting Softmax bottleneck. We revisit Softmax bottleneck yang2017breaking , a sort of representational bottleneck, happened at the softmax layer to formalize representational bottleneck. From eq.(1), the output of the cross-entropy loss is , whose rank is bounded by the rank of , which is . As the input dimension is smaller than the output dimension , the encoded features cannot fully represent the whole category due to the rank deficiency. This shows an instance of the representational bottleneck at a softmax layer. To resolve the issue, the related works yang2017breaking ; kanai2018sigsoftmax ; ganea2019breaking have shown large performance improvements by mitigating the rank deficiency of the softmax layer via involving non-linearity function. Furthermore, what if we increase closer to , does it become another solution to diminish the representational bottleneck? We will take a look at this in later sections.

Diminishing representational bottleneck by layer-wise rank expansion. Let us consider some popular networks VGG ; resnet ; mobilenetv1 ; mobilenetv2 designed for ImageNet classification task ImageNet

. The networks are designed to have the output channels (before the classifier) up to 1,000 using downsampling blocks by doubling the input channel size, while leaving the other layers with the same output and input channel sizes. We conjecture the layers that expand the channel size (i.e., expand layers) such as downsampling blocks would have a rank deficiency and may have the representational bottleneck.

Our goal is to mitigate the representational bottleneck problem in the intermediate layers by expanding the rank of weight matrix . Given the -th feature generated by a layer, , is bounded to (we assume ). We represent , where denotes the pointwise multiplication with another pointwise function . Following the inequality  million2007hadamard , the rank of feature is bounded as,


Therefore, we conclude the rank bound can be expanded by increasing and replacing with a proper function that has a larger rank such as using Swish-1 swish or ELU elu , which is similarly done in the work yang2017breaking . When is fixed, if we adjust the feature dimension close to , eq.(2) provides the possibility of the unbounded rank up to the feature dimension. For a bottleneck block resnet ; mobilenetv2 consists of consecutive 11, 33, and 11 convolutions, we identically expand the rank bound by eq.(2) by considering the input and output channel sizes of a bottleneck block111 Consider the feature generated by a bottleneck block, which is represented as with two weights and , where is the reordered feature of . Then, . ResNet resnet and MobileNetV2 mobilenetv2 adjusted , where denotes the expansion ratio 0.25 and 6, respectively. Finally, we have for inverted bottleneck mobilenetv2 and for bottleneck block resnet . In any cases, we can expand the rank bound by increasing close to . .

(a) A single layer
(b) A bottleneck
(c) 5-layers MLPs
(d) 5 bottlenecks
Figure 1: Normalized rank of networks. The normalized rank (i.e., rank/output channel size) vs. normalized channel size (i.e., input channel size/output channel size) is computed from the feature of (a) single layer networks and (b) networks with a bottleneck block, respectively and averaged over 10,000 randomly generated networks. Furthermore, we study the channel size configuration of the entire layers of the networks with (c) 5-layer MLPs and (d) 5-bottleneck blocks. We average the normalized rank of 10,000 randomly generated networks with respect to the number of expand layers.

2.3 Empirical study

In this section, we conduct two empirical studies: layer-level analysis and the entire layer’s channel configuration study using the matrix rank. First, we empirically investigate how the matrix rank of a layer is actually expanded. This study aims to how the input channel size and the following nonlinearity () affect the matrix rank, as we have discussed in §2.2. To this end, we design experiments for a single layer and a bottleneck using a large number of networks (>10,000 networks) whose building components (e.g., channel size, or non-linearity function) are randomly sampled and measure the rank of them. Second, based on the layer-level study, we investigate the whole channel configuration of a network by measuring the matrix rank and the real performance of the network to find a better network architecture. Using fixed-depth random networks again, we make a connection between the measured ranks and real model performances. This leads us to provide design principles for a network with expanded rank, eventually improving actual performance.

# of exp. layers Channel Conf. (%) Top-1 acc. (%) Norm. Rank Params. (M)
1 32-100-100-100-100 61.90 0.87 0.14
2 32-64-120-120-120 62.08 0.93 0.16
3 32-64-112-112-128 62.10 0.95 0.15
4 32-90-100-110-120 62.15 0.96 0.15
Table 1: Accuracy and the number of expand layers. We train four networks sampled from each configuration of different number of expand layers on CIFAR100 dataset cifar . We average the results over 5 networks due to the random initialization. Norm. Rank denotes the averaged normalized rank of randomly generated networks.

Layer-level rank analysis. To do layer-level rank analysis, we generate a set of random networks consist of a single layer: with and , where is randomly sampled, and is proportionally adjusted. We measure the normalized rank from the features () produced by each network. To investigate , widely-used nonlinear functions222ReLU relu , ReLU6 mobilenetv2 , ELU elu , SoftPlus softplus , LeakyReLU leakyrelu , and Swish-1 swish are considered. We repeat the experiment for 10,000 networks for each normalized channel size () in and for each nonlinearity. A bottleneck block resnet ; mobilenetv2 is similarly studied by generating three consecutive random layers (i.e., by decomposing into three matrices with arbitrary sizes). The inner expansion ratio of each bottleneck block is randomly set as well for generality. We report the normalized ranks in Figures 0(a) and 0(b), which are averaged over 10,000 networks for a single layer and a bottleneck block, respectively.

Channel configuration study. We now consider how to design a network of assigning the channel size of the entire layers. We randomly generate -depth networks with expand layers (i.e., ) and the layers with following the design trend using few condense layers because a condense layer directly reduces the model capacity Inceptionv3 . We change the number of expand layers from to and generate networks randomly. For example, a network with the number of expand layers is , all the layer has the same channel size (except for the channel size of the stem layer). We repeat the experiments with each randomly generated 10,000 networks and average the normalized rank. The results are shown in Figures 0(c) and 0(d). Additionally, we report the actual performance of the sampled networks that have 5 bottlenecks with the stem channel size of 32 for each configuration with a different number of expand layers. We train the networks on CIFAR100 dataset cifar and report the accuracy averaged over 5 networks (due to the random initialization of weights) in Table 1.

Observations. From Figures 0(a) and 0(b), we observe properly selected nonlinear functions can largely expand the rank comparing to the linear case. Second, the normalized input channel size () is closely related to the rank of the feature for both single layer (Figure 0(a)) and bottleneck block (Figure 0(b)) cases. For the entire layer’s channel configuration, Figures 0(c) and 0(d) show that the rank can be expanded using more expand layers when the network depth is fixed. Furthermore, this rank trend is well matched to the actual performance as shown in Table 1. The observations give the design principles that expand the rank of a given network: 1) expand the input channel size at a layer; 2) find a proper nonlinearity; 3) a network should be designed with many expand layers.

3 Improved Network Architecture

3.1 Where does representational bottleneck occur?

We now consider which layer the representation bottleneck may occur in a network. All popular deep networks have a similar architecture with many expand layers to expand channels from -channel input to -channel output prediction for image sources. First, downsampling blocks resnet ; mobilenetv2 or layers VGG is performed like an expand layer. Second, the first layer in a bottleneck module resnet ; preresnet ; resnext and inverted bottleneck blocks mobilenetv2 ; mobilenetv3 ; mnasnet is an expand layer as well. Finally, there exists the penultimate layer that largely expands output channel size. We claim that the representational bottleneck would happen at these expand layers and the penultimate layer.

3.2 Network Redesign

Intermediate convolutional layers. We first consider MobileNetV1 mobilenetv1 . We sequentially make the same modifications on convolutions, closer to the penultimate layer. We refine each layer by 1) expanding the input channel size of the convolution layer and 2) replacing the ReLU6s. Second, we renovate MobileNetV2 mobilenetv2 similarly in MobileNetV1. All the inverted bottlenecks from the end to the first are sequentially modified by the same principle. How much to expand the input channel size is an open question and would be managed by a NAS method, but for simplicity, we suggest instance models of following our design principles in the supplementary material. Note that we can also renovate popular networks such as ResNet resnet or VGG VGG . In ResNet and its variants resnet ; preresnet ; resnext , there is no nonlinearity after the third convolutional layer in each bottleneck block, so expanding the input channel size is the only remedy. We show how expanding the input channel size on ResNet and further on VGG can improve the performance in §5.2.

The penultimate layer. The network architectures resnet ; preresnet ; densenet ; mobilenetv1 ; mobilenetv2 ; mobilenetv3 ; mnasnet have the convolutional layer with a relatively large output channel size at the penultimate layer. This was to prevent the representational bottleneck at the final classifier, but the penultimate layer still suffers from the problem. We expand the input channel size of the penultimate layer and replace the ReLU6.

ReXNets. We now introduce our models called Rank eXpansion Networks (ReXNets) following the design principles inspired by our investigation. We call ReXNet-plain and ReXNet, which are renovated upon MobileNetV1 mobilenetv1 and MobileNetV2 mobilenetv2 , respectively. Note that our models are instances that show how diminishing the representational bottleneck affects the overall performance, which will be shown in the experiment section. Our design of channel configuration is roughly found to meet the overall parameters and flops of the baselines for fair comparison, so the better network architecture will be found by proper parameter searching methods such as NAS methods. The detailed model information is available in the supplementary material.

Network Top-1 (%) Top-5 (%) FLOPs Params.
MobileNetV1 mobilenetv1 70.6 89.5 0.575B 4.2M
MobileNetV2 mobilenetv2 72.0 91.0 0.300B 3.5M
CondenseNet condensenet 73.8 91.7 0.529B 4.8M
ShuffleNetV1 (x2) shufflenetv1 73.7 - 0.524B -
ShuffleNetV2 (x2) shufflenetv2 75.4 - 0.597B -
Pelee pelee 72.6 90.6 0.508B 2.8M
NASNet-A nasneta 74.0 91.7 0.564B 5.3M
AmoebaNet-A amoebanet 75.5 92.0 0.555B 5.1M
PNASNet pnasnet 74.2 91.9 0.588B 5.1M
DARTS darts 73.1 91.0 0.595B 4.9M
FBNet-C fbnet 74.9 - 0.375B 5.5M
ProxylessNas proxylessnas 74.6 93.3 0.320B 4.1M
RandWire-WS randomwire 74.7 92.2 0.583B 5.6M
MnasNet-A3 mnasnet 76.7 32.3 0.403B 5.2M
MobileNetV3-Large mobilenetv3 75.2 - 0.217B 5.4M
FBNetV2-L1 fbnetv2 77.2 - 0.325B -
EfficientNet-B0 efficientnet 77.3 93.5 0.399B 5.3M
ReXNet-1.0x 77.9 93.9 0.398B 4.8M
(a) Comparison of ImageNet top-1 accuracy. All the accuracies are borrowed from the original papers.
Network Top-1 (%) Top-5 (%) FLOPs Params.
ReXNet-plain 74.8 91.93 0.564B 3.41M
ReXNet-0.9x 77.2 93.5 0.347B 4.1M
ReXNet-1.0x 77.9 93.9 0.398B 4.8M
EfficientNetB0 efficientnet 77.3 93.5 0.40B 5.3M
ReXNet-1.1x 78.6 94.1 0.480B 5.6M
ReXNet-1.2x 79.0 94.3 0.567B 6.6M
ReXNet-1.3x 79.5 94.7 0.662B 7.6M
EfficientNetB1 efficientnet 79.2 94.5 0.70B 7.8M
ReXNet-1.4x 79.8 94.9 0.762B 8.6M
ReXNet-1.5x 80.3 95.2 0.875B 9.7M
EfficientNetB2 efficientnet 80.3 95.0 1.0B 9.2M
ReXNet-2.0x 81.6 95.7 1.53B 16M
ReXNet-2.2x 81.7 95.8 1.84B 19M
EfficientNetB3 efficientnet 81.7 95.6 1.8B 12M
(b) ReXNets and EfficientNets. Our models are compared with EfficientNets efficientnet on ImageNet.
Table 2: Model comparison with ReXNets on ImageNet. We compare the classification results of ReXNet (1.0x) with popular lightweight models including the models searched by NAS methods (left) on ImageNet dataset ImageNet . We report ReXNets’ performances with different width multipliers (right) and compare with those of EfficientNets efficientnet . Note that ReXNets are trained and evaluated with the fixed image size .

4 Experiment

4.1 ImageNet Classification

Training setup. We train our model on ImageNet dataset ImageNet with the fixed image size . We use the standard data augmentation GoogleNet

with the random-crop rate from 0.08 to 1.0. Our models are trained using stochastic gradient descent (SGD) with Nesterov momentum 


with momentum of 0.9 and mini-batch size of 512 with 4 GPUs. Learning rate is initially set to 0.5 and is linearly warmed up in the first 5 epochs following the method 

goyal2017accurate then is decayed by the cosine learning rate scheduling. Weight decay is set to 1e-5. We verify the correctness of our training setup by training MobileNetV1 and MobileNetV2. We achieve 72.5% and 73.1% that outperforms the reported scores 70.6% and 72,0%, respectively. See supplementary material for detailed training setup.

Our trained models. Here, we show our models: ReXNet-plain and ReXNet. We first train our models following the training setup on ImageNet from scratch using fixed image size. Furthermore, to show our models’ scalability, the simple width multiplier concept in the previous works mobilenetv1 ; mobilenetv2 ; shufflenetv1 ; shufflenetv2 ; mnasnet ; mobilenetv3 is adopted to adjust the model size. As shown in Table 2(b), our models are well scaled up to 2.2x from 0.9x with remarkable performances just using width multiplier.

Performance comparison. Table 2(a) shows the performance comparison with popular lightweight models. Note that all the models are trained and evaluated with resolution images. Our models show significant performance improvements over the baselines, so our models can be compared with the models searched by NAS methods. Interestingly, our models could outperform EfficietNet-B0 and B1 efficientnet , which are searched by NAS with comparable model size and FLOPs.

4.2 COCO Object Detection

Architecture details. We choose SSDLite mobilenetv2

which is a lightweight detector that is suitable for viewing the feature extractor’s capability. We put the first head of SSDLite to the last feature extractor layer that has an output stride of 16 and put the second head to the last feature extractor layer that has an output stride of 32 by following 

mobilenetv2 ; mobilenetv3 ; mnasnet . This is to use the same size of the extracted features fairly because the detection performance is sensitive to the features’ resolution.

Performance comparison. We compare our models with popular lightweight detectors including SSD ssd , SSDLites mobilenetv1 ; mobilenetv2 ; mobilenetv3 ; mnasnet , YOLOv2 yolov2 , YOLOv3 yolov3 , Pelee pelee , and Tiny-DSOD tdsod . Additionally, we report the detection performances of EfficientNets-B0, B1, and B2 as backbones in SSDLite. As shown in Table 3, ours largely outperform the performance of the other detectors with similar model sizes and FLOPs. Interestingly, as compared with the models using SSDLite, ours achieve much better performances. It is worth noting that ours outperforms EfficientNets-B1 and B2 with SSDLite, in which backbones are pretrained with larger image sizes (>224). We believe that this reflects diminishing the performance bottleneck would help a finetuning task as well.

Training setup. Our models are trained using stochastic gradient descent (SGD) with 1 GPU. We use the same setting of the previous works mobilenetv2 ; mobilenetv3 ; mnasnet including the input size of and data augmentations. Learning rate is initially set to 0.05, and weight decay is set to 4e-5. Following the standard setting ssd ; mobilenetv1 ; mobilenetv2 ; mobilenetv3 ; mnasnet , we train the models on train 2017and further evaluate on test-dev 2017 set at COCO test server. All the models except for Tiny-DSOD are finetuned using their own pretrained backbone.

Model Input Size Avg. Precision at IOU (%) Params. FLOPs
Pelee pelee 304x304 22.4 38.3 22.9 6.0M 1.29B
Tiny-DSOD tdsod 300×300 23.2 40.4 22.8 1.2M 1.12B
MobileNetV1 mobilenetv1 + SSDLite 320x320 22.2 - - 5.1M 1.31B
MobileNetV2 mobilenetv2 + SSDLite 320x320 22.1 - - 4.3M 0.79B
MobileNetV3 mobilenetv3 + SSDLite 320x320 22.0 - - 5.0M 0.62B
MnasNet-A1 mnasnet + SSDLite 320x320 23.0 - - 4.9M 0.84B
EfficientNetB0 efficientnet + SSDLite 320x320 23.5 39.9 23.5 6.2M 0.97B
ReXNet-0.9x + SSDLite 320x320 24.4 41.1 24.7 5.0M 0.88B
ReXNet-1.0x + SSDLite 320x320 24.8 41.8 25.0 5.7M 1.01B
YOLOv3-tiny yolov3 416x416 - 33.1 - 12.3M 5.56B
SSD ssd 300×300 23.2 41.2 23.4 36.1M 35.2B
SSD ssd 512x512 26.8 46.5 27.8 36.1M 99.5B
YOLOv2 yolov2 416x416 21.6 44.0 19.2 50.7M 17.5B
EfficientNetB1 efficientnet + SSDLite 320x320 25.7 43.0 26.1 8.7M 1.35B
EfficientNetB2 efficientnet + SSDLite 320x320 26.0 43.2 26.4 10.0M 1.55B
ReXNet-1.3x + SSDLite 320x320 26.5 44.0 26.9 8.4M 1.60B
Table 3: Object detection results on COCO test-dev 2017. We report ReXNets in SSDLite to compare with both lightweight (FLOPs1.0B) and heavier models (FLOPs1.0B). We choose ReXNet-0.9x, 1.0x, and 1.3x for the feature extractor to compare with lightweight detectors, respectively. : the model performances are trained by ourselves.

4.3 Transfer learning with ReXNets

Training setup and performance comparison. We finetune our models on several datasets including Food-101 food101 , Stanford Cars stanford_cars , FGVC Aircraft fgvc_aircraft , and Oxford Flowers-102 flower102 . We compare our models with the best performing models ResNet50 resnet and EfficientNet-B0 efficientnet . We exhaustively search the hyper-parameters including learning rate and weight decay for the best results for all the models like kornblith2019do_imagenet for a fair comparison. We do not put additional techniques but training all the layers using SGD with Nesterov momentum. For all the datasets, training and evaluation are done with image size, and we use center-cropped images of the same size after resizing images with the shorter side of 256 for evaluation. Note that we do not use larger image sizes such as as in the work efficientnet . As shown in Table 4, ours outperform EfficientNet-B0 for all the datasets with large margins. Ours beat ResNet50 which has more than parameters (x5) on all the datasets except for Stanford Cars dataset. This indicates that our models have fewer parameters but perform as prominent feature extractors for transfer learning over other models.

Dataset Network Top-1 acc. (%) FLOPs Params.
Food-101 food101 ResNet50 resnet 87.03 4.1B 25.6M
EfficientNet-B0 efficientnet 87.47 0.4B 5.3M
ReXNet-1.0x 88.41 0.4B 4.8M
Stanford Cars stanford_cars ResNet50 resnet 92.58 4.1B 25.6M
EfficientNet-B0 efficientnet 90.66 0.4B 5.3M
ReXNet-1.0x 91.45 0.4B 4.8M
FGVC Aircraft fgvc_aircraft ResNet50 resnet 89.42 4.1B 25.6M
EfficientNet-B0 efficientnet 87.06 0.4B 5.3M
ReXNet-1.0x 89.52 0.4B 4.8M
Oxford Flowers-102 flower102 ResNet50 resnet 97.72 4.1B 25.6M
EfficientNet-B0 efficientnet 97.33 0.4B 5.3M
ReXNet-1.0x 97.82 0.4B 4.8M
Table 4: Transfer learning results on four datasets. We report transfer learning results on four fine-grained datasets including Food-101, Standford Cars, FGVC Aircraft, and Oxford Flower-102. All the Top-1 scores of the models are reported by training and testing with image size.

5 Ablation Study and Discussion

5.1 Ablation studies

Impact on replacing nonlinearities. We study the impacts of replacing 1) the first nonlinearity in each inverted bottleneck, and 2) the last nonlinearity at the penultimate layer. Both of them are after expand layers, we expected that the performance is improved as they are replaced. As shown in Table 5(a), the first nonlinearity affects more on the performance than the second one does. First, Table 5(c) shows that both of the nonlinearities affect the performance when they are replaced. In MobileNetV1, Table 5(d) shows a similar trend, but the second nonlinearity also affect a little. We hypothesis this is because MobileNetV1 needs additional model capacity.

Impact on expanding channel size. We study how expanding the output channel size of the input feature work together with replacing the nonlinearity. As shown in Table 5(c), it works well together with replacing the nonlinearities. MobileNetV1 result in Table 5(d) show similar result as well.

1st act. 2nd act. Top-1 Top-5
- - 73.08 91.28
- 73.34 91.33
73.67 91.56
- 73.59 91.59
(a) On nonlinearities in MobileNetV2
1st act. Pen. act. Top-1 Top-5
- - 73.08 91.28
- 73.59 91.59
- 73.49 91.41
73.95 91.46
(b) On nonlinearities in MobileNetV2
Exp. 1st act. Pen. act. Top-1 Top-5
- - - 73.08 91.28
- - 75.45 92.60
- 75.65 92.83
75.86 92.86
(c) On rank expansion and nonlinearities in MobileNetV2
Exp. 1st act. 2st act. Top-1 Top-5
- - - 72.56 90.67
- - 73.64 91.41
- 74.11 91.73
74.21 91.76
(d) On rank expansion and nonlinearities in MobileNetV1
Table 5: Ablation study of rank expansion and nonlinearity. From (a) to (c), “1st act.” and “2nd act.” denote the first and the second ReLU6 in each bottleneck block, respectively, and “Pen act.” denotes the ReLU6 that follows the penultimate layer in MobileNetV2. For (d), “1st act.” and “2nd act.” denote the activation after convolution and depthwise convolution in MobileNetV1. “Exp.” denotes the model consists of expand layers by increasing the input channel size of each layer.

5.2 Discussions

ReX-ResNet and ReX-VGG. We apply our principles to ResNet and VGG. We choose ResNet50 resnet and VGG16-BN VGG . We found the accuracy improvements on ResNet50 (77.1% (ours) vs 76.3%) and on VGG16-BN (71.8% (ours) vs. 71.6%) on ImageNet, while with similar computational costs.

Verifying representational bottleneck in pretrained models.

We now make a final backup by measuring the matrix rank of the output of each layer to reveal the representational bottleneck. Specifically, we use two ImageNet-trained models (MobileNetV2 and a renovated MobileNetV2 that follows our design principles) to visualize the cumulative distribution of the singular values computed with each feature set. Using randomly sampled 2,000 images in ImageNet validation set, we compute the singular values from the extracted features of 1) each layer after the nonlinearity in every inverted bottleneck and 2) after the nonlinearity at the penultimate layer. We first normalize all singular values to [0, 1] to manage different singular values from different layers and then plot the cumulative percentage of normalized singular values for each layer. As shown in Figure 

2, many singular values from the layers are extremely low for MobileNetV2 compared with those of ours. This indicates our model has successfully overcome the representational bottleneck at layers.

     (a) MobileNetV2          (b) Renovated MobileNetV2 (ours)
Figure 2: Visualization of singular values. We compute the cumulative sum of the singular values for all the expand layers in MobileNetV2 and ours trained on ImageNet.
Backbone Detector ImageNet Top-1 Acc. (%) COCO AP Params. FLOPs
MobileNetV1 mobilenetv1 SSDLite 70.6 22.2 5.1M 1.31B
MobileNetV2 mobilenetv2 SSDLite 72.0 22.1 4.3M 0.79B
MobileNetV3-Large mobilenetv3 SSDLite 75.2 22.0 5.0M 0.62B
MnasNet-A1 mnasnet SSDLite 75.2 23.0 4.9M 0.84B
EfficientNetB0 efficientnet SSDLite 77.3 23.5 6.2M 0.97B
ReXNet-0.9x SSDLite 77.2 24.4 5.0M 0.88B
Table 6: Correlation between backbone and finetuning performance. We study the correlation between the top-1 classification accuracy of backbones on ImageNet (ImageNet Top-1 Acc.) and the corresponding average precision on COCO (COCO AP). We observe a better backbone in respect to the ImageNet performance does not always link to the detection performance. However, ReXNet’s detection performance has improved to match the performance improvement of the backbone without excessive computational costs.

Representational bottleneck and finetuning. We argue that increasing the classification accuracy may not link to the finetuning performance improvement. As shown in Table 6, MnasNet and MobileNetV3-Large are the first instances, where they have similar ImageNet accuracy but COCO APs are different. Second, when comparing MobileNetV1 and MobileNetV2 with MobileNetV3-Large, there is a large gap ImageNet accuracy, but not much in COCO AP. Also, EfficientNetB0 show higher classification accuracy then ReXNet-0.9x about 0.1%, but show inferior COCO AP about 1.0%. Through this result, we believe that a backbone when diminishing representational bottleneck is likely to have better encoding capacity inducing a better performance on a finetuning task.

ImageNet accuracy with different nonlinear functions. We further train ReXNet-x1.0 with ELU elu , SoftPlus softplus , LeakyReLU leakyrelu , and ReLU6 mobilenetv2 to compare with the model with Swish-1 swish . This is to see the actual quality of the nonlinearities along with the study in Figure. 1. We obtain the results of top-1 accuracy which is better in the order of Swish-1 (77.90%), ELU (77.64%), SoftPlus (77.60%), Leaky ReLU (77.44%), and ReLU6 (77.26%) (see supplementary material).

6 Conclusion

In this work, we have addressed representational bottleneck in CNN layers. Motivated by the representational bottleneck in language modeling, we hypothesized a similar representational bottleneck in the layers of a CNN. We further argued that the matrix rank is closely related to the bottleneck problem, and the model performance will be improved by diminishing it. We have proposed an experimental study that expand layers are likely to suffer from the representational bottleneck, so we propose a set of design principles to handle the problem. In the end, we achieved the models that successfully manage the problem, and the secured models that have renovated by following the principles outperformed the recent competitive models, including NAS-based models on ImageNet dataset. Furthermore, our models even showed the remarkable finetuning performances on COCO object detection and on several fine-grained datasets for transfer learning. Consequently, we believe that our work highlighted a new perspective of designing a network for many tasks.


We would like to thank Clova AI Research team members including Junsuk Choe, Seong Joon Oh, Sanghyuk Chun for fruitful discussions and internal reviews. In particular, we would like to thank Jung-Woo Ha who suggested the name of our network architecture. Naver Smart Machine Learning (NSML) platform 

nsml has been used in the experiments.



Appendix A Overview

This document presents further details and the additional experimental results of our proposed Rank eXpansion Networks (ReXNets). First, we show the validity of our training setup on ImageNet classification. It turns out that our training setup even shows better performance of MobileNetV1 [13] and MobileNetV2 [41] than those reported in the original papers (§B). Second, we provide the specifications of ReXNets, which are simple instance models according to our proposed design principles, yet they show prominent performances over diverse tasks as shown in the main paper (§C). Third, we provide extra experimental results including 1) model capacity comparison with EfficientNets [48] by training models from scratch on COCO dataset [25], 2) ReXNet with different nonlinear functions to justify choosing Swish-1 [36], and 3) model comparison with popular heavy models to show our models’ scalability (§D).

Appendix B ImageNet Classification Training Details

In this section, we first verify our training setup on ImageNet dataset [40] by comparing the scores between the officially reported ones and ours. Then, we provide further training details for ReXNets.

b.1 Training setup verification

We first verify our training setup in the paper with training MobileNetV2 [41] on ImageNet. This is because MobileNets [13, 41, 12] are challenging to reproduce with a few GPUs, it is crucial to show whether our training setup can reach the reported performance under a different environment333The original papers used 16 GPUs [41] or 4x4 TPU pods [12] for ImageNet training. We train all the models using 4 GPUs (V100 or P40).. We train with the network architectures which are officially released by the authors and report the accuracies. As shown in Table 7, our models seem to be trained well and even outperform the scores reported in the original papers [13, 41].

Network Top-1 (%) Top-5 (%) MAdds Params.
MobileNetV1 [13] (paper) 70.6 89.5 0.575B 4.2M
MobileNetV1 (ours) 72.5 90.7 0.575B 4.2M
MobileNetV2 [41] (paper) 72.0 91.0 0.300B 3.5M
MobileNetV2 (ours) 73.1 91.3 0.300B 3.5M
Table 7: Training results of MobileNets. MobileNets (ours) denote trained models with our training setup on ImageNet dataset [40] which are the identical architectures to the original ones [13, 41].

b.2 Further training details

Our models are trained using label smoothing [46] with the alpha of 0.1, dropout [43] rate of 0.2 on the last layer. As done in training the models of MobileNetV3 [12] and EfficientNet [48], we similarly train our models with stochastic depth [17] rate of 0.2, randaug [5] with the magnitude of 9, and random erasing [11]

with the probability of 0.2.

Note that we do not use FixResNet [49]-like techniques that need additional finetuning procedure after training. We do not use exponential moving average (EMA) used in training MobileNetV3 and EfficientNet. Training with the techniques may further improve the accuracy, so we will train our models with them as future work.

Input Operator # of channels SE Nonlinearity Stride
conv 32 - SW 2
bottleneck1 16 - SW/RE6 1
bottleneck6 27 - SW/RE6 2
bottleneck6 38 - SW/RE6 1
bottleneck6 50 SW/RE6 2
bottleneck6 61 SW/RE6 1
bottleneck6 72 SW/RE6 2
bottleneck6 84 SW/RE6 1
bottleneck6 95 SW/RE6 1
bottleneck6 106 SW/RE6 1
bottleneck6 117 SW/RE6 1
bottleneck6 128 SW/RE6 1
bottleneck6 140 SW/RE6 2
bottleneck6 151 SW/RE6 1
bottleneck6 162 SW/RE6 1
bottleneck6 174 SW/RE6 1
bottleneck6 185 SW/RE6 1
conv , pool 1280 - SW 1
fc 1000 - - 1
Table 8: Specification of ReXNet-1.0x. Bottleneck1 and bottleneck6 denote the inverted bottleneck with the expansion ratio of 1 and 6, respectively. In each block, SE denotes whether Squeeze Excitation Module (SE-module) [14] is used. SW denotes Swish-1 [36] is used after the convolution, and SW/RE6 denotes Swish and ReLU6 is used after the first convolution and the depthwise convolution [13], respectively.
Input Operator # of channels Nonlinearity Stride
conv 32 SW 2
dwconv / conv 96 RE/SW 2
dwconv / conv 144 RE/SW 1
dwconv / conv 192 RE/SW 2
dwconv / conv 240 RE/SW 1
dwconv / conv 288 RE/SW 2
dwconv / conv 336 RE/SW 1
dwconv / conv 384 RE/SW 1
dwconv / conv 432 RE/SW 1
dwconv / conv 480 RE/SW 1
dwconv / conv 528 RE/SW 1
dwconv / conv 576 RE/SW 2
dwconv / conv 624 RE/SW 1
dwconv / conv 1024 RE/SW 1
pool 1024 - 1
fc 1000 - 1
Table 9: Specification of ReXNet_plain. SW denotes Swish-1 is used after the convolution, and RE/SW denotes ReLU and Swish are used after the first depthwise convolution and the following convolution, respectively.

Appendix C Model Specifications of ReXNets

In this section, the detailed descriptions of ReXNets are presented. These models are simple instances of following our design principles of 1) expanding the input channel size, 2) replacing the nonlinearity of the expand layers, and 3) increasing the number of expand layers.

c.1 ReXNet

We do only a few changes in the layer configuration upon MobileNetV2 [41]. Specifically, we do not change the channel sizes of the stem (i.e., the first convolution) and the penultimate layer (i.e., the last convolution). We leave the original expansion ratio setting (each inverted bottleneck block has the ratio of 6 except for the first inverted bottleneck block that has the ratio of 1).

MobileNetV2 [41] has the channel sizes of each inverted bottleneck block of 32, 16, 24, 24, 32, 32, 32, 64, 64, 64, 64, 96, 96, 96, 160, 160, and 320, respectively. With the identical channel sizes of the stem (32) and the penultimate layer (1280), ReXNet has the following channel configuration: 32, 17, 27, 38, 50, 61, 72, 84, 95, 106, 117, 128, 140, 151, 162, 174, and 185 by expanding the input channel sizes and increasing the number of expand layers. We replace the ReLU6 at the expand layers in each inverted bottleneck block and the ReLU6 after the penultimate layer with Swish-1 [36]. We discard SE-modules [14, 12] in the inverted bottleneck blocks from the first to the bottleneck blocks with the stride 4 due to concerning the latency. The width multiplier is adopted to apply to all the channel sizes for scaling the model. The specification of ReXNet is shown in Table 8.

c.2 ReXNet_plain

A plain network such as MobileNetV1 [13] is able to be redesigned by following our design principles. Without changing the depth of MobileNetV1, we only redesign each channel and the nonlinearity of each convolution. MobileNetV1 has the channel sizes of each convolution of 32, 64, 128, 128, 256, 256, 512, 512, 512, 512, 512, 512, and 1024, respectively. We do slight modification on this to make many expand layers with expanded input channel size as 32, 96, 144, 192, 240, 288, 336, 384, 432, 480, 528, 576, and 624, respectively. All the other channel sizes including the stem and the output classifier are not changed. We only replace the ReLUs [33] after each convolution to Swish-1, where the layer expand the channel size. We call this model ReXNet_plain. The specification of ReXNet_plain is shown in Table 9.

Appendix D Additional Experimental Results

d.1 Model capacity of ReXNets and EfficientNets

We further estimate the model capacity by

training the models from scratch on COCO dataset [25]. This is to provide another experimental backup of the superior model capacity of ReXNets over EfficientNets not only on ImageNet classification but on COCO detection. We train ReXNets with the width multipliers from 0.9x to 1.3x in SSDLite, respectively and EfficientNets-B0, B1, and B2 in SSDLite, respectively. As shown in Table 10, ReXNets produce better AP scores than those of EfficientNets which show the consistent trend in Table 3 in the main paper. Therefore, we conclude that ReXNets have larger capacities for both finetuning and training from scratch.

Model Input Size Avg. Precision at IOU (%) Params. FLOPs
EfficienetNetB0 [48] + SSDLite 320x320 24.2 40.5 24.5 6.2M 0.97B
ReXNet-0.9x + SSDLite 320x320 24.9 41.4 25.4 5.0M 0.88B
ReXNet-1.0x + SSDLite 320x320 25.5 42.4 26.0 5.7M 1.01B
EfficienetNetB1 [48] + SSDLite 320x320 25.9 42.7 26.3 8.7M 1.35B
ReXNet-1.1x + SSDLite 320x320 26.0 43.0 26.6 6.5M 1.19B
ReXNet-1.2x + SSDLite 320x320 26.3 43.5 26.9 7.4M 1.39B
EfficienetNetB2 [48] + SSDLite 320x320 26.6 43.7 27.3 10.0M 1.55B
ReXNet-1.3x + SSDLite 320x320 26.8 44.1 27.4 8.4M 1.60B
Table 10: Object detection results on COCO test-dev 2017. We report the results of training from scratch on COCO train 2017 with ReXNets and EfficientNets in SSDLite.

d.2 ImageNet accuracy with different nonlinear functions

We studied how nonlinearity can affect the matrix rank of layers and model performance in the main paper. Here, we further study the actual impact of different nonlinear functions on model performance. We train ReXNet-x1.0 with ELU [4], SoftPlus [6], LeakyReLU [30], and ReLU6 [41] to compare with the model with Swish-1 [36] on ImageNet. The result will provide the actual quality of the different nonlinearities. We obtain the results of top-1 accuracy which is better in the order of Swish-1 (77.90%), ELU (77.64%), SoftPlus (77.60%), Leaky ReLU (77.44%), and ReLU6 (77.26%) as shown in Table 11.

Network Top-1 (%) Top-5 (%) FLOPs Params.
ReXNet-1.0x with Swish-1 [36] 77.90 93.87 0.398B 4.80M
ReXNet-1.0x with ELU [4] 77.64 93.69 0.398B 4.80M
ReXNet-1.0x with Softplus [6] 77.60 93.75 0.398B 4.80M
ReXNet-1.0x with Leaky ReLU [30] 77.44 93.56 0.398B 4.80M
ReXNet-1.0x with ReLU6 [41] 77.26 93.49 0.398B 4.80M
Table 11: Trained ReXNet-1.0x with different nonlinear functions. We verify the choice of nonlinearity in ReXNets by training the models with different nonlinear functions including ELU, Softplus, Leaky ReLU, and ReLU6 on ImageNet.

d.3 Comparison with heavy models

We report the performances of ReXNet-2.0x and ReXNet-2.2x and other popular heavy models trained on ImageNet in Table 12. ReXNets show better performances over those of the reported heavy models with much less computational costs.

Network Top-1 (%) Top-5 (%) FLOPs Params
VGG16BN [42] 71.5 89.8 15.5B 138.4M
VGG19BN [42] 74.2 91.8 19.7B 143.7M
ResNet18 [9] 69.8 89.1 1.9B 11.7M
ResNet50 [9] 76.1 92.9 4.1B 25.6M
ResNet101 [9] 77.4 93.6 7.9B 44.5M
ResNet152 [9] 78.3 94.1 11.6B 60.2M
InceptionV3 [46] 77.4 93.6 2.9B 27.2M
InceptionV4 [44] 80.0 95.0 13B 48M
Inception-ResNetV2 [44] 80.1 95.1 13B 56M
DenseNet169 [16] 76.2 93.1 3.4B 14.2M
DenseNet201 [16] 77.2 93.6 4.4B 20.0M
ResNeXt101_32x4d [53] 78.8 94.4 8.0B 44.2M
ResNeXt101_64x4d [53] 80.9 95.6 31.5B 83.6M
PolyNet [56] 81.3 95.8 34.7B 92.0M
RandWire-WS (C=109) [54] 79.0 94.4 4.0B 31.9M
EfficientNetB3 [48] 81.1 95.5 1.8B 12.2M
ReXNet-2.0x 81.6 95.7 1.5B 16.4M
ReXNet-2.2x 81.7 95.8 1.8B 19.4M
Table 12: Heavy model comparison on ImageNet. Our models are compared with popular models. Note that ReXNets are trained and evaluated with the fixed image size .