Deep neural networks are increasingly being used in a number of computer vision tasks. One great disadvantage of training a very deep network is the “vanishing gradient” problem which delays the convergence. This is alleviated to some extent by initialization techniques mentioned in and 4]. It is observed by K. He et.al in  that, accuracy stagnates and degrades subsequently as the network becomes deeper. They argue that this degradation is not caused by over-fitting and that adding more layers to an already “deep” model results in an increase in train and test errors. Therefore in order to make training of deep networks possible, they introduce “Resnets”. Resnet  attends to this problem by emphasizing on learning residual mapping rather than directly fit input to output. This is achieved by introducing skip connections which ensure a larger gradient is flown back during the back-propagation.
Subsequent to Resnet a plethora of variants like ResNeXt , Densenet , Resnet with stochastic depth  and preactivated Resnet  have been proposed that makes training very deep networks possible. All these variants have either focussed on pre-activation or split-transform-merge paradigm or dense skip connections or dropping layers at random. But none have investigated the bridge-connections in Resnet that connect two blocks with a varying number of feature maps. In this work, we investigate the effect of bridge-connections in Resnet and subsequently propose a network architecture called “Res-SE-Net” that performs better than the baseline Resnet and SE-Resnet.
2 Related Work
Training deep networks had been a concern until Resnets  were introduced. Resnet emphasize on learning residual mappings rather than directly fit input to output. Subsequent to Resnet, a lot of its variants have been proposed. Fully preactivated Resnet  performs activation before addition of identity to the residue to facilitate unhindered gradient flow through the shortcut connections to earlier layers. This makes training of a 1001 layered deep network possible. ResNeXt  follows a model which is like Inception net 
, by splitting the input to Resnet block into multiple transformation paths and subsequently merging them before identity addition. The number of paths is a new hyperparameter that characterizes model capacity. With higher capacity, the authors have demonstrated improved performance without going much wider or deeper. Densenet further exploits the effect of skip-connections by densely connecting through skip-connections the output of every earlier layer to every other following layer. The connections are made using depth concatenation and not addition. The authors argue that such connections would help in feature reuse and thereby an unhindered information flow. In , the weight layers in the Resnet block are randomly dropped thereby only keeping skip-connections active in these layers. This gives rise to an ensemble of Resnets similar to dropout 
. Dropping weight layers depend on “survival probability”. This idea outperforms the baseline Resnet. Another important architecture that won the 2017 ILSVRC222http://image-net.org/challenges/LSVRC/ competition is SE-Resnet . This winning architecture has in its base a Resnet with an SE block introduced between the layers of Resnet. This block quantifies the importance of feature maps instead of considering all of them equally likely. This has resulted in a significant level of improvement in performance of the Resnet.
Though Resnet has been studied in detail, to the best of our knowledge there has not been any work focusing on bridge-connections in Resnet. In this work, we investigate the effectiveness of bridge-connections and further propose a new architecture namely “Res-SE-Net”. This architecture consists of an SE block in the bridge-connection to weight the importance of feature maps. Using the proposed architecture we demonstrate a superior performance on CIFAR-10 and CIFAR-100 benchmark datasets over baseline Resnet and SE-Resnet.
The idea behind Resnets  is to make a shallow architecture deeper by adding identity mapping from a previous layer to the current layer and then applying a suitable non-linear activation. Addition of skip-connections facilitates larger gradient flow to earlier layers thereby addressing the degradation problem as mentioned in . The building block of a Resnet is depicted in Fig. 1. Here x is identity and x is called the residual mapping.
Resnet comprises of a stack of these blocks. A part of the 34-layer Resnet is shown in Fig. 2. The skip-connections that carry activations within a block are referred to as identity skip-connections and those that carry from block to another are called as bridge-connections. The dotted connection is an example of a bridge-connection. It involves a convolution to increase the number of feature maps from 64 to 128 and also a downsampling operation to reduce their spatial dimension. In its absence, x and x will have incompatible dimensions to be added.
3.2 Squeeze-and-Excitation Block
Filters in a convolutional layer capture local spatial relationships in the form of feature maps. These feature maps are further used as they are without any importance being attached to them. In other words, each feature map is treated independent and equal. This may allow insignificant features that are not globally relevant to propagate through the network, thereby affecting the accuracy. Hence, to model the relationship between the feature maps, SE block is introduced in 
. This enhances the quality of representations produced by a convolutional neural networks. SE block performs a recalibration of features so that the global information is used to weight features from the feature map that are more “informative” than the rest.
The SE block has two operations viz. squeeze and excitation. Features are first passed to the “squeeze” operation to produce a descriptor for each of the feature maps by aggregating along each of their spatial dimensions . The descriptor produces an embedding of a global distribution of channel-wise feature responses. This allows information from the global receptive field of the network to be used by all its layers. The squeeze operation is followed by an “excitation” operation, wherein the embedding produced is used to get a collection of modulation weights for every feature map. These weights are applied to the feature maps to generate weighted feature maps as shown in Fig. 3. In Fig. 3, the input
is fed to global pooling function which outputs a vector of dimension. Its dimension is further reduced by
using a fully-connected layer, which is followed by ReLU activation. This constitutes a Squeeze operation. The output of squeeze operation is then upsampled to dimension using another fully-connected layer followed by sigmoid activation which gives weights for each of the channels. This constitutes an Excitation operation. The input is thus rescaled using the output of excitation to get the weighted feature map which is as shown in Fig. 3.
SE blocks add negligible extra computation and can be included in any part of the network . SE-Resnet module is a Resnet module with each residual mapping passing through a SE block which is added with identity connection. SE-Resnet is a stack of SE-Resnet modules.
4 Proposed Model
Prior to the elucidation of the proposed model, we present the motivation. As shown in Fig. 2, bridge-connections (represented as dotted lines) connect two blocks of Resnet that have a different number of feature maps and different spatial dimensions. We now investigate the effectiveness of bridge-connections.
4.1 Effect of Bridge-connections in Resnet
Tables 1 and 2 compare the performance of various Resnet architectures with and without bridge-connections, respectively. The performance without bridge-connections drastically drops, particularly in Resnet-56 and Resnet-110. This comparison stresses the importance of bridge-connections.
However, in the original Resnet , all feature maps in the bridge-connections are weighted equally. It is to be noted that SE-Resnet  weights the feature maps along the non-skip connections, based on their importance. The importance is learnt using a simple feed forward network that adds negligible computations. This idea motivated us to quantify the importance of feature maps that arise in bridge-connections.
4.2 Res-SE-Net - Our architecture
We incorporate an SE block in every bridge-connection in Resnet. Fig. 4 shows an illustration of a modified bridge-connection. The proposed model Res-SE-Net has a similar architecture as mentioned in . Specifically, our architecture is as follows.
. This is followed by a stack of Resnet modules. A group of Resnet modules within the stack which have the same number of feature maps constitute a block. Average-pooling follows the stack of Resnet modules. The final layer is a fully-connected layer followed by softmax activation, which predicts the probability of an input belonging to a particular class. The sub-sampling of feature-maps is done in the first convolutional layer of every block, by performing the convolution with a stride of 2. There is a reduction in the size of feature maps and an increase in their number from one block to another. So to take activations from one block to another, the bridge-connection downsamples the feature map size and increases their number by usingconvolutions with stride 2.
We add an SE block  on the bridge-connection just after downsampling. This ensures that when the feature maps are taken from one block to another they are weighted according to the content that they carry. Hence those features that are more relevant are given higher importance. The downsampled feature maps are sent from the previous block to the next one, so the weighting process must be done after downsampling in order to give more importance to the downsampled feature maps. Doing this before downsampling would reduce the significance of the weighted features. This is the primary reason for adding SE layer after downsampling and not before. Empirically we have also found that adding SE layer before the downsampling process gives less accuracy compared to adding it after downsampling.
We have used CIFAR-10  and CIFAR-100  datasets for all of our experiments. The CIFAR-10 dataset consists of 50000 training images and 10000 test images in 10 classes, with 5000 training images and 1000 test images per class. The CIFAR-100 dataset consists of 50000 training images and 10000 test images in 100 classes, with 500 training images and 100 test images per class. There are 20 main classes which contain these classes. The size of images in both the datasets is and all of them are RGB images.
5.2 Experimental setup
We conducted our experiments on Resnets of depths 20, 32, 44, 56 and 110 layers. Our implementations are coded in Pytorch. The code for baseline Resnet555Adapted from https://github.com/bearpaw/pytorch-classification. and SE-Resnet666Adapted from https://github.com/moskomule/senet.pytorch. have been adapted from existing implementations and modified. The following data augmentation techniques are used for training as mentioned in :
Padding with 4 pixels on each side.
Random cropping to a size of from the padded image.
Random horizontal flip.
At the test time, we only normalize the images. The input to the network is of size .
The architecture of the network used, for both CIFAR-10 and CIFAR-100 datasets, is mentioned in section 4
. The training is started with an initial learning rate of 0.1 and subsequently, it is divided by 10 at 32000 and 48000 iterations. The training is done for a maximum of 64000 iterations. Stochastic Gradient Descent (SGD) is used for updating the weights. The weights in the model are initialized by the method described in and batch normalization  is adopted. Dropouts  were not used for this model. The hyperparameters used are enlisted in Table 3.
|Initial learning rate||0.1|
Tables 1, 4 and 5 report the accuracies obtained by baseline Resnets, baseline SE-Resnets and Res-SE-Nets respectively. As evident from Table 1, the best performing Resnet is Resnet-110 with Top-1 accuracies of 93.66% and 73.33% on CIFAR-10 and CIFAR-100 respectively. Similarly, from Table 4, the best performing SE-Resnet is SE-Resnet-110 with Top-1 accuracies of 93.79% and 72.99% on CIFAR-10 and CIFAR-100 respectively. It is clear from Table 5 that, our model, Res-SE-Net-110 reporting Top-1 accuracies of 94.53% and 74.93% on CIFAR-10 and CIFAR-100 datasets resspectively, significantly overwhelms the baseline Resnets and SE-Resnets. It can further be observed from Table 5, that SE-Resnet-44 performs exceedingly well compared to baseline Resnets and SE-Resnets. In fact, Res-SE-Net-44 is able to outperform Resnet-110 and SE-Resnet-110 by a significant margin of 0.42% and 0.29% on CIFAR-10 dataset respectively. On CIFAR-100 dataset, Res-SE-Net-44 dominates over Resnet-110 and SE-Resnet-110 by a margin of 0.5% and 0.84%. It is to be noted that Res-SE-Net-44 has 61.75% and 62.06% lesser number of parameters compared to Resnet-110 and SE-Resnet-110 respectively. Res-SE-Net-56 too exhibits outstanding performance on CIFAR-100 dataset compared to baseline Resnets and SE-Resnets, and near on-par performance on CIFAR-10 dataset with baseline Resnets and SE-Resnets. This strongly emphasizes the gravity of the proposed idea to activate the feature maps in bridge-connections by their importance. The proposed idea enables a reasonably deep network with lesser number of parameters to outperform very deep networks.
The improvement in accuracies of Res-SE-Nets, for both the datasets in comparison to baseline Resnets  and SE-Resnets  are tabulated in Table 6 and Table 7 respectively. Res-SE-Net outperforms baseline Resnet by 0.566% i.e. about 56 images on CIFAR-10 and by 0.704% i.e. about 70 images on CIFAR-100 datasets on an average. Res-SE-Net-110 has achieved the maximum improvement in accuracy of 0.87% on CIFAR-10 and 1.6% on CIFAR-100 over Res-110. Similarly, Res-SE-Net outperforms SE-Resnet by 0.124% on CIFAR-10 and by 1.392% on CIFAR-100 datasets respectively. Res-SE-Net-110 has achieved maximum overall improvement over SE-Resnet-110 with an increase in accuracy of 0.74% on CIFAR-10 and of 1.94% on CIFAR-100 datasets respectively.
With the improvement that addition of SE block provides, one might want to add SE blocks to all of the skip-connections so as to make the performance even better. But we have empirically found that adding an SE block to every identity skip-connections degrades the performance on CIFAR-10 and CIFAR-100 datasets as the depth increases. Also, as for the reasons mentioned in section 4, the addition of SE block before downsampling does not give better results either.
We now analyze the training phase of Res-SE-Net by plotting training losses for all of the aforesaid depths and for both the datasets. From Fig. 5, Fig. 6 and Fig. 7 we can conclude that training of Res-SE-Net has taken place smoothly. There is no abrupt increase in training loss of Res-SE-Net models. This shows that gradient flow has not been hindered by the introduction of an SE block in bridge-connections, keep intact the principle of Resnet (base of our Res-SE-Net) that skip-connections facilitate smooth training of deep networks.
In this work, we proposed a new architecture named “Res-SE-Net” that makes bridge-connections in Resnets more influential. This is achieved by incorporating an SE block in every bridge-connection. Res-SE-Net surpassed the performances of baseline Resnet and SE-Resnets by significant margins on CIFAR-10 and CIFAR-100 datasets. Further, we demonstrated that reasonably sized deep networks with positively contributing bridge-connections can outperform very deep networks. Also, we illustrated that addition of an SE block does not affect training. In future, we would like to explore other ways of making bridge-connections in Resnets influential towards enhancement in performance.
The authors wish to dedicate this work to the founder chancellor of Sri Sathya Sai Institute of Higher Learning, Bhagawan Sri Sathya Sai Baba. The authors also wish to extend their gratitude to Dr. Vineeth N Balasubramanian, Associate Professor in the Department of Computer Science and Engineering, Indian Institute of Technology - Hyderabad.
K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  J. Hu, L. Shen and G. Sun, “Squeeze-and-Excitation Networks,” IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in AISTATS, 2010.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Learning Representations, 2015.
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in International Conference on Computer Vision, 2015.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in Neural Information Processing Systems Workshop, 2017.
-  A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images”, Technical Report, 2009.
-  S. Xie, R. Girshick, P. Dollar, Z. Tu and K. He. “Aggregated Residual Transformations for Deep Neural Networks,” in arXiv preprint arXiv:1611.05431v1, 2016.
-  G. Huang, Z. Liu, L. van der Maaten and KQ. Weinberger, “Densely Connected Convolutional Networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  G. Huang, Y. Sun, Z. Liu, D. Sedra and K. Q. Weinberger. “Deep Networks with Stochastic Depth,” in European Conference on Computer Vision, 2006.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Identity Mappings in Deep Residual Networks,” in arXiv preprint arXiv:1603.05027v3, 2016.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” in Journal of Machine Learning Research, 2014.