Basing on the analysis by revealing the equivalence of modern networks, we find that both ResNet and DenseNet are essentially derived from the same "dense topology", yet they only differ in the form of connection -- addition (dubbed "inner link") vs. concatenation (dubbed "outer link"). However, both two forms of connections have the superiority and insufficiency. To combine their advantages and avoid certain limitations on representation learning, we present a highly efficient and modularized Mixed Link Network (MixNet) which is equipped with flexible inner link and outer link modules. Consequently, ResNet, DenseNet and Dual Path Network (DPN) can be regarded as a special case of MixNet, respectively. Furthermore, we demonstrate that MixNets can achieve superior efficiency in parameter over the state-of-the-art architectures on many competitive datasets like CIFAR-10/100, SVHN and ImageNet.READ FULL TEXT VIEW PDF
For the two recent ResNet and DenseNet, despite their externally large difference in path topology (skip connection vs. densely connected path), we discover and prove that both of them are essentially derived from the same “dense topology” (Fig. 1 (a)), where their only difference lies in the form of connection (“” in Fig. 1 (b) vs. “” in Fig. 1 (c)). Here, “dense topology” is defined as a path topology in which each layer is connected with all the previous layers using the connection function . The great effectiveness of “dense topology” has been proved via the significant success of both ResNet and DenseNet, yet the form of connection in ResNet and DenseNet still has room for improvement. For example, too many additions on the same feature space may impede the information flow in ResNet [Huang et al.2017], and there may be the same type of raw features from different layers, which leads to a certain redundancy in DenseNet [Chen et al.2017]. Therefore, the question “does there exist a more efficient form of connection in the dense topology” still remains to be further explored.
To address the problem, in this paper, we propose a novel Mixed Link Network (MixNet) with an efficient form of connection (Fig. 1 (d)) in the “dense topology”. That is, we mix the connections in ResNet and DenseNet, in order to combine both the advantages of them and avoid their possible limitations. In particular, the proposed MixNets are equipped with both inner link modules and outer link modules, where an inner link module refers to additive
feature vectors (similar connection in ResNet), while an outer link module stands forconcatenated ones (similar connection in DenseNet). More importantly, in the architectures of MixNets, these two types of link modules are flexible with their positions and sizes. As a result, ResNet, DenseNet and the recently proposed Dual Path Network (DPN) [Chen et al.2017] can be regarded as a special case of MixNet, respectively (see the details in Fig. 5 and Table 1).
To show the efficiency and effectiveness of the proposed MixNets, we conduct extensive experiments on four competitive benchmark datasets, namely, CIFAR-10, CIFAR-100, SVHN and ImageNet. The proposed MixNets require fewer parameters than the existing state-of-the-art architectures whilst achieving better or at least comparable results. Notably, on CIFAR-10 and CIFAR-100 datasets, MixNet-250 surpasses ResNeXt-29 (1664d) with 57% less parameters. On ImagNet dataset, the results of MixNet-141 are comparable to the ones of DPN-98 with 50% fewer parameters.
The main contributions of this paper are as follows:
ResNet and DenseNet are proved to have the same path topology – “dense topology” essentially, whilst their only difference lies in the form of connections.
A highly modularized Mixed Link Network (MixNet) is proposed, which has a more efficient connection – the blending of flexible inner link modules and outer link modules.
The relation between MixNet and modern networks (ResNet, DenseNet and DPN) is discussed, and these modern networks are shown to be specific instances of MixNets.
MixNet demonstrates its superior efficiency in parameter over the state-of-the-art architectures on many competitive benchmarks.
Designing effective path topologies always pushes the frontier of the advanced neural network architecture. Following the initial layer-wise feed-forward pipeline [LeCun et al.1998], AlexNet [LeCun et al.1998] and VGG [Simonyan and Zisserman2015] showed that building deeper networks with tiny convolutional kernels is a promising way to increase the learning capacity of neural network. GoogLeNet [Szegedy et al.2015] demonstrated that a multi-path topology (codenamed Inception) could easily outperform previous feed-forward baselines by blending various information flows. The effectiveness of multi-path topology was further validated in FractalNet [Larsson et al.2016], Highway Networks [Srivastava et al.2015], DFN [Wang et al.2016], DFN-MR [Zhao et al.2016], and IGC [Zhang et al.2017]. A recurrent connection topology [Liang and Hu2015] was proposed to integrate the context information. Perhaps the most revolutionary topology – skip connection was successfully adopted by ResNet [He et al.2016a, He et al.2016b], where micro-blocks were built sequentially and the skip connection bridged the micro-block’s input features with its output ones via identity mappings. Since then, different works based on ResNet have arisen, aiming to find a more efficient transformer of that micro-block, such as WRN [Zagoruyko and Komodakis2016], Multi-ResNet [Abdi and Nahavandi2016] and ResNeXt [Xie et al.2017]. Furthermore, DenseNet [Huang et al.2017] achieved comparable accuracy with deep ResNet by proposing the densely connected topology, which connects each layer to its previous layers by concatenation. Recently, DPN [Chen et al.2017] directly combines the two paths – ResNet path and DenseNet path together by a shared feature embedding in order to enjoy a mutual improvement.
In this section, we first introduce and formulate the “dense topology”. We then prove that both ResNet and DenseNet are intrinsically derived from the same “dense topology”, and they only differ in the specific form of connection (addition vs. concatenation). Furthermore, we present analysis on strengths and weaknesses of these two network architectures, which motivates us to develop Mixed Link Networks.
Definitions of “dense topology”. Let us consider a network that comprises
layers, each of which implements a non-linear transformation, where indexes the layer.
could be a composite function of several operations such as linear transformation, convolution, activation function, pooling[LeCun et al.1998]Ioffe and Szegedy2015]. As illustrated in Fig. 2 (a), refers to the immediate output of the transformation and is the result of the connection function whose inputs come from all the previous feature-maps (i.e., ). Initially, equals . As mentioned in Section 1, “dense topology” is defined as a path topology where each layer is connected with all the previous layers. Therefore, we can formulate the general form of “dense topology” simply as:
DenseNet is derived from “dense topology” obviously. For DenseNet [Huang et al.2017], the input of layer is the concatenation of the outputs from all the preceding layers. Therefore, we can write DenseNet as:
where “” refers to the concatenation. As shown in Eqn. (1) and Eqn. (2), DenseNet directly follows the formulation of “dense topology”, whose connection function is the pure concatenation (Fig. 1 (c)).
ResNet is also derived from “dense topology”. We then explain that ResNet also follows the “dense topology” whose connection is accomplished by addition. Given the standard definition from [He et al.2016b], ResNet poses a skip connection that bypasses the non-linear transformations with an identity mapping as:
where refers to the feature-maps directly after the skip connection (Fig. 2 (b)). Initially, equals . Now we concentrate on which is the output of as well:
As shown in Eqn. (5) clearly, in ResNet is deduced to be the element-wise addition result of all the previous layers – . It proves that ResNet is actually identical to a form of “dense topology”, where the connection function is specified to the addition (Fig. 1 (b)).
The above analyses reveal that ResNet and DenseNet share the same “dense topology” in essence. Therefore, the “dense topology” is confirmed to be a fundamental and significant path topology that works practically, due to the extraordinary effectiveness of both ResNet and DenseNet in the recent progress. Meanwhile, from Eqn. (2) and Eqn. (5), the only difference between ResNet and DenseNet is the connection function (“” vs. “”) obviously.
Analysis of ResNet. The connection in ResNet is only the additive form (“”) that operates on the entire feature map. It combines the features from previous layers by element-wise addition, which makes the features more expressive and eases the gradient flow for optimization simultaneously. However, too many additions on the same feature space may impede the information flow in the network [Huang et al.2017], which motivates us to develop a “shifted additions”, by dislocating/shifting the additive positions in subsequent feature spaces along multiple layers (e.g., the black arrow in Fig. 5 (e)), to alleviate this problem.
Analysis of DenseNet. The connection in DenseNet is only the concatenative connection (“”) which increases the feature dimension gradually along the depths. It concatenates the raw features from previous layers to form the input of the new layer. Concatenation allows the new layer to receive the raw features directly from previous layers and it also improves the flow of information between layers. However, there may be the same type of features from different layers, which leads to a certain redundancy [Chen et al.2017]. This limitation also inspires us to introduce the “shifted additions” (e.g., the black arrow in Fig. 5 (e)) on these raw features in purpose of a modification to avoid that redundancy to some extent.
In this section, we first introduce and formulate the inner/outer link modules. Next, we present the generalized mixed link architecture with flexible inner/outer link modules and propose Mixed Link Network (MixNet), a representative form of the generalized mixed link architecture. At last, we describe the implementation details of MixNets.
The inner link modules are based on the additive connections. Following the above preliminaries, we denote the output which contains the inner link part as111Please note that we omit the possible positional parameters to align/place the inner link parts for simplicity, and it will be discussed in the following subsection (Fig. 5 and Table 1).:
where refers to the function of producing feature-maps for inner linking – element-wisely adding new features inside the original ones (Fig. 3 (a)).
The outer link modules are based on the concatenated connection. Similarly, we have as:
where refers to the function of producing feature-maps for outer linking – appending new features outside the original ones (Fig. 3 (b)).
Basing on the analyses in Section 3, we introduce the mixed link architecture which embraces both inner link modules and outer link modules (Fig. 4). The mixed link architecture can be formulated as Eqn. (8), a flexible combination of Eqn. (6) and Eqn. (7), to get a blending feature output :
Definitions of parameters (, , fixed/unfixed) for mixed link architecture. Here we denote the channel number of feature-maps produced by and as and , respectively. That is, is the inner link size for inner link modules, and controls the outer link size for outer link modules. As for the positional control for inner link modules, we simplify it into two choices – “fixed” or “unfixed”. The “fixed” term is easy to understand – all the features are merged together by addition over the same fixed space, as in ResNet. Here is the explanation for “unfixed”: there are extremely exponential combinations to pose the inner link modules’ positions along multiple layers, and learning the variable position is currently unavailable since their arrangement is not derivable directly. Therefore, we make a compromise and choose one simple series of the unfixed-position version – the “shifted addition” (Fig. 5 (e)) as mentioned in our motivations in Section 3. Specifically, the position of inner link part exactly aligns with the growing boundary of entire feature embedding (see the black arrow in Arch-4) when the outer link parts increase the overall feature dimension. We denote this Arch-4 (Fig. 5 (e)) to be our proposed model exactly – Mixed Link Network (MixNet). In summary, we have defined the above two simple options for controlling the positions of inner link modules as – “fixed” and “unfixed” .
Modern networks are special cases of MixNets. It can be seen from Fig. 5 that the mixed link architecture (Fig. 5 (a)) with different parametric configurations can reach four representative architectures (Fig. 5 (b)(c)(d)(e)). The configurations of these corresponding architectures are listed in Table 1. We show that MixNet is a more generalized form than other exsiting modern networks, under the perspective of mixed link architecture. Therefore, ResNet, DenseNet and DPN can be treated as a specific instance of MixNets, respectively.
|Arch-1 (ResNet)||, fixed|
|Arch-3 (DPN)||, fixed|
|Arch-4 (MixNet)||, unfixed|
conv, stride 2
|Pooling||max pool, stride 2|
|Pooling||average pool, stride 2|
|Pooling||average pool, stride 2|
|Pooling||average pool, stride 2|
|Classification Layer||global average pool|
|1000||1000D fully-connected, softmax|
The proposed network consists of multiple mixed link blocks. Each mixed link block has several layers, whose structure follows Arch-4 (Fig. 5 (e)). Motivated from the common practices [Szegedy et al.2016, He et al.2016a], we introduce bottleneck layers as unitary elements in MixNets. That is, we implement both and
On CIFAR-10, CIFAR-100 and SVHN datasets, the MixNets used in our experiments have three mixed link blocks with the same amount of layers. Before entering the first mixed link block, a convolution with output channels is performed on the input images. For convolutional layers with kernel size
, each side of the inputs is zero-padded by one pixel to keep the feature-map size fixed. We useconvolution followed by
average pooling as transition layers between two contiguous blocks. At the end of the last block, a global average pooling is performed and then a softmax classifier is attached. The feature-map sizes in the three blocks are, , and , respectively. We survey the network structure with three configurations: , and in practice.
In our experiments on ImageNet dataset, we follow Arch-4 and use the network structure with four mixed link blocks on input images. The initial convolution layer comprises filters whose size is and stride is 2. The sizes of feature-maps in the following layers are determined by the settings of inner link parameter and outer link parameter (Table 2), consequently.
In this section, we empirically demonstrate MixNet’s effectiveness and efficiency in parameter over the state-of-the-art architectures on many competitive benchmarks.
CIFAR. The two CIFAR datasets [Krizhevsky and Hinton2009] consist of colored natural images with pixels. CIFAR-10 consists of images drawn from 10 and CIFAR-100 from 100 classes. The training and test sets contain 50K and 10K images, respectively. We follow the standard data augmentation scheme that is widely used for these two datasets [He et al.2016a, Huang et al.2016, Lee et al.2015, Romero et al.2015, Srivastava et al.2015, Springenberg et al.2014]
. For preprocessing, we normalize the data using the channel means and standard deviations. For the final run we use all 5K training images and report the final test error at the end of training.
SVHN. The Street View House Numbers (SVHN) dataset [Netzer et al.2011] contains colored digit images. There are 73,257 images in the training set, 26,032 images in the test set, and 531,131 images for extra training data. Following common practice [Goodfellow et al.2013, Huang et al.2016, Lin et al.2014, Sermanet et al.2012], We use all the training data (training set and extra training data) without any data augmentation, and a validation set with 6,000 images is split from the training set. In addition, the pixel values in the dataset are divided by 255 and thus they are in the [0, 1] range. We select the model with the lowest validation error during training and report the test error.
ImageNet. The ILSVRC 2012 classification dataset [Deng et al.2009] contains 1.2 million images for training, and 50K for validation, from 1K classes. We adopt the same data augmentation scheme for training images as in [He et al.2016a, He et al.2016b], and apply a single-crop with size at test time. Following [He et al.2016a, He et al.2016b], we report classification errors on the validation set.
All the networks are trained by using stochastic gradient descent (SGD). On CIFAR and SVHN we train using batch size 64 for 300 epochs. The initial learning rate is set to 0.1, and is divided by 10 at 50% and 75% of the total number of training epochs. On ImageNet, we train models with a mini-batch size 150 (MixNet-121) and 100 (MixNet-141) due to GPU memory constraints. To compensate for the smaller batch size, the models are trained for 100 epochs, and the learning rate is lowered by 10 times at epoch 30, 60 and 90. Following[He et al.2016a], we use a weight decay of
and a Nesterov momentum[Sutskever et al.2013] of 0.9 without dampening. We adopt the weight initialization introduced by [He et al.2015]. For the the dataset without data augmentation (i.e., SVHN), we follow the DenseNet setting [Huang et al.2017] and add a dropout layer [Srivastava et al.2014] after each convolutional layer (except the first one).
|RCNN-160 [Liang and Hu2015]||-||1.87M||7.09||31.75||1.80|
|DFN [Wang et al.2016]||50||3.9M||6.24||27.52||-|
|DFN-MR [Zhao et al.2016]||50||24.8M||3.57||19.00||1.55|
|FractalNet [Larsson et al.2016]||21||38.6M||4.60||23.73||1.87|
|ResNet with Stochastic Depth [Huang et al.2016]||110||1.7M||5.25||24.98||1.75|
|ResNet-164 (pre-activation) [He et al.2016b]||164||1.7M||4.80||22.11||-|
|ResNet-1001 (pre-activation) [He et al.2016b]||1001||10.2M||4.92||22.71||-|
|WRN-28-10 [Zagoruyko and Komodakis2016]||28||36.5M||4.00||19.25||-|
|ResNeXt-29 (d) [Xie et al.2017]||29||34.4M||3.65||17.77||-|
|ResNeXt-29 (d) [Xie et al.2017]||29||68.1M||3.58||17.31||-|
|DenseNet-100 () [Huang et al.2017]||100||27.2M||3.74||19.25||1.59|
|DenseNet-BC-190 () [Huang et al.2017]||190||25.6M||3.46||17.18||-|
|DPN-28-10 [Chen et al.2017]||28||47.8M||3.65||20.23||-|
|IGC- [Zhang et al.2017]||20||24.1M||3.31||18.75||1.56|
Efficiency comparisons among the four architectures. We first evaluate the efficiency of the four representative architectures which are derived from the mixed link architecture. The comparisons are based on various amount of parameters (#params). Specifically, we increase the complexities of the four architectures in parallel and evaluate them on CIFAR-100 dataset. The experimental results are reported in Fig. 6 (a), from which we can find that with various similar parameters, Arch-4 outperforms all other three architectures by a margin. It demonstrates the superior efficiency in parameter of Arch-4 which is exactly used in our proposed MixNets.
Fixed vs. unfixed for the inner link modules. Next we investigate “which is the more effective setting for the inner link modules – fixed or unfixed?”. To ensure a fair comparison, we hold the outer link parameter constant and train MixNets with different inner link parameter . In details, we set to , and let increase from to . The models are also evaluated on CIFAR-100 dataset. Fig. 6 (c) shows the experimental results, from which we can find that with the growing of , the test error rate keeps dropping. Furthermore, with the same inner link parameter , the models with unfixed inner link modules (red curve) have much lower test errors than the models with the fixed ones (green curve), which suggests the superiority of unfixed inner link module.
Outer link size. We then study the effect of outer link size by setting , under the configurations with the effective unfixed inner link modules on CIFAR-100 dataset. Fig. 6 (d) illustrates that the increasement of reduces the test error rate consistently. However, the performance gain becomes tiny when is relatively large.
We train MixNets with different depths , inner link parameters and outer link parameters . The main results on CIFAR and SVHN are shown in Table 3. Except for DPN-28-10 which is from https://github.com/Queequeg92/DualPathNet, all other reported results are directly borrowed from their original papers.
As can be seen from the bottom rows of Table 3, MixNet-190 outperforms many state-of-the-art architectures consistently on CIFAR datasets. Its error rates, on CIFAR-10 and on CIFAR-100, are significantly lower than the error rates achieved by DPN-29-10. Our results on SVHN are even more encouraging. MixNet-100 achieves comparable test errors with DFN-MR (24.1M) and IGC- (24.8M) whilst costing only 1.5M parameters.
|ResNet-101 [He et al.2016a]||44.55M||22.6||6.4|
|ResNet-152 [He et al.2016a]||60.19M||21.7||6.0|
|DenseNet-169 [Huang et al.2017]||14.15M||23.8||6.9|
|DenseNet-264 [Huang et al.2017]||33.34M||22.2||6.1|
|ResNeXt-50 (32 4d) [Xie et al.2017]||25M||22.2||-|
|ResNeXt-101 (32 4d) [Xie et al.2017]||44M||21.2||5.6|
|DPN-68 (32 4d) [Chen et al.2017]||12.61M||23.7||7.0|
|DPN-92 (32 3d) [Chen et al.2017]||37.67M||20.7||5.4|
|DPN-98 (32 4d) [Chen et al.2017]||61.57M||20.2||5.2|
We evaluate MixNets with different depths and inner/outer link parameters on the ImageNet classification task, and compare it with the representative state-of-the-art architectures. We report the single-crop and 10-crop validation errors of MixNets on ImageNet in Table 4. The single-crop top-1 validation errors of MixNets and different state-of-the-art architectures as a function of the number of parameters are shown in Fig. 6 (b). The results reveal that MixNets perform on par with the state-of-the-art architectures, whilst requiring significantly fewer parameters to achieve better or at least comparable performance. For example, MixNet-105 outperforms DenseNet-169 and DPN-68 with only 11.16M parameters. MixNet-121 (21.86M) yields better validation error than ResNeXt-50 (25M) and Densenet-264 (33.34M). Furthermore, the results of MixNet-141 are very close to the ones of DPN-98 with 50% fewer parameters.
In this paper, we first prove that ResNet and DenseNet are essentially derived from the same fundamental “dense topology”, whilst their only difference lies in the specific form of connection. Next, basing on the analysis of superiority and insufficiency of their distinct connections, we propose a highly efficient form of it – the Mixed Link Networks (MixNets), whose connection combines both flexible inner link modules and outer link modules. Further, MixNet is a generalized structure that ResNet, DenseNet and DPN can be regarded as special cases of it. Extensive experimental results demonstrate that our proposed MixNet is efficient in parameter.
The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
On the importance of initialization and momentum in deep learning.In ICML, 2013.
Rethinking the inception architecture for computer vision.In CVPR, 2016.