Mixed Link Networks

02/06/2018 ∙ by Wenhai Wang, et al. ∙ Nanjing University NetEase, Inc 0

Basing on the analysis by revealing the equivalence of modern networks, we find that both ResNet and DenseNet are essentially derived from the same "dense topology", yet they only differ in the form of connection -- addition (dubbed "inner link") vs. concatenation (dubbed "outer link"). However, both two forms of connections have the superiority and insufficiency. To combine their advantages and avoid certain limitations on representation learning, we present a highly efficient and modularized Mixed Link Network (MixNet) which is equipped with flexible inner link and outer link modules. Consequently, ResNet, DenseNet and Dual Path Network (DPN) can be regarded as a special case of MixNet, respectively. Furthermore, we demonstrate that MixNets can achieve superior efficiency in parameter over the state-of-the-art architectures on many competitive datasets like CIFAR-10/100, SVHN and ImageNet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The exploration of connectivity patterns of deep neural networks has attracted extensive attention in the literature of Convolutional Neural Networks (CNNs). LeNet

[LeCun et al.1998] originally demonstrated its layer-wise feed-forward pipeline, and later GoogLeNet [Szegedy et al.2015] introduced more effective multi-path topology. Recently, ResNet [He et al.2016a, He et al.2016b] successfully adopted skip connection which transferred early information through identity mapping by element-wisely adding input features to its block outputs. DenseNet [Huang et al.2017] further proposed a seemingly “different” topology by using densely connected path to concatenate all the previous raw input features with the output ones.

Figure 1: The topological relations of different types of neural networks. The symbols “” and “” denote element-wise addition and concatenation, respectively. (a) shows the general form of “dense topology”. refers to the connection function. (b) shows ResNet in the perspective of “dense topology”. (c) shows the path topology of DenseNet. (d) shows the path topology of MixNet.

For the two recent ResNet and DenseNet, despite their externally large difference in path topology (skip connection vs. densely connected path), we discover and prove that both of them are essentially derived from the same “dense topology” (Fig. 1 (a)), where their only difference lies in the form of connection (“” in Fig. 1 (b) vs.” in Fig. 1 (c)). Here, “dense topology” is defined as a path topology in which each layer is connected with all the previous layers using the connection function . The great effectiveness of “dense topology” has been proved via the significant success of both ResNet and DenseNet, yet the form of connection in ResNet and DenseNet still has room for improvement. For example, too many additions on the same feature space may impede the information flow in ResNet [Huang et al.2017], and there may be the same type of raw features from different layers, which leads to a certain redundancy in DenseNet [Chen et al.2017]. Therefore, the question “does there exist a more efficient form of connection in the dense topology” still remains to be further explored.

To address the problem, in this paper, we propose a novel Mixed Link Network (MixNet) with an efficient form of connection (Fig. 1 (d)) in the “dense topology”. That is, we mix the connections in ResNet and DenseNet, in order to combine both the advantages of them and avoid their possible limitations. In particular, the proposed MixNets are equipped with both inner link modules and outer link modules, where an inner link module refers to additive

feature vectors (similar connection in ResNet), while an outer link module stands for

concatenated ones (similar connection in DenseNet). More importantly, in the architectures of MixNets, these two types of link modules are flexible with their positions and sizes. As a result, ResNet, DenseNet and the recently proposed Dual Path Network (DPN) [Chen et al.2017] can be regarded as a special case of MixNet, respectively (see the details in Fig. 5 and Table 1).

To show the efficiency and effectiveness of the proposed MixNets, we conduct extensive experiments on four competitive benchmark datasets, namely, CIFAR-10, CIFAR-100, SVHN and ImageNet. The proposed MixNets require fewer parameters than the existing state-of-the-art architectures whilst achieving better or at least comparable results. Notably, on CIFAR-10 and CIFAR-100 datasets, MixNet-250 surpasses ResNeXt-29 (1664d) with 57% less parameters. On ImagNet dataset, the results of MixNet-141 are comparable to the ones of DPN-98 with 50% fewer parameters.

The main contributions of this paper are as follows:

  • ResNet and DenseNet are proved to have the same path topology – “dense topology” essentially, whilst their only difference lies in the form of connections.

  • A highly modularized Mixed Link Network (MixNet) is proposed, which has a more efficient connection – the blending of flexible inner link modules and outer link modules.

  • The relation between MixNet and modern networks (ResNet, DenseNet and DPN) is discussed, and these modern networks are shown to be specific instances of MixNets.

  • MixNet demonstrates its superior efficiency in parameter over the state-of-the-art architectures on many competitive benchmarks.

2 Related Work

Designing effective path topologies always pushes the frontier of the advanced neural network architecture. Following the initial layer-wise feed-forward pipeline [LeCun et al.1998], AlexNet [LeCun et al.1998] and VGG [Simonyan and Zisserman2015] showed that building deeper networks with tiny convolutional kernels is a promising way to increase the learning capacity of neural network. GoogLeNet [Szegedy et al.2015] demonstrated that a multi-path topology (codenamed Inception) could easily outperform previous feed-forward baselines by blending various information flows. The effectiveness of multi-path topology was further validated in FractalNet [Larsson et al.2016], Highway Networks [Srivastava et al.2015], DFN [Wang et al.2016], DFN-MR [Zhao et al.2016], and IGC [Zhang et al.2017]. A recurrent connection topology [Liang and Hu2015] was proposed to integrate the context information. Perhaps the most revolutionary topology – skip connection was successfully adopted by ResNet [He et al.2016a, He et al.2016b], where micro-blocks were built sequentially and the skip connection bridged the micro-block’s input features with its output ones via identity mappings. Since then, different works based on ResNet have arisen, aiming to find a more efficient transformer of that micro-block, such as WRN [Zagoruyko and Komodakis2016], Multi-ResNet [Abdi and Nahavandi2016] and ResNeXt [Xie et al.2017]. Furthermore, DenseNet [Huang et al.2017] achieved comparable accuracy with deep ResNet by proposing the densely connected topology, which connects each layer to its previous layers by concatenation. Recently, DPN [Chen et al.2017] directly combines the two paths – ResNet path and DenseNet path together by a shared feature embedding in order to enjoy a mutual improvement.

3 Dense topology

In this section, we first introduce and formulate the “dense topology”. We then prove that both ResNet and DenseNet are intrinsically derived from the same “dense topology”, and they only differ in the specific form of connection (addition vs. concatenation). Furthermore, we present analysis on strengths and weaknesses of these two network architectures, which motivates us to develop Mixed Link Networks.

Definitions of “dense topology”. Let us consider a network that comprises

layers, each of which implements a non-linear transformation

, where indexes the layer.

could be a composite function of several operations such as linear transformation, convolution, activation function, pooling

[LeCun et al.1998]

, batch normalization

[Ioffe and Szegedy2015]. As illustrated in Fig. 2 (a), refers to the immediate output of the transformation and is the result of the connection function whose inputs come from all the previous feature-maps (i.e., ). Initially, equals . As mentioned in Section 1, “dense topology” is defined as a path topology where each layer is connected with all the previous layers. Therefore, we can formulate the general form of “dense topology” simply as:

(1)

Figure 2: The key annotations for , , and .

DenseNet is derived from “dense topology” obviously. For DenseNet [Huang et al.2017], the input of layer is the concatenation of the outputs from all the preceding layers. Therefore, we can write DenseNet as:

(2)

where “” refers to the concatenation. As shown in Eqn. (1) and Eqn. (2), DenseNet directly follows the formulation of “dense topology”, whose connection function is the pure concatenation (Fig. 1 (c)).

ResNet is also derived from “dense topology”. We then explain that ResNet also follows the “dense topology” whose connection is accomplished by addition. Given the standard definition from [He et al.2016b], ResNet poses a skip connection that bypasses the non-linear transformations with an identity mapping as:

(3)

where refers to the feature-maps directly after the skip connection (Fig. 2 (b)). Initially, equals . Now we concentrate on which is the output of as well:

(4)

By substituting Eqn. (3) into Eqn. (4) recursively, we can rewrite Eqn. (4) as:

(5)

As shown in Eqn. (5) clearly, in ResNet is deduced to be the element-wise addition result of all the previous layers – . It proves that ResNet is actually identical to a form of “dense topology”, where the connection function is specified to the addition (Fig. 1 (b)).

The above analyses reveal that ResNet and DenseNet share the same “dense topology” in essence. Therefore, the “dense topology” is confirmed to be a fundamental and significant path topology that works practically, due to the extraordinary effectiveness of both ResNet and DenseNet in the recent progress. Meanwhile, from Eqn. (2) and Eqn. (5), the only difference between ResNet and DenseNet is the connection function (“vs.”) obviously.

Analysis of ResNet. The connection in ResNet is only the additive form (“”) that operates on the entire feature map. It combines the features from previous layers by element-wise addition, which makes the features more expressive and eases the gradient flow for optimization simultaneously. However, too many additions on the same feature space may impede the information flow in the network [Huang et al.2017], which motivates us to develop a “shifted additions”, by dislocating/shifting the additive positions in subsequent feature spaces along multiple layers (e.g., the black arrow in Fig. 5 (e)), to alleviate this problem.

Analysis of DenseNet. The connection in DenseNet is only the concatenative connection (“”) which increases the feature dimension gradually along the depths. It concatenates the raw features from previous layers to form the input of the new layer. Concatenation allows the new layer to receive the raw features directly from previous layers and it also improves the flow of information between layers. However, there may be the same type of features from different layers, which leads to a certain redundancy [Chen et al.2017]. This limitation also inspires us to introduce the “shifted additions” (e.g., the black arrow in Fig. 5 (e)) on these raw features in purpose of a modification to avoid that redundancy to some extent.

Figure 3: The examples of inner/outer link module. The symbol “” and “” denote addition and concatenation, respectively. The green arrows refer to duplication operation. (a) and (b) show the examples of inner link module and outer link module, respectively.

4 Mixed Link Networks

In this section, we first introduce and formulate the inner/outer link modules. Next, we present the generalized mixed link architecture with flexible inner/outer link modules and propose Mixed Link Network (MixNet), a representative form of the generalized mixed link architecture. At last, we describe the implementation details of MixNets.

4.1 Inner/Outer Link Module

The inner link modules are based on the additive connections. Following the above preliminaries, we denote the output which contains the inner link part as111Please note that we omit the possible positional parameters to align/place the inner link parts for simplicity, and it will be discussed in the following subsection (Fig. 5 and Table 1).:

(6)

where refers to the function of producing feature-maps for inner linking – element-wisely adding new features inside the original ones (Fig. 3 (a)).

The outer link modules are based on the concatenated connection. Similarly, we have as:

(7)

where refers to the function of producing feature-maps for outer linking – appending new features outside the original ones (Fig. 3 (b)).

Figure 4: The example of mixed link architecture. The symbol “” and “” represent addition and concatenation, respectively. The green arrows denote duplication operation.

4.2 Mixed Link Architecture

Basing on the analyses in Section 3, we introduce the mixed link architecture which embraces both inner link modules and outer link modules (Fig. 4). The mixed link architecture can be formulated as Eqn. (8), a flexible combination of Eqn. (6) and Eqn. (7), to get a blending feature output :

(8)

Definitions of parameters (, , fixed/unfixed) for mixed link architecture. Here we denote the channel number of feature-maps produced by and as and , respectively. That is, is the inner link size for inner link modules, and controls the outer link size for outer link modules. As for the positional control for inner link modules, we simplify it into two choices – “fixed” or “unfixed”. The “fixed” term is easy to understand – all the features are merged together by addition over the same fixed space, as in ResNet. Here is the explanation for “unfixed”: there are extremely exponential combinations to pose the inner link modules’ positions along multiple layers, and learning the variable position is currently unavailable since their arrangement is not derivable directly. Therefore, we make a compromise and choose one simple series of the unfixed-position version – the “shifted addition” (Fig. 5 (e)) as mentioned in our motivations in Section 3. Specifically, the position of inner link part exactly aligns with the growing boundary of entire feature embedding (see the black arrow in Arch-4) when the outer link parts increase the overall feature dimension. We denote this Arch-4 (Fig. 5 (e)) to be our proposed model exactly – Mixed Link Network (MixNet). In summary, we have defined the above two simple options for controlling the positions of inner link modules as – “fixed” and “unfixed” .

Modern networks are special cases of MixNets. It can be seen from Fig. 5 that the mixed link architecture (Fig. 5 (a)) with different parametric configurations can reach four representative architectures (Fig. 5 (b)(c)(d)(e)). The configurations of these corresponding architectures are listed in Table 1. We show that MixNet is a more generalized form than other exsiting modern networks, under the perspective of mixed link architecture. Therefore, ResNet, DenseNet and DPN can be treated as a specific instance of MixNets, respectively.

Figure 5: Four architectures derived from mixed link architecture. The view is placed on the channels of one location of feature-maps in convolutional neural networks. The orange arrows denote the function for the inner link module. The yellow arrows denote the function for the outer link module. The green arrows refer to duplication operation. The vertically aligned features are merged by element-wise addition, and the horizontally aligned features are merged by concatenation. (a) shows the generalized mixed link architecture. (b), (c), (d) and (e) are the four derivative architectures with various representative settings of (a).
Architecture
Inner Link Module Setting
Outer Link Module Setting
Arch-1 (ResNet) , fixed
Arch-2 (DenseNet)
Arch-3 (DPN) , fixed
Arch-4 (MixNet) , unfixed
Table 1: The configurations of the four representative architectures.

4.3 Implementation Details of MixNets

Layers
Output
Size
MixNet-105
()
MixNet-121
()
MixNet-141
()
Convolution

conv, stride 2

Pooling max pool, stride 2
Mixed Link
Block (1)
Convolution conv
Pooling average pool, stride 2
Mixed Link
Block (2)
Convolution conv
Pooling average pool, stride 2
Mixed Link
Block (3)
Convolution conv
Pooling average pool, stride 2
Mixed Link
Block (4)
Classification Layer global average pool
1000 1000D fully-connected, softmax
Table 2: MixNet architectures for ImageNet. and denote the parameters for inner and outer link modules, respectively.

The proposed network consists of multiple mixed link blocks. Each mixed link block has several layers, whose structure follows Arch-4 (Fig. 5 (e)). Motivated from the common practices [Szegedy et al.2016, He et al.2016a], we introduce bottleneck layers as unitary elements in MixNets. That is, we implement both and

with such a bottleneck layer – BN-ReLU-Conv(1, 1)-BN-ReLU-Conv(3, 3). Here BN, ReLU, and Conv refer to batch normalization, rectified linear units and convolution, respectively.

On CIFAR-10, CIFAR-100 and SVHN datasets, the MixNets used in our experiments have three mixed link blocks with the same amount of layers. Before entering the first mixed link block, a convolution with output channels is performed on the input images. For convolutional layers with kernel size

, each side of the inputs is zero-padded by one pixel to keep the feature-map size fixed. We use

convolution followed by

average pooling as transition layers between two contiguous blocks. At the end of the last block, a global average pooling is performed and then a softmax classifier is attached. The feature-map sizes in the three blocks are

, , and , respectively. We survey the network structure with three configurations: , and in practice.

In our experiments on ImageNet dataset, we follow Arch-4 and use the network structure with four mixed link blocks on input images. The initial convolution layer comprises filters whose size is and stride is 2. The sizes of feature-maps in the following layers are determined by the settings of inner link parameter and outer link parameter (Table 2), consequently.

5 Experiment

In this section, we empirically demonstrate MixNet’s effectiveness and efficiency in parameter over the state-of-the-art architectures on many competitive benchmarks.

5.1 Datasets

CIFAR. The two CIFAR datasets [Krizhevsky and Hinton2009] consist of colored natural images with pixels. CIFAR-10 consists of images drawn from 10 and CIFAR-100 from 100 classes. The training and test sets contain 50K and 10K images, respectively. We follow the standard data augmentation scheme that is widely used for these two datasets [He et al.2016a, Huang et al.2016, Lee et al.2015, Romero et al.2015, Srivastava et al.2015, Springenberg et al.2014]

. For preprocessing, we normalize the data using the channel means and standard deviations. For the final run we use all 5K training images and report the final test error at the end of training.

SVHN. The Street View House Numbers (SVHN) dataset [Netzer et al.2011] contains colored digit images. There are 73,257 images in the training set, 26,032 images in the test set, and 531,131 images for extra training data. Following common practice [Goodfellow et al.2013, Huang et al.2016, Lin et al.2014, Sermanet et al.2012], We use all the training data (training set and extra training data) without any data augmentation, and a validation set with 6,000 images is split from the training set. In addition, the pixel values in the dataset are divided by 255 and thus they are in the [0, 1] range. We select the model with the lowest validation error during training and report the test error.

ImageNet. The ILSVRC 2012 classification dataset [Deng et al.2009] contains 1.2 million images for training, and 50K for validation, from 1K classes. We adopt the same data augmentation scheme for training images as in [He et al.2016a, He et al.2016b], and apply a single-crop with size at test time. Following [He et al.2016a, He et al.2016b], we report classification errors on the validation set.

5.2 Training

All the networks are trained by using stochastic gradient descent (SGD). On CIFAR and SVHN we train using batch size 64 for 300 epochs. The initial learning rate is set to 0.1, and is divided by 10 at 50% and 75% of the total number of training epochs. On ImageNet, we train models with a mini-batch size 150 (MixNet-121) and 100 (MixNet-141) due to GPU memory constraints. To compensate for the smaller batch size, the models are trained for 100 epochs, and the learning rate is lowered by 10 times at epoch 30, 60 and 90. Following

[He et al.2016a], we use a weight decay of

and a Nesterov momentum

[Sutskever et al.2013] of 0.9 without dampening. We adopt the weight initialization introduced by [He et al.2015]. For the the dataset without data augmentation (i.e., SVHN), we follow the DenseNet setting [Huang et al.2017] and add a dropout layer [Srivastava et al.2014] after each convolutional layer (except the first one).

Method Depth #params CIFAR-10 CIFAR-100 SVHN
RCNN-160 [Liang and Hu2015] - 1.87M 7.09 31.75 1.80
DFN [Wang et al.2016] 50 3.9M 6.24 27.52 -
DFN-MR [Zhao et al.2016] 50 24.8M 3.57 19.00 1.55
FractalNet [Larsson et al.2016] 21 38.6M 4.60 23.73 1.87
ResNet with Stochastic Depth [Huang et al.2016] 110 1.7M 5.25 24.98 1.75
ResNet-164 (pre-activation) [He et al.2016b] 164 1.7M 4.80 22.11 -
ResNet-1001 (pre-activation) [He et al.2016b] 1001 10.2M 4.92 22.71 -
WRN-28-10 [Zagoruyko and Komodakis2016] 28 36.5M 4.00 19.25 -
ResNeXt-29 (d) [Xie et al.2017] 29 34.4M 3.65 17.77 -
ResNeXt-29 (d) [Xie et al.2017] 29 68.1M 3.58 17.31 -
DenseNet-100 () [Huang et al.2017] 100 27.2M 3.74 19.25 1.59
DenseNet-BC-190 () [Huang et al.2017] 190 25.6M 3.46 17.18 -
DPN-28-10 [Chen et al.2017] 28 47.8M 3.65 20.23 -
IGC- [Zhang et al.2017] 20 24.1M 3.31 18.75 1.56
MixNet-100 () 100 1.5M 4.19 21.12 1.57
MixNet-250 () 250 29.0M 3.32 17.06 1.51
MixNet-190 () 190 48.5M 3.13 16.96 -
11footnotetext: Results from https://github.com/Queequeg92/DualPathNet
Table 3: Error rates (%) on CIFAR and SVHN datasets. and denote the parameters for inner and outer link modules, respectively. The best, second-best, and third-best accuracies are highlighted in red, blue, and green.

Figure 6: The illustrations of the experimental results. (a) shows the parameter efficiency comparisons among the four architectures. (b) is the comparison of the MixNets and the state-of-the-art architectures top-1 error (single-crop) on the ImageNet validation set as a function of model parameters. (c) shows error rates of the models, whose inner link modules are fixed or unfixed. (d) shows error rates of the models with different outer link parameter .

5.3 Ablation Study for Mixed Link Architecture

Efficiency comparisons among the four architectures. We first evaluate the efficiency of the four representative architectures which are derived from the mixed link architecture. The comparisons are based on various amount of parameters (#params). Specifically, we increase the complexities of the four architectures in parallel and evaluate them on CIFAR-100 dataset. The experimental results are reported in Fig. 6 (a), from which we can find that with various similar parameters, Arch-4 outperforms all other three architectures by a margin. It demonstrates the superior efficiency in parameter of Arch-4 which is exactly used in our proposed MixNets.

Fixed vs. unfixed for the inner link modules. Next we investigate “which is the more effective setting for the inner link modules – fixed or unfixed?”. To ensure a fair comparison, we hold the outer link parameter constant and train MixNets with different inner link parameter . In details, we set to , and let increase from to . The models are also evaluated on CIFAR-100 dataset. Fig. 6 (c) shows the experimental results, from which we can find that with the growing of , the test error rate keeps dropping. Furthermore, with the same inner link parameter , the models with unfixed inner link modules (red curve) have much lower test errors than the models with the fixed ones (green curve), which suggests the superiority of unfixed inner link module.

Outer link size. We then study the effect of outer link size by setting , under the configurations with the effective unfixed inner link modules on CIFAR-100 dataset. Fig. 6 (d) illustrates that the increasement of reduces the test error rate consistently. However, the performance gain becomes tiny when is relatively large.

5.4 Experiments on CIFAR and SVHN

We train MixNets with different depths , inner link parameters and outer link parameters . The main results on CIFAR and SVHN are shown in Table 3. Except for DPN-28-10 which is from https://github.com/Queequeg92/DualPathNet, all other reported results are directly borrowed from their original papers.

As can be seen from the bottom rows of Table 3, MixNet-190 outperforms many state-of-the-art architectures consistently on CIFAR datasets. Its error rates, on CIFAR-10 and on CIFAR-100, are significantly lower than the error rates achieved by DPN-29-10. Our results on SVHN are even more encouraging. MixNet-100 achieves comparable test errors with DFN-MR (24.1M) and IGC- (24.8M) whilst costing only 1.5M parameters.

5.5 Experiments on ImageNet

Method #params top-1 top-5
ResNet-101 [He et al.2016a] 44.55M 22.6 6.4
ResNet-152 [He et al.2016a] 60.19M 21.7 6.0
DenseNet-169 [Huang et al.2017] 14.15M 23.8 6.9
DenseNet-264 [Huang et al.2017] 33.34M 22.2 6.1
ResNeXt-50 (32 4d) [Xie et al.2017] 25M 22.2 -
ResNeXt-101 (32 4d) [Xie et al.2017] 44M 21.2 5.6
DPN-68 (32 4d) [Chen et al.2017] 12.61M 23.7 7.0
DPN-92 (32 3d) [Chen et al.2017] 37.67M 20.7 5.4
DPN-98 (32 4d) [Chen et al.2017] 61.57M 20.2 5.2
MixNet-105 () 11.16M 23.3 6.7
MixNet-121 () 21.86M 21.9 5.9
MixNet-141 () 41.07M 20.4 5.3
Table 4: The top-1 and top-5 error rates on the ImageNet validation set, with single-crop / 10-crop testing.

We evaluate MixNets with different depths and inner/outer link parameters on the ImageNet classification task, and compare it with the representative state-of-the-art architectures. We report the single-crop and 10-crop validation errors of MixNets on ImageNet in Table 4. The single-crop top-1 validation errors of MixNets and different state-of-the-art architectures as a function of the number of parameters are shown in Fig. 6 (b). The results reveal that MixNets perform on par with the state-of-the-art architectures, whilst requiring significantly fewer parameters to achieve better or at least comparable performance. For example, MixNet-105 outperforms DenseNet-169 and DPN-68 with only 11.16M parameters. MixNet-121 (21.86M) yields better validation error than ResNeXt-50 (25M) and Densenet-264 (33.34M). Furthermore, the results of MixNet-141 are very close to the ones of DPN-98 with 50% fewer parameters.

6 Conclusion

In this paper, we first prove that ResNet and DenseNet are essentially derived from the same fundamental “dense topology”, whilst their only difference lies in the specific form of connection. Next, basing on the analysis of superiority and insufficiency of their distinct connections, we propose a highly efficient form of it – the Mixed Link Networks (MixNets), whose connection combines both flexible inner link modules and outer link modules. Further, MixNet is a generalized structure that ResNet, DenseNet and DPN can be regarded as special cases of it. Extensive experimental results demonstrate that our proposed MixNet is efficient in parameter.

References

  • [Abdi and Nahavandi2016] Masoud Abdi and Saeid Nahavandi. Multi-residual networks. CoRR, abs/1609.05672, 8, 2016.
  • [Chen et al.2017] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks. In NIPS, 2017.
  • [Deng et al.2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • [Goodfellow et al.2013] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In ICML, 2013.
  • [He et al.2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
  • [He et al.2016a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [He et al.2016b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
  • [Huang et al.2016] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In ECCV, 2016.
  • [Huang et al.2017] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In CVPR, 2017.
  • [Ioffe and Szegedy2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [Krizhevsky and Hinton2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Tech Report, 2009.
  • [Larsson et al.2016] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016.
  • [LeCun et al.1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [Lee et al.2015] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In AISTATS, 2015.
  • [Liang and Hu2015] Ming Liang and Xiaolin Hu. Recurrent convolutional neural network for object recognition. In CVPR, 2015.
  • [Lin et al.2014] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In ICLR, 2014.
  • [Netzer et al.2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop, 2011.
  • [Romero et al.2015] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
  • [Sermanet et al.2012] Pierre Sermanet, Soumith Chintala, and Yann LeCun. Convolutional neural networks applied to house numbers digit classification. In ICPR, 2012.
  • [Simonyan and Zisserman2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [Springenberg et al.2014] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
  • [Srivastava et al.2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting.

    The Journal of Machine Learning Research

    , 15(1):1929–1958, 2014.
  • [Srivastava et al.2015] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In NIPS, 2015.
  • [Sutskever et al.2013] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.

    On the importance of initialization and momentum in deep learning.

    In ICML, 2013.
  • [Szegedy et al.2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [Szegedy et al.2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna.

    Rethinking the inception architecture for computer vision.

    In CVPR, 2016.
  • [Wang et al.2016] Jingdong Wang, Zhen Wei, Ting Zhang, and Wenjun Zeng. Deeply-fused nets. arXiv preprint arXiv:1605.07716, 2016.
  • [Xie et al.2017] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
  • [Zagoruyko and Komodakis2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • [Zhang et al.2017] Ting Zhang, Guo-Jun Qi, Bin Xiao, and Jingdong Wang. Interleaved group convolutions. In CVPR, 2017.
  • [Zhao et al.2016] Liming Zhao, Jingdong Wang, Xi Li, Zhuowen Tu, and Wenjun Zeng. On the connection of deep fusion to ensembling. arXiv preprint arXiv:1611.07718, 2016.