Multi-scale Convolution Aggregation and Stochastic Feature Reuse for DenseNets

10/02/2018 ∙ by Mingjie Wang, et al. ∙ 0

Recently, Convolution Neural Networks (CNNs) obtained huge success in numerous vision tasks. In particular, DenseNets have demonstrated that feature reuse via dense skip connections can effectively alleviate the difficulty of training very deep networks and that reusing features generated by the initial layers in all subsequent layers has strong impact on performance. To feed even richer information into the network, a novel adaptive Multi-scale Convolution Aggregation module is presented in this paper. Composed of layers for multi-scale convolutions, trainable cross-scale aggregation, maxout, and concatenation, this module is highly non-linear and can boost the accuracy of DenseNet while using much fewer parameters. In addition, due to high model complexity, the network with extremely dense feature reuse is prone to overfitting. To address this problem, a regularization method named Stochastic Feature Reuse is also presented. Through randomly dropping a set of feature maps to be reused for each mini-batch during the training phase, this regularization method reduces training costs and prevents co-adaptation. Experimental results on CIFAR-10, CIFAR-100 and SVHN benchmarks demonstrated the effectiveness of the proposed methods.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, deep learning became a dominant field of machine learning for various vision tasks, such as recognition and classification. In particular, Convolutional Neural Networks (CNNs) have achieved an unprecedented success through AlexNet

[14], which has incurred a new line of research concentrating on constructing better performing CNNs [28]. Increasingly deeper architectures are being created and trained based on the observation that, the deeper the network is, the higher-level features it is able to extract. AlexNets have 5 convolutional layers [14], VGG Nets [23] have 16 or 19, GoogLeNets [27] have 22, and ResNets [6]

feature over 1000 layers employing residual connections.

As the networks became very deep, two common issues have emerged: gradients explosion and vanishing. To deal with these problems, several creative architectures, such as Highway networks [24], Deeply-Supervised Nets [16] and ResNets [6], have been designed. The key ideas are passing information flow from one layer to another via shortcuts or adding “companion” objective functions at each hidden layer respectively. Stochastic depth [10] trains an ensemble of ResNets with different depth values by randomly dropping a set of layers during the training phase. FractalNets [15] repeatedly utilize a simple expansion rule to generate an ultra-deep network containing interacting subpaths of different lengths. Based on the above work, DenseNet [9] was introduced, which connects each layer to every subsequent layer. As a result, a given layer in DenseNet takes all feature maps extracted by preceding layers as input. This new connection pattern allows DenseNets to obtain significant improvements over the state-of-the-art on several object recognition benchmark tasks.

On another front, inception series [12, 23, 26, 28] have been shown to achieve remarkable performance at very low memory costs. This module is composed by convolutions with different kernel size (11, 33, 55) and a 3

3 max pooling, and then concatenates results from the convolutions and pooling. This design strengthens the regularization and scale invariance of extracted features. Recently, feature pyramid networks (FPN)

[19] and deep layer aggregation [34] have been proposed, which aim at exploiting the inherent multi-scale, pyramidal hierarchy of CNNs. Features at different scale levels are merged together to achieve higher accuracy with fewer parameters.

Inspired by the benefits of multi-scale convolutions [19, 34] and features fusion for training deep networks, we design a novel module, referred as Multi-scale Convolution Aggregation (MCA) to work with DenseNets. As shown in Fig. 1, the MCA module consists of layers for multi-scale convolutions, cross-scale aggregation, maxout, and concatenation. We observe that DenseNets utilizing MCA module can substantially reduce parameters number and classification error than using other multi-scale designs. The reduction in parameters results from the new design of fusing pyramidal convolutions instead of simply concatenating them. The increase in accuracy is attributed to the following factors: 1) strengthening scale-invariance because of the multi-scale convolutions with four kernels with different receptive field sizes; 2) given a specific task, the network automatically chooses the most suitable scales via four trainable gating units to adaptively make use of multi-scale information; 3) the use of two maxout activations stimulates the competition among neural units of different receptive fields and enhances the learning ability of the network; 4) higher non-linearity; and 5) compared with traditional concatenation in GoogleNets, our module dramatically reduces the number of parameters while preserving sufficient multi-scale information by aggregation and maxout functions.

Figure 1: DenseNets with Multi-scale Convolution Aggregation (MCA) module. Given the raw image on the left, the first layer generates four groups of feature maps using different kernel sizes. These results pass through aggregation and maxout gates to produce two branches of compressed channels, which capture fine and coarse scale features, respectively. The two channels are concatenated into a layer of feature maps, which is fed into the DenseNets represented by with 3 composite layers on the right.

In addition to various methods of architecture design, difficulties in training deep networks motivated research on optimization and initialization techniques. These include dropout [8], maxout activation [4]

, batch normalization

[11, 12], group normalization [31], Xavier initialization [2], He initialization [5], etc., which have been applied in a wide range of networks as essential components.

To reduce the possibility of overfitting in DenseNets and to further boost the generalization of networks, we also develop a regularization method named Stochastic Feature Reuse (SFR). Similar to stochastic depth [10], SFR contains gates for dropping selected feature maps delivered from preceding layers; see Fig. 4. During training step, each layer randomly reuses different preceding feature maps for different mini-batch, resulting each mini-batch is trained under a sub-network with a unique connection scheme. This approach effectively addresses overfitting problem of DenseNet by substantially reducing the number of parameters while improving the performance of DenseNets.

We evaluate the impacts of both MCA module and SFR on three widely used benchmark datasets: CIFAR-10 [13], CIFAR-100 [13] and Street View House Number (SVHN) [21]. The comparisons show that our model can achieve comparable test accuracy with relatively lower computation costs and outperform the state-of-the-art performance of DenseNets.

2 Related Work

Deeper feed-forward neural networks tend to generate larger dividends in performances of various vision tasks. This leads to the recent resurgence of exploration in sophisticated CNNs architectures


with hugely increased classification accuracy on ImageNet

[1], e.g. from AlexNet [14] to GoogLeNets [27], and ResNets [6] to DenseNets [9].

Comparisons of layerwise performance, analysis [20, 32] and visualization of feature maps [33, 37]

show that networks with deeper layers are able to extract more semantic and higher-level representations. On the other hand, very deep networks make training more difficult, especially when using a first-order optimizer with purely random initialization and traditional activation functions (tanh, sigmoid etc.), which often cause gradients vanishing and internal covariate shifts. To overcome these problems, a lot of research has been carried out


To deeply dig into high-performance architectures, a series of independent methods have been explored. One of more dominative is to increase the network width. GoogLeNets [12, 23, 26, 28]

use the inception module to build deep networks and this component concatenates feature maps produced by a set of filters with different receptive field size. Other well-known structures, such as Resnet in Resnet

[29] and Wide residual networks [35], also demonstrate that simply increasing the number of filters in each layer can dramatically improve test accuracy. More recently, FractalNets [15] obtained excellent results using a wider block structure. In addition to increasing depth and width of networks, there are a growing number of research works focusing on aggregation or fusion. Deep Layer Aggregation [34] provides a novel approach to fuse features vertically across layers, which substantially improves recognition accuracy with less computational cost.

Inspired by these findings, we design a novel MCA module, which first broadens the width of the initial convolution layer of DenseNets through multi-scale convolutions, then fuses the filters using cross-scale aggregation parameterised by trainable weights. The idea of multi-scale convolutions also follows a neuroscience model [22] suggesting that the raw image should be processed at different scales and then joined together for next layers, so that the deeper layers can become robust to scale shift [27].

Another breakthrough in deep learning is the introduction of skip connections, which addresses the challenges of training deep networks. Highway Networks [24] efficiently train deep networks by introducing the bypassing path, which is the primary factor that eases the training pain. ResNets [6] further enhance this new connection pattern through substituting bypassing paths with residual connections, and achieve record-breaking performance on ImageNet [1]. Recently, DenseNets [9] densely connect all preceding layers with each layer to reuse all preceding feature maps and outperform the state-of-the-art results on several competitive benchmarks. Moreover, stochastic depth [10] was proposed as a successful approach to train an over 1000-layer ResNet through randomly dropping a few layers during training. Analogous to dropout [8], this method demonstrates that stochastically dropping is an extremely powerful technique to regularize networks. Our SFR regularizer was motivated by the observations on Dropout, Stochastic Depth and DenseNets. However, instead of dropping layers as in Stochastic Depth, our regularizer drops features by randomly blocking a set of bypassing paths.

3 Methodology

DenseNets. Both the MCA module and the SFR regularizer proposed in this paper are based on DenseNets [9]. Assume that a single input image is represented by and is passed through a DenseNet that has layers. Each layer comprises a composite function that includes one Batch Normalization layer [12]

, one ReLU layer

[3], and one convolution layer. DenseNets introduce a new connectivity scheme: the output of each layer is directly connected to all subsequent layers. Consequently, the layer receives the outputs of all preceding layers. That is:


where is the output of layer , is the concatenation of feature maps produced by layers 0, 1, 2, …, . The total number of channels in a -layer DenseNet, , can be approximatively computed as:


where represents the number of input channels into first dense block and is the growth rate of the DenseNet.

3.1 Multi-scale Convolution Aggregation Module

Through concatenating different groups of convolutions, Inception [23] module and its variant [17]

have shown that multi-scale convolution filters can boost the performance of deep networks. Inspired by their findings, we design a novel MCA module to enhance the representative and learning capacity of DenseNets. The new module consists of layers for multi-scale convolutions, cross-scale aggregation, maxout, and concatenation. It is placed in front of the DenseNet as initial layers so that abundant features extracted at different scales can be fed into the network.

Multi-scale Convolutions. Given the input image , the multi-scale convolutions layer computes the following:


where are the results of convolutions with , , , and kernels respectively. represents parameters of different kernels and denotes the concatenation operator.

Feeding the concatenation of four groups of convolutions, , into DenseNets directly helps to improve the performance of the network since the network bandwidth is increased. When evaluated on CIFAR-10 dataset, a standard DenseNet with depth and growth rate achieves , whereas the DenseNet with as input achieves a test accuracy of 94.31. However, since and the number of initial channels for DenseNet, in Equ. (2), equals to the length of , the number of parameters is increased from 4.2 millions to 5.7 millions.

Figure 2: Illustration on fine-scale and coarse-scale aggregations.

Cross-scale Aggregation. In order to reduce the complexity of the model while maintaining a high test accuracy and maximizing effective information flow of the network, an adaptive aggregation function is applied. Here we aggregate convolution results under four kernels into two branches that represent fine and coarse scales, respectively; see Fig. 2. Since trainable gating weights are introduced, the unit is similar to a small-scale voting system. For each mini-batch, proper weights are automatically assigned to different scales via our voting mechanism. This helps to preserve most contributive multi-scale information and suppress flows with lower importance. Specifically, the cross-scale aggregation layer performs the following operation:


where represents pixelwise summation aggregation. are learnable gating weights for convolution results at different scales. Their values indicate the importance of respective scales. In practice, we also found that trainable aggregation works much better than equal-weighted fusion, since the voting system in the former approach makes the module more adaptive on a variety of datasets. As shown in Fig. 6, the finally converged weights on different datasets vary widely, which indicates that different datasets favor the contributions from different scales. Fig. 3 visually compares the results obtained through the two versions of the aggregation.

Figure 3: Visualization on results obtained using the proposed adaptive aggregation (top row) and simple equal-weighted aggregation (bottom row): (a) results of convolutions; (b) results of convolutions; (c) aggregation of and convolutions; and (d) zoomed in view of dashed-box areas in (c), which shows that trainable aggregation can extract more abundant and detailed texture features.

Compared to the Inception module that simply concatenates different groups of convolutions, the aggregation layer we used can significantly reduces the number of parameters. On CIFAR-10 dataset, the number of parameters is reduced from 5.7 millions to 4.2 millions in DenseNet with depth = 40 and growth rate .

Maxout. Previous work have shown that: 1) maxout exploits the model averaging behavior as the approximation is more accurate; 2) back-forward flow of maxout can avoid pitfalls such as failing to use a large set of filters [4]; and 3) grouping is important in deep networks [31]. Hence, to better regularize our fusion results, here two maxout operations are independently performed after cross-scale aggregation layer, one for the two fines scale channels and the other for the two coarser scale channels. That is, we have the final output of MCA module :


With maxout layer introduced, the whole MCA module can be viewed as a highly non-linear transformation between original input and the first dense block of DenseNets. It includes four gating units parametrized by

controlling the flow of multi-scale information.

Backward Propagation. The process of gradients back-propagation is the same as the traditional back-propagation. Here, we present the derivation formula in terms of weights of multi-scale convolutions; see Equ. (6). represent the outputs of multi-scale convolutions and are kernel weights. We define the maxout function as and denotes its first-order derivative. The input image is and

are bias vectors.



is the loss function of the whole network and

are the weight and bias of the first layer in the first dense block. is the sensitivity of layer.

3.2 Stochastic Feature Reuse

Dropout [8], Drop-connect [30] and Maxout [4] provide excellent regularization methods through modifying interactions among neural units or connections between different layers in order to break co-adaptation. These techniques have been supported by subsequent research and applied in a wide range of network architectures, such as ResNets [6] and FractalNets [15]. Recent stochastic depth [10] and drop-path [15] successfully extend dropout and make impressive progress in vision tasks.

Figure 4: At a given layer of a dense block, the original DenseNets concatenate of all feature maps produced by preceding layers as input. The presented Stochastic Feature Reuse uses a mask to drop features from some of these layers but guarantees at least one set of feature maps from previous layers will be reused. The mask is randomly generated and changes for different mini-batches during training.

Motivated by these structures, we propose “Stochastic Feature Reuse” (SFR) as an effective regularizer in DenseNets to promote the generalization of networks and to overcome overfitting especially when the growth rate is high. Fig. 4

illustrates the model of SFR. For each mini-batch, a new mask tensor

obeying Bernoulli distribution is randomly generated for each layer and the input of layer is modified as follows:


During the training time, when a set of skip connections are blocked, there is no need to perform forward and backward computations trough those. Hence, these dropped features are not reused by the current layer. Since a large amount of computation is saved, SFR can speed up the convergence of the network. When testing, all features are reused in order to make use of the full-width network [10].

As a regularizer, SFR can enhance the performance of DenseNets and deal with the overfitting issue [9] through discouraging co-adaptation. In addition, SFR also implicitly trains an ensemble of DenseNets, which helps to improve the performance. For a -layer DenseNet, there are possible combinations and the final network used at the testing stage can be viewed as the average of these sub-networks.

4 Experiments

The presented MCA module and SFR regularizer are evaluated using three widely adopted benchmarks: CIFAR-10 [13], CIFAR-100 [13] and SVHN [21]. The results show that the performance of DenseNets with MCA modules is superior to the original DenseNets and that the SFR regularizer can effectively prevent overfitting.

4.1 Implementation and Training Details

In our experiments, we report test error from the epoch with the lowest validation error and we use the same construction and training scheme as introduced in DenseNet

[9]. When evaluating the MCA module, the DenseNet part has three dense blocks, all have equal numbers of layers and the same growth rate. When evaluating SFR regularizer, an additional dense block with SFR is added so that the performance of the original DenseNet is not affected. Each composite function of dense block uses a

convolution layer with zero-padding to keep the feature maps fixed. Between two dense blocks, there are bottleneck layers with a compression factor. In this paper, we set compression factor as 1.0 in standard DenseNet while set as 0.5 in the structure of DenseNet with bottleneck and compression (DenseNet-BC). At the end of the last dense block, a global average pooling layer, followed by a softmax layer, is attached. The sizes of feature maps in each of the three dense blocks are

, and , respectively.

Similar to the standard DenseNet [9]

, DenseNets in our experiments are optimized through the first-order SGD optimizer. We train 350 epochs for CIFAR and 40 epochs for SVHN. Initial learning rate is 0.1 and divided by 10 at epochs 150, 225 and 300 for CIFAR and epochs 20 and 30 for SVHN. We also add weight decay (0.0001) term into our loss function and use Nesterov momentum

[25] of 0.9 for optimization. Hinton’s Dropout [8]

layer with drop probability

, Batch Normalization [12] layer and He Initialization of weights [5] are applied as well.

Model Depth Params. C10() C100() SVHN()
Stochastic Pooling [36] - - 15.13 42.51 2.80
Maxout Networks [4] - - 11.68 38.57 2.47
Network in Network [18] - - 10.41 35.68 2.35
Deeply Supervised Net [16] - - 9.69 1.92
Competitive Multi-scale [17] - 4.48M 6.87 27.56 1.76
Highway Network [24] - - -
Fractal Network [15] 21 38.6M 10.18 35.34 2.01
FractalNet with Drop-path [15] 21 38.6M 7.33 28.20 1.87
ResNet [6] 110 1.7M - -
Stochastic Depth [10] 110 1.7M 11.66 37.80 1.75
ResNet(pre-activation) [7] 164 1.7M 11.26 35.58 -
1001 10.2M 10.56 33.47 -
DenseNet() [9] 40 1.0M 7.00 27.55 1.79
DenseNet() [9] 100 27.2M 5.83 23.42 1.59
DenseNet()[9] 53 7.8M 6.45 24.32 1.78
DenseNet with SFR() 53 7.8M 6.08 23.82 1.66
DenseNet-BC()[9] 100 0.8M 5.92 24.15 1.76
DenseNet-BC with MCA() 100 0.8M 5.41 24.07 -
DenseNet with MCA() 40 1.0M 6.44 27.44 1.77
DenseNet with MCA() 40 4.2M 23.78 1.66
DenseNet with MCA() 40 11.6M 5.76
Table 1: Test error on CIFAR and SVHN datasets. Contents in boldface are our competitive results. indicates that the error rate is based on datasets with data augmentations. DenseNet with MCA achieves better performance than the original under the same configuration. Particularly, when growth rate, depth are set as 24 and 40, the network obtains an excellent result () on CIFAR-10 which is better than the original DenseNet with 100 and 53 layers. On CIFAR-100 and SVHN, our model with achieves more remarkable results ( and ). In the structure of DenseNet-BC, our MCA also has positive impacts on the performance.

4.2 Datasets

The CIFAR-10 dataset [13] consists of 60,000 (50,000 for training + 10,000 for testing) natural color images of 3232 resolution. Objects from ten classes (e.g. vehicles, flowers etc.) have equal volume of training and test images and are centered in these images. The CIFAR-100 dataset extends the number of classes in CIFAR-10 to 100, but each class only consists of 600 images. Due to more classes and fewer samples for each class, the classification for CIFAR-100 is considered as more challenging. Street View House Number (SVHN)

dataset is also a well-known benchmark in computer vision, which consists of color images of digits 0 to 9 of 32

32 resolution. There are 73,257 training, 26,032 testing, and 531,131 additional training images respectively.

In our experiments, we apply the same normalization methods on input images as the original DenseNet. For CIFAR dataset, we subtract mean values and divide standard deviations, whereas for SVHN images, the pixel values were divided by 255. We do not use any data augmentation in the experiments, and only focus on comparing our approaches with other network models on original datasets.

4.3 Results and Discussion

We train our networks with different depths (40, 53, 100) and growth rates () and compare our approach with other well-known models on CIFAR-10, CIFAR-100, SVHN; see Table 1.

Figure 5: Test accuracy on CIFAR-10. All structures consist of the MCA module. Left: comparison between DenseNets with and without SFR regularizer. Our dropout has a constant drop rate of . Right: comparison of three aggregation patterns, which shows that adaptive fusion is more powerful and representative.

Multi-scale Convolution Aggregation.

To better evaluate our novel module, we train different patterns of aggregation on CIFAR-10 and test the best model on CIFAR-10, CIFAR-100 and SVHN. The performance of our structure with different setting on three benchmarks are shown in the bottom of Table 1. With relatively fewer parameters (4.2M), it obtains the lowest classification error rate on CIFAR-10 (5.38%) and CIFAR-100 (23.78%), and second best results on SVHN (1.66%). In the case of , depth = 40, our model gets impressive results (22.65%) on CIFAR-100 and (1.61%) on SVHN. This demonstrates that our MCA module has much higher representative capacity and is able to preserves abundant information of multi-scale convolutions. This is crucial for preventing overfitting and promoting generalization ability.

Fig. 5(right) compares different aggregation patterns for fussing multi-scale convolutions information (, depth = 40). The aggregation parameterized by gating weights gains the best performance with only four parameters added. Its success may be attributed to the following factors:

Factor 1: Aggregating different scales with trainable weights is more flexible and representative than aggregating with hand-crafted weights. During the process of SGD, the weights of different kernel sizes are treated independently and adaptively. Since pixels at different distances from the central point should have different importance, this strategy can preserve richer multi-scale information (texture, edges, corners, etc.) while using much fewer parameters than simply concatenating them.

Figure 6: The variation of aggregation weights during the training under different datasets. The optimal weights for different tasks are different. Under the Stochastic Gradient Descend (SGD) , the model adaptively controls the flow of multi-scale information so that the scales with high discrimination power are preserved whereas the redundant ones are suppressed.

More importantly, for vision tasks with different complexity, weights of gating units may vary under similar trends during training, but often converge to different final values; see Figure 6. This suggests that the optimal scale for convolutions can be different for different datasets. For instance, in CIFAR tasks, the module assigns high weights ( and ) to fine-scale features, whereas less coarse-scale information is delivered to subsequent DenseNet. On the other hand, for the SVHN dataset, the weight has much higher relative value than for the other two dataset, whereas the weight is almost 0. This observation suggests that, for simple digits classification tasks, coarse-scale features extracted by convolution is more important than in other more complicated tasks. To further demonstrate this point, we also run our module on another simple dataset MNIST and obtain the similar observation ( for MNIST vs. for CIFAR-10).

Factor 2: The combination of three dominant joining methods (summation, maxout and concatenation) makes our model highly non-linear and capable of effectively aggregating multi-scale representations. Each joining method has its own advantages. The combination of different approaches is also studied in [26], which shows a better performance. By utilizing two maxout, the units of the aggregation layer have strong competition which is beneficial for training and optimizing deep networks. The two branches of the fine-scale and coarse-scale aggregations enhance the scale invariance property.

Stochastic Feature Reuse.

We evaluate SFR on the same three datasets and compare it with the original DensNet with depth and growth rate . The additional dense block with SFR is placed at the front or at the end of the original DenseNet; see Table 2 for details. The comparison shows that placing the additional dense block with SFR at the end of the DenseNet generates lower error rates on all three datasets. We attribute the accuracy improvement to the fact that SFR randomly generates a new sub-network with different propagation path for each mini-batch and implicitly train an ensemble of different networks. This kind of dropout can disorganize the co-adaptation among reused features and prevent overfitting. On the other hand, adding the additional dense block with SFR to the front of DenseNet actually hurt the performance since this will lead that shallow layers are too narrow to pass sufficient information flow. In addition, we observe that SFR should work with Hinton’s Dropout, without which the accuracy also degenerates.

Block index Error() Dataset
DenseNet [9] 6.45 CIFAR-10
SFR 6.08 CIFAR-10
SFR 8.99 CIFAR-10
SFR(No Dropout) 10.00 CIFAR-10
DenseNet [9] None 24.32 CIFAR-100
SFR 23.82 CIFAR-100
SFR 26.54 CIFAR-100
SFR(No Dropout) 27.15 CIFAR-100
DenseNet [9] None 1.78 SVHN
SFR(No Dropout) 3.02 SVHN
Table 2: Test error of DenseNets trained with stochastic feature reuse on different datasets without data augmentation. SFR is more powerful when placed at the end of DenseNets.

Another observation is that our method is more effective on wider DenseNets, as narrow networks in general do not have serious co-adaptation issue or long training time. The ways of widening DenseNet mainly includes using a larger growth rate or increasing channels of the first initial layer. Hence, to illustrate the impact of different growth rate on the performance of our SFR, we firstly evaluate on CIFAR-10 based on three growth rates 12, 24 and 40. Moreover, the case of wider initial layer also be considered. Here we expand the first initial convolution layer four times via four-scale convolutions and the training results are shown in Fig. 5(left). Table 3 shows the results under different cases. SFR test error increases to 6.32% when growth rate adds up to 40 as a very large bandwidth causes slight overfitting. Using SRF with high drop probability addresses this issue.

Width w.o. SFR w. SFR Improve
SFR() 17196 6.93 6.80 0.13
SFR() 34392 6.45 6.08 0.37
SFR(WIL) 34536 6.09 5.76 0.33
SFR() 57320 6.53 6.32 0.21
Table 3: Test error with or without SFR under different growth rates and wider initial layer on CIFAR-10. WIL means “Wider Initial Layer”. Width is the maximal number of channels of all layers in DenseNet. When growth rate is set as 24, our SFR is more beneficial for improvements of performance on CIFAR-10.

5 Conclusions

A novel network module, referred as Multi-scale Convolution Aggregation, is presented in this paper. It consists of 4 groups of multi-scale convolutions, cross-scale aggregation parametrized by 4 trainable weights and 2 maxout that produces 2 branches of feature maps representing smaller and larger receptive fields respectively. In our experiments, Densenets with our new model obtain excellent performance while requiring substantially fewer parameters than utilizing traditional inception module. Instead of simple equal-weighted aggregation, our aggregation employs self-adaptive strategy to control the information flow of convolution filters. It automatically optimizes these weights according to different vision tasks. Trainable aggregation guarantees the maximum use of multi-scale convolutions and is the key for reducing parameters, whereas maxout strengthens the competitions among units in fine-scale and coarse-scale branches. The combination of three joining methods: concatenation, summation and maxout makes the networks highly non-linear.

In addition, a Stochastic Feature Reuse strategy is also presented for training deep DenseNets effectively and efficiently. This regularizer downsamples a new subnets of the basic DenseNet for each mini-batch during training but reuses all feature maps produced by preceding layers at test stage. Our method enhances the performance of DenseNets through breaking the co-adaptation among reused features and implicitly training an ensemble of multi-subnets with different widths. Being a simple and easy-to-apply approach, SFR is more useful for wider DenseNets with a larger growth rate and can effectively alleviate the difficulties of training wide networks.

For future work, we would like to explore the applications of the MCA module in other prominent deep architectures, as we felt MCA can be beneficial through introducing scale-invariance information without adding feature redundancy. In addition, when evaluating SFR, we empirically use a constant drop probability. It is interesting and meaningful to explore other configurations of the drop probability in future experiments.