In this paper, we present a novel deep learning approach, deeply-fused nets. The central idea of our approach is deep fusion, i.e., combine the intermediate representations of base networks, where the fused output serves as the input of the remaining part of each base network, and perform such combinations deeply over several intermediate representations. The resulting deeply fused net enjoys several benefits. First, it is able to learn multi-scale representations as it enjoys the benefits of more base networks, which could form the same fused network, other than the initial group of base networks. Second, in our suggested fused net formed by one deep and one shallow base networks, the flows of the information from the earlier intermediate layer of the deep base network to the output and from the input to the later intermediate layer of the deep base network are both improved. Last, the deep and shallow base networks are jointly learnt and can benefit from each other. More interestingly, the essential depth of a fused net composed from a deep base network and a shallow base network is reduced because the fused net could be composed from a less deep base network, and thus training the fused net is less difficult than training the initial deep base network. Empirical results demonstrate that our approach achieves superior performance over two closely-related methods, ResNet and Highway, and competitive performance compared to the state-of-the-arts.READ FULL TEXT VIEW PDF
Deep neural network has been popular again since the breakthrough performance 
in the ImageNet classification. In the past few years (from ), the top- classification accuracy on the -class ImageNet dataset has increased from  to . Besides, deep neural network has been shown to have very impressive performance for other vision applications, such as object detection [4, 5], image segmentation , edge detection , and so on.
Nevertheless, the fundamental problem, learning a deep hierarchical structure effectively and efficiently, still remains a challenge and has been attracting a lot of research efforts. Dropout  and other regularization techniques, such as weight decay and path regularization 
, have been developed to prevent neural network from over-fitting. Normalized variance-preserving weight initialization, such as[10, 11, 12]13] is shown to improve both the training speed and the recognition performance. Skip-layer connections between layers (including the output layer) and other network structure modifications, such as deeply-supervised nets  and its variant , Highway , ResNet , inception-v , are able to improve the flow of information and accordingly help train a very deep network. The teacher-student framework shows that learning a deep network can benefit from an already-trained network that is relatively easy to be learnt, e.g., FitNets  and Net2Net .
In this paper, we introduce a deep fusion approach and present a deeply-fused neural net formed by combining a group of base networks. The main idea is to perform fusion over the intermediate representations of the base networks, where the fused output serves as the input of the remaining part of each base network, rather than only over the final representations or the final classification scores, and such fusions are performed several times at different intermediate layers. There is a block-exchangeable property (the block is the subnetwork between two successive fusions in a base network): switch blocks from one base network to another one within one fusion, resulting in two different base networks with possibly different depth from the originals (e.g., deep network being less deep and shallow network being less shallow), but the fused net is not changed. In other words, a fused net can be formed by different groups of base networks. Thus, the deeply-fused net is able to learn multi-scale representations from much more base networks, and even same-scale representations can be different and learnt from different base networks.
There is one more benefit from deep fusion: the flow of information is improved. Consider the case where one base network is very deep but the other base network is not deep, which is the choice we suggest. The earlier intermediate layer in the deeper base network might have a shorter path through the other base network to the output, which implies that the supervision can be fast transformed to the earlier intermediate layer. On the other hand, the later intermediate layer might also have a shorter path from the input, which indicates that the input can be fast flowed to the later intermediate layer. As a result, training the fused net composed from a very deep base network is less difficult than training the very deep base network itself. Furthermore, the deep and shallow base networks are jointly learnt and can benefit from each other. We also show that the recently-developed networks, Highway and ResNet, can be viewed as specific examples of deep fusion. Empirical results demonstrate that our approach achieves superior performance over the plain network, the naive network fusion method, ResNet and Highway, and competitive performance compared to the state-of-the-arts.
The past few years have witnessed the rapid and great progress of deep neural networks in various aspects, from optimization techniques as well as initialization, regularization, activation and pooling functions, network structure design, to applications. In this section, we mainly discuss two closely-related lines: network structure design and network optimization with the aid of another already-trained network.
Averaging over a set of network predictors, which we call decision fusion, is able to improve the generalization accuracy and has been widely used, e.g., to boost the ImageNet recognition performance [1, 19, 20, 16]. Multi-column deep neural networks  presents an empirical study about decision fusion, later extended to an adaptive version, weighted averaging with the weights depending on the input 
. The averaging approach learns each network separately, which is equivalent to learn the network jointly that averages the loss functions. Our approach, in contrast, performs the feature fusion deeply over several intermediate layers and simultaneously learns the representations of the (base) networks.
The inception module in GoogLeNet can be viewed as a fusion stage: concatenate the outputs of several subnetworks with different lengths. It is different from our approach using the summation for fusion. The GoogLeNet architecture, consisting of a sequence of inception modules, is also a kind of deep fusion, i.e., deep concatenation fusion. But it is not as direct as our deep summation fusion. The output of each subnetwork in an inception module is narrower than the input of the subsequent inception module. Hence it is necessary to append many channels with all entries in the output to match the size with the input of the subsequent inception module or add more convolution operations to form the fused network. Skip-layer connection, such as deeply-supervised nets  and its variant , Highway , ResNet , as we will show, resembles our approach and can be regarded as special examples of our approach.
The teacher-student framework suggests that learning a hard-trained network can benefit from an easily-trained network. For instance, FitNets  uses the intermediate representation of a wider and shallower (but still deep) teacher net that is relatively easy to be trained, as the target of the intermediate representation of a thinner and deeper student net. Net2Net  also uses a teacher net to help train a (wider or deeper) student net, through a function-preserving transform to initialize the parameters of the student net according to the parameters of the teacher net. Our approach, in our suggested choice: including one deep base network and one shallow (but could still be deep) network, also uses the shallow network to help train the deep base network, meanwhile the deep base network also helps train the shallow network, i.e., they benefit from each other and are trained simultaneously.
A feedforward network typically consists of representation extraction layers,
, followed by a classification layer, e.g., a fully-connected layer and a linear classifier. Theth layer applies a nonlinear transformation (parameterized by
), e.g., a linear convolution followed by a nonlinear activation function:
where is the input of the th layer and also the output of the th layer, or is the input of the whole network if . The representation extraction part of the network can be written in a function form, , where .
Network fusion is a process of combining multiple base networks, e.g., base networks 111For simplicity, we only use the representation extraction part to describe a full network.. The conventional fusion in general includes two approaches: feature fusion, fusing the representations extracted from the networks together, followed by a classification layer; and decision fusion (a.k.a., model ensemble), fusing the classification scores computed from the networks. This paper focuses on feature fusion and the fusion can be formulated in the function form, , where the fusion function , in this paper, is the sum of the representations,
The fusion function could be in other forms, e.g., concatenation or maximization, and we will discuss them later.
Deep fusion performs feature fusion not only over the final feature representation but also over the intermediate feature representations. The forward process and backward propagation are presented as follows.
Forward process. A network with feature extraction layers is divided into blocks, where a block is a sequence of several nonlinear transformations, :
or simply an identity connection. Deep fusion is a process of fusing networks (assuming each consisting of blocks), with summation fusion operations:
where , is the output of the th fusion, and it is assumed that the output sizes of the blocks are the same. Figure 1 illustrates the difference between shallow and deep fusions.
Backward propagation. Gradient back-propagation is the same as the conventional back-propagation. Here, we present the back-propagation form with respect to the input and the output of each fusion, i.e., over the blocks. According to the definition, the gradient of with respect to , called fused block gradient, can be computed as follows,
which intuitively means that the gradient is the summation of block gradients , and block gradient is the gradient of the th block of the th base network. Suppose the loss function is , with and
being the estimated and ground-truth labels. The gradient with respect to the hidden responseusing back-propagation can be computed in the following form,
Base network selection. From the above descriptions, we can see that the computation complexity for both the forward and backward processes is almost equal to the complexity of all the base networks with the negligible element-wise addition cost. Therefore, deep fusion does not introduce additional parameters, nor increase the computation complexity.
Deep fusion typically chooses a very deep network, and a shallower (might also be deep) network in which each block contains only a convolution layer, with a few blocks/fusions (for example, each block corresponds to one scale). Consequently, the computation complexity is approximately equal to that of the very deep network. This nice property makes deep fusion comparable to the plain network (i.e., the deep one used for deep fusion), Highway , and ResNet  that includes some non-identity connections whose extra cost is similar to ours. In addition, training the fused net formed from a very deep base network and a shallow base network is less difficult than training the initial very deep base network because the fused net could be composed from a less deep base network, and a less shallow base network, which will be shown later, and the essential depth of a fused net is reduced.
Such a choice resembles the teacher and student framework (e.g., FitNets  and Net2Net ) where the student network is learnt from the already-trained teacher network. But in our approach, the teacher (shallow) and student (deep) networks are jointly learnt and benefit from each other. And the vanishing gradient problem if it seriously exists for the deep base network, is alleviated, according to the gradient back propagation shown in Equation (8). Besides, we will show that such a choice is advantageous in the flow of information.
High capability of combining multi-scale representations. A deeply-fused net composed from a group of base networks can also be composed from another group of different base networks, which is shown below. Considering the mathematical formulation of deep fusion, we can rewrite Equation (4) into an equivalent form,
which does not change the fused net. This property is called block exchangeability. It can be regarded as changing the first two base networks: and , to two other base networks: and . Similarly, we can obtain more base networks, but resulting in the same fused net. The number of unique combinations of base networks can reach up to (In practice the number will be smaller than , but it is still very large). Figure 2 shows an example to illustrate this block-exchangeable property.
Each possible base network, in our implementation, will output an feature map, in which each element corresponds to a receptive field with the size (scale) depending on the base network. The fused net can be formed from many base networks, and thus there exist many receptive fields with various sizes. In addition, the two base networks, with same sizes of receptive field, may have different extraction processes, and thus the extracted representations are different, and are able to capture different characteristics.
Improvement of the information flow. We show that an earlier intermediate layer might have a shorter path to the output layer. Consider an early intermediate representation, e.g., , the shortest path to the final feature representation is , which intuitively means that for each fused block the smallest block is chosen as the path. This implies that the path from the intermediate layer in the deeper base network to the output becomes shorter, and thus the supervision can be fast flowed to an early intermediate layer.
Similarly, a later intermediate layer may have a shorter path from the input layer, indicating that the input information can be quickly fed into a later intermediate layer instead of through a long path. This benefit to the network learning is in some sense related to relay back-propagation , which explicitly fixes the earlier layers (something like directly connecting the input to the later layers) when updating the later layers. In summary, the flow of information from the input to the intermediate layers and from the intermediate layers to the output are both improved, which is beneficial to training a deep network.
Concatenation, Maximization, and Summation. With the summation fusion in the intermediate layers, there is almost no change for each base network: the network structures are not changed. The only effect is that the output is changed with some signals added from other networks. Maximization fusion that performs an element-wise maximization,
e.g., an inception module in GoogLeNet. The base networks have to be changed and more parameters are needed: the input size of the subsequent sub-network immediately after the fusion in each base network is increased as the fusion output becomes larger (or in the original base network, there are many channels with all entries appended in the output of a block so that the total output matches the size with the input of the subsequent inception module). The combination of concatenation fusions and summation fusions is studied in , which shows better ImageNet performances.
Relation to Deeply-Supervised Nets. The deeply supervised net estimates the network parameters through optimizing multiple losses, some of which come from the intermediate layers. The formulation could be written as follows,
where is a subnetwork of and consists of the part from the input to the layer , and and are the classifiers. A similar network used for edge detection  is shown to be able to combine multi-scale information. This formulation can be interpreted as a shallow decision/loss fusion process: combine networks with shared parameters across the networks , which is illustrated in Figure 3(b).
In addition, we show that the unidirectional version of deeply fused network is closely related to the deeply-supervised net with weights sharing for the classification layers. The unidirectional deeply fused network, (combining two networks, the signal from the network is flowed to the network ) is mathematically given as follows,
and a classification layer, , is defined over .
Figure 3(c) shows an example of unidirectional deep fusion. Compared with the deeply-supervised net in Figure 3(a), we can observe that the unidirectional deep fusion uses progressive feature fusion, while deep supervision in deeply-supervised nets uses loss fusion.
Relation with Highway and ResNet. Skip-layer connection means that a layer can take not only the layer at the previous level as input but also some of the lower layers. It resembles deep fusion and in some sense they can be equivalently transformed to each other. Here we consider two recently-developed network examples: Highway  and ResNet .
Highway and ResNet can be viewed as combining two networks: one is deep, called a plain network, and the other one is shallow, a sequence of virtual layers (identity connection) and possible extra layers, which are just down-sample pool layers (same to the plain network) in Highway, and are projection layers in ResNet (e.g., the blue box means a linear projection in Figure 4). Figure 4 illustrates that ResNet is an example of a deeply-fused net when the number of base networks is . Similarly, Highway can be transformed to deep fusion, and the difference is that the fusion in Highway is a weighted sum and the weight is data customized through the transform gate.
The block in the deep/plain network in Highway and ResNet is typically small, consisting of - layers. And thus there are many blocks/fusions, while in our approach the number of fusions is suggested smaller. We use the non-identity connection to form the block in the shallow network: usually one block in one scale (could be more in a very deep network), and no extra layer except pooling layer across scales; while ResNet uses non-identity connection to match the size which is changed when across scales in the network. Thus the number of parameters in ResNet and our approach are similar and both are smaller than that in Highway.
|Layer name||Output size||Parameters||Repeat times|
|max pool, stride|
|max pool, stride|
|avg pool, stride|
We evaluate our approach on the CIFAR-10 and CIFAR-100 datasets. The CIFAR-10 and CIFAR-100 datasets  are both subsets drawn from the -million tiny image database . The CIFAR-10 dataset consists of colour images in classes, with images per class. There are training images and test images. The CIFAR-100 dataset is like the CIFAR-10, except that it has classes each containing images. There are training images and testing images per class. The classes in the CIFAR-100 are grouped into superclasses. Each image comes with a ”fine” label (the class to which it belongs) and a ”coarse” label (the superclass to which it belongs).
The architecture of the base network is built upon basic units: the convolution layer, the nonlinear ReLU activation function, the max-pooling layer, the fully-connected layer, and the softmax layer for training, which are intentionally chosen to directly show the benefits from deep fusion. We also use a batch normalization layer right after each convolution layer. The details of the network architectures used in our experiments are presented in Table1 (base networks) and Table 2 (block division).
We train the networks using the SGD algorithm, with the weight decay regularization, the momentum set to , the mini-batch size set to
, and the maximum number of epochs set to. An exponentially decay learning rate is used: the learning rate is reduced by a factor of after the th, th, th epoch. The results of our approach, the baseline algorithms and the plain network are reported with the weight decay coefficient and the initial learning rate tuned in the ranges: and for layers, and and for layers. The weights are initialized using the scheme 
. Experiments are conducted using Caffe. The datasets are preprocessed using a common setting, as described in 
, including the global contrast normalization, four-zero pixels padding at all borders withrandom crops, and random horizontal flipping.
Convergence curve. Figure 5 shows the training error and the testing error between a fused net, N13N33, composed of two base networks N13 and N33 (See Table 2), and a plain network, N13, the deeper base network in the fused net. It can be observed that both the training error and testing error of deep fusion are lower than the plain network in the later iterations, and deep fusion achieves the same error with plain network using much fewer training steps.
|Deep concatenation fusion|
|Deep max fusion|
|Decision fusion (separate train)|
|Decision fusion(joint train)|
|Deep summation (fusion before ReLU)|
|Deep summation (fusion after ReLU)|
|Decision fusion (separate train)|
Fusion. We present the results from our deep summation fusion, deep max fusion, deep concatenation fusion (the channel size of the convolution layer right before the pooling layer is reduced by half in order to match the size of the subsequent network), shallow feature fusion (the feature fusion is only performed at the final output of C3.), decision fusion with the networks jointly trained and with each network separately trained, where the base networks are N13 and N33 for layers, and are N26 and N46 for layers. The comparison is given in Table 3. It can be seen that our approach achieves superior results. There is a slight performance difference when the fusion is conducted before or after ReLU, and our other experiments are reported with fusion before ReLU. It is interesting that the performance of decision fusion with separate training for layers on CIFAR-100 is also good, but the performance on CIFAR-100 for layers is dramatically deteriorated. In contrast, our approach performs similarly for and layers, which shows that our approach indeed can help train a very deep network.
Performance with different #blocks. We empirically study the performance with different number of blocks. We consider three fused nets, N13N33, N16N46, N18N58, in which the base networks have , , and blocks respectively, and the deep base network is layers (see Table 2 for more details). The comparison given in Table 4 shows that the performances with different #blocks are close, and over the challenging dataset CIFAR-100, more blocks result in worse performance. Thus, we will use blocks to compare with the baseline networks as the computation complexity is almost the same as the plain network.
We compare our approach with three baseline networks: plain network, Highway  and ResNet . The three baseline networks use the same plain network N1 with layers, and our approach uses the plain network as the deeper base network together with another base network N33 to form the deeply fused net, N13N33. The number of parameters as well as the computation cost of our approach and ResNet are almost the same, as (1) ResNet includes residual connections for the -layer network, consisting of two non-identity connections to match the dimensions, and (2) there are three blocks in the shallow network in our approach, and the first one is small and thus computational negligible and the other two are almost the same with the two non-identity connections in ResNet. Highway includes more parameters as it introduces transform gates and hence is computationally more expensive. In addition to the results from our implementation of the plain network, Highway (using batch normalization w/o dropout) and ResNet (batch normalization used before ReLU), we also report the results of Highway and ResNet from the original papers.
|Highway (our implementation with dropout)|
|Highway (our implementation without dropout)|
|Resnet (our implementation)|
The results on CIFAR-10 and CIFAR-100 are shown in Table 5 and Table 6. Compared with Highway that uses the gate to select only a part of the plain network for prediction, our approach uses a small network to help train the whole plain network, resulting in better performance. In comparison to ResNet that uses identity-connection blocks over the layers with the same dimension and non-identity blocks over the layers with different dimensions (appear when only scale changes in our -layer network), leading to too many blocks, our approach uses non-identity connection to form the block which is not across scales, and has fewer blocks/fusions, which might be the reason for superior performance. Compared to the second best method, ResNet, our approach achieves more significant gain on CIFAR-100 than CIFAR-10.
In addition, we also report the performance with a much deeper base network, a -layer network in Table 5 and Table 6. We have several observations. On the one hand, the performance from layers for the plain network is lower than that from layers (e.g., decreased to from on CIFAR-100 ), while the performances of our approach from layers and layers ( and ) are only slightly different. On the other hand, on CIFAR-10, our approach with layers performs better than ResNet with similar depth ( and layers), but ResNet has less parameters, showing that our approach is helpful for training a deep and complex network. Considering that the -layer network has much more parameters than the -layer network, deep fusion indeed helps train very deep network even with more parameters.
The results of the top-performing algorthms on CIFAR-10 and CIFAR-100 are shown in Table 7. Overall speaking, our approach performs very well under the common data augmentation. There are two competitive algorithms (both published in ICLR 2016), LSUV  that focuses on initialization and ELU  that focuses on a new nonlinear activation layer, achieving similar performance as our approach. We also report the results of our approach with more base networks (N13N33N43, N13N33N43N63N73): both show better performance on CIFAR-100, while on CIFAR-10 N13N33N43N63N73 does not make improvement.
Compared to LSUV , our approach (N13N33 and N16N46) performs better on CIFAR-100, but worse on CIFAR-10. On CIFAR-10, our approach performs the second best: slightly lower than LSUV  using maxout, but greater than LSUV  (19 layers) without using the maxout layer. Our approach uses simple basic layers to show the benefit of deep fusion, and we believe that potentially our approach can benefit from other advanced layers, e.g., maxout.
Compared to ELU , our approach performs better on CIFAR-10, but worse on CIFAR-100. On CIFAR-100, our approach performs the second best: lower than ELU  whose result is very high with common data augmentation. ELU  introduces an exponential linear unit, which is complementary to our approach and can be combined with our approach.
It is worth noting that our deeply-fused net outperforms even much larger networks with extreme data augmentation like fractional max-pooling .
|Common data augmentation|
|HighWay  (19 layers) (2015)|
|HighWay  (2015)|
|ResNet  (110 layers) (2015)|
|CMSC  (2015)|
|ALL-CNN  (2014)|
|LSUV  (19 layers) (2015)|
|LSUV  (maxout) (2015)|
|GPF  (2015)|
|DSN  (2015)|
|NiN  (2013)|
|Maxout  (2013)|
|MIN  (2015)|
|DNGO  (2015)|
|ELU  (2015)|
|Extreme data augmentation|
|Large ALL-CNN  (2014)|
|Fractional MP  ( test) (2014)|
|Fractional MP  ( tests) (2014)|
|SSCNN  (2014)|
Deep fusion is an approach that fuses not only the final representation but also the intermediate representations of the base networks. It is advantageous in (1) Multi-scale representations can be learnt; (2) The information flow is improved, and training a fused net composed from a very deep base network and a shallow network is less difficult than training the deep base network itself. (3) The deep and shallow networks learning benefit from each other. Experimental results show that our approach achieves superior performance over ResNet and Highway, and competitive performance compared to the state-of-the-arts.
Imagenet classification with deep convolutional neural networks.In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. (2012) 1106–1114
Journal of Machine Learning Research 15(1) (2014) 1929–1958
In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010. (2010) 249–256
80 million tiny images: A large data set for nonparametric object and scene recognition.IEEE Trans. Pattern Anal. Mach. Intell. 30(11) (2008) 1958–1970