1 Introduction
Deep neural networks have become the stateoftheart systems for image recognition (He et al., 2016a; Huang et al., 2017b; Krizhevsky et al., 2012; Qiao et al., 2017a; Simonyan & Zisserman, 2014; Szegedy et al., 2015; Wang et al., 2017; Zeiler & Fergus, 2013) as well as other vision tasks (Chen et al., 2015; Girshick et al., 2014; Long et al., 2015; Qiao et al., 2017b; Ren et al., 2015; Shen et al., 2015; Xie & Tu, 2015). The architectures keep going deeper, e.g., from five convolutional layers (Krizhevsky et al., 2012) to layers (He et al., 2016b). The benefit of deep architectures is their strong learning capacities because each new layer can potentially introduce more nonlinearities and typically uses larger receptive fields (Simonyan & Zisserman, 2014). In addition, adding certain types of layers (e.g. (He et al., 2016b)) will not harm the performance theoretically since they can just learn identity mapping. This makes stacking up layers more appealing in the network designs.
Although deeper architectures usually lead to stronger learning capacities, cascading convolutional layers (e.g. VGG (Simonyan & Zisserman, 2014)) or blocks (e.g. ResNet (He et al., 2016a)) is not necessarily the only method to achieve this goal. In this paper, we present a new way to increase the depth of the networks as an alternative to stacking up convolutional layers or blocks. Figure 2 provides an illustration that compares our proposed convolutional network that gradually updates the feature representations against the traditional convolutional network that computes its output simultaneously. By only adding an ordering to the channels without any additional computation, the later computed channels become deeper than the corresponding ones in the traditional convolutional network. We refer to the neural networks with the proposed computation orderings on the channels as Gradually Updated Neural Networks (GUNN). Figure 1 provides two examples of architecture designs based on cascading building blocks and GUNN. Without repeating the building blocks, GUNN increases the depths of the networks as well as their learning capacities.
It is clear that converting plain networks to GUNN increases the depths of the networks without any additional computations. What is less obvious is that GUNN in fact eliminates the overlap singularities inherent in the loss landscapes of the cascadingbased convolutional networks, which have been shown to adversely affect the training of deep neural networks as well as their performances (Wei et al., 2008; Orhan & Pitkow, 2018)
. Overlap singularity is when internal neurons collapse into each other,
i.e. they are unidentifiable by their activations. It happens in the networks, increases the training difficulties and degrades the performances (Orhan & Pitkow, 2018). However, if a plain network is converted to GUNN, the added computation orderings will break the symmetry between the neurons. We prove that the internal neurons in GUNN are impossible to collapse into each other. As a result, the effective dimensionality can be kept during training and the model will be free from the degeneracy caused by collapsed neurons. Reflected in the training dynamics and the performances, this means that converting to GUNN will make the plain networks easier to train and perform better. Figure 3 compares the training dynamics of a 15layer plain network on CIFAR10 dataset (Krizhevsky & Hinton, 2009) before and after converted to GUNN.In this paper, we test our proposed GUNN on highly competitive benchmark datasets, i.e. CIFAR (Krizhevsky & Hinton, 2009) and ImageNet (Russakovsky et al., 2015). Experimental results demonstrate that our proposed GUNNbased networks achieve the stateoftheart performances compared with the previous cascadingbased architectures.
2 Related Work
The research focuses of image recognition have moved from feature designs (Dalal & Triggs, 2005; Lowe, 2004) to architecture designs (He et al., 2016a; Huang et al., 2017b; Krizhevsky et al., 2012; Sermanet et al., 2014; Simonyan & Zisserman, 2014; Szegedy et al., 2015; Xie et al., 2017; Zeiler & Fergus, 2013) due to the recent success of the deep neural networks. Highway Networks (Srivastava et al., 2015) proposed architectures that can be trained endtoend with more than layers. The main idea of Highway Networks is to use bypassing paths. This idea was further investigated in ResNet (He et al., 2016a), which simplifies the bypassing paths by using only identity mappings. As learning ultradeep networks became possible, the depths of the models have increased tremendously. ResNet with preactivation (He et al., 2016b) and ResNet with stochastic depth (Huang et al., 2016) even managed to train neural networks with more than layers. FractalNet (Larsson et al., 2016) argued that in addition to summation, concatenation also helps train a deep architecture. More recently, ResNeXt (Xie et al., 2017) used group convolutions in ResNet and outperformed the original ResNet. DenseNet (Huang et al., 2017b) proposed an architecture with dense connections by feature concatenation. Dual Path Net (Chen et al., 2017) finds a middle point between ResNet and DenseNet by concatenating them in two paths. Unlike the above cascadingbased methods, GUNN eliminates the overlap singularities caused by the architecture symmetry. The detailed analyses can be found in Section 4.3.
Alternative to increasing the depth of the neural networks, another trend is to increase the widths of the networks. GoogleNet (Szegedy et al., 2015, 2016) proposed an Inception module to concatenate feature maps produced by different filters. Following ResNet (He et al., 2016a), the WideResNet (Zagoruyko & Komodakis, 2016) argued that compared with increasing the depth, increasing the width of the networks can be more effective in improving the performances. Besides varying the width and the depth, there are also other design strategies for deep neural networks (Hariharan et al., 2015; Kontschieder et al., 2015; Pezeshki et al., 2016; Rasmus et al., 2015; Yang & Ramanan, 2015). DeeplySupervised Nets (Lee et al., 2014)
used auxiliary classifiers to provide direct supervisions for the internal layers. Network in Network
(Lin et al., 2013)adds micro perceptrons to the convolutional layers.
3 Model
3.1 Feature Update
We consider a feature transformation , where denotes the channel of the features and denotes the feature location on the D feature map. For example, can be a convolutional layer with channels for both the input and the output. Let be the input and be the output, we have
(1) 
Suppose that can be decomposed into channelwise transformation that are independent with eath other, then for any location and channel we have
(2) 
where denotes the receptive field of the location and denotes the transformation on channel .
Let denote a feature update on channel set , i.e.,
(3)  
Then, when .
3.2 Gradually Updated Neural Networks
By defining the feature update on channel set , the commonly used onelayer CNN is a special case of feature updates where every channel is updated simultaneously. However, we can also update the channels gradually. For example, the proposed GUNN can be formulated by
(4)  
When , GUNN is equivalent to .
Note that the number of parameters and computation of GUNN are the same as those of the corresponding for any partitions of . However, by decomposing into channelwise transformations and sequentially applying them, the later computed channels are deeper than the previous ones. As a result, the depth of the network can be increased, as well as the network’s learning capacity.
3.3 Channelwise Update by Residual Learning
We consider the residual learning proposed by ResNet (He et al., 2016a) in our model. Specifically, we consider the channelwise transformation to be
(5) 
where is a convolutional neural network . The motivation of expressing in a residual learning manner is to reduce overlap singularities (Orhan & Pitkow, 2018), which will be discussed in Section 4.
3.4 Learning GUNN by Backpropagation
Here we show the backpropagation algorithm for learning the parameters in GUNN that uses the same amount of computations and memory as in
. In Eq. 4, let the feature update be parameterized by . Let be the backpropagation algorithm for differentiable function with the loss and the parameters . Algorithm 1 presents the backpropagation algorithm for GUNN. Since has the residual structures (He et al., 2016a), the last two steps can be merged into(6) 
which further simplifies the implementation. It is easy to see that converting networks to GUNNbased does not increase the memory usage in feedforwarding. Given Algorithm 1, converting networks to GUNN will not affect the memory in both the training and the evaluation.
4 GUNN Eliminates Overlap Singularities
Overlap singularities are inherent in the loss landscapes of some network architectures which are caused by the nonidentifiability of subsets of the neurons. They are identified and discussed in previous work (Wei et al., 2008; Anandkumar & Ge, 2016; Orhan & Pitkow, 2018), and are shown to be harmful for the performances of deep networks. Intuitively, overlap singularities exist in architectures where the internal neurons collapse into each other. As a result, the models are degenerate and the effective dimensionality is reduced. (Orhan & Pitkow, 2018) demonstrated through experiments that residual learning (see Eq. 5) helps to reduce the overlap singularities in deep networks, which partly explains the exceptional performances of ResNet (He et al., 2016a)
compared with plain networks. In the following, we first use linear transformation as an example to demonstrate how GUNNbased networks break the overlap singularities. Then, we generalize the results to ReLU DNN. Finally, we compare GUNN with the previous stateoftheart network architectures from the perspective of singularity elimination.
4.1 Overlap Singularities in Linear Transformations
Consider a linear function such that
(7) 
Suppose that there exists a pair of collapsed neurons and (). Then, for , , and the equality holds after any number of gradient descents, i.e. .
Eq. 7 describes a plain network. The solution for the existence of and is that . This is the case that is mostly discussed previously, which happens in the networks and degrades the performances.
When we add the residual learning, Eq. 7 becomes
(8) 
Collapsed neurons require that , . This will make the collapse of and very hard when
is initialized from a normal distribution
as in ResNet, but still possible.Next, we convert Eq. 8 to GUNN, i.e.,
(9) 
Suppose that and () collapse. Consider , the value difference at after one step of gradient descent on with input , and learning rate . When ,
(10) 
As , we have . But this condition will be broken in the next update; thus, . Then, we derive that . But these will also be broken in the next step of gradient descent optimization. Hence, and cannot collapse into each other. The complete proof can be found in the appendix.
4.2 Overlap Singularities in ReLU DNN
In practice, architectures are usually composed of several linear layers and nonlinearity layers. Analyzing all the possible architectures is beyond our scope. Here, we discuss the commonly used ReLU DNN, in which only linear transformations and ReLUs are used by simple layer cascading.
Following the notations in §3, we use , in which is a ReLU DNN. Note that is continuous piecewise linear (PWL) function (Arora et al., 2018), which means that there exists a finite set of polyhedra whose union is , and is affine linear over each polyhedron.
Suppose that we convert to GUNN and there exists a pair of collapsed neurons and (). Then, the set of polyhedra for is the same as for . Let be a polyhedron for and defined above. Then, ,
(11) 
where denotes the parameters for polyhedron . Note that on each , is a function of in the form of Eq. 9; hence, and cannot collapse into each other. Since the union of all polyhedra is , we conclude that GUNN eliminates the overlap singularities in ReLU DNN.
4.3 Discussions and Comparisons
The previous two subsections consider the GUNN conversion where (see Eq. 4). But this will slow down the computation on GPU due to the data dependency. Without specialized hardware or library support, we decide to increase to . The resulted models run at the speed between ResNeXt (Xie et al., 2017) and DenseNet (Huang et al., 2017b). But this change introduces singularities into the channels from the same set . Then, the residual learning helps GUNN to reduce the singularities within the same set since we initialize the parameters from a normal distribution . We will compare the results of GUNN with and without residual learning in the experiments.
We compare GUNN with the stateoftheart architectures from the perspective of overlap singularities. ResNet (He et al., 2016a) and its variants use residual learning, which reduces but cannot eliminate the singularities. ResNeXt (Xie et al., 2017) uses group convolutions to break the symmetry between groups, which further helps to avoid neuron collapses. DenseNet (Huang et al., 2017b) concatenates the outputs of layers as the input to the next layer. DenseNet and GUNN both create dense connections, while DenseNet reuses the outputs by concatenating and GUNN by adding them back to the inputs. But the channels within the same layer of DenseNet are still possible to collapse into each other since they are symmetric. In contrast, adding back makes residual learning possible in GUNN. This makes residual learning indispensable in GUNNbased networks.
5 Network Architectures
In this section, we will present the details of our architectures for the CIFAR (Krizhevsky & Hinton, 2009) and ImageNet (Russakovsky et al., 2015) datasets.
5.1 Simultaneously Updated Neural Networks and Gradually Updated Neural Networks
Since the proposed GUNN is a method for increasing the depths of the convolutional networks, specifying the architectures to be converted is equivalent to specifying the GUNNbased architectures. The architectures before conversion, the Simultaneously Updated Neural Networks (SUNN), become natural baselines for our proposed GUNN networks. We first study what baseline architectures can be converted.
There are two assumptions about the feature transformation (see Eq. 1): (1) the input and the output sizes are the same, and (2)
is channelwise decomposable. To satisfy the first assumption, we will first use a convolutional layer with Batch Normalization
(Ioffe & Szegedy, 2015) and ReLU (Nair & Hinton, 2010) to transform the feature space to a new space where the number of the channels is wanted. To satisfy the second assumption, instead of directly specifying the transform , we focus on designing , where is a subset of the channels (see Eq. 4). To be consistent with the term update used in GUNN and SUNN, we refer to as the update units for channels .Bottleneck Update Units
In the architectures proposed in this paper, we adopt bottleneck neural networks as shown in Figure 4 for the update units for both the SUNN and GUNN. Suppose that the update unit maps the input features of channel size to the output features of size . Each unit contains three convolutional layers. The first convolutional layer transforms the input features to using a convolutional layer. The second convolutional layer is of kernel size
, stride
, and padding
, outputting the features of size . The third layer computes the features of size using a convolutional layer. The output is then added back to the input, following the residual architecture proposed in ResNet (He et al., 2016a). We add batch normalization layer (Ioffe & Szegedy, 2015) and ReLU layer (Nair & Hinton, 2010) after the first and the second convolutional layers, while only adding batch normalization layer after the third layer. Stacking upupdate units also generates a new one. In total, we have two hyperparameters for designing an update unit: the expansion rate
and the number of the layer update units .One Resolution, One Representation
Our architectures will have only one representation at one resolution besides the pooling layers and the convolutional layers that initialize the needed numbers of channels. Take the architecture in Table 1 as an example. There are two processes for each resolution. The first one is the transition process, which computes the initial features with the dimensions of the next resolution, then down samples it to using a average pooling. A convolutional operation is needed here because is assumed to have the same input and output sizes. The next process is using GUNN to update this feature space gradually. Each channel will only be updated once, and all channels will be updated after this process. Unlike most of the previous networks, after this two processes, the feature transformations at this resolution are complete. There will be no more convolutional layers or blocks following this feature representation, i.e., one resolution, one representation
. Then, the network will compute the initial features for the next resolution, or compute the final vector representation of the entire image by a global average pooling. By designing networks in this way, SUNN networks usually have about
layers before converting to GUNNbased networks.Stage  Output  WideResNet  GUNN 

Conv1  
Conv2  
Trans1  –  
Conv3  
Trans2  –  
Conv4  
Trans3  
GPU Memory  GB@  GB@  
# Params  M  M  
Error (C10/C100)  /  / 
Channel Partitions
With the clearly defined update units, we can easily build SUNN and GUNN layers by using the units to update the representations following Eq. 4. The hyperparameters for the SUNN/GUNN layer are the number of the channels and the partition over those channels. In our proposed architectures, we evenly partition the channels into segments. Then, we can use and to represent the configuration of a layer. Together with the hyperparameters in the update units, we have four hyperparameters to tune for one SUNN/GUNN layer, i.e. {}.
5.2 Architectures for CIFAR
We have implemented two neural networks based on GUNN to compete with the previous stateoftheart methods on CIFAR datasets, i.e., GUNN and GUNN. Table 1 shows the big picture of GUNN. Here, we present the details of the hyperparameter settings for GUNN and GUNN. For GUNN, we have three GUNN layers, Conv, Conv and Conv. The configuration for Conv is , the configuration for Conv is , and the configuration for Conv is . For GUNN, based on GUNN, we change the number of output channels of Conv to , Trans to , Trans to , and Trans to . The hyperparameters are for Conv, for Conv, and for Conv. The number of parameters of GUNN is for CIFAR10 and for CIFAR100. The number of parameters of GUNN is for CIFAR10 and for CIFAR100. The GUNN is aimed to compete with the methods published in an early stage by using a much smaller model, while GUNN is targeted at comparing with ResNeXt (Xie et al., 2017) and DenseNet (Huang et al., 2017b) to get the stateoftheart performance.
5.3 Architectures for ImageNet
Stage  Output  ResNet  GUNN 

Conv1  
Conv2  
Trans1  –  
Conv3  
Trans2  –  
Conv4  
Trans3  –  
Conv5  
Trans4  
# Params  M  M  
Error (Top1/5)  /  / 
We implement a neural network GUNN to compete with the stateoftheart neural networks on ImageNet with a similar number of parameters. Table 2 shows the big picture of the neural network architecture of GUNN. Here, we present the detailed hyperparameters for the GUNN layers in GUNN. The GUNN layers include Conv2, Conv3, Conv4 and Conv5. The hyperparameters are for Conv2, for Conv3, for Conv4 and for Conv5. The number of parameters is . The GUNN is targeted at competing with the previous stateoftheart methods that have similar numbers of parameters, e.g., ResNet (Xie et al., 2017), ResNeXt (Xie et al., 2017) and DenseNet (Huang et al., 2017b).
We also implement a wider GUNNbased neural networks WideGUNN18 for better capacities. The hyperparameters are for Conv2, for Conv3, for Conv4 and for Conv5. The number of parameters is . The WideGUNN is targeted at competing with ResNet, ResNext, DPN (Chen et al., 2017) and SENet (Hu et al., 2017).
Method  C10  C100  

Network in Network (Lin et al., 2013)  –  
AllCNN (Springenberg et al., 2014)  
Deeply Supervised Network (Lee et al., 2014)  
Highway Network (Srivastava et al., 2015)  
# layers  # params  
ResNet (He et al., 2016a; Huang et al., 2016)  M  
FractalNet (Larsson et al., 2016)  M  
Stochastic Depth (Huang et al., 2016)  M  
ResNet with preact (He et al., 2016b)  M  
WideResNet (Zagoruyko & Komodakis, 2016)  M  
WideResNet (Zagoruyko & Komodakis, 2016)  M  
ResNeXt (Xie et al., 2017)  M  
DenseNet (Huang et al., 2017b)  M  
Snapshot Ensemble (Huang et al., 2017a)  –  M  
GUNN15  M  
GUNN24  M  
GUNN24 Ensemble  M 
6 Experiments
In this section, we demonstrate the effectiveness of the proposed GUNN on several benchmark datasets.
6.1 Benchmark Datasets
Cifar
CIFAR (Krizhevsky & Hinton, 2009) has two color image datasets: CIFAR10 (C10) and CIFAR100 (C100). Both datasets consist of natural images with the size of pixels. The CIFAR10 dataset has categories, while the CIFAR100 dataset has categories. For both of the datasets, the training and test set contain and images, respectively. To fairly compare our method with the stateofthearts (He et al., 2016a; Huang et al., 2017b, 2016; Larsson et al., 2016; Lee et al., 2014; Lin et al., 2013; Romero et al., 2014; Springenberg et al., 2014; Srivastava et al., 2015; Xie et al., 2017), we use the same training and testing strategies, as well as the data processing methods. Specifically, we adopt a commonly used data augmentation scheme, i.e., mirroring and shifting, for these two datasets. We use channel means and standard derivations to normalize the images for data preprocessing.
ImageNet
The ImageNet dataset (Russakovsky et al., 2015) contains about million color images for training and for validation. The dataset has categories. We adopt the same data augmentation methods as in the stateoftheart architectures (He et al., 2016a, b; Huang et al., 2017b; Xie et al., 2017) for training. For testing, we use singlecrop at the size of . Following the stateofthearts (He et al., 2016a, b; Huang et al., 2017b; Xie et al., 2017), we report the validation error rates.
6.2 Training Details
We train all of our networks using stochastic gradient descents. On CIFAR10/100
(Krizhevsky & Hinton, 2009), the initial learning rate is set to , the weight decay is set to , and the momentum is set to without dampening. We train the models for epochs. The learning rate is divided by at th epoch and th epoch. We set the batch size to , following (Huang et al., 2017b). All the results reported for CIFAR, regardless of the detailed configurations, were trained using NVIDIA Titan X GPUs with the data parallelism. On ImageNet (Russakovsky et al., 2015), the learning rate is also set to initially, and decreases following the schedule in DenseNet (Huang et al., 2017b). The batch size is set to . The network parameters are also initialized following (He et al., 2016a). We use Tesla V100 GPUs with the data parallelism to get the reported results. Our results are directly comparable with ResNet, WideResNet, ResNeXt and DenseNet.6.3 Results on CIFAR
We train two models GUNN and GUNN for the CIFAR10/100 dataset. Table 3 shows the comparisons between our method and the previous stateoftheart methods. Our method GUNN achieves the best results in the test of both the single model and the ensemble test. Here, we use Snapshot Ensemble (Huang et al., 2017a).
Method  # layers  # params  top  top 

VGG (Simonyan & Zisserman, 2014)  M  
ResNet (He et al., 2016a)  M  
ResNeXt (Xie et al., 2017)  M  
DenseNet (Huang et al., 2017b)  M  
SUNN  M  
GUNN  M  
ResNet (He et al., 2016a)  M  
ResNeXt (Xie et al., 2017)  M  
DPN (Chen et al., 2017)  M  
SEResNeXt (Hu et al., 2017)  M  
Wide GUNN  M 
Baseline Methods
Here we present the details of baseline methods in Table 3. The performances of ResNet (He et al., 2016a) are reported in Stochastic Depth (Huang et al., 2016) for both C10 and C100. The WideResNet (Zagoruyko & Komodakis, 2016) WRN is reported in their official code repository on GitHub. The ResNeXt in the third group is of configuration d, which has the best result reported in the paper (Xie et al., 2017). The DenseNet is of configuration DenseNetBC (), which achieves the best performances on CIFAR10/100. The Snapshot Ensemble (Huang et al., 2017a) uses DenseNet to ensemble during inference. We do not compare with methods that use more data augmentation (e.g. (Zhang et al., 2017)) or stronger regularizations (e.g. (Gastaldi, 2017)) for the fairness of comparison.
Ablation Study
For ablation study, we compare GUNN with SUNN, i.e., the networks before the conversion. Table 5 shows the comparison results, which demonstrate the effectiveness of GUNN. We also compare the performances of GUNN with and without residual learning.
Method  # layers  # params  C10  C100 

GUNN15NoRes  M  
GUNN15  M  
SUNN15  M  
GUNN15  M  
SUNN24  M  
GUNN24  M 
6.4 Results on ImageNet
We evaluate the GUNN on the ImageNet classification task, and compare our performances with the stateoftheart methods. These methods include VGGNet (Simonyan & Zisserman, 2014), ResNet (He et al., 2016a), ResNeXt (Xie et al., 2017), DenseNet (Huang et al., 2017b), DPN (Chen et al., 2017) and SENet (Hu et al., 2017). The comparisons are shown in Table 4. The results of ours, ResNeXt, and DenseNet are directly comparable as these methods use the same framework for training and testing networks. Table 4 groups the methods by their numbers of parameters, except VGGNet which has parameters.
The results presented in Table 4 demonstrate that with the similar number of parameters, GUNN can achieve comparable performances with the previous stateoftheart methods. For GUNN, we also conduct an ablation experiment by comparing the corresponding SUNN with GUNN of the same configuration. Consistent with the experimental results on the CIFAR10/100 dataset, the proposed GUNN improves the accuracy on ImageNet dataset.
7 Conclusions
In this paper, we propose Gradually Updated Neural Network (GUNN), a novel, simple yet effective method to increase the depths of neural networks as an alternative to cascading layers. GUNN is based on Convolutional Neural Networks (CNNs), but differs from CNNs in the way of computing outputs. The outputs of GUNN are computed gradually rather than simultaneously as in CNNs in order to increase the depth. Essentially, GUNN assumes the input and the output are of the same size and adds a computation ordering to the channels. The added ordering increases the receptive fields and nonlinearities of the later computed channels. Moreover, it eliminates the overlap singularities inherent in the traditional convolutional networks. We test GUNN on the task of image recognition. The evaluations are done in three highly competitive benchmarks, CIFAR10, CIFAR100 and ImageNet. The experimental results demonstrate the effectiveness of the proposed GUNN on image recognition. In the future, since the proposed GUNN can be used to replace CNNs in other neural networks, we will study the applications of GUNN in other visual tasks, such as object detection and semantic segmentation.
References
 Anandkumar & Ge (2016) Anandkumar, Animashree and Ge, Rong. Efficient approaches for escaping higher order saddle points in nonconvex optimization. In Proceedings of the Conference on Learning Theory, 2016.

Arora et al. (2018)
Arora, Raman, Basu, Amitabh, Mianjy, Poorya, and Mukherjee, Anirbit.
Understanding deep neural networks with rectified linear units.
International Conference on Learning Representations, 2018.  Chen et al. (2015) Chen, LiangChieh, Papandreou, George, Kokkinos, Iasonas, Murphy, Kevin, and Yuille, Alan L. Semantic image segmentation with deep convolutional nets and fully connected crfs. In International Conference on Learning Representations, 2015.
 Chen et al. (2017) Chen, Yunpeng, Li, Jianan, Xiao, Huaxin, Jin, Xiaojie, Yan, Shuicheng, and Feng, Jiashi. Dual path networks. CoRR, abs/1707.01629, 2017.

Dalal & Triggs (2005)
Dalal, N. and Triggs, B.
Histograms of oriented gradients for human detection.
In
IEEE Conference on Computer Vision and Pattern Recognition
, June 2005.  Gastaldi (2017) Gastaldi, Xavier. Shakeshake regularization. CoRR, abs/1705.07485, 2017.
 Girshick et al. (2014) Girshick, Ross B., Donahue, Jeff, Darrell, Trevor, and Malik, Jitendra. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEEConference on Computer Vision and Pattern Recognition, 2014.
 Hariharan et al. (2015) Hariharan, Bharath, Arbeláez, Pablo Andrés, Girshick, Ross B., and Malik, Jitendra. Hypercolumns for object segmentation and finegrained localization. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 447–456, 2015.
 He et al. (2016a) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, 2016a.
 He et al. (2016b) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Identity mappings in deep residual networks. ECCV, 2016b.
 Hu et al. (2017) Hu, Jie, Shen, Li, and Sun, Gang. Squeezeandexcitation networks. CoRR, abs/1709.01507, 2017.
 Huang et al. (2016) Huang, Gao, Sun, Yu, Liu, Zhuang, Sedra, Daniel, and Weinberger, Kilian Q. Deep networks with stochastic depth. CoRR, abs/1603.09382, 2016.
 Huang et al. (2017a) Huang, Gao, Li, Yixuan, Pleiss, Geoff, Liu, Zhuang, Hopcroft, John E., and Weinberger, Kilian Q. Snapshot ensembles: Train 1, get M for free. CoRR, abs/1704.00109, 2017a.
 Huang et al. (2017b) Huang, Gao, Liu, Zhuang, and Weinberger, Kilian Q. Densely connected convolutional networks. IEEE Conference on Computer Vision and Pattern Recognition, 2017b.
 Ioffe & Szegedy (2015) Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
 Kontschieder et al. (2015) Kontschieder, Peter, Fiterau, Madalina, Criminisi, Antonio, and Bulò, Samuel Rota. Deep neural decision forests. In IEEE International Conference on Computer Vision, pp. 1467–1475, 2015.
 Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
 Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems, pp. 1097–1105. 2012.
 Larsson et al. (2016) Larsson, Gustav, Maire, Michael, and Shakhnarovich, Gregory. Fractalnet: Ultradeep neural networks without residuals. CoRR, abs/1605.07648, 2016.
 Lee et al. (2014) Lee, ChenYu, Xie, Saining, Gallagher, Patrick W., Zhang, Zhengyou, and Tu, Zhuowen. Deeplysupervised nets. CoRR, abs/1409.5185, 2014.
 Lin et al. (2013) Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network in network. CoRR, abs/1312.4400, 2013.
 Long et al. (2015) Long, Jonathan, Shelhamer, Evan, and Darrell, Trevor. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 Lowe (2004) Lowe, David G. Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision, 60(2):91–110, Nov 2004. ISSN 15731405.

Nair & Hinton (2010)
Nair, Vinod and Hinton, Geoffrey E.
Rectified linear units improve restricted boltzmann machines.
In International Conference on Machine Learning, 2010.  Orhan & Pitkow (2018) Orhan, A. Emin and Pitkow, Xaq. Skip connections eliminate singularities. International Conference on Learning Representations, 2018.
 Pezeshki et al. (2016) Pezeshki, Mohammad, Fan, Linxi, Brakel, Philemon, Courville, Aaron C., and Bengio, Yoshua. Deconstructing the ladder network architecture. In International Conference on Machine Learning, pp. 2368–2376, 2016.
 Qiao et al. (2017a) Qiao, Siyuan, Liu, Chenxi, Shen, Wei, and Yuille, Alan L. Fewshot image recognition by predicting parameters from activations. CoRR, abs/1706.03466, 2017a.
 Qiao et al. (2017b) Qiao, Siyuan, Shen, Wei, Qiu, Weichao, Liu, Chenxi, and Yuille, Alan L. Scalenet: Guiding object proposal generation in supermarkets and beyond. In IEEE International Conference on Computer Vision, 2017b.
 Rasmus et al. (2015) Rasmus, Antti, Berglund, Mathias, Honkala, Mikko, Valpola, Harri, and Raiko, Tapani. Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546–3554, 2015.
 Ren et al. (2015) Ren, Shaoqing, He, Kaiming, Girshick, Ross B., and Sun, Jian. Faster RCNN: towards realtime object detection with region proposal networks. In Advances in Neural Information Processing Systems 28, 2015.
 Romero et al. (2014) Romero, Adriana, Ballas, Nicolas, Kahou, Samira Ebrahimi, Chassang, Antoine, Gatta, Carlo, and Bengio, Yoshua. Fitnets: Hints for thin deep nets. CoRR, abs/1412.6550, 2014.
 Russakovsky et al. (2015) Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and FeiFei, Li. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 Sermanet et al. (2014) Sermanet, Pierre, Eigen, David, Zhang, Xiang, Mathieu, Michaël, Fergus, Robert, and Lecun, Yann. Overfeat: Integrated recognition, localization and detection using convolutional networks. International Conference on Learning Representations, 2014.
 Shen et al. (2015) Shen, Wei, Wang, Xinggang, Wang, Yan, Bai, Xiang, and Zhang, Zhijiang. Deepcontour: A deep convolutional feature learned by positivesharing loss for contour detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 Simonyan & Zisserman (2014) Simonyan, Karen and Zisserman,
Comments
There are no comments yet.