Gradually Updated Neural Networks for Large-Scale Image Recognition

11/25/2017 ∙ by Siyuan Qiao, et al. ∙ 0

We present a simple yet effective neural network architecture for image recognition. Unlike the previous state-of-the-art neural networks which usually have very deep architectures, we build networks that are shallower but can achieve better performances on three competitive benchmark datasets, i.e., CIFAR-10/100 and ImageNet. Our architectures are built using Gradually Updated Neural Network (GUNN) layers, which differ from the standard Convolutional Neural Network (CNN) layers in the way their output channels are computed: the CNN layers compute the output channels simultaneously while GUNN layers compute the channels gradually. By adding the computation ordering to the channels of CNNs, our networks are able to achieve better accuracies while using fewer layers and less memory. The architecture design of GUNN is guided by theoretical results and verified by empirical experiments. We set new records on the CIFAR-10 and CIFAR-100 datasets and achieve better accuracy on ImageNet under similar complexity with the previous state-of-the-art methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have become the state-of-the-art systems for image recognition (He et al., 2016a; Huang et al., 2017b; Krizhevsky et al., 2012; Qiao et al., 2017a; Simonyan & Zisserman, 2014; Szegedy et al., 2015; Wang et al., 2017; Zeiler & Fergus, 2013) as well as other vision tasks (Chen et al., 2015; Girshick et al., 2014; Long et al., 2015; Qiao et al., 2017b; Ren et al., 2015; Shen et al., 2015; Xie & Tu, 2015). The architectures keep going deeper, e.g., from five convolutional layers (Krizhevsky et al., 2012) to layers (He et al., 2016b). The benefit of deep architectures is their strong learning capacities because each new layer can potentially introduce more non-linearities and typically uses larger receptive fields (Simonyan & Zisserman, 2014). In addition, adding certain types of layers (e.g. (He et al., 2016b)) will not harm the performance theoretically since they can just learn identity mapping. This makes stacking up layers more appealing in the network designs.

Although deeper architectures usually lead to stronger learning capacities, cascading convolutional layers (e.g. VGG (Simonyan & Zisserman, 2014)) or blocks (e.g. ResNet (He et al., 2016a)) is not necessarily the only method to achieve this goal. In this paper, we present a new way to increase the depth of the networks as an alternative to stacking up convolutional layers or blocks. Figure 2 provides an illustration that compares our proposed convolutional network that gradually updates the feature representations against the traditional convolutional network that computes its output simultaneously. By only adding an ordering to the channels without any additional computation, the later computed channels become deeper than the corresponding ones in the traditional convolutional network. We refer to the neural networks with the proposed computation orderings on the channels as Gradually Updated Neural Networks (GUNN). Figure 1 provides two examples of architecture designs based on cascading building blocks and GUNN. Without repeating the building blocks, GUNN increases the depths of the networks as well as their learning capacities.

Figure 1: Comparing architecture designs based on cascading convolutional building blocks (left) and GUNN (right). Cascading-based architecture increases the depth by repeating the blocks. GUNN-based networks increases the depth by adding computation orderings to the channels of the building blocks.
Figure 2: Comparing Simultaneously Updated Convolutional Network and Gradually Updated Convolutional Network. Left is a traditional convolutional network with three channels in both the input and the output. Right is our proposed convolutional network which decomposes the original computation into three sequential channel-wise convolutional operations. In our proposed GUNN-based architectures, the updates are done by residual learning (He et al., 2016a), which we do not show in this figure.

It is clear that converting plain networks to GUNN increases the depths of the networks without any additional computations. What is less obvious is that GUNN in fact eliminates the overlap singularities inherent in the loss landscapes of the cascading-based convolutional networks, which have been shown to adversely affect the training of deep neural networks as well as their performances (Wei et al., 2008; Orhan & Pitkow, 2018)

. Overlap singularity is when internal neurons collapse into each other,

i.e. they are unidentifiable by their activations. It happens in the networks, increases the training difficulties and degrades the performances (Orhan & Pitkow, 2018). However, if a plain network is converted to GUNN, the added computation orderings will break the symmetry between the neurons. We prove that the internal neurons in GUNN are impossible to collapse into each other. As a result, the effective dimensionality can be kept during training and the model will be free from the degeneracy caused by collapsed neurons. Reflected in the training dynamics and the performances, this means that converting to GUNN will make the plain networks easier to train and perform better. Figure 3 compares the training dynamics of a 15-layer plain network on CIFAR-10 dataset (Krizhevsky & Hinton, 2009) before and after converted to GUNN.

Figure 3: Training dynamics on CIFAR-10 dataset.

In this paper, we test our proposed GUNN on highly competitive benchmark datasets, i.e. CIFAR (Krizhevsky & Hinton, 2009) and ImageNet (Russakovsky et al., 2015). Experimental results demonstrate that our proposed GUNN-based networks achieve the state-of-the-art performances compared with the previous cascading-based architectures.

2 Related Work

The research focuses of image recognition have moved from feature designs (Dalal & Triggs, 2005; Lowe, 2004) to architecture designs (He et al., 2016a; Huang et al., 2017b; Krizhevsky et al., 2012; Sermanet et al., 2014; Simonyan & Zisserman, 2014; Szegedy et al., 2015; Xie et al., 2017; Zeiler & Fergus, 2013) due to the recent success of the deep neural networks. Highway Networks (Srivastava et al., 2015) proposed architectures that can be trained end-to-end with more than layers. The main idea of Highway Networks is to use bypassing paths. This idea was further investigated in ResNet (He et al., 2016a), which simplifies the bypassing paths by using only identity mappings. As learning ultra-deep networks became possible, the depths of the models have increased tremendously. ResNet with pre-activation (He et al., 2016b) and ResNet with stochastic depth (Huang et al., 2016) even managed to train neural networks with more than layers. FractalNet (Larsson et al., 2016) argued that in addition to summation, concatenation also helps train a deep architecture. More recently, ResNeXt (Xie et al., 2017) used group convolutions in ResNet and outperformed the original ResNet. DenseNet (Huang et al., 2017b) proposed an architecture with dense connections by feature concatenation. Dual Path Net (Chen et al., 2017) finds a middle point between ResNet and DenseNet by concatenating them in two paths. Unlike the above cascading-based methods, GUNN eliminates the overlap singularities caused by the architecture symmetry. The detailed analyses can be found in Section 4.3.

Alternative to increasing the depth of the neural networks, another trend is to increase the widths of the networks. GoogleNet (Szegedy et al., 2015, 2016) proposed an Inception module to concatenate feature maps produced by different filters. Following ResNet (He et al., 2016a), the WideResNet (Zagoruyko & Komodakis, 2016) argued that compared with increasing the depth, increasing the width of the networks can be more effective in improving the performances. Besides varying the width and the depth, there are also other design strategies for deep neural networks (Hariharan et al., 2015; Kontschieder et al., 2015; Pezeshki et al., 2016; Rasmus et al., 2015; Yang & Ramanan, 2015). Deeply-Supervised Nets (Lee et al., 2014)

used auxiliary classifiers to provide direct supervisions for the internal layers. Network in Network 

(Lin et al., 2013)

adds micro perceptrons to the convolutional layers.

3 Model

3.1 Feature Update

We consider a feature transformation , where denotes the channel of the features and denotes the feature location on the -D feature map. For example, can be a convolutional layer with channels for both the input and the output. Let be the input and be the output, we have


Suppose that can be decomposed into channel-wise transformation that are independent with eath other, then for any location and channel we have


where denotes the receptive field of the location and denotes the transformation on channel .

Let denote a feature update on channel set , i.e.,


Then, when .

3.2 Gradually Updated Neural Networks

By defining the feature update on channel set , the commonly used one-layer CNN is a special case of feature updates where every channel is updated simultaneously. However, we can also update the channels gradually. For example, the proposed GUNN can be formulated by


When , GUNN is equivalent to .

Note that the number of parameters and computation of GUNN are the same as those of the corresponding for any partitions of . However, by decomposing into channel-wise transformations and sequentially applying them, the later computed channels are deeper than the previous ones. As a result, the depth of the network can be increased, as well as the network’s learning capacity.

3.3 Channel-wise Update by Residual Learning

Input :

, input ,

output , gradients ,

and parameters for .

Output :


for  to  do

       , , ,
end for
Algorithm 1 Back-propagation for GUNN

We consider the residual learning proposed by ResNet (He et al., 2016a) in our model. Specifically, we consider the channel-wise transformation to be


where is a convolutional neural network . The motivation of expressing in a residual learning manner is to reduce overlap singularities (Orhan & Pitkow, 2018), which will be discussed in Section 4.

3.4 Learning GUNN by Backpropagation

Here we show the backpropagation algorithm for learning the parameters in GUNN that uses the same amount of computations and memory as in

. In Eq. 4, let the feature update be parameterized by . Let be the back-propagation algorithm for differentiable function with the loss and the parameters . Algorithm 1 presents the back-propagation algorithm for GUNN. Since has the residual structures (He et al., 2016a), the last two steps can be merged into


which further simplifies the implementation. It is easy to see that converting networks to GUNN-based does not increase the memory usage in feed-forwarding. Given Algorithm 1, converting networks to GUNN will not affect the memory in both the training and the evaluation.

4 GUNN Eliminates Overlap Singularities

Overlap singularities are inherent in the loss landscapes of some network architectures which are caused by the non-identifiability of subsets of the neurons. They are identified and discussed in previous work (Wei et al., 2008; Anandkumar & Ge, 2016; Orhan & Pitkow, 2018), and are shown to be harmful for the performances of deep networks. Intuitively, overlap singularities exist in architectures where the internal neurons collapse into each other. As a result, the models are degenerate and the effective dimensionality is reduced. (Orhan & Pitkow, 2018) demonstrated through experiments that residual learning (see Eq. 5) helps to reduce the overlap singularities in deep networks, which partly explains the exceptional performances of ResNet (He et al., 2016a)

compared with plain networks. In the following, we first use linear transformation as an example to demonstrate how GUNN-based networks break the overlap singularities. Then, we generalize the results to ReLU DNN. Finally, we compare GUNN with the previous state-of-the-art network architectures from the perspective of singularity elimination.

4.1 Overlap Singularities in Linear Transformations

Consider a linear function such that


Suppose that there exists a pair of collapsed neurons and (). Then, for , , and the equality holds after any number of gradient descents, i.e. .

Eq. 7 describes a plain network. The solution for the existence of and is that . This is the case that is mostly discussed previously, which happens in the networks and degrades the performances.

When we add the residual learning, Eq. 7 becomes


Collapsed neurons require that , . This will make the collapse of and very hard when

is initialized from a normal distribution

as in ResNet, but still possible.

Next, we convert Eq. 8 to GUNN, i.e.,


Suppose that and () collapse. Consider , the value difference at after one step of gradient descent on with input , and learning rate . When ,


As , we have . But this condition will be broken in the next update; thus, . Then, we derive that . But these will also be broken in the next step of gradient descent optimization. Hence, and cannot collapse into each other. The complete proof can be found in the appendix.

4.2 Overlap Singularities in ReLU DNN

In practice, architectures are usually composed of several linear layers and non-linearity layers. Analyzing all the possible architectures is beyond our scope. Here, we discuss the commonly used ReLU DNN, in which only linear transformations and ReLUs are used by simple layer cascading.

Following the notations in §3, we use , in which is a ReLU DNN. Note that is continuous piecewise linear (PWL) function (Arora et al., 2018), which means that there exists a finite set of polyhedra whose union is , and is affine linear over each polyhedron.

Suppose that we convert to GUNN and there exists a pair of collapsed neurons and (). Then, the set of polyhedra for is the same as for . Let be a polyhedron for and defined above. Then, ,


where denotes the parameters for polyhedron . Note that on each , is a function of in the form of Eq. 9; hence, and cannot collapse into each other. Since the union of all polyhedra is , we conclude that GUNN eliminates the overlap singularities in ReLU DNN.

4.3 Discussions and Comparisons

The previous two subsections consider the GUNN conversion where (see Eq. 4). But this will slow down the computation on GPU due to the data dependency. Without specialized hardware or library support, we decide to increase to . The resulted models run at the speed between ResNeXt (Xie et al., 2017) and DenseNet (Huang et al., 2017b). But this change introduces singularities into the channels from the same set . Then, the residual learning helps GUNN to reduce the singularities within the same set since we initialize the parameters from a normal distribution . We will compare the results of GUNN with and without residual learning in the experiments.

We compare GUNN with the state-of-the-art architectures from the perspective of overlap singularities. ResNet (He et al., 2016a) and its variants use residual learning, which reduces but cannot eliminate the singularities. ResNeXt (Xie et al., 2017) uses group convolutions to break the symmetry between groups, which further helps to avoid neuron collapses. DenseNet (Huang et al., 2017b) concatenates the outputs of layers as the input to the next layer. DenseNet and GUNN both create dense connections, while DenseNet reuses the outputs by concatenating and GUNN by adding them back to the inputs. But the channels within the same layer of DenseNet are still possible to collapse into each other since they are symmetric. In contrast, adding back makes residual learning possible in GUNN. This makes residual learning indispensable in GUNN-based networks.

5 Network Architectures

In this section, we will present the details of our architectures for the CIFAR (Krizhevsky & Hinton, 2009) and ImageNet (Russakovsky et al., 2015) datasets.

5.1 Simultaneously Updated Neural Networks and Gradually Updated Neural Networks

Since the proposed GUNN is a method for increasing the depths of the convolutional networks, specifying the architectures to be converted is equivalent to specifying the GUNN-based architectures. The architectures before conversion, the Simultaneously Updated Neural Networks (SUNN), become natural baselines for our proposed GUNN networks. We first study what baseline architectures can be converted.

There are two assumptions about the feature transformation (see Eq. 1): (1) the input and the output sizes are the same, and (2)

is channel-wise decomposable. To satisfy the first assumption, we will first use a convolutional layer with Batch Normalization 

(Ioffe & Szegedy, 2015) and ReLU (Nair & Hinton, 2010) to transform the feature space to a new space where the number of the channels is wanted. To satisfy the second assumption, instead of directly specifying the transform , we focus on designing , where is a subset of the channels (see Eq. 4). To be consistent with the term update used in GUNN and SUNN, we refer to as the update units for channels .

Bottleneck Update Units

Figure 4: Bottleneck Update Units for both SUNN and GUNN.

In the architectures proposed in this paper, we adopt bottleneck neural networks as shown in Figure 4 for the update units for both the SUNN and GUNN. Suppose that the update unit maps the input features of channel size to the output features of size . Each unit contains three convolutional layers. The first convolutional layer transforms the input features to using a convolutional layer. The second convolutional layer is of kernel size

, stride

, and padding

, outputting the features of size . The third layer computes the features of size using a convolutional layer. The output is then added back to the input, following the residual architecture proposed in ResNet (He et al., 2016a). We add batch normalization layer (Ioffe & Szegedy, 2015) and ReLU layer (Nair & Hinton, 2010) after the first and the second convolutional layers, while only adding batch normalization layer after the third layer. Stacking up

update units also generates a new one. In total, we have two hyperparameters for designing an update unit: the expansion rate

and the number of the -layer update units .

One Resolution, One Representation

Our architectures will have only one representation at one resolution besides the pooling layers and the convolutional layers that initialize the needed numbers of channels. Take the architecture in Table 1 as an example. There are two processes for each resolution. The first one is the transition process, which computes the initial features with the dimensions of the next resolution, then down samples it to using a average pooling. A convolutional operation is needed here because is assumed to have the same input and output sizes. The next process is using GUNN to update this feature space gradually. Each channel will only be updated once, and all channels will be updated after this process. Unlike most of the previous networks, after this two processes, the feature transformations at this resolution are complete. There will be no more convolutional layers or blocks following this feature representation, i.e., one resolution, one representation

. Then, the network will compute the initial features for the next resolution, or compute the final vector representation of the entire image by a global average pooling. By designing networks in this way, SUNN networks usually have about

layers before converting to GUNN-based networks.

Stage Output WideResNet-- GUNN-
GPU Memory GB@ GB@
# Params M M
Error (C10/C100) / /
Table 1: Architecture comparison between WideResNet-28-10 (Zagoruyko & Komodakis, 2016) and GUNN-15 for CIFAR. (Left) WideResNet-28-10. (Right) GUNN-15. GUNN achieves comparable accuracies on CIFAR10/100 while using a smaller number of parameters and consuming less GPU memory during training. In GUNN-15, the convolution stages with stars are computed using GUNN while others are not.

Channel Partitions

With the clearly defined update units, we can easily build SUNN and GUNN layers by using the units to update the representations following Eq. 4. The hyperparameters for the SUNN/GUNN layer are the number of the channels and the partition over those channels. In our proposed architectures, we evenly partition the channels into segments. Then, we can use and to represent the configuration of a layer. Together with the hyperparameters in the update units, we have four hyperparameters to tune for one SUNN/GUNN layer, i.e. {}.

5.2 Architectures for CIFAR

We have implemented two neural networks based on GUNN to compete with the previous state-of-the-art methods on CIFAR datasets, i.e., GUNN- and GUNN-. Table 1 shows the big picture of GUNN-. Here, we present the details of the hyperparameter settings for GUNN- and GUNN-. For GUNN-, we have three GUNN layers, Conv, Conv and Conv. The configuration for Conv is , the configuration for Conv is , and the configuration for Conv is . For GUNN-, based on GUNN-, we change the number of output channels of Conv to , Trans to , Trans to , and Trans to . The hyperparameters are for Conv, for Conv, and for Conv. The number of parameters of GUNN- is for CIFAR-10 and for CIFAR-100. The number of parameters of GUNN- is for CIFAR-10 and for CIFAR-100. The GUNN- is aimed to compete with the methods published in an early stage by using a much smaller model, while GUNN- is targeted at comparing with ResNeXt (Xie et al., 2017) and DenseNet (Huang et al., 2017b) to get the state-of-the-art performance.

5.3 Architectures for ImageNet

Stage Output ResNet- GUNN-
# Params M M
Error (Top-1/5) / /
Table 2: Architecture comparison between ResNet (He et al., 2016a) and GUNN-18 for ImageNet-152. (Left) ResNet-152. (Right) GUNN-18. GUNN achieves better accuracies on ImageNet while using a smaller number of parameters.

We implement a neural network GUNN- to compete with the state-of-the-art neural networks on ImageNet with a similar number of parameters. Table 2 shows the big picture of the neural network architecture of GUNN-. Here, we present the detailed hyperparameters for the GUNN layers in GUNN-. The GUNN layers include Conv2, Conv3, Conv4 and Conv5. The hyperparameters are for Conv2, for Conv3, for Conv4 and for Conv5. The number of parameters is . The GUNN- is targeted at competing with the previous state-of-the-art methods that have similar numbers of parameters, e.g., ResNet- (Xie et al., 2017), ResNeXt- (Xie et al., 2017) and DenseNet- (Huang et al., 2017b).

We also implement a wider GUNN-based neural networks Wide-GUNN-18 for better capacities. The hyperparameters are for Conv2, for Conv3, for Conv4 and for Conv5. The number of parameters is . The Wide-GUNN- is targeted at competing with ResNet-, ResNext-, DPN (Chen et al., 2017) and SENet (Hu et al., 2017).

Method C10 C100
Network in Network (Lin et al., 2013)
All-CNN (Springenberg et al., 2014)
Deeply Supervised Network (Lee et al., 2014)
Highway Network (Srivastava et al., 2015)
# layers # params
ResNet (He et al., 2016a; Huang et al., 2016) M
FractalNet (Larsson et al., 2016) M
Stochastic Depth (Huang et al., 2016) M
ResNet with pre-act (He et al., 2016b) M
WideResNet-- (Zagoruyko & Komodakis, 2016) M
WideResNet-- (Zagoruyko & Komodakis, 2016) M
ResNeXt (Xie et al., 2017) M
DenseNet (Huang et al., 2017b) M
Snapshot Ensemble (Huang et al., 2017a) M
GUNN-24 Ensemble M
Table 3: Classification errors (%) on the CIFAR-10/100 test set. All methods are with data augmentation. The third group shows the most recent state-of-the-art methods. The performances of GUNN are presented in the fourth group. A very small model GUNN- outperforms all the methods in the second group except WideResNet--. A relatively bigger model GUNN- surpasses all the competing methods. GUNN- becomes more powerful with ensemble (Huang et al., 2017a).

6 Experiments

In this section, we demonstrate the effectiveness of the proposed GUNN on several benchmark datasets.

6.1 Benchmark Datasets


CIFAR (Krizhevsky & Hinton, 2009) has two color image datasets: CIFAR-10 (C10) and CIFAR-100 (C100). Both datasets consist of natural images with the size of pixels. The CIFAR-10 dataset has categories, while the CIFAR-100 dataset has categories. For both of the datasets, the training and test set contain and images, respectively. To fairly compare our method with the state-of-the-arts (He et al., 2016a; Huang et al., 2017b, 2016; Larsson et al., 2016; Lee et al., 2014; Lin et al., 2013; Romero et al., 2014; Springenberg et al., 2014; Srivastava et al., 2015; Xie et al., 2017), we use the same training and testing strategies, as well as the data processing methods. Specifically, we adopt a commonly used data augmentation scheme, i.e., mirroring and shifting, for these two datasets. We use channel means and standard derivations to normalize the images for data pre-processing.


The ImageNet dataset (Russakovsky et al., 2015) contains about million color images for training and for validation. The dataset has categories. We adopt the same data augmentation methods as in the state-of-the-art architectures (He et al., 2016a, b; Huang et al., 2017b; Xie et al., 2017) for training. For testing, we use single-crop at the size of . Following the state-of-the-arts (He et al., 2016a, b; Huang et al., 2017b; Xie et al., 2017), we report the validation error rates.

6.2 Training Details

We train all of our networks using stochastic gradient descents. On CIFAR-10/100 

(Krizhevsky & Hinton, 2009), the initial learning rate is set to , the weight decay is set to , and the momentum is set to without dampening. We train the models for epochs. The learning rate is divided by at th epoch and th epoch. We set the batch size to , following (Huang et al., 2017b). All the results reported for CIFAR, regardless of the detailed configurations, were trained using NVIDIA Titan X GPUs with the data parallelism. On ImageNet (Russakovsky et al., 2015), the learning rate is also set to initially, and decreases following the schedule in DenseNet (Huang et al., 2017b). The batch size is set to . The network parameters are also initialized following (He et al., 2016a). We use Tesla V100 GPUs with the data parallelism to get the reported results. Our results are directly comparable with ResNet, WideResNet, ResNeXt and DenseNet.

6.3 Results on CIFAR

We train two models GUNN- and GUNN- for the CIFAR-10/100 dataset. Table 3 shows the comparisons between our method and the previous state-of-the-art methods. Our method GUNN achieves the best results in the test of both the single model and the ensemble test. Here, we use Snapshot Ensemble (Huang et al., 2017a).

Method # layers # params top- top-
VGG- (Simonyan & Zisserman, 2014) M
ResNet- (He et al., 2016a) M
ResNeXt- (Xie et al., 2017) M
DenseNet- (Huang et al., 2017b) M
ResNet- (He et al., 2016a) M
ResNeXt- (Xie et al., 2017) M
DPN- (Chen et al., 2017) M
SE-ResNeXt- (Hu et al., 2017) M
Wide GUNN- M
Table 4: Single-crop classification errors (%) on the ImageNet validation set. The test size of all the methods is . Ours: .

Baseline Methods

Here we present the details of baseline methods in Table 3. The performances of ResNet (He et al., 2016a) are reported in Stochastic Depth (Huang et al., 2016) for both C10 and C100. The WideResNet (Zagoruyko & Komodakis, 2016) WRN-- is reported in their official code repository on GitHub. The ResNeXt in the third group is of configuration d, which has the best result reported in the paper (Xie et al., 2017). The DenseNet is of configuration DenseNet-BC (), which achieves the best performances on CIFAR-10/100. The Snapshot Ensemble (Huang et al., 2017a) uses DenseNet- to ensemble during inference. We do not compare with methods that use more data augmentation (e.g. (Zhang et al., 2017)) or stronger regularizations (e.g. (Gastaldi, 2017)) for the fairness of comparison.

Ablation Study

For ablation study, we compare GUNN with SUNN, i.e., the networks before the conversion. Table 5 shows the comparison results, which demonstrate the effectiveness of GUNN. We also compare the performances of GUNN with and without residual learning.

Method # layers # params C10 C100
GUNN-15-NoRes M
Table 5: Ablation study on residual learning and SUNN.

6.4 Results on ImageNet

We evaluate the GUNN on the ImageNet classification task, and compare our performances with the state-of-the-art methods. These methods include VGGNet (Simonyan & Zisserman, 2014), ResNet (He et al., 2016a), ResNeXt (Xie et al., 2017), DenseNet (Huang et al., 2017b), DPN (Chen et al., 2017) and SENet (Hu et al., 2017). The comparisons are shown in Table 4. The results of ours, ResNeXt, and DenseNet are directly comparable as these methods use the same framework for training and testing networks. Table 4 groups the methods by their numbers of parameters, except VGGNet which has parameters.

The results presented in Table 4 demonstrate that with the similar number of parameters, GUNN can achieve comparable performances with the previous state-of-the-art methods. For GUNN-, we also conduct an ablation experiment by comparing the corresponding SUNN with GUNN of the same configuration. Consistent with the experimental results on the CIFAR-10/100 dataset, the proposed GUNN improves the accuracy on ImageNet dataset.

7 Conclusions

In this paper, we propose Gradually Updated Neural Network (GUNN), a novel, simple yet effective method to increase the depths of neural networks as an alternative to cascading layers. GUNN is based on Convolutional Neural Networks (CNNs), but differs from CNNs in the way of computing outputs. The outputs of GUNN are computed gradually rather than simultaneously as in CNNs in order to increase the depth. Essentially, GUNN assumes the input and the output are of the same size and adds a computation ordering to the channels. The added ordering increases the receptive fields and non-linearities of the later computed channels. Moreover, it eliminates the overlap singularities inherent in the traditional convolutional networks. We test GUNN on the task of image recognition. The evaluations are done in three highly competitive benchmarks, CIFAR-10, CIFAR-100 and ImageNet. The experimental results demonstrate the effectiveness of the proposed GUNN on image recognition. In the future, since the proposed GUNN can be used to replace CNNs in other neural networks, we will study the applications of GUNN in other visual tasks, such as object detection and semantic segmentation.