Convolutional Neural Networks with Alternately Updated Clique (to appear in CVPR 2018)
Improving information flow in deep networks helps to ease the training difficulties and utilize parameters more efficiently. Here we propose a new convolutional neural network architecture with alternately updated clique (CliqueNet). In contrast to prior networks, there are both forward and backward connections between any two layers in the same block. The layers are constructed as a loop and are updated alternately. The CliqueNet has some unique properties. For each layer, it is both the input and output of any other layer in the same block, so that the information flow among layers is maximized. During propagation, the newly updated layers are concatenated to re-update previously updated layer, and parameters are reused for multiple times. This recurrent feedback structure is able to bring higher level visual information back to refine low-level filters and achieve spatial attention. We analyze the features generated at different stages and observe that using refined features leads to a better result. We adopt a multi-scale feature strategy that effectively avoids the progressive growth of parameters. Experiments on image recognition datasets including CIFAR-10, CIFAR-100, SVHN and ImageNet show that our proposed models achieve the state-of-the-art performance with fewer parameters.READ FULL TEXT VIEW PDF
We present a simple yet effective neural network architecture for image
The article describes a system for image recognition using deep convolut...
Deluge Networks (DelugeNets) are deep neural networks which efficiently
High-level (e.g., semantic) features encoded in the latter layers of
In this paper we demonstrate that state-of-the-art convolutional neural
Traditional convolutional neural networks (CNN) are stationary and
In this work, we propose the combined usage of low- and high-level block...
Convolutional Neural Networks with Alternately Updated Clique (to appear in CVPR 2018)
In recent years, the structure and topology of deep neural networks have attracted significant research interests, since the convolutional neural network (CNN) based models have achieved huge success in a wide range of tasks of computer vision. A notable trend of those CNN architectures is that the layers are going deeper, from AlexNet with 5 convolutional layers, the VGG network and GoogleLeNet with 19 and 22 layers, respectively [32, 36], to recent ResNets  whose deepest model has more than one thousand layers. However, inappropriately designed deep networks would make it hard for latter layer to access the gradient information from previous layers, which may cause gradient vanishing and parameter redundancy problems [17, 18].
Successfully adopted in ResNet  and Highway Network , skip connection is an efficient way to make top layers accessible to the information from bottom layers, and ease the network training at the same time, due to its relief of the gradient vanishing problem. The residual block structure in ResNet  also inspires a series of ResNet variations, including ResNext , WRN , PolyNet , etc. To further activate the gradient and information flow in networks, DenseNet 
is a newly proposed structure, where any layer in a block is the output of all preceding layers, and the input of all subsequent layers. Recent studies show that the skip connection mechanism can be extrapolated as a recurrent neural network (RNN) or LSTM, when weights are shared among different layers [27, 5, 21]. In this way, the deep residual network is treated as a long sequence and hidden units are linked by skip connections. While this recurrent structure benefits feature re-usage and iterative learning, the residual information is restricted among neighboring layers and cannot be considered across multiple layers, because the recurrence only happens once at each single layer.
. When people watch a picture or a scene, the information on our target is better captured if we re-look at or re-think the target with additional attention. In cognition theory, the activity of a neuron in visual cortex is influenced by other cortical area’s responses transferred through feedback connections[19, 15]. This motivates the introduce of feedback to deep networks [35, 42]. The feedback connections that bring back higher-level semantic information in a top-down manner are able to re-weight the focus, and suppress the non-relevant neuron activations of background and noises.
Inspired by the recurrent structure and attention mechanism, in this study, we propose a new convolutional neural network architecture with alternately updated clique (CliqueNet). In contrast to prior network structures, there are both forward and feedback connections between any two layers in the same block. As illustrated in Figure 1, the layers in Clique Block are constructed as a clique and are updated alternately. Concretely, the several previous layers are concatenated to update the next layer, after which, the newly updated layer is concatenated to re-update the previous layer, so that information flow and feedback mechanism can be maximized. Each layer in a block is both the input and output of another one, which means they are more densely connected than DenseNets . We adopt a multi-scale feature strategy to compose the final representation with the block features in different map sizes.
CliqueNet architecture has some unique properties. An intuition would tell that our proposal is parameter-demanding, because given a block with layers, DenseNet  needs groups of parameters, while ours needs ( and represents combination operator and permutation operator, respectively). However, the filters in DenseNet increase linearly as the depth rises , which may leads to the rapid growth of parameters. In our architecture, only the Stage-II feature in each block is fed into the next block. It turns out that this is a more parameter-efficient way. In addition, traditional neural networks add a new layer with its corresponding parameters. As for CliqueNet, the weights among layers in a block keep recycling during propagation. The layers can be updated alternately for multiple times so that a deeper representation space is attained with the fixed number of parameters.
CliqueNet also shows a strong ability for representation learning due to the combination of recurrent structure and feedback mechanism. In each Clique Block, both forward and feedback are densely connected. The information flow is maximized and feature maps are repeatedly refined by attention. We show that our network architecture can suppress the activations of background and noises, and achieve competitive results without resorting to data augmentation.
The contributions in this study are listed as follows:
We propose a new convolutional neural network architecture called CliqueNet, which incorporates both forward and backward connections between any two layers in the same block. The layers constructed as a loop are updated alternately. The CliqueNet that combines both recurrent structure and attention mechanism, is able to maximize information flow and achieve feature refinement. We show that the refined features are more discriminative and lead to a better performance.
We adopt a multi-scale feature strategy that effectively circumvents the progressive increment of parameters, despite the extra feedback connections.
We conduct experiments on four benchmark datasets including CIFAR-10, CIFAR-100, SVHN and ImageNet to demonstrate the superiority of our models.
A number of deep networks with large model capacity have been proposed. For widening the network, the Inception modules in GoogLeNet fuse the features in different map size to construct a multi-scale representation. Multi-column  nets and Deeply-Fused Nets  also use fusion strategy and have a wide network structure. Wide residual networks  increase the width and decrease the depth to improve the performance, while FractalNet  deepen and widen at the same time. However, simply widening the network is easy to consume more runtime and memory . For deepening the networks, skip connections or shortcut paths are widely adopted strategies to ease the network training [13, 34]. In , it is shown that some of the layers in ResNets are dispensable and cause parameters redundancy. So they randomly drop a subset of layers to ease the training and achieve a better performance. To further increase information flow, DenseNets  replace the identity mapping in residual block by concatenating operation, so that new feature learning can be reinforced while keeping old feature re-usage. In line with this view, dual path networks (DPN)  are proposed to combine both advantages of residual path and densely connected path.
Both residual path and densely connected path correspond to a recurrent propagation, and their success has been attributed to the recurrent structure and iterative refinement [27, 11, 21]. Studies incorporating recurrent connections into CNNs also show superiority in object recognition , scene parsing  and some other tasks. CliqueNet differs from these structures in that the iterative mechanism exists in each step of the propagation, instead of just between neighboring layers or from the top layer to the bottom layer; all layers in a block participate in the recurrent loop so that the filters are communicated sufficiently and the blocks play both roles of information carrier and refiner.
Recent studies have embraced the attention mechanism as an effective technique to strengthen some neurons that feature the target, and improve the performance as a result. It is proved fruitful in many applications, including image recognition [37, 8], image captioning , image-text matching , and saliency detection . In general, visual attention can be achieved by formulating an optimization problem , weighting the activations spatially or channel-wisely [3, 16], and introducing feedback connections [39, 35, 42]. In , the model makes consecutive decisions for a more accurate prediction via feedback connections. The input of the next decision is based on the output of the last decision. Experiments show that the top-down propagation is capable of refining lower-level features, and improving classification performance , especially on datasets with noise and occlusion [39, 28]. But how to make a proper attention mechanism and boost the supervision between layers remains further exploration.
There are also some studies that design attention mechanism tied with recurrent neural networks [28, 24, 8]. A recent report  tries to propose a loopy net, but it just repeats the skip connections and does not make layers communicated. The loopy inference adopted in [4, 45] shares a similar motivation with our work. However, they do not incorporate feedback connections, which are important for feature refinement. CliqueNet enables true cycling because of the alternate propagation. Although alternate updating has been an important method in the optimization theory 
, it has not been introduced into deep learning areas. At the best of out knowledge, we are the first to use updated layers to re-update previous layers alternately, and these layers construct a loop to cycle for multiple times.
The CliqueNet architecture has two main ingredients, the block with alternately updated clique (Clique Block) to enable feature refinement, and the multi-scale feature strategy that facilitates parameter efficiency.
|Bottom Layers||Weights||Top Layer||Feature|
In order to maximize the information flow among layers, we design the Clique Block. Any two layers in the same block are connected bidirectionally except for the input node. Compared with Dense Block  where each layer is the output of all previous layers, and the input of all subsequent layers, Clique Block makes each layer both the input and output of any other layers. The propagation of a Clique Block with 5 layers is illustrated in Table 1. At the first stage, the input layer () initializes all layers in this block by single directional connections. Each updated layer is concatenated to update the next layer. From the second stage, the layers begin updating alternately. All layers except the top layer to be updated are concatenated as the bottom layer, and their corresponding parameters are also concatenated. Accordingly, the th () layer in the th () loop can be formulated as:
where denotes the convolution operation with parameters , and
is the non-linear activation function.keeps re-used in different stages. Each layer will always receive the feedback information from the layers that are updated more lately. It achieves a spatial attention mechanism due to the top-down refinement brought by each propagation. This recurrent feedback structure ensures that the communication is maximized among all layers in the block.
We analyze the features produced at different stages, and adopt a multi-scale feature strategy to avoid the rapid increment of parameters.
The first stage is used to initialize all layers in the block, and the layers are refined repeatedly since the second stage. Given that the Stage-II feature is refined with attention and assimilates more high level visual information, we make the Stage-II feature together with the input layer in each block concatenated as the block feature, and then accessed to the loss function after global pooling. Only the Stage-II feature is fed into the next block as their input layer; see Figure 2. In this way, the final representation is characterized by multi-scale feature maps, and the dimensionality in each block will not increase progressively. Because higher stage propagation comes with more computational cost and amplifies the model complexity, we only consider the first two stages.
|CliqueNet (I+I)||, Stage-I||Stage-I||6.64|
|CliqueNet (I+II)||, Stage-I||Stage-II||6.1|
|CliqueNet (II+II)||, Stage-II||Stage-II||5.76|
For the purpose of analyzing the features generated in different stages, we conduct experiments on CIFAR-10 dataset (with no data augmentation) using different versions of CliqueNets. As Table 2 shows, the CliqueNet (I+I) only considers the Stage-I feature. The CliqueNet (I+II) uses the Stage-I feature and input layer as block feature to access loss function, but transits the Stage-II feature into the next block. The CliqueNet (II+II) adopts our aforementioned strategy. They all have 3 blocks with 5 layers in each block. Each layer contains 36 filters. The experimental settings are following . The main results are shown in Figure 3. It is found that the introduce of Stage-II feature indeed leads to a better result by a significant margin. We adopt the CliqueNet (II+II) structure for the following experiments.
In addition to the structures mentioned above, we consider some techniques to help strengthen the model and improve the state of the art. In the experimental section, we conduct experiments with and without these additional techniques to show the effectiveness of our model.
Attentional transition. The CliqueNet includes feedback connections to refine lower level activations using higher level visual information. The attention mechanism weight the feature maps spatially to weaken the noises and background. The channel-wise attention, adopted in [3, 37, 16], also benefits recognition problem because it recalibrates different filters to prevent overfitting and inspire new features learning. In CliqueNet, we incorporate channel-wise attention mechanism in transition layers, following the method proposed in . As depicted in Figure 4, the filters are globally averaged after the convolution in transition. They are followed by two fully connected (FC) layers. The first FC layer has half of the filters and is activated by Relu function. The second FC layer has the same number of filters and is activated by Sigmoid function, so that the activation is scaled into and acts on the input layer by filter-wise multiplication. Different from  which sets this module at each residual layer, we only add it to transition layers in order to adjust the filters into the next block.
Bottleneck and compression. Bottleneck is an effective way to decrease the number of parameters and provide further potential to enlarge model capacity. It is conjectured  that bottleneck architecture is suitable for deeper networks and large dataset like ImageNet, and recent studies have embraced bottleneck for a better performance [13, 17, 37, 5]. So we introduce bottleneck to our large models. The convolution kernels in each block are replaced by , and produce a middle layer, after which, a convolution layer follows to produce the top layer. The middle layer and top layer contain the same number of feature maps. Compression is another tool adopted in  to make the model more compact. Instead of compressing the number of filters in transition layers as they do, we only compress the features that are accessed to the loss function, i.e. the Stage-II concatenated with its input layer. The models with compression have an extra convolutional layer with kernel size before global pooling. It generates half the number of filters to enhance model compactness and keep the dimensionality of the final feature in a proper range.
In our experiments, we test our models on benchmark datasets without the aforementioned extra techniques to show the effectiveness of CliqueNet, and further improve the state-of-the-art performance with them. There are two structure parameters, the sum of layers in all blocks, T, and the number of filters per layer, k. For our models without bottleneck, convolution layers in each block are with
kernel size and padded by one pixel to keep the feature maps in the same size. Blocks are linked by transition layers, where a convolution layer withkernel size is followed by
average pooling. All convolutions are performed in a unit composed of three consecutive operations: batch normalization, Relu, and the convolution. Stage-II feature with its input layer from all blocks are concatenated after global pooling, and end with a fully-connected layer with softmax.
), 64, stride 2
|Pooling||max pool (), stride 2|
|Transition: conv (), avg pool ()|
|Transition: conv (), avg pool ()|
|Transition: conv (), avg pool ()|
For experiments on CIFAR and SVHN, there are three blocks in total, in which the feature map sizes are , , and , respectively. Before entering the first block, the input images pass through a convolution with output channels set to be 64 as the input layer () of the first block. As for ImageNet, we use four blocks with bottleneck and compression, and compare our results with and without attentional transition. The initial transition has convolution with stride 2 and max pooling with stride 2 on the input images. Our four network structures on ImageNet are shown in Table 3.
|Recurrent CNN ||-||-||-||-||1.86M||8.69||31.75||1.80|
|Stochastic Depth ResNet ||-||-||-||-||1.7M||11.66||37.8||1.75|
|DenseNet () ||-||-||-||0.53G||1.0M||7.00||27.55||1.79|
|DenseNet () ||-||-||-||3.54G||7.0M||5.77||23.79||1.67|
|DenseNet () ||-||-||-||13.78G||27.2M||5.83||23.42||1.59|
|DenseNet () ||-||✓||✓||0.58G||0.8M||5.92||24.15||1.76|
|DenseNet () ||-||✓||✓||10.84G||15.3M||5.19||19.64||1.74|
We evaluate the CliqueNet on benchmark classification datasets, including CIFAR-10, CIFAR-100, SVHN and ImageNet, and compare our results with the state of the arts.
CIFAR. The CIFAR-10 and CIFAR-100 datasets  are both
colored images. CIFAR-10 dataset consists of 60,000 images in 10 classes, with 6,000 images in each class. There are 50,000 images for training and 10,000 images for testing. CIFAR-100 dataset is similar to CIFAR-10 but has 100 classes, each of which contains 600 images. For data normalization, we preprocess the dataset by subtracting the mean and dividing by the standard deviation.
SVHN. The Street View House Number (SVHN)  dataset contains colored images of house numbers cropped from Google Street View. There are 73,257 images in the training set, 26,032 in the testing set and 531,131 digits for additional training. Following the common practice [41, 18, 25, 17], we use all training samples without augmentation and divide images by 255 for normalization. We report the lowest error rate on the testing set.
ImageNet. We also conduct experiments on ILSVRC 2012 dataset, which contains 1.2 million training images, 50,000 validation images, and 100,000 test images with 1,000 classes. Following [13, 17], we adopt the standard data augmentation for the training sets. A crop is randomly sampled from the images or its horizontal flip. The images are normalized into using mean values and standard deviations. We report the single-crop error rate on the validation set.
Our experimental results on CIFAR and SVHN are shown in Table 4. The first part in the table includes some methods before DenseNets and some other studies that also incorporate feedback connections or attention mechanism. The second and third parts compare the CliqueNets with DenseNets when they both have no extra technique. The last two parts show the situation with extra techniques. The best result and the second best result are marked by red bold and bold, respectively.
Without extra techniques. The first three parts show that, when extra techniques are not considered, CliqueNets outperform most previous methods on CIFAR-10, CIFAR-100, and SVHN with significantly fewer parameters. Because the layers in CliqueNet can be re-updated but contribute features in each cycle, the depth of CliqueNet is much shallower than other models. For our smallest model CliqueNet (36-12), (representing , and ), each block contains 4 layers. It has the same number of filters, 144, in each block as DenseNet (12-36), but reduce the error rate from 7% to 5.93% on CIFAR-10 with slightly fewer parameters than its counterpart DenseNet (12-36). Although the ResNet with stochastic depth  achieved a slightly better performance with 1.7M parameters on SVHN than CliqueNet (36-12), our model drops the error rate on CIFAR-10 and CIFAR-100 by a large margin. As the model capacity goes larger, we find that the performance of CliqieNets is getting better without overfitting. As for our model CliqueNet (80-15), it has already achieved the state of the art on three datasets, and even outperforms the DenseNets that use extra techniques on CIFAR-10 and SVHN. It has only 6.94M parameters, which are a quarter of DenseNet (24-96) with 27.2M parameters, and a half of DenseNet (24-246) using bottleneck and compression with 15.3M parameters.
With extra techniques. The CliqueNets realize spatial attention mechanism due to its recurrent feedback propagation. When armed with channel-wise attention, they achieve an improved performance. This is demonstrated by the CliqueNet (36-12) with attentional transition. It has a better result on CIFAR-10 and CIFAR-100 with slightly more parameters. The compression has the same effect by making the model more compact. It is shown that the attentional transition is compatible with compression. The CliqueNet (36-12) with both attentional transition and compression leads to a better result than its original version and its original version with only attentional transition or compression. Compared with its counterpart DenseNet (12-36), it drops an error rate of 1.39% on CIFAR-10, 2% on CIFAR-100, and 0.1% on SVHN, with just 0.08M more parameters. The CliqueNet (80-15) with attentional transition and compression also has an improvement than its original version, and increases the state of the art of SVHN to 1.53% with 8M parameters, while the previously best result 1.59% on SVHN performed by DenseNet (24-96) has three times more parameters. The bottleneck architecture is effective to save parameters, and our largest model CliqueNet (150-15) with bottleneck further improves the performance on CIFAR-10 and CIFAR-100, but increases parameter and computation cost moderately.
Because we have limited computational resource and can only spread a batch among 4 GPUs, we use a batchsize of 160 on ImageNet, instead of 256 in most studies. Although a smaller batchsize would impair the performance training for the same epochs, the CliqueNets achieve a comparable result on ImageNet with ResNets or DenseNets; see Table 5. This indicates that our proposed models can also be applied on large datasets.
The CliqueNet-S0 and CliqueNet-S1 outperform the ResNet-18 and ResNet-34 with only a half of their parameters. Larger models also achieve on par with the state of the art performed by ResNets and DenseNets. When the attentional transition is considered, the CliqueNet contains both spatial attention and channel-wise attention, and has a better performance accordingly. The CliqueNet-S2 and CliqueNet-S3 both reduce about 1% top-1 error rate compared with their original versions, CliqueNet-S2 and CliqueNet-S3 that do not have attentional transition.
In order to better analyze the recurrent feedback mechanism and the multi-scale feature strategy in CliqueNet, we visualize feature maps and parameters based on pre-trained models and provide a further understanding.
Parameter efficiency. Despite the fact that CliqueNet has bipartite connections between any two layers in the same block, which would bring more parameters in the block, we find that the CliqueNet achieves the state of the art on CIFAR and SVHN dataset with considerably fewer parameters than DenseNets. On ImageNet, the CliqueNet using a smaller batchsize also has parameter efficiency compared with ResNets. This is mainly due to the multi-scale feature strategy that only transits the Stage-II feature into the next block, instead of having feature maps stacked towards deeper layers, which may cause progressive increment of parameters. In Figure 5, we visualize the weights among layers within a block of pre-trained CliqueNet and DenseNet. The color pixel of Clique Block covers the whole heat map because of our feedback connections. It is noted that the heat dots in a Dense Block are concentrated along the diagonal. A similar result is also reported in . The observation reveals that only neighboring layers have strong dependency in DenseNet, while its forward stacking pattern is actually parameter-demanding. This helps to explain the parameter and flop efficiency in CliqueNet where information flow is distributed more evenly in each block.
Feature refinement. In CliqueNet, the layers are updated alternately so that they are supervised by each other. Moreover, in the second stage, feature maps always receive a higher-level information from the filters that are updated more lately. This spatial attention mechanism makes layers refined repeatedly, and is able to repress the noises or background of images and focus more activations on the region that characterize the target object. In order to test the effects, we visualize the feature maps following the methods in . As shown in Figure 6, we choose three input images with complex background from ImageNet validation set, and visualize their feature maps with the highest average activation magnitude in the Stage-I and Stage-II, respectively. It is observed that, compared with the Stage-I, the feature maps in Stage-II diminish the activations of surrounding objects and focus more attention on the target region. This is in line with the conclusion in Table 2 that the Stage-II feature is more discriminative and leads to a better performance.
In this study, we introduce a new convolutional neural network architecture where the layers in a block are constructed as a clique and are updated alternately in a loop manner. Any layer is both the input and output of another one in the same block so that the information flow is maximized. The parameters are circulated in the course of propagation and are able to produce multiple stage features. We analyze the feature in different stages and observe that the introduce of the Stage-II feature helps to suppress noises and leads to a better performance. The multi-scale feature strategy effectively circumvents the progressive increment of parameters. Experiments show that our proposed architectures are able to achieve the state of the arts with fewer parameters, especially on CIFAR and SVHN without resorting to data augmentation.
Different from prior networks, the CliqueNet utilizes a fixed number of parameters to attain a deeper representation space and incorporates the recurrent feedback to achieve attention mechanism. This topology provides the potential of developing models for other computer vision tasks in future work, such as semantic segmentation, salient object detection, image captioning, etc.
Zhouchen Lin was supported by National Basic Research Program of China (973 Program) (grant no. 2015CB352502), National Natural Science Foundation (NSF) of China (grant nos. 61625301 and 61731018), Qualcomm, and Microsoft Research Asia.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.
Highway and residual networks learn unrolled iterative estimation.In ICLR, 2017.
Journal of machine learning research, 15(1):1929–1958, 2014.
Attentional neural network: Feature selection using cognitive feedback.In NIPS, pages 2033–2041, 2014.