I Introduction
Over the last few years, deep convolutional neural networks [1]
have brought a revolution in computer vision society by learning powerful representations based on largescale datasets. Till now, CNNs have shown their success in but not limited to the following areas: image classification
[2, 3, 4, 5, 6, 7], detection [8, 9, 10, 11][12, 13], etc.The key idea of CNNs is utilizing convolutional and pooling layers to progressively extract more and more abstract patterns. The convolutional layers convolve multiple local filters with input images (or outputs of previous layers), and aim to produce translation invariant local features. Afterwards, pooling layers are applied to summarize the feature responses of the convolutional layers over multiple regions of images, and compress the size of the response maps. Both convolution and pooling are locally performed. For example, the representation of the top left image region will not influence the representation of the bottom right region. However, contextual information is very important for object/scene recognition. For example, in an image with label “beach”, if “sand” regions are represented with the reference of “sea” regions, then it is much easier to distinguish them from “road” or “desert sand”. In CNNs, spatial and scale dependencies among different image regions are not explicitly modeled.
In this manuscript, we aim to encode contextual dependencies in image representation. To learn the dependencies efficiently and effectively, we propose a new class of hierarchical recurrent neural networks (HRNNs), and utilize the HRNNs to learn such contextual information.
Recurrent neural networks (RNNs) have achieved great success in natural language processing (NLP)
[14, 15, 16, 17, 18, 19]. RNNs [20, 21]are neural networks developed for modeling dependencies in sequences by using feedback connections among themselves. Thus, they can retain all the processed states in the sequence, and learn patterns from sequential context. Furthermore, because of the reuse of hidden layers, only a limited number of neurons need to be kept in the model. Two most popular RNN models are the simple recurrent neural network (SRN) and longshort term memory recurrent network (LSTM). Based on which, we will introduce hierarchical SRN (HSRN) and hierarchical LSTM (HLSTM). Both HSRN and HLSTM target on modeling the spatial and scale dependencies among different local image regions. However, they also have different characteristics: HSRN is simple and fast, while HLSTM is more complex but it is able to maintain the longterm dependencies among local image regions far away from each other, and lead to better performance than HSRN.
Our proposed hierarchical recurrent neural networks (HRNNs) model two types of contextual dependencies: spatial dependencies and scale dependencies.
Firstly, we consider the spatial dependencies among image regions from the same scale but at different locations. Since there are no offtheshelf sequences in images, inspired by the multidimensional RNN [22], we generate two dimensional spatial region sequences for images, and represent each region as a function of its neighboring regions. Details will be described in Section IIIC1.
Secondly, we build multiple scale RNNs, and consider scale dependencies among image regions from different scales but at the same locations. Information captured from different scales are complementary to each other. Connecting multiple scales can help to learn more robust representation. For example, in an image with label “car”, regions at a lower level scale mostly contain patterns such as “tire” and “window”, while regions at a higher level scale include global patterns such as “car”. Knowing the existence of “car” can help the system to increase the representation preference of “tire” in the corresponding local regions. Details will be described in Section IIIC2.
However, HRNN layers are processed based on image regions, while in image classification, no intermediate labels for any of these regions are provided. The only supervision is the imagelevel labels. To make use of it, fully connected layers are introduced to collect the outputs of HRNN layers, merge them through the global hidden layer, and finally connect to imagelevel labels with a softmax layer.
Integrating CNNs with our HRNNs, we propose endtoend networks called convolutional hierarchical recurrent neural networks (CHRNNs). As shown in Figure 1, CHRNN not only maintain the discriminative representation power of CNNs, but also efficiently encode the spatial and scale contextual dependencies with HRNNs. Testing on four most challenging largescale image classification benchmarks, CHRNNs achieve the stateofthearts on Places 205, SUN 397, MIT indoor, and promising results on ILSVRC 2012.
Ii Related Works
In recent few years, deep neural networks have made great break through in computer vision area. Till now, lots of successful deep neural nets with different structures have been proposed, such as: convolutional neural networks [1, 2, 4, 5, 6, 7, 8, 9, 10, 23, 24, 12], deep belief nets [25, 26, 27], and autoencoder [28, 29, 30, 31], etc. Among all these frameworks, CNNs are the most developed networks for solving image classification problems. The core idea of CNNs is progressively learning more abstract (higher visual level) and more complex patterns: the first few layers focus on learning “garbor like” low level local features (e.g. edges and lines); based on which, the middle layers target on learning parts of objects (e.g. “tires” and “windows” in the images with label “car”); the higher layers connect to the final imagelevel labels, and aim to learn representations of the whole image.
In contrast, RNNs have achieved great success in natural language processing (NLP) [14, 15, 16, 17, 18, 19, 32, 33]. Different from the CNNs, which are purely combined with “feedforward” network layers, RNNs [20, 21] are “feedback” neural networks designed for modeling contextual dependencies. Because of the connections from the previous states to the current ones, RNNs are networks with “memory”. Through such “feedback” connections, RNNs are able to retain information of the past inputs, and it is able to discover correlations among the input data that might be far away from each other in the sequence.
Although very popular in NLP, RNNs have rarely been applied to computer vision area. In the recent decade, there are mainly five branches of works which involve the recurrent idea.
In the first branch of works, recurrent layers are mainly used as “tied” layers in the “feedforward” networks, which means different layers share the same parameters. Different from our recurrent networks, these “tied” layers iteratively encode the input data from the same locations with the same network parameters, and these layers focus on reducing the number of parameters, rather than modeling the contextual dependencies among input data from different locations. In [34], shared CNNs are applied to learn pixel label consistency among multiscale image patches. In DrSAE [35]
, autoencoder with rectified linear units are employed to iteratively encode the global digitalnumber image. In
[36], the “tied” CNN (called as recurrent convolutional network) is employed to assess the contributions of the number of layers, response maps, and parameters. Different from these works, the “recurrent” in our CHRNNs means learning spatial and scale dependencies among different image regions, and expanding receptive fields of local regions by encoding contextual information.In the second branch of works, RNNs are used to predict/generate the motion curve of objects/parts in the current/next moment, and applied to visual attention tasks. In
[37, 38], RNN is used to build a sequential variational autoencoder to iteratively analyze/generate image parts (at each iteration, RNNs are used to selectively attends to parts of the image while ignoring the others). Differently, we aim to build endtoend networks for largescale image classification.In the third branch of works, RNNs [39, 40] are used to combine the video information over an ordered sequence of video frames for video recognition and description. Differently, our CHRNNs model the contextual dependencies within single image rather than the sequential appearance/motion dependencies among consecutive frames.
In the fourth branch of works, RNNs are combined with CNNs for image/video description [41, 42, 43, 44]. In these works, CNNs are utilized to generate image/video features, while RNNs are used to connect the image/video feature domain to the text feature domain, and RNNs mainly focus on modeling the text contextual dependencies in the sentences/paragraphs. Different from these works, our CHRNNs models the contextual information in image appearance domain.
The last branch of works is RNN pyramid [45, 46]. In these works, multiple layers of local recurrent connectivities are stacked as a pyramid to get different levels of visual abstractions. In contrast, CHRNNs model the scale dependencies among image regions at the same level of visual abstraction, but different pooling scale. Moreover, CHRNNs integrate the discriminative power of CNNs and contextual modeling ability of RNNs, and work efficiently and effectively for largescale image classification.
Iii Convolutional Hierarchical Recurrent Neural Networks
As shown in Figure 1, our proposed convolutional hierarchical recurrent neural networks (CHRNNs) consist of three types of layers: 1) five convolutional (and pooling) layers for extracting middle level image region features; 2) hierarchical recurrent layers for encoding spatial and scale dependencies among different image regions; 3) two fully connected layers for generating global image representation. Finally, an Nway (N indicates the number of categories) softmax loss layer is added on the top for classification.
Iiia Convolutional Layers
As shown in the left part of Figure 1, given input raw pixel images, firstly, five convolutional layers are processed to progressively extract more and more complex and abstract patterns. According to the analysis in [47], outputs of the fifth convolutional layer are able to capture patterns representing parts and objects. Furthermore, size of the fifth layer response maps is orders of magnitudes smaller than size of the original raw pixel images. Thus, based on such CNN features, our proposed HRNNs can model the contextual dependencies among middlelevel regions with semantic meanings, and HRNNs can be processed very efficiently. Furthermore, with back propagation, RNNs can help the CNNs to increase the quality of middlelevel and lowlevel features.
Note that our HRNNs can be easily constructed based on any network other than CNN (e.g. deep restricted Boltzmann machine
[18], autoencoder [35]), hand crafted features (e.g. SIFT [48], HOG [49]), or even from scratch. In this work, we choose CNNs because of their excellent performance on representing midlevel patterns, which is the guarantee of good performance of the following HRNNs.IiiB Review of General RNNs
RNNs [20, 21] are originally developed for modeling dependencies in time sequential data. In RNNs, two of the most typical models are the simple recurrent neural network (SRN), and the longshort term memory recurrent neural network (LSTM). In the following two subsections, SRN and LSTM will be introduced to represent each state of a given sequence of length . , , and are the input, hidden and output representations of the th state respectively.
IiiB1 Simple Recurrent Neural Nets
As shown in Figure 1(a), the th state in SRN can be represented as:
(1)  
(2) 
where , and are the shared transformation matrices from input to hidden states, previous hidden to current hidden states, and hidden to output states. and are bias terms, and
are nonlinear activation functions. Since the expression of each state is based on hidden representation of the previous states, SRN can keep “memory” of the whole sequence, and learn patterns based on such sequential context.
Although simple and effective, SRN has the unpleasant “short term memory” problem [50]: during the backpropagation procedure in SRN, the gradients will be multiplied times by the . Consequently, when is relatively large, there will be gradient vanishing/exploding problems.
IiiB2 LongShort Term Memory Recurrent Neural Nets
To overcome the above “short term memory” issue, LSTM [50] introduce the “memory block” (combined with multiplication gates and memory cell) to keep long term flow of sequential information. As shown in Figure 1(b), the th state in LSTM can be represented as:
(3)  
(4)  
(5)  
(6)  
(7)  
(8)  
(9) 
in addition to the hidden state , LSTM introduced a memory cell , and four multiplication gates: , , , and , which are the input, forget, output, and input modulation gate respectively.
is a logistic sigmoid function (thus,
, , range from ). is the hyperbolic tangent nonlinearity, and represents elementwise multiplication. Specifically, the self recurrent memory cell keeps the longterm memory. The input gate controls the flow of incoming signal to alter the state of . The forget gate helps the to selectively maintain and forget the previous state status . While the output gate controls the amount of memory that transmits to .The “memory block” structure enables LSTM to selectively forget its previous memory states, and learn longterm dynamics which general SRN can hardly handle. However, LSTM has more intermediate neurons than SRN, thus LSTM consumes much more computational resources.
In this manuscript, rather than modeling contextual correlations among different states in time sequences, we modify RNNs to model contextual dependencies among image region “2D sequences”. Details of our proposed networks will be introduced in the following section.
IiiC Hierarchical Recurrent Layers
In CNNs, convolution and pooling are locally performed on image regions. While the spatial dependencies among different regions from the same scale are ignored, let alone the scale dependencies among image regions from different scales. On the other hand, general RNNs (Section IIIB) are designed for modeling dependencies in sequences, however, they cannot be directly applied on images. Thus, as shown in the middle part of Figure 1, we propose hierarchical recurrent layers to model spatial and scale contextual dependencies.
IiiC1 Modeling Spatial Contextual Dependencies
Spatial context is an important clue for recognizing images. For example, in an image with label “computer room”, knowing the existence of “computer” can help the system to increase the preference of representing “desk” in the surrounding image regions. In this subsection, we will introduce the spatial RNNs to model spatial contextual dependencies within single scale image feature maps.
There are no existing sequences in images, hence we need to generate region sequences in image domain. Take Alexnet [2] as an example, as described in Section IIIA, we utilize the fifth layer CNN feature maps (, corresponding to number of channels height width) as the input of the recurrent layers. It can be considered as a
2D data array, each element in the array is represented as a 256 dimensional vector. Then how to convert such an 2D array into sequences? The most straightforward way is to scan in a row by row or column by column manner. However, images are 2dimensional data. For each element, contextual information from all the directions should be taken into consideration. Thus, inspired by
[22], we generate “2D sequences” for images, and each element simultaneously receives spatial contextual references from its 2D neighborhood elements.As shown in (e) of Figure 3, spatial contextual information comes from all directions (left, right, top, bottom). If we directly connect all the surrounding elements to the target, each node would simultaneously be the “previous” and “next” element of its neighbors. Then the connections would form a cyclic graph. The resulting network is difficult to be optimized. Thus, four directional “2D sequences” are generated for each scale: topleft to bottomright, bottomright to topleft, bottomleft to topright, and topright to bottomleft. Each of them focuses on transferring information from an independent direction through an acyclic path. Take the topleft to bottomright sequence (as shown in (a) of Figure 3) as an example, each element receives references from its nearest neighbor elements in the previous row and the previous column. All the elements will be visited once, and each element can be unrolled into a function of all the previously visited elements. Similarly, contextual information from the other three directions can be encoded by “2D sequences” as shown in (bd) of Figure 3.
For each of the four directional “2D sequences”, the transformation matrices are shared through the whole sequence. To model the spatial correlations among different image regions, general SRN and LSTM (Section IIIB) are modified to model our spatial sequences, and they are called spatial SRN and spatial LSTM in the rest of this section.
Spatial SRN Firstly, the spatial SRN is introduced. The hidden representation of each image region in the “2D sequences” is:
(10)  
(11)  
(12)  
(13)  
(14) 
where is the position of the element. is the input, which is an image region represented by a 256dimensional fifth CNN layer feature vector, , , , and denote the hidden representations of in the four “2D sequences” respectively (corresponding to topleft to bottomright, bottomright to topleft, bottomleft to topright, and topright to bottomleft directions). is the combination of the four directional hidden representations, which is the output of spatial SRN. For each direction, and are row based and column based hidden to hidden states transformation matrices. is the input to hidden states transformation matrix, is the bias term, and
is a nonlinear activation function (ReLU is used here).
Spatial LSTM Similar to the Equation 14 in spatial SRN, each hidden unit of spatial LSTM is also a combination of four directional hidden representations. To make the expression concise, only functions corresponding to the direction will be expanded here (corresponding to Equation 10):
(15)  
(16)  
(17)  
(18)  
(19)  
(20) 
where is the current state position, represents the current input data. , , , , correspond to the input, forget, output, and input modulation gates. denotes the memory cell unit, and finally is the hidden representations of in topleft to bottomright direction. Similarly, the other three directions can be achieved.
For each gate function, , , and () are hiddengate (row), hiddengate (column), inputgate transformation matrices and bias terms respectively. and are nonlinear activation functions, in this manuscript, sigmoid is used as , and tangent is assigned as .
IiiC2 Modeling Scale Contextual Dependencies
Besides spatial contextual dependencies, there also exist scale contextual dependencies among image regions from the same locations but at different scales, which is another important clue for image recognition. For example, again in an image with label “computer room”, knowing the global pattern “computer room” can help the system to increase the preference of representing patterns correspond to “computer” and “desk” in local level scales. In this subsection, we will focus on modeling scale dependencies.
The final goal of image classification is to achieve good imagelevel representation, which is based on wellperformed local image region representations. When describing a local image region, the traditional way is to only encode its own information. In contrast, if information from higher level scale regions is given, then the global information would be encoded in local features, and lead to better local descriptions. Thus, we build connections across regions from different scales.
For each element at each scale, its receptive field covers a number of elements at the lower level scales. More intuitively, as shown in the middle part of Figure 1, areas highlighted with yellow at the scale and are covered by the receptive field of the yellow element at the scale . Thus, global information from the higher level scale would be transferred to the corresponding areas at the lower level scales and . Thus, for element at position on scale , the scale dependencies from higher level scales can be encoded as:
(21) 
where , and is the number of scales. is the position at the higher level scale . is scale contextual element (already combined the four directional spatial dependencies, refer to Equation 14) from the higher level scale. is the scale to scale transformation matrix.
IiiC3 HRNNs with Spatial & Scale Dependencies
By inserting Equation 21 into Equation 1013 (spatial SRN) or Equation 1520 (spatial LSTM), scale and spatial dependencies can be modeled together in our hierarchical RNNs.
HSRN Take the topleft to rightbottom directional HSRN as an example (refer to Equation 10), the hidden representation of each element is:
(22)  
HLSTM Similarly, for the HLSTM model (refer to Equation 1518), the hidden representation of each gate functions is:
(23)  
in which, is sigmoid function when , and is tangent when .
For both Equation 22 and 23, the scale index of each variable is removed for the convenience of expression. Similarly, expressions of the other three directions can be obtained. Afterwards, refer to Equation 14, by combining the revised four directional hidden representations, the complete HRNNs (for both HSRN and HLSTM) hidden element expression is:
(24) 
According to the RNNs optimization notes in [14]
, RNNs can be simply and effectively optimized by back propagation through time (BPTT). In BPTT, the recurrent nets would be unfolded into feedforward deep networks, then normal backpropagation can be applied. Utilizing the “weight sharing” setting in Caffe
[51], BPTT can be performed with shared RNN weights.IiiD Fully Connected Layers
Different from applications like image labeling, where the label of each pixel or patch level image region is given, in image classification, there is no intermediate labels except the overall imagelevel label. Thus, Equation 2 (corresponds to HSRN) or Equation 9 (corresponds to HLSTM) cannot be directly applied. To connect the hierarchical recurrent layers (Equation 24) to the image labels, fully connected layers are introduced to merge the information learned by different scales of HRNNs:
(25)  
(26)  
where is the fully connected transformation matrix to transform the HRNNs output to the global hidden layer . is the concatenation of HRNN layer outputs () at different scales. For each scale, is the concatenation of all its hidden element expressions (, , and are the number of rows and columns at the scale respectively). is learned to connect with the class label , and are the bias terms. is a nonlinear activation function (ReLU is used in this manuscript), and is the softmax function for classification.
Iv Experiments
In this section, detailed network settings of our endtoend CHRNNs are firstly introduced. Next, CHRNNs are compared with other popular methods on four challenging object/scene image classification benchmarks: ILSVRC 2012 [52], Places 205 [3], SUN 397 [53], and MIT indoor [54]. Afterwards, effectiveness of different modules of CHRNNs is analyzed, CHSRN and CHLSTM are compared in detail.
Iva Experimental Settings
Models  conv1  conv2  conv3  conv4  conv5  hrnn6  fc7  fc8  






 









 



CHRNNs 








Following the default data prepossessing settings in Caffe [51], all images are resized to pixels and subtracted by the pixel mean. For training images, 10 subcrops of size (1 center, 4 corners, and horizontal flips) are extracted. In the remaining part of this section, if not specified, the results are the Top 1 accuracy (or error rates) tested with center crop by using a single model.
As shown in Table I, detailed layer structures of the baseline deep nets (Alexnet [2], SPPnet [8]) and CHRNNs are given.
Comparing with SPPnet, our CHRNNs has the same first five convolutional layers: , , , , and
respectively. Strides of the first two layers are 2, and the rest are 1. Following each of the first, second and fifth convolutional layers, there is a max pooling layer with kernel size of
, and stride of 2. Finally, size of output feature maps of the fifth CNN layer is (number of channels height width). Similar to SPPnet, we pool the feature maps into four scales, and achieve response maps with size of .Different from all of these baseline networks, our CHRNNs introduce hierarchical recurrent layers (hrnn6 as shown in Table I). For the hierarchical recurrent layers, we process three scale spatial RNN layers with size of and one global pooling layer, and build cross scale connections among all these four scales. The corresponding numbers of image regions of the four scales are 36, 9, 4, and 1 respectively, and each region is represented as a 256dimensional feature vector (number of channels in the fifth layer CNN). For each RNN layer, sizes of the transformation matrices (HSRN: , , , and , refer to Equation 22; HLSTM: , , , and , refer to Equation 23) in the four directional “2D sequences” are .
To show the performance gain of introducing spatial and scale dependencies separately, we introduce an intermediate network called convolutional multiscale recurrent neural networks (CMRNNs), which only considers spatial dependencies in multiple scales, and ignores the scale dependencies. Specifically, convolutional multiscale simple recurrent neural network (CMSRN) and convolutional multiscale longshort term memory neural network (CMLSTM) are tested in the experiments. Furthermore, when all the hiddenhidden weights in CMSRN are set to 0, and the inputhidden weights are identity matrices, CMSRN degenerates to SPPnet.
For the fully connected layers, the number of output units of both two layers is 4096, and each of them is applied dropout at the rate of 0.5.
The training batch size is 256, learning rate starts from 0.01 and it is divided by 10 when the accuracy stops increasing, and the weight of momentum is 0.9. All the experiments are run on Caffe [51] with a single NVIDIA Tesla K40 GPU.
IvB Experimental Results
IvB1 Experimental Results on ILSVRC 2012
Methods  test scales  test views  Top 1 val  Top 5 val 
MOPCNN [4] (max pooling)  3  101  44.12%   
MOPCNN [4] (VLAD pooling)  3  101  42.07%   
SPPnet [8]  1  1  38.21%   
CMSRN  1  1  36.90%   
CHSRN  1  1  36.38%   
CMLSTM  1  1  36.01%   
CHLSTM  1  1  35.85%   
Alexnet [2]  1  10  40.7%  18.2% 
ZFnet [47]  1  10  38.4%  16.5% 
Overfeat[10]  1  10  35.6%  14.7% 
SPPnet [8]  1  10  36.2%  14.9% 
CMSRN  1  10  35.2%  14.0% 
CHSRN  1  10  34.8%  13.7% 
CMLSTM  1  10  34.5%  13.5% 
CHLSTM  1  10  34.3%  13.4% 
ImageNet LargeScale Visual Recognition Challenge (ILSVRC) dataset [52] is one of the most challenging and popular largescale object image classification datasets. ILSVRC 2012 contains 1.2 million training images and 50,000 validation images (50 per class), and they belong to 1000 object categories.
In the upper part of Table II, we compare CHRNNs with SPPnet[8], which encodes the spatial information by using spatial pyramid pooling. Based on their released model^{1}^{1}1https://github.com/ShaoqingRen/SPP_net, we can only achieve 41.47% top1 error rate with one testing view. Therefore, we further tune the model with the settings in Section IVA, and finally achieve 38.21% (reported 38.01%) for SPPnet. The performance gap might be caused by the different training settings (when preprocess images, SPPnet keeps the original image aspect ratio, while the standard Caffe [51] does not). CHRNNs and SPPnet use the same convolutional and fully connected layer settings, except SPPnet directly applied spatial pyramid pooling after the fifth convolutional layer, while our CHRNNs model spatial dependencies with RNN for each scale, and models scale dependencies across different scales.
For SRN models, comparing with SPPnet, CMSRN brings 1.31% Top 1 error rate decrease, which indicates the benefit of modeling spatial dependencies. After integrating the scale dependencies, CHSRN is 1.83% better than SPPnet. Thus, both encoding spatial and scale dependencies can help to generate better image representations.
In LSTM models, performance improvement introduced by modeling spatial and scale dependencies can also be observed. Different from SRN models, LSTM models are able to keep longerterm memory of image region “2D sequences”. Comparing with SPPnet, CHLSTM is 2.36% better. CMLSTM gets 36.01%, which is better than CMSRN, and CHLSTM (35.85%) also works better than CHSRN. But the performance gap between CHLSTM and CHSRN (0.53%) is less than the one between CMLSTM and CMSRN (0.89%). The reason should be that the introduced scale dependencies from higher scales indirectly extend the longterm ability of CMSRN, and indent the gap between SRN and LSTM models.
We also compare with another spatial statistics based CNN method MOPCNN [4], which directly uses the Caffe CNN [51] to densely extract features from threescale image patches, and use VLAD pooling to generate global representations. The performance gap indicates that our way of encoding spatial and scale information is more effective.
In the lower part of Table II, CHRNNs are compared with other deep neural networks with the most general settings: 10 testing views, comparing Top 1 and Top 5 error rates. Outstanding performances of our CHRNNs indicate that besides going deeper and wider, RNN is another promising way to increase the image representation power of neural networks.
IvB2 Experimental Results on Places 205
Methods  Top 1 val  Top 5 val 

Alexnet [2, 55]  50.06%  80.51% 
SPPnet [8]  51.57%  81.88% 
CMSRN  52.70%  82.75% 
CHSRN  53.16%  83.07% 
CMLSTM  53.75%  83.36% 
CHLSTM  53.91%  83.48% 
Places 205 dataset [55] is currently the largest scene categorization dataset, which has just been released at the end of 2014. Different from ILSVRC 2012, it focuses on scene images rather than object centric ones. It has 2.5 million training images from 205 scene categories, which is twice the size of ILSVRC 2012, and much more challenging. There are 20,500 images (100 per category) in the validation set.
As shown in Table III, CHLSTM update the stateoftheart on Places 205 (previous best result was 50.06% achieved by Alexnet) with the accuracy of 53.91%. When only introducing spatial dependencies, CMSRN and CMLSTM outperform SPPnet by 1.13% and 2.18% respectively. Further integrating scale dependences, CHSRN and CHLSTM bring 1.59% and 2.34% improvements respectively.
Methods  Accuracy 

MOPCNN [4] (max pooling)  48.50% 
MOPCNN [4] (VLAD pooling)  51.98% 
Alexnet [2] (ILSVRC ft)  44.42% 
Alexnet [2] (Places ft)  54.55% 
SPPnet [8] (ILSVRC ft)  49.02% 
CMSRN (ILSVRC ft)  51.76% 
CHSRN (ILSVRC ft)  52.59% 
CMLSTM (ILSVRC ft)  52.67% 
CHLSTM (ILSVRC ft)  52.78% 
SPPnet [8] (Places ft)  57.23% 
CMSRN (Places ft)  59.32% 
CHSRN (Places ft)  59.90% 
CMLSTM (Places ft)  60.08% 
CHLSTM (Places ft)  60.34% 
Xiao et al.[53]  38.00% 
IFV [56]  47.20% 
MTLSDCA [57]  49.50% 
IvB3 Experimental Results on SUN 397
SUN 397 [53] is another popular largescale scene image recognition benchmark. There are 100,000 images from 397 scene classes in total. The general splittings in [53] are used here, in which, there are 50 images per class for training, and 50 images per class for testing. Since the number of training images is too small (20,000), we introduce the models pretrained on ILSVRC 2012 and Places 205, and use the training images from SUN 397 to finetune the network. We also increase the learning rates of the HRNN layers (10 times higher than the other layers), and aim to focus more on spatial dependencies specifically exist in SUN 397.
As shown in the upper part of Table IV, our CHRNNs performs better than existing CNNs. After finetuning on SUN 397, CHRNNs are able to learn more data adaptive spatial dependencies and significantly outperform SPPnet: 1) Based on models pretrained on ILSVRC, the performance gains of CHSRN and CHLSTM are 3.57% and 3.76% respectively; 2) Based on models pretrained on Places, the accuracy improvements of CHSRN and CHLSTM are 2.67% and 3.11% correspondingly.
When applying finetuning based on Places 205, CHLSTM achieves the stateoftheart on the SUN 397 with the accuracy of 60.34%, which outperforms the previous best result (MOPCNN 51.98%) by 8.36%. Another observation is that the performances of the finetuned models based on Places 205 consistently perform better than the ones based on ILSVRC 2012. The reason should be that both Places 205 and SUN 397 are scene datasets, their domain gap is smaller than the gap between ILSVRC 2012 (object dataset) and SUN 397.
The lower part of Table IV
shows the traditional stateoftheart shallow methods. Most of these works heavily depend on combining multiple densely extracted handcrafted features, and the image level representations are usually very highdimensional. Another drawback of these methods is that the testing procedures are generally very time consuming, since the feature extraction steps are slow. Comparing with them, our CHRNNs perform much better with much less computational cost in testing, and much lowerdimensional features.
Methods  Accuracy 

MOPCNN [4] (max pooling)  64.85% 
MOPCNN [4] (VLAD pooling)  68.88% 
Alexnet [2] (ILSVRC ft)  61.57% 
Alexnet [2] (Places ft)  68.24% 
SPPnet [8] (ILSVRC ft)  66.32% 
CMSRN (ILSVRC ft)  68.28% 
CHSRN (ILSVRC ft)  68.88% 
CMLSTM (ILSVRC ft)  69.18% 
CHLSTM (ILSVRC ft)  69.25% 
SPPnet [8] (Places ft) 
72.09% 
CMSRN (Places ft)  74.18% 
CHSRN (Places ft)  74.85% 
CMLSTM (Places ft)  75.30% 
CHLSTM (Places ft)  75.67% 
Object Bank [58]  37.60% 
Visual Concepts [59]  46.40% 
MMDL [60]  50.15% 
IFV [61]  60.77% 
MLrep + IFV [62]  66.87% 
ISPR + IFV [63]  68.50% 
IvB4 Experimental Results on MIT Indoor
MIT indoor [54] is very challenging scene image classification benchmarks. This dataset focuses on indoor scene scenarios, which usually contains lots of objects, and has larger variations. There are 67 different scene scenarios in MIT indoor in total, and the widely used splitting provided by [54] are applied in our experiments. In each class, around 80 training images, and around 20 testing images are selected. Because of the limitation of dataset size, we also utilize the pretrain models on ILSVRC 2012 and Places 205, and do finetuning. For the other baseline deep neural networks, such finetuning is also applied.
As shown in the upper part of Table V, CHRNNs are able to outperform the other deep neural nets with obvious gaps. Comparing with the stateofthe art MOPCNN, our CHLSTM achieves the accuracy of 75.67%, which is 6.79% better. Comparing with SPPnet: 1) Based on models pretrained on ILSVRC, the performance gains of CHSRN and CHLSTM are 2.56% and 2.93% respectively; 2) Based on models pretrained on Places, the accuracy improvements of CHSRN and CHLSTM are 2.76% and 3.58% respectively.
In the lower part of Table V, results of the stateofart traditional shallow methods are given. Although very powerful on MIT Indoor, these methods cost much more computation power to perform middlelevel patch searching and clustering, the feature dimensions are relatively high, and most of them can hardly be applied on largescale benchmarks. In contrast, our CHRNNs are endtoend feature learning frameworks with 4096dimensional output features, and CHRNNs can easily handle largescale data.
IvC Analysis of CHRNNs
In this subsection, we will analyze the effectiveness of our CHRNNs from different perspectives.
IvC1 CHRNNs Visualization
Firstly, patterns learned by the hrnn6 layer (refer to Table I) of CHLSTM are visualized in Figure 4. On the left part of Figure 4, six testing image region on the {3x3} scale (refer to Section IVA) are given, and the receptive field of each region is highlighted with blue box in the original image. On the right part of Figure 4, the top 8 nearest neighbors of each testing image region are shown. These nearest neighbors are searched from all the local region features extracted from training images, and measured by distance. For every two rows, the top row is the nearest neighbors searched by utilizing CHLSTM hrnn6 layer features, and the bottom row is the results of using SPPnet conv5 layer features.
Comparing the visualization results of our CHLSTM with the SPPnet, we can observe obvious better local image region representations. Take the first testing image region as an example, it is the right bottom area in the “radiator grille” image. By using our CHLSTM, this region is more likely to be represented as the “radiator grille” from the same or different car models, and less likely to be mismatched to similar patterns from other unrelated classes. Take the last testing image “partridge” as another example. The testing region contains “body of partridge” and the background “gravel”. For our CHLSTM, contextual information has been taken into consideration, thus, this region is represented as the “body of partridge”. In contrast, the SPPnet wrongly focused on “gravel”, and missed the target object. Similarly, better local region visualization results of CHLSTM can be observed in other classes, such as manmade buildings like “steel arch bridge”, creatures like “sea urchin” etc.
IvC2 CHRNNs vs Modified CNNs
Since CHRNNs have more parameters than the original CNNs, we aim to quantitatively show whether the performance gain is from encoding contextual dependencies, or simply from increasing the number of parameters.
For each HRNN scale, there are four directional “2D sequences”. In HSRN, each direction has three transformation matrices , , and ; While in HLSTM, each direction has four gate functions, and in each gate, there are , , and . Thus, for each scale, there are 12 transformation matrices in HSRN, and 48 matrices in HLSTM. Furthermore, in both HSRN and HLSTM, there are 6 cross scale transformation matrices . Each of these matrices has the size of , which has the same number of parameters as one convolution layer with 256 kernels of size . Thus, in the modified CNNs, each transformation matrix in HRNN is replaced with a convolution layer.
Testing on ILSVRC 2012, HSRN gets 36.38% in error rate, while the modified CNN gets 37.73%. Similarly, HLSTM gets 35.85% in error rate, while the modified CNN gets 37.49%. These obvious gaps indicate that RNNs are able to learn contextual dependencies which cannot be captured by CNNs.
IvC3 Effect of Number of Spatial Context Directions
In CHRNNs, four directional “2D sequences” are employed, can they really learn complementary information to each other?
In Table VI, performance of CHRNNs with different directions of spatial context are given. The first four rows show the performances of using single directional “2D sequence”, and different direction performs similarly to each other. On the last three rows of Table VI, results of combination of two directions, and the complete four directions are given. Comparing the results of using two directions and single direction, improvements can be observed. When combining all four directions, the best performance can be achieved.
Methods  Error  Methods  Error 

CHSRN  36.96%  CHLSTM  36.39% 
CHSRN  37.03%  CHLSTM  36.46% 
CHSRN  36.95%  CHLSTM  36.45% 
CHSRN  36.89%  CHLSTM  36.50% 
CHSRN  36.85%  CHLSTM  35.97% 
CHSRN  36.59%  CHLSTM  36.03% 
CHSRN  36.38%  CHLSTM  35.85% 
IvC4 HRNNs Complexity
Although very powerful, our HRNNs do not bring much extra computational burden or memory usage.
There are three scale HRNN layers with spatial dependencies encoded: , , and , each of them has 12 transformation matrices in HSRN and 48 transformation matrices in HLSTM, and there are 6 hierarchical connections (3 from , 2 from , and 1 from ). Thus, there are 42 transformation matrices in HSRN and 150 matrices in HLSTM in total, each one has size of . Thus, the HSRN layers have parameters, and the HLSTM layers have parameters. The number of parameters in HLSTM is almost four times HSRN, which make the HLSTM models be able to learn more complex patterns with the price of more computation resources. In contrast, in CNN, the fully connected layers have most of the network parameters, e.g. the second fully connected layer needs to learn a weight matrix, which has parameters, comparing with which, our HRNN layers have much fewer parameters.
In terms of memory consumption, HRNN layers do not cost much extra memory except some intermediate hidden layer output, i.e. and gate units (only exist in HLSTM) for each image region, which are 256dimensional vectors. While in CNN, the most memory consuming part is the first convolutional layer. In our setting, output of the first CNN layer has 1,161,600 dimensions, comparing with which, the HRNNs cost negligible memory to save intermediate data.
IvC5 CHRNNs Success & Failure Cases
In Figure 5, the final classification results of using CHLSTM and SPPnet [8] on SUN 397 (finedtuned based on ILSVRC 2012 models) are visually compared. We show the images on which CHLSTM leads to the highest accuracy improvement (the two rows above the dashed line), and the images on which CHLSTM leads to the highest accuracy drop (the row below the dashed line).
From the first two rows of Figure 5, we can clearly observe that the SPPnet focuses on predicting image regions, while ignoring the contextual information. For example, the label of the first image in the first row is “landing deck”, our CHLSTM can correctly recognize it, while the SPPnet wrongly recognizes it as the “windmill” with a very high confidence score. It’s because SPPnet wrongly recognized the rotor blades of the helicopter as the windmill blades. In contrast, our CHLSTM takes the context such as the body of helicopter and deck into consideration. Thus, our CHLSTM works better when the local image regions are confusing, but contextual information can help to make better decisions.
In the third row of Figure 5, we observe some interesting results. For example, the first image of the third row is “rope bridge”, which is relatively small in the image, while the forest is more obvious. Thus, our CHLSTM wrongly recognize it as “rainforest”. For the third image of the third row, the first word that comes to mind is cliff, while the ground truth label “light house” just represents a small region on the “cliff”. Thus, our CHLSTM makes mistakes when class labels are based on local regions rather than the global image.
fountain (1.00) ruin (0.50) CHLSTM vs SPPnet, result on SUN 397. The two rows above the dashed line: images misclassified by SPPnet, but correctly classified by CHLSTM. The row below the dashed line: images correctly classified by SPPnet, but misclassified by CHLSTM. Under each image, the first row shows the predicted label of using CHLSTM, and the second row shows the predicted label of using SPPnet, the prediction confidence scores are shown in the bracket, and correct labels are in bold. fountain (1.00) ruin (0.50) 
V Conclusions
In this manuscript, we propose an endtoend deep learning framework to encode spatial and scale contextual dependencies in image representation, which is called CHRNNs. In CHRNNs, CNN layers are firstly utilize to extract middlelevel representations for local image regions. Based on the CNN layer outputs, our proposed hierarchical recurrent layers are then applied to model the spatial dependencies among different image regions from the same scale, and the scale dependencies among image regions from different scales but at the same locations. In our proposed hierarchical recurrent neural networks, HSRN and HLSTM are introduced as two specific instances, which correspond to a fast recurrent model, and a sophisticated but more effective recurrent model respectively. By integrating CNN and HRNNs, our CHRNNs show outstanding performances on image classification.
References
 [1] B. B. Le Cun, J. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a backpropagation network,” in NIPS, 1990.
 [2] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.

[3]
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in
NIPS, 2014.  [4] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multiscale orderless pooling of deep convolutional activation features,” in ECCV, 2014.
 [5] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” in BMVC, 2014.
 [6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” arXiv preprint arXiv:1409.4842, 2014.
 [7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [8] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in ECCV, 2014.
 [9] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.
 [10] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv preprint arXiv:1312.6229, 2013.
 [11] W. Ouyang, P. Luo, X. Zeng, S. Qiu, Y. Tian, H. Li, S. Yang, Z. Wang, Y. Xiong, C. Qian et al., “Deepidnet: multistage and deformable deep convolutional neural networks for object detection,” arXiv preprint arXiv:1409.3505, 2014.
 [12] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to humanlevel performance in face verification,” in CVPR, 2014.
 [13] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation from predicting 10,000 classes,” in CVPR, 2014.
 [14] T. Mikolov, “Statistical language models based on neural networks,” Ph.D. dissertation, Brno University of Technology, 2012.
 [15] I. Sutskever, “Training recurrent neural networks,” Ph.D. dissertation, University of Toronto, 2013.
 [16] J. Koutník, K. Greff, F. Gomez, and J. Schmidhuber, “A clockwork rnn,” in ICML, 2014.
 [17] A. Graves and N. Jaitly, “Towards endtoend speech recognition with recurrent neural networks,” in ICML, 2014.
 [18] I. Sutskever, G. E. Hinton, and G. W. Taylor, “The recurrent temporal restricted boltzmann machine,” in NIPS, 2009.
 [19] N. BoulangerLewandowski, Y. Bengio, and P. Vincent, “Modeling temporal dependencies in highdimensional sequences: Application to polyphonic music generation and transcription,” in ICML, 2012.
 [20] J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14, no. 2, pp. 179–211, 1990.
 [21] H. Jaeger, Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach. GMDForschungszentrum Informationstechnik, 2002.
 [22] A. Graves and J. Schmidhuber, “Offline handwriting recognition with multidimensional recurrent neural networks,” in NIPS, 2009.
 [23] M. Oquab, L. Bottou, I. Laptev, J. Sivic et al., “Learning and transferring midlevel image representations using convolutional neural networks,” in CVPR, 2014.
 [24] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” in ICML, 2014.
 [25] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, 2006.

[26]
H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in
ICML, 2009.  [27] V. Nair and G. E. Hinton, “3d object recognition with deep belief nets,” in NIPS, 2009.
 [28] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
 [29] X. Yan, H. Chang, S. Shan, and X. Chen, “Modeling video dynamics with deep dynencoder,” in ECCV, 2014.
 [30] J. Zhang, S. Shan, M. Kan, and X. Chen, “Coarsetofine autoencoder networks (cfan) for realtime face alignment,” in ECCV, 2014.
 [31] Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized denoising autoencoders as generative models,” in NIPS, 2013.
 [32] I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with recurrent neural networks,” in ICML, 2011.
 [33] A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850, 2013.
 [34] P. Pinheiro and R. Collobert, “Recurrent convolutional neural networks for scene labeling,” in ICML, 2014.
 [35] J. T. Rolfe and Y. LeCun, “Discriminative recurrent sparse autoencoders,” arXiv preprint arXiv:1301.3775, 2013.
 [36] D. Eigen, J. Rolfe, R. Fergus, and Y. LeCun, “Understanding deep architectures using a recursive convolutional network,” arXiv preprint arXiv:1312.1847, 2013.
 [37] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra, “Draw: A recurrent neural network for image generation,” arXiv preprint arXiv:1502.04623, 2015.
 [38] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent models of visual attention,” in NIPS, 2014, pp. 2204–2212.
 [39] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Longterm recurrent convolutional networks for visual recognition and description,” arXiv preprint arXiv:1411.4389, 2014.
 [40] J. Y.H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” arXiv preprint arXiv:1503.08909, 2015.
 [41] X. Chen and C. L. Zitnick, “Learning a recurrent visual representation for image caption generation,” arXiv preprint arXiv:1411.5654, 2014.
 [42] A. Karpathy and L. FeiFei, “Deep visualsemantic alignments for generating image descriptions,” arXiv preprint arXiv:1412.2306, 2014.
 [43] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” arXiv preprint arXiv:1412.4729, 2014.
 [44] J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille, “Deep captioning with multimodal recurrent neural networks (mrnn),” arXiv preprint arXiv:1412.6632, 2014.
 [45] S. Behnke, Hierarchical neural networks for image interpretation. Springer Science & Business Media, 2003, vol. 2766.
 [46] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. Courville, and Y. Bengio, “Renet: A recurrent neural network based alternative to convolutional networks,” arXiv preprint arXiv:1505.00393, 2015.
 [47] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” arXiv preprint arXiv:1311.2901, 2013.
 [48] D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
 [49] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
 [50] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [51] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
 [52] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in CVPR, 2009.
 [53] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Largescale scene recognition from abbey to zoo,” in CVPR, 2010.
 [54] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in CVPR, 2009.
 [55] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in NIPS, 2014.
 [56] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image classification with the fisher vector: Theory and practice,” International journal of computer vision, vol. 105, no. 3, pp. 222–245, 2013.
 [57] M. Lapin, B. Schiele, and M. Hein, “Scalable multitask representation learning for scene classification,” in CVPR, 2014.
 [58] L.J. Li, H. Su, L. FeiFei, and E. P. Xing, “Object bank: A highlevel image representation for scene classification & semantic feature sparsification,” in NIPS, 2010.
 [59] Q. Li, J. Wu, and Z. Tu, “Harvesting midlevel visual concepts from largescale internet images,” in CVPR, 2013.
 [60] X. Wang, B. Wang, X. Bai, W. Liu, and Z. Tu, “Maxmargin multipleinstance dictionary learning,” in ICML, 2013.
 [61] M. Juneja, A. Vedaldi, C. Jawahar, and A. Zisserman, “Blocks that shout: Distinctive parts for scene classification,” in CVPR, 2013.
 [62] C. Doersch, A. Gupta, and A. A. Efros, “Midlevel visual element discovery as discriminative mode seeking,” in NIPS, 2013.
 [63] D. Lin, C. Lu, R. Liao, and J. Jia, “Learning important spatial pooling regions for scene classification,” in CVPR, 2014.