Enriched Deep Recurrent Visual Attention Model for Multiple Object Recognition

06/12/2017 ∙ by Artsiom Ablavatski, et al. ∙ Nanyang Technological University Agency for Science, Technology and Research 0

We design an Enriched Deep Recurrent Visual Attention Model (EDRAM) - an improved attention-based architecture for multiple object recognition. The proposed model is a fully differentiable unit that can be optimized end-to-end by using Stochastic Gradient Descent (SGD). The Spatial Transformer (ST) was employed as visual attention mechanism which allows to learn the geometric transformation of objects within images. With the combination of the Spatial Transformer and the powerful recurrent architecture, the proposed EDRAM can localize and recognize objects simultaneously. EDRAM has been evaluated on two publicly available datasets including MNIST Cluttered (with 70K cluttered digits) and SVHN (with up to 250k real world images of house numbers). Experiments show that it obtains superior performance as compared with the state-of-the-art models.



There are no comments yet.


page 2

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Publiished as a conference paper at 2017 IEEE Winter Conference on Applications of Computer Vision (WACV)

Recurrent models of visual attention have demonstrated superior performance on a variety of recognition and classification tasks  [16, 2, 19, 22] in recent year. A recurrent model of visual attention is a task-driven agent interacting with a visual environment which observes the environment via a bandwidth-limited sensor at each time stamp. Recurrent models consist of two crucial components: an attention mechanism and a recurrent network. The first component, a simple attention mechanism as introduced by Mnih  [16], has demonstrated great success on recognizing digits within the Street View House Number (SVHN) dataset [6]. However, this mechanism is restricted by its simplicity which extracts a fixed amount of patches on each iteration with predefined scales. The recently introduced more flexible and sophisticated visual attention mechanisms [7, 12] achieved state-of-the-art results on the MNIST Cluttered dataset [19]. The attention mechanisms advance in the patch extraction allowing to cover any 2D affine transformation of objects in the image. Hence the mechanisms are robust to most geometric transformations and have demonstrated superior performance comparable with the humans. The second component, the recurrent network also plays an essential in the recurrent models of visual attention. More powerful recurrent architecture employed in the Deep Recurrent Visual Attention Model [2] significantly outperformed the original Recurrent Attention Model proposed by Mnih  [16] while leaving the rest of the systems identical.

Inspired by deep recurrent visual attention model and the power of the visual attention mechanism [12]

, we propose an Enriched Deep Recurrent Visual Attention model (EDRAM) that consists of a flexible and powerful attention mechanism along with a smart and light-weight recurrent neural network. The proposed technique is fully differentiable and has been trained end-to-end by using Stochastic Gradient Descent (SGD). It obtained superior performance as evaluated on two publicly available datasets including the multi-digit SVHN dataset

[6] as illustrated in Fig. 1a and the MNIST Cluttered [16] as illustrated in Fig. 1b.

(a) (b)
Figure 1: Examples of recognized images of the Street View House Numbers dataset [6] in (a) and the MNIST Cluttered dataset [16] in (b) where the green-color boxes show the localization and the digits at the box top-left corner show the recognition results.

2 Related work

Recurrent models of visual attention have been attracting increasing interest in recent years. The Recurrent Attention Model (RAM) proposed by Mnih  [16] employs a recurrent neural network (RNN) to integrate visual information (image patch or glimpse) over time. By REINFORCE optimization of the network [21], they achieved a huge reduction of computational cost as well as state-of-the-art performance on the MNIST Cluttered dataset. Ba  [2]

extended the glimpse network for multiple object recognition with visual attention. They introduced the Deep Recurrent Visual Attention Model (DRAM) that integrates a simple visual attention mechanism with the neural network based on Long Short-Term Memory (LSTM) gated recurrent unit 


. The REINFORCE learning rule, employed in 

[16] to train their attention model, was used to learn the network “where” and “what”. Though the DRAM achieved superior result in a number of tasks such as the MNIST pair classification and SVHN recognition, the attention mechanism used is straightforward by extracting patches of fixed scales only, where the potential of the visual attention mechanism is far from fully exploited.

In comparison to the simple attention mechanism, Gregor  [7] introduced Selective Attention Model that created a differentiable, end-to-end trainable system which greatly improves the recognition accuracy on the MNIST Cluttered dataset. The idea of this attention mechanism is to position set of Gaussian filters forming a grid centered at the particular spatial coordinates with which rectangle patches of different scales can be extracted. More recently, Jaderberg  [12]

proposed the Spatial Transformer Network that can deal with not only scales, but also any 2D affine transformation of objects in images. Using Spatial Transformer in combination with the convolutional neural network, Jaderberg  

[11] achieved state-of-the-art result on the SVHN dataset – 3.6% error rate.

Though different deep attention models have been designed, all existing models have various constraints. For example, the Differentiable RAM [7] is fully differentiable and can be trained using SGD, but the model has a weak network architecture and does not scale well to real world tasks. The DRAM [2] has a powerful architecture, but the sampling strategy makes the whole network non-differentiable. Building on the long line of the previous attempts of attention-based visual processing methods [2, 16, 12, 19, 7], the proposed EDRAM expands the idea of using recurrent connections inside the attention mechanism [19] and improves the Deep Recurrent Visual Attention Model (DRAM) [2] from several aspects as follows:

  • It made previously non-differentiable architecture fully differentiable by using the Spatial Transformer. It allowed us to optimize the network parameters end-to-end by using SGD with backpropagation framework.

  • A flexible loss function was designed based on Cross Entropy and Mean Squared Error function with which the EDRAM can be trained efficiently.

  • It obtains superior performance on the MNIST Cluttered and SVHN datasets as compared with state-of-the-art methods.

3 Enriched Deep Recurrent Visual Attention Model

The underlying idea of the EDRAM is to combine a powerful and computationally simple recurrent neural network with a flexible and adjustable attention mechanism (ST), while making the whole network fully differentiable and trainable through SGD.

3.1 Network Architecture

Inspired by the model proposed by Ba  [2], our network architecture was designed to satisfy the complexity requirements and the capability to learn a very nonlinear classification function. It can be decomposed into several sub-components including an attention mechanism, a context network, a glimpse network, a recurrent network, a classification network and an emission network as illustrated in Fig. 2. Each sub-component can be referred by a “network” because it is a typical a multi-layered neural network.

Figure 2: The architecture of the EDRAM.

The context network receives a down-sampled low-resolution image as input and processes it through a three-layered convolutional neural network. It produces a feature vector

that serves as an initialization of a hidden state of the second LSTM unit in the recurrent network.

The attention mechanism reads an image patch by using the transformation parameters (more details to be described in Section 3.2) that have been predicted on the previous iteration. Parameters for the first iteration are defined in a way that the attention mechanism reads the whole image without transformation. An algorithm for the read operation and transformation parameters will be defined in the next subsection.

Different from the DRAM where the glimpse network is responsible only for producing discriminative features for the classification network, the glimpse network in EDRAM integrates the localization network from the Spatial Transformer. Therefore, the glimpse network in the EDRAM contains a number of convolution layers followed by the max pooling layer. The number of convolutions varies and depends on the difficulty of a recognition task. The result of the convolution layers and the transformation parameters

are used in multiple isolated fully-connected layers. The outputs of the isolated fully-connected layers are combined together by element-wise multiplication to form the final glimpse feature vector. This type of combination “what” and “where” was proposed by Larochelle and Hinton  [14].

The recurrent network contains two LSTM units stacked one above the other with hidden states and . The first LSTM receives the glimpse feature vector and uses the hidden state to produce a feature vector for the classification network. Based on the hidden state of the first LSTM and the hidden state , the second LSTM unit produces a feature vector for the emission network. The hidden state of the first LSTM is independent of the hidden state of the second LSTM and the glimpse parameters . This means that a prediction of the classification network depends only on an extracted patch and is independent of a location and the transformation parameters.

The classification and emission networks map feature vectors from different layers of the recurrent network (by using fully-connected layers) to predict labels and the transformation parameters for the next iteration, respectively. The classification network has two fully-connected layers followed by softmax output layer. The emission network employs the fully-connected layer to predict the transformation parameters.

The EDRAM processes the image in a sequential manner with steps. At each time step , the model receives the parameters of transformation and uses the attention mechanism to extract the patch at the location as defined by the parameters (to be described in Section 3.2) in the transformation matrix . The model uses the observation , processed by the glimpse network, and the parameters to update its internal states and produce the parameters for the next step. Besides that, the model makes a prediction based on the internal states of the first LSTM. The attention mechanism controls the number of pixels in the patch , which is usually much smaller than the number of pixels in the original image.

3.2 Spatial Transformer attention mechanism

Instead of using the nondifferentiable attention mechanism that simply reads patches at the given location , our Spatial Transformer attention mechanism is inspired by the differentiable Spatial Transformer Networks as proposed by Jaderberg  [12]. The original Spatial Transformer Networks only contains a localization network, grid generator and sampler. All parts are fully differentiable that enables the optimization of the whole model end-to-end by using gradient descent within a standard backpropagation framework.

Figure 3: The localization network of the EDRAM with shared layers: CNN - convolutional layers, followed max pooling layer; LSTM - Long Short-Term Memory; FC - Fully-connected layer.

The major constraint of the original Spatial Transformer Networks is that the transformation parameters for the grid generator were obtained using the standalone localization network, which introduces a larger number of parameters and increases the computational cost. Another drawback of the Spatial Transformer Networks is the supervision for the prediction when it is used in an iterative manner: the Spatial Transformer Networks was introduced only for feed-forward networks and predicts the transformation parameters only based on the current input data without using the information from the previous steps. The transformation parameters with a better accuracy could be obtained when the Spatial Transformer can incorporate information from the previous steps.

To overcome the downsides of the original Spatial Transformer Networks we propose a solution that incorporates information from previous iterations and at the same time reduces the number of parameters needed for the localization network. Our Spatial Transformer attention mechanism integrates the localization network into the glimpse network, which is responsible for both classification and transformation information flow. Technically, it means that backpropagation of the classification error through the glimpse network will affect on the localization error of the emission network and backpropagation of the localization error will affect on the prediction accuracy of the classification network. A flexible loss function therefore needs to be designed for the EDRAM to control the backpropagation of the errors and to learn both accurate transformation parameters and correct classification labels. Besides, sharing network parameters between the prediction and transformation information flows results in the reduction of the computational cost. The overall scheme of the localization network of the EDRAM is illustrated in Fig. 3

The Spatial Transformer is responsible for an affine transformation (zoom, rotation and skew) of mesh grid points

according to the parameters .


where – determine a zoom, – determine a skewness in directions respectively and – determine the center position of the mesh grid.

Since the mesh grid of points (Grid generator in Fig. 3) does not correspond exactly to one particular point

in the input image, the bilinear interpolation (Sampler in Fig. 

3) is used to output the fixed scale patch from the input image for further processing.


Then the partial derivatives for the bilinear sampling (3) w.r.t the sampling grid coordinates can be formulated as follows:


The sub-differentiable sampling allows backpropagation of the loss to the sampling grid coordinates which leads to flow back the gradients to the transformation parameters and to Emission Network in Fig. 3

. In addition, to encourage the Spatial Transformer attention mechanism to learn more accurate transformation parameters we allow backpropagation of the loss in opposite direction through the supervision of the parameters obtained on the previous iteration (see Learning “Where” in the next section). This contribute in a precise localization of the extracted patches by the attention mechanism after only a few epochs of training.

The Spatial Transformer attention mechanism is fully differentiable and satisfies both requirements of flexibility and adjustability. Hence, it allows us to train the attention mechanism with standard backpropagation.

3.3 Learning “Where” and “What”

In the context of multiple object recognition, the network should locate the necessary objects of an image (“Where”) and successfully recognize them (“What”) in order to achieve the desired performance. Hence, the objective function should penalize any false positives predictions as well as incorrect recognitions at true positive locations.

The loss function is designed to force the EDRAM to recognize necessary objects in a finite number of steps. For each object in the image we allow the network to make a fixed number of predictions . The network produces a final class prediction for the given object by averaging the predictions. Suppose we have targets in the image, the loss function will be calculated only for steps:


The loss function for each iteration includes a weighted summation of the Cross Entropy loss for the given glimpse and weighted Mean Squared Error of the transformation parameters :


where can be interpreted as “what to look” and — “where to look

”. The hyperparameters

and give a good trade-off between classification and transformation loss, forcing the model to simultaneously optimize for a better patch extraction and for better recognition.



is a predicted class probability on a ground-truth position,

— elements of the matrix and — ground truth values for iteration . The hyperparameters force the network to pay more attention to critical parameters such as width, height and coordinates of the mesh grid and ignore unimportant parameters such as the skewness.

4 Experiments

EDRAM has been evaluated over the MNIST Cluttered dataset [16, 2] as well as a real-world object recognition task by using the Street View House Numbers (SVHN) dataset [17].

Since the attention mechanism is fully differentiable, the proposed network is trained with standard backpropagation and SGD by using Adam optimization algorithm [13]. Gradient step clipping techniques are applied [15, 18] to ensure the absence of the gradient exploding during the learning over the recursive structures. The value of the thresholded norm is chosen equal to 10. Following Cooijmans  [4]

, the Batch Normalization proposed by Iofee and Szegedy 


is used on the MNIST Cluttered dataset to estimate the statistics independently for each iteration. Theano 

[3], Blocks and Fuel [20] are used to implement and conduct the experiments with the MNIST Cluttered and SVHN datasets.

For comparison, we take the results from the latest works by Almahairi  [1] — Dynamic Capasity Networks (DCN), Jaderberg  [12] — ST-CNN, Ba  [2] — DRAM and the first result obtained on the SVHN dataset by Goodfellow  [6] — 11 layer CNN. DCN is an attention-based ensemble of convolutional networks of different capacities. The family of DRAM models includes the original Deep Recurrent Visual Attention Model (Single DRAM), DRAM with Monte Carlo sampling policy (Single DRAM MC avg.) and ensemble of two models of different recognition order (forward-backward DRAM MC avg.). The 11 layer CNN and ST-CNN are the feed-forward networks containing several convolutions layers followed by fully-connected layers. In addition, the ST-CNN includes one (ST-CNN Single) or several (ST-CNN Multi) Spatial Transformers between convolutional layers with separate localization networks.

4.1 MNIST Cluttered

Figure 4: MNIST Cluttered classification: Each row shows a sequence of glimpses taken by the network while recognizing MNIST Cluttered dataset. The red rectangle illustrates the location and size of the image patch extracted by the attention mechanism.

We first evaluate EDRAM on the MNIST Cluttered dataset [16], where each image contains randomly located hand-written digit surrounded with digit-like fragments. The dataset has 60000 images for training and 10000 for testing.

At each time step, a glimpse of the size

from the input image is fed to the network and the model predicts parameters of extraction for the next iteration as well as a class label. The model produces the final classification result after a fixed number of glimpses (6 in our case). In this experiment, we use ReLU activations for all layers except the recurrent network, where standard

activation in LSTM units are employed. Context network takes down-sampled image and projects it into a vector using 3 convolutional layers (without any activations) with filter sizes and number of filters , respectively.

We use 6 convolutional layers in the glimpse network. The size of filters in each convolution of the glimpse network is chosen to be 3 and the numbers of filters in 6 convolutions are 64, 64, 128, 128, 160 and 192. The max pooling is made with size

and stride

after second and fourth convolutional layers. Zero-padding of half of the filter size is used in the first, third, forth and fifth convolutions of the glimpse network. There are 512 LSTM units and 1024 hidden units in each fully-connected layer of the model.

Optimal values for hyperparameters and were found to be 1 by random search. To encourage the network to learn precise location of the target the weights and set to 1 whereas and are 0.5. A learning rate of

is used for training the model and exponentially reduced by a factor of 10 when the training loss plateaus. The model is initialized with a uniform distribution for recurrent and convolutional units with the range of

and a Gaussian distribution for fully-connected layers with a variance

. A mini-batch size of is used to estimate the gradient directions.

Model Test Error
Convolutional, 2 layers 14.35%
RAM [16], 8 glimpses, , 4 scales 8.11%
Differentiable RAM [7], 8 glimpses, 3.36%
ST-CNN Single [12] 1.7%
DCN [1], 8 glimpses, 1.39%
Ours (6 glimpses, ) 0.6%
Table 1: Recognition results on MNIST Cluttered dataset.

The results in Table 1 demonstrate more then 2 improvement in the test error as compared to the state-of-the-art models on the MNIST Cluttered dataset. With the help of the proposed Spatial Transformer attention mechanism and the designed objective function, the network is able to learn where to find a digit in the cluttered background and how to recognize it accurately. Moreover, making the network fully differentiable allows it, to train end-to-end by standard back propagation and to use the batch normalization, which leads to faster convergence. Fig. 4 illustrates the process of the attention mechanism where each row shows how the Spatial Transformer attention mechanism locates a digit on the image iteration by iteration accurately.

Figure 5: The dependency of the number of processed glimpses vs accuracy on the MNIST Cluttered test dataset.

We show in Fig. 5 how the test error on the MNIST Cluttered dataset decreases when the number of glimpses is increased. We can see that the test error is decreasing almost linearly with increasing the number of patches and after 6 glimpses it saturates and performance does not improve significantly. So, 6 glimpses is a good trade-off between the accuracy and computation cost of the model.

4.2 Svhn

We also evaluate EDRAM on the multi-digit SVHN dataset and compared it with the state-of-the-art models. The SVHN dataset contains around real world images of digits taken from pictures of house fronts. There are between 1 and 5 digits in each image, with a large variability in scale and spatial arrangement. Following the experimental setup in [6, 2], the test set is formed from images and the rest of data (train and extra sets) is used to train the networks.

The data is preprocessed by generating tightly cropped images with multi-digits at the center and similar data augmentation is used to create jittered images during training. Similar to [2]

, RGB images are converted into grayscale. The model is trained to classify all the digits in an image sequentially with objective function as defined in Eq. (

6). Patches are given to classify each digit in the image. As images in the SVHN dataset have at most digit sequences, the overall amount of iterations is that equals plus patches for a terminal label.

The heuristic of learning two separate models of different reading orders (forward, backward) as proposed in 

[2] is adapted. As the localization network is integrated in the glimpse network and the transformation parameters are different for the models, the weights are not shared between forward and backward models. Fig. 1a illustrates how the proposed EDRAM extracts glimpses accurately around digits.

In the experiments, a square extraction window of size pixels is used. Initialization of the network is chosen identically to the MNIST Cluttered experiment. An initial value for learning rate is . A mini-batch size of

is used to estimate the gradient direction. For all units except LSTM blocks, ReLU activation function was used. For the stacked LSTM blocks, standard

activation was used. As SVHN dataset provides only size and location for digit bounding box, only 4 parameters from transformation matrix are used to estimate loss function . The parameters of skewness and in the proposed system are learned by themselves for better prediction accuracy. The loss function hyperparameters are chosen to be identical to the experiment with MNIST Cluttered dataset. The training took 5 days on a single modern GPU.

Model Test Error
11 layer CNN [6] 3.96%
Single DRAM [2] 5.1%
Single DRAM MC avg. [2] 4.4%
forward-backward DRAM MC avg. [2] 3.9%
ST-CNN Single [12] 3.7%
ST-CNN Multi [12] 3.6%
Ours (Single model) 4.36%
Ours (forward-backward ensemble) 3.6%
Table 2: Sequence recognition error rates on SVHN dataset.

The proposed approach obtained state-of-the-art performance in recognition of multiple objects from the real world as shown in Table 2 while having 1.7 less parameters () than previous state-of-the-art model ST-CNN Multi (see Table 3). This proves the effectiveness of the developed objective function that forces the network to locate the desired objects on images and successfully recognize them one by one. The usage of the attention mechanism allows the network to ignore redundant information from images and extract patches that are necessary for the prediction of the correct class labels.

Model Parameters (millions)
10 layer CNN 51
Single DRAM [2] 14
Single DRAM MC avg. [2] 14
forward-backward DRAM MC avg. [2] 28
ST-CNN Single [12] 33
ST-CNN Multi [12] 37
Ours (Single model) 11
Ours (forward-backward ensemble) 22
Table 3: Computation cost of different Deep Convolutional Networks.

The computational cost of the neural networks (NN) depends on the number of parameters, with more parameters the model needs more space to be stored and more floating-point operations (FLOPs) to execute to produce the final output. This creates a difficulty of applying NNs on different embedded platforms with limited memory and processing units, like mobile phones  [8]. Besides that, significant redundancy has been reported in many state-of-the-art neural network models  [5]. The integration of the separate localization network into the glimpse network allows to achieve a significant computation cost reduction in comparison with other state-of-the-art models. Table 3 shows the number of parameters of the proposed model in comparison with other deep convolutional neural networks. Though we only matched the performance on the SVHN dataset, our network contains times less parameters than the state-of-the-art ST-CNN Multi.

5 Conclusions

This paper presents an Enriched Deep Recurrent Visual Attention Model that is fully differentiatiable and trainable end-to-end using SGD. The EDRAM outperforms the state-of-the-art result on the MNIST Cluttered dataset and matches the state-of-the-art models on a multi-digit house number recognition task. It requires a smaller amount of parameters and less computation resources, thereby proving that attention mechanism has a big impact on accuracy and efficiency of the model.