Efficient CNN-LSTM based Image Captioning using Neural Network Compression

12/17/2020 ∙ by Harshit Rampal, et al. ∙ Carnegie Mellon University 0

Modern Neural Networks are eminent in achieving state of the art performance on tasks under Computer Vision, Natural Language Processing and related verticals. However, they are notorious for their voracious memory and compute appetite which further obstructs their deployment on resource limited edge devices. In order to achieve edge deployment, researchers have developed pruning and quantization algorithms to compress such networks without compromising their efficacy. Such compression algorithms are broadly experimented on standalone CNN and RNN architectures while in this work, we present an unconventional end to end compression pipeline of a CNN-LSTM based Image Captioning model. The model is trained using VGG16 or ResNet50 as an encoder and an LSTM decoder on the flickr8k dataset. We then examine the effects of different compression architectures on the model and design a compression architecture that achieves a 73.1 reduction in inference time and a 7.7 its uncompressed counterpart.



There are no comments yet.


page 7

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep neural networks have gained massive popularity for achieving state of the art results on tasks like classification, recognition and prediction. However, such complex networks have a large computational footprint that impedes their portability to low power mobile devices. Modern mobile devices have light and sleek form factors that further constraints their power and thermal capacity. For instance, a 32bit floating point addition operation consumes 0.9pJ, under 45nm CMOS technology. As summarized by Han et al. [2015a], an on-chip SRAM cache access takes 5pJ, while an off-chip DRAM access takes 640pJ. Blatantly, complex neural networks will require an off-chip DRAM access which is way costlier. From an energy perspective, a typical neural network having more than 1 billion connections, running at 30fps will take more than 19W for just accessing DRAM memory, which, is well beyond the power limit of a typical mobile device.

Hence, there is a need to compress deep networks to enable their real-time applications on resource-limited devices. Recently, advanced pruning and quantization algorithms have gained momentum in compressing such networks without compromising their performance. Pruning helps in reducing the parameters that are less sensitive to a change in network’s performance. On the other hand, quantization carries out the computations during a network’s work cycle in a lower bit precision. A synergy of these two methods enables faster inference times and efficient storage of large and dense neural networks.

Conventionally, researchers test their compression pipelines on standalone CNNs ranging from AlexNetKrizhevsky et al. [2017] to MobileNetHoward et al. [2017] and even on RNNs and LSTMs. In this project, we followed an unconventional approach of developing and experimenting a compression pipeline on a novel use case i.e. a CNN-LSTM based image captioning model. As the name describes, an image captioning model generates captions or texts related to the contents of the image. The model is a sequence of an encoder (CNN) and a decoder (RNNs). The encoder extracts visual features from the image and feeds it to the decoder to generate captions. We use state-of-the-art CNNs like VGG16 and Resnet50 as encoders and build our decoder from scratch using LSTM. Such encoder-decoder models have huge number of parameters as they use both a CNN(e.g. VGG16 has 138M parameters) and an LSTM network (having 2.62M parameters), resulting in a massive number of network parameters. Deploying such huge networks on mobile devices is unfeasible due to the power, space and thermal constraints. Hence, an end-to-end compression of a captioning model is crucial to leverage its real time applications on mobile devices. Interestingly, a compressed version of such an image captioning model can be deployed on wearable electronics for assisting visually impaired people.

We trained, validated and tested our compressed captioning model on flickr8k111https://www.kaggle.com/adityajn105/flickr8k dataset. It consists of 8094 images each with 5 captions. The metric for evaluation was BLEU scores Papineni et al. [2002], a commonly used method to evaluate a generated sequence with a reference sequence.

We employed magnitude based pruning to sparsify both the encoder and decoder parts of the network. We further implemented and experiment two types of quantization schemes, namely, post-training and quantization aware training. Post training quantization was implemented on the encoder while both quantization schemes were experimented on the decoder. Section-4 portrays our findings on implementing different compression architectures on the captioning model. Interestingly, some of them even outperformed the full-scale uncompressed model. On the basis of the reported results, we strongly advocate a particular compression architecture to compress the captioning model. The compressed model achieves impressive storage efficiency coupled with a respectable reduction in inference time.

2 Related Work

Our literature review focuses on the different use cases of neural network compression and the different compression techniques. While there is an extensive repertoire of work on the different compression techniques, the use cases of such algorithms have not yet been widely explored.

Recently, Google launched Live Caption 1, an on device captioning feature on its mobile devices making use of a RNN based model to caption audio sequences. For the purpose of deploying on mobile devices, Google used neural network pruning to reduce power consumption to 50% while still being able to maintain the efficacy of the network Shangguan et al. [2019]. Kim et al. [2017] designed an object recognition system to recognize vehicles on the road using Faster RCNN. They were able to deploy the system on embedded devices by compressing the network using pruning and quantization. Tan et al. [2019] proposed pruning techniques on RNN for an image captioning model. While they were able to reach 95% sparsity level without significant loss in performance, the proposed pruning techniques applied only to the RNN layers and not to the convolutional layers. The above work implemented and validated compression techniques on standalone CNNs, RNNs or LSTM architectures whereas we will be focusing on the effect of applying compression techniques on an end-to-end CNN-LSTM based image captioning pipeline.

There has been an immense advancement in research on neural network compression, especially on pruning and quantization in the last decade owing to deploying neural networks on resource-constrained environments. Among this large corpus of work, we review a select few which align best with our project.

Han et al. [2015b] proposed a three step process to prune and fine tune a network to reduce a significant number of parameters without incurring massive performance loss. Along similar lines, Zhu and Gupta [2017] validated the efficacy of magnitude based pruning by comparing pruned networks (large-sparse) and dense networks (small-dense) with similar memory footprints. Li et al. [2016] proposed a filter pruning method to remove filters having small effect on output accuracy and thereby reducing the convolution computation costs. The above methods follow an iterative approach to prune the networks after the network has been trained. On the other hand, Lee et al. [2018] proposed a pruning algorithm to prune the network once at initialization prior to training.

Han et al. [2015a] proposed a compression pipeline by pruning, quantizing and encoding a neural network to significantly reduce a network’s footprint with no loss in accuracy. After pruning the network, they reduce the number of bits required to represent the networks learned weights and limit the number of effective weights to store by having multiple connections share the same weight. They show that with pruning and quantization, a model can be compressed to 3% of original size with no loss of accuracy whereas only pruning compresses the network to 8% of its original size. Jacob et al. [2018] proposed a quantization scheme that relies only on integer arithmetic to approximate the floating-point computations in a neural network. They were able to achieve 4 times reduction of model size and improvement in inference efficiency in ARM NEON-based implementations.

3 Method

An image captioning model consists of an encoder and a decoder. In our work, the encoder and the decoder have separate training procedures. We first train the encoder, obtain extracted features and then use those features to train the decoder. While deploying the network, we merge both the models. Figure-1 demonstrates a complete training pipeline for the quantized encoder quantized decoder model.

Figure 1: Training pipeline for the model with best compression architecture i.e., quantized encoder quantized decoder image captioning model.

3.1 Encoder

The role of the encoder is to extract meaningful features from an input image. We leveraged the power of established architectures, namely, VGG16 and ResNet50 for this task.

Simultaneous working of our Encoder-Decoder model is a laborious task that requires a lot of RAM. So, in order to make the best use of our available computing resources, we adopted progressive storing. It is about separately storing the features extracted by the pruned and quantized pre-trained CNN model. These features are then fed to the decoder model for captioning. Hence, this strategy of separately loading the CNN features is more efficient as compared to the computationally demanding joint processing of the encoder-decoder model.

3.1.1 Pruning

Pruning is a practice of zero-masking the weights that hold less importance. Here “importance” refers to their gradient with respect to the loss function, which, reflects their role in affecting the training accuracy of the network. There are various levels of pruning, generally, ranging from 50% to 95%. State of the art pruning algorithms are able to retain original levels of accuracy even after pruning 90% of the weights. However, the upper bound of pruning sparsity is heavily dependent on the model architecture. Mainstream architectures ranging from AlexNet to MobileNet can be safely pruned in the range of 85% to 95%. It should be noted that pruning more than 95% of the network parameters generally results in a significant dip in accuracy.

The dense layers in the stated architectures were added to binary mask variables of the same corresponding shapes. The mask determines which weights will be made zero and the rest retain their value. The masks of the weights attaining a magnitude below than a threshold after

number of epochs are set to zero. Here

is a hyper-parameter. Additionally, Pruning is defined for a sparsity range. The lower bound, is generally 0 and the upper bound, , generally, varies from 50 to 95. The binary masks are updated every t steps as the network is gradually trained to reach the final sparsity level. Furthermore, zero-masked weights are not updated during backpropogation and this procedure of zero masking is followed for n pruning steps, until the selected layer attains the final sparsity level. Zhu and Gupta [2017] contrived the following relation between , , n, t and t.


The above formulation conveys that the network undergoes extensive pruning during the first few epochs. After epochs, the pruning rate gradually decreases. Pruning is highly correlated with the learning rate as a small learning rate makes it difficult to recover from the accuracy loss induced by pruning while a large one will prune weights that hold significant importance to the networks performance. Lastly, when

is achieved for the selected layer, binary masks are no longer updated. In order to prune the convolutional layers of these networks, they must be iteratively trained and retrained on ImageNet. This task was beyond the limit of our computing resources. So, we experimented pruning and quantizing the last dense layers of VGG16 and ResNet50.

3.1.2 Quantization

It is also desirable to quantize the network in conjunction with pruning for efficient use of storage. Quantization is about learning a reduced bit precision without compromising network performance and it is recognized as one of the most effective approaches to satisfy low memory requirements of resource constrained devices. It is employed to store weights and subsequent matrix calculations during the forward, backward and update processes of a neural network in compact formats such as float-16, int-32 and even lower. Such efficient use of storage enables fitting large models into an on-chip SRAM cache rather than an off-chip DRAM memory.

Post Training quantization converts the weights from floating point representation to integer representation having 8 bits of precision. At inference time, activations are also converted to int8 format and further computations are done using the integer based weights. The following equation represents the 8bit post training quantization Jacob et al. [2018].


Per axis/per weights are represented by int8 two’s complement values in the range [-127,127] where zero point is equal to 0. Per tensor activations or per tensor inputs are represented by int8’s two complement values in the range [-128,127] have a zero point in the range of [-128,127]. Zero point follows the int8 representation of the zero point in the floating point precision. This ensures that 0 in float format is exactly representable by its quantized int8 value. The scale is a positive real number in floating point precision.

3.2 Decoder

The decoder produces captions for an image based on the extracted features of the image from the encoder. To train the decoder, we use the extracted features of the training images from the encoder model and the processed captions of those train images. We process the text using standard pre-processing procedures like converting characters to lower case, removing punctuation and digits. We then create the vocabulary followed by its tokenization.

We design a multi-input decoder model to process the extracted features and the texts to produce captions. The feature extractor

part has a dense layer with 256 neurons and a dropout layer, the

text extractor part pre-process the training captions and is followed by an embedding layer and an LSTM layer of 256 neurons. The decoder layer merges the outputs from the feature and the text extractor layers, followed by two dense layers, one with 256 neurons and the other with as many neurons as the vocabulary size. Training the decoder model as is requires a lot of memory. Due to our resource constraints, we train the decoder with progressive loading wherein we create a data generator function to provide a single sample of training data from the whole training set. We train the decoder sample by sample and that saves a lot of memory.

3.2.1 Pruning

We applied both pruning and quantization algorithms to compress the decoder model. We use the same pruning algorithm as used in the encoder. Each layer of the decoder model was pruned to achieve 50% sparsity level. While pruning reduced the model size and produced BLEU scores comparable to the baseline model, the generated captions were not comprehensible and the model size reduction was not that significant. Hence, our best model doesn’t incorporate decoder pruning.

3.2.2 Quantization

For quantization, we employed both post-training quantization and quantization aware training to convert the weights of the decoder model to 8-bit precision. Post-training quantization, same as the one employed for encoder, didn’t perform as expected and produced very low BLEU scores. Quantization aware training, on the other hand, produced BLEU scores better than the baseline model. Quantization aware training has the same objective as the post training quantization i.e., to reduce the precision to 8-bits. The former is applied during the training phase of the network Jacob et al. [2018]. Before performing quantization aware training, we initialized the model with the baseline model weights to achieve better test accuracy. Our best decoder model is based on quantization aware training.

4 Results

As stated before, we test our compression pipeline on an image captioning model consisting of pre-trained VGG16/ResNet50 encoder and a LSTM decoder trained from scratch. The image captioning model is trained on the flickr8k dataset using TensorFlow framework and Keras APIs. For compression, we make use of the Tensorflow model optimization library. For evaluation, we use BLEU scores, model size and inference time of the model on test set.

Table-1 and Table-2 demonstrate the effect of different compression architectures on the image captioning model with VGG16 and ResNet50 as encoders, respectively. In these tables, we report the BLEU-1 score, size of the encoder and decoder models combined (in MB) and the inference time of the model on test set (in mins). Also, we report the % change in the metrics of the compressed models from the metrics of the baseline model.

Out of the many different compression architectures that we implemented and evaluated, we report the metrics for the following 8 configurations:

  • baseline encoder and baseline decoder

  • baseline encoder and 50% pruned decoder

  • baseline encoder and quantized decoder

  • 50% pruned encoder and baseline decoder

  • quantized encoder and baseline decoder

  • 50% pruned encoder and 50% pruned decoder

  • quantized encoder and quantized decoder

  • 50% pruned and quantized encoder and quantized decoder

Effect of Pruning: A pruned model should achieve smaller model size than its baseline counterpart without hampering the model’s performance. From Table-1 and Table-2, comparison of Model 1 to 2 and Model 4 to 6 demonstrates the effect of decoder pruning. The BLEU score for the pruned model is atleast at par with its corresponding baseline model and the pruned model achieves conceivable reduction in model size. Comparison of Model 1 to 4 shows that while encoder pruning achieves better reduction in model size than decoder pruning, it comes at the cost of reduced BLEU score.

Effect of Quantization: Similarly, comparison of Model 1 to 3 and Model 5 to 7 demonstrates the effect of decoder quantization. In one case, the quantized model achieves comparable BLEU score while in the other it performs significantly better. While the quantized models achieve as good model size reduction as their pruned counterparts, they achieve significant reduction in inference time. Comparison of Model 1 to 5 and Model 3 to 7 shows that encoder quantized models perform at par with their corresponding baseline models. This shows that both encoder and decoder quantization is highly effective.

We tried an interesting case of pruning and quantizing the encoder and only quantizing the decoder. From the tables, it can be observed that while this configuration achieves the lowest model size, the model performance is not comparable to the baseline. In Figure-2, we report the generated captions of four images taken from the internet [dog, soccer, surf, bike]. Our implementation can be found at: https://github.com/amanmohanty/idl-nncompress.

5 Evaluation

Our attempt of pruning the decoder produced unanticipated results. A plausible reason for its under-performance can be attributed to the relatively simple architecture of the decoder. An attempt to prune the sole LSTM cell to 50 could have aggressively pruned its memory gate, thus affecting its efficacy of producing suitable captions.

On the encoder side, pruning 50

of the dense layers in VGG16 and ResNet 50 led to appreciable results. Experiments with a higher sparsity level degraded the consistency of captions. The rationale for dip in accuracy lies in the limited representation of the terminal set of dense layers of both VGG16 and ResNet50. In order to implement pruning, one must iteratively train and re-train the network upto the desired sparsity level. Since, training and re-training on ImageNet dataset was beyond the limit of our resources, we switched to CIFAR-100 for pruning the terminal set of dense layers. CIFAR-100 is less variegated than ImageNet, so, the terminal layers must have learnt a limited set of features. Therefore, excessive pruning (above 50

) of these layers dropped their efficacy in extracting better features.

The above reasoning provides an insight that the relative simplicity of the encoder and decoder makes the captioning model sensitive to a high level of pruning. Therefore, one may opine that the model performs best when pruning is avoided in both the encoder and decoder. Table-1 and Table-2 validate this opinion.

We experimented only quantization aware training for decoder in place of coupling it with pruning. The primary reason is that the effect of pruning is diminished as quantization aware training is performed during training.

Sl No Models BLEU-1 score Model size Inference time
(Encoder-Decoder) score % change MB % reduction mins % change
1 Baseline-Baseline 0.527 - 578.4 - 5.68 -
2 Baseline-Prune 0.534 + 1.3 534.1 7.6 5.58 - 1.7
3 Baseline-Quantized 0.521 - 1.1 533.09 7.8 1.72 - 69.7
4 Prune-Baseline 0.418 - 20.6 74 87.2 5.79 + 1.9
5 Quantized-Baseline 0.527 0 200.7 65.3 5.33 - 6.2
6 Prune-Prune 0.418 - 20.6 29.7 94.8 6.21 + 9.3
7 Quantized-Quantized 0.568 + 7.7 155.39 73.1 1.63 - 71.3
8 PruneQuant-Quantized 0.459 - 12.9 22.99 96 1.74 - 69.4
Table 1: Evaluation of different compression architectures on the image captioning model with VGG16 as encoder trained on flickr8k dataset. Model sizes are reported in MB and inference times are reported in minutes taken per 2000 samples.
Sl No Models BLEU-1 score Model size Inference time
(Encoder-Decoder) score % change MB % reduction mins % change
1 Baseline-Baseline 0.51 - 150.1 - 5.91 -
2 Baseline-Prune 0.514 + 0.8 110 26.72 5.7 - 3.5
3 Baseline-Quantized 0.503 - 1.3 109.1 27.31 1.65 - 72.1
4 Prune-Baseline 0.418 - 18.0 145.1 3.33 6.26 + 5.9
5 Quantized-Baseline 0.546 + 7.1 83.7 44.23 5.87 - 0.7
6 Prune-Prune 0.418 -18 105 30.04 4.79 - 18.9
7 Quantized-Quantized 0.48 - 5.9 42.7 71.55 1.79 - 69.7
8 PruneQuant-Quantized 0.442 -13.3 39.1 73.95 1.33 - 77.5
Table 2: Evaluation of different compression architectures on the image captioning model with ResNet50 as encoder trained on flickr8k dataset. Model sizes are reported in MB and inference times are reported in minutes taken per 2000 samples.
Baseline: two dogs are playing together in the grass Baseline: two men are playing soccer on the grass
Best: dog is running through the grass Best: two men are playing soccer
Baseline: man in red shirt is walking through the water Baseline: man in red shirt is riding bike on dirt path
Best: the man is in the water Best: man in red shirt is riding bike on dirt path
Figure 2: Generated captions by the baseline and the best model, QVGG-QLSTM

6 Conclusion

Among the corpus of different compression architectures that we implemented and evaluated on the image captioning model with VGG16 and ResNet50 as encoders, we select the VGG16-LSTM quantized encoder and quantized decoder (QVGG-QLSTM) as our best compressed model. The QVGG-QLSTM model performs better than the baseline model across all reported metrics. It achieves a 7.7% increase in BLEU score, 73.1% reduction in model size and 71.3% reduction in inference time per 2000 samples. In Figure-2, we report the generated captions of four images taken from the internet.

The captions reported are generated from the baseline model and our best model, QVGG-QLSTM. Clearly, the QVGG-QLSTM generates better captions for the ’dog’ image than the baseline model. For the ’surf’ and the ’soccer’ image, both the models generate different but correct captions. For the ’bike’ image both the models generate somewhat wrong but same captions. This shows that the compressed model performs at par or better than the baseline model.

7 Future Work

A couple of directions can be pursued to improve the present model. A run-of-the-mill method to boost performance of Deep Learning Neural Networks is to increase the size of the dataset. In this context, MS-COCO

Lin et al. [2014] is a promising choice. The dataset consists of 123,387 images, each with 5 captions. Another upgrade can be a more advanced pruning method. State of the art methodsLi et al. [2016] not only prune the dense layers but the convolutional filters as well. This helps in achieving a higher level of sparsity without observing a significant dip in accuracy. Similarly, advanced quantization algorithms Banner et al. [2019] can be experimented to push the envelope of the present compression architecture.


  • [1] (2019) Note: https://ai.googleblog.com/2019/10/on-device-captioning-with-live-caption.html Cited by: §2.
  • R. Banner, Y. Nahshan, and D. Soudry (2019)

    Post training 4 bit quantization of convolutional neural networks for rapid deployment

    Advances in Neural Information Processing Systems. Cited by: §7.
  • S. Han, H. Mao, and W. J. Dally (2015a) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2.
  • S. Han, J. Pool, J. Tran, and W. Dally (2015b) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §2.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
  • B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. pp. 2704–2713. Cited by: §2, §3.1.2, §3.2.2.
  • B. Kim, Y. Jeon, H. Park, D. Han, and Y. Baek (2017) Design and implementation of the vehicular camera system using deep neural network compression. In

    Proceedings of the 1st International Workshop on Deep Learning for Mobile Systems and ApplicationsProceedings of the 40th annual meeting of the Association for Computational LinguisticsProceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    EMDL ’17, New York, NY, USA. External Links: ISBN 9781450349628, Link, Document Cited by: §2.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2017) Imagenet classification with deep convolutional neural networks. Communications of the ACM 60 (6), pp. 84–90. Cited by: §1.
  • N. Lee, T. Ajanthan, and P. Torr (2018) SNIP: single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, Cited by: §2.
  • H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §2, §7.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §7.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. pp. 311–318. Cited by: §1.
  • Y. Shangguan, J. Li, L. Qiao, R. Alvarez, and I. McGraw (2019) Optimizing speech recognition for the edge. arXiv preprint arXiv:1909.12408. Cited by: §2.
  • J. H. Tan, C. S. Chan, and J. H. Chuah (2019)

    Image captioning with sparse recurrent neural network

    arXiv preprint arXiv:1908.10797. Cited by: §2.
  • M. Zhu and S. Gupta (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878. Cited by: §2, §3.1.1.