Automatic caption generation is a fundamental task that incorporates vision and language. The task can be tackled in two stages: first, image-visual information extraction and then linguistic description generation. Most models couple the relations between visual and linguistic information via a Convolutional Neural Network (CNN) to encode the input image and Long Short Term Memory for language generation (LSTM)Oriol:15; anderson2018bottom. Recently, self-attention has been used to learn these relations via Transformers huang2019attention; Marcella:20 or Transformer-based models like Vision and Language Bert lu202012. These systems show promising results on benchmark datasets such as Flicker young2014image
and COCOTsung-Yi:14. However, the lexical diversity of the generated caption remains a relatively unexplored research problem. Lexical diversity refers to how accurate the generated description is for a given image. An accurate caption should provide details regarding specific and relevant aspects of the image luo2018discriminability. Caption lexical diversity can be divided into three levels: word level (different words), syntactic level (word order), and semantic level (relevant concepts) wang2019describing. In this work, we approach word level diversity by learning the semantic correlation between the caption and its visual context, as shown in Figure 1, where the visual information from the image is used to learn the semantic relation from the caption in a word and sentence manner.
Modern sophisticated image captioning systems focus heavily on visual grounding to capture real world scenarios. Early worksfang2015captions built a visual detector to guide and re-rank image captions with a global similarity. The work of wang2018object investigates the informativeness of object information (e.g. object frequency) in end-to-end caption generation. cornia2019show proposes controlled caption language grounding through visual regions from the image. Inspired by these works, we propose an object-based re-ranker to re-rank the most closely related caption with both static and contextualized semantic similarity.
|Show and Tell Oriol:15|
|Transformer based caption generator Marcella:20|
Our main contributions in this paper are: (1) We propose a post-processing method for any caption generation system via visual semantic related measures; (2) as an addendum to the main analysis of this work, we note that the visual re-ranker does not apply to less diverse beam searches.
2 Beam search caption extraction
We employ the three most common architectures for caption generation to extract the top beam search. The first baseline is based on the standard shallow CNN-LSTM model (Oriol:15). The second, VilBERT lu202012
, is fine-tuned on a total of 12 different vision and language datasets such as caption image retrieval. Finally, the third baseline is a specialized Transformer based caption generatorMarcella:20.
3 Visual Beam Re-ranking for Image Captioning
3.1 Problem Formulation
Beam search is the dominant method for approximate decoding in structured prediction tasks such as machine translation, speech recognition, and image captioning. The larger beam size allows the model to perform a better exploration of the search space compared to greedy decoding. Our goal is to leverage the visual context information of the image to re-rank the candidate sequences obtained through the beam search, thereby moving the most visually relevant candidate up in the list, while moving incorrect candidates down.
3.2 Beam Search Visual Re-ranking
We introduce a word and sentence level semantic relation with the visual context in the image. Inspired by peinelt2020tbert, who propose a joint BERT Jacob:19 with topic modelling for semantic similarity, we propose a joint BERT with GloVe for visual semantic similarity.
Word level similarity. To learn the semantic relation between a caption and its visual context in a word level manner, we first employ a bidirectional LSTM based CopyRNN keyphrase extractor meng2017deep to extract keyphrases from the sentence as context. The model is trained on two combined pre-processed datasets: (1) wikidump (i.e. keyword, short sentence) and (2) SemEval 2017 Task 10 (Keyphrases from scientific publications) augenstein2017semeval
. Secondly, GloVe is used to compute the cosine similarity between the visual context and its related context. For example, froma woman in a red dress and a black skirt walks down a sidewalk the model will extract dress and walks, which are the highlights keywords of the caption.
Sentence level similarity. We fine-tune the BERT base model to learn the visual context information. The model learns a dictionary-like relation word-to-sentence paradigm. We use the visual data i.e. object as context for the sentence, i.e. caption via cosine distance.
BERT. BERT achieves remarkable results on many sentence level tasks and especially in the textual semantic similarity task (STS-B) cer2017semeval. Therefore, we fine-tuned BERT on the training dataset, (textual information, 460k captions: 373k for training and 87k for validation) i.e
. visual, caption, label [semantically related or not related]), with a binary classification cross-entropy loss function [0,1] where the target is the semantic similarity between the visual and the candidate caption, with a batch size of 16 for 2 epochs.
RoBERTa. RoBERTa is an improved version of BERT, and since RoBERTa is more robust, we rely on pre-trained Sentence RoBERTa-sts reimers2019sentence111https://www.sbert.net as its yields a better cosine score.
Fusion Similarity Expert. Inspired by Product of Experts PoE hinton1999products, we combined the two experts at word and sentence levels as a late fusion layer as shown in Figure 1. The PoE is computed as follows:
where are the parameters of each model , is the probability of under the model , and
is the indexes of all possible vectors in the data space. Since this approach is interested in retrieving the most closely related caption with the highest probability after re-ranking, the normalization step is not needed:
where, are the probabilities assigned by each expert to the candidate word .
4.1 Datasets and Evaluation Metric
We evaluate the proposed approach on two different sized datasets. The idea is to evaluate our approach with (1) a shallow model CNN-LSTM (i.e. less data scenario), and on a system that is trained on a huge amount of data (i.e. Transformer).
Flicker 8K rashtchian2010collecting: the dataset contains 8K images and each image has five human label annotated captions. We use this data to train the shallow model (6270 train/1730 test).
COCO Tsung-Yi:14: It contains around 120K images, and each image is annotated with five different human label captions. We use the most commonly used split as provided by karpathy2015deep, where 5k images are used for testing and 5k for validation, and the rest for model training for the Transformer baseline.
Visual Context Dataset: Since there are many public datasets for caption, they contain no textual visual information like objects in the image. We enrich the two datasets, as mentioned above, with textual visual context information. In particular, to automate visual context generation and dispense with the need for human labeling, we use Resnet Kaiming:16 to extract top-k 3 visual context information for each image in the caption dataset.
|Show and tell Oriol:15|
Evaluation Metric. We use the official COCO offline evaluation suite, producing several widely used caption quality metrics: BLEU papineni2002bleu METEOR banerjee2005meteor, ROUGE lin2004rouge, CIDEr vedantam2015cider and BERTscore or (B-S) bert-score.
5.1 Results and Analysis
We use visual semantic information to re-rank the candidate captions produced by out-of-the-box state-of-the-art caption generators. We extracted the top-20 beam search candidate captions from three different architectures (1) standard CNN+LSTM model Oriol:15, (2) A pre-trained language and vision model VilBERT lu202012, fine-tuned on a total of 12 different vision and language datasets such as caption image retrieval, and (3) a specialized caption-based Transformer Marcella:20.
Experiments applying different rerankers to each base system are shown in Table 1. The tested rerankers are: (1) VR, which uses BERT and GloVe similarity between the candidate caption and the visual context (top- V1 and V2 during the inference) to obtain the reranked score. (2) VR, which carries out the same procedure using similarity produced by SRoBERTa.
Our re-ranker produced mixed results as the model struggles when the beam search is less diverse. The model is therefore not able to select the most closely related caption to its environmental context as shown in Figure 2, which is a visualization of the final visual beam re-ranking.
Evaluation of Lexical Diversity. As shown in Table 2, we evaluate the model from a lexical diversity perspective. We can conclude that we have (1) more vocabulary, and (2) the Unique word per caption is also improved, even with a lower Type-Token Ratio (TTR brown2005encyclopedia222TTR is the number of unique words or types divided by the total number of tokens in a text fragment..
|Transformer based caption generator Marcella:20|
Ablation Study. We performed an ablation study to investigate the effectiveness of each model. As to the proposed architecture, each expert tried to learn different representations in a word and sentence manner. In this experiment, we trained each model separately, as shown in Table 3. The GloVe performed better as a stand-alone than the combined model (and thus, the combined model breaks the accuracy). To investigate this even further we visualized each expert before the fusion layers as shown in Figure 3. Limitation. In contrast to CNN-LSTM top Figure 3, where each expert is contributing to the final decisions, we observed that having a shorter caption (with less context) can influence the BERT similarity score negatively. Therefore, the word level i.e. GloVe dominates as the main expert.
In this work, we have introduced an approach that overcomes the limitation of beam search and avoids re-training for better accuracy. We proposed a combined word and sentence visual beam search re-ranker. However, we discovered that word and sentence similarity disagree with each other, when the beam search is less diverse.
Appendix A Hyperparameters and Setting
All training and the beam search are implemented in fairseq ott2019fairseq
and trained with PyTorch 1.7.1paszke2019pytorch.
Visual Re-ranker. The only model we tuned is the BERT
model. We fine-tuned it on the training dataset using the original BERT implementation, Tensorflow version 1.15 with Cuda 8abadi2016tensorflow. The textual dataset contains around 460k captions: 373k for training and 87k for validation i.e. visual, caption, label [semantically related or not related]). We use batch size 16 for three epochs with a learning rate
and we kept the rest of hyperparameters settings as the original implementation. Note that we keep the GloVe as a static model as the model is trained on 840 billion tokens.
Show-and-Tell Oriol:15. We train this shallow model from scratch on the 8K flicker dataset (6270 train/1730 test) (hardware: GPU GTX 1070Ti and 32 RAM and 8-cores i7 CPU).
Caption Transformer Marcella:20333https://github.com/aimagelab/meshed-memory-transformer. We train the transformer from scratch with the Bottom-Up features anderson2018bottom. However, unlike the original implementation by the authors, we use a full 12-layer transformer. We follow the same hyperparameters as the original implementation. The model is trained with PyTorch 1.7.1 on a single K-80 GPU.
VilBERT lu202012. Since VilBERT is trained on 12 datasets, we use it as an out-of-the-box model.
Appendix B Examples of Re-ranked Captions
Best Beam. In Figure 4 we show examples of the proposed re-ranker and comparison results with the best baseline beam search (BL
). The model struggled to unify the information from diffident modalities, and therefore the word-level expert has a stronger influence on the final score. In addition, the visual classifier also faces difficulties with complex background images. This could be resolved in future work, by employing multiple classifiers (each with multiple labels) and then using a voting technique to filter out the most probable object in the image.
Greedy. We also experiment with - greedy output (BL) as shown in Figure 5, our model suffers from the same limitation.
|Visual: monitor||BL: a computer monitor|
|on a desk with a keyboard|
|VR: a desk with a computer monitor and a keyboard ✗|
|Human: a computer that is on a wooden desk|
|Visual: ant ✗||BL: a group of birds|
|walking in the water ✓|
|VR: a group of birds|
|walking in the water ✓|
|Human: a group of small birds|
|walking on top of a beach|
|Visual: necklace||BL: a woman wearing a white|
|dress holding a pair of scissors|
|VR: a woman with a pair of|
|Human: a silver colored necklace|
|with a pair of mini scissors on it|
|Visual: food||BL: a plate of food on a table|
|VR: a plate of food and a drink on a table|
|Human: a white plate with some|
|food on it|
|Visual: apple||BL: a cat is eating an apple|
|VR: a close up of a cat eating an apple|
|Human: a gray cat eating a treat from a humans hand|
|Visual: chainlink fence ✗||Vil: a black and white photo of train tracks|
|VR: a black and white photo of a train on the tracks|
|Human:a long train sitting on a railroad track|
|Visual: cardigan ✗||BL: a cat sitting on the floor|
|next to a closet|
|VR: a cat and a dog|
|in a room|
|Human: a cat and a dog on|
|the floor in a room|
|Visual: bassinet||BL: a baby sitting in front of a cake|
|VR: a baby sitting in front of a birthday cake|
|Human: a woman standing over a sheet cake sitting on top of table|
|Visual: cowboy hat ✗||BL: a cat is eating a dish|
|on the floor|
|VR: a black and white cat|
|sitting in a bowl ✗|
|Human: a cat on a wooden|
|surface is looking at a wooden|
|Visual: pizza||BL: a pizza with cheese on a plate|
|VR: a pizza sitting on top of a white plate|
|Human: a small pizza being served on a white plate|
|Visual: dishwasher||BL: a man standing in a kitchen with a laptop|
|VR: a man standing in a kitchen preparing food|
|Human: a man with some drink in hand stands in front of counter|
|Visual: lab coat ✗||BL: a man standing in a kitchen holding a glass of wine|
|VR: a man standing in a kitchen holding a wine glass|
|Human: a man standing in a kitchen holding a glass full of alcohol|
|Visual: indian elephant||BL: a group of elephants under a shelter in a field|
|VR: a group of elephants|
|under a hut|
|Human: a young man riding a skateboard down a yellow hand rail|
|Visual: chain ✗||Vil: a group of women|
|sitting on a bench eating|
|VR: a group of women|
eating hot dogs
Human: three people are pictured
while they are eating
|trolleybus||BL: a green bus parked in front|
|of a building|
|VR: a green double decker|
|bus parked in front of a building ✗ Human: a passenger bus that is parked|
|in front of a library|
|Visual: racket||BL: a woman hitting a tennis ball on a tennis court|
|VR: a woman holding a tennis ball on a tennis court ✗|
|Human: a large crowd of people are|
|watching a lady play tennis|