Log In Sign Up

Word to Sentence Visual Semantic Similarity for Caption Generation: Lessons Learned

This paper focuses on enhancing the captions generated by image-caption generation systems. We propose an approach for improving caption generation systems by choosing the most closely related output to the image rather than the most likely output produced by the model. Our model revises the language generation output beam search from a visual context perspective. We employ a visual semantic measure in a word and sentence level manner to match the proper caption to the related information in the image. The proposed approach can be applied to any caption system as a post-processing based method.


Leveraging sentence similarity in natural language generation: Improving beam search using range voting

We propose a novel method for generating natural language sentences from...

Belief Revision based Caption Re-ranker with Visual Semantic Information

In this work, we focus on improving the captions generated by image-capt...

Speculative Beam Search for Simultaneous Translation

Beam search is universally used in full-sentence translation but its app...

A Geometric Method to Obtain the Generation Probability of a Sentence

"How to generate a sentence" is the most critical and difficult problem ...

Learning Semantic Concepts and Order for Image and Sentence Matching

Image and sentence matching has made great progress recently, but it rem...

Semantic Relatedness Based Re-ranker for Text Spotting

Applications such as textual entailment, plagiarism detection or documen...

Acrostic Poem Generation

We propose a new task in the area of computational creativity: acrostic ...

1 Introduction

Automatic caption generation is a fundamental task that incorporates vision and language. The task can be tackled in two stages: first, image-visual information extraction and then linguistic description generation. Most models couple the relations between visual and linguistic information via a Convolutional Neural Network (CNN) to encode the input image and Long Short Term Memory for language generation (LSTM)

Oriol:15; anderson2018bottom. Recently, self-attention has been used to learn these relations via Transformers huang2019attention; Marcella:20 or Transformer-based models like Vision and Language Bert lu202012. These systems show promising results on benchmark datasets such as Flicker young2014image

and COCO

Tsung-Yi:14. However, the lexical diversity of the generated caption remains a relatively unexplored research problem. Lexical diversity refers to how accurate the generated description is for a given image. An accurate caption should provide details regarding specific and relevant aspects of the image luo2018discriminability. Caption lexical diversity can be divided into three levels: word level (different words), syntactic level (word order), and semantic level (relevant concepts) wang2019describing. In this work, we approach word level diversity by learning the semantic correlation between the caption and its visual context, as shown in Figure 1, where the visual information from the image is used to learn the semantic relation from the caption in a word and sentence manner.

Modern sophisticated image captioning systems focus heavily on visual grounding to capture real world scenarios. Early works

fang2015captions built a visual detector to guide and re-rank image captions with a global similarity. The work of wang2018object investigates the informativeness of object information (e.g. object frequency) in end-to-end caption generation. cornia2019show proposes controlled caption language grounding through visual regions from the image. Inspired by these works, we propose an object-based re-ranker to re-rank the most closely related caption with both static and contextualized semantic similarity.

Figure 1: An overview of our visual semantic re-ranker. We employ the visual context in a word and sentence level manner from the image to re-rank the most closely related caption to its visual context. An example, from the caption Transformer Marcella:20, shows how the visual re-ranker (Visual Beam) uses the semantic relation to re-rank the most descriptive caption.
Model B-1 B-2 B-3 B-4 M R C BERTscore
Show and Tell Oriol:15
Tell 0.331 0.159 0.071 0.035 0.093 0.270 0.035 0.8871
Tell+VR_V1 0.330 0.158 0.069 0.035 0.095 0.273 0.036 0.8855
Tell+VR_V2 0.320 0.154 0.073 0.037 0.099 0.277 0.041 0.8850
Tell+VR_V1 (sts) 0.313 0.153 0.072 0.037 0.101 0.273 0.036 0.8839
Tell+VR_V2 (sts) 0.330 0.158 0.069 0.035 0.095 0.273 0.036 0.8869
VilBERT lu202012
Vil 0.739 0.577 0.440 0.336 0.271 0.543 1.027 0.9363
Vil+VR_V1 0.739 0.576 0.438 0.334 0.273 0.544 1.034 0.9365
Vil+VR_V2 0.740 0.578 0.439 0.334 0.273 0.545 1.034 0.9365
Vil+VR_V1 (sts) 0.738 0.576 0.440 0.335 0.273 0.544 1.036 0.9365
Vil+VR_V2 (sts) 0.740 0.579 0.442 0.338 0.272 0.545 1.040 0.9366
Transformer based caption generator Marcella:20
Trans 0.780 0.631 0.491 0.374 0.278 0.569 1.153 0.9399
Trans+VR_V1 0.780 0.629 0.487 0.371 0.278 0.567 1.149 0.9398
Trans+VR_V2 0.780 0.630 0.488 0.371 0.278 0.568 1.150 0.9399
Trans+VR_V1 (sts) 0.779 0.629 0.487 0.370 0.277 0.567 1.145 0.9395
Trans+VR_V2 (sts) 0.779 0.629 0.487 0.370 0.277 0.567 1.145 0.9395
Table 1: Performance of compared baselines on the Karpathy test split (for Transformer baselines) and Flicker (for show and tell CNN-LSTN baseline) with/without Visual semantic Re-ranking. At inference, we use only Top--2 (Visual 1 or Visual 2) object visual context once at a time.

Our main contributions in this paper are: (1) We propose a post-processing method for any caption generation system via visual semantic related measures; (2) as an addendum to the main analysis of this work, we note that the visual re-ranker does not apply to less diverse beam searches.

2 Beam search caption extraction

We employ the three most common architectures for caption generation to extract the top beam search. The first baseline is based on the standard shallow CNN-LSTM model (Oriol:15). The second, VilBERT lu202012

, is fine-tuned on a total of 12 different vision and language datasets such as caption image retrieval. Finally, the third baseline is a specialized Transformer based caption generator


3 Visual Beam Re-ranking for Image Captioning

3.1 Problem Formulation

Beam search is the dominant method for approximate decoding in structured prediction tasks such as machine translation, speech recognition, and image captioning. The larger beam size allows the model to perform a better exploration of the search space compared to greedy decoding. Our goal is to leverage the visual context information of the image to re-rank the candidate sequences obtained through the beam search, thereby moving the most visually relevant candidate up in the list, while moving incorrect candidates down.

Figure 2: Visualization of the top-15 beam search after visual re-ranking. The color , and

represents the degree of change in probability after visual re-ranking, respectively. Also, we can observe that a less diverse beam negatively impacted the score, as in the case of Transformer and show and tell baselines.

3.2 Beam Search Visual Re-ranking

We introduce a word and sentence level semantic relation with the visual context in the image. Inspired by peinelt2020tbert, who propose a joint BERT Jacob:19 with topic modelling for semantic similarity, we propose a joint BERT with GloVe for visual semantic similarity.

Word level similarity. To learn the semantic relation between a caption and its visual context in a word level manner, we first employ a bidirectional LSTM based CopyRNN keyphrase extractor meng2017deep to extract keyphrases from the sentence as context. The model is trained on two combined pre-processed datasets: (1) wikidump (i.e. keyword, short sentence) and (2) SemEval 2017 Task 10 (Keyphrases from scientific publications) augenstein2017semeval

. Secondly, GloVe is used to compute the cosine similarity between the visual context and its related context. For example, from

a woman in a red dress and a black skirt walks down a sidewalk the model will extract dress and walks, which are the highlights keywords of the caption.

Sentence level similarity. We fine-tune the BERT base model to learn the visual context information. The model learns a dictionary-like relation word-to-sentence paradigm. We use the visual data i.e. object as context for the sentence, i.e. caption via cosine distance.

  • BERT. BERT achieves remarkable results on many sentence level tasks and especially in the textual semantic similarity task (STS-B) cer2017semeval. Therefore, we fine-tuned BERT on the training dataset, (textual information, 460k captions: 373k for training and 87k for validation) i.e

    . visual, caption, label [semantically related or not related]), with a binary classification cross-entropy loss function [0,1] where the target is the semantic similarity between the visual and the candidate caption, with a batch size of 16 for 2 epochs.

  • RoBERTa. RoBERTa is an improved version of BERT, and since RoBERTa is more robust, we rely on pre-trained Sentence RoBERTa-sts reimers2019sentence111 as its yields a better cosine score.

Fusion Similarity Expert. Inspired by Product of Experts PoE hinton1999products, we combined the two experts at word and sentence levels as a late fusion layer as shown in Figure 1. The PoE is computed as follows:


where are the parameters of each model , is the probability of under the model , and

is the indexes of all possible vectors in the data space. Since this approach is interested in retrieving the most closely related caption with the highest probability after re-ranking, the normalization step is not needed:


where, are the probabilities assigned by each expert to the candidate word .

4 Experiments

4.1 Datasets and Evaluation Metric

We evaluate the proposed approach on two different sized datasets. The idea is to evaluate our approach with (1) a shallow model CNN-LSTM (i.e. less data scenario), and on a system that is trained on a huge amount of data (i.e. Transformer).

Flicker 8K rashtchian2010collecting: the dataset contains 8K images and each image has five human label annotated captions. We use this data to train the shallow model (6270 train/1730 test).

COCO Tsung-Yi:14: It contains around 120K images, and each image is annotated with five different human label captions. We use the most commonly used split as provided by karpathy2015deep, where 5k images are used for testing and 5k for validation, and the rest for model training for the Transformer baseline.

Visual Context Dataset: Since there are many public datasets for caption, they contain no textual visual information like objects in the image. We enrich the two datasets, as mentioned above, with textual visual context information. In particular, to automate visual context generation and dispense with the need for human labeling, we use Resnet Kaiming:16 to extract top-k 3 visual context information for each image in the caption dataset.

Model Voc TTR Uniq WPC
Show and tell Oriol:15
Tell 304 0.79 10.4 12.7
Tell+VR 310 0.82 9.42 13.5
VilBERT lu202012
Vil 894 0.87 8.05 10.5
Vil+VR 953 0.85 8.86 10.8
Transformer Marcella:20
Trans 935 0.86 7.44 9.62
Trans+VR 936 0.86 7.48 8.68
Table 2: Measuring the lexical diversity of caption before and after re-ranking. Uniq and WPC columns indicate the average of unique/total words per caption, respectively. (The refers to the Fliker 1730 test set, and refers to the COCO Karpathy 5K test set).

5 Experiments

Evaluation Metric. We use the official COCO offline evaluation suite, producing several widely used caption quality metrics: BLEU papineni2002bleu METEOR banerjee2005meteor, ROUGE lin2004rouge, CIDEr vedantam2015cider and BERTscore or (B-S) bert-score.

5.1 Results and Analysis

We use visual semantic information to re-rank the candidate captions produced by out-of-the-box state-of-the-art caption generators. We extracted the top-20 beam search candidate captions from three different architectures (1) standard CNN+LSTM model Oriol:15, (2) A pre-trained language and vision model VilBERT lu202012, fine-tuned on a total of 12 different vision and language datasets such as caption image retrieval, and (3) a specialized caption-based Transformer Marcella:20.

Experiments applying different rerankers to each base system are shown in Table 1. The tested rerankers are: (1) VR, which uses BERT and GloVe similarity between the candidate caption and the visual context (top- V1 and V2 during the inference) to obtain the reranked score. (2) VR, which carries out the same procedure using similarity produced by SRoBERTa.

Our re-ranker produced mixed results as the model struggles when the beam search is less diverse. The model is therefore not able to select the most closely related caption to its environmental context as shown in Figure 2, which is a visualization of the final visual beam re-ranking.

Evaluation of Lexical Diversity. As shown in Table 2, we evaluate the model from a lexical diversity perspective. We can conclude that we have (1) more vocabulary, and (2) the Unique word per caption is also improved, even with a lower Type-Token Ratio (TTR brown2005encyclopedia222TTR is the number of unique words or types divided by the total number of tokens in a text fragment..

Figure 3: ( Top) 1k random sample from Flicker test set with shown and tell model. Each Expert is contributing different probability confidence and therefore the model is learning the semantic relation in word level and sentence level. ( Bottom) 5k random sample from COCO caption with Transformer based caption model. The Glove score is dominating the distribution to become the expert.
Model B-4 M R C B-S
Transformer based caption generator Marcella:20
Trans 0.374 0.278 0.569 1.153 0.9399
+VR 0.370 0.277 0.567 1.145 0.9395
+VR 0.371 0.278 0.567 1.149 0.9398
+VR 0.369 0.278 0.567 1.144 0.9395
+VR_V1 0.371 0.278 0.568 1.148 0.9398
+VR_V2 0.371 0.278 0.568 1.149 0.9398
Table 3: Ablation study using different model compared to GloVe alone visual re-ranker on the Transformer baseline. Bottom Figure 3 shows that BERT is not contributing, as GloVe, to the final score for two reasons: (1) short caption, and (2) less diverse beam.

Ablation Study. We performed an ablation study to investigate the effectiveness of each model. As to the proposed architecture, each expert tried to learn different representations in a word and sentence manner. In this experiment, we trained each model separately, as shown in Table 3. The GloVe performed better as a stand-alone than the combined model (and thus, the combined model breaks the accuracy). To investigate this even further we visualized each expert before the fusion layers as shown in Figure 3. Limitation. In contrast to CNN-LSTM top Figure 3, where each expert is contributing to the final decisions, we observed that having a shorter caption (with less context) can influence the BERT similarity score negatively. Therefore, the word level i.e. GloVe dominates as the main expert.


In this work, we have introduced an approach that overcomes the limitation of beam search and avoids re-training for better accuracy. We proposed a combined word and sentence visual beam search re-ranker. However, we discovered that word and sentence similarity disagree with each other, when the beam search is less diverse.


Appendix A Hyperparameters and Setting

All training and the beam search are implemented in fairseq ott2019fairseq

and trained with PyTorch 1.7.1


Visual Re-ranker. The only model we tuned is the BERT

model. We fine-tuned it on the training dataset using the original BERT implementation, Tensorflow version 1.15 with Cuda 8

abadi2016tensorflow. The textual dataset contains around 460k captions: 373k for training and 87k for validation i.e. visual, caption, label [semantically related or not related]). We use batch size 16 for three epochs with a learning rate

and we kept the rest of hyperparameters settings as the original implementation. Note that we keep the GloVe as a static model as the model is trained on 840 billion tokens.

Show-and-Tell Oriol:15. We train this shallow model from scratch on the 8K flicker dataset (6270 train/1730 test) (hardware: GPU GTX 1070Ti and 32 RAM and 8-cores i7 CPU).

Caption Transformer Marcella:20333 We train the transformer from scratch with the Bottom-Up features anderson2018bottom. However, unlike the original implementation by the authors, we use a full 12-layer transformer. We follow the same hyperparameters as the original implementation. The model is trained with PyTorch 1.7.1 on a single K-80 GPU.

VilBERT lu202012. Since VilBERT is trained on 12 datasets, we use it as an out-of-the-box model.

Appendix B Examples of Re-ranked Captions

Best Beam. In Figure 4 we show examples of the proposed re-ranker and comparison results with the best baseline beam search (BL

). The model struggled to unify the information from diffident modalities, and therefore the word-level expert has a stronger influence on the final score. In addition, the visual classifier also faces difficulties with complex background images. This could be resolved in future work, by employing multiple classifiers (each with multiple labels) and then using a voting technique to filter out the most probable object in the image.

Greedy. We also experiment with - greedy output (BL) as shown in Figure 5, our model suffers from the same limitation.

Visual: monitor BL: a computer monitor
on a desk with a keyboard
VR: a desk with a computer monitor and a keyboard
Human: a computer that is on a wooden desk
Visual: ant BL: a group of birds
walking in the water
VR: a group of birds
walking in the water
Human: a group of small birds
walking on top of a beach
Visual: necklace BL: a woman wearing a white
dress holding a pair of scissors
VR: a woman with a pair of
scissors on
Human: a silver colored necklace
with a pair of mini scissors on it
Visual: food BL: a plate of food on a table
VR: a plate of food and a drink on a table
Human: a white plate with some
food on it
Visual: apple BL: a cat is eating an apple
VR: a close up of a cat eating an apple
Human: a gray cat eating a treat from a humans hand
Visual: chainlink fence Vil: a black and white photo of train tracks
VR: a black and white photo of a train on the tracks
Human:a long train sitting on a railroad track
Visual: cardigan BL: a cat sitting on the floor
next to a closet
VR: a cat and a dog
in a room
Human: a cat and a dog on
the floor in a room
Visual: bassinet BL: a baby sitting in front of a cake
VR: a baby sitting in front of a birthday cake
Human: a woman standing over a sheet cake sitting on top of table
Figure 4: Examples of the re-ranked captions by our visual re-ranker (VR) and the original caption (Beam Search) by the baseline (BL).
Visual: cowboy hat BL: a cat is eating a dish
on the floor
VR: a black and white cat
sitting in a bowl
Human: a cat on a wooden
surface is looking at a wooden
Visual: pizza BL: a pizza with cheese on a plate
VR: a pizza sitting on top of a white plate
Human: a small pizza being served on a white plate
Visual: dishwasher BL: a man standing in a kitchen with a laptop
VR: a man standing in a kitchen preparing food
Human: a man with some drink in hand stands in front of counter
Visual: lab coat BL: a man standing in a kitchen holding a glass of wine
VR: a man standing in a kitchen holding a wine glass
Human: a man standing in a kitchen holding a glass full of alcohol
Visual: indian elephant BL: a group of elephants under a shelter in a field
VR: a group of elephants
under a hut
Human: a young man riding a skateboard down a yellow hand rail
Visual: chain Vil: a group of women
sitting on a bench eating
VR: a group of women
eating hot dogs
Human: three people are pictured
while they are eating
trolleybus BL: a green bus parked in front
of a building
VR: a green double decker
bus parked in front of a building Human: a passenger bus that is parked
in front of a library
Visual: racket BL: a woman hitting a tennis ball on a tennis court
VR: a woman holding a tennis ball on a tennis court
Human: a large crowd of people are
watching a lady play tennis
Figure 5: Examples of the re-ranked captions by our visual re-ranker (VR) and the original caption (greedy) by the baseline (BL).