Controlling Length in Image Captioning

05/29/2020 ∙ by Ruotian Luo, et al. ∙ Toyota Technological Institute at Chicago 0

We develop and evaluate captioning models that allow control of caption length. Our models can leverage this control to generate captions of different style and descriptiveness.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most existing captioning models learn an autoregressive model, either LSTM or transformer, in which explicit control of generation process is difficult. In particular, the length of the caption is determined only after the End Of Sentence (

eos) is generated. It is hard to know and control the length beforehand. However, length can be an important property of a caption. A model that allows control of output length would provide more options for the end users of the captioning model. By controling the length, we can influence the style and descriptiveness of the caption: short, simple captions vs. longer, more complex and detailed descriptions for the same image.

Previous work includes captioning models that allow control for other aspects. [5] controls the caption by inputting a different set of image regions. [7] can generate a caption controlled by assigning POS tags. Length control has been studied in abstract summarization [11, 8, 17], but to our knowledge not in the context of image capitoning.

To control the length of the generated caption, we build the model borrowing existing ideas from summarization work, by injecting length information into the model. To generate captions without an explicit length specification, we add a length prediction module that can predict the optimal length for the input image at hand. We show that the length models can successfully generate catpion ranging from 7 up to 28 words long. We also show that length models perform better than non-controled model (with some special decoding methods) when asked to generate long captions.

2 Models

We consider repurposing existing methods in summarization for captioning. In general, the length is treated as an intermediate variable: . , and are caption, image and length, respectively. We introduce how we build and as follows. Note that, the following methods can be used in conjunctions with any standard captioning model.

2.1 LenEmb[11]

We take LenEmb from [11] and make some small change according to [17]. Given desired length and current time step , we embed the remaining length

into a vector that is the same size as the word embedding. Then the word embedding of the previous word

, added with the length embedding (rather than concatenated, as in [11]), is fed as the input to the rest of the LSTM model.

Learning to predict length We add a length prediction module to predict the length given the image features (in this case the averaging region features) while no desired length is provided. We treat it as a classification task and train it with the reference caption length.

2.2 Marker[8]

We also implement Marker model from [8]. The desired length is fed as a special token at the beginning of generation as the “first word”. At training time, the model needs to learn to predict the length at the first step the same way as other words (no extra length predictor needed). At test time, the length token is sampled in the same way as other words if no desired length specified.

3 Experiments

We use COCO [13] to evaluate our models. For train/val/test split we follows [10]. The base captioning model is Att2in [16]. The images features are bottom-up features[4]. For evaluation, we use BLEU [15], METEOR [6], ROUGE [12], CIDEr [18], SPICE [3] and bad ending rate [9, 1]. We train the models with cross-entropy loss. Unless specified otherwise, decoding is beam search with beam size 5, and evaluation on Karpthy test set.

3.1 Generation with predicted lengths

For fair comparison on general image captioning task, we predict the length and generate the caption conditioned on the predicted length for length models. Results in Table 1 show that the length models are comparable to the base model.

B4 R M C S BER LenMSE
Att2in 35.9 56.1 27.1 110.6 20.0 0.1 N/A
LenEmb 34.9 56.2 27.0 110.0 20.0 0.0 0
Marker 35.2 56.2 26.9 109.8 19.9 0.0 0
Table 1: Performance on COCO Karpathy test set. B4=BLEU4, R=ROUGE, M=METEOR, C=CIDEr, S=SPICE, BER=bad ending rate, LenMSE=length mean square error. Numbers all in percentage(%) except LenMSE.

Length distribution(Fig.1) While the scores are close, the length distribution is quite different. Length models tend to generate longer captions than normal auto-regressive models. However neither is close to the real caption length distribution(“test” in the figure).

Figure 1: Length distribution of captions generated by different models. For length models, the length is obtained within the beam search process as a special token.

3.2 Generation with controlled lengths

For baseline model, we use the method fixLen in [11]

where probability manipulation are used to avoid generating eos token until desired length.

The original CIDEr-D promotes short and generic captions because it computes average similarity between the generated and the references. We report a modified CIDEr (mCIDEr): 1) removing length penalty term in the CIDEr-D; 2) combining the ngram counts from all the reference captions to compute similarity [2].

Fluency The high bad ending rate for Att2in indicates it can’t generate fluent sentences; when increasing beam size, the bad ending rate becomes lower. For length models, Marker performs well when length is less than 20 but collapse after, while LenEmb performs consistently well.

Accuracy The length models perform better than base model when length. The base model performs better between 10-16 which are the most common lengths in the dataset. For larger length, the LenEmb performs the best on both mCIDEr and SPICE, incidcating it’s covering more information in the reference captions.

Controllability We use the mean square error between the desired length and the actual length (LenMSE) to evaluate the controllability. When using predicted length, the length models perfectly achieve the predicted length (in Table 1). When desired length is fed, Fig. 2 shows that LenEmb can perfectly obey the length while Marker fails for long captions probabily due to poor long-term dependency.

Quatlitative results(Fig. 3) show that the LenEmb model, when generating longer captions, changes the caption structure and covers more detail, while the base model tends to have the same prefix for different lengths and repetition. More results can be browsed onine [14].

Figure 2: The performance of models with different desired length. Att2in+BSx is Att2in+fixLen with beam size x.
7 a motorcycle parked on a dirt road
10 a motorcycle is parked on the side of a road
16 a motorcycle parked on the side of a dirt road with a fence in the background
22 a motorcycle parked on the side of a dirt road in front of a fence with a group of sheep behind it
28 a motorcycle is parked in a dirt field with a lot of sheep on the side of the road in front of a fence on a sunny day
7 a motorcycle parked on a dirt road
10 a motorcycle parked on a dirt road near a fence
16 a motorcycle parked on a dirt road in front of a group of people behind it
22 a motorcycle parked on a dirt road in front of a group of people on a dirt road next to a fence
28 a motorcycle parked on a dirt road in front of a group of people on a dirt road in front of a group of people in the background
7 an airplane is parked at an airport
10 an airplane is parked on the tarmac at an airport
16 an airplane is parked on a runway with a man standing on the side of it
22 an airplane is parked on a runway with a man standing on the side of it and a person in the background
28 an airplane is parked on the tarmac at an airport with a man standing on the side of the stairs and a man standing next to the plane
7 a plane is sitting on the tarmac
10 a plane is sitting on the tarmac at an airport
16 a plane that is sitting on the tarmac at an airport with people in the background
22 a plane is sitting on the tarmac at an airport with people in the background and a man standing in the background
28 a plane is sitting on the tarmac at an airport with people in the background and a man standing on the side of the road in the background
Figure 3: Generated captions of different lengths. Top: LenEmb; Bottom: Att2in+BS10

3.3 Failure on CIDEr optimization

We apply SCST[16] training for length models. However, SCST doesn’t work well. While the CIDEr scores can be improved, the generated captions tend to be less fluent, including bad endings (ending with ’with a’) or repeating words (like ’a a’).

4 Conclusions

We present two captioning models that can control the length and shows their effectiveness to generate good captions of different lengths. The code will be released at link111https://github.com/ruotianluo/self-critical.pytorch/tree/length_goal.

References