Factual image captioning is one of the fundamental tasks in deep learning. The issue with factual captions is that language generated is often stilted, and not necessarily representative of human communication. While classic image captioning approaches show deep understanding of image composition and language construction, it often lacks elements that make communication distinctly human. To address this issue, some researchers have tried to add personality to image captioning in order to generate stylish captions. In general, stylish captioning systems are divided into two categories based on how they are trained: single style[gan2017stylenet, zhang2018style] and multi-style [shuster2019engaging, guo2019mscap, zhao2020memcap]. Single-style training involves training one model for each personality, whereas multi-style techniques learn to generate captions in many different styles using one model.
Shuster et al.
built a multi-style module by converting each personality to a multi-dimensional vector. Their generative model struggled to generate captions that accurately captured the given image context. This is likely because a multi-style captioners require greater knowledge about the input image when compared to single-style captioners. To address the inherent limitations of past multi-style captioning approaches, we propose the use of multi-modality image features to improve the quality of multi-style image captioning. We believe that multi-modality features, specifically image features combined with features derived from text describing said image, will help the model better ground image features into text.
To effectively generate stylish captions, a model needs to incorporate elements of the local context of image regions and the global context of the image itself. To capture local context, our model will make use of region-based caption features generated by the DenseCap network [densecap].
To complement dense caption features, we will also use ResNext features describing the global input image. To combine these features, we introduce a Multi-UPDOWN structure model where each UPDOWN structure is used to select the best feature from its own modality. These selections are then fused to generate the caption.
To evaluate the performance of our multi-style captioning model, we examine its performance on different stylish image captioning datasets. We evaluate its performance using various NLP metrics and compare against several state-of-the-art baselines. We perform an ablation study in which we examine how each part of our model contributes to the overall expressiveness and diversity of our generated captions. We also perform a qualitative evaluation in which we examine how well the captions generated by our model capture image and style context.
Captions in FlickrStyle10K are created to have either a Humorous or Romantic linguistic style [gan2017stylenet] while captions in PERSONALITY-CAPTIONS are created to be engaging and have a conversational style [shuster2019engaging]. With FlickrStyle10K, researchers have built single-style captioners [gan2017stylenet, chen2018factual] where they make use of both factual captions and stylized captions for training. Later researchers explored training multi-style networks [guo2019mscap, zhao2020memcap] that can generate multiple types of stylish outputs using a single model.
Shuster et al. released the PERSONALITY-CAPTIONS containing 215 personalities in 2019 for building engaging caption generations models. In their work, Shuster et al. built an image caption retrieval model and also explored the multi-style generative caption models along with various image encoding strategies [he2016deep, xie2017aggregated] using several state-of-the-art image captioning models [xu2015show, anderson2018bottom]rennie2017self] using CIDEr score [vedantam2015cider] as reward. We extend the best performing supervised model presented in Shuster et al.’s work to build a multi-style model which incorporates multi-modality image features.
The primary contribution of this paper is an architecture that utilizes multi-modality fusion for performing multi-style image captioning. This architecture specifically utilizes the soft fusion of two parallel encoder-decoder blocks, with each block containing an UPDOWN-like attention module. Our overall architecture for one step generation can be seen in Figure 1. As seen in the figure, our multi-UPDOWN fusion blocks synthesize the information from multi-modality image features, multi-style components (previous word, personality) and previous hidden states to predict current word and hidden states at each time step.
In this architecture, we utilize two features from pre-trained networks: ResNeXt xie2017aggregated visual features and text features describing the image itself densecap. These features allow the learner to better ground the image features into natural language.
As in the Figure 1, the desired style of the output caption is given as an input to our system using a one-hot vector. We then use an embedding matrix and a linear layer to encode each style into a fixed-size vector, we call it style vector . For each word in our target stylized caption, we use another embedding matrix to embed each word. We will use to embed the dense captions too. This enables us to better connect image features to natural language. To better enable our network to generate words according to the given style, we concatenate each embedded word vector with the to create a stylized word vector, .
Multi-modality Image Features
Our architecture relies on two sets of bottom-up features extracted using pre-trained networks: ResNeXt features and dense caption features. Specifically, we extract mean-pooled image features and spatial features from the ResNeXt network[shuster2019engaging] and 5 dense captions from each image with a dense caption network [densecap]. Each word in the dense captions is embedded using . By collecting both visual and text features, we provide our architecture with a more complete understanding of the full context of the image.
Multi-UPDOWN fusion Model
Our fusion model is composed of two individual encoders, the ResNext feature encoder and the dense caption encoder. Our model also uses a fused Top-down fashion decoder, which used to decode captions from encoded image features.
ResNeXt Feature Encoder and Dense Caption Encoder
We encode the ResNeXt mean-pooled image features and spatial features using a linear layer, dropout layer and activation layer and get mean-pooled feature vector and spatial feature vector , , …, . These are used as input features for the decoding process showed in the right branch of Figure 2. Then, we encode each embedded caption vector using the Dense Caption Encoder, which is an LSTM network [hochreiter1997long] shown below where denotes a word vector in at time t.
We concatenate all 5 hidden states into one vector , which we call the caption vector. To apply attention on specific words during the decoding procedure, we keep all word states from the LSTM encoding process denoted as where 5 captions contain total words.
|Method||Caption Model||Training Method||Text Features||ResNeXt||BLEU1||BLEU4||ROUGE-L||CIDEr||SPICE|
|Caption Model||Personality||Text Features||ResNeXt||BLEU1||BLEU4||ROUGE-L||CIDEr||SPICE||Unique words(#)|
Top-down Decoder Fusion
As we show in Figure 2, we apply the Top-down decoder model on encoded visual features and text features. The left branch is the Top-down decoder for our text features generated by the dense caption network and the right branch is the Top-down decoder for the ResNeXt features. At each time step, the Top-down decoder for text features generates a caption attention vector by taking in the previous attention vector hidden states as well as the concatenation of previous language model hidden states , the caption vector and the previous stylized word vector as input.
To calculate the attended caption feature vector We use a process inspired by [anderson2018bottom]. We use vectors and the caption attention vector in the below equations:
where and are learned parameters. This attention vector is used as the input to the language LSTM layer where the initial state is the previous hidden state from the language model, . This language LSTM then outputs the current language model hidden states for our text features as below:
We calculate the ResNeXt attention vector , and current language model hidden states from ResNeXt features , using a similar process with a separate network (shown in Figure 2 right branch). We generate the final language hidden states of the current step by fusing , as below:
We generate the final attention hidden states of the current step by fusing , as below:
We get the final language output as below:
Then we apply a linear layer to project the final language output
To demonstrate the effectiveness of our model on stylish image captioning, we use the PERSONALITY-CAPTIONS dataset, which contains 215 distinct personalities. To prove our model is expandable to linguistic stylized captions, we train our model using the FlickrStyle10K dataset [gan2017stylenet]
which contains humorous and romantic personalities. We compare our results with the state-of-the-art work on the same datasets based on their automatic evaluation metrics. Ablation studies are also done to justify the contributions of each component of our method. We also perform a qualitative examination of the outputs of our model.
The ground truth captions in PERSONALITY-CAPTIONS [shuster2019engaging] are created to be engaging and have a human-like style. Each data entry in this dataset is represented as a triple containing an image, personality trait, and caption. In total, 241,858 captions are included in this dataset. In this work, we do not use the full PERSONALITY-CAPTIONS dataset due to accessibility of some examples. In total, our reduced dataset contains 186698 examples in the training set, 4993 examples in the validation set, and 9981 examples in the test set. The total vocabulary size of PERSONALITY-CAPTIONS after replacing infrequent tokens with ’UNK’ is 10453.
The FlickrStyle10K dataset captions focus on linguistic style. Since only 7000 images are publicly available, we evaluate using a similar process to the one outlined in [guo2019mscap, zhao2020memcap]. First we randomly select 6,000 images as the training data and use the remaining 1000 images as testing data. We further split 10% data from training data as validation data. Total vocabulary size of FlickrStyle10K is 8889.
Training and Inference
In the training, we use entropy as loss function and Adam optimization with initial learning rate of 5e-4. The learning rate decays every 5 epochs. In total, we train 30 epochs when using the PERSONALITY-CAPTIONS dataset[shuster2019engaging] with a batch size of 128 and evaluate the model every 3000 iterations. We train for 100 epochs when using the FlickrStyle10K dataset [gan2017stylenet] with batch size 128 and evaluate model every 100 iterations.
During inference, we generate captions using beam search with beam size 5. During this process, we impose a penalty to discourage the network from repeating words, from ending on words such as an, the, at, etc and from generating special tokens, like ’UNK’.
Our quantitative analysis is meant to show that our 3M model can outperform several state-of-the-art baselines in terms of a set of automated NLP metrics. In addition, we run an ablation study to validate the need for each part of the 3M model.
Baselines and Evaluation Metrics
To test if our 3M model can be used to generate human-like captions, we train it using the above settings on the PERSONALITY-CAPTIONS dataset. We compare against the model introduced previously by Shuster et al. [shuster2019engaging]. Since we use a subset of the original PERSONALITY-CAPTIONS dataset, we retrain the method outlined by Shuster et al. using similar settings. We compare the performance of our 3M model against their model using using BLEU [papineni2002bleu], ROUGE-L [lin-2004-rouge], CIDEr [vedantam2015cider], and SPICE [anderson2016spice]. The comparison results are listed in Table 1.
To evaluate the extensibility of our model, we also applied our method on the FlickrStyle10K dataset. This is meant to evaluate how well our method can generate captions that capture linguistic style. We compare against the following state-of-the-art baselines:
StyleNet gan2017stylenet, a single style model trained with paired factual sentences and unpaired stylized captions.
SF-LSTM chen2018factual, a single style model trained with paired stylized caption and paired factual captions.
MsCap guo2019mscap, a multi-style model trained with paired factual sentences and unpaired stylized captions.
MemCap zhao2020memcap, a multi-style model trained with paired factual sentences and unpaired stylized captions.
Following [zhao2020memcap]stolcke2002srilm] to measure perplexity. We report BLEU, Meteor [banerjee-lavie-2005-meteor], CIDEr, the style classification accuracy (cls) and the average perplexity (ppl) for comparison and results are showed in Table 3.
Additionally, to evaluate the benefits of each component of our model, we perform an ablation study using the PERSONALITY-CAPTIONS dataset. We compare the full 3M model against the following variations: no personality features, no text features, and no ResNeXt features. BLEU, ROUGE-L, CIDEr, and SPICE are reported in Table 2 for evaluating the relevance between image and generations. we also report the number of unique words used across all generated captions per model in Table 2 to show the expressiveness of each generative model.
Specifically, we seek to explain that our model is capable of generating captions that match the given style as well as the image context. We first list the given image and five given dense captions, sample generations along with personality in the parenthesis, in Figure 3 as R1-R3. We discuss the whether caption generations matching the context in three aspects: 1. whether the multi-style component working for connecting caption generations with given personality; 2. whether valid text features could help for generations to match the image; 3. whether ResNext feature could help make reasonable generations when the given text features fails to connect with the image. To give a more complete view of the text that our model can generate, we also list the imperfect sample generations underlined in Figure 3 as W1-W2.
Results and Discussion
In this section, we will outline the results of our experiments and illustrate them in both quantitative and qualitative ways.
Comparison with baselines
As seen in Table 1, our 3M model outperforms UPDOWN models under the same training method across all the NLP metrics we used for evaluation. We also achieve better results on ROUGE-L, CIDEr compared with Shuster’s model trained under reinforcement learning. This provides evidence that our approach is effective at multi-style caption generation.
We also show that our 3M model does well on linguistic style captioning even though it was not designed for that task. As Table 3 shows, our 3M model significantly outperforms two other multi-style models, MsCap [guo2019mscap] and MemCap [zhao2020memcap] on BLEU, CIDEr, Meteor, and ppl on the FlickrStyle10K dataset. Note that our 3M model also achieved high cls values, which show how well our captions capture the given style.
We also achieve comparable performance to the SF-LSTM model across the automated metrics we examined. Given that the SF-LSTM model is designed for a single-style generation task, whereas our 3M model was designed for multi-style generation, we feel that this shows how robust our model is.
From Table 2, we can see if our model is trained without the multi-style component, the performance of all the nlp metrics drops, proving how critical this component is. Examining the results obtained from a model using only text features against a model that only had access to ResNeXt features shows that using only text features limits the overall expressiveness of generated captions as shown by the low number of unique words generated.
Our full model has achieved the highest ROUGE-L, CIDEr and SPICE score and improves expressiveness compared with model with only text features and improves the relevancy compared to a model with only Resnext features.
For our qualitative analysis, we will discuss the quality of the trained 3M models across two datasets assessing whether our model is capable of generating captions that match the given style and image context, and assessing whether our model can assist in finding reasons for imperfect captions.
From all generations in Figure 3, we can see our 3M model is able to generate captions matching the given personality, which certify that our multi-style component is able to help direct the generations in the desired personality tone. From R2-R3 we can see that when there is a valid text feature available, the 3M model could make use of them. The generation in R1 is expressed in a more conservative and global way since text features cannot provide correct information, which necessitates the use of ResNext features.
One of the advantages of the 3M model is that it can easily generate multiple captions with different styles. This can enable us to better contextualize incomplete or erroneous captions. In W1 of Figure 3, for example, the generation appear incomplete for the “Anxious” personality. Looking at captions for other personalities, we see that our model can correctly identify image context. This leads us to believe that we simply set the caption length too low for the “anxious” example. In W2, our model generates the incorrect phrases “a bike on a bike.” By examining the text features used for generation, we can see that this was likely caused by our input text, and not the model itself.
In this paper we introduce the 3M model, which is a multi-style image captioner which integrates multi-modal features and a multi-UPDOWN encoder-decoder model. We demonstrate the effectiveness of our 3M model by comparing against state-of-the-art work using automatic evaluation methods. Ablation studies have also be done to evaluate the contributions of each component of our 3M model. And we certify that our 3M model could generate more expressive and diverse generations without losing the connection with context. The qualitative study helps understand how well our 3M performs and shows how our model can also explain the imperfectness of generations.