Vision and language are important unstructured data sources in knowledge discovery research. In recent years, the intelligent multimodal content understanding models have been extensively proposed and studied to meet the increasing demand for multimedia knowledge management. With advancements of deep neural networks, a number of deep content understanding models are beginning to sprout up in various fields including image captioning, video captioning, and visual question answering (VQA). These well-trained models can be further embedded in a larger multimedia database for the down-streaming tasks of efficient retrieval, knowledge discovery, etc.
In particular, this paper focuses on image captioning, which engages vision and language in a concise way. Given an image, the vision encoder extracts the visual features from the RGB picture, then a language model decodes the visual representations into a sentence-level description, as illustrated in Fig.1
. Typically, for the vision encoder, the CNN extracts visual feature maps to represent the visual content, followed by an optional region detector (e.g. Faster-RCNN) for recognising object-wise regions and outputting the refined regional features. For the language decoder, the visual features initialise the language model as an input to generate the word one-by-one given the first “visual” word embeddings using Long Short-Term Memory (LSTM)[lstm]
. Other techniques have been made to further improve the caption quality, such as incorporating visual attention, attributes, scene graphs, and reinforcement learning.
Most of the work is built on such an encoder-decoder framework, based on the assumption that visual features can be perfectly recognised by language decoder. The vision and language modalities are assumed to be interchangeable by inserting the global visual features as the first visual “word” for the language model. However, this assumption overlooks the underlying facts that the visual features and word embeddings are still trained in different ways despite that efforts have been made during the fine-tuning. The visual feature encoder is generally pre-trained to classify 1,000 categories (e.g. ImageNet) given the image, whilst the language model is trained to predict the next word from a vocabulary with approximately 10,000 words. Therefore, the intrinsic modality gap between vision and language still exists, and the previous methods made limited efforts to address this issue. Consequently, the challenges are also presented in the language decoding phase. The previous methods directly use the visual features as the initial visual “word” to generate the entire sentence, without considering the global semantics of the caption. This may mislead the language decoder to generate the sequence with generic content without details and highlights. For example, “sunny day with blue sky”, “red tennis court” could be only described as “sky”, “tennis court”.
To address the challenges, we propose a novel modality transition image captioning framework to explicitly bridge the vision-language modality gap in Fig. 1. The framework is equipped with the novel Modality Transition Module (MTM), which is trained by the modality loss, and supported by our pre-trained text auto-encoder. In particular, the image is firstly forwarded into the CNN to obtain convolutional feature maps before passing to the region detectors (e.g. Faster-RCNN) for object-level features. Meanwhile, different from the previous baseline methods shown in Fig. 1
, the caption of the image is additionally encoded via the text auto-encoder. In this way, the global representation of the caption can be extracted. After both the image and caption representations are prepared, they are fed to the Modality Transition Module. In the MTM, the regional features are firstly average pooled, and passed through neural network layers to transfer into the preliminary textual vector. The generated preliminary vector is compared with the global caption representation just encoded from the text auto-encoder by the modality loss, which measures the difference between the preliminary textual vector and the target global caption vector. It is important to let the model have the ability to generate the global textual features, because during testing, the caption and its global representation will not be available. The estimated global textual features are inputted to the language model to generate the sentence word-by-word. Optionally, during the word generating, visual attention can also be embedded to enhance the prediction performance by attentively shifting the sentence focus on different parts of the image.
The key contributions in this paper are three-fold:
To the best of our knowledge, it is the first work to propose a dedicated modality transition framework to explicitly bridge the gap between the visual and textual modalities in the image captioning model.
The proposed Modality Transition Module (MTM), modality loss, and the text auto-encoder are agnostic to the existing encoder-decoder framework, therefore, it is straightforward to embed this module into other models.
Extensive quantitative and qualitative are conducted to evaluate the proposed framework comparing the the-state-of-arts methods, and different modality loss and base model structures are also studied to show the effectiveness of the proposed model.
The rest of this paper is organised as follows: Sec. 2 briefly reviews the literature in image captioning and multimodal fusion, Sec. 3 demonstrates the proposed method, Sec. 4 gives the experiments, and Sec. 5 concludes the paper.
2 Related Work
Visual captioning is an emerging knowledge discovery topic in conjunction with computer vision and natural language processing. The majority of existing models follow the encoder-decoder framework, in which the encoder is a visual feature extractor and the decoder is a language model. The captioning research has its origin in the early work on image captioning utilising the CNN as an image encoder, and the RNN as a language generation model[Karpathy_2017_tpami, Vinyals_2015_CVPR]. Similar frameworks are also proposed to accommodate paragraph generation [krause2016paragraphs, para_gan], video captioning [video_caption_donahue, yang2018caption].
Furthermore, the visual attention mechanism is embedded into the visual captioning spectrum. During the sentence generation, the visual attention module is able to adaptively locate the most salient areas in the image given the current hidden states for more accurate prediction. Different structures are proposed such as soft attention [xu2015show], adaptive attention [Lu_2017_CVPR], depth attention [lookdeeper_mm18], top-down and bottom-up attention [vqa_updown], etc.
Another line of work exploit reinforcement learning to optimise the model based on the evaluation metrics[rl_rennie, rl_Liu_2017_ICCV, curiosity_rl_luo]. These models take the action of predict the next word, given the states of visual, textual, and context vectors, to obtain the reward measured by caption evaluation scores. Other innovate methods focus on domain transfer [Chen_2017_ICCV, dual_cross_domain, multitask_cross_domain], object hallucinations and gender bias in generated captions [womenSnowboard, objectHallucination], topic modelling [topic_WangPYTM19], Model efficiency [nips_transformer_objects, paic_adc2020, ijcai_cons_caption], etc.
Most of the existing work focuses on improving the vision or language feature extractions, and bridge them together by directly inserting the vision modalities in the language decoder. The modality gap is weakly addressed underlying the encoder-decoder framework, but improvements could still be made to explicitly engage the vision and language modality transitions.
In this section, we demonstrate the proposed enhanced modality transition framework (Fig. 1) for image captioning. The novel framework consists of the text auto-encoder, the visual region detector, the modality transition module (MTM), the language decoder, and the visual attention. The details of each component in the framework are explained in this section.
3.1 Problem Formulation
The objective of image captioning is formulated in this section. The input RGB image is denoted as . The output caption consists of a sequence of words with length . The final objective is to generate given .
3.2 Text Auto-Encoder
In the training phase, the conventional image captioning model encodes , but shelves the caption only for calculating supervised loss to mimic the inference scenario where is not available. However, both image and caption are available during the training. The motivation is to investigate the informative caption during training for maximum utilisation in addition to be the ground-truth label. The text auto-encoder is proposed to encode information intensive captions into a global text representation, and guide the MTM model to transfer the visual features into the “textual-like” representations.
The text auto-encoder consists of an encoder and decoder for automatically generating a compressed vector representation for the caption as illustrated in Fig. 2
. The LSTM encoder-decoder adopts the sequence-to-sequence structure similar to neural machine translation[s2s_cho]. In the text auto-encoder, the encoder module reads the input caption , and encodes it to a fixed-length vector . The LSTM is commonly used such that:
where is the hidden states at , is the memory cell states at time step , the dimension of encodings is , is the LSTM, and is the generated global vector from the hidden states of the last time step .
Given the generated compressed caption code and all the predicted previous words , the decoder LSTM is trained to predict the next word
in the caption. The probability over the entire caption is denoted as follows:
where the dependency of model weights are dropped for convenience. We optimise the sum of log probabilities in Eq. 3 among all the training samples using gradient descent. Cross-entropy is utilised as the reconstruction loss to measure the difference between the re-generated and the original caption. By optimising all the parameters in both encoder-decoder , the auto-encoder is able to generate the compressed code , which is able to reconstruct the original caption.
3.3 Modality Transition Module and Loss
3.3.1 Modality Transition
The modality transition module projects the visual features into textual global vector for language decoder as illustrated in Fig. 2. With the modality transition module, the decoder is able to generate captions in an accurate and concise way.
Before forwarding to MTM, the image is encoded via the region detector shown in Fig. 1, outputting visual features with visual regions, where is the dimension of visual features.
where is the global (mean pooled) visual feature, is the predicted global textual representations. , denote the linear projection parameters and , are the biases to be learned.
3.3.2 Modality Loss
In the training phase, the predicted preliminary textual vector representation is compared with the compressed code encoded by text auto-encoder. The modality loss measures the difference between and , we empirically choose mean squared error (MSE) to implement the loss. Other distance measurements are also implemented and tested in Sec. 4.4.1. The modality loss is written such that
We minimise the modality loss to guide the MTM to learn the correct projection from visual to textual vectors. This is necessary because there will be no caption during inference, so the real global textual vector will not be available.
3.4 Modality Transition-enhanced Captioner
Before the image captioning model training, the dedicated text auto-encoder is trained by reconstructing the caption itself to learn a compressed representation. During the image captioning model training, the total loss is the combination of cross-entropy loss and the proposed modality loss
. The total loss function is denoted as follows:
The cross-entropy loss measures the performance of how well the predicted probability of the vocabulary reflects the ground-truth label.
Summary: To train the MTM, the outputs of MTM are compared with the auto-encoded reference global textual representations via modality loss. During the inference, as illustrated in Fig. 1, the visual features are firstly extracted by the region detector such as Faster-RCNN. Secondly, before passing the visual features to the language decoder, Modality Transition Module (MTM) projects the averaged visual features into global textual representations. The language decoder decodes the representations into captions with the optional visual attention module such as the TopDown attention [vqa_updown].
4.1 Experimental Settings
The experiments are conducted on the MS-COCO [lin2014microsoft] Captioning dataset for evaluating the proposed MTM model. The train, validation, and test partitions are following the commonly adopted “Karpathy” [Karpathy_2017_tpami] split, which has 113,287 training, 5,000 validation, and 5,000 test images.
4.1.2 Implementation Details
The proposed framework consists of the visual region detector, the MTM module, the text auto-encoder, the visual attention, and the language decoder. For visual region detector, we extract the region features using Faster-RCNN with ResNet-101 backbone following [vqa_updown]. In the MTM module, the model has two fully-connect neural network layers with 2048-1024 and 1024-1024 dimensions for each layer, i.e., , . The text auto-encoder consists of an encoder and decoder, both of them are implemented with 2-layer LSTM with 1024 dimensions. The captioning model is optimised by ADAM optimiser [adam_KingmaB14], and the learning rate is empirically set to with decay. All the models in this paper are trained on a server with GeForce GTX 1080Ti GPU.
4.1.3 Evaluation Metrics
For comparison, we report the performance measured by automatic language evaluation metrics: BLEU- [papineni2002bleu], METEOR [lavie2005meteor], Rouge-L [lin-2004-rouge] and CIDEr [vedantam2015cider]. For example, the BLEU- measures -gram precision of the candidate caption.
4.1.4 Compared Methods
We compare the proposed method with a number of encoder-decoder image captioning models. ShowTell [Vinyals_2015_CVPR]: The neural image captioner consists of CNN image encoder and LSTM language decoder. Adaptive [Lu_2017_CVPR]: The CNN-LSTM captioner with the adaptive visual attention. Att2in and Att2all: Variants of attention-based models trained with self-critical sequence training (SCST) from [rl_rennie]. UpDown: state-of-the-art captioning model with carefully engineered bottom-up and top-down attention [vqa_updown]. In addition, in the following comparison tables and figures, suffix MT indicates that the model has Modality Transition Module embedded, and suffix RL indicates that the model is fine-tuned by reinforcement learning objectives from the SCST [rl_rennie].
4.2 Quantitative Analysis
We embedded the proposed MTM model to the simplest baseline ShowTell and state-of-the-art model UpDown, and labelled as ShowTell_MT and UpDown_MT, respectively in Table 1. The reinforcement learning objective is also implemented to achieve the best performance. From Table 1, by exploiting the MTM module, we can clearly observe that the best model UpDown_MT_RL can improve the previous state-of-the-art model UpDown_RL by relative 3.4% in CIDEr. Consistently, in terms of the baseline ShowTell model, when the MTM is deployed, the CIDEr performance increases from 96.29 to 100.42. Moreover, when self-critical reinforcement learning is adopted to optimise ShowTell_MT, the BLEU-4 is boosted by 10.5%, and the CIDEr result is increased from 100.42 to 113.37. Importantly, the ShowTell_MT_RL is not equipped any form of attention mechanism, but the performance is surpassing attention-based models such as Adaptive and Att2in_RL and comparable to Att2all_RL. Comparing to the previous work, where the visual features are assumed as the modality-invariant global representations for language decoding without compensation, the MTM model significantly improves language generation quality by allowing the smooth transition from vision to language.
4.3 Qualitative Analysis
In qualitative study, we present a number of showcases to demonstrate the performance of MTM intuitively. Comparing to the results from UpDown attention [vqa_updown] model and ShowTell CNN-LSTM [Vinyals_2015_CVPR] model shown in Fig. 3, the proposed MTM is able to generate sentences in a accurate, detailed and grammatically correct way. Specifically, after modality transition, the language model can precisely identify object-interactions from the scene. For example, in the first row, the first example illustrates that the MTM can detect the fact that a man is “standing next to” rather then “riding” the elephant. The ability of differentiate human-object interactions also is also shown in the 4th image of the top row. Although only one man is holding the frisbee in that picture, from the action and pose we can conclude that a group of people are playing rather than only one people. Furthermore, the details can be better preserved by exploiting global caption representations in MTM. The pictures on the 2nd column give accurate colours of the objects, while the 3rd image on the top and the 1st on the bottom reveal details such as “bottles of wine”, “two computer monitors” rather than the general expressions from other models.
4.4 Ablation Study
4.4.1 Modality Loss
In this section, we compare different variants of modality loss. In Table 5
, ShowTell indicates the performance of the base model without modality transition. All other models in the table are implemented with MTM. For comparison, we choose different measurements between the predicted and the target textual encodings, the loss functions include mean absolute error (MAE), mean squared error (MSE), Cosine distance (COS), Kullback-Leibler divergence (KLD), Maximum Mean Discrepancy (MMD)[mmd]. From the comparison result, we can see that the standard MSE effectively measures the loss between the preliminary textual vectors and the well-trained global caption representations giving the best performance. However, the measurements such as KLD and COS result in poor performance. The KLD measures the relative entropy between two distributions, but not directly computing the distances between the input and target vectors. While the COS measures the angles between the two vectors, but the embedding vectors are sensible to magnitude. The experiments show that they are not suitable for the modality similarity measures.
4.4.2 Modality Transition Module Training
We implement the Modality Transition Module on the existing models mentioned above in Sec. 4.1.4, and annotated with suffix MT: ShowTell_MT, Adaptive_MT, Att2in_MT, Att2all_MT, and TopDown_MT. The detailed training curves are illustrated in Fig. 5, and the steadily increasing red line on the top is the proposed MTM built on TopDown [vqa_updown] attention. MTM also performs well on other encoder-decoder models, demonstrating strong ability to be embedded in the conventional image captioning models with good adaptability.
As an auxiliary study, the validation performance and the training modality loss during model learning are visualised in the Fig. 6. The two methods included are MTM modules embedded on ShowTell and TopDown. As shown in the figure, with the CIDEr scores increasing on the left, the corresponding modality losses are decreasing in the similar pace on the right sub-figure. We can also observe that the TopDown_MT performs better than ShowTell_MT in terms of CIDEr scores, and the TopDown_MT’s loss is lower as expected. This indicates that the proposed modality loss is strongly correlated to the model performance.
In this work, we propose a Modality Transition Module (MTM) to enhance vision and language modalities transition for image captioning. The proposed model explicitly projects the image visual features into the global textual representation vector, giving the language decoder the preferable textual cues for caption generation. The experiments demonstrate the effectiveness of the proposed model.
This work is partially supported by ARC DP190102353 and ARC DP170103954.