Using Semantic Compositional Networks for Video Captioning
A Semantic Compositional Network (SCN) is developed for image captioning, in which semantic concepts (i.e., tags) are detected from the image, and the probability of each tag is used to compose the parameters in a long short-term memory (LSTM) network. The SCN extends each weight matrix of the LSTM to an ensemble of tag-dependent weight matrices. The degree to which each member of the ensemble is used to generate an image caption is tied to the image-dependent probability of the corresponding tag. In addition to captioning images, we also extend the SCN to generate captions for video clips. We qualitatively analyze semantic composition in SCNs, and quantitatively evaluate the algorithm on three benchmark datasets: COCO, Flickr30k, and Youtube2Text. Experimental results show that the proposed method significantly outperforms prior state-of-the-art approaches, across multiple evaluation metrics.READ FULL TEXT VIEW PDF
Using Semantic Compositional Networks for Video Captioning
The Theano code for the CVPR 2017 paper "Semantic Compositional Networks for Visual Captioning"
There has been a recent surge of interest in developing models that can generate captions for images or videos, termed visual captioning. Most of these approaches learn a probabilistic model of the caption, conditioned on an image or a video [29, 47, 13, 20, 48, 52, 10, 46, 32, 56]. Inspired by the successful use of the encoder-decoder framework employed in machine translation [2, 8, 40]
, most recent work on visual captioning employs a convolutional neural network (CNN) as an encoder, obtaining a fixed-length vector representation of a given image or video. A recurrent neural network (RNN), typically implemented with long short-term memory (LSTM) units, is then employed as a decoder to generate a caption.
Recent work shows that adding explicit high-level semantic concepts (i.e., tags) of the input image/video can further improve visual captioning. As shown in [49, 54], detecting explicit semantic concepts encoded in an image, and adding this high-level semantic information into the CNN-LSTM framework, has improved performance significantly. Specifically,  feeds the semantic concepts as an initialization step into the LSTM decoder. In , a model of semantic attention is proposed which selectively attends to semantic concepts through a soft attention mechanism . On the other hand, although significant performance improvements were achieved, integration of semantic concepts into the LSTM-based caption generation process is constrained in these methods; e.g., only through soft attention or initialization of the first step of the LSTM.
In this paper, we propose the Semantic Compositional Network (SCN) to more effectively assemble the meanings of individual tags to generate the caption that describes the overall meaning of the image, as illustrated in Figure 0(a). Similar to the conventional CNN-LSTM-based image captioning framework, a CNN is used to extract the visual feature vector, which is then fed into a LSTM for generating the image caption (for simplicity, in this discussion we refer to images, but the method is also applicable to video). However, unlike the conventional LSTM, the SCN extends each weight matrix of the conventional LSTM to an ensemble
of tag-dependent weight matrices, subject to the probabilities that the tags are present in the image. These tag-dependent weight matrices form a weight tensor with a large number of parameters. In order to make learning feasible, we factorize that tensor to be a three-way matrix product, which dramatically reduces the number of free parameters to be learned, while also yielding excellent performance.
The main contributions of this paper are as follows: (i) We propose the SCN to effectively compose individual semantic concepts for image captioning. (ii) We perform comprehensive evaluations on two image captioning benchmarks, demonstrating that the proposed method outperforms previous state-of-the-art approaches by a substantial margin. For example, as reported by the COCO official test server, we achieve a BLEU-4 of 33.1, an improvement of 1.5 points over the current published state-of-the-art . (iii) We extend the proposed framework from image captioning to video captioning, demonstrating the versatility of the proposed model. (iv) We also perform a detailed analysis to study the SCN, showing that the model can adjust the caption smoothly by modifying the tags.
We focus on recent neural-network-based literature for caption generation, as these are most relevant to our work. Such models typically extract a visual feature vector via a CNN, and then send that vector to a language model for caption generation. Representative works include [7, 9, 10, 20, 23, 24, 29, 48] for image captioning and [10, 46, 47, 56, 3, 35, 11] for video captioning. The differences of the various methods mainly lie in the types of CNN architectures and language models. For example, the vanilla RNN  was used in [29, 20], while the LSTM  was used in [48, 46, 47]. The visual feature vector was only fed into the RNN once at the first time step in [48, 20], while it was used at each time step of the RNN in .
Most recently,  utilized an attention-based mechanism to learn where to focus in the image during caption generation. This work was followed by  which introduced a review module to improve the attention mechanism and 
which proposed a method to improve the correctness of visual attention. Moreover, a variational autoencoder was developed in for image captioning. Other related work includes  for video captioning and  for composing sentences that describe novel objects.
Another class of models uses semantic information for caption generation. Specifically,  applied retrieved sentences as additional semantic information to guide the LSTM when generating captions, while [13, 49, 54] applied a semantic-concept-detection process  before generating sentences. In addition,  also proposes a deep multimodal similarity model to project visual features and captions into a joint embedding space. This line of methods represents the current state of the art for image captioning. Our proposed model also lies in this category; however, distinct from the aforementioned approaches, our model uses weight tensors in LSTM units. This allows learning an ensemble of semantic-concept-dependent weight matrices for generating the caption.
Related to but distinct from the hierarchical composition in a recursive neural network , our model carries out implicit composition of concepts, and there is no hierarchical relationship among these concepts. Figure 0(b) illustrates the semantic composition manifested in the SCN model. Specifically, a set of semantic concepts, such as “baby, holding, toothbrush, mouth”, are detected with high probabilities. If only one semantic concept is turned on, the model will generate a description covering only part of the input image, as shown in sentences 1-5 of Figure 0(b); however, by assembling all these semantic concepts, the SCN is able to generate a comprehensive description “a baby holding a toothbrush in its mouth”. More interestingly, as shown in sentences 6-8 of Figure 0(b), the SCN also has great flexibility to adjust the generation of the caption by changing certain semantic concepts.
the authors also briefly discussed using the tensor factorization method for image captioning. Specifically, visual features extracted from CNNs are utilized in[10, 23], and an inferred scene vector is used in  for tensor factorization. In contrast to these works, we use the semantic-concept vector that is formed by the probabilities of all tags to weight the basis LSTM weight matrices in the ensemble. Our semantic-concept vector is more powerful than the visual-feature vector [10, 23] and the scene vector  in terms of providing explicit semantic information of an image, hence leading to significantly better performance, as shown in our quantitative evaluation. In addition, the usage of semantic concepts also makes the proposed SCN more interpretable than [10, 19, 23], as shown in our qualitative analysis, since each unit in the semantic-concept vector corresponds to an explicit tag.
Consider an image , with associated caption . We first extract feature vector , which is often the top-layer features of a pretrained CNN. Henceforth, for simplicity, we omit the explicit dependence on , and represent the visual feature vector as . The length- caption is represented as , with a 1-of-
(“one hot”) encoding vector, withthe size of the vocabulary. The length typically varies among different captions.
The -th word in a caption, , is linearly embedded into an -dimensional real-valued vector , where is a word embedding matrix (learned), i.e., is a column of chosen by the one-hot . The probability of caption given image feature vector is defined as
where is defined as a special start-of-the-sentence token. All the words in the caption are sequentially generated using a RNN, until the end-of-the-sentence symbol is generated. Specifically, each conditional is specified as , where is recursively updated through , and is defined as a zero vector ( is not updated during training). is the weight matrix connecting the RNN’s hidden state, used for computing a distribution over words. Bias terms are omitted for simplicity throughout the paper.
Without loss of generality, we begin by discussing an RNN with a simple transition function ; this is generalized in Section 3.4 to the LSTM. Specifically, is defined as
is a logistic sigmoid function, andrepresents an indicator function. Feature vector is fed into the RNN at the beginning, , at . is defined as the input matrix, and is termed the recurrent matrix. The model in (2) is illustrated in Figure 2(a).
The SCN developed below is based on the detection of semantic concepts, i.e., tags, in the image under test. In order to detect such from an image, we first select a set of tags from the caption text in the training set. Following , we use the most common words in the training captions to determine the vocabulary of tags, which includes the most frequent nouns, verbs, or adjectives.
In order to predict semantic concepts given a test image, motivated by [49, 44], we treat this problem as a multi-label classification task. Suppose there are training examples, and is the label vector of the -th image, where if the image is annotated with tag , and otherwise. Let and represent the image feature vector and the semantic feature vector for the -th image, the cost function to be minimized is
where is a -dimensional vector with , is the logistic sigmoid function and
is implemented as a multilayer perceptron (MLP).
In testing, for each input image, we compute a semantic-concept vector , formed by the probabilities of all tags, computed by the semantic-concept detection model.
The SCN extends each weight matrix of the conventional RNN to be an ensemble of a set of tag-dependent weight matrices, subjective to the probabilities that the tags are present in the image. Specifically, the SCN-RNN computes the hidden states as follows
where , and and are ensembles of tag-dependent weight matrices, subjective to the probabilities that the tags are present in the image, according to the semantic-concept vector .
Given , we define two weight tensors and , where is the number of hidden units and is the dimension of word embedding. and can be specified as
where is the -th element in ; and denote the -th 2D “slice” of and , respectively. The probability of the -th semantic concept, , is associated with a pair of RNN weight matrices and , implicitly specifying RNNs in total. Consequently, training such a model as defined in (4) and (5) can be interpreted as jointly training an ensemble of RNNs.
Though appealing, the number of parameters is proportional to , which is prohibitive for large (e.g., for COCO). In order to remedy this problem, we adopt ideas from  to factorize and defined in (5) as
where denotes the element-wise multiply (Hadamard) operator.
and are shared among all the captions, effectively capturing common linguistic patterns; while the diagonal term, , accounts for semantic aspects of the image under test, captured by . The same analysis also holds true for . In this factorized model, the RNN weight matrices that correspond to each semantic concept share “structure.” This factorized model (termed SCN-RNN) is illustrated in Figure 2(b).
A similar decomposition is manifested for . The matrix may be interpreted as the -th “slice” of a weight tensor, with each slice corresponding to one of the semantic concepts ( total tensor “slices,” each of size ). Hence, via the decomposition in (6) and (7), we effectively learn an ensemble of sets of RNN parameters, one for each semantic concept. This is efficiently done by sharing and when composing each member of the ensemble. The weight with which the -th slice of this tensor contributes to the RNN parameters for a given image is dependent on the respective probability with which the -th semantic concept is inferred to be associated with image .
The number of parameters in the basic RNN model (see Figure 2(a)) is , while the number of parameters in the SCN-RNN model (see Figure 2(b)) is . In experiments, we set . Therefore, the additional number of parameters is . This increased model complexity also indicates increased training/testing time.
|SCN-LSTM Ensemble of 5||0.741||0.578||0.444||0.341||0.261||1.041||0.747||0.552||0.403||0.288||0.223|
RNNs with LSTM units  have emerged as a popular architecture, due to their representational power and effectiveness at capturing long-term dependencies. We generalize the SCN-RNN model by using LSTM units. Specifically, we define as
where . For , we define
Since we implement the SCN with LSTM units, we name this model SCN-LSTM. In experiments, since LSTM is more powerful than classifical RNN, we only report results using SCN-LSTM.
In summary, distinct from previous image-captioning methods, our model has a unique way to utilize and combine the visual feature and semantic-concept vector extracted from an image . is fed into the LSTM to initialize the first step, which is expected to provide the LSTM an overview of the image content. While the LSTM state is initialized with the overall visual context , an ensemble of sets of LSTM parameters is utilized when decoding, weighted by the semantic-concept vector , to generate the caption.
Given the image and associated caption , the objective function is the sum of the log-likelihood of the caption conditioned on the image representation:
The above objective corresponds to a single image-caption pair. In training, we average over all training pairs.
The above framework can be readily extended to the task of video captioning [10, 46, 47, 56, 3, 51]. In order to effectively represent the spatiotemporal visual content of a video, we use a two-dimensional (2D) and a three-dimensional (3D) CNN to extract visual features of video frames/clips. We then perform a mean pooling process  over all 2D CNN features and 3D CNN features, to generate two feature vectors (one from 2D CNN features and the other from 3D CNN features). The representation of each video, , is produced by concatenating these two features. Similarly, we also obtain the semantic-concept vector by running the semantic-concept detector based on the video representation . After and are obtained, we employ the same model proposed above directly for video-caption generation, as described in Figure 2(b).
We present results on three benchmark datasets: COCO , Flickr30k  and Youtube2Text . COCO and Flickr30k are for image captioning, containing 123287 and 31783 images, respectively. Each image is annotated with at least 5 captions. We use the same pre-defined splits as  for all the datasets: on Flickr30k, 1000 images for validation, 1000 for test, and the rest for training; and for COCO, 5000 images are used for both validation and testing. We further tested our model on the official COCO test set consisting of 40775 images (human-generated captions for this split are not publicly available), and evaluated our model on the COCO evaluation server. We also follow the publicly available code  to preprocess the captions, yielding vocabulary sizes of 8791 and 7414 for COCO and Flickr30k, respectively.
Youtube2Text is used for video captioning, which contains 1970 Youtube clips, and each video is annotated with around 40 sentences. We use the same splits as provided in , with 1200 videos for training, 100 videos for validation, and 670 videos for testing. We convert all captions to lower case and remove the punctuation, yielding vocabulary size of 12594 for Youtube2Text.
For image representation, we take the output of the 2048-way layer from ResNet-152 
, pretrained on the ImageNet dataset. For video representation, in addition to using the 2D ResNet-152 to extract features on each video frame, we also utilize a 3D CNN (C3D)  to extract features on each video. The C3D is pretrained on Sports-1M video dataset , and we take the output of the 4096-way layer from C3D as the video representation. We consider the RGB frames of videos as input, with 2 frames per second. Each video frame is resized as and for the C3D and ResNet-152 feature extractor, respectively. The C3D feature extractor is applied on video clips of length 16 frames (as in ) with an overlap of 8 frames.
We use the procedure described in Section 3.2 for semantic concept detection. The semantic-concept vocabulary size is determined to reflect the complexity of the dataset, which is set to 1000, 200 and 300 for COCO, Flickr30k and Youtube2Text, respectively. Since Youtube2Text is a relatively small dataset, we found that it is very difficult to train a reliable semantic-concept detector using the Youtube2Text dataset alone, due to its limited amount of data. In experiments, we utilize additional training data from COCO.
For model training, all the parameters in the SCN-LSTM are initialized from a uniform distribution in [-0.01,0.01]. All bias terms are initialized to zero. Word embedding vectors are initialized with the publicly availableword2vec vectors 
. The embedding vectors of words not present in the pretrained set are initialzied randomly. The number of hidden units and the number of factors in SCN-LSTM are both set to 512 and we use mini-batches of size 64. The maximum number of epochs we run for all the three datasets is 20. Gradients are clipped if the norm of the parameter vector exceeds 5. We do not perform any dataset-specific tuning and regularization other than dropout  and early stopping on validation sets. The Adam algorithm  with learning rate
is utilized for optimization. All experiments are implemented in Theano111Code is publicly available at https://github.com/zhegan27/Semantic_Compositional_Nets..
In testing, we use beam search for caption generation, which selects the top- best sentences at each time step and considers them as the candidates to generate new top- best sentences at the next time step. We set the beam size to in experiments.
|SCN-LSTM Ensemble of 5||0.511||0.335||0.777|
The widely used BLEU , METEOR , ROUGE-L , and CIDEr-D  metrics are reported in our quantitative evaluation of the performance of the proposed model and baselines in the literature. All the metrics are computed by using the code released by the COCO evaluation server . For COCO and Flickr30k datasets, besides comparing to results reported in previous work, we also re-implemented strong baselines for comparison. The results of image captioning are presented in Table 1. The models we implemented are as follows.
LSTM-R / LSTM-T / LSTM-RT: R, T, RT denotes using different features. Specifically, R denotes ResNet visual feature vector, T denotes Tags (i.e., the semantic-concept vector), and RT denotes the concatenation of R and T. The features are fed into a standard LSTM decoder only at the initial time step. In particular, LSTM-T is the model proposed in .
LSTM-RT: The ResNet feature vector is sent to a standard LSTM decoder at the first time step, while the tag vector is sent to the LSTM decoder at every time step in addition to the input word. This model is similar to  without using semantic attention. This is the model closest to ours, which provides a direct comparison to our proposed model.
SCN-LSTM: This is the model presented in Section 3.4.
For video captioning experiments, we use the same notation. For example, LSTM-C means we leverage the C3D feature for caption generation.
We first present results on the task of image captioning, summarized in Table 1. The use of tags (LSTM-T) provides better performance than leveraging visual features alone (LSTM-R). Combining both tags and visual features further enhances performance, as expected. Compared with only feeding the tags into the LSTM at the initial time step (LSTM-RT), LSTM-RT yields better results, since it takes as input the tag feature at each time step. Further, the direct comparison between LSTM-RT and SCN-LSTM demonstrates the advantage of our proposed model, indicating that our approach is a better method to fuse semantic concepts into the LSTM.
We also report results averaging an ensemble of 5 identical SCN-LSTM models trained with different initializations, which is a common strategy adopted widely  (note that now we employ ensembles in two ways: an ensemble of LSTM parameters linked to tags, and an overaching ensemble atop the entire model). We obtain state-of-the-art results on both COCO and Flickr30k datasets. Remarkably, we improve the state-of-the-art BLEU-4 score by 3.1 points on COCO.
We also evaluate the proposed SCN-LSTM model by uploading results to the online COCO test server. Table 2 shows the comparison to the published state-of-the-art image captioning models on the blind test set as reported by the COCO test server. We include the models that have been published and perform at top-3 in the table. Compared to these methods, our proposed SCN-LSTM model achieves the best performance across all the evaluation metrics on both c5 and c40 testing sets.222Please check https://competitions.codalab.org/competitions/3221#results for the most recent results.
Results on video captioning are presented in Table 3. The SCN-LSTM achieves significantly better results over all competing methods in all metrics, especially in CIDEr-D. For self-comparison, it is also worth noting that our model improves over LSTM-CRT by a substantial margin. Again, using an overaching ensemble further enhances performance.
Figure 3 shows three examples to illustrate the semantic composition on caption generation. Our model properly describes the image content by using the correctly detected tags. By manually replacing specific tags, our model can adjust the caption smoothly. For example, in the left image, by replacing the tag “grass” with “bed”, our model imagines “a dog laying on top of a bed”. Our model is also able to generate novel captions that are highly unlikely to occur in real life. For instance, in the middle image, by replacing the tag “road” and “street” with “ocean”, our model imagines “a bus driving in the ocean”; in the right image, by replacing the tag “field” with “snow”, our model dreams “a group of zebras standing in the snow”.
SCN not only picks up the tags well (and imagines the corresponding scenes), but also selects the right functional words for different concepts to form syntactically correct caption. As illustrated in sentence 6 of Figure 0(b), by replacing the tag “baby” with “girl”, the generated captions not only changes “a baby” to “a little girl”, but more importantly, changes “in its mouth” to “in her mouth”. In addition, the SCN also infers the underlying semantic relatedness between different tags. As illustrated in sentence 4 of Figure 0(b), when only switching on the tag “mouth”, the generated caption becomes “a man with a toothbrush”, indicating the semantic closeness between “mouth”, “man” and “toothbrush”. By further switching on “baby”, we generate a more detailed description “a baby brushing its teeth”.
The above analysis shows the importance of tags in generating captions. However, SCN generates captions using both semantic concepts and the global visual feature vector. The language model learns to assemble semantic concepts (weighted by their likelihood), in consideration of the global visual information, into a coherent meaningful sentence that captures the overall meaning of the image. In order to demonstrate the importance of visual feature vectors, we train another SCN-LSTM-T model, which is a SCN-LSTM model without the visual feature inputs, i.e., with only tag inputs . As shown in the first example of Figure 4, the image tagger detects “dog” with high probability. Using only tag inputs, SCN-LSTM-T can only generate the wrong caption “a dog laying on top of a stuffed animal”. With additional visual feature inputs, our SCN-LSTM model correctly replaces “dog” with “teddy bear” .
We further present examples of generated captions on COCO with various other methods in Figure 5, along with the detected tags. As can be seen, our model often generates more reasonable captions than LSTM-R, due to the use of high-level semantic concepts. For example, in the first image, LSTM-R outputs an irrelevant caption to the image, while the detection of “table” and “library” helps our model to generate more sensible caption. Further, although both our model and LSTM-RT utilize detected tags for caption generation, our model often depicts the image content more comprehensively; LSTM-RT has a larger potential to miss important details in the image. For instance, in the 3rd image, the tag “red” appears in the caption generated by our model, which is missed by LSTM-RT. This observation might be due to the fact that the SCN provides a better approach to fuse tag information into the process of caption generation. Similiar observations can also be found in the video captioning experiments, as demonstrated in Figure 6.
We have presented Semantic Compositional Network (SCN), a new framework to effectively compose the individual semantic meaning of tags for visual captioning. The SCN extends each weight matrix of the conventional LSTM to be a three-way matrix product, with one of these matrices dependent on the inferred tags. Consequently, the SCN can be viewed an ensemble of tag-dependent LSTM bases, with the contribution of each LSTM basis unit proportional to the likelihood that the tag is present in the image. Experiments conducted on three visual captioning datasets validate the superiority of the proposed approach.
Most of this work was done when the first author was an intern at Microsoft Research. This work was also supported in part by ARO, DARPA, DOE, NGA, ONR and NSF.
Variational autoencoder for deep learning of images, labels and captions.In NIPS, 2016.
Factored conditional restricted boltzmann machines for modeling motion style.In ICML, 2009.