Recent works dealing with the generation of text from data structures such as images (e.g., Karpathy and Li, 2015; Vinyals et al., 2015), videos (e.g., Venugopalan et al., 2015) or audio (e.g., Graves et al., 2013)
have shown that supervised learning algorithms are capable of aligning semantic concepts across different modalities. In this work, we focus on the task of automatic image captioning, a widely-studied task at the intersection of vision and language research. Most approaches to image captioning operate by conditioning a decoder model on an abstracted representation of the input image instead of explicitly taking detected objects and visual relationships into account(e.g., Karpathy and Li, 2015; Xu et al., 2015). However, natural language descriptions in general and captions in particular are dominated by discrete objects standing in discrete relations. By forcing the generation process to go through a scene graph consisting of objects and relations, we impose an appropriate structural bias that is lacking in direct pixel-to-caption generation. We therefore approach the task of supervised image caption generation by developing an architecture that makes explicit use of detected visual objects and their semantic relationships in a given input image to generate an image description in natural language. More specifically, our method consists of a two-step approach that first extracts a scene graph (i.e., objects and their visual relationships) from an input image and then utilizes this representation to generate an image description in natural language. In doing so, we incorporate an existing method for supervised scene graph generation, i.e., MotifNet Zellers et al. (2018), to extract visual semantic concepts from images and represent them in form of scene graphs.
Scene graphs have been utilized in a variety of tasks such as image retrieval(e.g., Johnson et al., 2015) and image generation Johnson et al. (2018) and are of particular interest for tasks dealing with the alignment of visual and textual concepts, since the representations utilize words to describe phenomena that are present in visual scenarios. While numerous approaches for image-to-graph generation and visual relationship detection have been proposed in recent years (e.g., Lu et al., 2016; Newell and Deng, 2017; Li et al., 2017; Yang et al., 2018a; Zellers et al., 2018; Zhang et al., 2019)
, little attention has thus far been paid to the problem of graph-to-text generation. We hence propose a variety of methods utilizing recurrent neural network mechanisms operating on scene graphs for the generation of natural language and show that the presence of visual objects and their relationships is beneficial for the automatic description of images.
Our work thus presents the following main contributions:
We propose a two-step supervised learning approach that generates scene graphs from raw input pixels and utilizes these graph representations to generate image descriptions in natural language.
We show that such a simple two-step approach outperforms conventional CNN-LSTM image captioning architectures.
2 Related work
The problem of end-to-end image caption generation has been studied widely in the context of deep learning in recent years. Pioneering approaches to this problem utilize a combination of convolutional and recurrent neural networks processing the visual and textual data representations, respectively. Multiple encoder-decoder approaches have been proposed that employ a CNN transforming a raw input image to a dense vector representation which is then used to condition a neural language model generating a descriptive sequence in natural language(e.g., Chen and Zitnick, 2015; Donahue et al., 2015; Vinyals et al., 2015; Karpathy and Li, 2015; Wang et al., 2016).
Building upon this idea, Xu et al. (2015) propose the first approach to incorporate an additional attention mechanism into the model’s decoder, enabling it to refer back to the abstracted image representation at each time step during the generation of an image caption. Subsequent approaches extend the incorporation of attention mechanisms for image captioning (e.g., Yang et al., 2016; Lu et al., 2017; Khademi and Schulte, 2018). For instance, Lu et al. (2017) extend the idea of incorporating visual attention to the image caption generation task by introducing an adaptive attention mechanism allowing the model to decide to what extent it should rely on the visual and linguistic features when generating an image caption.
2.1 Generating image captions from visual relationships
Although comparatively little attention has been paid to the generation of image captions via visual relationships, there exists a variety of works employing these characteristics to generate image captions.
Yao et al. (2018) propose an architecture that utilizes region-based visual relationships to generate an image caption for a given image. Specifically, their method uses the Faster R-CNN Ren et al. (2015)
object detector to identify a set of objects present in an input image. Afterwards, a classification method is applied on pairs of detected objects to identify their most probable semantic relationship. The resulting graph representation is then forwarded to two Graph Convolutional Neural Networks (CGN) that generate relation-aware region features for all the detected regions based on their predicted visual relationships. Finally, a two-layer LSTM is conditioned on the region-level features generated by the CGN module, and generates the image caption based on this representation.Yao et al. (2018) additionally install an attention mechanism in the LSTM decoder that operates over the region features at each time step when generating the output predictions.
propose a dense captioning mechanism that produces multiple individual captions per image. Their approach initially uses a bounding box object detector that identifies object regions present in an input image. Afterwards, a recurrent neural network is trained to generate a caption for each relational pair of identified objects.
Two recent works published by Li and Jiang (2019) and Hou et al. (2019) present approaches that are similar to our work. Li and Jiang (2019) combine scene graphs for image captioning in conjunction with a hierarchical attention network. Their approach first uses a Region Proposal Network Girshick (2015) to compute object proposals for an input image. These proposals are then used to generate both a visual feature representation and semantic relationship features, which are forwarded to an LSTM decoder with a hierarchical attention module generating the image caption. Hou et al. (2019)
provide a different method for incorporating scene graphs into the image captioning pipeline by utilizing scene graphs sourced from the Visual Genome dataset as external prior knowledge graphs.
The proposed approach for generating image captions via visual relationships is divided into two parts. Our model tackles the image-to-text generation task by first generating an intermediate scene graph representation of the input image and then decodes an image caption from this representation. Hence, our method conducts image-to-graph-to-text generation by approaching the subtasks of image-to-graph and graph-to-text in an isolated fashion. To achieve this, we use two neural network architectures that focus on each task independently, and stack both architectures together once they have been trained.
3.1 Scene graph generation
We initially aim to solve the problem of image-to-graph generation, i.e., generating a scene graph consisting of objects and visual relationships present in a given input image. Formally, our scene graph generator crafts a scene graph for an input image that consists of a set of nodes and corresponding directed edges . Each node is associated with a label , representing an object in an image (e.g., car, person, building). Likewise, each edge is assigned a label denoting a relationship between the two objects and (e.g., above, on).
In order to generate a graph from raw input pixels , we make use of an existing scene graph generation model called MotifNet Zellers et al. (2018). This method represents a scene graph as a triplet , with a set of bounding boxes, a set of objects where each corresponds to a bounding box , and a set of relationships where each relationship is a triplet . Here, represent the start and end node of the relationship and denotes the relationship between both nodes from all possible relationships . Based on this scene graph representation, MotifNet computes the probability of observing graph given image by decomposing it into three parts:
to estimate the bounding box labels. Subsequently, the authors employ a bidirectional LSTM to compute the relationships between objects identified by the object detector as denoted by
. To do so, all possible pairs of detected objects are taken into account and the LSTM computes a probability distribution over all potential relationships infor each pair of objects.
3.2 Graph-to-text generation
Once we have generated a graph representation for an input image , we utilize an LSTM decoder with an additional attention mechanism over the graph to generate an output sequence in natural language. Our architecture receives a set of graph nodes and maps each node to an embedding representation corresponding to its node label . Hence, in order to represent visual relationships in this setup, we first transform our graph to a new representation that differs from in that each edge label is now assigned an individual node in the graph, i.e., for each we create a new node such that and add edges to with .
Our method then applies an LSTM to the matrix of each node’s embedding representation. To do so, we follow Xu et al. (2015) and first initialize the LSTM’s hidden and cell states as
are two independent multilayer perceptrons. Based on this initial conditioning, we then decode the image caption by sampling from
at each time step , thereby also following Xu et al. (2015). Here, represents our word embedding matrix ( is the vocabulary size), is a one-hot representation of the model’s prediction at time step (or a special start token at ), is the LSTM’s hidden state at time step and are trainable parameter matrices. In the remainder of this work, we refer to the combination of our graph encoder and this type of decoder as G-LSTM.
Our second model variant incorporates an additional attention mechanism operating over the latent graph representation at each time step of the LSTM. We adapt Xu et al. (2015)’s approach for image captioning with visual attention and replace the latent image representation with our graph nodes, thus enabling our model to refer back to the graph representation and identify the most salient nodes at each time step during the generation of the output sequence. We call this extended approach .
3.3 Encoding visual relationships
The aforementioned G-LSTM+att does not explicitly incorporate the visual relationships between objects as represented in the scene graph, but instead only processes all object and relationship nodes to generate an image caption. We thus experiment with the incorporation of an additional graph encoder that maps the initial graph representation to an output representation . The task of this graph encoder is to encode relational information for each graph node into its corresponding graph embedding to provide the decoder with semantic dependencies between individual nodes in the graph. Additionally, the encoder has the ability to process indirect connections between entities in order to contextualize global relationships between entities that are indirectly connected through multiple edges. Graph Attention Networks (GAT; Veličković et al. (2018)) represent a gradient-based approach that transforms an input graph by individually attending over each node’s neighborhood to encode relational information into the resulting node representations. For a given input graph , we then define a graph representation , where
represents the set of undirected edges in corresponding to .
GAT layers transform the node representations by computing attention over their neighborhoods. Formally, let denote the neighborhood of a node embedding . A GAT layer then transforms each to by computing
represents the sigmoid function andis an attention coefficient with respect to the nodes and . We follow Veličković et al. (2018) and set
where are trainable weight matrices, represents vector concatenation and LR
denotes the LeakyReLU activation function. In our experiments, we defineto ensure a direct connection between an input node and its transformation in each GAT layer.
Our final graph encoder then consists of multiple GAT layers that are executed sequentially to transform the node embedding representations with respect to their relationships in the graph. Once our encoder has processed the initial graph embedding representation, we then feed our G-LSTM models with this representation and train the entire model in an end-to-end fashion. We denote both model variants with G-LSTM+enc and G-LSTM+enc+att.
3.4 Conventional image captioning baselines
To provide a comparison between our approach and the conventional CNN-LSTM image captioning, we adapt Xu et al. (2015)’s method. We preprocess each input image using the VGG19 network Simonyan and Zisserman (2015)
pre-trained on ImageNet, and condition our LSTM language model on the
feature representation emitted by the fifth layer of VGG19 before applying max-pooling. Analogously to the graph-to-text models, we furthermore experiment with an additional visual attention mechanism operating over the input image (seeXu et al. (2015)). We denote both approaches with Pixel2Caption and Pixel2Caption+att.
Figure 1 provides an overview and comparison of both the G-LSTM+att and Pixel2Caption+att models. Both follow a similar technique of firstly encoding an input image by transforming it to a latent representation. This latent representation is then used to decode the corresponding image caption using an attention mechanism. However, a major difference between Pixel2Caption+att and G-LSTM+att is that the latent representation of the latter (i.e., the scene graph) allows humans to explicitly observe which visual and contextual information have been extracted from the image. This property is not given for the Pixel2Caption+att approach, since the latent representation emitted by the CNN is highly abstracted and hence less interpretable.
We conduct a series of experiments on a subset of the Visual Genome Krishna et al. (2017) and MS COCO Lin et al. (2014) datasets consisting of images accompanied by bounding boxes, scene graphs and individual image captions.
We use the BLEU Papineni et al. (2002) and METEOR Denkowski and Lavie (2014) evaluation metrics to measure the performance of our proposed approaches and to be able to compare them to existing methods for image caption generation. Both metrics have been used in a variety of studies related to image caption generation (e.g., Xu et al., 2015; Vinyals et al., 2015; Lu et al., 2017).
Our dataset consists of a subset of all 51,498 images at the intersection of the Visual Genome and MS COCO datasets. First, we split the 51,498 images into a test set of 5,000 images, a validation set of 1,000 images and a training set of 45,498 samples. Operating on the intersection of VG and MS COCO allows us to craft triplet samples consisting of an image, a corresponding scene graph and a list of captions describing the image. In order to be as consistent as possible with the existing literature on scene graph generation, we then match all dataset samples with a modified Visual Genome dataset as explained in Xu et al. (2017), considering only the 150 most common object categories and 50 most common relationships. As each image in the training set is on average accompanied by 5.002 captions sourced from MS COCO, the graph-to-text generation module can be trained with a total amount of 221,792 (scene graph, caption) pairs.
An example for a single element from our generated dataset (image, scene graph and captions) can be found in Figure 2.
During validation and testing, we evaluated our model’s predictions using all available captions for a given image.
4.2 Implementation details and training
We trained the individual submodules responsible for the image-to-graph and graph-to-text generation independently on the aforementioned datasets.
The scene graph generator was trained by strictly following Zellers et al. (2018)’s approach to train their proposed model.111We followed the authors’ instructions on https://github.com/rowanz/neural-motifs.
This approach consists of three phases. First, a Faster R-CNN object detector with a VGG backbone is pre-trained in isolation to learn the extraction of objects and corresponding bounding boxes from images. We adhered to the architecture and parameter setup as explained in their work, and trained the detector for 50 epochs. After training the object detector, we trained theMotifNet module for 26 epochs without modifying the authors’ implementation setup (this includes the adaptation to scene graph detection as explained in Zellers et al. (2018), Section 5.2). For the graph-to-text models, we tokenized all sequences used during training using the NLTK tokenize package Loper and Bird (2002). We did not exclude infrequent vocabulary tokens during our analysis. All reported models were trained using the Adam optimizer Kingma and Ba (2014) with a learning rate of . In terms of model regularization, we used dropout Srivastava et al. (2014) in both the encoder and the decoder during training. In the encoder, we added a dropout mechanism with a rate of 0.25 at each GAT layer directly before computing the weighted sum of the transformer graph inputs. In the decoder, we adhered to the use of dropout as realized by Xu et al. (2015) and used a dropout rate of 0.5. Moreover, we use batch normalization Ioffe and Szegedy (2015) in the LSTM decoder by normalizing the encoder outputs before transforming them to the LSTM’s initial hidden and cell states. Our graph encoder consists of two consecutive GAT layers that are operating on a dimension of . We set the dimension of the trainable graph and word embeddings to the same size and utilize a single-layer LSTM with 1024 hidden units as decoder.
We trained our two conventional image captioning baselines Pixel2Caption and Pixel2Caption+att
with the same hyperparameter settings.
4.3 Tuning MotifNet on the validation set
For a given input image, the trained MotifNet generates both a list of detected bounding boxes along with their predicted labels as well as a list of relationship predictions between the identified objects. In detail, it outputs a probability distribution over all possible 50 relationship predicates for each pair of predicted objects. However, the Faster R-CNN object detector predicts certain bounding box labels with low confidence values which might result in scene graph representations with high model uncertainty. To account for this problem, and to limit the size (i.e., number of nodes) of the generated scene graphs, we experimented with various confidence threshold values representing lower bounds for the confidence values of the object detector to be considered a valid object of an image. Specifically, we considered the confidence thresholds 0.2, 0.4, 0.6, and 0.8 for our trained models. For each of the four G-LSTM model variants, we thus evaluated to what extent these different confidence thresholds affected the overall model performances (in terms of METEOR) by experimenting how the model variants perform on the validation set with each of the parameter values. Our results suggest that the G-LSTM, G-LSTM+enc+att and G-LSTM+enc models exhibit their best performance with a confidence threshold of 0.4, while the G-LSTM+att variant performs best with a confidence threshold of 0.2.
Once we have identified all valid predicted objects present in the image, we selected the graph’s relationships by considering all relationships between valid objects suggested by MotifNet and assigned the predicate with highest probability as the relationship label.
Finally, we removed all duplicate nodes and identical relationships from the crafted scene graph. If a generated scene graph exceeds the maximum size (i.e., number of nodes) of the graphs used during training, we limit the graph’s size to this maximum size by removing the object nodes exhibiting the lowest prediction confidences. Moreover, if a scene graph consists of less than two predicted object nodes, we ignore the sample during testing.
Quantitative results of all our models can be found in Table 1. The results of all graph-based models are based on the scene graphs generated by MotifNet, which we trained before on our new dataset. The results in Table 1 show that both the G-LSTM+att and the G-LSTM outperform both the Pixel2Caption and Pixel2Caption+att in every metric, indicating that our proposed models represent a suitable alternative to the conventional image captioning approaches. Figure 3 shows qualitative results of the Pixel2Caption+att and G-LSTM+att approaches in comparison, showing that our model is able to produce accurate captions even in the presence of imperfect auto-generated scene graphs. Furthermore, it is interesting to observe that the additional graph encoder operating over the input scene graph leads to performance decreases of our G-LSTM model. In addition to that, for both the conventional and the captioning model based on scene graphs, the attention mechanism operating on the decoding LSTM only slightly improves the overall model performance across our evaluation metrics.
To further assess the performance of our models when operating on generated scene graphs, we provide metrics for all model variants when evaluated on the ground-truth gold scene graphs as provided in the Visual Genome dataset in Table 2. Our models exhibit even higher performances when evaluated on the gold scene graphs, indicating that our method has the potential to benefit from future progress in the field of scene graph generation.
The presented method imposes a number of limitations that we would like to address in the following paragraph. First, our image-to-graph-to-text model utilizes a scene graph generation model that is restricted to predicting only 150 different object labels and 50 different edge labels. This represents a notable limitation to the model since it is explicitly trained to predict diverse English sentences from only a small subset of semantic concepts. Nevertheless, the fact that our proposed methods outperform conventional image captioning approaches (which do not have this additional constraint) suggests that the model still learns to predict semantic concepts outside of the 200 given ones in context, and achieves to reasonably generate other concepts that are likely to occur in the context of certain objects and relationships as represented by the scene graph.
Moreover, it is worth mentioning that our proposed approach arguably requires a larger amount of computational resources to be trained properly when compared to conventional image captioning methods. In addition to that, the current study does not investigate the potential of our proposed architecture when trained in an end-to-end fashion, i.e., by developing a single pipeline that processes an input image, generates a scene graph representation and then uses this representation to create a corresponding image caption. At this point we would like to encourage other researchers focusing on image captioning to further explore the potential of explicitly incorporating visual objects and relationships with respect to this problem.
In this work, we proposed a supervised learning approach to generate image captions by explicitly leveraging detected objects and visual relationships. Our suggested model consists of a simple two-step procedure that first generates a scene graph representation from a given image and then uses this representation to generate an image description in natural language. Empirical results on a newly-generated dataset consisting of samples from the intersection of Visual Genome and MS COCO demonstrate the superiority of our model when compared to conventional image captioning approaches, indicating that our method provides a fruitful ground to further advance the task of image captioning.
We gratefully acknowledge a Ph.D. scholarship awarded to the second author by the German Academic Scholarship Foundation (Studienstiftung des deutschen Volkes). This work was supported by the BMBF as part of the project MLWin (Grant No. 01IS18050) as well as the Munich Center for Machine Learning (Grant No. 01IS18036B).
Mind’s eye: a recurrent visual representation for image caption generation..
Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2422–2431. External Links: Cited by: §2.
- Meteor universal: language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, Cited by: §4.
- Long-term recurrent convolutional networks for visual recognition and description.. In Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2625–2634. External Links: Cited by: §2.
- Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §2.1.
- Speech recognition with deep recurrent neural networks. In The 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 6645–6649. Cited by: §1.
- Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Cited by: §3.1.
- Relational reasoning using prior knowledge for visual captioning. arXiv preprint arXiv:1906.01290. Cited by: §2.1.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 448–456. External Links: Cited by: §4.2.
- Image retrieval using scene graphs. In Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 28, pp. 3668–3678. External Links: Cited by: §1.
- Image generation from scene graphs. In Proceedings of the 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1219–1228. Cited by: §1.
- Deep visual-semantic alignments for generating image descriptions. In CVPR, pp. 3128–3137. Cited by: §1, §2.
- Image caption generation with hierarchical contextual visual spatial attention. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §2.
- Dense relational captioning: triple-stream networks for relationship-based captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6271–6280. Cited by: §2.1.
- Adam: a method for stochastic optimization.. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), Cited by: §4.2.
- Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §4.
- Know more say less: image captioning based on scene graphs. IEEE Transactions on Multimedia. Cited by: §2.1.
- Scene graph generation from objects, phrases and region captions. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1270–1279. Cited by: §1.
- Microsoft coco: common objects in context. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 740–755. External Links: Cited by: §4.
NLTK: the natural language toolkit.
In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Philadelphia: Association for Computational Linguistics, Cited by: §4.2.
- Visual relationship detection with language priors. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1.
- Knowing when to look: adaptive attention via a visual sentinel for image captioning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3242–3250. Cited by: §2, §4.
- Pixels to graphs by associative embedding. In Advances in Neural Information Processing Systems 30, pp. 2171–2180. External Links: Cited by: §1.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: §4.
- Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 91–99. External Links: Cited by: §2.1, §3.1.
- Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §3.4.
- Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. External Links: Cited by: §4.2.
- Graph Attention Networks. In Proceedings of the 2018 International Conference on Learning Representations (ICLR), Cited by: §3.3, §3.3.
- Sequence to sequence – video to text. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
- Show and tell: a neural image caption generator. In Proceedings of the 2015 Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §4.
- Image captioning with deep bidirectional lstms. In ACM Multimedia, Cited by: §2.
- Scene graph generation by iterative message passing. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3097–3106. Cited by: Figure 2, §4.1.
- Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the 2015 International Conference on Machine Learning (ICML), pp. 2048–2057. Cited by: §1, §2, §3.2, §3.2, §3.4, §4.2, §4.
- Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1.
- Auto-encoding scene graphs for image captioning. arXiv preprint arXiv:1812.02378. Cited by: §2.1.
- Review networks for caption generation. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 2361–2369. External Links: Cited by: §2.
- Exploring visual relationship for image captioning. In ECCV, Cited by: §2.1.
- Neural motifs: scene graph parsing with global context. In Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1, §3.1, §3.1, §4.2.
- Large-scale visual relationship understanding. In AAAI, Cited by: §1.