Seq-SG2SL: Inferring Semantic Layout from Scene Graph Through Sequence to Sequence Learning

08/19/2019 ∙ by Boren Li, et al. ∙ 0

Generating semantic layout from scene graph is a crucial intermediate task connecting text to image. We present a conceptually simple, flexible and general framework using sequence to sequence (seq-to-seq) learning for this task. The framework, called Seq-SG2SL, derives sequence proxies for the two modality and a Transformer-based seq-to-seq model learns to transduce one into the other. A scene graph is decomposed into a sequence of semantic fragments (SF), one for each relationship. A semantic layout is represented as the consequence from a series of brick-action code segments (BACS), dictating the position and scale of each object bounding box in the layout. Viewing the two building blocks, SF and BACS, as corresponding terms in two different vocabularies, a seq-to-seq model is fittingly used to translate. A new metric, semantic layout evaluation understudy (SLEU), is devised to evaluate the task of semantic layout prediction inspired by BLEU. SLEU defines relationships within a layout as unigrams and looks at the spatial distribution for n-grams. Unlike the binary precision of BLEU, SLEU allows for some tolerances spatially through thresholding the Jaccard Index and is consequently more adapted to the task. Experimental results on the challenging Visual Genome dataset show improvement over a non-sequential approach based on graph convolution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning the relation from semantic description to its visual incarnation leads to important applications, such as text-to-image synthesis [24]

and semantic image retrieval

[11]

. It remains a challenging and fundamental problem in computer vision

[10]. Recent researches have gradually formalized the structured representations of the two modality, scene graph [11][13] for semantic description and semantic layout [9][26] for image. Therefore, our goal in this work solves the underlying task, inferring semantic layout from scene graph, for connecting text to image.

Figure 1: The Seq-SG2SL framework for inferring semantic layout from scene graph.

Most existing works infer semantic layout from text [7][9][20]. However, leading methods still struggle with complex text inputs depicting multiple objects owing to the unstructured nature of text. Hence, Johnson [10] pioneered to infer semantic layout from scene graph as a separate task, isolated from semantic parsing [18]. Scene graph is adopted because it is a powerful structured representation that efficiently conveys scene contents in text [13]

. As a notable step forward, they observed that semantic layout is largely constrained by objects within relationships. As such, they developed a graph convolution network to embed a scene graph containing only objects within relationships into respective object feature vectors that further dictate semantic layout through an object layout network. However, they inferred all object feature embeddings simultaneously from a scene graph that comprises exponential variability of object and relationship combinations. It is extremely challenging for a model to express such prohibitive diversity.

We view this task from a novel perspective to avoid combinatorial explosion that largely restricted the model expressiveness in the past. Inferring semantic layout from scene graph can be compared to constructing a building from its blueprint. It is unwise to offer corpus of blueprint-to-building correspondences to directly train a learner how to construct a building from its blueprint. Instead, teaching the basic actions to stack the building blocks based on their counterparts in the blueprint is much more feasible. What determines the building blocks? It is the relationship.

We propose a conceptually simple, flexible and general framework using sequence to sequence (seq-to-seq) learning to infer semantic layout from scene graph (Figure 1). The framework, called Seq-SG2SL, derives sequence proxies for the two modality and a Transformer-based seq-to-seq model learns to transduce one into the other. A scene graph is decomposed into a sequence of semantic fragments (SF), one for each relationship. A semantic layout is the consequence from a series of brick-action code segments (BACS), dictating the position and scale of each object bounding box in the layout. Viewing the two building blocks, SF and BACS, as corresponding terms in two different vocabularies, a seq-to-seq model is fittingly used to translate. Seq-SG2SL is an intuitive framework that learns BACS to drag-and-drop and scale-adjust the two bounding boxes of subject and object in a relationship to the layout supervised by its SF counterpart.

Direct and automated evaluation for semantic layout prediction is another challenging problem unto itself. A new metric, semantic layout evaluation understudy (SLEU), is devised for this purpose inspired by BLEU [15]. SLEU defines relationships within a layout as unigrams and looks at the spatial distribution for -grams. Unlike the binary precision of BLEU, SLEU allows for some tolerances spatially through thresholding the Jaccard Index and is consequently more adapted to the task. Mean-SLEU over a large corpus is a proper metric for evaluation.

We experiment on the challenging Visual Genome (VG) dataset [13] that provides human annotated pairs of scene graph and semantic layout for each image. We first show qualitative results from Seq-SG2SL and rationalize SLEU intuitively. Further quantitative comparison shows the advantages of Seq-SG2SL over a non-sequential approach based on graph convolution [10], especially in the aspect of model expressiveness. We show further that this advantage originates from our sequential formulation, not merely from the Transformer model. Various aspects of Seq-SG2SL are studied extensively from additional ablation experiments.

The key contributions are:

  • [leftmargin=*]

  • Seq-SG2SL is the first framework to infer semantic layout from scene graph through seq-to-seq learning and outperforms the non-sequential state-of-the-art model by a significant margin.

  • SLEU is the first automatic metric to directly evaluate the performance of semantic layout prediction, allowing results reproducibility.

2 Related Works

Scene Graph: A scene graph is a directed graph representing a scene, where nodes are objects and edges give relationships between objects. Johnson [11] first introduced the notion of scene graph as query input for semantic image retrieval. They predicted the most likely semantic layout from a query scene graph as intermediate result for the end retrieval task through a conditional random field (CRF) model. Meanwhile, Schuster [18] complemented their work by introducing an automatic approach to create a scene graph from unstructured natural language scene descriptions. Having these brilliant pioneering attempts using scene graph, Krishna [13] constructed the VG dataset that aimed to bridge language and vision using dense image annotations. Scene graph and semantic layout were adopted as intermediate representations for the two modality. With the advent of VG, scene graph further shows its value from successive researches, such as in predicting grounded scene graph for images [14][23], in evaluating image captioning [1], and in image generation [10].

Semantic Description to Semantic Layout: Semantic layout was first officially given as spatial distribution of clip arts in abstract scene proposed by Zitnick [28]. This representation initially aimed to study directly for inferring high-level semantics from images. By contrast, Zitnick [27] formulated the opposite problem of predicting an abstract scene from its textual description and proposed a solution using CRF. As clip arts in abstract scene can be easily generalized to object bounding boxes in semantic layout, this concept extends to real images [20].

Predicting a semantic layout from text is usually posed as an intermediate step for complex image generation [7][9]. A complex image refers to the one containing multiple interactive objects. Unlike the family of methods [16][24] that can give stunning results on limited domains, such as fine-grained descriptions of birds or flowers, a semantic layout is usually necessary for complex image generation to dictate multi-object spatial distribution depicted in a text. Johnson [10] pioneered to structurize text into scene graph for further complex image generation. This work is the closest to ours. We adopt the same idea but only focus on its subtask of semantic layout prediction from scene graph. The rest of the task, image generation from layout, can be separately solved as [26]. In contrast to the closest work that proposed a non-sequential approach based on graph convolution, Seq-SG2SL views the task from a novel perspective and formulates the problem in a seq-to-seq manner.

During quantitative studies, all existing works did not perform direct evaluations on semantic layout prediction. Instead, they applied indirect metrics, such as the inception score or the image captioning score derived from the generated image. Though human evaluations are all incorporated for further evaluation, these results are very expensive to be reproduced, which logjam the way for fruitful research ideas in this field. The circumstance is exceptionally similar to that before BLEU [15] was first introduced to the field of machine translation.

Sequence to Sequence Learning: RNN, LSTM [8] and GRU [5] are firmly established for sequence modelling and transduction problems, such as language modelling and machine translation [4][19]. Attention mechanisms further become integral parts of compelling sequence modelling and the transduction models, allowing modelling dependencies without regarding to the distance in the input or output sentences [2]. Vaswani [21] popularized the Transformer architecture that can boost the performance of machine translation by a significant margin. Larger-scale architecture exploration specifically for machine translation was conducted using Transformer to further converge towards the optimal settings [3]. Similar conclusion was also drawn by [12]. Seq-to-seq learning is still evolving rapidly. Beneficial experiences can be borrowed for our purpose.

3 Seq-SG2SL

Seq-SG2SL is conceptually simple: it predicts a series of actions to form a consequent layout from a corresponding sequence of symbolic triplets derived from relationships in a scene graph. Next, we introduce the design of sequence proxies for the two modality that is key in our work.

3.1 Sequence Proxies

A scene graph encodes objects, attributes and relationships, while its resulting layout is constrained merely by the objects within relationships. Therefore, a scene graph is first preprocessed to drop all attributes and independent objects not within any relationship.

A relationship in scene graph is represented by a symbolic triplet: subject, predicate, object. The sequence proxy for a triplet is a tuple, named SF. It is a successive concatenation of the three elements. The preprocessed scene graph can be then decomposed into a sequence of SF, one for each relationship. To fully preserve information in the scene graph, a node sequence, consisting of corresponding object node IDs, is additionally maintained. Note that our Seq-SG2SL framework is flexible to transfer object attributes from a scene graph to a layout through this sequence.

The visual incarnation of an SF in semantic layout contains a pair of object bounding boxes, each for the subject and object. They are referred to as visual subject and visual object. Its sequence proxy is a BACS. The design requirements are threefold: first, the series of BACS must uniquely determine a layout; second, a BACS should correspond to an SF such that the direction of causality is clear; third, vocabularies in BACS must be representable and repeatable such that they can concisely represent any layout.

A semantic layout, space-quantized into square grids, is called a quantized layout where all BACS are defined. It requires types of actions to form an object bounding box in the layout: four to specify the location and scale, and the other to set its class index. The bounding box location for the subject is represented in absolute coordinates, whereas that for the object uses the relative position to the subject. This relativity aims to encode the visual predicate in a relationship. We show by an experiment that this relative position encoding is key for good performance.

Type Functionality
c set class index of bbox
xp set xmin of subject bbox
yp set ymin of subject bbox
ixp increase xmin of object bbox from subject
ixn decrease xmin of object bbox from subject
iyp increase ymin of object bbox from subject
iyn decrease ymin object bbox from subject
w set width of bbox
h set height of bbox
imgar set aspect ratio index of semantic layout
Table 1: The BACS types and functionalities. The minimum values for x and y are respectively denoted as xmin and ymin. The bounding box is abbreviated as bbox.

The BACS types and functionalities are formalized in Table 1. A BACS is composed of consecutive words and corresponds to a -word SF. Types for the words are sequentially: c, xp, yp, w, h, c, ixp(n), iyp(n), w, h. The first and last words in a BACS form the visual subject and object, respectively. A BACS sequence is the direct concatenation of individual BACS with identical relationship order to its corresponding SF sequence. Optionally, imgar is added to the front of a BACS sequence in case the aspect ratio of semantic layout is of interest.

3.2 Sequence-to-Sequence Model

Given an SF sequence and its corresponding BACS sequence, a seq-to-seq model is fittingly used to translate. We adopt the most recent Transformer model with stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, exactly identical to the one in [21]. This model is used because of its superior performance in machine translation whose formulation is identical to ours. We show by an experiment that the advantage of model expressiveness by Seq-SG2SL originates from our sequential formulation, not merely from Transformer.

3.3 Semantic Layout Restoration

Having predicted BACS sequence from its input SF sequence, the alignment is first verified by checking brick action types for each word sequentially. If aligned, the BACS sequence corresponds to both the input SF and node sequences. The predicted brick actions are then successively executed to form the restored layout.

Note that with the subsidiary node sequence, it is sufficient to derive which bounding boxes in the restored layout should be merged to one as given in scene graph. If the bounding boxes to be merged have identical predicted class index, the merged bounding box is computed simply as their mean value. Otherwise, it picks the one with median bounding box area. In fact, the predicted class indices for these bounding boxes are rarely distinct. More careful merging strategy is left for future investigation.

3.4 Implementation Details

Semantic Layout Encoding: The maximum side length of a quantized layout is set at . Larger values lead to more BACS vocabularies, making seq-to-seq model harder to generalize. Smaller values, however, result in imprecise bounding box localization and scaling. The chosen value is a trade-off. The aspect ratio of semantic layout is also uniformly quantized. The quantization interval and minimum value are and , respectively.

Data Augmentation: If a scene-graph corresponding layout is valid, its subgraph counterpart, containing a subset of the relationships, is still acceptable. We augment a scene graph to more pairs of sequences in the two modality by applying this property. The concatenation order of relationships in the pairs of sequences is arbitrary, leading to another freedom for augmentation. To balance training data, each scene graph is augmented to at most correspondences. To limit the maximum length for both sequences, we only preserve at most relationships in a scene graph.

Training: We set hyper-parameters exactly as the Transformer work [21], except the warm-up steps is set at . We use the Transformer implementation by OpenNMT [12] for our purpose. We train on a single Tesla P100 GPU for one million iterations ( days) under batch size of .

Inference: We use beam search with beam size of and length penalty [22]. The inference time is about ms on a single Tesla P100 GPU.

4 SLEU Metric

Our goal is to design an automatic metric to directly quantify the success of semantic layout prediction from scene graph. The premise behind automatic evaluation is: the closer a prediction is to human-prepared references, the better it is. Thus, the problem becomes: how to design a metric that can measure the similarity between a predicted layout against a set of references.

4.1 Notation

Let denote a layout with relationships, where represents a visual relationship. and respectively denote visual subject and object, where and . denotes the class index. dictates the bounding box.

Our objective is to evaluate how close a predicted layout is to a collection of reference layouts . represents predicted values later on.

4.2 Metric Design

SLEU is devised inspired by BLEU [15] in machine translation. The cornerstone for BLEU is -gram that refers to a contiguous sequence of items from a text. Here, the item is word. Evaluating only

-grams for machine translation is justified under the Markov assumption that the appearance probability for the current word is merely determined by its previous

words, independent from the th one. The -gram concept is generalized for SLEU, where the item is relationship, instead of word. By analogy to BLEU, evaluating -grams in a semantic layout assumes the placement for a visual relationship depends only on a maximum of other relationships.

SLEU evaluates a semantic layout from two perspectives: intra-relationship adequacy as unigram accuracy; and inter-relationship fidelity as n-gram accuracy. These accuracies are finally combined to a single-number metric.

4.2.1 Unigram Accuracy

Input: A predicted layout and a reference
Output: Unigram accuracy

1:function ():
2:     set : count of matches
3:     for each in  do iterations
4:         if  = and =  then
5:              compute
6:              
7:              
8:              compute
9:              compute
10:              if  and  then
11:                  
12:              end if
13:         end if
14:     end for
15:     return
16:end function
Algorithm 1 Unigram accuracy against a single reference

The unigram accuracy quantifies matching of individual relationships in a predicted layout to a reference, as shown in Algorithm 1. It simply compares the pairs of visual relationships and counts the number of matches. is then the count of the matched-pair divided by the total number of relationships.

To compare a pair of visual relationships, one first needs to compute a shift vector that aligns the center of to . Then and are center-shifted by this vector, where denotes this operation. The two Jaccard Indices for the two pairs of bounding boxes are then computed. These Jaccard Indices, each for visual subject and object, give rise to the binary decision of matching through thresholding by that allows some tolerances spatially.

4.2.2 -gram Accuracy

Input: A predicted layout and a reference
Output: -gram accuracy

1:function ():
2:     set : count of matches
3:     compute
4:     for each in  do iterations
5:         compute
6:         set : flag of match
7:         for each in  do iterations
8:              
9:              compute
10:              if  or  then
11:                  
12:                  break
13:              end if
14:         end for
15:         if  then
16:              
17:         end if
18:     end for
19:     return
20:end function
Algorithm 2 -gram accuracy against a single reference

The -gram accuracy () quantifies similarity of spatial distributions with visual relationships between the predicted and reference layout, as elaborated in Algorithm 2. denotes a collection that consists of all -relationship subsets of . represents the cardinality of a set. To compute , it first compares each pair of the -relationship subsets and counts the number of matches. is then the matched proportion of these pairs of -relationship subsets.

To measure similarity between a pair of -relationship subsets, we only use their visual subjects. This is because the relative distribution from a visual object to its subject has been encoded in unigram accuracy. To compare two spatial distributions, each with visual subjects, a shift vector is first computed to align the centroids of and . All elements in are then center-shifted by . If all the shifted bounding boxes pass the Jaccard-Index thresholding and class-alignment check, this pair of -relationship subsets is considered match.

4.2.3 SLEU Score

SLEU combines unigram and -gram accuracies as a single-number indicator. For a predicted layout and a reference, the unigram and -gram accuracies are separately computed. Similar to the observation by BLEU, -gram accuracy decays roughly exponentially with . Hence, SLEU adopts the same averaging scheme: a weighted average of logarithm accuracies.

SLEU can also measure similarity between a prediction and multiple references. One first needs to compare a prediction with each reference, obtaining a combined accuracy each. Then, the highest value, corresponding to the closest reference to the prediction, is simply designated as SLEU.

SLEU is formally defined as

(1)

where and as uniform weights. is chosen experimentally to make SLEU more distinguishable since larger leads to negligible small -gram accuracies.

SLEU ranges from to . Few prediction can attain a score of unless it is very similar to one of the references. Involving more reasonable reference layouts corresponding to a scene graph leads to a higher value of SLEU. Thus, one must be cautious to the number of reference during making comparisons on evaluations. The quality of SLEU can be enhanced by adding the quantity of reference.

4.2.4 Mean-SLEU Metric

One may still wonder the SLEU capability to measure the prediction performance, especially when only one layout is available in the reference corpus. The design of SLEU strictly follows the idea from the well-known BLEU. We just extend the concept from 1D to 2D, which can lead to the binary decision of visual relationship matching and allow some tolerances spatially. Therefore, SLEU is largely justified from its similarity to BLEU.

From the work of BLEU [15], using a single-reference test corpus for evaluation is valid if the corpus size is large and the reference translations are not all from the same translator. In our case, the corpus size is maintained at least several thousands and all the samples are crowd-sourced. Hence, the mean-SLEU over a large set can be justifiable for evaluation, though its correlation with human judgement is still desirable for future investigation.

(a)

(b)

(c)
(d)

(e)
(f)
Figure 2: Examples of predicted layouts from Seq-SG2SL on the test set of Visual Genome: the first row is the input scene graph; the second row is the predicted layout; the third row is a reference layout overlaid on its corresponding image.

5 Experiments

5.1 Dataset

We experiment on the VG dataset [13]. VG comprises images, each contains a scene graph and a corresponding layout. To facilitate later comparison with [10], the dataset is organized exactly following its public implementation. We use the same dataset division: training, validation, and test. We use object and relationship classes occurring at least and times respectively in the training set, leaving object and relationship types. We discard tiny bounding boxes with side length shorter than pixels. We preserve images with between to objects and at least one relationship. This leaves us with training, validation, and test images with an average of ten objects and five relationships per image. Different from [10], we limit the maximum number of relationships in a scene graph to to adapt for our Seq-SG2SL setting. Here, for a scene graph with more relationships, we simply keep the first while discard the rest.

5.2 Qualitative Results

This experiment aims to visualize results from Seq-SG2SL and rationalize SLEU intuitively. Figure 2 demonstrates these results from test set and their SLEU scores under . Qualitatively, Seq-SG2SL is capable to predict a semantic layout from a scene graph containing multiple interactive objects.

We analyze by examples from Figure 2 to intuitively understand how SLEU can distinguish a good prediction from a bad one. A high SLEU score greater than , such as (a), definitely indicates a good prediction since the predicted layout is sufficiently close to one of the references. For those with medium SLEU scores between to , as shown in (b)-(d), most predictions are still comparable to a reference. Objects with smaller bounding boxes may appear at apparently irregular positions, such as the chair in (d), whereas larger ones are more likely to be reasonable. For cases with low SLEU scores smaller than , as shown in (e) and (f), the predictions are basically different from the reference. A low SLEU score either suggests a bad prediction or a reasonable prediction that far deviates from any of the given references. Provided more references, SLEU can be more representable, just as the rationale of BLEU.

5.3 Quantitative Comparison

Figure 3: Comparison of mean n-grams accuracies under different thresholds between our Seq-SG2SL and the baseline approach on the two sets. The top and bottom figure respectively show results on the training and test set.

This experiment compares Seq-SG2SL with a baseline approach [10]. The pretrained model of the baseline is applied to generate layouts for benchmarking. Our Seq-SG2SL model is trained on exactly the same set as the baseline. Mean-SLEU is employed for evaluation on both the training and test set. Each sample in the two sets consists of a single reference layout corresponding to a scene graph.

IoU IoU IoU IoU
Baseline-train
Ours-train
Baseline-test
Ours-test
Table 2: Comparison of mean-SLEU scores under different thresholds between the baseline and our Seq-SG2SL approach on the training and test set.

Table 2 shows the comparison of mean-SLEU scores under various thresholds between the two approaches on the training and test set. The larger values, indicating better performances, are highlighted. As demonstrated, Seq-SG2SL outperforms the baseline by a significant margin on the training set. Note that evaluations on the training set are not trivial but offer insight for model expressiveness. By analogy with machine translation, we cannot expect a trained model to produce exactly the same output as a given reference, even on the training set, since the same input inherently results in several different reasonable outputs. But for training, the closer, the better. Therefore, Seq-SG2SL has shown its advantage in model expressiveness over the baseline. This advantage originates from our sequential formulation in avoiding combinatorial explosion. On the test set, Seq-SG2SL still has higher scores than the baseline, suggesting its better generalization capability, except that the score under is slightly worse. This score only considers the matching of object classes with no evaluation on the spatial distribution of bounding boxes in the layout. This score from Seq-SG2SL is also very close to , showing its capability in predicting the correct object classes. In rare cases, the predicted BACS sequence cannot find perfect alignment with the input SF sequence, resulting in such negligible gap.

Figure 3 demonstrates the mean -grams accuracies for further analyses, where (a) and (b) respectively show comparisons on the training and test set. As shown, Seq-SG2SL outperforms the baseline in each metric. However, the performance gap during training and testing for Seq-SG2SL is phenomenal, which can be explained from two perspectives. First, only a single reference is given for each scene graph during both training and testing. During training, the model is guided to memorize these particular references and turns out to balance performances on all samples. More supervision implies a higher chance the prediction is closer to its reference. That explains why -gram with larger has better accuracy on the training set. During testing, however, a given single reference may differ significantly from that during training, resulting in an underestimated score. Thus, adding more references for each test sample is one way to mitigate this training-inference gap. Second, this discrepancy also originates from exposure bias for seq-to-seq problem. We refer readers to [17] for more detailed explanations. Recent work [25] may be incorporated in the future to bridge this training-inference gap.

5.4 Ablation Experiments

5.4.1 Relative vs. Absolute Position Encoding

IoU IoU IoU IoU
Abs-train
Ours-train
Abs-test
Ours-test
Table 3: Comparison of mean-SLEU scores under different thresholds between encoding methods using absolute and relative position on the training and test set.

Seq-SG2SL applies relative position encoding. This relativity aims to represent the visual predicate in a relationship. This experiment aims to clarify the significance to use relative position encoding. We compare the results with that from absolute position encoding where locations of visual subject and object are represented in absolute coordinates without considering the visual incarnation of predicate. The two models are trained following exactly the same way except the position encoding method. Table 3 shows the results of this comparison. As demonstrated, relative position encoding is much better, suggesting that the training for visual predicate in a relationship is key for good performance.

5.4.2 Network Architecture

IoU IoU IoU IoU
LSTM-train
Ours-train
LSTM-test
Ours-test
Table 4: Comparison of mean-SLEU scores under different thresholds between our Transformer-based model and an LSTM-based model on the training and test set.

This experiment compares results from our Transformer-based model and a -layer LSTM-based model with attention mechanism as shown in Table 4. The two models are trained on exactly the same set. As expected, the LSTM-based model performs slightly worse than our Transformer-based model in each metric. However, its performance on the training set is still far better than the non-sequential baseline (See Table 2). Therefore, the advantage of model expressiveness mainly comes from our Seq-SG2SL framework in avoiding combinatorial explosion, not only originates from the superiority of the Transformer architecture.

5.4.3 Greedy vs. Beam Search

This experiment compares results from greedy search and beam search. Table 5 shows the top- results from beam search in addition to the greedy decoding baseline. As shown, on the training set, the best mean-SLEU score is in accordance with the top- choice from beam search, which justifies the use of the top- prediction. The top- to top- scores appear almost the same but distinctively lower than the top- score. These results may enrich the diversity of semantic layout predictions, though they are more likely to be unreasonable. The score of the top- prediction from beam search is higher than that from greedy search as expected, showing the advantage to use beam search. On the test set, however, all scores are generally low and resemble each other. The single-reference test corpus may not be sufficient to conclude with these minor discrepancies.

Greedy Top-1 Top-2 Top-3 Top-4
train
test
Table 5: Comparison among results from greedy search and beam search using mean-SLEU score under from Seq-SG2SL on the training and test set.

6 Conclusions And Future Works

This paper has presented Seq-SG2SL, a conceptually simple, flexible and general framework for inferring semantic layout from scene graph through seq-to-seq learning. Seq-SG2SL outperforms a non-sequential state-of-the-art approach by a significant margin, especially in the aspect of model expressiveness. This advantage mainly originates from our sequential formulation in avoiding combinatorial explosion. Relative position encoding to represent visual predicate in a relationship is also key for good performance. Beam search during inference can further enhance the quality of prediction compared with greedy search. The top- result from beam search is the best layout prediction. SLEU, an automatic metric to directly evaluate semantic layout prediction from scene graph, has been devised in a similar fashion to BLEU. The rationale of SLEU and the justification of using mean-SLEU metric over a large set for evaluation have been discussed.

Direct future works are twofold: first, as for SLEU, its quality can be further enhanced by adding more references. Human evaluations on SLEU may also be of interest. More evaluation metrics on similar tasks, such as text-to-layout synthesis, may be derived by adopting our

-gram analogy regarding relationship as unigram; second, as for Seq-SG2SL, alternative BACS designs may be investigated, such as altering the parameterization of bounding box similar to CenterNet [6]. Additive layout generation may also be investigated due to the sequential nature of our framework.

References

  • [1] P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) SPICE: semantic propositional image caption evaluation. In ECCV, pp. 382–398. External Links: ISBN 978-3-319-46454-1 Cited by: §2.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. CoRR arXiv:1409.0473. Cited by: §2.
  • [3] D. Britz, A. Goldie, M. Luong, and Q. V. Le (2017) Massive exploration of neural machine translation architectures. CoRR arXiv:1703.03906. Cited by: §2.
  • [4] K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR arXiv:1406.1078. Cited by: §2.
  • [5] J. Chung, Ç. G. K. Cho, and Y. Bengio (2014)

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    .
    CoRR arXiv:1412.3555. Cited by: §2.
  • [6] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian (2019) CenterNet: keypoint triplets for object detection. CoRR arXiv:1904.08189. Cited by: §6.
  • [7] T. Hinz, S. Heinrich, and S. Wermter (2019) Generating multiple objects at spatially distinct locations. In ICLR, Cited by: §1, §2.
  • [8] S. Hochreiter and J. Schmidhuber (1997-11) Long short-term memory. Neural Computation 8 (8), pp. 1735–1780. Cited by: §2.
  • [9] S. Hong, D. Yang, J. Choi, and H. Lee (2018-06) Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR, pp. 7986–7994. Cited by: §1, §1, §2.
  • [10] J. Johnson, A. Gupta, and L. Fei-Fei (2018-06) Image generation from scene graphs. In CVPR, pp. 1219–1228. Cited by: §1, §1, §1, §2, §2, §5.1, §5.3.
  • [11] J. Johnson, R. Krishna, M. Stark, L. Li, D. Shamma, M. Bernstein, and L. Fei-Fei (2015-06) Image retrieval using scene graphs. In CVPR, pp. 3668–3678. Cited by: §1, §2.
  • [12] G. Klein, Y. Kim, Y. Deng, J. M. Crego, J. Senellart, and A. M. Rush (2017) OpenNMT: open-source toolkit for neural machine translation. CoRR arXiv:1709.03815. Cited by: §2, §3.4.
  • [13] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei (2017-02) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123 (1), pp. 32–73. Cited by: §1, §1, §1, §2, §5.1.
  • [14] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei (2016) Visual relationship detection with language priors. In ECCV, pp. 852–869. External Links: ISBN 978-3-319-46448-0 Cited by: §2.
  • [15] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: §1, §2, §4.2.4, §4.2.
  • [16] T. Qiao, J. Zhang, D. Xu, and D. Tao (2019-06) MirrorGAN: learning text-to-image generation by redescription. In CVPR, Cited by: §2.
  • [17] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba (2015) Sequence level training with recurrent neural networks. CoRR arXiv:1511.06732. Cited by: §5.3.
  • [18] S. Schuster, R. Krishna, A. Chang, L. Fei-fei, and C. D. Manning (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In EMNLP, Cited by: §1, §2.
  • [19] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112. External Links: Link Cited by: §2.
  • [20] F. Tan, S. Feng, and V. Ordonez (2019-06) Text2Scene: generating compositional scenes from textual descriptions. In CVPR, pp. 6710–6719. Cited by: §1, §2.
  • [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. External Links: Link Cited by: §2, §3.2, §3.4.
  • [22] Y. Wu, M. Schuster, and Z. e. al. Chen (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR arXiv:1609.08144. Cited by: §3.4.
  • [23] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei (2017-07) Scene graph generation by iterative message passing. In CVPR, pp. 5410–5419. Cited by: §2.
  • [24] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas (2017-10) StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, Cited by: §1, §2.
  • [25] W. Zhang, Y. Feng, F. Meng, D. You, and Q. Liu (2019-07) Bridging the gap between training and inference for neural machine translation. In ACL, Cited by: §5.3.
  • [26] B. Zhao, L. Meng, W. Yin, and L. Sigal (2019-06) Image generation from layout. In CVPR, Cited by: §1, §2.
  • [27] C. L. Zitnick, D. Parikh, and L. Vanderwende (2013-12) Learning the visual interpretation of sentences. In ICCV, pp. 1681–1688. Cited by: §2.
  • [28] C. L. Zitnick and D. Parikh (2013-06) Bringing semantics into focus using visual abstraction. In CVPR, pp. 3009–3016. Cited by: §2.