What and Where: A Context-based Recommendation System for Object Insertion

11/24/2018 ∙ by Song-Hai Zhang, et al. ∙ 0

In this work, we propose a novel topic consisting of two dual tasks: 1) given a scene, recommend objects to insert, 2) given an object category, retrieve suitable background scenes. A bounding box for the inserted object is predicted in both tasks, which helps downstream applications such as semi-automated advertising and video composition. The major challenge lies in the fact that the target object is neither present nor localized at test time, whereas available datasets only provide scenes with existing objects. To tackle this problem, we build an unsupervised algorithm based on object-level contexts, which explicitly models the joint probability distribution of object categories and bounding boxes with a Gaussian mixture model. Experiments on our newly annotated test set demonstrate that our system outperforms existing baselines on all subtasks, and do so under a unified framework. Our contribution promises future extensions and applications.



There are no comments yet.


page 1

page 5

page 6

page 7

page 9

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Our goal is to build a bidirectional recommendation system [1, 2] that performs two tasks under a unified framework:

  1. Object Recommendation: For a given scene, recommend a sorted list of categories and bounding boxes for insertable objects;

  2. Scene Retrieval: For a given object category, retrieve a sorted list of suitable background scenes and corresponding bounding boxes for insertion.

The motivation for the two tasks stems from the bilateral collaboration between media owners and advertisers in the advertising industry. Some media owners make profits by offering paid promotion [3], while many advertisers pay media owners for product placement [4]. This collaboration pattern reflects the mutual requirement, from which we distill the novel research topic of dual recommendation for object insertion.

Consider a typical collaborative workflow between a media owner and an advertising artist consisting of three phases:

  1. Matching: The media owner determines what kind of products are insertable, while the advertiser determines what kind of background scenes are suitable. Both of them, in this process, also consider where an insertion might potentially happen;

  2. Negotiation: They contact each other and confirm what and where after negotiation;

  3. Insertion: Post-process the media to perform the actual insertion.

In this work, both of the above tasks aim underpin phases 1 and 2, but neither include a fully automatic solution for phase 3. Analogously, the key idea here is to automatically make recommendations rather than make decisions for the user. We do not perform automatic segment selection or insertion, because in practice the inserted object will be brand-specific and the final decision depends upon the personal opinions of the advertiser [5]. Nonetheless, for illustration purpose only, we use manually selected, yet automatically pasted object segments for cases presented in this paper, which demonstrates our system’s ability to make reasonable recommendations on categories and bounding boxes.

The advantage of our system is three-fold. First, we provide constructive ideas for designers: the object recommendation task can be especially useful for sponsored media platforms, which may profit by making recommendations to media owners. Second, the scene retrieval task provides a specialized search engine that is capable of retrieving images, given an object, that goes beyond previous content-based image retrieval systems [6, 7, 8]. Future applications include advertiser-oriented search engines, or matching services for designer websites. Third, the bounding boxes predicted for both tasks further makes the recommendation concrete and visualizable. As we will show in our experiments, this not only enables applications such as automatic preview over a gallery of target segments, but also may assist designers with a heatmap as hint to users.

Specifically, our contributions are:

  1. We are the first, to the best of our knowledge, to propose dual recommendation for object insertion as a new research topic;

  2. We develop an unsupervised algorithm (Sect. III) based on object-level context [9], which explicitly models the joint probability distribution of object category and bounding box;

  3. We establish a newly annotated test set (Sect. IV), and introduce task-specific metrics for automatic quantitative evaluation (Sect. V);

  4. We outperform existing baselines on all subtasks under a unified framework, as demonstrated by both quantitative and qualitative results (Sect. V).

Ii Related Work

Although there are no related works that directly addresses exactly the same topic, we can still borrow ideas from previous arts on related tasks.

Object Recognition. The family of recognition tasks include image classification [10, 11, 12], object detection [13, 14, 15], weakly supervised object detection [16, 17, 18] and semantic segmentation [19, 20, 21].

Generally, the appearance of the target object is given, and the expected output is either the category (image classification), or the location (weakly supervised object detection), or both (object detection, semantic segmentation). Our object recommendation task shares the similar output of both category and location, but there are two key differences: (i) the appearance of the target object is unknown in our task, for the object is even not present at the scene; (ii) the expected outputs for both category and location are not unique, for there may be multiple objects suitable for the same scene with multiple reasonable placements.

In this work, we build our system upon the recently proposed state-of-the-art object detector, Faster R-CNN [13]. The basic idea is to seek evidence from other existing objects in the scene, which requires object detection as a basic building block. We also extend the expected output from a single category and a single location to lists of each, to allow multiple acceptable results in an information retrieval (IR) fashion.

Image Retrieval. Image retrieval tasks aim to retrieve a list of relevant images based on keywords [6], example images [8], or even other abstract concepts such as sketches or layouts [22].

Generally, some attributes (topic, features, color, layout, etc.) are known about the target image, and the expected output is a list of images that satisfies these conditions. Our scene retrieval task is distinct to this family of tasks because our query object is not generally present in the scene. Neither is it an attribute possessed by the target image. Nonetheless, we share a similar idea as the retrieval systems in two aspects: First, we adopt the similar expected output as a ranked list, and employ the metric, normalized discounted cumulative gain (nDCG) , as is widely used in previous retrieval tasks; Second, similar to content-based image retrieval systems [6, 7, 8], we also utilize the known information of the image, typically the categories and locations of the existing objects.

Image Composition. Our work aims to provide inspirations for object insertion, which has a close relationship to image composition. Some works focus on interactive editing, for instance, [23] builds an interactive library-based editing tool. It enables users to draw rough sketches, leading to plausible composite images incorporated with retrieved patches; Some other works focus on automatic completion, with image in-painting as one of the most notable research topics [24, 25]

. These works aim to restore the removed region of an image, typically with neural networks that exploit the context. Our system is unique in two aspects: 1) We neither take the user’s sketches as input, nor require a masked region as location hint; 2) We do not take “plausible” as our final goal, because our motivation is to do recommendations, rather than make decisions, as explained in Sect.

I .

Closest to our work is the automatic person composition task proposed by [26], which establishes a fully automatic pipeline for incorporating a person into a scene. This pipeline consists of two stages: 1) location prediction; 2) segment retrieval. Though our system is different from this work, in that we do not perform segment retrieval; while it could not make recommendations on categories or scenes. We compare our system’s performance on bounding box prediction with the first stage of this work, and report both quantitative and qualitative results.

Iii Method

In this section, first, we decompose the two tasks into three subtasks with probabilistic formulations, which we derive from the same joint probability distribution. Furthermore, we present an algorithm that models object-level context with a Gaussian mixture model (GMM), which leads to an approximation for the joint distribution. Finally, we report implementation details and per-image runtime.

Iii-a Problem Formulation

Given a set of candidate object categories , a set of scene images , and a set of candidate bounding boxes for each specific image , we further break the two tasks introduced in Sect. I into the following three subtasks:

  1. Object Recommendation: for a given image , rank all candidate categories by ;

  2. Scene Retrieval: for a given object category , rank all candidate images by ;

  3. Bounding Box Prediction: for a given image and an object category , rank all candidate bounding boxes by .

We show that all of the three subtasks can be solved from the same joint probability distribution

. The basic intuition is that the object category and bounding box should be interrelated, when judging if the insertion is appropriate. By adopting Bayes’ theorem, we arrive at:

  1. Object Recommendation:

  2. Scene Retrieval:


    where we perform the maximum a posteriori (MAP) estimation and assume a uniform prior for


  3. Bounding Box Prediction:


    where we rank all bounding boxes for each given pair of .

In summary, to achieve our goal (which breaks down to three subtasks), we need an algorithm to estimate , which is discussed in the next subsection.

Iii-B Modeling the Joint Probability Distribution

Iii-B1 Model Formulation

For each image , we obtain a set of bounding boxes for existing objects, which is typically the output of a region proposal network (RPN) [13].

Note that the candidate bounding box and category are conditionally independent with given , because is derivable from . We then model the joint probability distribution as follows:


We represent each context object with a probability distribution over all possible categories. Denoting the set of all categories considered in the context as , we have


where, the last term in the right-hand-side is the output distribution obtained from an object detector [13, 27]. The first term is decided by the co-occurrence frequency of the inserted object with an localized existing object . For simplicity, we drop and approximate this term with . The basic intuition is that does not contribute significantly to the ranking between categories. For instance, compared to a mouse, a cake is more likely to co-occur with a plate, no matter where the plate is. The second term is an object-level context [9] term that will be modeled with a Gaussian mixture model (GMM), as described next.

Iii-B2 Context Modeling with GMM

We now focus on the context term in equation 5 that remains unsolved. Consider the case when and . The term answers the question “Having observed a wall in a certain place, where should we insert a clock?”. Given such a question, a human agent would first identify that a clock is likely to be mounted on the wall, then conclude that the clock is likely to appear in the upper region of the wall, and its size should be much smaller than the wall. Our GMM model simulates the above process to judge each candidate bounding box .

Based on this intuition, we further exploit inter-object relationships as proposed by previous works on scene graphs [28, 6]. Denoting the set of all considered relations as , we get:


Following [6], we extract pairwise bounding box feature, which encodes the relative position and scale of the inserted object and a context object:


where are the bottom-left corners of the 2 boxes, and are the widths and heights respectively. We then train a Gaussian mixture model (GMM) for each annotated triple from the Visual Genome [29] dataset:


where denotes the GMM model corresponding to triple . is the number of components same for each GMM, which we empirically set to 4 in our experiments.

is the normal distribution.

are the prior, mean, and covariance for the th component of , which we learn using the EM algorithm implemented by Scikit-learn [30].

Iii-B3 Final Model

Putting everything together, we have:


Iii-C Implementation Details

We adopt the pretrained Faster R-CNN released by [27] as object detector. We use 10 object categories for insertion (detailed in Sect. IV) and keep the top 20 object categories and top 10 relations from the Visual Genome [29] dataset, sorted by the co-occurrence count with the 10 insertable categories. We consider at most existing objects with detection threshold of 0.4 for context modeling. For each image with size , we sample the candidate bounding boxes in a sliding window fashion, with window size

and stride

, which generates around 800 candidate boxes per image. We further refine the size of the best ranked box by searching over sizes within interval equally discretized into 32 values. A complete, single thread, pure Python implementation on an Intel i7-5930K 3.50GHz CPU and a single Titan X Pascal GPU takes around 4 seconds per image.

Iv Dataset

Iv-a Scenes and Objects

We establish a test set that consists of fifty scenes from the Visual Genome [29] dataset. The test scenes come from 4 indoor scene types: living room, dining room, kitchen, office. The statistic for each scene type is shown in Table I.

living room dining room kitchen office
15 13 10 12
TABLE I: Statistics on different scene types

There are ten insertable objects considered in this experiment, as shown in Table II. The same illustrations and specifications are emphasized to the annotators as a standard to ensure consistency for the same category. We choose these insertable objects based on the following principles:

  1. Environment: Mostly appears indoor;

  2. Frequency: Is within the top 150 frequent categories [28] in Visual Genome;

  3. Flexibility: Is not generally embedded (e.g. sink) or large and clumsy (e.g. bed), so that it can be flexibly inserted into a scene;

  4. Diversity: Does not have a significant context overlap with other object categories (e.g. bottle is not included because we already have cup).

Category Illustration Specification
cup A cup for drinking water that is medium in size.
spoon No specification.
apple No specification.
cake A small dessert cake (not a big birthday cake).
laptop An open laptop.
mouse No specification.
clock A normal clock at home (not a watch / alarm clock / bracket clock).
book A closed book that is roughly of B5 size and 200-300 pages.
pillow A rectangle pillow that is commonly placed on sofas, chairs, etc.
TABLE II: Insertable object categories considered in this experiment

Iv-B Annotation Guideline

On average, there are 11 human annotators for each scene. For each scene, the annotator is asked to generate the following annotations:

Iv-B1 Insertable Categories

For each scene, the annotator is encouraged to annotate as much as possible, yet no more than 5 insertable object categories (chosen from the categories in Table II).

Iv-B2 User Preference

For each annotated object category, the annotator should assign a preference score ranging from . The annotators are shown a wide range of different example scenes in advance to ensure that they have consistent criterion towards this preference.

  • Score 2 (very suitable): Indicates “this category is very suitable to be inserted into the scene”;

  • Score 1 (generally suitable): Indicates “this category can be inserted into this scene, yet not very suitable”.

Iv-B3 Bounding Box Size

For each annotated object category, the annotator should draw a rectangle bounding box, whose longer side equates to the longer side of an appropriate bounding box of the object. We only need 1 freedom for size evaluation because the aspect ratio of the inserted object is typically fixed.

Iv-B4 Insertable Region

For each annotated object category, the annotator should draw a region. The method for drawing this region is that: Imagine you are holding the object for insertion, and you drag it over all the places that it can be inserted. In this process, the region that can be covered by the object is defined as the insertable region, which should be drawn using a brush tool (Fig. 2).

(a) Original image
(b) Imagination
(c) Insertable region
Fig. 2: Method for drawing the insertable region. This is also the same illustration that we presented to the annotators. Note that in this case, because the cup has a non-zero height, some pixels above the table can also be covered.

Note that, different annotators may have different opinions towards this region. For instance, for the scene in Fig. 2, some annotators may not include the left-bottom corner of the table when drawing the insertable region. This subjectivity is explicitly allowed within the range of quality control.

V Experiments

Table IV and V shows qualitative results for object recommendation and scene retrieval, both enhanced by bounding box prediction. We further quantitatively evaluate our method against existing baselines on our new test set. Task-specific metrics are designed for comprehensive evaluation.

We design experiments for the 3 subtasks systematically. First, for both the object recommendation and scene retrieval subtasks, we compare our system against a statistical baseline, bag-of-categories (BOC), which is based on category co-occurrence. Second, we separately evaluate the size and location for the bounding box prediction subtask, and compare our results against a recently proposed neural model for person composition [26]. Finally, we report comparisons on both quantitative and qualitative results, which helps interpretations for what is learned by our algorithm.

model nDCG@1 nDCG@3 nDCG@5
BOC 43.28% 47.81% 55.72%
ours 59.06% 55.30% 61.78%
TABLE III: Quantitative evaluation for object recommendation
1) clock 2) mouse 3) cup
1) clock 2) tv 3) cup
1) cup 2) clock 3) book
TABLE IV: Object recommendation results. The first row for each demo is the recommended bounding box (V-C2) and indicative heatmap (V-C3) automatically generated for the top 3 recommended object categories (V-A). The second row is generated using manually selected yet automatically pasted object segments (for illustration purpose only).
(a) clock
(b) apple
(c) cup
TABLE V: Scene retrieval results: top 5 retrieved scenes (V-B) for clock, apple, cup. The first row for each category shows the original images, and the second row shows the scenes overlaid with automatically generated bounding boxes (V-C2) and heatmaps (V-C3).

V-a Object Recommendation

We adopt the normalized discounted cumulative gain (nDCG) [31], which is an indicator widely used in information retrieval (IR) for ranking quality. We use this as the metric for object recommendation. Because the desired output for this subtask is a ranked list, and each item is annotated with a gain reflecting user preference, nDCG is a perfect choice for evaluation.

V-A1 Metric Formulation

For images and annotators for the th image, the averaged nDCG@K is defined as


where measures the ranking quality for the top-K recommended object categories of the th image, with regard to the ground truth user preference scores provided by the th annotator.

V-A2 Quantitative Results

The baseline method, bag-of-categories (BOC), regards each image as a bag of existing objects, and ranks all candidate objects by the sum of co-occurrences with the existing objects. BOC borrows idea from the simple yet effective bag-of-words (BOW) model [32]

in natural language processing, which ignores the structural information and only keeps the statistical count.

The quantitative comparison between our system and BOC is shown in Table III. We evaluate nDCG at the top-1, top-3, top-5 results respectively, because there are at most 5 annotations per image. As demonstrated by the results, our method achieves consistent improvements as compared to BOC.

V-A3 Qualitative Analysis

(d) ours: mouse
BOC: clock
(e) ours: clock
BOC: laptop
(f) ours: cup
BOC: book
Fig. 3: Qualitative comparison on top recommended object

The largest gain of our method over baseline is reflected by nDCG@1, i.e. the top result. Fig. 3 shows qualitative comparison against BOC on top 1 recommendation. In Fig. 2(d), the baseline wrongly recommends a clock because there are 2 detected walls. Whereas, our system recognizes that most candidate boxes for clock lead to unreasonable relative positions with the walls. In Fig. 2(e), the baseline recommends a laptop due to high co-occurrence of pair (laptop, table). However, the table in this scene is too small, disabling any noticeable bounding box for insertion. In Fig. 2(f), the baseline recommends a book because there’s a mis-detection for a small shelf to the edge of the background (the blue box, which is actually a counter), which is almost ignored by our system for the same reason as in Fig. 2(e).

In summary, the key advantage of our algorithm over baseline is that we not only consider the co-occurrence frequency, but also take into account the relative locations and relationships between the inserted object and context objects. This enables our system to bypass candidate categories with high co-occurrence counts yet unreasonable placements; and to also be more robust when faced with detection failures.

V-B Scene Retrieval

(a) cup — ours
(b) cup — BOC
(c) clock — ours
(d) clock — BOC
Fig. 4: Qualitative comparison on top 10 recommended scenes

Similarly, for the scene retrieval subtask, we also adopt nDCG as a metric for ranked image list.

V-B1 Metric Formulation

For insertable categories and candidate images for each category, the averaged nDCG@K is defined as


where measures the ranking quality of the top-K retrieved scene images for the th category, with regard to the ground truth user preference scores provided by the th annotator.

V-B2 Quantitative Results

The quantitative comparison between our system and BOC is shown in table VI. We evaluate nDCG at respectively, in consideration of the fact that there are 50 candidate images in total. Again, we outperform the baseline by a remarkable margin.

model nDCG@1 nDCG@10 nDCG@20
BOC 60.00% 50.70% 56.03%
ours 65.45% 54.87% 58.25%
TABLE VI: Quantitative evaluation for scene retrieval

V-B3 Qualitative Analysis

Fig. 4 shows qualitative comparison against BOC on top 10 retrieved scenes. Intuitively, our system prefers scenes whose supportive objects that are visually large, continuous or close to the user, while the baseline is typically biased towards scenes with more relevant objects. This is due to the fact that only boxes that lead to reasonable relationships will contribute significantly to , while BOC is agnostic to the spatial structure of the context objects.

V-C Bounding Box Prediction

We evaluate the size and location of the predicted bounding box separately. The baseline for this subtask is the neural approach proposed by [26]. [26] builds an automatic two-stage pipeline for inserting a person’s segment into an image. It first determines the best bounding box using the dilated convolution networks [33], then retrieves a context-compatible person segment from a database.

Here, we compare our system’s performance on bounding box prediction, against the first stage of [26]. We adopt the same object detector [27] with the same confidence threshold as in our experiments, and the same training settings for [26] as reported in its supplementary material.

For size prediction, we design a single metric to measure the similarity of 2 lengths. For location prediction, however, we design 2 different metrics for automatic use cases and manual use cases, respectively. The automatic use case would require an API that returns the best ranked bounding box, while the manual use case would prefer a heatmap as an intuitive hint. We will discuss these 3 metrics and different use cases in detail.

V-C1 Metric Formulation — Size

For a bounding box with height and width , we define its box size . We then define a metric that evaluates how close is the ground truth box compared with the predicted box, under the measurement of box size. Note that we only preserve 3 freedoms for a box, because the aspect ratio of the inserted object segment should be predetermined.

Given images, and annotators for the th image, for a specific category , we define the average intersection over union (IoU) score for box size as:


where, , is the ground truth box size provided by annotator in image for category , is the predicted box size in image for category . has an upper bound of 1.0 (when ), and a lower bound of 0.0 (when and are drastically different).

V-C2 Metric Formulation — Location, Best Box

The best recommended box would be crucial to an automatic application, such as an automatic preview software. Hence, this experiment evaluates whether the best recommended box is in a reasonable location. We consider the location of a bounding box as reasonable, if it is contained within the insertable region annotated by the user.

Fig. 5: Metric for best box: A box is regarded as reasonable, if it is contained within the insertable region annotated by the user (Fig. 4(a)). A box that slightly exceeds this region (Fig. 4(b)

) is not good enough, yet still visually better than a box that is an outlier (Fig.

4(c)). For best box evaluation, the difference between accuracy and strict accuracy is that for cases like Fig. 4(b), the former one counts the fraction of area that is included in the insertable region, whereas the later one only counts valid boxes as in Fig. 4(a).

Note that this criterion can be biased towards smaller boxes. We address this drawback in 2 aspects: First, larger boxes that slightly exceeds the insertable region may still have a non-zero contribution to this metric; Second, unreasonably small boxes will pull down the size prediction score accordingly.

Given images, and annotators for the th image, for a specific category , we define the average accuracy for the location of best recommended box as


where, shares the same meaning as before. is the ground truth insertable region drawn by annotator in image for category , is the best recommended box in image for category . has an upper bound of 1.0 (when is entirely contained within ), and a lower bound of 0.0 (when is entirely outside ).

Furthermore, if we only regard bounding boxes that are fully contained by the insertable region as reasonable, we can define a stricter metric by substituting the in Eq. 14 with a binary indicator function , which is set to 1 if and only if is fully contained by . We denote this metric as the “strict accuracy”. This metric excludes boxes that are partially contained by the insertable region, and only counts for valid boxes that are entirely covered.

V-C3 Metric Formulation — Location, Heatmap

This metric evaluates the score distribution of all sampled boxes, which we further convert into an intuitive pixel-level representation. We denote this representation as a heatmap.

Specifically, we generate a heatmap by adding the score of each sampled box to all its contained pixels. The heat value at each pixel hence approximates the probability that it is contained within at least one insertable box 111For each pixel in image contained within candidate boxes , for a specific category , we have

when . Typically, there are around 800 candidate boxes per image, therefore the numerical value of is reasonably small to make this approximation.. This representation is compatible, and hence directly comparable, with the insertable region provided by the user.

Fig. 6: Metric for heatmap: The heat value at each pixel represents the probability that it is contained within at least 1 insertable box. We take the average insertable region over all users as the ground truth heatmap, and test the consistency of this pixel-level probability distribution between ground truth and prediction. This metric measures the system’s ability to approximate the hint provided by a human.

Note that the heatmap does not support any programmatical usages, but only aims to provide a clear user hint. We do not adopt the distribution of the left-bottom corner or the stand position [26] because not all the insertable categories are supported from the bottom (e.g. TV). Hence, a heatmap that dissolves the probability of each box into its inner pixels is more cognitively consistent across different categories.

Given images, and annotators for the th image, for a specific category , we define the average IoU for the heatmap as (illustrated in Fig. 6)


In Eq. 16, shares the same meaning as before. is the ground truth insertable region drawn by annotator in image for category , is the predicted heatmap for category in image .

In Eq. 17, denotes the averaged ground truth insertable region, and denotes the predicted heatmap. iterates through all pixels of and . We normalize and such that they each sums to 1.0. This definition for IoU has a maximum value 1.0 when and are exactly the same, and a minimum value 0.0 when they are absolutely disjoint.

V-C4 Quantitative Results

We report the average IoU for size, the average accuracy for best recommended location, and the average IoU for heatmap, over all insertable categories. We refine the location heatmap generated by [26] by adding the heat value at each stand position to pixels in the corresponding box to match our heatmap definition. As shown in Table VII, we achieve consistent improvement over the baseline in all metrics designed for bounding box prediction.

model IoU (size) accuracy (location, best box) strict accuracy (location, best box) IoU (location, heatmap)
[26] 59.79% 12.50% 2.87% 9.14%
ours 64.90% 33.65% 9.46% 18.23%
TABLE VII: Quantitative evaluation for bounding box prediction

V-C5 Qualitative Analysis

Table VIII shows the qualitative comparison against [26] on bounding box prediction. We outperform baseline significantly, especially on location prediction. Possible reasons include: 1) The baseline employs an impainting model to generate fake background images that do not contain the target object, which leads to error propagation throughout the downstream training process; 2) The Visual Genome [29] dataset is relatively small, and images containing non-human objects (i.e. the insertable categories considered in this paper) are even fewer. We do not use larger datasets such as MS-COCO [34], because many important context object categories (e.g. desk, counter, wall, etc.) are not annotated. The data-driven nature of neural network hence limits the performance of [26].

cup spoon apple cake laptop
mouse tv clock book pillow
TABLE VIII: Qualitative comparison against [26] on bounding box prediction. The first row is the original images, the second row is our results, the third row is the results of [26]. The inserted object for each image is labeled at the bottom of each column.

Vi Conclusion

We propose a novel research topic, dual recommendation for object insertion, and build an unsupervised algorithm that exploits object-level context. We establish a new test dataset and design task-specific metrics for automatic quantitative evaluation. We outperform existing baselines on all subtasks under a unified framework, as evidenced by both quantitative and qualitative results. Future work includes incorporation of high-dimensional image features, or larger datasets that is able to fully drive the training of neural networks.


This work was supported by the National Key R&D Program (No. 2017YFB1002604), the National Natural Science Foundation of China (No. 61772298 and No. 61521002), a Research Grant of Beijing Higher Institution Engineering Research Center, and the Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology.