An expressive visual storytelling environment for presenting timelines on the web and in Power BI. Developed at Microsoft Research.
We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress. Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.READ FULL TEXT VIEW PDF
We introduce the first dataset for human edits of machine-generated visu...
We address the problem of end-to-end visual storytelling. Given a photo
Language grounded image understanding tasks have often been proposed as ...
One of the primary challenges of visual storytelling is developing techn...
Storytelling plays a central role in human socializing and entertainment...
Despite recent progress on computer vision and natural language processi...
We present an empirical analysis of the state-of-the-art systems for
An expressive visual storytelling environment for presenting timelines on the web and in Power BI. Developed at Microsoft Research.
Visual Storytelling API
Give Attention to Learn Sequential Vision to Language
An expressive browser-based visual storytelling environment for presenting timelines. A research prototype under development at Microsoft Research.
Beyond understanding simple objects and concrete scenes lies interpreting causal structure; making sense of visual input to tie disparate moments together as they give rise to a cohesive narrative of events through time. This requires moving from reasoning about single images – static moments, devoid of context – to sequences of images that depict events as they occur and change. On the vision side, progressing from single images to images in context allows us to begin to create an artificial intelligence (AI) that can reason about a visual moment given what it has already seen. On the language side, progressing from literal description to narrative helps to learn more evaluative, conversational, and abstract language. This is the difference between, for example, “sitting next to each other” versus “having a good time”, or “sun is setting” versus “sky illuminated with a brilliance…” (see Figure1). The first descriptions capture image content that is literal and concrete; the second requires further inference about what a good time may look like, or what is special and worth sharing about a particular sunset.
We introduce the first dataset of sequential images with corresponding descriptions, which captures some of these subtle but important differences, and advance the task of visual storytelling. We release the data in three tiers of language for the same images: (1) Descriptions of images-in-isolation (DII); (2) Descriptions of images-in-sequence (DIS); and (3) Stories for images-in-sequence (SIS)
. This tiered approach reveals the effect of temporal context and the effect of narrative language. As all the tiers are aligned to the same images, the dataset facilitates directly modeling the relationship between literal and more abstract visual concepts, including the relationship between visual imagery and typical event patterns. We additionally propose an automatic evaluation metric which is best correlated with human judgments, and establish several strong baselines for the visual storytelling task.
Work in vision to language has exploded, with researchers examining image captioning [Lin et al.2014, Karpathy and Fei-Fei2015, Vinyals et al.2015, Xu et al.2015, Chen et al.2015, Young et al.2014, Elliott and Keller2013], question answering [Antol et al.2015, Ren et al.2015, Gao et al.2015, Malinowski and Fritz2014], visual phrases [Sadeghi and Farhadi2011], video understanding [Ramanathan et al.2013], and visual concepts [Krishna et al.2016, Fang et al.2015].
Such work focuses on direct, literal description of image content. While this is an encouraging first step in connecting vision and language, it is far from the capabilities needed by intelligent agents for naturalistic interactions. There is a significant difference, yet unexplored, between remarking that a visual scene shows “sitting in a room” – typical of most image captioning work – and that the same visual scene shows “bonding”. The latter description is grounded in the visual signal, yet it brings to bear information about social relations and emotions that can be additionally inferred in context (Figure 1). Visually-grounded stories facilitate more evaluative and figurative language than has previously been seen in vision-to-language research: If a system can recognize that colleagues look bored, it can remark and act on this information directly.
Storytelling itself is one of the oldest known human activities [Wiessner2014], providing a way to educate, preserve culture, instill morals, and share advice; focusing AI research towards this task therefore has the potential to bring about more human-like intelligence and understanding.
|beach (684)||breaking up (350)||easter (259)|
|amusement park (525)||carnival (331)||church (243)|
|building a house (415)||visit (321)||graduation ceremony (236)|
|party (411)||market (311)||office (226)|
|birthday (399)||outdoor activity (267)||father’s day (221)|
We begin by generating a list of “storyable” event types. We leverage the idea that “storyable” events tend to involve some form of possession, e.g., “John’s birthday party,” or “Shabnam’s visit.” Using the Flickr data release [Thomee et al.2015], we aggregate 5-grams of photo titles and descriptions, using Stanford CoreNLP [Manning et al.2014]
to extract possessive dependency patterns. We keep the heads of possessive phrases if they can be classified as anevent in WordNet3.0, relying on manual winnowing to target our collection efforts.222 We simultaneously supplemented this data-driven effort by a small hand-constructed gazetteer. These terms are then used to collect albums using the Flickr API.333https://www.flickr.com/services/api/ We only include albums with 10 to 50 photos where all album photos are taken within a 48-hour span and CC-licensed. See Table 1 for the query terms with the most albums returned.
The photos returned from this stage are then presented to crowd workers using Amazon’s Mechanical Turk to collect the corresponding stories and descriptions. The crowdsourcing workflow of developing the complete dataset is shown in Figure 2.
We develop a 2-stage crowdsourcing workflow to collect naturalistic stories with text aligned to images. The first stage is storytelling, where the crowd worker selects a subset of photos from a given album to form a photo sequence and writes a story about it (see Figure 3). The second stage is re-telling, in which the worker writes a story based on one photo sequence generated by workers in the first stage.
In both stages, all album photos are displayed in the order of the time that the photos were taken, with a “storyboard” underneath. In storytelling, by clicking a photo in the album, a “story card” of the photo appears on the storyboard. The worker is instructed to pick at least five photos, arrange the order of selected photos, and then write a sentence or a phrase on each card to form a story; this appears as a full story underneath the text aligned to each image. Additionally, this interface captures the alignments between text and photos. Workers may skip an album if it does not seem storyable (e.g., a collection of coins). Albums skipped by two workers are discarded. The interface of re-telling is similar, but it displays the two photo sequences already created in the first stage, which the worker chooses from to write the story. For each album, 2 workers perform storytelling (at $0.3/HIT), and 3 workers perform re-telling (at $0.25/HIT), yielding a total of 1,907 workers. All HITs use quality controls to ensure varied text at least 15 words long.
We also use crowdsourcing to collect descriptions of images-in-isolation (DII) and descriptions of images-in-sequence (DIS), for the photo sequences with stories from a majority of workers in the first task (as Figure 2). In both DII and DIS tasks, workers are asked to follow the instructions for image captioning proposed in MS COCO [Lin et al.2014] such as describe all the important parts. In DII, we use the MS COCO image captioning interface.444https://github.com/tylin/coco-ui In DIS, we use the storyboard and story cards of our storytelling interface to display a photo sequence, with MS COCO instructions adapted for sequences. We recruit 3 workers for DII (at $0.05/HIT) and 3 workers for DIS (at $0.07/HIT).
We tokenize all storylets and descriptions with the CoreNLP tokenizer, and replace all people names with generic male/female tokens,555 We use those names occurring at least 10,000 times. https://ssa.gov/oact/babynames/names.zip and all identified named entities with their entity type (e.g., location). The data is released as training, validation, and test following an 80%/10%/10% split on the stories-in-sequence albums. Example language from each tier is shown in Figure 4.
Our dataset includes 10,117 Flickr albums with 210,819 unique photos. Each album on average has 20.8 photos ( = 9.0). The average time span of each album is 7.9 hours ( = 11.4). Further details of each tier of the dataset are shown in Table 2.666We exclude words seen only once.
We use normalized pointwise mutual information to identify the words most closely associated with each tier (Table 3). Top words for descriptions-in-isolation reflect an impoverished disambiguating context: References to people often lack social specificity, as people are referred to as simply “man” or “woman”. Single images often do not convey much information about underlying events or actions, which leads to the abundant use of posture verbs (“standing”, “sitting”, etc.). As we turn to descriptions-in-sequence, these relatively uninformative words are much less represented. Finally, top story-in-sequence words include more storytelling elements, such as names (male), temporal references (today) and words that are more dynamic and abstract (went, decided).
Given the nature of the complex storytelling task, the best and most reliable evaluation for assessing the quality of generated stories is human judgment. However, automatic evaluation metrics are useful to quickly benchmark progress. To better understand which metric could serve as a proxy for human evaluation, we compute pairwise correlation coefficients between automatic metrics and human judgments on 3,000 stories sampled from the SIS training set.
For the human judgements, we again use crowdsourcing on MTurk, asking five judges per story to rate how strongly they agreed with the statement “If these were my photos, I would like using a story like this to share my experience with my friends”.777Scale presented ranged from “Strongly disagree” to “Strongly agree”, which we convert to a scale of 1 to 5. We take the average of the five judgments as the final score for the story. For the automatic metrics, we use METEOR,888We use METEOR version 1.5 with hter weights. smoothed-BLEU [Lin and Och2004], and Skip-Thoughts [Kiros et al.2015] to compute similarity between each story for a given sequence. Skip-thoughts provide a Sentence2Vec embedding which models the semantic space of novels.
|0.22 (2.8e-28)||0.08 (1.0e-06)||0.18 (5.0e-27)|
|0.20 (3.0e-31)||0.08 (8.9e-06)||0.16 (6.4e-22)|
|0.14 (1.0e-33)||0.06 (8.7e-08)||0.11 (7.7e-24)|
As Table 4 shows, METEOR correlates best with human judgment according to all the correlation coefficients. This signals that a metric such as METEOR which incorporates paraphrasing correlates best with human judgement on this task. A more detailed study of automatic evaluation of stories is an area of interest for a future work.
|+Viterbi||This is a picture of a family. This is a picture of a cake. This is a picture of a dog. This is a picture of a beach. This is a picture of a beach.|
|+Greedy||The family gathered together for a meal. The food was delicious. The dog was excited to be there. The dog was enjoying the water. The dog was happy to be in the water.|
|-Dups||The family gathered together for a meal. The food was delicious. The dog was excited to be there. The kids were playing in the water. The boat was a little too much to drink.|
|+Grounded||The family got together for a cookout. They had a lot of delicious food. The dog was happy to be there. They had a great time on the beach. They even had a swim in the water.|
We report baseline experiments on the storytelling task in Table 7, training on the SIS tier and testing on half the SIS validation set (valtest). Example output from each system is presented in Table 5. To highlight some differences between story and caption generation, we also train on the DII tier in isolation, and produce captions per-image, rather than in sequence. These results are shown in Table 7.
To train the story generation model, we use a sequence-to-sequence recurrent neural net (RNN) approach, which naturally extends the single-image captioning technique of DevlinEtAl15 and Vinyals2015 to multiple images. Here, we encode an imagesequence by running an RNN over the fc7vectors of each image, in reverse order. This is used as the initial hidden state to the story decoder model, which learns to produce the story one word at a time using softmax loss over the training data vocabulary. We use Gated Recurrent Units (GRUs) [Cho et al.2014] for both the image encoder and story decoder.
In the baseline system, we generate the story using a simple beam search (size=10), which has been successful in image captioning previously [Devlin et al.2015]. However, for story generation, the results of this model subjectively appear to be very poor – the system produces generic, repetitive, high-level descriptions (e.g., “This is a picture of a dog”). This is a predictable result given the label bias problem inherent in maximum likelihood training; recent work has looked at ways to address this issue directly [Li et al.2016].
To establish a stronger baseline, we explore several decode-time heuristics to improve the quality of the generated story. The first heuristic is to lower the decoder beam size substantially. We find that using a beam size of 1 (greedy search) significantly increases the story quality, resulting in a 4.6 gain in METEOR score. However, the same effect is not seen for caption generation, with the greedy caption model obtaining worse quality than the beam search model. This highlights a key difference in generating stories versus generating captions.
Although the stories produced using a greedy search result in significant gains, they include many repeated words and phrases, e.g., “The kids had a great time. And the kids had a great time.” We introduce a very simple heuristic to avoid this, where the same content word cannot be produced more than once within a given story. This improves METEOR by another 2.3 points.
An advantage of comparing captioning to storytelling side-by-side is that the captioning output may be used to help inform the storytelling output. To this end, we include an additional baseline where “visually grounded” words may only be produced if they are licensed by the caption model. We define the set of visually grounded words to be those which occurred at higher frequency in the caption training than the story training:
We train a separate model using the caption annotations, and produce an n-best list of captions for each image in the valtest set. Words seen in at least 10 sentences in the 100-best list are marked as ‘licensed’ by the caption model. Greedy decoding without duplication proceeds with the additional constraint that if a word is visually grounded, it can only be generated by the story model if it is licensed by the caption model for the same photo set. This results in a further 1.3 METEOR improvement.
It is interesting to note what a strong effect relatively simple heuristics have on the generated stories. We do not intend to suggest that these heuristics are the right way to approach story generation. Instead, the main purpose is to provide clear baselines that demonstrate that story generation has fundamentally different challenges from caption generation; and the space is wide open to explore for training and decoding methods to generate fluent stories.
We have introduced the first dataset for sequential vision-to-language, which incrementally moves from images-in-isolation to stories-in-sequence. We argue that modelling the more figurative and social language captured in this dataset is essential for evolving AI towards more human-like understanding. We have established several strong baselines for the task of visual storytelling, and have motivated METEOR as an automatic metric to evaluate progress on this task moving forward.
International Conference on Computer Vision (ICCV).
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 100–105, Beijing, China, July. Association for Computational Linguistics.
Computer Vision and Pattern Recognition (CVPR).