Multimodal Storytelling via Generative Adversarial Imitation Learning

12/05/2017 ∙ by Zhiqian Chen, et al. ∙ 0

Deriving event storylines is an effective summarization method to succinctly organize extensive information, which can significantly alleviate the pain of information overload. The critical challenge is the lack of widely recognized definition of storyline metric. Prior studies have developed various approaches based on different assumptions about users' interests. These works can extract interesting patterns, but their assumptions do not guarantee that the derived patterns will match users' preference. On the other hand, their exclusiveness of single modality source misses cross-modality information. This paper proposes a method, multimodal imitation learning via generative adversarial networks(MIL-GAN), to directly model users' interests as reflected by various data. In particular, the proposed model addresses the critical challenge by imitating users' demonstrated storylines. Our proposed model is designed to learn the reward patterns given user-provided storylines and then applies the learned policy to unseen data. The proposed approach is demonstrated to be capable of acquiring the user's implicit intent and outperforming competing methods by a substantial margin with a user study.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Imitation Storytelling

As the Internet becomes more pervasive, information overload becomes increasingly more severe. Even with the help of search engines such as Google and Yahoo, people cannot easily understand a series of coherent news events. For example, a person who desires to learn about the 2016 Presidential Election

needs iteratively search through several keywords many times and review numerous news documents so that he or she can generate a cohesive picture, e.g., knowledge graph.

Storytelling is an efficient way to solve this issue of information overload. By inferring the entity nodes connections, the original documents can be represented through a knowledge graph which consists of a set of storylines. Current works [Kumar et al.2008, Hossain et al.2012, Shahaf et al.2012a, Shahaf et al.2012b, Lin et al.2012] mainly focus on this task but employ strict assumptions that may not capture meaningful stories. For instance, [shahaf2010connecting] argues that maximizing the weakest link makes good storylines, which can be reduced to density-based clustering. However, to model users’ preferred stories, it requires to understand the evolution patterns, not merely to keep strong coherence.

Specifically, existing research in this area suffers from several shortcomings: (1) Strong assumptions can lead to poor storylines. Most related works manually design coherence metrics which are directly assumed to be associated with good storylines. However, a high consistency does not guarantee story quality, since good story might have other properties such interest, novelty, or user-preferred style. Therefore, the coherence based metric is not sufficient for modeling storylines. (2) Lack of multimodal learning. Existing works only focus on unimodal data such as Text Storytelling or Visual Storytelling, overlooking the cross-modality information. Multimodal learning can find entity linkages that a unimodal may miss. (3) Absence of benchmark dataset: Few prior works are reported to provide a publicly available dataset for imitation storytelling.

This paper focuses on directly imitating user-provided storylines rather than designing any indirect measures. The basic idea is to learn the connectivity features and structure in storylines so that the agent can reveal similar stories in other domains. This approach is illustrated in Figure 1. The September 11 attack event contains a storyline which shows the key entities related to the cause, perpetrators, victims, and the aftermath. Similarly, the event Charlie Hebdo Attack also has a story of a similar type. We argue that the two similar storylines share the same structure in a certain embedding space. Therefore, we can reveal similar stories in another event domain(Mexico Iguala mass kidnapping at the bottom). The inherent benefit of utilizing multimodal data is that humans often make inferences between the images and texts so as to resolve ambiguities.

Deep reinforcement learning can learn multi-step decisions. However, a critical difficulty in reinforcement learning is designing a reward function for optimizing the agent. Unlike game application in which there exists a responsive environment, it is dramatically difficult to design a reward function for storytelling due to the unavailability of such responses. Instead of proposing reward function, we introduce a typical Inverse Reinforcement Learning(IRL), imitation learning, to learn the latent policy. Imitating from demonstrations is a strategy in which an agent learns a hidden policy for a dynamic environment by observing demonstrations delivered by an expert agent. While IRL is often with two issues: instability and implicit policy. To solve these problems, Generative Adversarial Networks(GAN)

[Goodfellow et al.2014] mechanism is employed to solve the instability issue. GAN yields an internal generator model which can output policy explicitly after training, which addresses the second issue. Therefore, it is promising to integrate IRL with GAN to learn policy from users’ demonstrations.

Different from previous work, our study treats storytelling as an imitation learning. Specifically, a policy is acquired from one event domain, and then transfer the policy to learn a storyline from another event. Furthermore, our work takes full advantage of multimodal data to improve the imitation performance. In this paper, we define a Multimodal Imitation Storytelling Task(MIST), and then propose a multimodal generative adversarial method that derives latent policy behind users’ demonstrations without explicitly designing a reward function. The main contributions as follows:

  • Proposing an imitation learning method for storytelling: To avoid the difficulty in designing reward function for storytelling, we enforce generative adversarial model on imitation learning. Using this learning strategy, the model can robustly model latent connectivity patterns.

  • Designing a multimodal model integrated with GAN based imitation learning

    : Inspired by human’s ability to link multiple entities through visual similarity, we propose a multimodal method across textual and visual modality with imitation learning. Our model learns reward functions from these two modalities and their correlation.

  • Creating a benchmark dataset for multimodal imitation storytelling: A new multimodal storytelling dataset is collected from multiple attacks and civil unrest events. Under several selected topics, storylines are manually extracted and validated. Both texts and images are included in our dataset.

The rest of this paper is organized as follows. Section 2 reviews the related works. A detailed description of the proposed method is given in Section 3. Experiments on multiple public datasets and case study are presented in Section 4. We conclude the paper in Section 5.

2 Related Works

Storytelling: The storyline generation problem was first studied by Kumar et al. [Kumar et al.2008] as a generic redescription mining technique, by which a series of re-description between the given disjoint and dissimilar object sets and corresponding subsets are discovered. Storytelling is an efficient way to solve the issue of information overload. By extracting critical and connected entities, the original document is structurally summarized. Current works contain two categories: Textual Storytelling[Kumar et al.2008, Hossain et al.2012, Fang et al.2011, Voskarides et al.2015, Lee et al.2012, Shahaf et al.2012a, Shahaf et al.2012b, Lin et al.2012] and Visual Storytelling[Kim et al.2014, Park and Kim2015, Wang et al.2012]. Few works are reported to extract storylines based on both text and image. Current methods often suggest assumptions between good storyline and explicit metrics, such as average similarity or weakest similarity of all the neighbor nodes. However, these assumptions limit the generating meaningful stories since a user may have unique notions of good storylines. A few researchers employ Latent Dirichlet Allocation(LDA)[Zhou et al.2015, Huang and Huang2013] to extract stories in unsupervised fashion. However, it is difficult for LDA to accurately model sequential data.

Reinforcement Learning: Starting from AlphaGo [Silver et al.2016] and Atari[Mnih et al.2015], numerous game applications [Lample and Chaplot2016, He et al.2016, Oh et al.2015, Narasimhan et al.2016] enjoy the property of deep reinforcement learning in imitating sequential patterns. However, it is often difficult to design rewards function, especially for this real world problem. Other possible solutions include behavior cloning[Pomerleau1991], and Inverse Reinforcement Learning(IRL)[Russell1998, Ng et al.2000]

which can derive the underlying cost function under which the expert data is uniquely optimal. Unfortunately, behavior cloning learns a policy as a supervised learning problem over state-action pairs from expert trajectories and required large amounts of data because of compounding error caused by covariate shift. While IRL has stability issue and does not explicitly tell us how to act. Several works

[Yu et al.2017, Ho and Ermon2016] employ Generative Adversarial Networks(GAN) solve IRL issue. Hence we leverage GAN and IRL to effectively imitate user-provided storylines.

Multimodal Learning

is to derive the patterns in associated cross-modality information. An example is that readers often estimate the relationship among news articles using the top images inside because if the images contain the same entity, they are probably involved in the same event. However, current works

[Kiros et al.2015][Srivastava and Salakhutdinov2012][Ngiam et al.2011] focus on joint representation learning, paying little attention to sequence problem.

Different from the previous study, our work treats storytelling as a combination problem of imitation learning and multimodal learning. By employing GAN based imitation learning, our proposed model can learn and show the hidden policy. Moreover, this work takes full advantage of joint constraint on cross-modality data to improve the imitation performance.

3 Multimodal Imitation Storytelling

This section formally defines the task of imitation storytelling and then describes our proposed MIL-GAN model. In particular, a multimodal and GAN based imitation learning is elaborated in 3.2, the multimodal method we applied is introduced in 3.3.

3.1 Problem Setup

Based on a set of event document-storyline pairs {, }, our goal is to reveal the user generated policy . Each is a documents collection and each consist of several entity nodes {}

Let be the types of modalities appearing in and , which denote textual data, image and multimodal relationship between respectively. In the proposed method, a generator controlled by multimodal parameters is used to approximate the users’ policy generator .

Following REINFORCE algorithm[williams1992simple], the collected rewards along the sequence starting from the initial state are maximized so as to optimize the generator. Utilizing GAN model, a discriminator yields rewards as the state score of the proposed generator. Since the value function is influenced by both the generator and discriminator, let denote the function. The learning process iteratively updates the objective function via gradient w.r.t. the parameters until convergence.

3.2 Imitation Learning via GAN

The intermediate rewards are set to zero for storytelling task because it’s difficult to evaluate the generated storyline until it’s complete. Following [Sutton et al.],the objective of policy is to generate a sequence starting from to maximize the expected reward:

(1)

Because multimodal constraints involves several separate relationship embedded in text, image and their bi-directional coherence, the objective function is formulated as:

(2)
Figure 2: Illustration of Multimodal GAN: Large arrows show the data flow in GAN training, while small arrows indicate the process inside policy gradient

where is the action-value function that estimates the expected accumulative reward from initial state . The value is from the rating on sequence generated by . When the discriminator consider the generated sequences real, the value will increase. The basic idea of estimating value is to treat it as the probability of the sequence being real considered by the discriminator, i.e., . Due to the model limitation, only the sequence of fixed length can be evaluated by discriminator. However, incomplete sequence does provide partial information for rating. To estimate this incomplete portion of the sequence, the action-value function value is assigned with the expected rewards obtained from LSTM sampling controlled by . Therefore, the value is estimated as:

(3)

By using such values, the generator can update the objective function w.r.t. its parameters. Once generator finishes update, the sampling sequences mixed with the real data are fed into discriminator. Following the typical rules of GAN and an improvement[Arjovsky et al.2017] that the original log operation of discriminator is removed, the discriminator will then improve the function:

(4)

As illustrated in Figure 2, the GAN iteratively updates the multimodal generators using Formula 3, and improves the multimodal discriminators by Formula 4. Following policy gradient theorem(episodic case) [Sutton et al.], the gradient of the objective function w.r.t. the parameters can be expressed as:

Theorem 3.1 (Multimodal Policy Gradient Theorem).

The gradient of the objective function w.r.t. the parameter via policy gradient is:

Proof of Theorem 3.1.

Let denote the state value function. To keep the notation simple, we leave it implicit in all cases that , and are controlled by the generator . For each single modality, due to that intermediate rewards are set to zero, and state transferring probability is one-hot deterministic, we have:

(5)
(6)

Sum over modalities on Eq. (6) is exactly Eq. (2), i.e., . Substitute in Eq. (6) and in (5) iteratively:

We can re-write accumulative policy output as probability:

(7)

After repeating unrolling and apply Eq. (7), it is then immediate that:

In storytelling, each node can only be assigned with one entity, the policy function in the last term given a specific reach . Accordingly, the above equation becomes:

Using weighted likelihood [Sutton et al.], the expectation inside Formula 3.1

can be decomposed using unbiased estimation. To keep the notation simple, we leave it implicit in all cases that

is a function of given , and is a function of and . Thus, the gradient can be estimated as:

The gradient can then be applied on the previous parameters as: , where is learning rate. Algorithm 1 presents full details of the proposed method. First, we pre-train on input set . Then, the generator and discriminator are trained alternatively and periodically. When training the discriminator, positive examples are from the given dataset, while negative examples are sampled from the generator.

Input: initialize generator policy and discriminator with random weights; multimodal storyline dataset ; event documents as knowledge base
Output: Generator and discriminator ; Newly-generated storyline sequences by on unseen test dataset.
1 // Pre-processing
2 Derive embeddings on and map entity in ;
3 Compute representations of based on using multimodal learning, and map each of to ;
4 Generate the sequence including , and ;
5 // Pre-training
6 Train using LSTM on ;
7 Generate negative samples using for training ;
8 Feed with negative samples and real data;
9 Train via minimizing the cross entropy;
10 // GAN training
11 repeat
12    for  training do
13        Sample a sequence ;
14        for t in 1:T do
15            Derive using Eq. (3)
16        end for
17       Update generator:
18    end for
19   for  training do
20        Generate negative examples by sampling ;
21        Use given positive examples ;
22        Train discriminator by Eq. (4)
23    end for
24   
25until GAN converges;
Apply on unseen dataset starting from randomly selected entity
Algorithm 1 Multimodal Imitation Storytelling

3.3 Multimodal Learning

The task contains three types of modalities which are text, image, and their coherence. Entity words are encoded using word embeddings algorithm such as Word2Vec. Whereas our model expresses images in a semantic sentential space instead of word space since images contain massive information.

A SVD-based semantic embedding model is employed to derive vectors for images conditioning on contextual words. Denote the image vector

from a multimodal vector space and is associated with sentential description where indicates words. To represent the vectors is to condition on the embedding vector of the description . We aim to model the conditional distribution of the following word given context from the contextual words and the vector

. Vocabulary embedding matrix with associated image vectors are represented using a tensor

where mean tensor product in which any word in description is associated with related images. Given , the model predicts word representation of the following words as a function of and contextual words using tensor decomposition:

(8)

where the dimensionality of is tunable. denote a function which retain all the arguments on the diagonal after multiplication. This idea share resemblance with SVD in which the middle matrix is a diagonal matrix. The context is then represented as an intermediate variable which subject to the expectation of weights distribution among word pairs : . Combining with , we get another intermediate factor which encode context and image using Hadamard product . Using with softmax, the conditional probability

can be calculated. The true following words are used as true label for backpropagation method such as SGD, this neural networks iteratively update w.r.t. the parameters. The training parameters include

, and weights distribution . The this model encodes image in a vector space . Finally, the third modality is the sequential difference between word embeddings and image vectors.

4 Benchmark and Experiment Setting

The proposed method is evaluated on newly-proposed storytelling dataset111https://gist.github.com/aquastar/03dadfd751f5862ea0b44bb66996b490. To guide the model to discover desirable stories, manually labeled storylines are compiled for GAN training. Generator obtained in one event dataset was tested on another event corpus. This experiment shows if the generator is capable of deriving transferable storyline. Please note that different event datasets share no entities.

4.1 Benchmark Description and Metrics

Training set contain events from two major categories: Homicide and Protest. Homicide contains 1310 storylines, while Protest chooses 934 storylines. Two short examples are shown in Figure 1. They are from the most famous historical events, such as 9/11 Attack , Orlando Nightclub Attack, Occupy Wall Street, and Protest led by Martin Luther King. Each event contains several hundred documents. All the articles are taken from Google news and Wikipedia. Generators will be trained in the two categories separately. The Test set includes two event 2014 Iguala mass kidnapping222https://en.wikipedia.org/wiki/2014_Iguala_mass_kidnapping and Malaysia Airlines Flight 370333https://en.wikipedia.org/wiki/Malaysia_Airlines_Flight_370, since the two test events involve both homicide and protest sub-events. For data augmentation purpose, slicing window is employed to divide raw data into minimal sequence.

Metrics: First the proposed method with several baselines are tested under convergence performance. Secondly, our evaluation conducts user studies via Amazon Mechanical Turk(AMT) cloud sourcing service, since the ultimate goal is to validate imitation behaviors.

4.2 Training Setting

Words are expressed using Word2Vec in each independent event corpus, while images are initially represented in VGG19 and then transferred to sentential space using multimodal learning. To normalize their shape, word vectors along each storyline are reduced by the vector of the first word in its storyline. Likewise, image vectors are reduced by the first image in each storyline. The same operation is also conducted on the multimodal sequence. For GAN model, the generator is implemented using LSTM model which accept continuous values. TextCNN[Zhang and LeCun2015] is used for discriminator since such CNN is effective for both text and image. Baselines include Random(Ran.), Scheduled sampling(SS)[bengio2015scheduled], policy gradient(PG), and LSTM.

4.3 Initial Analysis

We evaluate our result with several established alternatives: random, scheduled sampling[bengio2015scheduled] and policy gradient with similarity. Unfortunately, the baselines do not share the same metric or objective function. Instead, they were compared regarding accumulative normalized similarity on the training set:

where, sample denotes sequence matrix from the user-provided data, and is the output sequence matrix from models. The result is shown in Table (1).

Ran. SS PG LSTM
-3860.02 33714.67 34009.27 338876.34
T. I. T.I. MIL-GAN
34263.59 12483.20 36143.34 36697.33
Table 1: Similarity performance (T./I. denotes Text/Image respectively, T.I. means the combination of T. and I.)

Using the multimodal GAN, we improve the text-only model. One interesting point is that the image-only modality does not performance very well, but with other modalities, the performance increases. This makes sense because an image may contain too much information which is likely to lead to confusion. For instance, an image of original World Trade Center being attached could be either a victim object(the building) or an attacker (the attacking flight). With textual label such as victim, the image would be confined in a smaller and accurate semantic space.

The balance parameters are all initialized to 1. After fine tuning, good performance often appear if more weights were assigned on text part. One good set example is [0.6, 0.3, 0.1] for [, , ] separately.

However, result of the training set can only show that the potential ability to achieve success, but it is not the ultimate evaluation. In next section, we conducted user study via Amazon Mechanical Turk(AMT) Service to directly assess users’ satisfaction.

4.4 Evaluation via user study

Evaluating storytelling is a difficult task, due to the fact that there is no established golden standard, and even ground truth is hard to elicit. The ultimate goal of imitation storytelling is to help users discover customized storylines given users’ examples. Therefore, a third-party user study via AMT is conducted, since it allows us to obtain accurate statistical information with a large sample size of users. It aims to test whether the generated storyline matches users’ interests. We evaluate our result with the baselines mentioned in the last subsection.

In the study, workers were asked which storyline fit the given storyline best. Due to our dynamic reward per assignment in AMT, every minimal unit of cost is treated as one evaluation, so 1000 rating were collected. For each test event, several starting entities were randomly chosen. Storylines were generated from these starting entities using both our proposed method and baselines. For each unit task in AMT, workers were asked to choose the best one among those generated candidates gave event background knowledge such as Wikipedia or key news articles:

  • 1st Step: AMT will be given E0, which denotes an event background and user-provided storyline S0 which is a user-define storyline. S0 contain a topic which is based on E0.

  • 2nd Step: After AMT workers study the relationship between the storyline and event, they will be given another event E1 and a few machine generated storyline S1/S2/S3/S4. They are also related.

  • 3rd Step: Based on their understanding of the relationship S0-E0, choose the best one that would have similar relationship: E1-S[?] among (S1,S2,S3,S4)

The final result as shown in Table 2 demonstrates that our proposed MIL-GAN was significantly preferred by users. T. slightly improve the preference beyond Schedule Sampling. Policy Gradient fails to capture the users’ behavior.

PG SS T. MIL-GAN
17.3% 23.2% 28.1% 31.4%
Table 2: Preference statistics from AMT

4.5 case analysis

In this subsection, several typical examples are presented, and detailed analysis will be elaborated.

4.5.1 case 1: Mexico Murderer

  • SS: AbarcaFlipeFlipeGuerrero

  • PG: AbarcaPinedastudentstudent

  • T.: AbarcagangstudentGuerrero

  • MIL-GAN: AbarcaFlipeAyotzinapaGuerrero

4.5.2 case 1: Mexico Drug Cartel

  • SS: CartelFlipeFlipeGuerrero

  • PG: CartelganggangSinaloa

  • T.: CartelDrugPinedagang

  • MIL-GAN: CartelPinedaAbarcaFlipe

4.5.3 case 2: MH370

  • SS: Zahariewifewifecockpit

  • PG: ZaharieFlightFariqtraining

  • T.: ZaharieFlightAisaSearch

  • MIL-GAN: ZaharieMH370Indiadebris

We found that (1)The generated sequences from the baselines SS and PG often contains more vague words such as student and gang. These two baselines tend to overfit and often yield repeat nodes as in case 1 and 2. This suggests that the models without GAN are likely to overfit. (2) With GAN, the proposed models seems succeeded to avoid the overfit issue. Compared with SS and PG, T. and MIL-GAN tend to generate specific word rather than ambiguous ones. For example, in case 1, 2 and 3, T. and MIL-GAN generated specific words such as Aisa and debris. Compared with T., MIL-GAN generates even more specific entity name such as Ayotzinapa, Flipe and India. This implies that multimodal constraints does improve the unimodal performance.

5 Conclusion

In this paper, we proposed a multimodal imitation learning approach for generating storyline on unseen events. To avoid the reward function designing, GAN based imitation learning is introduced to learn the latent policy given users’ demonstrations. To bridge the information gap between text and image, our model effectively integrates generative adversarial nets and multimodal learning via deterministic policy gradient. The different modalities learn from each other and potentially resolve confusion from each single modality. In our experiments, we utilized user-provided demonstration to explicitly illustrate the advantage of the proposed method beyond baselines. Associating with multimodal perspective, our model succeeds to capture the latent patterns across different modalities, and therefore reveal more satisfying storylines towards users’ interests.

References

  • [Arjovsky et al.2017] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. ArXiv e-prints, jan 2017.
  • [Fang et al.2011] Lujun Fang, Anish Das Sarma, Cong Yu, and Philip Bohannon. Rex: explaining relationships between entity pairs. VLDB, 5(3):241–252, 2011.
  • [Goodfellow et al.2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Yoshua Bengio, et al. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
  • [He et al.2016] Ji He, Mari Ostendorf, et al. Deep reinforcement learning with a combinatorial action space for predicting popular reddit threads. EMNLP, 2016.
  • [Ho and Ermon2016] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In NIPS, pages 4565–4573, 2016.
  • [Hossain et al.2012] M Shahriar Hossain, Patrick Butler, Arnold P Boedihardjo, and Naren Ramakrishnan. Storytelling in entity networks to support intelligence analysts. In SIGKDD, pages 1375–1383. ACM, 2012.
  • [Huang and Huang2013] Lifu Huang and Lian’en Huang. Optimized event storyline generation based on mixture-event-aspect model. In EMNLP, pages 726–735, 2013.
  • [Kim et al.2014] Gunhee Kim, Leonid Sigal, and Eric Xing. Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In CVPR, pages 4225–4232, 2014.
  • [Kiros et al.2015] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. TACL, 2015.
  • [Kumar et al.2008] Deept Kumar, Naren Ramakrishnan, Richard F Helm, and Malcolm Potts. Algorithms for storytelling. KDD, 20(6):736–751, 2008.
  • [Lample and Chaplot2016] Guillaume Lample and Devendra Singh Chaplot. Playing fps games with deep reinforcement learning. AAAI, 2016.
  • [Lee et al.2012] Heeyoung Lee, Marta Recasens, et al. Joint entity and event coreference resolution across documents. In EMNLP-CONLL, pages 489–500. Association for Computational Linguistics, 2012.
  • [Lin et al.2012] Chen Lin, Chun Lin, et al. Generating event storylines from microblogs. In CIKM, pages 175–184. ACM, 2012.
  • [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [Narasimhan et al.2016] Karthik Narasimhan, Adam Yala, et al. Improving information extraction by acquiring external evidence with reinforcement learning. EMNLP, 2016.
  • [Ng et al.2000] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In ICML, pages 663–670, 2000.
  • [Ngiam et al.2011] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng.

    Multimodal deep learning.

    In ICML, pages 689–696, 2011.
  • [Oh et al.2015] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, et al. Action-conditional video prediction using deep networks in atari games. In NIPS, pages 2863–2871, 2015.
  • [Park and Kim2015] Cesc C Park and Gunhee Kim. Expressing an image stream with a sequence of natural sentences. In NIPS, pages 73–81, 2015.
  • [Pomerleau1991] Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88–97, 1991.
  • [Russell1998] Stuart Russell. Learning agents for uncertain environments. In

    Proceedings of the eleventh annual conference on Computational learning theory

    , pages 101–103. ACM, 1998.
  • [Shahaf et al.2012a] Dafna Shahaf, Carlos Guestrin, and Eric Horvitz. Metro maps of science. In SIGKDD, pages 1122–1130. ACM, 2012.
  • [Shahaf et al.2012b] Dafna Shahaf, Carlos Guestrin, and Eric Horvitz. Trains of thought: Generating information maps. In WWW, pages 899–908. ACM, 2012.
  • [Silver et al.2016] David Silver, Aja Huang, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • [Snavely et al.2010] Noah Snavely, Ian Simon, Michael Goesele, et al. Scene reconstruction and visualization from community photo collections. Proceedings of the IEEE, 98(8):1370–1390, 2010.
  • [Srivastava and Salakhutdinov2012] Nitish Srivastava and Ruslan R Salakhutdinov.

    Multimodal learning with deep boltzmann machines.

    In NIPS, pages 2222–2230, 2012.
  • [Sutton et al.] Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. Policy gradient methods for reinforcement learning with function approximation.
  • [Voskarides et al.2015] Nikos Voskarides, Edgar Meij, Manos Tsagkias, Maarten de Rijke, and Wouter Weerkamp. Learning to explain entity relationships in knowledge graphs. 2015.
  • [Wang et al.2012] Dingding Wang, Tao Li, and Mitsunori Ogihara. Generating pictorial storylines via minimum-weight connected dominating set approximation in multi-view graphs. In AAAI. Citeseer, 2012.
  • [Yu et al.2017] L Yu, W Zhang, J Wang, and Y Yu. Seqgan: sequence generative adversarial nets with policy gradient. In AAAI

    , volume 31. Association for the Advancement of Artificial Intelligence, 2017.

  • [Zhang and LeCun2015] Xiang Zhang and Yann LeCun. Text understanding from scratch. arXiv preprint arXiv:1502.01710, 2015.
  • [Zhou et al.2015] Deyu Zhou, Liangyu Chen, and Yulan He. An unsupervised framework of exploring events on twitter: Filtering, extraction and categorization. In AAAI, 2015.