Log In Sign Up

Interactive Storytelling over Document Collections

Storytelling algorithms aim to 'connect the dots' between disparate documents by linking starting and ending documents through a series of intermediate documents. Existing storytelling algorithms are based on notions of coherence and connectivity, and thus the primary way by which users can steer the story construction is via design of suitable similarity functions. We present an alternative approach to storytelling wherein the user can interactively and iteratively provide 'must use' constraints to preferentially support the construction of some stories over others. The three innovations in our approach are distance measures based on (inferred) topic distributions, the use of constraints to define sets of linear inequalities over paths, and the introduction of slack and surplus variables to condition the topic distribution to preferentially emphasize desired terms over others. We describe experimental results to illustrate the effectiveness of our interactive storytelling approach over multiple text datasets.


page 3

page 4

page 11


TopicViz: Semantic Navigation of Document Collections

When people explore and manage information, they think in terms of topic...

Content-based Text Categorization using Wikitology

A major computational burden, while performing document clustering, is t...

Contextualization for the Organization of Text Documents Streams

There has been a significant effort by the research community to address...

Topic Similarity Networks: Visual Analytics for Large Document Sets

We investigate ways in which to improve the interpretability of LDA topi...

TopicSifter: Interactive Search Space Reduction Through Targeted Topic Modeling

Topic modeling is commonly used to analyze and understand large document...

Scholastic: Graphical Human-Al Collaboration for Inductive and Interpretive Text Analysis

Interpretive scholars generate knowledge from text corpora by manually s...

1 Introduction

Faced with a constant deluge of unstructured (text) data and an ever increasing sophistication of our information needs, a significant research front has opened up in the space of what has been referred to as information cartography [31]. The basic objective of this space is to pictorially help users make sense of information through inference of visual constructs such as stories [13, 17, 27, 28], threads [9, 10, 23], and maps [30, 29]. By supporting interactions over such constructs, information cartography systems aim to go beyond traditional information retrieval systems in supporting users’ information exploration needs.

Arguably the key concept underlying such cartography is the notion of storytelling, which aims to ‘connect the dots’ between disparate documents by linking starting and ending documents through a series of intermediate documents. There are two broad classes of storytelling algorithms, motivated by their different lineages. Algorithms focused on news articles [27, 28] aim for coherence of stories wherein every document in the story shares an underlying common theme. Algorithms focused in domains such as intelligence analysis [12] and bioinformatics [14] must often work with sparse information wherein a common theme is typically absent or at best tenuous. Such algorithms must leverage weak links to bridge diverse clusters of documents, and thus emphasize the construction and traversal of similarity networks. Irrespective of the motivations behind storytelling, all such algorithms provide limited abilities for the user to steer the story construction process. There is typically no mechanism to interactively steer the story construction toward desired story lines and avoid specific aspects that are not of interest.

In this paper, we present an alternative approach to storytelling wherein the user can interactively provide ‘must use’ constraints to preferentially support the construction of some stories over others. At each stage of our approach, the user can inspect the given story and the overall document collection, and express preferences to adjust the storyline, either in part or in overall. Such feedback is then incorporated into the story construction iteratively.

Our key contributions are:

  1. Our interactive storytelling approach can be viewed as a form of ‘visual to parametric interaction’ (V2PI [18]

    ) wherein users’ natural interactions with documents in a workspace is translated into parameter-level interactions in terms of the underlying machine learning models (here, topic models). In particular, we demonstrate how high-level user feedback at the level of paths is translated down to redefine topic distributions.

  2. The underlying mathematical framework for interactive storytelling is a novel combination of hitherto uncombined components: distance measures based on (inferred) topic distributions, the use of constraints to define sets of linear inequalities over paths, and the introduction of slack and surplus variables to condition the topic distribution to preferentially emphasize desired terms over others. The proposed framework thus brings together notions from heuristic search, linear systems of inequalities, and topic models.

  3. We illustrate how just a modicum of user feedback can be fruitfully employed to redefine topic distributions and at the same time severely curtail the search process in navigating large document collections. Through experimental studies, we demonstrate the effectiveness of our interactive storytelling approach over multiple text datasets.

2 Motivating Example

We present an illustrative example of how a storytelling algorithm can be steered toward desired lines of analysis based on user input. For our purposes, assume a vanilla storytelling algorithm (akin to [17, 15]) based on heuristic search to prioritize the exploration of adjacent documents in order to reach a desired destination document. Adjacency here can be assessed in many ways. One approach is to use local representations such as a tf-idf representation and define similarity measures (e.g., Jaccard coefficient) over such local representations. A second approach is to utilize the normalized topic distribution generated using, e.g., LDA [5], to induce a distance between every pair of documents.

Let us construct a toy corpus of documents wherein the terms are drawn from predefined themes and some random noise terms. Each theme is assumed to be represented by a collection of terms. An example of a theme is:
Theme 1: nation, terror, avert, orange
Each document is generated by a single theme or by mixing two themes. In addition to the terms sampled from the themes, each document is assumed to also contain noise terms. (The noise terms are document-specific meaning two documents do not share the same terms.) Thus, we obtain terms for each of the themes and noise terms for each of the documents, so that the total number of terms is . A pair of documents has an edge between them if they have at least one common term. (Since noise terms are not common between the documents, they are not responsible for edge formation.) We use the notation to denote a document. Here denotes the document index and are the two themes represented by the document. For example is the first document in the corpus and contains terms from themes and .

Figure 1: An illustration of the interactive storytelling algorithm.

Now consider the storytelling scenario from Fig. 1. The user desires to make a story from document to document . describes a bank robbery and mentions a possible chemical attack. The constructed story is as follows: using heuristic search (Fig. 1 (a)). The first two documents are connected using (Theme 7), involving the terms bank, red, truck, aspen. As can be seen this story is not desirable since the algorithm has conflated a bank robbery in Aspen using a red truck with the bankruptcy of the Red Trucking company (due to insufficient orange production in Aspen). Thus although the connection between two documents are established by the same set of terms, the contexts are very different.

In this case the user realizes that the story does not make very good sense, and thus uses her domain knowledge to steer the story in the right direction. She aims to incorporate a story segment into the construction. Here, reports the closing of a chemical factory and mentions about a sweet odor emanating from an abandoned chemical factory (see Fig. 1 (b)). The user believes that these two documents could play an important role in the final story. Incorporating this feedback, a story from to could potentially be (i.e., the shortest path from to via and ). Note that there could be other documents necessary to be included in the path that are not explicitly provided in the user’s feedback.

Incorporating this feedback, the algorithm introduced in this paper will infer new topic definitions over the dictionary of terms, and subsequently new topic distributions for each document. In this case, a new story is generated: . In this story (see Fig. 1 (c)), the first two documents are connected by the terms ski, tourist, destination, winter (Theme 5); the second and the third are linked via the terms chemical, factory, recently, hiring (Theme 8) and the last two documents are connected by nation, terror, avert, orange (Theme 1). This story thus suggests an alternative hypothesis for the user’s scenario.

Figure 2: Probability of weights of terms before (green) and after (blue) feedback. The inferred topic distributions are shifted to induce proximity between documents so that the story is consistent with user feedback.

After incorporating the user’s feedback using our proposed algorithm, we see that ski, tourist, destination, winter has some mass for document so that it is inferred closer to document (see Fig. 2

). Similarly, the algorithm estimates positive probabilities for the terms

chemical, factory, recently, hiring in document which brings it closer to document .

3 Framework

A summary of notation as used in this paper is given in Table 1. We utilize the terms nodes and documents

interchangeably in this paper. As described earlier, we impute the notion of distance between documents based on vector representations inferred from probabilistic topic models (here, LDA). Specifically, we use the topic distributions

and for documents and (resp.) to calculate the distance or edge cost between and . We posit an edge between two documents if they share any terms and the edge cost is lower than a fixed cost . While a number of probabilistic measure of distance can be utilized, in this paper we adopt the Manhattan distance metric. The heuristic distance for a node is given by the straight line distance to the ending (target) document . Since the Manhattan distance obeys the triangle inequality, it is well known that it is an admissble heuristic for A* search. As is customary, we define a node evaluation function as the sum of and .

Notation Explanation
document in the copus
total number of topics
starting document
ending/goal document
distance threshold
dimensional vector of normalized topic distribution of document
edge between and if they have any term in common
cost between and , , where
path from to with edges, is the document after
shortest path from to
cost of the shortest path from to
cost of the shortest path from to using search
the heuristic distance (Manhattan distance) between the node and the goal node
minimum cost any is bounded by such that is the shortest path from to
maximum cost any is bounded by such that is the shortest path from to
cost of the shortest path from to with
cost of an arbitrary path with
cost of the shortest path from to including an edge
Table 1: Notation overview.

3.1 Obtaining User Feedback

After an initial story generated by heuristic search, the user provides a sequence of documents that ought to be included in the story (i.e., between the documents and ). Suppose this sequence is . The order of the documents is important, since the sequence is a reflection of desired story progression. We define the path as a concatenation of the shortest path between and , followed by the nodes in , and finally the shortest path between and . This process is done in the original LDA-inferred topic space. We will now undertake a constrained search incorporating the user feedback.

Figure 3: (a) shows the heuristic distance from original search. (b) depicts based on constrained search. (c) depicts when feedback nodes are where ancestry of is given by . Dashed line shows the shortest path from to .

3.2 Constrained search

Now we discuss the incorporation of the user’s feedback into the story. Consider the case where the user insists that a document (not in the initial story) should be included in the story. This case can be easily extended to a sequence of documents . Suppose the adjacent nodes of a document is denoted by . There are five adjacent nodes to in Fig. 3. The heuristic distance between a neighbor (say, ) and the ending document is given by in the original search. Our redefined heuristic distance for constrained search is given by . If the feedback is a sequence of documents then . However, while ensures that the of a document depends on the path via the sequence of feedback nodes , it must also consider the subset of that already belong to the shortest path from to to estimate the heuristic distance . We define a property named Ancestry that keeps track of the subset of the feedback nodes that already exists in the shortest path from the to the said node. Ancestry of an arbitrary neighbor of is defined as if is not a feedback node. If is the feedback node immediately after the subsequence in then . The starting node has an empty ancestry. A node having longer subsequence of in its ancestry compared to another is said to have a richer ancestry. A node with richer ancestry is always preferred. If ancestries are comparable, for an open node the predecessor with smaller is chosen while for a closed node the predecessor with smaller is chosen.

Figure 4: (left) The path with green nodes is the initial story generated by the storytelling algrithm and hence the shortest path from to before incorporating feedback. The gray paths (dashed and solid) are alternate stories abandoned by the search. (right) Story after user feedback where the user-preferred story is shown in blue. This is not the shortest path in the current topic space. The documents that the user desires to be in the story are shown in large circles. We intend to estimate the topic space where the blue path () is shorter than all the other alternate paths from to .

3.3 Alternate/Candidate Stories

The nodes explored by search in the initial topic space (the set of open and closed nodes) induce an acyclic graph . The orange nodes in Fig. 4 are open nodes in such a graph. Denote the set of open nodes by . Any path from to via is a candidate story generated by search. Let us denote the path via by .

Now assume has open nodes. To enforce the user feedback that be the shortest path over all paths from to we define the following system of inequalities:


If we break each inequality in terms of topics then we obtain:


In addition to this set of inequalities, we also add another set of inequalities imposing that the cost of an edge in the new topic space, is at least as much as the cost of the edge in the initial topic space .


This constraint is imposed so that the proximity of the document does not change drastically, as otherwise this might disorient users.

3.4 Deriving Systems of Inequalities

is a heuristic algorithm to find the shortest path between two nodes. Given the shortest path, finding the edge costs or upper and lower limits thereof is thus as inverse shortest path problem. Our goal is to find a normalized topic distribution so that is actually the shortest path in the new topic space.

In our approach, we obtain the inequalities in Eqn 2 by using the following observation: if the cost of an edge crosses the upper threshold or an edge falls below the lower threshold , all the other edge cost being fixed is no longer the shortest path from to . Therefore the condition for being the shortest path is


Upper and lower shortest path tolerances are presented in [25] as:


Therefore the inequities for the edges becomes:


Note that for the first equation in Eqn. 3.4, is the difference of two path costs: the cost of the shortest path from to that avoid (imposing an infinite cost for ) and the minimum cost of with in the path (imposing a zero cost for ), so that . Notice also that if , then . For the second equation if the shortest path from to does not change even with , i.e. , then the lower tolerance for is zero. However, if the constraint favors a different path through (meaning not ) the lower tolerance for is given by the drop in the path cost which this alternate path allows over .

We use the fact that our choice of is an admissible heuristic in search to simplify our formulation of inequalities. Due to admissibility, , and consequently . Replacing with lower heuristic estimate of in Eqn. 7 we achieve a stricter inequality:

Figure 5: Dashed line shows the subtree and the solid line shows the subtree . The candidate open nodes in for Eqn. 6 are shown in green. Red nodes are open nodes in and do not contribute in Eqn. 6. The shortest path from to avoiding is the shortest path from to via any of the green nodes.

The cost of shortest path avoiding is given by . In Fig. 5 suppose the red edge is one such . Let the subtree induced by search following is (shown in dashed line) and the remainder of the tree is (shown in solid line). Based on the search process, we would expect the shortest path from to via any edge in to have in it. Therefore should be based on paths via edges in . Since we have path costs that are estimated by the heuristically search () we can use these for the open nodes in . These open nodes are shown in green in Fig. 5. Hence in this setting, the inequality is replaced by the following set of inequalities:


Due to the admissibility of , also underestimates the true distance, so we are using a stricter inequality in Eqn. 9. If this process is repeated for all our set of inequalities consist of the user defined path being compared against all the set of paths defined by the open nodes in the original search given in Eqn 2.

3.5 Modeling Relationships by Auxiliary Variables

In the previous section we formulated the user feedback as a set of relationships, where each relationship is an inequality in terms of path lengths. Since the distance metric is based on normalized topic distribution we explicitly show the dependence of an individual relationship on . For an inequality in Eqn. 2

, we introduce a slack random variable

(i.e. for some ) as an auxiliary variable with expectation . Similarly for a relationship in Eqn 3 we define a surplus random variable where is positive with expectation given by . Therefore . Suppose the distribution of the auxiliary variable is given by . The random variable measures the difference in path lengths between the user defined path and an alternate . If is zero, it means enforcing the relationship that is as costly as the alternate path . The more negative the value of its mean , the larger we expect to be compared to . This ensures that the topic space satisfies the relationship . Now conditional on a known

, the joint distribution of the auxiliary variables (both slack and surplus) and the observed feedback

is given below:


Here, is an indicator variable which is one if condition holds and zero otherwise. Our goal is to find a set of surplus and slack variables that maximizes the probability in Eqn 10. Now let

be normally distributed with mean

and variance 1. By marginalizing over the auxiliary variables

, our formulation is same as the modeling the probability of satisfying a relationship using the cumulative normal distribution.


Here for a standard normal variable , . This approach is very similar to the usage of auxiliary variables in probit regression [1]. In probit regression the mean of the auxiliary variable is modeled by a linear predictor to maximize the discrimination between the successes and failures in the data. In our case satisfiability of a user defined relationship is a success and the probability of satisfying the relationship is modeled by the mean of auxiliary variable. The mean of the auxiliary variable is a function of the topic space on which the distances are defined. Our goal is to search for a topic space which explains the term distribution of the documents and satisfies as many of the relationships in as possible. Truncating a slack variable to a negative region specified by allows to search for that shrinks the mean to a negative value. The complete hierarchical model using the term document data and the relationship data is presented below:


3.6 Inference

We use Gibbs sampling to compute the posterior distributions for and . The conditional posterior distributions for is given below:


The sampling of topic for terms is same as used in vanilla LDA [11].


The full conditional distribution for is given below:


The full conditional distribution for the topic distribution of document is given below:


since . denotes the number of terms in document assigned to topic based on . If does not belong to , then is sampled from . We sample from by a Metropolis-Hastings step otherwise. We use a proposal strategy based on stick-breaking process to allow better mixing. The stick-breaking process bounds the topic distribution of a document between zero and one and their sum to one. We first sample random variables truncated between zeros and one and centered by using a proposal distribution :


This is followed by the mappings, ,


The inverse mappings are given by:


The Metropolis-Hastings acceptance probability for such a proposed move is given by


where = The samples from and are iteratively sampled to generate the joint posterior distribution of all the unknown parameters using Gibbs Sampling.

This procedure completes the interactivity loop in the storytelling algorithm. The newly inferred topic distributions will induce a new similarity network over which we can again conduct a search, followed by (potentially) additional user feedback.

4 Experimental Results

We evaluate our interactive storytelling approach over a range of text datasets from intelligence analysis, such as Atlantic Storm, Crescent, Manpad, and the VAST11 dataset from the IEEE Visual Analytics Science & Technology Conference. Pl see [35] for details of these datasets. The questions we seek to answer are:

  1. Can we effectively visualize the operations of the interactive storytelling as user feedback is incorporated? (Section 4.1)

  2. Does the interactive storytelling framework provide better alternatives for stories than a vanilla topic model? (Section 4.2)

  3. Are topic reoorganizations obtained from interactive storytelling significantly different from a vanilla topic model? (Section 4.3)

  4. Does our method scale to large datasets? (Section 4.4)

  5. How effectively does the interactive storytelling approach improve over uninformed search (e.g., uniform cost search or breadth-first search)? (Section 4.5)

In the below, unless otherwise stated, we fix the number of topics to be and set and . We also use the Gini index to remove top 10% of of the terms as a pre-processing step for our text collections.

4.1 Visualizing interactive storytelling

We apply multidimensional scaling (MDS) over the normalized topic space as an aid to visualize the operations of the storytelling algorithm. For instance, the Manpad dataset is visualized as shown in Fig. 6. Consider a story from document - to document -. Here - reports that a member of an infamous terrorist organization has a meeting with a notorious arms dealer. - reports that a team of suicide bombers plans to set off bombs in trains carrying tens of thousands of commuters under the Hudson River. The storytelling algorithms generates a story as: ---. Here, - identifies a person belong to a terrorist organization. The user is not satisfied with this story and provides a constraint that the story should involve documents - and -. Here, - describes that libraries in Georgia and Colorado have some connections to a web site. - reports that a code number is found in the website linked to a charitable organization. Using this feedback a new story is generated: -----. In addition to being consistent with the user’s feedback, note that the algorithm has introduced a new document (-) which contains a report of police seizing documents involving specific names and dates.

4.2 Evaluating story options

In this experiment, we seek to generate multiple stories using our interactive storytelling approach as well as a vanilla topic modeling, with a view to comparative evaluation. In this experiment, run over the Atlantic Storm dataset, the user specifies as the starting document and as the ending document. The default story is: . The user’s feedback specifies and to be included in the final story. The results of incorporating this feedback yields: . We next use Yen’s -shortest path algorithm [20] to generate a set of top (alternative) stories. As shown in Table 2, the top-ranked path in the interactive setting is indeed the shortest path in the new topic space that satisfies the given constraints.

Top stories generated using vanilla topic model Path Length Top stories generated using interactive storytelling Path Length
CIA06, CIA37, NSA19, NSA16 2.84 CIA06, CIA08, DIA01, NSA09, NSA16 1.39
CIA06, CIA20, CIA21, NSA16 3.16 CIA06, CIA12, NSA09, NSA16 1.93
CIA06, CIA22, CIA21, NSA16 3.16 CIA06, CIA33, DIA01, NSA09, NSA16 2.13
CIA06, CIA20, CIA22, CIA21, NSA16 3.16 CIA06, CIA22, NSA09, NSA16 2.13
CIA06, CIA22, CIA20, CIA21, NSA16 3.16 CIA06, CIA08, DIA01, FBI07, NSA16 2.20
CIA06, CIA08, NSA21, NSA16 3.23 CIA06, CIA33, FBI07, NSA16 2.22
CIA06, CIA08, NSA21, NSA12, NSA16 3.23 CIA06, CIA33, CIA08, DIA01, NSA09, NSA16 2.31
CIA06, CIA08, NSA21, NSA13, NSA16 3.23 CIA06, CIA11, FBI13, DIA01, NSA09, NSA16 2.33
CIA06, CIA08, NSA21, NSA12, NSA13, NSA16 3.23 CIA06, DIA02, DIA01, NSA09, NSA16 2.33
CIA06, CIA08, NSA21, NSA18, NSA16 3.23 CIA06, CIA08, CIA23, NSA16 2.34
Table 2: Top stories (shortest paths) generated from to using both a vanilla topic model and the interactive storytelling algorithm (using the Atlantic Storm dataset). The user’s feedback requires that both and be included in the story. The interactive storytelling algorithm updates the topic model wherein the shortest path indeed contains these documents.
Figure 6:

Visualizing documents using multidimensional scaling (Manpad dataset) before and after user feedback. Many documents are omitted for better visualization. The starting and the ending documents are shown in green. The documents in the initial story are shown in blue (and the story by solid lines). The story generated by the interactive storytelling algorithm is shown in the dotted line through the grey documents. Each document is represented by its top five terms having the highest posterior probability.

4.3 Proximity between topics

We investigate topic proximity in terms of Manhattan distance in Fig  7. Here, rows denote topics from a vanilla topic model, and the columns correspond to topics inferred by the interactive storytelling algorithm. As shown in Fig. 7 the diagonally dominant nature of the matrix is destroyed due to the introduction of user feedback, illustrating that the distributions of words underyling the topics are quite dissimilar.

Atlantic Storm Crescent Manpad
Figure 7: Manhattan distance between topic distributions before and after user feedback. Blue color denotes topics closest to each other. As can be seen, the incorporation of feedback destroys the diagonal dominance of the matrix.

4.4 Scalability to large corpora

With large datasets, such as the VAST11 dataset, we can fruitfully combine clustering with our framework to navigate the document collection (see Fig. 8

). Given a document collection, an initial clustering (e.g., k-means) can be utilized to identify broad groups of documents that can be discarded during the initial story construction. Here, assume that the user specifies

and as the starting and ending document, respectively. The storytelling algorithm generates as the initial story (solid line). Note that this story ignores documents from the cluster displayed in red. Assume that the user now requires that documents from the red cluster also participate in the story. Based on an initial exploratory analysis, the user specifies that documents and should participate in the story. Based on this feedback the interactive storytelling algorithm generates: (note the introduction of into the story).

Figure 8: Scaling the storytelling methodology by integrating clustering. The initial story (solid line) from to avoided documents in the red cluster. After incorporating user feedback, the new story (dotted line) navigates through the red cluster.

4.5 Comparing interactive storytelling vs uniform cost search

We now assess the performance of the constrained search process underlying interactive storytelling versus that of an uninformed search (e.g., uniform cost search). The comparison is shown in Fig. 9. We use different distance threshold to compute effective branching factor, path length and execution time.

We show in Fig. 9(a, b, c) that average effective branching factor increases with . Since higher means a node will have more neighbors, the branching factor will increase in this case. However in case of Interactive Storytelling path finding is more guided so the average effective branching factor does not vary much. We can see that using a heuristics decreases the average effective branching factor. The average path length however decreases with increasing (Fig. 9(d, e, f)). Increasing results in having larger neighborhood for each node, therefore the chance of reaching the goal becomes higher resulting in shorter average path length. For Interactive Storytelling the average path length is higher because it has to visit the nodes specified by the user while searching for the shortest path. The execution time for both heuristic search and the uninformed search are almost same (Fig. 9(g, h, i)), however for Interactive Storytelling it is much longer. Since it has to visit the nodes provided by the user, it travels the search space in more depth so it takes more time on average to finish the search.

(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Figure 9: Comparison of interactive storytelling, heuristic search and uniform cost search in terms of average effective branching factor (top), average path length (middle) and execution time (bottom). (Left) Atlantic Storm. (middle) Crescent. (right) Manpad.

5 Related Work

Related work pertaining to storytelling has been covered in the introduction. We survey topic modeling related work here. To the best of our knowledge, no existing work supports the incorporation of path-based constraints to refine topic models, as done here.

Expressive topic models

The author-topic model [26] is one of the popular extensions of topic models that aims to model how multiple authors contributed to a document collection. Works such as [7, 6] extend basic topic modeling to include specific words or semantic concepts by incorporating notions of proximity between documents. In [32], the authors move beyond bag-of-words assumptions and accommodate the ordering of words in topic modeling. Domain knowledge is incorporated in [2] in the form of Dirichlet forest priors. Finally, in [3], correlated topic models are introduced to model correlations between topics.

Incorporating external information

Supervised topic models are introduced in [21]. Lu and Zhai [19] propose a semi-supervised topic model to incorporate expert opinions into modeling. In [24], authors incorporate user tags accorded to documents to place constraints on topic inference. The timestamps of documents is used in [4, 33] to model the evolution of topics in large corpus.

Visualizing topics

Wei et al. [34] propose TIARA, a visual exploratory text analytics system to observe the evolution of topics over time in a corpus. Crossno et al. [8] develop a framework to visually compare document contents based on different topic modeling approaches. In [22], the authors present documents in topic space and depict inter-document connectivity as a network in a visual interface, simultaneously displaying community clustering.

Interactive topic modeling

User feedback is incorporated in [16]

wherein users can provide constraints about specific words that must appear in topics. An active learning framework to incorporate user feedback and improve topic quality is introduced in 


6 Discussion

We have demonstrated interactive storytelling, a combination of interactive topic modeling and constrained search wherein documents are connected obeying user constraints on paths. User feedback is pushed deep into the computational pipeline and used to refine the topic model. Through experiments we have demonstrated the ability of our approach to provide meaningful alternative stories while satisfying user constraints. In future work, we aim to generalize our framework to a multimodal network representation where entities of various kinds are linked through a document corpus, so that constraints can be more expressively communicated.


  • [1] J. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. J. Amer. Statist. Assoc, 88(422):669–679, 1993.
  • [2] D. Andrzejewski, X. Zhu, and M. Craven. Incorporating domain knowledge into topic modeling via dirichlet forest priors. In ICML, pages 25–32, 2009.
  • [3] D. Blei and J. Lafferty. Correlated topic models. In ICML, pages 113–120, 2006.
  • [4] D. Blei and J. Lafferty. Dynamic topic models. In ICML, pages 113–120, 2006.
  • [5] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. JMLR, 3:993–1022, 2003.
  • [6] C. Chemudugunta, A. Holloway, P. Smyth, and M. Steyvers. Modeling documents by combining semantic concepts with unsupervised statistical learning. In ISWC, pages 229–244, 2008.
  • [7] C. Chemudugunta, P. Smyth, and M. Steyvers. Modeling general and specific aspects of documents with a probabilistic topic model. In NIPS, pages 241–248, 2006.
  • [8] P. Crossno, A. Wilson, T. Shead, and D. Dunlavy. Topicview: Visually comparing topic models of text collections. In

    International Conference on Tools with Artificial Intelligence

    , pages 936–943, 2011.
  • [9] A. Feng and J. Allan. Incident threading for news passages. In CIKM, pages 1307–1316, 2009.
  • [10] S. Gad, W. Javed, S. Ghani, N. Elmqvist, E. Ewing, K. Hampton, and N. Ramakrishnan. Themedelta: Dynamic segmentations over temporal topic models. IEEE Trans. Vis. Comput. Graph., 21(5):672–685, 2015.
  • [11] T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101(suppl 1):5228–5235, 2004.
  • [12] M. Hossain, C. Andrews, N. Ramakrishnan, and C. North. Helping intelligence analysts make connections. In Scalable Integration of Analytics and Visualization, volume WS-11-17, 2011.
  • [13] M. Hossain, P. Butler, A. Boedihardjo, and N. Ramakrishnan. Storytelling in entity networks to support intelligence analysts. In KDD, pages 1375–1383, 2012.
  • [14] M. Hossain, J. Gresock, Y. Edmonds, R. Helm, M. Potts, and N. Ramakrishnan. Connecting the dots between PubMed abstracts. PLoS ONE, 7:e29509, 01 2012.
  • [15] M. Hossain, M. Narayan, and N. Ramakrishnan. Efficiently discovering hammock paths from induced similarity networks. CoRR, abs/1002.3195, 2010.
  • [16] Y. Hu, J. Boyd-Graber, and B. Satinoff. Interactive topic modeling. In ACL, pages 248–257, 2011.
  • [17] D. Kumar, N. Ramakrishnan, R. Helm, , and M. Potts. Algorithms for storytelling. In KDD, pages 604–610, 2006.
  • [18] S. Leman, L. House, D. Maiti, A. Endert, and C. North. Visual to parametric interaction (v2pi). PLoS ONE, 8(3), 2013.
  • [19] Y. Lu and C. Zhai. Opinion integration through semi-supervised topic modeling. In WWW, pages 121–130, 2008.
  • [20] E. Martins and M. Pascoal. A new implementation of yen’s ranking loopless paths algorithm. 4OR, 1:121–133, 2003.
  • [21] J. Mcauliffe and D. Blei. Supervised topic models. In NIPS, pages 121–128, 2008.
  • [22] Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network regularization. In WWW, pages 101–110, 2008.
  • [23] R. Nallapati, A. Feng, F. Peng, and J. Allan. Event threading within news topics. In CIKM, pages 446–453, 2004.
  • [24] D. Ramage, D. Hall, R. Nallapati, , and C. Manning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In EMNLP, pages 248–256, 2009.
  • [25] R. Ramaswamy, J. B. Orlin, and N. Chakravarti. Sensitivity analysis for shortest path problems and maximum capacity path problems in undirected graphs. Math. Program, 102(2):355–369, 2005.
  • [26] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The Author-Topic model for authors and documents. In UAI, pages 487–494, 2004.
  • [27] D. Shahaf and C. Guestrin. Connecting the dots between news articles. In KDD, pages 623–632, 2010.
  • [28] D. Shahaf and C. Guestrin. Connecting two (or less) dots: Discovering structure in news articles. ACM Trans. Knowl. Discov. Data, 5(4):24:1–24:31, 2012.
  • [29] D. Shahaf, C. Guestrin, and E. Horvitz. Metro maps of science. In KDD, pages 1122–1130, 2012.
  • [30] D. Shahaf, C. Guestrin, and E. Horvitz. Metro maps of information. SIGWEB Newsletter, 2013.
  • [31] D. Shahaf, J. Yang, C. Suen, J. Jacobs, H. Wang, and J. Leskovec. Information cartography: Creating zoomable, large-scale maps of information. In KDD, pages 1097–1105, 2013.
  • [32] H. Wallach. Topic modeling: Beyond bag-of-words. In ICML, pages 977–984, 2006.
  • [33] X. Wang and A. McCallum. Topics over time: A non-markov continuous-time model of topical trends. In KDD, pages 424–433, 2006.
  • [34] F. Wei, S. Liu, Y. Song, S. Pan, M. Zhou, W. Qian, L. Shi, L. Tan, and Q. Zhang. Tiara: A visual exploratory text analytic system. In KDD, pages 153–162, 2010.
  • [35] H. Wu, M. Mampaey, N. Tatti, J. Vreeken, M. Hossain, and N. Ramakrishnan. Where do I start?: Algorithmic strategies to guide intelligence analysts. In KDD Workshop on Intelligence and Security Informatics, pages 3:1–3:8, 2012.
  • [36] Y. Yang, S. Pan, D. Downey, and K. Zhang. Active learning with constrained topic model. Sponsor: Idibon, 2014.