There is a vast amount of data, and understanding this data presents a cognitive barrier for people. Studies show that in many scenarios, people prefer a text description of data over numerical, tabular, or graphical representations of it. As an example, medical staff made better treatment decisions when presented with a text description of patient status compared to graphs (Law et al., 2005). In this paper, we present an approach to generate textual summaries using a probabilistic model that represents the complex patterns of human summarization.
Diverse data-to-text systems have been proposed for generating summaries (Gkatzia, 2016). Unfortunately, most efforts to automatically generate text descriptions of data fail to consider which aspects of the data are most important to a human end user. In this paper, we introduce our summary generation system for numerical time series. The goal of this system is to learn how humans describe time series data and create a descriptive natural language text similar to the human summary.
Human summaries can capture many different features from data, including relationships to background knowledge and comparisons to other, unseen, data. In this paper, we focus on simulating parts of the text that are directly or indirectly describing a salient pattern in the data. To do so, we try to learn common numerical patterns used by people, the various textual statements that are used in describing them, and how they are aligned with each other. This process leads to finding interpretable patterns in data which we call trends and their textual descriptions which we use as templates. For example, in the sentence “TSLA stock has plummeted 15 percent in the past three months” the verb “plummeted” signals a sharp decreasing trend in the time series.
We detect these trends from numerical time series data, propose a utility estimation model to detect a subset of commonly used trends that are present in time series and learn when and how these trends are used by humans. The core idea in our model are a set of policies, which represent latent variables parameterized by data features that dictate when a trend is included in a summary. To model the complex interactions of summarization policies, we use a Bayesian network. The output of the system consists of a set of templates, each of which is associated with a high utility pattern in the data.
In this section, we formalize our summary generation model for numerical time series data. Our model is based on identifying prominent trends in data and creating textual descriptions for them. A trend is a pattern in data which is interpreted by a human and can be qualitatively described in text. As an example, fig:sub-second shows a dataset of Greenland mass variation, with a cyclic pattern corresponding to a trend which has been described in the sentence ”These oscillations are waves of mass variation which occurred on an annual or biennial basis”.
A trend can be a value of a point or set of points in the data (such as a maximum), a relationship between points or sets of points (such as an increasing trend), or an aggregate measure on a subset of data points (such as a mean value). Although there are many possible patterns in the data, we observed that certain categories of patterns appeared more frequently in human-generated summaries of data, which are our main focus in this paper. These trend categories are linear, statistical properties, discontinuous transitions, cycles, and anomalous points. Figure 1 contains examples of these trends with their descriptions.
We now provide a formal definition of time series trends, their features, and the utility of a trend. Suppose is an arbitrary time series. We represent each trend observed in
with an m-element feature vector. A feature vector for each trend contains parameters needed to describe the trend in a textual summary and approximately reconstruct the underlying data pattern. For example, the feature vector for the a linear trend contains the slope, intercept, the spanning interval, the number of data points, etc. These features can be general or specific to a trend type (e.g., the slope feature is defined for linear trends whereas the spanning interval is defined for all trends).
Let be the set of all possible trends and be the associated features for these trends, s.t. . We introduce a utility model which captures the preference of trends. The general utility model tries to find a utility function which maps a set of trends and their associated features to a utility value in the interval for the summary consisting of that set of trends. The utility function imposes a weak ranking over all possible subsets of trends, creating a preference function. Note that in practical applications, utility functions may be personalized to specific populations or subject domains.
The general form of this utility model leads to a potentially intractable problem. Datasets contain many trends, and the utility function must be computed over the powerset of all such trends. In this paper, we focus on a simpler problem that provides utilities for individual trends. We introduce simplified version of utility model that assigns utilities to individual trends.
In this section, we formally describe our utility model, our approach of introducing latent summarization policies to model human behavior, and define the associated tasks for learning our model.
In this section, we describe our utility model. Let be an arbitrary time series and , be trends in with feature vectors . Let be feature vectors of all trends in except . Let
be a binary random variable indicating whetherappears in the text or not. We identify the utility of a trend
as the probability that we observein the text that is
. Our goal is to learn probability distribution ofbased on the observed data which is . We propose a graphical model to estimate this distribution. In our model, observed variables are , and during training we are given . The hidden variables in this model are latent policies, described in the next section.
A policy is a binary random variable whose distribution depends on the trend feature vector. The value of the policy indicates whether a trend is selected for the summary based on its feature vector and other trends in that series. For example let be the policy that prefers more recent trends. In other words considers the attribute of trends which refers to their spanning interval. The value of is more likely to be 1 for the most recent trend and 0 for initial trends. In this paper, we define a compositional model of utility that defines the utility of complex preferences as a combination of simpler models, which we refer to as policies. The primary difference between policies and the utility model is that policies are explanatory latent variables for a trend, while the utility function estimates the empirical probability of a trend aggregated over a dataset.
We divide policies into leaf policies and complex policies. A leaf policy is an atomic policy that can not be decomposed as a set of policies combined with binary operations, considering only one aspect of the trend or trends and its value for each trend is independent of other policies.
For example the policy (introduced in the previous paragraph) that prefers more recent trends is a leaf policy. Complex policies can be created by combining leaf policies using binary logical operations i.e conjunction, disjunction, exclusive or etc. For example, let be the policy that is 1 when the the linear trend is increasing. The policy that prefers the most recent increase in a time series can be viewed as .
The criteria used in leaf policies may vary from simple to complex. The criteria may use limited features of trends or might consider dependencies among different features in different trends. We gathered a set of criteria that humans used in their preference models and classified leaf policies into following categories based on them.
Single Feature: In this policy category, the value of the policy depends on the value of a single feature. We assume that in this case, the value of policy is derived from a simple function of that feature, For example, a threshold function measures when a feature value exceeds a threshold can be used to define a policy that selects linear increasing trends by setting a threshold of 0 on the slope feature.
Multiple Features: In this policy category, the value of a policy depends on multiple features of a single trend. For example, this policy can be used to define cases when a linear trend has a slope greater than the intercept value.
Single Feature in Multiple Trends: In this policy category, the policy value for a trend depends on a single feature of that trend as well as the same feature in other trends in the time series. This policy type can be used to compare trends. For example, the policy that prefers the most recent trend is in this category since it requires comparing the ”interval” feature of all trends in the same time series.
Multiple Features in Multiple Trends: In this policy category, the policy value for a trend depends on multiple features of the trend as well as other trends. For example, a policy that prefers jump points that do not exceed 50% of the maximum value of a time series fall into this category.
Feature Independent: In this type of policy, the value of policy is not determined by feature vector of trends. In this case, a series of hidden factors affect the utility of trends, e.g., a hidden factor might be the context of the time series. Note that we do not consider this leaf policy category in our model.
The leaf policies can be combined using different logical structures to create various complex policies. Complex policies may have different and conflicting values for trends. For example a complex policy might have a high value for a specific trend whereas another policy might have low value for the same trend.
Although there exist many complex policies, when and how these policies are activated depends on the specific summarization context, and some policies are not considered in assessing the utility of some trends.
Therefore, the utility of each trend is dependent on a specific subset of these policies and each of them might have different degree of importance. For example, suppose is a linear increasing trend in stock indicator spanning from 2009 to 2013. A policy identifying long-running trends may be triggered by this trend, while a policy that identifies recent data may ignore this trend.
A utility model learns that the second policy is a more reliable indicator of human behavior than the first policy may then omit this trend from a summary.
Our goal is to find the utility function that estimates the utility of the trends by assigning high utility values to the trends that human prefer. The problem of finding the utility function can be formulated as finding the leaf policies, finding complex policies which requires determining the structure of dependency between complex policies and leaf policies and finding the joint distribution of complex policies for different trends.
Using Policies for Utility Estimation
The architecture of our utility model is a Bayesian network shown in figure 2. In this paper, the simplified task is restricted to predicting the utility of a single trend. Simple policies are defined with respect to feature vectors for the target trend as well as other trends. These policies are combined using logical formulae to create complex policies. During utility estimation, the output variable of whether a trend is included in the summary is defined using a probability distribution over the complex policies.
We define leaf policies to be binary random variables.
Let be the leaf-policies in the model.
Complex policies are created using different structures and arithmetic logic on leaf-policies.
For example let and be leaf-policies they can be combined using xor and create a new policy . Let represent the structures that are used in the model, therefore complex policies are present in the model which we denote by . Complex policies are also binary random variables and are dependent on leaf policies. The value of each complex policy is independent of other policies and the value of is dependent on all complex policies. Therefore we can compute the final utility of trend in our model as:
As we can see in the model, the utility of each trend depends on leaf policies , complex policies and conditional distribution of given complex policies. can be interpreted as the weight of complex policy . Our goal is to find the structure and parameters of the model to maximize the probability of observed data i.e:
Implementing this utility model requires addressing several probabilistic modeling tasks:
Latent variable learning: determining the types and parameters of the simple policies
Structure learning: identifying the dependencies necessary to identify complex policies
finding the conditional probability distribution for the trend’s inclusion in the summary, given the complex policies
Inference: determining whether a given trend will appear in a summary
In the following subsection we describe how we can learn this model. We define prerequisite learning sub-tasks similar to Weston et al. (2015) for learning the complete utility model. Once the underlying graphical model is learned, we can use it to assign utilities to the new trends.
Inference and Learning for Utility Models
In this section, we describe how we address several of the core learning tasks for our utility estimation model.
Parameter estimation for leaf policies
As mentioned in the previous section the utility of a trend is dependent on the combination of leaf policies. Therefore the first step in learning the utility model is to capture various leaf polices. In this paper, we focus on several predefined types of leaf policies and focus on learning their parameters.
Leaf policies are binary random variables that indicate whether features of a trend have a specific relation or not. Some of these policies also consider dependency of trend to other trends in their relation. The probability distribution of a leaf policy which its value for a trend is determined by its feature vector is characterized by an indicator function with parameters , .
This value of a policy for is true if and false otherwise. In other words, it is parametrized by a linear separator in which gives high value to the points above the line.
As an example let be the policy whose value is true for linear increasing trends and false for non-increasing trends. The be a linear trend and element in its feature vector denote its slope. can be represented with a one hot vector , where .
In another group of leaf policies, the value of policy for a trend is based on its feature vector and its relation with other trends. We focus on the pairwise dependency among trends and later show that dependencies involving more trends can be captured by pairwise dependencies in this problem, though it might not be the most efficient solution. Let be a leaf policy such that its value for is determined by feature vector and feature vector of another trend . The distribution of is characterized by an indicator function with parameters , .
considers the dependency between and . This policy has a true value for if . It can be interpreted as a linear separator in . Note that policies that only consider a single trend are a special case where , but for simplicity we separated their representations.
As an example let be the policy that has higher value for more recent trends and be the element of feature vector that indicates the time span of trends. can be represented with two one hot vectors , where , and .
Policies like are building blocks of more complex policies which consider dependencies among multiple trends e.g the policy which prefers the most recent trend in a time series can be expressed as conjunction of with itself
where the each of time it repeats, it contributes to the dependency of with one of the trends in . Therefore assigns highest utility to the most recent trend.
As we mentioned above, we characterized leaf policies with linear separators. Therefore our goal is to find parameters of these linear separators
such that the probability of observed data is maximized. We expect probabilistic linear classifiers such as logistic regression perfectly detect parameters of these separators. It is also possible to use Maximum likelihood to find the parameters of each leaf policy when the graphical structure of the model is known.
In the proposed utility model, subsets of leaf policies are combined via different structures and create complex policies. Therefore each complex policy is dependent on a subset of the leaf policies. We assume complex policies can be modeled as the product of the constituent leaf policies. The structure of the dependencies among leaf and complex policies are unknown. The problem of finding conditional dependencies between variables, which represent edges in our graphical model, has been well-studied (Drton and Maathuis, 2016). Our utility model is a Bayesian network, hence we use available structure learning methods for tree structured Bayesian networks as the baselines. We use greedy search and Chow-Liu (Chow and Liu, 1968) for learning the utility model structure.
The final step of learning the utility model is to find the probability distribution of given complex policies. The table containing the conditional probability of in the Bayesian network contains entries. Therefore its impractical to compute all values in the table. For learning this table, a possible approach is to learn a probabilistic classifier for given
Once the structure and parameters of the model are learned we can infer utility of new trends using them. They utility of a trend with feature vector can be computed as:
Computing the probability of for all possible values of complex policies which are the hidden variables is computationally expensive. Also the conditional probability table of is unknown. We train a classifier for estimating the utility based on complex policies and use it instead of conditional probability table. A possible approach to compute utility in this scenario is to find values of with the highest probability then compute probability of for that specific assignment of complex policies.
In this section, we describe the experiment we conducted to evaluate our proposed utility model. We use synthetic data in our experiments to check the applicability of our model since real data is noisy and using them adds more complexity to the problem at this stage. We created a synthetic dataset consisting of 2000 numerical time series. For generating each time series in our synthetic data, we randomly segmented the time span. Then we inserted a random linear trend for each time span by adding points in that linear trend with normal noise. In each experiment scenario, we created a training set which contains feature vectors of the detected trends in training time series. Then we learned each part of the model separately and evaluated the overall performance of the system for baseline methods.
The real value of utility for each trend is not available. As mentioned in previous sections, utility of trends is used in selecting subset of trends to appear in the text and determining ranking among them. Therefore we evaluate our system using two different metrics which are Precision/Recall and Kendall Tau each of which evaluate one aspect of utility. We also evaluate the subtasks separately.
In this section we describe our experiments. In each scenario, we assumed structure and parameter learning are done in isolation and we evaluated the inferred utilities of the learned utility model.
Learning Leaf-policy parameters
In this experiment set, we assumed the model consists of a single leaf policy and repeated the experiment for example leaf policies from different leaf policy types introduced in 1
. We tried to learn the parameters of that single leaf policy and infer the utility of trends. Since there is only one complex policy and one leaf policy in this case, no structure learning is required. In this experiment, the baselines are probabilistic linear separators e.g logistic regression. We also evaluated the performance of non probabilistic classifiers in this case including decision tree, SVM. We describe the leaf policies used in this experiment in Table1.
|Policy Id||Policy Preference Description|
|increasing linear trend|
|slope of linear trend greater than a threshold|
|specific trend type|
|more recent trends|
|greater spanning interval|
|more extreme jumps|
|different trend types|
Results of baselines in this experiment are shown in Table 2. As we expected the probabilistic classifiers model the leaf policies perfectly. Therefore, almost perfect f1-score is achieved for all leaf policies. We use the trained logistic regression classifiers for leaf policies in this experiment for the second experiment.
Utility Estimation Experiment
In this experiment, we assumed to have multiple leaf policies. We also assumed to have multiple complex policies and their dependency structure to the leaf policies are known. Our goal was to estimate utility based on complex policies. We did not keep the conditional probability table of given complex policies. Instead, we trained a probabilistic linear classifier to estimate the utility given complex policies. Note that, we used the trained classifiers of the previous experiment to find the value of complex policies. In the first two scenarios, the leaf policies have the same type, while in the rest scenarios leaf policies have different types. The complex policies along descriptions and dependent leaf policies are shown in table 3.
|linear and increasing|
|jump point and downward|
|extreme point and has high value|
|linear and highest spanning interval|
|linear and sharpest increase|
|jump point and sharpest uptrend|
|jump point and most recent|
|jump point and unique|
|linear and most recent|
The results of the second experiment are shown in table 4
As shown in table 4, when the complex policies and their correct values for a given trend are known estimating utilities by using a probabilistic linear classifier can achieve high score in many setting. However, we should note that the performance of utility estimator highly depends on the value of complex policies.
Data-to-text systems have long been an area of active research. There have been various Data-to-text systems focusing on creating textual summary for different data. Al-Zaidy et al. (2016); Demir et al. (2012) provide examples of data-to-text systems that focus on generating textual descriptions for graphical or chart data.
A data-to-text system represents the given data or knowledge in a text format so that people can understand and interpret the information better. The workflow in a data-to-text system consists of modules that are responsible for analyzing the input data and extracting patterns and trends, detecting the relation between trends, selecting the content and generating the output (Gkatzia, 2016).
In this paper, we limit the domain of the system to numerical time series data. Sripada et al. (2003, 2004) focus on creating description for time series data on different domains. Their systems depends on expert knowledge in the content selection phase. (Lloyd et al., 2014) creates description for time series data by discovering statistical models in it and map them to natural language text for creating a good explanation of data. However the provided explanation consists of description of complicated statistical patterns such as ”This component is a smooth function with a typical length scale of 8.1 months” which are not appealing for nontechnical reports and are not similar to human descriptions.
Our approach follows the pattern of data-to-text systems: we create a data-to-text system to generate a qualitative summary for a given time series. Our goal is to provide enough information in the summary so that user can reason about a series based solely this summary, without requiring quantitative analysis.
Analyzing time series data and extracting trends and patterns from it which is the first component of our system have been studied in (Lloyd et al., 2014; Streibel et al., 2013), (Hwang et al., 2015).
The content selection component of data-to-text systems resembles the extractive document summarization problem. In extractive document summarization, the goal is to select a subset of the documents or patterns to represent the whole document. More precisely, these methods select topmost important sentences in documents by greedy search or optimizing an objective function. (Allahyari et al., 2017). Their objective function is usually a combination of a submodular and non submodular function that adjusts the redundancy and informativeness of the created summary. (Dasgupta et al., 2013; Lin and Bilmes, 2011). In this work, by estimating utility value for each trend we provide a means for selecting top trends in a time series. Estimating utility for trend also enables us to defind submodular objective functions for selecting a subset of trends.
In this paper, we introduced a model of simulating human like descriptions for time series data. Our initial work is focused on identifying substasks to learn such model. In our evaluation, we showed the result of baseline on substask and showed that although learning each subtask is straightforward, learning combination of them is a complex task. In our ongoing work, we are working to learn substasks simultaneously.
- Automatic summary generation for scientific data charts. In AAAI Workshop: Scholarly Big Data, Cited by: Previous Work.
- Text summarization techniques: a brief survey. International Journal of Advanced Computer Science and Applications 8 (10). External Links: Cited by: Previous Work.
- Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory 14 (3), pp. 462–467. External Links: Cited by: Structure Learning.
- Summarization through submodularity and dispersion. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, pp. 1014–1022. External Links: Cited by: Previous Work.
- Summarizing information graphics textually. Computational Linguistics 38, pp. 527–574. Cited by: Previous Work.
- Structure learning in graphical modeling. External Links: Cited by: Structure Learning.
- Content selection in data-to-text systems: a survey. pp. . Cited by: Introduction, Previous Work.
- The Automatic Statistician: a relational perspective. Note: arXiv:1511.08343 [cs.LG]http://arxiv.org/abs/1511.08343 External Links: Cited by: Previous Work.
- A comparison of graphical and textual presentations of time series data to support medical decision making in the neonatal intensive care unit. Journal of Clinical Monitoring and Computing 19, pp. 183–194. Cited by: Introduction.
- A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, Stroudsburg, PA, USA, pp. 510–520. External Links: Cited by: Previous Work.
- Automatic construction and natural-language description of nonparametric regression models. External Links: Cited by: Previous Work.
- Summarizing neonatal time series data. In 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary. External Links: Cited by: Previous Work.
- SumTime-mousam: configurable marine weather forecast generator. Expert Update 6, pp. . Cited by: Previous Work.
- Trend template: mining trends with a semi-formal trend model. Vol. 1088, pp. . Cited by: Previous Work.
- Towards ai-complete question answering: a set of prerequisite toy tasks. pp. . Cited by: Using Policies for Utility Estimation.