NSTM: Real-Time Query-Driven News Overview Composition at Bloomberg

06/01/2020 ∙ by Joshua Bambrick, et al. ∙ Bloomberg 0

Millions of news articles from hundreds of thousands of sources around the globe appear in news aggregators every day. Consuming such a volume of news presents an almost insurmountable challenge. For example, a reader searching on Bloomberg's system for news about the U.K. would find 10,000 articles on a typical day. Apple Inc., the world's most journalistically covered company, garners around 1,800 news articles a day. We realized that a new kind of summarization engine was needed, one that would condense large volumes of news into short, easy to absorb points. The system would filter out noise and duplicates to identify and summarize key news about companies, countries or markets. When given a user query, Bloomberg's solution, Key News Themes (or NSTM), leverages state-of-the-art semantic clustering techniques and novel summarization methods to produce comprehensive, yet concise, digests to dramatically simplify the news consumption process. NSTM is available to hundreds of thousands of readers around the world and serves thousands of requests daily with sub-second latency. At ACL 2020, we will present a demo of NSTM.



There are no comments yet.


page 2

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many domains, finding contextually-important news as fast as possible is a key goal. With millions of articles published around the globe each day, quickly finding relevant and actionable news can mean the difference between success and failure. When provided with a search query, a traditional system returns links to articles sorted by relevance. However, users typically encounter (near) duplicate or overlapping articles, making it hard to quickly identify key events and easy to miss less-reported stories. Moreover, news headlines are frequently sensational, opaque, or verbose, forcing readers to open and read individual articles. For illustration, imagine an analyst sees the price of Amazon.com stock drop and wants to know why. With a traditional system, they would search for news on the company and wade through many stories ( in this case111The corresponding overview can be found in Appendix C.), often with duplicate information or unhelpful headlines, to slowly build up a full picture of what the key events were. By contrast, using NSTM (Key News Themes), this same analyst can search for ‘Amazon.com’, over a given time horizon, and promptly receive a concise and comprehensive overview of the news, as shown in Fig. 1. We tackle the challenges involved with consuming vast quantities of news by leveraging modern techniques to semantically cluster stories, as well as innovative summarization methods to extract succinct, informational summaries for each cluster. A handful of key stories are then selected from each cluster. We define a (story cluster, summary, key stories) triple as one theme and an ordered list of themes as an overview. NSTM works at web scale but responds to arbitrary user queries with sub-second latency. It is deployed to hundreds of thousands of users around the globe and serves thousands of requests per day.

Figure 1: A query-based UI for NSTM showing two themes. The un-cropped screenshot is in Appendix C.

2 Design Goals

We focus on the scenario where a news search query can render many matching news articles, from tens up to hundreds of thousands. The task is to create a succinct overview of the results to help our users to easily grasp the gist of them without combing through the individual articles. Since the matching articles often cover various aspects and events, NSTM must first cluster related stories to form a clear separation among them. Furthermore, the system must extract a concise (up to characters, or roughly 6 tokens) summary for each cluster. It needs to be short enough to be understandable to humans with a single glance, but also rich enough to retain critical details from a minimal ‘who-does-what’ stub, so the most popular noun phrase or entity alone will not suffice. Such conciseness also helps when screen space is limited (for context-driven applications or mobile devices). From each cluster, NSTM must surface a few key stories to provide a sample of its contents. The clusters themselves should also be ranked to highlight the most important few in limited screen space. Finally, the system must be fast. It may only take up to a few seconds for the slowest queries.

Main technical challenges:

1) There is no public dataset corresponding to this overview composition problem with all the requirements set above, so we were required to either define new (sub-)tasks and collect new annotations, or select techniques by intuition, implement them, and iterate on feedback; 2) Generating summaries which are simultaneously accurate, informational, fluent, and highly concise necessitates careful and innovative choices of summarization techniques; 3) Supporting arbitrary user searches in real-time places significant performance requirements on the system whilst also setting a high bar for its robustness.

3 Related Work

A comparable system is Google News’ ‘Full Coverage’ feature222https://www.blog.google/products/news/new-google-news-ai-meets-human-intelligence/, which groups stories from different sources, akin to our clustering approach. However, it doesn’t offer summarization and its clustered view is unavailable for arbitrary search queries. SUMMA (Liepins et al., 2017) is another comparable system which integrates a variety of NLP components and provides support for numerous media and languages, to simultaneously monitor several media broadcasts. SUMMA applies the online clustering algorithm by Aggarwal and Yu (2006)

and the extractive summarization algorithm by

Almeida and Martins (2013). In contrast to NSTM, SUMMA focuses on scenarios with continuous multimedia and multilingual data streams and produces much longer summaries.

4 Approach

4.1 Architecture

Figure 2: The architecture of NSTM. The digits indicate the order of execution whenever a new request is made.

The functionality of NSTM can be formulated as: given a search query, generate a ranked list (overview) of the key themes, or (news cluster, summary, key stories) triples, that concisely represent the most important matching news events. Fig. 2 depicts the system’s architecture. The story ingestion service processes millions of published news stories each day, stores them in a search index, and applies online clustering to them. When a search query is submitted via a user interface (\⃝raisebox{-0.9pt}{1} in the diagram), the overview composition service retrieves matching stories and their associated online cluster IDs from the search index (\⃝raisebox{-0.9pt}{2}). The system then further clusters the retrieved online clusters into the final clusters, each corresponding to one theme (\⃝raisebox{-0.9pt}{3}). For each such cluster, the system extracts a concise summary and a handful of key stories to reflect the cluster’s contents (\⃝raisebox{-0.9pt}{4}). This creates a set of themes, which NSTM ranks to create the final overview. Lastly, the system caches the overview for a limited time to support future reuse (\⃝raisebox{-0.9pt}{5}) before returning it to the UI (\⃝raisebox{-0.9pt}{6}).

4.2 News Search

The first step in the NSTM pipeline is to retrieve relevant news stories (\⃝raisebox{-0.9pt}{1} in Fig. 2), for which we leverage a customized in-house news search engine based on Apache Solr.333http://lucene.apache.org/solr/ This supports searches based on keywords, metadata (such as news source and time of ingestion), and tags generated during ingestion (such as topics, regions, securities, and people). For example, TOPIC:ECOM AND NOT COMPANY:AMZN444This is Bloomberg’s internal news search query syntax, which maps closely to the final query submitted to Solr. will retrieve all news about ‘E-commerce’ but exclude Amazon.com. NSTM uses Solr’s facet functionality to surface the largest online clusters (detailed in Sec. 4.3.2) in the search results, before returning stories from each. This tiered approach offers better coverage and scalability than direct story retrieval.

4.3 Clustering

4.3.1 News Embedding and Similarity

At the core of any clustering system is a similarity metric. In NSTM, we define the similarity between two articles as the cosine similarity between their embeddings as computed by NVDM 

(Miao et al., 2016), i.e., , where denotes the NVDM embedding. Our choice is motivated by two observations: 1) The generative model of NVDM is based on bag-of-words (BoW) and where is the softmax function, is the word embedding matrix in the decoder and is the size of the vocabulary. This resembles the latent topic structure popularized by LDA (Blei et al., 2003) which has proven effective in capturing textual semantics. Additionally, the use of cosine similarities is naturally motivated by the fact that the generative model is directly defined by the dot-product between the story embedding () and a shared vocabulary embedding (). 2)

NVDM’s Variational Autoencoder (VAE) 

(Kingma and Welling, 2014; Rezende et al., 2014) framework makes the inference procedure much simpler than LDA and it also supports decoder customizations. For example, it allows us to easily integrate the idea of introducing a learnable common background word distribution into the generative model Arora et al. (2017). We trained the model on an internal corpus of M news articles, using a vocabulary of size about k and a latent dimension of .

4.3.2 Clustering Stages

We divide clustering into two stages in the pipeline, 1) online incremental clustering at story ingestion time, and 2) hierarchical agglomerative clustering (HAC) at query time (\⃝raisebox{-0.9pt}{3} in Fig. 2). The former is used to produce query-agnostic online clusters at a relatively low cost to handle the daily influx of millions of news stories. These clusters reduce the computational cost at query time. However, due to its online nature, over-fragmentation, among other quality issues, occurs in the resulting clusters. This necessitates further refinement at query time when an offline HAC step is performed on top of the retrieved online clusters. A similar, but more complicated, design was adopted in Vadrevu et al. (2011) for clustering real-time news search results. At both stages, we compute the cluster embedding as the mean of all the story embeddings therein, and evaluate similarities between clusters (individual stories are taken as singleton clusters) using the metric defined in Sec. 4.3.1. For online clustering, we apply an in-house implementation which uses a distributed pool of workers to reduce latency and increase throughput. It merges each incoming story with the closest cluster if the similarity is within a parameterized threshold and otherwise creates a new singleton cluster. For HAC, we apply fastcluster555https://www.jstatsoft.org/article/view/v053i09 (Müllner, 2013) to construct the dendrogram. We use complete linkage to encourage more congruent clusters and then form flat clusters by cutting the dendrogram at the same (height) threshold. To further reduce fragmentation where similar clusters are left un-clustered, we apply HAC twice recursively. To find a reasonable similarity threshold, we manually annotated just over 1k pairs of news articles. Each annotator indicated whether they would expect to see the articles grouped together or not in an overview. We then selected the threshold which achieved the highest F1 score on this binary classification task, which was .

4.4 Summary Extraction

Clustering search results (Vadrevu et al., 2011) is a meaningful step towards creating a useful overview. With NSTM, we push this one step further by additionally generating a concise, yet still human-readable, summary for each cluster (\⃝raisebox{-0.9pt}{4} in Fig. 2).

Figure 3: Illustrations of the symbolic OpenIE (left) and neural sentence compression (right) candidate extraction pipelines. We apply both, to render a diverse pool of candidate summaries, and use a ranker to select the best.

Due to the unique style of the summary explained in Sec. 2, the scarcity of training data makes it hard to train an end-to-end seq2seq (Sutskever et al., 2014) model, as is typical for abstractive summarization. Also, this technique would only offer limited control over the output. Hence, we opt for an extractive method, leveraging OpenIE (Banko et al., 2007) and a BERT-based (Devlin et al., 2019) sentence compressor (both illustrated in Fig. 3) to surface a pool of sub-sentence-level candidate summaries from the headline and the body, which are then scored by a ranker.

4.4.1 OpenIE-based Tuple Extraction

Open Domain Information Extraction (OpenIE) presents an unsupervised approach to extract summary candidates from an input sentence. First, we construct a dependency parse tree of the sentence, using a model based on Kiperwasser and Goldberg (2016) (\⃝raisebox{-0.9pt}{1} in Fig. 3). From this tree, we extract predicate-argument -tuples using an adapted reimplementation of PredPatt (White et al., 2016) (\⃝raisebox{-0.9pt}{2}). The tuples represent nested proto-semantic parses of the sentence, and typically correspond to well-formed phrases. This method applies rules cast over Universal Dependencies (Nivre et al., 2016) so syntactic patterns are unlexicalized and language-neutral. We then prune these tuples (\⃝raisebox{-0.9pt}{3}

), applying rules which reduce the arguments to their syntactic heads, while heuristics keep named entities and multi-word expressions intact. We recursively intersect the resulting tuples to create more tuples. Finally, to render summary candidates, we create a titlecased surface form of each tuple (


4.4.2 BERT-based Sentence Compression

In addition to the rule-based OpenIE system, we apply a Transfer Learning-based solution, using a novel in-house dataset specific to our sub-task. In particular, we model candidate summary extraction as a ‘sentence compression’ task 

(Filippova et al., 2015)

, where each story is split into sentences and tokens are classified as

keep or delete to make each sentence shorter, while retaining the key message. We oversaw the manual annotation of a dataset which maps sentences to compressed equivalents that correspond to summaries. When presented with a news story, annotators selected one sentence and deleted words to create a high quality summary. This rendered k annotations which we randomly partitioned into train () and test () sets. The task is formulated as sequence tagging, whereby each sub-token (\⃝raisebox{-0.9pt}{1} in Fig. 3), defined using the BERT vocabulary, is classified as keep or delete (\⃝raisebox{-0.9pt}{2}

). We implement this using a feedforward layer on top of a Bloomberg-internal pre-trained neural network, akin to the uncased English BERT-Base model, applying an adapated implementation. To create a compression, we stitch sub-tokens labelled

keep together (\⃝raisebox{-0.9pt}{3}). Lastly, we use postprocessing rules to improve formatting (\⃝raisebox{-0.9pt}{4}), such as titlecasing and fixing partial-entity deletion (where only some sub-tokens of a token/entity are deleted).

4.4.3 Summary Candidate Ranking

Tuple generation and sentence compression provide a pool of summary candidates for individual news stories. These are further aggregated across stories within a cluster to form the final pool. To identify the best summary for the cluster, we trained a sequence-pair model to score each candidate given an article . Such article-level scores for a candidate are computed against all the stories in a cluster and then aggregated (e.g., averaged) to produce the final cluster-level scores, which we use for ranking. For this purpose, we collected an in-house annotated dataset. We sampled a few thousand news articles and generated k summary candidates from them using OpenIE,666At this time, we hadn’t considered sentence compression.. Then we asked internal annotators to label each as Great, Acceptable or Terrible were it to be used as a summary for the article, considering both readability and informativeness. From this dataset, we constructed about k pairwise samples where is labelled more favorably than for a given common article , and the model was then trained to match such preferences using pairwise margin loss, i.e., . We considered a few models, including a parameter-free baseline which scores candidate-article pairs as the dot-product of their NVDM (Sec. 4.3.1) embeddings, i.e., . We also considered this model’s bilinear extension where is the learnable weight matrix. Lastly, we tried neural network models, such as DecAtt (Parikh et al., 2016). We evaluated these models on a held-out test set with metrics such as pairwise ranking accuracy and NDCG. We opted to productionize the baseline model, since it was the simplest and performed on par with the others.777E.g., with NDCG5, the (untrained) NVDM dot-product yields , while the bilinear model and DecAtt yield . Because NVDM uses a bag-of-words model, this ranker ignores syntax entirely. We believe that its empirical success owes to both the well-formedness of the majority of the candidates and the averaging effect that amplifies the ‘signal-noise ratio’ when the scores are averaged over the cluster. Empirically, this approach tends to surface ‘informational’ summaries, in contrast to headlines which are often ‘sensational’. We posit that this is because high-ranked summaries must also be representative of story bodies, not just headlines.

4.4.4 Combining Summary Candidates

OpenIE and sentence compression offer distinct ways to extract candidates, and we experimented with each as the sole source of summary candidates in our pipeline. On the basis of ROUGE scores (Lin and Hovy, 2003; Lin, 2004) (details in Appendix B), the latter provides superior results. However, in a production system which informs business decisions, we must consider factors which aren’t readily captured by metrics which compare generated and ‘gold’ outputs. For example, changing a single word can reverse the meaning of a summary, with only a small change in such scores. Hence, we consider a range of pros and cons. The sentence compression method is supervised and is trained to produce summaries which can take advantage of news-specific grammatical styles. However, the OpenIE system is much faster and offers greater interpretability and controllability. Since the neural and symbolic systems provide different advantages, we apply both. This renders a diverse pool of candidate summaries from which the ranker’s task is to select the best. At the pooling stage we also impose a length constraint of 50 characters and exclude any longer candidates.

4.5 Key Story Selection

As a sample from the full story cluster, NSTM selects an ordered list of key stories which are deemed to be representative. We select these using a heuristic based on intuition and client feedback. Our approach is to re-cluster all stories in the cluster using HAC (see Sec. 4.3.2), to create a parameterized number of sub-clusters. For each sub-cluster, we select the story that has maximum average similarity (as per Sec. 4.3.1) to the other sub-cluster stories. This strategy is intended to select stories which represent each cluster’s diversity. We sort the key stories by sub-cluster size and time of ingestion, in that order of precedence.

4.6 Theme Ranking

We have described how (story cluster, summary, key stories) triples, or themes, are created. However, some themes are considered to be more important

than others since they are more useful to readers. It is tricky to define this concept concretely but we apply proxy metrics in order to estimate an importance

score for each theme. We rank themes by this score and, in order to save screen space, return only the top few (‘key’) themes as an overview. The main factor considered in the importance score is the size of the story cluster – the larger the cluster, the larger the score. This heuristic corresponds to the observation that more important themes tend to be reported on more frequently. Additionally, we consider the entropy of the news sources in the cluster, which corresponds to the observation that more important themes are reported on by a larger number of publishers and reduces the impact of a source publishing duplicate stories.

4.7 Caching

Since many user requests are the same or use similar data, caching is useful to minimize response times. When NSTM receives a request, it checks whether there is a corresponding overview in the cache, and immediately returns it if so. of requests hit the cache and of requests are handled within .888Computed for all requests over a 90-day period. In the event of a cache miss, NSTM responds in a median time of .999Computed for the top 50 searches over a 7-day period. We apply two mechanisms to ensure cache freshness. Firstly, we preemptively invoke NSTM using requests that are likely to be queried by users (e.g., most read topics) and re-compose them from scratch at fixed intervals (e.g., every 30 min). Once computed, they are cached. The second mechanism is user-driven: every time a user requests an overview which is not cached, it will be created and added to the cache. The system will subsequently preemptively invoke NSTM using this request for a fixed period of time (e.g., 24 hours).

5 Demonstration

NSTM was deployed to our clients in 2019. Using the UI depicted in Fig. 1, users can find overviews for customized queries to help support their work. From this screen, the user can enter a search query using any combination of Boolean logic with tag- or keyword-based terms. They may also alter the period that the overview is calculated over (this UI offers 1 hour, 8 hour, 1 day, and 2 day options). This interface also allows users to provide feedback via the ‘thumb’ icons or plain-text comments. Of several hundred per-overview feedback submissions, over three quarters have been positive.

Summary Size
1 Facebook to Settle Recognition Privacy Lawsuit 90
2 Facebook Warns Revenue Growth Slowing 79
3 Facebook Stock Drops 7% Despite Earnings Beat 70
4 Facebook to Remove Coronavirus Misinformation 49
5 Mark Zuckerberg to Launch WhatsApp Payments 19
Table 1: Ranked theme summaries and cluster sizes for ‘Facebook’ (1,176 matching stories) from 31 Jan. 2020.
Summary Size
1 Britain to Leave the EU 459
2 Bank of England Would Keep Interest Rate Unchanged 141
3 Sturgeon Demands Scottish Independence Vote 71
4 Pompeo in UK for Trade Talks 45
5 Boris Johnson Hails ‘Beginning’ on Brexit Day 63
Table 2: Ranked theme summaries and cluster sizes for ‘U.K.’ (13,858 matching stories) from 31 Jan. 2020.

Tables 1 and 2 show example theme summaries generated for the queries ‘Facebook’ and ‘U.K.’. Note that the summaries are quite different from what has previously been studied by the NLP community (in terms of brevity and grammatical style) and that they accurately represent distinct events. In addition to user-driven settings, NSTM can be used to supplement context-driven applications. One example, demonstrated in Appendix D, uses themes provided by NSTM to help explain why companies or topics are ‘trending’.

6 Conclusion

We presented NSTM, a novel and production-ready system that composes concise and human-readable news overviews given arbitrary user search queries. NSTM is the first of its kind; it is query-driven, it offers unique news overviews which leverage clustering and succinct summarization, and it has been released to hundreds of thousands of users. We also demonstrated effective adoption of modern NLP techniques and advances in the design and implementation of the system, which we believe will be of interest to the community. There are many open questions which we intend to research, such as whether autoregressivity in neural sentence compression can be exploited and how to compose themes over longer time periods.


  • C. C. Aggarwal and P. S. Yu (2006) A framework for clustering massive text and categorical data streams. In Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 479–483. Cited by: §3.
  • M. Almeida and A. Martins (2013) Fast and robust compressive summarization with dual decomposition and multi-task learning. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, pp. 196–206. External Links: Link Cited by: §3.
  • S. Arora, Y. Liang, and T. Ma (2017) A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the 5th International Conference on Learning Representations, ICLR’17. External Links: Link Cited by: §4.3.1.
  • M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni (2007) Open information extraction from the web. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, San Francisco, CA, USA, pp. 2670–2676. Cited by: §4.4.
  • D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation. J. Mach. Learn. Res. 3, pp. 993–1022. External Links: Document, ISSN 1532-4435, Link Cited by: §4.3.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §4.4.
  • K. Filippova, E. Alfonseca, C. A. Colmenares, L. Kaiser, and O. Vinyals (2015) Sentence compression by deletion with LSTMs. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    Lisbon, Portugal, pp. 360–368. External Links: Link, Document Cited by: §4.4.2.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, External Links: Link Cited by: §4.3.1.
  • E. Kiperwasser and Y. Goldberg (2016) Simple and accurate dependency parsing using bidirectional LSTM feature representations. Transactions of the Association for Computational Linguistics 4, pp. 313–327. External Links: Link, Document Cited by: §4.4.1.
  • R. Liepins, U. Germann, G. Barzdins, A. Birch, S. Renals, S. Weber, P. van der Kreeft, H. Bourlard, J. Prieto, O. Klejch, P. Bell, A. Lazaridis, A. Mendes, S. Riedel, M. S. C. Almeida, P. Balage, S. B. Cohen, T. Dwojak, P. N. Garner, A. Giefer, M. Junczys-Dowmunt, H. Imran, D. Nogueira, A. Ali, S. Miranda, A. Popescu-Belis, L. Miculicich Werlen, N. Papasarantopoulos, A. Obamuyide, C. Jones, F. Dalvi, A. Vlachos, Y. Wang, S. Tong, R. Sennrich, N. Pappas, S. Narayan, M. Damonte, N. Durrani, S. Khurana, A. Abdelali, H. Sajjad, S. Vogel, D. Sheppey, C. Hernon, and J. Mitchell (2017) The SUMMA platform prototype. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, pp. 116–119. External Links: Link Cited by: §3.
  • C. Lin and E. Hovy (2003)

    Automatic evaluation of summaries using n-gram co-occurrence statistics

    In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 150–157. External Links: Link Cited by: Appendix B, §4.4.4.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Link Cited by: Appendix B, §4.4.4.
  • Y. Miao, L. Yu, and P. Blunsom (2016) Neural variational inference for text processing. In

    Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48

    ICML’16, pp. 1727–1736. External Links: Link Cited by: §4.3.1.
  • D. Müllner (2013) Fastcluster: fast hierarchical, agglomerative clustering routines for R and Python. Journal of Statistical Software, Articles 53 (9), pp. 1–18. External Links: ISSN 1548-7660, Document, Link Cited by: §4.3.2.
  • J. Nivre, M. De Marneffe, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, et al. (2016) Universal dependencies v1: a multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 1659–1666. Cited by: §4.4.1.
  • A. Parikh, O. Täckström, D. Das, and J. Uszkoreit (2016)

    A decomposable attention model for natural language inference

    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2249–2255. External Links: Link, Document Cited by: §4.4.3.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pp. 1278–1286. External Links: Link Cited by: §4.3.1.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 3104–3112. External Links: Link Cited by: §4.4.
  • S. Vadrevu, C. H. Teo, S. Rajan, K. Punera, B. Dom, A. J. Smola, Y. Chang, and Z. Zheng (2011) Scalable clustering of news search results. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM’11, New York, NY, USA, pp. 675–684. External Links: ISBN 978-1-4503-0493-1, Link, Document Cited by: §4.3.2, §4.4.
  • A. S. White, D. Reisinger, K. Sakaguchi, T. Vieira, S. Zhang, R. Rudinger, K. Rawlins, and B. Van Durme (2016) Universal Decompositional Semantics on Universal Dependencies. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1713–1723. External Links: Link Cited by: §4.4.1.

Appendix A Acknowledgements

This has been a multi-year project, involving contributions from many people at different stages. In particular, we thank Miles Osborne, Marco Ponza, Amanda Stent, Mohamed Yahya, Christoph Teichmann, Prabhanjan Kambadur, Umut Topkara, Ted Merz, Sam Brody, and Adrian Benton for reviewing and commenting on the manuscript; We further thank Adela Quinones, Shaun Waters, Mark Dimont, Ted Merz and other colleagues from the News Product group for helping to shape the vision of the system; We also thank José Abarca and his team for developing the user interface; We thank Hady Elsahar for helping to improve summary ranking during his internship; Finally, we thank all colleagues (especially those in the Global Data department) who helped to produce high quality in-house annotations and all others who contributed valuable thoughts and time into this work.

Appendix B End-To-End Evaluation

We evaluate the end-to-end NSTM system when using the OpenIE (Sec. 4.4.1) and the BERT-based sentence compression (Sec. 4.4.2) algorithms as the sole source of candidate summaries. We also conducted one experiment where both were used to create a shared pool of candidates (as per Sec. 4.4.4). We test the system end-to-end using the manually-annotated Single Document Summarization (SDS) test set described in Sec. 4.4.2. To implement SDS, our experimental setup assumes that only one story was returned by a search request (as per Sec. 4.2). We evaluate the output from each system with ROUGE (Lin and Hovy, 2003; Lin, 2004)101010https://github.com/google/seq2seq/blob/master/seq2seq/metrics/rouge.py. The results are presented in Table 3.

Metric OpenIE BSC Both
ROUGE-1 F1 0.831 0.863 0.851
ROUGE-2 F1 0.609 0.701 0.667
ROUGE-3 F1 0.530 0.640 0.599
ROUGE-4 F1 0.492 0.603 0.562
ROUGE-L F1 0.621 0.706 0.670
Table 3: ROUGE scores for the Single-Document Summarization task in the end-to-end system, when using OpenIE, BERT-based sentence compression (BSC) and both to construct the pool of candidate summaries.

Appendix C Screenshots of A Query-Driven User Interface

Figure 4: Screenshot (taken on 29 January 2020) of a query-driven interface for NSTM showing the overview for the company ‘Amazon.com’.
Figure 5: Screenshot (taken on 29 January 2020) of a query-driven interface for NSTM showing the overview for the topic ‘Electric Vehicles’.
Figure 6: Screenshot (taken on 29 January 2020) of a query-driven interface for NSTM showing the overview for the region ‘Canada’.
Figure 7: Screenshot (taken on 29 January 2020) of a query-driven interface for NSTM showing the overview for a complex query, including a keyword.

Appendix D Screenshots of A Context-Driven User Interface

Figure 8: Screenshot (taken on 29 January 2020) of a context-driven application of NSTM. In the ‘Security’ column are the companies that have seen the largest increase in news readership over the last day. Each entry in the ‘News Summary’ column is the summary of the top theme provided by NSTM for the adjacent company.
Figure 9: Screenshot (taken on 29 January 2020) of a context-driven application of NSTM. In the ‘News Topic’ column are the topics that have seen the largest volume of news readership over the past 8 hours. Each entry in the ‘News Summary’ column is the summary of the top theme provided by NSTM for the adjacent topic.