Traditional study of extractive sentence compression seeks to create short, readable, single-sentence summaries which retain the most “important” information from source sentences. But search user interfaces often require compressions which must include a user’s query terms and must not exceed some maximum length, permitted by screen space. Figure 1 shows an example.
This study examines the English-language compression problem with such length and lexical requirements. In our constrained compression setting, a source sentence is shortened to a compression which (1) must include all tokens in a set of query terms and (2) must be no longer than a maximum budgeted character length, . Formally, constrained compression maps , such that respects and . We describe this task as query-focused compression because places a hard requirement on words from which must be included in .
Existing techniques are poorly suited to constrained compression. While methods based on integer linear programming (ILP) can trivially accommodate such length and lexical restrictionsClarke and Lapata (2008); Filippova and Altun (2013); Wang et al. (2017), these approaches rely on slow third-party solvers to optimize an NP-hard integer linear programming objective, causing interface lag. An alternative LSTM tagging approach Filippova et al. (2015) does not allow practitioners to specify length or lexical constraints, and requires an expensive graphics processing unit (GPU) to achieve low runtime latency (a barrier in fields like social science and journalism). These deficits prevent application of existing compression techniques in search user interfaces Marchionini (2006); Hearst (2009), where length, lexical and latency requirements are paramount. We thus present a new stateful method for query-focused compression.
Our approach is theoretically and empirically faster than ILP-based techniques, and more accurately reconstructs gold standard compressions.
2 Related work
Extractive compression shortens a sentence by removing tokens, typically for summarization Knight and Marcu (2000); Clarke and Lapata (2008); Filippova et al. (2015); Wang et al. (2017).111Some methods compress via generation instead of deletion Rush et al. (2015); Mallinson et al. (2018). Our extractive method is intended for practical, interpretable and trustworthy search systems Chuang et al. (2012). Users might not trust abstractive summaries Zhang and Cranshaw (2018), particularly in cases with semantic error. To our knowledge, this work is the first to consider extractive compression under hard length and lexical constraints.
We compare our vertex addition approach to ILP-based compression methods Clarke and Lapata (2008); Filippova and Altun (2013); Wang et al. (2017), which shorten sentences using an integer linear programming objective. ilp methods can easily accommodate lexical and budget restrictions via additional optimization constraints, but require worst-case exponential computation.222ILPs are exponential in when selecting Clarke and Lapata (2008) tokens and exponential in when selecting edges Filippova et al. (2015).
Finally, compression methods based on LSTM taggers Filippova et al. (2015) cannot currently enforce lexical or length requirements. Future work might address this limitation by applying and modifying constrained generation techniques Kikuchi et al. (2016); Post and Vilar (2018); Gehrmann et al. (2018).
3 Compression via vertex addition
We present a new transition-based method for shortening sentences under lexical and length constraints, inspired by similar approaches in transition-based parsing Nivre (2003). We describe our technique as vertex addition because it constructs a shortening by growing a (possibly disconnected) subgraph in the dependency parse of a sentence, one vertex at a time. This approach can construct constrained compressions with a linear algorithm, leading to 4x lower latency than ILP techniques (§4). To our knowledge, our method is also the first to construct compressions by adding vertexes rather than pruning subtrees in a parse Knight and Marcu (2000); Berg-Kirkpatrick et al. (2011); Almeida and Martins (2013); Filippova and Alfonseca (2015). We assume a boolean relevance model: must contain . We leave more sophisticated relevance models for future work.
3.1 Formal description
vertex addition builds a compression by maintaining a state where is a set of added candidates, is a priority queue of vertexes, and indexes a timestep during compression. Figure 2 shows a step-by-step example.
During initialization, we set and . Then, at each timestep, we pop some candidate from the head of and evaluate for inclusion in . (Neighbors of in get higher priority than non-neighbors; we break ties in left-to-right order, by sentence position.) If we accept , then . We discuss acceptance decisions in detail in §4.3. We continue adding vertexes to until either is empty or is characters long.333We linearize by left-to-right vertex position in , common for compression in English Filippova and Altun (2013). The appendix includes a formal algorithm.
vertex addition is linear in the token length of because we pop and evaluate some vertex from at each timestep, after . Additionally, because (1) we never accept if the length of is more than , and (2) we set , our method respects and .
We observe the latency, readability and token-level F1 score of vertex addition, using a standard dataset (§4.1). We compare our method to an ilp baseline (§2) because ILP methods are the only known technique for constrained compression. All methods have similar compression ratios (shown in appendix), a well-known evaluation requirement Napoles et al. (2011). We evaluate the significance of differences between vertex addition and the ilp with bootstrap sampling Berg-Kirkpatrick et al. (2012). All differences are significant .
4.1 Constrained compression experiment
In order to evaluate different approaches to constrained compression, we require a dataset of sentences, constraints and known-good shortenings, which respect the constraints. This means we need tuples , where is a known-good compression of which respects and (§1).
To support large-scale automatic evaluation, we reinterpret a standard compression corpus Filippova and Altun (2013) as a collection of input triples and constrained compressions. The original dataset contains pairs of sentences and compressions , generated using news headlines. For our experiment, we set equal to the character length of the gold compression . We then sample some query set from the nouns in so that the distribution of cardinalities of queries across our dataset simulates the observed distribution of cardinalities (i.e. number of query tokens) in real-world search Jansen et al. (2000). Sampled queries are short sets of nouns, such as “police, Syracuse”, “NHS” and “Hughes, manager, QPR,” approximating real-world behavior Barr et al. (2008).444See appendix for detailed discussion of query sampling.
By sampling queries and defining budgets in this manner, we create 199,152 training tuples and 9,969 test tuples, each of the form . Filippova and Altun (2013) define the train/test split. We re-tokenize, parse and tag with CoreNLP v3.8.0 Manning et al. (2014). We reserve 24,999 training tuples as a validation set.
4.2 Model: ilp
We compare our system to a baseline ilp method, presented in Filippova and Altun (2013)pairs.555Another ILP Wang et al. (2017) sets weights using a LSTM, achieving similar in-domain performance. This method requires a multi-stage computational process (i.e. run LSTM then ILP) that is poorly-suited to our query-focused setting, where low latency is crucial. Learned weights are used to compute a global compression objective, subject to structural constraints which ensure is a valid tree. This baseline can easily perform constrained compression: at test time, we add optimization constraints specifying that must include , and not exceed length .
To our knowledge, a public implementation of this method does not exist. We reimplement from scratch using Gurobi Optimization (2018), achieving a test-time, token-level F1 score of 0.76 on the unconstrained compression task, lower than the result (F1 = 84.3) reported by the original authors. There are some important differences between our reimplementation and original approach (described in detail in the appendix). Since vertex addition requires and , so we can only compare it to the ILP on the constrained (rather than traditional, unconstrained) compression task.
4.3 Models: vertex addition
Vertex addition accepts or rejects some candidate vertex at each timestep . We learn such decisions using a corpus of tuples (§4.1). Given such a tuple, we can always execute an oracle path shortening to by first initializing vertex addition and then, at each timestep: (1) choosing and (2) adding to iff . We say that each if , and that if . We use decisions from oracle paths to train two models of inclusion decisions, .
The model vertex addition broadly follows neural approaches to transition-based parsing (e.g. Chen and Manning (2014)): we predict2017), implemented with a common neural framework Gardner et al. (2017). Our classifier maintains four vocabulary embeddings matrixes, corresponding to the four disjoint subsets . Each LSTM input vector comes from the appropriate embedding for , depending on the state of the compression system at timestep
. The appendix details hyperparameter tuning and network optimization.
The model vertex addition
uses binary logistic regression,666We implement with Python 3 using scikit-learn Pedregosa et al. (2011). We tune the inverse regularization constant to via grid search over powers of ten, to optimize validation set F1. with features that fall into 3 classes.
Edge features describe the properties of the edge between and . We use the edge-based feature function from Filippova and Altun (2013), described in detail in the appendix. This allows us to compare the performance of a vertex addition method based on local decisions with an ILP method that optimizes a global objective (§4.5), using the same feature set.
Stateful features represent the relationship between and the compression at timestep . Stateful features include information such as the position of in the sentence, relative to the right-most and left-most vertex in , as well as history-based information such as the fraction of the character budget used so far. Such features allow the model to reason about which sort of should be added, given , and .
Interaction features are formed by crossing all stateful features with the type of the dependency edge governing , as well as with indicators identifying if governs , if governs or if there is no edge in the parse.
4.4 Metrics: F1, Latency and SLOR
We measure the token-level F1 score of each compression method against gold compressions in the test set. F1 is the standard automatic evaluation metric for extractive compressionFilippova et al. (2015); Klerke et al. (2016); Wang et al. (2017).
In addition to measuring F1, researchers often evaluate compression systems with human importance and readability judgements Knight and Marcu (2000); Filippova et al. (2015). In our setting determines the “important” information from , so importance evaluations are inappropriate. To check readability, we use the automated SLOR metric Lau et al. (2015), which correlates with human readability judgements Kann et al. (2018).
Finally, we measure the latency of each compression method, in milliseconds per sentence. We observe that theoretical gains from vertex addition (Table 1) translate to real speedups: vertex addition is on average 4x faster than the ilp (Table 2
). (Edge feature extraction code is shared across both methods.) We measure latency ofvertex addition using a CPU, to test if the method is practical in settings without GPUs. The appendix describes additional details related to our measurement of latency and SLOR.
4.5 Analysis: ablated & random
For comparison, we implement an ablated vertex addition method, which makes inclusion decisions using a model trained on only edge features from Filippova and Altun (2013). This method achieves a lower F1 than the ILP, which integrates the same edge-level information to optimize a global objective. However, adding stateful and interaction features in vertex addition raises the F1 score. The relatively strong performance of ablated hints that edge-level information alone can largely guide acceptance decisions, e.g. should some verb governing some object in via be included?
We also evaluate a random baseline, which accepts each randomly in proportion to across training data. random achieves reasonable F1 because (1) and (2) F1 correlates with compression rate Napoles et al. (2011), and is set to the length of .
|random (lower bound)||0.653||0.371||0.5|
|ablated (edge only)||0.820||0.665||3.5|
Both novel and traditional search user interfaces would benefit from low-latency, query-focused sentence compression. We introduce a new vertex addition technique for such settings. Our method is 4x faster than an ilp baseline while better reconstructing known-good compressions, as measured by F1 score.
In search applications, the latency gains from vertex addition are non-trivial: real users are measurably hindered by interface lags Nielsen (1993); Liu and Heer (2014). Fast, query-focused compression better enables search systems to create snippets at the “pace of human thought” Heer and Shneiderman (2012).
Appendix A Appendix
We formally present the vertex addition compression algorithm, using notation defined in the paper. linearizes a vertex set, based on left-to-right position in . indicates the number of tokens in the priority queue.
a.2 Neural network tuning and optimization
We learn network weights for vertex addition by minimizing cross-entropy loss with AdaGrad Duchi et al. (2011). The hyperparameters of the network are: the learning rate, the dimensionality of embeddings, the weight decay parameter, and the hidden state size of the LSTM. We tune hyperparameters via random search Bergstra and Bengio (2012)
, selecting parameters which achieve highest accuracy in predicting oracle decisions for the validation set. We train for 10 epochs, and use parameters from the best-performing ecoch (by validation accuracy) at test time.
a.3 Reimplementation of Filippova and Altun (2013)
In this work, we reimplement the method of Filippova and Altun (2013), who in turn implement a method partially described in Filippova and Strube (2008). There are inevitable discrepancies between our implementation and the methods described in these two prior papers.
Where the original authors train on only 100,000 sentences, we learn weights with the full training set to compare fairly with vertex addition (each model trains on the full training set.)
In Table 1 of their original paper, Filippova and Altun (2013) provide an overview of the syntactic, structural, semantic and lexical features in their model. We implement every feature described in the table. We do not implement features which are not described in the paper.
Filippova and Altun (2013) preprocess dependency parses by adding an edge between the root node and all verbs in a sentence.888This step ensures that subclauses can be removed from parse trees, and then merged together to create a compression from different clauses of a sentence. We found that replicating this transform literally (i.e. only adding edges from the original root to all tokens tagged as verbs) made it impossible for the ILP to recreate some gold compressions. (We suspect that this is due to differences in output from part-of-speech taggers.) We thus add an edge between the root node and all tokens in a sentence during preprocessing, allowing the ILP to always return the gold compression.
We assess convergence of the ILP by examining validation F1 score on the traditional sentence compression task. We terminate training after six epochs, when F1 score stabilizes (i.e. changes by fewer than points).
a.4 Implementation of SLOR
We use the SLOR function to measure the readability of the shortened sentences produced by each compression system. SLOR normalizes the probability of a token sequence assigned from a language model by adjusting for both the probability of the individual unigrams in the sentence and for the sentence length.999Longer sentences are always less probable than shorter sentences; rarer words make a sequence less probable.
Following Lau et al. (2015), we define the function as
where is a sequence of words, is the unigram probability of this sequence of words and is the probability of the sequence, assigned by a language model. is the length (in tokens) of the sentence.
We use a 3-gram language model trained on the training set of the Filippova and Altun (2013) corpus. We implement with KenLM Heafield (2011). Because compression often results in shortenings where the first token is not capitalized (e.g. a compression which begins with the third token in ) we ignore case when calculating language model probabilities.
a.5 Latency evaluation
To measure latency, for each technique, we sample 100,000 sentences with replacement from the test test. We observe the mean time per sentence using Python’s built-in timeit module. We measure with an Intel Xeon processor with a clock rate of 2.80GHz.
a.6 Query sampling
We sample queries for our synthetic constrained compression experiment to mimic real-world searches: the distribution of query token lengths and the distribution of query part-of-speech tags employed in our experiment closely match empirical distributions observed in real search Jansen et al. (2000); Barr et al. (2008). To create queries, for each sentence in our corpus we: (1) sample a query token length in proportion to the real-world distribution over query token lengths (2) sample a proper or common noun from in proportion to the distribution over proper and common nouns in real-world queries (3) add to the query set (4) repeat steps 2 and 3 until the cardinality of is the query token length specified in step 1. We exclude sentences in cases where the gold compression does not contain enough nouns to fill to the desired token length.
a.7 Compression ratios
When comparing sentence compression systems, it is important to ensure that all approaches use the same rate of compression Napoles et al. (2011). Following Filippova et al. (2015), we define the compression ratio as the character length of the compression divided by the character length of the sentence. We present test set compression ratios for all methods in Table 4. Because ratios are similar, our comparison is appropriate.
- Almeida and Martins (2013) Miguel Almeida and Andre Martins. 2013. Fast and robust compressive summarization with dual decomposition and multi-task learning. In ACL.
- Barr et al. (2008) Cory Barr, Rosie Jones, and Moira Regelson. 2008. The linguistic structure of english web-search queries. In EMNLP.
- Berg-Kirkpatrick et al. (2012) Taylor Berg-Kirkpatrick, David Burkett, and Dan Klein. 2012. An empirical investigation of statistical significance in NLP. In EMNLP.
- Berg-Kirkpatrick et al. (2011) Taylor Berg-Kirkpatrick, Dan Gillick, and Dan Klein. 2011. Jointly learning to extract and compress. In ACL.
Bergstra and Bengio (2012)
James Bergstra and Yoshua Bengio. 2012.
Random search for hyper-parameter optimization.
Journal of Machine Learning Research, 13:281–305.
- Briscoe et al. (2006) Ted Briscoe, John Carroll, and Rebecca Watson. 2006. The second release of the RASP system. In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions.
Chen and Manning (2014)
Danqi Chen and Christopher Manning. 2014.
A fast and accurate dependency parser using neural networks.In EMNLP.
- Chuang et al. (2012) Jason Chuang, Daniel Ramage, Christopher D. Manning, and Jeffrey Heer. 2012. Interpretation and trust: designing model-driven visualizations for text analysis. In CHI.
Clarke and Lapata (2008)
James Clarke and Mirella Lapata. 2008.
Global inference for sentence compression: An integer linear
Journal of Artificial Intelligence Research, 31:399–429.
- Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In EMNLP.
- Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.
- Filippova and Alfonseca (2015) Katja Filippova and Enrique Alfonseca. 2015. Fast k-best sentence compression. CoRR, abs/1510.08418.
- Filippova et al. (2015) Katja Filippova, Enrique Alfonseca, Carlos A Colmenares, Lukasz Kaiser, and Oriol Vinyals. 2015. Sentence compression by deletion with LSTMs. In EMNLP.
- Filippova and Altun (2013) Katja Filippova and Yasemin Altun. 2013. Overcoming the lack of parallel data in sentence compression. In EMNLP. https://github.com/google-research-datasets/sentence-compression.
Filippova and Strube (2008)
Katja Filippova and Michael Strube. 2008.
Dependency tree based sentence compression.
Proceedings of the Fifth International Natural Language Generation Conference.
Gardner et al. (2017)
Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi,
Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer.
Allennlp: A deep semantic natural language processing platform.
- Gehrmann et al. (2018) Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. 2018. Bottom-up abstractive summarization. In EMNLP.
- Gurobi Optimization (2018) LLC Gurobi Optimization. 2018. Gurobi optimizer reference manual (v8).
- Heafield (2011) Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In EMNLP: Sixth Workshop on Statistical Machine Translation.
- Hearst (2009) Marti Hearst. 2009. Search user interfaces. Cambridge University Press, Cambridge New York.
- Heer and Shneiderman (2012) Jeffrey Heer and Ben Shneiderman. 2012. Interactive dynamics for visual analysis. Queue, 10(2).
- Jansen et al. (2000) Bernard J. Jansen, Amanda Spink, and Tefko Saracevic. 2000. Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing and Management, 36:207–227.
- Kann et al. (2018) Katharina Kann, Sascha Rothe, and Katja Filippova. 2018. Sentence-Level Fluency Evaluation: References Help, But Can Be Spared! In CoNNL 2018.
- Kikuchi et al. (2016) Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, and Manabu Okumura. 2016. Controlling output length in neural encoder-decoders. In EMNLP.
- Klerke et al. (2016) Sigrid Klerke, Yoav Goldberg, and Anders Søgaard. 2016. Improving sentence compression by learning to predict gaze. In NAACL.
- Knight and Marcu (2000) Kevin Knight and Daniel Marcu. 2000. Statistics-based summarization - step one: Sentence compression. In AAAI/IAAI.
- Lau et al. (2015) Jey Han Lau, Alexander Clark, and Shalom Lappin. 2015. Unsupervised prediction of acceptability judgements. In ACL.
- Liu and Heer (2014) Zhicheng Liu and Jeffrey Heer. 2014. The effects of interactive latency on exploratory visual analysis. IEEE Transactions on Visualization and Computer Graphics, 20:2122–2131.
- Mallinson et al. (2018) Jonathan Mallinson, Rico Sennrich, and Mirella Lapata. 2018. Sentence Compression for Arbitrary Languages via Multilingual Pivoting. In EMNLP.
- Manning et al. (2014) Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In ACL System Demonstrations.
- Marchionini (2006) Gary Marchionini. 2006. Exploratory search: From finding to understanding. Commun. ACM, 49(4).
- de Marneffe et al. (2006) Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In LREC.
- Napoles et al. (2011) Courtney Napoles, Benjamin Van Durme, and Chris Callison-Burch. 2011. Evaluating sentence compression: Pitfalls and suggested remedies. In Proceedings of the Workshop on Monolingual Text-To-Text Generation.
- Nielsen (1993) Jakob Nielsen. 1993. Usability Engineering. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
- Nivre (2003) Joakim Nivre. 2003. An efficient algorithm for projective dependency parsing. In International Conference on Parsing Technologies.
- Nivre et al. (2016) Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan T. McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal Dependencies v1: A multilingual treebank collection. In LREC.
- Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in python. J. Mach. Learn. Res., 12:2825–2830.
Post and Vilar (2018)
Matt Post and David Vilar. 2018.
Fast lexically constrained decoding with dynamic beam allocation for neural machine translation.In NAACL.
Rush et al. (2015)
Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015.
A neural attention model for abstractive sentence summarization.In EMNLP.
- Schuster and Manning (2016) Sebastian Schuster and Christopher D. Manning. 2016. Enhanced english universal dependencies: An improved representation for natural language understanding tasks. In LREC.
- Wang et al. (2017) Liangguo Wang, Jing Jiang, Hai Leong Chieu, Chen Hui Ong, Dandan Song, and Lejian Liao. 2017. Can syntax help? Improving an LSTM-based sentence compression model for new domains. In ACL.
- Zhang and Cranshaw (2018) Amy X. Zhang and Justin Cranshaw. 2018. Making sense of group chat through collaborative tagging and summarization. Proc. ACM Hum.-Comput. Interact., 2(CSCW).