Catching Attention with Automatic Pull Quote Selection

05/27/2020
by   Tanner Bohn, et al.
Western University
0

Pull quotes are an effective component of a captivating news article. These spans of text are selected from an article and provided with more salient presentation, with the aim of attracting readers with intriguing phrases and making the article more visually interesting. In this paper, we introduce the novel task of automatic pull quote selection, construct a dataset, and benchmark the performance of a number of approaches ranging from hand-crafted features to state-of-the-art sentence embeddings to cross-task models. We show that pre-trained Sentence-BERT embeddings outperform all other approaches, however the benefit over n-gram models is marginal. By closely examining the results of simple models, we also uncover many unexpected properties of pull quotes that should serve as inspiration for future approaches. We believe the benefits of exploring this problem further are clear: pull quotes have been found to increase enjoyment and readability, shape reader perceptions, and facilitate learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/02/2019

Exploiting BERT for End-to-End Aspect-based Sentiment Analysis

In this paper, we investigate the modeling power of contextualized embed...
09/20/2019

Learning to Create Sentence Semantic Relation Graphs for Multi-Document Summarization

Linking facts across documents is a challenging task, as the language us...
08/06/2019

Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word Representations

In this paper we introduce a new natural language processing dataset and...
03/29/2017

Automatic Argumentative-Zoning Using Word2vec

In comparison with document summarization on the articles from social me...
07/27/2017

Correction of "A Comparative Study to Benchmark Cross-project Defect Prediction Approaches"

Unfortunately, the article "A Comparative Study to Benchmark Cross-proje...
11/21/2019

Automatic Text-based Personality Recognition on Monologues and Multiparty Dialogues Using Attentive Networks and Contextual Embeddings

Previous works related to automatic personality recognition focus on usi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we introduce the novel problem of automated pull quote (PQ) selection and analyze several approaches. Ideally thought provoking and succinct, PQs are graphical elements of articles with spans of text pulled from an article by a writer or copy editor to be presented on the page in a more salient manner [17].

Following the 15 year period between 1965 and 1980 where many newspapers experimented with their design (having previously been graphically similar) [58], some newspapers adopted a more modern design. Supported by preference from readers, aspects of this newer design include a more horizontal or modular layout, the six-column format, additional whitespace around heads, fewer stories, larger photographs, more colour, and more pull quotes [53, 60, 10].

PQs have been found to serve many purposes, including temptation (with unusual or intriguing phrases, they make strong entrypoints for a browsing reader), emphasis (by reinforcing particular aspects of the article), and improving overall visual balance and excitement [54, 21]. PQ frequency in reading material has also shown to be significantly related to information recall and student ratings of enjoyment, readability, attractiveness [60, 61].

The problem in this work of automatically selecting PQs is distinct from, but related to previously studied problems of headline success prediction [45, 30], clickbait identification [46, 7, 59], as well as key phrase extraction [19]

and document summarization

[42]. In the context of convincing a reader to engage in a text, the title tells the reader what the article is about and sets the tone, clickbait makes (often unwarranted) lofty promises of what the article is about, and key phrases and summaries indicate whether the topic or constituent components are of interest to the user. In contrast PQs can provide specific intriguing entrypoints for the reader and maintain interest once reading has begun by providing glimpses of interesting things to come.

In this work we interpret PQ selection as a sentence classification task and create a dataset of news article and their human-selected PQs from a variety of news sources. We consider a wide variety of approaches to solve this task: (1) handcrafted features, (2) n-gram encodings, (3) pre-trained sentence embeddings (specifically Sentence-BERT [48] and predicted position distributions [3]), and (4) cross-task models.

We find that on this dataset, pre-trained Sentence-BERT embeddings work best, with n-grams models close behind. Among the hand-crafted features, we found that reading difficulty, preposition density, and concreteness were the most informative. Motivated by the observation that PQs are not uniformly spread throughout news articles, we also demonstrate that using the predicted position distributions as sentence embeddings performs surprisingly well. Finally, as suggested by cross-task performance, we find that PQ selection is most similar to the task of clickbait identification, with models for headline popularity and summarization performing more poorly.

In summary, the main contributions of this work are as follows:

  1. We describe several motivated approaches for PQ selection (Sec. 3).

  2. We construct a dataset for training and evaluation of automated PQ selection (Sec. 4).

  3. We thoroughly examine the performance of our approaches to gain a deeper understanding of PQs and their relation to other tasks (Sec. 5).

2 Related Work

In this section, we look at three areas of work related to PQ selection: (1) headline quality prediction, (2) clickbait identification, and (3) summarization and keyphrase extraction. These topics also motivates cross-task models whose performance on PQ selection is reported in Section 5.4

2.1 Headline Quality Prediction

When a reader comes across a news article, the headline is often the first thing given a chance to catch their attention. Once they decide to check out the article, it is up to the content (including PQs) to maintain their engagement. A wide variety of research exists on attracting user attention, one of the core purposes of a pull quote. Understanding and predicting what we find interesting, attention-grabbing, and appealing has been studied for domains such as music [31], images and video [14, 47], web-page aesthetics [49], as well as online news article content [29, 12].

Predicting the success of headlines is a strongly motivated and well studied task. The features found to be useful have been relatively consistent. In [45], the authors experimented with two sets of features: journalism-inspired (which aim to measure how news-worthy the topic itself is), and linguistic style features (reflecting properties such as length, readability, and parts-of-speech). They found that overall the simpler style features work better than the more complex journalism-inspired features at predicting social media popularity of news articles. The success of simple features is also reflected in [30]

, which proposed multi-task training of a recurrent neural network to not only predict headline popularity given pre-trained word embeddings, but also predict its topic and parts-of-speech tags. They found that while the multi-task learning helped, it performed only as well as a logistic regression model using character n-grams. A unique approach to predicting headline performance is taken in

[24], where the authors propose a method to model the “click-value” of individual words given current news trend information.

2.2 Clickbait Identification

The detection of a certain type of headline – clickbait – has been a recently popular task of study. Clickbait is a particularly catchy headline used by news outlets which lure potential readers but usually fail to meet expectations and leave readers disappointed [46]. We suspect that the task of distinguishing between clickbait and non-clickbait headlines is related to pull quote extraction because both tasks may rely on identifying the catchiness of a span of text. In [59]

, the authors found that measures of topic novelty (estimated using LDA) and surprise (based on word bigram frequency) were strong features for detecting clickbait. A set of 215 features were considered in

[46] including sentiment, length statistics, and many features based on specialized dictionary-based word occurrences, but the authors found that the most successful features were character and word n-grams. The strength of n-gram features at this task is also supported by [7]. In our work we also demonstrate that n-gram features work well (Sec. 5.2), nearly as effective as state-of-the-art deep pre-trained sentence embeddings.

2.3 Summarization and Keyphrase Extraction

Summarization and keyphrase extraction are two well-studied tasks in natural language processing with the goals of capturing and conveying the main topics and key information discussed in a body of text

[57, 42]. Keyphrase extraction is concerned with doing this at the level of individual phrases, while extractive document summarization (which is just one type of summarization [41]) aims to do this at the sentence level. We believe (and provide some evidence in Sec. 5) that the difference between these tasks and PQ selection comes down to their interpretations of importance: while summarization and keyphrase extraction defines importance as the ability to convey representative information, PQs define importance as the ability to intrigue the reader. With this view, PQs and summaries may overlap where the central information of a document is itself intriguing.

Along with their overlapping purposes are related applications. Where keyphrases can be applied to facilitate skimming through highlighting [57] to help find specific information of interest, PQs may facilitate skimming to help find more generally interesting reading material. Where summarization has found applications in education through automated evaluation of summary quality [55], PQs have found application here by being able to improve student reading comprehension and recall [60, 61]

Approaches to summarization have roughly evolved from unsupervised extractive heuristic-based methods

[33, 36, 15, 43, 18]

, to supervised and often abstractive deep-learning approaches

[40, 39, 38, 63]. Approaches to keyphrase extraction fall into similar groups, with unsupervised approaches including [56, 36, 32], and supervised approaches including [57, 35, 50].

3 Models

We consider four types of approaches for the task of PQ selection: (1) hand-crafted features either motivated by literature or otherwise interesting to study (Sec. 3.1), (2) n-gram features (Sec. 3.2), (3) pre-trained sentence embeddings (Sec. 3.3), and (4) cross-task models (Sec. 3.4). The aim of these approaches, as discussed further in Section 4.2

, is to determine the probability that a given article sentence is part of the source text for a pull quote.

3.1 Handcrafted Features

Our handcrafted features can be loosely grouped into three categories: surface, parts-of-speech, and affective. For the classifier we will use AdaBoost

[20]

with a decision tree base estimator.

3.1.1 Surface Features

  • Length: Including length of the sentence is motivated by the preference by writers to choose PQs which are concise. To measure length, we will use the total character length, as this more accurately reflects the space used by the text than the number of words.

  • Sentence position: We consider the location of the sentence in the document (from 0 to 1). This is motivated by the finding in summarization that summary-suitable sentences tend to occur near the beginning [4] – perhaps a similar trend exists for PQs.

  • Readability: Motivated by the assumption that a writer will not purposefully choose PQs which are difficult to read, we consider a few readability metric features:

    • Flesch Reading Ease: This measure defines reading ease in terms of the number of words per sentence and the number of syllables per word [16]:

      (1)
    • Coleman-Liau Index: This measure, introduced in [11], is designed to be easy to calculate automatically by not requiring syllable counting and is scaled to provide the approximate U.S. grade level necessary to comprehend the text. It considers the average number of letters per word, and average number of sentences per word:

      (2)
    • Difficult words: This measure, , simply computes the percentage of unique words which are considered difficult, where “difficult” is defined as being at least six characters long and not in a list of 3000 words that are easy to understand. The source of the easy words list is given in Section 4.3.

    • Average word length: While average word length, , in number of characters, is correlated with reading difficulty, it is conceivable that PQs could have higher average word length than average. The reasoning behind this is that authors can use relatively longer words (e.g. superlatives) to emphasize a passage.

3.1.2 Part-of-Speech Features

Part-of-speech (POS) features have appeared in many previous related works on headline popularity and clickbait [45, 30, 24, 7]. Here, we include the word density of a given POS tag in a sentence as a feature. As suggested by [45] with respect to guidelines for writing good headlines, we suspect that verbs and adverbs will perform well.

POS tag Name Examples
CD cardinal digit 5, 3rd, four
JJ adjective big, fast
MD model verb could, will
NN singular noun chair, arm
NNP proper noun Laura, Robert
PRP personal pronoun I, he, she
RB adverb silently, quickly
VB verb take, eat, run
Table 1: We report results on the densities of these various part-of-speech tags.

We consider the densities of an extensive set of POS tags111The full list of considered POS tags is here: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/., and report results on the interesting set described in Table 1.

3.1.3 Affective Features

Events or images that are shocking, filled with emotion, or otherwise exciting will attract attention [52]. However, this does not necessarily mean that text describing these things will catch a readers interest as reliably [1]. Like any other text, a reader must go through the process of decoding its meaning before becoming aware of its interesting qualities. Where individual words are involved, this may be a very fast involuntary process [34], perhaps behind the success of good headlines and clickbait.

To answer the question of how predictive sentence affective properties are of being part of a PQ, we include the following features:

  • Positive sentiment () and negative sentiment().

  • Compound sentiment (), which combines the positive and negative sentiments to represent overall sentiment between -1 and 1.

  • Valence () and arousal (): Valence refers to the pleasantness of a stimulus and arousal refers to the intensity of emotion provoked by a stimulus [62]. In [1], the authors specifically note that it is the arousal level of words, and not valence which is predictive of their effect on attention (measured via reaction time). Measuring early cortical responses and recall, [27] observed that words of greater valence were both more salient and memorable. To measure valence and arousal of a passage, we use the average score for each word using a database of word ratings [62]. Stop words are removed and when a word rating cannot be found, a value of 5 is used for valence and 4 for arousal (the mean word ratings).

  • Concreteness (): this is “the degree to which the concept denoted by a word refers to a perceptible entity” [5]. As demonstrated by [51], concrete texts are better recalled than abstract ones and concreteness is a strong predictor of text comprehensibility, interest, and recall. A concreteness score is computed similar to valence and arousal, with a mean concreteness word rating of 5 used when no value for a word is available.

3.2 N-Gram Features

We consider three types of n-gram text representations: character-level, word-level, and POS-tag level. A passage of text is then represented by a vector of the counts of the individual n-grams it contains.

3.3 Pre-trained Sentence Embeddings

Distributed word and sentence representations have proven their value at many NLP tasks in recent years [26, 23, 6, 13, 48]. We evaluate two interesting sentence embedding techniques in this work:

  • Sentence-BERT: based off the BERT (Bidirectional Encoder Representations from Transformers) language representation model [13], Sentence-BERT is a modification designed to more efficiently produce directly semantically meaningful sentence embeddings [48]. Following [48] we combine these embeddings with a logistic regression classifier for predicting PQ probability.

  • Predicted position distributions (PPDs): Described by [3], PPDs are a self-supervised sentence embedding technique where the embedding represents a discrete distribution over quantiles of a document (Eqn. 3).

    (3)

    Our use of this self-supervised embedding technique is motivated by the observation that PQ source sentences do not occur uniformly throughout articles (discussed in Sec. 5.1). As demonstrated by [3] in the context of extractive summarization, using the probability that a sentence occurs at the beginning significantly outperforms other unsupervised summarization algorithms. We use the entire predicted position distribution as a sentence encoding combined with a logistic regression classifier.

3.4 Cross-Task Models

In order to test the similarity of the PQ selection task with the related tasks of headline popularity prediction, clickbait identification, and summarization, we use the following models:

  • Headline popularity

    : Using Sentence-BERT embeddings and linear regression, we train a model to predict the popularity of a headline. We then apply this model to PQ selection by predicting the popularity of each sentence, scaling the predictions for each article to lie in

    and interpreting these values as PQ probability.

  • Clickbait identification: Using Sentence-BERT embeddings and logistic regression, we train a model to discriminate between clickbait and non-clickbait headlines. Clickbait probability is then used as a proxy for PQ probability.

  • Summarization: Using multiple extractive summarization algorithms, we score each sentence in an article, scale the values to lie in , and interpret these values as PQ probability.

4 Experimental Setup

4.1 Datatset Construction

To conduct our experiments, we created a pull quote dataset using articles from several online news outlets: National Post, The Intercept, Ottawa Citizen, and Cosmopolitan. For each news outlet we obtain a list of articles and identify those containing at least one pull quote. From these articles, we extract the following pieces of information were extracted:

  • The body: the full list of sentences composing the body of the article.

  • The edited PQs: the pulled texts as they appears after being augmented by the editor to appear as pull quotes. This can include replacing pronouns such as “she”, “they”, “it”, with the more precise nouns or proper nouns, or shortening sentences by removing individual words or clauses, or even replacing words with ones of a similar meaning but different length in order to achieve a clean text rag.

  • The PQ source sentences: the article sentences from which the edited pull quotes came. In this work, we aim to determine whether a given article sentence belongs to this group or not.

Statistics of the dataset are provided in Table 2. Notably, the total number of articles in the dataset is near 15,000, with a majority coming from National Post. It is also interesting to note that the number of PQ/article varies widely across news outlets, ranging from 1.02/article for Ottawa Citizen, to 2.26/article for The Intercept. Overall, our dataset contains 26,500 positive samples (sentences in PQs) and 680,000 negative samples (all non-PQ sentences), for a positive to negative ratio of 1:25. For all experiments, we use the same training/validation/test split of the articles (70/10/20).

nationalpost theintercept ottawacitizen cosmopolitan train val test all
# articles 11080 1183 1062 1272 10217 1459 2921 14597
# PQ 16211 2670 1083 2374 15611 2207 4520 22338
# PQ/article 1.46 2.26 1.02 1.87 1.53 1.51 1.55 1.53
# sentences/PQ 1.17 1.23 1.32 1.24 1.19 1.18 1.2 1.19
#sentences/article 40.48 97.93 38.34 79.01 48.38 47.65 48.54 48.34
# pos samples 18879 3277 1431 2925 18504 2595 5413 26512
# neg samples 429681 112572 39285 97582 475829 66927 136364 679120
Table 2: Statistics of our PQ dataset, composed of articles from four different news outlets. Only articles with at least one PQ are included in the dataset.

4.2 Evaluation

To evaluate PQ selection models, we will use the AUC averaged across articles, as described in Equation 4, where is the binary vector indicating whether each sentence of article is used for a PQ, and contains the corresponding predicted probabilities.

(4)

This has the following intuitive interpretation: it is the probability that a random true positive sample is ranked by the model above a random true negative sample. By averaging scores across sentences from individual articles instead of calculating AUC over all sentences at the same time, the evaluation method is compatible with the observation that different articles may be more “pull-quotable” than others. If this observation is not taken into account and articles are combined with computing AUC, the average sentence from an interesting article may be ranked higher than the best sentence from a less interesting article.

4.3 Implementation Details

Here we outline the various tools, datasets, and other implementation details important to our experiments:

  • To perform part-of-speech tagging for feature extraction, we use the NLTK 3.4.5 perceptron tagger

    [2].

  • To compute sentiment, the VADER Sentiment Analysis tool is used

    [22], accessed through the NLTK library.

  • Implementations of the readability metrics and are provided by the Textstat 0.6.0 Python package222Available online here: https://github.com/shivam5992/textstat. The corpus of easy words for is also made available by this package.

  • Valence, arousal word ratings are obtained from the dataset described in [62]333Available online at http://crr.ugent.be/archives/1003..

  • Concreteness word ratings are obtained from the dataset described in [5] 444Available online at http://crr.ugent.be/archives/1330..

  • The Sentence-BERT [48] implementation and pre-trained models are used for text embedding555Can be found online at https://github.com/UKPLab/sentence-transformers. We use the bert-base-nli-mean-tokens pre-trained model..

  • For PPD models, we use Sentence-BERT for the initial embeddings, and train a neural network with two hidden layers to predict the position distributions. We test a range of values: 5, 10, 15, 20, 25, 30. For each

    , we use the validation performance averaged across 5 trials to select layer sizes (chosen from (128, 64), (256, 128), (512, 256)) and layer dropouts (chosen from (0.5, 0.5), (0.5, 0.25), (0.25, 0.1), (0, 0)). The Keras library

    [9] is used to implement the network networks with the Adam optimizer [25] (with default Keras settings) and categorical cross-entropy loss. We use selu activations [28]

    , batch size of 128, and up to 30 epochs with early stopping done with the validation set.

  • The clickbait identification dataset introduced by [7] is used, which contains 16,000 clickbait samples and 16,000 non-clickbait headlines666Available online at https://github.com/bhargaviparanjape/clickbait/tree/master/dataset..

  • The headline popularity dataset introduced by [37] is used, which includes feedback metrics for about 100,000 news articles from various social media platforms777Available online at https://archive.ics.uci.edu/ml/machine-learning-databases/00432/Data/.. For pre-processing, we remove those article where no popularity feedback data is available, and compute popularity by averaging percentiles across platforms. For example, if an article is in the popularity percentile on Facebook and in the percentile on LinkedIn, then it is given a popularity score of 0.85.

  • We use the following summarizers: TextRank [36], SumBasic [43], LexRank [15], and KLSum [18]888Implementations provided by Sumy library, available at https://pypi.python.org/pypi/sumy..

  • We used the Scikit-learn [44] implementations of AdaBoost, decision trees, and logistic regression. To accommodate the imbalanced training data, balanced class weighting was used for the decision trees in Adaboost and logistic regression.

5 Experimental Results

We present our experimental results for the four types of approaches: handcrafted features (Sec.5.1), n-gram features (Sec.5.2), pre-trained sentence embeddings (Sec.5.3), and cross-task models (Sec.5.4). Additionally, in Table 3, we include the PQ sentences selected by several of the models discussed on two test articles.

Article URL https://nationalpost.com/news/arizona-man-dies-after-taking-chloroquine-for-coronavirus
True PQ Source “It’s all about finding the best spot to go to make the operation as safe as possible.”
Model HIghest rated sentence
R_difficult They are alone and isolated from October to March; two dogs, once part of a sled patrol, keep them company.
POS_PRP “Given the weather situation I’m actually pretty happy,” he said.
A_concreteness But in those cases the area that could be covered was limited.
Char-2 “For this kind of flying, the key is good forecast data,” he said.
Word-1 “For this kind of flying, the key is good forecast data,” he said.
POS-2 “In order to get representative measurements of sea ice, you need to cover large distances,” Krumpen said.
PPD_20 “Given the weather situation I’m actually pretty happy,” he said.
Sent-BERT “In order to get representative measurements of sea ice, you need to cover large distances,” Krumpen said.
headline popularity To fill the data gap, some governments and other groups have conducted summer measurement campaigns from aircraft.
clickbait So the Wegener team - which this summer included two pilots, an engineer, a mechanic and another scientist in addition to Krumpen - spends a lot of time discussing the weather.
TextRank
Operating from Station Nord, a small Danish military and scientific outpost in Greenland, about 575 miles from the geographic North Pole, the researchers measured ice thickness
 in the Arctic Ocean and in the Fram Strait, which separates Greenland from the Norwegian archipelago of Svalbard.
Article URL https://nationalpost.com/news/pilots-struggled-to-control-plane-that-crashed-in-indonesia
True PQ Source “Every accident is a combination of events, so there is disappointment all around here,” he said.
Model HIghest rated sentence
R_difficult The new 737 MAX 8 plunged into the Java Sea on Oct. 29, killing all 189 people on board.
POS_PRP “Had they fixed the airplane, we would not have had the accident,” he said.
A_concreteness We will analyze any additional information as it becomes available,” the company said in a statement.
Char-2 “Had they fixed the airplane, we would not have had the accident,” he said.
Word-1 “Every accident is a combination of events, so there is disappointment all around here,” he said.
POS-2 “Every accident is a combination of events, so there is disappointment all around here,” he said.
PPD_20 Indonesian authorities released the findings Wednesday but were not expected to draw conclusions from the data they presented.
Sent-BERT “Every accident is a combination of events, so there is disappointment all around here,” he said.
headline popularity “Had they fixed the airplane, we would not have had the accident,” he said.
clickbait “Had they fixed the airplane, we would not have had the accident,” he said.
TextRank
JAKARTA, Indonesia - Black box data collected from their crashed Boeing 737 MAX 8 show Lion Air pilots struggled to maintain control as the aircraft’s automatic safety
 system repeatedly pushed the plane’s nose down, according to a preliminary investigation into last month’s disaster.
Table 3: The top sentence chosen for each of several models for two different test articles.

5.1 Handcrafted Features

The performance of each of our handcrafted features is provided in Figure 1. There are several interesting observations, including some that support and contradict hypotheses made in Section 3.1:

  • Simply using the sentence location works better than random guessing. When we inspect the distribution of this feature value for PQ and non-PQ sentences in Figure 2

    a, we see that PQ sentences are not uniformly distributed throughout articles, but rather tend to occur slightly more often around a quarter of the way through the article.

  • The proportion of difficult words is the second-best handcrafted feature, outperforming other reading difficulty metrics. As we suggested in Section 3.1.1 and reflected in Figure 2b, PQ sentences are indeed easier to read than non-PQ sentences.

  • Of the POS tag densities, personal pronoun (PRP) and verb (VB) density are the most informative. Inspecting the feature distributions, we see that PQs tend to have slightly higher PRP density (Fig. 2c) as well as VB density – suggesting that sentences about people doing things are good candidates for PQs. In the following subsection we attempt to further investigate the importance of different POS tags.

  • Affective features tended to performed poorer than expected, contradicting our (non-expert) intuition that more exciting or emotional sentences would be chosen for pull quotes. The exception to this is that concreteness is indeed an informative feature, as we can also see in Figure 2c. The improved memorability that comes with more concrete texts [51] may help explain the beneficial effects of PQ on learning outcomes [60, 61].

Figure 1: Performance results of individual handcrafted features. The reading difficulty, POS tag, and affective feature groups all produced a feature with an , with the single best feature being PRP density.
(a) Sentence position
(b) Proportion of difficult words
(c) Personal pronoun density
(d) Concreteness
Figure 2: The value distributions for several of the top performing hand-crafted features for both PQ sentences (solid blue lines) and non-PQ sentences (dashed orange lines).

5.2 N-Gram Features

The results for our n-gram models are provided in Table 4. Impressively, all n-gram models performed better than all handcrafted features, with the best model, character bi-grams, demonstrating an of 76.0. When we inspect the learned logistic regression weights for the best variant of each model type (summarized in Table 5), we find a few interesting observations:

  • The highest weighted character bi-grams exclusively aim to identify the beginnings of quotations, suggested that the presence of a quote is highly informative. Curiously, although not show in Tab. 5 due to space limitations, end-of-quotation indicators (i.e. “”. ”) occur among the lowest weighted features. Additionally, presence of a quotation being present but not starting the sentence is a strong negative indicator (i.e. “ “”).

  • Among the lowest weighted character bi-grams are also indicators of numbers, URLs, and possibly twitter handles (i.e. “@”).

  • Although the highest weighted words are difficult to interpret together, among the lowest weighted words are those which indicate past tense: “called”, “declined”, “described”, “included”, “suggested”. This suggests a promising approach for PQ selection would include identification of the tense of each sentence.

  • The perspective offered by POS tags appears even more difficult to determine. The single highest weighted POS tag sequence, “PDT-RB” (predeterminer-adverb) reflects such phrases as “…both happily…” or “…all grimly…”. When inspecting the feature weighting of the less successful POS-based model with , we see that the EX (existential there) tag is weighted highest, used in such phrases as “There is a place” or “The man said that there are no jobs.”

n
token 1 2 3
char 71.9 76.0 74.6
word 74.4 73.1 66.2
pos 69.2 72.2 71.3
Table 4: Performance results of the n-gram models tested. Overall, the character-level n-grams worked best (especially with ), followed by the word and POS-tag n-grams. A vocabulary size of 1000 was used for all models, and lower-casing was applied for the character and word models.
Model Highest weighted
2-char “h “j “f “t “o “u “k “e “c “s
1-word (’)’,) (’nothing’,) (’weve’,) (’”’,) (’seem’,) (’politics’,) (’entire’,) (’…’,) (’isnt’,) (’never’,)
2-POS (’PDT’, ’RB’) (’WP’, ’EX’) (’PDT’, ’PRP’) (’EX’, ’PRP$’) (’WRB’, ’UH’) (’WP’, ’JJR’) (’RB’, ’WP$’) (’SYM’, ’VBN’) (’RBR’, ’WDT’) (’WP$’, ’JJS’)
Lowest weighted
2-char .c m@ 51 m/ 62 _“ (@ .a :3 _@
1-word (’30’,) (’called’,) (’thursday’,) (’declined’,) (’described’,) (’m’,) (’similar’,) (’(’,) (’included’,) (’suggested’,)
2-POS (’NNPS’, ’PRP$’) (’TO’, ’JJS’) (’NNS’, ’UH’) (’SYM’, ’DT’) (’RB’, ’POS’) (’MD’, ’VBP’) (’JJR’, ’FW’) (’RP’, ’RBR’) (’VBP’, ’UH’) (’VBG’, ’JJS’)
Table 5: The top ten highest and lowest weighted n-grams for the best character (2-char), word (1-word), and POS-tag (2-POS) models. Weights are the coefficients learned by the corresponding logistic regression model.

5.3 Pre-trained Sentence Embeddings

The results of the two deep pre-trained sentence embedding techniques we evaluate are included in Figure 3. We note the following interesting observations:

  • The optimal number of PPD quantiles is around 20, with an of 69.7, below the simpler n-gram features, but still impressive given the low dimensionality and fact that the vectors only represent predicted position. Even with only 5 quantiles, the method outperforms all handcrafted features. The layer sizes for the best PPD model were found to be (128, 64) with dropout rates of (0.25, 0.1).

  • If we instead use the true position distributions with the same number of quantiles (i.e. one-hot vectors), the performance is indeed much worse than using the PPD.

  • The performance of the pre-trained Sentence-BERT embeddings performs the best out of all approaches tested. However, at 77.6, it is only marginally better than the best n-gram approach with of 76.0.

Figure 3: Performance results of the PPD models, Sentence-BERT, as well as true sentence quantile distributions. We see that using the predicted position distributions indeed works must better than the true position distribution. However, the Sentence-BERT embeddings still considerably outperform the PPDs.

5.4 Cross-Task Models

Model
headline_popularity 57.1
clickbait 63.9
LexRank 52.1
SumBasic 45.8
KLSum 56.4
TextRank 56.4
Table 6: Performance results of the cross-task models.

The final set of models we consider provide insight into the cross-task performance of models built for the related problems of headline popularity prediction, clickbait identification, and summarization. The results for these models are shown in Table 6.

Considered holistically, the results suggest that PQs are not designed to inform the reader about what they are reading (the shared purpose of headlines and summaries), so much as they are designed to attract attention and motivate further engagement (the sole purpose of clickbait). However, the considerable performance gap between the clickbait model and PQ-specific models (such as character bi-grams and Sentence-BERT embeddings) suggest that this is only one aspect of choosing good pull quotes.

Another interesting observation is the variability in performance of summarizers at PQ selection with performance ranging from worse-than-random to slightly better than random. After considering the summarization performance of these models as reported together in [8], we see that PQ selection performance is not strongly correlated with their summarization performance.

6 Conclusion

In this paper we introduced the interesting task of automated pull quote selection, which has applications in education, writing assistance, and engagement maximization. To approach this problem, we created a PQ dataset with articles coming from a variety of online news outlets. We additionally describe and benchmark four groups of approaches: hand-crafted features inspired by related works and results from psychology and neuroscience, n-grams, deep pre-trained sentence embeddings, and cross-task models.

By closely examining the model results, we report many intriguing findings to inspire further research on PQ selection and related tasks. We find that contrary to our intuition, sentiment is not as an informative of a feature as reading difficulty, personal pronoun density, and concreteness. By examining the highest and lowest weighted n-grams, we also uncover specific non-trivial linguistic patterns found in PQs. We also find that sentences are not uniformly spread throughout articles, and using the predicted position distribution of a sentence leverages this to achieve good performance with very low dimensional sentence embeddings, but not quite as high as n-gram models or Sentence-BERT embeddings. Finally, by comparing to cross-task models we provide evidence suggesting PQs are chosen to catch interest, similar to clickbait, rather than let the reader know more generally what they are reading about.

There are many interesting avenues for future research with regard to pull quotes. While we assume in this work that all true PQs in our dataset are of equal quality, it would be valuable to know the quality of individual PQs. Additionally, the creation of a pull quote by a copy editor can be a complex and nuanced process, where beyond simply selecting the source sentences, they often undergo augmentation of abbreviating, paraphrasing, or replacing words. Thus, instead of PQ selection, future work could consider an editing process to turn the selected sentences into a final edited PQ.

Place licence statement here for the camera-ready version.

References

  • [1] J. M. Aquino and K. M. Arnell (2007) Attention and the processing of emotional words: dissociating effects of arousal. Psychonomic bulletin & review 14 (3), pp. 430–435. Cited by: 3rd item, §3.1.3.
  • [2] S. Bird, E. Klein, and E. Loper (2009) Natural language processing with python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”. Cited by: 1st item.
  • [3] T. Bohn, Y. Hu, J. Zhang, and C. X. Ling (2019) Learning sentence embeddings for coherence modelling and beyond. In RANLP, pp. 151–160. Cited by: §1, 2nd item.
  • [4] R. Braddock (1974) The frequency and placement of topic sentences in expository prose. Research in the Teaching of English 8 (3), pp. 287–302. Cited by: 2nd item.
  • [5] M. Brysbaert, A. B. Warriner, and V. Kuperman (2014) Concreteness ratings for 40 thousand generally known english word lemmas. Behavior research methods 46 (3), pp. 904–911. Cited by: 4th item, 5th item.
  • [6] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, et al. (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175. Cited by: §3.3.
  • [7] A. Chakraborty, B. Paranjape, S. Kakarla, and N. Ganguly (2016) Stop clickbait: detecting and preventing clickbaits in online news media. In Advances in Social Networks Analysis and Mining (ASONAM), 2016 IEEE/ACM International Conference on, pp. 9–16. Cited by: §1, §2.2, §3.1.2, 8th item.
  • [8] Q. Chen, X. Zhu, Z. Ling, S. Wei, and H. Jiang (2016) Distraction-based neural networks for document summarization. arXiv preprint arXiv:1610.08462. Cited by: §5.4.
  • [9] F. Chollet et al. (2015) Keras. Note: https://keras.io Cited by: 7th item.
  • [10] J. Click and G. H. Stempel (1974) Reader response to modern and traditional front page make-up. American Newspaper Publishers Association. Cited by: §1.
  • [11] M. Coleman and T. L. Liau (1975) A computer readability formula designed for machine scoring.. Journal of Applied Psychology 60 (2), pp. 283. Cited by: 2nd item.
  • [12] H. Davoudi, A. An, and G. Edall (2019) Content-based dwell time engagement prediction model for news articles. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), pp. 226–233. Cited by: §2.1.
  • [13] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: 1st item, §3.3.
  • [14] S. Dhar, V. Ordonez, and T. L. Berg (2011) High level describable attributes for predicting aesthetics and interestingness. In CVPR 2011, pp. 1657–1664. Cited by: §2.1.
  • [15] G. Erkan and D. R. Radev (2004) Lexrank: graph-based lexical centrality as salience in text summarization.

    Journal of artificial intelligence research

    22, pp. 457–479.
    Cited by: §2.3, 10th item.
  • [16] R. Flesch (1979) How to write plain english: a book for lawyers and consumers. Harper & Row New York, NY. Cited by: 1st item.
  • [17] N. French (2018) InDesign type: professional typography with adobe indesign. Adobe Press. Cited by: §1.
  • [18] A. Haghighi and L. Vanderwende (2009) Exploring content models for multi-document summarization. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 362–370. Cited by: §2.3, 10th item.
  • [19] K. S. Hasan and V. Ng (2014-06) Automatic keyphrase extraction: a survey of the state of the art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 1262–1273. External Links: Link, Document Cited by: §1.
  • [20] T. Hastie, S. Rosset, J. Zhu, and H. Zou (2009) Multi-class adaboost. Statistics and its Interface 2 (3), pp. 349–360. Cited by: §3.1.
  • [21] T. Holmes (2015) Subediting and production for journalists: print, digital & social. Routledge. Cited by: §1.
  • [22] C. J. Hutto and E. Gilbert (2014) Vader: a parsimonious rule-based model for sentiment analysis of social media text. In Eighth international AAAI conference on weblogs and social media, Cited by: 2nd item.
  • [23] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: §3.3.
  • [24] J. H. Kim, A. Mantrach, A. Jaimes, and A. Oh (2016) How to compete online for news audience: modeling words that attract clicks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1645–1654. Cited by: §2.1, §3.1.2.
  • [25] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: 7th item.
  • [26] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler (2015) Skip-thought vectors. In Advances in neural information processing systems, pp. 3294–3302. Cited by: §3.3.
  • [27] J. Kissler, C. Herbert, P. Peyk, and M. Junghofer (2007) Buzzwords: early cortical responses to emotional words during reading. Psychological Science 18 (6), pp. 475–480. Cited by: 3rd item.
  • [28] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter (2017) Self-normalizing neural networks. In Advances in neural information processing systems, pp. 971–980. Cited by: 7th item.
  • [29] D. Lagun and M. Lalmas (2016) Understanding and measuring user engagement and attention in online news reading. In Proceedings of the ACM International Conference on Web Search and Data Mining, pp. 113–122. Cited by: §2.1.
  • [30] S. Lamprinidis, D. Hardt, and D. Hovy (2018) Predicting news headline popularity with syntactic and semantic knowledge using multi-task learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 659–664. Cited by: §1, §2.1, §3.1.2.
  • [31] J. Lee and J. Lee (2018) Music popularity: metrics, characteristics, and audio-based prediction. IEEE Transactions on Multimedia 20 (11), pp. 3173–3182. Cited by: §2.1.
  • [32] Z. Liu, P. Li, Y. Zheng, and M. Sun (2009) Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pp. 257–266. Cited by: §2.3.
  • [33] H. P. Luhn (1958) The automatic creation of literature abstracts. IBM Journal of research and development 2 (2), pp. 159–165. Cited by: §2.3.
  • [34] B. D. McCandliss, L. Cohen, and S. Dehaene (2003) The visual word form area: expertise for reading in the fusiform gyrus. Trends in cognitive sciences 7 (7), pp. 293–299. Cited by: §3.1.3.
  • [35] O. Medelyan, E. Frank, and I. H. Witten (2009) Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pp. 1318–1327. Cited by: §2.3.
  • [36] R. Mihalcea and P. Tarau (2004) Textrank: bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404–411. Cited by: §2.3, 10th item.
  • [37] N. Moniz and L. Torgo (2018) Multi-source social feedback of online news feeds. arXiv preprint arXiv:1801.07055. Cited by: 9th item.
  • [38] R. Nallapati, F. Zhai, and B. Zhou (2017) Summarunner: a recurrent neural network based sequence model for extractive summarization of documents. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.3.
  • [39] R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang, et al. (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023. Cited by: §2.3.
  • [40] R. Nallapati, B. Zhou, and M. Ma (2016) Classify or select: neural architectures for extractive document summarization. arXiv preprint arXiv:1611.04244. Cited by: §2.3.
  • [41] A. Nenkova, K. McKeown, et al. (2011) Automatic summarization. Foundations and Trends® in Information Retrieval 5 (2–3), pp. 103–233. Cited by: §2.3.
  • [42] A. Nenkova and K. McKeown (2012) A survey of text summarization techniques. In Mining text data, pp. 43–76. Cited by: §1, §2.3.
  • [43] A. Nenkova and L. Vanderwende (2005) The impact of frequency on summarization. Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005 101. Cited by: §2.3, 10th item.
  • [44] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011)

    Scikit-learn: machine learning in python

    .
    Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: 11st item.
  • [45] A. Piotrkowicz, V. Dimitrova, J. Otterbacher, and K. Markert (2017) Headlines matter: using headlines to predict the popularity of news articles on twitter and facebook. In Eleventh International AAAI Conference on Web and Social Media, Cited by: §1, §2.1, §3.1.2.
  • [46] M. Potthast, S. Köpsel, B. Stein, and M. Hagen (2016) Clickbait detection. In European Conference on Information Retrieval, pp. 810–817. Cited by: §1, §2.2.
  • [47] S. Rayatdoost and M. Soleymani (2016) Ranking images and videos on visual interestingness by visual sentiment features.. In MediaEval, Cited by: §2.1.
  • [48] N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. Cited by: §1, 1st item, §3.3, 6th item.
  • [49] K. Reinecke, T. Yeh, L. Miratrix, R. Mardiko, Y. Zhao, J. Liu, and K. Z. Gajos (2013) Predicting users’ first impressions of website aesthetics with a quantification of perceived visual complexity and colorfulness. In Proceedings of the SIGCHI conference on human factors in computing systems, pp. 2049–2058. Cited by: §2.1.
  • [50] P. L. L. Romary (2010) Automatic key term extraction from scientific articles in grobid. In SemEval 2010 Workshop, pp. 4. Cited by: §2.3.
  • [51] M. Sadoski, E. T. Goetz, and M. Rodriguez (2000) Engaging texts: effects of concreteness on comprehensibility, interest, and recall in four text types.. Journal of Educational Psychology 92 (1), pp. 85. Cited by: 4th item, 4th item.
  • [52] H. T. Schupp, J. Stockburger, M. Codispoti, M. Junghöfer, A. I. Weike, and A. O. Hamm (2007) Selective visual attention to emotion. Journal of neuroscience 27 (5), pp. 1082–1089. Cited by: §3.1.3.
  • [53] G. C. Stone (1987) Examining newspapers: what research reveals about america’s newspapers. Vol. 20, Sage Publications, Inc. Cited by: §1.
  • [54] J. G. Stovall (1997) Infographics: a journalist’s guide. Allyn & Bacon. Cited by: §1.
  • [55] Y. Sung, C. Liao, T. Chang, C. Chen, and K. Chang (2016) The effect of online summary assessment and feedback system on the summary writing on 6th graders: the lsa-based technique. Computers & Education 95, pp. 1–18. Cited by: §2.3.
  • [56] T. Tomokiyo and M. Hurst (2003) A language model approach to keyphrase extraction. In Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment, pp. 33–40. Cited by: §2.3.
  • [57] P. Turney (1999) Learning to extract key phrases from text, nrc technical report erb⋅ 1057. Technical report Canada: National Research Council. Cited by: §2.3, §2.3, §2.3.
  • [58] S. H. Utt and S. Pasternack (1985) Use of graphic devices in a competitive situation: a case study of 10 cities. Newspaper Research Journal 7 (1), pp. 7–16. Cited by: §1.
  • [59] L. Venneti and A. Alam (2018) How curiosity can be modeled for a clickbait detector. arXiv preprint arXiv:1806.04212. Cited by: §1, §2.2.
  • [60] W. Wanta and D. Gao (1994) Young readers and the newspaper: information recall and perceived enjoyment, readability, and attractiveness. Journalism Quarterly 71 (4), pp. 926–936. Cited by: §1, §1, §2.3, 4th item.
  • [61] W. Wanta and J. Remy (1994) Information recall of four newspaper elements among young readers.. Cited by: §1, §2.3, 4th item.
  • [62] A. B. Warriner, V. Kuperman, and M. Brysbaert (2013) Norms of valence, arousal, and dominance for 13,915 english lemmas. Behavior research methods 45 (4), pp. 1191–1207. Cited by: 3rd item, 4th item.
  • [63] J. Zhang, Y. Zhao, M. Saleh, and P. J. Liu (2019) PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. arXiv preprint arXiv:1912.08777. Cited by: §2.3.