Factuality Checking in News Headlines with Eye Tracking

06/17/2020 ∙ by Christian Hansen, et al. ∙ Københavns Uni Aalborg University 0

We study whether it is possible to infer if a news headline is true or false using only the movement of the human eyes when reading news headlines. Our study with 55 participants who are eye-tracked when reading 108 news headlines (72 true, 36 false) shows that false headlines receive statistically significantly less visual attention than true headlines. We further build an ensemble learner that predicts news headline factuality using only eye-tracking measurements. Our model yields a mean AUC of 0.688 and is better at detecting false than true headlines. Through a model analysis, we find that eye-tracking 25 users when reading 3-6 headlines is sufficient for our ensemble learner.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction and Prior Work

Factuality detection in headlines is important because headlines are often solely responsible for the user’s first impression (especially in mobile environments); but it is also challenging because, unlike full text, news headlines convey information succinctly and without reasoned argumentation or background.

We measure the overt attention of 55 participants who are eye-tracked when reading 108 news headlines. We find statistically significantly longer eye gazing and fixation durations when reading headlines of true, rather than false news, regardless of participant gender. We also train an ensemble learner, solely on eye-tracking data, to infer factuality in headlines. Our model yields a mean AUC of 0.688 and is better at detecting false headlines than true headlines. Further analysis shows that eye-tracking 25 users when reading 3-6 headlines is sufficient for our ensemble learner.

Eye tracking has long been used in IR to infer relevance (LobodaBB11; BuscherDE08; Buscher:2012:ADE:2070719.2070722; Hardoon:article; AjankiHKPS09; PuolamakiAK08) and to improve user understanding, for instance that adding information to search engine snippets significantly improves performance for informational tasks but degrades performance for navigational tasks (CutrellG07); that users with higher change in knowledge differ significantly in terms of the number and duration of fixations compared to users with lower knowledge-change (BhattacharyaG18); and that relevant documents tend to be continuously read, while irrelevant documents tend to be scanned (Gwizdka14). In most cases, cognitive effort inferred from eye-tracking data is highest for (at least) partially relevant documents and lowest for irrelevant documents.

Our findings complement prior findings that news posts from credible sources receive more gaze attention (SulflowSW19) and that false news tend to be read more quickly than accurate news (Gwizdka14). However, none of the above studies is done on headlines, and, to our knowledge, we present the first factuality inference model to be trained exclusively on eye-tracked data.

2. Experiment design

55 participants with normal or corrected-to-normal vision were recruited (24 females, 31 males; 19-33 years of age, median age 24), and each participated in a single eye tracking session in a laboratory. At the start of each session, we logged the age and gender of each participant and then introduced the task and apparatus. The eye tracker was calibrated and the task commenced. On completion of the task, participants were debriefed and comments were solicited. At no time were participants informed about how well they were doing. Each participant was shown a screen (white background) with three headlines (each on a separate line, in black font, size=36), without any further information. The headlines were centered on the screen, with 70mm of space between them and 20mm of space to the left border of the screen. Participants were asked to choose the most recent headline. This task was chosen on purpose to keep participants engaged in reading under circumstances where they were not directly checking for factuality. When participants had made their choice, the next screen (showing three new headlines) appeared. Participants did not know that two of the headlines were true and one was false, at any time. In total, 36 screens, each with three different headlines, were shown (108 unique headlines). To address order effects, we fully counterbalanced the position (top, middle, bottom) of the headlines, so that each position contained a factually false headline exactly 12 times. Participants could not move on to the next screen before answering, with no possibility of giving a “don’t know”-answer, and could not revisit a previous screen. All participants saw the same 36 screens with the order of screens randomized across participants. No time limit was set for completing the task.

To calibrate the experimental design, we did a pre-study on 11 participants with a subset of 24 screens. The pre-study did not lead to any changes in the design or protocol, except that the number of screens was increased to 36 because participants were faster than initially expected. In our analysis we combine the data from the pre-study with the remaining data to form the complete dataset.

Each participant performed the task individually, and was given the same oral instructions by the research assistant111https://github.com/Varyn/Factuality_Checking_News_Headlines_EyeTracking. Participants could at all times elect to stop the experiment (none did). The study was approved by the ethics board of our university, and all data was anonymized prior to storage and analysis.

The headlines shown to participants were crawled from the website of a reputable local newspaper222https://www.thelocal.dk/ and consisted of the full title of an article concerning local and national news. From the pool of crawled headlines, we selected 108 headlines that: (a) covered news that should be generally known to the public, (b) were formulated in approximately the same tone (i.e., no clickbait titles, no emphatics, no puns), and (c) were unlikely to provoke strong feelings. All headlines were selected manually by one of the authors of this paper (see Table 2 for their statistics).

All crawled headlines were factually true. We created factually false headlines by semantically reversing parts of some headlines. For example, among most expensive cities to relocate to became among least expensive cities to relocate to. All the semantic transformations we used to falsify headlines are shown in Table 2. When falsifying headlines, we made sure that they still appeared semantically plausible and sounded natural. To make sure that there is no bias stemming from the linguistic formulation of true versus false headlines, we POS-tagged all headlines (using the Stanford parser) and found that the proportion of content words (which are known to be fixated on by the human eye much more than functions words (Rayner98)) was approximately the same in both true and false headlines (see Table 2). We make all 108 headlines freely available1.

True False Total
# Headlines 72 36 108
Mean # words per headline 8.56 8.42 8.51
Mean # content words per headline 4.79 4.53 4.70
Mean # function words per headline 3.88 4.08 3.95
Table 2. All transformations that falsified news headlines.
original text transformed text
more, most, best, top, highest, good fewer, least, worst, bottom, lowest, bad
denies, fear, pick up award, react to admits, love, stripped of award, praise
two … in top 50, remain, helping out no … in top 50, exit, refuses to help
criticised, leads in, drops down praised, last in, tops
cannot get enough of, calls for end do no like, tolerates
looks to as inspiration uses as example to avoid
Table 1. Dataset statistics.


We used an Eyetribe ET1000 desk-mounted stream-based eye tracker bar, paired with a 24-inch screen (resolution of 1920x1200 and 170 DPI). The eye tracker sampled the position of eyes at the rate of 30 Hz and had a spatial resolution of 0.1 degree. We used iMotions333https://imotions.com/ to calibrate the eye tracker and collect the data. Participants were placed 60cm away from the screen, and the room had soft standard artificial light. No head stabilisation was used (head movements were unconstrained so the intrusion of the eye moving measurement was minimal). We calibrated the eye tracker using a standard 9-point calibration prior to each recording.

Participants indicated which of the three headlines per screen was the most recent by typing 1, 2, or 3 on the keyboard (for the position of the top, middle, and respectively bottom headline). Typing was chosen over using the cursor because the cursor could interfere considerably with eye tracking.

Eye-tracking measures

A fixation is a stable eye-in-head position within a dispersion threshold (typically 2 degrees), above a duration threshold (typically 100-200 milliseconds444We set fixations at 100 milliseconds.), and velocity below a threshold (typically 15-100 degrees per second). Gaze duration is the cumulative duration of a sequence of consecutive fixations within an area of interest (AOI). We defined a separate AOI around each headline and we analysed these 5 measures: the total time spent fixating inside an AOI (total fixation duration); the total number of fixations inside a AOI (total fixation count; the total time spent gazing inside an AOI (total gaze duration)555Gaze duration consists of the duration of fixations and other captured gaze activity (such as time between fixations) inside an AOI.; the average fixation duration inside an AOI (total fixation duration divided by total fixation count); the duration of the first fixation inside an AOI (first fixation duration).

3. Findings

We now study the statistical effect the headline factuality has on the eye-tracking measures. Let denote any of the above 5 eye-tracking measures. To establish whether factuality affects each of these

s in a statistically significant way, we consider both fixed effects (gender, headline length, position of headline on screen), and random effects. These fixed and random effects are potentially non-negligible, meaning that conventional methods for inferential data analysis, such as ANOVA and general linear regression are not applicable

(LobodaBB11). We therefore fit a mixed model (mixedcomplex) that uses the above s as a response and the fixed effects as explanatory variables. Because each participant is drawn from some larger population, the participant is included as a random intercept. The mixed model for each of the above s is:

where is the coefficient for the factor and is the indicator function for the factor, e.g. if the participant is male and

otherwise. For the categorical variables of position (middle, bottom), gender (male), and factuality (true), there are

fewer factors than number of categories ().

is the normalised length of the headline with zero mean and unit variance,

is the random effect for the participant, and is the intercept. The model is fitted using the s collected; these s are normalised so that the scale of the coefficient is comparable across measures, which otherwise have different scales.

The coefficient shows the relation between the measure

and the factuality of the headline. We formulate the null hypothesis

for as the assumption that factuality does not affect , that is . To test this hypothesis, we compute

-values and confidence intervals for each coefficient by performing Wald tests. We have 5 different eye-tracking measures, so we perform 5 hypothesis tests with Bonferroni correction, requiring that

to reject each . All statistical analysis is done using StatsModels666https://www.statsmodels.org/stable/index.html, version 0.9, and the models are fitted using Maximum Likelihood.

Table 3 shows the resulting coefficients. We see that for total gaze duration, total fixation duration, and total fixation count , thus we have sufficient evidence to reject the null hypothesis. These three eye-tracking measures change significantly when reading true versus false headlines. However, for average fixation duration and first fixation duration, we cannot reject the null hypothesis, and thus we cannot conclude that the time spent on each individual fixation changes between factually true and false headlines. We also observe that a factually true headline causes the total gaze duration, total fixation duration, and total fixation count to increase, as seen by the positive value of ; this means that false headlines in general have shorter fixation and gazing duration than true headlines. The fact that factuality is not significant for average fixation duration means that the increased total fixation duration for true headlines is caused by an increase in total fixation count for factually true headlines.

We now briefly discuss the other coefficients than . Using , we see that the position of the headline is not significant for the total gaze duration, while it is significant if the headline is placed on the bottom for all measures of fixation. The negative value of shows that all measures of fixation decrease when the headline is placed on the bottom. The length of the headline is significant for all eye-tracking measures (p ¡ 0.001), with longer headlines having higher measures. Lastly, we observe no significant difference in any measures between the genders.

Coef. Std.Err. z P¿—z— [0.025 0.975]
Total gaze duration
0.154 0.023 6.697 ¡0.001 0.109 0.199
-0.026 0.027 -0.959 0.338 -0.078 0.027
-0.054 0.027 -2.020 0.043 -0.106 -0.002
-0.149 0.154 -0.969 0.333 -0.451 0.153
0.174 0.011 15.844 ¡0.001 0.153 0.196
Total fixation duration
0.109 0.021 5.301 ¡0.001 0.069 0.149
-0.083 0.024 -3.474 ¡0.001 -0.129 -0.036
-0.239 0.024 -10.059 ¡0.001 -0.285 -0.192
-0.202 0.182 -1.109 0.267 -0.558 0.155
0.100 0.010 10.154 ¡0.001 0.081 0.119
Total fixation count
0.115 0.020 5.609 ¡0.001 0.075 0.155
-0.037 0.024 -1.536 0.124 -0.083 0.010
-0.199 0.024 -8.420 ¡0.001 -0.246 -0.153
-0.164 0.184 -0.894 0.371 -0.524 0.196
0.118 0.010 12.011 ¡0.001 0.099 0.137
Average fixation duration
0.025 0.022 1.106 0.269 -0.019 0.068
-0.003 0.026 -0.125 0.900 -0.054 0.047
-0.130 0.026 -5.061 ¡0.001 -0.181 -0.080
-0.006 0.171 -0.038 0.970 -0.342 0.329
0.059 0.011 5.509 ¡0.001 0.038 0.079
First fixation duration
0.034 0.024 1.411 0.158 -0.013 0.081
0.014 0.028 0.484 0.628 -0.041 0.068
-0.120 0.028 -4.321 ¡0.001 -0.175 -0.066
-0.016 0.148 -0.106 0.915 -0.305 0.274
0.056 0.011 4.906 ¡0.001 0.034 0.079
Table 3. The fixed effects for the five eye-tracking measures. -values below 0.01 are marked in bold. (See Section 3 for notation).

Learning to infer factuality from eye tracking

Having established that total gaze duration, total fixation duration, and total fixation count

are all significantly different depending on the headline factuality, we next investigate if these measures provide sufficient signal for training a headline factuality classifier. As these measures are highly dependent on the length and position of the headlines, they are also included in the model. We observe that

total fixation duration is highly correlated with total fixation count, thus to keep the model as simple as possible, we only use total gaze duration and total fixation duration.

In table 3, we see the coefficient of factuality (), for many measures, to be less influential than the position and length of the headline. Thus, we expect using eye-tracking measures of only a single participant to be noisy. Due to this, we use an ensembling approach, where the predicted factuality of a headline is computed as an average over a set of participants (): , where is the factuality prediction for headline , and is the factuality prediction for headline for participant

. Due to the relative small size of our dataset, we propose to use the average of two simple second-order logistic models for estimating



where are the learned coefficients of the logistic models, is the total gaze duration for participant on headline , is the total fixation duration for on , and is the length of the headline. Both logistic models have one eye-tracking measure interacting with either the length or position of the headline, where the interaction is chosen based on the pair with the lowest correlation. We choose to use two simple logistic models, instead of a single combined model, to increase the variance of the predicted factuality, as high variance is beneficial for ensembling. We standardize (zero mean and unit variance) the eye-tracking measures from each participant across all headlines. Lastly, the two logistic models are trained using Maximum Likelihood on a set of training participants.


We evaluate the model by inferring factuality on unseen headlines using Monte Carlo cross-validation over 100,000 iterations. In each iteration, the participants are split for training and ensembling (27 and 28 participants, respectively), and three headlines are chosen for evaluation (2 true and 1 false), while the remaining headlines are used for training. We report the mean AUC and mean accuracy, across all iterations.

As reported in Table 4, we find that our ensemble model predicts the factuality of unseen headlines with a mean AUC of 0.688 and an accuracy of 0.634 (which is higher on false headlines (0.662) than one true ones (0.619)). There is no prior work on automatically detecting factuality in news headlines only, but related work on inferring factuality in text (but not headlines, which is harder) using textual features alone (not eye-tracking features), shows that accuracy ranges from 0.39 (Wang17) to 0.76 (Perez-RosasKLM18), and even up to 0.86 (poliak-EtAl:2018:S18-2)

when using BiLSTMs and a multilayer perceptron classifier with refined linguistic features such as entailment and contradiction. Comparably, we have a simple learning model, which uses weaker input features (eye-tracking measures are less discriminative than textual ones), and which solves a more difficult problem (factuality checking in headlines instead of longer texts).

Mean AUC Mean Acc. Mean Acc. (True) Mean Acc. (False)
0.688 0.634 0.619 0.662
Table 4. Factuality performance scores from our eye-tracking ensemble model.


In the above, we standardize the eye-tracking measures for each participant on all headlines. We now ask: how important is this standardization, and would standardization on fewer headlines suffice? We answer this by sampling fewer headlines to base the standardization on, while still preserving the ratio of 2 true headlines for each false one. We refer to three headlines following this ratio as a “screen”.

Figure 1. Performance analysis when varying (left) the number of screens used for participant standardization in our model and (right) the number of participants used for ensembling.

Figure 1 (left) shows the mean accuracy and AUC when varying the number of screens used for standardization. When only standardizing on the screen we predict on (screen=1), mean AUC is at minimum; it drastically increases at 6 screens, and then stabilizes for the remaining number of screens. When increasing the number of screens, the accuracy for the true headlines decreases slightly, while the accuracy increases for the false headlines, but after 6-18 screens the difference of including more screens is minimal. This suggests that the performance of our ensemble model is not largely dependent on a large set of headlines to use for standardization. Deployed on a live setup, few headlines for standardization could suffice to fetch the accuracy and AUC levels reported in this study.

The results reported above correspond to splitting participants approximately 50/50 for training and ensembling, and this split can of course be varied; Figure 1 (right) plots mean accuracy and AUC (y axis) across a varying number of participants used for ensembling out of the 55 participants in total. We see that the choice of a 50/50 split is close to optimal. The fact that performance drops rapidly when 15 or fewer participants are used for ensembling indicates that aggregating over a large set of participants is at least as important as training a model on more data, in this setup. This happens because our dataset is small (we have few participants), so the optimal performance is a trade-off between training a better model (requiring more participants for training) and aggregating over more participants (requiring more participants in the ensemble).

4. Conclusions

We studied whether the human eye moves differently when reading factually true versus factually false news headlines, and if we can infer factuality in news headlines using only eye-tracking signals. In an experiment with 55 users reading 108 news headlines, we found that false headlines receive statistically significantly less visual attention than true ones. We used this to build an ensemble learner that predicts news headline factuality using only eye-tracking measurements, which obtained a mean AUC score of 0.688 and a mean accuracy of 0.634.

Future work includes investigation of eye tracking as a boosting mechanism to potentially improve factuality detection based on text processing, and refining the relationship between eye movements in more typical IR tasks such as search. A different direction of promising future work is to repeat our study “in the wild” outside usual laboratory settings, including eye-tracking methods with lower fidelity, such as for instance typical cameras mounted on laptops and smartphone cameras.