1. Introduction and Prior Work
Factuality detection in headlines is important because headlines are often solely responsible for the user’s first impression (especially in mobile environments); but it is also challenging because, unlike full text, news headlines convey information succinctly and without reasoned argumentation or background.
We measure the overt attention of 55 participants who are eye-tracked when reading 108 news headlines. We find statistically significantly longer eye gazing and fixation durations when reading headlines of true, rather than false news, regardless of participant gender. We also train an ensemble learner, solely on eye-tracking data, to infer factuality in headlines. Our model yields a mean AUC of 0.688 and is better at detecting false headlines than true headlines. Further analysis shows that eye-tracking 25 users when reading 3-6 headlines is sufficient for our ensemble learner.
Eye tracking has long been used in IR to infer relevance (LobodaBB11; BuscherDE08; Buscher:2012:ADE:2070719.2070722; Hardoon:article; AjankiHKPS09; PuolamakiAK08) and to improve user understanding, for instance that adding information to search engine snippets significantly improves performance for informational tasks but degrades performance for navigational tasks (CutrellG07); that users with higher change in knowledge differ significantly in terms of the number and duration of fixations compared to users with lower knowledge-change (BhattacharyaG18); and that relevant documents tend to be continuously read, while irrelevant documents tend to be scanned (Gwizdka14). In most cases, cognitive effort inferred from eye-tracking data is highest for (at least) partially relevant documents and lowest for irrelevant documents.
Our findings complement prior findings that news posts from credible sources receive more gaze attention (SulflowSW19) and that false news tend to be read more quickly than accurate news (Gwizdka14). However, none of the above studies is done on headlines, and, to our knowledge, we present the first factuality inference model to be trained exclusively on eye-tracked data.
2. Experiment design
55 participants with normal or corrected-to-normal vision were recruited (24 females, 31 males; 19-33 years of age, median age 24), and each participated in a single eye tracking session in a laboratory. At the start of each session, we logged the age and gender of each participant and then introduced the task and apparatus. The eye tracker was calibrated and the task commenced. On completion of the task, participants were debriefed and comments were solicited. At no time were participants informed about how well they were doing. Each participant was shown a screen (white background) with three headlines (each on a separate line, in black font, size=36), without any further information. The headlines were centered on the screen, with 70mm of space between them and 20mm of space to the left border of the screen. Participants were asked to choose the most recent headline. This task was chosen on purpose to keep participants engaged in reading under circumstances where they were not directly checking for factuality. When participants had made their choice, the next screen (showing three new headlines) appeared. Participants did not know that two of the headlines were true and one was false, at any time. In total, 36 screens, each with three different headlines, were shown (108 unique headlines). To address order effects, we fully counterbalanced the position (top, middle, bottom) of the headlines, so that each position contained a factually false headline exactly 12 times. Participants could not move on to the next screen before answering, with no possibility of giving a “don’t know”-answer, and could not revisit a previous screen. All participants saw the same 36 screens with the order of screens randomized across participants. No time limit was set for completing the task.
To calibrate the experimental design, we did a pre-study on 11 participants with a subset of 24 screens. The pre-study did not lead to any changes in the design or protocol, except that the number of screens was increased to 36 because participants were faster than initially expected. In our analysis we combine the data from the pre-study with the remaining data to form the complete dataset.
Each participant performed the task individually, and was given the same oral instructions by the research assistant111https://github.com/Varyn/Factuality_Checking_News_Headlines_EyeTracking. Participants could at all times elect to stop the experiment (none did). The study was approved by the ethics board of our university, and all data was anonymized prior to storage and analysis.
The headlines shown to participants were crawled from the website of a reputable local newspaper222https://www.thelocal.dk/ and consisted of the full title of an article concerning local and national news. From the pool of crawled headlines, we selected 108 headlines that: (a) covered news that should be generally known to the public, (b) were formulated in approximately the same tone (i.e., no clickbait titles, no emphatics, no puns), and (c) were unlikely to provoke strong feelings. All headlines were selected manually by one of the authors of this paper (see Table 2 for their statistics).
All crawled headlines were factually true. We created factually false headlines by semantically reversing parts of some headlines. For example, among most expensive cities to relocate to became among least expensive cities to relocate to. All the semantic transformations we used to falsify headlines are shown in Table 2. When falsifying headlines, we made sure that they still appeared semantically plausible and sounded natural. To make sure that there is no bias stemming from the linguistic formulation of true versus false headlines, we POS-tagged all headlines (using the Stanford parser) and found that the proportion of content words (which are known to be fixated on by the human eye much more than functions words (Rayner98)) was approximately the same in both true and false headlines (see Table 2). We make all 108 headlines freely available1.
|Mean # words per headline||8.56||8.42||8.51|
|Mean # content words per headline||4.79||4.53||4.70|
|Mean # function words per headline||3.88||4.08||3.95|
|original text||transformed text|
|more, most, best, top, highest, good||fewer, least, worst, bottom, lowest, bad|
|denies, fear, pick up award, react to||admits, love, stripped of award, praise|
|two … in top 50, remain, helping out||no … in top 50, exit, refuses to help|
|criticised, leads in, drops down||praised, last in, tops|
|cannot get enough of, calls for end||do no like, tolerates|
|looks to as inspiration||uses as example to avoid|
We used an Eyetribe ET1000 desk-mounted stream-based eye tracker bar, paired with a 24-inch screen (resolution of 1920x1200 and 170 DPI). The eye tracker sampled the position of eyes at the rate of 30 Hz and had a spatial resolution of 0.1 degree. We used iMotions333https://imotions.com/ to calibrate the eye tracker and collect the data. Participants were placed 60cm away from the screen, and the room had soft standard artificial light. No head stabilisation was used (head movements were unconstrained so the intrusion of the eye moving measurement was minimal). We calibrated the eye tracker using a standard 9-point calibration prior to each recording.
Participants indicated which of the three headlines per screen was the most recent by typing 1, 2, or 3 on the keyboard (for the position of the top, middle, and respectively bottom headline). Typing was chosen over using the cursor because the cursor could interfere considerably with eye tracking.
A fixation is a stable eye-in-head position within a dispersion threshold (typically 2 degrees), above a duration threshold (typically 100-200 milliseconds444We set fixations at 100 milliseconds.), and velocity below a threshold (typically 15-100 degrees per second). Gaze duration is the cumulative duration of a sequence of consecutive fixations within an area of interest (AOI). We defined a separate AOI around each headline and we analysed these 5 measures: the total time spent fixating inside an AOI (total fixation duration); the total number of fixations inside a AOI (total fixation count; the total time spent gazing inside an AOI (total gaze duration)555Gaze duration consists of the duration of fixations and other captured gaze activity (such as time between fixations) inside an AOI.; the average fixation duration inside an AOI (total fixation duration divided by total fixation count); the duration of the first fixation inside an AOI (first fixation duration).
We now study the statistical effect the headline factuality has on the eye-tracking measures. Let denote any of the above 5 eye-tracking measures. To establish whether factuality affects each of these
s in a statistically significant way, we consider both fixed effects (gender, headline length, position of headline on screen), and random effects. These fixed and random effects are potentially non-negligible, meaning that conventional methods for inferential data analysis, such as ANOVA and general linear regression are not applicable(LobodaBB11). We therefore fit a mixed model (mixedcomplex) that uses the above s as a response and the fixed effects as explanatory variables. Because each participant is drawn from some larger population, the participant is included as a random intercept. The mixed model for each of the above s is:
where is the coefficient for the factor and is the indicator function for the factor, e.g. if the participant is male and
otherwise. For the categorical variables of position (middle, bottom), gender (male), and factuality (true), there arefewer factors than number of categories ().
is the normalised length of the headline with zero mean and unit variance,is the random effect for the participant, and is the intercept. The model is fitted using the s collected; these s are normalised so that the scale of the coefficient is comparable across measures, which otherwise have different scales.
The coefficient shows the relation between the measure
and the factuality of the headline. We formulate the null hypothesisfor as the assumption that factuality does not affect , that is . To test this hypothesis, we compute
-values and confidence intervals for each coefficient by performing Wald tests. We have 5 different eye-tracking measures, so we perform 5 hypothesis tests with Bonferroni correction, requiring thatto reject each . All statistical analysis is done using StatsModels666https://www.statsmodels.org/stable/index.html, version 0.9, and the models are fitted using Maximum Likelihood.
Table 3 shows the resulting coefficients. We see that for total gaze duration, total fixation duration, and total fixation count , thus we have sufficient evidence to reject the null hypothesis. These three eye-tracking measures change significantly when reading true versus false headlines. However, for average fixation duration and first fixation duration, we cannot reject the null hypothesis, and thus we cannot conclude that the time spent on each individual fixation changes between factually true and false headlines. We also observe that a factually true headline causes the total gaze duration, total fixation duration, and total fixation count to increase, as seen by the positive value of ; this means that false headlines in general have shorter fixation and gazing duration than true headlines. The fact that factuality is not significant for average fixation duration means that the increased total fixation duration for true headlines is caused by an increase in total fixation count for factually true headlines.
We now briefly discuss the other coefficients than . Using , we see that the position of the headline is not significant for the total gaze duration, while it is significant if the headline is placed on the bottom for all measures of fixation. The negative value of shows that all measures of fixation decrease when the headline is placed on the bottom. The length of the headline is significant for all eye-tracking measures (p ¡ 0.001), with longer headlines having higher measures. Lastly, we observe no significant difference in any measures between the genders.
|Total gaze duration|
|Total fixation duration|
|Total fixation count|
|Average fixation duration|
|First fixation duration|
Learning to infer factuality from eye tracking
Having established that total gaze duration, total fixation duration, and total fixation count
are all significantly different depending on the headline factuality, we next investigate if these measures provide sufficient signal for training a headline factuality classifier. As these measures are highly dependent on the length and position of the headlines, they are also included in the model. We observe thattotal fixation duration is highly correlated with total fixation count, thus to keep the model as simple as possible, we only use total gaze duration and total fixation duration.
In table 3, we see the coefficient of factuality (), for many measures, to be less influential than the position and length of the headline. Thus, we expect using eye-tracking measures of only a single participant to be noisy. Due to this, we use an ensembling approach, where the predicted factuality of a headline is computed as an average over a set of participants (): , where is the factuality prediction for headline , and is the factuality prediction for headline for participant
. Due to the relative small size of our dataset, we propose to use the average of two simple second-order logistic models for estimating:
where are the learned coefficients of the logistic models, is the total gaze duration for participant on headline , is the total fixation duration for on , and is the length of the headline. Both logistic models have one eye-tracking measure interacting with either the length or position of the headline, where the interaction is chosen based on the pair with the lowest correlation. We choose to use two simple logistic models, instead of a single combined model, to increase the variance of the predicted factuality, as high variance is beneficial for ensembling. We standardize (zero mean and unit variance) the eye-tracking measures from each participant across all headlines. Lastly, the two logistic models are trained using Maximum Likelihood on a set of training participants.
We evaluate the model by inferring factuality on unseen headlines using Monte Carlo cross-validation over 100,000 iterations. In each iteration, the participants are split for training and ensembling (27 and 28 participants, respectively), and three headlines are chosen for evaluation (2 true and 1 false), while the remaining headlines are used for training. We report the mean AUC and mean accuracy, across all iterations.
As reported in Table 4, we find that our ensemble model predicts the factuality of unseen headlines with a mean AUC of 0.688 and an accuracy of 0.634 (which is higher on false headlines (0.662) than one true ones (0.619)). There is no prior work on automatically detecting factuality in news headlines only, but related work on inferring factuality in text (but not headlines, which is harder) using textual features alone (not eye-tracking features), shows that accuracy ranges from 0.39 (Wang17) to 0.76 (Perez-RosasKLM18), and even up to 0.86 (poliak-EtAl:2018:S18-2)
when using BiLSTMs and a multilayer perceptron classifier with refined linguistic features such as entailment and contradiction. Comparably, we have a simple learning model, which uses weaker input features (eye-tracking measures are less discriminative than textual ones), and which solves a more difficult problem (factuality checking in headlines instead of longer texts).
|Mean AUC||Mean Acc.||Mean Acc. (True)||Mean Acc. (False)|
In the above, we standardize the eye-tracking measures for each participant on all headlines. We now ask: how important is this standardization, and would standardization on fewer headlines suffice? We answer this by sampling fewer headlines to base the standardization on, while still preserving the ratio of 2 true headlines for each false one. We refer to three headlines following this ratio as a “screen”.
Figure 1 (left) shows the mean accuracy and AUC when varying the number of screens used for standardization. When only standardizing on the screen we predict on (screen=1), mean AUC is at minimum; it drastically increases at 6 screens, and then stabilizes for the remaining number of screens. When increasing the number of screens, the accuracy for the true headlines decreases slightly, while the accuracy increases for the false headlines, but after 6-18 screens the difference of including more screens is minimal. This suggests that the performance of our ensemble model is not largely dependent on a large set of headlines to use for standardization. Deployed on a live setup, few headlines for standardization could suffice to fetch the accuracy and AUC levels reported in this study.
The results reported above correspond to splitting participants approximately 50/50 for training and ensembling, and this split can of course be varied; Figure 1 (right) plots mean accuracy and AUC (y axis) across a varying number of participants used for ensembling out of the 55 participants in total. We see that the choice of a 50/50 split is close to optimal. The fact that performance drops rapidly when 15 or fewer participants are used for ensembling indicates that aggregating over a large set of participants is at least as important as training a model on more data, in this setup. This happens because our dataset is small (we have few participants), so the optimal performance is a trade-off between training a better model (requiring more participants for training) and aggregating over more participants (requiring more participants in the ensemble).
We studied whether the human eye moves differently when reading factually true versus factually false news headlines, and if we can infer factuality in news headlines using only eye-tracking signals. In an experiment with 55 users reading 108 news headlines, we found that false headlines receive statistically significantly less visual attention than true ones. We used this to build an ensemble learner that predicts news headline factuality using only eye-tracking measurements, which obtained a mean AUC score of 0.688 and a mean accuracy of 0.634.
Future work includes investigation of eye tracking as a boosting mechanism to potentially improve factuality detection based on text processing, and refining the relationship between eye movements in more typical IR tasks such as search. A different direction of promising future work is to repeat our study “in the wild” outside usual laboratory settings, including eye-tracking methods with lower fidelity, such as for instance typical cameras mounted on laptops and smartphone cameras.