Scrutinizer: A Mixed-Initiative Approach to Large-Scale, Data-Driven Claim Verification

03/14/2020 ∙ by Georgios Karagiannis, et al. ∙ cornell university 0

Organizations such as the International Energy Agency (IEA) spend significant amounts of time and money to manually fact check text documents summarizing data. The goal of the Scrutinizer system is to reduce verification overheads by supporting human fact checkers in translating text claims into SQL queries on an associated database. Scrutinizer coordinates teams of human fact checkers. It reduces verification time by proposing queries or query fragments to the users. Those proposals are based on claim text classifiers, that gradually improve during the verification of a large document. In addition, Scrutinizer uses tentative execution of query candidates to narrow down the set of alternatives. The verification process is controlled by a cost-based optimizer. It optimizes the interaction with users and prioritizes claim verifications. For the latter, it considers expected verification overheads as well as the expected claim utility as training samples for the classifiers. We evaluate the Scrutinizer system using simulations and a user study, based on actual claims and data and using professional fact checkers employed by IEA. Our experiments consistently demonstrate significant savings in verification time, without reducing result accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data is often disseminated in the form of text reports, summarizing the most important statistics. For authors of such documents, it is time-consuming and tedious to ensure the correctness of each single claim. Nevertheless, erroneous claims about data are not acceptable in many scenarios as each mistake can have dire consequences. Those consequences reach from embarrassing retractions (in case of scientific papers [hosseini2018doing]) to legal or financial implications (in case of business or health reports [ash2004some]). We present Scrutinizer, a system that helps teams of fact checkers to verify consistency of text and data faster.

Our work is inspired and motivated by the real use case provided by the International Energy Agency (IEA). Every year the agency produces a report of more than 600 pages about the energy consumption and production in the world, covering historical facts and predictions both for individual countries and at the world level. We have been given access to the 2018 edition, which contains 7901 sentences with 1539 manually checked statistical claims. Every claim has been checked by three domain experts and their annotations have been collected in a spreadsheet. This process takes months of work of a team of domain experts. Consider the following example from our corpus of statistical claims.

Example 1

The institute has hundreds of relational tables with information about energy, pollution, and climate. A fragment of a table is reported in Figure 1. Consider the claim “In 2017, global electricity demand grew by 3%, more than any other fuel besides solar thermal, reaching 22 200 TWh.”. An expert validates the claim in bold by identifying the relevant table(s) and by writing a query over such table to collect the relevant information. In the example:

SELECT POWER(a.2017/b.2016,1/(2017-2016)) -1 FROM GED a, GED b WHERE a.Index = ‘PGElecDemand’, b.Index = ’PGElecDemand’

Finally, the expert compares the output of the query with the claim and either validates or updates the claim.

Index 2017 2018 2030 2040
PGElecDemand 22 209 22 793 29 349 35 526
PGINCoal 2 390 2 412 .. 2 341 2 353
TFCelec 21 465 22 040 28 566 34 790
Figure 1:

Global Energy Demand history and estimates (GED), the full table has 22 rows and 70 attributes.

Gathering data for the claim at hand and composing the right query for the validation takes expertise over the domain and data skills, taking several minutes for a single claim. We argue for the need of a system that takes as input the document and a corpus of related datasets to automatically identify the declarative queries that explain why every claim is validated or not by the data.

1.1 Challenges

Given a document with statistical claims and related datasets, our goal is to come up with the SQL queries that assist the users in the validation, suggesting alternative values for updating a claim in case of an incorrect statement. Our real-world use case clearly shows three issues that make such data verification hard to automate.

Text analysis. Converting a textual claim to a structured query is difficult because claims are expressed in natural language, do not use a fixed vocabulary, and come from multiple authors with different wording and style.

Query complexity. Our analysis of the checks done in past by the validation team reveals that the subclass of queries used for checking claim is very wide, going from simple selection to complex mathematical operations involving group of values, aggregations, and functions with more than 100 different combinations of operations.

Large corpus of datasets. Given a corpus of datasets, it is not clear which one(s) should be used to verify a new statistical claim. In reality, datasets do not come with rich metadata beyond table and attribute names and are heterogeneous in format, schema, and granularity of the data.

An exhaustive search of all possible queries is unfeasible, but pruning of the search space must be done carefully. In particular, the testing of false claim is immediately affected, as it is not clear how to judge the inability to create a matching query: is it because of a factual error or from the pruning in the query generation? With such a difficult problem, we found inspiration from an important resource in our use case. We notice that, by processing the annotations of the checkers, we can collect the data and the operations that have been used to verify every claim. This significant human effort can be used to train models that reduce the search space and identify the queries that verify the claims.

In this direction, we tackle the above challenges with a novel system that builds on three main modules: machine learning (ML) and natural language processing (NLP) to process the text, human-in-the-loop by involving the domain experts in bootstrapping and validating the candidate queries, and query generation with a large library of functions. The involvement of the users immediately raises more challenges: how to divide the work among a crowd of domain experts? What are the right questions to ask them? How to schedule such questions? How to bootstrap and improve the quality of the models when the training data from previous checks is not available?

1.2 Contributions

Translating the claim to a structured query requires to recognize the semantics of the query and the correct data to run it. As the translation is a challenging process and we aim at supporting a large variety of use cases, our system steers the query generation and data matching by generating and scheduling questions to domain experts.

  • [leftmargin=*]

  • We introduce a novel framework for statistical claims verification that minimizes the human effort (Section 2). Scrutinizer makes use of classifiers and simple questions to a crowd of domain experts to generate interpretable SQL queries that either validate or contradict the claim (Section 3).

  • We build queries by extracting their main features from the textual claim and its context, such as surrounding paragraphs. The classifiers identify the dataset, the attributes, the rows and the mathematical operation that are required to verify the claim. A query generation algorithm combines the provided information into the interpretable queries that are exposed to the user to assess a claim (Section 4).

  • We introduce a cost model and scheduling algorithms for planning the sequence of claims to verify and the questions to ask to crowd of domain experts for a single claim. We give algorithms that minimize the verification cost for the users with quality bounds. The algorithms model the trade-off between the constraints given by the users and the necessity to bootstrap and fine tune the classifiers with labels (Section 5).

  • We experimentally verify with a user study and real data that our system is effective in supporting users in checking claims, enabling the verification of more than 1 claim per minute on average with a reduction in time of 50% compared to the original verification process (Section 6). We corroborate those results via simulations, studying performance of different baselines when verifying larger reports.

2 Problem Model

We assume a scenario with a crowd of domain experts, a textual document to be verified, and a set of relational tables . The textual document is divided into sentences and each sentence can contain one or more claims, that is, word sequences that describe the output of a query over . More precisely:

Definition 1

A general claim describes the comparison () between the value of query and a parameter , when is executed on .
A claim is correct if is true.

A special instance of our definition of general claim is a common class of statements, where the comparison is the equality and the parameter is a value reported in the claim itself. For equality, we consider a tolerance threshold (admissible error rate) that can be defined by the users.

Definition 2

An explicit claim describes a query that, when executed on , returns a value close to the parameter stated in the claim. An explicit claim is correct if the relative difference between and is lower than the admissible error rate .

Example 2

Consider the following two claims:

The market for new wind power projects increased nine-fold from 2000 to 2017, while the solar PV market expanded aggressively.

The claim in bold is explicit and “nine-fold” is the parameter. The query should identify this ratio in the relevant data for wind market (VM) and check an equality, i.e., (VM in 2017 VM in 2000) = 9. The underlined claim is general, with “expanded” being an operation over solar market (SM) yearly values, i.e., (SM in 2017 SM in 2000) 1, and “aggressively” a parameter, i.e., (SM in 2017 SM in 2000) 100.

General claims are more challenging than explicit ones because of the ambiguity in the language. The problem is domain specific, as an aggressive growth in the energy market may not be the same parameter in the financial or in the automotive market.

Assuming a system can identify the comparison and the parameter in the claim, it still has to come up with the correct query. We consider a fragment of SQL focused on statistical checks based on a library of functions that includes aggregate and mathematical SQL functions, possibly combined with arithmetic operators.

Definition 3

A statistical check SQL query has the form:
select (a., b., …)
from T1 a, T2 b,
where a.key1 = and (b.key2 = or b.key2 = ) and

The where clause is a conjunction and disjunction of unary equality predicates defined over the key attributes of one or more relations in . The select clause is a (possibly nested) combination of functions defined over attributes values and constants.

In our use case, we observe that the number of possible function combinations (e.g., power(a.2017/b.2016,1/(2017-2016)) in Example 1) is in the hundreds. As for the parameter discussion, we do not assume that is fixed in general, as different combinations are used in different domains.

Example 3

Consider again the explicit claim in the previous example about the wind market. The claim is validated if there is a query that translates the textual content and returns a value equals to 9. In this case, the query is

SELECT (a.2017 / b.2000) FROM GED a, GED b WHERE a.Index = ‘CapAddTotal_Wind’ and b.Index = ‘CapAddTotal_Wind’;

Finally, we aim at minimizing the effort taken by a group of experts of the domain in verifying the claims in the document. A natural metric to measure the effort is the total time to verify all claims and update incorrect ones.

We are now ready to formally define our problem

Definition 4

Given a set of relations and a document containing a set of generic textual claims , we want to minimize the human effort in identifying for every claim either (i) a query and the relations s.t. is labelled as correct, or (ii) that there is no query and dataset s.t. is labelled as correct.

In the latter case, we also want to report the queries over relations in that make the claim correct.

Example 4

Consider again the table fragment in Figure 1 and the (false) claim “In 2017, global electricity demand grew by 2.5%”. Our system recognizes that there is no query that returns 2.5% for global electricity growth in 2017, but there is a query on the same subject returning 3%. We suggest the value as a possible update to the claim.

Figure 2: Architecture of Scrutinizer.

3 System Overview

Figure 2 shows an overview of Scrutinizer

. The input consists of a text document, containing general claims, and a set of relations. Inspired by our use case, if a database of previously checked claims is available, our system uses it for bootstrapping. In case such database is not available, we introduce an active learning algorithm to steer the crowd in its creation. The output of the system is a verification report, mapping verified claims to queries while pointing out mistakes and potential updates to the text.

The system encompasses two primary components. The automated translation component leverages machine learning to identify the elements that defines every claim, i.e., candidates for datasets, attributes, rows, and comparison operations. The question planning component interacts with human domain experts to verify such elements and the checking results, optimizing verification tasks for maximal benefit.

1:// Verify claims in text using models
2:// and return verification results.
3:function Verify()
4:     // Initialize verification result
6:     // While unverified claims left
7:     while  do
8:         // Select next claims to verify
9:         OptBatch()
10:         // Select optimal question sequence
11:         OptQuestions()
12:         // Get answers from fact checkers
13:         GetAnswers()
14:         // Generate queries and validate claims
15:         Validate()
17:         // Remove answered claims
18:         Unanimous()
19:         // Retrain text classifiers
20:         Retrain()
21:     end while
22:     // Return verification results
23:     return
24:end function
Algorithm 1 Main verification algorithm.
Figure 3: Example of the generated query (bottom) for the general claim in red in the sentence (top). Below the sentence are reported elements of the claim that have already been validated, such as the database, the key value and the attributes.

Algorithm 1 describes the main steps in our workflow. Given the claims in a document and the ML models, the claims are verified in batches by a team of experts. In each step, the algorithm selects an optimal batch of claims for verification. Claim batches are selected based on multiple criteria, including expected verification overheads as well as their estimated utility for improving accuracy of the classifiers. For each selected claim in the current batch, we determine an optimal sequence of questions for the human checkers, minimizing expected verification time. Claims are validated or marked as erroneous, based on replies from crowd workers and query evaluations. We remove the claims for which a verification result (i.e., either a verifying query or a decision that the claim is erroneous) can be calculated with sufficiently high confidence. Finally, the classifiers are retrained, based on the newly obtained classification results.

We detail the two main components in the following.

3.1 Text to Query Translation

The systems starts by executing four classifiers over the textual claim. We assume the text relevant for the statistical claim has been already identified with one of the existing tools for this task [JoTY0YLM19]. Given the textual claim, the classifiers identify four elements that are key for the query generation process and claim verification. The first three are basic elements of every query: relevant relations, primary keys values (rows), and attributes names. The fourth classifier is in charge of identifying a generic formula with variables in the place of keys and attribute values. This formula gets instantiated on the dataset at hand and becomes the combination of functions in the SELECT clause. While for explicit claim we always identify the parameter and the comparison, for general claims these two elements can also be predicted within the formula. It is also possible that they cannot be predicted and the user input them answering a question.

Example 5

Consider again the (false) claim “In 2017, global electricity demand grew by 2.5%”. Ideally, the first classifier identifies that global relations can be used to verify it; the second classifier recognizes that rows reporting values for electricity demand should be used; the third classifier returns 2016, 2017 as the attributes of interest, and, finally, the fourth classifier returns the formula power, with explicit parameter (2.5%) in the claim (the explicit parameter implies the equality comparison).

To get good accuracy results in the prediction, we resort to active learning. This is in line with our use case, where the previously checked claims are immediately used to derive training data for the classifiers, but also enable the use of our system for cases where previous checks are not available. Previous checks are also important for generalizing check functions into formulas with variables. This step enable us to (i) reuse formulas on unseen claims and (ii) have a number of classes (for the prediction) as small as possible.

As we cannot assume that the first prediction is always the correct one in practice, we validate the relations, rows, and attributes predictions with the crowd of domain experts. Once we have this “context” information, we predict the top formulas with the last classifier and generate all the possible queries that combine context and formulas. The complexity raised by this combination is in the assignment of the elements of the query to the variables in the formula. Consider two attributes and identified for a certain row and a formula stating that we should compute “”, the system does not know if is assigned to or .

Example 6

Given the predictions for relations (g1, g2), rows (PGElecDemand), attributes (2016, 2017) and formula power(-1, the query generator module produces all the possible bindings for variables over global relations, for electricity demand rows and with attributes 2016 and 2017. In one assignment, is bound to a row in relation g1, with Index value PGElecDemand and attribute 2016, while in the second assignment is bound to g2 and 2016 or g1 and 2017 and so on. One of these query returns the 3% parameter in the original claim, thus validating it.

The assignment operation is done in a brute force fashion, but, thanks to the pruning power of the context, it is usually achieved in less than a second. We describe these components in more detail in Section 4.

3.2 Question Planning

Obtaining feedback from crowd workers is expensive. Hence, the question planning component uses cost-based optimization to determine most effective question sequences. Question planning consists of two sub-tasks. First, for a fixed claim, we choose a sequence of questions allowing us to verify that claim with minimal expected overhead. Each question either solicits crowd workers to verify automatically generated query fragments, or to propose suitable query fragments themselves. Second, we need to decide the order in which claims are verified. When selecting claims to verify next, we take into account expected verification overheads as well as their value as training samples for our classifiers (used for automated claim verification). We describe this component in more detail in Section 5.

Example 7

Figure 3 shows at the bottom the query generated after a group of relations, a key value and two attributes have been validated for the general claim at the top. The domain expert can examine the formula that has been predicted (left), its assignment over all the relations that contain the key value and attributes (right), and the resulting value (0.012 in the example) for verifying the claim.

Notice that in the example above the parameter is not predicted by the formula, the user has to assess if 0.012 is correctly described by “scarcely”.

We remark that our system is designed for a setting with many claims that need to be verified by a team of checkers. If this is not the case, it does not add an extra cost but the effort in training the classifiers would not be visible.

4 Claim Translation

We first describe how we preprocess the claims to extract the features to be used with the classifiers. We then describe the query generation step.

4.1 Claim Preprocessing

Given a text, we start by processing it to identify worth checking claims with existing tools [HassanALT17, JaradatGBMN18]. Given a claim (sequence of words) it is necessary to identify the correct relations from the corpus. In such relations, we need to identify the primary key values and the attributes that identify the data values to be used in the check operation. For these three tasks, we rely on (GloVe

) pre-trained embeddings to convert the text to a distributed representation which maps each word to a real-valued vector 

[pennington2014glove]. To get the embedding of a sentence, we average the embedding of each word in that sentence. For each claim in a sentence, we concatenate the sentence embedding with the TF-IDF scores of the unigrams and bigrams in the claim, followed by the TF-IDF scores of every 3 characters.

Figure 4: Preprocessing of the claims.

As depicted in Figure 4, embeddings for the sentence and the claim are fed as multi-dimensional vectors to the four classifiers responsible for predicting a fragment of the final query. One classifier is used to predict a list of possible relations that are used to verify the claim. Another classifier predicts a list of primary key values that are relevant to the claim. A third classifier predicts a list of possible attribute labels. The final classifier predicts a list of possible operations. If the claim is explicit, we identify the parameter directly from the sentence with a syntactical parsing.

4.2 From Claims to Formulas

Given the large variety of possible statistical checks, we do not rely on a pre-defined library of operations and their possible combinations, but learn them from the previously checked claims. Given a previously checked claim, we describe how we turn it into a generic formula with variables, with the goal of reusing it with unseen claims. A classifier is trained with (claim, formula) pairs and returns a ranked list of formulas for a given textual claim.

Example 8

A query with
SELECT POWER(a.2017/b.2016,1/(2017-2016))-1
identifies a formula POWER(./.,1/(-))-1

Example 8 shows the translation of the Select clause of a query into a formula. The formula contains variables for the relations and for the attributes, but preserves function names, operations, and constants. The variables make the check reusable for a new claim, assuming that the classifier returns the correct formula for it. In a formula, one of the operations and one of the constants can play the role of comparison and parameter , respectively, for claims that are not explicit.

Given annotations with SQL queries, the process to obtain a formula is straightforward. Unfortunately, going from previous checks to formulas is challenging because we do not assume SQL queries in the annotations. In fact, in our use case, checkers used spreadsheets and notes in natural language to annotate their verification process. The lack of rigorous guidelines raises three problems.

Reconstruction. We call look-up the function that retrieves data values from the relations (Select and Project in SQL). Given a relation, each data value is identified by its primary key value (e.g., “PGElecDemand”) and its attribute name (e.g., “2017” or “Total”). Data values in a claim can be collected from different relations. Check operations range from simple look-ups in the relations to compositions of SQL functions and operations, possibly with constants. Moreover, values can be results of other intermediate operations. Any value involved in an operation may be obtained from operations such as subtraction, multiplication, or even compound annual growth rate111 that involve looking-up other values.

Any value might be the result of several operations, thus formulas contain the entire sequence of operations. We achieve this by recursively replacing each value by its corresponding function in the annotations until we reach a look-up. As attribute labels are present in some formulas as values, we also replace them with attribute variables.

Ambiguity. A complication from the lack of guidelines is that checkers verify the same claims with different operations. Even for simple explicit claims, one checker may write a Boolean query and see if the output is empty, while another may collect a value from the data and compare it visually with the parameter. The problem gets harder with general claims, as in the following example.

Example 9

Consider an explicit claim stating that the consumption of some resource in a certain year has been “very high”. One checker may verify if with a Boolean query:

SELECT d. FROM rel WHERE d.key=
but a second checker may verify the claim with query:

SELECT d. FROM rel WHERE d.key=
and marking the claim as correct based on some parameter that is neither in the claim nor in the query.

Incomplete information. The second check in Example 9 shows a case of incomplete annotation for a general claim. This is quite common in our experience, as also shown in Example 1 and in the formula reported in Figure 3: only a human can conclude that 0.012 is an appropriate value for this domain and claim to validate “scarcely”. While for explicit claims is not an issue, as the comparison and the parameter are in the claim, incomplete annotations lead make it impossible to even replicate a past check on the same general claim. This problem clearly motivates our human-in-the-loop solution, detailed in Section 5.

Due to the first two problems above, it is hard to generate formulas in practice, as it is reflected by our experimental results with the prediction of formulas from the text. Fortunately, we use information from the claim data to better identify the correct formula, as we discuss next.

1:// Given relations , key , attributes , formulas ,
2:// parameter (for explicit claim), returns queries
3:function GenerateQueries(,,,,)
5:     // Collect data value assignments
6:     for , ,  do
7:          GetValue()
8:     end for
9:     for  do
10:         // Get # of non-attribute variables in formula
11:         GetVars()
12:         GetPerm()
13:         for  do
14:              // Test approx. value match for explicit claim
15:              if  and  then
17:              else if  then
19:              end if
20:         end for
21:     end for
22:     // Rewrite variables and assignments as queries
23:     if   then
24:          Rewrite()
25:         Return()
26:     else
27:          Rewrite()
28:         Return()
29:     end if
30:end function
Algorithm 2 Query generation algorithm.

4.3 Query Generation

We describe the query generation process in Algorithm 2. The input of the algorithm is the output of the classifiers and the parameter if the claim is explicit. To generate the candidate queries, we first get all the possible values from the combinations of the classifier outputs for relations, primary key values, and attribute labels (line 7). Then, we loop through the list of formulas (line 9), and for each formula, we get the number of possible permutations (line 12) of the possible values. We then try each permutation (line 13) to see if it leads to a match for the explicit claim and eventually store it as a solution (line 16). If we did not find a valid solution or the claims was not explicit, then we store the solution in a different list (line 18). After looping through the formulas, if a solution was found for the explicit claim, we produce the queries associated to these solutions (line 24). In all other cases, we produce queries for all solutions (line 27). In the rewriting, we fill up a query template with the relations, key values, attribute labels and formula instantiated. The query template is an SQL string with placeholders, as described in Definition 3. Note that we generate (SELECT-PROJECT-AGGREGATE) queries that can span multiple relations. Finally, we return all queries (lines 25, 28).

The algorithm assumes that the input information for relations, key values and attributes are correct as these come from the crowd validation, as described in the next section. Formulas are not validated by the crowd as returned by the classifier, but only after they have been filtered in the instantiation loop in the algorithm.

Example 10

Consider the following input to the algorithm:

Relations: GED; Keys: PGElecDemand, Attributes: 2016, 2017; Formulas: Power, , …

After instantiating the first formula and replacing the query template, we obtain:

SELECT POWER(a.2017/b.2016,1/(2017-2016)) -1 FROM GED a, GED b WHERE a.Index = ‘PGElecDemand’, b.Index = ‘PGElecDemand’

In the last step of the workflow, the queries are executed and the results displayed to the user to draw conclusions on the claim, as depicted in Figure 3.

5 Question Planning

Question planning consists of two tasks: determining optimal questions to verify single claims, and determining an optimal verification order between claims. We discuss the first problem in Section 5.1 and the second one in Section 5.2.

5.1 Single Claim Verification

We verify claims by asking a series of questions to human fact checkers. Our goal is to minimize overheads for the fact checkers. To do so, we leverage the results of our claim to query translation components. In the ideal case, we have identified a query that translates the current claim with high confidence. In that case, crowd workers only need to verify the proposed translation. This is typically faster than verifying the claim manually.

In practice, we are not always able to find a high-confidence translation for a claim. Instead, we may still be able to narrow down the range of possibilities to a small set of alternatives. If this is not possible for the query as a whole, we may still be able to do so for specific query properties (e.g., we identify specific columns that appear in the query with high confidence). In those cases, we can ask crowd workers to verify assumptions about specific query properties, or to select answers from a small set of options. Of course, answering questions on query properties or selecting answers causes overheads as well. Our goal is to select the sequence of questions that minimize expected verification cost.

For each claim, we generate a series of screens. Each screen contains questions that are answered by a crowd worker. Each screen is associated with one specific query property (e.g., the presence of specific columns or tables). On the upper part of each screen, crowd workers are shown a set of answer options with regards to the current property. Those answer options are obtained from our classifiers. On the lower part of each screen, crowd workers have the option to suggest new options, if the correct answer is not on display. The final screen for each claim asks directly for the query translating the current claim. Answers to prior questions may have allowed us to narrow down the range of possible queries. If so, the chances for confronting workers with the correct query increase.

In this scenario, our search space for question planning is the following. First, we need to decide how many screens to show. Second, we need to determine what query properties our questions should focus on. Third, we need to decide how many answer options to display on each screen. Fourth, we need to pick those answer options.

We make those decisions based on a simple cost model, representing time overhead for crowd workers for verifying the current claim. We assume that workers read screen content from top to bottom. For each answer option, a worker needs to determine whether it is correct or not. We count a per-option verification cost in our model, distinguishing cost of verifying answers about query properties, , from the cost of verifying the full query (on the final screen), . We choose constants such that to account for the fact that full queries are significantly longer than their fragments (which increases reading time and therefore verification cost). If none of the given options applies, crowd workers must suggest an answer themselves. We denote by and the cost of suggesting answers for properties and queries (again, ).

First, we discuss how to choose the number of screens and answer options. We denote the number of screens by and the number of options by . Predicting the precise verification cost for specific choices of those parameters is not possible. Doing so would require knowing the right solution to each question (as it determines how many options workers will read). However, we can upper-bound verification cost in relation to the cost of verifying claims without Scrutinizer.

Theorem 1

Compared to the baseline, relative verification overhead of Scrutinizer is at most .

Reading through answer options on the final screen adds cost overheads of in the worst case. We have overheads of for all previous screens. Verifying the claim without help means suggesting a query for the current claim. This has cost in our model.

Corollary 1

Setting and limits verification overheads to factor three.

This follows immediately by substituting the proposed formulas in the equations from Theorem 1.

We will use the aforementioned setting for most of our experiments. Having determined the number of screens and options, we still need to pick specific screens and answers. First, we discuss the selection of answer options. Note that the worst-case verification cost of a property depends only on the number of options shown (but not on the options themselves). Hence, to pick options, we consider expected verification cost instead.

We calculate expected verification cost based on our classifiers, assigning specific answer options to a probability. For a fixed property, denote by

the set of all relevant answer options. Also, denote by the probability that an answer is correct. We calculate expected verification cost when presenting users with an (ordered) list of answer options where .

Theorem 2

The expected verification cost for answer options is .

We consider the case that at most one answer option is accurate (this case is typical). The cost of verifying one answer option is (assuming properties). The probability that workers need to read beyond the -th option is the probability that none of the first options is correct: . The expected cost is the cost of each verification, weighted by the probability that it is necessary: .

Corollary 2

Selecting answer options in decreasing order of probability minimizes expected verification cost.

Each term in the cost formula, proven in Theorem 2, decreases if the sum of probabilities of the first options increases. Hence, starting with higher probability choices decreases cost.

Finally, we discuss the selection of query properties. Our goal is to select the best properties to verify by creating corresponding screens.

We define the quality of a property as follows. At any point (before verification), we consider a set of possible query translations for a claim. A large set of possible query translations is problematic for two reasons. First, it leads to higher computational overheads when executing them to obtain tentative result. Second, we increase overheads for fact checkers who may be presented with a large number of alternatives. A good property has high pruning power with regards to the current set of candidates. This means that it allows us to discard as many incorrect candidates as possible.

How many query candidates we can prune depends on the actual property value. Depending on the answer we obtain from the fact checkers, more or less queries can be pruned. We do, of course, not know the correct answers when selecting questions. Hence, we define the expected pruning power of a set of properties as follows.

Definition 5

Given a set of query candidates, a set of query properties to verify, and trained models predicting a-priori probabilities for possible answers, we define the pruning power as the expected number of queries that are excluded by obtaining answers for .

Next, we provide a formula for pruning power, based on simplifying assumptions. For that, we denote by the -th answer option for property and by queries that are excluded if answer option turns out to be correct.

Theorem 3

The pruning power is given by .

The pruning power is given as the expected number of pruned queries: . Clearly, it is . We simplify by assuming independence between properties and obtain

. Furthermore, we assume that different answer options for the same property are mutually exclusive. Then, we obtain

. Substitution yields the postulated formula.

Next, we discuss the question of how to find property sets maximizing above formula. Iterating over all possible property sets is possible but expensive (exponential complexity in the number of properties). Instead, we select properties according to a simple, greedy approach. At each step, we add whichever property maximizes pruning power to the set of selected properties (when comparing properties to add, we calculate pruning power for the union between the new and previously selected properties). We stop once the number of selected properties has reached the threshold determined before. While this algorithm may seem simple, it offers surprisingly strong formal guarantees. Those guarantees are derived from the fact that pruning power is a sub-modular function [Nemhauser1978]. We define sub-modularity below.

Definition 6

A set function is sub-modular if, using , it is for any .

Intuitively, sub-modularity captures a “diminishing returns” behavior. If adding more elements to a set, the utility of new elements decreases as the set of previous elements grows. The pruning power function is sub-modular as well, according to the following theorem.

Theorem 4

Pruning power is sub-modular.

Consider the probability that one specific query is not pruned via questions relating to any property, given as (see proof of Theorem 3). From the perspective of each query, adding one more property corresponds to multiplying its probability of not being pruned by a factor between zero and one. For , it is generally if . As the probability of not being pruned does not increase when adding questions, the impact of adding a new question on pruning probability decreases for each query. This means the probability of one query of being pruned is sub-modular in the question set. The same applies to pruning power itself (as a sum over sub-modular functions with positive weights is sub-modular).

Next, we show that the simple greedy algorithm produces a near-optimal set of questions.

Theorem 5

Using the greedy algorithm, we select a set of questions that achieve pruning power within factor of the optimum.

The greedy algorithm is equivalent to the greedy algorithm by Nemhauser [Nemhauser1978]. The pruning power function is sub-modular (see Theorem 4), it is non-negative (as we sum over probabilities) and non-decreasing (as pruning probability can only increase when adding more questions). Hence, it satisfies the conditions under which those bounds have been proven for Nemhauser’s algorithm [Nemhauser1978].

Finally, we analyze time complexity (denoting by the number of screens, by the number of properties, and by the number of query candidates).

Theorem 6

Finding optimal question sequences for verifying single claims is in .

The greedy algorithm performs steps and considers options in each step. Evaluating the pruning power function requires steps if using a naive approach (we can reduce complexity if query candidates are represented by a Cartesian product between query properties).

The complexity of selecting optimal question sequences for claims is therefore polynomial in all problem dimensions. This is important, as we need to re-run this step for each claim in the document, whenever classifiers are retrained. This is due to the fact that expected verification cost, based on the optimal question sequence, forms the input to the algorithm discussed next.

5.2 Claim Ordering

Next, we discuss the problem of determining a claim order for verification. At first, it may not be clear why verification order matters. If modeling verification cost per claim as a constant, total verification cost is simply the cost sum over all claims. In that model, verification order does not matter indeed.

However, verification cost per claim is not static. As time progresses, the quality of automated claim translation increases (as claims verified by crowd workers serve as training samples). This decreases expected claim verification cost at the same time (as crowd workers merely need to assert proposed claim translations). Hence, verifying claims in different order may indeed influence overall verification cost.

We consider two criteria when selecting the next claims to verify. First, we consider the benefit of claim labels for training our classifiers (for automated claim to query translation). Second, we consider the expected verification cost.

The first point relates to prior work on active learning. Here, the goal is generally to select optimal training samples to increase the quality of a learned model. In our case, verified claims correspond to training samples for classifiers that translate claims to queries. Picking training samples with maximal uncertainty (according to the current model) is a popular heuristic in the context of active learning. We follow this approach as well and define the training utility as follows.

Definition 7

Let a model predicting specific properties of the query associated with a text claim . We assume that

maps each claim to a probability distribution over property values. Denote by

the entropy of that probability distribution. We define the training utility of , by averaging over all models (associated with different query properties): .

The second point (verification cost) relates to the cost model discussed in the previous subsection. However, this cost model is incomplete. It neglects the cost of understanding the context in which a certain claim is placed. Intuitively, verifying multiple claims in the same section is faster than verifying claims that are far apart in the input document. Our extended cost model takes this into account. In contrast to the model from the previous subsection, it calculates verification cost for claim batches (instead of single claims).

Definition 8

Denote by a batch of claims for verification. For each claim , denote by the section in which this claim is located (instead of sections, a different granularity such as paragraphs can be chosen as well). Denote by the pure claim verification cost for defined in the last subsection. Further, denote by the cost of reading (respectively skimming) section . We define the total (combined verification and skimming) cost for claim batch as the sum of both verification cost over all claims and reading cost over all associated sections: .

This cost model has the desired property: it captures the fact that verifying claims in the same section is faster. Our approach to claim ordering is based on this model. It is not useful to determine a global claim order before verification starts. We cannot predict how the quality of classifiers (and therefore claim verification cost) will change over time. Instead, we repeatedly select claim batches that are presented to the checkers. Those claim batches are selected based on training utility and the aforementioned cost model.

Note that we prefer selecting claim batches as opposed to single claims. First, presenting fact checkers with claim batches allows them to better plan their verification strategy. For instance, claims can be clustered in a first pass to treat claims that are semantically related together during verification. Second, integrating new training samples and optimally selecting claim batches is computationally expensive. As discussed next, selecting claim batches is a hard optimization problem. Also, retraining all classifiers (the operation that motivates re-running claim selection) is a relatively expensive operation on our test platform. By re-training on claim batches, rather than single claims, we reduce computational overheads.

To select claim batches, we solve the following optimization problem.

Definition 9

Given a set of unverified claims , the goal of claim selection is to select a claim batch such that total cost of remains below a threshold : . Additionally, the minimal and maximal batch size is restricted by parameters and : . Under those constraints, the goal is to maximize accumulated training utility . Alternatively, as a variant, we minimize the cost formula where is a weight representing the relative importance of selecting claims with high uncertainty for classifier training.

This problem is hard, as shown by the following theorem.

Theorem 7

Claim selection is NP-hard.

We prove NP-hardness by a reduction from the knapsack problem. Let a set of items with associated weights and benefit . The goal is to maximize accumulated benefit for an item set whose accumulated weight remains below a threshold : . We construct an equivalent instance of claim selection as follows. We introduce an unverified claim for each item . We assume that each claim is located in a separate section ( for claim ). We set combined verification and reading cost for each claim and associated section to be proportional to item weight: . Training utility is proportional to benefit (). We choose cardinality bounds that do not influence the solution ( and ). Now, an optimal solution to claim verification yields an optimal solution to the original knapsack instance (via a polynomial time transformation).

The fact that claim selection is NP-hard justifies the use of sophisticated solver tools. We reduce the problem to integer linear programming. This allows us to apply mature solvers for this standard problem. Next, we discuss how we transform claim selection into integer linear programming.

An integer linear program (ILP) is generally characterized by a set of integer variables, a set of linear constraints, and a (linear) objective function. The goal is to find an assignment from variables to values that minimizes the objective function, while satisfying all constraints.

We introduce binary decision variables of the form , indicating whether the -th claim was selected () or not (

). Also, we introduce binary variables of the form

to indicate whether section number needs to be skimmed or not (to verify the selected claims). Next, we express the constraints of our scenario on those variables. First, we limit the number of selected claims to the range by introducing the linear constraints . Next, we represent the constraint that sections of selected claims must be read. We introduce constraints of the form if claim is located within section . Furthermore, we limit accumulated verification cost of the selected claims by the constraint . Finally, we set as objective function to minimize.

The time complexity for solving a linear program generally depends on the solver (and the algorithm it selects to solve a specific instance). However, the number of variables and constraints often correlates with solution time. We analyze both in the following.

Theorem 8

The size of the ILP problem is in where is the claim count and the section count.

The number of variables is in while the number of constraints (specifically: constraints connecting claims to sections read) is in .

The ILP size grows relatively slowly in the number of claims and sections. While claim selection remains NP-hard, we show in our experiments that we can solve corresponding problem instances sufficiently fast in practice.

6 Experiments

We evaluated Scrutinizer using real data along two dimensions: (i) the effectiveness of the tool in real verification tasks with domain experts, (ii) the effectiveness and efficiency of question scheduling. The code of the system is available online222

Dataset. We obtained a document of 661 pages, containing 7901 sentences, and the corresponding corpus of manually checked claims, with check annotations for every claim from three domain experts. The annotations cover 1539 numerical claims, of which about half are explicit. The massive effort in checking claims is because the document authors write the report with early estimates, so the data underlying the book change over time. In the first pass of the draft, up to 40% of the claims are updated.

Percentiles 10% 25% 50% 95% 99%
Relation 2 4 10 199 532
Primary Key 2 2 4 39 107
Attribute 1 2 7 127 1400
Formula 1 1 1 8 55
Table 1: Percentiles of property value frequencies.

After processing the claims, we identify 1791 relations, 830 key values, 87 attribute labels, and 413 formulas. Table 1 shows some percentiles of the frequency distribution of each property. We see that 50% of the values for all properties appear at most 10 times in the corpus, with the top 5% most frequent formulas appearing at least 8 times.

6.1 User Study

In this experiment we involved seven domain experts from the institution to measure the benefit of our system compared to the traditional manual workflow for verification. We trained Scrutinizer with all the annotated statistical claims and randomly selected 43 claims among the ones with the 10 formulas that cover the majority of the claims. As we only have access to the correct version of the claim, we randomly selected 25% of them to inject errors.

Three experts have been randomly assigned to the Manual process and the remaining four to the System-assisted process. We gave them instructions to execute the test without interruptions and without collaboration. Three claims (two correct, one incorrect) have been used for training on the new process and the remaining 40 for the study. The task given to the experts was to verify as many claims as possible in 20 minutes, given access to their traditional tools in the manual process (spreadsheets and databases) and to our system only in the second case. The order of the claims has been fixed to allow comparison among experts and the time for checking every claims has been registered.