Polish Natural Language Inference and Factivity – an Expert-based Dataset and Benchmarks

Despite recent breakthroughs in Machine Learning for Natural Language Processing, the Natural Language Inference (NLI) problems still constitute a challenge. To this purpose we contribute a new dataset that focuses exclusively on the factivity phenomenon; however, our task remains the same as other NLI tasks, i.e. prediction of entailment, contradiction or neutral (ECN). The dataset contains entirely natural language utterances in Polish and gathers 2,432 verb-complement pairs and 309 unique verbs. The dataset is based on the National Corpus of Polish (NKJP) and is a representative sample in regards to frequency of main verbs and other linguistic features (e.g. occurrence of internal negation). We found that transformer BERT-based models working on sentences obtained relatively good results (≈89% F1 score). Even though better results were achieved using linguistic features (≈91% F1 score), this model requires more human labour (humans in the loop) because features were prepared manually by expert linguists. BERT-based models consuming only the input sentences show that they capture most of the complexity of NLI/factivity. Complex cases in the phenomenon - e.g. cases with entitlement (E) and non-factive verbs - remain an open issue for further research.



There are no comments yet.


page 1

page 2

page 3

page 4


Self-Attentive Model for Headline Generation

Headline generation is a special type of text summarization task. While ...

Evaluating Persian Tokenizers

Tokenization plays a significant role in the process of lexical analysis...

Using syntactical and logical forms to evaluate textual inference competence

In the light of recent breakthroughs in transfer learning for Natural La...

Situation and Behavior Understanding by Trope Detection on Films

The human ability of deep cognitive skills are crucial for the developme...

Using Natural Language Processing to Develop an Automated Orthodontic Diagnostic System

We work on the task of automatically designing a treatment plan from the...

Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

The relevance of the Key Information Extraction (KIE) task is increasing...

Do Prompt-Based Models Really Understand the Meaning of their Prompts?

Recently, a boom of papers have shown extraordinary progress in few-shot...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantics is still one of the biggest problems of Natural Language Processing (NLP)111See relatively new and important work related to this topic: (Richardson et al., 2020; Zhang et al., 2020).. It should not come as a surprise; semantic problems are also the most challenging in the field of linguistics itself (see (Speaks, 2021)). The topic of presupposition and such relations as entailment, contradiction and neutrality are at the core of semantic-pragmatic research (Huang, 2011). For this reason, we dealt with the issue of factivity, which is one of the types of presupposition.

The subject of this study includes three phenomena occurring in the Polish language. The first of them is factivity (P Kiparsky, 1971; Karttunen, 1971b). The next phenomenon is the three relations: entailment, contradiction and neutrality (ECN), often studied in Natural Language Inference (NLI) tasks. The third and last phenomenon is utterances with the following syntactic pattern: ’[verb][że][complement clause]’. The segment że corresponds to the English segments that and to.

This study aims to answer the following research questions (RQs):

  1. How relevant is the opposition factivity / non-factivity for the prediction of ECN relations?

  2. How well do machine learning (ML) models recognize the entailment, contradiction and neutrality in utterances with the structure ’[verb][that][complement clause]’?

To answer the first question, we collected a dataset based on the National Corpus of Polish (NKJP). The NKJP corpus is the largest Polish corpus which is genre diverse, morphosyntactically annotated and representative of contemporary Polish (Przepiórkowski et al., 2012). Our goal was to prepare a dataset that is representative and adequately reflects the problems in NLI in the Polish language. Additionally, besides the utterances, we prepared multiple linguistic features to study and compare the influence on our main task – prediction of ECN relations.

The answer to the second question required building ML models. We trained models based on prepared linguistic features and text embedding BERT variants. Note that if a presupposition trigger in the form of a factive verb is used in an utterance, then – under certain conditions – an entailment relation occurs between the whole sentence and its complement. If a non-factive verb is used in an utterance, one of the following three relations is possible: entailment, contradiction, or neutrality. We investigated whether the modern machine learning models handle this linguistic phenomenon.

We adopted the following research perspective: first, we chose a linguistic issue, namely the opposition of factivity vs non-factivity. The issue of factivity has posed theoretical problems since this phenomenon was first noticed. Factivity has also not been a subject of research for the Polish language so far. Then, we created a dataset that reflects this opposition in communication conducted in Polish. With a dataset, we posed the research question outlined above. It is worth mentioning that we adopted a different procedure than those commonly used in ML papers. Usually, researchers gather datasets regarding the chosen ML task (e.g., classification task, sentiment analysis, predicting ECN), which is sometimes not even related to linguistic problems. In this standard ML approach, the datasets are usually as big as possible and sometimes even synthetically generated or specially prepared by linguists. This is also different in our procedure, in which we chose only natural utterances that appeared in authentic communication (from NKJP corpus).

Thus, in this paper, our contributions are as follows:

  • gathering a new dataset LingFeatured NLI, based on fully natural utterances from (NKJP). The dataset consists of 2,432 ’verb-complement’ pairs. It was enriched with various linguistic features to perform inferences of the utterance relation types, i.e. entailment, contradiction, neutral (ECN) (see Section 4). To the best of our knowledge, it is the first such dataset in the Polish language. Additionally, all the utterances constituting the dataset were translated into English.

  • building ML benchmark models (linguistic feature-based and embedding-based BERT) that predict the utterance relation type ECN (see Section 5).222The dataset, its translation and the source code, and our ML models are attached as supplementary material and will be publicly available after acceptance.

In the following, Section 2 and Section 3 describe theoretical background, related work and datasets. Then, Section 4 introduces our new dataset LingFeatured NLIwith a discussion about [with a commentary on?] its language background, annotation process and features. Section 5 shows our machine learning modelling approach and experiments. Further, in Section 6, we analyze the results and formulate findings. Finally, we summarize our work and indicate its main findings and limitations in Section 7.

2 Linguistic Background

2.1 Linguistic Problems

In the linguistic and philosophical literature, the topic of factivity is one of the most disputed. The work of Kiparsky and Kiparsky (P Kiparsky, 1971) began the never-ending debate about presupposition and factivity in linguistics. This topic was of interest to linguists especially in the 1970s (see, e.g., (Karttunen, 1971b; Givón, 1973; Elliott, 1974; Hooper, 1975; Delacruz, 1976; Stalnaker et al., 1977)). Since the entry of the term factivity into the linguistic vocabulary in the 1970s, there have been many, often mutually exclusive, theoretical studies of this phenomenon.

Karttunen’s (Karttunen, 2016) article with the significant title Presupposition: What went wrong? emphasized this fact. He mentioned that factivity is the research area that raised the most controversy among linguists. It should be clearly stated that no single dominant concept explaining the phenomenon of factivity has emerged so far. Nowadays, new research on this topic appears constantly, e.g., (Giannakidou, 2006; Egré, 2008; Beaver, 2010; Tsohatzidis, 2012; Anand and Hacquard, 2014; Kastner, 2015; Djärv, 2019). The clearest line of conflict concerns the question in which area to situate the topic of factivity: semantics or pragmatics. There is also a dispute about discursive concepts (see, e.g., (Abrusán, 2016; Tonhauser, 2016; Simons et al., 2017)). An interesting example from our research point of view is the work of (Djärv and Bacovcin, 2020). These authors argue against the claim that factivity depends on the prosodic features of the utterance, pointing out that it is a lexical rather than a discursive phenomenon.

In addition to the disputes in the field of linguistics, there is also work (see, e.g., (Hazlett, 2010, 2012)) of a more philosophical orientation which strikes at beliefs that are mostly considered accurate in linguistics, e.g., that know that is a factive verb. These works, however, have been met with a distinctly polemical response (see, for example, (Turri, 2011) (Tsohatzidis, 2012)).

In summary, the first problem to note is the clear differences in theoretical discussions of the phenomenon of factivity-based presupposition. These take place not only in the older literature, but also in the contemporary one.

Theoretical differences are related to another issue, namely the widespread disagreement about which verbs are factive and which are not. A good example is a verb regret that, which, depending on the author, is factive or not or presents a different type of factivity from the paradigmatic in the class of factive expressions verb know that333See on the one hand the works of (Karttunen, 1971b), (Öztürk, 2017), (Dietz, 2018), on the other hand (Egré, 2008), (Dahlman, 2016), (Grigore and others, 2016)..

The explosion of work on presupposition in the 1970s and the multiplicity of theoretical concepts resulted in the uncontrolled growth of terminological proposals and changes in the meaning of terms already used. The term presupposition has been ambiguous since the 1970s, and this state of matters persists today. Terms such as factivity, presupposition, modality or implicature are indicated as typical examples of ambiguous expressions. Problems of terminology are the third problem to be highlighted in the work on factivity. It is important to note the disturbing phenomenon of transferring terminological issues to the NLI. Reporting studies analogous to ours will bring attention to this difficulty.

A final point to note is the lack of linguistic papers that provide a fairly exhaustive list of factive, non-factive, veridical, etc. expressions. There is also a lack of comparative work between ethnic languages in general. This kind of research is only available for individual, selective expressions (see e.g. (Özyıldız, 2017; Hanink and Bochnak, 2017; Jarrah, 2019; Dahlman and van de Weijer, 2019)).

2.2 Key Terms

We, therefore, chose to put more emphasis on conceptual issues. The concepts most important to this study will now be presented, primarily factive presupposition.

2.2.1 Entailment, Contradiction, Neutral

Let’s start by introducing an understanding of the basic semantic relations:

  • Entailment: H must be true if T is true;

  • Contradiction: H must be false, if T is true;

  • Neutral: H may be false or true if T is true
     (Chierchia and McConnell-Ginet, 2000).

where (T) is an utterance and (H) is an item of information.

2.2.2 Information

The information we are interested in is that transmitted by means of spoken sentences, which are utterances. The mechanisms involved in this transmission may be either of a purely semantic (lexical consequences) or pragmatic nature (conversational/scalar implicatures) (Grice, 1975; Sauerland, 2004). Three examples of information are shown below.

<T, H> Example 1. (Entailment)
T: Only then did they realize that they had been cut off from their only way out.
H: They had been cut off from their only way out.

<T, H> Example 2. (Contradiction)
T: If it wasn’t for the special smell of medicines in the air, you’d think it was just a normal room.
H: It was just a normal room.

<T, H> Example 3. (Neutral)
T: Statists believe that people can be demanding towards the state.
H: People can be demanding towards the state.

In example  2.2.2, the entailment is lexical in nature because it is founded on the factive verb realize that. The nature of the contradiction in example  2.2.2 is not obvious; certainly (H) is not a factive presupposition, since the verb think that is not a factive unit. Regardless of the kind of contradiction we are dealing with here, there is just this relation between (T) and (H). In Example  2.2.2, on the other hand, we have the neutrality: there is nothing in textbf(T’s) utterance that guarantees either a lexical entailment or a contradiction, and there are no pragmatic mechanisms that would fund either of these two relations.

2.2.3 Negation

We take into account in our dataset the occurrence of negation, specifically internal negation. We distinguish it from the so-called external negation, which is not relevant to the phenomenon of factivity. The examples of these two types of negation can be found below.

<T, H> Example 4. (Entailment)
T: He didn’t manage to open the door.
H: He tried to open the door.

<T, H> Example 5. (Neutral)
T: It is not the case that he managed to open the door.
H: He tried to open the door.

In Example 2.2.3, internal negation was used, and in Example 2.2.3, external negation. The utterance in 2.2.3 implies (H), whereas in 2.2.3 does not imply (H). The source of this difference is the different types of negation used. In the case of the implicative verb manage that (Karttunen, 1971a), some of its meaningful components are not within the scope of internal negation, e.g., the information that someone tried to do something. Thus, this information is actualized in the utterance 2.2.3 in the form of information (H). The external negation applied to the utterance 2.2.3 makes it possible for all the meaning components of the verb to be within its range, so (H) does not follow from 2.2.3.

2.2.4 Presupposition

We understand the term presupposition as follows: If the utterance (T) entails (H) and (H) is insensitive to internal negation, whether currently occurring or potential, then (H) is the presupposition of the utterance (T). We named all such information as presuppositions, regardless of their detailed nature. Thus, presuppositions can have both semantic and non-semantic grounds. In the literature, one can find lists of expressions and constructions that are classically called presupposition triggers (Levinson, 1983). Below is an example illustrating a presupposition based on a factive verb.

<T, H> Example 6.
T: [PL] Ona wie, że należy podać hasło.
T: [ENG] She knows that a password must be provided.
>> H [PL]: Należy podać hasło.
>> H [ENG]: A password must be provided.

Information (H) in Example 2.2.4 is the presupposition of the utterance because this information is insensitive to internal negation. Presupposition (H) is guaranteed by the factivity property of the verb wiedzieć, że / know that.

The relation between a semantic entailment and a semantic presupposition is shown in Example’ – lexical entailment; ’’ – presupposition; ’’ – no presupposition

<T, H> Example 7.
T: The driver managed to open the door before the car sank.
(Ha) The driver managed to open the door before the car sank.
(Hb) The driver tried to open the door before the car sank.
(Hc) The driver opened the door before the car sank.
(Hd) The car sank.

The utterance in Example 2.2.4 semantically entails (Ha)-(Hd) because the definition of this type of entailment is fulfilled: if T is true, then H must be true. Apart from that, the information (Hb) and (Hd) are also presuppositions of the utterance. They meet the defining conditions of a presupposition: they are insensitive to – in this case, potential – internal negation. In other words, we treat presupposition as a certain subtype of entailment.

Analyzing the difference between semantic presupposition and non-semantic presupposition, consider the following utterances in Examples 2.2.4 and 2.2.4. Both utterances in Examples 2.2.4 and 2.2.4 entail information (H). Moreover, in both cases, (H) is not within the scope of internal negation, so the information (H) is their presupposition. However, the foundations of these presuppositions are radically different. In Example 2.2.4, (H) is guaranteed by the appropriate prosodic structure of the utterance, whereas in Example 2.2.4, the presupposition (H) has a semantic grounding – it is guaranteed because of the factive verb know that. In other words, the entailment in Example 2.2.4 is not lexical, and in Example 2.2.4, it is.

<T, H> Example 8.
T: She was not told that he was already married.
H: He is already married.

<T, H> Example 9.
T: She didn’t know that he was already married.
H: He is already married.

2.2.5 Factivity

It is worth noting the occurrence of the following four terms in the NLI literature: factivity, event factuality, veridicality, speaker commitment. These terms, unfortunately, are sometimes understood differently depending on the work in question. In the presented study, we use only the term factivity, which is understood as an element of the meaning of particular lexical units. Such phenomena as speaker "degrees of certainty" are entirely outside the scope of the research presented here. We assume that presupposition founded on factivity takes place independently of communicative intentions; it may or may not occur: there are no states in between. For comparison, let’s look at how the term "event factuality" is understood by the authors of the FactBank dataset:

"(…) we define event factuality as the level of information expressing the commitment of relevant sources towards the factual nature of events mentioned in discourse. Events are couched in terms of a veridicality axis that ranges from truly factual to counterfactual, passing through a spectrum of degrees of certainty (Saurí and Pustejovsky, 2009, p.231)."

In another paper, the same pair of authors provide the following definition:

"Event factuality (or factivity) is understood here as the level of information expressing the factual nature of eventualities mentioned in text. That is, expressing whether they correspond to a fact in the world (…) (Saurí and Pustejovsky, 2012, p.263)." It seems that the above two explanations of the term event factuality are significantly different. They are also composed of other terms that require a separate explanation, e.g. discourse, veridicality, spectrum of degrees of certainty, level of information.

Note that the second quoted fragment also defines factivity; the authors apparently put an equality mark between "event factuality" and "factivity" Reading the specifications of the FactBank corpus and the instructions for the annotators leads, in turn, to the conclusion that Saurí and Pustejovsky understand factivity as (a) a subset of factuality, (b) in a "classical way", as a property of certain lexical entities (Saurí and Pustejovsky, 2009).

In the presented approach, factivity is understood precisely as a semantic feature of specific lexical units. In other words, it is an element of their meaning. According to the terminology used in the literature, we will say that factive verbs are presupposition triggers. Using the category of semantic signature, it can be said that factive verbs belong to the category (+/+) 555See Section 3 for the explanation of the term signature.. Examples 2.2.5 and 2.2.5 illustrate presuppositions based on the meaning of the factivity verb be aware that.

<T, H> Example 10.
T: She was not aware that she had made a fatal mistake.
H: She made a fatal mistake.

<T, H> Example 11.
T: She was aware she had made a fatal mistake.
H: She made a fatal mistake.

Information (H) follow semantically from both Example 2.2.5 and its modification as Example 2.2.5. This information is beyond the scope of internal negation and is, therefore, its presupposition. The foundation of this presupposition is the presupposition trigger in the form of the verb be aware that. Neither the common knowledge of the speakers nor the prosodic properties of the utterances are irrelevant to the fact that (H) in the above examples are presuppositions.

In summary, presuppositions can be either lexical or pragmatic in nature. What they have in common is that they are insensitive to internal negation. We treat presuppositions founded on factive verbs as lexical presuppositions. If information (H) is a lexical presupposition of an utterance T, then T entails (H). These relations are independent of the speaker’s communicative intention; it means that the speaker may presuppose something unconsciously or against his own communicative intention.

3 Related Datasets

The historical background of this paper is gathered from (Cooper et al., 1996; Dagan et al., 2005, 2013). These works established a certain pattern of construction of linguistic material, consisting in pairs of sentences: thesis and hypothesis (<T, H>). In this work, the source of (T) utterances is NKJP, and H) are complement clauses of (T).

<T, H> Example 12.
T: [PL] Myśleli, że zwierzęta gryzą się ze sobą. [NKJP]
H: Zwierzęta gryzą się ze sobą.
T: [ENG] They thought the animals were biting each other.
H: The animals were biting each other.

We will now review some of the most recent and similar works. The first of these is (Ross and Pavlick, 2019). The central term of this work is veridicality, which is understood as follows: "A context is veridical when the propositions it contains are taken to be true, even if not explicitly asserted." (Ross and Pavlick, 2019, p. 2230). As can be seen, the quoted definition also includes situations in which the entailement is guaranteed by factive verbs. The authors pose the following research question: ’whether neural models of natural language learn to make inferences about veridicality consistent with those made by humans? This is a very different question from the one posed in this paper. Ross and Pavlick used Likert scales to conduct annotations employing unqualified annotators (Ross and Pavlick, 2019). They then checked the extent to which the models’ predictions coincide with the human annotations obtained. Unlike these authors, we do not use any scales in annotation, and the object of the models we train is to predict real semantic relations, not those that occur as judged by humans.

Let’s also note that this is an example of work that uses semantic signatures. They distinguish between eight pairs of semantic signatures, e.g., (+/+) (realize that), (+/-) (manage to), (-/+) (forget to). A similar approach is in (Rudinger et al., 2018), i.e., factive verbs are one of several types of expressions that are of interest. In contrast to these works, we have distinguished only two classes of verbs: factive (+/+) and non-factive (+/+). Thus we included verbs that belong to the group "(+/+)", i.e. verb classes such as (+/-), (-/-), (-/+) etc. Due to the fact that we operate with the concepts of factive/non-factive verbs, we do not use the notion of semantic signatures in this paper. We are aware that in similar papers the number of verb distinctions is sometimes significantly higher. The decision to use only a binary distinction (factive vs. non-factive) is dictated by several interrelated considerations. First, there are no lists of Polish verbs with signatures assigned by linguists. Secondly, the preparation of such a list of even several dozen verbs is a highly specialized task. It may be added that there are still disputes in the literature about the status of some high frequency verbs, e.g. regret that. Third, we are interested in the real features of lexical units, and not in the ’textual’ ones, i.e. those developed by non-specialist annotators, using the committee method. The development of implication signatures by unqualified annotators would be pointless with regard to the research questions posed. The type of linguistics used in this work is formal linguistics, which investigates the real features of a language, unlike survey linguistics, which collects the intuitions of speakers of an ethnic language. 666See (Ipeirotis et al., 2010) and (Hsueh et al., 2009) on the problems of low quality annotation with Amazon Turk and how to solve them. The last reason is that the factivity/non-factivity split, given the frequency of occurrence of these relations, is most important for entailment and neutral.

Another important paper is (Jiang and de Marneffe, 2019), in which the authors take a closer look at the CommitmentBank dataset. Also, in the dataset, the annotation process used a Likert scale. The paper concludes that BERT systematically commits specific patterns of errors; it does not handle inferences described as pragmatic, e.g.:
T: Those people… Not a one of them realized I was not human. They looked at me and they pretended I’m someone called Franz Kafka. Maybe they really thought I was Franz Kafka.
H: he was Franz Kafka

Other new work worth paying attention to is (Parrish et al., 2021)

. It presents the Naturally-Occurring Presuppositions in English (NOPE) Corpus that considers a diverse set of presupposition triggers (as many as 10 types). It is worth noting that this set does not include factivity. The authors argue that there is no clear distinction between factive and non-factive verbs.They also state that even verbs such as know, which "are commonly regarded as factive," do not always guarantee that the complement is true. We strongly disagree with the claim that there is no clear distinction between factive and non-factive verbs: this is a central assumption of our work. We do, however, of course agree that in certain contexts factive verbs do not guarantee the truthfulness of sentence complements: however, we treat this as a research challenge; at the current stage of work, we estimated the scale of this phenomenon in communication conducted in Polish.

It is also worth noting completely new work on the topic of interest, namely (Jiang and de Marneffe, 2021), (Tarunesh et al., 2021b), (Tarunesh et al., 2021a), (Yanaka et al., 2021) . These works, like the earlier ones, point to the limited possibilities of recognizing phenomena such as facticity.

4 Language Material & Our Dataset

Our LingFeatured NLIdataset focuses on a specific linguistic phenomenon: the opposition of factivity vs. nonfactivity and the relation of these categories to semantic features such as entailment, contradiction and neutrality. We conclude that the specified datasets allow for a better specialization of ML models to narrow their scope of features to generalize (see (Poliak, 2020)). The three most important features of our dataset are as follows:

  • it does not contain any prepared utterances, only authentic examples from the national language corpus,

  • it is not balanced, i.e., some features are represented more frequently than others; it reflects authentic communication in Polish,

  • each pair <T, H> is assigned a number of linguistic features, e.g. the main verb, its grammatical tense, the presence of internal negation, the type of utterance, etc. In this way, it allows us to compare different models – embedding-based or feature-based.

4.1 Input Material Sources & Extraction

The material basis of our dataset is the National Corpus of Polish Language (NKJP) (Przepiórkowski et al., 2012). We used a subset of NKJP in the form of the PKK Polish Coreference Corpus (PKK) (Ogrodniczuk et al., 2014)), which contains randomly selected fragments from NKJP and constitutes its representative sample. We did not add any prepared utterances – our dataset consists only of original utterances found in the PKK. Moreover, the selected utterances have not been modified in any way – we did not correct typos, syntactic errors, or other difficulties and issues. Thus, the language content remained entirely natural, not improved artificially.

We automatically annotated with Discann777http://zil.ipipan.waw.pl/Discann all occurrences of the phrase że (that | to) as in Example 4.1.

<T, H> Example 13.
T: [PL] Przez lornetkę obserwuję, że zalane zostały żerowiska bobrów.
T: [ENG] I can see through binoculars that the beaver feeding grounds have been flooded.

From more than 3,500 utterances, we left only those that satisfied the following pattern: ’[verb] [że] [complement clause].’ It required a manual review of the entire dataset. Thus, we obtained 2,320 utterances that constitute the empirical basis of our dataset.

4.2 Dataset Content

Finally, the dataset consists of 2,320 real utterances from which 2,432 <T, H> pairs were formed. Each of these pairs was assigned one of three relations: entailment, contradiction, and neutral. In addition, each utterance <T> was assigned several linguistic features. The occurrence of features is not balanced, i.e. entailment class states of the dataset, contradiction, neutral. Thus, in this shape, the dataset constitutes a representative sample of communication in Polish. As can be seen, e.g., utterances with negation in the studied syntactic construction appear relatively rarely (less than 5%). Table 1 shows the detailed distribution of features in the set.

Features Distribution
target / GOLD – logic relations entailment 33.88%; contradiction 4.40%; neutral 61.72%
Verb type (factivity) factive 24.96%; non-factive 75.04%
Grammatical verb tense past 36.18%; present 52.22%; future 3.08%; none 8.51%
Utterance type indicative 90.17%; performative 2.43%; rule 2.22%; interrogative 1.97%; imperative 1.93%; counterfactual 0.66%; conditional 0.62%
Verb semantic class epistemic 51.85%; speech 38.03%; perceptual 1.81%; emotive 1.40%; other 6.91%;
Occurrence of internal negation occurs in 4.93%; does not occur in 95.07%
Tense of complement clause past 23.23%; present 49.01%; future 14.80%; other 12.95%
Table 1: Distributions of features in LingFeatured NLIdataset in Polish version
Factive Non-factive
C – contradiction 0 107
E – entailment 593 231
N – neutral 14 1487
Total 607 1825
Table 2: Contingency table consisting of the frequency distribution of two variables.

4.3 Linguistic Features

4.3.1 Utterances & Information – <T, H> Pairs

From 2,320 utterances, we created 2,432 <T, H> pairs (309 of unique main verbs). In some utterances, the verb introduced more than one complement – in each such situation, we created a separate <T, H>. For each sentence, we extracted a complement clause manually. Our manual work included, e.g., removing fragments that were not in the range of the verb – see Example 4.3.1.

<T, H> Example 14.
T: He said I am – I guess – beautiful.
H: I am - I guess - beautiful.

The pairs we have created <T, H> are the core of our dataset. We assigned specific properties (i.e., linguistic features) to each pair. In the following, we presented these linguistic features with their brief characteristics.

4.3.2 Entailment, Contradiction and Neutral (ECN)

The process of annotating (ECN) relations was performed by an expert linguist experienced in natural language semantics and then checked by another expert in formal semantics.

4.3.3 Verb

In each utterance, the experienced linguist manually identified the verb that introduced the H sentence.

Despite appearances, this was not a trivial task. Often identifying a given lexical unit required deep thought and verification of specific delimitation hypotheses. Among other things, in order to avoid problems of polysemy, we assumed that one meaning can be assigned to a given verb, e.g., we distinguish between czuć, że / feel that, which is epistemic, and czuć, że / feel that, which is purely perceptual (see Examples 4.3.3 and 4.3.3).

<T, H> Example 15. (Epistemic)
T: He felt that he would never see it again.

<T, H> Example 16. (Perceptual)
T: He felt that he was walking in his arms.

We identified a total of 309 verbs.888We treated the aspect pairs as two verbs. The Polish language, in a nutshell, has only two aspects: perfect and imperfect.

4.3.4 Verb Type

We assigned one of two values: factive / non-factive to all verbs. From the linguistic side, it was the most difficult part. This task was done by a linguist writing his Ph.D. thesis on the factivity phenomenon. The list was checked with the thesis supervisor, and in most cases, these people agreed with each other, but not in all cases. Finally, 81 verbs were marked as factive and 230 as non-factive.

4.3.5 Internal Negation

For each utterance, we marked whether it contains an internal negation. About 95% utterances did not contain explicit negation words, and almost 5% sentences did.

4.3.6 Verb Semantic Class

We have distinguished four semantic classes of verbs: epistemic (myśleć, że / think that), speech (dodać, że / add that)), emotive żałować, że / regret that) and perceptual dostrzegać, że / perceive that). Most verbs were hybrid objects, e.g. epistemic-speech. The class name was given due to the given dominant semantic component. If the verb did not fit into any of the above classes, the value other was given.

4.3.7 Grammatical Tense

In each utterance, we marked the grammatical tense of the verb and the complement H.

4.3.8 Utterance Type

We labeled the type of utterance as: indicative, imperative, counterfactual, performative, interrogative, or conditional.

All T utterances have been translated into English, see Appendix B.

4.4 Annotation

Among linguistic features assigned to pairs <T, H> the most difficult and essential to identify were factivity/non-factivity and logic relations ECN. Whether a verb is factive was determined by two linguists who are professionally involved in natural language semantics. They achieved more than 90% agreement, with most doubts arising when analyzing verbs of speaking, e.g., wytłumaczyć, że / explain that. The final decisions on identifying which verb is factive were made by a linguist writing a PhD on the topic of factivity in contemporary Polish and it was checked by his supervisor – a professor of formal linguistics.

Semantic relations ECN were established in two stages. The first stage was annotated by two linguists, including one who academically deals with the issue of verb semantics. They achieved 70% agreement in the annotation. Significant discrepancies can be observed for relations of contradiction, as opposed to entailment. Then, those debatable pairs were discussed with a third linguist, a professor specializing in natural language semantics. The result of these consultations was the final annotation - the GOLD standard.

We checked how the GOLD standard created in this way would differ from the annotations of non-professional linguists – a group of annotators who are not involved professionally in formal linguistics but have a high linguistic competence. The criteria for selecting annotators were the level and type of education and a pre-test. Thus the four selected annotators were: (1) cognitive science student (3rd year), (2) and (3) master’s degree in Polish Studies, master’s degree in Polish Studies, (4) linguistics Ph.D.

Each annotator was given the same set of <T,H> pairs from the set (20% of the total set). The task of each annotator was to note the relation between T and H. There were four labels to choose from: entailment, contradiction, neutral and ’?’.999However, the utterances annotated as "?" in the GOLD label, were not taken for training and testing ML benchmarks. The annotation instructions included simplified definitions of key labels – as we presented in Section 2.2.

Annotators were asked to choose ’?’, if: (1) they could not indicate what the relation was, or (2) they thought the sentence was meaningless, or (3) they encountered another problem that made it impossible for them to choose any of the other three labels. Especially important, from our point of view, is the situation (1). The idea was to reserve it for T, H pairs whose semantic relation is dependent on prosodic features (like accent, which determines focus and topic (Partee, 2014). Let’s look at an example  4.4:

<T, H> Example 17.
T: Let’s not say [that] these projects are supposed to end in a constitutional change.
H: These projects are supposed to end in a constitutional change.

There are two possible situations in Example 4.4: (a) the sender wants to hide the information from H (label: entailment) and (b) the sender does not want to say H because, e.g., he wants to make sure that this is true first (label: neutral).

Inter-annotator agreement with the dataset gold standard was in the range of 61% – 65%, excluding the worst annotator whose Kappa was below 52% with all other annotators.101010A few annotation examples are given in our supplement A Table 3 summarizes the inter-annotator agreement among four non-expert linguists and one of the experts preparing the dataset. The conclusions of the annotation performed and described above are provided in Section 6.2.

Ex A1 A2 A3 A4
Ex 1.00 0.65 0.60 0.38 0.61
A1 0.65 1.00 0.59 0.29 0.51
A2 0.60 0.59 1.00 0.47 0.70
A3 0.38 0.29 0.47 1.00 0.52
A4 0.61 0.51 0.70 0.52 1.00
Table 3: Inter-annotator agreement given by Cohen’s Kappa (alpha=0.05). Note: Ex – an expert who made the gold standard, A1-A4 – non-expert linguists.

5 Machine Learning Modelling – Experiments and Results

The models we built aim to simulate human cognitive abilities. The models trained on our dataset were expected to reflect high competence – comparable to that of human experts – in recognizing the relations of entailment, contradiction, and neutral (ECN) between an utterance (T) and its complement (H). We trained five kinds of models:

  1. Random Forest with an input of the prepared linguistic features,

  2. fine-tuned HerBERT-based models for only main verbs in sentences as inputs,

  3. model (2) with input extended with linguistic features,

  4. fine-tuned HerBERT-based model for the whole input utterance (T),

  5. model (3) with input extended with linguistic features.

We employed HerBERT (Rybak et al., 2020) models instead of BERT, because they are trained explicitly for Polish and achieved better results in comparison to Polish RoBERTa (Rybak et al., 2020). Python code and LingFeatured NLIdataset can be found in our GitHub repository 111111https://github.com/grant-TraDA/factivity-classification.

Each model was trained using 10-fold cross validation in order to avoid selection bias. Table 4

shows the models’ results achieved on the first seen data (unknown data for a model). The values in the table represent mean and standard deviation of metrics, respectively. F1 score in binary setting is harmonic mean between model precision and recall. In multiclass situation it is calculated per class and overall metric is average score. Here F1 score was calculated as weighted F1, due to large imbalance between classes.

Model Input F1 score [%] Accuracy [%]
All C E N
Random Forest Linguistic features
HerBERT Verb embedding
HerBERT Verb embedding + linguistic features
HerBERT Sentence embedding
HerBERT Sentence embedding + linguistic features
Table 4: Classification results of entailment (E), contradiction (C) and neutral (N). Linguistic features comprise: verb, grammatical tense of verb, occurrence of internal negation, grammatical tense of complement clause, utterance type, verb semantic class, verb type (factive/non-factive). F1 score depicts weighted F1 score.

The parameters of models and their training process are gathered in Table 5. The precise results of Random Forest for different feature sets and the feature importance plots are given in Table 6 and in Figure 1. Table 7 summarises the results of our models for the most characteristic sub-classes in our dataset: entailment and factive verbs, neutral and non-factive verbs, and the other cases.

Model Parameters
1 Random Forest sklearn implementation with 100 trees (n_features=100, max_depth=20, random_state=123, class_weight={’C’: 2, ’E’: 1, ’N’: 1} and default other parameters)
2 HerBERT (verb embedding)

32 - batch size, 10 - epochs, 1e-5 - learning rate (Pytorch implementation of Adam)

3 HerBERT+linguistic features (verb embedding)

Predictions of model (2) combined with linguistic features preprocessed with one hot encoding, sklearn implementation of Multi-layer Perceptron

4 HerBERT (sentence embedding) 32 - batch size, 13 - epochs, 1e-5 - learning rate (Pytorch implementation of Adam)
5 HerBERT+linguistic features (sentence embedding) Predictions of model (4) combined with linguistic features preprocessed with one hot encoding, sklearn implementation of Multi-layer Perceptron
Table 5: Model and training parameters.
Features Accuracy [%] Weighted F1 [%]
verb – factive/non-factive
verb , tense of verb, occurrence of negation, tense of complement clause, type of sentence
verb , tense of verb, occurrence of negation, tense of complement clause, type of sentence, semantic class of verb
verb , tense of verb, occurrence of negation, tense of complement clause, type of sentence, semantic class of verb, verb - factive/non-factive
Table 6: Features in Classification of Entailment, Contradiction and Neutral. Random Forest results with inputs of different sets of features.
Model Random Forest model accuracy [%] HerBERT-based (sentence embedding) model accuracy [%] HerBERT-based (sentence embedding + linguistic features) model accuracy [%] HerBERT-based (verb embedding) model accuracy [%] HerBERT-based (verb embedding + linguistic features) model accuracy [%]
Entailment and factive verbs
Neutral and non-factive
Table 7: Results in the most characteristic subsets in our test dataset: entailment and factive, neutral and non-factive, and the other cases.
Figure 1: Impurity-based feature importance of feature-based Random Forest. The chart shows the English equivalents of Polish verbs: know/wiedzieć; pretend/udawać; think/myśleć; turn out/okazać się; admit/przyznać; it is known/wiadomo; remember/pamiętać.

6 Results Analysis & Discussion

6.1 Issues in Dataset Preparation

We gathered dataset LingFeatured NLIthat is representative with regard to particular syntactic pattern ’[verb][że (eng: that/to)][complement clause]’ and factivity and non-factivity characteristic of the verb (in the main clause). The representation dataset is derived from NKJP – Polish national corpus, which itself is representative for Polish contemporary utterances. Thus, based on this material – our dataset – we can answer our first research question RQ1, which, recall, was as follows:

  1. How relevant is the opposition factivity / non-factivity for the prediction of ECN relations?

Firstly, the distribution of features in the dataset indicates that, in the vast majority of cases, factive verbs go with an entailment relation (24.4% of our dataset), and non-factive verbs with a neutral relation (61.1%) – see Table 2. Other utterances, i.e., the pairs <T, H>, in which, for example, despite a factive verb, there is neutral, or despite a non-factive verb, there is entailment, constitute a narrow subset of the dataset (14.5% – 352 utterances in the dataset). Table 9 contains examples of such pairs. These kinds of <T, H> pairs pose the biggest problem for humans and models – the best model accuracy of 62.87% (see Table 7). Let’s recap - in 85,5% of the pairs of the whole dataset, entailment co-occurs with a factive verb or the neutrality co-occurs with a non-factive verb.

Second, if the verb was factive, then the entailment relation occurred in 97,70%. And if the verb was nonfactive, the neutrality occurred in 81,50%. It means that pairs of features <factivity, entailment> and <non-factivity, neutrality> very often co-occur with each other, especially the first pair. This means that such phenomena as cancellation and suspension of presuppositions 121212See, for example, the paper  (Abrusán, 2016), which discusses the phenomena behind these terms. are marginal in our dataset.

Thirdly, according to our dataset, only 10 factive verbs with the highest frequency account for the occurrence of 60% of all occurrences of such expressions, and 10 nonfactive verbs with the highest frequency account for nearly 45% of all occurrences of nonfactive verbs (see Figure 2).

Figure 2: Relationship between the number of the most frequent verbs and the coverage of dataset. Left: The analysis of factive subsample. Right: The analysis of non-factive subsample.

In view of the above, it can be concluded that the opposition factivity / non-factivity for the prediction of ECN relations is relevant in a fundamental way. In other words in the syntactic pattern under analysis, verbs with the signature (+|+) (factive) and the signature "(+/+)" (non-factive) are most important in tasks that predict ECN relations between the whole utterance (T) and its complement clause (H).

It is also worth noting that in the task of predicting ECN relations in the "Vthat|top" stucture, we do not need large lists of verbs with their implicature signatures to identify these relations reasonably efficiently. Given the problem of translation of utterances from one language to another, it is therefore sensible to create a multilingual list of verbs with the highest frequency of occurrence. We realize that the frequency of occurrence of certain Polish lexical units may sometimes differ significantly from that of their equivalents in other languages. However, there are reasons to believe that these differences are not significant 131313Compare, for example, the verbs wiedzieć and powiedzieć with their English counterparts (Davies and Gardner, 2010) and many other such verbs realizing the "Vthat|top" structure.. A bigger issue than frequency may be a situation where a factive verb in language X is non-factive in language Y and vice versa.Table  8 contains lists the factive and non-factive verbs with the highest frequency in our dataset. We leave it to native speakers of English to judge whether the given English verbs are factive/non-factive.

Factive Non-factive
wiedzieć; know mówić; say
pamiętać; remember myśleć, ; think
wiadomo [komuś]; it is known powiedzieć; tell
przyznać; admit/acknowledge uważać; believe
widzieć; see [epistemic] okazać się; turn out
cieszyć się; glad mieć nadzieję ; hope
przypomnieć [komuś]; remind [someone] twierdzić; assert
dowiedzieć się; find out/learn wydaje się [komuś]; it appears [to someone]
zrozumieć; understand stwierdzić; state
przyznawać; admit/acknowledge wynikać; imply/follow
Table 8: Top 10 verbs broken down into factive / non-factive subgroups.

At this point it is worth asking the following question: do the results obtained on the Polish material have any bearing on communication in other ethnic languages? We think it is quite possible. Firstly, the way of life and, consequently, the communication situations of the speakers of Polish do not differ from the communication situations of the speakers of English, Spanish, German or French. Secondly, we see no evidence in favor of a negative answer. It is clear, however, that the answer to this question requires research analogous to ours in other languages.

T: [ENG] How do you know he bought, Gosia?     [PL] Skąd wiesz, że kupił, Gosia?
H: [ENG] He bought.     [PL] Kupił.
GOLD – Neutral, verb – Factive
T: [ENG] I read that Gabrysia was crying when she discovered that her daughter Tygrysek had lied, and this is such a moral compass for me.     [PL] Czytam, że Gabrysia płakała, kiedy odkryła, że jej córka Tygrysek skłamała, i jest to dla mnie taki moralny azymut.
H: [ENG] Her daughter Tygrysek had lied     [PL] Jej córka Tygrysek skłamała
GOLD – Neutral, verb – Factive
T: [ENG] Ernest and Agnieszka didn’t plan that they would have a big female family.     [PL] Ernest i Agnieszka nie planowali, że będą mieli wielką, babską rodzinę.
H: [ENG] Ernest and Agnieszka have a big female family.     [PL] Ernest i Agnieszka mają wielką, babską rodzinę.
GOLD – Entailment , verb – Non-factive
Table 9: Hard utterances in our dataset.

6.2 Annotation Task

Inter-annotator agreement of non-expert annotators with the linguists’ preparing the dataset gold standard (Kappa of 61% - 65%) indicates that the task is very specialized. We did not find patterns of errors made by the annotators. If the goal of human annotation is to identify the real relationships between two fragments, then such annotation requires specialized knowledge and a sufficiently long time to perform such a task.

Note that Jeretic et al. (2020)

, as part of their verification of annotation in the MultiNLI corpus 

Williams et al. (2018), randomly selected 200 utterances from this corpus and presented them for evaluation to three expert annotators with several years of experience in formal semantics and pragmatics. The agreement among these experts was low. This provides the authors with an indication that MultiNLI contains a few "paradigmatic" examples of implicatures and presuppositions141414See Levinson (2001) for typical examples of presuppositions and implicatures..

Notice that the low agreement of annotators may also be the result of differences in their beliefs of theoretical nature and their research specialization. In our opinion, the analysis of the issue of human annotation process in such a task as detecting relations of entailment, contradiction and neutral in principle deserves a separate study.

6.3 ML Results Analysis

Our ML experiment can help answer our second research question, RQ2, about assessing ML models in our task – recognition of ECN relations. The conclusion derived from the dataset can be that classes Entailment and Neutral are most common in language and are the most important for models to deal with. The overall results indeed show very high performance of models in these classes: Entailment – 88% to 92% in accuracy, and Neutral – 91% to 93.5% in accuracy (see Table  4). More precisely, the models achieved very high results (93% up to 100% in accuracy) for the sentence pairs containing lexical relations – subsets entailment and factive, neutral and non-factive in Table 7 – but gained very low metrics (47 % up to 62.9 %) on those with a pragmatic foundation, which are drastically more difficult – subset "Other" in Table 7.

Additionally, the overall ML modelling results show that HerBERT sentence embedding-based models are at a much higher level than non-expert linguists. Nevertheless, they did not achieve the results of professional linguists (by whom our GOLD standard was annotated). Feature-based models achieve slightly better results (mean accuracy across folds of 91.32%), although not for the contradiction relation (mean accuracy of 39.89%). However, the weak result for this relation is due to a small representation in the dataset (only 4.4% cases, see Table 1 and 2

). Regardless, according to the data obtained, contradiction in communication conducted in Polish occurs very rarely. Moreover, the variance of the ML results is between 0.08 up to 2.5 % across different folds in cross-validation for the overall results and the easier classes (E,N). However, the variance for C class (contradiction) is very high – from 11.93 even up to 31.97 %. Once again, this occurs because the phenomenon appears very rarely and in the dataset we have only a few cases (i.e. 107 cases).

Note that the performance of models trained on such a training set achieves significantly higher results than those obtained by annotators in the test annotation performed.This means that under certain conditions the trained models make human annotation work almost completely redundant.

Further, models with verb embedding vs the entire sentence representations are better. However, they require manual extraction of the main verb in the utterance because sometimes it is not apparent. In the following are examples of such a difficult extraction.

<T, H> Example 18. (Neutral)

T: [PL] Czuł, że inni zbliżali się do niego, ale nie był tego pewien.
T: [ENG] He felt others getting closer to him, but he wasn’t sure.

<T, H> Example 19. (Entailment)

T: [ENG] He felt that the typist wasn’t taking her eyes off him…
T: [PL] Czuł, że maszynistka nie spuszcza zeń oczu.

Consideration of the difference between the main verbs in the above examples requires attention to suprasegmental differences. In Example 6.3, the non-factive verb czuć, że is used. In contrast, in example B, a different verb is used, namely the factive CZUĆ, że, which necessarily takes on the main sentence stress (see  Danielewiczowa (2002)). Note that from the two verbs above, we should also distinguish the factive perceptual verb czuć, że.

<T, H> Example 20. (Neutral)

T: [PL] Lawirował na granicy prawdy, lecz przez cały czas czułem, że kłamie.

<T, H> Example 21. (Entailment)

T: - Dziękuję - odpowiedział, czując, że policzek zaczyna go boleć.
T: [ENG] - ’Thank you’, he replied, feeling his cheek begin to hurt.

In Example 6.3 the epistemic verb is used, while in Example 6.3 the main predicate is the perceptual verb. The former is non-factive and the latter is factive.

Further findings are that the models with inputs comprising text embeddings and linguistic features achieved slightly better results than those with only embedding inputs. Besides, we can see that – in our base feature model (1) – some features make the most significant contribution to our model, i.e., if the verb is factive or non-factive (see Table 6 and Figure 1). However, the indication of verb tense (see Figure 1) as relatively important for our ML tasks, i.e. ECN classification, appears to be misleading in the light of linguistic data and requires further analysis. It seems that we are dealing here with spurious correlations rather than with lexical rules of language connecting the verb tense with ECN relations. Deeper linguistic analysis would be advisable here, however, because the relation between the grammatical tense of the verb and the ECN relations may be the result of pragmatic rules and prosodic properties of the utterances. We hypothesize that these are spurious correlations in our dataset because, indeed, present or past tense co-occur more often with a particular class of ECN in our dataset.151515Other names for this issue: annotation artifacts Gururangan et al. (2018), dataset bias He et al. (2019), group shift Oren et al. (2019). For the problem of spurious correlations in the context of NLI see, e.g. Dasgupta et al. (2018); McCoy et al. (2019)); Tu et al. (2020).

7 Conclusions

Machine learning solutions often act as black boxes that often feed on biases in training datasets. Therefore, it is not enough to use larger machines and larger corpora. Methodological reflection on the data gathering process, a precise definition of targeted tasks, and data quality evaluation are also essential.

In this study, we stated a review of the opposition factivity – non-factivity in the context of predicting ECN relations. The dataset representing this phenomenon was gathered and analyzed. Then ML models were trained based on this dataset.

Thus, we presented benchmarks with BERT-based models and models utilizing prepared linguistic features. They are even better than the performance of test annotators. However, a few issues remain unresolved in this task, i.e. utterances with a pragmatic foundation. Other issues to examine are potential spurious correlations (e.g. influence of the verb tense on the model results) – further, deeper analysis of the models and their interpreting. Our results indicate the need for a dataset that focuses on these kinds of cases.


We want to thank Przemysław Biecek and Szymon Maksymiuk for their work on another NLI dataset and valuable remarks on conducting experiments and interpretability approaches. We also want to thank Karol Saputa, who implemented preliminary source code for the machine learning models we reused and redesigned in our experiments. Also, we are grateful for many students from the Faculty of Mathematics and Information Science at Warsaw University of Technology, working under Anna Wróblewska’s guidance in Natural Language Processing course. They performed experiments on similar datasets and thus influenced our further research.


  • M. Abrusán (2016) Presupposition cancellation: explaining the ‘soft–hard’trigger distinction. Natural Language Semantics 24 (2), pp. 165–202. Cited by: §2.1, footnote 12.
  • P. Anand and V. Hacquard (2014) Factivity, belief and discourse. The art and craft of semantics: A festschrift for Irene Heim 1, pp. 69–90. Cited by: §2.1.
  • D. Beaver (2010) 3: have you noticed that your belly button lint color is related to the color of your clothing?. In Presuppositions and discourse: Essays offered to Hans Kamp, pp. 65–100. Cited by: §2.1.
  • G. Chierchia and S. McConnell-Ginet (2000) Meaning and grammar: an introduction to semantics. MIT press. Cited by: 3rd item.
  • R. Cooper, D. Crouch, J. van Eijck, C. Fox, J. van Genabith, J. Jaspars, H. Kamp, D. Milward, M. Pinkal, M. Poesio, and S. Pulman (1996) The fracas consortium. Cited by: §3.
  • I. Dagan, O. Glickman, and B. Magnini (2005) The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, pp. 177–190. Cited by: §3.
  • I. Dagan, D. Roth, M. Sammons, and F. M. Zanzotto (2013) Recognizing textual entailment: models and applications. Synthesis Lectures on Human Language Technologies 6 (4), pp. 1–220. Cited by: §3.
  • R. C. Dahlman and J. van de Weijer (2019) Testing factivity in italian. experimental evidence for the hypothesis that italian sapere is ambiguous. Language Sciences 72, pp. 93–103. Cited by: §2.1.
  • R. C. Dahlman (2016) Did people in the middle ages know that the earth was flat?. Acta Analytica 31 (2), pp. 139–152. Cited by: footnote 3.
  • M. Danielewiczowa (2002) Wiedza i niewiedza. Studium polskich czasowników epistemicznych. Cited by: §6.3.
  • I. Dasgupta, D. Guo, A. Stuhlmüller, S. J. Gershman, and N. D. Goodman (2018) Evaluating compositionality in sentence embeddings. arXiv preprint arXiv:1802.04302. Cited by: footnote 15.
  • M. Davies and D. Gardner (2010) Word frequency list of american english. a a 10343885, pp. 0–97. Cited by: footnote 13.
  • E. B. Delacruz (1976) Factives and proposition level constructions in montague grammar. In Montague grammar, pp. 177–199. Cited by: §2.1.
  • C. H. Dietz (2018) Reasons and factive emotions. Philosophical Studies 175 (7), pp. 1681–1691. Cited by: footnote 3.
  • K. Djärv and H. A. Bacovcin (2020) Prosodic effects on factive presupposition projection. Journal of Pragmatics 169, pp. 61–85. Cited by: §2.1.
  • K. Djärv (2019) Factive and assertive attitude reports. Cited by: §2.1.
  • P. Egré (2008) Question-embedding and factivity. Grazer philosophische studien 77 (1), pp. 85–125. Cited by: §2.1, footnote 3.
  • D. E. Elliott (1974) Toward a grammar of exclamations. Foundations of language 11 (2), pp. 231–246. Cited by: §2.1.
  • A. Giannakidou (2006) Only, emotive factive verbs, and the dual nature of polarity dependency. Language, pp. 575–603. Cited by: §2.1.
  • T. Givón (1973) The time-axis phenomenon. Language, pp. 890–925. Cited by: §2.1.
  • H. P. Grice (1975) Logic and conversation. In Speech acts, pp. 41–58. Cited by: §2.2.2.
  • N. Grigore et al. (2016) Factive verbs and presuppositions for’regret’and’know’. Revista Română de Filosofie Analitică 10 (2), pp. 19–34. Cited by: footnote 3.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. arXiv preprint arXiv:1803.02324. Cited by: footnote 15.
  • E. Hanink and M. R. Bochnak (2017) Factivity and two types of embedded clauses in washo. In North-east linguistic society (nels), Vol. 47, pp. 65–78. Cited by: §2.1.
  • A. Hazlett (2010) The myth of factive verbs. Philosophy and phenomenological research 80 (3), pp. 497–522. Cited by: §2.1.
  • A. Hazlett (2012) Factive presupposition and the truth condition on knowledge. Acta Analytica 27 (4), pp. 461–478. Cited by: §2.1.
  • H. He, S. Zha, and H. Wang (2019) Unlearn dataset bias in natural language inference by fitting the residual. arXiv preprint arXiv:1908.10763. Cited by: footnote 15.
  • J. B. Hooper (1975) On assertive predicates. In Syntax and Semantics volume 4, pp. 91–124. Cited by: §2.1.
  • P. Hsueh, P. Melville, and V. Sindhwani (2009) Data quality from crowdsourcing: a study of annotation selection criteria. In

    Proceedings of the NAACL HLT 2009 workshop on active learning for natural language processing

    pp. 27–35. Cited by: footnote 6.
  • Y. Huang (2011) 14. types of inference: entailment, presupposition, and implicature. In Foundations of pragmatics, pp. 397–422. Cited by: §1.
  • P. G. Ipeirotis, F. Provost, and J. Wang (2010) Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, pp. 64–67. Cited by: footnote 6.
  • M. Jarrah (2019) Factivity and subject extraction in jordanian arabic. Lingua 219, pp. 106–126. Cited by: §2.1.
  • P. Jeretic, A. Warstadt, S. Bhooshan, and A. Williams (2020) Are natural language inference models IMPPRESsive? Learning IMPlicature and PRESupposition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8690–8705. External Links: Link, Document Cited by: §6.2.
  • N. Jiang and M. de Marneffe (2019) Evaluating BERT for natural language inference: a case study on the CommitmentBank. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 6085–6090. External Links: Link, Document Cited by: §3.
  • N. Jiang and M. de Marneffe (2021) He thinks he knows better than the doctors: bert for event factuality fails on pragmatics. Transactions of the Association for Computational Linguistics 9, pp. 1081–1097. Cited by: §3.
  • L. Karttunen (1971a) Implicative verbs. Language, pp. 340–358. Cited by: §2.2.3.
  • L. Karttunen (1971b) Some observations on factivity. Research on Language & Social Interaction 4 (1), pp. 55–69. Cited by: §1, §2.1, footnote 3.
  • L. Karttunen (2016) Presupposition: what went wrong?. In Semantics and Linguistic Theory, Vol. 26, pp. 705–731. Cited by: §2.1.
  • I. Kastner (2015) Factivity mirrors interpretation: the selectional requirements of presuppositional verbs. Lingua 164, pp. 156–188. Cited by: §2.1.
  • S. C. Levinson (1983) Pragmatics. Cited by: §2.2.4.
  • S. C. Levinson (2001) Pragmatics. In International Encyclopedia of Social and Behavioral Sciences: Vol. 17, pp. 11948–11954. Cited by: footnote 14.
  • R. T. McCoy, E. Pavlick, and T. Linzen (2019)

    Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference

    arXiv preprint arXiv:1902.01007. Cited by: footnote 15.
  • M. Ogrodniczuk, K. Glowinska, M. Kopec, A. Savary, and M. Zawislawska (2014) Coreference: annotation, resolution and evaluation in polish. Walter de Gruyter GmbH & Co KG. Cited by: §4.1.
  • Y. Oren, S. Sagawa, T. B. Hashimoto, and P. Liang (2019) Distributionally robust language modeling. arXiv preprint arXiv:1909.02060. Cited by: footnote 15.
  • E. Ö. Öztürk (2017) A corpus-based study on ‘regret’as a factive verb and its complements. European Journal of Foreign Language Teaching. Cited by: footnote 3.
  • D. Özyıldız (2017) Factivity and prosody in turkish attitude reports. UMass generals paper. Cited by: §2.1.
  • C. K. P Kiparsky (1971) Fact’in semantics. Semantics 1 (971), pp. 345–69. Cited by: §1, §2.1.
  • A. Parrish, S. Schuster, A. Warstadt, O. Agha, S. Lee, Z. Zhao, S. R. Bowman, and T. Linzen (2021) NOPE: a corpus of naturally-occurring presuppositions in english. arXiv preprint arXiv:2109.06987. Cited by: §3.
  • B. Partee (2014) Topic, focus and quantification. Semantics and Linguistic Theory, pp. 159–188. Cited by: §4.4.
  • A. Poliak (2020) A survey on recognizing textual entailment as an NLP evaluation. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Online, pp. 92–109. External Links: Link, Document Cited by: §4.
  • A. Przepiórkowski, M. Bańko, R. L. Górski, and B. Lewandowska-Tomaszczyk (Eds.) (2012) Narodowy korpus języka polskiego (national corpus of polish language). Wydawnictwo Naukowe PWN, Warsaw, Poland. Cited by: §1, §4.1.
  • K. Richardson, H. Hu, L. Moss, and A. Sabharwal (2020) Probing natural language inference models through semantic fragments. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 34, pp. 8713–8721. External Links: Document Cited by: footnote 1.
  • A. Ross and E. Pavlick (2019) How well do NLI models capture verb veridicality?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2230–2240. External Links: Link, Document Cited by: §3.
  • R. Rudinger, A. S. White, and B. Van Durme (2018) Neural models of factuality. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 731–744. External Links: Link, Document Cited by: §3.
  • P. Rybak, R. Mroczkowski, J. Tracz, and I. Gawlik (2020) KLEJ: comprehensive benchmark for polish language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 1191–1201. External Links: Link Cited by: §5.
  • P. Rybak, R. Mroczkowski, J. Tracz, and I. Gawlik (2020) KLEJ: comprehensive benchmark for polish language understanding. arXiv preprint arXiv:2005.00630. Cited by: §5.
  • U. Sauerland (2004) Scalar implicatures in complex sentences. Linguistics and philosophy 27 (3), pp. 367–391. Cited by: §2.2.2.
  • R. Saurí and J. Pustejovsky (2009) FactBank: a corpus annotated with event factuality. Language resources and evaluation 43 (3), pp. 227–268. Cited by: §2.2.5, §2.2.5.
  • R. Saurí and J. Pustejovsky (2012) Are you sure that this happened? assessing the factuality degree of events in text. Computational linguistics 38 (2), pp. 261–299. Cited by: §2.2.5.
  • M. Simons, D. Beaver, C. Roberts, and J. Tonhauser (2017) The best question: explaining the projection behavior of factives. Discourse processes 54 (3), pp. 187–206. Cited by: §2.1.
  • J. Speaks (2021) Theories of Meaning. In The Stanford Encyclopedia of Philosophy, E. N. Zalta (Ed.), Note: https://plato.stanford.edu/archives/spr2021/entries/meaning/ Cited by: §1.
  • R. Stalnaker, M. K. Munitz, and P. Unger (1977) Pragmatic presuppositions. In Proceedings of the Texas conference on per~ formatives, presuppositions, and implicatures. Arlington, VA: Center for Applied Linguistics, pp. 135–148. Cited by: §2.1.
  • I. Tarunesh, S. Aditya, and M. Choudhury (2021a) LoNLI: an extensible framework for testing diverse logical reasoning capabilities for nli. arXiv preprint arXiv:2112.02333. Cited by: §3.
  • I. Tarunesh, S. Aditya, and M. Choudhury (2021b) Trusting roberta over bert: insights from checklisting the natural language inference task. arXiv preprint arXiv:2107.07229. Cited by: §3.
  • J. Tonhauser (2016) Prosodic cues to presupposition projection. In Semantics and Linguistic Theory, Vol. 26, pp. 934–960. Cited by: §2.1.
  • S. L. Tsohatzidis (2012) How to forget that “know” is factive. Acta Analytica 27 (4), pp. 449–459. Cited by: §2.1, §2.1.
  • L. Tu, G. Lalwani, S. Gella, and H. He (2020)

    An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models

    Transactions of the Association for Computational Linguistics 8, pp. 621–633. External Links: ISSN 2307-387X, Document, Link, https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00335/1923506/tacl_a_00335.pdf Cited by: footnote 15.
  • J. Turri (2011) Mythology of the factive. Logos & Episteme 2 (1), pp. 141–150. Cited by: §2.1.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1112–1122. External Links: Link, Document Cited by: §6.2.
  • H. Yanaka, K. Mineshima, and K. Inui (2021) Exploring transitivity in neural nli models through veridicality. arXiv preprint arXiv:2101.10713. Cited by: §3.
  • Z. Zhang, Y. Wu, H. Zhao, Z. Li, S. Zhang, X. Zhou, and X. Zhou (2020) Semantics-aware bert for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 9628–9635. Cited by: footnote 1.

Appendix A Examples From Our Dataset

In the following, there are a few examples from the LingFeatured NLIdataset, and their descriptions.

<T, H> Example 22.

T: [ENG] If a priest refuses Mass for a deceased person or a funeral because he received too little money, knowing that the payer is very poor, it is of course not right, but it is another matter.
H: The payer is very poor.

T: [PL] Jeżeli ksiądz odmówi mszy za zmarłego, lub pogrzebu z powodu zbyt niskiej zapłaty wiedząc, że proszący jest bardzo biedny to rzeczywiście nie jest w porządku, ale to inna sprawa.
H: Proszący jest bardzo biedny.
Sentence type – conditional
Verb labels: wiedzieć, że / know that, present, epistemic, factive, negation does not occur
Complement tense – present
GOLD – Neutral

Example A: Despite the factive verb wiedzieć, że/know that, GOLD label is neutral. This is because the whole utterance is conditional. An additional feature not included in the example is that the whole sentence does not refer to specific objects but is general.

<T, H> Example 23.

T: [ENG] You never expected to hear that from me, did you?
H: You heard it from me.

T: [PL] Nie spodziewałeś się, że kiedykolwiek to ode mnie usłyszysz, co?
H: [PL] Kiedykolwiek to ode mnie usłyszysz.
Sentence type – interrogative
Verb labels: spodziewać się, że / expect that, past, epistemic, non-factive, negation occurs
Complement tense – future
GOLD – Entailment

Example A: Despite the non-factive verb spodziewać się, że/expect that, GOLD label is entailment. In this pair, non-lexical mechanisms are the basis of the entailment relation. Proper judgment of this example requires consideration of the prosodic structure of T’s utterance.

It is worth noting that the sentence H is incorrectly written – strictly speaking, it should be "H’:You have heard it from me./ Usłyszałeś to ode mnie". So it is since H sentences were extracted semi-automatically. However, we did not want to change the linguistic features of the complement. The annotators were informed that in such situations, they should take into account not the H sentence, but its proper form – in the above case, it is H’. From the perspective of bilingualism of the set, it is also vital that the information provided by the expression "never" forms part of the main clause. In the Polish language, this content conveys the expression "kiedykolwiek" and is part of the complement clause.

<T, H> Example 24.

T: [ENG] maybe he was afraid that I would spill the beans…
H: [ENG] I would spill the beans.

T: [PL] może się bał że się wygadam…
H: Się wygadam.
Sentence type – indicative
Verb labels: bać się, że / afraid that, past, emotive, non-factive, negation does not occur
GOLD – ?

Example A: Example in which linguists decided to label a "?".161616These utterances were removed from the dataset for training and testing our benchmarks. Whether the state of affairs reflected by the complement clause was realized, it belongs to the common knowledge of the interlocutors. Without context, we are not able to say whether the sender spilled the beans or not. It is also worth noting that in the English translation, the modal verb is present. This element is absent in the Polish complement clause. We can also see that the lack of context does not make it possible to determine the H sentence.

<T, H> Example 25.

T: [ENG] And that’s why I made no effort to remind anyone of myself, I thought nobody here would remember me. H: nobody here would remember me.

T: [PL] I dlatego nie starałem się przypomnieć, myślałem, że nikt tu o mnie nie pamięta.
H: nikt tu o mnie nie pamięta
Sentence type – indicative
Verb labels: myśleć, że / think that, past, epistemic, non-factive, negation does not occur
GOLD – Contradiction

Example A: The main verb is non-factive, and the relation between the whole sentence and its complement is a contradiction. The grounding of this relation has a pragmatic nature.

Appendix B Polish-English Translation

Our dataset is two-lingual. We translated its original Polish version into English. In the following, we described methodological challenges and doubts related to the creation of the second language version and the solutions we have adopted.

We translated the whole dataset into English. First, we used the deepL translator171717https://www.deepl.com/translator, then a professional translator corrected the automatic translation. The human translator acted following the guidelines: (a) not to change the structure ’[verb] "to"|"that" [complement clause]’, provided the sentence in English remained correct and natural, (b) to keep various types of linguistic errors in translation.

We believe that the decision on whether the translator knows how to use the dataset is important from the methodological point of view. Therefore, we decided to inform the translator that it is crucial for us that the translated sentence retains its set of logical consequences, especially the relation between the whole sentence and its complement clause. However, we did not provide the translator with a GOLD column (annotations of specialist linguists). The translator was aware that, in her task, this aspect is essential. On the other hand, during her work on each sentence, she had to assess the Polish relation and try to keep it in translation.

The English version differs from the Polish in several important issues. Each Polish sentence contains a complementizer że/that. In English, we can observe more complementizers, especially that to and other, e.g., about, for, into. There are also sentences without a complementizer. In Polish, a complementizer cannot be elliptical, in contrast to English (e.g., Nie planowali, że będą mieli wielką, babską rodzinę. /They didn’t plan they will have a big, girl family) In some English sentences, an adjective, a noun, or a verb phrase has appeared instead of a verb, e.g., The English will appear in a weakened line-up for these meetings. (in Polish: okazuje się, że Anglicy…)

It happens that depending on the sentence, the Polish verb has more than one English equivalent, e.g. cieszyć się - glad that or happy to; realize that - zdawać sobie sprawę or zrozumieć. (In the dictionaries zrozumieć is closest to understand). For this reason, the frequency of verbs is different in respective sets. Different language versions also pose problems related to verb signatures. First of all, the signatures developed by us are for Polish verbs. Therefore, we do not know how many pairs <V(pl); V(eng)> there are, where verbs have identical signatures (factive or non-factive). Secondly, a verb in language L1 may not have its equivalent in language L2 and vice versa.

Appendix C Test Annotations by Non-Experts

Table 10 shows examples of annotations made by non-experts in our study.

Label name Value
T [ENG] I know the cold degrades the mind and makes it sluggish. [PL] Wiem, że zimno degraduje umysł i wiedzie go do ospałości.
H [ENG] The cold degrades the mind and makes it sluggish. [PL] Zimno degraduje umysł i wiedzie go do ospałości
Task GOLD – E, Annot. – E E E E
T [ENG] Statists believe that people can be demanding towards the state. [PL] Etatyści wierzą, że ludzie mogą wymagać od państwa.
H [ENG] People can be demanding towards the state. [PL] Ludzie mogą wymagać od państwa.
Task GOLD – N, Annot. – N N N N
T [ENG] I thought you were a bachelor. [PL] Myślałam, że jesteś kawalerem.
H [ENG] You were a bachelor [PL] Jesteś kawalerem.
Task GOLD – N, Annot. – C C N N
T [ENG] I imagined that if there was guilt, then there was punishment. [PL] Wyobrażałem sobie, że jak jest wina, to jest i kara.
H [ENG] If there was guilt, then there was punishment. [PL] Jak jest wina, to jest i kara.
Task GOLD – C, Annot. – N C N C
T [ENG] She may have ordered hastily, but she really hoped that the stadium would remain a place of trade, not sport. [PL] Być może zamawiała pochopnie, ale naprawdę liczyła, że stadion pozostanie miejscem handlu, nie sportu.
H [ENG] The stadium would remain a place of trade, not sport. [PL] Stadion pozostanie miejscem handlu.
Task GOLD – C, Annot. – N C N N
T [ENG] I was wondering when you could talk about a woman’s life making sense… [PL] Zastanawiałam się, kiedy można mówić o tym, że życie kobiety miało sens…
H [ENG] A woman’s life making sense… [ENG] Życie kobiety miało sens…
Task GOLD – N, Annot. – N ? N ?
T [ENG] I mean, Mrs. W. didn’t say she was kicked out of the house. [PL] Przecież pani W. nie powiedziała, że została wyrzucona z domu.
H [ENG] She was kicked out of the house. [PL] Została wyrzucona z domu.
Task GOLD – E, Annot. – N N C C
T [ENG] Szerucki wiped his wet cheeks with a frayed sleeve, walked in, closed the door behind him and felt that something was wrong. [PL] Szerucki przetarł wystrzępionym rękawem zroszone policzki, wszedł, zamknął za sobą drzwi i poczuł, że jest niedobrze.
H [ENG] Something was wrong. [PL] Jest niedobrze.
Task GOLD – E, Annot. – N N E E
Table 10: Annotation examples for non-experts. Note: "Annot." indicate non-expert annotations.