Czech News Dataset for Semantic Textual Similarity

by   Jakub Sido, et al.
University of West Bohemia

This paper describes a novel dataset consisting of sentences with semantic similarity annotations. The data originate from the journalistic domain in the Czech language. We describe the process of collecting and annotating the data in detail. The dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the annotation as an average of 9 individual annotations. We evaluate the quality of the dataset by measuring inter and intra annotation annotators' agreements. Beside agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116 956), the model can perform significantly better than an average annotator (0,92 versus 0,86 of Person's correlation coefficients).



There are no comments yet.


page 3

page 6


Supersense and Sensibility: Proxy Tasks for Semantic Annotation of Prepositions

Prepositional supersense annotation is time-consuming and requires exper...

CompiLIG at SemEval-2017 Task 1: Cross-Language Plagiarism Detection Methods for Semantic Textual Similarity

We present our submitted systems for Semantic Textual Similarity (STS) T...

The AVA-Kinetics Localized Human Actions Video Dataset

This paper describes the AVA-Kinetics localized human actions video data...

Improving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure

We suggest a new method for creating and using gold-standard datasets fo...

Use of Modality and Negation in Semantically-Informed Syntactic MT

This paper describes the resource- and system-building efforts of an eig...

Investigating Correlations of Inter-coder Agreement and Machine Annotation Performance for Historical Video Data

Video indexing approaches such as visual concept classification and pers...

#MeTooMA: Multi-Aspect Annotations of Tweets Related to the MeToo Movement

In this paper, we present a dataset containing 9,973 tweets related to t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we describe a novel dataset consisting of sentences with semantic similarity annotations. The dataset comprises pairs of sentences in the Czech language, where each pair is associated with a similarity score. The purpose of the dataset is to train and evaluate systems for predicting the semantic similarity of sentences.

Currently, the NLP field relies ever more on unsupervised or self-supervised models. Nevertheless, well-annotated datasets are still required either for model adaptation or model testing. We pay higher attention to the testing part of the dataset. For this part, every sentence pair is annotated independently by nine annotators. We remove the unreliable annotations and compute the average of all the independent annotations. We store the resulting average as the final annotation for the test part of the dataset. For the training part, we prefer to cover as much diverse data as possible. Therefore, we stick with one annotation per sentence pair.

We decided to cooperate with journalism students to produce better annotations, since we believe they are generally skilled at processing language. To improve the annotations further, professional journalists supervised the student annotators. We believe that the contribution of skilled annotators increases the quality of the dataset.

The final dataset contains 138,556 human annotations. Its creation required a considerable amount of human annotation work – 485 annotators and the time spent creating the dataset was around 1,017 man-hours.

2 Related Work

Regarding English, there are many datasets for semantic textual similarity. The most commonly used datasets come from the SemEval competition. Starting in 2012, there were 6 competitions on STS agirre2012semeval; agirre2013sem; agirre2014semeval; agirre2015semeval; agirre2016semeval; cer2017semeval. These datasets include pairs of sentences taken from news articles, forum discussions, headlines, image and video descriptions labeled with a similarity score between 0 and 5. The goal is to evaluate how the cosine distance between two sentences correlates with a human-labeled similarity score through Pearson and Spearman correlations. The datasets from all SemEval STS competitions are gathered in the SentEval corpus. conneau2018senteval.

As for Czech, an STS dataset exists, which was created by translating English STS from SemEval and the translated sentence pairs are annotated once again, after translation to Czech. svoboda2018czech The main drawback of this dataset is that it is very small (1425 sentence pairs) in comparison to the dataset presented in this paper.

3 Source Dataset

Czech News Agency (CNA) delivers complete service for Czech journalists, including images, quick short news, reports, observing long term incidents in every domain. Due to our collaboration with CNA, we had access to news produced by their stack of people following a normalised process. This encompasses the evolution of each followed incident (an event interesting for CNA) and all related news (hereinafter report), including a human made summary at the end. We decided to use summaries and reports grouped by incidents for our purposes.

Our main goal was to create a dataset to support our future mission – a tool for making everyday journalists work easier. So, we analysed our input corpora and designed a convenient way of annotating a part of it.

Because the original database of summaries and reports is not public, we were supposed to create a public dataset in such a way that prevents reconstruction of the original data while creating a valuable set of annotations.

4 Process of Collecting Humans Annotations

Our research in CNA showed that the journalists would appreciate the better systematic analysis of their database of related news when creating a summary. We also considered current research questions in natural language processing; therefore, we adjusted the whole process to support modern challenges.

To obtain useful information for training a model to create summaries from the original reports – or at least, to help the journalists with their creation – we had made a web application to collect human annotations of inter-sentence similarity. The first sentence belongs to the summary, and the other one belongs to the original reports (See Fig. 3). The annotators were asked to give us two elementary pieces of information:

  1. context-free semantic textual similarity,

  2. context-dependent semantic textual similarity.

The sentences are taken from reports and their summaries; therefore, it is guaranteed that related and semantically similar sentences are present.

Human Resources

In this work, we cooperated with two groups of students (1,2)(272+229). Group 1 annotated the data in the first round to pick sentence pairs and annotate similarities – these data are used for the training dataset. Group 2 was used in the second round to create some valuable statistics (exploratory phase) and the test dataset(validation phase) at the end. We discuss the process and our goal in the following sections in more detail.

4.1 First Round

We decided to use cross annotation between summary and reports due to relation to our future effort. Due to the growing capabilities of modern modelsbeltagy2020longformer and sufficient human resources, we decided to create annotations for sentence pairs both with and without context.

In the first round, we asked the annotators to choose three sentences from reports for each sentence in the summary. We instructed them to select the most similar sentence, the least similar sentence, and something in the middle, in order to create a more balanced dataset. For better annotator orientation, we highlighted the words from summary sentences in all reports (Figure 1). We asked the users to use a slide bar to annotate the degree of similarity on a scale of 0-6.

Figure 1: R1 – The first round of annotation. S = summary sentence; A,B,C = report sentences. Pairs SA, SB and SC were selected and annotated by Group 1. One annotator has processed 6 full incidents (no overlap). The STS scores for each pair unused in future rounds (R2, R3) are – together with the corresponding sentences – labeled as the training dataset.

4.2 Second Round

After the first round, we calculated some basic statistics about the whole process, like annotation time, user scale calibration, compliance of time pause between the context-free and context-dependent phases, etc. Each R1 annotator had three associated R2 validators, who received one of the three pairs (S-A, S-B or S-C) of the first 30 (S, A, B, C) tuples selected by the R1 annotator (producing three distinct sequences) for STS annotation. The sequences were generated randomly but not independently of one another, as they were specifically crafted in order for their union to yield the full original (S, A, B, C) tuple set whilst having an empty intersection (See fig. 2). The individual sequences in R2 were processed in the same order they were generated during R1.

Figure 2: R2 – Exploratory phase – SN= summary sentence from R1; R1-A,B,C = report sentences from R1. The distinct sequences for validators 1, 2, and 3 are shown as green, purple, and brown respectively.
(a) The screen of the first round – R1.
(b) The screen of context free annotation phase – R2,R3
(c) The screen of context dependent annotation phase – R2,R3
Figure 3: Annotation application window screens.

4.2.1 Exploratory phase – R2


The dependence of root mean squared error (= RMSE) between the R1 score and R2 score on estimated validator (R2) thinking time. All annotations were grouped into buckets (bucket size = 1 second) and RMSE deviation from the original R1 score was calculated for each group. The initial spike corresponds to users who have not given the answer any significant thought.

(b) The dependence of pearson correlation between STS scores of R1 and R2 on the number of annotations already made by individual R2 validators (= calibration curve). Note the value shown is not an average correlation between each R1 annotator and the corresponding R2 validator, but the correlation between all R1 annotators and all R2 validators at once.

The cumulative distribution function of the 30th time-step estimated waiting times (box shown in Figure

LABEL:fig:r2_pause), i.e., how many R2 validators waited between the context-free and context-dependent phases of annotation for less than hours. The recommended waiting time was 72 hours, which was disregarded by the majority of participants.
Figure 4: Statistics of the exploratory phase (= R2). As the true time between submitting annotations consists of thinking time and waiting time (precisely, their sum, which was the only information known to us), the thinking, resp. waiting times shown here were estimated by ignoring the other part. We assume that, where waiting times matter (notably the 30-th time-step), the thinking times were very small in comparison, and vice versa.

In the short exploratory phase (R2), we decided to measure statistical data, which should help us set up the right amount of work and get a better understanding of our annotators’ behaviour. Each annotator from R2 had validated N pairs from the first round R1.

We divide the work imaginarily into two blocks:

  1. intra-annotator block used for measuring intra-annotator agreement,

  2. overlapping block used to create useful statistics which help us design next phase.

We gently asked the annotators for careful work and a pause between the context-free and context-dependent annotations. We can see from the statistics that some annotators did not take our recommendations seriously (See Fig 3(c),3(a)). Therefore, we add some strict rules into the process forced by the user interface in the validation phase (R3).

4.2.2 Validation phase – R3

Based on our experience in R2, we decided to add a calibration block before main annotation process. We also add a strict rule into the annotation user interface, which forces a 72-hour pause between context-free and context-dependent annotations. On the end of both, context-free and context-dependent, we added special blocks to measure intra-annotator agreement between R2 and R3 (See Fig. 5).

Figure 5: The full R3 round on a temporal axis. The target sentence pairs in context-free and context-dependent subphases of each phase were equal and ordered the same exact way. The 72h break between the context-free and context-dependent halves has been forced.
Calibration Block (R3_C)

After estimating the number of leading pairs required for successful user calibration to 6 (see Figure 3(b)

), we have cherry-picked appropriate sentence pairs from R1 and (globally) let each R3 validator annotate each of those pairs before the main annotation phase of R3. The ordered pairs were manually chosen for their R1 STS to be precise

, for the user to become familiar with the full scale. We have also personally checked that the R1 STS values were appropriate given those six sentence pairs.

Main block

Similarly to R2, each R1 annotator had multiple associated R3 validators, except that in R2, the associates were a single group of three, as opposed to R3, where the associates were split into multiple groups of three – specifically 9 – yielding 27 validators in total associated with a single annotator from R1 (see Figure 6). Within each group, the annotation process was equivalent to the exploratory phase (R2): each group produced one STS score for each S-A, S-B and S-C pair of the first N (S, A, B, C) tuples selected by the R1 annotator. However, as there were nine distinct groups, the scores overlapped as shown in Figure 6, resulting in 9 different validators (= a stack) choosing an STS score for a single R3 pair.

The main block’s size was chosen to be 50 to match the expected spent time in man-hours. Consequently, every single individual validator annotated precisely 50 STS pairs, first without context and then including context.

The final score in the test dataset is produced by averaging the scores of each stack. Generally speaking, the validator stacks for were intentionally chosen randomly (without replacement) from the full set of 27 validators. This was done to eliminate any potential STS bias resulting from repeating a stack of validators with similar semantic intuition. After R3 was concluded, amongst eight annotators and 150 ( for ) sentence pairs within the main block (= 1200 opportunities in total), only a single duplicate stack was randomly generated. This means that two-sentence pairs in the entire validation phase (R3) have their STS assigned by the same stack of validators. Note that the assignments of validators for context-free and context-dependent phases are equivalent.

Figure 6: R3_main – Visualisation of the main block in validation phase (R3). Notice that the stack size shown is 3, but the actual size is 9.
Intra-annotator block (R3_IA)

After the main block, each annotator was asked to annotate 5 extra sentence pairs which they already annotated in the exploratory phase (See Fig. 7)111The group of annotators in the exploratory and validation phase was the same. This data was used to measure the intra-annotator agreement.

Figure 7: R3_IA – Intra-annotator block – Fife already annotated pairs is randomly sampled and presented to the same annotator to get intra-annotator agreement for both versions – context-free and context dependent.

5 Post-processing

In order to prepare the dataset as clean as possible, we decided to remove some data in the post-processing from the train dataset. Due to exploratory phase and carefully designed validation phase with overlaying annotations and averaging final numbers, we suppose that the test data part does not contain significant errors.

Figure 8: Evaluation of contextual semantic shift – The black line is a plot of . The dashed lines help show how where exactly – in the plot – are the most common levels of significance. For the hard dataset, we cherry-picked samples showing change on significance level 0.05.

5.1 Evaluation of the Semantic Shift

We designed the process to collect context-free and context-dependent aligned data to enable future researchers to examine the role of context in semantics. We have not done any special pre-filtering of the sentences presented to the annotators, so the natural (unbiased) distribution of contextual semantic shift should appear in the collected data.

We have decided to quantify the significance of context by performing a series of t-tests, specifically to test the significant difference of means between the context-free and context-dependent main blocks element-wise (= one test for each sentence pair). We have assumed that the means of STS of size-9 samples for a single sentence pair and context presence are approximately normally distributed.

To recapitulate – for each sentence pair (of the main blocks), we possessed 9 STS scores for a context-free and 9 STS scores for a contextual version of the sentence pair. Then, we performed a two-sided t-test for the equivalence of the corresponding means. The null hypothesis of this test was that the STS score means would be equal; in other words, the added context would be insignificant in the domain of STS. Each such test yielded a certain p-value, that is, the largest possible significance level under which the null hypothesis would not be rejected. The function

is defined as the fraction of sentence pairs for which we would reject the null hypothesis of context-insignificance, if our level of significance was , i.e., the CDF of the distribution of p-values (See fig. 8).

Figure 9: Motivation for diagonal filtering : Analysis of the frequency of final semantic similarity numbers grouped by suggested classes in the first round (R1).

5.2 Diagonal Filtering of Training Dataset

After the annotations, we did some extra statistics of the collected data. We found out that in the first phase, there were some borderline cases. We made a frequency analysis of each rating (0-6) for three suggested classes (green,orange,red) (Fig. 9). Due to the expected outcome, we were surprised by samples annotated against this scheme. There was small, but non-zero number of samples, picked for green(should be similar) but rated with low numbers of similarity (contra-green) and wise versa, red ones(should not be similar) rated with high numbers of similarity(contra-red).

We suppose this is possible due to several reasons. If we omit human mistakes, the most probable is that the data displayed to the annotator was not possible to annotate differently. There could be only a minor difference between reports and summaries, so the user could not find any strongly dissimilar sentence and was forced to pick a similar one even for the red class. However, we believe that such non-trivial examples were presented to annotators only in the minority.

There are not so many possible reasons for data being biased oppositely. Again, if we omit systematic human mistake, which is unlikely, the only reason for such systematic bias (observable also on orange suggested class) – the annotators had no other choice. There was no similar sentence, which is odd if we consider the data source (reports and summary of the same reports).

The only reasonable source of such bias is the presence of totally new information in the summary or sentences unused in original reports. We checked possibility by the journalist – They call it backgrounds, and the reason for using it is to place the summary into some (typically historical) context. And, of course, they often rephrase original sentences and underlay original pieces of information because they do abstract summarisation.

Collecting the data in the first round could potentially bring systematic bias into training data – humans marking different pairs as green tents to select higher similarity numbers. We decided to investigate this hypothesis by filtering of possible systematic bias in training data set by using only the close neighbourhood of 1 and 2 of intended colored class (N1,N2), where the number is the maximal difference of score from the expected value (green–6, orange–3, red–0). For statistics and results of basic experiments see the Tables 3,4 and Figure 10.

6 Dataset Statistics

The final dataset is divided into the train and test parts. The train part contains 116 956 samples. In the test dataset, we decided to annotate 1200 pairs for both free and dependent variants. We designed the process to end up with nine annotations for each pair – ideally annotations to fill up our wish – that succeeded in the context-free part. However, against our effort, the context-dependent dataset is about 6% smaller. After all, only 21 out of 1200 sentences has less than seven annotations.

We estimate the time spent creating the training dataset (R1) around 876 man-hours (269 annotators, 3.26 hours each on average). The test set was created in around 141 man-hours (216 annotators, 0.66 hours each on average).

Annotator Agreement

We made basic statistics of collected data to discover the limits of human agreement. The intra-annotator agreement can be expressed as the variance on pairs introduced twice to the same people after some time. More specifically, we stored five annotations made by users in R2 (the exploratory phase), which we let the users annotate again in R3.

For inter-annotator agreement, we claim that the variance of human annotations on the shared sentence pairs is a reasonable metric. We also computed the average STS for a specific validator stack and compared it with every STS score contributing to it (= the entire stack).

For both agreements, we calculate the Pearson and Spearman correlation coefficients, presenting the results in Table 1.

The Pearson correlation coefficient for the inter-annotator agreement as an upper bound is calculated similarly as in agirre2014semeval, but instead of averaging, we calculate the correlation through the whole dataset, which is comparable to the method of assessing correlation between the dataset and the returned values of a trained model.

Agreement pcorr spcorr RMSE MSE
inter 0.861 0.840 1.017 1.035
intra 0.746 0.719 1.396 1.948
Table 1: Intra-annotator and inter-annotator agreement measurement results. pcorr and spcorr stand for pearson and spearman correlations respectively.
(a) STS distribution in the train dataset before
and after performing the individual diagonal filters.
(b) STS distribution in the test dataset. Due to the significantly smaller granularity of test STS (means of 9), the data has been rounded down to units of size 0.2 on the STS scale for clarity.
Figure 10: Train/test dataset STS distributions.
Correlation between original STS scores and test datasets

For completeness, we present the correlations of STS scores between the scores captured in R1 and the corresponding test free/dep datasets (See Tab. 3). We can observe that both context-free and context-dependent test datasets correlate strongly with the original R1 scores (= intersection between R1 and R3 excluded from the train datasets). Notably, the correlation between context-free and context-dependent STS scores collected in R3 (and compiled into the test datasets) is very high, even after controlling the influence of the original R1 score. This suggests that the presence of context had a negligible impact on validators’ decisions during STS annotation. The results of our analysis of equivalence of context-free and context-dependent STS scores (visualised in Figure 8) suggest that only 7.5% of the test dataset (90 pairs) are significantly different from their contextual counterpart.

Table 3 also shows how big of an impact our R1 dataset filtering methods had on the correlation of STS between the corresponding R1 and test datasets. We can see a slight improvement in correlation after filtering using N2, and yet another tiny improvement after filtering using N1. To prove the correlation improvement is not caused by shrinking the intersection size, we sampled a cropped test dataset, calculated the correlation between it and the corresponding part of the full R1 dataset, and averaged the results. The notable difference is between N1 and N1-random, which shows that it is not sufficient to simply remove random 287 elements from the test dataset to improve the correlation.

dataset mean variance / MSE size
train-raw 2.46 2.12 116 956
train-N2 2.56 2.15 101 413
train-N1 2.53 2.26 85 374
test free 2.66 1.78 1200
test dep 2.92 1.65 1200
test-sig free 1.41 2.02 90
test-sig dep 2.57 1.85 90
Table 2: Statistical indicators of the created datasets.

width=center R1 data pw R1/cfree pw R1/cdep pw cfree/cdep par cfree/cdep (R1) inters. R1 and R1 R1-RAW 0.8787 0.8733 0.9362 0.8276 1200 (100%) R1-N2 0.8831 0.8792 0.9356 0.8184 1101 (92%) R1-N2-rand 0.8782 0.8736 0.9366 0.8275 1101 (92%) R1-N1 0.8932 0.8918 0.9405 0.8212 913 (76%) R1-N1-rand 0.8759 0.8714 0.9362 0.8300 913 (76%)

Table 3: Correlations between STS scores of different R1 filtering methods and the resulting test dataset. pw stands for pairwise pearson correlation, par stands for partial pearson correlation controlling R1, cfree resp. cdep stand for test dataset context-free resp. context-dependent. The filtering method indirectly removes sentence pairs from the test dataset as well when computing correlation; the size of the remainder, which was used to calculate correlation with, is shown in the rightmost column. The imaginary set of R1 data created not by our filtering method, but by filtering random elements so as to preserve the original test size, is labelled -random, and the "filtering method" is not actually applied, only the intersection size is preserved. unfiltered is simply the whole of R1 without filtering.
(a) Original (= from R1) vs
R2 context-free STS.
(b) Original vs
R3 context-free STS (rounded).
(c) Original vs
R3 context-free STS (boxplots).
(d) Original (= from R1) vs
R2 context-dependent STS.
(e) Original vs
R3 context-dependent STS (rounded).
(f) Original vs
R3 context-dependent STS (boxplots).
Figure 11: Overview of the relation between scores from R1 and scores from R2/R3, including the corresponding correlations. Rounding – in this context – means applying if , otherwise .
(a) R3 context-free vs. context-dependent scores (rounded).
(b) R3 context-free vs. context-dependent scores (scatter).
Figure 12: Overview of the relation between context-free and context-dependent scores of R3, including the corresponding correlations. Rounding – in this context – means applying if , otherwise .
(a) R2 calibration.
(b) R3 calibration.
Figure 13: STS-scale user calibration over time. The improvement caused by introducing the calibration phase in R3 is easily noticeable. Mind that the calibration phase has preceded both main phases during both context-free and context-dependent tasks, which helped users improve their correlation faster than during R2.
Influence of context on STS
(a) STS increase histogram (after context is introduced) of the whole test dataset.
(b) STS increase histogram of the significantly different () pairs only.

STS increase quantile function.

Figure 14: STS increase after introducing context to validators in R3.

Figure 14 shows how much STS of a pair of sentences increases or decreases after context is introduced to validators in the validation phase (R3). In Figure 13(b), we can see that the mean is greater than zero and its value is equal to the difference between means of test free and test dep in Table 2

. The original distribution does not pass any normality test (because of its leptokurticity), however, if we think of the STS gains around 1.5 as outliers, it is safe to consider the distribution of gains to be approximately normal with

and . The apparent steps in Figure 13(c) are caused by the non-continuous distribution of means of 9 whole numbers between 0 and 6.

Theoretical lower bound for MSE

Within the context of a single sentence pair, assuming that the STS scores supplied by validators are approximately normally distributed, it follows that the random variable

(where is the STS sample mean, is the true STS mean,

is the STS sample standard deviation and

is the amount of averaged STS results) has the Student’s t distribution with 8 degrees of freedom. Scaling this random variable by a factor of

, we obtain the distribution of , whose variance is the MSE between an oracle for STS (which always returns the true mean ) and the corresponding STS mean , which is in our dataset. Since and the variance of Student’s distribution with 8 degrees of freedom is , we estimate the lower bound for MSE as the average of for all sentence pairs, which is approximately .

7 Initial Experiments

To set up a baseline for the new dataset, we use well known and robust word2vecmikolov2013efficient baseline and modern models for textual processing. We choose published BERT-like models pre-trained on the Czech language Czertsido2021czertand SlavicBertarkhipov2019tuning-SlavicBert; and used them in two ways: a Cross-attention encoder on both sentences at once; and a Two tower/Siamese encoder with similarity measure on top. Results are shown in the Table 5.

Two Tower Models

We used an unweighted average of word2vec embeddings; and pooler outputs

(PO) from Bert-like models to encode each sentence independently. With cosine similarity measure on top tuned on the training dataset.

Cross-Attention Model

For the cross-attention encoder, we used pooler output with a projection layer of the size of 200 with RELU activation on top followed by one single neuron with linear activation to get similarity measure (CA).

width=center RAW N2 N1 size 116956 (100%) 101413 (86.71%) 85374 (73.00%) Czert-CA 91.887 0.1193 89.282 0.1755 91.525 0.2343 89.346 0.1906 91.25 0.1812 89.10 0.09493 Pavlov-CA 91.383 0.2914 88.966 0.0892 91.14 0.2638 89.056 0.1036 91.039 0.3166 89.034 0.1087

Table 4: Filtering experiments. The size is absolute and relative in brackets in %. We report Pearson(first line) and Spearman(second line) correlations multiplied by a factor of 100.
11todo: 1Měli bychom uvést i MSE, když pro to máme lower bound, abychom věděli, jak jsme od něj daleko.

width=center W2V Czert Slavic Bert Cross Attention Two Tower Cross Attention Two Tower free 82.8300 0.0702 73.8225 0.0783 91.887 0.1193 89.291 0.1675 88.177 0.02407 85.568 0.06162 91.383 0.2914 88.966 0.0892 86.158 0.1573 83.634 0.1500

Table 5: We report Pearson(first line) and Spearman(second line) correlations multiplied by a factor of 100.

8 Data Format

Due to reasons in Sec. 3 we can release the dataset in text-exported form only with a limited surrounding context. All initial experiments were done only with these files, so future researchers can use the same data as we did. Unfortunately, we can not release the original database with raw data collected during the annotation process. We did our best to bring the reader the best insight into the whole process, the original data character and the quality of the new final annotated corpora. We decided to make public all versions we made without and with performed filtering.

We present the collected data in textual files available on our website222 or on github333 The averaged numbers of annotations for the same pair and enumeration of all annotations are presented in the files containing the test part.

We also present the test dataset filtered by the significance of change between context-free and context-dependent annotations labelled as test-sig.tsv. The data samples are filtered on a 0.95 confidence level of being significantly different between context-free and context-dependent annotations in this file.

Context-free and context-dependent test parts are naturally aligned. However, context-free is about 6 % bigger than context-dependent.

The train part of the dataset consists of two sentences with the user’s annotation made in R1.

The test part contains two sentences followed by averaged sts value from R3, all original annotations collected in R3, and the value from R1.

The key sentence is surrounded with <sent></sent> marks in the context-dependent variants of the dataset.

9 Discussion

The Role of the Context

The main goal was to create a new Czech dataset for semantic textual similarity with the aligned context-free and context-dependent annotations and evaluate the importance of context in usual texts. As shown, the context has a significant impact on a subset of the collected data. In the narrow domain of news and their summaries, we showed 90 samples from 1200 (0.075 %) significantly shifted. We gather these samples into separate datasets so future researchers can utilize modern context-aware models to show their benefits.

The Role of Diagonal Filtering

Our initial motivation to filter the training data came from contradictory annotations collected in the first round. Measuring correlations between R1 and R3 annotations validated our suspicion – correlation increases with a throwaway of contradictory samples. Such filtering could help to some methods; however, we confirm a generally known paradigm by higher evaluation metrics on unfiltered versions – deep models can benefit from noisy but more enormous datasets more than cleaner and smaller variants. Nonetheless, we decided to share publicly both filtered datasets (N2,N1)

Means and variances

We evaluated distribution of scores in dataset splits and their different versions. The statistics are summarized in Table 2

. From the Table we can see that mean of context-dependent test dataset is significantly higher than the mean of context-free dataset. This does make sense because with larger context more information is present in the text and there is higher change for some thematic overlap. Next, the test dataset has higher mean and lower variance. This is obviously caused by averaging nine annotators’ scores in the test dataset. according to central limit theorem dve variance of the average is lower and the mean is closer to the center of the interval.

10 Conclusion

We conclude our paper with a summary of the distinct features of the introduced dataset. The large size of the dataset (138,556 annotated sentence pairs) allows robust training and evaluation of semantic models. The dataset belongs among the most extensive non-English training resources for learning the semantics of a language.

The testing part of the dataset contains annotations based upon a consensus of nine annotators. Moreover, we performed a detailed analysis of the resulting annotations and filtered out the unreliable ones. We compute the theoretical lower bound of MSE to be approximately 0.1731. This number is considerably lower (better) than the performance of a random human annotator. Therefore, the testing part enables the evaluation of well-performing models.

We offer our dataset publicly accessible for research purposes.


This work has been partly supported by Grant No. SGS-2019-018 Processing of heterogeneous data and its specialized applications. Computational resources were supplied by the project "e-Infrastruktura CZ" (e-INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.