Czech News Dataset for Semantic Textual Similarity

08/19/2021
by   Jakub Sido, et al.
0

This paper describes a novel dataset consisting of sentences with semantic similarity annotations. The data originate from the journalistic domain in the Czech language. We describe the process of collecting and annotating the data in detail. The dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the annotation as an average of 9 individual annotations. We evaluate the quality of the dataset by measuring inter and intra annotation annotators' agreements. Beside agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116 956), the model can perform significantly better than an average annotator (0,92 versus 0,86 of Person's correlation coefficients).

READ FULL TEXT

page 3

page 6

research
03/27/2021

Supersense and Sensibility: Proxy Tasks for Semantic Annotation of Prepositions

Prepositional supersense annotation is time-consuming and requires exper...
research
12/31/2022

Approaching Peak Ground Truth

Machine learning models are typically evaluated by computing similarity ...
research
04/05/2017

CompiLIG at SemEval-2017 Task 1: Cross-Language Plagiarism Detection Methods for Semantic Textual Similarity

We present our submitted systems for Semantic Textual Similarity (STS) T...
research
09/11/2018

Evaluating Multimodal Representations on Sentence Similarity: vSTS, Visual Semantic Textual Similarity Dataset

In this paper we introduce vSTS, a new dataset for measuring textual sim...
research
11/11/2016

Improving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure

We suggest a new method for creating and using gold-standard datasets fo...
research
02/05/2015

Use of Modality and Negation in Semantically-Informed Syntactic MT

This paper describes the resource- and system-building efforts of an eig...
research
06/02/2023

LyricSIM: A novel Dataset and Benchmark for Similarity Detection in Spanish Song LyricS

In this paper, we present a new dataset and benchmark tailored to the ta...

Please sign up or login with your details

Forgot password? Click here to reset