Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank

10/07/2020
by   Eleftheria Briakou, et al.
0

Detecting fine-grained differences in content conveyed in different languages matters for cross-lingual NLP and multilingual corpora analysis, but it is a challenging machine learning problem since annotation is expensive and hard to scale. This work improves the prediction and annotation of fine-grained semantic divergences. We introduce a training strategy for multilingual BERT models by learning to rank synthetic divergent examples of varying granularity. We evaluate our models on the Rationalized English-French Semantic Divergences, a new dataset released with this work, consisting of English-French sentence-pairs annotated with semantic divergence classes and token-level rationales. Learning to rank helps detect fine-grained sentence-level divergences more accurately than a strong sentence-level similarity model, while token-level predictions have the potential of further distinguishing between coarse and fine-grained divergences.

READ FULL TEXT
research
05/07/2020

LIIR at SemEval-2020 Task 12: A Cross-Lingual Augmentation Approach for Multilingual Offensive Language Identification

This paper presents our system entitled `LIIR' for SemEval-2020 Task 12 ...
research
10/06/2022

Measuring Fine-Grained Semantic Equivalence with Abstract Meaning Representation

Identifying semantically equivalent sentences is important for many cros...
research
04/18/2021

Keyphrase Generation with Fine-Grained Evaluation-Guided Reinforcement Learning

Aiming to generate a set of keyphrases, Keyphrase Generation (KG) is a c...
research
04/18/2021

A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation

Large pretrained generative models like GPT-3 often suffer from hallucin...
research
12/20/2022

Pay Attention to Your Tone: Introducing a New Dataset for Polite Language Rewrite

We introduce PoliteRewrite – a dataset for polite language rewrite which...
research
11/16/2021

Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity

We present Aspire, a new scientific document similarity model based on m...
research
03/23/2017

TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia

We present a dataset that contains every instance of all tokens ( words...

Please sign up or login with your details

Forgot password? Click here to reset