LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias

07/06/2023
by   Mario Almagro, et al.
0

Textual noise, such as typos or abbreviations, is a well-known issue that penalizes vanilla Transformers for most downstream tasks. We show that this is also the case for sentence similarity, a fundamental task in multiple domains, e.g. matching, retrieval or paraphrasing. Sentence similarity can be approached using cross-encoders, where the two sentences are concatenated in the input allowing the model to exploit the inter-relations between them. Previous works addressing the noise issue mainly rely on data augmentation strategies, showing improved robustness when dealing with corrupted samples that are similar to the ones used for training. However, all these methods still suffer from the token distribution shift induced by typos. In this work, we propose to tackle textual noise by equipping cross-encoders with a novel LExical-aware Attention module (LEA) that incorporates lexical similarities between words in both sentences. By using raw text similarities, our approach avoids the tokenization shift problem obtaining improved robustness. We demonstrate that the attention bias introduced by LEA helps cross-encoders to tackle complex scenarios with textual noise, specially in domains with short-text descriptions and limited context. Experiments using three popular Transformer encoders in five e-commerce datasets for product matching show that LEA consistently boosts performance under the presence of noise, while remaining competitive on the original (clean) splits. We also evaluate our approach in two datasets for textual entailment and paraphrasing showing that LEA is robust to typos in domains with longer sentences and more natural context. Additionally, we thoroughly analyze several design choices in our approach, providing insights about the impact of the decisions made and fostering future research in cross-encoders dealing with typos.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/23/2016

Sentence Similarity Learning by Lexical Decomposition and Composition

Most conventional sentence similarity methods only focus on similar part...
research
04/22/2022

MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Learning semantically meaningful sentence embeddings is an open problem ...
research
03/10/2023

Logic Against Bias: Textual Entailment Mitigates Stereotypical Sentence Reasoning

Due to their similarity-based learning objectives, pretrained sentence e...
research
04/29/2022

KERMIT – A Transformer-Based Approach for Knowledge Graph Matching

One of the strongest signals for automated matching of knowledge graphs ...
research
12/19/2022

MANTIS at TSAR-2022 Shared Task: Improved Unsupervised Lexical Simplification with Pretrained Encoders

In this paper we present our contribution to the TSAR-2022 Shared Task o...
research
01/23/2023

Injecting the BM25 Score as Text Improves BERT-Based Re-rankers

In this paper we propose a novel approach for combining first-stage lexi...
research
06/08/2023

Revealing the Blind Spot of Sentence Encoder Evaluation by HEROS

Existing sentence textual similarity benchmark datasets only use a singl...

Please sign up or login with your details

Forgot password? Click here to reset