Contrastive String Representation Learning using Synthetic Data

10/08/2021
by   Urchade Zaratiana, et al.
0

String representation Learning (SRL) is an important task in the field of Natural Language Processing, but it remains under-explored. The goal of SRL is to learn dense and low-dimensional vectors (or embeddings) for encoding character sequences. The learned representation from this task can be used in many downstream application tasks such as string similarity matching or lexical normalization. In this paper, we propose a new method for to train a SRL model by only using synthetic data. Our approach makes use of Contrastive Learning in order to maximize similarity between related strings while minimizing it for unrelated strings. We demonstrate the effectiveness of our approach by evaluating the learned representation on the task of string similarity matching. Codes, data and pretrained models will be made publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/31/2020

A Clustering Framework for Lexical Normalization of Roman Urdu

Roman Urdu is an informal form of the Urdu language written in Roman scr...
research
07/23/2019

Optimal Transport-based Alignment of Learned Character Representations for String Similarity

String similarity models are vital for record linkage, entity resolution...
research
09/24/2020

Novel Keyword Extraction and Language Detection Approaches

Fuzzy string matching and language classification are important tools in...
research
04/07/2021

Accurate and Efficient Suffix Tree Based Privacy-Preserving String Matching

The task of calculating similarities between strings held by different o...
research
04/27/2023

string2string: A Modern Python Library for String-to-String Algorithms

We introduce string2string, an open-source library that offers a compreh...
research
06/07/2023

Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning

Single-cell RNA sequencing (scRNA-seq) data is a potent tool for compreh...
research
07/03/2019

Encoding high-cardinality string categorical variables

Statistical analysis usually requires a vector representation of categor...

Please sign up or login with your details

Forgot password? Click here to reset