Contrastive String Representation Learning using Synthetic Data

by   Urchade Zaratiana, et al.

String representation Learning (SRL) is an important task in the field of Natural Language Processing, but it remains under-explored. The goal of SRL is to learn dense and low-dimensional vectors (or embeddings) for encoding character sequences. The learned representation from this task can be used in many downstream application tasks such as string similarity matching or lexical normalization. In this paper, we propose a new method for to train a SRL model by only using synthetic data. Our approach makes use of Contrastive Learning in order to maximize similarity between related strings while minimizing it for unrelated strings. We demonstrate the effectiveness of our approach by evaluating the learned representation on the task of string similarity matching. Codes, data and pretrained models will be made publicly available.


page 1

page 2

page 3

page 4


A Clustering Framework for Lexical Normalization of Roman Urdu

Roman Urdu is an informal form of the Urdu language written in Roman scr...

Novel Keyword Extraction and Language Detection Approaches

Fuzzy string matching and language classification are important tools in...

Optimal Transport-based Alignment of Learned Character Representations for String Similarity

String similarity models are vital for record linkage, entity resolution...

Accurate and Efficient Suffix Tree Based Privacy-Preserving String Matching

The task of calculating similarities between strings held by different o...

Query by String word spotting based on character bi-gram indexing

In this paper we propose a segmentation-free query by string word spotti...

Neural text normalization leveraging similarities of strings and sounds

We propose neural models that can normalize text by considering the simi...

Combining a Context Aware Neural Network with a Denoising Autoencoder for Measuring String Similarities

Measuring similarities between strings is central for many established a...

1 Introduction

Recently, there has been a lot of research in representation learning for textual data. These works have been mainly motivated by the growing popularity of pretrained transformer models such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) or XLNet (Yang et al., 2020). However, most of these works have focused on representation at the sentence or paragraph level, and the representations are optimized to capture the semantic meaning of the sentence or paragraph. These learned representations are then applied to different downstream task like Semantic Textual Similarity (STS), Text Retrieval or Paraphrase Detection. Therefore, these models are not well suited in tasks that require reasoning at the character level such as string matching.

Due to his recent success in the computer vision field

(Chen et al., 2020; He et al., 2020; van den Oord et al., 2019), Contrastive Learning has become very popular and is now a standard approach to self-supervised representation learning. The key idea of contrastive learning is to learn representation by maximizing agreement between two differently augmented views of the same sample data via a contrastive loss in latent space. Giorgi et al. (2020) propose DeCLUTR a model for textual representation using Contrastive Learning where the goal is to minimize the distance between textual segments sampled from nearby in the same document without having recourse to data augmentation. Wu et al. (2020) had a similar approach but they trained an encoder to minimize the distance between the embeddings of different augmentations of the same sentence by word deletion, span deletion, sentence reordering and synonym substitution. However, to our knowledge, no existing work has used contrastive learning for learning string representation.

Synthetic data generation (SDG) has always been popular in the machine learning community. In fact, training Machine Learning models require data, but data is not always available. Collecting and labelling data can cost a lot of money and take a lot of time, so it is critical to find a way to build data with little effort. SDG is an approach where data are generated using an algorithm or by employing Generative models such as Generative Adversarial Networks (GAN)

(Goodfellow et al., 2014) or Variational Auto-encoders (VAE) (Kingma and Welling, 2014). To the best of our knowledge, this is the first work that uses synthetic data for training string representation models.

Figure 1: Overview of the model architecture. For each synthetically generated character sequence (anchor sample), we create an augmented sample by adding a random perturbation to the anchor sample. The two resulting representations by the same model (character encoder and grouping layer) are then optimized in a contrastive way.

2 Model

2.1 Architecture

The overall overview of our model is depicted on the figure 1. The model is composed of a character encoder layer, a pooling layer and is optimised by minimizing a contrastive loss, consisting in maximizing similarity between positive or related samples.

Character encoder

The first component of our architecture is the character encoder. The goal of the character encoder is to return a set of embedding vectors give a sequence of characters . In our experiments, we explore explore both contextualized and non-contextualized character representation. The first method we use as a reference is to encode each of the input characters using a simple dictionary lookup, which means that, since the representations are not contextualized, a character would have the same representation regardless of its context. Our second model uses a bidirectional LSTM (Hochreiter and Schmidhuber, 1997; Graves and Schmidhuber, 2005) to produce a contextualized representation of each input character. The representation of each character is the concatenation of their LSTM forward and backward representation.

Pooling Layer

The purpose of the pooling layer is to produce a single vector to encode the sequence of character embedding

produced by the character encoder. For this part, we experimented either the element-wise pooling or element-wise max-pooling. We found that there is no substantial difference between the two approaches, but max-pooling makes the model training converge faster.

Contrastive Loss Function

To learn a meaningful representation of strings, our model is optimized by minimizing the disagreement between a generated sequence and its augmented version. After the pooling layer, we added a non-linear projection head that maps the pooled representations to the space where contrastive loss is applied. The addition of the non-linear projection head have been empirically proved to improve the representation quality of the layer before it (Chen et al., 2020).After the projection, we move to contrastive learning which is defined as follow: given a set of training samples , including a positive pair of examples , the goal is identify in for a give . During training, we sample a set of N pairs of synthetic data and their augmented version which results in data points.

Then, the loss function for a positive pair of examples (i, j) is defined as:



if the cosine similarity between the two vectors

and and

is temperature hyperparameter. This loss function is called normalized temperature-scaled cross entropy loss (NT-Xent for short) and has recently gained at lot attention in self-supervised learning

(Chen et al., 2020; He et al., 2020; van den Oord et al., 2019).

2.2 Data Generation

Here, we present our technique for generating data in order to train a string representation model. In our works, we focused on English but our technique can be easily adapted to any other written languages. The idea behind our approach is to generate synthetic character string by mimicking some summary statistics of the true word distribution like the word lengths and the frequency of the characters.

Word length

The first step of our synthetic data generation is to sample a word length

from a word length distribution. For that, we used we modelled it with the discrete normal distribution

(Roy, 2003) with and where and

are respectively the empirical mean and empirical standard deviation of the true word length distribution. Formally, we sample a real number

from the standard normal distribution with mean and standard. deviation . Then, we choose the word length as the integer part of the sampled real number . Moreover, we truncate the distribution to avoid a length of 0 or negative numbers. We also prevent too long sequences by setting a maximum length of 25.


Character sampling

After getting a word length , we randomly sample characters from the character vocabulary according to their frequencies in the true word distribution. Note that the characters are sampled independently, which means that the actual structure of the word distribution is not preserved. It is possible to generate the sequences by training a character language model to generate more realistic sequences, but we have found that our simple technique already works well and is much less expensive in terms of computation. To form our contrastive learning model, we create a positive sample for each generated character sequence that we call anchor samples. To create the positive samples, we applied various perturbations to the anchor samples, such as random deletion of characters, random insertion of characters, and random reordering of characters.

3 Experiments

3.1 Setup

Training data and data loading

Since the data are synthetics and their generation is fast, we generate them online while training instead of generating them offline because the last one is less memory efficient. We trained all the models with 50 million synthetic for a single epoch. During data loading, we generate batches of synthetic character sequences and their augmented versions. We dynamically set the maximum string length to the length of the longest character in the batch and we zero-padded the remaining strings to this max length.

Implementation details

We implemented all our model with pytorch

(Paszke et al., 2019) and we employed the Normalized cross-entropy loss function (NXEnt) implemented by the PyTorch Metric Learning library (Musgrave et al., 2020) following Giorgi et al. (2020). To get the word statistics we used NLTK’s word dictionary which contains 235892 unique words. For all models, we used a hidden size of 300, a batch size of 256 with a maximum sequence length of 25. All the models were optimized with adam optimizer with a learning rate , and . The models were trained on a single Tesla V100 with 32GB GPU.

Task, dataset and metrics

We evaluate the learned string embedding on the character-based retrieval task. This task consists of retrieving the most relevant string in a dataset given a query string. For that, we construct a dataset for string retrieval where the task is to retrieve the real word in the dataset given its augmented version by injecting a random noise to the real word. For that, we sampled 19970 words from the NLTK Librabrie’s (Feng et al., 2020) word dictionary. To evaluate the performance of the models, we use the precision@1 and the evaluation time, the time it takes to compute the Precision@1 on the test set.

3.2 Baselines

In addition to the two neural methods we implemented (Bi-LSTM representation model, and non-conxtualised bag of character model), we also implemented as baselines two non-neural models: the Levenshtein Distance and a Tf-IDF weighed character n-gram similarity. They are discribed on the two following paragraphs.

Levenshtein distance

The Levenshtein distance uses a distance factor to calculate the similarity between two given chords. The distance is calculated by counting the minimum number of operations (insertions, deletions or substitutions) required to transform one string into another. This distance is widely used in the context of fuzzy string matching and provides a solid basis for more recent models.

Character n-gram

For this method, we have represented each string with its bi-gram and trigram weighted by TF-IDF. Afterwards, the cosine similarity is used to calculate the pairwise similarity scores between the strings. In this way, the strings that share the most bi-grams or trigrams get a high similarity score, but this method can be very sensitive to noise, especially for strings that contain few characters.

Methods Precision@1 Evaluation time
Levenshtein distance 0.877 2 min 45 sec
Tf-Idf character N-gram 0.690  sec
Our models
Non-contextualized representation 0.672  sec
Bi-LSTM representation 0.904  sec
Table 1: Performance of models on the string similarity task. The evaluation measures are the precision@1 and the evaluation time, the time needed to compute the precision@1 on the test set. .

4 Results

The table 1 reports the performance of the models on the task of string similarity matching. The Bi-LSTM model performs best with 0.904 Precision@1 on the retrieval task. This result shows the effectiveness of our approach and the importance of contextual information for string representation.

The Levenshtein distance retrieval model achieves a good result with 0.877 of Precision@1. However, it does not modulate well on many samples and long sequences. This is explained by its temporal complexity of where and are the lengths of two characters whose distance we want to calculate. Calculating this distance in pairs for many samples is very expensive, even more so when the sequences are long.

The non-contextualised representation model obtained a very low score of 0.672 on Precision@1. This result shows the importance of contextualised representation on the string similarity matching task. In fact, because it processes each character independently of the others, the representation becomes very sensitive to noise and typos. The n-gram model of characters works a little better but is less effective at inference time.

5 Conclusion

In this paper, we introduce a new method for character-level string representation using contrastive learning and synthetic data generation. Our model encoder consists of a simple bag of character encoder and bi-LSTM based encoder but could be easily extended to another type of encoders like 1-D CNN or transformer encoder layer. We demonstrated the effectiveness of our approach on the string retrieval tasks. In future work, it would interesting to see if the learned representation from synthetic data could improve discriminative tasks like text classification or Named Entity Recognition.


  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. External Links: 2002.05709 Cited by: §1, §2.1, §2.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. External Links: 1810.04805 Cited by: §1.
  • F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang (2020) Language-agnostic bert sentence embedding. External Links: 2007.01852 Cited by: §3.1.
  • J. M. Giorgi, O. Nitski, G. D. Bader, and B. Wang (2020) DeCLUTR: deep contrastive learning for unsupervised textual representations. External Links: 2006.03659 Cited by: §1, §3.1.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial networks. External Links: 1406.2661 Cited by: §1.
  • A. Graves and J. Schmidhuber (2005)

    Framewise phoneme classification with bidirectional lstm and other neural network architectures

    Neural Networks 18 (5), pp. 602–610. Note: IJCNN 2005 External Links: ISSN 0893-6080, Document, Link Cited by: §2.1.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. External Links: 1911.05722 Cited by: §1, §2.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link, Document Cited by: §2.1.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. External Links: 1312.6114 Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. External Links: 1907.11692 Cited by: §1.
  • K. Musgrave, S. Belongie, and S. Lim (2020) PyTorch metric learning. External Links: 2008.09164 Cited by: §3.1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    External Links: 1912.01703 Cited by: §3.1.
  • D. Roy (2003) The discrete normal distribution. Communications in Statistics - Theory and Methods 32 (10), pp. 1871–1883. External Links: Document, Link, Cited by: §2.2.
  • A. van den Oord, Y. Li, and O. Vinyals (2019) Representation learning with contrastive predictive coding. External Links: 1807.03748 Cited by: §1, §2.1.
  • Z. Wu, S. Wang, J. Gu, M. Khabsa, F. Sun, and H. Ma (2020) CLEAR: contrastive learning for sentence representation. External Links: 2012.15466 Cited by: §1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2020) XLNet: generalized autoregressive pretraining for language understanding. External Links: 1906.08237 Cited by: §1.