1 Introduction
Recently, there has been a lot of research in representation learning for textual data. These works have been mainly motivated by the growing popularity of pretrained transformer models such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) or XLNet (Yang et al., 2020). However, most of these works have focused on representation at the sentence or paragraph level, and the representations are optimized to capture the semantic meaning of the sentence or paragraph. These learned representations are then applied to different downstream task like Semantic Textual Similarity (STS), Text Retrieval or Paraphrase Detection. Therefore, these models are not well suited in tasks that require reasoning at the character level such as string matching.
Due to his recent success in the computer vision field
(Chen et al., 2020; He et al., 2020; van den Oord et al., 2019), Contrastive Learning has become very popular and is now a standard approach to self-supervised representation learning. The key idea of contrastive learning is to learn representation by maximizing agreement between two differently augmented views of the same sample data via a contrastive loss in latent space. Giorgi et al. (2020) propose DeCLUTR a model for textual representation using Contrastive Learning where the goal is to minimize the distance between textual segments sampled from nearby in the same document without having recourse to data augmentation. Wu et al. (2020) had a similar approach but they trained an encoder to minimize the distance between the embeddings of different augmentations of the same sentence by word deletion, span deletion, sentence reordering and synonym substitution. However, to our knowledge, no existing work has used contrastive learning for learning string representation.Synthetic data generation (SDG) has always been popular in the machine learning community. In fact, training Machine Learning models require data, but data is not always available. Collecting and labelling data can cost a lot of money and take a lot of time, so it is critical to find a way to build data with little effort. SDG is an approach where data are generated using an algorithm or by employing Generative models such as Generative Adversarial Networks (GAN)
(Goodfellow et al., 2014) or Variational Auto-encoders (VAE) (Kingma and Welling, 2014). To the best of our knowledge, this is the first work that uses synthetic data for training string representation models.2 Model
2.1 Architecture
The overall overview of our model is depicted on the figure 1. The model is composed of a character encoder layer, a pooling layer and is optimised by minimizing a contrastive loss, consisting in maximizing similarity between positive or related samples.
Character encoder
The first component of our architecture is the character encoder. The goal of the character encoder is to return a set of embedding vectors give a sequence of characters . In our experiments, we explore explore both contextualized and non-contextualized character representation. The first method we use as a reference is to encode each of the input characters using a simple dictionary lookup, which means that, since the representations are not contextualized, a character would have the same representation regardless of its context. Our second model uses a bidirectional LSTM (Hochreiter and Schmidhuber, 1997; Graves and Schmidhuber, 2005) to produce a contextualized representation of each input character. The representation of each character is the concatenation of their LSTM forward and backward representation.
Pooling Layer
The purpose of the pooling layer is to produce a single vector to encode the sequence of character embedding
produced by the character encoder. For this part, we experimented either the element-wise pooling or element-wise max-pooling. We found that there is no substantial difference between the two approaches, but max-pooling makes the model training converge faster.
Contrastive Loss Function
To learn a meaningful representation of strings, our model is optimized by minimizing the disagreement between a generated sequence and its augmented version. After the pooling layer, we added a non-linear projection head that maps the pooled representations to the space where contrastive loss is applied. The addition of the non-linear projection head have been empirically proved to improve the representation quality of the layer before it (Chen et al., 2020).After the projection, we move to contrastive learning which is defined as follow: given a set of training samples , including a positive pair of examples , the goal is identify in for a give . During training, we sample a set of N pairs of synthetic data and their augmented version which results in data points.
Then, the loss function for a positive pair of examples (i, j) is defined as:
(1) |
where
if the cosine similarity between the two vectors
and andis temperature hyperparameter. This loss function is called normalized temperature-scaled cross entropy loss (NT-Xent for short) and has recently gained at lot attention in self-supervised learning
(Chen et al., 2020; He et al., 2020; van den Oord et al., 2019).2.2 Data Generation
Here, we present our technique for generating data in order to train a string representation model. In our works, we focused on English but our technique can be easily adapted to any other written languages. The idea behind our approach is to generate synthetic character string by mimicking some summary statistics of the true word distribution like the word lengths and the frequency of the characters.
Word length
The first step of our synthetic data generation is to sample a word length
from a word length distribution. For that, we used we modelled it with the discrete normal distribution
(Roy, 2003) with and where andare respectively the empirical mean and empirical standard deviation of the true word length distribution. Formally, we sample a real number
from the standard normal distribution with mean and standard. deviation . Then, we choose the word length as the integer part of the sampled real number . Moreover, we truncate the distribution to avoid a length of 0 or negative numbers. We also prevent too long sequences by setting a maximum length of 25.(2) |
Character sampling
After getting a word length , we randomly sample characters from the character vocabulary according to their frequencies in the true word distribution. Note that the characters are sampled independently, which means that the actual structure of the word distribution is not preserved. It is possible to generate the sequences by training a character language model to generate more realistic sequences, but we have found that our simple technique already works well and is much less expensive in terms of computation. To form our contrastive learning model, we create a positive sample for each generated character sequence that we call anchor samples. To create the positive samples, we applied various perturbations to the anchor samples, such as random deletion of characters, random insertion of characters, and random reordering of characters.
3 Experiments
3.1 Setup
Training data and data loading
Since the data are synthetics and their generation is fast, we generate them online while training instead of generating them offline because the last one is less memory efficient. We trained all the models with 50 million synthetic for a single epoch. During data loading, we generate batches of synthetic character sequences and their augmented versions. We dynamically set the maximum string length to the length of the longest character in the batch and we zero-padded the remaining strings to this max length.
Implementation details
We implemented all our model with pytorch
(Paszke et al., 2019) and we employed the Normalized cross-entropy loss function (NXEnt) implemented by the PyTorch Metric Learning library (Musgrave et al., 2020) following Giorgi et al. (2020). To get the word statistics we used NLTK’s word dictionary which contains 235892 unique words. For all models, we used a hidden size of 300, a batch size of 256 with a maximum sequence length of 25. All the models were optimized with adam optimizer with a learning rate , and . The models were trained on a single Tesla V100 with 32GB GPU.Task, dataset and metrics
We evaluate the learned string embedding on the character-based retrieval task. This task consists of retrieving the most relevant string in a dataset given a query string. For that, we construct a dataset for string retrieval where the task is to retrieve the real word in the dataset given its augmented version by injecting a random noise to the real word. For that, we sampled 19970 words from the NLTK Librabrie’s (Feng et al., 2020) word dictionary. To evaluate the performance of the models, we use the precision@1 and the evaluation time, the time it takes to compute the Precision@1 on the test set.
3.2 Baselines
In addition to the two neural methods we implemented (Bi-LSTM representation model, and non-conxtualised bag of character model), we also implemented as baselines two non-neural models: the Levenshtein Distance and a Tf-IDF weighed character n-gram similarity. They are discribed on the two following paragraphs.
Levenshtein distance
The Levenshtein distance uses a distance factor to calculate the similarity between two given chords. The distance is calculated by counting the minimum number of operations (insertions, deletions or substitutions) required to transform one string into another. This distance is widely used in the context of fuzzy string matching and provides a solid basis for more recent models.
Character n-gram
For this method, we have represented each string with its bi-gram and trigram weighted by TF-IDF. Afterwards, the cosine similarity is used to calculate the pairwise similarity scores between the strings. In this way, the strings that share the most bi-grams or trigrams get a high similarity score, but this method can be very sensitive to noise, especially for strings that contain few characters.
Methods | Precision@1 | Evaluation time |
Baselines | ||
Levenshtein distance | 0.877 | 2 min 45 sec |
Tf-Idf character N-gram | 0.690 | sec |
Our models | ||
Non-contextualized representation | 0.672 | sec |
Bi-LSTM representation | 0.904 | sec |
4 Results
The table 1 reports the performance of the models on the task of string similarity matching. The Bi-LSTM model performs best with 0.904 Precision@1 on the retrieval task. This result shows the effectiveness of our approach and the importance of contextual information for string representation.
The Levenshtein distance retrieval model achieves a good result with 0.877 of Precision@1. However, it does not modulate well on many samples and long sequences. This is explained by its temporal complexity of where and are the lengths of two characters whose distance we want to calculate. Calculating this distance in pairs for many samples is very expensive, even more so when the sequences are long.
The non-contextualised representation model obtained a very low score of 0.672 on Precision@1. This result shows the importance of contextualised representation on the string similarity matching task. In fact, because it processes each character independently of the others, the representation becomes very sensitive to noise and typos. The n-gram model of characters works a little better but is less effective at inference time.
5 Conclusion
In this paper, we introduce a new method for character-level string representation using contrastive learning and synthetic data generation. Our model encoder consists of a simple bag of character encoder and bi-LSTM based encoder but could be easily extended to another type of encoders like 1-D CNN or transformer encoder layer. We demonstrated the effectiveness of our approach on the string retrieval tasks. In future work, it would interesting to see if the learned representation from synthetic data could improve discriminative tasks like text classification or Named Entity Recognition.
References
- A simple framework for contrastive learning of visual representations. External Links: 2002.05709 Cited by: §1, §2.1, §2.1.
- BERT: pre-training of deep bidirectional transformers for language understanding. External Links: 1810.04805 Cited by: §1.
- Language-agnostic bert sentence embedding. External Links: 2007.01852 Cited by: §3.1.
- DeCLUTR: deep contrastive learning for unsupervised textual representations. External Links: 2006.03659 Cited by: §1, §3.1.
- Generative adversarial networks. External Links: 1406.2661 Cited by: §1.
-
Framewise phoneme classification with bidirectional lstm and other neural network architectures
. Neural Networks 18 (5), pp. 602–610. Note: IJCNN 2005 External Links: ISSN 0893-6080, Document, Link Cited by: §2.1. - Momentum contrast for unsupervised visual representation learning. External Links: 1911.05722 Cited by: §1, §2.1.
- Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link, Document Cited by: §2.1.
- Auto-encoding variational bayes. External Links: 1312.6114 Cited by: §1.
- RoBERTa: a robustly optimized bert pretraining approach. External Links: 1907.11692 Cited by: §1.
- PyTorch metric learning. External Links: 2008.09164 Cited by: §3.1.
-
PyTorch: an imperative style, high-performance deep learning library
. External Links: 1912.01703 Cited by: §3.1. - The discrete normal distribution. Communications in Statistics - Theory and Methods 32 (10), pp. 1871–1883. External Links: Document, Link, https://doi.org/10.1081/STA-120023256 Cited by: §2.2.
- Representation learning with contrastive predictive coding. External Links: 1807.03748 Cited by: §1, §2.1.
- CLEAR: contrastive learning for sentence representation. External Links: 2012.15466 Cited by: §1.
- XLNet: generalized autoregressive pretraining for language understanding. External Links: 1906.08237 Cited by: §1.