Convolutional Embedding for Edit Distance

01/31/2020
by   Xinyan Dai, et al.
0

Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment. However, computing edit distance is known to have high complexity, which makes string similarity search challenging for large datasets. In this paper, we propose a deep learning pipeline (called CNN-ED) that embeds edit distance into Euclidean distance for fast approximate similarity search. A convolutional neural network (CNN) is used to generate fixed-length vector embeddings for a dataset of strings and the loss function is a combination of the triplet loss and the approximation error. To justify our choice of using CNN instead of other structures (e.g., RNN) as the model, theoretical analysis is conducted to show that some basic operations in our CNN model preserve edit distance. Experimental results show that CNN-ED outperforms data-independent CGK embedding and RNN-based GRU embedding in terms of both accuracy and efficiency by a large margin. We also show that string similarity search can be significantly accelerated using CNN-based embeddings, sometimes by orders of magnitude.

READ FULL TEXT
research
01/31/2020

Edit Distance Embedding using Convolutional Neural Networks

Edit-distance-based string similarity search has many applications such ...
research
08/18/2022

Algorithm to derive shortest edit script using Levenshtein distance algorithm

String similarity, longest common subsequence and shortest edit scripts ...
research
04/16/2021

Neural String Edit Distance

We propose the neural string edit distance model for string-pair classif...
research
11/30/2020

Combinatorial Learning of Graph Edit Distance via Dynamic Embedding

Graph Edit Distance (GED) is a popular similarity measurement for pairwi...
research
09/10/2018

Convolutional Neural Networks for Fast Approximation of Graph Edit Distance

Graph Edit Distance (GED) computation is a core operation of many widely...
research
03/12/2022

TEN: Twin Embedding Networks for the Jigsaw Puzzle Problem with Eroded Boundaries

The jigsaw puzzle problem (JPP) is a well-known research problem, which ...

Please sign up or login with your details

Forgot password? Click here to reset