Edit Distance Embedding using Convolutional Neural Networks

01/31/2020
by   Xinyan Dai, et al.
0

Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment. However, computing edit distance is known to have high complexity, which makes string similarity search challenging for large datasets. In this paper, we propose a deep learning pipeline (called CNN-ED) that embeds edit distance into Euclidean distance for fast approximate similarity search. A convolutional neural network (CNN) is used to generate fixed-length vector embeddings for a dataset of strings and the loss function is a combination of the triplet loss and the approximation error. To justify our choice of using CNN instead of other structures (e.g., RNN) as the model, theoretical analysis is conducted to show that some basic operations in our CNN model preserve edit distance. Experimental results show that CNN-ED outperforms data-independent CGK embedding and RNN-based GRU embedding in terms of both accuracy and efficiency by a large margin. We also show that string similarity search can be significantly accelerated using CNN-based embeddings, sometimes by orders of magnitude.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/31/2020

Convolutional Embedding for Edit Distance

Edit-distance-based string similarity search has many applications such ...
research
03/13/2020

Knowledge Graph Alignment using String Edit Distance

In this work, we propose a novel knowledge base alignment technique base...
research
07/02/2019

Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing

Edit distance similarity search, also called approximate pattern matchin...
research
04/16/2021

Neural String Edit Distance

We propose the neural string edit distance model for string-pair classif...
research
06/09/2020

Fast Subtrajectory Similarity Search in Road Networks under Weighted Edit Distance Constraints

In this paper, we address a similarity search problem for spatial trajec...
research
03/12/2022

TEN: Twin Embedding Networks for the Jigsaw Puzzle Problem with Eroded Boundaries

The jigsaw puzzle problem (JPP) is a well-known research problem, which ...

Please sign up or login with your details

Forgot password? Click here to reset