Efficient Parallel Output-Sensitive Edit Distance

06/30/2023
by   Xiangyun Ding, et al.
0

Given two strings A[1..n] and B[1..m], and a set of operations allowed to edit the strings, the edit distance between A and B is the minimum number of operations required to transform A into B. Sequentially, a standard Dynamic Programming (DP) algorithm solves edit distance with Θ(nm) cost. In many real-world applications, the strings to be compared are similar and have small edit distances. To achieve highly practical implementations, we focus on output-sensitive parallel edit-distance algorithms, i.e., to achieve asymptotically better cost bounds than the standard Θ(nm) algorithm when the edit distance is small. We study four algorithms in the paper, including three algorithms based on Breadth-First Search (BFS) and one algorithm based on Divide-and-Conquer (DaC). Our BFS-based solution is based on the Landau-Vishkin algorithm. We implement three different data structures for the longest common prefix (LCP) queries needed in the algorithm: the classic solution using parallel suffix array, and two hash-based solutions proposed in this paper. Our DaC-based solution is inspired by the output-insensitive solution proposed by Apostolico et al., and we propose a non-trivial adaption to make it output-sensitive. All our algorithms have good theoretical guarantees, and they achieve different tradeoffs between work (total number of operations), span (longest dependence chain in the computation), and space. We test and compare our algorithms on both synthetic data and real-world data. Our BFS-based algorithms outperform the existing parallel edit-distance implementation in ParlayLib in all test cases. By comparing our algorithms, we also provide a better understanding of the choice of algorithms for different input patterns. We believe that our paper is the first systematic study in the theory and practice of parallel edit distance.

READ FULL TEXT

page 6

page 9

page 12

page 13

research
02/08/2023

Weighted Edit Distance Computation: Strings, Trees and Dyck

Given two strings of length n over alphabet Σ, and an upper bound k on t...
research
12/06/2013

Towards Normalizing the Edit Distance Using a Genetic Algorithms Based Scheme

The normalized edit distance is one of the distances derived from the ed...
research
08/21/2022

A Work-Efficient Parallel Algorithm for Longest Increasing Subsequence

This paper studies parallel algorithms for the longest increasing subseq...
research
10/20/2018

MinJoin: Efficient Edit Similarity Joins via Local Hash Minimums

In this paper we study edit similarity joins, in which we are given a se...
research
06/07/2022

Locality-sensitive bucketing functions for the edit distance

Many bioinformatics applications involve bucketing a set of sequences wh...
research
05/07/2013

Parallel Chen-Han (PCH) Algorithm for Discrete Geodesics

In many graphics applications, the computation of exact geodesic distanc...
research
11/25/2019

Faster Privacy-Preserving Computation of Edit Distance with Moves

We consider an efficient two-party protocol for securely computing the s...

Please sign up or login with your details

Forgot password? Click here to reset