Engineering faster double-array Aho-Corasick automata

07/28/2022
by   Shunsuke Kanda, et al.
0

Multiple pattern matching in strings is a fundamental problem in text processing applications such as regular expressions or tokenization. This paper studies efficient implementations of double-array Aho–Corasick automata (DAACs), data structures for quickly performing the multiple pattern matching. The practical performance of DAACs is improved by carefully designing the data structure, and many implementation techniques have been proposed thus far. A problem in DAACs is that their ideas are not aggregated. Since comprehensive descriptions and experimental analyses are unavailable, engineers face difficulties in implementing an efficient DAAC. In this paper, we review implementation techniques for DAACs and provide a comprehensive description of them. We also propose several new techniques for further improvement. We conduct exhaustive experiments through real-world datasets and reveal the best combination of techniques to achieve a higher performance in DAACs. The best combination is different from those used in the most popular libraries of DAACs, which demonstrates that their performance can be further enhanced. On the basis of our experimental analysis, we developed a new Rust library for fast multiple pattern matching using DAACs, named Daachorse, as open-source software at <https://github.com/daac-tools/daachorse>. Experiments demonstrate that Daachorse outperforms other AC-automaton implementations, indicating its suitability as a fast alternative for multiple pattern matching in many applications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/02/2017

Non-linear Associative-Commutative Many-to-One Pattern Matching with Sequence Variables

Pattern matching is a powerful tool which is part of many functional pro...
research
09/29/2017

Efficient Pattern Matching in Python

Pattern matching is a powerful tool for symbolic computations. Applicati...
research
04/20/2022

Fast Circular Pattern Matching

The Exact Circular Pattern Matching (ECPM) problem consists of reporting...
research
11/27/2020

Adaptive Non-linear Pattern Matching Automata

Efficient pattern matching is fundamental for practical term rewrite eng...
research
07/17/2022

On the Practical Power of Automata in Pattern Matching

The classical pattern matching paradigm is that of seeking occurrences o...
research
02/17/2022

Term Rewriting Based On Set Automaton Matching

In previous work we have proposed an efficient pattern matching algorith...

Please sign up or login with your details

Forgot password? Click here to reset