Replace or Retrieve Keywords In Documents at Scale

10/31/2017
by   Vikash Singh, et al.
0

In this paper we introduce, the FlashText algorithm for replacing keywords or finding keywords in a given text. FlashText can search or replace keywords in one pass over a document. The time complexity of this algorithm is not dependent on the number of terms being searched or replaced. For a document of size N (characters) and a dictionary of M keywords, the time complexity will be O(N). This algorithm is much faster than Regex, because regex time complexity is O(MxN). It is also different from Aho Corasick Algorithm, as it doesn't match substrings. FlashText is designed to only match complete words (words with boundary characters on both sides). For an input dictionary of Apple, this algorithm won't match it to 'I like Pineapple'. This algorithm is also designed to go for the longest match first. For an input dictionary Machine, Learning, Machine learning on a string 'I like Machine learning', it will only consider the longest match, which is Machine Learning. We have made python implementation of this algorithm available as open-source on GitHub, released under the permissive MIT License.

READ FULL TEXT

page 5

page 6

page 7

page 8

page 9

research
01/29/2019

Structuring an unordered text document

Segmenting an unordered text document into different sections is a very ...
research
05/01/2019

Semi-automatic System for Title Construction

In this paper, we propose a semi-automatic system for title construction...
research
04/20/2021

Robustness Tests of NLP Machine Learning Models: Search and Semantically Replace

This paper proposes a strategy to assess the robustness of different mac...
research
06/30/2022

Computing the Parameterized Burrows–Wheeler Transform Online

Parameterized strings are a generalization of strings in that their char...
research
01/26/2019

A Linear-complexity Multi-biometric Forensic Document Analysis System, by Fusing the Stylome and Signature Modalities

Forensic Document Analysis (FDA) addresses the problem of finding the au...
research
06/01/2002

Neural Net Model for Featured Word Extraction

Search engines perform the task of retrieving information related to the...
research
03/22/2015

Construction of FuzzyFind Dictionary using Golay Coding Transformation for Searching Applications

Searching through a large volume of data is very critical for companies,...

Please sign up or login with your details

Forgot password? Click here to reset