DeepAI AI Chat
Log In Sign Up

Replace or Retrieve Keywords In Documents at Scale

10/31/2017
by   Vikash Singh, et al.
Belong Technologies India Pvt. Ltd.
0

In this paper we introduce, the FlashText algorithm for replacing keywords or finding keywords in a given text. FlashText can search or replace keywords in one pass over a document. The time complexity of this algorithm is not dependent on the number of terms being searched or replaced. For a document of size N (characters) and a dictionary of M keywords, the time complexity will be O(N). This algorithm is much faster than Regex, because regex time complexity is O(MxN). It is also different from Aho Corasick Algorithm, as it doesn't match substrings. FlashText is designed to only match complete words (words with boundary characters on both sides). For an input dictionary of Apple, this algorithm won't match it to 'I like Pineapple'. This algorithm is also designed to go for the longest match first. For an input dictionary Machine, Learning, Machine learning on a string 'I like Machine learning', it will only consider the longest match, which is Machine Learning. We have made python implementation of this algorithm available as open-source on GitHub, released under the permissive MIT License.

READ FULL TEXT

page 5

page 6

page 7

page 8

page 9

01/29/2019

Structuring an unordered text document

Segmenting an unordered text document into different sections is a very ...
05/01/2019

Semi-automatic System for Title Construction

In this paper, we propose a semi-automatic system for title construction...
04/20/2021

Robustness Tests of NLP Machine Learning Models: Search and Semantically Replace

This paper proposes a strategy to assess the robustness of different mac...
06/30/2022

Computing the Parameterized Burrows–Wheeler Transform Online

Parameterized strings are a generalization of strings in that their char...
09/29/2021

Context based Roman-Urdu to Urdu Script Transliteration System

Now a day computer is necessary for human being and it is very useful in...
11/27/2019

A tale of two toolkits, report the second: bake off redux. Chapter 1. dictionary based classifiers

Time series classification (TSC) is the problem of learning labels from ...

Code Repositories

flashtext

Extract Keywords from sentence or Replace keywords in sentences.


view repo

flashtextgo

Implementation of flashtext (https://arxiv.org/abs/1711.00046) algorithm in go.


view repo

phpflashtext

Extract Keywords from sentence or Replace keywords in sentences. @ https://github.com/vi3k6i5/flashtext


view repo