Replace or Retrieve Keywords In Documents at Scale

10/31/2017 ∙ by Vikash Singh, et al. ∙ Belong Technologies India Pvt. Ltd. 0

In this paper we introduce, the FlashText algorithm for replacing keywords or finding keywords in a given text. FlashText can search or replace keywords in one pass over a document. The time complexity of this algorithm is not dependent on the number of terms being searched or replaced. For a document of size N (characters) and a dictionary of M keywords, the time complexity will be O(N). This algorithm is much faster than Regex, because regex time complexity is O(MxN). It is also different from Aho Corasick Algorithm, as it doesn't match substrings. FlashText is designed to only match complete words (words with boundary characters on both sides). For an input dictionary of Apple, this algorithm won't match it to 'I like Pineapple'. This algorithm is also designed to go for the longest match first. For an input dictionary Machine, Learning, Machine learning on a string 'I like Machine learning', it will only consider the longest match, which is Machine Learning. We have made python implementation of this algorithm available as open-source on GitHub, released under the permissive MIT License.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 7

page 8

page 9

Code Repositories

flashtext

Extract Keywords from sentence or Replace keywords in sentences.


view repo

flashtextgo

Implementation of flashtext (https://arxiv.org/abs/1711.00046) algorithm in go.


view repo

phpflashtext

Extract Keywords from sentence or Replace keywords in sentences. @ https://github.com/vi3k6i5/flashtext


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the field of Information Retrieval, keyword search and replace is a standard problem. Often we want to either find specific keywords in text, or replace keywords with standardized names.

For example:

  1. Keyword Search: Say we have a resume (D) of a software engineer, and we have a list of 20K programing skills corpus = {java, python, javascript, machine learning, …}. We want to find which of the 20K terms are present in the resume (corpus D).

  2. Keyword Replace: Another common use case is when we have a corpus of synonyms (different spellings meaning the same thing) like corpus = { javascript: [‘javascript’, ‘javascripting’, ‘java script’], …} and we want to replace the synonyms with standardized names in the candidate resume.

To solve such problems, Regular-Expressions (Regex) are most commonly used. While Regex works well when the number of terms are in 100s, it is, however, considerably slow for 1000s of terms. When the no. of terms are in 10s of thousands and no. of documents are in millions, the run-time will reach a few days. As shown in figure 1, time taken by Regex to find 15K keywords in 1 document of 10K terms is almost 0.165 seconds. Whereas, for FlashText it is 0.002 seconds. Thus FlashText is 82x faster than Regex for 15K terms.

As the number of terms increase, time taken by Regex grows almost linearly. Whereas time taken by FlashText is almost a constant. In this paper we will discuss the Regex based approach for both keyword search and replace and compare it with FlashText. We will also go through the detailed FlashText algorithm and how it works, and share the code that we used to benchmark FlashText with Regex (which was used to generate the figures in this paper).

1.1 Regex for keyword search

Regex as a tool is very versatile and useful for pattern matching. We can search for patterns like ’\d{4}’ (which will match any 4 digit number), or keywords like ’2017’ in a text document. Sample python code (Code

1) to search ’2017’ or any 4 digit number in a given string.

import re
compiled_regex = re.compile(r’\b2017\b|\b\d{4}\b’)
compiled_regex.findall(’In 2017 2311 is my birthday.’)
[’2017’, ’2311’]
Code 1: Sample python code to search 2017 or any 4 digit number in a given string using Regex.
Figure 1: Comparing Time taken (y-axis) by Regex and FlashText to find number of terms (x-axis).

Here ’\b’ is used to denote word-boundary, and is used so that 23114 won’t return 2311 as a match. Word-boundary in Regex (’\b’) matches special characters like ‘space’, ‘period’, ‘new line’, etc.. {‘ ‘, ‘.’, ‘\n’}.

1.2 Regex for keyword replacement

We can also use Regex tool to replace the matched term with a standardised term. Sample python code (Code 2) to replace java script with javascript.

import re
re.sub(r”\bjava script\b”, ’javascript’, ’java script is awesome.’)
’javascript is awesome.’
Code 2: Sample python code to replace java script with javascript using Regex.
Figure 2: Comparing time taken (y-axis) by Regex and FlashText to replace number of terms (x-axis).

2 Flashtext

FlashText is an algorithm based on Trie dictionary data structure and inspired by the Aho Corasick Algorithm. The way it works is, first it takes all relevant keywords as input. Using these keywords a trie dictionary is built (As shown in Figure 3).

Figure 3: Trie dictionary with 2 keywords, j2ee and java both mapped to standardised term java.

Start and eot are both special symbols that define word boundary, as defined in Regex. This trie dictionary is used for searching keywords in string as well as replacing keywords in string.

2.1 Search with FlashText

For an input string (document), we iterate over it character by character. When a sequence of characters in the document \bword\b match in the trie dictionary from startwordeot (Start and eot both stand for word-boundary), we consider it as a complete match. We add the standardized term corresponding to the matched term into a list of keywords found.

Figure 4: For input string matched character sequence is shown in Green and unmatched in Red.

2.2 Replace with FlashText

For an input string (document), we iterate over it character by character. We create an empty return string and when a sequence of characters in the document \bword\b doesn’t match in the trie dictionary, we copy the original word as it is into the return string. When we do have a match, we add the standardised term instead. Thus the return string is a copy of input string, with only matched terms replaced.

Figure 5: For input string matched character sequence is replaced with standardised name.

2.3 FlashText algorithm

FlashText algorithm has 3 major parts. We will go over each part separately.

  1. Building the trie dictionary

  2. Searching keywords

  3. Replacing keywords

2.3.1 Building the trie dictionary

To build the trie dictionary, we start with the root node which points to an empty_dictionary 444 Associative_array dictionary data structure: https://en.wikipedia.org/wiki/Associative_array. This node is used as the start point for all words. We insert a word in the dictionary by inserting the first character to the root node and pointing that to an empty dictionary. The next character from the word, goes as a key in this dictionary, and that again points to an empty dictionary. This process is repeated till we reach the last character in the word. If any character is already present in the dictionary we move to the child dictionary and the next character in the word. When we reach the end of the word we insert a special key _keyword_, to signify end of term (eot), and standardized name is stored against this key.

Input

Keyword w = where each is an input character and w is the input keyword. Standardized name s for keyword w.

Method
class FlashText(object):
    def __init__(self, case_sensitive=False):
        self._keyword = ’_keyword_’ # end of term (eot) and key to store standardized name
        self._white_space_chars = set([’.’, ’\t’, ’\n’, ’\a’,  , ’,’])
        self.keyword_trie_dict = dict()
        self.case_sensitive = case_sensitive
    def add_keyword(self, keyword, clean_name=None):
        if not clean_name and keyword:
            clean_name = keyword
        if keyword and clean_name:
            # if both keyword and clean_name are not empty.
            if not self.case_sensitive:
                # if not case_sensitive then lowercase the keyword
                keyword = keyword.lower()
            current_dict = self.keyword_trie_dict
            for letter in keyword:
                current_dict = current_dict.setdefault(letter, {})
            current_dict[self._keyword] = clean_name
Code 3: Python code for FlashText Initialization and adding keywords to dictionary.
Output

A dictionary will be created which will look like Figure 3.

2.3.2 Searching for keywords

Once all keywords are added to the trie dictionary, we can find keywords present in an input string.

Input

String x = where each is an input character and x is the input string.

Method
def extract_keywords(self, sentence):
    keywords_extracted = []
    if not self.case_sensitive:
        # if not case_sensitive then lowercase the sentence
        sentence = sentence.lower()
    current_dict = self.keyword_trie_dict
    sequence_end_pos = 0
    idx = 0
    sentence_len = len(sentence)
    while idx < sentence_len:
        char = sentence[idx]
        # when we reach a character that might denote word end
        if char not in self.non_word_boundaries:
            # if eot is present in current_dict
            if self._keyword in current_dict or char in current_dict:
                # update longest sequence found
                sequence_found = None
                longest_sequence_found = None
                is_longer_seq_found = False
                if self._keyword in current_dict:
                    sequence_found = current_dict[self._keyword]
                    longest_sequence_found = current_dict[self._keyword]
                    sequence_end_pos = idx
                # re look for longest_sequence from this position
                if char in current_dict:
                    current_dict_continued = current_dict[char]
                    idy = idx + 1
                    while idy < sentence_len:
                        inner_char = sentence[idy]
                        if inner_char not in self.non_word_boundaries and \
                            self._keyword in current_dict_continued:
                            # update longest sequence found
                            longest_sequence_found = current_dict_continued[self._keyword]
                            sequence_end_pos = idy
                            is_longer_seq_found = True
                        if inner_char in current_dict_continued:
                            current_dict_continued = current_dict_continued[inner_char]
                        else:
                            break
                        idy += 1
                    else:
                        # end of sentence reached.
                        if self._keyword in current_dict_continued:
                            # update longest sequence found
                            longest_sequence_found = current_dict_continued[self._keyword]
                            sequence_end_pos = idy
                            is_longer_seq_found = True
                    if is_longer_seq_found:
                        idx = sequence_end_pos
                current_dict = self.keyword_trie_dict
                if longest_sequence_found:
                    keywords_extracted.append(longest_sequence_found)
            else:
                # we reset current_dict
                current_dict = self.keyword_trie_dict
        elif char in current_dict:
            # char is present in current dictionary position
            current_dict = current_dict[char]
        else:
            # we reset current_dict
            current_dict = self.keyword_trie_dict
            # skip to end of word
            idy = idx + 1
            while idy < sentence_len:
                char = sentence[idy]
                if char not in self.non_word_boundaries:
                    break
                idy += 1
            idx = idy
        # if we are end of sentence and have a sequence discovered
        if idx + 1 >= sentence_len:
            if self._keyword in current_dict:
                sequence_found = current_dict[self._keyword]
                keywords_extracted.append(sequence_found)
        idx += 1
    return keywords_extracted
Code 4: Python code to get keywords in input string which are present in dictionary.
Output

A list of standardized names found in the string x, as shown in Figure 4.

2.3.3 Replacing keywords

We can use the same trie dictionary to replace keywords present in an input string with standardized names.

Input

String x = where each is an input character and x is the input string.

Method
def replace_keywords(self, sentence):
    new_sentence = 
    orig_sentence = sentence
    if not self.case_sensitive:
        sentence = sentence.lower()
    current_word = 
    current_dict = self.keyword_trie_dict
    current_white_space = 
    sequence_end_pos = 0
    idx = 0
    sentence_len = len(sentence)
    while idx < sentence_len:
        char = sentence[idx]
        current_word += orig_sentence[idx]
        # when we reach whitespace
        if char not in self.non_word_boundaries:
            current_white_space = char
            # if end is present in current_dict
            if self._keyword in current_dict or char in current_dict:
                # update longest sequence found
                sequence_found = None
                longest_sequence_found = None
                is_longer_seq_found = False
                if self._keyword in current_dict:
                    sequence_found = current_dict[self._keyword]
                    longest_sequence_found = current_dict[self._keyword]
                    sequence_end_pos = idx
                # re look for longest_sequence from this position
                if char in current_dict:
                    current_dict_continued = current_dict[char]
                    current_word_continued = current_word
                    idy = idx + 1
                    while idy < sentence_len:
                        inner_char = sentence[idy]
                        current_word_continued += orig_sentence[idy]
                        if inner_char not in self.non_word_boundaries and \
                            self._keyword in current_dict_continued:
                            # update longest sequence found
                            current_white_space = inner_char
                            longest_sequence_found = current_dict_continued[self._keyword]
                            sequence_end_pos = idy
                            is_longer_seq_found = True
                        if inner_char in current_dict_continued:
                            current_dict_continued = current_dict_continued[inner_char]
                        else:
                            break
                        idy += 1
                    else:
                        # end of sentence reached.
                        if self._keyword in current_dict_continued:
                            # update longest sequence found
                            current_white_space = 
                            longest_sequence_found = current_dict_continued[self._keyword]
                            sequence_end_pos = idy
                            is_longer_seq_found = True
                    if is_longer_seq_found:
                        idx = sequence_end_pos
                        current_word = current_word_continued
                current_dict = self.keyword_trie_dict
                if longest_sequence_found:
                    new_sentence += longest_sequence_found + current_white_space
                    current_word = 
                    current_white_space = 
                else:
                    new_sentence += current_word
                    current_word = 
                    current_white_space = 
            else:
                # we reset current_dict
                current_dict = self.keyword_trie_dict
                new_sentence += current_word
                current_word = 
                current_white_space = 
        elif char in current_dict:
            # we can continue from this char
            current_dict = current_dict[char]
        else:
            # we reset current_dict
            current_dict = self.keyword_trie_dict
            # skip to end of word
            idy = idx + 1
            while idy < sentence_len:
                char = sentence[idy]
                current_word += orig_sentence[idy]
                if char not in self.non_word_boundaries:
                    break
                idy += 1
            idx = idy
            new_sentence += current_word
            current_word = 
            current_white_space = 
        # if we are end of sentence and have a sequence discovered
        if idx + 1 >= sentence_len:
            if self._keyword in current_dict:
                sequence_found = current_dict[self._keyword]
                new_sentence += sequence_found
        idx += 1
    return new_sentence
Code 5: Python code for replacing keywords with standardized names  from dictionary in input string.
Output

A new string with replaced standardized names found in the string x, as shown in Figure 5.

3 Benchmarking FlashText And Regex

As shown in Figure 1 and 2, FlashText is much faster than Regex. Now we will benchmark and compare FlashText and Regex.

3.1 Searching keywords

Python code is used to benchmark search keywords feature. First we will generate a corpus of 10K random words of randomly chosen lengths. Then we will choose 1K terms from the list of words and join them to create a document.

We will choose k number of terms terms from the corpus, where k {0, 1000, 2000, .. , 20000}. We will search this list of keywords in the document using both Regex and FlashText and time them.

from flashtext import FlashText
import random
import string
import re
import time
def get_word_of_length(str_length):
    # generate a random word of given length
    return .join(random.choice(string.ascii_lowercase) for _ in range(str_length))
# generate a list of 100K words of randomly chosen size
all_words = [get_word_of_length(random.choice([3, 4, 5, 6, 7, 8])) for i in range(100000)]
print(’Count  | FlashText | Regex    )
print(’——————————’)
for keywords_length in [0, 1000, 5000, 10000, 15000]:
    # chose 1000 terms and create a string to search in.
    all_words_chosen = random.sample(all_words, 1000)
    story =  .join(all_words_chosen)
    # get unique keywords from the list of words generated.
    unique_keywords_sublist = list(set(random.sample(all_words, keywords_length)))
    # compile Regex
    compiled_re = re.compile(’|’.join([r’\b’ + keyword + r’\b’ for keyword in unique_keywords_sublist]))
    # add keywords to Flashtext
    keyword_processor = FlashText()
    keyword_processor.add_keywords_from_list(unique_keywords_sublist)
    # time the modules
    start = time.time()
    _ = keyword_processor.extract_keywords(story)
    mid = time.time()
    _ = compiled_re.findall(story)
    end = time.time()
    # print output
    print(str(keywords_length).ljust(6), ’|’,
          ”{0:.5f}”.format(mid - start).ljust(9), ’|’,
          ”{0:.5f}”.format(end - mid).ljust(9), ’|’,)
# output: Data for Figure 1
Code 6: Python code to benchmark FlashText and Regex keyword search. Github Gist Link .

Github Gist Link for Keyword Search Benchmark code 555 Search Keywords: https://gist.github.com/vi3k6i5/604eefd92866d081cfa19f862224e4a0

3.2 Replace keywords

Code to benchmark replace keywords feature.

from flashtext import FlashText
import random
import string
import re
import time
def get_word_of_length(str_length):
    # generate a random word of given length
    return .join(random.choice(string.ascii_lowercase) for _ in range(str_length))
# generate a list of 100K words of randomly chosen size
all_words = [get_word_of_length(random.choice([3, 4, 5, 6, 7, 8])) for i in range(100000)]
print(’Count  | FlashText | Regex    )
print(’——————————’)
for keywords_length in [0, 1000, 5000, 10000, 15000]:
    # chose 1000 terms and create a string to search in.
    all_words_chosen = random.sample(all_words, 1000)
    story =  .join(all_words_chosen)
    # get unique keywords from the list of words generated.
    unique_keywords_sublist = list(set(random.sample(all_words, keywords_length)))
    # add keywords to Flashtext
    keyword_processor = FlashText()
    for keyword in unique_keywords_sublist:
        keyword_processor.add_keyword(keyword, ’_keyword_’)
    # time the modules
    start = time.time()
    _ = keyword_processor.replace_keywords(story)
    mid = time.time()
    for keyword in unique_keywords_sublist:
        story = re.sub(r’\b’ + keyword + r’\b’, ’_keyword_’, story)
    end = time.time()
    # print output
    print(str(keywords_length).ljust(6), ’|’,
          ”{0:.5f}”.format(mid - start).ljust(9), ’|’,
          ”{0:.5f}”.format(end - mid).ljust(9), ’|’,)
# output: Data for Figure 2
Code 7: Python code to benchmark FlashText and Regex keyword replace. .

Github Gist Link for Replace Keywords Benchmark code 666 Replace Keywords: https://gist.github.com/vi3k6i5/dc3335ee46ab9f650b19885e8ade6c7a

Conclusion

As we saw, FlashText is fast and well suited for keyword search/replace. It’s much faster than Regex when the keywords are complete. The complexity of the algorithm is linear in length of the searched text. It is specially useful when the number of keywords is large since all keywords can be simultaneously matched in one pass over the input string.