Merging Sorted Lists of Similar Strings
Merging T sorted, non-redundant lists containing M elements into a single sorted, non-redundant result of size N ≥ M/T is a classic problem typically solved practically in O(M log T) time with a priority-queue data structure the most basic of which is the simple *heap*. We revisit this problem in the situation where the list elements are *strings* and the lists contain many *identical or nearly identical elements*. By keeping simple auxiliary information with each heap node, we devise an O(M log T+S) worst-case method that performs no more character comparisons than the sum of the lengths of all the strings S, and another O(M log (T/ e̅)+S) method that becomes progressively more efficient as a function of the fraction of equal elements e̅ = M/N between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.
READ FULL TEXT