Artificial synthesis of DNA molecules is an essential part of the study of biological mechanisms. The design of a synthetic DNA molecule usually involves many objectives. One of the important objectives is to eliminate short sequence patterns that correspond to binding sites of restriction enzymes or transcription factors. While many design tools address this problem, no adequate formal solution exists for the pattern elimination problem. In this work, we present a formal description of the elimination problem and suggest efficient algorithms that eliminate unwanted patterns and allow optimization of other objectives with minimal interference to the desired DNA functionality. Our approach is flexible, efficient, and straightforward, and therefore can be easily incorporated in existing DNA design tools, making them considerably more powerful.
Synthetic biology is an emerging domain that uses engineering principles to study biological mechanisms by examining perturbations of these mechanism. This field has seen rapid growth in research and innovation in recent years . Many applications of synthetic biology involve artificial synthesis of DNA molecules based on some specification . An example of such an application is the pilot project announced by an initiative called the Human Genome Project-write (HGP-write) to create a virus-resistant cell by removing DNA sequences from the human genome that viruses use to hijack and replicate . Another application is to conduct experiments to test theories, such as the experiment that confirmed that CRISPR (clusters of regularly interspaced short palindromic repeats) is used by bacteria to recognize viruses and handle future attacks. This finding later led to using CRISPR to alter the DNA of human cells like an exact and easy-to-use pair of scissors . These examples demonstrate that with the rapid progress in relevant technologies, it is expected that synthetic biology will be able to help resolve many key open questions in molecular biology.
In many applications, like the ones presented above, the synthesized DNA molecule is a molecule that was artificially designed to meet some requirements. The design of protein-coding sequences usually involves meeting objectives such as optimizing codon usage, restriction site incorporation, and motif avoidance. Whereas meeting only one objective can be relatively simple, meeting multiple objectives at once is a much more complicated task, and therefore, many tools heavily rely on heuristics based on random sampling. One particularly challenging task in DNA sequence design is avoiding certain short sequence patterns that correspond to potential binding sites of proteins such as restriction enzymes or transcription factors. Cleaning the synthesized sequence from potential binding sites is essential when one wishes to control the function of that sequence in a cellular environment. Compared to other design objectives that try to optimize some properties, this problem involves a strict restriction: we must remove all unwanted patterns because even one occurrence of a binding site can affect the DNA function. This strict restriction, along with positive specification that one wishes to optimize, introduces a significant computational challenge.
In this work, we examine the problem of eliminating unwanted sequences from a given target sequence with minimal disturbances. We start by examining the simple question of cleaning a single unwanted pattern from a target DNA sequence. We show that various versions of this problem can be solved by reduction to the well-known hitting set problem. Later, we present a dynamic programming scheme that solves a more general version of this problem that, among other things, cleans multiple unwanted patterns. All of the algorithms we present in this work are linear in the size of the input. We also provide related software tools in a public repository:
2 Related works
2.1 Design tools
Modern DNA design tools aim to meet multiple design preferences and objectives, as reviewed in . Table 1 summarizes the objectives that the different tools claim to achieve. All tools consider codon usage, meaning that they attempt to choose a codon for each protein amino acid based on usage statistics in the organism whose cells are used in the experiment. Considering codon usage is clearly central in experiments that involve synthetic DNA. Computationally, it is relatively simple to address using the organisms codon usage distribution. Other than codon usage, tools differ in the set of objectives they claim to address. Most tools claim to address some version of pattern elimination, either through a user-defined set of patterns or by eliminating a pre-defined set of patterns (hidden stop codons, binding sites of certain restriction enzymes, etc.).
|Gene Designer 2.0||X||X|||
Gould and colleagues in  sought out to examine how well different tools deal with the pattern elimination objective together with other competing objectives. They took a target sequence and specified two restriction sites to be removed. They also restricted the codons that can be used such that no valid sequence of codons will eliminate the restriction sites. Thus, the design requirements cannot be met in this case. The purpose of this experiment was to see how tools behaved when posed with a pattern elimination objective that conflicts with another design requirement. Four tools (Gene Designer 2.0 , Jcat , Eugene , and D-Tailor ) were not tested because they do not have the option to configure this specific design objective. One tool became unresponsive (Synthetic gene designer ), possibly because there is no feasible solution. Two tools (DNAWorks  and Visual gene developer ) left the restriction sites. It is unclear whether the tools indicated that they could not remove the restriction sites. The remaining three tools (GeneDesign , OPTIMIZER , COOL ) removed the restriction sites using restricted codons for two amino acid.
It seems that the tools do not expect a set of constraints that cannot be met. One of the reasons for the difficulty that existing tools have in addressing complex, and possibly conflicting, constraints is likely due to the general technique they all use. As far as we can tell, all programs eliminate unwanted patterns by scanning the DNA sequence, and each time they encounter an unwanted pattern, they choose a random substitution (as done in [24; 7]). This strategy is simple and can be effective in many cases, but it ignores the possible complexities of the pattern elimination problem. One potential problem that this approach ignores is that removing one unwanted pattern can create a new unwanted pattern. Therefore, random sampling cannot guarantee a feasible and optimal solution and might be ineffective. This becomes more problematic the more patterns you wish to eliminate. Another clear problem with how these tools address the pattern elimination problem is that they do not clearly specify the algorithm or heuristic protocol they use. Consider, for example, two of the tools that removed the restriction sites in the test described above. The article that published OPTIMIZER () does not mention the algorithm used at all, and the article that published GeneDesign () only mentions that it uses a random selection of codons.
2.2 Theoretical analysis of related problems
The patterns elimination problem first requires finding all pattern matches. There are two ways to address this problem. One is inspired by the Knuth-Morris-Pratt (KMP)  algorithm, and the other is using a suffix tree. The KMP algorithm finds all matches of a single pattern in a given sequence using a protocol it constructs based on the given pattern. The KMP protocol can be described using a simple finite state machine (FSM) that traces any given sequence and keeps in every state the longest prefix of the pattern that is also a suffix of the sequence traced thus far. When the FSM reaches the state corresponding to the complete pattern, this indicates that a match has been found. In , Aho and Corasick describe an efficient method for creating a FSM that is inspired by the KMP FSM and matches multiple patterns in a given sequence. The FSM they describe keeps in every state the longest prefix of one of the patterns that is also a suffix of the sequence traced thus far. Finding all pattern occurrences using this FSM is linear in the sequence length, and it does not depend on the length or the number of patterns. Building this FSM requires a pre-processing time that is linear in the sum of lengths of all patterns. Another approach for solving the pattern matching problem is using a suffix tree , which is a data structure whose nodes correspond to substrings of a given sequence and whose leaves hold indices in it. Each path in the tree from the root to a leaf corresponds to a suffix of the sequence: the leaf holds the starting position of the suffix, and the concatenation of all the nodes’ substrings in the path gives the sequence of the suffix. After building the suffix tree of the sequence, all pattern matches can be found in time that is linear in the sum of lengths of all patterns by simply searching for a pattern starting at the root, as each substring is a prefix of a suffix of the sequence.
There have been several studies that examine theoretical and algorithmic aspects of the pattern elimination problem. Some problems have been studied and were shown to be NP-complete. For example, in  Skiena addressed the problem of minimizing the number of restriction sites while keeping the set of given genes unchanged (codon substitution is permitted only if the resulting amino acid is the same). He suggests a dynamic programming algorithm that is exponential in the length of the longest restriction site and proves that the problem is NP-complete for non-fixed restriction site lengths. Another related problem is the Unique Restriction Site Placement Problem (URSPP) presented in . The objective in this problem is to allow only one restriction site for any given restriction enzyme, keep the translated sequence of amino acids unchanged, and minimize the maximum gap between adjacent restriction sites. They show that this problem is NP-complete and then suggest a heuristic algorithm that starts with eliminating all but one binding site for each restriction enzyme. They do not provide a detailed description of their algorithm and specifically how they avoid creating new restriction sites. Both  and  give higher priority to avoiding changes in the translated amino acid sequence over the number or placement of restriction sites.
A recent study  addressed the problem of eliminating a single unwanted pattern in the context of images (and multi-dimensional arrays). The results of  focus on the problem of deciding if a multi-dimensional array is clean of an unwanted pattern and measuring its distance from being clean. One of their results suggested a simple and efficient algorithm for eliminating a single pattern from a sequence over a binary alphabet. Our work uses the results of  in the one-dimensional case as a starting point for dealing with the pattern elimination problem. In Section 4 we extend a lemma that was proved by  (Lemma ) to establish the connection between the pattern elimination problem and the hitting set problem over the DNA alphabet.
3 Definition of objectives and notations
We consider a long target sequence of length over an alphabet . The sequence represents the optimal version of the synthesized sequence without considering possible existence of unwanted patterns. If we wish to synthesize multiple sequences, we concatenate them into one long target sequence , using a unique character to separate between individual sequences. Our main objective is to clean the target sequence from occurrences of short patterns specified in the set . Typically, the sequences in are much shorter than the target sequence .
We use a -based indexing scheme and denote by the character in , and by the substring of that begins in index and ends in index . Our objective is defined by the following concepts:
Given a sequence and a short pattern of length , a -match in is a substring of that is identical to : .
Given a collection of short sequence patterns, , a sequence is said to be -clean iff does not contain a -match for every .
Given a target sequence and a collection of short sequences , an eliminating set for in is a set such that substituting with character for all pairs results in a sequence , which is -clean.
In the following sections, we describe a series of algorithms that find an optimal eliminating set under different scenarios. In Section 4, we start with the simple scenario where contains a single pattern , and we wish to find the smallest eliminating set. In Section 5, we expand the optimization criterion to consider positional-preferences for substitutions. In both sections, we consider elimination of a single pattern and thus equate the set with the single pattern it contains. Finally, in section 6 we expand the discussion to the multi-pattern case and to more general optimization criteria.
4 The connection between eliminating sets and hitting sets
We start by considering the simple problem of finding the smallest eliminating set for a given target sequence, , and a single pattern, . Clearly, the set of positions of any elimination set has to cover all -matches. However, a set that covers all of the -matches is not necessarily an eliminating set, because substituting may create new -matches. Consider the following example over the binary alphabet:
There are three -matches in starting in positions , and . If the bit in position is flipped, then the first -match is eliminated, but a new one is created (starting in position ). On the other hand, flipping each of the bits in positions eliminates this -match without creating a new -match. The second -match can be eliminated by flipping each of the bits in positions , but flipping the bit in position also eliminates the third -match, so it is clearly preferable. This example demonstrates that some substitutions may eliminate an existing -match but may also create a new one. The example also demonstrates that we should aim to utilize overlaps between -matches in order to minimize the number of substitution. Optimal utilization of overlaps can be achieved by finding a minimal hitting set for the set of -matches.
Let be a set of intervals of a sequence . Let be a subset of positions in . is a hitting set of if each interval contains at least one position in .
The minimal hitting set problem is a specific instance of the more general set cover problem, which is known to be NP-hard. However, when the sets correspond to contiguous intervals of natural numbers, this problem has a simple linear-time algorithm, which we describe in Section 4.1. The following lemma provides a key observation to our analysis, establishing an important connection between hitting sets and eliminating sets.
If a position in belongs to a -match, then substituting with any character can create at most one new -match.
A version of this lemma restricted to binary sequences was proven in  (Lemma ). For completeness, we provide a detailed proof of Lemma 5 in Section 4.2. One important implication of this lemma is that for non-binary alphabets, the eliminating set problem is reduced to the hitting set problem, such that any hitting set can be extended to an eliminating set using the same positions.
If the alphabet has more than two characters, then the elimination problem of a single pattern reduces to the hitting set problem.
Let , where , and let be a hitting set of all -matches in . Consider an arbitrary position in the hitting set , and assume, w.l.o.g., that . Any substitution of to for eliminates all -matches that contain index , and Lemma 5 implies that at most one of these substitutions can create a new -match. Therefore, there are at least substitutions of the character that eliminate all -matches that include and create no new -matches. Thus, once a set of positions that cover all matches is identified, an eliminating set can be constructed by finding for each position in the hitting set a substitute character that does not create a new -match. The argument above implies that there are at least substitute characters that guarantee this for every position in the hitting set. ∎
Claim 6 implies a simple algorithm for computing a minimal eliminating set in the non-binary alphabet case. The outline of such an algorithm is:
Step is implemented either by the KMP algorithm or by a suffix tree, and is achieved in (see brief review in Section 2.2). Step is implemented by a simple greedy algorithm that is described in Section 4.1 below in time. Lastly, Step is implemented by considering an arbitrary substitute characters for every position and checking the interval for a new -match. If no -match is found, then this character is chosen, and if a -match was found, then a different (arbitrary) substitute character is chosen (Claim 6 guarantees that at most one character can create a new -match). Therefore, the time complexity of step is . Finally, the total time complexity of Algorithm 1 is .
Note that this algorithm has at least degrees of freedom for choosing a substitute character for each position in the hitting set. However, in the binary case where we are not guaranteed that every hitting set can be used to generate a valid eliminating set. We address this issue in detail in Appendix A.
4.1 Efficient algorithm for finding a hitting set
The minimal hitting set problem we defined is a special case of the set cover problem, which is a very well known NP-complete problem (), but in the special case of interval sets it has a simple linear algorithm (see ), which we present below in Algorithm 2 for completeness. The algorithms goes through the intervals, in order, and when it encounters an interval that is not covered, it adds its right-most position to the hitting set. Assuming the intervals are already sorted, the complexity of the algorithm is time and extra space. The correctness of the algorithm is thus established by Claim 7 below.
The set returned by Algorithm 2 is a minimal hitting set of the input set of intervals .
The algorithm removes an interval from only if covers it, implying that is a hitting set for . We are left to argue the minimality of . We do this by proving that for an arbitrary hitting set of , we have . Consider positions in in ascending order: . We will prove by induction on that .
Position is the rightmost position in the first ending interval in . Any hitting set should cover this interval using a position that is prior to , therefore: .
Step: Assume correctness of the claim for all and prove for . Let denote the interval for which the algorithm decided to add position to (step of the algorithm). The algorithm decided to add position because the interval was not covered by positions implying that and . is a hitting set of so it has to cover interval . We get that:
Since the induction hypothesis implies that we get that , as required.
Applying this inductive claim to , we get that any arbitrary hitting set of satisfies
4.2 Proof of Lemma 5
Recall that Lemma 5 states that if a position in belongs to a -match, then substituting with any character can create at most one new -match. The following proof follows similar lines of arguments as in the proof of lemma in .
Assume, in contradiction, that substituting creates two new P-matches. This may be either by a single substitution or by two different substitutions and . Let denote the starting position of the original -match and let and denote the two starting positions of the two new -matches.
Denote by the offsets (in ) of the substituted position w.r.t the newly created -matches, i.e, .
The fact that three -long substrings starting in positions and are nearly identical implies the following basic observation: for every and every offset we have and for we have . This is because of the one exact -match starting in position and the two near exact matches starting in position and . We use the series of equations in this basic observation to define the following undirected graph :
The basic observation we stated above implies that if positions and are connected in then we have . We will reach a contradiction by showing there is a path in from to either or . Denote by and the distance between the starting positions of the original -match and the two newly created -matches: . Now, distinguish between the following two cases:
Case 1: The original -match is on the same side of the two newly created -matches: or . Assume, w.l.o.g., that . (If , then we can reverse the sequence and the pattern and then obtain the desired configuration with the reversed sequences.)
In this case, iff . We will reach a contradiction by showing a path of length in connecting positions and . Consider the following series of positions: . Notice that the first, third and fourth positions in this walk belong to the range : . The second and the third inequalities follow from the assumption that and that both are positive. The first and forth inequalities follow from (position belongs to the -long substrings starting in positions ). This implies that the three steps in this walk correspond to edges in :
because (established above), and
because (established above), and
because (established above), and , and
Case 2: The original -match is between the two newly created matches: . We will reach a contradiction by showing that there is a path in the graph connecting positions and , but the length of this path will depend on the specific values of and . In this case, iff . Consider a walk through positions that starts in position and proceeds according to the following protocol:
Informally, the series takes backward- steps as long as the position is greater than , and when it cannot, it takes a forward- step.
We will show that this walk reaches position , and each step in this walk from to corresponds to an undirected edge in . First, note that the walk is confined to the range . The lower bound directly follows from the definition of the backward step, and the upper bound follows from the fact that forward steps are taken from positions no larger than (otherwise a backward step is taken). Now, because the size of this range is exactly , no position in the range can be approached from more than one position. Because the walk range is finite, this implies that the walk will eventually close a cycle and return to position with a backward- step from position .
We are left to show that all steps in this walk from to correspond to edges in . By design, for every , and . Then, the steps in the path correspond to edges in if the range of the walk, , is in . Position belongs to the near exact -match starting in position , therefore it holds that . Similarly, position belongs to the near exact -match starting in position , therefore it holds that . ∎
5 Introducing position-specific restrictions
When specifying a sequence for synthesis, we will often be restricted to change the sequence only in a given set of positions. For example, if the sequence contains a coding sequence for a given gene, then we would typically wish to avoid substitutions that change the resulting sequence of amino acids. Non-coding positions may also be restricted if they fall in regulatory sequences (promoters, enhancers, etc.). There are two different ways to specify such restrictions:
Position-specific hard restrictions: the user provides a set of indices that are not allowed to be changed. The objective will be to clean using a minimal number of changes in the set of allowed positions.
Position-specific soft restrictions: the user specifies a penalty for a letter change in each position along the sequence. The objective here is to clean at a minimum-cost. Note that hard restrictions can be implemented in this framework by associating positions that are not allowed to be changed with a very high cost (practically ). In this section we consider cost schemes where the cost of substituting a given position does not depend on the base we substitute it with. Later, in Section 6 we consider a more general cost scheme where the cost associated with a substitution in a given position may depend on the base we substitute it with.
5.1 Position-specific hard restrictions
Given a set of positions that are not allowed to be modified, , we find a minimal elimination set by modifying step of Algorithm 1 to compute a minimal hitting set among hitting sets that do not intersect . This is achieved by modifying step in Algorithm 2 to select the right-most position in to add to the hitting set. Note that this modification does not influence the complexity of the algorithm, so a minimal elimination set is still computed in even under hard restrictions. We now prove that this modification yields the required outcome.
The set returned by the modified version of Algorithm 2 is a minimal hitting set of the input set of intervals , among hitting sets that do not intersect the set of restricted positions .
The proof is similar in spirit to the proof of Claim 7. is a hitting set of , because the algorithm makes sure to cover all intervals. Moreover, does not intersect , because the positions added to in the modified step are never in . We are left to argue that every other hitting set that does not intersect is not smaller than . Consider positions in in ascending order: . We will prove by induction on that .
Position is the rightmost position that is allowed to be changed in the first ending interval in . Any valid hitting set should cover this interval using a position that is prior to , therefore: .
Step: Assume correctness of the claim for all and prove for . Let denote the interval for which the algorithm decided to add position to (step of the modified version above). The algorithm decided to add position because the interval was not covered by positions implying that . has to cover interval using at least one position from because is the rightmost position in that is allowed to be changed. We get that:
Since the induction hypothesis implies that we get that , as required.
Applying this inductive claim to , we get that any arbitrary hitting set of that does not intersect satisfies
5.2 Position-specific soft restrictions
We implement position-specific soft restrictions by introducing a cost function on sequence positions. The cost function, specifies the cost incurred by substituting position such that all possible substitutions of have the same cost. Our objective is to find a minimum-cost eliminating set of a pattern . As in the case of hard restrictions, we do this by modifying step of Algorithm 1 to compute a minimum-cost hitting set. This is done by applying a relatively straightforward dynamic programming algorithm that computes two 1D tables, and . Entry holds a minimum-cost hitting set for the set of all intervals in that are contained in the prefix and entry holds its cost, i.e., . The tables and are calculated using Algorithm 3, described below. The time complexity of the algorithm is because for each examined position () that ends an interval we scan the preceding indices. The extra space complexity is dominated by the dynamic programming table , since its entries hold sets. In order to reduce the extra space used we can save only a pointer to the last position in and use these pointers to reconstruct by back tracing. Notice that this modification increases the time complexity of step in Algorithm 1, but the total time complexity of Algorithm 1 remains the same . The correctness of the algorithm is established by the the following claim:
holds a minimum-cost hitting set of the set of all intervals in that are contained in the prefix and holds its cost.
By induction on .
The empty prefix has an empty hitting set with cost .
Step: Assume correctness of the claim for all and prove for . is a hitting set for the given set of intervals because the algorithm makes sure to cover all intervals in the range . We are left to argue the minimality of and we establish it by proving that for an arbitrary hitting set for the same set of intervals we have .
If there is no interval ending in position , then and the induction hypothesis implies that . Otherwise, there is an interval ending in position . Let and be the rightmost indices of and that cover that interval correspondingly. The induction hypothesis implies that . According to how index is set by the algorithm, . By combining the inequalities above with the definition of and we get:
6 Dynamic programming algorithms for a generalized elimination problem
In this section, we generalize the elimination problem in two directions. First, we allow the specification of multiple unwanted patterns, since usually there is more than one pattern to eliminate (e.g., multiple binding sites of different transcription factors and/or restriction enzymes). Second, we allow a more general cost scheme than the one considered in Section 5.2, where the cost of substituting a given position may depend on the target base. Assuming an additive cost function, this scheme implies a cost on any sequence that has the same length () as the target sequence: . This generalized cost scheme allows the user to define a preference toward certain type of substitutions (e.g. transitions versus transversions), and to allow a wider range of synonymous substitutions (that do not change the encoded amino acids in a gene). Using this scheme we redefine our objective as finding a minimum-cost sequence of length that does not contain any unwanted pattern. Note that in this redefined objective the target sequence () is not explicitly specified, but it can be thought of as being the minimum-cost sequence of length (with possible instances of unwanted patterns).
This objective cannot be solved by slight modifications to the previous algorithms because we can no longer separate the two decisions that we are making: the set of positions to substitute and the target bases we substitute to. For example, consider the following scenario, where we wish to eliminate pattern from the target sequence using the following cost function:
There is a -match starting in position that should be eliminated. The minimum-cost sequence without a -match is of cost . Note that in this case it is beneficial to substitute two positions (), one of them creates a new -match and the other eliminate the newly created -match. The previous approach which restricts itself to substitutions that do not create new -matches would substitute only one position (for example position ) and would result in a higher cost of . Thus, a solution to this generalized elimination problem requires an algorithm that jointly considers the substituted positions and the bases we choose to substitute to.
To solve this problem, we suggest a simple dynamic programming algorithm based on a finite state machine (FSM) that generates all (and only) sequences without unwanted patterns. Given such an FSM, Algorithm 4 below finds the minimum-cost sequence of a given length that the FSM generates. This implies that the elimination problem reduces to finding such an FSM, which is what we do in Sections 6.1 and 6.2.
An FSM that generates sequences is defined by the tuple where
is the alphabet of the generated sequences.
is the state space which includes a single initial state .
is a partial transition function (i.e, not defined for all ).
A sequence of length is said to be generated by a given if there is a path through states of the such that . Note that because the transition function is partial, then not all sequences have a generating path. Furthermore, because the FSM is deterministic and has a single initial state, then the generating path is unique, and we denote by the final state () in that path.
We can find the minimum-cost sequence of a given length generated by the by a rather straightforward calculation of a dynamic programming table s.t holds the minimum cost of a sequence of length that is generated by the and . Note that this algorithm does not involve an initial step of finding all pattern matches in the target sequence. This is because it considers all clean sequences in parallel and does not start from a specific target sequence, as the algorithms in sections 4 and 5 did.
holds the minimum cost of a sequence of length that is generated by the s.t
By induction on :
The only sequence of length is and it holds that iff .
Assume correctness of the claim for all and all , and prove for and an arbitrary .
We first prove that for any sequence of length that is generated by the s.t . Let be such a sequence and let , then , and let be the state such that . Thus, and the induction hypothesis implies that . Thus, using the update step definition we get that
We are left to show that there is a sequence of length that is generated by the s.t and . Let be the pair that minimizes the update step, meaning that and . The induction hypothesis implies that there is a sequence of length that is generated by the s.t and . Then, is of length , is generated by the , and . This gives us
Complexity: The space complexity of storing the dynamic programming tables and is . Adding the space complexity required for holding the transition function for the FSM , we get that the total extra space complexity is . Note that is an upper bound for . The time complexity of the update of cell is linear in the size of the source set for state : . Assuming that the source sets of all states are specified in the input given to the algorithm, the total time complexity for updating all cells in the row of the table is the sum of the sizes of all source sets. The source sets of the states in forms a disjoint partition of the Cartesian product , and therefore the total time complexity for updating every row of the matrix is at most (which is also an upper bound of the size of the FSM). In conclusion, the total time complexity is .
In the following two subsections, we show a couple of FSMs that generate all (and only) sequences without unwanted patterns and show how to compute the source sets for each one of them.
6.1 A naive FSM based on the de Brujin graph
The first FSM we suggest for this purpose is based on the de Brujin graph . Let be a collection of unwanted patterns and let be an upper bound on their length. The de Bruijn-inspired FSM for generating clean sequences is denoted by and defined as follows: corresponds to the set of all -long -clean sequences, and the transition function is defined by computing the -long suffix of (adding to and removing its first character). Importantly, is defined only if this -long suffix corresponds to a state in . Furthermore, in this FSM, we deviate from the requirement of having a single initial state by allowing every state to be an initial state, and letting the first state define the first characters of the generated sequence. Note that despite having more than one initial state, a sequence that is generated by has only one path through the states: such that and . Therefore, , the final state generating a given sequence, , in this FSM, , is well defined.
generates all and only sequences (of length at least ) without unwanted patterns from .
By induction on , the length of the sequence:
Following the definition of , all (and only) -long -clean sequences are initial states of .
Assume correctness of the claim for all and prove for . Let be a sequence of length that is -clean, then such that is a -clean sequence of length . Using the induction hypothesis, is generated by . Let . The -long suffix of is also -clean, therefore is defined, meaning that is generated by and it holds that .
We are left to show that generates only sequences without unwanted patterns. Let be a sequence of length generated by and let then . Using the induction hypothesis, is of length and is generated by and therefore does not contain an unwanted pattern. Adding at the end of does not introduce a -match because the -long suffix of corresponds to a state in . ∎
The size of the state space of this FSM is very large (), and it dominates the complexity of using this in the context of Algorithm 4. We therefore turn to look for a significantly smaller FSM that serves the same purpose.
6.2 A smaller KMP-based FSM
To produce a smaller FSM for this problem, we utilize the KMP-inspired automaton suggested by Aho and Corasick  (see brief review in Section 2.2). Recall that this automaton finds all matches of a set of patterns by keeping track of the longest suffix of the traced sequence that is also a prefix of a given pattern. We extend this FSM to avoid complete matches. This approach will let us generate all and only sequences without unwanted patterns.
We denote the KMP-inspired FSM for a given collection of unwanted patterns by and define it as follows: we first define as the set: . Then . In other words, there is a state for every prefix of a pattern in that does not end with an unwanted pattern. We designate the state corresponding to the empty string, , as the initial state . The transition function is defined as follows: if there is a suffix of that is an unwanted pattern, then is not defined. Otherwise, is the longest suffix of that is in .
generates all and only sequences without unwanted patterns from .
We first prove that any -clean sequence can be generated by the FSM by induction on the length of . For length , the only -clean sequence is , which is generated by and . For longer , there is a sequence such that . The induction hypothesis implies that is generated by . Let , then the sequence is -clean because it is a suffix of , implying that is defined and is equal to . Thus, is generated using the path that generates appended by state .
For the opposite direction we need to strengthen the induction hypothesis and show that every generated sequence, , is -clean and that the state corresponds to the longest prefix in that is also a suffix of . For length , the only generated sequence is , which is -clean and , which is the longest prefix in that is also a suffix of . For longer , there is a sequence such that . The induction hypothesis implies that is -clean and corresponds to the longest prefix in that is also a suffix of . The definition of the transition function implies that is the longest prefix in that is a suffix of . Because is a suffix of , then so is . We are left to prove that any longer suffix of , , is not a prefix in . If is shorter than , then it is not in , because of the way the transition function is defined. If, on the other hand, is longer than , then , and is a suffix of . The induction hypothesis implies that is the longest suffix of that is in , and is longer than , so it cannot be in . In conclusion, is the longest prefix in that is a suffix of , and since , then does not have a suffix that is a -match. Since its prefix is -clean, then itself is also -clean. ∎
The size of the state space of this FSM is which is significantly smaller than the size of the state space of the naive FSM described in Section 6.1 (). Thus, by using , Algorithm 4 finds a minimum-cost -clean sequence of length in time , which is linear in the size of the input. However, this requires an additional preprocessing step for computing . We describe the calculation of in the Section 6.2.1 below and show that the preprocessing time and space complexity is .
6.2.1 An efficient algorithm for computing
In this section, we describe an efficient procedure for calculating the FSM (Aho and Corasick  describe an efficient procedure for calculating a similar FSM to which does not forbid pattern matches). Throughout the discussion below, we assume that the empty word is not an unwanted pattern in . If , then is empty by definition and there is no sequence that does not contain unwanted patterns. To compute this FSM, we need to:
Compute its state space .
Compute the (partial) transition function for every . Recall that if has a suffix in , then is not defined. Otherwise, is the longest suffix of that is in .
Computing and requires scanning words in for suffixes in . Thus, a naive implementation would take at least quadratic time. In order to achieve this in linear time, we employ a technique originally suggested in  for the construction of the KMP automaton for matching a single pattern. Our algorithm extends this technique to multiple patterns and uses it also to identify invalid transitions (which was not needed in the original pattern matching problem). The technique suggested in  makes use of the auxiliary function () defined below:
Given a collection of unwanted patterns and a word , we define as the longest proper suffix of that is in . A proper suffix in this context is any suffix that is not equal to the entire word .
The relationship between the auxiliary function and the transition function is established by the following claim:
Consider s.t. does not have a suffix in . The following two relationships hold:
If , then .
First, note that under the conditions of the claim, the transitions and are defined ( and do not have a suffix in ). If , then , implying that is a proper suffix of . Hence, both and are equal to the longest proper suffix of that is in , establishing (1) above.
To prove (2), we need to show that is the longest proper suffix of that is in . The definition of implies that . Furthermore, is a proper suffix of because is a suffix of and is a proper suffix of . We are left to show that for any proper suffix of that is in it holds that . If , then . Otherwise, , where is a proper suffix of . Since is in , then so is . So, the definition of implies that is also a suffix of , which implies in turn that is a suffix of . Finally, since is the longest suffix of that is in , we get , as required. ∎
The two equations in Claim 15 imply a recursive procedure for jointly computing the functions and . The validity of the recursion is guaranteed by the fact that is strictly shorter than . The recursion halts either when (and then ), or when (and then , since the only proper suffix of is ). A similar recursive procedure can also be used to compute the state space by applying the following claim:
has a suffix in iff or has a suffix in .
If , then clearly has a suffix in . Furthermore, since is a suffix of , then is a suffix of , and so if has a suffix in , then so does . This establishes the direction of the claim. To establish the other direction, we consider s.t. has a suffix , and we show that is also a suffix of . We know that (because ), and that (because ). So, is a proper suffix of of the form , where is a proper suffix of . Since and is the longest proper suffix of in , then . This implies that is a suffix of , because they are both suffixes of , and so is a suffix of that belongs to . ∎
Algorithm 5 described below implements the two recursive procedures for computing and using forward recursion (establishing the base cases first). The algorithm keeps track of undefined transitions (when has a suffix in ) by setting their values to NULL. The first phase of the algorithm (lines 1–6) computes all the transitions associated with elongations of pattern prefixes (where ), and identifies elongations that result in complete patterns as invalid transitions (where ). Note that some prefix elongations may later be identified as invalid transitions, when has a proper suffix in .