In graph theory, the objective in -centre problem is to find a set of vertices for which the largest distance of any vertex of the graph and its closest vertex in this -set is minimised. The numerous applications of the problem in various areas of computer science, lead to different definitions of connectivity and distance between the vertices depending on the application at hand. The -centre problem on graphs is known to be NP-hard. The best performance ratio for a polynomial-time approximation solution is unless P = NP, but it is unlikely to be fixed-parameter tractable (FPT) in a context of the most natural parameter , which is the number of centres [Algorithmica2020].
A different form of the -centre problem appears in stringology and it was linked with important applications in computational biology for example to find the approximate gene clusters for a set of words over the DNA alphabet [JACM2002]
. This problem is also NP-hard problem; there are fixed-parameter algorithms and heuristic algorithms for it without any performance guarantee. The closest string problem aims to find a new string within a distanceto each input of strings and such that is minimized. The natural generalization of -Closest String problem is of finding -centre strings of a given length minimizing the distance from every string to closest centre [SODA99, CPM2004]. This problem has been mainly studied for the most popular distance measure which is the Hamming distance. The major application of this distance is in the coding theory, but it also has been intensively used in many biological applications which aim to discover a region of similarity or to design probes or primers [IC2003].
In this paper we define and study a new variant of the -centre problem on the new objects, the class of necklaces, and using a different distance function, the overlap coefficient to define the closeness of strings or necklaces. Necklaces are classical structures in combinatorics, which can be defined as a set of -character strings over an alphabet of size , that are equivalent under the cyclic shift operation. The research on necklaces in combinatorics has been mainly focused on characterisation of these objects, their efficient generation and comparison, ordering as well as on the design of an efficient ranking and unranking procedures [Kopparty2016, Sawada2017]. In this paper, we link several problems and interconnect research ideas on these fascinating objects as well as design approximation algorithms for the -centres problem on necklaces, which have applications for combinatorial crystal structure prediction [SOFSEM2020, CSP2020].
In particular we are motivated to study -centre problem for a class of necklaces by the Extended Module Materials Assembly (EMMA) method used for in silico predictions of novel materials [Collins17, Dyer13]
. The idea of this method is to consider materials assembled from well-chosen layers, where the arrangements with low enough energy constitute potentially stable materials. Since the space of all potential stackings is typically too large for exhaustive search, one is looking for a diverse and representative sample of this space to be used with further optimisation strategies. To approach this problem, we model layers as letters and materials (periodic crystals) as necklaces due to their invariance under the cyclic shift operation. The set of necklaces during the optimisation is often further constrained: fixed chemical composition corresponds to necklaces with the fixed Parikh vector and constraints on the relative position of layers lead to necklaces with forbidden subwords. These considerations lead us to the general problem of sampling from languages:– the set of all words of length ; – the set of all words of length at most ; – the set of all words with the Parikh vector and – the set of all words of length at most that do not contain words from a finite set as factors. Apart from the Hamming distance, there are several well known methods for comparing words with their own advantages and disadvantages [cohen2003comparison, piskorski2007comparison, recchia2013comparison]. In order to define the closeness between different necklaces we use one of such methods based on computing the overlap coefficient between each pair of necklaces. In particular, in the context of the material science, two patterns of layers may have closer properties if they have more common fragments.
In general the -centre problem on necklaces can be seen as the -centre problem on the complete weighted graph where every word is represented by a node, and each edge has a weight given by the overlap distance between the two words. As in the graph case, the goal is to choose words such that the distance from any word in the language and its nearest centre is minimised. However, in a case of -centre problem for languages the associated graph can be of exponential size in the context of the input.
The main results of this paper is in the design of approximation algorithms for several finite languages of necklaces , and with logarithmic approximation factor in the context of and and log-linear for . The first algorithm is based on building a prefix tree of necklaces utilising previously designed ranking and unranking procedures. Then we extend an approximation algorithm for a language of necklaces with forbidden words by designing new procedures for ranking and unranking of necklaces under forbidden words and Parikh map constraints, where both limitations are motivated by natural material science constraints. Finally we propose a different technique based on building centres for via the de Bruijn sequences, which can find a solution in linear time with a constant approximation factor.
A formal language consists of words whose letters are taken from an alphabet and are defined according to a specific set of rules. We focus on cyclic languages languages consisting of cyclic words only. A cyclic word is the equivalence class of words under the cyclic shift operation, also known as necklace. A cyclic shift of size moves the suffix of length to the front of the word, while maintaining the relative order within the suffix. More formally, the cyclic shift of length on the word will transform it to . Any word that is a member of this equivalence class is a representation of the cyclic word. In general, cyclic words are represented by the lexicographically smallest word in this equivalence class, known as the canonical representation. Lyndon words form the set of aperiodic necklaces; necklaces such that given the canonical form , there exists no cyclic shift such that for any .
The class of fixed length cyclic languages consists of all cyclic words from an alphabet , made of any combination of characters of with length . This is equivalent to the set of necklaces of length over the alphabet . This language will be denoted . We focus on two restricted cases of this language.
The first is the fixed length cyclic language with forbidden subwords. This is a fixed length cyclic language where any word does not contain subwords from a set of forbidden words. Given the set of forbidden words , the language of all words of length without these will be denoted . Formally a word will be in if there is no representation of it of the form for any word .
The second language is the fixed content cyclic language. Here, in any word the number of occurrences of each letter of is fixed. The number of occurrences of each character will be given as a vector , where denotes the number of occurrences of the character. This language will be denoted .
A generalisation of and is to maximum length languages. These contain every word of length less than or equal to in the corresponding fixed length language. For a given , the maximum length generalisation of will be denoted and will be denoted . Formally, this can be written as and . As a cyclic word of length can be seen as a word of infinite length with a period of a most , when comparing two cyclic words it makes sense to look at two representatives of these words with the same length, which will be called the same length representatives of the words.
The Overlap Coefficient. The Overlap coefficient of the sets and is defined as the size of the intersection of the two sets, normalised by the size of the smaller set, i.e. . For the overlap coefficient measures the closeness of two sets and , where a value of means that the two sets are identical, and a value of means there are no shared elements.
The Overlap coefficient for two cyclic words and is defined as the overlap coefficient between the multisets of all subwords of the same length representatives of and . Given the same length representative some word of length , the multiset of subwords of length is the multiset of each subword starting at every position from to , labelled by the number of occurrences of this subword up to this point. Note that as this word is cyclic, subwords of length may occur beginning in the last positions of the word, giving a total of subwords of length for any . For example, given the word , the multiset of subwords of length 2 are . The multiset of all subwords is simply the union of the multisets of the subwords for every length from 1 to , having a total size of . An example of this is given explicitly between and in Figure 1.
To use this as a distance, the measure will be inverted so that a value of will imply the strings share no similarity and a value of implies they represent the same word. From this, the Overlap distance between two cyclic words and will be given by
The k-Centre Problem on Necklaces. With this distance, the -centre problem for languages can be defined. This can be thought of as the -centre problem on the complete weighted graph where every word is represented by a node, and each edge has a weight given by the overlap distance between the two words. As in the graph case, the goal here is to choose words such that the distance from any word in the language and its nearest centre is minimised. However in a case of -centre problem for languages the size of associated graph maybe exponential in relation to the description of the language, i.e., the input size.
|word with representative||word with representative|
-Centre problem for languages: Given a finite language and an integer , select words from forming a sample of -centres, minimising the maximum overlap distance between every word in and the nearest member of :
Our goal is to maximise the length of the longest subword such that every word in shares a subword of length with at least one member of the sample. The idea behind this approach is that if a word shares a subword of length with the sample, it will also share 2 words of length , 3 of length , and so on, for a total of common subwords. Therefore by increase the length of these subwords by , there is a quadratic increase in the size of the intersection in the overlap coefficient. This provides a bound on the maximum distance between every word in and the centres is created of . So our algorithms will be focused on maximising the length of for the languages , , , , and .
For the language , given as longest length such that every subword in shares a subword of length with at least one centre, .
Let us assume that every word shares at least one subword of length with at least one centre . Thus, will contain all subwords of this shared word, giving a total of shared subwords. This gives an intersection of with the closest member of the sample set of size at least . As the size of the multiset of subwords (in cyclic words of length ) will be , the total distance will be . ∎
For the language and a sample of -centres the optimal overlap distance between every word in and the nearest member of is no less than .
Let be the longest subword such that every word in shares a subword of length with at least one of the centres. To get an upper bound on the size of the , observe that every word in is a necklace, therefore every word must have at least one subword of length that is a prefix of a necklace. Therefore the longest value for is the largest value such that there are fewer than necklace prefixes. A simple lower bound on the number of necklace prefixes of length for an alphabet of size is . This can be rewritten as an inequality in terms of and as . Observing that gives , giving as a bound on , .
Assume that the furthest word share two subwords of length with the nearest centre. For this to be the case, every centre must contain every subword of length , with at least one occurring twice, requiring the string to be of length , which is clearly greater than under the assumption that . Using this as an upper bound, the intersection may be of size no more than , giving a distance of . Using the upper bound on gives a lower bound on the distance of . ∎
Given an algorithm that can approximate the solution to Problem 1 for within a factor , the same sample will be an approximation of of the optimal solution to Problem 1 for .
Let be the length of the longest subword such that every word in shares a subword of length with at least one centre in the sample. Observe that for every length , every word in will occur as a subword of at least words in . As the samples must include every subword of length as a subword, any word of length will also share a common subword of length with each centre, which when converted to the same length representatives of length will lead to an overlap coefficient of . For words of length , observe that the same length representative with a word of length will also be of length at least , ensuring that it shares a common subword of length at least . As in the case , the overlap coefficient between some word with length will be
3 Sampling via Prefix Trees
In this section we will look at a generic framework for sampling necklaces under various constraints. This will give a logarithmic approximation factor relative to the number of samples in the general case. In Section 3.1 we will present the algorithm and show how it preforms on the languages and . In Section 3.2 we will extend the ranking function for necklaces to the language .
3.1 The general algorithm
The underlying idea behind the first algorithm is of building samples based on covering different possible prefixes of necklaces. The reasoning behind this is twofold: first the set of prefixes for necklaces is much more limited than it is for unconstrained words, and secondly by covering all prefixes up to a given length , all words are guaranteed to share a subword of length with some sample.
In order to do this effectively, it will be important to compute how many necklaces have a given prefix.
This may be done by ranking the necklace.
The rank of a necklace is the number of necklaces for which the canonical representation is lexicographically smaller than it.
The first algorithm to rank necklaces was given by Kopparty et. al. [Kopparty2016] without a tight bound, followed by an algorithm by Sawada and Williams [Sawada2017] who provided an time algorithm.
Sawada and Williams show how this may be used to find the number of necklaces with a given prefix, by computing the difference between the ranks of smallest and largest necklace with the given prefix, both of which may be done in quadratic time.
This ranking function has been further extended to the fixed content case by Hartman and Sawada [Hartman2019].
The k-centres selection based on a tree of necklace prefixes: The algorithm recursively builds the tree of possible necklace prefixes, starting with the empty string, in a breadth first manner, continuing until there are such prefixes. Once these prefixes have been generated, the centres can be built as necklaces containing these prefixes.
This is achieved as follows. At each step there is the set of prefixes of a length such that the number of prefixes is less than . Observe that every prefix in the set of prefixes of length , , will consist of a prefix from followed by a character. Note also that every must be the prefix for at least one member of . Therefore to generate , each prefix in must be considered. Given and , will be in if and only if it is the prefix of a necklace. To determine this property, the rank of the smallest and largest non-cyclic words starting with amongst the set of necklaces can be used. The rank of a word amongst the set of necklaces will be denoted Let denote the smallest word starting with , i.e. the word consisting of followed by copies of the smallest character, and denote the largest word starting with . The number of necklaces sharing the prefix will be given by . If there are no such necklaces, then will be discarded, otherwise it will be added to . The set will be generated by repeating this process for every , . Once the size of is greater than , the algorithm will terminate using the prefixes in as a basis. For each , a centre will be generated by appending an arbitrary subword following the prefix.
There exists a polynomial-time algorithm to construct k centres of such that every word in shares a common substring of length at least with the nearest centre, and therefore will be at a distance of no more than from the nearest centre.
Let be the length of the prefixes at the termination of the algorithm. To bound the length of , observe that each sample corresponds to a prefix of length . Therefore, this becomes the problem of determining the largest value of such that the size of the set is less than . An upper bound on the size of the set of necklace prefixes of length can be taken as the sum of the upper bound on the number of Lyndon words of length 1 to . This gives the . Ignoring the divisor by allows the upper bound to be rewritten into the inequality . Note that , but for any , so if we are considering only integer values then . Lemma 1 gives a lower bound on the distance between every word in and the nearest centre in the sample of , which is bounded by .
To show that the method will terminate in polynomial time, we note that the time to compute the number of necklaces with a given prefix will be . For every , at most samples will be checked. To determine the longest , let there be prefixes of length . Observe that there will be at least words of length . Therefore the number of prefixes of length will be at least . Therefore the longest length will be . Thus the maximum number prefixes that need to be checked will be and the total complexity will be . ∎
Problem 1 can be solved by the polynomial time approximation algorithm for a language with an approximation factor and a language with .
Recall, from Lemma 2 the lower bound on the value is . Dividing the bound from Lemma 4 by the lower bound on the distance from Lemma 2 gives a performance ratio . Then in the big O notation it can be simplified to = and then to by changing the base to . Therefore Problem 1 can be solved in polynomial time within a factor of for and by Lemma 3 for . ∎
There exists a polynomial time algorithm for the -centre problem on that can ensure that no word in shares a common substring of length at least with the nearest centre. Furthermore, every word in will be no more than from the nearest centre.
This follows from the same methods given for . Using the ranking function for fixed content necklaces given by a straight forward adaptation of the ranking function given by Hartman and Sawada [Hartman2019], the set of prefixes may be generated in the same way as before. The primary difference in these settings is that the number of necklaces and prefixes thereof are considerably smaller. Once a set of prefixes is generated, the centres can be generated in the same way as before, with the added constraint that the word satisfies the Parikh vector.
Let be the length of the longest substring shared by the word in that is furthest from one of the centres. By the same arguments as in Theorem 4, a lower bound of can be given as . However, unlike in the general case, all necklaces start with the same first character, increasing the lower bound on to . Furthermore, as every word in shares the same number of occurrences of every character, therefore the size of the intersection in the overlap coefficient will be . Therefore the lower bound on the distance will be no more than and then bounded by . ∎
Problem 1 can be solved for a language can be solved by an approximation factor of .
First we must establish a lower bound on the distance for this setting. Note than, unlike for , every word contains the same set of characters. Therefore size of the intersection in the overlap co-efficient will be at least for every word. Moreover, if any character occurs more than times, there will be a subword of length 2 shared by every word.
Let be the length of the longest subword such that every word in share a subword of length at least with at least one of the centres. An upper bound on the length of comes from the number of prefixes to fixed content necklaces. For any , there will be at least possible subwords of length . As there are possible subwords of length , this requires . assuming , may be approximated as , giving bounding as .
For any combination of subwords of length less than or equal to guaranteeing that every word in has an intersection of size requires , however for any , this contradicts the assumption that is the largest value such that . Therefore the size of the intersection must be less than or equal to , which by substituting the upper bound of gives . This gives an lower bound on the distance as .
To get the approximation ratio, the lower bound in the distance given in Lemma 5 by this upper bound gives . Assuming that , this can be simplified for big O notation to . ∎
3.2 Sampling with forbidden subwords
In order to generalise the algorithm described in Theorem 4, the ranking and unranking functions must be generalised to account for forbidden words. This is a much more challenging problem compared to the general case primarily due to the cyclic nature of the words. Unlike with an non-cyclic word, when counting the number of cyclic words without a given forbidden word, it must be ensured that it does not occur for any shift, as opposed to just one. This is further complicated when considering multiple forbidden words, where it must be checked that no forbidden word occurs for any rotation. Ruskey and Sawada [Ruskey2000] computed the size of as where is a function for counting the number of cyclic words of length containing no subword in . This can be computed in polynomial time for a constant size of . The number of Lyndon words of length with not containing any subword in , denoted , is given relative to the number of Necklace, using the function:
For the remainder of this section, let and denote the sets of necklaces and Lyndon words respectively.
Before introducing the function for ranking and unranking of forbidden words, some theoretical results must be established. Let be the set of words such that the canonical representation for each every word is smaller than , and no forbidden word in occurs as a subword. Let denote the canonical rotation of some word , and let denote that is not a subword of . Using this notation we get
Three further sets are needed for the purpose of ranking. The first of these is the set of aperiodic words such that the smallest rotation is less than some given word , denoted . Next is the set of words on length where the smallest rotation is less than , denoted . For two words and of lengths and respectively, if and only if . The final set is that of aperiodic words of a given length where the canonical representation is less than, denoted .
For every that is a factor of , the size of is equal to .
Observe that every word in the set will either be aperiodic, in which case it will belong also to the set , or it will be periodic. If it is periodic, the period must be some value that is a factor of . Given some word with a period , if it is smaller than , then it will occur in the set . By definition, any word greater than will not occur in any set for any such that . As each set consists only of aperiodic words, there can be no word that occurs in both and for . Therefore the size of can be computed as . ∎
By application of the Möbius inversion formula to , an equation for the size of can be derived as:
This can be used to rank some word amongst the set of Lyndon words without any forbidden subword.
The number of Lyndon word smaller than some word without any forbidden subword in will be given by .
As every Lyndon word is aperiodic, it has unique rotations. Therefore, for any given word , each Lyndon word will occur times within the set of aperiodic words with some rotation smaller than , if and only if the canonical representation of the Lyndon word is smaller than . ∎
In the next lemma we compute the number of necklaces smaller than using .
The number of necklaces smaller than without any forbidden subword in is equal to .
It follows from Lemma 6 that all necklaces smaller than some word will either be aperiodic, or periodic with a period that is a factor of the length of the necklace. From Lemma 7, the necklaces that are smaller than and are aperiodic are . Similarly, the necklaces with a period of some factor of are . ∎
The problem now becomes computing . To do this, the set will be partitioned to the set such that for every word the following hold.
is the smallest cyclic shift such that shifting by , denoted , makes the resulting word smaller than , i.e. .
Under the shift by , is the length of the longest prefix of that is also a prefix of .
The size of can be computed in time.
This can be done by considering two possible cases for the set. First is the case . In this case, every word will be of the form where:
is some word of length with no forbidden subword such that every suffix is greater than ;
is the prefix of of length ;
is a character smaller than ;
is a word with no restrictions other than having no forbidden substrings.
To compute the number of possible words satisfying and , we define the function . A full definition of this function is given in Appendix A. At a high level, this function works by recursively checking how many possible ways of extending the string based on the sets and . represents the set of suffixes of that are prefixes of . initially represents the set of prefixes of that are suffixes of . As this function can be computed recursively, the computational complexity will be equal to the number of potential calls to . There are possible value for and , and for both and . This gives a total time complexity of This will take time to compute in the worst case. Alongside this, two auxiliary functions and are needed. These, respectively, compute the sets of prefixes and suffixes of forbidden words of , for some character . Using the above, we get:
In the second case, every word will be of the form . Let be the length of the longest prefix of that is a suffix . If , then the shift by would be smaller than . Therefore must be greater than or equal to . From this, the size of can be computed as:
As can be computed in and stored for all value of values of and , the time to compute the size of will be . As this is dominated by the time to compute , the total time will be . ∎
The rank of a word amongst all necklaces without any forbidden subword may be computed in .
It follows from Lemma 9 that by separately computing the values of , the time to compute the size of will be . Following Lemma 8, the number of Necklaces may be computed by summing the size of for every factor of . Note that there are at most factors of . The size of can be computed using Lemma Equation 3. In the worst case there will be sets of , each of which taking at most time to compute. Putting this together, the total time complexity will be ∎
For a constant number of forbidden words, Problem 1 for can be solved in polynomial time with an approximation factor of .
The same approach given in Section 3.1 may be used to build centres. In this case the ranking function described in this section may be used in place of the ones used there. Clearly the same bounds on length of the prefix will hold in the worst case, giving an upper bound on the distance of . A lower bound follows from the same observation that the length of the longest common subword between the furthest word in the language and the nearest centre will be bound from above by the number of Lyndon words without any forbidden subwords. A bound on this is , which will be of order for a constant size . Thus the approximation factor between the bound given by this algorithm will be the same as in the unconstrained case, giving a factor of . ∎
4 Sampling via de Bruijn sequences
The primary issue with the prenecklace based algorithm is that it does not take advantage of any additional space left in the samples. As such a different approach will have to be considered to build the samples. Following the same motivation of maximising the length of the subword shared between every word in the language and the nearest centre, observe that this requires every subword must occur at some point in the sample.
For a given length , there are subwords. A de Bruijn sequence of order is a word of length where every word of length occurs as a subword. This makes it a natural candidate for use as the basis for the centres.
There exists an algorithm with a worst case running time of for the -centre problem on ensuring that every word in shares a common substring of length at least . Further this will ensure that no word in is a distance of more than from the nearest centre.
The main idea of this algorithm is to take the de Bruijn sequence of order and divide it between samples while ensuring that all subwords of length occur at some point as a subword of a sample. Note that the length of the de Bruijn sequence of order will be . The de Bruijn sequence may be efficiently generated in time linear to the length of the sequence [Ruskey2000], which must be less than .
Naively splitting the sequence between the centres may lead to subwords being lost. In order to account for this, the sequence may be split into samples of size . The first centre can be generated by taking the first characters of the de Bruijn sequence. To ensure that every subword of length occurs, the fist characters of the second centre will be the same as the the last character of the first centre. Repeating this, the centre will be the subword of length starting at position in the de Bruijn sequence. An example of this is given in Figure 3.
To determine the length of relative to and , note that the size of the corresponding de Bruijn sequence must be small enough such that every subword may occur. Formally, for to be feasible, . This may be rearranged in terms of , giving as an upper bound .
Using these, the centres may be made be formed by taking each of these samples, and appending the first of the next sample. For a given centres of length at most , will be the largest value such that , which may be rewritten as ¡. An upper bound on the distance may be gained using Lemma 1, giving , which may be bounded by .
In terms of complexity, there are known algorithms to output the de Bruijn sequence in time, which by the definition of will be time in the worst case. From the sequence, the process of dividing into samples will take no more than time linear to the size of the sequence, giving a total complexity of . It is worth noting that any algorithm that outputs the centres will have a complexity of at least . ∎
The algorithm described in Lemma 10 will approximate the optimal solution within a factor of .
Recall from Lemma 1 that the minimum distance the word that is furthest from the sample can be is .
The upper bound on distance given by Lemma 10 is .
Dividing this upper bound by the lower bound gives .
Rewriting this in terms of base gives
= . If then this algorithm will approximate the optimal solution within a factor of . As will be no more than 1, this will hold. ∎
Appendix A Ranking functions
The function can be thought of as counting the size of the set of words where, for every in the set:
has length .
For every , and , no forbidden subword occurs in the word .
The first characters of .
Every suffix of the subword is greater than .
To compute , two auxiliary functions will be introduced to compute the and after the next character is introduced. will take as argument the set of prefixes and some character , and returning the set of prefixes of where either is the first character of a forbidden word or there exists some where - i.e. the prefixes that may be continued by adding this character. will return the members of that are suffixes of some forbidden word, and the words for where is a subword of a forbidden word. To compute , the set may be further partitioned by the next character . If , then there is no lower bound on the value of , otherwise must be greater than or equal to . For each the number of words for which the character equals will be as follows:
If there exists some such that for some forbidden word then there will be no subsequent words.
Otherwise, if then if there exists some and some where or then there will also be no words in this set, otherwise there will be only 1 word.
If and then the size of the set will be equal to