The problem of finding uniformly spaced points (centres) in a metric space is well known as the -centre selection problem. Finding centres is important in many contexts: facility location and distribution, representative samples for state space exploration or identification of cluster centres. So far, the problem has been intensively studied for finite and explicitly given inputs like the -centre problem for graphs [Algorithmica2020], grids [Zhang11] or strings [CPM2004, JACM2002].
In graph theory, the objective of the -centre problem is to find a set of vertices, in a given undirected (weighted) graph , for which the maximal distance from any vertex to its nearest centre in is minimised .
In the area of stringology finding -centres within a set of words can be seen as a problem in a complete weighted graph. Thus, vertices are words and the distance between words depends on their closeness such as the Hamming distance or overlap/Jaccard coefficients for contextual similarity. However, the configuration space of many algebraic and combinatorial structures cannot be explicitly given due to the exponential growth and infeasibility of listing/enumerating the space. So the solutions for centre-selection problem on graphs, or explicitly given finite set of strings, is impractical to apply directly on these objects.
In this paper we introduce a fundamental class of combinatorial objects, multidimensional necklaces, generalising the classical combinatorial necklaces, and we study -centre selection problem for these objects. Multidimensional words in automata theory literature are known as picture-languages and they are a well-studied generalisation of one-dimensional languages to two dimensions [Annexstein1997, Bozapalidis2006, Giammarresi1997, Latteux1997, Matz1997, Stromoney1972]. The level of complexity to deal with such objects moves even higher if we consider natural classes of words which are equivalent under translation symmetry, known as necklaces [Codes_automata, Graham1994].
One-dimensional necklaces are known as cyclic words, i.e. strings over a finite alphabet, which are equivalent under the cyclic shift operation. One-dimensional necklaces also closely related to Lyndon words, i.e., aperiodic necklaces. For both one-dimensional necklaces and Lyndon words efficient algorithms for generation, ranking and unranking have been discovered only recently [Kociumaka2014, Kopparty2016, Sawada2017]. Two-dimensional necklaces correspond to toroidal codes, which have recently attracted attention in the combinatorics on words community in the context of bioinformatic applications [Anselmo2019].
Periodic motives in crystals is another example to illustrate the application of necklaces up to dimension three, e.g. see representation of in Figure 1. The methods for effective exploration of a configuration space of crystal structures and a search for potentially stable materials, see EMMA [Collins17, Dyer13], FUSE [FUSE2018], AIRSS [AIRSS], require procedures for selecting equally spaced seeding configurations in contrast to purely random initial positions. The solution to -centre problem on combinatorial necklaces can be used to build representative sample in discrete configuration space of crystalline materials [SOFSEM2020, CSP2020] and speed up in silico predictions of novel materials, known to be one of the major scientific challenges of our time. The substantial gap of knowledge in solving -centre problem for implicitly represented objects and applications in computational chemistry motivate the study of -centre selection problem for multidimensional necklaces.
Main Contributions: Our contribution is twofold. Firstly, we introduce the -centre problem for necklaces and develop approximation algorithms for it. Secondly, we derive efficient procedures for foundational operations on multidimensional necklaces. These operations are used by our algorithms, and are of independent interest.
For the -centre problem, we introduce the overlap distance for necklaces. Using this, we provide both negative and positive results for the problem. On the negative side we prove that it is NP-hard to evaluate the quality of a solution for the problem even for 1-dimensional necklaces. Although this is does not resolve the complexity of -centre problem, it is a strong indication that the problem is hard. On the positive side, we design two polynomial-time algorithms that achieve approximation. In the 1-dimensional case this is , and for the multidimensional case it is where is the size of a necklace and the size of the alphabet. Our analysis for both algorithms relies on technical Lemma 2, providing an upper bound on the distance between necklaces and the nearest sample based on the number of subwords which we cover with at least one sample. Our first algorithm works when the input includes a de Bruijn sequences of logarithmic size relative to the number of necklaces; these can be efficiently computed for one-dimensional necklaces, while no algorithm is known for higher dimensions. Our second algorithm bypasses this limitation using new operations on multidimensional necklaces like counting and ranking that we establish in Section 5.
Our second set of results contains the generalisation of several fundamental results from one-dimensional necklaces to the multidimensional setting. This theme focuses on -dimensional necklaces defined over a given set of dimensions and an alphabet . Throughout the paper we use to denote . Our results include the first formal definition of multidimensional necklaces, along with the algorithms to:
Count the number of necklaces of dimensions over the alphabet in polynomial time.
Generate the set of all necklaces of dimensions over the alphabet in at most time per necklace.
Rank a necklace in the set of all necklaces of dimensions over the alphabet in time.
Unrank the necklace of dimensions over the alphabet in time.
The remainder of this paper is organised as follows. Section 2 provides definitions and notation used throughout the paper. Section 3 gives the general results about the -centre problem on necklaces, including both hardness results and bounds on the optimal solutions. Section 4 provides two approximation algorithms for this problem in one and dimensional cases. Finally, Section 5 is devoted to the foundational results for multidimensional necklaces.
We denote by the set of integers from to inclusive and by the set of integers from to inclusive. Let be a linearly ordered finite alphabet such that . We denote by the set of all words over and by the set of all words in of length . The notation is used to clearly denote that the variable is a word. The length of a word is denoted by . We use to denote the symbol of , where . The concatenation of words and , denoted , returns the word . We extend the ordering from to in the usual lexicographic manner. Formally, let be a pair of words, where . We say if and only if there exists an where and . For a given set of words the rank of with respect to is the number of words in that are smaller than .
The translation of a word by , denoted , returns the word . A word is equivalent to under translation if for some . The power of a word , denoted , is equal to concatenations of . A word is periodic if there is some word and an integer such that . The smallest such is called the period of . A word is aperiodic if it is not periodic.
A necklace, also called a cyclic word, is the equivalence class of words under the translation operation. For notation, a word is written as when treated as a necklace. Given a necklace , the canonical representative is the lexicographically smallest element of the set of words in the equivalence class . A cyclic subword of the word , denoted , is the word such that for all . Here and in the future we tacitly assume that is equivalent to . Since we consider only cyclic subwords in the paper, we omit “cyclic” in the future. If , then is a prefix and is a suffix. A prefix or suffix of is proper if its length is smaller than .
Multidimensional Words and Necklaces
A -dimensional word over is an array of
-dimensions given by a vectorof elements from . For notation, given a vector where every , is used to denote the set . Similarly is used to denote the set . Let be the dimensions of . Given a vector of dimensions , is used to denote the set of all words of dimensions over . Let for a dimension vector . For a -dimensional word , the notation is used to refer to the symbol at position in the array. Given 2 -dimensional words such that and , the concatenation is performed along the last coordinate, returning the word of dimensions such that if and if .
A multidimensional cyclic subword of of dimensions is denoted . As in the one-dimensional case, a subword is defined by a starting position in the original word and set of dimensions defining the size of the subword. The subword starting at position with dimensions is the word such that for all of the form , where . Such a subword we denote by . One important class of subwords are what we call slices, an example of which is given in Figure 2. The slice of , denoted by , is the subword of dimensions starting at position of . In the 2D case, the slice corresponds to the row of a word. We use to denote . A prefix of length for a multidimensional word is the first slices of in order. A suffix of length for a multidimensional word is the last slices of in order. In the two-dimensional case, the prefix and suffix of length corresponds to the first and last rows respectively.
A -dimensional translation is defined by a vector . The translation of the word of dimensions by , denoted returns the word such that and for all of the form . We can assume that , so the set of translations is equivalent to the direct product of the cyclic groups . For notation let . Given two translations and in , is used to denote the translation .
A multidimensional necklace (multidimensional cyclic word) is an equivalence class of all multidimensional words under the translation operation.
Informally, given a necklace containing the word , contains every word where there exists some translation such that . Let denote the set of necklaces of dimensions over an alphabet of size . As in the 1D case, a canonical representation of a multidimensional necklace is defined as the smallest element in the equivalence class, denoted . Similarly, given a word , denotes the canonical representation of the necklace , i.e. . To determine the smallest element in the equivalence class, an ordering needs to be defined. First, we introduce an ordering over translations.
Let be the direct product of the cyclic groups , i.e. the set of all translations of words of dimensions . The translation is indexed by the injective function
The translation is smaller than if . Note that is the smallest translation and is the largest. Using this definition an ordering on multidimensional words is defined recursively. The key idea is to compare each slice based on the canonical representations. For notation, given two words , let return the smallest translation where . Note that can be computed in time by simply checking each translation in . In one dimension, the smallest such translation can be found in time [Booth1980].
Let and let be the smallest integer such that . Then if either , or and . Further, given necklaces and , we have if and only if .
An example of the ordering is given in Figure 4. In what follows, is assumed to be ordered as in Definition 3. The rank of a necklace is defined as the number of necklaces smaller than in . In the other direction, the necklace in is the necklace with the rank , i.e. the necklace for which there are smaller necklaces.
In order to answer some of the key questions regarding multidimensional necklaces, there are two further concepts that need to be defined for multidimensional necklaces. The first is the period of a word. Informally the period of of dimensions can be thought of as the smallest subword that can tile -dimensional space equivalently to . In order to define the period of a word, it is easiest to first define the concept of aperiodicity.
A word of dimensions is aperiodic if there exists no subword of dimensions such that for every , and where for every position in .
The period of a word of dimensions , denoted , is the length of the aperiodic subword of dimensions such that for every position and .
See Figure 3 as an example. By Definition 5 every word, including aperiodic ones, has a unique period [GAMARD201758]. In the case of an aperiodic word , the period is equal to the dimensions of . It is easy to see that a multidimensional necklace is aperiodic if every word is aperiodic. Further, note that if some word in is aperiodic, then every word is. An aperiodic necklace is called a Lyndon word. A related but distinct concept is an atranslational word. A word is atranslational if there exists no translation such that .
A necklace if dimensions is atranslational if there exists no pair of translations where and .
In one dimension every aperiodic necklace is atranslational, while in any higher dimension every atranslational word is aperiodic, although not every aperiodic word is atranslational. For example is aperiodic but not atranslational, as there are only two unique representations of the cyclic word. On the other hand is both atranslational and aperiodic. For notation, is used to denote the index of the smallest translation where . Similarly is used to denote the index of the smallest translation where , i.e. the smallest rotation of the canonical representative to get .
3 The Overlap Distance and the -Centre problem
In this section we formally define the -centre problem for necklaces. The input to our problem is an alphabet of size , a vector of dimensions that defines the size of the multidimensional words, and a positive integer . The goal is to choose a set of necklaces from the set such that the maximum “distance” between any necklace and the set is minimised. Since there is no standard notion of distance between necklaces, our first task is to define one. We introduce the overlap distance
, which aims to capture similarity between crystalline materials as an extension of the overlap metric between words. This can be seen as a natural distance based “bag-of-words” techniques used in machine learning[gartner2003survey].
The Overlap coefficient for Necklaces. Our definition of the overlap distance depends of the well studied overlap coefficient, defined for a pair of set and as . For notation let return the overlap coefficient between two sets and . Observe that returns a rational value between and , with indicating no common elements and indicating that either or . In the context of necklaces the overlap coefficient is defined as the overlap coefficient between the multisets of all subwords of and . For some necklace of dimensions , the multiset of subwords of dimensions contains all . For each subword appearing times in , copies of are added to the multiset. This gives a total of subwords of dimensions for any , where . For example, given the necklace represented by , the multiset of subwords of length 2 are . The multiset of all subwords is the union of the multisets of the subwords for every set of dimensions, having a total size of ; see Figures 5 and 6.
Overlap Distance for Necklaces. To use the overlap coefficient as a distance between and , the overlap coefficient is inverted so that a value of means and share no common subwords while a value of means . The overlap distance (see example in Figure 6) between two necklaces and is . Proposition 1 shows that this distance is a metric distance.
The overlap distance for necklaces is a metric distance.
Let , for some arbitrary vector and . In order for the overlap distance to satisfy the metric property, must be less than or equal to . Rewriting this gives which can be rewritten in turn as . Observe that if then , meaning that . This implies that and share at least subwords. Therefore must be at least . Hence . ∎
The -Centre Problem. The goal of the -Centre problem for necklaces is to select a set of necklaces of dimensions over an alphabet of size that are “central” within the set of necklaces . Formally the goal is to choose a set of necklaces such that the maximum distance between any necklace and the nearest member of is minimised. Given a set of necklaces , we use to denote the maximum overlap distance between any necklace in and its closest necklace in . Formally:
-Centre problem for necklaces: Given a set of dimensions , alphabet of size , and an integer , what is the set of size minimising ?
There are two major challenges we have to overcome in order to solve Problem 1: the exponential size of , and the lack of structural, algorithmic, and combinatorial results for multidimensional necklaces. We show that the conceptually simpler problem of verifying whether a set of necklaces is a solution for Problem 2 is NP-hard for any dimension .
Given a set of necklaces of dimensions over the alphabet and a distance , does there exist some necklace such that for every ?
Problem 2 is NP-hard for any dimension .
We prove the claim via a reduction from the Hamiltonian cycle problem on bipartite graphs to Problem 2 in one dimension. Note that if the problem is hard in the 1D case, then it is also hard in any dimension by using the same reduction for necklaces of dimensions . Let be a bipartite graph containing an even number of vertices. The alphabet is constructed with size such that there is a one to one correspondence between each vertex in and symbol in . Using a set of necklaces is constructed as follows. For every pair of vertices where , the necklace corresponding to the word is added to the set of centres . Further the word , for every , is added to the set .
For the set , we ask if there exists any necklace in that is further than a distance of . For the sake of contradiction, assume that there is no Hamiltonian cycle in , and further that there exists a necklace such that the distance between and every necklace is greater than . If shares a subword of length with any necklace in then would be at a distance of no less than from . Therefore, as every subword of length in corresponds to a edge that is not a member of , every subword of length 2 in must correspond to a valid edge.
As can not correspond to a Hamiltonian cycle, there must be at least one vertex for which the corresponding symbol appears at least 2 times in . As is bipartite, if any cycle represented by has length greater than , there must exist at least one vertex such that . Therefore, the necklace is at a distance of no more than from . Alternatively, if every cycle represented by has length , there must be some vertex that is represented at least times in . Hence in this case is at a distance of no more than from the word . Therefore, there exists a necklace at a distance of greater than if and only if there exists a Hamiltonian cycle in the graph . Therefore, it is NP-hard to verify if there exists any necklace at a distance greater than for some set . ∎
The combination of this negative result with the exponential size of makes finding an optimal solution for Problem 1 in polynomial time relative to the values of and exceedingly unlikely. As such the remainder of our work on the -centre problem for necklaces focuses on approximation algorithms. Lemma 1 provides a lower bound on the optimal distance.
Let be an optimal set of centres minimising then .
We first prove the lemma for the one-dimensional case, then extend the proof to the multidimensional setting. Recall that the distance between any pair of necklaces and is determined by the overlap coefficient and by extension the number of shared subwords between and . Hence the distance between the furthest necklace and the optimal set is bound from bellow by determining an upper bound on the number of shared subwords between and the words in . For the remainder of this proof let to be the necklace furthest from the optimal set . Further for the sake of determining an upper bound, the set is treated as a single necklace of length . This may be thought of as the necklace corresponding to the concatenation of each necklace in . Note that the length of is . As the distance between and is no more than the distance between and any , the distance between and provides a lower bound on the distance between and .
In order to determine the number of subwords shared by and , consider first the subwords of length . In order to guarantee that shares at least one subword of length , must contain each symbol in , requiring the length of to be at least . Similarly, in order to ensure that shares two subwords of length with , must contain copies of every symbol on , requiring the length of to be at least . More generally for to share subwords of length with , must contain copies of each symbol in , requiring the length of to be at least