Graphs, on the one hand, and strings, on the other, are two different types of data objects and they have certain particularities. Graphs seem to be more popular in fields like classical and parameterised algorithms and complexity (due to the fact that many natural graph problems are intractable), while fields like formal languages, pattern matching, verification or compression are more concerned with strings. Moreover, both the field of graph algorithms as well as string algorithms are well established and provide rich toolboxes of algorithmic techniques, but they differ in that the former is tailored to computationally hard problems (e. g., the approach of treewidth and related parameters), while the latter focuses on providing efficient data-structures for near-linear-time algorithms. Nevertheless, it is sometimes possible to bridge this divide, i. e., by “flattening” a graph into a sequential form, or by “inflating” a string into a graph, to make use of respective algorithmic techniques otherwise not applicable. This paradigm shift may provide the necessary leverage for new algorithmic approaches.
In this paper, we are concerned with certain structural parameters (and the problems of computing them) for graphs and strings: the cutwidth of a graph (i. e., the maximum number of “stacked” edges if the vertices of a graph are drawn on a straight line), the pathwidth of a graph (i. e., the minimum width of a tree decomposition the tree structure of which is a path), and the locality number of a string (explained in more detail in the next paragraph). By Cutwidth, Pathwidth and Loc, we denote the corresponding decision problems and with the prefix Min, we refer to the minimisation variants. The two former graph-parameters are very classical. Pathwidth is a simple (yet still hard to compute) subvariant of treewidth, which measures how much a graph resembles a path. The problems Pathwidth and MinPathwidth are intensively studied (in terms of exact, parameterised and approximation algorithms) and have numerous applications (see the surveys and textbook [10, 35, 8]). Cutwidth is the best known example of a whole class of so-called graph layout problems (see the survey [17, 40] for detailed information), which are studied since the 1970s and were originally motivated by questions of circuit layouts.
The locality number is rather new and we shall discuss it in more detail. A word is -local if there exists an order of its symbols such that, if we mark the symbols in the respective order (which is called a marking sequence), at each stage there are at most contiguous blocks of marked symbols in the word. This is called the marking number of that marking sequence. The locality number of a word is the smallest for which that word is -local, or, in other words, the minimum marking number over all marking sequences. For example, the marking sequence marks as follows (marked blocks are illustrated by overlines): , , , ; thus, the marking number of is . In fact, all marking sequences for have a marking number of , except , for which it is : , , . Thus, the locality number of , denoted by , is .
The locality number has applications in pattern matching with variables . A pattern is a word that consists of terminal symbols (e. g., ), treated as constants, and variables (e. g., ). A pattern is mapped to a word by substituting the variables by strings of terminals. For example, can be mapped to by the substitution . Deciding whether a given pattern matches (i. e., can be mapped to) a given word is one of the most important problems that arise in the study of patterns with variables (note that the concept of patterns with variables arises in several different domains like combinatorics on words (word equations , unavoidable patterns ), pattern matching , language theory , learning theory [2, 19, 39, 43, 32, 22], database theory , as well as in practice, e.g., extended regular expressions with backreferences [26, 27, 45, 28], used in programming languages like Perl, Java, Python, etc.). Unfortunately, the matching problem is NP-complete  in general (it is also NP-complete for strongly restricted variants [23, 21] and also intractable in the parameterised setting ).
As demonstrated in , for the matching problem a paradigm shift as sketched in the first paragraph above yields a very promising algorithmic approach. More precisely, any class of patterns with bounded treewidth (for suitable graph representations) can be matched in polynomial-time. However, computing (and therefore algorithmically exploiting) the treewidth of a pattern is difficult (see the discussion in [21, 44]), which motivates more direct string-parameters that bound the treewidth and are simple to compute (virtually all known structural parameters that lead to tractability [14, 21, 44, 46] are of this kind (the efficiently matchable classes investigated in  are one of the rare exceptions)). This also establishes an interesting connection between ad-hoc string parameters and the more general (and much better studied) graph parameter treewidth. The locality number is a simple parameter directly defined on strings, it bounds the treewidth and the corresponding marking sequences can be seen as instructions for a dynamic programming algorithm. However, compared to other “tractability-parameters”, it seems to cover best the treewidth of a string, but whether it can be efficiently computed is unclear.
In this paper, we investigate the problem of computing the locality number and, by doing so, we establish an interesting connection to the graph parameters cutwidth and pathwidth with algorithmic implications for approximating cutwidth. In the following, we first discuss related results in more detail and then outline our respective contributions.
Known Results and Open Questions: For Loc, only exact exponential-time algorithms are known and whether it can be solved in polynomial-time, or whether it is at least fixed-parameter tractable is mentioned as open problems in . Approximation algorithms have not yet been considered. Addressing these questions is the main purpose of this paper.
Pathwidth and Cutwidth are NP-complete, but fixed-parameter tractable with respect to parameter or , respectively (even with “linear” fpt-time [9, 11, 48]).
With respect to approximation, their minimisation variants
have received a lot of attention, mainly because they yield (like many other graph parameters) general algorithmic approaches for numerous graph problems, i. e., a good linear arrangement or path-decomposition can often be used for a dynamic programming (or even divide and conquer) algorithm. More generally speaking, pathwidth and cutwidth are related to the more fundamental concepts of small balanced vertex or edge separators for graphs (i. e., a small set of vertices (or edges, respectively) that, if removed, divides the graph into two parts of roughly the same size. More precisely, and are upper bounds for the smallest balanced vertex separator of and the smallest balanced edge separator of , respectively (see  for further details and explanations of the algorithmic relevance of balanced separators). The best known approximation algorithms for MinPathwidth and MinCutwidth (with approximations ratios of and , respectively) follow from approximations of vertex separators (see ) and edge separators (see ), respectively.
Our Contributions: There are two natural approaches to represent a word over alphabet as a graph : (1) and the edges are somehow used to represent the actual symbols, or (2) and the edges are somehow used to represent the positions of . We present a reduction of type (2) such that and , and a reduction of type (1) such that and . Since these reductions are parameterised reductions and also allow to transfer approximation results, we conclude that Loc is fixed-parameter tractable if parameterised by or by the locality number (answering the respective open problem from ), and also that there is a polynomial-time -approximation algorithm for MinLoc.
In addition, we also show a way to represent an arbitrary multi-graph by a word over alphabet , of length and with . This describes a Turing-reduction from Cutwidth to Loc which also allows to transfer approximation results between the minimisation variants. As a result, we can conclude that Loc is NP-complete (which solves the other open problem from ). Finally, by plugging together the reductions from MinCutwidth to MinLoc and from MinLoc to MinPathwidth, we obtain a reduction which transfers approximation results from MinPathwidth to MinCutwidth, which yields an -approximation algorithm for MinCutwidth. This improves, to our knowledge for the first time since 1999, the best approximation for Cutwidth from .
To our knowledge, this connection between cutwidth and pathwidth has not yet been reported in the literature so far. This is rather surprising, since Cutwidth and Pathwidth have been jointly investigated in the context of exact and approximation algorithms, especially in terms of balanced vertex and edge separators. More precisely, the approximation of pathwidth and cutwidth follows from the approximation of vertex and edge separators, respectively, and the approximation of vertex separators usually relies on edge separators: the edge separator approximation from  can be used as a black-box for vertex separator approximation, and the best vertex separator algorithm from  uses a technique for computing edge separators from  as component. Our improvement, on the other hand, is achieved by going in the opposite direction: we use pathwidth approximation (following from ) in order to improve the currently best cutwidth approximation (from ). This might be why the reduction from cutwidth to pathwidth has been overlooked in the literature. Another reason might be that this relation is less obvious on the graph level and becomes more apparent if linked via the string parameter of locality, as in our considerations. Nevertheless, since pathwidth and cutwidth are such crucial parameters for graph algorithms, we also translate our locality based reduction into one from graphs to graphs directly.
Basic Definitions: The set of strings (or words) over an alphabet is denoted by , by we denote the length of a word , is the smallest alphabet with . A string is called a factor of if ; if or , where is the empty string, is a prefix or a suffix, respectively. For a position , , we refer to the symbol at position of by the expression , and , . For a word and , let be the set of all positions where occurs in . For a word , let and for .
Let be a word and let . A marking sequence is an enumeration, or ordering on the letters, and hence may be represented either as an ordered list of the letters or, equivalently, as a bijection . Given a word and a marking sequence , the marking number (of with respect to ) is the maximum number of marked blocks obtained while marking according to . We say that is -local if and only if, for some marking sequence , we have , and the smallest such that is -local is the locality number of , denoted by . A marking sequence with is optimal (for ). For a marking sequence and a word , by stage of we denote the word in which exactly the positions are marked. For a word , the condensed form of , denoted by , is obtained by replacing every maximal factor with by . For example, . A word is condensed if .
For a word , we have . Hence, by computing in time , algorithms for computing the locality number (and the respective marking sequences) for condensed words extend to algorithms for general words.
Examples and Word Combinatorial Considerations: The structure of -local and -local words is characterised in . The simplest -local words are repetitions for some . Furthermore, if is -local, then is -local, where . Marking sequences for -local words can be obtained by going from the “inner-most” letters to the “outer-most” ones. The English words radar, refer, blender, or rotator are all 1-local.
Generally, in order to have a high locality number, a word needs to contain many alternating occurrences of (at least) two letters. For instance, is -local. In general (see Appendix A), one can show that if , then .
The well-known Zimin words  also have high locality numbers compared to their lengths. These words are important in the domain of avoidability, as it was shown that a terminal-free pattern is unavoidable (i.e., it occurs in every infinite word over a large enough finite alphabet) if and only if it occurs in a Zimin word. The Zimin words , for , are inductively defined by and . Clearly, for all . Regarding the locality of , note that marking leads to marked blocks; further, marking first and then the remaining symbols in an arbitrary order only extends or joins marked blocks. Thus, we obtain a sequence with marking number . In fact (see Appendix A), we have for . Notice that both Zimin words and -local words have an obvious palindromic structure. However, in the Zimin words the letters occur multiple times, but not in large blocks, while in -local words there are at most blocks of each letter. One can show (see Appendix A) that if is a palindrome, with or , and , then ( denotes the reversal of ).
The number of occurrences of a letter alone is not always a good indicator of the locality of a word. The German word Einzelelement (basic component of a construction) has 5 occurrences of
e, but is only -local, as witnessed by marking sequence (l,m,e,i,n,z,t).
a repetitive structure often leads to high locality. The Finnish word tutustuttu (perfect passive of tutustua—to meet) is nearly a repetition and -local, while pneumonoultramicroscopicsilicovolcanoconiosis is an (English) -local word,
and lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas is a
-local (Finnish) word.
Complexity and Approximation: We briefly summarise the fundamentals of parameterised complexity [25, 18] and approximation . A parameterised problem is a decision problem with instances , where is the actual input and is the parameter. A parameterised problem is fixed-parameter tractable if there is an -algorithm for it, i.e., one that solves on input in time for a recursive function and a polynomial . We use the notation which hides multiplicative factors polynomial in .
A minimisation problem
is a triple with being the set of instances, being a function that
maps instances to the set of feasible solutions for , and being the
objective function that maps pairs with and to a positive
rational number. For every , we denote .
For and , the value is the performance ratio of with respect to .
An algorithm is an approximation algorithm for with ratio (or an -approximation algorithm, for short) if, for every ,
, and .
We also let be of
the form when the ratio depends on and ;
in this case, we write . We further assume that the function is monotonically non-decreasing. Unless stated otherwise, all approximation algorithms run in polynomial time with respect to .
Pathwidth, Cutwidth and Problem Definitions: Let be a (multi)graph with the vertices . A cut of is a partition of into two disjoint subsets , ; the (multi)set of edges is called the cut-set or the (multi)set of edges crossing the cut, while and are called the sides of the cut. The size of this cut is the number of crossing edges, i.e., . A linear arrangement of the (multi)graph is a sequence , where is a permutation of . For a linear arrangement , let . For every , , we consider the cut of , and denote the cut-set (for technical reasons, we also set ). We define the cutwidth of by . Finally, the cutwidth of is the minimum over all cutwidths of linear arrangements of , i.e., .
A path decomposition (see ) of a connected graph is a tree decomposition whose underlying tree is a path, i.e., a sequence (of bags) with , , satisfying the following two properties:
Cover property: for every , there is an index , , with .
Connectivity property: for every , there exist indices and , , such that . In other words, the bags that contain occur on consecutive positions in .
The width of a path decomposition is , and the pathwidth of a graph is . A path decomposition is nice if and, for every , , either or , for some .
It is convenient to treat a path decomposition as a scheme marking the vertices of the graph based on the order in which the bags occur in the bag sequence. More precisely, all vertices are initially marked as . Then we process the bags one by one, as they occur in . When we process the first bag that contains a vertex , then becomes . When we process the last bag that contains , it becomes . The connectivity property enforces that vertices that are cannot be marked as again, while the cover property enforces that adjacent vertices must be both at some point. The width is the maximum number of vertices which are marked at the same time minus one. If the path decomposition is nice, then whenever a bag is processed as described above, we change the marking of exactly one vertex.
We next formally define the computational problems of computing the parameters defined above. By Loc, Cutwidth and Pathwidth, we denote the problems to check for a given word or graph and integer , whether , , and , respectively. Note that since we can assume that and , whether is given in binary or unary has no impact on the complexity. With the prefix Min, we refer to the minimisation variants. More precisely, , where is the set of words, is the set of all marking sequences for and (note that ); , where are all multigraphs, is the set of linear arrangements of , and (note that ); finally, , where are all graphs, is the set of path decompositions of , and (note that ).
3 Locality and Cutwidth
In this section, we introduce polynomial-time reductions from Loc to Cutwidth and vice versa. The established close relationship between these two problems lets us derive several complexity-theoretic and algorithmic results for Loc. We also discuss approximation-preserving properties of our reductions.
First, we show a reduction from Loc to Cutwidth. For a word and an integer , we build a multigraph whose set of nodes consists of symbols occurring in and two additional characters . The multiset of edges contains an edge between nodes for each occurrence of the factors and in , as well as edges between and , one edge between and the first letter of , and one edge between and the last letter of . An example is given in Figure 1.
The graph satisfies if and only if .
Suppose firstly that is -local, and let be an optimal marking sequence of . Consider the linear arrangement . Clearly,
Now consider a cut for . Every edge is of the form with , or of the form or . Consequently, every edge corresponds to a unique factor or of with and, after exactly the symbols are marked, is marked and is not, or to a unique factor or and, after exactly the symbols are marked, or is marked. Since there can be at most marked blocks in after marking the symbols , there are at most such factors, which means that . Thus . Note that any linear arrangement must at some point separate the nodes and , meaning , so we get that .
Now suppose that the cutwidth of is and let be an optimal linear arrangement witnessing this fact. Firstly, we note that must either start with followed by (i.e., have the form ) or end with preceded by (i.e., have the form . Otherwise, since is connected, every cut separating and would be of size strictly greater than . Because a linear ordering and its mirror image have the same cutwidth, we may assume that the optimal linear arrangement has the form for some permutation of . Let be the marking sequence of induced by . Suppose, for contradiction, that for some , with , after marking , we have marked blocks. Furthermore, let and . For every marked block that is not a prefix or a suffix of , we have and and therefore . Moreover, for a marked prefix , we have and and therefore . Analogously, the existence of a marked suffix leads to . Consequently, for each marked block we have two unique edges in , which implies . This contradicts the assumption that is a witness that has cutwidth . Thus, must be -local as stipulated. ∎
In the following, we briefly discuss the complexity of this reduction. Suppose we are given a word and an integer . It is usual in string algorithmics to assume that is over an integer alphabet, i.e., . In this framework, the multigraph can be constructed in time (e.g., represented as a list of vertices and a list of edges).
If there is an -approximation algorithm for MinCutwidth running in time for an input multigraph with edges, then there is an -approximation algorithm for MinLoc running in time on an input word .
As already indicated in the proof of Lemma 2, for , every linear arrangement for naturally translates to a marking sequence for . However, in an approximate linear arrangement, the vertices and do not have to be at the first (or last) position. Still, the marking sequence corresponding to the linear arrangement can have not more than blocks, since only suffix and prefix can be marked blocks which correspond to only one instead of two edges in a cut in . This observation remains valid, if we do not include the extra vertices and in in the reduction. Let be the graph obtained from (for some ) by removing the extra vertices and (observe that this also removes the dependence on ). Removing vertices only decreases the cutwidth, so Lemma 2 implies that . Let be an instance of MinLoc and an -approximation for MinCutwidth on multigraphs.
The approximation algorithm run on returns a linear arrangement with . Let be the marking sequence corresponding to , then . The performance ratio is at most , where is the number of edges in . For the optimum value , the cutwidth of is at least and has performance ratio at most (measured with respect to the optimum value for MinLoc).
The overall approximation procedure builds the graph in , runs on in and translates the linear arrangement into a marking sequence in . This gives an -approximation for MinLoc with running time in . ∎
For a reduction from Cutwidth to Loc, let be a connected multigraph, where is the set of nodes and the multiset of edges (for technical reasons, we assume ). Let be the multigraph obtained by duplicating every edge in . As such, each node in has even degree, so there exists an Eulerian cycle (i.e., a cycle visiting each edge exactly once) in , and, moreover, . For each edge , let be the word over that corresponds to an arbitrary traversal of the Eulerian path obtained from by deleting ; see Figure 2 for an example.
For any edge in , the word satisfies . Moreover, there is a vertex such that for every edge incident to .
Let . Note that there is a natural bijection between the linear arrangements of and the marking sequences of the word , since they both are essentially permutations of , i. e., for a permutation of , we can interpret both as the linear arrangement for and the a marking sequence of induced by . In the following, let be a permutation of , let , and , and let (note that since every edge has been duplicated, we can guarantee that the size of every cut of is even).
Now consider after after marking the letters . For every marked block that is not a prefix or a suffix of , we have and and therefore . Moreover, for a marked prefix , we have and and therefore . Analogously, the existence of a marked suffix leads to .
Conversely, for every edge in , with the exception of (if is in at all), there is a unique factor of such that either is marked and is unmarked, or vice-versa. Thus, if all marked blocks are internal, i.e., no marked block is a prefix or a suffix, then there are exactly marked blocks. Also, if both a prefix and a suffix occurs as a marked block, then we have marked blocks. Finally, if a prefix occurs as a marked block, but no suffix, or vice-versa, then there are only marked blocks; note that in this case we must have . Since we consider all permutations, the arguments above are sufficient to conclude that, in our setting, each has locality number either or .
Furthermore, consider a linear ordering of which is optimal, i.e., . Note that if either the first or last letter of is the last letter to be marked according to the marking sequence induced by the linear ordering , the case that both a suffix and prefix of are marked cannot be reached until and the entire word is marked. Consequently, this would imply that has locality number . For any permutation of the linear ordering , this holds for where is an edge adjacent to the node , since the path obtained by removing such an edge from must start or end with . ∎
The resulting Turing reduction from Cutwidth to Loc is performed in time, where is the number of vertices and is the number of edges of the input multigraph: First, the graph and its Eulerian cycle are constructed in time. Then, for each vertex, we select an arbitrary incident edge and build the word of length .
If there is an -approximation algorithm for MinLoc running in time on a word , then there is an -approximation algorithm for MinCutwidth running in time on a multigraph with vertices and edges.
Let be an instance of MinCutwidth and an -approximation for MinLoc. By Lemma 4, there exists a vertex such that holds for any edge adjacent to . The approximation algorithm hence returns on input a marking sequence with .
In the proof of Lemma 4 it is further shown that any marking sequence for translates to a linear arrangement for with . The performance ratio of this linear arrangement is .
The procedure which, for each vertex , constructs for some adjacent to in , runs in and checks the resulting linear arrangement in and returns the best linear arrangement among all , yields an -approximation for MinCutwidth on multigraphs in . ∎
Consequences: In the following, we overview a series of complexity-theoretic and algorithmic consequences of the reductions provided above. We first discuss negative results and note that we can close one of the main problems left open in .
The Loc problem is NP-complete.
Lemma 4 shows a polynomial time Turing reduction from Cutwidth to Loc. Indeed, given a (multi)graph we construct in linear time the multigraph by duplicating its edges. has an Eulerian cycle, so, using Hierholzer’s algorithm, we can compute such a cycle in linear time . Let be the computed Eulerian cycle. For each edge of construct, in linear time, the word as described before Lemma 4. By Lemma 4 we get that . This completes the reduction, and, thus, as Cutwidth is NP-hard (see, e. g., ), we get that is also NP-hard. As Loc clearly belongs to NP, the result follows. ∎
Theorem 6 follows from the Turing reduction from Cutwidth to Loc, but it can also be proved using a polynomial-time one-to-many reduction from the well known NP-complete problem Clique. This alternative approach (given in Appendix B) is more technically involved, but has the merit of emphasising how the combinatorial properties of the locality number can be used to construct computationally hard instances of Loc. Moreover, by the word-combinatorial observations about locality made in Section 2, it is clear that Loc is NP-complete also for words with special structure, e.g., palindromes and repetitions.
With respect to approximation, it is known that, assuming the Small Set Expansion Conjecture (denoted SSE; see ), there exists no constant-ratio approximation for MinCutwidth (see ). Consequently, approximating MinLoc within any constant factor is also SSE-hard. In particular, we point out that stronger inapproximability results for MinCutwidth are not known. Positive approximation results for MinLoc will be discussed in Section 4.
On certain graph classes, the SSE conjecture is equivalent to the Unique Games Conjecture  (see [41, 42]), which, at its turn, was used to show that many approximation algorithms are tight  and is considered a major conjecture in inapproximability. However, some works seem to provide evidence that could lead to a refutation of SSE; see [3, 6, 29]. In this context, we show in Section 4 a series of unconditional results which state that multiple natural greedy strategies do not provide low-ratio approximation of MinLoc.
As formally stated next, Lemma 2 extends algorithmic results for computing cutwidth to determining the locality number (we formulate this result so that it also covers fpt-algorithms with respect to the standard parameters and ). Note that the maximum degree in a multigraph is bounded from above by , so the number of nodes and the number of edges satisfy . Hence, we state the complexity in terms of and rather than with respect to , which is the actual input size.
If there is an algorithm solving MinCutwidth (resp., Cutwidth) in time for a multigraph with vertices, then there is an algorithm solving MinLoc (resp., Loc) in time for a word over an alphabet .
We only show the claim for MinCutwidth; the case of Cutwidth follows immediately from 2. Our goal is to compute for the word , i.e., the minimum such that is -local. By Lemma 2, we get for and for . Consider a multigraph obtained by removing the vertices and from (the result does not depend on ), and observe that . Indeed, if , we add the two missing nodes and (in this order) as a prefix to an optimal linear arrangement for and get a linear arrangement of of width , a contradiction.
Hence, in order to determine , we proceed as follows: Compute and iterate over integers , , in the increasing order, checking if . The first value for which this equality holds equals , and the marking sequence induced by the respective linear arrangement of is an optimal one for (as proved in 2). ∎
In particular, we can draw the following corollaries using Lemma 7 and known results from the literature. Due to the algorithms of , which also work for multigraphs111These algorithms actually support weighted graphs without any major modification and in the same complexity. In this setting, parallel edges connecting two vertices are replaced by a single “super-edge” whose weight is the number of parallel edges., MinLoc can be solved in time and space, or in time and polynomial space. In particular, this also implies that Loc is fixed-parameter tractable with respect to the alphabet size. Moreover, the fpt-algorithm from  directly implies that MinLoc is fixed-parameter tractable for parameter with linear fpt-running-time . Since Cutwidth is NP-complete already for graphs with maximum degree (see ), we also derive a stronger statement compared to Theorem 6: Loc is NP-complete even if every symbol has at most occurrences; if every symbol has at most occurrences, the complexity of Loc is open, while the case where every symbol has only one occurrence is trivial. If, on the other hand, the symbols have many occurrences in comparison to , i.e., , then Loc can be solved in polynomial time, e.g., using the -time algorithm mentioned above.
4 Locality and Pathwidth
In this section, we consider the approximability of the minimisation problem MinLoc. Since a marking sequence is just a linear arrangement of the symbols of the input word, this problem seems to be well tailored to greedy algorithms: until all symbols are marked, we choose an unmarked symbol according to some greedy strategy and mark it. There are two aspects that motivate the investigation of such approaches. Firstly, ruling out simple strategies is a natural initial step in the search for approximation algorithms for a new problem. Secondly, due to the results of Section 3, the obvious greedy approaches for computing the locality number may also provide a new angle to approximating the cutwidth of a graph, i.e., some greedy strategies may only become apparent in the locality number point of view and hard to see in the graph formulation of the problem. Given the fact that, as formally stated later as Theorem 14, approximating the cutwidth via approximation of the locality number does in fact improve the best currently known cutwidth approximation ratio, this seems to be a rather important aspect.
Unfortunately, we can formally show that many natural candidates for greedy strategies fail to yield promising approximation algorithms (and are therefore also not helpful for cutwidth approximation). We just briefly mention these negative results; all details are provided in Appendix C. The four considered basic strategies are the following: (1) prefer symbols with few occurrences, (2) symbols with many occurrences, (3) symbols leading to fewer blocks after marking, (4) symbols with earlier leftmost occurrence. All these strategies fail in a sense that there are arbitrarily long (condensed) words with constant locality numbers for which these strategies yield marking sequences with marking numbers .
A more promising approach is to choose among symbols that extend at least one already marked block (except when marking the first symbol). We denote this strategy by and marking sequences that can be obtained by it are called -marking sequences. Intuitively, marking a symbol that has only isolated occurrences, and therefore will increase the current number of marked blocks by the number of its occurrences, seems a bad choice. This raises a general question whether every word has a -marking sequence that is also optimal for this word. We answer this question negatively: all -marking sequences for words like achieve a marking number of at least , while first marking in this order (which all have only isolated occurrences), then , and then the rest of the symbols in some order, yields at most marked blocks. However, this only shows a lower bound of roughly for the approximation ratio of algorithms based on , so might still be a promising candidate. However, in order to devise a -based approximation algorithm, we still face the problem of deciding which of the extending symbols should be chosen; trying out all of them is obviously too costly. Unfortunately, if we handle this decision by one of the basic strategies (1)–(4) from above, e.g., choosing among all extending symbols one that leads to fewer new blocks, we again end up with poor approximation ratios. More precisely, we can again find arbitrarily long words with constant locality numbers for which these algorithms yield marking numbers . Moreover, this is also true if we choose among all extending symbols one that has a maximum number of extending occurrences or one that maximises the ratio .
While we obviously have not investigated all reasonable greedy strategies, we consider our negative results as sufficient evidence that a worthwhile approximation algorithm for computing the locality number most likely does not follow from such simple greedy strategies.
In the following, we adopt a more sophisticated approach of approximating the locality number: we devise a reduction to the problem of computing the pathwidth of a graph. To this end, we first have to describe how a (condensed) word can be represented as a graph: For a condensed word