Minimizing LR(1) State Machines is NP-Hard

10/02/2021
by   Wuu Yang, et al.
0

LR(1) parsing was a focus of extensive research in the past 50 years. Though most fundamental mysteries have been resolved, a few remain hidden in the dark corners. The one we bumped into is the minimization of the LR(1) state machines, which we prove is NP-hard. It is the node-coloring problem that is reduced to the minimization puzzle. The reduction makes use of two technique: indirect reduction and incremental construction. Indirect reduction means the graph to be colored is not reduced to an LR(1) state machine directly. Instead, it is reduced to a context-free grammar from which an LR(1) state machine is derived. Furthermore, by considering the nodes in the graph to be colored one at a time, the context-free grammar is incrementally extended from a template context-free grammar that is for a two-node graph. The extension is done by adding new grammar symbols and rules. A minimized LR(1) machine can be used to recover a minimum coloring of the original graph.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/20/2018

The isomorphism problem for finite extensions of free groups is in PSPACE

We present an algorithm for the following problem: given a context-free ...
05/26/2022

Computing homomorphisms in hereditary graph classes: the peculiar case of the 5-wheel and graphs with no long claws

For graphs G and H, an H-coloring of G is an edge-preserving mapping fro...
10/13/2020

Digraph Coloring and Distance to Acyclicity

In k-Digraph Coloring we are given a digraph and are asked to partition ...
04/17/2021

3-Coloring on Regular, Planar, and Ordered Hamiltonian Graphs

We prove that 3-Coloring remains NP-hard on 4- and 5-regular planar Hami...
07/11/2017

Multiple Context-Free Tree Grammars: Lexicalization and Characterization

Multiple (simple) context-free tree grammars are investigated, where "si...
08/28/2019

Eliminating Left Recursion without the Epsilon

The standard algorithm to eliminate indirect left recursion takes a prev...
03/05/2011

Teraflop-scale Incremental Machine Learning

We propose a long-term memory design for artificial general intelligence...

1. Introduction

Parsing is a basic step in every compiler and interpreter. LR parsers are powerful enough to handle almost all practical programming languages [12]. The downside of LR parsers is the huge table size. This caused the development of several variants, such as LALR parsers, which require significantly smaller tables at the expense of reduced capability.

Figure 1. The LR(1) machine of a grammar.

The core of an LR(1) parser is a deterministic finite state machine. The LALR(1) state machine may be obtained by merging every pair of similar states222Two states are similar if and only if they become identical if the look-ahead sets in the items in the two states are ignored. in the LR(1) machine [9]. In case (reduce-reduce) conflicts occur due to merging,333Only reduce-reduce conflicts may occur due to merging similar states. the parser is forced to revert to the larger, original LR(1) machine. A practical advantage of LALR(1) grammars is the much smaller state machines than the original LR(1) machines. However, any conflicts will force the parser to use the larger LR(1) machine.

If every pair of similar states in an LR(1) state machine are merged together (ignoring conflicts), the resulting LR(1) state machine would be isomorphic to the LR(0) machine [9][2]. Due to the significant size difference between the two state machines, we know there are many pairs of similar states in an LR(1) machine. If any pair of similar states may cause conflicts, the parser will be forced to use the much larger LR(1) machine. It would be more reasonable to merge some, but not all, pairs of similar states [17]. The result, called an extended LALR(1) state machine, would be a state machine that is smaller than the LR(1) machine but larger than the LALR(1) machine.

For example, there are five pairs of similar states in the LR(1) machine in Figure 1. Only three pairs——can be merged. The pair of similar states——cannot be merged due to a (reduce-reduce) conflict. The last pair of similar states——cannot be merged because are not merged for otherwise the resulting machine would become nondeterministic. Figure 2 is the corresponding (minimum) LR(1) machine.

Figure 2. The corresponding minimum LR(1) machine for Figure 1.

In general, two states in an LR(1) machine can be merged if and only if the following two conditions are satisfied:

  1. The two states must be similar;

  2. Corresponding successor states of the two states must have already been merged.

A further question is if there is an efficient algorithm that can merge the most number of similar states, thus producing a minimum LR(1) state machine. That is, we wish to minimize the LR(1) state machine. Since the number of similar states is finite, a naïve approach is to try all possibilities.

Our study shows that minimizing the LR(1) state machine is an NP-hard problem. We reduce the node-coloring problem to this minimization problem. Starting from an (undirected) graph to be node-colored, we construct a context-free grammar. Then the LR(1) machine of the context-free grammar is derived. We can use an algorithm to calculate the corresponding minimum LR(1) machine.

In order to recover a minimum coloring from this minimum LR(1) machine, we can perform one more easy step. In the LR(1) machine, every state that is not similar to any other states is removed, leaving only similar states. Then an edge between two similar states is added if the two similar states may cause conflicts. The resulting machine is called a conflict graph. Merging similar states in the LR(1) machine is essentially identical to merging states in the conflict graph.444Due to the construction of the grammar, all states in the resulting graph are similar to one another in the LR(1) machine. Furthermore, the conflict graph is actually isomorphic to the original color graph. From the minimum LR(1) machine, it is straightforward to recover a minimum coloring.

The following theorem seems obvious but we wish to bring it to the reader’s attention when reading this paper:

Theorem. Let , , and be three states in an LR(1) machine. If the three states are not conflicting pairwise, then merging all three states will not create any conflicts.

Due to the above theorem, we need to consider only pairs, not triples, quadruples, etc., of similar states. This greatly simplifies our discussion.

Note also that there might be more than one minimum LR(1) machine for a given LR(1) machine.

LR parsers were first introduced by Knuth [12]. Since LR parsers are considered the most powerful and efficient practical parsers, much effort has been devoted to related research and implementation [1],[3],[7],[11],[13],[15].

The canonical LR(1) parsers make use of big state machines. For some LR(1) state machines, every pair of similar states may be merged. These are called LALR(1) state machines (or LALR(1) parsers) [6]. One way to generate LALR(1) machines is to merge all pairs of similar states. On the other hand, since an LALR(1) machine is isomorphic to the corresponding LR(0) machine, many algorithms are proposed to add the look-ahead information to the states in the LR(0) machine in order to obtain the LALR(1) machine [3],[8],[15]. None of these attempted to parse LR(1)-but-non-LALR(1) grammars.

It is known that every language that admits an LR() parser also admits an LALR(1) parser [13]. In order to parse for an LR(1)-but-non-LALR(1) grammar, there used to be four approaches: (1) use the much larger LR(1) parser; (2) add ad hoc rules to the LALR(1) parser to resolve conflicts, similar to what yacc [11] does; (3) merge some, but not all, pairs of similar states [17]; and (4) transform the grammar into LALR(1) and then generate a parser. The transformation approach may exponentially increase the number of production rules [13] and the transformed grammar is usually difficult to understand. This paper shows that, although we wish to merge as many pairs of similar states as possible, this optimization problem is NP-hard.

Pager proposed two methods: “practical general method” (PGM) [15] and “lane-tracing method” [14] [16]. The PGM method conceptually starts from the complete LR(1) machine and attempts to merge similar states. In constructing the LR(1) machine, when a new state is generated, PGM attempts to merge the new state with an existing state if a (strong or weak) compatibility test is passed. The compatibility test essentially determines if further derivation of the LR(1) machine will cause conflicts. On the other hand, the lane-tracing method starts from LR(0) machine. When a conflict is detected, the relevant states are split in order to eliminate the conflicts. Splits continue until all conflicts are resolved. Chen [4] actually implements Pager’s two methods as well as other improvements, such as unit-production elimination. Even though we proved the minimization problem is NP-hard, it is still important to build practical parser generators in the LR family. Practically, Pager and Chen’s work is one of the best existing LR parser generators.

The IELR method [5] includes additional capability to eliminate conflicts even if the grammar is not LR. For instance, IELR can handle certain ambiguous grammars, similar to yaccr.

Both [15] and [5] attempt to find a minimal machine. However, minimal simply means “very small” or “locally minimum” rather than “globally minimum”[5]. This is different from our study of minimization.

The remainder of this paper is organized as follows. Section 2 will introduce the terminology and background. Section 3 introduces a reduction algorithm that translates an undirected graph into a context-free grammar and discusses the reduction of the coloring problem to the minimization problem. The last section concludes this paper.

Figure 3. Two cases for a color graph with two nodes and four cases for a color graph with three nodes.

2. Terminology and Background

A grammar consists of a non-empty set of nonterminals , a non-empty set of terminals , a non-empty set of production rules and a special nonterminal , which is called the start symbol. We assume that . A production rule has the form

A ::=

where is a nonterminal and is a (possibly empty) string of nonterminals and terminals. We use the production rules to derive a string of terminals from the start symbol.

Figure 4. The LR(1) machine for the graph in Figure 3 (b). Note that there is a single conflict . The empty boxes are not part of this machine. They are used for comparison with later machines.

LR(1) parsing is based on a deterministic finite state machine, called the LR(1) machine. A state in the LR(1) machine is a non-empty set of items. An item has the form (A ::= , ), where A ::= is one of the production rules, indicates a position in the string , and (the lookahead set) is a set of terminals that could follow the nonterminal A in later derivation steps. The algorithm for constructing the LR(1) state machine for a grammar is explained in most compiler textbooks, for example, [2][9]. An example state machine is shown in Figure 1.

Two states in the LR(1) machine are similar if they have the same number of items and the corresponding items differ only in the lookahead sets. For example, states and in Figure 1, each of which contains three items, are similar states. Similarity among states is an equivalence relation. On the other hand, conflicting between two states is not transitive.

Figure 5. The LR(1) machine for the graph in Figure 3 (a). There is no conflict in the machine. The empty boxes are not part of this machine. They are used for comparison with later machines.

LR(1) state machines are closely related to LR(0) state machines. However, an LR(1) machine is much larger than the corresponding LR(0) machine because many similar states are introduced. In order to reduce the size of the LR(1) state machine, some or all pairs of similar states may be merged555 Remember two states are similar if they have the same items, except that the lookahead sets might differ. To merge two similar states, we use the same items in the original states, except that the lookahead set of an item is the union of the lookahead sets of the two corresponding items in the original states. as long as no conflicts occur. For example, LALR(1) machines are obtained from LR(1) machines by merging every pair of similar states.

(no edge) (one edge)
P ::= S$ P ::= S$
S ::= 1Xa S ::= 1X)
S ::= 1Yb S ::= 1Yb
S ::= 2Xc S ::= 2Xc
S ::= 2Yd S ::= 2Y)
X ::= @ X ::= @
Y ::= @ Y ::= @
Table 1. The grammar on the left corresponds to the color graph in Figure 3 (a). The grammar on the right corresponds to the color graph in Figure 3 (b).

Sometimes merging two similar states may create a (parsing) conflict. For example, when states and in Figure 4 are merged, the resulting state will contain two items: (X ::= @ , ), c ) and (Y ::= @ , b, ) ). Because the terminal “)” appears in two different items in which appears at the end of the right-hand side, this is a parsing conflict.

The aim of minimizing an LR(1) machine is to merge as many pairs of similar states as possible without causing conflicts. Our study shows that this minimization problem is NP-hard.

3. Reduction

We may prove that minimizing LR(1) machines is an NP-hard problem by reducing the node-coloring problem to this minimization problem. Specifically, from a graph to be colored, we construct a context-free grammar . Then the LR(1) state machine is derived from . An algorithm is used to calculate the minimum state machine, from which a minimum coloring can be recovered.

In order to recover a minimum coloring, can be simplified by removing every state that is not similar to any other state, resulting in a conflict graph. Merging similar states in the conflict graph is essentially identical to finding a minimum coloring of .

We define a node-coloring of a graph as a partition of the set of nodes in the color graph satisfying the requirement that nodes connected by an edge cannot be in the same partition block. A minimum coloring is a partition with the fewest blocks. Similarly, a merge scheme of an LR(1) state machine is a partition of the states satisfying the requirement that states in the same partition block are similar to one another and do not conflict with one another. A minimal merge scheme is a partition with the fewest blocks. Minimizing an LR(1) machine is to find a minimum merge scheme of the machine.

(a) (b) (c) (d)
P ::= S$ P ::= S$ P ::= S$ P ::= S$
S ::= 1Xa S ::= 1X) S ::= 1X) S ::= 1X)
S ::= 1Yb S ::= 1Yb S ::= 1Yb S ::= 1Yb
S ::= 1Ze S ::= 1Ze S ::= 1Z= S ::= 1Z=
S ::= 1Vf S ::= 1Vf S ::= 1Vf S ::= 1Vf
S ::= 2Xc S ::= 2Xc S ::= 2Xc S ::= 2Xc
S ::= 2Yd S ::= 2Y) S ::= 2Y) S ::= 2Y)
S ::= 2Zg S ::= 2Zg S ::= 2Zg S ::= 2Z=
S ::= 2Vh S ::= 2Vh S ::= 2Vh S ::= 2Vh
S ::= 3Xi S ::= 3Xi S ::= 3Xi S ::= 3Xi
S ::= 3Yj S ::= 3Yj S ::= 3Yj S ::= 3Yj
S ::= 3Zk S ::= 3Zk S ::= 3Zk S ::= 3Zk
S ::= 3Vm S ::= 3Vm S ::= 3V= S ::= 3V=
X ::= @ X ::= @ X ::= @ X ::= @
Y ::= @ Y ::= @ Y ::= @ Y ::= @
Z ::= @ Z ::= @ Z ::= @ Z ::= @
V ::= @ V ::= @ V ::= @ V ::= @
Table 2. The four grammars constructed from graphs with 3 nodes by our algorithm. (a) is for graphs with no edges (Figure 3 (c)); (b) is for graphs with one edge (Figure 3 (d)); (c) is for graphs with two edges (Figure 3 (e)); and (d) is for graphs with three edges (Figure 3

(f)). The production rules are classified into five categories. In particular, the boxed production rules in (c) are new rules added to the grammar on the right column in Table

1.

We build a context-free grammar for a given color graph inductively. The LR(1) machine is then derived from the grammar. We did not construct the LR(1) machines directly because context-free grammars are easier to generate.

Assume color graph has nodes. Then the constructed machine has states that are similar to one another; the remaining states are distinct and can be ignored in the discussion of merging similar states. There is 1-1 correspondence between the nodes in and the similar states in . We claim that satisfies the following property:

may be colored with colors if and only if pairs of similar states in may be merged (so that only similar states remain).

If there is any algorithm that can calculate the minimum LR(1) machine from by merging certain pairs of similar states, we can use that algorithm to solve the node-coloring problem—for all states that are merged into a single state in , their corresponding nodes in have the same color.

Due to the above property, we have successfully reduced the node-coloring problem to the minimization problem. Because the node-coloring problem is NP-hard [10], the minimization problem is also NP-hard.

To construct a context-free grammar from the color graph , we first choose two arbitrary nodes and . There are two cases, shown in Figure 3 (a) and (b): there is no or one edge . Then one of the grammars in Table 1 is selected.

Assume that there is an edge in . Then the grammar on the right in Table 1 is selected. The corresponding LR(1) machine is shown in Figure 4, in which there are two similar states ( and ). Merging the two similar states will cause a conflict due to the terminal symbol “)”.666In our constructed grammars, the numbers, such as 1, 2, 14, 27, etc., are terminals and indicate a similar state in the resulting LR(1) machine and the order the corresponding nodes in are chosen. The upper-case English letters, such as A, B, etc., denote nonterminals. The lower-case English letters, such as a, b, etc., denote terminals that are used only once in the grammar. These lower-case letters will not cause conflicts. The punctuation marks, such as “)” and “=”, are terminals that will cause conflicts. The grammar is carefully constructed so that the conflict corresponds to the edge in .

On the other hand, if and in are not connected, the grammar on the left in Table 1 will be selected. Figure 5 is the LR(1) machine for that grammar. There are two similar states in that machine ( and ). The two similar states can be merged without conflicts. This corresponds to the fact that and in Figure 3 (a) can have the same color since there is no edge connecting them.

Note that the notion of “two (similar) states can be merged” in the LR(1) machine is closely related to the notion of “two nodes can have there same color” due to our construction.

The remaining nodes in are chosen one by one in an arbitrary order. By adding one node at a time, we can gradually construct grammars .

1.    Nodes in graph are listed as ;

2.    if there is an edge then G := the left grammar in Table 1 ;

3.    else G := the right grammar in Table 1;

4.    NewNonTerm := { X, Y };

5.    for := 3 to n do

6.       generate two new nonterminals, called and ;

7.       generate two new terminals, called and ;

8.       generate four production rules:

9.          (“S ::= ” ) and (“S ::= ” ) and

10.          ( “ ::= @”) and ( “ ::= @”);

11.       for each nonterminal NewNonTerm do

12.          generate a new terminal, called ;

13.          generate a production rule: (“S ::= ” );

14.       end;

15.       for := 1 to do

16.          generate two new terminals, called and ;

17.          generate a production rule: (“S ::= ” );

18.          if there is an edge then

19.             /* The following rule will cause a conflict due to . */

20.             generate a production rule: (“S ::= ” );

21.          else /* The following rule will NOT cause a conflict. */

22.             generate a production rule: (“S ::= ” );

23.       end;

24.       NewNonTerm := NewNonTerm ;

25.    end;

Figure 6. Algorithm for generating a context-free grammar from a graph. The four rules generated at lines 9, 17, and 20 will cause a conflict in the LR(1) machine due to the terminal .

3.1. Extending Grammar to Grammar

In this subsection, we explain the steps for extending grammar to grammar . The complete algorithm for generating a context-free grammar from a graph is shown in Figure 6.

Figure 7. The LR(1) machine for the graph in Figure 3 (c).
Figure 8. The LR(1) machine for the graph in Figure 3 (d).

We may list the nodes in a color graph as . We first consider the two nodes and . One of the grammars in Table 1 is selected according to whether there is an edge . Call the resulting grammar .

The remaining nodes are added one at a time, resulting in grammars . (This is done in the for-loop between lines 5 and 25 in Figure 6.) Each is constructed for the subgraph of consisting of nodes and all edges incident on these nodes. The last grammar is the intended grammar.

Grammar is obtained from the previous grammar by adding two new nonterminals and , new terminals (, , , ), and new production rules.

Consider node . Firstly, four new production rules (S ::= ), (S ::= ), ( ::= @), and ( ::= @), are generated, where S is the nonterminal taken from the initial grammar , is the index of the current iteration that is treated as an integer terminal, and are two newly created nonterminals, @ is the terminal taken from the initial grammar , and and are two of the new terminals. (We will use the numbers 1, 2, 3, etc., to denote the new terminals in the order of creation, as shown in grammars in Table 2. Each number also corresponds to a new similar state in the final LR(1) machine.)

The variable NewNonTerm is the set of nonterminals that are created. At the beginning of the “” iteration of the for-loop, there are nonterminals in the variable NewNonTerm. During each iteration, we will create one new production rule (S ::= ) for every nonterminal NewNonterm, where is the loop index and is a new terminal used only in this production rule.

Figure 9. The LR(1) machine for the graph in Figure 3 (e).

Consider node . might be connected to some of . Let and . For each , we generate two new production rules: (S ::= ) and (S ::= ), where is a new terminal and is a terminal generated at the beginning of the current iteration. For each , we generate two new production rules: (S ::= ) and (S ::= ), where and are two new terminals. (This is done in the for loop at line 15.)

Figure 10. The LR(1) machine for the graph in Figure 3 (f). USELESS G31-3 (Figure 7)

Note that every production rule of the form (S ::= ) will not cause a conflict (in the LR(1) machine for this grammar) as long as the terminal is not used in any other production rules. (An example is the production rule (S ::= 2 X c) in the two grammars in Table 1.) Therefore, only the two production rules (S ::= ) and (S ::= ) will cause a conflict (together with other production rules).

Note also that different terminals are generated during different iterations of the for loop at line 5. Therefore, production rules generated in different iterations will not conflict with each other.

Example. Suppose the color graph in Figure 3 (b) is extended to that in Figure (e). The corresponding grammars are shown on the right column in Table 1 and on the third column in Table 2, respectively. Note that the latter grammar is extended from the former grammar. The boxed production rules in the latter grammar are added by the algorithm in Figure 6.

Example. Figures 7, 8, 9, and 10 are the constructed state machines for the graphs in Figure 3 (c), (d), (e), and (f), respectively.

The grammar on the third column in Table 2 is a typical grammar generated by our algorithm. The production rules are classified into five categories:

  1. one starting production rule (i.e., P ::= S $)

  2. four production rules of the form ( ::= @)

  3. four production rules whose right-hand sides begin with the terminal 1

  4. four production rules whose right-hand sides begin with the terminal 2

  5. four production rules whose right-hand sides begin with the terminal 3

Now consider the constructed LR(1) machine in Figure 9. Note that all items derived from rules of categories 1, 3, 4, and 5, appear only once in the whole LR(1) machine. Any state containing any of these items will not be similar to any other state and hence can be ignored. We could focus on states consisting solely of items derived from production rules of category 2. There are only 3 such states, which are indeed similar to one another. Each such state has all items of the form ( ::= @ , ), where is a nonterminal except P and S.

Another characteristic of the constructed LR(1) machine in Figure 9 is that there are no cycles. The longest path contains 3 steps.

In fact, all grammars generated by the algorithm in Figure 6 share the above characteristics. They help us to infer properties of the minimized LR(1) machines.

In the corresponding state machine in Figure 9, consider the four states that come immediately after the initial state. Items derived from rules in categories (3), (4) and (5) are cleanly separated because of the first symbols (which are integer terminals) on the right-hand sides of the rules. Hence, except the four states that come immediately after the initial state, those items whose first symbols on the right-hand sides are different will never be mixed in the same state in the state machine. Items derived from rules in category (2) are quite similar—actually all items of the form ( ::= @, where is is a nonterminal) appear in every state that comes immediately after the initial state (we will ignore the starting production rule in this discussion). Furthermore, all items of the form ( ::= @ ) appear in states that come two steps after the initial state. It is these states (which contain all items of the form ( ::= @ ) and no other items) that are similar to one another. All other states are not similar to any other states and hence can be ignored when we discuss the merging of similar states.

Therefore, we can create or avoid conflicts among similar states by carefully adjusting the last terminal in rules of the form (S ::= ), where is an integer terminal, is a nonterminal, and is a terminal. In the four grammars in Table 2, when is a lower-case English letter (e.g., “b” or “i”), that rule will not cause any conflict because the lower-case letter appears only once in the whole grammar. On the other hand, when is a punctuation marks (e.g., “)” or “=”), a conflict is intentionally added to the grammar because that punctuation mark is used in two different production rules. The above discussion is related lines 18-22 in the algorithm in Figure 6.

3.2. The Constructed Grammar and LR(1) Machine

In general there are nonterminals, terminals, and production rules in the final constructed grammar , where is the number of nodes and is the number of edges in the original color graph. All nonterminals except P and S are called NewNonTerms (new nonterminals). There are NewNonTerms. There are one starting rule and rules of the form ( ::= @), one for each NewNonTerm .

The remaining rules have the form (S ::= ), where is the nonterminal taken from a grammar in Table 1, is an integer terminal, , is a nonterminal, and is a terminal. These production rules are divided into groups according to the first (integer terminal) symbols on the right-hand sides.

Each group consists of exactly rules. In each group, each of the nonterminals, except P and S (which are taken from a grammar in Table 1) appears as the second (nonterminal) symbol on the right-hand side of a production rule.

There are states in the LR(1) machine derived from the constructed grammar . One is the starting state. There are successor states immediately following the starting state. One successor state contains the item (P ::= S $, ). Each of the remaining successor states consists of items (which belong to the same group of production rules defined above) of the form (S ::= , $ ) and items of the form ( ::= @, ). Then each successor state is, in turn, followed by another state, denoted as , of items of the form ( ::= @ , ).

In the whole LR(1) machine, there are such states (i.e., , where ) because there are groups of production rules. These states are similar to one another. State corresponds to node in the original color graph, where . All other states in the derived LR(1) machine are dissimilar to one another. They can be ignored when we attempt to merge similar states.

After the last grammar is constructed, the corresponding LR(1) state machine is derived from the grammar. Note that the nonterminal P never appears in the right-hand side of a production rule. The nonterminal S appears only once in the right-hand side of only one production. Thus, P and S are never involved in a conflict. A conflict in the LR(1) machine, if it ever occurs, must occur due to two items of the form ( ::= @ , ) and ( ::= @ , ), where and are two different NewNonTerms and is a terminal.

Note that is a terminal that can follow and in some production rules. However, since and are NewNonTerms, they can appear only in production rules of the form (S ::= ) and (S ::= ). The two production rules are carefully planted in the grammar (see lines 9 and 20 in the algorithm in Figure 6).

We will use an algorithm to calculate the minimum state machine from .

3.3. Recovering a Minimum Node-Coloring from a Minimum LR(1) Machine

In this subsection, we will describe how to recover a minimum coloring of the original color graph from the minimum state machine.

For the purpose of merging similar states, we may ignore all states that are not similar to any other states. To make conflicts among states explicit we add a conflict edge between two states if a conflict will occur when the two states are merged. The state machine in Figure 9 becomes Figure 11, which is called a conflict graph. Due to our construction, all states in the conflict graph (Figure 11) are similar to one another. In Figure 11, there are three (similar) states and two conflict edges and .

Finding a minimum merge scheme for the LR(1) machine is identical to finding a minimum merge scheme for the conflict graph. So we will focus on the conflict graph instead.

The reader may find that the conflict graph (Figure 11) is isomorphic to the original color graph (Figure 3 (e)). This is due to the construction of the context-free grammar.

For the former (the conflict graph) we wish to merge as many states as possible under the constraint that two states connected by an edge, such as , should not be merged. For the latter (the color graph), we wish to assign colors to nodes under the constraint that two nodes connected by an edge, such as , should not be assigned the same color. Since the conflict graph and the color graph are isomorphic, it is easy to see that every coloring of the color graph with colors corresponds to a merge scheme of the conflict graph with partition blocks, and vice versa. A detailed argument is in the next paragraph.

Suppose in a coloring of the color graph, nodes and are assigned the same color. Due to the rule of coloring, and are not connected by an edge. The corresponding states and in the conflict graph are not connected by an edge either due to the construction of the grammar. Therefore, states and can be merged together. Hence, a coloring of the color graph correspond to a merge scheme of the conflict graph. By the same argument, a merge scheme of the conflict graph correspond to a coloring of the color graph.

Figure 11. The conflict graph. After removing the states that are not similar to any other states, only three states are left. We may add an edge to indicate there is a conflict edge and another conflict edge .

3.4. Time Analysis of the Reduction

Constructing a context-free grammar from a graph (the algorithm in Figure 6) takes polynomial time. Note that the constructed grammar does not contain recursive rules, that is, there is no derivation of the form . Deriving the LR(1) state machine from such a grammar also takes polynomial time because there are states in the derived LR(1) machine. Constructing the conflict graph from the LR(1) machine also takes polynomial time. Therefore the reduction takes polynomial time.

4. Conclusion

We have reduced the node-coloring problem to the minimization problem of the LR(1) state machines. Therefore, the minimization problem is NP-hard.

There are efficient algorithms for minimization of finite state machines. LR(0) state machines are minimum by its construction. We show that LR(1) state machines cannot be easily minimized in general.

Note that minimizing an LR(1) machine is quite different from minimizing a general finite state machine. For one thing, we need to examine the items in the states of an LR(1) machine. On the other hand, minimizing a general finite state machine does not consider the “contents” of the states.



References

  • [1] A.V. Aho and S.C. Johnson. “LR Parsing,” ACM Computing Surveys, vol. 6, no. 2, June 1974. pp. 99-124.
  • [2] A.V. Aho, M.S. Lam, R. Sethi, and J.D. Ullman. Compilers: Principles, Techniques, and Tools. (2nd Edition) Prentice Hall, New York, 2006.
  • [3] T. Anderson, J. Eve, and J. Horning. “Efficient LR(1) parsers,” Acta Informatica, vol. 2, 1973. pp. 2-39.
  • [4] X. Chen and D. Pager. “Full LR(1) Parser Generator Hyacc And Study On The Performance of LR(1) Algorithms.” In Proc. 4th International C* Conf. Computer Science and Software Engineering (C3S2E ’11), May 16-18, Montreal, CANADA, 2011. pp. 83-92.
  • [5] J.E. Denny and B.A. Malloy. “IELR(1) algorithm for generating minimal LR(1) parser tables for non-LR(1) grammars with conflict resolution,” Science of Computer Programming, vol. 75, no. 11, November 2010. pp. 943-979. doi:10.1016/j.scico.2009.08.001.
  • [6] F.L. DeRemer. Practical translators for LR() languages. Project MAC Tech. Rep. TR-65, MIT, Cambridge, Mass., 1969.
  • [7] F.L. DeRemer. “Simple LR(k) Grammars.” Comm. ACM, vol. 14, no. 7, July 1971. pp. 453-460.
  • [8] F.L. DeRemer and T. Pennello. “Efficient Computation of LALR(1) LookAhead Sets.” ACM Trans. Programming Languages and Systems, vol. 4, no. 4, October 1982. pp. 615-649.
  • [9] C.N. Fischer, R.K. Cytron, and R.J. LeBlanc, Jr.. Crafting A Compiler. Pearson, New York, 2010.
  • [10] M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. ISBN 0-7167-1045-5.
  • [11] S.C. Johnson. Yacc: Yet Another Compiler-Compiler. Bell Laboratories, Murray Hill, NJ, 1978.
  • [12] D. E. Knuth. “On the translation of languages from left to right.” Information and Control, vol. 8, no. 6, July 1965. pp. 607-639. doi:10.1016/S0019-9958(65)90426-2.
  • [13] M.D. Mickunas, R.L. Lancaster, V.B. Schneider. “Transforming LR(k) Grammars to LR(1), SLR(1), and (1,1) Bounded Right-Context Grammars.” Journal of the ACM, vol. 23, no. 3, July 1976. pp. 511-533. doi:10.1145/321958.321972.
  • [14]

    D. Pager. “The lane tracing algorithm for constructing LR(k) parsers.” In Proc 5th Annual ACM Symp. Theory of computing, Austin, Texas, United States, 1973. pp. 172–181.

  • [15] D. Pager. “A practical general method for constructing LR(k) parsers.” Acta Informatica, vol. 7, no. 3, 1977. pp. 249-268. doi:10.1007/BF00290336.
  • [16] D. Pager. “The lane-tracing algorithm for constructing LR(k) parsers and ways of enhancing its efficiency.” Information Sciences, vol. 12, 1977. pp. 19–42.
  • [17] W. Yang, “Extended LALR(1) Parsing.” In Proc. 13th International Conf. Autonomic and Autonomous Systems (ICAS 2017), May 21-25, 2017. Barcelona, Spain.