Recurrent Neural Networks as Weighted Language Recognizers

11/15/2017 ∙ by Yining Chen, et al. ∙ USC Information Sciences Institute Dartmouth College 0

We investigate computational complexity of questions of various problems for simple recurrent neural networks (RNNs) as formal models for recognizing weighted languages. We focus on the single-layer, ReLU-activation, rational-weight RNNs with softmax, which are commonly used in natural language processing applications. We show that most problems for such RNNs are undecidable, including consistency, equivalence, minimization, and finding the highest-weighted string. However, for consistent RNNs the last problem becomes decidable, although the solution can be exponentially long. If additionally the string is limited to polynomial length, the problem becomes NP-complete and APX-hard. In summary, this shows that approximations and heuristic algorithms are necessary in practical applications of those RNNs. We also consider RNNs as unweighted language recognizers and situate RNNs between Turing Machines and Random-Access Machines regarding their real-time recognition powers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent neural networks (RNNs) are an attractive apparatus for probabilistic language modeling Mikolov and Zweig (2012)

. Recent experiments show that RNNs significantly outperform other methods in assigning high probability to held-out English text 

Jozefowicz et al. (2016).

Roughly speaking, an RNN works as follows. At each time step, it consumes one input token, updates its hidden state vector, and predicts the next token by generating a probability distribution over all permissible tokens. The probability of an input string is simply obtained as the product of the predictions of the tokens constituting the string followed by a terminating token. In this manner, each RNN defines a

weighted language; i.e. a total function from strings to weights. siegelmann showed that single-layer rational-weight RNNs with saturated linear activation can compute any computable function. To this end, a specific architecture with 886 hidden units can simulate any Turing machine in real-time (i.e., each Turing machine step is simulated in a single time step). However, their RNN encodes the whole input in its internal state, performs the actual computation of the Turing machine when reading the terminating token, and then encodes the output (provided an output is produced) in a particular hidden unit. In this way, their RNN allows “thinking” time (equivalent to the computation time of the Turing machine) after the input has been encoded.

We consider a different variant of RNNs that is commonly used in natural language processing applications. It uses ReLU activations, consumes an input token at each time step, and produces softmax predictions for the next token. It thus immediately halts after reading the last input token and the weight assigned to the input is simply the product of the input token predictions in each step.

Other formal models that are currently used to implement probabilistic language models such as finite-state automata and context-free grammars are by now well-understood. A fair share of their utility directly derives from their nice algorithmic properties. For example, the weighted languages computed by weighted finite-state automata are closed under intersection (pointwise product) and union (pointwise sum), and the corresponding unweighted languages are closed under intersection, union, difference, and complementation Droste et al. (2013). Moreover, toolkits like OpenFST Allauzen et al. (2007) and Carmel111https://www.isi.edu/licensed-sw/carmel/ implement efficient algorithms on automata like minimization, intersection, finding the highest-weighted path and the highest-weighted string.

RNN practitioners naturally face many of these same problems. For example, an RNN-based machine translation system should extract the highest-weighted output string (i.e., the most likely translation) generated by an RNN, Sutskever et al. (2014); Bahdanau et al. (2014). Currently this task is solved by approximation techniques like heuristic greedy and beam searches. To facilitate the deployment of large RNNs onto limited memory devices (like mobile phones) minimization techniques would be beneficial. Again currently only heuristic approaches like knowledge distillation Kim and Rush (2016) are available. Meanwhile, it is unclear whether we can determine if the computed weighted language is consistent; i.e., if it is a probability distribution on the set of all strings. Without a determination of the overall probability mass assigned to all finite strings, a fair comparison of language models with regard to perplexity is simply impossible.

The goal of this paper is to study the above problems for the mentioned ReLU-variant of RNNs. More specifically, we ask and answer the following questions:

  • Consistency: Do RNNs compute consistent weighted languages? Is the consistency of the computed weighted language decidable?

  • Highest-weighted string: Can we (efficiently) determine the highest-weighted string in a computed weighted language?

  • Equivalence: Can we decide whether two given RNNs compute the same weighted language?

  • Minimization: Can we minimize the number of neurons for a given RNN?

2 Definitions and notations

Before we introduce our RNN model formally, we recall some basic notions and notation. An alphabet  is a finite set of symbols, and we write  for the number of symbols in . A string  over the alphabet  is a finite sequence of zero or more symbols drawn from , and we write  for the set of all strings over , of which  is the empty string. The length of the string  is denoted  and coincides with the number of symbols constituting the string. As usual, we write  for the set of functions . A weighted language  is a total function  from strings to real-valued weights. For example, for all  is such a weighted language.

We restrict the weights in our RNNs to the rational numbers . In addition, we reserve the use of a special symbol  to mark the start and end of an input string. To this end, we assume that for all considered alphabets, and we let .

Definition 1.

A single-layer RNN  is a -tuple , in which

  •  is an input alphabet,

  •  is a finite set of neurons,

  • is an initial activation vector,

  •  is a transition matrix,

  • is a -indexed family of bias vectors ,

  • is a prediction matrix, and

  • is a prediction bias vector.

Next, let us define how such an RNN works. We first prepare our input encoding and the effect of our activation function. For an input string 

with , we encode this input as and thus assume that and

. Our RNNs use ReLUs (Rectified Linear Units), so for every 

we let  (the ReLU activation) be the vector such that

In other words, the ReLUs act like identities on nonnegative inputs, but clip negative inputs to . We use softmax-predictions, so for every vector  and we let

RNNs act in discrete time steps reading a single letter at each step. We now define the semantics of our RNNs.

Definition 2.

Let be an RNN,  an input string of length  and a time step. We define

  • the hidden state vector  given by

    where and we use standard matrix product and point-wise vector addition,

  • the next-token prediction vector 

  • the next-token distribution 

Finally, the RNN  computes the weighted language , which is given for every input  as above by

In other words, each component  of the hidden state vector is the ReLU activation applied to a linear combination of all the components of the previous hidden state vector  together with a summand  that depends on the -th input letter . Thus, we often specify  as linear combination instead of specifying the matrix  and the vectors . The semantics is then obtained by predicting the letters  of the input  and the final terminator  and multiplying the probabilities of the individual predictions.

Let us illustrate these notions on an example. We consider the RNN  with and

  • and ,

  • and

  • and and

  • and .

In this case, we obtain the linear combinations

computing the next hidden state components. Given the initial activation, we thus obtain . Using this information, we obtain

Consequently, we assign weight  to input , weight  to , and, more generally, weight  to .

Clearly the weight assigned by an RNN is always in the interval , which enables a probabilistic view. Similar to weighted finite-state automata or weighted context-free grammars, each RNN is a compact, finite representation of a weighted language. The softmax-operation enforces that the probability  is impossible as assigned weight, so each input string is principally possible. In practical language modeling, smoothing methods are used to change distributions such that impossibility (probability ) is removed. Our RNNs avoid impossibility outright, so this can be considered a feature instead of a disadvantage.

The hidden state  of an RNN can be used as scratch space for computation. For example, with a single neuron  we can count input symbols in  via:

Here the letter-dependent summand  is universally . Similarly, for an alphabet we can use the method of siegelmann to encode the complete input string  in base  using:

where is a bijection. In principle, we can thus store the entire input string (of unbounded length) in the hidden state value , but our RNN model outputs weights at each step and terminates immediately once the final delimiter  is read. It must assign a probability to a string incrementally

using the chain rule decomposition

.

Figure 1: Sample RNNs over single-letter alphabets, and the weighted languages they recognize. is some positive rational number which depends on the desired error margin. If we want to express the second and the third languages with error margin , is chosen so that in column , and chosen so that in column .

Let us illustrate our notion of RNNs on some additional examples. They all use the alphabet  and are illustrated and formally specified in Figure 1. The first column shows an RNN  that assigns . The next-token prediction matrix ensures equal values for  and  at every time step. The second column shows the RNN , which we already discussed. In the beginning, it heavily biases the next symbol prediction towards , but counters it starting at . The third RNN  uses another counting mechanism with . The first two components are ReLU-thresholded to zero until , at which point they overwhelm the bias towards  turning all future predictions to .

3 Consistency

We first investigate the consistency problem for an RNN , which asks whether the recognized weighted language  is indeed a probability distribution. Consequently, an RNN  is consistent if . We first show that there is an inconsistent RNN, which together with our examples shows that consistency is a nontrivial property of RNNs.222

For comparison, all probabilistic finite-state automata are consistent, provided no transitions exit final states. Not all probabilistic context-free grammars are consistent; necessary and sufficient conditions for consistency are given by booth_thompson_1973. However, probabilistic context-free grammars obtained by training on a finite corpus using popular methods (such as expectation-maximization) are guaranteed to be consistent 

Nederhof and Satta (2006).

We immediately use a slightly more complex example, which we will later reuse.

Example 3.

Let us consider an arbitrary RNN

with the single-letter alphabet , the neurons , initial activation for all , and the following linear combinations:

Now we distinguish two cases:
Case 1: If for all , then and and . Hence we have and . In this case the termination probability

(i.e., the likelihood of predicting ) shrinks rapidly towards , so the RNN assigns less than 15% of the probability mass to the terminating sequences (i.e., the finite strings), so the RNN is inconsistent (see Lemma 15 in the appendix).
Case 2: Suppose that there exists a time point  such that for all

Then for all  and otherwise. In addition, we have and . Hence we have

which shows that the probability

of predicting  increases over time and eventually (for ) far outweighs the probability of predicting . Consequently, in this case the RNN is consistent (see Lemma 16 in the appendix).

We have seen in the previous example that consistency is not trivial for RNNs, which takes us to the consistency problem for RNNs:

Consistency:

Given an RNN , return “yes” if  is consistent and “no” otherwise.

We recall the following theorem, which, combined with our example, will prove that consistency is unfortunately undecidable for RNNs.

Theorem 4 (Theorem 2 of siegelmann).

Let  be an arbitrary deterministic Turing machine. There exists an RNN

with saturated linear activation, input alphabet , and  designated neuron  such that for all and

  • if  does not halt on , and

  • if  does halt on empty input after  steps, then

In other words, such RNNs with saturated linear activation can semi-decide halting of an arbitrary Turing machine in the sense that a particular neuron achieves value  at some point during the evolution if and only if the Turing machine halts on empty input. An RNN with saturated linear activation is an RNN following our definition with the only difference that instead of our ReLU-activation  the following saturated linear activation  is used. For every vector  and , let

Since for all

, and the right-hand side is a linear transformation, we can easily simulate saturated linear activation in our RNNs. To this end, each neuron 

of the original RNN  is replaced by two neurons  and  in the new RNN  such that for all and , where the evaluation of  is performed in the RNN . More precisely, we use the transition matrix  and bias function , which is given by

for all and , where  and  are the two neurons corresponding to  and  and  are the two neurons corresponding to  (see Lemma 17 in the appendix).

Corollary 5.

Let  be an arbitrary deterministic Turing machine. There exists an RNN

with input alphabet  and  designated neurons  such that for all and

  • if  does not halt on , and

  • if  does halt on empty input after  steps, then

We can now use this corollary together with the RNN  of Example 3 to show that the consistency problem is undecidable. To this end, we simulate a given Turing machine  and identify the two designated neurons of Corollary 5 as  and  in Example 3. It follows that  halts if and only if  is consistent. Hence we reduced the undecidable halting problem to the consistency problem, which shows the undecidability of the consistency problem.

Theorem 6.

The consistency problem for RNNs is undecidable.

As mentioned in Footnote 2, probabilistic context-free grammars obtained after training on a finite corpus using the most popular methods are guaranteed to be consistent. At least for 2-layer RNNs this does not hold.

Theorem 7.

A two-layer RNN trained to a local optimum using Back-propagation-through-time (BPTT) on a finite corpus is not necessarily consistent.

Proof.

The first layer of the RNN  with a single alphabet symbol  uses one neuron  and has the following behavior:

The second layer uses neuron  and takes  as input at time :

Let the training data be . Then the objective we wish to maximize is simply . The derivative of this objective with respect to each parameter is , so applying gradient descent updates does not change any of the parameters and we have converged to an inconsistent RNN. ∎

It remains an open question whether there is a single-layer RNN that also exhibits this behavior.

4 Highest-weighted string

Given a function  we are often interested in the highest-weighted string. This corresponds to the most likely sentence in a language model or the most likely translation for a decoder RNN in machine translation.

For deterministic probabilistic finite-state automata or context-free grammars only one path or derivation exists for any given string, so the identification of the highest-weighted string is the same task as the identification of the most probable path or derivation. However, for nondeterministic devices, the highest-weighted string is often harder to identify, since the weight of a string is the sum of the probabilities of all possible paths or derivations for that string. A comparison of the difficulty of identifying the most probable derivation and the highest-weighted string for various models is summarized in Table 1, in which we marked our results in bold face.

Best-path Best-string
General RNN Undecidable
Consistent RNN NP-c 333Restricted to solutions of polynomial length
Det. PFSA/PCFG P 444Dijkstra shortest path / Knuth (1977)
Nondet. PFSA/PCFG NP-c 555Casacuberta and de la Higuera (2000) / Simaan (1996)
Table 1: Comparison of the difficulty of identifying the most probable derivation (Best-path) and the highest-weighted string (Best-string) for various models.

We present various results concerning the difficulty of identifying the highest-weighted string in a weighted language computed by an RNN. We also summarize some available algorithms. We start with the formal presentation of the three studied problems.

  1. Best string: Given an RNN  and , does there exist with ?

  2. Consistent best string: Given a consistent RNN  and , does there exist with ?

  3. Consistent best string of polynomial length: Given a consistent RNN , polynomial  with for , and , does there exist with and ?

As usual the corresponding optimization problems are not significantly simpler than these decision problems. Unfortunately, the general problem is also undecidable, which can easily be shown using our example.

Theorem 8.

The best string problem for RNNs is undecidable.

Proof.

Let be an arbitrary Turing machine and again consider the RNN  of Example 3 with the neurons  and  identified with the designated neurons of Corollary 5. We note that in both cases. If  does not halt, then for all . On the other hand, if  halts after  steps, then

using Lemma 14 in the appendix. Consequently, a string with weight above  exists if and only if halts, so the best string problem is also undecidable. ∎

If we restrict the RNNs to be consistent, then we can easily decide the best string problem by simple enumeration.

Theorem 9.

The consistent best string problem for RNNs is decidable.

Proof.

Let be the RNN over alphabet  and be the bound. Since  is countable, we can enumerate it via . In the algorithm we compute for increasing values of . If we encounter a weight , then we stop with answer “yes.” Otherwise we continue until , at which point we stop with answer “no.”

Since  is consistent, , so this algorithm is guaranteed to terminate and it obviously decides the problem. ∎

Next, we investigate the length  of the shortest string  of maximal weight in the weighted language  generated by a consistent RNN  in terms of its (binary storage) size . As already mentioned by siegelmann and evidenced here, only small precision rational numbers are needed in our constructions, so we assume that for a (reasonably small) constant , where  is the set of neurons of . We show that no computable bound on the length of the best string can exist, so its length can surpass all reasonable bounds.

Theorem 10.

Let be the function with

for all . There exists no computable function with for all .

Proof.

In the previous section (before Theorem 6) we presented an RNN  that simulates an arbitrary (single-track) Turing machine  with  states. By siegelmann we have . Moreover, we observed that this RNN  is consistent if and only if the Turing machine  halts on empty input. In the proof of Theorem 8 we have additionally seen that the length  of its best string exceeds the number  of steps required to halt.

For every , let be the -th “Busy Beaver” number Radó (1962), which is

It is well-known that cannot be bounded by any computable function. However,

so  clearly cannot be computable and no computable function  can provide bounds for . ∎

Finally, we investigate the difficulty of the best string problem for consistent RNN restricted to solutions of polynomial length.

Theorem 11.

Identifying the best string of polynomial length in a consistent RNN is NP-complete and APX-hard.

Proof sketch.

Clearly, we can guess an input string of polynomial length, run the RNN, and verify whether its weight exceeds the given bound in polynomial time. Therefore the problem is trivially in NP. For NP-hardness, we reduce from the 0-1 Integer Linear Programming Feasibility Problem:

0-1 Integer Linear Programming Feasibility:

Given: variables which can only take values in , and constraints (): , . , . Return: Yes iff. there is a feasible solution that satisfies all constraints.

Suppose we are given an instance of the above problem. We construct an instance of the consistent best string of polynomial length problem with input . Our construction ensures that the only length at which a string can have weight greater than is . Thus, if there is any string whose weight is greater than , the given instance of 0-1 Integer Linear Programming Problem is feasible; otherwise it is not.

Our reduction is a Polynomial-Time Approximation Scheme (PTAS) reduction and preserves approximability. Since 0-1 Integer Linear Programming Feasibility is NP-complete and the corresponding maximization problem is APX-complete, consistent best string of polynomial length is NP-complete and APX-hard, meaning there is no PTAS to find the best string bounded by polynomial length (i.e. the best we can hope for in polynomial time is a constant-factor approximation algorithm) unless P NP.

The full proof is given in the appendix.

If we assume that the solution length is bounded by some finite number, we can convert algorithms from Higuera2013ComputingTM for computing the most probable string in PFSAs for use in RNNs. Such algorithms would be similar to beam search Lowerre (1976) used most widely in practice.

5 Equivalence

We prove that equivalence of two RNNs is undecidable. For comparison, equivalence of two deterministic WFSAs can be tested in time , where , are the number of states of the two WFSAs and is the size of the alphabet Cortes et al. (2007); equivalence of nondeterministic WFSAs are undecidable Griffiths (1968). The decidability of language equivalence for deterministic probabilistic push-downtown automata (PPDA) is still open Forejt et al. (2014), although equivalence for deterministic unweighted push-downtown automata (PDA) is decidable Sénizergues (1997).

The equivalence problem is formulated as follows:

Equivalence:

Given two RNNs and , return “yes” if for all , and “no” otherwise.

Theorem 12.

The equivalence problem for RNNs is undecidable.

Proof.

We prove by contradiction. Suppose Turing machine decides the equivalence problem. Given any deterministic Turing Machine , construct the RNN that simulates on input as described in Corollary 5. Let and . If does not halt on , for all , ; if halts after steps, , . Let be the trivial RNN that computes . We run on input . If returns “no”, halts on , else it does not halt. Therefore the Halting Problem would be decidable if equivalence is decidable. Therefore equivalence is undecidable. ∎

6 Minimization

We look next at minimization of RNNs. For comparison, state-minimization of a deterministic PFSA is where is the number of transitions and is the number of states Aho et al. (1974). Minimization of a non-deterministic PFSA is PSPACE-complete Jiang and Ravikumar (1993).

We focus on minimizing the number of hidden neurons () in RNNs:

Minimization:

Given RNN and non-negative integer , return “yes” if RNN with number of hidden units such that for all , and “no” otherwise.

Theorem 13.

Minimization of RNNs is undecidable.

Proof.

We reduce from the Halting Problem. Suppose Turing Machine decides the minimization problem. For any Turing Machine , construct the same RNN as in Theorem 12. We run on input . Note that an RNN with no hidden unit can only output constant for all . Therefore the number of hidden units in can be minimized to if and only if it always outputs . If returns “yes”, does not halt on , else it halts. Therefore minimization is undecidable. ∎

7 Conclusion

We proved the following hardness results regarding RNN as a recognizer of weighted languages:

  1. Consistency:

    1. Inconsistent RNNs exist.

    2. Consistency of RNNs is undecidable.

  2. Highest-weighted string:

    1. Finding the highest-weighted string for an arbitrary RNN is undecidable.

    2. Finding the highest-weighted string for a consistent RNN is decidable, but the solution length can surpass all computable bounds.

    3. Restricting to solutions of polynomial length, finding the highest-weighted string is NP-complete and APX-hard.

  3. Testing equivalence of RNNs and minimizing the number of neurons in an RNN are both undecidable.

Although our undecidability results are upshots of the Turing-completeness of RNN Siegelmann and Sontag (1995), our NP-completeness and APX-hardness results are original, and surprising, since the analogous hardness results in PFSA relies on the fact that there are multiple derivations for a single string Casacuberta and de la Higuera (2000)

. The fact that these results hold for the relatively simple RNNs we used in this paper suggests that the case would be the same for more complicated models used in NLP, such as long short term memory networks (LSTMs;

Hochreiter and Schmidhuber 1997).

Our results show the non-existence of (efficient) algorithms for interesting problems that researchers using RNN in natural language processing tasks may have hoped to find. On the other hand, the non-existence of such efficient or exact algorithms gives evidence for the necessity of approximation, greedy or heuristic algorithms to solve those problems in practice. In particular, since finding the highest-weighted string in RNN is the same as finding the most-likely translation in a sequence-to-sequence RNN decoder, our NP-completeness and APX-hardness results provide some justification for employing greedy and beam search algorithms in practice.

Acknowledgments

This work was supported by DARPA (W911NF-15-1-0543 and HR0011-15-C-0115). Andreas Maletti was financially supported by DFG Graduiertenkolleg 1763 (QuantLA).

References

Appendix

Theorem 11.

Identifying the best string of polynomial length in a consistent RNN is NP-complete and APX-hard.

Proof.

Clearly, we can guess an input string of polynomial length, run the RNN, and verify whether its weight exceeds the given bound in polynomial time. Therefore the problem is trivially in NP. For NP-hardness we now reduce from the 0-1 Integer Linear Programming Feasibility Problem to our problem:

0-1 Integer Linear Programming Feasibility:

Given: variables which can only take values in , and constraints (): , . , . Return: Yes iff. there is a feasible solution that satisfies all constraints.

Suppose we are given an instance of the above problem. Construct an instance of the consistent best string with polynomial length problem with input , where:

  1. is an RNN as follows:

    Let , . We pick a big enough positive rational number so that if we define ,

    (1)

    When , set

    Therefore

    When , one can verify that we can set

    where

    so that

    since the range of is a finite set of values .

From equation 1 we get

so we can pick such that its length written in binary

is logarithmic in and . So the weights in matrices that produce are polynomial in and . Same is true for the weights that produce . written in binary has length

which is polynomial in and . So our construction is polynomial.

We now prove that if we can solve the -instance of consistent best string of polynomial length in polynomial time, we can also solve the given instance of 0-1 Integer Linear Programming Feasibility in polynomial time.

By our design, at time , reads a binary string into neurons while predicting almost half-half probability for either 0 or 1 and infinitesimal probability for termination. Therefore no string with length less then has weight greater than .

At time , since is an integer, is the indicator for whether the -th constraint is satisfied:

Therefore is the total number of clauses satisfied by a given setting of (). The termination probability at is . If all clauses are satisfied, this setting of would have termination probability and therefore weight . If fewer than clauses are satisfied, would have weight at most .

When , continues to assign almost half-half probability for either 0 or 1 and infinitesimal probability for termination. Therefore any string of length greater than has a weight smaller than . From that point on the output vector is constant, so the RNN is consistent. Notice that the weights of strings monotonically decrease with length except for at length .

Therefore our construction ensures that the only length at which a string can have weight greater than is . Thus, if there is any string whose weight is greater than , the given instance of 0-1 Integer Linear Programming Problem is feasible; otherwise it is not.

Define the maximum number of clauses satisfied by all assignments of :

By our construction, when , the highest-weighted string will occur at length , and has weight which is proportional to <