One Model for the Learning of Language

11/16/2017 ∙ by Yuan Yang, et al. ∙ 0

A major target of linguistics and cognitive science has been to understand what class of learning systems can acquire the key structures of natural language. Until recently, the computational requirements of language have been used to argue that learning is impossible without a highly constrained hypothesis space. Here, we describe a learning system that is maximally unconstrained, operating over the space of all computations, and is able to acquire several of the key structures present natural language from positive evidence alone. The model successfully acquires regular (e.g. (ab)^n), context-free (e.g. a^n b^n, x x^R), and context-sensitive (e.g. a^nb^nc^n, a^nb^mc^nd^m, xx) formal languages. Our approach develops the concept of factorized programs in Bayesian program induction in order to help manage the complexity of representation. We show in learning, the model predicts several phenomena empirically observed in human grammar acquisition experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the central debates in language acquisition is whether the syntactic structures of natural language are genetically specified or learned through experience. A key tool in this debate has been the use of formal mathematical analysis in order to determine what learners could or could not logically conclude from the type of data they observe. Perhaps the most famous is Gold’s Theorem Gold (1967), which holds that observers of positive examples of even a regular language Hopcroft (1979) could not identify the set of allowable (“grammatical”) strings with certainty. In linguistics and cognitive science, Gold’s proof was taken to mean that innate constraints must be present for learning to succeed. The theorem gave rise to rich formal theories of what could be learned under similar assumptions and correspondingly what must be innate <e.g.¿wexler1983formal, although its relevance for cognitive science is not universally accepted Johnson (2004). Indeed, more recent work has undermined the view that learners must come with substantial built-in knowledge about the specific structures of language. chater2007ideal describe a theoretical learning framework without tight constraints that could identify the right language still from only positive data. chater2007ideal’s make two key differences from Gold-style analyes: first, chater2007ideal assume that sentences are sampled from the parent’s generative model, not a worst-case analysis. Second, the model considers all possible computations as hypotheses, showing that the space need not be constrained innately for an idealized learner. The core underlying idea of this work is that learners attempt to find simple descriptions of the data they observe, much like work in Minimum Description Length Grünwald (2007)

and artificial intelligence more generally

Hutter (2005). This perspective has been further developed to correctly predict the difficulty of constructions Hsu  Chater (2010); Hsu . (2011).

The present paper applies this perspective to learning the structures supporting natural language that was first presented in Chomsky’s “Three models for the description of language” Chomsky (1956); Lees  Chomsky (1957); Chomsky (1959). Chomsky noted that many dependencies in natural language could be captured abstractly with fundamentally different kinds of computational devices. Some devices require a finite amount of memory (e.g. finite-state machines), some require a stack (e.g. phrase structure rules), and some require even more powerful computational systems (e.g. context-sensitive grammars). These relationships hold true for even very simple versions of natural language structures, including sets of strings—called formal languages—over simple alphabets. For example, the simple recursion/iteration allowed by English adjectives (“The beautiful well-dressed young lady”) could be captured by the formal language indicating that any number of adjectives (s) could be put together into a valid (sub)string. The language could be seen in, for instance, embedding in lists of determiner-noun pairs (“Bring me two cars, three accordions, and six pickles.”) or in subject-verb pairs of sentences in disourse (“John laughed. Sally cried. Mindy marveled.”). The formal language might characterize the key dependencies in english if-then relationships where every “if” (an “a”) must be followed by a “then” (a “b”), as in “If If Mary cried then John was sad then John cares about Mary.”

In this study, we consider a learning model that, like the one of chater2007ideal, builds in a space of all computations. In contrast to Chomsky’s Universal Grammar (UG), specifying such space, one only need to build in a few operations, a minimal Turing complete set. Results from computer science, complexity theory, and logic have shown that it takes very little formal mechanism to allow Turing-completeness. By building in only such a minimal mechanism—here, lambda calculus Church (1936); Hindley  Seldin (1986)—we are able to construct a theory whose innate assumptions are explicit and extremely minimal. Out of a large, implicitly defined hypothesis space, learners are able to construct new computations which correspond to different types of formal language systems. Our model shows that given a single such system, learners could construct or genuinely discover cognitive representations corresponding to finite, regular, context-free, even context-sensitive languages and beyond.

Previous study conducted by perfors2011learnability follows a similar paradigm, where they present a learner who does Bayesian inference over different kinds of grammatical systems, including finite, regular, and context-free grammars. They show that with very little positive evidence in the form of part-of-speech sequences from CHILDES

MacWhinney (2000) learners could infer that natural language requires a context free grammar. However, the key difference between the present work and Perfors . (2011) is that we do not build in just a handful of possible alternatives; instead, we define an implicit space of hypotheses that is more concise or parsimonious to build in a larger, unconstrained space of hypotheses, allowing the description length of what must be assumed by the model to be very short. As the former has often been criticized to have “built in” more than alternatives such as UG.

In the next section we describe a computational model that can acquire these kinds of fundamentally different computational structures. We then show that it can learn a variety of formal languages from positive evidence only, and that in doing so it is unproblematic to acquire languages with infinite cardinality, even given finite data. We then show that the model can combine several different computational sub-structures and acquire a grammar of simplified English, including non-context-free language structures described in prior literature. In the last section, we show that the model is able to captures several experimental findings from artificial language learning.

2 A computational model

The starting assumption for our model is fodor’s Language of Thought (LOT), the hypothesis that the structures of mental operations are structured and compositional, or language-like,. Following now a growing body of work in Bayesian LOT models Goodman . (2008); Piantadosi . (2012); Goodman  Tenenbaum (2014); Piantadosi (2011); Piantadosi . (2016); Piantadosi  Jacobs (2016); Yildirim  Jacobs (2013), we define a set of operations that compose and have learners consider all compositions of these operations as possible hypotheses. These meanings can be interpreted analogously to short computer programs which “compute” strings of part of speech sequences, although the input and output could be different in other domains. Importantly, the specification of these programs is domain general meaning that essentially the same setup can be used to explain acquisition in other domains like counting, kinship, cross-modal transfer, and concept learning. In all cases, the task of the learner is to determine which compositions of primitives are likely to be correct, given the observed data.

2.1 Hypothesis space

Functions on strings
concatenate lists and
return the first element of list
return the list without the first element
Logical functions

randomly return true (probability

) or false (probability )
return true if string is empty, otherwise false
return if else return
Function calls
call the expression with argument
Table 1: Primitive functions used in LOT.

The primitives that we assume are compoased are motivated by minimalist, functional programming languages like scheme Sussman . (1983), which also forms the basis of other LOT work <e.g.¿probmods. There are 3 classes of primitive functions, shown in Table.1. The first 3 are list operators: pair, first and rest, which respectively join two elements together (somewhat like chomsky1995minimalist’s merge), take the first element of a pair, or take the second element. The second class of operators are flip and empty, which return boolean values. flip is notable in that it provides a stochastic element to the grammar, allowing nondeterminism in the generation of structures. if allows for conditional expressions. Finally we allowa function to call another function —potentially itself—via . We note the similarity between this grammar and others used in the conceptual modeling with a LOT in entirely different domains Piantadosi . (2012). While these operations look very simple, they actually permit a surprising range of computations to be expressed.

Table 2: The grammar used to specify valid compositions of primitives.

Compositions of these primitives and grammar define a set of functions which act as hypotheses for learners. In our analogy to the Library of Babel, these primitives are the characters in the books, and a book (a composition of primitives) is a single hypothesis. The goal of the learner is to walk in the library (hypothesis space) and find the best book (hypothesis) that fits the observed data. For instance, a possible expression built out of these primitives is,

F1(x) = pair( a, if(flip(0.7), , F1()))

Here we have named this function “F1” and have allowed it to recursively call itself in its own definition via call. When this expression evaluates, it concatenates (pairs) an “a” with either an emptry string () or the outcome of calling itself with an empty string as an argument, call(F1, ). In terms of a set of strings, this function represents the formal language . However, this function also gives a probability distribution over strings since some (e.g. ) are much more likely to be generated than others (e.g. ) according to the probability of each flip

. In particular, this probability distribution this assigns is geometric,

. Note, however, that these operations permit a vast array of different types of languages and computations to be specified.

2.2 Factorized hypothesis

One important feature of nearly all working computational systems is that they have a mechanisms to handle complexity, in large part through re-use of subcomputations (For work on learning principled subcomputations to re-use, see O’Donnell (2015)). One simple way to manage the complexity of hypotheses and efficiently structure search is to allow learners to incrementally build on hypotheses, extending partially working theories in order to cover new data. Here, we introduce a factorized hypothesis consisting of a sequence of functions, each of which may be defined in terms of prior functions. This is best illustrated with an example:

F1(x) =
pair(a, if(flip(0.7), , F1()))}}
F2(x) =
if(empty(x), , pair(first(x), pair(F2(rest(x)), b)))}
F3(x) =
F2(F1())

The first fucntion, F1, is simply the language shown above. The second function, F2, is more interesting: it is a function that takes an argument , and recursively concatenates the first element of with a recursive call to itself, followed by a ’b’. Because this is recursive on a shortening list (rest(x)), one ’b’ will be added for each element of . For instance, if we called F2 on the string ’xyz’, we would be left with the string ’xyzbbb’. Finally, F3 puts F1 and F2 together, calling F2 with F1 as an argument. Thus, the strings generated by F1 will be passed to F2, which will attach a single ’b’ to for each ’a’, giving rise to strings of the form .222Note this is just an example, as there are other, simpler ways to generate . In this way, a complex computation can be broken down into pieces that are more manable, each of which might be more easily learned on their own. In the domain of language or grammar learning, such a division of labor is especially elegant since many valid strings have valid substrings, so a model that captures part of the data (the substrings) in F1 can be extended to a model that captures the longer strings as well in F2. Note that in permitting passing arguments to functions, this formamlism permits important types of variable abstraction.

In the implementation, we allow up to 10 different functions to be called in this way. Note that each function can only call itself and prior functions (e.g. F3 cannot call F4, but F4 can call F3 and F4). Also, it is assumed by convention that the output of the function is the final hypothesis called with the empty string as input.

2.3 The probabilistic model

Next, we turn to the specification of a probabilistic model over this space that supports tractable learning and inference. Following prior work using the Bayesian LOT, we convert the grammar of operations to a Probabilistic Context-Free Grammar (PCFG) Manning  Schütze (1999)

that specifies the prior probability of any hypothesis. The use of a PCFG automatically penalizes complex hypotheses, fitting the idea that

simplicity is a core inductive pressure Chater  Vitányi (2003); Feldman (2000).

The goal of a learner is to observe some strings and infer a posterior on hypotheses, . Following Bayes rule, , where is given by the PCFG times a penalty on having factorized components (F1, F2, , FK), and is a multinomial likelihood specifying how likely the observed strings are to be generated by

. This likelihood include an outlier likelihood penalty of log value

for strings in the data that are assigned zero probability by . To manage the uncomputability inherent in this space, hypotheses that do not “halt” after a set number of recursive calls—e.g. those that are likely to be in an infinite loop—are assigned a zero prior.

In inference, most of the posterior probability mass will be held by just a few concise and well-fitting hypotheses. In our implementation (the code can be found on GitHub

Piantadosi (2014)), we run 12 Monte-Carlo chains for 50000 steps of the tree-regeneration MCMC method presented in goodman2008rational for varying amounts of data. his algorithm essentially takes a random walk around hypotheses that is biased towards high probability regions. Over time, this will converge on hypotheses with high and indeed can be shown to correctly be sampling from this posterior distribution. We collect 100 of the highest-posterior expressions found in each chain. These samples are taken as a fixed, finite hypothesis space for the purposes of computing summary statistics and plots. Code is available online in LOTlib Piantadosi (2014). In each model run, we add to the grammar the required terminal symbols specific to each data set.

3 Learning simple formal languages

We first study the model’s ability to learn several simple formal languages that are motivated by the structures present in natural language. Since Chomsky (1956), numerous authors have attempted to pinpoint what type of computational system supports natural language, with much of the debate focusing on what constructions evidence computational requirements more than context free languages. This has motivated attempts to map natural language structures onto very simple formal languages whose computational properties can be fully characterized.

Specifically, we examine the languages , , , , , and the Dyck language (valid sequences of parentheses). Here, and are regular languages and can be accepted by finite state automation; , and the language are context free languages, and are recognized by pushdown automation; is a context sensitive language that requires more than a context free grammar. Critically, many of these languages have analogs in natural language synta; and the Dyck language specifies the valid ways in which a sentence could be parsed.

3.1 Results

(a)
(b)
(c)
(d)
(e)
(f)
Figure 1:

The graph shows the weighted F-score against the amount of observed data: (a)

; (b) ; (c) ; (d) ; (e) ; (f) . The F-score curve for Maximum a Posteriori (MAP) hypothesis is also shown.

Fig.1 shows the posterior-weighted precision, recall and F-scores of the hypotheses given different sizes of data from these formal languages. Here, the data consist of observations of positive examples of strings (e.g. from we might observe ) and is provided to the same learning model. The model is able to rapidly acquire the correct kind of computational system for each, often taking only 5-10 samples of strings before the model accurately learns the right system. The implication of this result is that ideal learner with domain-general constraints is able to learn the unobserved grammatical structures and computations that underlie a target set of strings rapidly, and from very little data. To illustrate, 3 shows example hypotheses for each type of language.

def ():
    return (,(,(),))
def ():
    def ():
        return (,,())
    return (,((),))
def ():
    def ():
        return (,,())
    return ((,),())
def ():
    def ():
        return (,,())
    return (,((),(,))
def ():
    def ():
        return (,,())
    return (,((),(,())))
def ():
    def ():
        return (,)
    def ():
        def ():
            return ((,()),)
        return (,((),(),()))
    return ().
Table 3: Example hypotheses for 6 classes for formal languages.

3.2 Learning that a language is infinite

One of the most striking properties of natural language is that there are infinitely many sentences of valid, grammatical English <though see¿pullum2010recursion. This fact might be considered to be a core aspect of our innate linguistic endowment, perhaps a consequence of a generative capacity that forms UG. On the other hand, once we consider learners who are computationally sophisticated, it may be possible to learn that the generative system supporting language is infinite. After all, there are aspects of language which are not infinite (e.g. the set of function words) and there have even been argued to be languages with finite cardinality Futrell . (2016). Interestingly, from a learning point of view, it is not entirely obvious what finite data might convince learners that their language permits infinitely many strings.

Figure 2: The sum of posterior probabilities of infinite hypotheses given finite and infinite target data.

To investigate this, we generated data from two target hypothesis languages: (a) with or (b) with . In both cases, we assume . Fig.2 shows under each type of data the posterior probability of a hypothesis that generates an infinite language. The type of hypothsis the model learns in the infinite case looks like the example above for ; in the finite case, the model could learn a hypothesis like

F1(x) = pair(a, pair(if(flip(0.5), , a), if(flip(0.5), , a)))

What should be clear is that as finite languages become more complex or have greater cardinality, they become more complex. In the absence of overwhelming data, learners will prefer a more concise hypothesis and this will tend to bias them towards hypotheses that generate infinite sets. It takes additional data to convince a simplicity-driven model that long strings () are not permitted. This shows that languages with infinite cardinality may result from core infernetial properties of a computationally sophisticated learner <see also¿piantadosi2017infinitely.

4 Learning a simplified English phrase structure grammar

Of course, the problem faced in natural syntax is not just a series of unrelated formal languages, but data that combines many forms of computation together. We next test the learning model with a grammar of a simple subset of English:

Here, the number over each arrow gives the unnormalized probability of each expansion (so is three times more likely to expand to “a” than to “a AP”). This target grammar includes many interesting syntactic structures of English, including multiple expansions of a nonterminal type (e.g. NP), tail recursion (in AP), transitive and intransitive verbs (in VP), and sentential embedding (in VP).

Importantly, although we have written this as a phrase structure grammar, as above, the learning model is not told that the strings it observes come from this grammar, or indeed from any kind of phrase structure at all. The strings that are provided have no parsing information or structure: for instance, one string the model might observe is ’if d a n v then n v that d n v d n’, which corresponds to the part-of-speech sequence for a sentence like “If an angry man smiles then Jane believes that the instrument is an accordion.” The challenge for the model is to take these kinds of part-of-speech sequences and infer the computational process out of all compositions of primitives.

(a)
        def ():
            def ():
                if ():
                    return (,())
                else:
                    return (,)
            def ():
                if ():
                    return 
                else:
                    return (,(()))
            def (,):
                return (())
            def (,,):
                 = ((,(,)),)
                 = (,)
                if (()):
                    return (,())
                else:
                    return (,(,,))
            return (,,)
(b)
Figure 3: (a)The graph shows the weighted F-score of hypotheses learning simple English case given different sizes of data; (b) Example hypothesis for simple English.

The learning curves are shown in Figure. 2(a) which gives the posterior probability (y-axis) of various types of grammars as a function of amount of data (x-axis). The model eventually settles on a hypotheses Fig.2(b). Importantly, the components of this grammar–recurive, tail-recursive, random-choices of expansions, etc. are not “built in” as necessary mechanisms for learners. Instead, the parts of this grammar which mimic a PCFG are constructed by composing function calls with the assumed primitives.

5 Connections to human learning

This computational model is interesting not only as a theoretical demonstration of the power of statistical learning, but also because its underlying principles can be directly connected to human behavior. In this section, we consider two prior experiments on human artificial grammar learning (AGL) and show that the model exhibits similar patterns as humans.

5.1 Structure of the input

A wealth research has sought to understand human of learning center-embedded recursion structure Fitch  Hauser (2004); Lai  Poletiek (2011, 2013). This structure, in our context, can be characterized as an “” language, and an example in natural language can be “the car that I drove yesterday broke”. It is of particular interest because it is shown to be sufficiently complex and is a crucial feature in human language faculty Hauser . (2002). Lai  Poletiek studied whether an unequal distribution of the training exemplars can influences the process of learning a center-embedded grammar. They argued that a biased distribution, where short exemplars are more likely to be presented to the learners than longer ones, can facilitate acquisition. In Lai et al.’s experiments, participants were asked to learn an artificial language generated with center-embedded recursive grammar which resembles the language. They were exposed to 12 blocks of training data set and completed a test at the end of each block. Participants were divided into two groups. Training data for the first group was randomly sampled from the grammar (Random Case), where short and long exemplars has the same amount; the one for the other group has a biased distribution over exemplars. The amount of exemplars rises exponentially as they get shorter, e.g., the shortest exemplars are 2 times more than the second shortest and 4 times than the third shortest, and so on. These settings are illustrated in Fig.3(a) and Fig.3(c). Their results showed an improvement in learning as measured by the rate of correctly recognizing the target language in test phrase: when given the data with biased distribution, participants performance increased by 30%.

(a)
(b)
(c)
Figure 4:

The compositions of data blocks in experimental conditions from Lai et al (2011, 2013): (a) skewed frequency, (b) staged input, and (c) random.

lai1 studied another factor that can facilitate the learning, i.e. staged input. They argued that AGL can be facilitated with an explicit control of exemplar difficulties in several learning stages. The experiments setup was similar to the previous one, except that learners in the test group now experience 3 stages of training, each containing 4 blocks. For each stage, the maximum length (or complexity) of exemplars was limited, and this limit increased as participants proceed to higher stage. This setting is illustrated in Fig.3(b). Experimental results showed improvement in performance by 40% when learners were exposed to staged training.

We constructed three datasets of the same size (144 items) and language () as that used in these experiments. We matched the input data composition to these studies. The only difference was the actual appearance of exemplars to be presented, which for simplicity we mapped to “a” and “b” instead of syllables.

Figure 5: The average F-score of MAP hypothesis given different types of data (skewed frequency, staged input and uniform input) as a function of the total MCMC steps taken.

Figure Fig.5

shows the learning curves for each of these kinds of input. These results show that models with skewed frequency and staged input performs better than the random case—reaching a higher F-score faster—agreeing qualitatively with the human results. The computational explanation for this is also intuitive: data with a uniform distribution requires an equal chance to produce strings with every different level of complexity, which is hard to realize in our model: the main device for producing complex strings that follows a certain type of grammar is stochastic recursion, and that the chance for generating complex strings decreases exponentially is almost an innate property of this mechanism.

5.2 Nonadjacent dependency

The third case we consider is that of nonadjacent dependencies, one of the most characteristic features of human languages. Examples can be found in the dependencies between auxiliaries and inflectional morphemes (e.g. “is running, has eaten”), as well as dependencies involving number agreement (“The rocks on the bluff are jagged”); nested nonadjacent dependencies can be found in, for instance, the if-then constructions (above) Lees  Chomsky (1957).

In the language learning literature, dependencies have been studied in very simple constructions without recursive embedding. gomez had participants learn an like language, meaning that the first () and last () word were fixed and was free to vary. They found that it was easier to for people to discover the fixed dependency between and if the is more variable. In their experiment, was chosen from a fixed pool of strings; as the cardinality of the pool increased, learners performed better. The result is intuitive: when there are more possible es it is less likely to be predicted since there are more possibilities may free resources for discovering the association between and .

(a)
        Small pool size
        def ():
            def ():
                if ():
                    return (,(,(,(,))))
                else ():
                    return return (,(,(,(,))))
            def ():
                return ()
            return ()
(b)
        Large pool size
        def ():
            def ():
                x = 
                for _ in xrange(3):
                    x = (x,) if () else (x,)
                return x
            def ():
                return (,((),))
            return ()
Figure 6: (a) The graph shows the weighted F-scores of adjacent and nonadjacent dependencies, and their differences with different pool sizes; (b) example hypotheses with small and large pool sizes (only one pair of dependency is shown for simplicity).

We adopt the same paradigms in computational test. There are 3 pairs of nonadjacent dependencies. They are , and , where is a string with its length equals 3, and the characters in are , , , . We construct the pool that is drawn from by numerating all possible 3-characters strings’ permutations (the total amount is, therefore, ). When generating the strings, we draw uniformly from the pool.

Gomez et al. obtained their results by showing learners strings with both trained and untrained dependencies (ones that do not appear in training strings), and then ask them to choose whether string is valid or not. Then the rates of learners accepting trained and untrained exemplars are computed respectively, and the difference of these two values is taken as the final measure of performance. In our computational test, we can keep track of how much attention our model puts to the nonadjacent dependencies by looking at the posterior probability of those hypotheses that generate such structures. The idea that links these two measures are straightforward: if learner assigns more posterior weight on correct hypotheses, then the probability of generating the untrained data given the posterior distribution state would be smaller, thus less chance to accept them as valid strings.

To get this measure, we first evaluate every hypothesis in our finite hypothesis set for 2048 runs, and check if all generated strings contain valid pairings, such as and . In this case, those candidates with all their strings valid are taken as the hypotheses that contain the nonadjacent dependencies. Then we compute the posterior of those hypotheses given different pool size ranging from 2 to 25. The posterior probability curve is shown in Fig.6.

This result reveals a similar learning preference as we observed in finite vs. infinite language case. When target language has a small cardinality (i.e. small pool size), the model tends to memorize all occurred strings (examples shown in Fig.5(b)). Since by this time, the cost of memorizing it still remains low—lower than that of exploring the underlying generative model of the target language. But as the pool size grows, memorizing observed data starts to increase the model complexity. So model tends to explore the potential rules for generating the data, leading to a increase in the posterior probability of those hypotheses containing nonadjacent dependencies.

6 General discussion & Conclusion

Statistical language learning has often been criticized by formal linguists as failing to do justice to the structure of language. Indeed, the earliest statistical learning paradigms Saffran . (1996); Aslin . (1998); Saffran . (1999) focused on domains with little structure such as word segmentation. But this was for good reason: the computational system behind syntax and structure is complex, and a considerable amount is learned about the general capacity of human learners—what general class of theories should science explore—by demonstrating their statistical acquisition in any domain. Indeed, these early statistical learning experiments motivated the development of richer computational models <e.g.¿redington1998distributional, as well as perfors2011learnability’s work that captured learners as doing model comparison. Perhaps the most interesting critique of Perfors et al. has argued that it “builds in” more than alternatives such as Chomsky’s Universal Grammar (UG) because in Perfors et al.’s work, learners consider seven different grammars. Surely it is a more parsimonious theory of human nature that learners come equipped with one single grammar than seven. The arguement also applies to Chater et al.: they allow learners to consider an infinite number of grammars. How could “building in” an infinite number of grammars be a more plausible or parsimonious theory than building in just one?

It turns out that this thinking is wrongheaded—but for a very interesting reason. The argument can best be understood by considering Jorge Luis Borges’ short story, The Library of Babel. Borges imagines an infinite library full of every possible book—every possible sequence of characters. The curious fact about the library which contains every possible book is that it actually contains essentially no information at all. Its contents can be completely specified with just a few words (“all possible books”), or compressed into an extremely short generating computer program, yielding a negligible complexity or description length. Certainly, it is much easier to describe the entire library of Babel than to describe one of its typical books. In this way, it is often more concise or parsimonious to build in a larger, unconstrained space of hypotheses because a large space can easily have a more concise description (e.g. “all computations”) compared to a constrained hypothesis space.

The model described in this paper shows in principle how learners can build in very little in terms of representation—just the specific operations in Table 1. This model is the first to show that statistical learning can construct representations of provably different computational power, including regular, context-free, and context-sensitive hypotheses and in doing so, it requires surprisingly little data333There is an interesting sense in which models such as these challenge the very notion of innateness. We could ask: are the infinitely many hypotheses considered here “innate”? Or is only the capacity for generating hypotheses innate? We tend toward the latter sense because the amount that is “built in” is very small. When we download a C compiler, it would be very strange to think that all of the programs that could be written are “innately” included. Instead, the compiler has a generative capacity to evaluate many such programs, and it is left up to the programmer to construct the specific program that should be run..

The overarching view, then, is not one of comparing a few different built-in grammars, but applying likely the same remarkable capacity for discovering algorithms that is apparent in other cognitive domains. As the recent innovations in the Bayesian LOT and inductive computation Chater  Vitányi (2007); Hsu  Chater (2010); Hsu . (2011) make clear, such statistical models are just getting started. However, is likely that ongoing research will push towards more detailed linguistic facts and correspondingly sophisticated learning theories.444Recently, for example, mitchener2010computational studied berwick2011poverty’s example “*The child seems sleeping” and showed that a hierarchical Bayesian model motivated by Perfors et al. and other models showed promise in acquiring the verb’s syntax. The work on the simplified English grammar has shown that the same model can begin to scale to more complex types of computations and syntactic processes.

7 Conclusion

We have shown that a statistical learner who operates over computational processes can discover many of the key features of syntactic structures, including regular, context-free, and context-sensitive langauges, using positive evidence only and a simple, unitary learning mechanism. This work opens the door to more sophisticated statistical learning theories that incorporate other areas of language like semantics and pragmatics into the generating process, and unifies theories of syntactic learning with state of the art machine learning models used in other areas of cognition.

References

  • Aslin . (1998) aslin1998computationAslin, RN., Saffran, JR.  Newport, EL.  1998. Computation of conditional probability statistics by 8-month-old infants Computation of conditional probability statistics by 8-month-old infants. Psychological science94321–324.
  • Berwick . (2011) berwick2011povertyBerwick, RC., Pietroski, P., Yankama, B.  Chomsky, N.  2011. Poverty of the stimulus revisited Poverty of the stimulus revisited. Cognitive Science3571207–1242.
  • Chater  Vitányi (2003) chater2003simplicityChater, N.  Vitányi, P.  2003. Simplicity: a unifying principle in cognitive science? Simplicity: a unifying principle in cognitive science? Trends in cognitive sciences7119–22.
  • Chater  Vitányi (2007) chater2007idealChater, N.  Vitányi, P.  2007. ‘Ideal learning’of natural language: Positive results about learning from positive evidence ‘ideal learning’of natural language: Positive results about learning from positive evidence. Journal of Mathematical psychology513135–163.
  • Chomsky (1956) chomsky1956threeChomsky, N.  1956. Three models for the description of language Three models for the description of language. Information Theory Ire Transactions on23113 - 124.
  • Chomsky (1959) chomsky1959certainChomsky, N.  1959. On certain formal properties of grammars On certain formal properties of grammars. Information and control22137–167.
  • Chomsky (1995) chomsky1995minimalistChomsky, N.  1995. The minimalist program The minimalist program ( 1765). Cambridge Univ Press.
  • Church (1936) church1936unsolvableChurch, A.  1936. An unsolvable problem of elementary number theory An unsolvable problem of elementary number theory. American journal of mathematics582345–363.
  • Feldman (2000) feldman2000minimizationFeldman, J.  2000. Minimization of Boolean complexity in human concept learning Minimization of boolean complexity in human concept learning. Nature4076804630–633.
  • Fitch  Hauser (2004) fitch2004computationalFitch, WT.  Hauser, MD.  2004. Computational constraints on syntactic processing in a nonhuman primate Computational constraints on syntactic processing in a nonhuman primate. Science3035656377–380.
  • Fodor (1975) fodorFodor, JA.  1975. The language of thought The language of thought ( 5). Harvard University Press.
  • Futrell . (2016) futrellcorpusFutrell, R., Stearns, L., Everett, DL., Piantadosi, ST.  Gibson, E.  2016. A Corpus Investigation of Syntactic Embedding in Piraha A corpus investigation of syntactic embedding in piraha. Plos One.
  • Gold (1967) gold1967languageGold, EM.  1967. Language identification in the limit Language identification in the limit. Information and control105447–474.
  • Gómez (2002) gomezGómez, RL.  2002. Variability and detection of invariant structure Variability and detection of invariant structure. Psychological Science135431–436.
  • Goodman  Tenenbaum (2014) probmodsGoodman, ND.  Tenenbaum, JB.  2014. Probabilistic Models of Cognition Probabilistic models of cognition.
  • Goodman . (2008) goodman2008rationalGoodman, ND., Tenenbaum, JB., Feldman, J.  Griffiths, TL.  2008. A Rational Analysis of Rule-Based Concept Learning A rational analysis of rule-based concept learning. Cognitive Science321108–154.
  • Grünwald (2007) grunwald2007minimumGrünwald, PD.  2007. The minimum description length principle The minimum description length principle. MIT press.
  • Hauser . (2002) hauserHauser, MD., Chomsky, N.  Fitch, WT.  2002. The faculty of language: What is it, who has it, and how did it evolve? The faculty of language: What is it, who has it, and how did it evolve? science29855981569–1579.
  • Hindley  Seldin (1986) hindley1986introductionHindley, JR.  Seldin, JP.  1986. Introduction to Combinators and (lambda) Calculus Introduction to combinators and (lambda) calculus ( 1). CUP Archive.
  • Hopcroft (1979) hopcroft1979introductionHopcroft, JE.  1979. Introduction to automata theory, languages, and computation Introduction to automata theory, languages, and computation. Pearson Education India.
  • Hsu  Chater (2010) hsu2010logicalHsu, AS.  Chater, N.  2010. The logical problem of language acquisition: A probabilistic perspective The logical problem of language acquisition: A probabilistic perspective. Cognitive science346972–1016.
  • Hsu . (2011) hsu2011probabilisticHsu, AS., Chater, N.  Vitányi, PM.  2011. The probabilistic analysis of language acquisition: Theoretical, computational, and experimental analysis The probabilistic analysis of language acquisition: Theoretical, computational, and experimental analysis. Cognition1203380–390.
  • Hutter (2005) hutter2005universalHutter, M.  2005. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability Universal artificial intelligence: Sequential decisions based on algorithmic probability. BerlinSpringer.
  • Johnson (2004) johnson2004goldJohnson, K.  2004. Gold’s Theorem and Cognitive Science* Gold’s theorem and cognitive science*. Philosophy of Science714571–592.
  • Lai  Poletiek (2011) lai1Lai, J.  Poletiek, FH.  2011. The impact of adjacent-dependencies and staged-input on the learnability of center-embedded hierarchical structures The impact of adjacent-dependencies and staged-input on the learnability of center-embedded hierarchical structures. Cognition1182265–273.
  • Lai  Poletiek (2013) lai2Lai, J.  Poletiek, FH.  2013. How “small” is “starting small” for learning hierarchical centre-embedded structures? How “small” is “starting small” for learning hierarchical centre-embedded structures? Journal of Cognitive Psychology254423–435.
  • Lees  Chomsky (1957) chomsky1957syntacticLees, RB.  Chomsky, N.  1957. Syntactic Structures Syntactic structures. Language333 Part 1375–408.
  • MacWhinney (2000) macwhinney2000childesMacWhinney, B.  2000. The CHILDES project: The database The childes project: The database ( 2). Psychology Press.
  • Manning  Schütze (1999) manning1999foundationsManning, CD.  Schütze, H.  1999.

    Foundations of statistical natural language processing Foundations of statistical natural language processing.

    MIT press.
  • Mitchener  Becker (2010) mitchener2010computationalMitchener, WG.  Becker, M.  2010. Computational models of learning the raising-control distinction Computational models of learning the raising-control distinction. Research on Language and Computation82-3169–207.
  • O’Donnell (2015) o2015productivityO’Donnell, TJ.  2015. Productivity and reuse in language: A theory of linguistic computation and storage Productivity and reuse in language: A theory of linguistic computation and storage. MIT Press.
  • Perfors . (2011) perfors2011learnabilityPerfors, A., Tenenbaum, JB.  Regier, T.  2011. The learnability of abstract syntactic principles The learnability of abstract syntactic principles. Cognition1183306–338.
  • Piantadosi (2011) piantadosi2011learningPiantadosi, ST.  2011.  Learning and the language of thought Learning and the language of thought .  Massachusetts Institute of Technology.
  • Piantadosi (2014) piantadosi2014lotlibPiantadosi, ST.  2014. LOTlib: Learning and Inference in the Language of Thought. LOTlib: Learning and Inference in the Language of Thought. available from https://github.com/piantado/LOTlib.
  • Piantadosi  Fedorenko (2017) piantadosi2017infinitelyPiantadosi, ST.  Fedorenko, E.  2017. Infinitely productive language can arise from chance under communicative pressure Infinitely productive language can arise from chance under communicative pressure. Journal of Language Evolutionlzw013.
  • Piantadosi  Jacobs (2016) piantadosi2016fourPiantadosi, ST.  Jacobs, RA.  2016. Four problems solved by the probabilistic Language of Thought Four problems solved by the probabilistic language of thought. Current Directions in Psychological Science25154–59.
  • Piantadosi . (2012) piantadosi2012bootstrappingPiantadosi, ST., Tenenbaum, JB.  Goodman, ND.  2012. Bootstrapping in a language of thought: A formal model of numerical concept learning Bootstrapping in a language of thought: A formal model of numerical concept learning. Cognition1232199–217.
  • Piantadosi . (2016) piantadosi2016logicalPiantadosi, ST., Tenenbaum, JB.  Goodman, ND.  2016. The logical primitives of thought: Empirical foundations for compositional cognitive models. The logical primitives of thought: Empirical foundations for compositional cognitive models. Psychological review1234392.
  • Pullum  Scholz (2010) pullum2010recursionPullum, GK.  Scholz, BC.  2010. Recursion and the infinitude claim Recursion and the infinitude claim. Recursion in human language104113–138.
  • Redington . (1998) redington1998distributionalRedington, M., Chater, N.  Finch, S.  1998. Distributional information: A powerful cue for acquiring syntactic categories Distributional information: A powerful cue for acquiring syntactic categories. Cognitive science224425–469.
  • Saffran . (1996) saffran1996statisticalSaffran, JR., Aslin, RN.  Newport, EL.  1996. Statistical learning by 8-month-old infants Statistical learning by 8-month-old infants. Science1926–1928.
  • Saffran . (1999) saffran1999statisticalSaffran, JR., Johnson, EK., Aslin, RN.  Newport, EL.  1999. Statistical learning of tone sequences by human infants and adults Statistical learning of tone sequences by human infants and adults. Cognition70127–52.
  • Sussman . (1983) sicpSussman, G., Abelson, H.  Sussman, J.  1983. Structure and interpretation of computer programs. Structure and interpretation of computer programs. MIT Press, Cambridge, Mass.
  • Wexler  Culicover (1983) wexler1983formalWexler, K.  Culicover, P.  1983. Formal principles of language acquisition Formal principles of language acquisition.
  • Yildirim  Jacobs (2013) yildirim2013transferYildirim, I.  Jacobs, RA.  2013. Transfer of object category knowledge across visual and haptic modalities: Experimental and computational studies Transfer of object category knowledge across visual and haptic modalities: Experimental and computational studies. Cognition1262135–148.