Unsupervised grammar induction – learning the grammar rules of a language from a corpus of text or speech without any labeled examples (e.g. sentences annotated with human-created syntax parses) – remains in essence an unsolved problem. Although it has been approached for decades , useful applications for restricted domains have been presented , and state-of-the-art performance is improving , the resulting grammars for natural language are still not able to properly capture its structure.
Bypassing explicit representations of the grammar rules, recent transformer neural network models have shown powerful abilities at language prediction and generation, indicating that at some level they internally “understand” those rules. However, such rules don’t seem to be found in the neural connections in these networks in any straightforward manner [3, 8], and are not easily extractable without supervision. Supervised extraction of grammatical knowledge from the BERT 
network reveals that, to map the state of a transformer network when parsing a sentence into the sentence’s parse, complex and tangled matrix transformations are needed.
Here we explore an alternate approach: Don’t try to milk the grammar out of the transformer network directly, rather use the transformer’s language model as a sequence probability oracle
, a tool for estimating the probabilities of word sequences; then use these sequence probability estimates to guide the behavior of symbolic learning algorithms performing grammar induction. Our proposal is actually agnostic in the mechanism to find rules, and could synergize well with related efforts[12, 6]; what we introduce is a novel and powerful way to guide the induction. This is work in progress, but preliminary results have been obtained and look quite promising.
Full human-level AI language processing will clearly involve additional aspects not considered here, most critically the grounding of linguistic constructs in non-linguistic data . However, the synergy between symbolic and sub-symbolic aspects of language modeling is a key aspect of generally intelligent language understanding and generation which has not been adequately captured so far, and we feel the work presented here makes significant progress in this direction.
Transformer network models like BERT 
, GPT-2, and their relatives provide probabilistic language models which can be used to assess the probability of a given sentence. The probability of sentence
according to such a language model tells you the odds that, if you sampled a random sentence from the model (used in a generative way), the output would be. If is not grammatical according to the grammar rules of the language modelled by the network, its probability will be very low. If is grammatical but senseless, we assume from experimentation with these models, that its probability should also be quite low.
Having a sentence (or more generally word sequence) probability oracle of this nature for a language provides a way to assess the degree to which a given grammar models that language. What one wants is that: the high-probability sentences according to the oracle tend to be grammatical according to , the low-probability sentences according to the oracle are less likely to be grammatical according to , and is as concise as possible. The grammars that best fit these conjuncted factors are the best grammatical models of the language in question.
This concept could be used to cast grammar induction as a probabilistic programming problem, in a relatively straightforward but computationally exorbitant way. Just sample random grammars from some reasonable distribution on grammar space, and evaluate their quality by the above factors.
What we propose here is conceptually similar but more feasible: Begin with a symbolic grammar learning algorithm which is capable of incrementally building up a complex grammar, then use sentence probability estimates from a neural language model to guide the grammar learning. One could view this as an instance of the probabilistic programming approach, using a linguistic-theory-based heuristic method of sampling grammar space.
Our prior work on symbolic grammar induction  uses two mains steps to build a dependency grammar from an unlabeled corpus. First, separate the vocabulary of interest into word categories (functionally equivalent to parts of speech, with a certain level of granularity). An implicit sub-step here is the disambiguation of polysemous words in the vocabulary, so that a single word could be assigned to more than one category. Then, perform rule induction to find how words in these categories are connected to form grammatical sentences.
Our proposed approach, which enhances the aforementioned steps with the use of transformer language models, is depicted in Figure 1 and summarized as:
Infer word-senses and parts of speech from vectors built using a neural language model as a sentence probability oracle.
Infer grammatical rules from symbolic pattern-analysis of the corpus tagged with these senses and parts of speech.
Assemble a grammar incrementally from inferred rules. To evaluate whether a given rule should be included in the grammar:
Using a tree transformer network, generate a set of sentences consistent with the given rule, and others that follow mutations of the rule.
Use a neural model as a sentence probability oracle to estimate whether the inferred rule leads to better generated sentences than its mutation(s).
For our early experiments, we have chosen BERT  as the transformer to use, but the idea could easily make use of similar unsupervised pre-trained networks.
2.1 Assessing sentence probability
To explain details of our approach, we begin with the computation of sentence probability according to a neural language model (illustrated in Fig. 2).
Given a sentence , composed of N words , we want to calculate its probability . A way to decompose that probability into conditional probabilities is:
which we call forward sentence probability.
A conditional probability can be obtained from BERT’s masked word prediction model by taking the whole sentence, masking all the words which are not conditioned in the term (including ), and obtaining BERT’s estimation for the probability of .
To exemplify the idea, we summarize how to calculate the forward probability of the sentence “She answered quickly”. The probability is given by
Each factor translates to a BERT Masked Language Model (MLM) prediction for a sentence with masked tokens. For example,
and we get the probability that “answered” is predicted as the second token in the BERT MLM.
Now, to take advantage of BERT’s bi-directional capabilities, we can estimate the sentence’s backwards probability in a similar fashion:
We finally approximate the sentence probability as the geometric-mean of the two directional ones:
2.2 Word Category Formation
Following our prior work on symbolic grammar induction , and a number of previous works, we propose to generate embeddings for the words in the vocabulary and cluster them using a proximity metric in the embedding space. Each final cluster can be considered a different word category, whose connection rules to other clusters will be defined in the induced grammar. Unlike prior work, we use sentence probabilities as the embedding features.
We expand each sentence in the corpus into sentences with a “blank” token in a different position, where is that sentence’s length. Each of those sentences with a blank is a feature for the word-vectors we will build. Hence, we can think of a word-sentence matrix , where rows are unique sentences with blanks in them, and columns are the words in the vocabulary (see Fig. 3).
We fill each cell in the matrix with the probability of the corresponding sentence-with-a-blank (row), when the blank is substituted by the corresponding word (column). That is, if is the sentence-with-a-blank in row and is the word in column , then the cell .
Once the matrix is filled, word categories are obtained by clustering the obtained word vectors (columns of the matrix). Or, if one has performed word sense disambiguation (which can be done based on different computations from this same matrix, as will be described below), by clustering similar vectors corresponding to word senses.
2.3 Word Sense Disambiguation
Word embeddings obtained from transformer networks by supervised learning have been used to disentangle word senses; here we attempt this task in an unsupervised manner. From an unlabeled training corpus, we obtain a transformer embedding for each instance of each word in its given context. Then, for each word in the vocabulary, we gather all of its embeddings and cluster them; we consider the resulting clusters as different word senses.
Specifically, a word-instance can be represented by a vector whose components are given by the probability that the neural language model assigns to the sentences (and discourse contexts) obtained by replacing such word instance with each word in the vocabulary.
Consider the word-instance “test” in “The test was a success”. If the corpus vocabulary is then we can represent this instance’s intension (contextual properties) with the vector :
I(test, The was a success) = P(The frog was a success)
I(test, The was a success) = P(The which was a success)
Noticeably, the matrix obtained this way is the same one used for word-category formation; only, instead of performing clustering over the word vectors (columns), we need to independently cluster the rows that belong to instances of the same word to find their different senses.
2.3.1 Word Category Formation in Depth.
Once polysemy is taken care of, we can perform word-categorization over word-senses, allowing the same word to be assigned to different parts of speech (PoS) (e.g. “test” as a noun and as a “verb”). We need, however, to re-structure the sentence probability matrix to express word-senses as columns before grouping them into PoS. This is done by reassigning the previously-calculated probabilities to the correct word-sense.
Starting from the original matrix , we zero-initialize a disambiguated matrix with the same number or rows, and as many columns as word-senses. For a given entry in the original matrix, , corresponding to sentence and vocabulary word , we need to decide to which of its senses to assign it to. If has only one sense, the decision is trivial; otherwise, we take the embedding for sentence (that is, the entire row , as in the WSD process), and measure its distance from the centroids of the different senses for obtained in the WSD step. The closest sense gets assigned the value , and the rest keep a zero. This way, we build word-sense embeddings by using the columns of ; clustering these embeddings creates PoS categories and finer-grained syntactico-semantic categories. Figure 3 illustrates the disambiguated probability matrix.
2.4 Grammar Induction
After word categories are formed, grammar induction can take place by figuring out which groups of words are allowed to link with others in grammatical parses. A grammar can be accumulated by starting with one rule and adding more incrementally, using the neural language model to evaluate the desirability of each proposed addition. The choice of candidate rules is made by a symbolic rule induction algorithm; so far we have used the Grammar Learner process described in .
For a grammar rule proposed as an addition to the partial grammar already learned, we generate sentences that use that rule within the given grammar and obtain their sentence probabilities . Then we corrupt the rule in some manner, adjust the grammar accordingly, generate sentences from this modified grammar starting with the mutated rule, and evaluate their . If the sentences from the modified grammar decrease significantly in quality (where the threshold is a parameter), then the original rule is taken as valid. The rationale is that correct grammar rules will produce better sentences than their distortions.
In the case of the link grammar formalism , which we have used in our work so far, a grammar rule consists of a set of disjuncts of conjunctions of typed “connectors” pointing forward or backward in a sentence. A mutation of this type of rule can be the swapping of each connector in the rule, which also implies a word-order change.
For example, if we have a rule that connects the word “kids” with the word “the” on the left and the word “small” also on the left, in that order:
kids: small- & the-,
which allows the string “the small kids”, then the mutated rule would be
kids: small+ & the+,
which accepts the string “kids small the”111Notice that connectors in the rules for small and kids also have to be modified to accommodate this mutation, i.e. they need to swap kids+ to kids-.
This methodology requires a way to generate sentences from proposed grammars. One approach is to use a given grammar to guide the attention within a Tree Transformer . The standard Tree Transformer approach guides attention based on word-sequence segmentation that is driven by mutual information values between pairs of adjacent words. One can replace these probabilities with mutual information values between pairs of words that are linked in partial parses that agree with a provided grammar.
Currently we are using a simpler stochastic sentence generation model in our proof-of-concept experiments, and planning to shift to a Tree Transformer approach for the next phase of work.
So, the rule guides the generation of sentences like “The small kids play football” (see its Link-parse in Fig. 4). The rule guides the generation of sentences like “Kids small the play football”. The language model says , thus arguing in favor of adding to one’s grammar (and then continuing the incremental learning process).
Alternatively, instead of producing mutated rules, one could also compare the probabilities of sentences generated with the rule under evaluation against those of a set of reference sentences of the same length, like those in the corpus used to derive the grammar, or the word categories obtained previously.
3 Proof of concept (POC)
Scalable implementation and testing of the ideas described above is work in progress; here we describe some basic examples we have explored so far, which validate the basic concepts (but do not yet provide a thorough demonstration). We chose to perform our initial experiments using BERT222In particular, we use Huggingface’s implementation of BERT, contained in their “transformers” package  https://huggingface.co/transformers, due to its popularity in several downstream tasks (e.g. word sense disambiguation ).
Following the workflow of the grammar induction process, we first show an example of word sense disambiguation, then one for word category formation, and finally grammar rule evaluation.
3.1 Word sense disambiguation
For an initial simple experiment, we created a small corpus of 16 sentences containing 146 words, out of which 8 are clearly ambiguous (to an English speaker). Both syntactic and semantic ambiguities were included. We generated embeddings for each word instance in the corpus, as described in section 2.3. Clustering was performed with spherical clustering methods from Spherecluster333https://github.com/jasonlaska/spherecluster , as well as out-of-the-box DBSCAN and OPTICS models in Python’s scikit-learn library with the cosine-distance metric.
We found that SphericalKMeans clustering did the best job at separating word senses in our test corpus. Setting the number of clusters to two, the algorithm achieved an F1-score of 0.91. As examples, the disambiguation for the word “fat”, which was perfect, looks as follows:
The clustering for “time”, on the other hand, placed one instance in the wrong category, and looks like this:
The disadvantage of using this straightforward implementation of SphericalKMeans is that one has to specify the number of clusters to use. When requesting more clusters than there are senses for a word, the algorithm spreads instances with similar meanings to different clusters. This is especially the case with words that we wouldn’t consider ambiguous, like function words (we have sought to filter these by explicitly not disambiguating the top 10% most frequent words in the corpus). However, this may not be a terrible problem in our use case, as the word category formation algorithm will simply create more word-sense vectors per word, which then it could cluster together in the same word category. Future experiments will involve alternatives that automatically estimate the number of clusters to use.
3.2 Word category formation
Here, working with the same corpus as for WSD, we used the disambiguation results described above to build word vectors, thus allowing for words to be catalogued in more than one group. Rather than SphericalKMeans, we found that OPTICS, a method that doesn’t require a parameter for the number of clusters and can leave vectors uncategorized (shown as Cluster #-1), offers remarkable quality in most formed clusters (#0-14), with a good level of granularity.
An evident problem with this result is that most of the words remain uncategorized (in Cluster #-1). Although we would expect the full iterative grammar learning algorithm we propose to be able to live with that and cluster some of the remaining words in the next pass, we will first try to fine-tune the procedure to alleviate this situation, as well as explore some other clustering algorithms. At the same time, we predict that the results will improve when we use a larger number of features (instead of only 16 sentences for a total of 146 different features). A very simple expansion of the vocabulary to cluster (not shown) already showed a similar number of more populated clusters.
3.3 Grammar Rule Evaluation
We show a simple use case for grammar rule evaluation, using the basic rule modification strategy proposed in the methodology: swapping the direction of the connectors that make up a rule, and comparing the sentences generated with and without the mutation.
For this experiment, we created a proof-of-concept grammar with 6 words divided in 6 categories: determiner, subject, verb, direct object, adjective, adverb. Then, we assigned relationships among the classes. Using a semi-random sentence generator, this grammar produces sentences like “the small kids eat the small candy quickly.” (that being the longest possible sentence in this grammar).
We then introduced some extra spurious rules to the grammar by hand. From a total of 21 rules (15 correct ones vs. 6 spurious ones), the grammar can generate sentences like “kids eat the the small candy kids eat candy the small quickly quickly.”, which clearly shows that the grammar is not correct anymore (this grammar has loops, so this is not even the longest sentence permitted by these simple modification).
Finally, we ran a first version of the grammar rule evaluator, to find out that all of the spurious rules were rejected, as well as three of the “correct” rules.
We notice that among the “correct” rules that were discarded, at least one:
generates sentences with no direct object, like “the kids eat.” This sentence, although valid, might not be very common for the BERT model, and thus obtain a low probability.
Similarly, the reverse of this rule, as modified by the evaluation algorithm:
generates sentences like “eat the kids.”, which is also grammatically valid, and maybe as common as the previous case. This is a sensible explanation for the rule’s rejection.
4 Conclusion and Future Work
Our proof-of-concept experiments give intuitively strong indication of the viability of the methodology proposed for synergizing symbolic and sub-symbolic language modeling to achieve unsupervised grammar induction. The next step is to create a scalable implementation of the approach and apply it to a large corpus, and assess the quality of the results. If successful this will constitute significant progress both toward unsupervised grammar induction, and toward understanding how different types of intelligent subsystems can come together to more closely achieve human-like language understanding and generation.
Banerjee, A., Dhillon, I., Ghosh, J., Sra, S.: Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research6 (2005)
-  Charniak, E.: Statistical language learning. MIT press (1996)
-  Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What Does BERT Look At? An Analysis of BERT’s Attention. arXiv:1906.04341 [cs] (Jun 2019)
-  Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs] (2019)
-  Glushchenko, A., Suarez, A., Kolonin, A., Goertzel, B., Baskov, O.: Programmatic Link Grammar Induction for Unsupervised Language Learning. In: Artificial General Intelligence, vol. 11654, pp. 111–120. Springer International Publishing (2019)
-  Grave, E., Elhadad, N.: A convex and feature-rich discriminative approach to dependency grammar induction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics. pp. 1375–1384 (2015)
-  Hewitt, J., Manning, C.D.: A structural probe for finding syntax in word representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. pp. 4129–4138 (2019)
-  Htut, P.M., Phang, J., Bordia, S., Bowman, S.R.: Do Attention Heads in BERT Track Syntactic Dependencies? arXiv:1911.12246 [cs] (Nov 2019)
de La Higuera, C., Oates, T., van Zaanen, M.: Introduction: Special issue on applications of grammatical inference. Applied Artificial Intelligence22(1-2), 1–3 (2008)
-  Li, B., Cheng, J., Liu, Y., Keller, F.: Dependency Grammar Induction with a Neural Variational Transition-based Parser. arXiv:1811.05889 [cs] (Nov 2018)
-  Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
-  Schmid, U., Kitzelmann, E.: Inductive rule learning on the knowledge level. Cognitive Systems Research 12(3-4), 237–248 (2011)
-  Sleator, D.D., Temperley, D.: Parsing english with a link grammar. arXiv: cmp-lg/9508004 (1995)
-  Tomasello, M.: Constructing a Language: A Usage-Based Theory of Language Acquisition. Harvard University Press (2003)
-  Wang, Y.S., yi Lee, H., Chen, Y.N.: Tree transformer: Integrating tree structures into self-attention. In: EMNLP/IJCNLP (2019)
-  Wiedemann, G., Remus, S., Chawla, A., Biemann, C.: Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings. arXiv:1909.10430 [cs] (Oct 2019)
-  Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Brew, J.: Huggingface’s transformers: State-of-the-art natural language processing. ArXiv abs/1910.03771 (2019)