The fundamental principle of regularization
is at the heart of many machine learning algorithms and models. Informally speaking, regularization refers to the idea of adding a penalty term to the loss function optimized by a learning model in order to encourage learningsimple functions. In particular, in regularized empirical risk minimization, the objective is to find the hypothesis minimizing where denotes the empirical risk on a dataset , is a penalty term that penalizes complex functions and is an hyper-parameter controlling the tradeoff between fitting the data and being a ”simple” hypothesis. Examples of regularization functions include the or
norm of the weight vector of a linear model, the degree of a polynomial model, the rank or the trace norm of the user-item matrix in a collaborative filtering task, etc.
In this work, we propose a novel regularization technique for sequential models. While there are many natural notions of simplicity for functions defined over vector spaces (e.g., sparsity, smoothness, etc.), defining a notion of simplicity suited for functions defined over sequences can be more tedious due to the discrete and sequential nature of the data arising in tasks such as language modelling. One such notion of simplicity naturally arises from the so-called Chomsky hierarchy, which categorizes functions over sequences into four different levels of complexity, the simplest of which is regular grammars (1056813)
. To encourage a sequential model such as an RNN (Recurrent Neural Network) to learn simple functions, i.e., functions that appear lower in Chomsky hierarchy, we introduce a novel inductive bias throughspectral regularization.
In order to encourage learning such simple functions, we leverage a fundamental result relating the rank of the Hankel matrix of a function to the minimal number of states of a weighted finite automaton computing (fliess1974matrices; carlyle1971realizations). Functions computed by weighted finite automata corresponds to regular weighted languages, i.e., functions that are low in the Chomsky hierarchy. The idea behind spectral regularization is to encourage learning models whose Hankel matrix are approximately low rank. To do so, the spectral regularization is defined as the trace norm of the Hankel matrix. Using the trace norm instead of the rank of the Hankel matrix offers two advantages: (i) the trace norm is the tightest convex relaxation of the rank and is differentiable, allowing one to use automatic differentiation techniques to use the spectral regularization when training black box neural network sequence models, and (ii) the trace norm can be seen as a ”soft” version of the rank, allowing learned models to only be approximately low rank, whereas a hard rank constraint would be too strong and forces the learned functions to be regular. The spectral regularization can thus incorporate a natural inductive bias towards regular functions in the training of any black box differentiable model.
A key technical challenge in implementing the proposed spectral regularization resides in the fact the Hankel matrix is a bi-infinite matrix whose trace norm cannot be explicitly computed. To address this issue we propose a Russian Roulette estimator to design a stochastic unbiased estimator of the Hankel matrix, whose trace norm is lower bounded (in expectation) by the trace norm of the Hankel matrix itself. We thus plug in the realizations of the Russian Roulette estimator in the minimization objective at each mini-batch in place of the actual trace norm of the Hankel matrix.
We provide a simple experimental study on Tomita grammars (tomita1982dynamic) to illustrate the potential benefits of the spectral regularization.
Let be a finite nonempty set, also known as an alphabet. We denote by the free monoid over , where string concatenation is the binary operation and the empty string in the singleton set serves as the unique unit element. Intuitively, refers to the set of all finite sequences (or words) generated by :
For two sequences , we use uv to denote the concatenation of u and v. The length of a sequence is denoted as . Finally, a grammar, or language, over is a subset of .
One of the simplest class of languages is the set of regular languages, which are languages that can be computed by deterministic finite automata. Regular languages forms the simplest class of languages in the so-called Chomsky hierarchy . In this work, we are interested in real-valued functions over , sometimes called weighted languages. Such functions are of crucial interest for machine learning applications on sequence data such as language modeling. The Chomsky hierarchy easily extends to weighted languages using the weighted counterparts of the finite state machines used in the classical hierarchy. In particular, the simplest class of such functions is the set of regular functions (sometimes called rational, or recognizable), which are functions that can be computed by weighted automata.
[Weighted Finite Automaton] A weighted finite automaton (WFA) with states is a tuple , where is the initial weight vector, the final weight vector, and is the transition matrix for each symbol . A WFA computes a function that maps any sequence to , where .
It is worth briefly mentioning that any regular language is the support of a rational function (however, surprisingly, the converse is not true, see, e.g., Chapter 4, Section 6 in droste2009handbook).
In this work, we will design a regularization scheme for sequential models that will favour functions that are close to the class of rational functions (i.e., low on Chomsky’s hierarchy). In order to do so, we need a quantitative measure of the ”rationality” of a function. We will see that the spectrum of the so-called Hankel matrix is a good candidate for this purpose.
[Hankel Matrix] For a given function , its Hankel matrix is the infinite matrix with entries for .
The following classical theorem shows the fundamental (striking) relation between the Hankel matrix of a function and its ”rationality”. [fliess1974matrices; carlyle1971realizations] For any function , is equal to the minimal number of states of a WFA computing . In particular, a function is regular if and only if its Hankel matrix has finite rank.
We will see in the next section how this result can be leveraged to design a regularization technique to favour simpler model during learning.
3 Spectral Regularization
In this section we propose the trace norm of the Hankel matrix as a natural spectral regularization for black box sequential models and show how to efficiently compute stochastic approximation of this regularization term for training through back-propagation.
3.1 Motivation and Definition
A naive idea to leverage Theorem 2 for regularization would be to enforce the Hankel matrix of the learned model to be low rank. However, this approach has two drawbacks. First, optimization under low rank constraints is known to be computationally hard. Second, such a constraint would be too strong: we want to incorporate an inductive bias towards simple functions in the learning process, but we do not want to actually enforce the learned function to be regular. In some sense, we want a softer version of the rank of the Hankel matrix which would also consider functions that can be well approximated by regular functions (i.e., functions whose Hankel matrix is approximately low rank) as simple.
Enforcing the trace norm (or nuclear norm) of the Hankel matrix to be small, instead of directly enforcing the rank to be small, will solve (to some extent) both of these issues. Indeed, the trace norm (which is the sum of the singular values) is the tightest convex relaxation of the matrix rank(fazel2001rank) which naturally represents a soft version of the notion of rank. The trace norm of the Hankel matrix has actually been previously leveraged for this purpose in the context of learning (balle2012local). Using the trace norm of the Hankel matrix as a way to regularize models for sequence tagging was also previously explored in (quattoni2014spectral).
We formally introduce this regularization technique in the following definition.
[Spectral Regularization] Let be the function computed by a model with parameters and let be the loss function associated with this model. Spectral regularization corresponds to the following minimization problem:
where is the regularization coefficient, and the trace norm is the spectral regularizer (or spectral loss).
Note that the previous definition does not make any assumptions on the class of models considered. One particular class of interest is the one of functions computed by recurrent neural networks, for which we would ideally want to, at the same time, benefit from their remarkable expressiveness while still steering the learning process towards functions that are, in some sense, low on the Chomsky hierarchy. In particular, when an RNN is used for sequential probabilistic modeling (i.e. trained to predict the probabilities of next symbol given a sequence),
would denote the underlying probability distribution over, i.e., .
3.2 Russian Roulette Estimator
It is clear that the optimization problem in Eq. (1) can not be solved easily. To start with, the Hankel matrix is infinite! In order to tackle this optimization problem, we will make use of the so-called Russian Roulette estimator which allows one to stochastically approximate an infinite series with random realization of partial sums.
[Russian Roulette Estimator; Kahn] Given a convergent series , a Russian Roulette estimator of is given by , where
is a random variable with support over all nonnegative integers.
Note that in this definition we do not require
’s to be scalars. Instead, they could stand for vectors, matrices, tensors, or some abstract objects with well-defined component-wise addition.
[NEURIPS2019_5d0d5594; Lemma 3; lyne2015russian] If and the series is absolutely convergent, then given in Definition 3.2 is an unbiased estimator of , i.e., .
Although the Russian Roulette Estimator is unbiased under mild assumptions, its variance might be large or even unbounded with an ill-chosen random variable(DBLP:journals/mcma/McLeish11; DBLP:conf/icml/BeatsonA19).
3.3 Stochastic Estimator for the Trace Norm of the Hankel Matrix
In order to leverage the Russian Roulette estimator for the trace norm of the Hankel matrix, we need to express the Hankel matrix as an infinite sum. We propose one way convenient way to do this in the following theorem.
Let . For any , let be defined by
for all . Then .
Although the ’s defined above are infinite matrices, each of them only contains a finite number of nonzero elements. We can thus construct the Russian Roulette estimator of as
where is a random variable taking its values in such that for all .
As mentioned previously, even though still is an infinite matrix, it only has a finite number on non-zeros entries for any integer . Thus, informally, the trace norm of the infinite matrix is equal to the trace norm of its smallest sub-block containing no columns or rows entirely filled with ’s, which is a finite sub-block whose trace norm can be computed in polynomial time. We now formalize this intuition. We start by showing that the Russian Roulette estimator of the Hankel matrix is unbiased.
Let . The estimator defined in Eq. (2) is an unbiased estimator of , i.e., .
For some , we notice from Eq. (2) that
for some , since the RHS of Eq. (2) contributes at most one term for an entry in LHS. Then
Therefore, each entry of is unbiased to estimate the corresponding entry in .
We showed that the infinite Hankel matrix of a function can be expressed as an infinite sum of matrices with a finite number of non-zero entries, allowing us to construct a Russian Roulette estimator of the Hankel matrix which can be computed efficiently. But we are interested in the trace norm of the Hankel matrix in the objective we wish to minimize in Eq. (1). It remains to show that the trace norm of the Russian Roulette estimator of the Hankel matrix is a good stochastic estimator of the trace norm of the Hankel matrix itself.
For any , we have that
where is the Russian Roulette estimator defined in Eq. (2).
It suffices to show that is a convex operator, and then the claim follows by Jensen’s inequality. The convexity of follows naturally from the triangle inequality of the trace norm for any . Combining Jensen’s inequality with Theorem 3.3 we have that .
The previous theorem shows that we can efficiently compute a stochastic approximation of the trace norm of the Hankel matrix, through which we can use the back-propagation algorithm to train any differentiable black-box model. In the next section, we implement this regularization technique to train RNNs on a synthetic language modeling task.
We conduct experiments to validate that spectral regularization imposes an inductive bias for sequence modeling. In particular, we focus on synthetic data generated according to Tomita grammars #3 to #6 defined in Table 1, which is a benchmark study for grammatical inference (tomita1982dynamic; bengio1994approach). As shown in the table, all these grammars are some subsets of , where the binary alphabet is .
|#3||not containing as a substring|
|#4||not containing as a substring|
|#5||containing even number of ’s and ’s|
|#6||(number of 0’s number of ’s) is a multiple of 3|
The training dataset for each grammar include synthetic sequences up to length 12 in that grammar. 20% of the training set is split out as the validation set. We also use a test dataset consisting of sequences of exact length 12 and disjoint with the training dataset.
We consider an RNN with one embedding layer of neurons (we use two additional symbols to mark the start and end of sequences) and one hidden layer of 50 neurons, which has been just expressive enough for our training data. NLL (Negative Log-Likelihood) loss is used for both training and reporting performances of grammatical inference on the three test sets. The loss minimization is based on the Adam optimizer (diederik2014adam) with an initial learning rate of 0.01 and a batch size of 32. Moreover, early stopping and a simple scheduler to reduce learning rate on detected plateaus of validation loss are adopted.
In our experiments, we compare the test NLL when training without spectral regularization versus that with spectral regularization for different size of training data sampled from the given training set. The latter chooses the hyperparameteraccording to validation NLL. In each mini-batch, we randomly draw (stopping probability is 0.2) to construct the Russian Roulette estimator of the Hankel matrix. To check the significance of the unbiased Russian Roulette estimator in spectral regularization, we also implement a naïve biased estimator of the Hankel matrix for comparison defined by , which is a fixed-sized subblock of the Hankel matrix.
Results on Tomita grammars #3 to #6 are presented in Figure 1, where we see that spectral regularization marginally improves generalization for Tomita grammars #4, #5, and #6 on small training data sizes. Especially on Tomita grammar #5, the unbiased Russian Roulette estimator performs modestly better than the naïve biased one. However, there is no clear winner between the two estimators, biased or not, on other Tomita grammars. We hypothesize that this phenomenon might be due to the bias-variance tradeoff, i.e., the high variance of the unbiased Russian Roulette estimator makes the loss computation much coarser (DBLP:conf/icml/BeatsonA19), which will be further investigated in future work. We also consider whether more convincing results can be obtained on other tasks such as classification, or on other datasets, to be explored in upcoming studies.
This paper proposes spectral regularization according to an intuitive notion of simplicity arising from the Chomsky hierarchy, which serves as an extra inductive bias for any sequence modeling task and is formulated as an additional regularization term to be added to any loss function. Results on synthetic data of Tomita grammars show that spectral regularization indeed marginally helps encourage the model to learn approximately low-rank functions. Forthcoming research will also examine the effect of spectral regularization in other tasks and other datasets.
To estimate the trace norm of the bi-infinite Hankel matrix in the spectral regularizer, we construct an unbiased stochastic estimator to relax the loss minimization problem. However, the unbiased estimator does not exhibit significant advantages compared to a naïve biased one, which will be explored in further research.
We would like to acknowledge the support of the 2021 Globalink Research Internship Mitacs program (Project ID 24986).