Backpropagation has long been the de-facto credit assignment technique underlying the successful training of popular neural network architectures such as convolutional neural networks and multilayer perceptions. It is well known that backpropagation enables neural networks to learn highly relevant task specific non-linear features. Whilst its empirical efficacy is not in question, backpropagation has been criticized on many fronts: its lack of biological plausibility, the existence of multiple local optima due to performing gradient descent on a non-convex loss function, susceptibility to catastrophic forgetting, a tendency to learn exploitable decision boundaries and being notoriously difficult for humans to interpret. This has motivated many researchers to consider alternative credit assignment mechanisms for neural inspired architectures.
This paper introduces an alternate family of neural models, Gated Linear Networks (GLNs), which have a distributed and local credit assignment mechanism based on optimizing a convex objective. This technique is a generalization of a successful approach used within the state-of-the art PAQ family of online hard-gated neural compression models, which are known for their excellent sample efficiency. We provide an interpretation of these systems as a sequence of data dependent linear networks coupled with a choice of gating function. From this viewpoint we provide a general analysis of how the choice of gating function gives rise to further representational power, and characterize under what conditions such systems are universal in the limit. Using these insights, we introduce a new type of hard-gating mechanism that opens up their usage to more standard machine learning settings. Empirically, we show that GLNs offer competitive performance to existing batch machine learning techniques, in just a single online pass through the data, on a variety of benchmarks.
For many years, neural networks were considered too computationally demanding for data compression applications. Schmidhuber and Heil (1996) were the first to concretely show their potential, but lingering questions about their computational costs and data efficiency persisted. Mahoney (2000)
introduced a number of key algorithmic and architectural changes tailored towards computationally efficient online density estimation which greatly impacted future data compression research. Rather than use a standard fully connected MLP, neurons within a layer were partitioned into disjoint sets; given a single data point, hashing was used to select only a single neuron from each set in each layer, implementing a kind of hard gating mechanism. Since only a small subset of the total weights were used for any given data point, a speed up of multiple orders of magnitude was achieved. Also, rather than using backpropagation, the weights in each neuron were adjusted using a local gradient descent based learning rule, which intriguingly was empirically found to work better online. Many variants along this theme were developed, including improvements to the network architecture and weight update mechanism, with a full history given in the book byMahoney (2013). These network architectures typically take as input a variety of simple history-based model predictions (e.g. -grams, skip-grams, match models); this approach has been called context mixing in the literature, and at the time of writing have achieved the best known compression ratios on widely used benchmarks such as the Calgary Corpus (Bell et al., 1990), Canterbury Corpus (Bell and Arnold, 1997), and Wikipedia (Hutter, 2017). The current state-of-the-art, cmix (Knoll, 2017), uses both a context-mixing network and LSTM (Hochreiter and Schmidhuber, 1997) to achieve its impressive results.
The excellent empirical performance of these methods motivated further investigation into understanding why these techniques seem to perform so well and whether they have applicability beyond data compression. Knoll and de Freitas (2012) investigated the PAQ8 variant from a machine learning perspective, providing a useful mixture of experts (Jacobs et al., 1991) interpretation, as well as providing an up-to-date summary of the algorithmic developments since the tech report of Mahoney (2005)
. They also explored the performance of PAQ8 in a variety of standard machine learning settings including text categorization, simple shape recognition and classification, showing that competitive performance was possible with appropriate input pre-processing. One of the key enhancements in PAQ7 was the introduction of geometric mixing, which applied a logit-based non-linearity to each neurons input;Mattern (2012, 2013, 2016) justified this particular form of input combination in the single neuron case via a KL-divergence minimization argument, generalized this technique to non-binary alphabets, and provided a regret analysis exploiting the convexity with respect to the log-loss, but a theoretical understanding of multiple interacting hard-gated geometric mixing neurons remained open.
Let be the
dimensional probability simplex embedded inand the set of binary elements. The indicator function for set is and satisfies if and otherwise. The scalar element located at position of a matrix is , with the th row and column denoted by and respectively. For functions
and vectorswe adopt the convention of writing for the coordinate-wise image of under so that . Let be a finite, non-empty set of symbols. A string of length over is a finite sequence with for all . For we introduce the shorthands and . The string of length zero is and the set of all finite strings is . A sequential, probabilistic model is a probability mass function , satisfying the constraint that for all , , with . Under this definition, the conditional probability of a symbol given previous data is defined as provided
, with the familiar chain rulesand applying as usual.
Given sequential, probabilistic, binary models , Geometric Mixing provides a principled way of combining thedenote the Boolean target at time and let denote the vector . Given a convex set and parameter vector , a Geometric Mixture is defined by
with . The RHS of equation Equation 1 can be obtained via simple algebraic manipulation, with
denoting the sigmoid function, andits inverse. A few properties of this formulation are worth discussing: setting for
is equivalent to taking the geometric mean of theinput probabilities; if then ; due to the product formulation, every model also has “the right of veto" (Hinton, 2002), in the sense that a single close to 0 coupled with a drives close to zero.
We assume a standard online learning framework for the logarithmic loss, where at each round a predictor outputs a binary distribution , with the environment responding with an observation , causing the predictor to suffer a loss before moving onto round . In the case of geometric mixing, which depends on both the dimensional input predictions and the parameter vector , we abbreviate the loss by defining
Some properties of this formulation, whose proof can be found in Appendix A (Veness et al., 2017), follow immediately.
For all , , and we have:
is a convex function of .
If for some , then:
is -exp-concave with ;
Note also that Part 4.(a) of Proposition 2 implies that .
The above properties of the sequence of loss functions make it straightforward to apply one of the many different online convex programming techniques to adapt at the end of each round. In this paper we restrict our attention to Online Gradient Descent (Zinkevich, 2003), with equal to some choice of hypercube; further justification of this choice is given in (Veness et al., 2017). This gives a regret bound with respect to the best chosen in hindsight provided an appropriate schedule of decaying learning rates is used, where denotes the total number of rounds. Using a fixed learning rate, various regret bounds for piecewise sources have been shown by Mattern (2013, 2016).
3 Gated Linear Networks
We now introduce gated linear networks (GLNs), which are feed-forward networks composed of many layers of gated geometric mixing neurons. Each neuron in a given layer outputs a gated geometric mixture of the predictions from the previous layer, with the final layer consisting of just a single neuron that determines the output of the entire network. There are two types of input to each neuron: the first is the side information
, which can be thought of as the input features in a standard supervised learning setup; the second is the input to the neuron, which will be the predictions output by the previous layer, or in the case of layer 0, some (optionally) provided base predictions. Each neuron will also take in a constant prediction, which allows a bias weight to be learnt; this is essential for our subsequent analysis, since there is no guarantee that the base predictions are informative in general. Upon receiving an input, all the gates in the network fire, which corresponds to selecting a single weight vector local to each neuron from the provided side information for subsequent use with geometric mixing. The distinguishing properties of this architecture are that the gating functions are fixed in advance, each neuron attempts to predict the target with an associated per-neuron loss, and that all learning will take place locally within each neuron. These properties are depicted graphically in Figure1 and described in more detail below.
Gated Geometric Mixing.
We now formally define our notion of a single neuron, a gated geometric mixer, which we obtain by adding a contextual gating procedure to geometric mixing. Here, contextual gating has the intuitive meaning of mapping particular examples to particular sets of weights. Associated with each neuron is a context function , where is the set of possible side information and for some is the context space. Given a convex set , each neuron is parametrized by a matrix with each row vector for . The context function is responsible for mapping a given piece of side information to a particular row of , which we then use with standard geometric mixing, that is
with . Once again we have the following equivalent form . The key idea is that each neuron can now specialize its weighting of the input predictions based on some property of the side information . Although many possible gating functions are possible, we limit our attention here to two types of general purpose context function which have proven themselves useful empirically and/or theoretically: (Half-space contexts) This choice of context function is useful for real-valued side information. Given a normal and offset
, consider the associated affine hyperplane. This divides in two, giving rise to two half-spaces, one of which we denote . The associated half-space context is then given by ; (Skip-gram contexts) The following type of context function is useful when we have multi-dimensional binary side information and can expect single components of to be informative. If , given an index , a skip-gram context is given by the function where . One can also naturally extend this notion to categorical multi-dimensional side information or real valued side information by thresholding.
Richer notions of context can be created by composition. In particular, any finite set of context functions with associated context spaces can be composed into a single higher order context function , where by defining . For example, we could combine four different skip-gram context functions into a single context function with a context space containing elements.
Gated Linear Networks.
A GLN is a network of sequential, probabilistic models organized in layers indexed by , with models in each layer. Models are indexed by their position in the network when laid out on a grid; for example, will refer to the th model in the th layer. The zeroth layer of the network is called the base layer and is constructed from sequential probabilistic models . Since each of their predictions is assumed to be a function of the given side information and all previously seen examples, these base models can essentially be arbitrary. The nonzero layers are composed of gated geometric mixing neurons. Associated to each of these will be a fixed context function that determines the behavior of the gating. In addition to the context function, for each context and each neuron there is an associated weight vector which is used to geometrically mix the inputs. We also enforce the constraint of having a non-adaptive bias model on every layer, which will be denoted by for each layer . Each of these bias models will correspond to a Bernoulli Process with parameter ; the learnt weights of these models play a similar role to the bias weights in MLPs. Given a , a weight vector for each neuron is determined by evaluating its associated context function. The output of each neuron can now be described inductively in terms of the outputs of the previous layer. To simplify the notation, we assume an implicit dependence on and let denote the output of the th neuron in the th layer, and the output of the layer. The bias output for each layer is defined to be for all , for all . Here we make the nonessential algebraically convenient choice to set so that . For layers , the th node in the th layer receives as input the vector of dimension of predictions of the preceding layer. The output of a single neuron is the geometric mixture of the inputs with respect to a set of weights that depend on its context, namely
The output of layer can be re-written in matrix form as
where is the matrix with th row equal to . Iterating Equation 4 once gives
Since logit is the inverse of , the th iteration of Equation 4 simplifies to
which shows the network behaves like a linear network (Baldi and Hornik, 1989; Saxe et al., 2013), but with weight matrices that are data-dependent. Without the data dependent gating, the product of matrices would collapse to a single linear mapping, giving the network no additional modeling power over a single neuron (Minsky and Papert, 1969). Generating a prediction requires computing the contexts from the given side information for each neuron, and then performing matrix-vector products. Under the assumption that multiplying a by pair of matrices takes work, the total time complexity to generate a single prediction is for the matrix-vector products, which in typical cases will dominate the overall runtime.
Learning in Gated Linear Networks.
We now describe how the weights are learnt in a Gated Linear Network using Online Gradient Descent (Zinkevich, 2003). While architecturally superficially similar to the well-known multilayer perception (MLP), its learning dynamics are completely different. The main difference is that every neuron in a GLN probabilistically predicts the target, and has a loss function defined in terms of just the parameters of the neuron itself; thus, unlike backpropagation, learning is local. Furthermore, this loss function is convex, which will allow us to avoid many of the difficulties associated with training typical deep architectures. For example, we can use a deterministic weight initialization, which aids reproducibility of empirical results, while convexity reduces the impact of learning online from correlated examples. As the weight update is local, there is no issue of vanishing gradients, nor is there any need for the relatively expensive backward pass in models trained via backpropagation; a forward pass of a GLN gives sufficient information to determine the weight update for each neuron analytically. One should think of each layer as being responsible for trying to directly improve the predictions of the previous layer, rather than a form of implicit non-linear feature/filter construction as is the case with MLPs trained offline with back-propagation (Rumelhart et al., 1988).
The weights for every contextual weight vector will lie within some (scaled) hypercube, that is , where . We will use to denote the weight vector at time , and to denote the predictions of the th layer at time . The weights will be initialized as for all , which causes geometric mixing to initially compute a geometric average of its input. One could also use small random weights, as is typically done in MLPs, but we recommend against this choice because it makes little practical difference in our setting and has a negative impact on reproducibility. If we let denote the loss of the th neuron in layer , from Equation (2) we have . Now, for all , , and for all , using Proposition 2, the local Online Gradient Descent update is
where is the projection operation onto the hypercube . This projection is equivalent to simply clipping every component of to .
We now state a regret guarantee for a single (non-bias) neuron, which we derive from a simple generalization of the standard analysis technique of (Zinkevich, 2003) instantiated appropriately using Proposition 2 to our gated case, whose proof is provided in Section 3.2, “Performance Guarantees for individual neurons” (Veness et al., 2017). Given we define the index set , that is, the set of rounds when context is observed by neuron within rounds. The regret experienced by neuron is defined as the difference between the losses suffered by the neuron and the losses it would have suffered using the best choice of weights in hindsight.
For each neuron , provided each input probability is bounded between , the weights reside in the hypercube , and then
Proposition 3 shows that the average per round regret grows as for each neuron, which implies that the learning dynamics are no-regret.
Effective Capacity of Gated Linear Networks.
Neural networks have long been known to be capable of approximating arbitrary continuous functions with almost any reasonable activation function(Hornik, 1991, and others). We will show that provided the context functions are sufficiently expressive, then GLNs also have the capacity to approximate large classes of functions. More than this, the capacity is effective in the sense that gradient descent will eventually always find the best feasible approximation. In contrast, similar results for neural networks show the existence of a choice of weights for which the neural network will approximate some function, but do not show that gradient descent (or any other single algorithm) will converge to these weights. The following general result establishes the convergence properties for GLNs with respect to any family of context functions.
Let be a measure on with the Lebesgue -algebra and let
be a sequence of independent random variables sampled from. Furthermore, let be a sequence of independent Bernoulli random variables with for some -measurable function . Consider a GLN, and let be the set of context functions in layer . Assume there exists a such that for each non-bias neuron the weight-space is compact and . Provided the weights are learned using any no-regret algorithm, then the following hold with probability one:
For each non-bias neuron there exists a non-random function with
The average of converges to the average of on the context-induced partitions of :
There exists a non-random -measurable function such that
The first part of the theorem says the output of each neuron converges almost surely in Cesaro average to some nonrandom function on the support of . The third part shows that as the number of layers tends to infinity, the output of neurons in later layers converges to some single function . The second part shows that is approximately equal to the average of on all sets , with equality holding as . Note that all results are about what happens in the limit of infinite data. Theorem 3 can be clarified by inspecting Figure 2. Here a half-space GLN (with no base models) is trained to convergence, with the training data consisting of samples drawn uniformly over the range
and associated labels drawn from a Bernoulli distribution parametrised by. On the first layer, one can directly see how each neuron learns to predict the average of on the partitions of the input space. In the subsequent layers, the neurons’ outputs gain in complexity, by combining information not only from the neuron’s own contexts, but also from the estimates of the neurons below, culminating in a reasonable approximation at the topmost neuron.
Our next result gives a sufficient condition on a common set of context functions (taking ) for to predict like -almost-everywhere.
Using the notation and definitions from Theorem 3, if
is a norm on the space of bounded -measurable functions then .
The following result together with Theorem 3 implies the universality of half-space gated GLNs.
Proposition (Half-space Gating is Universal)
If is absolutely continuous with respect to the Lebesgue measure and is the space of context functions that are indicators on the countable set of half-spaces , then is a norm on the spaces of bounded -measurable functions.
Proof of all results are provided in (Veness et al., 2017); for Theorem 1, see Appendix B, for Theorem 2, see “Applications of Theorem 1” in Section 4, and for Proposition 3, see Appendix D and Lemma 11. Proposition 3 follows directly from the Radon Transform and its properties. The Radon Transform is used extensively (Deans, 1993) in digital tomography to recover approximations of objects from a finite number of cross sectional scans. One can view half-space gated GLNs as the machine learning analogue of such reconstruction techniques.
Relationship to PAQ family of compressors.
Our theoretical results are relevant in that they hold with respect to the mixing networks used in the PAQ family of compression programs. At the time of writing, the best performing PAQ implementation in terms of general purpose compression performance is cmix (Knoll, 2017). cmix uses a mixing network comprised of many gated geometric mixers, each of which uses skip-gram gating. While many of the core building blocks have been analyzed previously by Mattern (2016), the reason for the empirical success of such locally trained mixing networks has hitherto remained somewhat of a mystery. Our work shows that such architectures are special cases of Gated Linear Networks, that their local learning rule is well-founded, and that future improvements should result by exploring richer classes of gating functions, as skip-gram gating lacks universality (except in the degenerate case when the skip-gram depends on all the side information, which is rarely practical). In principle the universality of GLNs suggests that the performance of PAQ-like approaches will continue to scale with more investigation.
4 Experimental Results
Online MNIST Classification.
First we explore the use of GLNs for classification the MNIST dataset (Lecun et al., 1998)
. We used an ensemble of 10 GLNs to construct a one-vs-all classifier. Each member of the ensemble was a 4 layer network consisting of 2000-1000-500-1 neurons, each of which used 4 half-space context functions (using context composition) as the gating procedure. Mean subtraction was used as a pre-processing step. The half-space contexts for each neuron were randomly determined by sampling adimensional normal vector whose components were distributed according to , and a bias weight of . The learning rate for an example at time was set to . Running the method purely online across a single pass of the data gives an accuracy on the test set of
It is worth noting that these results are permutation invariant. A reference implementation of random half-space gated GLNs is provided in the supplementary material. Accuracy can improved to 98.6% using a de-skewing operation(Ghosh and Wan, 2017). Whether translation invariance can be successfully incorporated into a practical gating procedure is an open question, and likely essential for GLNs to be competitive with image specific architectures such as convolutional neural networks on real world image datasets.
Online UCI Dataset Classification.
We next compare randomly sampled half-space gated GLNs to a variety of general purpose batch learning techniques (SVMs, Gradient Boosting for Classification, MLPs) in small data regimes on a selection of standard UCI datasets. A 1000-500-1 neuron GLN (using a context function composed from 8 random half-space context functions) was trained with asingle pass
over 80% of instances and evaluated with frozen weights on the remainder. The comparison MLP used ReLU activations and was equivalently shaped, and trained for 100 epochs using the Adam optimizer with learning rate
and batch size 32. The SVM classifier used a radial basis function kernelwith , where is the input dimension. The GBC classifier was an ensemble of 100 trees of maximum depth 3 with a learning rate of . The mean and stderr over 100 random train/test splits are shown in the leftmost graph of Figure 3. Here we see that the single-pass GLN is competitive with the best of the batch learning results on each domain.
Next we explored the ability of GLNs to fit randomly assigned labels, to demonstrate that they have large capacity in practice. Half-space gated GLNs of ----1 neurons (using a context function composed from 8 random half-space context functions) were trained for varying layer-width across multiple epochs on the full set of MNIST images with shuffled labels. The rightmost graph of Figure 3 shows that after 8 epochs a GLN of sufficient size can perfectly fit random labels, which complements our theoretical capacity results, showing that realistically sized GLNs have non-trivial capacity.
Online MNIST Density Modelling.
Our final result explores the potential of GLNs for autoregressive image density modeling on the binarized MNIST dataset(Larochelle and Murray, 2011); the full details are given in (Veness et al., 2017). An autoregressive density model over the dimensional binary space was constructed by using a single GLN per pixel to model the 784 different conditional distributions; Running the method online across a single pass of the data (we concatenated the training, validation and test sets) gave an average loss of nats per image across the test data, and 80.74 nats per image if we held the parameters fixed upon reaching the test set. Our online result is close to state of the art (Van Den Oord et al., 2016) of any batch trained density model which outputs exact probabilities.
We have introduced and analyzed a family of general purpose neural architectures that have impressive single-pass performance and strong theoretical guarantees due to their distributed and local credit assignment mechanism. This work sheds light on the empirical success of the PAQ family of compressors, and our introduction of a universal gating mechanism motivates further investigation for their use in wider machine learning applications where sample efficiency is important.
Baldi and Hornik (1989)
P. Baldi and K. Hornik.
Neural networks and principal component analysis: Learning from examples without local minima.Neural Networks, 2(1):53–58, January 1989.
- Bell et al. (1990) T. C. Bell, J. G. Cleary, and I. H. Witten. Text Compression. Prentice Hall, Englewood Cliffs, NJ, 1990.
- Bell and Arnold (1997) Tim Bell and Ross Arnold. A corpus for the evaluation of lossless compression algorithms. Data Compression Conference, 00:201, 1997.
- Deans (1993) Stanley R. Deans. The Radon Transform and Some of Its Applications. John Wiley and Sons, 1993.
- Ghosh and Wan (2017) Dibya Ghosh and Alvin Wan, 2017. URL https://fsix.github.io/mnist/Deskewing.html.
Geoffrey E. Hinton.
Training products of experts by minimizing contrastive divergence.Neural Computation, 14(8):1771–1800, August 2002.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- Hornik (1991) Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
- Hutter (2017) Marcus Hutter. Hutter prize, 2017. URL http://prize.hutter1.net/.
- Jacobs et al. (1991) Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Comput., 3(1):79–87, 1991.
- Knoll and de Freitas (2012) B. Knoll and N. de Freitas. A machine learning perspective on predictive coding with paq8. In Data Compression Conference (DCC), pages 377–386, April 2012.
- Knoll (2017) Byron Knoll, 2017. URL http://www.byronknoll.com/cmix.html.
Larochelle and Murray (2011)
Hugo Larochelle and Iain Murray.
The neural autoregressive distribution estimator.
In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors,
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 29–37, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR.
- Lecun et al. (1998) Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
- Mahoney (2000) Matthew Mahoney. Fast text compression with neural networks. AAAI, 2000.
- Mahoney (2005) Matthew Mahoney. Adaptive weighing of context models for lossless data compression. Technical Report, Florida Institute of Technology CS, 2005.
- Mahoney (2013) Matthew Mahoney. Data Compression Explained. Dell, Inc, 2013.
- Mattern (2012) Christopher Mattern. Mixing strategies in data compression. In 2012 Data Compression Conference, Snowbird, UT, USA, April 10-12, pages 337–346, 2012.
- Mattern (2013) Christopher Mattern. Linear and geometric mixtures - analysis. In 2013 Data Compression Conference, DCC 2013, Snowbird, UT, USA, March 20-22, 2013, pages 301–310, 2013.
- Mattern (2016) Christopher Mattern. On Statistical Data Compression. PhD thesis, Technische Universität Ilmenau, Germany, 2016.
- Minsky and Papert (1969) Marvin Minsky and Seymour Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, MA, USA, 1969.
- Rumelhart et al. (1988) David E. Rumelhart, Geoffrey E. , and Ronald J. Williams. Learning representations by back-propagating errors. In James A. Anderson and Edward Rosenfeld, editors, Neurocomputing: Foundations of Research, pages 696–699. MIT Press, Cambridge, MA, USA, 1988.
- Saxe et al. (2013) Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.
- Schmidhuber and Heil (1996) J. Schmidhuber and S. Heil. Sequential neural text compression. IEEE Transactions on Neural Networks, 7(1):142–146, Jan 1996.
Van Den Oord et al. (2016)
Aäron Van Den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu.
Pixel recurrent neural networks.In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 1747–1756. JMLR.org, 2016.
- Veness et al. (2017) Joel Veness, Tor Lattimore, Avishkar Bhoopchand, Agnieszka Grabska-Barwinska, Christopher Mattern, and Peter Toth. Online learning with gated linear networks. CoRR, abs/1712.01897, 2017. URL http://arxiv.org/abs/1712.01897.
- Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 928–936, 2003.