Gaussian Gated Linear Networks

06/10/2020
by   David Budden, et al.
Google
12

We propose the Gaussian Gated Linear Network (G-GLN), an extension to the recently proposed GLN family of deep neural networks. Instead of using backpropagation to learn features, GLNs have a distributed and local credit assignment mechanism based on optimizing a convex objective. This gives rise to many desirable properties including universality, data-efficient online learning, trivial interpretability and robustness to catastrophic forgetting. We extend the GLN framework from classification to multiple regression and density modelling by generalizing geometric mixing to a product of Gaussian densities. The G-GLN achieves competitive or state-of-the-art performance on several univariate and multivariate regression benchmarks, and we demonstrate its applicability to practical tasks including online contextual bandits and density estimation via denoising.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 9

page 10

page 13

page 14

page 15

page 16

02/21/2020

Online Learning in Contextual Bandits using Gated Linear Networks

We introduce a new and completely online contextual bandit algorithm cal...
09/30/2019

Gated Linear Networks

This paper presents a family of backpropagation-free neural architecture...
01/24/2019

Deep Neural Linear Bandits: Overcoming Catastrophic Forgetting through Likelihood Matching

We study the neural-linear bandit model for solving sequential decision-...
06/08/2016

Fast and Extensible Online Multivariate Kernel Density Estimation

We present xokde++, a state-of-the-art online kernel density estimation ...
02/07/2021

Online Limited Memory Neural-Linear Bandits with Likelihood Matching

We study neural-linear bandits for solving problems where both explorati...
10/23/2020

A Combinatorial Perspective on Transfer Learning

Human intelligence is characterized not only by the capacity to learn co...
08/23/2020

Blindness of score-based methods to isolated components and mixing proportions

A large family of score-based methods are developed recently to solve un...

Code Repositories

GatedLinearNetworks.jl

A gaussian gated linear networks implementation


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent studies have demonstrated that backpropagation-free deep learning, particularly the Gated Linear Network (GLN) family 

sezener2020online ; veness2017online ; veness2019gated , can yield surprisingly powerful models for solving classification tasks. This is particularly true in the online regime where data efficiency is paramount. In this paper we extend GLNs to model real-valued and multi-dimensional data, and demonstrate that their theoretical and empirical advantages apply to far broader domains than previously anticipated.

The distinguishing feature of a GLN is distributed and local credit assignment. A GLN associates a separate convex loss to each neuron such that all neurons (1) predict the target distribution directly, and (2) are optimized locally using online gradient descent. A half-space “context function” is applied per neuron to select which weights to apply as a function of the input features, allowing the GLN to learn highly nonlinear functions. This architecture gives rise to many desirable properties previously shown in a classification setting: (1) trivial interpretability given its piecewise linear structure, (2) exceptional robustness to catastrophic forgetting, and (3) provably universal learning; a sufficiently large GLN can model any well-behaved, compactly supported density function to any accuracy, and any no-regret convex optimization method will converge to the correct solution given enough data.

Related Work.

We extend the previous Bernoulli GLN (B-GLN) formulation to model multivariate, real-valued data by reformulating the GLN neuron as a gated product of Gaussians. This Gaussian Gated Linear Network (G-GLN) formulation exploits the fact that exponential family densities are closed under multiplication welling2005exponential , a property that has seen much use in Gaussian Process and related literature williams2002products ; peng2019mcp ; cao2014generalized ; marblestone2020product ; deisenroth2015distributed

. Similar to the B-GLN, every neuron in our G-GLN directly predicts the target distribution. This idea is shared with work in supervised learning where targets are predicted from intermediate layers. The motivations for local, layer-specific training include improving gradient propagation and representation learning

lee2015deeply ; rasmus2015semi ; lowe2019putting ; lee2018gradient , decoding for representation analysis alain2016understanding and making neural networks more biologically plausible sussillo2009generating ; nokland2019training ; mostafa2018deep by avoiding backpropagation. The use of context-dependent weight selection (gating) in the GLN algorithm family resembles proposals to improve the continual and multi-task learning properties of deep networks schmidhuber1992learning ; ha2016hypernetworks ; cheung2019superposition ; von2019continual ; perez2018film by using a conditioning network to gate a principal network solving the task.

Paper Outline.

We begin by reviewing some background on weighted products of Gaussian densities, and describe how the relevant weights can be adapted using well-known online convex programming techniques Hazan16 . We next show how to augment this adaptive form with a gating mechanism, inspired by earlier work on classification with GLNs veness2017online ; veness2019gated , which gives rise to the notion of neuron in G-GLNs. We then introduce G-GLNs, feed-forward networks of locally trained neurons, each computing a weighted product of Gaussians with input-dependent, gated weights. We conclude by providing a comprehensive set of experimental results demonstrating the impressive performance of the G-GLN algorithm across a diverse set of regression benchmarks and practical applications including contextual bandits and image denoising.

2 Background

The Gaussian distribution has a number of well-known properties that make it well suited for machine learning applications. Here we briefly review two of these important properties: closure under multiplication and convexity with respect to its parameters under the logarithmic loss, which we will later exploit to define our notion of a G-GLN neuron.

2.1 Weighted Products of Gaussian Densities

A weighted product of Gaussians is closed in the sense that it yields another Gaussian. More formally, let denote the set of non-negative real numbers. For notational simplicity, we first construct the univariate case. Let denote the univariate Gaussian PDF with mean

and standard deviation

. Now, given univariate Gaussian experts of the form , …, with associated PDFs

(1)

and an

-dimensional vector of weights

, we define a weighted Product of Gaussians (PoG) as

(2)

It is straightforward to show that this formulation gives rise to a Gaussian distribution whose mean and variance jointly depend on

; see Appendix A for a short derivation. In particular we can exactly interpret the weighted product of experts as another Gaussian expert where

(3)

The same closure property holds for the multivariate case (e.g. see bromiley2018 ). Let denote the -dimensional multivariate Gaussian PDF, with mean and covariance matrix , and let denote the

-dimensional identity matrix. In the general case, given

multivariate -dimensional Gaussian experts, , …, , we have

(4)

Note that is a convex combination of the means of its inputs, which implies that must lie within the convex hull formed from all the . In the isotropic case with for precision , Equation 4 simplifies to

(5)

Note that if all the initial experts are isotropic, the product of Gaussians must also be isotropic. Although less general, the isotropic form has considerable computational advantages for high-dimensional multivariate regression (since the inverses can be computed in time), and will be used in our larger scale multivariate regression experiments.

2.2 Online Convex Programming Formulation

We now show how to adapt the weights in Equation 2 using online convex programming. Assuming a standard online learning setup under the logarithmic loss, we define the instantaneous loss given a target with respect to a fixed weight vector as

(6)

with equivalence following by dropping non-essential constant terms. It is straightforward to show is convex in , either directly (as in Appendix B), or by appealing to known properties of the log-partition function for exponential family members wainwright2008graphical .

As we are interested in large scale applications, we derive an Online Gradient Descent (OGD) Zinkevich03 learning scheme to exploit the convexity of the loss in a principled fashion. To apply OGD in our setting, we need to restrict the weights to a choice of compact convex set . For simplicity of exposition, we focus our presentation on the case where the weight space is defined as

(7)

where and . As is formed from the intersection of a scaled hypercube and a half-space, it is a convex set with finite diameter, and is clearly compact and non-empty. OGD works by performing two operations, a gradient step and a projection of the modified weights back into if the gradient update pushed them outside of . This projection is essential, as it is responsible for both ensuring that the weighted product of Gaussians is well-defined (e.g. positive variance) and for providing no-regret guarantees comparable to what was previously achieved for B-GLNs veness2017online .

3 G-GLN Neurons

We now introduce a new type of neuron which will constitute the basic learning primitive within a G-GLN. The key idea is that further representational power can be added to a weighted product of Gaussians via a contextual gating procedure. We achieve this by extending the previous weighted product of Gaussians model with an additional type of input, which we call side information. The side information will be used by a neuron to select a weight vector to apply for a given example from a table of weight vectors. In typical applications to regression, the side information is defined as the (normalized) input features for an input example: i.e. .

More formally, associated with each neuron is a context function , where is the set of possible side information and for some is the context space. Each neuron is now parameterized by a weight matrix with each row vector for . The context function is responsible for mapping side information to a particular row of , which we then use to weight the Product of Gaussians.

In other words, a G-GLN neuron can be defined in terms of Equation 2 by

(8)

with the associated loss function

inheriting all the properties needed to apply Online Convex Programming directly from Equation 6.

Half-space Gating.

We restrict our attention to the class of half-space context functions, as in veness2017online . Given a normal vector and offset

, consider the associated affine hyperplane

. This divides in two, giving rise to two half-spaces, one of which we denote The associated half-space context function is then given by if or 0 otherwise. Richer notions of context can be created by composition. In particular, any finite set of context functions with associated context spaces can be composed into a single higher order context function by defining . We will refer to the choice of as the context dimension.

Bias Models.

G-GLN neurons transform an input set of Gaussians to an output Gaussian. Recall that the mean of a product of Gaussian PDFs must lie within the convex hull defined by the means of the individual input Gaussian PDFs (Section 2.1). To ensure the G-GLN neuron can represent any mean in , where is the target dimension, we therefore concatenate a number of bias inputs, i.e. constant Gaussian PDFs to the input of each neuron. In the univariate case, we concatenate two Gaussian PDFs with mean with a typical value of (the target is standardized). This generalizes to the multivariate case by multiplying the two scalars against each -dimensional standard basis vector, allowing the convex hull of the bias inputs to span the target hypercube.

4 G-GLN Architecture

Figure 1: (A) Illustration of half-space gating for a 2D context. Color represents how many half-spaces intersect with the data point . Within each region of constant color (each polytope), the gated weights for a G-GLN network are constant. (B) G-GLN feed-forward architecture. Each neuron uses its active weights to predict the target density as a function of the preceding layer outputs. (C) Illustration of the function sufficient statistics (mean and standard deviation) predicted by two neurons at different G-GLN layers, visualized for both for a single input (red line) and across all inputs within a fixed range (blue). Deeper neurons more accurately reconstruct the true density (orange).

We now describe how the neurons defined in the previous section are assembled to form a G-GLN (Figure 1, B). Similar to its B-GLN predecessor veness2017online ; veness2019gated , a G-GLN is a feed-forward network of data-dependent distributions. Each neuron calculates the sufficient statistics () for its associated PDF using its active weights, given those emitted by neurons in the preceding layer.

Inputs and Side Information.

There are two types of input to neurons in the network. The first is the side information, which can be thought of as the input features, and is used to determine the weights used by each neuron via half-space gating. The second is the input to the neuron, which will be the PDFs output by the previous layer, or in the case of layer 0, some provided base models. To apply a G-GLN in a supervised learning setting, we need to map the sequence of input-label pairs for onto a sequence of (side information, base Gaussian PDFs, label) triplets . The side information will be set to the (potentially normalized) input features . The Gaussian PDFs for layer 0 will generally include the necessary base Gaussian PDFs to span the target range, and optionally some base prediction PDFs that capture domain-specific knowledge.

Model Description.

More formally, a G-GLN consists of layers indexed by , with neurons in each layer. The weight space for a neuron in layer will be denoted by ; the subscript is needed since the dimension of the weight space depends on . Each neuron/distribution will be indexed by its position in the network when laid out on a grid; for example, will refer to the family of PDFs defined by the th neuron in the th layer. Similarly, will refer to the context function associated with each neuron in layers , and and (or in the multivariate case) referring to the sufficient statistics for each Gaussian PDF.

Heteroskedastic Regression Example.

We show an illustrative example on a popular heteroskedastic benchmark function , with mean and the logarithm of the standard deviation  silverman85 ; kersting2007most . Intermediate layer outputs in the G-GLN are illustrated in Figure 1(C). For each training input (red line), with target (intersection of dashed red line and yellow curve), and for each neuron: (1) a set of active weights are selected by applying the context function to the broadcast side information (in this case simply ), (2) the active weights are used to predict the target distribution as a function of preceding predictions, and (3) the active weights are updated with respect to the loss function defined in Equation (6). Figure 1(B) compares the predictions (blue) for all values of for two individual neurons. It is clearly evident from inspection that neurons in higher layers produce more accurate predictions of the sufficient statistics given only the preceding predictions as input.

Generating Context Functions.

We sample our context functions randomly according to the scheme first introduced in veness2017online ; veness2019gated , which is inspired by the SimHash method Charikar2002 for locality sensitive hashing. Recall that a half-space context is defined by ; to sample , we first generate an i.i.d. random vector of dimension , with each component of distributed according to the unit normal , and then divide by its -norm, giving us a vector . This scheme uniformly samples points from the surface of a unit sphere. The scalar

is sampled directly from a standard normal distribution.

To gain intuition for this procedure, consider Figure 1

(A). There is a 1-1 mapping between any convex polytope formed from the intersection of each of the half-spaces, and the collective firing pattern of all context functions in the network. Choices of side information close in terms of cosine similarity will map to similar sets of weights. A (local) update of the weights corresponding to a particular convex region will therefore affect neighbouring regions, but with decreasing impact in proportion to the number of overlapping half-spaces.

5 G-GLN Algorithm

1:Input: base model / features
2:Input: side information , target
3:Input: G-GLN weights , learning rate
4:Output: Gaussian PDF
5:for  do
6:     for  do
7:         
8:         
9:         
10:          // (if learning)
11:     end for
12:end for
13:return ,
Algorithm 1 G-GLN: inference with optional update

We now describe how inference is performed in a G-GLN. For layer 0, we assume all the base models are given. For layers , we then have

(9)

Equation 9 makes it explicit that, conceptually, a G-GLN is a network of Gaussian PDFs, each of which depend on the side information via gating. Computationally, this involves a forward pass of the network to compute the relevant sufficient statistics for each neuron (using Equations 3-5). By re-expressing Equation 9 as

one can view each neuron as having an exponential output non-linearity and a logarithic input non-linearity. Since these non-linearities are inverses of each other, stacking layers causes the non-linearities to cancel, so the density output by a G-GLN collapses to a linear function of the gated weights (i.e. a Gated Linear Network). The same cancellation argument applies to B-GLN veness2017online

, where the output and input non-linearities are the sigmoid and logit functions.

A distinguishing feature of a G-GLN is that every neuron directly attempts to predict the target, by locally boosting the accuracy of its input distributions. Because of this, every neuron will have its own loss function defined only in terms of its own weights. Given a (potentially vector-valued) target , and side information (which will typically be identified with the input features), each neuron-specific loss function will be

(10)

This loss can be optimized using online gradient descent Zinkevich03 , which involves performing a step of gradient descent, and projecting the weights back onto , via the update rule

(11)

where refers to the th row of the neurons weight matrix , is the learning rate and is the projection operator with respect to the Euclidean norm.

Algorithm 1 provides pseuodocode for both inference and (optionally) weight adaptation for a univariate G-GLN for a given input, with the top-most neuron taken as the final Gaussian PDF. The multivariate case can be obtained by replacing lines 8-9 with Equation 4 or 5. The total time complexity to perform inference is the sum of the cost of computing the gating operations , where is the dimensionality of the input vector, and the cost of propagating the sufficient statistics through the network, .

6 Experimental Results

Dataset G-GLN VI graves2011practical PBP hernandez2015probabilistic DO gal2015dropout
Boston Housing 506 13 2.84 0.03 4.32 0.29 3.01 0.18 2.97 0.19
Concrete Compression Strength 1030 8 5.84 0.03 7.13 0.12 5.67 0.09 5.23 0.12
Energy Effiency 768 8 1.31 0.01 2.65 0.08 1.80 0.05 1.66 0.04
Kin8nm 8192 8 0.09 0.00 0.10 0.00 0.10 0.00 0.10 0.00
Naval Propulsion 11,934 16 0.00 0.00 0.01 0.00 0.01 0.00 0.01 0.00
Combined Cycle Power Plant 9568 4 3.90 0.01 4.33 0.04 4.12 0.03 4.02 0.04
Protein Structure 45,730 9 3.77 0.01 4.84 0.03 4.73 0.01 4.36 0.01
Wine Quality Red 1599 11 0.57 0.00 0.65 0.01 0.64 0.01 0.62 0.01
Yacht Hydrodynamics 308 6 3.76 0.04 6.89 0.67 1.01 0.05 1.11 0.09
Table 1:

Test RMSE and standard errors for G-GLN versus three previously published methods on a standard suite of UCI regression benchmarks with

instances of

features each. Models are trained for 40 epochs and results summarized for 20 random seeds (5 for Protein).

We applied G-GLNs to univariate regression, multivariate regression, contextual bandits with real valued rewards, denoising and image infilling. Model, experimental and implementation details common across our test domains are discussed below.

Training Setup.

Weights for all neurons in layer are initialized to where is the number of neurons in the previous layer. Note that due to the convexity of the loss, the choice of initial weights plays a less prominent role in terms of overall performance compared with typical deep learning applications. The only source of non-determinism in the model is the choice of context function; to address this, all of our results are reported by averaging over multiple random seeds. For regression experiments, multiple epochs of training are used. Training data is randomly shuffled at the beginning of each epoch, and each example is seen exactly once within an epoch.

Bias and Base Predictions.

Constant bias inputs for each neuron were set to span the target range, as described in Section 3. Given -dimensional input of the form , we adopted the convention of adding Gaussian PDF base predictions to layer 0. The mean and variance of the th expert was calculated online either by setting each expert to be centered at a single input feature and with a fixed width such as

, or from an analytic formula that applies Bayesian Linear Regression (BLR) to learn a mapping of

to an approximation of the target distribution.

Output Aggregation and Weight Projection.

As each neuron in a G-GLN models the target distribution, any choice of neuron to be the output provides an estimate of the target density; we either take the output of the top-most neuron or use the switching aggregation method introduced in veness2017online for B-GLNs which uses Bayesian tracking VenessNHB12 to estimate the best performing neuron on recent data. See Appendix D for details of switching aggregation.

We explored multiple methods for implementing weight projection efficiently, and obtained the best performance in our regression benchmarks by an approximate solution which used the log-barrier method Boyd04 . This method essentially amounts to adding an additional regularization term to the loss, which has negligible affect on the cost of inference; see Appendix E for implementation details.

6.1 UCI Regression

We begin by evaluating the performance of a G-GLN to solve a benchmark suite of univariate UCI regression tasks. We adopt the same datasets and training setup described in gal2015dropout , and compare G-GLN performance to the previously published results for 3 MLP-based probabilistic methods: variational inference (VI) graves2011practical , probabilistic backpropagation (PBP) hernandez2015probabilistic and the interpretation of dropout (DO) as Bayesian approximation as described in gal2015dropout . Our results are presented in Table 1. It is evident that G-GLN achieves competitive performance, outperforming PBP, BP and DO on 7 out of 9 regression tasks. See Appendix H.1 for full details.

Algorithm MSE
G-GLN 0.19
Random forest 2.39
MLP 2.13

Stochastic decision tree

2.11
Gradient boosted tree 1.44
TabNet-S 1.25
Adaptive neural tree 1.23
TabNet-M 0.28
TabNet-L 0.14
Algorithm financial jester wheel mean rank
G-GLN 3 1 2 2
BBAlphaDiv 10 9 10 9.67
constSGD 9 8 6 7.67
ParamNoise 7 10 4 7
BBB 8 5 6 6.33
NeuralGreedy 5 4 9 6
BootRMS 4 2 8 4.67
Dropout 6 3 5 4.67
NeuralLinear 2 7 3 4
LinFullPost 1 6 1 2.67
Table 2: (Left) Test MSE for G-GLN versus previously published methods on the SARCOS inverse dynamics dataset vijayakumar2000locally . G-GLNs are trained for 1200 epochs using the same test procedure as arik2019tabnet . (Right) Performance of a G-GLN based GLCB algorithm for the continuous contextual bandits tasks and competitors described in sezener2020online ; riquelme2018deep . Ranks are computed by running each algorithm on 500 randomly sampled environments. Raw scores are provided in Table 3 of the Appendix.

6.2 Inverse Dynamics

Next we demonstrate G-GLNs on regression tasks where both the inputs and targets are multi-dimensional. We consider the SARCOS dataset for a 7 degree-of-freedom robotic arm 

vijayakumar2000locally : using a 21-dimensional feature vector (7 joint positions, velocities and accelerations) to predict the 7 joint torques. We compare our performance to the state-of-the-art TabNet model arik2019tabnet and the same suite of standard regression algorithms considered by the TabNet authors. See Appendix H.2 for details.

Table 2 (left) shows that G-GLN outperforms contemporary methods, as well as small-to-medium sized TabNets. TabNet is a complex system of neural networks optimized for tabular data, exploiting residual transformer blocks for sequential attention. It is likely that a similar system could exploit G-GLNs as components for improved performance, but doing so is beyond the scope of this paper.

6.3 Online Contextual Bandits

The authors of sezener2020online proposed an algorithm, Gated Linear Context Bandits (GLCB), by which B-GLNs could be applied to solve contextual bandits tasks with binary rewards. GLCB provides a UCB-like auer2002ucb rule that exploits GLN half-space activation as a “pseudo-count” that is shown to be effective for exploration (full details in Appendix G.1). Our G-GLN provides a natural solution for extending GLCB to continuous rewards. Table 2 (right) compares the results of a G-GLN based GLCB algorithm (see Appendix H.3 for details) to three bandits tasks derived from UCI regression datasets, a standard benchmark in previous literature. G-GLN obtains the best mean rank across these tasks compared to 9 popular Bayesian deep learning methods riquelme2018deep . Similar to sezener2020online , our results are obtained in an online regime – each data point is considered once without storage, whereas all other methods were able to i.i.d. resample from prior experience to learn an effective representation.

6.4 Application to Denoising Density Estimation

Figure 2: Denoising multi-dimensional data with G-GLNs. (A) G-GLNs (top row) and MLPs (bottom two rows) are trained on 1-step denoising of a Swiss Roll density under additive Gaussian noise (BS = batch size, LR = learning rate). Starting with a grid, the original Swiss Roll data manifold is reconstructed with multi-step denoising. Larger version in Appendix G.2. (B) Sampling via HMC using the gradient field inferred by denoising. Shown are samples from G-GLN inferred gradient (green), MLP inferred gradient (orange) and original data manifold (blue). (C) Infilling of MNIST train images (left) or unseen test images (right) is shown for binary occlusion masks, after training a G-GLN for only one epoch over the dataset with batch size 1 to remove additive Gaussian noise from each train image. Orig: original image. Mask: masked: Fill: filled. More examples in Appendix G.3

One application of high-dimensional regression is to the problem of density estimation via denoising vincent2011connection ; bigdeli2020learning ; sohl2015deep ; saremi2019neural , which gives the ability to sample any conditional distribution from a learnt gradient of the log-joint data distribution. We use G-GLNs to approximate this score function, hyvarinen2005estimation

, by using a G-GLN multivariate regression model as a denoising autoencoder

vincent2011connection ; bigdeli2020learning ; sohl2015deep ; saremi2019neural . We train the GLN by adding isotropic Gaussian noise with covariance () to each data point and regressing to the un-noised point. At convergence, the vector approximates the score function alain2014regularized , which we can feed into Hamiltonian Monte Carlo (HMC) neal2011mcmc to approximately sample from the distribution implied by the score field. See Appendix F for details.

From Figure 2(A) it is evident that G-GLNs can learn reasonable approximate gradient fields for 2D distributions from just a single online pass of 500-5000 samples. Starting from a grid, multi-step denoising can then by applied to reconstruct the original data manifold. MLPs trained with the same data required a larger batch size and many more samples to accurately approximate the data density. This is evident in Figure 2(B), which shows the result of HMC sampling neal2011mcmc using the G-GLN versus MLP estimated gradient fields. Figure 2(C) demonstrates that the same process can be extended to much higher-dimensional problems, e.g. MNIST density modelling: iterative G-GLN denoising can be leveraged to fill in occluded regions in MNIST train or unseen test images after a single online pass through the train set in which it is trained to remove small additive Gaussian noise patterns from each image. This suggests an exciting avenue for future work applying G-GLNs as data-efficient pattern completion memories.

7 Conclusion

We have introduced a new backpropagation-free deep learning algorithm for multivariate regression that leverages local convex optimization and data-dependent gating to model highly non-linear and heteroskedastic functions. We demonstrate competitive or state-of-the-art performance on a comprehensive suite of established benchmarks. The simplicity and data efficiency of the G-GLN approach, coupled with its strong performance in high-dimensional multivariate settings, makes us optimistic about future extensions to a broad range of applications.

8 Acknowledgements

We thank Agnieszka Grabska-Barwinska, Chris Mattern, Jianan Wang, Pedro Ortega and Marcus Hutter for helpful discussions.

References

Appendix A Weighted Products of Gaussians

A well-known result is that a product of Gaussian PDFs collapses to a scaled Gaussian PDF (e.g. [24]). In particular, if we define

(12)

and let denote the associated PDF of , then we have that . In the case where , this implies that as the constant of proportionality (not a function of ) is cancelled out by the division by Z in Equation 2, and we are left with an integral of a PDF in the denominator. Now consider a Gaussian PDF raised to a power , i.e.

which corresponds to an unnormalized Gaussian PDF with mean and variance . Thus we can replace each term in Equation 2 with the PDF associated with . Combining the above techniques for products and powers allows us to exactly interpret the weighted product of experts as another Gaussian expert where

(13)

Appendix B Properties of the G-GLN Loss

Gradient.

First define , and , which due to the non-negativity of implies . Hence and . Using this notation, we can reformulate Equation 6 as

(14)

The first partial derivative can be obtained by direct calculation, and is

Hence, using the above and , we have

Convexity.

Here we prove that is a convex function of by showing that the Hessian of Equation 14 is positive semi-definite (PSD). Let and , which allows us to compute the second partial derivative as

Thus the Hessian of Equation 14 is

(15)

where denotes the matrix whose entries are all 1. As is PSD and , the first additive term is PSD. The second term is also PSD, since and the outer product is PSD by letting and observing that for all . Hence since the Hessian is the sum of two PSD matrices, it is PSD which implies that and therefore is a convex function of .

Appendix C Learning the Base Model

Every neuron in a G-GLN takes one-or-more Gaussian PDFs as input and produces a Gaussian PDF as output. This raises the question of what input to provide to neurons in the first layer, i.e. the base prediction. We consider three solutions: (1) None. The input sufficient statistics to each neuron are already concatenated with so-called “bias” Gaussians to ensure that the target mean falls within the convex hull defined by the input means (described in Section 3). (2) A Gaussian PDF for each component of the input vector, with and constant. It is perhaps surprising that the neuron inputs are not required to be a function of the s, but this is permissible because

is z-score normalized and broadcast to every neuron as side information

.

We present a third option (3) whereby the base prediction is provided by a probabilistic base model trained to directly predict the target using only a single feature dimensions. The formulation of this Bayesian Linear-Gaussian Regression (BLR) model is described below. Empirically we find that it leads to improved data efficiency in the first epoch of training (see examples in Figure 3) with only an additional time and space cost per feature dimension.

Consider a dataset of zero-centered univariate features and corresponding targets . We assume a Normal-linear relationship between a feature and target ,

where and are some coefficients, and is the precision (inverse variance). We assume is known, but it can also be optimized via (type II) maximum likelihood estimation. We also assume an isotropic Normal prior over and , i.e. and , where is the prior precision.

By adapting widely known equations (e.g. Equations 3.53-3.54 in [46]) we can obtain the posterior for as

Similarly, we obtain the posterior for as

Putting these two together, we can obtain the posterior predictive distribution,

It is apparent that updates and inference can be performed incrementally in constant time and space by storing and updating the sufficient statistics , , , .

We can use this BLR formulation to convert the input features into probability densities. Specifically, for each feature, we independently maintain posterior/sufficient statistics and use the posterior predictive distributions as inputs to the base layer of the G-GLN.

Figure 3: Effect of using Bayesian linear regression (BLR) versus a constant base model on predictive RMSE for four UCI regression tasks. Results are shown for the first epoch of training.

Appendix D Switching Aggregation

Because every neuron in a G-GLN directly models the target distribution, there is no one natural definition of the network output. One convention is simply to have a final layer consisting of a single neuron, and take the output of that neuron as the network output. An alternative method of switching aggregation was used in [2, 47], whereby an incremental online update rule was used to weight the contributions of individual neurons in the network to an overall estimate of the target density.

We extend the switching aggregation procedure from the Bernoulli to Gaussian case by replacing a Bernoulli target probability value with a Gaussian probability density value evaluated at the target. The switching algorithm of [47] was originally presented in terms of log-marginal probabilities, which can cause numerical difficulties at implementation time. Instead we use an equivalent formulation derived from [2] that incrementally maintains a weight vector that is used to compute a convex combination of model predictions, i.e. the densities given by each neuron in the network, at each time step.

Using notation similar to [2], let denote the number of neurons, and denote the weight associated with model at times . The density output by the th neuron at time , evaluated on target , will be denoted by . At each time step , switching aggregation outputs the density

with the weights defined, for all , by and

with . This can be implemented in linear time with respect to the number of neurons. Notice that mathematically the weights satisfy the invariant for all times , which should be explicitly enforced after each update to avoid numerical issues in any practical implementation.

Appendix E Weight Projection

Weight projection after an update (Line 11 in Algorithm 1) enforces three sets of constraints: each weight to be in , mixed means to be in , and mixed variances to be in . These constraints ensure that the online convex optimization is well-behaved by forming a convex feasible set and also preventing numerical issues that arise from rounding likelihoods to . We outline two ways in which these constraints can be implemented below.

The constraints can be represented in terms of linear inequalities , where is the weight vector of neuron given side info . Assume violates some of the constraints, therefore we would like to project onto our feasible set . Let and be the matrix/vector composed of rows/elements of and respectively that violate our original inequality, thus . Then we can write down the projection problem as s.t. , the solution of which is where is the pseudo-inverse of . This pseudo-inverse can be computed efficiently, because all but (at most) two rows of are “one-hot”.

The exact projection approach relies on dynamically shaped and

, support for which is limited in contemporary differentiable programming libraries such as Tensorflow 

[48] and JAX [49]. Therefore, we take an alternative approach and enforce the inequalities via using logarthmic barrier functions (log-barriers) that augment the original loss function by penalizing the weights that are close to the constraints. Let and be the th row and element of and respectively. For the constraint , we can define a barrier function

Note that we are now dealing with strict inequalities rather than for convenience. We can then augment the loss function from Equation 6, incorporating the barriers,

(16)

where and is the barrier constant. Note that is convex in as each is convex.

The weight updates can be carried out via . For sufficiently small and sufficiently large , we will not need the projection step in Line 11 of Algorithm 1, as the constraints are incorporated into the loss function. However, in practice, we need backstops in case weights pass through the barriers due to large gradient steps. We implement the backstops by first hard-clipping each weight to be in then by enforcing , which corresponds to performing a single linear projection if the inequality is violated.

Appendix F Denoising Density Estimation

With denoting a Gaussian likelihood function (as parameterized by a G-GLN) and an unknown data-generating distribution, suppose we add isotropic Gaussian noise of variance to sampled data points and then denoise them back to the original samples. The expected loss is

Taking the variational derivative of this expected loss with respect to our G-GLN demonstrates the relationship between the value of the optimal output and the gradient of the log data density:

(17)

in the limit . Therefore, we can approximate the gradient field as , which we use in the main text. Hamiltonian Monte Carlo sampling then takes as input this gradient estimate for . Denoising iteratively applies the G-GLN, trained on denoising, to an arbitrary starting point , and so on.

Appendix G Additional Results

g.1 Contextual Bandits

Binary targets Continuous targets
Algorithm adult census covertype statlog financial jester wheel
G-GLN - - - - 30183 33014 438611
B-GLN 6785 27183 271512 48631 30383 32983 443211
BBAlphaDiv 182 93212 18389 273115 18601 31124 177611
BBB 3998 225812 298311 457610 217218 31994 226544
BootRMS 6763 26933 30027 458311 28984 32694 193344
Dropout 6525 26448 28997 440315 27694 32684 238348
LinFullPost 4632 18982 28216 44572 31221 31934 449115
NeuralGreedy 5985 260414 29238 439217 28575 32668 186344
NeuralLinear 3912 24182 27916 47622 30592 31694 428518
ParamNoise 2733 22845 24935 409810 22242 30844 344320
constSGD 1073 139922 19919 389618 18621 31364 226531
Table 3: Performance of the GLN-based GLCB algorithms for the contextual bandits tasks and competitors described in [1, 37]. G-GLCB uses a single G-GLN instead of a CTree of 7 equivalent-sized B-GLNs (italics), the method described in [1], to model continuous-valued results. Results are mean and standard error of cumulative rewards over random environment seeds.

In [1] the authors present a B-GLN based algorithm, GLCB, that achieves state-of-the-art results across a suite of contextual bandits tasks with both binary and real-valued rewards. The former uses the B-GLN formulation directly. For the latter, the authors present an algorithm called CTree for tree-based discretization, i.e. using B-GLNS arranged within a binary tree structure to model the target distribution over bins. In both cases, GLCB leveraged properties of GLN half-space gating to derive a UCB-like [38] rule based on “pseudo-counts" (inspired by [50]) to help guide exploration. At each timestep , the GLCB policy [1] greedily maximizes a linear combination of the expected action reward as predicted by a GLN and an exploration bonus where is the pseudocount term capturing how similar the current context-action pair is to the previously seen data. This term is computed at no additional cost by utilizing gating functions of GLN neurons.

Table 3 expands on the results in Section 6.3 to demonstrate the performance of GLNs for both binary and continuous-valued rewards. It is evident that GLNs achieve state-of-the-art performance in both regimes. Moreover, using the natural G-GLN formulation described in this paper is able to match the previous performance of a CTree of B-GLNs with just a single equivalent-sized network (an order-of-magnitude reduction in memory and computation cost).

g.2 2D Denoising

Figure 4 shows 24 steps of denoising starting from a grid for the Swiss Roll gradient fields. At larger batch sizes and lower learning rates, and with more denoising steps (lower right panel), the MLP control begins to approximate the Swiss Roll data manifold.

Figure 4: G-GLNs (top set of rows) and MLPs (bottom two sets of rows) are trained on 1-step denoising of added Gaussian noise using data points sampled from a Swiss Roll. Subsequently, iterative multi-step denoising starting from a grid reconstructs an approximation of the original Swiss Roll data manifold. BS denotes batch size, LR denotes learning rate. The initial grid followed by 24 steps of denosing are shown left to right and top to bottom.

g.3 MNIST Infilling

Figure 5 shows the result of 3000 steps of denoising of MNIST train and test digits, after training for 1 epoch at batch size 1. This shows that the network, which has been trained on denoising small additive Gaussian noise perturbations to train set digits, is able to denoise unseen binary mask perturbations on unseen test set digits. This occurs over many iterative steps of denoising, much as the grid in Figure 4 is iteratively denoised to the Swiss Roll data manifold.

Figure 5: Further MNIST infilling examples. G-GLN was trained for 1 epoch at batch size 1 by denoising a small additive Gaussian noise pattern from each train image. Subsequently, it can remove unseen binary occulsion masks either from train images (left) or unseen test images (right). Orig: original image. Mask: masked image: Fill: filled image. Examples were randomly chosen.

Appendix H Experimental Details

h.1 UCI regression details

Each G-GLN was trained with batch size 1 for 40 epochs of a randomly selected 90% split of the dataset (except DO which was trained for 400). The predictive RMSE is evaluated for the remaining 10%, with the mean and standard error reported across 20 different splits (5 for Protein Structure). Similarly to [31], we normalize the input features and targets to have zero mean and unit variance during training. Target normalization is removed for evaluation.

For each UCI dataset we train a G-GLN with 12 layers of 256 neurons. Context functions are sampled as described in Section 4 with an additive bias of 0.05. The switching aggregation scheme was used to generate the output distribution. In [31] the authors specify that 30 configurations of learning rate, momentum and weight decay parameters are tuned for each task for VI, BP and PBP. We likewise search 12 configurations of learning rate and context dimension for each task and present the best result.

h.2 SARCOS details

The G-GLN was trained for 1200 epochs using the SARCOS test and train splits defined in [35]. Inputs were normalized to have zero mean and unit variance during training, with the target component-wise linearly rescaled to . Fixed bias Gaussians were placed with means and variance 5 along each of the 7 output coordinate axes. The network base model uses Gaussians with standard deviation 1 centered on each component of the input vector.

The G-GLN was trained with 4 layers of 50 neurons and context dimension 14. Context functions are sampled as described in Section 4 with an additive bias of 0.05. The switching aggregation scheme was used to generate the output distribution. We implement weight projection with log-barriers as outlined in Section E. We place barriers enforcing weights to be in and mixed variances to be in . The log-barrier term is multiplied with a constant of 0.1 and added to the log-loss. A higher learning rate of 100 was necessary due to the log-barriers.

h.3 Contextual bandits details

We adopt the experimental configuration described in [1]

, including inputs and target scaling and method of hyperparameter selection. Performance was evaluated across 500 seeds per dataset. The G-GLN was trained with shape

with context dimension 1 and a learning rate of 0.003. A single output layer with a single neuron was used to generate the output distribution. Context functions are sampled as described in Section 4 with an additive bias of 0.05. For the GLCB algorithm a UCB exploration bonus of 1 was chosen with mean-based pseudo-count aggregation.

h.4 Denoising details

The MLP control for Swiss Roll denoising was a ReLU network with hidden layer sizes 64 and 32 and output size 3 (2D

and 1D ). Both were trained with Gaussian log likelihood. The MLP was evaluated with learning rates of both or for comparison. For Hamiltonian Monte Carlo (HMC) sampling, 15000 HMC steps were performed, with each step consisting of 150 sub-steps and . No acceptance criterion was used. Particle mass was 1.

For the MNIST image denoising, the G-GLN was trained with 6 layers of batch size 50 with context dimension of 10 and a learning rate of 0.05. The network base model uses Gaussians with variance 0.3 centered on each component of the input vector. A single output layer with a single neuron was used to generate the output distribution.

For MNIST denoising, context functions are sampled as described in Section 4 with a normally distributed additive bias of scale 0.05, while for Swiss Roll denoising in 2D, the additive bias scale was 0.5 to ensure proper tiling of the low-dimensional input space with hyperplane regions.

The G-GLN was trained in a single pass through all train points with batch size 1, with data represented as flat

dimensional vectors. The model was trained to remove a single additive Gaussian noise pattern for each train image during training, and was then tested on MNIST in-filling using an independent test set of images occluded by unseen randomly positioned binary masks. To estimate a gradient direction for infilling, a single step of the trained denoising procedure was performed on each successive image, then a step of length 0.002 was taken interpolating between the image and the denoised prediction, after which pixels outside the masked region were projected back to their original values. This was repeated iteratively up to 3000 times.

For both Swiss Roll and MNIST denoising, target data was component-wise linearly scaled to . For MNIST, we first added Gaussian noise of standard deviation 75 to the first 10k train points to define an appropriate scaling range for the linear scaler. All weights were kept positive by clipping to a maximum of 1000. A minimum was enforced by clipping during inference but not updating. Log-barriers were not used.