Learning to Adapt by Minimizing Discrepancy

11/30/2017 ∙ by Alexander G. Ororbia II, et al. ∙ 0

We explore whether useful temporal neural generative models can be learned from sequential data without back-propagation through time. We investigate the viability of a more neurocognitively-grounded approach in the context of unsupervised generative modeling of sequences. Specifically, we build on the concept of predictive coding, which has gained influence in cognitive science, in a neural framework. To do so we develop a novel architecture, the Temporal Neural Coding Network, and its learning algorithm, Discrepancy Reduction. The underlying directed generative model is fully recurrent, meaning that it employs structural feedback connections and temporal feedback connections, yielding information propagation cycles that create local learning signals. This facilitates a unified bottom-up and top-down approach for information transfer inside the architecture. Our proposed algorithm shows promise on the bouncing balls generative modeling problem. Further experiments could be conducted to explore the strengths and weaknesses of our approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For many problems, back-propagation of errors, or the application of reverse-mode differentiation to computation graphs, has been the primary algorithm of choice for conducting credit assignment in neural architectures. However, when neural architectures are made deeper, and thus more complex, the error gradients must pass backward through many layers. As a result of the additional multiplications, these gradients tend to either explode or vanish (Pascanu et al., 2013)

. In order to keep the values of the gradients within reasonable magnitudes, we often require the layers to behave sufficiently linearly to prevent saturation of neuronal post-activities, which would yield zero gradient. It has been shown that this required linearity can lead to undesirable extrapolation effects, creating the well-known problem of adversarial samples

(Szegedy et al., 2013; Ororbia II et al., 2017a). Furthermore, this linearity also hinders usage of other important, non-linear mechanisms, such as lateral competition and saturation.

From a biological perspective, back-propagation has received a great deal of criticism due to the implausibility of its implementation in the brain. Some of the key problems include: 1) the “weight transport problem”, where the feedback weights must be the same as (the transpose of) the feedback weights, 2) the forward pass and the backward pass require different computations, and 3) the error gradients must be stored separately from the activations. To remedy the last two conditions, one could use a symmetrical network that is solely used for propagating errors (an error-propagation network). However, beyond the fact that two information pathways have been created, there is no known biological mechanism that allows the error network to know the weights of the feedforward network it is serving. The earlier described requirement of linearity also violates what we know about biological neurons, which interleave linear and non-linear operations. As argued by Bengio et al. (2015)

, if the brain were to use feedback paths as implemented by back-propagation, it would require precise knowledge of the derivatives of the non-linear activation functions , which is not possible since not all neurons are the same. Furthermore, discrete-valued or stochastic activations (such as sampling a Bernoulli or Categorical distribution) cannot be used, even though we know that real neurons communicate with spikes (binary values) and not by continuous values.

More critically, back-propagation requires a global feedback path to carry error derivatives across the system. This is due to the nature of supervised learning systems–an objective is grounded in input/output space and the global pathway is used to relate how internal processing elements affect the target. One problem with this, especially when used to generatively model data, is that most of the learner’s time is spent on surface-level properties of the data and not on extracting latent structure necessary for generalization. A good example is in speech processing, where the log likelihood cost used leads the model to focus mostly on acoustic details rather than higher-level linguistic features.

222

Yoshua Bengio, presentation at ReWork: Deep Learning Summit Montreal, 2017.

This global feedback path stands in contrast to many theories of the brain (Grossberg, 1982; Rao and Ballard, 1999; Huang and Rao, 2011; Clark, 2013)

, which posit that local computations occur at multiple levels of a (somewhat) hierarchical structure. However, if we were to violate the idea of a global feedback path, where would the targets then come from in order to create learning signals for the hidden processing elements? One thing is likely: we will no longer be able to rely on a loss function that operates primarily in the input space, which is at the core of supervised learning. This means that the learning approach we seek will require

higher-level objectives

, or objectives that operate at various levels of latent space. More importantly, in designing higher-level objectives to create local targets, we can better encourage a neural system to find hidden/abstract structure in data. This type of objective directly connects to representation learning, better embodying one of the key assumptions behind unsupervised learning: by observing a stream of data points, it is possible to derive the predictable systemic relationships between variables

333These variables can be pixels in images of a video or the characters of a word in a sentence. as well as relationships between these relationships, i.e., more complicated, abstract patterns. Higher-level patterns are what a representation learning system seeks to uncover–latent variables, or intermediate concepts and features, that capture useful statistical regularities of the world that the intelligent system is embedded in. To this end, in this paper, the intuition behind our learning algorithm is to measure and reduce the discrepancy, or mismatch, between what representations a neural model can currently generate and representations that better describe the input.

Attempting to build models and algorithms that resolve some of the above criticisms might open the door to learning approaches that generalize better. However, while many variations of back-propagation and alternatives have been proposed, most work has only shown their usefulness on static problems, typically on classification. However, we know that in the human brain, many active processes, including those related to vision and speech, take in sequences of input stimuli and attempt to build a dynamic mental model of the world (Rao and Ballard, 1997). Given this dynamic view of the brain, which constructs an implicit, abstract knowledge base that is representative of the structure of the observed environment (Reber, 1989; Destrebecqz and Cleeremans, 2001), we design our approach with an eye toward stateful problems. From a machine learning perspective, this is important given the success of recurrent neural models sequential problems such as language modeling (Mikolov et al., 2010; Ororbia II et al., 2017b). Critical to the success of recurrent models is the action of unrolling, a mechanism that is clearly neurobiologically implausible. In order to implement back-propagation through time, one needs to unroll 444Or unfold a recursively defined operation into an explicit chain of events. the underlying computation graph over the length of a given input sequence, creating a longer global feedback path for error information to traverse. The brain, in contrast, is an incremental, adaptive process. As such, we investigate the viability of our model, which learns from sequences without any unrolling, and offer some promising evidence that our learning algorithm can match or outperform some powerful neural models that rely on back-propagation through time or advanced variations, such as neural variational inference (Mnih and Gregor, 2014). Notably, our proposed approach allows the learning of a directed generative model, which is important given the causal structure of the universe, without the need to correct for the imperfections of an approximate inference model.

The contributions of this article are the following:

  • We propose the Temporal Neural Coding Network (TNCN), a temporal neural model, and its learning algorithm, Discrepancy Reduction, for learning dynamically from sequential data. Our model incorporates some basic notions from predictive coding (Rao and Ballard, 1999) theories of the brain, notably lateral competition among neural variables.

  • To create our model’s unsupervised learning algorithm, we draw inspiration from random feedback alignment (Lillicrap et al., 2014) and difference target propagation (Lee et al., 2015). To create local targets for the model’s higher-level objectives, we show that simple fixed projection functions can be used to create special error units that can generate local targets.

  • To evaluate our model and learning algorithm, we experiment with a video modeling problem and discover promising results with our learning approach.

This work can also be viewed as another contribution towards the long-term goal of finding more biologically plausible machine learning approaches to the credit assignment problem Bengio et al. 2015. Specifically, we offer a rather simple approach to implementing and training sequential predictive neural models without back-propagation through time.

2 Motivation & Related Work

There has been a great deal of research in finding more biologically-plausible alternatives to back-propagation of errors. Classically, the online alternative to back-propagation of errors was real-time recurrent learning (RTRL, Williams and Zipser 1989), which employs forward-mode differentiation to compute gradients. However, this algorithm scales poorly, i.e., quadratically in the number of parameters. Some algorithms have been proposed to reduce the complexity of RTRL, including the NoBackTrack Ollivier et al. 2015 and Unbiased Online Recurrent Learning Tallec and Ollivier 2017 algorithms, but are noisy and slow down the learning procedure in trying to approximate the way back-propagation through time works in an online fashion.

The contrastive divergence recipe

(Hinton, 2006)

, well-known for being the primary learning algorithm of restricted Boltzmann machines, and the Wake-Sleep algorithm

(Hinton et al., 1995; Bornschein and Bengio, 2014)

for training deeper Boltzmann-based architectures, were largely inspired by the role of sleep in human learning. However, these approaches to learning, which rely on Markov Chain Monte Carlo methods, suffer from a variety of problems including slow convergence to steady-state distributions due to poor mixing of modes and the constraint that the weights of the model must be symmetric. With some success, Boltzmann-based architectures have been applied to stateful problems

Taylor et al. 2007; Sutskever et al. 2009; Boulanger-Lewandowski et al. 2012; Mittelman et al. 2014, but require hybridizing the Contrastive Divergence approach with back-propagation through time, incurring the limitations and criticisms of both algorithms. Other approaches inspired by Boltzmann-based learning (and energy-based learning in general) include the variational walkback algorithm (Goyal et al., 2016) and equilibrium propagation (Scellier and Bengio, 2017). However, these algorithms have only been investigated on static modeling problems and it is not clear how one might extend them to stateful problems.

Learning deep Boltzmann machines can be quite difficult, requiring all sorts of tricks to make the process work well and efficiently (Salakhutdinov and Larochelle, 2010; Ororbia II et al., 2015b). In response, one line of work has taken on a variational inference scheme (Kingma and Welling, 2013; Mnih and Gregor, 2014; Serban et al., 2017)

, where we train an approximate inference machine parametrized by a neural network. While efficient in learning probabilistic models of data, the success of the generative model under the variational inference framework depends largely on how good the inference model is. Specifically, the inference model

constrains the generative model , and any deficiency in the inference model must be then be made up by the generative model. Instead we would like to reverse this scenario–the generative model adjusts itself, where generation can be used to prime the feedback mechanisms that will guide learning and adaptation. The proposed TNCN embodies this idea in the attempt to circumvent the need for approximate inference machinery.

Our algorithm is inspired by three different strands of research focused on finding viable alternatives to the biologically-implausible back-propagation of errors.

2.1 Random Feedback Alignment

Feedback alignment (Lillicrap et al., 2014) and its variants (Nøkland, 2016; Baldi et al., 2016), have shown that random feedback weights can deliver useful teaching signals. This random form of back-propagation has also been used to develop an event-driven variation of the learning rule suitable for neuromorphic implementations of neural networks (Neftci et al., 2017). More importantly, feedback alignment algorithms resolve the weight-transport problem, which has been one criticism of back-propagation before (Grossberg, 1987; Liao et al., 2016), by showing that coherent learning is possible with asymmetric forward and backward pathways. Rather, the error back-projection pathways need not be the transpose of the weights used to carry out forward propagation, and the learning process can instead be viewed as the alignment of feedforward weights with feedback weights.

Feedback alignment, however, suffers from several problems: 1) during the alignment phase, a given layer cannot learn before the upper layers have approximately “aligned”, 2) the procedure operates much like (supervised) greedy layerwise learning where each layer only learns something that is useful for a linear classifier but does not globally optimize or offer any coordination among the layers.

555Personal communication, Yoshua Bengio. The TNCN’s learning algorithm deviates from feedback alignment in that it uses error feedback weights to create potentially better target representations instead of replacing the derivatives normally computed by back-propagation. Furthermore, the TNCN does not strive to learn by an algorithm that works approximately like back-propagation (as approaches like feedback alignment and contrastive Hebbian learning (Xie and Seung, 2003) do), which requires a global feedback path to conduct credit assignment.

2.2 Recirculation & Target Propagation

Recirculation (Hinton and McClelland, 1988; O’Reilly, 1996), the predecessor to target propagation and originally proposed for a single hidden layer auto-encoder, uses the datum as the target value for reconstruction (which affects the decoder) and the initial encoded representation of the datum as the target for the encoder, which is computed after a second forward pass. One key requirement for recirculation is that the weights of the encoder and decoder are symmetric, however, the learning process encourages these weights to automatically self-align to approximate the symmetry.

Difference target propagation (Lee et al., 2015)

brings forth the idea that a learning signal might be created by instead calculating targets rather than loss gradients at each layer. This allows for the development of local learning rules, removing the need for a global error pathway to carry error derivatives across (and thus side-stepping the vanishing gradient problem). Furthermore, some connections can made between target propagation and Spike-Time-Dependent-Plasticity

(Andrew, 2003). The general approach in target propagation is that the feedback weights are trained to learn the inverse of the feedforward mapping. This has also been roughly applied to training recurrent network models (Wiseman et al., 2017)

but still requires unrolling the computation graph along the length of the sequence. Target propagation also requires a few things to work, notably that every layer in the network model must be an autoencoder and that a linear correction term is used to account for the imperfectness of auto-encoders (which can obstruct learning).

It is important to note that target propagation permits the use of non-linearities that output discrete values (or Bernoulli sampled activations). The TNCN also possesses this useful property, since the element-wise functions used to compute neuronal post-activations no longer need to be differentiable. This allows us to work with highly non-linear transformations where the gradients are often near zero, for example, stochastic binary units.

2.3 Local Learning & Greedy Layerwise Training

The desire for useful local learning, of which target propagation represents a strong modern step towards, is not new, and saw a small resurgence in the early days of training deeper networks in the form of layer-wise training of unsupervised models (Bengio et al., 2007), supervised models (Lee et al., 2014), and semi-supervised models Ororbia II et al. 2015c; Ororbia II et al. 2015b (also known as hybrid training). However, among the many problems with these early approaches to deep learning was the lack of global coordination. Global coordination means that higher-level layers essentially direct lower-level layers in what patterns they should be extracting. A lower-level feature detector might be able to find different aspects of structure in its input since multiple patterns might satisfy its layer-wise objective. However, this detector will only locate the right bit of structure needed for the whole model to make sense, at any time-step, if a higher-level layer signals what pattern it should be looking for. Since greedy layer-wise approaches build the model from the bottom-up, freezing the learned lower-level parameters, this coordination is impossible to achieve.

The TNCN’s localized learning approach was also motivated by the simple Bottom-Up-Top-Down learning algorithm (Ororbia II et al., 2015a) , which showed that a stack of Boltzmann network modules (and other simple, auto-associative variations) could be learned in a “pseudo-joint” layerwise fashion. However, in order to build in some form of global coordination, a variation of back-propagation of errors was used, ultimately creating a global feedback path as part of the overall learning procedure. A more global approach was later presented in Ororbia II et al. 2015b, incorporating top-down information much like that in (Salakhutdinov and Larochelle, 2010), however, these algorithms were only built for and studied on stateless problems. Furthermore, these approaches would be difficult to scale when extended to sequential modeling problems given their strong dependence on Markov Chain Monte Carlo sampling.

2.4 Predictive Coding

Predictive coding theories posit that the brain is in a continuous process of creating and updating hypotheses that predict the sensory input it receives, directly influencing conscious experience (Panichello et al., 2013). Models of sparse coding (Olshausen and Field, 1997) and predictive coding (Rao and Ballard, 1999) embody the idea that the brain is a directed generative model where the processes of generation (top-down mechanisms) and inference (bottom-up mechanisms) are intertwined (Rauss and Pourtois, 2013)

and interact to perform a sort of iterative inference of latent variables/states. Furthermore, nesting the ideas of predictive coding within the Kalman Filter framework

Rao and Ballard 1997 can create dynamic models that handle time-varying data. Many variations and implementations of predictive coding have been developed (Chalasani and Principe, 2013; Lotter et al., 2016; Santana et al., 2017). Some approaches, such as predictive sparse decomposition (Kavukcuoglu et al., 2010), attempt to speed up the iterative inference by introducing an inference network, but this again, creates a problem similar to that of variational inference–the generative model is constrained by the quality of the approximate inference model.

One key concept behind predictive coding that our own work embodies is that, for a multi-level objective to work well, each layer of a neural architecture would need an error feedback mechanism that could communicate the needs of the layer below it. If the learning signals are moved closer to the layers themselves, the error connections can directly transmit the information to the right representation units. Very importantly, this allows us to side-step the vanishing gradient problem that plagues pure back-propagation, where the internal layers of the architecture are trying to satisfy an objective that they only indirectly influence. If we were to compare the updates from this local learning approach to back-propagation, the updates would still ascend/descend towards a similar objective, just not the steepest ascent/descent, so long as they were within 90 degrees of the direction given by back-propagation. However, since steepest ascent/descent is a greedy form of optimization, updates from a more localized approach might lead to superior generalization results (van den Broeke, 2016).

Our TNCN takes an adaptive, state-corrective approach similar to the dynamic predictive coding model proposed by Rao and Ballard (1997). The general idea is to let the model first make a prediction and generate what it thinks the current frame or symbol will be at time . The errors are computed for each layer via some feedback mechanism, starting from the sensors, which have direct access to the state of the world, and used to correct the model’s internal states before moving on to predict the next time-step. A learning signal is created by comparing the corrected states to the initially predicted states. Since intra-layer competition among neuronal units is important (for reasons we will describe later), we also follow in the spirit of Olshausen and Field (1997) and encourage sparsity indirectly through a penalty/constraint.

We combine the basic ideas described above to propose an algorithm for learning a temporal neural model, the Temporal Neural Coding Network. Generally, proposed alternative algorithms are tested on stateless data classification problems, such as the well known MNIST digit recognition dataset. In contrast we investigate the potential of our algorithm on sequential/temporal problems. Our approach can be viewed as on online, adaptive approach to learning, requiring no unrolling does back-propagation through time, since the TNCN is continuously engaged in self-correction, or rather, minimizing its total discrepancy between its expectations and targets.

3 Learning a Temporal Neural Coding Network

Let us begin by formally defining a TNCN, at time , with three layers of neural variables , where refers to the output sensors that directly connect the model with the environment/world. The TNCN architecture distinguishes between two sets of parameters, (the generative/predictive parameters) and (the fixed, error feedback parameters). We define to be the pre-activations of each latent variable.

To calculate the necessary statistics for one step of error correction, we first define the model’s generated prediction for any (internal) arbitrary layer to be:

(1)

which assumes that any pre-activation is a linear combination of a filtration and a top-down (expectation) bias. The error units do two things: 1) create a local representation target by using information from the error units of the layer below and the TNCN’s initial guess of the representation, 2) measure the discrepancy between this target and the corresponding layer pre-activation. Specifically, an error unit at layer within the model is computed as follows:

(2)

noting that the second formula depicts how the error units compute a latent target using the error feedback weights . is the post-activation function applied at each layer and is the element-wise function666In this paper, we used the hyperbolic tangent, scaled optimally according to LeCun et al. (1989). applied to the information coming from the error units below, meaning that the representation target is a non-linear function of the representation guess and the error fed back from below. This equation is reminiscent of the single corrective step found in dynamic predictive coding models formulated as Kalman Filters (Rao and Ballard, 1997). Note that these error feedback weights are fixed, as those of Lillicrap et al. (2014) were, but they do not necessarily have to be.777Future work shall investigate neuro-biologically grounded ways of evolving these feedback weights. However, the feedback weights of Lillicrap et al. (2014) were used to carry the partial derivatives across layers (much like a short-circuit) and ultimately serve as a global feedback path, whereas the weights we propose are meant to help correct or update the currently guessed representation (which itself is a function of past information and the top-down generative weights of the layer above) and create local targets useful for subsequent learning. It is important to note that the error weights transmit error information behind the non-linear activation function for any layer . Once the correct pre-activation target for any layer has been calculated, the final corrected representation is simply a re-application of the post-activation function for that layer,

(this will then be used as the vector summary of the past when moving on to the next time step

). A critical advantage of this proposed way of wiring the feedback connections (which is a unique departure from standard predictive coding models) is that we are now free to use any differentiable or non-differentiable non-linearity we like. This can include other sampling operations not amenable to the re-parametrization trick as well as discrete-valued activation functions (such as the classical hard threshold function).

We describe the TNCN’s learning procedure, Discrepancy Reduction, next.

3.1 The Discrepancy Reduction Algorithm

To compute the gradients of model parameters once latent representations have been corrected, we exploit the local learning signals that natural arise given the way the error units we have designed work. The cost function that measures the total discrepancy within a TNCN composed of layers of latent variables, applied to real-valued input distributions, can be naively formulated as follows:

(3)

noting that

. We fix the variance

, however, these can be additional parameters to be learned (details will appear in the appendix). If we want to take an information-theoretic view, which aligns better with the idea of reducing internal discrepancy in the system, we can instead use the Kullback-Leibler Divergence

Kullback and Leibler 1951 to measure the similarity between the guess and target for each local representation:

(4)
(5)

noting that is the variance of and is the variance of (both of these are diagonal covariance matrices, and fixing these to vectors of ones further simplifies the expression to look quite similar to Equation 3) . The leftmost term of the right-hand side of the equations for the loss is the partially grounded term in input space, while the rest of the terms are the higher-level terms (albeit simple and perhaps crude). Note that, internally, the TNCN architecture will readily compute the representation pre-activity targets each time a sequential element is presented.

(a) A two-layer TNCN architecture.
(b) The feedback structure of a pair of cells.
Figure 1: In (0(b)), the error feedback loops are explicitly shown, where one pair of cells (in layer 0) communicate the discrepancy backwards to an earlier pair of cells (in layer 1). The error in layer 0 drives the representation target for layer 1, communicating the amount of mismatch in its own respective layer to help correct the representation above. In this sub-figure, solid lines represent the initial, feedforward pass of information while dashed lines represent the information flow along the structural feedback connections, which feed into an error unit that computes the target (which is also used to compute the corrected state). The dash-dotted line marks the separation of the forward phase and the error-correction phase in computing the latent variable .

If we find the first-order partial derivatives of Equation 3 with respect to the weights in , we get the following updates (assuming we use stochastic gradient ascent as the update rule):

(6)
(7)

where

is a noise process that is directly applied to the estimated parameter gradient. Such a process can be zero-mean Gaussian noise (with a scalar variance chosen through cross-validation). Note that the input layer

does not sport any recurrent connections (but if it did, these could be considered auto-regressive connections that relate input variables to past input variables) and is simply defined as:

(8)

where we simply set the gradient since setting the parameter matrix will effectively delete these recurrent input connections (we did this simply to avoid the situation where the dimensionality of the input is large, which would require even larger parameter matrices). One favorable property of the Discrepancy Reduction learning algorithm for learning TNCNs is the (partial) parallelism one may exploit in calculating parameter gradients, much like the goal of (Jaderberg et al., 2016). One simply needs to run the TNCN’s generation and target calculation procedures to get the guesses and targets, but once these statistics have been computed one can treat each layer as a computation sub-graph, the gradient estimates of which are independent of any other layer (this means that we design each layer to be more intricate and complex than what is used in this paper).

Input: (mini-batch at time ), current parameters , , previous state variables , and meta-parameters , ,
function generate(, )
     , Extract states and parameters
     ,
     ,
     ,
     return
function correct()
     
     , If , set to datum at
     , ,
     , ,
     for  to  do Could add convergence criterion based on error units
         
         ,
         
         , ,
         , ,      
     return
function updateModel()
      1) Run generative model (get guesses of each representation)
     
      2) Correct states via error-created targets
     
      3) Adjust parameters via gradient ascent & output corrected state variables
      Extract current parameters
     ,
     ,
     ,
     ,
     ,
     return
Algorithm 1 The Discrepancy Reduction learning algorithm for building TNCN model with two latent variable layers. is the regularization function (or prior distribution) imposed on a layer’s pre-activities, .

In predictive coding theories of the brain, it is often assumed that sparsity is a key ingredient. This means we seek representations of data where only a small subset of the latent variables have non-zero values. If the TNCN is to disentangle concepts in its latent representations, the need for sparsity makes sense since it is reasonable to assume that only a few out of the many possible concepts/variables explain any given datum (useful in tasks such as corrupt image denoising (Cho, 2013)). From a theoretical perspective on feature learning, we desire compact representations of the input in which no information is lost regarding the input (Bengio, 2013; van Rooyen and Williamson, 2015). Dense representations, though rich, are highly entangled since small changes in the input can lead to big changes in the representation vector. Sparse representations, in contrast, are robust and mostly conserve the set of non-zero features (Glorot et al., 2011). In this paper, we only enforce a “weak” form of lateral competition over the neuronal variables through a simple Laplacian prior distribution over the pre-activities. During the learning phase/step, lateral competition patterns (where the neurons fight to be active) get internalized in the model parameters via the term .888It is important to note that the simple way we encourage sparsity does not mean that all representations of the TNCN are guaranteed to be sparse. It is quite possible that the model could produce dense representations for data points outside the training sample since only during training is sparsity encouraged. One way to remedy this would be ensure that the Laplacian prior is active during inference (as in classical sparse coding). To better encourage sparsity, other prior distributions, such as the spike-and-slab distribution (Goodfellow et al., 2012) (to avoid controlling the pre-activation magnitudes), or architectural modifications (Szlam et al., 2011) might improve generalization and will be the subject of future work. We found in initial preliminary experiments that sparsity was indeed a necessary component in improving performance.

With all of the above taken together, the full algorithm for the TNCN (generation, representation correction, and parameter updating) is depicted in Algorithm 3. The TNCN (or rather its generation/inference mechanisms) and its learning algorithm, are intricately tied together, since the learning procedure will make use of the representation targets created by the architecture’s error-driven correction mechanism. This operates in the spirit of predictive coding which posits that the brain’s generation and inference procedures interact to formulate local learning signals. The mechanism we use for target creation is rather simple, and future work should investigate more sophisticated mechanisms (especially ones with evolving error weights). As is depicted in Algorithm 3, the representation-correction mechanism can be extended to a process where targets can be iteratively refined, in the hopes of shortening the overall learning phase.

With respect to higher-level objectives, we can see that the error units play a crucial role–they are in fact the first-order derivatives of the Gaussian log likelihood (with fixed unit variance). Learning is simple since the error units can be easily re-used to calculate parameter gradients incrementally (when combined with the competition prior) and the only activation function derivative required in this approach is that of the output distribution model (which can be easily worked out for commonly used output distributions, such as the Gaussian, Bernoulli, and Categorical distributions). Note that better error units could be derived if one chose a different tactic for measuring the distance between predicted and corrected representation layers (for example, one could measure the Manhattan distance instead of the Euclidean distance, as formulated in our framework). However, the general idea is that the TNCN is engaged with ensuring its layer-wise representations are as close to those suggested by the error units–it is optimizing not only on the input space, but also in the latent space giving us some rough measure of the quality of the model’s internal representations. In some sense, this bears a loose resemblance to the bottom-up-top-down algorithm proposed in (Ororbia II et al., 2015a), which proposed a non-greedy way of learning a set of layer-wise experts. Through the feedback mechanism and the top-down generation paths, the local learning rules of the TNCN gain some form of global coordination, which was lacking in the greedy approaches of the past (Bengio et al., 2007; Ororbia II et al., 2015c)

when training deep belief networks and their hybrid variants.

It is important to highlight that learning and inference under this model is ideally supposed to be continuous, meaning that the model simultaneously generates expectations and then corrects itself (both representations and parameters) each time a new datum from a sequence is presented. This makes the model directly suited to learning incrementally from data-streams. Furthermore, the TNCN shows how two types of recurrence/feedback are at play when modeling sequences: 1) the model is recurrent across the temporal axis since it is stateful, since each processing layer depends on a vector summary of the past, and 2) the model is structurally recurrent, similar to deep Boltzmann machines and Hopfield Networks (Hopfield, 1982), since error is fed back in order to automatically correct guessed representations.

4 Experimental Results

4.1 The Bouncing Balls Process

To test our proposed TNCN architecture and its learning algorithm, Discrepancy Reduction, we benchmark our performance on the bouncing balls dataset following in line with the experimental setup used in Sutskever et al. 2009. This high-dimensional dataset was created by simulating the rudimentary physics of three balls bouncing around in a box. We generate a training set of 4000 training sequences and a test set of 200 sequences (as well as yet another 200 sequences to create a development set). Furthermore, our models are given no prior knowledge of the task, e.g. convolutional weight matrices, much as was done in Taylor et al. 2007. On this dataset, Frame t-1 is the simplest possible baseline–a model that predicts the next step as simply the previously seen frame.

We trained TNCNs with two and three layers of latent variables, searching for the size of the layers over the range

(with performance measured on the validation subset). In this experiment, the logistic sigmoid activation function ultimately proved to be the most useful (we also experimented with nonlinearities, but found these to not work as well). Parameters were initialized from centered Gaussian distributions with

(except in the three-layer TNCN, setting the top-level recurrent and generative weights using improved performance a bit). Error feedback parameters were initialized with centered Gaussians of (again, in the case of the 3-layered model, we used for the top layer). The value of the gradient noise was set to (to control the stochastic approximation of the prior over weights). was to . 999We also experimented with a naive approach by fixing and found that performance slightly worsened. We believe that to properly use iterative inference, we would need employ an adaptive iterative inference schedule, since in the early states of learning, the model is learning to use its error feedback weights, but in later stages one should raise gradually to give the model the opportunity to iteratively refine its representations. Gradients (at each time step) were estimated using mini-batches of 50 samples (across 50 parallel videos). Parameters were updated using the method of steepest gradient descent, of which we employed the Adam (Kingma and Ba, 2014) adaptive learning rate scheme with the step-size fixed to . We further apply norm-rescaling to the gradients computed by Discrepancy Reduction (threshold is ) (Pascanu et al., 2013) and take the Polyak average (Polyak and Juditsky, 1992) of the model at its best performance on the validation subset (i.e., early stopping).

We report our models’ average squared next-step prediction (20 trials) per frame in Table 1 and compare against previously reported errors. The proposed TNCN performs better than the Boltzmann-based models. Furthermore, we see that the inclusion of an additional hidden layer actually helps the directed model, pushing it to nearly the same level as a deep temporal sigmoid belief network. Note that all of the models we compare our TNCN to utilize back-propagation through time as a core mechanism while our approach is incremental and adaptive. To improve the performance of our model, we believe using an adaptive iterative inference scheme combined with learnable variance parameters are key ingredients.

What is most surprising is that our simple way of building non-linear error units was effective in creating useful local representation targets. This is evidenced by the fact that performance vastly improves upon adding a layer of these types of neurons to the top-down generative model. What this might mean is that the TNCN is making use of the generated local targets and trying to minimize the mismatch between its initial guess of the representation (conditioned on past corrected representations) and the error-corrected representation. Since each layer higher up in the network aims to do a better job at explaining the layer representation below, this local target becomes useful during learning and inference. The target in effect helps keep the model on track as it updates its latent representations given the sequence data it encounters, step by step.

Ball Model Error
Frame t-1
TSBN-1 (Gan et al., 2015)
RTRBM (Sutskever et al., 2009)
SRTRBM (Mittelman et al., 2014)
TSBN-4 (Gan et al., 2015)
DTSBN-D (Gan et al., 2015)
DTSBN-s (Gan et al., 2015)
2-TNCN (present work)
3-TNCN (present work)
Table 1: Test-set performance on the bouncing ball problem.

5 Conclusions

In this paper, we proposed a novel neural architecture, the Temporal Neural Coding Network (TNCN), and its learning algorithm, Discrepancy Reduction. To derive our idea, we drew inspiration from several strands of work that seek biologically-plausible alternative learning algorithms that generalize better to out-of-sample data. Furthermore, we connected these various research paths with the aim of tackling the difficult problem of learning representations of data in an unsupervised manner. With this target in mind, we argued for the use of higher-level objectives, which can be interpreted as local learning rules that still result in a globally coherent model. We developed our model with the goal of learning from streams of sequential data, inspired by the dynamic variations of predictive coding theories of the brain. On one generative modeling benchmark, the bouncing balls problem, we showed that our model, using an incremental, adaptive procedure, can compete with various models that use back-propagation through time.

Breaking free from the global feedback path required in back-propagation, as we do in this work, brings us closer to building models that are better suited for true unsupervised learning (Barlow, 1989). Unsupervised learning requires the computer to capture all possible dependencies between all observed variables, since inputs are no longer distinguished from outputs, as they in supervised learning. It is this latter form of learning that is more closely related to how humans learn, where much of the incoming data does not come with labels (or, at best, comes with very few labels, as mentioned in Ororbia II et al. 2015c, a). A successful unsupervised learning system is one that can discover all of the useful concepts and underlying causes to explain what it perceives (LeCun et al., 2015) just as an infant must do, by observation alone.

More importantly, learning generic representations in an unsupervised system would further free us from the rather inflexible models created from the task-specific nature of supervised learning. Since downstream supervised/reinforcement learning approaches focus on task-specific measurements, the objectives used in unsupervised learning must then attempt to measure the quality of the generic representations acquired by the generative models we train. Defining what a good-quality general representation is itself an open problem and an active area of theoretical research

(van Rooyen and Williamson, 2015; McNamara et al., 2016). However, in lieu of the ideal metric, we took a small step towards higher-level objectives by literally interpreting this concept as a set of simple reconstruction loss terms that measure how well our neural architecture can predict a set of representation targets. Far better performance can be reached, we hypothesize, if better representation measurements and metrics can be developed.

We are continuing to run experiments on other datasets to show the generality of our architecture and learning algorithm, and ultimately seek to use the learned generative models in downstream supervised learning tasks. Furthermore, we believe that our incremental, adaptive algorithm is better suited to streaming problems, more commonly found in the online learning setting (Ororbia II et al., 2015b), which we argue is important when considering the even greater challenge of lifelong learning (Mitchell et al., 2015).

6 Acknowledgements

We would like to thank Yoshua Bengio for useful feedback.

References