For many problems, back-propagation of errors, or the application of reverse-mode differentiation to computation graphs, has been the primary algorithm of choice for conducting credit assignment in neural architectures. However, when neural architectures are made deeper, and thus more complex, the error gradients must pass backward through many layers. As a result of the additional multiplications, these gradients tend to either explode or vanish (Pascanu et al., 2013)
. In order to keep the values of the gradients within reasonable magnitudes, we often require the layers to behave sufficiently linearly to prevent saturation of neuronal post-activities, which would yield zero gradient. It has been shown that this required linearity can lead to undesirable extrapolation effects, creating the well-known problem of adversarial samples(Szegedy et al., 2013; Ororbia II et al., 2017a). Furthermore, this linearity also hinders usage of other important, non-linear mechanisms, such as lateral competition and saturation.
From a biological perspective, back-propagation has received a great deal of criticism due to the implausibility of its implementation in the brain. Some of the key problems include: 1) the “weight transport problem”, where the feedback weights must be the same as (the transpose of) the feedback weights, 2) the forward pass and the backward pass require different computations, and 3) the error gradients must be stored separately from the activations. To remedy the last two conditions, one could use a symmetrical network that is solely used for propagating errors (an error-propagation network). However, beyond the fact that two information pathways have been created, there is no known biological mechanism that allows the error network to know the weights of the feedforward network it is serving. The earlier described requirement of linearity also violates what we know about biological neurons, which interleave linear and non-linear operations. As argued by Bengio et al. (2015)
, if the brain were to use feedback paths as implemented by back-propagation, it would require precise knowledge of the derivatives of the non-linear activation functions , which is not possible since not all neurons are the same. Furthermore, discrete-valued or stochastic activations (such as sampling a Bernoulli or Categorical distribution) cannot be used, even though we know that real neurons communicate with spikes (binary values) and not by continuous values.
More critically, back-propagation requires a global feedback path to carry error derivatives across the system. This is due to the nature of supervised learning systems–an objective is grounded in input/output space and the global pathway is used to relate how internal processing elements affect the target. One problem with this, especially when used to generatively model data, is that most of the learner’s time is spent on surface-level properties of the data and not on extracting latent structure necessary for generalization. A good example is in speech processing, where the log likelihood cost used leads the model to focus mostly on acoustic details rather than higher-level linguistic features.222
Yoshua Bengio, presentation at ReWork: Deep Learning Summit Montreal, 2017.
, which posit that local computations occur at multiple levels of a (somewhat) hierarchical structure. However, if we were to violate the idea of a global feedback path, where would the targets then come from in order to create learning signals for the hidden processing elements? One thing is likely: we will no longer be able to rely on a loss function that operates primarily in the input space, which is at the core of supervised learning. This means that the learning approach we seek will requirehigher-level objectives
, or objectives that operate at various levels of latent space. More importantly, in designing higher-level objectives to create local targets, we can better encourage a neural system to find hidden/abstract structure in data. This type of objective directly connects to representation learning, better embodying one of the key assumptions behind unsupervised learning: by observing a stream of data points, it is possible to derive the predictable systemic relationships between variables333These variables can be pixels in images of a video or the characters of a word in a sentence. as well as relationships between these relationships, i.e., more complicated, abstract patterns. Higher-level patterns are what a representation learning system seeks to uncover–latent variables, or intermediate concepts and features, that capture useful statistical regularities of the world that the intelligent system is embedded in. To this end, in this paper, the intuition behind our learning algorithm is to measure and reduce the discrepancy, or mismatch, between what representations a neural model can currently generate and representations that better describe the input.
Attempting to build models and algorithms that resolve some of the above criticisms might open the door to learning approaches that generalize better. However, while many variations of back-propagation and alternatives have been proposed, most work has only shown their usefulness on static problems, typically on classification. However, we know that in the human brain, many active processes, including those related to vision and speech, take in sequences of input stimuli and attempt to build a dynamic mental model of the world (Rao and Ballard, 1997). Given this dynamic view of the brain, which constructs an implicit, abstract knowledge base that is representative of the structure of the observed environment (Reber, 1989; Destrebecqz and Cleeremans, 2001), we design our approach with an eye toward stateful problems. From a machine learning perspective, this is important given the success of recurrent neural models sequential problems such as language modeling (Mikolov et al., 2010; Ororbia II et al., 2017b). Critical to the success of recurrent models is the action of unrolling, a mechanism that is clearly neurobiologically implausible. In order to implement back-propagation through time, one needs to unroll 444Or unfold a recursively defined operation into an explicit chain of events. the underlying computation graph over the length of a given input sequence, creating a longer global feedback path for error information to traverse. The brain, in contrast, is an incremental, adaptive process. As such, we investigate the viability of our model, which learns from sequences without any unrolling, and offer some promising evidence that our learning algorithm can match or outperform some powerful neural models that rely on back-propagation through time or advanced variations, such as neural variational inference (Mnih and Gregor, 2014). Notably, our proposed approach allows the learning of a directed generative model, which is important given the causal structure of the universe, without the need to correct for the imperfections of an approximate inference model.
The contributions of this article are the following:
We propose the Temporal Neural Coding Network (TNCN), a temporal neural model, and its learning algorithm, Discrepancy Reduction, for learning dynamically from sequential data. Our model incorporates some basic notions from predictive coding (Rao and Ballard, 1999) theories of the brain, notably lateral competition among neural variables.
To create our model’s unsupervised learning algorithm, we draw inspiration from random feedback alignment (Lillicrap et al., 2014) and difference target propagation (Lee et al., 2015). To create local targets for the model’s higher-level objectives, we show that simple fixed projection functions can be used to create special error units that can generate local targets.
To evaluate our model and learning algorithm, we experiment with a video modeling problem and discover promising results with our learning approach.
This work can also be viewed as another contribution towards the long-term goal of finding more biologically plausible machine learning approaches to the credit assignment problem Bengio et al. 2015. Specifically, we offer a rather simple approach to implementing and training sequential predictive neural models without back-propagation through time.
2 Motivation & Related Work
There has been a great deal of research in finding more biologically-plausible alternatives to back-propagation of errors. Classically, the online alternative to back-propagation of errors was real-time recurrent learning (RTRL, Williams and Zipser 1989), which employs forward-mode differentiation to compute gradients. However, this algorithm scales poorly, i.e., quadratically in the number of parameters. Some algorithms have been proposed to reduce the complexity of RTRL, including the NoBackTrack Ollivier et al. 2015 and Unbiased Online Recurrent Learning Tallec and Ollivier 2017 algorithms, but are noisy and slow down the learning procedure in trying to approximate the way back-propagation through time works in an online fashion.
The contrastive divergence recipe(Hinton, 2006)
, well-known for being the primary learning algorithm of restricted Boltzmann machines, and the Wake-Sleep algorithm(Hinton et al., 1995; Bornschein and Bengio, 2014)
for training deeper Boltzmann-based architectures, were largely inspired by the role of sleep in human learning. However, these approaches to learning, which rely on Markov Chain Monte Carlo methods, suffer from a variety of problems including slow convergence to steady-state distributions due to poor mixing of modes and the constraint that the weights of the model must be symmetric. With some success, Boltzmann-based architectures have been applied to stateful problemsTaylor et al. 2007; Sutskever et al. 2009; Boulanger-Lewandowski et al. 2012; Mittelman et al. 2014, but require hybridizing the Contrastive Divergence approach with back-propagation through time, incurring the limitations and criticisms of both algorithms. Other approaches inspired by Boltzmann-based learning (and energy-based learning in general) include the variational walkback algorithm (Goyal et al., 2016) and equilibrium propagation (Scellier and Bengio, 2017). However, these algorithms have only been investigated on static modeling problems and it is not clear how one might extend them to stateful problems.
Learning deep Boltzmann machines can be quite difficult, requiring all sorts of tricks to make the process work well and efficiently (Salakhutdinov and Larochelle, 2010; Ororbia II et al., 2015b). In response, one line of work has taken on a variational inference scheme (Kingma and Welling, 2013; Mnih and Gregor, 2014; Serban et al., 2017)
, where we train an approximate inference machine parametrized by a neural network. While efficient in learning probabilistic models of data, the success of the generative model under the variational inference framework depends largely on how good the inference model is. Specifically, the inference modelconstrains the generative model , and any deficiency in the inference model must be then be made up by the generative model. Instead we would like to reverse this scenario–the generative model adjusts itself, where generation can be used to prime the feedback mechanisms that will guide learning and adaptation. The proposed TNCN embodies this idea in the attempt to circumvent the need for approximate inference machinery.
Our algorithm is inspired by three different strands of research focused on finding viable alternatives to the biologically-implausible back-propagation of errors.
2.1 Random Feedback Alignment
Feedback alignment (Lillicrap et al., 2014) and its variants (Nøkland, 2016; Baldi et al., 2016), have shown that random feedback weights can deliver useful teaching signals. This random form of back-propagation has also been used to develop an event-driven variation of the learning rule suitable for neuromorphic implementations of neural networks (Neftci et al., 2017). More importantly, feedback alignment algorithms resolve the weight-transport problem, which has been one criticism of back-propagation before (Grossberg, 1987; Liao et al., 2016), by showing that coherent learning is possible with asymmetric forward and backward pathways. Rather, the error back-projection pathways need not be the transpose of the weights used to carry out forward propagation, and the learning process can instead be viewed as the alignment of feedforward weights with feedback weights.
Feedback alignment, however, suffers from several problems: 1) during the alignment phase, a given layer cannot learn before the upper layers have approximately “aligned”, 2) the procedure operates much like (supervised) greedy layerwise learning where each layer only learns something that is useful for a linear classifier but does not globally optimize or offer any coordination among the layers.555Personal communication, Yoshua Bengio. The TNCN’s learning algorithm deviates from feedback alignment in that it uses error feedback weights to create potentially better target representations instead of replacing the derivatives normally computed by back-propagation. Furthermore, the TNCN does not strive to learn by an algorithm that works approximately like back-propagation (as approaches like feedback alignment and contrastive Hebbian learning (Xie and Seung, 2003) do), which requires a global feedback path to conduct credit assignment.
2.2 Recirculation & Target Propagation
Recirculation (Hinton and McClelland, 1988; O’Reilly, 1996), the predecessor to target propagation and originally proposed for a single hidden layer auto-encoder, uses the datum as the target value for reconstruction (which affects the decoder) and the initial encoded representation of the datum as the target for the encoder, which is computed after a second forward pass. One key requirement for recirculation is that the weights of the encoder and decoder are symmetric, however, the learning process encourages these weights to automatically self-align to approximate the symmetry.
Difference target propagation (Lee et al., 2015)
brings forth the idea that a learning signal might be created by instead calculating targets rather than loss gradients at each layer. This allows for the development of local learning rules, removing the need for a global error pathway to carry error derivatives across (and thus side-stepping the vanishing gradient problem). Furthermore, some connections can made between target propagation and Spike-Time-Dependent-Plasticity(Andrew, 2003). The general approach in target propagation is that the feedback weights are trained to learn the inverse of the feedforward mapping. This has also been roughly applied to training recurrent network models (Wiseman et al., 2017)
but still requires unrolling the computation graph along the length of the sequence. Target propagation also requires a few things to work, notably that every layer in the network model must be an autoencoder and that a linear correction term is used to account for the imperfectness of auto-encoders (which can obstruct learning).
It is important to note that target propagation permits the use of non-linearities that output discrete values (or Bernoulli sampled activations). The TNCN also possesses this useful property, since the element-wise functions used to compute neuronal post-activations no longer need to be differentiable. This allows us to work with highly non-linear transformations where the gradients are often near zero, for example, stochastic binary units.
2.3 Local Learning & Greedy Layerwise Training
The desire for useful local learning, of which target propagation represents a strong modern step towards, is not new, and saw a small resurgence in the early days of training deeper networks in the form of layer-wise training of unsupervised models (Bengio et al., 2007), supervised models (Lee et al., 2014), and semi-supervised models Ororbia II et al. 2015c; Ororbia II et al. 2015b (also known as hybrid training). However, among the many problems with these early approaches to deep learning was the lack of global coordination. Global coordination means that higher-level layers essentially direct lower-level layers in what patterns they should be extracting. A lower-level feature detector might be able to find different aspects of structure in its input since multiple patterns might satisfy its layer-wise objective. However, this detector will only locate the right bit of structure needed for the whole model to make sense, at any time-step, if a higher-level layer signals what pattern it should be looking for. Since greedy layer-wise approaches build the model from the bottom-up, freezing the learned lower-level parameters, this coordination is impossible to achieve.
The TNCN’s localized learning approach was also motivated by the simple Bottom-Up-Top-Down learning algorithm (Ororbia II et al., 2015a) , which showed that a stack of Boltzmann network modules (and other simple, auto-associative variations) could be learned in a “pseudo-joint” layerwise fashion. However, in order to build in some form of global coordination, a variation of back-propagation of errors was used, ultimately creating a global feedback path as part of the overall learning procedure. A more global approach was later presented in Ororbia II et al. 2015b, incorporating top-down information much like that in (Salakhutdinov and Larochelle, 2010), however, these algorithms were only built for and studied on stateless problems. Furthermore, these approaches would be difficult to scale when extended to sequential modeling problems given their strong dependence on Markov Chain Monte Carlo sampling.
2.4 Predictive Coding
Predictive coding theories posit that the brain is in a continuous process of creating and updating hypotheses that predict the sensory input it receives, directly influencing conscious experience (Panichello et al., 2013). Models of sparse coding (Olshausen and Field, 1997) and predictive coding (Rao and Ballard, 1999) embody the idea that the brain is a directed generative model where the processes of generation (top-down mechanisms) and inference (bottom-up mechanisms) are intertwined (Rauss and Pourtois, 2013)
and interact to perform a sort of iterative inference of latent variables/states. Furthermore, nesting the ideas of predictive coding within the Kalman Filter frameworkRao and Ballard 1997 can create dynamic models that handle time-varying data. Many variations and implementations of predictive coding have been developed (Chalasani and Principe, 2013; Lotter et al., 2016; Santana et al., 2017). Some approaches, such as predictive sparse decomposition (Kavukcuoglu et al., 2010), attempt to speed up the iterative inference by introducing an inference network, but this again, creates a problem similar to that of variational inference–the generative model is constrained by the quality of the approximate inference model.
One key concept behind predictive coding that our own work embodies is that, for a multi-level objective to work well, each layer of a neural architecture would need an error feedback mechanism that could communicate the needs of the layer below it. If the learning signals are moved closer to the layers themselves, the error connections can directly transmit the information to the right representation units. Very importantly, this allows us to side-step the vanishing gradient problem that plagues pure back-propagation, where the internal layers of the architecture are trying to satisfy an objective that they only indirectly influence. If we were to compare the updates from this local learning approach to back-propagation, the updates would still ascend/descend towards a similar objective, just not the steepest ascent/descent, so long as they were within 90 degrees of the direction given by back-propagation. However, since steepest ascent/descent is a greedy form of optimization, updates from a more localized approach might lead to superior generalization results (van den Broeke, 2016).
Our TNCN takes an adaptive, state-corrective approach similar to the dynamic predictive coding model proposed by Rao and Ballard (1997). The general idea is to let the model first make a prediction and generate what it thinks the current frame or symbol will be at time . The errors are computed for each layer via some feedback mechanism, starting from the sensors, which have direct access to the state of the world, and used to correct the model’s internal states before moving on to predict the next time-step. A learning signal is created by comparing the corrected states to the initially predicted states. Since intra-layer competition among neuronal units is important (for reasons we will describe later), we also follow in the spirit of Olshausen and Field (1997) and encourage sparsity indirectly through a penalty/constraint.
We combine the basic ideas described above to propose an algorithm for learning a temporal neural model, the Temporal Neural Coding Network. Generally, proposed alternative algorithms are tested on stateless data classification problems, such as the well known MNIST digit recognition dataset. In contrast we investigate the potential of our algorithm on sequential/temporal problems. Our approach can be viewed as on online, adaptive approach to learning, requiring no unrolling does back-propagation through time, since the TNCN is continuously engaged in self-correction, or rather, minimizing its total discrepancy between its expectations and targets.
3 Learning a Temporal Neural Coding Network
Let us begin by formally defining a TNCN, at time , with three layers of neural variables , where refers to the output sensors that directly connect the model with the environment/world. The TNCN architecture distinguishes between two sets of parameters, (the generative/predictive parameters) and (the fixed, error feedback parameters). We define to be the pre-activations of each latent variable.
To calculate the necessary statistics for one step of error correction, we first define the model’s generated prediction for any (internal) arbitrary layer to be:
which assumes that any pre-activation is a linear combination of a filtration and a top-down (expectation) bias. The error units do two things: 1) create a local representation target by using information from the error units of the layer below and the TNCN’s initial guess of the representation, 2) measure the discrepancy between this target and the corresponding layer pre-activation. Specifically, an error unit at layer within the model is computed as follows:
noting that the second formula depicts how the error units compute a latent target using the error feedback weights . is the post-activation function applied at each layer and is the element-wise function666In this paper, we used the hyperbolic tangent, scaled optimally according to LeCun et al. (1989). applied to the information coming from the error units below, meaning that the representation target is a non-linear function of the representation guess and the error fed back from below. This equation is reminiscent of the single corrective step found in dynamic predictive coding models formulated as Kalman Filters (Rao and Ballard, 1997). Note that these error feedback weights are fixed, as those of Lillicrap et al. (2014) were, but they do not necessarily have to be.777Future work shall investigate neuro-biologically grounded ways of evolving these feedback weights. However, the feedback weights of Lillicrap et al. (2014) were used to carry the partial derivatives across layers (much like a short-circuit) and ultimately serve as a global feedback path, whereas the weights we propose are meant to help correct or update the currently guessed representation (which itself is a function of past information and the top-down generative weights of the layer above) and create local targets useful for subsequent learning. It is important to note that the error weights transmit error information behind the non-linear activation function for any layer . Once the correct pre-activation target for any layer has been calculated, the final corrected representation is simply a re-application of the post-activation function for that layer,
(this will then be used as the vector summary of the past when moving on to the next time step). A critical advantage of this proposed way of wiring the feedback connections (which is a unique departure from standard predictive coding models) is that we are now free to use any differentiable or non-differentiable non-linearity we like. This can include other sampling operations not amenable to the re-parametrization trick as well as discrete-valued activation functions (such as the classical hard threshold function).
We describe the TNCN’s learning procedure, Discrepancy Reduction, next.
3.1 The Discrepancy Reduction Algorithm
To compute the gradients of model parameters once latent representations have been corrected, we exploit the local learning signals that natural arise given the way the error units we have designed work. The cost function that measures the total discrepancy within a TNCN composed of layers of latent variables, applied to real-valued input distributions, can be naively formulated as follows:
. We fix the variance
, however, these can be additional parameters to be learned (details will appear in the appendix). If we want to take an information-theoretic view, which aligns better with the idea of reducing internal discrepancy in the system, we can instead use the Kullback-Leibler DivergenceKullback and Leibler 1951 to measure the similarity between the guess and target for each local representation:
noting that is the variance of and is the variance of (both of these are diagonal covariance matrices, and fixing these to vectors of ones further simplifies the expression to look quite similar to Equation 3) . The leftmost term of the right-hand side of the equations for the loss is the partially grounded term in input space, while the rest of the terms are the higher-level terms (albeit simple and perhaps crude). Note that, internally, the TNCN architecture will readily compute the representation pre-activity targets each time a sequential element is presented.
If we find the first-order partial derivatives of Equation 3 with respect to the weights in , we get the following updates (assuming we use stochastic gradient ascent as the update rule):
is a noise process that is directly applied to the estimated parameter gradient. Such a process can be zero-mean Gaussian noise (with a scalar variance chosen through cross-validation). Note that the input layerdoes not sport any recurrent connections (but if it did, these could be considered auto-regressive connections that relate input variables to past input variables) and is simply defined as:
where we simply set the gradient since setting the parameter matrix will effectively delete these recurrent input connections (we did this simply to avoid the situation where the dimensionality of the input is large, which would require even larger parameter matrices). One favorable property of the Discrepancy Reduction learning algorithm for learning TNCNs is the (partial) parallelism one may exploit in calculating parameter gradients, much like the goal of (Jaderberg et al., 2016). One simply needs to run the TNCN’s generation and target calculation procedures to get the guesses and targets, but once these statistics have been computed one can treat each layer as a computation sub-graph, the gradient estimates of which are independent of any other layer (this means that we design each layer to be more intricate and complex than what is used in this paper).
In predictive coding theories of the brain, it is often assumed that sparsity is a key ingredient. This means we seek representations of data where only a small subset of the latent variables have non-zero values. If the TNCN is to disentangle concepts in its latent representations, the need for sparsity makes sense since it is reasonable to assume that only a few out of the many possible concepts/variables explain any given datum (useful in tasks such as corrupt image denoising (Cho, 2013)). From a theoretical perspective on feature learning, we desire compact representations of the input in which no information is lost regarding the input (Bengio, 2013; van Rooyen and Williamson, 2015). Dense representations, though rich, are highly entangled since small changes in the input can lead to big changes in the representation vector. Sparse representations, in contrast, are robust and mostly conserve the set of non-zero features (Glorot et al., 2011). In this paper, we only enforce a “weak” form of lateral competition over the neuronal variables through a simple Laplacian prior distribution over the pre-activities. During the learning phase/step, lateral competition patterns (where the neurons fight to be active) get internalized in the model parameters via the term .888It is important to note that the simple way we encourage sparsity does not mean that all representations of the TNCN are guaranteed to be sparse. It is quite possible that the model could produce dense representations for data points outside the training sample since only during training is sparsity encouraged. One way to remedy this would be ensure that the Laplacian prior is active during inference (as in classical sparse coding). To better encourage sparsity, other prior distributions, such as the spike-and-slab distribution (Goodfellow et al., 2012) (to avoid controlling the pre-activation magnitudes), or architectural modifications (Szlam et al., 2011) might improve generalization and will be the subject of future work. We found in initial preliminary experiments that sparsity was indeed a necessary component in improving performance.
With all of the above taken together, the full algorithm for the TNCN (generation, representation correction, and parameter updating) is depicted in Algorithm 3. The TNCN (or rather its generation/inference mechanisms) and its learning algorithm, are intricately tied together, since the learning procedure will make use of the representation targets created by the architecture’s error-driven correction mechanism. This operates in the spirit of predictive coding which posits that the brain’s generation and inference procedures interact to formulate local learning signals. The mechanism we use for target creation is rather simple, and future work should investigate more sophisticated mechanisms (especially ones with evolving error weights). As is depicted in Algorithm 3, the representation-correction mechanism can be extended to a process where targets can be iteratively refined, in the hopes of shortening the overall learning phase.
With respect to higher-level objectives, we can see that the error units play a crucial role–they are in fact the first-order derivatives of the Gaussian log likelihood (with fixed unit variance). Learning is simple since the error units can be easily re-used to calculate parameter gradients incrementally (when combined with the competition prior) and the only activation function derivative required in this approach is that of the output distribution model (which can be easily worked out for commonly used output distributions, such as the Gaussian, Bernoulli, and Categorical distributions). Note that better error units could be derived if one chose a different tactic for measuring the distance between predicted and corrected representation layers (for example, one could measure the Manhattan distance instead of the Euclidean distance, as formulated in our framework). However, the general idea is that the TNCN is engaged with ensuring its layer-wise representations are as close to those suggested by the error units–it is optimizing not only on the input space, but also in the latent space giving us some rough measure of the quality of the model’s internal representations. In some sense, this bears a loose resemblance to the bottom-up-top-down algorithm proposed in (Ororbia II et al., 2015a), which proposed a non-greedy way of learning a set of layer-wise experts. Through the feedback mechanism and the top-down generation paths, the local learning rules of the TNCN gain some form of global coordination, which was lacking in the greedy approaches of the past (Bengio et al., 2007; Ororbia II et al., 2015c)
when training deep belief networks and their hybrid variants.
It is important to highlight that learning and inference under this model is ideally supposed to be continuous, meaning that the model simultaneously generates expectations and then corrects itself (both representations and parameters) each time a new datum from a sequence is presented. This makes the model directly suited to learning incrementally from data-streams. Furthermore, the TNCN shows how two types of recurrence/feedback are at play when modeling sequences: 1) the model is recurrent across the temporal axis since it is stateful, since each processing layer depends on a vector summary of the past, and 2) the model is structurally recurrent, similar to deep Boltzmann machines and Hopfield Networks (Hopfield, 1982), since error is fed back in order to automatically correct guessed representations.
4 Experimental Results
4.1 The Bouncing Balls Process
To test our proposed TNCN architecture and its learning algorithm, Discrepancy Reduction, we benchmark our performance on the bouncing balls dataset following in line with the experimental setup used in Sutskever et al. 2009. This high-dimensional dataset was created by simulating the rudimentary physics of three balls bouncing around in a box. We generate a training set of 4000 training sequences and a test set of 200 sequences (as well as yet another 200 sequences to create a development set). Furthermore, our models are given no prior knowledge of the task, e.g. convolutional weight matrices, much as was done in Taylor et al. 2007. On this dataset, Frame t-1 is the simplest possible baseline–a model that predicts the next step as simply the previously seen frame.
We trained TNCNs with two and three layers of latent variables, searching for the size of the layers over the range
(with performance measured on the validation subset). In this experiment, the logistic sigmoid activation function ultimately proved to be the most useful (we also experimented with nonlinearities, but found these to not work as well). Parameters were initialized from centered Gaussian distributions with(except in the three-layer TNCN, setting the top-level recurrent and generative weights using improved performance a bit). Error feedback parameters were initialized with centered Gaussians of (again, in the case of the 3-layered model, we used for the top layer). The value of the gradient noise was set to (to control the stochastic approximation of the prior over weights). was to . 999We also experimented with a naive approach by fixing and found that performance slightly worsened. We believe that to properly use iterative inference, we would need employ an adaptive iterative inference schedule, since in the early states of learning, the model is learning to use its error feedback weights, but in later stages one should raise gradually to give the model the opportunity to iteratively refine its representations. Gradients (at each time step) were estimated using mini-batches of 50 samples (across 50 parallel videos). Parameters were updated using the method of steepest gradient descent, of which we employed the Adam (Kingma and Ba, 2014) adaptive learning rate scheme with the step-size fixed to . We further apply norm-rescaling to the gradients computed by Discrepancy Reduction (threshold is ) (Pascanu et al., 2013) and take the Polyak average (Polyak and Juditsky, 1992) of the model at its best performance on the validation subset (i.e., early stopping).
We report our models’ average squared next-step prediction (20 trials) per frame in Table 1 and compare against previously reported errors. The proposed TNCN performs better than the Boltzmann-based models. Furthermore, we see that the inclusion of an additional hidden layer actually helps the directed model, pushing it to nearly the same level as a deep temporal sigmoid belief network. Note that all of the models we compare our TNCN to utilize back-propagation through time as a core mechanism while our approach is incremental and adaptive. To improve the performance of our model, we believe using an adaptive iterative inference scheme combined with learnable variance parameters are key ingredients.
What is most surprising is that our simple way of building non-linear error units was effective in creating useful local representation targets. This is evidenced by the fact that performance vastly improves upon adding a layer of these types of neurons to the top-down generative model. What this might mean is that the TNCN is making use of the generated local targets and trying to minimize the mismatch between its initial guess of the representation (conditioned on past corrected representations) and the error-corrected representation. Since each layer higher up in the network aims to do a better job at explaining the layer representation below, this local target becomes useful during learning and inference. The target in effect helps keep the model on track as it updates its latent representations given the sequence data it encounters, step by step.
In this paper, we proposed a novel neural architecture, the Temporal Neural Coding Network (TNCN), and its learning algorithm, Discrepancy Reduction. To derive our idea, we drew inspiration from several strands of work that seek biologically-plausible alternative learning algorithms that generalize better to out-of-sample data. Furthermore, we connected these various research paths with the aim of tackling the difficult problem of learning representations of data in an unsupervised manner. With this target in mind, we argued for the use of higher-level objectives, which can be interpreted as local learning rules that still result in a globally coherent model. We developed our model with the goal of learning from streams of sequential data, inspired by the dynamic variations of predictive coding theories of the brain. On one generative modeling benchmark, the bouncing balls problem, we showed that our model, using an incremental, adaptive procedure, can compete with various models that use back-propagation through time.
Breaking free from the global feedback path required in back-propagation, as we do in this work, brings us closer to building models that are better suited for true unsupervised learning (Barlow, 1989). Unsupervised learning requires the computer to capture all possible dependencies between all observed variables, since inputs are no longer distinguished from outputs, as they in supervised learning. It is this latter form of learning that is more closely related to how humans learn, where much of the incoming data does not come with labels (or, at best, comes with very few labels, as mentioned in Ororbia II et al. 2015c, a). A successful unsupervised learning system is one that can discover all of the useful concepts and underlying causes to explain what it perceives (LeCun et al., 2015) just as an infant must do, by observation alone.
More importantly, learning generic representations in an unsupervised system would further free us from the rather inflexible models created from the task-specific nature of supervised learning. Since downstream supervised/reinforcement learning approaches focus on task-specific measurements, the objectives used in unsupervised learning must then attempt to measure the quality of the generic representations acquired by the generative models we train. Defining what a good-quality general representation is itself an open problem and an active area of theoretical research(van Rooyen and Williamson, 2015; McNamara et al., 2016). However, in lieu of the ideal metric, we took a small step towards higher-level objectives by literally interpreting this concept as a set of simple reconstruction loss terms that measure how well our neural architecture can predict a set of representation targets. Far better performance can be reached, we hypothesize, if better representation measurements and metrics can be developed.
We are continuing to run experiments on other datasets to show the generality of our architecture and learning algorithm, and ultimately seek to use the learned generative models in downstream supervised learning tasks. Furthermore, we believe that our incremental, adaptive algorithm is better suited to streaming problems, more commonly found in the online learning setting (Ororbia II et al., 2015b), which we argue is important when considering the even greater challenge of lifelong learning (Mitchell et al., 2015).
We would like to thank Yoshua Bengio for useful feedback.
- Andrew (2003) Andrew, A. M. (2003). Spiking neuron models: Single neurons, populations, plasticity. Kybernetes, 32(7/8).
- Baldi et al. (2016) Baldi, P., Sadowski, P., and Lu, Z. (2016). Learning in the machine: Random backpropagation and the learning channel. arXiv preprint arXiv:1612.02734.
- Barlow (1989) Barlow, H. B. (1989). Unsupervised learning. Neural computation, 1(3):295–311.
- Bengio (2013) Bengio, Y. (2013). Deep learning of representations: Looking forward. In International Conference on Statistical Language and Speech Processing, pages 1–37. Springer.
- Bengio et al. (2007) Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al. (2007). Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19:153.
- Bengio et al. (2015) Bengio, Y., Lee, D.-H., Bornschein, J., Mesnard, T., and Lin, Z. (2015). Towards biologically plausible deep learning. arXiv preprint arXiv:1502.04156.
- Bornschein and Bengio (2014) Bornschein, J. and Bengio, Y. (2014). Reweighted wake-sleep. arXiv preprint arXiv:1406.2751.
- Boulanger-Lewandowski et al. (2012) Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2012). Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392.
- Chalasani and Principe (2013) Chalasani, R. and Principe, J. C. (2013). Deep predictive coding networks. arXiv preprint arXiv:1301.3541.
Cho, K. (2013).
Simple sparsification improves sparse denoising autoencoders in denoising highly corrupted images.In International Conference on Machine Learning, pages 432–440.
- Clark (2013) Clark, A. (2013). Whatever next? predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3):181–204.
- Destrebecqz and Cleeremans (2001) Destrebecqz, A. and Cleeremans, A. (2001). Can sequence learning be implicit? new evidence with the process dissociation procedure. Psychonomic bulletin & review, 8(2):343–350.
- Gan et al. (2015) Gan, Z., Li, C., Henao, R., Carlson, D. E., and Carin, L. (2015). Deep temporal sigmoid belief networks for sequence modeling. In Advances in Neural Information Processing Systems, pages 2467–2475.
Glorot et al. (2011)
Glorot, X., Bordes, A., and Bengio, Y. (2011).
Deep sparse rectifier neural networks.
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 315–323.
- Goodfellow et al. (2012) Goodfellow, I. J., Courville, A., and Bengio, Y. (2012). Spike-and-slab sparse coding for unsupervised feature discovery. arXiv preprint arXiv:1201.3382.
- Goyal et al. (2016) Goyal, A., Ke, N. R., Lamb, A., and Bengio, Y. (2016). The variational walkback algorithm. On openreview.net.
- Grossberg (1982) Grossberg, S. (1982). How does a brain build a cognitive code? In Studies of mind and brain, pages 1–52. Springer.
- Grossberg (1987) Grossberg, S. (1987). Competitive learning: From interactive activation to adaptive resonance. Cognitive Science, 11(1):23 – 63.
- Hinton (2006) Hinton, G. E. (2006). Training products of experts by minimizing contrastive divergence. Training, 14(8).
- Hinton et al. (1995) Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The" wake-sleep" algorithm for unsupervised neural networks. Science, 268(5214):1158.
- Hinton and McClelland (1988) Hinton, G. E. and McClelland, J. L. (1988). Learning representations by recirculation. In Neural information processing systems, pages 358–366.
- Hopfield (1982) Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558.
- Huang and Rao (2011) Huang, Y. and Rao, R. P. (2011). Predictive coding. Wiley Interdisciplinary Reviews: Cognitive Science, 2(5):580–593.
- Jaderberg et al. (2016) Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., and Kavukcuoglu, K. (2016). Decoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343.
- Kavukcuoglu et al. (2010) Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2010). Fast inference in sparse coding algorithms with applications to object recognition. arXiv preprint arXiv:1010.3467.
- Kingma and Ba (2014) Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Kullback and Leibler (1951) Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1):79–86.
- LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553):436–444.
- LeCun et al. (1989) LeCun, Y. et al. (1989). Generalization and network design strategies. Connectionism in perspective, pages 143–155.
- Lee et al. (2014) Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., and Tu, Z. (2014). Deeply-Supervised Nets. arXiv:1409.5185 [cs, stat].
- Lee et al. (2015) Lee, D.-H., Zhang, S., Fischer, A., and Bengio, Y. (2015). Difference target propagation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 498–515. Springer.
Liao et al. (2016)
Liao, Q., Leibo, J. Z., and Poggio, T. A. (2016).
How important is weight symmetry in backpropagation?In AAAI, pages 1837–1844.
- Lillicrap et al. (2014) Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. (2014). Random feedback weights support learning in deep neural networks. arXiv preprint arXiv:1411.0247.
- Lotter et al. (2016) Lotter, W., Kreiman, G., and Cox, D. (2016). Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104.
- McNamara et al. (2016) McNamara, D., Ong, C. S., and Williamson, R. C. (2016). A modular theory of feature learning. arXiv preprint arXiv:1611.03125.
- Mikolov et al. (2010) Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., and Khudanpur, S. (2010). Recurrent neural network based language model. In Interspeech, volume 2, page 3.
- Mitchell et al. (2015) Mitchell, T. M., Cohen, W. W., Hruschka Jr, E. R., Talukdar, P. P., Betteridge, J., Carlson, A., Mishra, B. D., Gardner, M., Kisiel, B., Krishnamurthy, J., et al. (2015). Never ending learning. In AAAI, pages 2302–2310.
- Mittelman et al. (2014) Mittelman, R., Kuipers, B., Savarese, S., and Lee, H. (2014). Structured recurrent temporal restricted boltzmann machines. In International Conference on Machine Learning, pages 1647–1655.
- Mnih and Gregor (2014) Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030.
- Neftci et al. (2017) Neftci, E. O., Augustine, C., Paul, S., and Detorakis, G. (2017). Event-driven random back-propagation: Enabling neuromorphic deep learning machines. Frontiers in neuroscience, 11:324.
- Nøkland (2016) Nøkland, A. (2016). Direct feedback alignment provides learning in deep neural networks. In Advances in Neural Information Processing Systems, pages 1037–1045.
- Ollivier et al. (2015) Ollivier, Y., Tallec, C., and Charpiat, G. (2015). Training recurrent networks online without backtracking. arXiv preprint arXiv:1507.07680.
- Olshausen and Field (1997) Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325.
- O’Reilly (1996) O’Reilly, R. C. (1996). Biologically plausible error-driven learning using local activation differences: The generalized recirculation algorithm. Neural computation, 8(5):895–938.
et al. (2015a)
Ororbia II, A. G., Giles, C. L., and Reitter, D. (2015a).
Learning a deep hybrid model for semi-supervised text classification.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, Portugal.
- Ororbia II et al. (2015b) Ororbia II, A. G., Giles, C. L., and Reitter, D. (2015b). Online semi-supervised learning with deep hybrid boltzmann machines and denoising autoencoders. arXiv preprint arXiv:1511.06964.
- Ororbia II et al. (2017a) Ororbia II, A. G., Kifer, D., and Giles, C. L. (2017a). Unifying adversarial training algorithms with data gradient regularization. Neural computation, 29(4):867–887.
- Ororbia II et al. (2017b) Ororbia II, A. G., Mikolov, T., and Reitter, D. (2017b). Learning simpler language models with the differential state framework. Neural Computation, 0(0):1–26. PMID: 28957029.
- Ororbia II et al. (2017c) Ororbia II, A. G., Reitter, D., and Giles, C. L. (2017c). The temporal neural coding network: Towards lifelong language learning. The New York Academy of Sciences: 11th Annual Machine Learning Symposium.
- Ororbia II et al. (2015c) Ororbia II, A. G., Reitter, D., Wu, J., and Giles, C. L. (2015c). Online learning of deep hybrid architectures for semi-supervised categorization. In Machine Learning and Knowledge Discovery in Databases (Proceedings, ECML PKDD 2015), volume 9284 of Lecture Notes in Computer Science, pages 516–532. Springer, Porto, Portugal.
- Panichello et al. (2013) Panichello, M., Cheung, O., and Bar, M. (2013). Predictive feedback and conscious visual experience. Frontiers in Psychology, 3:620.
- Pascanu et al. (2013) Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318.
- Polyak and Juditsky (1992) Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855.
- Rao and Ballard (1997) Rao, R. P. and Ballard, D. H. (1997). Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural computation, 9(4):721–763.
- Rao and Ballard (1999) Rao, R. P. and Ballard, D. H. (1999). Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1).
- Rauss and Pourtois (2013) Rauss, K. and Pourtois, G. (2013). What is bottom-up and what is top-down in predictive coding? Frontiers in Psychology, 4:276.
- Reber (1989) Reber, A. S. (1989). Implicit learning and tacit knowledge. Journal of experimental psychology: General, 118(3):219.
- Salakhutdinov and Larochelle (2010) Salakhutdinov, R. and Larochelle, H. (2010). Efficient learning of deep boltzmann machines. In International Conference on Artificial Intelligence and Statistics, pages 693–700.
- Santana et al. (2017) Santana, E., Emigh, M. S., Zegers, P., and Principe, J. C. (2017). Exploiting spatio-temporal structure with recurrent winner-take-all networks. IEEE Transactions on Neural Networks and Learning Systems.
Scellier, B. and Bengio, Y. (2017).
Equilibrium propagation: Bridging the gap between energy-based models and backpropagation.Frontiers in computational neuroscience, 11.
- Serban et al. (2017) Serban, I. V., Ororbia II, A. G., Pineau, J., and Courville, A. (2017). Piecewise latent variables for neural variational text processing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 422–432.
- Sutskever et al. (2009) Sutskever, I., Hinton, G. E., and Taylor, G. W. (2009). The recurrent temporal restricted boltzmann machine. In Advances in Neural Information Processing Systems, pages 1601–1608.
- Szegedy et al. (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
- Szlam et al. (2011) Szlam, A. D., Gregor, K., and LeCun, Y. L. (2011). Structured sparse coding via lateral inhibition. In Advances in Neural Information Processing Systems, pages 1116–1124.
- Tallec and Ollivier (2017) Tallec, C. and Ollivier, Y. (2017). Unbiased online recurrent optimization. arXiv preprint arXiv:1702.05043.
- Taylor et al. (2007) Taylor, G. W., Hinton, G. E., and Roweis, S. T. (2007). Modeling human motion using binary latent variables. In Advances in neural information processing systems, pages 1345–1352.
- van den Broeke (2016) van den Broeke, G. (2016). What Auto-encoders Could Learn from Brains. Master’s thesis, Aalto University, Finland.
- van Rooyen and Williamson (2015) van Rooyen, B. and Williamson, R. C. (2015). A theory of feature learning. arXiv preprint arXiv:1504.00083.
- Williams and Zipser (1989) Williams, R. J. and Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280.
- Wiseman et al. (2017) Wiseman, S., Chopra, S., Ranzato, M., Szlam, A., Sun, R., Chintala, S., and Vasilache, N. (2017). Training language models using target-propagation. arXiv preprint arXiv:1702.04770.
- Xie and Seung (2003) Xie, X. and Seung, H. S. (2003). Equivalence of backpropagation and contrastive hebbian learning in a layered network. Neural computation, 15(2):441–454.