
Phoneme recognition in TIMIT with BLSTMCTC
We compare the performance of a recurrent neural network with the best results published so far on phoneme recognition in the TIMIT database. These published results have been obtained with a combination of classifiers. However, in this paper we apply a single recurrent neural network to the same task. Our recurrent neural network attains an error rate of 24.6 is not significantly different from that obtained by the other best methods, but they rely on a combination of classifiers for achieving comparable performance.
04/21/2008 ∙ by Santiago Fernández, et al. ∙ 0 ∙ shareread it

Neural Turing Machines
We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable endtoend, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.
10/20/2014 ∙ by Alex Graves, et al. ∙ 0 ∙ shareread it

WaveNet: A Generative Model for Raw Audio
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to texttospeech, it yields stateoftheart performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
09/12/2016 ∙ by Aaron van den Oord, et al. ∙ 0 ∙ shareread it

Playing Atari with Deep Reinforcement Learning
We present the first deep learning model to successfully learn control policies directly from highdimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Qlearning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
12/19/2013 ∙ by Volodymyr Mnih, et al. ∙ 0 ∙ shareread it

Asynchronous Methods for Deep Reinforcement Learning
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actorlearners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actorcritic, surpasses the current stateoftheart on the Atari domain while training for half the time on a single multicore CPU instead of a GPU. Furthermore, we show that asynchronous actorcritic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
02/04/2016 ∙ by Volodymyr Mnih, et al. ∙ 0 ∙ shareread it

Decoupled Neural Interfaces using Synthetic Gradients
Training directed neural networks typically requires forwardpropagating data through a computation graph, followed by backpropagating error signal, to produce weight updates. All layers, or more generally, modules, of the network are therefore locked, in the sense that they must wait for the remainder of the network to execute forwards and propagate error backwards before they can be updated. In this work we break this constraint by decoupling modules by introducing a model of the future computation of the network graph. These models predict what the result of the modelled subgraph will produce using only local information. In particular we focus on modelling error gradients: by using the modelled synthetic gradient in place of true backpropagated error gradients we decouple subgraphs, and can update them independently and asynchronously i.e. we realise decoupled neural interfaces. We show results for feedforward models, where every layer is trained asynchronously, recurrent neural networks (RNNs) where predicting one's future gradient extends the time over which the RNN can effectively model, and also a hierarchical RNN system with ticking at different timescales. Finally, we demonstrate that in addition to predicting gradients, the same framework can be used to predict inputs, resulting in models which are decoupled in both the forward and backwards pass  amounting to independent networks which colearn such that they can be composed into a single functioning corporation.
08/18/2016 ∙ by Max Jaderberg, et al. ∙ 0 ∙ shareread it

Scaling MemoryAugmented Neural Networks with Sparse Reads and Writes
Neural networks augmented with external memory have the ability to learn algorithmic solutions to complex tasks. These models appear promising for applications such as language modeling and machine translation. However, they scale poorly in both space and time as the amount of memory grows  limiting their applicability to realworld domains. Here, we present an endtoend differentiable memory access scheme, which we call Sparse Access Memory (SAM), that retains the representational power of the original approaches whilst training efficiently with very large memories. We show that SAM achieves asymptotic lower bounds in space and time complexity, and find that an implementation runs 1,000× faster and with 3,000× less physical memory than nonsparse models. SAM learns with comparable data efficiency to existing models on a range of synthetic tasks and oneshot Omniglot character recognition, and can scale to tasks requiring 100,000s of time steps and memories. As well, we show how our approach can be adapted for models that maintain temporal associations between memories, as with the recently introduced Differentiable Neural Computer.
10/27/2016 ∙ by Jack W Rae, et al. ∙ 0 ∙ shareread it

Parallel WaveNet: Fast HighFidelity Speech Synthesis
The recentlydeveloped WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers, and therefore hard to deploy in a realtime production setting. This paper introduces Probability Density Distillation, a new method for training a parallel feedforward network from a trained WaveNet with no significant difference in quality. The resulting system is capable of generating highfidelity speech samples at more than 20 times faster than realtime, and is deployed online by Google Assistant, including serving multiple English and Japanese voices.
11/28/2017 ∙ by Aaron van den Oord, et al. ∙ 0 ∙ shareread it

Automated Curriculum Learning for Neural Networks
We introduce a method for automatically selecting the path, or syllabus, that a neural network follows through a curriculum so as to maximise learning efficiency. A measure of the amount that the network learns from each data sample is provided as a reward signal to a nonstationary multiarmed bandit algorithm, which then determines a stochastic syllabus. We consider a range of signals derived from two distinct indicators of learning progress: rate of increase in prediction accuracy, and rate of increase in network complexity. Experimental results for LSTM networks on three curricula demonstrate that our approach can significantly accelerate learning, in some cases halving the time required to attain a satisfactory performance level.
04/10/2017 ∙ by Alex Graves, et al. ∙ 0 ∙ shareread it

Stochastic Backpropagation through Mixture Density Distributions
The ability to backpropagate stochastic gradients through continuous latent distributions has been crucial to the emergence of variational autoencoders and stochastic gradient variational Bayes. The key ingredient is an unbiased and lowvariance way of estimating gradients with respect to distribution parameters from gradients evaluated at distribution samples. The "reparameterization trick" provides a class of transforms yielding such estimators for many continuous distributions, including the Gaussian and other members of the locationscale family. However the trick does not readily extend to mixture density models, due to the difficulty of reparameterizing the discrete distribution over mixture weights. This report describes an alternative transform, applicable to any continuous multivariate distribution with a differentiable density function from which samples can be drawn, and uses it to derive an unbiased estimator for mixture density weight derivatives. Combined with the reparameterization trick applied to the individual mixture components, this estimator makes it straightforward to train variational autoencoders with mixturedistributed latent variables, or to perform stochastic variational inference with a mixture density variational posterior.
07/19/2016 ∙ by Alex Graves, et al. ∙ 0 ∙ shareread it

MemoryEfficient Backpropagation Through Time
We propose a novel approach to reduce memory consumption of the backpropagation through time (BPTT) algorithm when training recurrent neural networks (RNNs). Our approach uses dynamic programming to balance a tradeoff between caching of intermediate results and recomputation. The algorithm is capable of tightly fitting within almost any userset memory budget while finding an optimal execution policy minimizing the computational cost. Computational devices have limited memory capacity and maximizing a computational performance given a fixed memory budget is a practical usecase. We provide asymptotic computational upper bounds for various regimes. The algorithm is particularly effective for long sequences. For sequences of length 1000, our algorithm saves 95% of memory usage while using only one third more time per iteration than the standard BPTT.
06/10/2016 ∙ by Audrūnas Gruslys, et al. ∙ 0 ∙ shareread it
Alex Graves
is this you? claim profile
Alex Graves is a DeepMind research scientist. He received a BSc in Theoretical Physics from Edinburgh and an AI PhD from IDSIA under Jürgen Schmidhuber. He was also a postdoctoral graduate at TU Munich and at the University of Toronto under Geoffrey Hinton.
At IDSIA, he trained longterm neural memory networks by a new method called connectionist time classification. In certain applications, this method outperformed traditional voice recognition models. In 2009, his CTCtrained LSTM was the first repeat neural network to win pattern recognition contests, winning a number of handwriting awards.
This is a very popular method. Google uses CTCtrained LSTM for smartphone voice recognition.Graves also designs the neural Turing machines and the related neural computer.