
Sample Efficient Adaptive TexttoSpeech
We present a metalearning approach for adaptive texttospeech (TTS) with few data. During training, we learn a multispeaker model using a shared conditional WaveNet core and independent learned embeddings for each speaker. The aim of training is not to produce a neural network with fixed weights, which is then deployed as a TTS system. Instead, the aim is to produce a network that requires few data at deployment time to rapidly adapt to new speakers. We introduce and benchmark three strategies: (i) learning the speaker embedding while keeping the WaveNet core fixed, (ii) finetuning the entire architecture with stochastic gradient descent, and (iii) predicting the speaker embedding with a trained neural network encoder. The experiments show that these approaches are successful at adapting the multispeaker neural network to new speakers, obtaining stateoftheart results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.
09/27/2018 ∙ by Yutian Chen, et al. ∙ 2 ∙ shareread it

Plan, Attend, Generate: Planning for SequencetoSequence Models
We investigate the integration of a planning mechanism into sequencetosequence models using attention. We develop a model which can plan ahead in the future when it computes its alignments between input and output sequences, constructing a matrix of proposed future alignments and a commitment vector that governs whether to follow or recompute the plan. This mechanism is inspired by the recently proposed strategic attentive reader and writer (STRAW) model for Reinforcement Learning. Our proposed model is endtoend trainable using primarily differentiable operations. We show that it outperforms a strong baseline on characterlevel translation tasks from WMT'15, the algorithmic task of finding Eulerian circuits of graphs, and question generation from the text. Our analysis demonstrates that the model computes qualitatively intuitive alignments, converges faster than the baselines, and achieves superior performance with fewer parameters.
11/28/2017 ∙ by Francis Dutil, et al. ∙ 0 ∙ shareread it

Mollifying Networks
The optimization of deep neural networks can be more challenging than traditional convex optimization problems due to the highly nonconvex nature of the loss function, e.g. it can involve pathological landscapes such as saddlesurfaces that can be difficult to escape for algorithms based on simple gradient descent. In this paper, we attack the problem of optimization of highly nonconvex neural networks by starting with a smoothed  or mollified  objective function that gradually has a more nonconvex energy landscape during the training. Our proposition is inspired by the recent studies in continuation methods: similar to curriculum methods, we begin learning an easier (possibly convex) objective function and let it evolve during the training, until it eventually goes back to being the original, difficult to optimize, objective function. The complexity of the mollified networks is controlled by a single hyperparameter which is annealed during the training. We show improvements on various difficult optimization tasks and establish a relationship with recent works on continuation methods for neural networks and mollifiers.
08/17/2016 ∙ by Caglar Gulcehre, et al. ∙ 0 ∙ shareread it

Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes
We extend neural Turing machine (NTM) model into a dynamic neural Turing machine (DNTM) by introducing a trainable memory addressing scheme. This addressing scheme maintains for each memory cell two separate vectors, content and address vectors. This allows the DNTM to learn a wide variety of locationbased addressing strategies including both linear and nonlinear ones. We implement the DNTM with both continuous, differentiable and discrete, nondifferentiable read/write mechanisms. We investigate the mechanisms and effects of learning to read and write into a memory through experiments on Facebook bAbI tasks using both a feedforward and GRUcontroller. The DNTM is evaluated on a set of Facebook bAbI tasks and shown to outperform NTM and LSTM baselines. We have done extensive analysis of our model and different variations of NTM on bAbI task. We also provide further experimental results on sequential pMNIST, Stanford Natural Language Inference, associative recall and copy tasks.
06/30/2016 ∙ by Caglar Gulcehre, et al. ∙ 0 ∙ shareread it

Gated Orthogonal Recurrent Units: On Learning to Forget
We present a novel recurrent neural network (RNN) based model that combines the remembering ability of unitary RNNs with the ability of gated RNNs to effectively forget redundant/irrelevant information in its memory. We achieve this by extending unitary RNNs with a gating mechanism. Our model is able to outperform LSTMs, GRUs and Unitary RNNs on several longterm dependency benchmark tasks. We empirically both show the orthogonal/unitary RNNs lack the ability to forget and also the ability of GORU to simultaneously remember long term dependencies while forgetting irrelevant information. This plays an important role in recurrent neural networks. We provide competitive results along with an analysis of our model on many natural sequential tasks including the bAbI Question Answering, TIMIT speech spectrum prediction, Penn TreeBank, and synthetic tasks that involve longterm dependencies such as algorithmic, parenthesis, denoising and copying tasks.
06/08/2017 ∙ by Li Jing, et al. ∙ 0 ∙ shareread it

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
In this paper we compare different types of recurrent units in recurrent neural networks (RNNs). Especially, we focus on more sophisticated units that implement a gating mechanism, such as a long shortterm memory (LSTM) unit and a recently proposed gated recurrent unit (GRU). We evaluate these recurrent units on the tasks of polyphonic music modeling and speech signal modeling. Our experiments revealed that these advanced recurrent units are indeed better than more traditional recurrent units such as tanh units. Also, we found GRU to be comparable to LSTM.
12/11/2014 ∙ by Junyoung Chung, et al. ∙ 0 ∙ shareread it

Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation
In this paper, we propose a novel neural network model called RNN EncoderDecoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixedlength vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN EncoderDecoder as an additional feature in the existing loglinear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
06/03/2014 ∙ by Kyunghyun Cho, et al. ∙ 0 ∙ shareread it

LearnedNorm Pooling for Deep Feedforward and Recurrent Neural Networks
In this paper we propose and investigate a novel nonlinear unit, called L_p unit, for deep neural networks. The proposed L_p unit receives signals from several projections of a subset of units in the layer below and computes a normalized L_p norm. We notice two interesting interpretations of the L_p unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, rootmeansquare and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the L_p unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the stateoftheart object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the L_p unit is more efficient at representing complex, nonlinear separating boundaries. Each L_p unit defines a superelliptic boundary, with its exact shape defined by the order p. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few L_p units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed L_p units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the L_p units achieve the stateoftheart results on a number of benchmark datasets. Furthermore, we evaluate the proposed L_p unit on the recently proposed deep recurrent neural networks (RNN).
11/07/2013 ∙ by Caglar Gulcehre, et al. ∙ 0 ∙ shareread it

Memory Augmented Neural Networks with Wormhole Connections
Recent empirical results on longterm dependency tasks have shown that neural networks augmented with an external memory can learn the longterm dependency tasks more easily and achieve better generalization than vanilla recurrent neural networks (RNN). We suggest that memory augmented neural networks can reduce the effects of vanishing gradients by creating shortcut (or wormhole) connections. Based on this observation, we propose a novel memory augmented neural network model called TARDIS (Temporal Automatic Relation Discovery in Sequences). The controller of TARDIS can store a selective set of embeddings of its own previous hidden states into an external memory and revisit them as and when needed. For TARDIS, memory acts as a storage for wormhole connections to the past to propagate the gradients more effectively and it helps to learn the temporal dependencies. The memory structure of TARDIS has similarities to both Neural Turing Machines (NTM) and Dynamic Neural Turing Machines (DNTM), but both read and write operations of TARDIS are simpler and more efficient. We use discrete addressing for read/write operations which helps to substantially to reduce the vanishing gradient problem with very long sequences. Read and write operations in TARDIS are tied with a heuristic once the memory becomes full, and this makes the learning problem simpler when compared to NTM or DNTM type of architectures. We provide a detailed analysis on the gradient propagation in general for MANNs. We evaluate our models on different longterm dependency tasks and report competitive results in all of them.
01/30/2017 ∙ by Caglar Gulcehre, et al. ∙ 0 ∙ shareread it

Noisy Activation Functions
Common nonlinear activation functions used in neural networks can cause training difficulties due to the saturation behavior of the activation function, which may hide dependencies that are not visible to vanillaSGD (using first order gradients only). Gating mechanisms that use softly saturating activation functions to emulate the discrete switching of digital logic circuits are good examples of this. We propose to exploit the injection of appropriate noise so that the gradients may flow easily, even if the noiseless application of the activation function would yield zero gradient. Large noise will dominate the noisefree gradient and allow stochastic gradient descent toexplore more. By adding noise only to the problematic parts of the activation function, we allow the optimization procedure to explore the boundary between the degenerate (saturating) and the wellbehaved parts of the activation function. We also establish connections to simulated annealing, when the amount of noise is annealed down, making it easier to optimize hard objective functions. We find experimentally that replacing such saturating activation functions by noisy variants helps training in many contexts, yielding stateoftheart or competitive results on different datasets and task, especially when training seems to be the most difficult, e.g., when curriculum learning is necessary to obtain good results.
03/01/2016 ∙ by Caglar Gulcehre, et al. ∙ 0 ∙ shareread it

Gated Feedback Recurrent Neural Networks
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gatedfeedback RNN (GFRNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GFRNN with different types of recurrent units, such as tanh, long shortterm memory and gated recurrent units, on the tasks of characterlevel language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GFRNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GFRNN can adaptively assign different layers to different timescales and layertolayer interactions (including the topdown ones which are not usually present in a stacked RNN) by learning to gate these interactions.
02/09/2015 ∙ by Junyoung Chung, et al. ∙ 0 ∙ shareread it