
REGAL: Transfer Learning For Fast Optimization of Computation Graphs
We present a deep reinforcement learning approach to optimizing the execution cost of computation graphs in a static compiler. The key idea is to combine a neural network policy with a genetic algorithm, the Biased RandomKey Genetic Algorithm (BRKGA). The policy is trained to predict, given an input graph to be optimized, the nodelevel probability distributions for sampling mutations and crossovers in BRKGA. Our approach, "REINFORCEbased Genetic Algorithm Learning" (REGAL), uses the policy's ability to transfer to new graphs to significantly improve the solution quality of the genetic algorithm for the same objective evaluation budget. As a concrete application, we show results for minimizing peak memory in TensorFlow graphs by jointly optimizing device placement and scheduling. REGAL achieves on average 3.56 previously unseen graphs, outperforming all the algorithms we compare to, and giving 4.4x bigger improvement than the next best algorithm. We also evaluate REGAL on a production compiler team's performance benchmark of XLA graphs and achieve on average 3.74 others. Our approach and analysis is made possible by collecting a dataset of 372 unique realworld TensorFlow graphs, more than an order of magnitude more data than previous work.
05/07/2019 ∙ by Aditya Paliwal, et al. ∙ 14 ∙ shareread it

Attentive Neural Processes
Neural Processes (NPs) (Garnelo et al 2018a;b) approach regression by learning to map a context set of observed inputoutput pairs to a distribution over regression functions. Each function models the distribution of the output given an input, conditioned on the context. NPs have the benefit of fitting observed data efficiently with linear complexity in the number of context inputoutput pairs, and can learn a wide family of conditional distributions; they learn predictive distributions conditioned on context sets of arbitrary size. Nonetheless, we show that NPs suffer a fundamental drawback of underfitting, giving inaccurate predictions at the inputs of the observed data they condition on. We address this issue by incorporating attention into NPs, allowing each input location to attend to the relevant context points for the prediction. We show that this greatly improves the accuracy of predictions, results in noticeably faster training, and expands the range of functions that can be modelled.
01/17/2019 ∙ by Hyunjik Kim, et al. ∙ 10 ∙ shareread it

Graph Matching Networks for Learning the Similarity of Graph Structured Objects
This paper addresses the challenging problem of retrieval and matching of graph structured objects, and makes two key contributions. First, we demonstrate how Graph Neural Networks (GNN), which have emerged as an effective model for various supervised prediction problems defined on structured data, can be trained to produce embedding of graphs in vector spaces that enables efficient similarity reasoning. Second, we propose a novel Graph Matching Network model that, given a pair of graphs as input, computes a similarity score between them by jointly reasoning on the pair through a new crossgraph attentionbased matching mechanism. We demonstrate the effectiveness of our models on different domains including the challenging problem of controlflowgraph based function similarity search that plays an important role in the detection of vulnerabilities in software systems. The experimental analysis demonstrates that our models are not only able to exploit structure in the context of similarity learning but they can also outperform domainspecific baseline systems that have been carefully handengineered for these problems.
04/29/2019 ∙ by Yujia Li, et al. ∙ 8 ∙ shareread it

Universal Transformers
Selfattentive feedforward sequence models have been shown to achieve impressive results on sequence modeling tasks, thereby presenting a compelling alternative to recurrent neural networks (RNNs) which has remained the defacto standard architecture for many sequence modeling problems to date. Despite these successes, however, feedforward sequence models like the Transformer fail to generalize in many tasks that recurrent models handle with ease (e.g. copying when the string lengths exceed those observed at training time). Moreover, and in contrast to RNNs, the Transformer model is not computationally universal, limiting its theoretical expressivity. In this paper we propose the Universal Transformer which addresses these practical and theoretical shortcomings and we show that it leads to improved performance on several tasks. Instead of recurring over the individual symbols of sequences like RNNs, the Universal Transformer repeatedly revises its representations of all symbols in the sequence with each recurrent step. In order to combine information from different parts of a sequence, it employs a selfattention mechanism in every recurrent step. Assuming sufficient memory, its recurrence makes the Universal Transformer computationally universal. We further employ an adaptive computation time (ACT) mechanism to allow the model to dynamically adjust the number of times the representation of each position in a sequence is revised. Beyond saving computation, we show that ACT can improve the accuracy of the model. Our experiments show that on various algorithmic tasks and a diverse set of largescale language understanding tasks the Universal Transformer generalizes significantly better and outperforms both a vanilla Transformer and an LSTM in machine translation, and achieves a new state of the art on the bAbI linguistic reasoning task and the challenging LAMBADA language modeling task.
07/10/2018 ∙ by Mostafa Dehghani, et al. ∙ 6 ∙ shareread it

Preventing Posterior Collapse with deltaVAEs
Due to the phenomenon of "posterior collapse," current latent variable generative models pose a challenging design choice that either weakens the capacity of the decoder or requires augmenting the objective so it does not only maximize the likelihood of the data. In this paper, we propose an alternative that utilizes the most powerful generative models as decoders, whilst optimising the variational lower bound all while ensuring that the latent variables preserve and encode useful information. Our proposed δVAEs achieve this by constraining the variational family for the posterior to have a minimum distance to the prior. For sequential latent variable models, our approach resembles the classic representation learning approach of slow feature analysis. We demonstrate the efficacy of our approach at modeling text on LM1B and modeling images: learning representations, improving sample quality, and achieving state of the art loglikelihood on CIFAR10 and ImageNet 32× 32.
01/10/2019 ∙ by Ali Razavi, et al. ∙ 4 ∙ shareread it

Classification Accuracy Score for Conditional Generative Models
Deep generative models (DGMs) of images are now sufficiently mature that they produce nearly photorealistic samples and obtain scores similar to the data distribution on heuristics such as Frechet Inception Distance. These results, especially on largescale datasets such as ImageNet, suggest that DGMs are learning the data distribution in a perceptually meaningful space, and can be used in downstream tasks. To test this latter hypothesis, we use classconditional generative models from a number of model classesvariational autoencoder, autoregressive models, and generative adversarial networksto infer the class labels of real data. We perform this inference by training the image classifier using only synthetic data, and using the classifier to predict labels on real data. The performance on this task, which we call Classification Accuracy Score (CAS), highlights some surprising results not captured by traditional metrics and comprise our contributions. First, when using a stateoftheart GAN (BigGAN), Top5 accuracy decreases by 41.6 other model classes, such as highresolution VQVAE and Hierarchical Autoregressive Models, substantially outperform GANs on this benchmark. Second, CAS automatically surfaces particular classes for which generative models failed to capture the data distribution, and were previously unknown in the literature. Third, we find traditional GAN metrics such as Frechet Inception Distance neither predictive of CAS nor useful when evaluating nonGAN models. Finally, we introduce Naive Augmentation Score, a variant of CAS where the image classifier is trained on both real and synthetic data, to demonstrate that naive augmentation improves classification performance in limited circumstances. In order to facilitate better diagnoses of generative models, we opensource the proposed metric.
05/26/2019 ∙ by Suman Ravuri, et al. ∙ 4 ∙ shareread it

Generating Diverse HighFidelity Images with VQVAE2
We explore the use of Vector Quantized Variational AutoEncoder (VQVAE) models for large scale image generation. To this end, we scale and enhance the autoregressive priors used in VQVAE to generate synthetic samples of much higher coherence and fidelity than possible before. We use simple feedforward encoder and decoder networks, making our model an attractive candidate for applications where the encoding and/or decoding speed is critical. Additionally, VQVAE requires sampling an autoregressive model only in the compressed latent space, which is an order of magnitude faster than sampling in the pixel space, especially for large images. We demonstrate that a multiscale hierarchical organization of VQVAE, augmented with powerful priors over the latent codes, is able to generate samples with quality that rivals that of state of the art Generative Adversarial Networks on multifaceted datasets such as ImageNet, while not suffering from GAN's known shortcomings such as mode collapse and lack of diversity.
06/02/2019 ∙ by Ali Razavi, et al. ∙ 4 ∙ shareread it

Relational recurrent neural networks
Memorybased neural networks model temporal data by leveraging an ability to remember information for long periods. It is unclear, however, whether they also have an ability to perform complex relational reasoning with the information they remember. Here, we first confirm our intuitions that standard memory architectures may struggle at tasks that heavily involve an understanding of the ways in which entities are connected  i.e., tasks involving relational reasoning. We then improve upon these deficits by using a new memory module  a Relational Memory Core (RMC)  which employs multihead dot product attention to allow memories to interact. Finally, we test the RMC on a suite of tasks that may profit from more capable relational reasoning across sequential information, and show large gains in RL domains (e.g. Mini PacMan), program evaluation, and language modeling, achieving stateoftheart results on the WikiText103, Project Gutenberg, and GigaWord datasets.
06/05/2018 ∙ by Adam Santoro, et al. ∙ 2 ∙ shareread it

Learning Implicit Generative Models with the Method of Learned Moments
We propose a method of moments (MoM) algorithm for training largescale implicit generative models. Moment estimation in this setting encounters two problems: it is often difficult to define the millions of moments needed to learn the model parameters, and it is hard to determine which properties are useful when specifying moments. To address the first issue, we introduce a moment network, and define the moments as the network's hidden units and the gradient of the network's output with the respect to its parameters. To tackle the second problem, we use asymptotic theory to highlight desiderata for moments  namely they should minimize the asymptotic variance of estimated model parameters  and introduce an objective to learn better moments. The sequence of objectives created by this Method of Learned Moments (MoLM) can train highquality neural image samplers. On CIFAR10, we demonstrate that MoLMtrained generators achieve significantly higher Inception Scores and lower Frechet Inception Distances than those trained with gradient penaltyregularized and spectrallynormalized adversarial objectives. These generators also achieve nearly perfect MultiScale Structural Similarity Scores on CelebA, and can create highquality samples of 128x128 images.
06/28/2018 ∙ by Suman Ravuri, et al. ∙ 2 ∙ shareread it

Deep AudioVisual Speech Recognition
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an openworld problem  unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) we compare two models for lip reading, one using a CTC loss, and the other using a sequencetosequence loss. Both models are built on top of the transformer selfattention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release a new dataset for audiovisual speech recognition, LRS2BBC, consisting of thousands of natural sentences from British television. The models that we train surpass the performance of all previous work on a lip reading benchmark dataset by a significant margin.
09/06/2018 ∙ by Triantafyllos Afouras, et al. ∙ 2 ∙ shareread it

Sample Efficient Adaptive TexttoSpeech
We present a metalearning approach for adaptive texttospeech (TTS) with few data. During training, we learn a multispeaker model using a shared conditional WaveNet core and independent learned embeddings for each speaker. The aim of training is not to produce a neural network with fixed weights, which is then deployed as a TTS system. Instead, the aim is to produce a network that requires few data at deployment time to rapidly adapt to new speakers. We introduce and benchmark three strategies: (i) learning the speaker embedding while keeping the WaveNet core fixed, (ii) finetuning the entire architecture with stochastic gradient descent, and (iii) predicting the speaker embedding with a trained neural network encoder. The experiments show that these approaches are successful at adapting the multispeaker neural network to new speakers, obtaining stateoftheart results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.
09/27/2018 ∙ by Yutian Chen, et al. ∙ 2 ∙ shareread it