
Stochastically RankRegularized Tensor Regression Networks
Overparametrization of deep neural networks has recently been shown to be key to their successful training. However, it also renders them prone to overfitting and makes them expensive to store and train. Tensor regression networks significantly reduce the number of effective parameters in deep neural networks while retaining accuracy and the ease of training. They replace the flattening and fullyconnected layers with a tensor regression layer, where the regression weights are expressed through the factors of a lowrank tensor decomposition. In this paper, to further improve tensor regression networks, we propose a novel stochastic rankregularization. It consists of a novel randomized tensor sketching method to approximate the weights of tensor regression layers. We theoretically and empirically establish the link between our proposed stochastic rankregularization and the dropout on lowrank tensor regression. Extensive experimental results with both synthetic data and real world datasets (i.e., CIFAR100 and the UK Biobank brain MRI dataset) support that the proposed approach i) improves performance in both classification and regression tasks, ii) decreases overfitting, iii) leads to more stable training and iv) improves robustness to adversarial attacks and random noise.
02/27/2019 ∙ by Arinbjörn Kolbeinsson, et al. ∙ 83 ∙ shareread it

Learning Causal State Representations of Partially Observable Environments
Intelligent agents can cope with sensoryrich environments by learning taskagnostic state abstractions. In this paper, we propose mechanisms to approximate causal states, which optimally compress the joint history of actions and observations in partiallyobservable Markov decision processes. Our proposed algorithm extracts causal state representations from RNNs that are trained to predict subsequent observations given the history. We demonstrate that these learned taskagnostic state abstractions can be used to efficiently learn policies for reinforcement learning problems with rich observation spaces. We evaluate agents using multiple partially observable navigation tasks with both discrete (GridWorld) and continuous (VizDoom, ALE) observation processes that cannot be solved by traditional memorylimited methods. Our experiments demonstrate systematic improvement of the DQN and tabular models using approximate causal state representations with respect to recurrentDQN baselines trained with raw inputs.
06/25/2019 ∙ by Amy Zhang, et al. ∙ 68 ∙ shareread it

OutofDistribution Detection Using Neural Rendering Generative Models
Outofdistribution (OoD) detection is a natural downstream task for deep generative models, due to their ability to learn the input probability distribution. There are mainly two classes of approaches for OoD detection using deep generative models, viz., based on likelihood measure and the reconstruction loss. However, both approaches are unable to carry out OoD detection effectively, especially when the OoD samples have smaller variance than the training samples. For instance, both flow based and VAE models assign higher likelihood to images from SVHN when trained on CIFAR10 images. We use a recently proposed generative model known as neural rendering model (NRM) and derive metrics for OoD. We show that NRM unifies both approaches since it provides a likelihood estimate and also carries out reconstruction in each layer of the neural network. Among various measures, we found the joint likelihood of latent variables to be the most effective one for OoD detection. Our results show that when trained on CIFAR10, lower likelihood (of latent variables) is assigned to SVHN images. Additionally, we show that this metric is consistent across other OoD datasets. To the best of our knowledge, this is the first work to show consistently lower likelihood for OoD data with smaller variance with deep generative models.
07/10/2019 ∙ by Yujia Huang, et al. ∙ 35 ∙ shareread it

Stochastic Linear Bandits with Hidden Low Rank Structure
Highdimensional representations often have a lower dimensional underlying structure. This is particularly the case in many decision making settings. For example, when the representation of actions is generated from a deep neural network, it is reasonable to expect a lowrank structure whereas conventional structures like sparsity are not valid anymore. Subspace recovery methods, such as Principle Component Analysis (PCA) can find the underlying lowrank structures in the feature space and reduce the complexity of the learning tasks. In this work, we propose Projected Stochastic Linear Bandit (PSLB), an algorithm for high dimensional stochastic linear bandits (SLB) when the representation of actions has an underlying lowdimensional subspace structure. PSLB deploys PCA based projection to iteratively find the low rank structure in SLBs. We show that deploying projection methods assures dimensionality reduction and results in a tighter regret upper bound that is in terms of the dimensionality of the subspace and its properties, rather than the dimensionality of the ambient space. We modify the image classification task into the SLB setting and empirically show that, when a pretrained DNN provides the high dimensional feature representations, deploying PSLB results in significant reduction of regret and faster convergence to an accurate model compared to stateofart algorithm.
01/28/2019 ∙ by Sahin Lale, et al. ∙ 18 ∙ shareread it

Open Vocabulary Learning on Source Code with a GraphStructured Cache
Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a GraphStructured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graphstructured cache strategy with recent GraphNeuralNetworkbased models for supervised learning on code improves the models' performance on a code completion task and a variable naming task  with over 100  at the cost of a moderate increase in computation time.
10/18/2018 ∙ by Milan Cvitkovic, et al. ∙ 2 ∙ shareread it

Robust Regression for Safe Exploration in Control
We study the problem of safe learning and exploration in sequential control problems. The goal is to safely collect data samples from an operating environment to learn an optimal controller. A central challenge in this setting is how to quantify uncertainty in order to choose provablysafe actions that allow us to collect useful data and reduce uncertainty, thereby achieving both improved safety and optimality. To address this challenge, we present a deep robust regression model that is trained to directly predict the uncertainty bounds for safe exploration. We then show how to integrate our robust regression approach with modelbased control methods by learning a dynamic model with robustness bounds. We derive generalization bounds under domain shifts for learning and connect them with safety and stability bounds in control. We demonstrate empirically that our robust regression approach can outperform conventional Gaussian process (GP) based safe exploration in settings where it is difficult to specify a good GP prior.
06/13/2019 ∙ by Anqi Liu, et al. ∙ 1 ∙ shareread it

Compact Tensor Pooling for Visual Question Answering
Performing high level cognitive tasks requires the integration of feature maps with drastically different structure. In Visual Question Answering (VQA) image descriptors have spatial structures, while lexical inputs inherently follow a temporal sequence. The recently proposed Multimodal Compact Bilinear pooling (MCB) forms the outer products, via countsketch approximation, of the visual and textual representation at each spatial location. While this procedure preserves spatial information locally, outerproducts are taken independently for each fiber of the activation tensor, and therefore do not include spatial context. In this work, we introduce multidimensional sketch (MDsketch), a novel extension of countsketch to tensors. Using this new formulation, we propose Multimodal Compact Tensor Pooling (MCT) to fully exploit the global spatial context during bilinear pooling operations. Contrarily to MCB, our approach preserves spatial context by directly convolving the MDsketch from the visual tensor features with the text vector feature using higher order FFT. Furthermore we apply MCT incrementally at each step of the question embedding and accumulate the multimodal vectors with a second LSTM layer before the final answer is chosen.
06/20/2017 ∙ by Yang Shi, et al. ∙ 0 ∙ shareread it

Homotopy Analysis for Tensor PCA
Developing efficient and guaranteed nonconvex algorithms has been an important challenge in modern machine learning. Algorithms with good empirical performance such as stochastic gradient descent often lack theoretical guarantees. In this paper, we analyze the class of homotopy or continuation methods for global optimization of nonconvex functions. These methods start from an objective function that is efficient to optimize (e.g. convex), and progressively modify it to obtain the required objective, and the solutions are passed along the homotopy path. For the challenging problem of tensor PCA, we prove global convergence of the homotopy method in the "high noise" regime. The signaltonoise requirement for our algorithm is tight in the sense that it matches the recovery guarantee for the best degree4 sumofsquares algorithm. In addition, we prove a phase transition along the homotopy path for tensor PCA. This allows to simplify the homotopy method to a local search algorithm, viz., tensor power iterations, with a specific initialization and a noise injection procedure, while retaining the theoretical guarantees.
10/28/2016 ∙ by Anima Anandkumar, et al. ∙ 0 ∙ shareread it

Training InputOutput Recurrent Neural Networks through Spectral Methods
We consider the problem of training inputoutput recurrent neural networks (RNN) for sequence labeling tasks. We propose a novel spectral approach for learning the network parameters. It is based on decomposition of the crossmoment tensor between the output and a nonlinear transformation of the input, based on score functions. We guarantee consistent learning with polynomial sample and computational complexity under transparent conditions such as nondegeneracy of model parameters, polynomial activations for the neurons, and a Markovian evolution of the input sequence. We also extend our results to Bidirectional RNN which uses both previous and future information to output the label at each time point, and is employed in many NLP tasks such as POS tagging.
03/03/2016 ∙ by Hanie Sedghi, et al. ∙ 0 ∙ shareread it

Efficient approaches for escaping higher order saddle points in nonconvex optimization
Local search heuristics for nonconvex optimizations are popular in applied machine learning. However, in general it is hard to guarantee that such algorithms even converge to a local minimum, due to the existence of complicated saddle point structures in high dimensions. Many functions have degenerate saddle points such that the first and second order derivatives cannot distinguish them with local optima. In this paper we use higher order derivatives to escape these saddle points: we design the first efficient algorithm guaranteed to converge to a third order local optimum (while existing techniques are at most second order). We also show that it is NPhard to extend this further to finding fourth order local optima.
02/18/2016 ∙ by Anima Anandkumar, et al. ∙ 0 ∙ shareread it

Beating the Perils of NonConvexity: Guaranteed Training of Neural Networks using Tensor Methods
Training neural networks is a challenging nonconvex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of twolayer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NPhard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild nondegeneracy conditions. It consists of simple embarrassingly parallel linear and multilinear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
06/28/2015 ∙ by Majid Janzamin, et al. ∙ 0 ∙ shareread it
Anima Anandkumar
is this you? claim profile
Bren Professor at Caltech and Principal Scientist at NVIDIA