Anima Anandkumar

is this you? claim profile


Bren Professor at Caltech and Principal Scientist at NVIDIA

  • Stochastically Rank-Regularized Tensor Regression Networks

    Over-parametrization of deep neural networks has recently been shown to be key to their successful training. However, it also renders them prone to overfitting and makes them expensive to store and train. Tensor regression networks significantly reduce the number of effective parameters in deep neural networks while retaining accuracy and the ease of training. They replace the flattening and fully-connected layers with a tensor regression layer, where the regression weights are expressed through the factors of a low-rank tensor decomposition. In this paper, to further improve tensor regression networks, we propose a novel stochastic rank-regularization. It consists of a novel randomized tensor sketching method to approximate the weights of tensor regression layers. We theoretically and empirically establish the link between our proposed stochastic rank-regularization and the dropout on low-rank tensor regression. Extensive experimental results with both synthetic data and real world datasets (i.e., CIFAR-100 and the UK Biobank brain MRI dataset) support that the proposed approach i) improves performance in both classification and regression tasks, ii) decreases overfitting, iii) leads to more stable training and iv) improves robustness to adversarial attacks and random noise.

    02/27/2019 ∙ by Arinbjörn Kolbeinsson, et al. ∙ 83 share

    read it

  • Stochastic Linear Bandits with Hidden Low Rank Structure

    High-dimensional representations often have a lower dimensional underlying structure. This is particularly the case in many decision making settings. For example, when the representation of actions is generated from a deep neural network, it is reasonable to expect a low-rank structure whereas conventional structures like sparsity are not valid anymore. Subspace recovery methods, such as Principle Component Analysis (PCA) can find the underlying low-rank structures in the feature space and reduce the complexity of the learning tasks. In this work, we propose Projected Stochastic Linear Bandit (PSLB), an algorithm for high dimensional stochastic linear bandits (SLB) when the representation of actions has an underlying low-dimensional subspace structure. PSLB deploys PCA based projection to iteratively find the low rank structure in SLBs. We show that deploying projection methods assures dimensionality reduction and results in a tighter regret upper bound that is in terms of the dimensionality of the subspace and its properties, rather than the dimensionality of the ambient space. We modify the image classification task into the SLB setting and empirically show that, when a pre-trained DNN provides the high dimensional feature representations, deploying PSLB results in significant reduction of regret and faster convergence to an accurate model compared to state-of-art algorithm.

    01/28/2019 ∙ by Sahin Lale, et al. ∙ 18 share

    read it

  • Open Vocabulary Learning on Source Code with a Graph-Structured Cache

    Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models' performance on a code completion task and a variable naming task --- with over 100 --- at the cost of a moderate increase in computation time.

    10/18/2018 ∙ by Milan Cvitkovic, et al. ∙ 2 share

    read it

  • Robust Regression for Safe Exploration in Control

    We study the problem of safe learning and exploration in sequential control problems. The goal is to safely collect data samples from an operating environment to learn an optimal controller. A central challenge in this setting is how to quantify uncertainty in order to choose provably-safe actions that allow us to collect useful data and reduce uncertainty, thereby achieving both improved safety and optimality. To address this challenge, we present a deep robust regression model that is trained to directly predict the uncertainty bounds for safe exploration. We then show how to integrate our robust regression approach with model-based control methods by learning a dynamic model with robustness bounds. We derive generalization bounds under domain shifts for learning and connect them with safety and stability bounds in control. We demonstrate empirically that our robust regression approach can outperform conventional Gaussian process (GP) based safe exploration in settings where it is difficult to specify a good GP prior.

    06/13/2019 ∙ by Anqi Liu, et al. ∙ 1 share

    read it

  • Compact Tensor Pooling for Visual Question Answering

    Performing high level cognitive tasks requires the integration of feature maps with drastically different structure. In Visual Question Answering (VQA) image descriptors have spatial structures, while lexical inputs inherently follow a temporal sequence. The recently proposed Multimodal Compact Bilinear pooling (MCB) forms the outer products, via count-sketch approximation, of the visual and textual representation at each spatial location. While this procedure preserves spatial information locally, outer-products are taken independently for each fiber of the activation tensor, and therefore do not include spatial context. In this work, we introduce multi-dimensional sketch (MD-sketch), a novel extension of count-sketch to tensors. Using this new formulation, we propose Multimodal Compact Tensor Pooling (MCT) to fully exploit the global spatial context during bilinear pooling operations. Contrarily to MCB, our approach preserves spatial context by directly convolving the MD-sketch from the visual tensor features with the text vector feature using higher order FFT. Furthermore we apply MCT incrementally at each step of the question embedding and accumulate the multi-modal vectors with a second LSTM layer before the final answer is chosen.

    06/20/2017 ∙ by Yang Shi, et al. ∙ 0 share

    read it

  • Homotopy Analysis for Tensor PCA

    Developing efficient and guaranteed nonconvex algorithms has been an important challenge in modern machine learning. Algorithms with good empirical performance such as stochastic gradient descent often lack theoretical guarantees. In this paper, we analyze the class of homotopy or continuation methods for global optimization of nonconvex functions. These methods start from an objective function that is efficient to optimize (e.g. convex), and progressively modify it to obtain the required objective, and the solutions are passed along the homotopy path. For the challenging problem of tensor PCA, we prove global convergence of the homotopy method in the "high noise" regime. The signal-to-noise requirement for our algorithm is tight in the sense that it matches the recovery guarantee for the best degree-4 sum-of-squares algorithm. In addition, we prove a phase transition along the homotopy path for tensor PCA. This allows to simplify the homotopy method to a local search algorithm, viz., tensor power iterations, with a specific initialization and a noise injection procedure, while retaining the theoretical guarantees.

    10/28/2016 ∙ by Anima Anandkumar, et al. ∙ 0 share

    read it

  • Training Input-Output Recurrent Neural Networks through Spectral Methods

    We consider the problem of training input-output recurrent neural networks (RNN) for sequence labeling tasks. We propose a novel spectral approach for learning the network parameters. It is based on decomposition of the cross-moment tensor between the output and a non-linear transformation of the input, based on score functions. We guarantee consistent learning with polynomial sample and computational complexity under transparent conditions such as non-degeneracy of model parameters, polynomial activations for the neurons, and a Markovian evolution of the input sequence. We also extend our results to Bidirectional RNN which uses both previous and future information to output the label at each time point, and is employed in many NLP tasks such as POS tagging.

    03/03/2016 ∙ by Hanie Sedghi, et al. ∙ 0 share

    read it

  • Efficient approaches for escaping higher order saddle points in non-convex optimization

    Local search heuristics for non-convex optimizations are popular in applied machine learning. However, in general it is hard to guarantee that such algorithms even converge to a local minimum, due to the existence of complicated saddle point structures in high dimensions. Many functions have degenerate saddle points such that the first and second order derivatives cannot distinguish them with local optima. In this paper we use higher order derivatives to escape these saddle points: we design the first efficient algorithm guaranteed to converge to a third order local optimum (while existing techniques are at most second order). We also show that it is NP-hard to extend this further to finding fourth order local optima.

    02/18/2016 ∙ by Anima Anandkumar, et al. ∙ 0 share

    read it

  • Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

    Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.

    06/28/2015 ∙ by Majid Janzamin, et al. ∙ 0 share

    read it

  • A Scale Mixture Perspective of Multiplicative Noise in Neural Networks

    Corrupting the input and hidden layers of deep neural networks (DNNs) with multiplicative noise, often drawn from the Bernoulli distribution (or 'dropout'), provides regularization that has significantly contributed to deep learning's success. However, understanding how multiplicative corruptions prevent overfitting has been difficult due to the complexity of a DNN's functional form. In this paper, we show that when a Gaussian prior is placed on a DNN's weights, applying multiplicative noise induces a Gaussian scale mixture, which can be reparameterized to circumvent the problematic likelihood function. Analysis can then proceed by using a type-II maximum likelihood procedure to derive a closed-form expression revealing how regularization evolves as a function of the network's weights. Results show that multiplicative noise forces weights to become either sparse or invariant to rescaling. We find our analysis has implications for model compression as it naturally reveals a weight pruning rule that starkly contrasts with the commonly used signal-to-noise ratio (SNR). While the SNR prunes weights with large variances, seeing them as noisy, our approach recognizes their robustness and retains them. We empirically demonstrate our approach has a strong advantage over the SNR heuristic and is competitive to retraining with soft targets produced from a teacher model.

    06/10/2015 ∙ by Eric Nalisnick, et al. ∙ 0 share

    read it

  • Learning Mixed Membership Community Models in Social Tagging Networks through Tensor Methods

    Community detection in graphs has been extensively studied both in theory and in applications. However, detecting communities in hypergraphs is more challenging. In this paper, we propose a tensor decomposition approach for guaranteed learning of communities in a special class of hypergraphs modeling social tagging systems or folksonomies. A folksonomy is a tripartite 3-uniform hypergraph consisting of (user, tag, resource) hyperedges. We posit a probabilistic mixed membership community model, and prove that the tensor method consistently learns the communities under efficient sample complexity and separation requirements.

    03/16/2015 ∙ by Anima Anandkumar, et al. ∙ 0 share

    read it