
Stochastically RankRegularized Tensor Regression Networks
Overparametrization of deep neural networks has recently been shown to be key to their successful training. However, it also renders them prone to overfitting and makes them expensive to store and train. Tensor regression networks significantly reduce the number of effective parameters in deep neural networks while retaining accuracy and the ease of training. They replace the flattening and fullyconnected layers with a tensor regression layer, where the regression weights are expressed through the factors of a lowrank tensor decomposition. In this paper, to further improve tensor regression networks, we propose a novel stochastic rankregularization. It consists of a novel randomized tensor sketching method to approximate the weights of tensor regression layers. We theoretically and empirically establish the link between our proposed stochastic rankregularization and the dropout on lowrank tensor regression. Extensive experimental results with both synthetic data and real world datasets (i.e., CIFAR100 and the UK Biobank brain MRI dataset) support that the proposed approach i) improves performance in both classification and regression tasks, ii) decreases overfitting, iii) leads to more stable training and iv) improves robustness to adversarial attacks and random noise.
02/27/2019 ∙ by Arinbjörn Kolbeinsson, et al. ∙ 83 ∙ shareread it

Stochastic Linear Bandits with Hidden Low Rank Structure
Highdimensional representations often have a lower dimensional underlying structure. This is particularly the case in many decision making settings. For example, when the representation of actions is generated from a deep neural network, it is reasonable to expect a lowrank structure whereas conventional structures like sparsity are not valid anymore. Subspace recovery methods, such as Principle Component Analysis (PCA) can find the underlying lowrank structures in the feature space and reduce the complexity of the learning tasks. In this work, we propose Projected Stochastic Linear Bandit (PSLB), an algorithm for high dimensional stochastic linear bandits (SLB) when the representation of actions has an underlying lowdimensional subspace structure. PSLB deploys PCA based projection to iteratively find the low rank structure in SLBs. We show that deploying projection methods assures dimensionality reduction and results in a tighter regret upper bound that is in terms of the dimensionality of the subspace and its properties, rather than the dimensionality of the ambient space. We modify the image classification task into the SLB setting and empirically show that, when a pretrained DNN provides the high dimensional feature representations, deploying PSLB results in significant reduction of regret and faster convergence to an accurate model compared to stateofart algorithm.
01/28/2019 ∙ by Sahin Lale, et al. ∙ 18 ∙ shareread it

Open Vocabulary Learning on Source Code with a GraphStructured Cache
Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a GraphStructured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graphstructured cache strategy with recent GraphNeuralNetworkbased models for supervised learning on code improves the models' performance on a code completion task and a variable naming task  with over 100  at the cost of a moderate increase in computation time.
10/18/2018 ∙ by Milan Cvitkovic, et al. ∙ 2 ∙ shareread it

Robust Regression for Safe Exploration in Control
We study the problem of safe learning and exploration in sequential control problems. The goal is to safely collect data samples from an operating environment to learn an optimal controller. A central challenge in this setting is how to quantify uncertainty in order to choose provablysafe actions that allow us to collect useful data and reduce uncertainty, thereby achieving both improved safety and optimality. To address this challenge, we present a deep robust regression model that is trained to directly predict the uncertainty bounds for safe exploration. We then show how to integrate our robust regression approach with modelbased control methods by learning a dynamic model with robustness bounds. We derive generalization bounds under domain shifts for learning and connect them with safety and stability bounds in control. We demonstrate empirically that our robust regression approach can outperform conventional Gaussian process (GP) based safe exploration in settings where it is difficult to specify a good GP prior.
06/13/2019 ∙ by Anqi Liu, et al. ∙ 1 ∙ shareread it

Compact Tensor Pooling for Visual Question Answering
Performing high level cognitive tasks requires the integration of feature maps with drastically different structure. In Visual Question Answering (VQA) image descriptors have spatial structures, while lexical inputs inherently follow a temporal sequence. The recently proposed Multimodal Compact Bilinear pooling (MCB) forms the outer products, via countsketch approximation, of the visual and textual representation at each spatial location. While this procedure preserves spatial information locally, outerproducts are taken independently for each fiber of the activation tensor, and therefore do not include spatial context. In this work, we introduce multidimensional sketch (MDsketch), a novel extension of countsketch to tensors. Using this new formulation, we propose Multimodal Compact Tensor Pooling (MCT) to fully exploit the global spatial context during bilinear pooling operations. Contrarily to MCB, our approach preserves spatial context by directly convolving the MDsketch from the visual tensor features with the text vector feature using higher order FFT. Furthermore we apply MCT incrementally at each step of the question embedding and accumulate the multimodal vectors with a second LSTM layer before the final answer is chosen.
06/20/2017 ∙ by Yang Shi, et al. ∙ 0 ∙ shareread it

Homotopy Analysis for Tensor PCA
Developing efficient and guaranteed nonconvex algorithms has been an important challenge in modern machine learning. Algorithms with good empirical performance such as stochastic gradient descent often lack theoretical guarantees. In this paper, we analyze the class of homotopy or continuation methods for global optimization of nonconvex functions. These methods start from an objective function that is efficient to optimize (e.g. convex), and progressively modify it to obtain the required objective, and the solutions are passed along the homotopy path. For the challenging problem of tensor PCA, we prove global convergence of the homotopy method in the "high noise" regime. The signaltonoise requirement for our algorithm is tight in the sense that it matches the recovery guarantee for the best degree4 sumofsquares algorithm. In addition, we prove a phase transition along the homotopy path for tensor PCA. This allows to simplify the homotopy method to a local search algorithm, viz., tensor power iterations, with a specific initialization and a noise injection procedure, while retaining the theoretical guarantees.
10/28/2016 ∙ by Anima Anandkumar, et al. ∙ 0 ∙ shareread it

Training InputOutput Recurrent Neural Networks through Spectral Methods
We consider the problem of training inputoutput recurrent neural networks (RNN) for sequence labeling tasks. We propose a novel spectral approach for learning the network parameters. It is based on decomposition of the crossmoment tensor between the output and a nonlinear transformation of the input, based on score functions. We guarantee consistent learning with polynomial sample and computational complexity under transparent conditions such as nondegeneracy of model parameters, polynomial activations for the neurons, and a Markovian evolution of the input sequence. We also extend our results to Bidirectional RNN which uses both previous and future information to output the label at each time point, and is employed in many NLP tasks such as POS tagging.
03/03/2016 ∙ by Hanie Sedghi, et al. ∙ 0 ∙ shareread it

Efficient approaches for escaping higher order saddle points in nonconvex optimization
Local search heuristics for nonconvex optimizations are popular in applied machine learning. However, in general it is hard to guarantee that such algorithms even converge to a local minimum, due to the existence of complicated saddle point structures in high dimensions. Many functions have degenerate saddle points such that the first and second order derivatives cannot distinguish them with local optima. In this paper we use higher order derivatives to escape these saddle points: we design the first efficient algorithm guaranteed to converge to a third order local optimum (while existing techniques are at most second order). We also show that it is NPhard to extend this further to finding fourth order local optima.
02/18/2016 ∙ by Anima Anandkumar, et al. ∙ 0 ∙ shareread it

Beating the Perils of NonConvexity: Guaranteed Training of Neural Networks using Tensor Methods
Training neural networks is a challenging nonconvex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of twolayer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NPhard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild nondegeneracy conditions. It consists of simple embarrassingly parallel linear and multilinear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
06/28/2015 ∙ by Majid Janzamin, et al. ∙ 0 ∙ shareread it

A Scale Mixture Perspective of Multiplicative Noise in Neural Networks
Corrupting the input and hidden layers of deep neural networks (DNNs) with multiplicative noise, often drawn from the Bernoulli distribution (or 'dropout'), provides regularization that has significantly contributed to deep learning's success. However, understanding how multiplicative corruptions prevent overfitting has been difficult due to the complexity of a DNN's functional form. In this paper, we show that when a Gaussian prior is placed on a DNN's weights, applying multiplicative noise induces a Gaussian scale mixture, which can be reparameterized to circumvent the problematic likelihood function. Analysis can then proceed by using a typeII maximum likelihood procedure to derive a closedform expression revealing how regularization evolves as a function of the network's weights. Results show that multiplicative noise forces weights to become either sparse or invariant to rescaling. We find our analysis has implications for model compression as it naturally reveals a weight pruning rule that starkly contrasts with the commonly used signaltonoise ratio (SNR). While the SNR prunes weights with large variances, seeing them as noisy, our approach recognizes their robustness and retains them. We empirically demonstrate our approach has a strong advantage over the SNR heuristic and is competitive to retraining with soft targets produced from a teacher model.
06/10/2015 ∙ by Eric Nalisnick, et al. ∙ 0 ∙ shareread it

Learning Mixed Membership Community Models in Social Tagging Networks through Tensor Methods
Community detection in graphs has been extensively studied both in theory and in applications. However, detecting communities in hypergraphs is more challenging. In this paper, we propose a tensor decomposition approach for guaranteed learning of communities in a special class of hypergraphs modeling social tagging systems or folksonomies. A folksonomy is a tripartite 3uniform hypergraph consisting of (user, tag, resource) hyperedges. We posit a probabilistic mixed membership community model, and prove that the tensor method consistently learns the communities under efficient sample complexity and separation requirements.
03/16/2015 ∙ by Anima Anandkumar, et al. ∙ 0 ∙ shareread it
Anima Anandkumar
is this you? claim profile
Bren Professor at Caltech and Principal Scientist at NVIDIA