
Inverting Deep Generative models, One layer at a time
We study the problem of inverting a deep generative model with ReLU activations. Inversion corresponds to finding a latent code vector that explains observed measurements as much as possible. In most prior works this is performed by attempting to solve a nonconvex optimization problem involving the generator. In this paper we obtain several novel theoretical results for the inversion problem. We show that for the realizable case, single layer inversion can be performed exactly in polynomial time, by solving a linear program. Further, we show that for multiple layers, inversion is NPhard and the preimage set can be nonconvex. For generative models of arbitrary depth, we show that exact recovery is possible in polynomial time with high probability, if the layers are expanding and the weights are randomly selected. Very recent work analyzed the same problem for gradient descent inversion. Their analysis requires significantly higher expansion (logarithmic in the latent dimension) while our proposed algorithm can provably reconstruct even with constant factor expansion. We also provide provable error bounds for different norms for reconstructing noisy observations. Our empirical validation demonstrates that we obtain better reconstructions when the latent dimension is large.
06/18/2019 ∙ by Qi Lei, et al. ∙ 23 ∙ shareread it

Random Warping Series: A Random Features Method for TimeSeries Embedding
Time series data analytics has been a problem of substantial interests for decades, and Dynamic Time Warping (DTW) has been the most widely adopted technique to measure dissimilarity between time series. A number of globalalignment kernels have since been proposed in the spirit of DTW to extend its use to kernelbased estimation method such as support vector machine. However, those kernels suffer from diagonal dominance of the Gram matrix and a quadratic complexity w.r.t. the sample size. In this work, we study a family of alignmentaware positive definite (p.d.) kernels, with its feature embedding given by a distribution of Random Warping Series (RWS). The proposed kernel does not suffer from the issue of diagonal dominance while naturally enjoys a Random Features (RF) approximation, which reduces the computational complexity of existing DTWbased techniques from quadratic to linear in terms of both the number and the length of timeseries. We also study the convergence of the RF approximation for the domain of time series of unbounded length. Our extensive experiments on 16 benchmark datasets demonstrate that RWS outperforms or matches stateoftheart classification and clustering methods in both accuracy and computational time. Our code and data is available at <https://github.com/IBM/RandomWarpingSeries>.
09/14/2018 ∙ by Lingfei Wu, et al. ∙ 8 ∙ shareread it

PrimalDual Block FrankWolfe
We propose a variant of the FrankWolfe algorithm for solving a class of sparse/lowrank optimization problems. Our formulation includes Elastic Net, regularized SVMs and phase retrieval as special cases. The proposed PrimalDual Block FrankWolfe algorithm reduces the periteration cost while maintaining linear convergence rate. The per iteration cost of our method depends on the structural complexity of the solution (i.e. sparsity/lowrank) instead of the ambient dimension. We empirically show that our algorithm outperforms the stateoftheart methods on (multiclass) classification tasks.
06/06/2019 ∙ by Qi Lei, et al. ∙ 1 ∙ shareread it

Similarity Preserving Representation Learning for Time Series Analysis
A considerable amount of machine learning algorithms take instancefeature matrices as their inputs. As such, they cannot directly analyze time series data due to its temporal nature, usually unequal lengths, and complex properties. This is a great pity since many of these algorithms are effective, robust, efficient, and easy to use. In this paper, we bridge this gap by proposing an efficient representation learning framework that is able to convert a set of time series with equal or unequal lengths to a matrix format. In particular, we guarantee that the pairwise similarities between time series are well preserved after the transformation. The learned feature representation is particularly suitable to the class of learning problems that are sensitive to data similarities. Given a set of n time series, we first construct an n× n partially observed similarity matrix by randomly sampling O(n n) pairs of time series and computing their pairwise similarities. We then propose an extremely efficient algorithm that solves a highly nonconvex and NPhard problem to learn new features based on the partially observed similarity matrix. We use the learned features to conduct experiments on both data classification and clustering tasks. Our extensive experimental results demonstrate that the proposed framework is both effective and efficient.
02/12/2017 ∙ by Qi Lei, et al. ∙ 0 ∙ shareread it

Coordinate Descent Methods for Symmetric Nonnegative Matrix Factorization
Given a symmetric nonnegative matrix A, symmetric nonnegative matrix factorization (symNMF) is the problem of finding a nonnegative matrix H, usually with much fewer columns than A, such that A ≈ HH^T. SymNMF can be used for data analysis and in particular for various clustering tasks. In this paper, we propose simple and very efficient coordinate descent schemes to solve this problem, and that can handle large and sparse input matrices. The effectiveness of our methods is illustrated on synthetic and realworld data sets, and we show that they perform favorably compared to recent stateoftheart methods.
09/04/2015 ∙ by Arnaud Vandaele, et al. ∙ 0 ∙ shareread it

Hessianbased Analysis of Large Batch Training and Robustness to Adversaries
Large batch size training of Neural Networks has been shown to incur accuracy loss when trained with the current methods. The precise underlying reasons for this are still not completely understood. Here, we study large batch size training through the lens of the Hessian operator and robust optimization. In particular, we perform a Hessian based study to analyze how the landscape of the loss functional is different for large batch size training. We compute the true Hessian spectrum, without approximation, by backpropagating the second derivative. Our results on multiple networks show that, when training at large batch sizes, one tends to stop at points in the parameter space with noticeably higher/larger Hessian spectrum, i.e., where the eigenvalues of the Hessian are much larger. We then study how batch size affects robustness of the model in the face of adversarial attacks. All the results show that models trained with large batches are more susceptible to adversarial attacks, as compared to models trained with small batch sizes. Furthermore, we prove a theoretical result which shows that the problem of finding an adversarial perturbation is a saddlefree optimization problem. Finally, we show empirical results that demonstrate that adversarial training leads to areas with smaller Hessian spectrum. We present detailed experiments with five different network architectures tested on MNIST, CIFAR10, and CIFAR100 datasets.
02/22/2018 ∙ by Zhewei Yao, et al. ∙ 0 ∙ shareread it

Stabilizing Gradients for Deep Neural Networks via Efficient SVD Parameterization
Vanishing and exploding gradients are two of the main obstacles in training deep neural networks, especially in capturing long range dependencies in recurrent neural networks (RNNs). In this paper, we present an efficient parametrization of the transition matrix of an RNN that allows us to stabilize the gradients that arise in its training. Specifically, we parameterize the transition matrix by its singular value decomposition(SVD), which allows us to explicitly track and control its singular values. We attain efficiency by using tools that are common in numerical linear algebra, namely Householder reflectors for representing the orthogonal matrices that arise in the SVD. By explicitly controlling the singular values, our proposed SpectralRNN method allows us to easily solve the exploding gradient problem and we observe that it empirically solves the vanishing gradient issue to a large extent. We note that the SVD parameterization can be used for any rectangular weight matrix, hence it can be easily extended to any deep neural network, such as a multilayer perceptron. Theoretically, we demonstrate that our parameterization does not lose any expressive power, and show how it controls generalization of RNN for the classification task. process easier. Our extensive experimental results also demonstrate that the proposed framework converges faster, and has good generalization, especially in capturing long range dependencies, as shown on the synthetic addition and copy tasks, as well as on MNIST and Penn Tree Bank data sets.
03/25/2018 ∙ by Jiong Zhang, et al. ∙ 0 ∙ shareread it

Discrete Attacks and Submodular Optimization with Applications to Text Classification
Adversarial examples are carefully constructed modifications to an input that completely change the output of a classifier but are imperceptible to humans. Despite these successful attacks for continuous data (such as image and audio samples), generating adversarial examples for discrete structures such as text has proven significantly more challenging. In this paper we formulate the attacks with discrete input on a set function as an optimization task. We prove that this set function is submodular for some popular neural network text classifiers under simplifying assumption. This finding guarantees a 11/e approximation factor for attacks that use the greedy algorithm. Meanwhile, we show how to use the gradient of the attacked classifier to guide the greedy search. Empirical studies with our proposed optimization scheme show significantly improved attack ability and efficiency, on three different text classification tasks over various baselines. We also use a joint sentence and word paraphrasing technique to maintain the original semantics and syntax of the text. This is validated by a human subject evaluation in subjective metrics on the quality and semantic coherence of our generated adversarial text.
12/01/2018 ∙ by Qi Lei, et al. ∙ 0 ∙ shareread it
Qi Lei
is this you? claim profile