
Exponential Family Estimation via Adversarial Dynamics Embedding
We present an efficient algorithm for maximum likelihood estimation (MLE) of the general exponential family, even in cases when the energy function is represented by a deep neural network. We consider the primaldual view of the MLE for the kinectics augmented model, which naturally introduces an adversarial dual sampler. The sampler will be represented by a novel neural network architectures, dynamics embeddings, mimicking the dynamicalbased samplers, e.g., Hamiltonian MonteCarlo and its variants. The dynamics embedding parametrization inherits the flexibility from HMC, and provides tractable entropy estimation of the augmented model. Meanwhile, it couples the adversarial dual samplers with the primal model, reducing memory and sample complexity. We further show that several existing estimators, including contrastive divergence (Hinton, 2002), score matching (Hyvärinen, 2005), pseudolikelihood (Besag, 1975), noisecontrastive estimation (Gutmann and Hyvärinen, 2010), nonlocal contrastive objectives (Vickrey et al., 2010), and minimum probability flow (SohlDickstein et al., 2011), can be recast as the special cases of the proposed method with different prefixed dual samplers. Finally, we empirically demonstrate the superiority of the proposed estimator against existing stateoftheart methods on synthetic and realworld benchmarks.
04/27/2019 ∙ by Bo Dai, et al. ∙ 28 ∙ shareread it

CostEffective Incentive Allocation via Structured Counterfactual Inference
We address a practical problem ubiquitous in modern industry, in which a mediator tries to learn a policy for allocating strategic financial incentives for customers in a marketing campaign and observes only bandit feedback. In contrast to traditional policy optimization frameworks, we rely on a specific assumption for the reward structure and we incorporate budget constraints. We develop a new twostep method for solving this constrained counterfactual policy optimization problem. First, we cast the reward estimation problem as a domain adaptation problem with supplementary structure. Subsequently, the estimators are used for optimizing the policy with constraints. We establish theoretical error bounds for our estimation procedure and we empirically show that the approach leads to significant improvement on both synthetic and real datasets.
02/07/2019 ∙ by Romain Lopez, et al. ∙ 8 ∙ shareread it

Kernel Exponential Family Estimation via Doubly Dual Embedding
We investigate penalized maximum loglikelihood estimation for exponential family distributions whose natural parameter resides in a reproducing kernel Hilbert space. Key to our approach is a novel technique, doubly dual embedding, that avoids computation of the partition function. This technique also allows the development of a flexible sampling strategy that amortizes the cost of MonteCarlo sampling in the inference stage. The resulting estimator can be easily generalized to kernel conditional exponential families. We furthermore establish a connection between infinitedimensional exponential family estimation and MMDGANs, revealing a new perspective for understanding GANs. Compared to current score matching based estimators, the proposed method improves both memory and time efficiency while enjoying stronger statistical properties, such as fully capturing smoothness in its statistical convergence rate while the score matching estimator appears to saturate. Finally, we show that the proposed estimator can empirically outperform stateoftheart methods in both kernel exponential family estimation and its conditional extension.
11/06/2018 ∙ by Bo Dai, et al. ∙ 4 ∙ shareread it

Neural Networkbased Graph Embedding for CrossPlatform Binary Code Similarity Detection
The problem of crossplatform binary code similarity detection aims at detecting whether two binary functions coming from different platforms are similar or not. It has many security applications, including plagiarism detection, malware detection, vulnerability search, etc. Existing approaches rely on approximate graph matching algorithms, which are inevitably slow and sometimes inaccurate, and hard to adapt to a new task. To address these issues, in this work, we propose a novel neural networkbased approach to compute the embedding, i.e., a numeric vector, based on the control flow graph of each binary function, then the similarity detection can be done efficiently by measuring the distance between the embeddings for two functions. We implement a prototype called Gemini. Our extensive evaluation shows that Gemini outperforms the stateoftheart approaches by large margins with respect to similarity detection accuracy. Further, Gemini can speed up prior art's embedding generation time by 3 to 4 orders of magnitude and reduce the required training time from more than 1 week down to 30 minutes to 10 hours. Our real world case studies demonstrate that Gemini can identify significantly more vulnerable firmware images than the stateoftheart, i.e., Genius. Our research showcases a successful application of deep learning on computer security problems.
08/22/2017 ∙ by Xiaojun Xu, et al. ∙ 0 ∙ shareread it

Deep Hyperspherical Learning
Convolution as inner product has been the founding basis of convolutional neural networks (CNNs) and the key to endtoend visual representation learning. Benefiting from deeper architectures, recent CNNs have demonstrated increasingly strong representation abilities. Despite such improvement, the increased depth and larger parameter space have also led to challenges in properly training a network. In light of such challenges, we propose hyperspherical convolution (SphereConv), a novel learning framework that gives angular representations on hyperspheres. We introduce SphereNet, deep hyperspherical convolution networks that are distinct from conventional inner product based convolutional networks. In particular, SphereNet adopts SphereConv as its basic convolution operator and is supervised by generalized angular softmax loss  a natural loss formulation under SphereConv. We show that SphereNet can effectively encode discriminative representation and alleviate training difficulty, leading to easier optimization, faster convergence and comparable (even better) classification accuracy over convolutional counterparts. We also provide some theoretical insights for the advantages of learning on hyperspheres. In addition, we introduce the learnable SphereConv, i.e., a natural improvement over prefixed SphereConv, and SphereNorm, i.e., hyperspherical learning as a normalization method. Experiments have verified our conclusions.
11/08/2017 ∙ by Weiyang Liu, et al. ∙ 0 ∙ shareread it

Towards Blackbox Iterative Machine Teaching
In this paper, we make an important step towards the blackbox machine teaching by considering the crossspace teaching setting, where the teacher and the learner use different feature representations and the teacher can not fully observe the learner's model. In such scenario, we study how the teacher is still able to teach the learner to achieve a faster convergence rate than the traditional passive learning. We propose an active teacher model that can actively query the learner (i.e., make the learner take exams) for estimating the learner's status, and provide the sample complexity for both teaching and query, respectively. In the experiments, we compare the proposed active teacher with the omniscient teacher and verify the effectiveness of the active teacher model.
10/21/2017 ∙ by Weiyang Liu, et al. ∙ 0 ∙ shareread it

Iterative Machine Teaching
In this paper, we consider the problem of machine teaching, the inverse problem of machine learning. Different from traditional machine teaching which views the learners as batch algorithms, we study a new paradigm where the learner uses an iterative algorithm and a teacher can feed examples sequentially and intelligently based on the current performance of the learner. We show that the teaching complexity in the iterative case is very different from that in the batch case. Instead of constructing a minimal training set for learners, our iterative machine teaching focuses on achieving fast convergence in the learner model. Depending on the level of information the teacher has from the learner model, we design teaching algorithms which can provably reduce the number of teaching examples and achieve faster convergence than learning without teachers. We also validate our theoretical findings with extensive experiments on different data distribution and real image datasets.
05/30/2017 ∙ by Weiyang Liu, et al. ∙ 0 ∙ shareread it

Wasserstein Learning of Deep Generative Point Process Models
Point processes are becoming very popular in modeling asynchronous sequential data due to their sound mathematical foundation and strength in modeling a variety of realworld phenomena. Currently, they are often characterized via intensity function which limits model's expressiveness due to unrealistic assumptions on its parametric form used in practice. Furthermore, they are learned via maximum likelihood approach which is prone to failure in multimodal distributions of sequences. In this paper, we propose an intensityfree approach for point processes modeling that transforms nuisance processes to a target one. Furthermore, we train the model using a likelihoodfree leveraging Wasserstein distance between point processes. Experiments on various synthetic and realworld data substantiate the superiority of the proposed point process model over conventional ones.
05/23/2017 ∙ by Shuai Xiao, et al. ∙ 0 ∙ shareread it

Learning Combinatorial Optimization Algorithms over Graphs
The design of good heuristics or approximation algorithms for NPhard combinatorial optimization problems often requires significant specialized knowledge and trialanderror. Can we automate this challenging, tedious process, and learn the algorithms instead? In many realworld applications, it is typically the case that the same optimization problem is solved again and again on a regular basis, maintaining the same problem structure but differing in the data. This provides an opportunity for learning heuristic algorithms that exploit the structure of such recurring problems. In this paper, we propose a unique combination of reinforcement learning and graph embedding to address this challenge. The learned greedy policy behaves like a metaalgorithm that incrementally constructs a solution, and the action is determined by the output of a graph embedding network capturing the current state of the solution. We show our framework can be applied to a diverse range of optimization problems over graphs, and learns effective algorithms for the Minimum Vertex Cover, Maximum Cut and Traveling Salesman problems.
04/05/2017 ∙ by Hanjun Dai, et al. ∙ 0 ∙ shareread it

Deep SemiRandom Features for Nonlinear Function Approximation
We propose semirandom features for nonlinear function approximation. The flexibility of semirandom feature lies between the fully adjustable units in deep learning and the random features used in kernel methods. For one hidden layer models with semirandom features, we prove with no unrealistic assumptions that the model classes contain an arbitrarily good function as the width increases (universality), and despite nonconvexity, we can find such a good function (optimization theory) that generalizes to unseen new data (generalization bound). For deep models, with no unrealistic assumptions, we prove universal approximation ability, a lower bound on approximation error, a partial optimization guarantee, and a generalization bound. Depending on the problems, the generalization bound of deep semirandom features can be exponentially better than the known bounds of deep ReLU nets; our generalization error bound can be independent of the depth, the number of trainable weights as well as the input dimensionality. In experiments, we show that semirandom features can match the performance of neural networks by using slightly more units, and it outperforms random features by using significantly fewer units. Moreover, we introduce a new implicit ensemble method by using semirandom features.
02/28/2017 ∙ by Kenji Kawaguchi, et al. ∙ 0 ∙ shareread it

Diverse Neural Network Learns True Target Functions
Neural networks are a powerful class of functions that can be trained with simple gradient descent to achieve stateoftheart performance on a variety of applications. Despite their practical success, there is a paucity of results that provide theoretical guarantees on why they are so effective. Lying in the center of the problem is the difficulty of analyzing the nonconvex loss function with potentially numerous local minima and saddle points. Can neural networks corresponding to the stationary points of the loss function learn the true target function? If yes, what are the key factors contributing to such nice optimization properties? In this paper, we answer these questions by analyzing onehiddenlayer neural networks with ReLU activation, and show that despite the nonconvexity, neural networks with diverse units have no spurious local minima. We bypass the nonconvexity issue by directly analyzing the first order optimality condition, and show that the loss can be made arbitrarily small if the minimum singular value of the "extended feature matrix" is large enough. We make novel use of techniques from kernel methods and geometric discrepancy, and identify a new relation linking the smallest singular value to the spectrum of a kernel function associated with the activation function and to the diversity of the units. Our results also suggest a novel regularization function to promote unit diversity for potentially better generalization.
11/09/2016 ∙ by Bo Xie, et al. ∙ 0 ∙ shareread it
Le Song
is this you? claim profile
Associate Director, Center for Machine Learning, Associate Professor at Georgia Institute of Technology