
Set Transformer
Many machine learning tasks such as multiple instance learning, 3D shape recognition and fewshot image classification are defined on sets of instances. Since solutions to such problems do not depend on the permutation of elements of the set, models used to address them should be permutation invariant. We present an attentionbased neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set. The model consists of an encoder and a decoder, both of which rely on attention mechanisms. In an effort to reduce computational complexity, we introduce an attention scheme inspired by inducing point methods from sparse Gaussian process literature. It reduces computation time of selfattention from quadratic to linear in the number of elements in the set. We show that our model is theoretically attractive and we evaluate it on a range of tasks, demonstrating increased performance compared to recent methods for setstructured data.
10/01/2018 ∙ by Juho Lee, et al. ∙ 10 ∙ shareread it

Bayesian inference on random simple graphs with power law degree distributions
We present a model for random simple graphs with a degree distribution that obeys a power law (i.e., is heavytailed). To attain this behavior, the edge probabilities in the graph are constructed from BertoinFujitaRoynetteYor (BFRY) random variables, which have been recently utilized in Bayesian statistics for the construction of power law models in several applications. Our construction readily extends to capture the structure of latent factors, similarly to stochastic blockmodels, while maintaining its power law degree distribution. The BFRY random variables are well approximated by gamma random variables in a variational Bayesian inference routine, which we apply to several network datasets for which power law degree distributions are a natural assumption. By learning the parameters of the BFRY distribution via probabilistic inference, we are able to automatically select the appropriate power law behavior from the data. In order to further scale our inference procedure, we adopt stochastic gradient ascent routines where the gradients are computed on minibatches (i.e., subsets) of the edges in the graph.
02/27/2017 ∙ by Juho Lee, et al. ∙ 0 ∙ shareread it

TreeGuided MCMC Inference for Normalized Random Measure Mixture Models
Normalized random measures (NRMs) provide a broad class of discrete random measures that are often used as priors for Bayesian nonparametric models. Dirichlet process is a wellknown example of NRMs. Most of posterior inference methods for NRM mixture models rely on MCMC methods since they are easy to implement and their convergence is well studied. However, MCMC often suffers from slow convergence when the acceptance rate is low. Treebased inference is an alternative deterministic posterior inference method, where Bayesian hierarchical clustering (BHC) or incremental Bayesian hierarchical clustering (IBHC) have been developed for DP or NRM mixture (NRMM) models, respectively. Although IBHC is a promising method for posterior inference for NRMM models due to its efficiency and applicability to online inference, its convergence is not guaranteed since it uses heuristics that simply selects the best solution after multiple trials are made. In this paper, we present a hybrid inference algorithm for NRMM models, which combines the merits of both MCMC and IBHC. Trees built by IBHC outlines partitions of data, which guides MetropolisHastings procedure to employ appropriate proposals. Inheriting the nature of MCMC, our treeguided MCMC (tgMCMC) is guaranteed to converge, and enjoys the fast convergence thanks to the effective proposals guided by trees. Experiments on both synthetic and realworld datasets demonstrate the benefit of our method.
11/18/2015 ∙ by Juho Lee, et al. ∙ 0 ∙ shareread it

Bayesian Hierarchical Clustering with Exponential Family: SmallVariance Asymptotics and Reducibility
Bayesian hierarchical clustering (BHC) is an agglomerative clustering method, where a probabilistic model is defined and its marginal likelihoods are evaluated to decide which clusters to merge. While BHC provides a few advantages over traditional distancebased agglomerative clustering algorithms, successive evaluation of marginal likelihoods and careful hyperparameter tuning are cumbersome and limit the scalability. In this paper we relax BHC into a nonprobabilistic formulation, exploring smallvariance asymptotics in conjugateexponential models. We develop a novel clustering algorithm, referred to as relaxed BHC (RBHC), from the asymptotic limit of the BHC model that exhibits the scalability of distancebased agglomerative clustering algorithms as well as the flexibility of Bayesian nonparametric models. We also investigate the reducibility of the dissimilarity measure emerged from the asymptotic limit of the BHC model, allowing us to use scalable algorithms such as the nearest neighbor chain algorithm. Numerical experiments on both synthetic and realworld datasets demonstrate the validity and high performance of our method.
01/29/2015 ∙ by Juho Lee, et al. ∙ 0 ∙ shareread it

DropMax: Adaptive Stochastic Softmax
We propose DropMax, a stochastic version of softmax classifier which at each iteration drops nontarget classes with some probability, for each instance. Specifically, we overlay binary masking variables over class output probabilities, which are learned based on the input via regularized variational inference. This stochastic regularization has an effect of building an ensemble classifier out of exponential number of classifiers with different decision boundaries. Moreover, the learning of dropout probabilities for nontarget classes on each instance allows the classifier to focus more on classification against the most confusing classes. We validate our model on multiple public datasets for classification, on which it obtains improved accuracy over regular softmax classifier and other baselines. Further analysis of the learned dropout masks shows that our model indeed selects confusing classes more often when it performs classification.
12/21/2017 ∙ by Hae Beom Lee, et al. ∙ 0 ∙ shareread it

Adaptive Network Sparsification via Dependent Variational BetaBernoulli Dropout
While variational dropout approaches have been shown to be effective for network sparsification, they are still suboptimal in the sense that they set the dropout rate for each neuron without consideration of the input data. With such inputindependent dropout, each neuron is evolved to be generic across inputs, which makes it difficult to sparsify networks without accuracy loss. To overcome this limitation, we propose adaptive variational dropout whose probabilities are drawn from sparsityinducing betaBernoulli prior. It allows each neuron to be evolved either to be generic or specific for certain inputs, or dropped altogether. Such inputadaptive sparsity inducing dropout allows the resulting network to tolerate larger degree of sparsity without losing its expressive power by removing redundancies among features. We validate our dependent variational betaBernoulli dropout on multiple public datasets, on which it obtains significantly more compact networks than baseline methods, with consistent accuracy improvements over the base networks.
05/28/2018 ∙ by Juho Lee, et al. ∙ 0 ∙ shareread it

UncertaintyAware Attention for Reliable Interpretation and Prediction
Attention mechanism is effective in both focusing the deep learning models on relevant features and interpreting them. However, attentions may be unreliable since the networks that generate them are often trained in a weaklysupervised manner. To overcome this limitation, we introduce the notion of inputdependent uncertainty to the attention mechanism, such that it generates attention for each feature with varying degrees of noise based on the given input, to learn larger variance on instances it is uncertain about. We learn this Uncertaintyaware Attention (UA) mechanism using variational inference, and validate it on various risk prediction tasks from electronic health records on which our model significantly outperforms existing attention models. The analysis of the learned attentions shows that our model generates attentions that comply with clinicians' interpretation, and provide richer interpretation via learned variance. Further evaluation of both the accuracy of the uncertainty calibration and the prediction performance with "I don't know" decision show that UA yields networks with high reliability as well.
05/24/2018 ∙ by Jay Heo, et al. ∙ 0 ∙ shareread it

Mixed Effect Composite RNNGP: A Personalized and Reliable Prediction Model for Healthcare
We present a personalized and reliable prediction model for healthcare, which can provide individually tailored medical services such as diagnosis, disease treatment and prevention. Our proposed framework targets to making reliable predictions from timeseries data, such as Electronic Health Records (EHR), by modeling two complementary components: i) shared component that captures global trend across diverse patients and ii) patientspecific component that models idiosyncratic variability for each patient. To this end, we propose a composite model of a deep recurrent neural network (RNN) to exploit expressive power of the RNN in estimating global trends from large number of patients, and Gaussian Processes (GP) to probabilistically model individual timeseries given relatively small number of time points. We evaluate the strength of our model on diverse and heterogeneous tasks in EHR datasets. The results show that our model significantly outperforms baselines such as RNN, demonstrating clear advantage over existing models when working with noisy medical data.
06/05/2018 ∙ by Ingyo Chung, et al. ∙ 0 ∙ shareread it

Transductive Propagation Network for Fewshot Learning
Fewshot learning aims to build a learner that quickly generalizes to novel classes even when a limited number of labeled examples (socalled lowdata problem) are available. Metalearning is commonly deployed to mimic the test environment in a training phase for good generalization, where episodes (i.e., learning problems) are manually constructed from the training set. This framework gains a lot of attention to fewshot learning with impressive performance, though the lowdata problem is not fully addressed. In this paper, we propose Transductive Propagation Network (TPN), a transductive method that classifies the entire test set at once to alleviate the lowdata problem. Specifically, our proposed network explicitly learns an underlying manifold space that is appropriate to propagate labels from fewshot examples, where all parameters of feature embedding, manifold structure, and label propagation are estimated in an endtoend way on episodes. We evaluate the proposed method on the commonly used miniImageNet and tieredImageNet benchmarks and achieve the stateoftheart or promising results on these datasets.
05/25/2018 ∙ by Yanbin Liu, et al. ∙ 0 ∙ shareread it

A Bayesian model for sparse graphs with flexible degree distribution and overlapping community structure
We consider a nonprojective class of inhomogeneous random graph models with interpretable parameters and a number of interesting asymptotic properties. Using the results of Bollobás et al. [2007], we show that i) the class of models is sparse and ii) depending on the choice of the parameters, the model is either scalefree, with powerlaw exponent greater than 2, or with an asymptotic degree distribution which is powerlaw with exponential cutoff. We propose an extension of the model that can accommodate an overlapping community structure. Scalable posterior inference can be performed due to the specific choice of the link probability. We present experiments on five different realworld networks with up to 100,000 nodes and edges, showing that the model can provide a good fit to the degree distribution and recovers well the latent community structure.
10/03/2018 ∙ by Juho Lee, et al. ∙ 0 ∙ shareread it

Beyond the Chinese Restaurant and PitmanYor processes: Statistical Models with Double Powerlaw Behavior
Bayesian nonparametric approaches, in particular the PitmanYor process and the associated twoparameter Chinese Restaurant process, have been successfully used in applications where the data exhibit a powerlaw behavior. Examples include natural language processing, natural images or networks. There is also growing empirical evidence that some datasets exhibit a tworegime powerlaw behavior: one regime for small frequencies, and a second regime, with a different exponent, for high frequencies. In this paper, we introduce a class of completely random measures which are doubly regularlyvarying. Contrary to the PitmanYor process, we show that when completely random measures in this class are normalized to obtain random probability measures and associated random partitions, such partitions exhibit a double powerlaw behavior. We discuss in particular three models within this class: the beta prime process (Broderick et al. (2015, 2018), a novel process called generalized BFRY process, and a mixture construction. We derive efficient Markov chain Monte Carlo algorithms to estimate the parameters of these models. Finally, we show that the proposed models provide a better fit than the PitmanYor process on various datasets.
02/13/2019 ∙ by Fadhel Ayed, et al. ∙ 0 ∙ shareread it
Juho Lee
is this you? claim profile