Juho Lee

is this you? claim profile


  • Set Transformer

    Many machine learning tasks such as multiple instance learning, 3D shape recognition and few-shot image classification are defined on sets of instances. Since solutions to such problems do not depend on the permutation of elements of the set, models used to address them should be permutation invariant. We present an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set. The model consists of an encoder and a decoder, both of which rely on attention mechanisms. In an effort to reduce computational complexity, we introduce an attention scheme inspired by inducing point methods from sparse Gaussian process literature. It reduces computation time of self-attention from quadratic to linear in the number of elements in the set. We show that our model is theoretically attractive and we evaluate it on a range of tasks, demonstrating increased performance compared to recent methods for set-structured data.

    10/01/2018 ∙ by Juho Lee, et al. ∙ 10 share

    read it

  • Bayesian inference on random simple graphs with power law degree distributions

    We present a model for random simple graphs with a degree distribution that obeys a power law (i.e., is heavy-tailed). To attain this behavior, the edge probabilities in the graph are constructed from Bertoin-Fujita-Roynette-Yor (BFRY) random variables, which have been recently utilized in Bayesian statistics for the construction of power law models in several applications. Our construction readily extends to capture the structure of latent factors, similarly to stochastic blockmodels, while maintaining its power law degree distribution. The BFRY random variables are well approximated by gamma random variables in a variational Bayesian inference routine, which we apply to several network datasets for which power law degree distributions are a natural assumption. By learning the parameters of the BFRY distribution via probabilistic inference, we are able to automatically select the appropriate power law behavior from the data. In order to further scale our inference procedure, we adopt stochastic gradient ascent routines where the gradients are computed on minibatches (i.e., subsets) of the edges in the graph.

    02/27/2017 ∙ by Juho Lee, et al. ∙ 0 share

    read it

  • Tree-Guided MCMC Inference for Normalized Random Measure Mixture Models

    Normalized random measures (NRMs) provide a broad class of discrete random measures that are often used as priors for Bayesian nonparametric models. Dirichlet process is a well-known example of NRMs. Most of posterior inference methods for NRM mixture models rely on MCMC methods since they are easy to implement and their convergence is well studied. However, MCMC often suffers from slow convergence when the acceptance rate is low. Tree-based inference is an alternative deterministic posterior inference method, where Bayesian hierarchical clustering (BHC) or incremental Bayesian hierarchical clustering (IBHC) have been developed for DP or NRM mixture (NRMM) models, respectively. Although IBHC is a promising method for posterior inference for NRMM models due to its efficiency and applicability to online inference, its convergence is not guaranteed since it uses heuristics that simply selects the best solution after multiple trials are made. In this paper, we present a hybrid inference algorithm for NRMM models, which combines the merits of both MCMC and IBHC. Trees built by IBHC outlines partitions of data, which guides Metropolis-Hastings procedure to employ appropriate proposals. Inheriting the nature of MCMC, our tree-guided MCMC (tgMCMC) is guaranteed to converge, and enjoys the fast convergence thanks to the effective proposals guided by trees. Experiments on both synthetic and real-world datasets demonstrate the benefit of our method.

    11/18/2015 ∙ by Juho Lee, et al. ∙ 0 share

    read it

  • Bayesian Hierarchical Clustering with Exponential Family: Small-Variance Asymptotics and Reducibility

    Bayesian hierarchical clustering (BHC) is an agglomerative clustering method, where a probabilistic model is defined and its marginal likelihoods are evaluated to decide which clusters to merge. While BHC provides a few advantages over traditional distance-based agglomerative clustering algorithms, successive evaluation of marginal likelihoods and careful hyperparameter tuning are cumbersome and limit the scalability. In this paper we relax BHC into a non-probabilistic formulation, exploring small-variance asymptotics in conjugate-exponential models. We develop a novel clustering algorithm, referred to as relaxed BHC (RBHC), from the asymptotic limit of the BHC model that exhibits the scalability of distance-based agglomerative clustering algorithms as well as the flexibility of Bayesian nonparametric models. We also investigate the reducibility of the dissimilarity measure emerged from the asymptotic limit of the BHC model, allowing us to use scalable algorithms such as the nearest neighbor chain algorithm. Numerical experiments on both synthetic and real-world datasets demonstrate the validity and high performance of our method.

    01/29/2015 ∙ by Juho Lee, et al. ∙ 0 share

    read it

  • DropMax: Adaptive Stochastic Softmax

    We propose DropMax, a stochastic version of softmax classifier which at each iteration drops non-target classes with some probability, for each instance. Specifically, we overlay binary masking variables over class output probabilities, which are learned based on the input via regularized variational inference. This stochastic regularization has an effect of building an ensemble classifier out of exponential number of classifiers with different decision boundaries. Moreover, the learning of dropout probabilities for non-target classes on each instance allows the classifier to focus more on classification against the most confusing classes. We validate our model on multiple public datasets for classification, on which it obtains improved accuracy over regular softmax classifier and other baselines. Further analysis of the learned dropout masks shows that our model indeed selects confusing classes more often when it performs classification.

    12/21/2017 ∙ by Hae Beom Lee, et al. ∙ 0 share

    read it

  • Adaptive Network Sparsification via Dependent Variational Beta-Bernoulli Dropout

    While variational dropout approaches have been shown to be effective for network sparsification, they are still suboptimal in the sense that they set the dropout rate for each neuron without consideration of the input data. With such input-independent dropout, each neuron is evolved to be generic across inputs, which makes it difficult to sparsify networks without accuracy loss. To overcome this limitation, we propose adaptive variational dropout whose probabilities are drawn from sparsity-inducing beta-Bernoulli prior. It allows each neuron to be evolved either to be generic or specific for certain inputs, or dropped altogether. Such input-adaptive sparsity- inducing dropout allows the resulting network to tolerate larger degree of sparsity without losing its expressive power by removing redundancies among features. We validate our dependent variational beta-Bernoulli dropout on multiple public datasets, on which it obtains significantly more compact networks than baseline methods, with consistent accuracy improvements over the base networks.

    05/28/2018 ∙ by Juho Lee, et al. ∙ 0 share

    read it

  • Uncertainty-Aware Attention for Reliable Interpretation and Prediction

    Attention mechanism is effective in both focusing the deep learning models on relevant features and interpreting them. However, attentions may be unreliable since the networks that generate them are often trained in a weakly-supervised manner. To overcome this limitation, we introduce the notion of input-dependent uncertainty to the attention mechanism, such that it generates attention for each feature with varying degrees of noise based on the given input, to learn larger variance on instances it is uncertain about. We learn this Uncertainty-aware Attention (UA) mechanism using variational inference, and validate it on various risk prediction tasks from electronic health records on which our model significantly outperforms existing attention models. The analysis of the learned attentions shows that our model generates attentions that comply with clinicians' interpretation, and provide richer interpretation via learned variance. Further evaluation of both the accuracy of the uncertainty calibration and the prediction performance with "I don't know" decision show that UA yields networks with high reliability as well.

    05/24/2018 ∙ by Jay Heo, et al. ∙ 0 share

    read it

  • Mixed Effect Composite RNN-GP: A Personalized and Reliable Prediction Model for Healthcare

    We present a personalized and reliable prediction model for healthcare, which can provide individually tailored medical services such as diagnosis, disease treatment and prevention. Our proposed framework targets to making reliable predictions from time-series data, such as Electronic Health Records (EHR), by modeling two complementary components: i) shared component that captures global trend across diverse patients and ii) patient-specific component that models idiosyncratic variability for each patient. To this end, we propose a composite model of a deep recurrent neural network (RNN) to exploit expressive power of the RNN in estimating global trends from large number of patients, and Gaussian Processes (GP) to probabilistically model individual time-series given relatively small number of time points. We evaluate the strength of our model on diverse and heterogeneous tasks in EHR datasets. The results show that our model significantly outperforms baselines such as RNN, demonstrating clear advantage over existing models when working with noisy medical data.

    06/05/2018 ∙ by Ingyo Chung, et al. ∙ 0 share

    read it

  • Transductive Propagation Network for Few-shot Learning

    Few-shot learning aims to build a learner that quickly generalizes to novel classes even when a limited number of labeled examples (so-called low-data problem) are available. Meta-learning is commonly deployed to mimic the test environment in a training phase for good generalization, where episodes (i.e., learning problems) are manually constructed from the training set. This framework gains a lot of attention to few-shot learning with impressive performance, though the low-data problem is not fully addressed. In this paper, we propose Transductive Propagation Network (TPN), a transductive method that classifies the entire test set at once to alleviate the low-data problem. Specifically, our proposed network explicitly learns an underlying manifold space that is appropriate to propagate labels from few-shot examples, where all parameters of feature embedding, manifold structure, and label propagation are estimated in an end-to-end way on episodes. We evaluate the proposed method on the commonly used miniImageNet and tieredImageNet benchmarks and achieve the state-of-the-art or promising results on these datasets.

    05/25/2018 ∙ by Yanbin Liu, et al. ∙ 0 share

    read it

  • A Bayesian model for sparse graphs with flexible degree distribution and overlapping community structure

    We consider a non-projective class of inhomogeneous random graph models with interpretable parameters and a number of interesting asymptotic properties. Using the results of Bollobás et al. [2007], we show that i) the class of models is sparse and ii) depending on the choice of the parameters, the model is either scale-free, with power-law exponent greater than 2, or with an asymptotic degree distribution which is power-law with exponential cut-off. We propose an extension of the model that can accommodate an overlapping community structure. Scalable posterior inference can be performed due to the specific choice of the link probability. We present experiments on five different real-world networks with up to 100,000 nodes and edges, showing that the model can provide a good fit to the degree distribution and recovers well the latent community structure.

    10/03/2018 ∙ by Juho Lee, et al. ∙ 0 share

    read it

  • Beyond the Chinese Restaurant and Pitman-Yor processes: Statistical Models with Double Power-law Behavior

    Bayesian nonparametric approaches, in particular the Pitman-Yor process and the associated two-parameter Chinese Restaurant process, have been successfully used in applications where the data exhibit a power-law behavior. Examples include natural language processing, natural images or networks. There is also growing empirical evidence that some datasets exhibit a two-regime power-law behavior: one regime for small frequencies, and a second regime, with a different exponent, for high frequencies. In this paper, we introduce a class of completely random measures which are doubly regularly-varying. Contrary to the Pitman-Yor process, we show that when completely random measures in this class are normalized to obtain random probability measures and associated random partitions, such partitions exhibit a double power-law behavior. We discuss in particular three models within this class: the beta prime process (Broderick et al. (2015, 2018), a novel process called generalized BFRY process, and a mixture construction. We derive efficient Markov chain Monte Carlo algorithms to estimate the parameters of these models. Finally, we show that the proposed models provide a better fit than the Pitman-Yor process on various datasets.

    02/13/2019 ∙ by Fadhel Ayed, et al. ∙ 0 share

    read it