
Efficient Transfer Bayesian Optimization with Auxiliary Information
We propose an efficient transfer Bayesian optimization method, which finds the maximum of an expensivetoevaluate blackbox function by using data on related optimization tasks. Our method uses auxiliary information that represents the task characteristics to effectively transfer knowledge for estimating a distribution over target functions. In particular, we use a Gaussian process, in which the mean and covariance functions are modeled with neural networks that simultaneously take both the auxiliary information and feature vectors as input. With a neural network mean function, we can estimate the target function even without evaluations. By using the neural network covariance function, we can extract nonlinear correlation among feature vectors that are shared across related tasks. Our Gaussian processbased formulation not only enables an analytic calculation of the posterior distribution but also swiftly adapts the target function to observations. Our method is also advantageous because the computational costs scale linearly with the number of source tasks. Through experiments using a synthetic dataset and datasets for finding the optimal pedestrian traffic regulations and optimal machine learning algorithms, we demonstrate that our method identifies the optimal points with fewer target function evaluations than existing methods.
09/17/2019 ∙ by Tomoharu Iwata, et al. ∙ 4 ∙ shareread it

Deep Mixture Point Processes: Spatiotemporal Event Prediction with Rich Contextual Information
Predicting when and where events will occur in cities, like taxi pickups, crimes, and vehicle collisions, is a challenging and important problem with many applications in fields such as urban planning, transportation optimization and locationbased marketing. Though many point processes have been proposed to model events in a continuous spatiotemporal space, none of them allow for the consideration of the rich contextual factors that affect event occurrence, such as weather, social activities, geographical characteristics, and traffic. In this paper, we propose DMPP (Deep Mixture Point Processes), a point process model for predicting spatiotemporal events with the use of rich contextual information; a key advance is its incorporation of the heterogeneous and highdimensional context available in image and text data. Specifically, we design the intensity of our point process model as a mixture of kernels, where the mixture weights are modeled by a deep neural network. This formulation allows us to automatically learn the complex nonlinear effects of the contextual factors on event occurrence. At the same time, this formulation makes analytical integration over the intensity, which is required for point process estimation, tractable. We use realworld data sets from different domains to demonstrate that DMPP has better predictive performance than existing methods.
06/21/2019 ∙ by Maya Okawa, et al. ∙ 1 ∙ shareread it

Improving Output Uncertainty Estimation and Generalization in Deep Learning via Neural Network Gaussian Processes
We propose a simple method that combines neural networks and Gaussian processes. The proposed method can estimate the uncertainty of outputs and flexibly adjust target functions where training data exist, which are advantages of Gaussian processes. The proposed method can also achieve high generalization performance for unseen input configurations, which is an advantage of neural networks. With the proposed method, neural networks are used for the mean functions of Gaussian processes. We present a scalable stochastic inference procedure, where sparse Gaussian processes are inferred by stochastic variational inference, and the parameters of neural networks and kernels are estimated by stochastic gradient descent methods, simultaneously. We use two realworld spatiotemporal data sets to demonstrate experimentally that the proposed method achieves better uncertainty estimation and generalization performance than neural networks and Gaussian processes.
07/19/2017 ∙ by Tomoharu Iwata, et al. ∙ 0 ∙ shareread it

Localized Lasso for HighDimensional Regression
We introduce the localized Lasso, which is suited for learning models that are both interpretable and have a high predictive power in problems with high dimensionality d and small sample size n. More specifically, we consider a function defined by local sparse models, one at each data point. We introduce samplewise network regularization to borrow strength across the models, and samplewise exclusive group sparsity (a.k.a., ℓ_1,2 norm) to introduce diversity into the choice of feature sets in the local models. The local models are interpretable in terms of similarity of their sparsity patterns. The cost function is convex, and thus has a globally optimal solution. Moreover, we propose a simple yet efficient iterative leastsquares based optimization procedure for the localized Lasso, which does not need a tuning parameter, and is guaranteed to converge to a globally optimal solution. The solution is empirically shown to outperform alternatives for both simulated and genomic personalized medicine data.
03/22/2016 ∙ by Makoto Yamada, et al. ∙ 0 ∙ shareread it

Multiview Anomaly Detection via Probabilistic Latent Variable Models
We propose a nonparametric Bayesian probabilistic latent variable model for multiview anomaly detection, which is the task of finding instances that have inconsistent views. With the proposed model, all views of a nonanomalous instance are assumed to be generated from a single latent vector. On the other hand, an anomalous instance is assumed to have multiple latent vectors, and its different views are generated from different latent vectors. By inferring the number of latent vectors used for each instance with Dirichlet process priors, we obtain multiview anomaly scores. The proposed model can be seen as a robust extension of probabilistic canonical correlation analysis for noisy multiview data. We present Bayesian inference procedures for the proposed model based on a stochastic EM algorithm. The effectiveness of the proposed model is demonstrated in terms of performance when detecting multiview anomalies and imputing missing values in multiview data with anomalies.
11/13/2014 ∙ by Tomoharu Iwata, et al. ∙ 0 ∙ shareread it

Warped Mixtures for Nonparametric Cluster Shapes
A mixture of Gaussians fit to a single curved or heavytailed cluster will report that the data contains many clusters. To produce more appropriate clusterings, we introduce a model which warps a latent mixture of Gaussians to produce nonparametric cluster shapes. The possibly lowdimensional latent mixture model allows us to summarize the properties of the highdimensional clusters (or density manifolds) describing the data. The number of manifolds, as well as the shape and dimension of each manifold is automatically inferred. We derive a simple inference scheme for this model which analytically integrates out both the mixture parameters and the warping function. We show that our model is effective for density estimation, performs better than infinite Gaussian mixture models at recovering the true number of clusters, and produces interpretable summaries of highdimensional datasets.
08/09/2014 ∙ by Tomoharu Iwata, et al. ∙ 0 ∙ shareread it

Imitation networks: Fewshot learning of neural networks from scratch
In this paper, we propose imitation networks, a simple but effective method for training neural networks with a limited amount of training data. Our approach inherits the idea of knowledge distillation that transfers knowledge from a deep or wide reference model to a shallow or narrow target model. The proposed method employs this idea to mimic predictions of reference estimators that are much more robust against overfitting than the network we want to train. Different from almost all the previous work for knowledge distillation that requires a large amount of labeled training data, the proposed method requires only a small amount of training data. Instead, we introduce pseudo training examples that are optimized as a part of model parameters. Experimental results for several benchmark datasets demonstrate that the proposed method outperformed all the other baselines, such as naive training of the target model and standard knowledge distillation.
02/08/2018 ∙ by Akisato Kimura, et al. ∙ 0 ∙ shareread it

Zeroshot Domain Adaptation without Domain Semantic Descriptors
We propose a method to infer domainspecific models such as classifiers for unseen domains, from which no data are given in the training phase, without domain semantic descriptors. When training and test distributions are different, standard supervised learning methods perform poorly. Zeroshot domain adaptation attempts to alleviate this problem by inferring models that generalize well to unseen domains by using training data in multiple source domains. Existing methods use observed semantic descriptors characterizing domains such as time information to infer the domainspecific models for the unseen domains. However, it cannot always be assumed that such metadata can be used in realworld applications. The proposed method can infer appropriate domainspecific models without any semantic descriptors by introducing the concept of latent domain vectors, which are latent representations for the domains and are used for inferring the models. The latent domain vector for the unseen domain is inferred from the set of the feature vectors in the corresponding domain, which is given in the testing phase. The domainspecific models consist of two components: the first is for extracting a representation of a feature vector to be predicted, and the second is for inferring model parameters given the latent domain vector. The posterior distributions of the latent domain vectors and the domainspecific models are parametrized by neural networks, and are optimized by maximizing the variational lower bound using stochastic gradient descent. The effectiveness of the proposed method was demonstrated through experiments using one regression and two classification tasks.
07/09/2018 ∙ by Atsutoshi Kumagai, et al. ∙ 0 ∙ shareread it

Variational Autoencoder with Implicit Optimal Priors
The variational autoencoder (VAE) is a powerful generative model that can estimate the probability of a data point by using latent variables. In the VAE, the posterior of the latent variable given the data point is regularized by the prior of the latent variable using Kullback Leibler (KL) divergence. Although the standard Gaussian distribution is usually used for the prior, this simple prior incurs overregularization. As a sophisticated prior, the aggregated posterior has been introduced, which is the expectation of the posterior over the data distribution. This prior is optimal for the VAE in terms of maximizing the training objective function. However, KL divergence with the aggregated posterior cannot be calculated in a closed form, which prevents us from using this optimal prior. With the proposed method, we introduce the density ratio trick to estimate this KL divergence without modeling the aggregated posterior explicitly. Since the density ratio trick does not work well in high dimensions, we rewrite this KL divergence that contains the highdimensional density ratio into the sum of the analytically calculable term and the lowdimensional density ratio term, to which the density ratio trick is applied. Experiments on various datasets show that the VAE with this implicit optimal prior achieves high density estimation performance.
09/14/2018 ∙ by Hiroshi Takahashi, et al. ∙ 0 ∙ shareread it

Unsupervised Crosslingual Word Embedding by Multilingual Neural Language Models
We propose an unsupervised method to obtain crosslingual embeddings without any parallel data or pretrained word embeddings. The proposed model, which we call multilingual neural language models, takes sentences of multiple languages as an input. The proposed model contains bidirectional LSTMs that perform as forward and backward language models, and these networks are shared among all the languages. The other parameters, i.e. word embeddings and linear transformation between hidden states and outputs, are specific to each language. The shared LSTMs can capture the common sentence structure among all languages. Accordingly, word embeddings of each language are mapped into a common latent space, making it possible to measure the similarity of words across multiple languages. We evaluate the quality of the crosslingual word embeddings on a word alignment task. Our experiments demonstrate that our model can obtain crosslingual embeddings of much higher quality than existing unsupervised models when only a small amount of monolingual data (i.e. 50k sentences) are available, or the domains of monolingual data are different across languages.
09/07/2018 ∙ by Takashi Wada, et al. ∙ 0 ∙ shareread it

Refining Coarsegrained Spatial Data using Auxiliary Spatial Data Sets with Various Granularities
We propose a probabilistic model for refining coarsegrained spatial data by utilizing auxiliary spatial data sets. Existing methods require that the spatial granularities of the auxiliary data sets are the same as the desired granularity of target data. The proposed model can effectively make use of auxiliary data sets with various granularities by hierarchically incorporating Gaussian processes. With the proposed model, a distribution for each auxiliary data set on the continuous space is modeled using a Gaussian process, where the representation of uncertainty considers the levels of granularity. The finegrained target data are modeled by another Gaussian process that considers both the spatial correlation and the auxiliary data sets with their uncertainty. We integrate the Gaussian process with a spatial aggregation process that transforms the finegrained target data into the coarsegrained target data, by which we can infer the finegrained target Gaussian process from the coarsegrained data. Our model is designed such that the inference of model parameters based on the exact marginal likelihood is possible, in which the variables of finegrained target and auxiliary data are analytically integrated out. Our experiments on realworld spatial data sets demonstrate the effectiveness of the proposed model.
09/21/2018 ∙ by Yusuke Tanaka, et al. ∙ 0 ∙ shareread it
Tomoharu Iwata
is this you? claim profile