Kohei Hayashi

is this you? claim profile


Machine Learning researcher at AI Research Center, National Institute of Advanced Industrial Science and Technology (AIST)

  • On Random Subsampling of Gaussian Process Regression: A Graphon-Based Analysis

    In this paper, we study random subsampling of Gaussian process regression, one of the simplest approximation baselines, from a theoretical perspective. Although subsampling discards a large part of training data, we show provable guarantees on the accuracy of the predictive mean/variance and its generalization ability. For analysis, we consider embedding kernel matrices into graphons, which encapsulate the difference of the sample size and enables us to evaluate the approximation and generalization errors in a unified manner. The experimental results show that the subsampling approximation achieves a better trade-off regarding accuracy and runtime than the Nyström and random Fourier expansion methods.

    01/28/2019 ∙ by Kohei Hayashi, et al. ∙ 16 share

    read it

  • Data Interpolating Prediction: Alternative Interpretation of Mixup

    Data augmentation by mixing samples, such as Mixup, has widely been used typically for classification tasks. However, this strategy is not always effective due to the gap between augmented samples for training and original samples for testing. This gap may prevent a classifier from learning the optimal decision boundary and increase the generalization error. To overcome this problem, we propose an alternative framework called Data Interpolating Prediction (DIP). Unlike common data augmentations, we encapsulate the sample-mixing process in the hypothesis class of a classifier so that train and test samples are treated equally. We derive the generalization bound and show that DIP helps to reduce the original Rademacher complexity. Also, we empirically demonstrate that DIP can outperform existing Mixup.

    06/20/2019 ∙ by Takuya Shimada, et al. ∙ 4 share

    read it

  • An Optimality Proof for the PairDiff operator for Representing Relations between Words

    Representing the semantic relations that exist between two given words (or entities) is an important first step in a wide-range of NLP applications such as analogical reasoning, knowledge base completion and relational information retrieval. A simple, yet surprisingly accurate method for representing a relation between two words is to compute the vector offset () between the corresponding word embeddings. Despite its empirical success, it remains unclear whether is the best operator for obtaining a relational representation from word embeddings. In this paper, we conduct a theoretical analysis of the operator. In particular, we show that for word embeddings where cross-dimensional correlations are zero, is the only bilinear operator that can minimise the ℓ_2 loss between analogous word-pairs. We experimentally show that for word embedding created using a broad range of methods, the cross-dimensional correlations in word embeddings are approximately zero, demonstrating the general applicability of our theoretical result. Moreover, we empirically verify the implications of the proven theoretical result in a series of experiments where we repeatedly discover as the best bilinear operator for representing semantic relations between words in several benchmark datasets.

    09/19/2017 ∙ by Huda Hakami, et al. ∙ 0 share

    read it

  • On Tensor Train Rank Minimization: Statistical Efficiency and Scalable Algorithm

    Tensor train (TT) decomposition provides a space-efficient representation for higher-order tensors. Despite its advantage, we face two crucial limitations when we apply the TT decomposition to machine learning problems: the lack of statistical theory and of scalable algorithms. In this paper, we address the limitations. First, we introduce a convex relaxation of the TT decomposition problem and derive its error bound for the tensor completion task. Next, we develop an alternating optimization method with a randomization technique, in which the time complexity is as efficient as the space complexity is. In experiments, we numerically confirm the derived bounds and empirically demonstrate the performance of our method with a real higher-order tensor.

    08/01/2017 ∙ by Masaaki Imaizumi, et al. ∙ 0 share

    read it

  • Minimizing Quadratic Functions in Constant Time

    A sampling-based optimization method for quadratic functions is proposed. Our method approximately solves the following n-dimensional quadratic minimization problem in constant time, which is independent of n: z^*=_v∈R^n〈v, A v〉 + n〈v, diag(d)v〉 + n〈b, v〉, where A ∈R^n × n is a matrix and d,b∈R^n are vectors. Our theoretical analysis specifies the number of samples k(δ, ϵ) such that the approximated solution z satisfies |z - z^*| = O(ϵ n^2) with probability 1-δ. The empirical performance (accuracy and runtime) is positively confirmed by numerical experiments.

    08/25/2016 ∙ by Kohei Hayashi, et al. ∙ 0 share

    read it

  • Making Tree Ensembles Interpretable: A Bayesian Model Selection Approach

    Tree ensembles, such as random forests and boosted trees, are renowned for their high prediction performance. However, their interpretability is critically limited due to the enormous complexity. In this study, we present a method to make a complex tree ensemble interpretable by simplifying the model. Specifically, we formalize the simplification of tree ensembles as a model selection problem. Given a complex tree ensemble, we aim at obtaining the simplest representation that is essentially equivalent to the original one. To this end, we derive a Bayesian model selection algorithm that optimizes the simplified model while maintaining the prediction performance. Our numerical experiments on several datasets showed that complicated tree ensembles were reasonably approximated as interpretable.

    06/29/2016 ∙ by Satoshi Hara, et al. ∙ 0 share

    read it

  • Making Tree Ensembles Interpretable

    Tree ensembles, such as random forest and boosted trees, are renowned for their high prediction performance, whereas their interpretability is critically limited. In this paper, we propose a post processing method that improves the model interpretability of tree ensembles. After learning a complex tree ensembles in a standard way, we approximate it by a simpler model that is interpretable for human. To obtain the simpler model, we derive the EM algorithm minimizing the KL divergence from the complex ensemble. A synthetic experiment showed that a complicated tree ensemble was approximated reasonably as interpretable.

    06/17/2016 ∙ by Satoshi Hara, et al. ∙ 0 share

    read it

  • A Tractable Fully Bayesian Method for the Stochastic Block Model

    The stochastic block model (SBM) is a generative model revealing macroscopic structures in graphs. Bayesian methods are used for (i) cluster assignment inference and (ii) model selection for the number of clusters. In this paper, we study the behavior of Bayesian inference in the SBM in the large sample limit. Combining variational approximation and Laplace's method, a consistent criterion of the fully marginalized log-likelihood is established. Based on that, we derive a tractable algorithm that solves tasks (i) and (ii) concurrently, obviating the need for an outer loop to check all model candidates. Our empirical and theoretical results demonstrate that our method is scalable in computation, accurate in approximation, and concise in model selection.

    02/06/2016 ∙ by Kohei Hayashi, et al. ∙ 0 share

    read it

  • Bayesian Masking: Sparse Bayesian Estimation with Weaker Shrinkage Bias

    A common strategy for sparse linear regression is to introduce regularization, which eliminates irrelevant features by letting the corresponding weights be zeros. However, regularization often shrinks the estimator for relevant features, which leads to incorrect feature selection. Motivated by the above-mentioned issue, we propose Bayesian masking (BM), a sparse estimation method which imposes no regularization on the weights. The key concept of BM is to introduce binary latent variables that randomly mask features. Estimating the masking rates determines the relevance of the features automatically. We derive a variational Bayesian inference algorithm that maximizes the lower bound of the factorized information criterion (FIC), which is a recently developed asymptotic criterion for evaluating the marginal log-likelihood. In addition, we propose reparametrization to accelerate the convergence of the derived algorithm. Finally, we show that BM outperforms Lasso and automatic relevance determination (ARD) in terms of the sparsity-shrinkage trade-off.

    09/03/2015 ∙ by Yohei Kondo, et al. ∙ 0 share

    read it

  • Doubly Decomposing Nonparametric Tensor Regression

    Nonparametric extension of tensor regression is proposed. Nonlinearity in a high-dimensional tensor space is broken into simple local functions by incorporating low-rank tensor decomposition. Compared to naive nonparametric approaches, our formulation considerably improves the convergence rate of estimation while maintaining consistency with the same function class under specific conditions. To estimate local functions, we develop a Bayesian estimator with the Gaussian process prior. Experimental results show its theoretical properties and high performance in terms of predicting a summary statistic of a real complex network.

    06/19/2015 ∙ by Masaaki Imaizumi, et al. ∙ 0 share

    read it

  • Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal Likelihood

    Factorized information criterion (FIC) is a recently developed approximation technique for the marginal log-likelihood, which provides an automatic model selection framework for a few latent variable models (LVMs) with tractable inference algorithms. This paper reconsiders FIC and fills theoretical gaps of previous FIC studies. First, we reveal the core idea of FIC that allows generalization for a broader class of LVMs, including continuous LVMs, in contrast to previous FICs, which are applicable only to binary LVMs. Second, we investigate the model selection mechanism of the generalized FIC. Our analysis provides a formal justification of FIC as a model selection criterion for LVMs and also a systematic procedure for pruning redundant latent variables that have been removed heuristically in previous studies. Third, we provide an interpretation of FIC as a variational free energy and uncover a few previously-unknown their relationships. A demonstrative study on Bayesian principal component analysis is provided and numerical experiments support our theoretical results.

    04/22/2015 ∙ by Kohei Hayashi, et al. ∙ 0 share

    read it