
On Random Subsampling of Gaussian Process Regression: A GraphonBased Analysis
In this paper, we study random subsampling of Gaussian process regression, one of the simplest approximation baselines, from a theoretical perspective. Although subsampling discards a large part of training data, we show provable guarantees on the accuracy of the predictive mean/variance and its generalization ability. For analysis, we consider embedding kernel matrices into graphons, which encapsulate the difference of the sample size and enables us to evaluate the approximation and generalization errors in a unified manner. The experimental results show that the subsampling approximation achieves a better tradeoff regarding accuracy and runtime than the Nyström and random Fourier expansion methods.
01/28/2019 ∙ by Kohei Hayashi, et al. ∙ 16 ∙ shareread it

Data Interpolating Prediction: Alternative Interpretation of Mixup
Data augmentation by mixing samples, such as Mixup, has widely been used typically for classification tasks. However, this strategy is not always effective due to the gap between augmented samples for training and original samples for testing. This gap may prevent a classifier from learning the optimal decision boundary and increase the generalization error. To overcome this problem, we propose an alternative framework called Data Interpolating Prediction (DIP). Unlike common data augmentations, we encapsulate the samplemixing process in the hypothesis class of a classifier so that train and test samples are treated equally. We derive the generalization bound and show that DIP helps to reduce the original Rademacher complexity. Also, we empirically demonstrate that DIP can outperform existing Mixup.
06/20/2019 ∙ by Takuya Shimada, et al. ∙ 4 ∙ shareread it

An Optimality Proof for the PairDiff operator for Representing Relations between Words
Representing the semantic relations that exist between two given words (or entities) is an important first step in a widerange of NLP applications such as analogical reasoning, knowledge base completion and relational information retrieval. A simple, yet surprisingly accurate method for representing a relation between two words is to compute the vector offset () between the corresponding word embeddings. Despite its empirical success, it remains unclear whether is the best operator for obtaining a relational representation from word embeddings. In this paper, we conduct a theoretical analysis of the operator. In particular, we show that for word embeddings where crossdimensional correlations are zero, is the only bilinear operator that can minimise the ℓ_2 loss between analogous wordpairs. We experimentally show that for word embedding created using a broad range of methods, the crossdimensional correlations in word embeddings are approximately zero, demonstrating the general applicability of our theoretical result. Moreover, we empirically verify the implications of the proven theoretical result in a series of experiments where we repeatedly discover as the best bilinear operator for representing semantic relations between words in several benchmark datasets.
09/19/2017 ∙ by Huda Hakami, et al. ∙ 0 ∙ shareread it

On Tensor Train Rank Minimization: Statistical Efficiency and Scalable Algorithm
Tensor train (TT) decomposition provides a spaceefficient representation for higherorder tensors. Despite its advantage, we face two crucial limitations when we apply the TT decomposition to machine learning problems: the lack of statistical theory and of scalable algorithms. In this paper, we address the limitations. First, we introduce a convex relaxation of the TT decomposition problem and derive its error bound for the tensor completion task. Next, we develop an alternating optimization method with a randomization technique, in which the time complexity is as efficient as the space complexity is. In experiments, we numerically confirm the derived bounds and empirically demonstrate the performance of our method with a real higherorder tensor.
08/01/2017 ∙ by Masaaki Imaizumi, et al. ∙ 0 ∙ shareread it

Minimizing Quadratic Functions in Constant Time
A samplingbased optimization method for quadratic functions is proposed. Our method approximately solves the following ndimensional quadratic minimization problem in constant time, which is independent of n: z^*=_v∈R^n〈v, A v〉 + n〈v, diag(d)v〉 + n〈b, v〉, where A ∈R^n × n is a matrix and d,b∈R^n are vectors. Our theoretical analysis specifies the number of samples k(δ, ϵ) such that the approximated solution z satisfies z  z^* = O(ϵ n^2) with probability 1δ. The empirical performance (accuracy and runtime) is positively confirmed by numerical experiments.
08/25/2016 ∙ by Kohei Hayashi, et al. ∙ 0 ∙ shareread it

Making Tree Ensembles Interpretable: A Bayesian Model Selection Approach
Tree ensembles, such as random forests and boosted trees, are renowned for their high prediction performance. However, their interpretability is critically limited due to the enormous complexity. In this study, we present a method to make a complex tree ensemble interpretable by simplifying the model. Specifically, we formalize the simplification of tree ensembles as a model selection problem. Given a complex tree ensemble, we aim at obtaining the simplest representation that is essentially equivalent to the original one. To this end, we derive a Bayesian model selection algorithm that optimizes the simplified model while maintaining the prediction performance. Our numerical experiments on several datasets showed that complicated tree ensembles were reasonably approximated as interpretable.
06/29/2016 ∙ by Satoshi Hara, et al. ∙ 0 ∙ shareread it

Making Tree Ensembles Interpretable
Tree ensembles, such as random forest and boosted trees, are renowned for their high prediction performance, whereas their interpretability is critically limited. In this paper, we propose a post processing method that improves the model interpretability of tree ensembles. After learning a complex tree ensembles in a standard way, we approximate it by a simpler model that is interpretable for human. To obtain the simpler model, we derive the EM algorithm minimizing the KL divergence from the complex ensemble. A synthetic experiment showed that a complicated tree ensemble was approximated reasonably as interpretable.
06/17/2016 ∙ by Satoshi Hara, et al. ∙ 0 ∙ shareread it

A Tractable Fully Bayesian Method for the Stochastic Block Model
The stochastic block model (SBM) is a generative model revealing macroscopic structures in graphs. Bayesian methods are used for (i) cluster assignment inference and (ii) model selection for the number of clusters. In this paper, we study the behavior of Bayesian inference in the SBM in the large sample limit. Combining variational approximation and Laplace's method, a consistent criterion of the fully marginalized loglikelihood is established. Based on that, we derive a tractable algorithm that solves tasks (i) and (ii) concurrently, obviating the need for an outer loop to check all model candidates. Our empirical and theoretical results demonstrate that our method is scalable in computation, accurate in approximation, and concise in model selection.
02/06/2016 ∙ by Kohei Hayashi, et al. ∙ 0 ∙ shareread it

Bayesian Masking: Sparse Bayesian Estimation with Weaker Shrinkage Bias
A common strategy for sparse linear regression is to introduce regularization, which eliminates irrelevant features by letting the corresponding weights be zeros. However, regularization often shrinks the estimator for relevant features, which leads to incorrect feature selection. Motivated by the abovementioned issue, we propose Bayesian masking (BM), a sparse estimation method which imposes no regularization on the weights. The key concept of BM is to introduce binary latent variables that randomly mask features. Estimating the masking rates determines the relevance of the features automatically. We derive a variational Bayesian inference algorithm that maximizes the lower bound of the factorized information criterion (FIC), which is a recently developed asymptotic criterion for evaluating the marginal loglikelihood. In addition, we propose reparametrization to accelerate the convergence of the derived algorithm. Finally, we show that BM outperforms Lasso and automatic relevance determination (ARD) in terms of the sparsityshrinkage tradeoff.
09/03/2015 ∙ by Yohei Kondo, et al. ∙ 0 ∙ shareread it

Doubly Decomposing Nonparametric Tensor Regression
Nonparametric extension of tensor regression is proposed. Nonlinearity in a highdimensional tensor space is broken into simple local functions by incorporating lowrank tensor decomposition. Compared to naive nonparametric approaches, our formulation considerably improves the convergence rate of estimation while maintaining consistency with the same function class under specific conditions. To estimate local functions, we develop a Bayesian estimator with the Gaussian process prior. Experimental results show its theoretical properties and high performance in terms of predicting a summary statistic of a real complex network.
06/19/2015 ∙ by Masaaki Imaizumi, et al. ∙ 0 ∙ shareread it

Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal Likelihood
Factorized information criterion (FIC) is a recently developed approximation technique for the marginal loglikelihood, which provides an automatic model selection framework for a few latent variable models (LVMs) with tractable inference algorithms. This paper reconsiders FIC and fills theoretical gaps of previous FIC studies. First, we reveal the core idea of FIC that allows generalization for a broader class of LVMs, including continuous LVMs, in contrast to previous FICs, which are applicable only to binary LVMs. Second, we investigate the model selection mechanism of the generalized FIC. Our analysis provides a formal justification of FIC as a model selection criterion for LVMs and also a systematic procedure for pruning redundant latent variables that have been removed heuristically in previous studies. Third, we provide an interpretation of FIC as a variational free energy and uncover a few previouslyunknown their relationships. A demonstrative study on Bayesian principal component analysis is provided and numerical experiments support our theoretical results.
04/22/2015 ∙ by Kohei Hayashi, et al. ∙ 0 ∙ shareread it
Kohei Hayashi
is this you? claim profile
Machine Learning researcher at AI Research Center, National Institute of Advanced Industrial Science and Technology (AIST)