Tensor regression deals with matrices or tensors (i.e., multi-dimensional arrays) as covariates (inputs) to predict scalar responses (outputs) Wang et al. (2014); Hung and Wang (2013); Zhao et al. (2014); Zhou et al. (2013); Tomioka et al. (2007); Suzuki (2015); Guhaniyogi et al. (2015). Suppose we have a set of observations ; is a respondent variable in the space and is a covariate with th-order tensor form in the space , where is the dimensionality of order . With the above setting, we consider the regression problem of learning a function as
is zero-mean Gaussian noise with variance. Such problems can be found in several applications. For example, studies on brain-computer interfaces attempt to predict human intentions (e.g., determining whether a subject imagines finger tapping) from brain activities. Electroencephalography (EEG) measures brain activities as electric signals at several points (channels) on the scalp, giving channel time matrices as covariates. Functional magnetic resonance imaging captures blood flow in the brain as three-dimensional voxels, giving X-axis Y-axis Z-axis time tensors.
There are primarily two approaches to the tensor regression problem. One is assuming linearity to as
where is a weight
parameter with the same dimensionalities as and denotes the inner product. Since is
very high-dimensional in general, several authors have incorporated a
low-rank structure to
Dyrholm et al. (2007); Zhou et al. (2013); Hung and Wang (2013); Wang et al. (2014); Suzuki (2015); Guhaniyogi et al. (2015). We
collectively refer to the linear models (2) with
low-rank as tensor linear regression (TLR)
tensor linear regression (TLR). As an alternative, a nonparametric approach has been proposed Zhao et al. (2013); Hou et al. (2015). When belongs to a proper functional space, with an appropriately choosing kernel function, the nonparametric method can estimate perfectly even if is nonlinear.
In terms of both theoretical and practical aspects, the bias-variance tradeoff is a central issue. In TLR, the function class that the model can represent is critically restricted due to its linearity and the low-rank constraint, implying that the variance error is low but the bias error is high if the true function is either nonlinear or full rank. In contrast, the nonparametric method can represent a wide range of functions and the bias error can be close to zero. However, at the expense of the flexibility, the variance error will be high due to the high dimensionality, the notorious nature of tensors. Generally, the optimal convergence rate of nonparametric models is given by
which is dominated by the input dimensionality and the smoothness of the true function (Tsybakov, 2008). For tensor regression, is the total number of ’s elements, i.e., . When each dimensionality is roughly the same as , , which significantly worsens the rate, and hinders application to even moderate-sized problems.
In this paper, to overcome the curse of dimensionality, we proposeadditive-multiplicative nonparametric regression (AMNR), a new class of nonparametric tensor regression. Intuitively, AMNR constructs as the sum of local functions taking the component of a rank-one tensor as inputs. In this approach, functional space and the input space are concurrently decomposed. This “double decomposition” simultaneously reduces model complexity and the effect of noise. For estimation, we propose a Bayes estimator with the Gaussian Process (GP) prior. The following theoretical results highlight the desirable properties of AMNR. Under some conditions,
AMNR represents the same function class as the general nonparametric model, while
the convergence rate (3) is improved as (), which is times better.
We verify the theoretical convergence rate by simulation and demonstrate the empirical performance for real application in network science.
2 AMNR: Additive-Multiplicative Nonparametric Regression
where denotes the tensor product,
is a unit vector in a set, and is the scale of satisfying for all . In this paper, is the rank of .
A similar relation holds for functions. Here, denotes a Sobolev space, which is times differentiable functions with support . Let be such a function. If is given by the direct product of multiple supports as , there exists a (possibly infinite) set of local functions satisfying
for any (Hackbusch, 2012, Example 4.40). This relation can be seen as an extension of tensor decomposition with infinite dimensionalities.
2.1 The Model
For brevity, we start with the case wherein is rank one. Let with vectors and be a function on a rank one tensor. For any , we can construct such that using function composition as with . Then, using (5), is decomposed into a set of local functions as:
where represents the complexity of (i.e., the “rank” of the model).
With CP decomposition, (6) is amenable to extend for having a higher rank. For , we define AMNR as follows:
Aside from the summation with respect to , AMNR (7) is very similar to CP decomposition (4) in terms of that it takes summation over ranks and multiplication over orders. In addition, as indicates the importance of component in CP decomposition, it controls how component contributes to the final output in AMNR. Note that, for , equality between and does not hold in general; see Section 4.
3 Truncated GP Estimator
3.1 Truncation of and
To construct AMNR (7), we must know . However, this is unrealistic because we do not know the true function. More crucially, can be infinite, and in such a case the exact estimation is computationally infeasible. We avoid these problems using predefined rather than and ignore the contribution from . This may increase the model bias; however, it decreases the variance of estimation. We discuss how to determine in Section 4.2.
For , we adopt the same strategy as , i.e., we prepare some and approximate as a rank- tensor. Because this approximation reduces some information in X, the prediction performance may degrade. However, if is not too small, this preprocessing is justifiable for the following reasons. First, this approximation possibly removes the noise in . In real data such as EEG data, often includes observation noise that hinders the prediction performance. However, if the power of the noise is sufficiently small, the low-rank approximation discards the noise as the residual and enhances the robustness of the model. In addition, even if the approximation discards some intrinsic information of , its negative effects could be limited because s of the discarded components are also small.
3.2 Estimation method and algorithm
For each local function , consider the GP prior
, which is represented as multivariate Gaussian distributionwhere is the zero element vector of size and is a kernel Gram matrix of size . The prior distribution of the local functions is then given by:
From the prior and the likelihood , Bayes’ rule yields the posterior distribution:
are dummy variables for the integral. We use the posterior mean as the Bayesian estimator of AMNR:
To obtain predictions with new inputs, we derive the mean of the predictive distribution in a similar manner.
Since the integrals in the above derivations have no analytical solution, we compute them numerically by sampling. The details of the entire procedure are summarized as follows. Note that denotes the number of random samples.
Step 1: CP decomposition of input tensors
With the dataset , apply rank- CP decomposition to and obtain and for .
Step 2: Construction of the GP prior distribution
Construct a kernel Gram matrix from for each and , and obtain random samples of the multivariate Gaussian distribution . For each sampling , obtain a value for each , and .
Step 4: Prediction with the predictive distribution
Given a new input , compute CP decomposition and obtain and . Then, sample from the prior for each . By multiplying the likelihood calculated in Step 3, derive the predictive distribution of and obtain its expectation with respect to .
Although CP decomposition is not unique up to sign permutation, our model estimation is not affected by this. For example, tensor with and has two equivalent decompositions: (A) and (B) . If training data only contains pattern (A), prediction for pattern (B) does not make sense. However, such a case is pathological. Indeed, if necessary, we can completely avoid the problem by flipping the sign of , and at random while maintaining the original sign of . Although the sign flipping decreases the effective sample size, it is absorbed as a constant term and the convergence rate is not affected.
4 Theoretical Analysis
Our main interest here is the asymptotic behavior of distance between the true function that generates data and an estimator (9). Preliminarily, let be the true function and be the estimator of . To analyze the distance in more depth, we introduce the notion of rank additivity111This type of additivity is often assumed in multivariate and additive model analysis Hastie and Tibshirani (1990); Ravikumar et al. (2009). for functions, which is assumed implicitly when we extend (6) to (7).
Definition 1 (Rank Additivity).
A function is rank additive if
Letting be a projection of onto the Sobolev space satisfying rank additivity, the distance is bounded above as
Unfortunately, the first term is difficult to evaluate, aside from a few exceptions; if or is rank additive, .
Therefore, we focus on the rest term . By definition, is rank additive and the functional tensor decomposition (6) guarantees that is decomposed as the AMNR form (7) with some . Here, the behavior of the distance strongly depends on . We consider the following two cases: (i) is finite and (ii) is infinite. In case (i), the consistency of to is shown with an explicit convergence rate (Theorem 1). More surprisingly, the consistency also holds in case (ii) with a mild assumption (Theorem 2).
Figure 1 illustrates the relations of these functions and the functional space. The rectangular areas are the classes of functions represented by AMNR with small , AMNR with large , and Sobolev space with rank additivity.
Note that the formal assumptions and proofs of this section are shown in supplementary material.
4.1 Estimation with Finite
The consistency of Bayesian nonparametric estimators is evaluated in terms of posterior consistency Ghosal et al. (2000); Ghosal and van der Vaart (2007); van der Vaart and van Zanten (2008). Here, we follow the same strategy. Let be the empirical norm. We define as the contraction rate of the estimator of local function
, which evaluates the probability mass of the GP around the true function. Note that the order ofdepends on the covariance kernel function of the GP prior, in which the optimal rate of is given by (3) with . For brevity, we suppose that the variance of the noise is known and the kernel in the GP prior is selected to be optimal.222We assume that the Matérn kernel is selected and the weight of the kernel is equal to . Under these conditions, the optimal rate is achieved (Tsybakov, 2008). Then, we obtain the following result.
Theorem 1 (Convergence analysis).
Let . Then, with Assumption 1 and some finite constant ,
4.2 Estimation with Infinite
When is infinite, we cannot use the same strategy used in Section 4.1. Instead, we truncate by finite and evaluate the bias error caused by the truncation. To evaluate the bias, we assume that the local functions are in descending order of their volumes , i.e., are ordered as satisfying for all . We then introduce the assumption that decays to zero polynomially with respect to .
With some constant ,
This assumption is sufficiently limited. For example, if we pick as are orthogonal to each ,333For example, we can obtain such by the Gram–Schmidt process.
by the Gram–Schmidt process.then the Parseval-type inequality leads to , implying as .
Then we claim the main result in this section.
Suppose we construct the estimator (9) with
where denotes equality up to a constant. Then, with some finite constant ,
The above theorem states that, even if we truncate by finite , the convergence rate is nearly the same as the case of finite (Theorem 1), which is slightly worsened by the factor .
Theorem 2 also suggests how to determine . For example, if , and , is recommended, which is much smaller than the sample size. Our experimental results (Section 6.3) also support this. Practically, very small is sufficient, such as or , even if is greater than .
Here, we show the conditional consistency of AMNR, which is directly derived from Theorem 2.
5 Related Work
5.1 Nonparametric Tensor Regression
The tensor GP (TGP) (Zhao et al., 2014; Hou et al., 2015) in a method that estimates the function in directly. TGP is essentially a GP regression model that flattens a tensor into a high-dimensional vector and takes the vector as an input. Zhao et al. (2014) proposed its estimator and applied it to image recognition from a monitoring camera. Hou et al. (2015) applied the method to analyze brain signals. Although both studies demonstrated the high performance of TGP, its theoretical aspects such as convergence have not been discussed.
Signoretto et al. (2013) proposed a regression model with tensor product reproducing kernel Hilbert spaces (TP-RKHSs). Given a set of vectors , their model is written as
where is a weight. The key difference between TP-RKHSs (11) and AMNR is in the input. TP-RKHSs take only a single vector for each order, meaning that the input is implicitly assumed as rank one. On the other hand, AMNR takes rank- tensors where can be greater than one. This difference allows AMNR to be used for more general purposes, because the tensor rank observed in the real world is mostly greater than one. Furthermore, the properties of the estimator, such as convergence, have not been investigated.
proposed a logistic regression where the expectation is given by (2) and is a rank-one matrix. Zhou et al. (2013) extended these concepts for tensor inputs. Suzuki (2015) and Guhaniyogi et al. (2015) proposed a Bayes estimator of TLR and investigated its convergence rate.
Interestingly, AMNR is interpretable as a piecewise nonparametrization of TLR. Suppose and have rank- and rank- CP decompositions, respectively. The inner product in the tensor space in (2) is then rewritten as the product of the inner product in the low-dimensional vector space, i.e.,
where is the order- decomposed vector of . The AMNR formation is obtained by replacing the inner product with local function .
From this perspective, we see that AMNR incorporates the advantages of TLR and TGP. AMNR captures nonlinear relations between and through , which is impossible for TLR due to its linearity. Nevertheless, in contrast to TGP, an input of the function constructed in a nonparametric way is given by an -dimensional vector rather than an -dimensional tensor. This reduces the dimension of the function’s support and significantly improves the convergence rate (Section 4).
5.3 Other studies
Koltchinskii et al. (2010) and Suzuki et al. (2013) investigated Multiple Kernel Learning (MKL) considering a nonparametric -variate regression model with an additive structure: . To handle high dimensional inputs, MKL reduces the input dimensionality by the additive structure for and
. Note that both studies deal with a vector input, and they do not fit to tensor regression analysis.
|l Method||Tensor Input||Non- linearity||Convergence Rate with (3)|
Table 1 summarizes the methods introduced in this section. As shown, MKL and TP-RKHSs are not applicable for general tensor input. In contrast, TLR, TGP, and AMNR can take multi-rank tensor data as inputs, and their applicability is much wider. Among the three methods, AMNR is only the one that manages nonlinearity and avoids the curse of dimensionality on tensors.
6.1 Synthetic Data
We compare the prediction performance of three models: TLR, TGP, and AMNR. In all experiments, we generate datasets by the data generating process (dgp) as and fix the noise variance as . We set the size of , i.e., and . By varying the sample size as , we evaluate the empirical risks by the mean-squared-error (MSE) for the testing data, for which we use one-half of the samples. For each experiment, we derive the mean and variance of the MSEs in 100 trials. For TGP and AMNR, we optimize the bandwidth of the kernel function by grid search in the training phase.
6.1.1 Low-Rank Data
First, we consider the case that and the dgp are exactly low rank. We set , the true rank of , as and
where . The results (Figure 3) show that AMNR and TGP clearly outperform TLR, implying that they successfully capture the nonlinearity of the true function. To closely examine the difference between AMNR and TGP, we enlarge the corresponding part (Figure 3(b)), which shows that AMNR consistently outperforms TGP. Note that the performance of TGP improves gradually as increases, implying that the sample size is insufficient for TGP due to its slow convergence rate.
6.1.2 Full-Rank Data
Next, we consider the case that is full rank and the dgp has no low-rank structure, i.e., model misspecification will occur in TLR and AMNR. We generate as with
The results (Figure 3) show that, as in the previous experiment, AMNR and TGP outperform TLR. Although the difference between AMNR and TGP (Figure 3(b)) is much smaller, AMNR still outperforms TGP. This implies that, while the effect of AMNR’s model misspecification is not negligible, TGP’s slow convergence rate is more problematic.
6.2 Sensitivity of Hyperparameters
Here, we investigate how the truncation of and affect prediction performance. In the following experiments, we fix the sample size as .
First, we investigate the sensitivity of . We use the same low-rank dgp used in Section 6.1.1 (i.e., .) The results (Figure 5) show that AMNR and TGP clearly outperform TLR. Although their performance is close, AMNR beats TGP when is not too large, implying that the negative effect of truncating is limited.
Next, we investigate the sensitivity of . We use the same full-rank dgp used in Section 6.1.2. Figure 5 compares the training and testing MSEs of AMNR, showing that both errors increase as increases. These results imply that the model bias decreases quickly and estimation error is more dominant. Indeed, the lowest testing MSE is achieved at . This agrees satisfactory with the analysis in Section 4.2, which recommends small .
6.3 Convergence Rate
Here, we confirm how the empirical convergence rates of AMNR and TGP meet the theoretical convergence rates. To clarify the relation, we generate synthetic data from dgp with such that the difference between TGP and AMNR is maximized. To do so, we design the dgp function as and
where is an orthonormal basis function of the functional space and .444This dgp is derived from a theory of Sobolev ellipsoid; see Tsybakov (2008).
For we consider three variations: , , and (Table 2). Figure 6 shows testing MSEs averaged over trials. The theoretical convergence rates are also depicted by the dashed line (TGP) and the solid line (AMNR). To align the theoretical and empirical rates, we adjust them at . The result demonstrates the theoretical rates agree with the practical performance.
6.4 Prediction of Epidemic Spreading
Here, we deal with the epidemic spreading problem in complex networks Anderson and May (1991); Vespignani (2012) as a matrix regression problem. More precisely, given an adjacency matrix network , we simulate the spreading process of a disease by the susceptible-infected-recovered (SIR) model as follows.
We select 10 nodes as the initially infected nodes.
The nodes adjacent to the infected nodes become infected with probability .
Repeat Step 2. After 10 epochs, the infected nodes recover and are no longer infected (one iteration = one epoch).
After the convergence of the above process, we count the total number of infected nodes as . Note that the number of infected nodes depends strongly on the network structure and its prediction is not trivial. Conducting the simulation is of course a reliable approach; however, it is time-consuming, especially for large-scale networks. In contrast, once we obtain a trained model, regression methods make prediction very quick.
As a real network, we use the Enron email dataset Klimt and Yang (2004), which is a collection of emails. We consider these emails as undirected links between senders and recipients (i.e., this is a problem of estimating the number of email addresses infected by an email virus). First, to reduce the network size, we select the top email addresses based on frequency and delete emails sent to and received from other addresses. After sorting the remaining emails by timestamp, we sequentially construct an adjacency matrix from every emails, and we finally obtain input matrices.
For the analysis, we set for the AMNR and TLR methods.555We also tested , and ; however, the results were nearly the same. Although
seems small, we can still use the top-two eigenvalues and eigenvectors, which contain a large amount of information about the original tensor. In addition, the top eigenvectors are closely related to the threshold of outbreaks in infection networksWang et al. (2003). From these perspectives, the good performance demonstrated by AMNR with is reasonable. The bandwidth of the kernel is optimized by grid search in the training phase.
Figure 7 shows the training and testing MSEs. Firstly, there is a huge performance gap between TLR and the nonparametric models in the testing error. This indicates that the relation between epidemic spreading and a graph structure is nonlinear and the linear model is deficient for this problem. Secondly, AMNR outperforms TGP for every in both training and testing errors. In addition, the performance of AMNR is constantly good and almost unaffected by . This suggests that the problem has some extrinsic information that inflates the dimensionality, and the efficiency of TGP is diminished. On the other hand, it seems AMNR successfully captures the intrinsic information in a low-dimensional space by its “double decomposition” so that AMNR achieves the low-bias and low-variance estimation.
7 Conclusion and Discussion
We have proposed AMNR, a new nonparametric model for the tensor regression problem. We constructed the regression function as the sum-product of the local functions, and developed an estimation method of AMNR. We have verified our theoretical analysis and demonstrated that AMNR was the most accurate method for predicting the spread of an epidemic in a real network.
The most important limitation of AMNR is the computational complexity, which is better than TGP but worse than TLR. The time complexity of AMNR with the GP estimator is , where CP decomposition requires Phan et al. (2013) for inputs and the GP prior requires for the local functions. On the contrary, TGP requires computation because it must evaluate all the elements of to construct the kernel Gram matrix. When and , which are satisfied in many practical situations, the proposed method is more efficient than TGP.
Approximation methods for GP regression can be used to reduce the computational burden of AMNR. For example, Williams and Seeger (2001) proposed the Nyström method, which approximates the kernel Gram matrix by a low-rank matrix. If we apply rank- approximation, the computational cost of AMNR can be reduced to .
Appendix A Proof of Theorem 1
Here, we describe the detail and proof of Theorem 1. At the beginning, we introduce a general theory for evaluating the convergence of a Bayesian estimator.
Preliminary, we introduce some theorems from previous studies. Let be a true distribution of ,
be the Kullback-Leibler divergence, and define. Let be the Hellinger distance, be the bracketing number, and be the packing number. Also we consider a reproducing kernel Hilbert space (RKHS), which is a closure of linear space spanned by a kernel function. Denote by the RKHS on .
The following theorem provides a novel tool to evaluate the Bayesian estimator by posterior contraction.
Theorem 3 (Theorem 2.1 in Ghosal et al. (2000)).
Consider a posterior distribution on a set . Let be a sequence such that and . Suppose that, for a constant and sets , we have
Then, for sufficiently large , .
Based on the theorem, van der Vaart and van Zanten (2008) provide a more useful result for the Bayesian estimator with the GP prior. They consider the estimator for an infinite dimensional parameter with the GP prior and investigated the posterior contraction of the estimator with GP. They provide the following conditions.
With some Banach space and RKHS ,
where is a random element in and is a true function in support of .
When the estimator with the GP prior satisfied the above conditions, the posterior contraction of the estimator is obtained as Theorem 2.1 in Ghosal et al. (2000).
Based on this, we obtain the following result. Consider a set of GP and let be a stochastic process that satisfies .
Also we assume that there exists a true function which is constituted by a unique set of local functions .
To describe posterior contraction, we define a contraction rate . It converges to zero as . Let be a concentration function such that
where is the norm induced by the inner product of RKHS. We define the contraction rate with . We denote a sequence satisfying
The order of depends on a choice of kernel function, where the optimal minimax rate is Tsybakov (2008). In the following part, we set for every . As , the satisfies the condition about the concentration function.
We also note the relation between posterior contraction and the well-known risk bound. Suppose the posterior contraction such that holds, where is a parameter, is a true value, and is a bounded metric. Then we obtain the following inequality:
where is a bound of . This leads . In addition, if is convex, the Jensen’s inequality provides . By taking the expectation, we obtain
We start to prove Theorem 1. First, we provide a lemma for functional decomposition. When the function has a form of a -product of local functions, we bound a distance between two functions with a -sum of distance by the local functions.
Suppose that two functions have a form and with local functions . Then we have a bound such that
We show the result based on induction. When , we have
Thus the result holds when .
Assume the result holds when . Let . The difference between -product functions is written as
From this, we obtain the bound
The distance is decomposed recursively by the case of . Then we obtain the result. ∎
We will show that satisfies the condition (A). Firstly, we check the third condition in the theorem.
According to Lemma 1, the value is bounded as
By denoting , we evaluate the probability as
where is a positive finite constant satisfying .
From van der Vaart and van Zanten (2008), we use the following inequality for every Gaussian random element :
Then, by seting and with some constant , we bound (13) below as
For the second condition, we define a subspace of the Banach space as
for all . Note and are unit balls in and . Also, we define as