Supervised learning framework in tensor network format using the Canonical Polyadic Decomposition
There has recently been increasing interest, both theoretical and practical, in utilizing tensor networks for the analysis and design of machine learning systems. In particular, a framework has been proposed that can handle both dense data (e.g., standard regression or classification tasks) and sparse data (e.g., recommender systems), unlike support vector machines and traditional deep learning techniques. Namely, it can be interpreted as applying local feature mappings to the data and, through the outer product operator, modelling all interactions of functions of the features; the corresponding weights are represented as a tensor network for computational tractability. In this paper, we derive efficient prediction and learning algorithms for supervised learning with the Canonical Polyadic (CP) decomposition, including suitable regularization and initialization schemes. We empirically demonstrate that the CP-based model performs at least on par with the existing models based on the Tensor Train (TT) decomposition on standard non-sequential tasks, and better on MovieLens 100K. Furthermore, in contrast to previous works which applied two-dimensional local feature maps to the data, we generalize the framework to handle arbitrarily high-dimensional maps, in order to gain a powerful lever on the expressiveness of the model. In order to enhance its stability and generalization capabilities, we propose a normalized version of the feature maps. Our experiments show that this version leads to dramatic improvements over the unnormalized and/or two-dimensional maps, as well as to performance on non-sequential supervised learning tasks that compares favourably with popular models, including neural networks.READ FULL TEXT VIEW PDF
Supervised learning framework in tensor network format using the Canonical Polyadic Decomposition
Tensor networks (TNs) have recently found various applications in supervised learning, owing to their interpretable, multilinear structure and ability to circumvent the curse of dimensionality by representing high-order tensors, i.e., multi-way arrays, by sets of interconnected lower-order tensors. For example, by establishing links between common TNs such as the Tensor Train (TT) or Hierarchical Tucker (HT) , and deep learning architectures such as Recurrent [10, 23]20], researchers were able to obtain new theoretical insights [7, 17, 16]. In addition, TNs have proven to be useful for the heavy compression of feedforward, convolutional, and recurrent networks without significant loss in performance [2, 25, 31].
This paper focuses on a related line of research, whereby TNs are employed for the design of scalable, interpretable, and general supervised learning models. More specifically, a framework has been introduced [30, 24] that can be interpreted as applying local feature maps and subsequently the outer product operator to construct a rank-1 tensor that holds all possible feature interactions. The model prediction is given by the inner product of this tensor with a corresponding weight tensor. To render the computations tractable, the exponentially-large weight tensor is represented by a TN. By employing the TT and matrix TT formats, researchers were able to obtain competitive results on tasks including image recognition, recommender systems, and sequence-to-sequence learning [30, 24, 11].
The advantages of such models include:
linear scaling with the dimensionality of the feature vectors and constant scaling with the training size (assuming mini-batch gradient descent is used for training), making them suitable for big data applications
enhanced interpretability, due to the multi-linear nature of the model, that is lost with other state-of-the-art approaches such as neural networks and SVMs
possession of the universal approximation property for large enough dimension of the local feature maps
suitability for a wide range of supervised learning settings, including recommender systems, for which models such as neural networks and SVMs tend to underperform
potential to derive novel machine learning models using multilinear algebra and to gain theoretical insights on existing approaches
Previous works have (either explicitly or implicitly) locally mapped the features to two-dimensional vectors, i.e., the local dimension was two. Being able to vary the local dimension plays a prominent role in the expressiveness of the models (they can only enjoy the universal approximation property if the local dimension is large enough) but leads to significant computational costs for learning algorithms that do not scale linearly with the local dimension, such as in , and results in instability or poor generalization depending on the feature map used.
In this work, we focus on the well-established Canonical Polyadic (CP) decomposition to represent the weight tensor and refer to this model as the CP-based model (naturally, we refer to the models based on TT as TT-based). The use of CP allows for simpler optimization algorithms, can lead to more parsimonious representations, and is insensitive to feature permutations, a desirable property for non-sequential data. We derive prediction and learning algorithms as well as initialization and regularization schemes that scale linearly with the dimensionality of the features and the local dimension. We show in Section IX that the model works at least on par with TT-based models on the non-sequential learning tasks that we experimented with and better on the MovieLens 100K recommender system dataset. A further contribution (that is agnostic to the type of decomposition used) is the extension of existing two-dimensional maps to generalized -dimensional maps, which are able to model all interactions of features raised to powers of up to . To avoid erratic behavior of the learning procedure for high local dimension , we propose to unit-normalize the vectors after mapping. Through experiments, we demonstrate that a higher can dramatically enhance performance and that the algorithms remain stable even for . The performance boosts enjoyed by increasing
facilitated these models to achieve competitive results with popular models, including SVMs and feedforward neural networks. The CP-based model has been implemented as a custom layer in TensorFlow 2.0 to enable straightforward experimentation with various optimizers, regularizers, loss functions, etc.222https://github.com/KritonKonstantinidis/CPD_Supervised_Learning
The rest of the paper is organized as follows. First, we present the tensor preliminaries necessary to follow the technical part of the paper (Section II). Next, we describe the tensor network framework for supervised learning on which our algorithm is based (Section III). After a discussion on the interpretability of the model and its universal approximation property (Section IV), we present the derivations and algorithms for model prediction and learning (Section V). We then propose local feature maps for different settings (Section VI) and derive procedures for order regularization  (Section VII). We show how to initialize the model using the solution of a linear model trained on transformed features (Section VIII), and discuss results obtained during our experiments (Section IX). Finally, we provide links between the CP-based model and related works, discuss the advantages of our approach (Section X) and we conclude with potential future research directions (Section XI).
A real-valued tensor is a multidimensional array, which we denote in calligraphic typeface as , where is the order of the tensor, and () is the size of its th mode. Matrices (referenced with capital bold letters, e.g., ) can be viewed as tensors with , vectors (given in lower-case bold letters, e.g., ) with , and scalars (denoted by lower-case letters, e.g., ) with . A specific entry of a tensor is given by . Moreover, we adopt a graphical notation, whereby a tensor is represented by a shape (e.g., a circle) with outgoing lines; the number of lines equals the order of the tensor (see  for more information on this graphical notation).
We employ the following conventions for basic linear/multilinear operations.
A multi-index (in reverse lexicographic ordering) is defined as .
The mode- matricization of a tensor reshapes the multidimensional array into a matrix with .
The outer product of two vectors and is given by with .
The Kronecker product of two matrices and is denoted by with .
The Khatri-Rao product of two matrices and is denoted by with columns , .
The Hadamard product of two th-order tensors and is denoted by with .
The contraction of an th-order tensor and an th-order tensor over the th and th modes respectively, where , results in an ()th-order tensor with entries . Contraction is illustrated elegantly in the graphical notation simply by connecting the corresponding lines of the tensors.
The inner product of two th-order tensors and is denoted by with .
A property (see Appendix A for proof) that we use in this paper is
where and .
A TN provides an efficient representation of a tensor as a set of lower-order tensors which are contracted over certain modes. The relatively low order of the tensors and their sparse interconnections allow the avoidance of the curse of dimensionality as the size of the representation tends to scale linearly with the tensor order (rather than exponentially).
A tensor can be represented in the Tensor Train (TT) format as
where , , are the TT cores, and denotes contraction over the last mode of and first mode of . The TT rank is given by the dimensions of the contracted modes, i.e., . The TT format is depicted on the left panel of Fig. 1.
where is the CP rank of the tensor (this is equal to the standard matrix rank when ). We can collect these vectors into factor matrices , (), in which case the CP decomposition can be expressed as a contraction of the identity tensor (all ones on the super-diagonal) with all factor matrices. This view allows the CP decomposition to be formulated as a TN, as is shown on the right panel in Fig. 1.
The entries of the CPD can be expressed as
where denotes the th row of the th factor matrix, and .
Upon matricization of we obtain
Furthermore, its vectorization is expressed as
Consider a local feature mapping applied to every feature , where is referred to as the local dimension of the mapping. The choice of the feature map is flexible and application-dependent. We take the outer product between the mapped vectors to obtain
Note that is shorthand for . The prediction produced by the model for one output (e.g., single-target regression or binary classification) is given by
where is the weight tensor, holding all model coefficients. Note that in the case of multi-target regression or multi-class classification with labels, the model is composed of different weight tensors (). For a graphical representation in tensor network notation, see Figure 2.
Consider the map . Then, for number of features , we have and
Thus, the model in this case captures all combinations of distinct features.
The size of the weight tensor scales exponentially with the number of features and is therefore computationally prohibitive to learn. For this reason, is represented as a tensor network with exponentially lower number of parameters. In this work, we represent in the CP format, which is insensitive to the ordering of the features (see X-A). Although this is also true for the Tucker decomposition , using the Tucker format entails exponential scaling with the number of features. Also, the Tucker decomposition is achieved through a core tensor whose modes typically have dimensions smaller than those of the original tensor, rendering them unsuitable for weight tensors with dimensions 2 across all modes, such as in Example 1.
It is important to notice that the model prediction can be expressed in terms of the rows of the CP model’s factor matrices and the corresponding elements of the feature mapping:
In other words, the coefficient for each (transformed) feature interaction can be readily obtained after training simply by performing a Hadamard product of the corresponding row vectors of the learned factor matrices and then summing over all entries of the resulting vector. This provides enhanced interpretability over SVMs or deep learning techniques, as the importance of a given feature interaction can be observed in time.
Consider again the mapping with . Then, the coefficient for would be obtained by finding the Hadamard product of , , and , and then summing over the entries.
It is known that the TN framework discussed in Section III can approximate any function with any approximation error, as long as is large enough . Since there always exists a CP rank for which a tensor can be exactly decomposed in its CP form, the proposed model satisfies the universal approximation property. This highlights the importance of using a higher for the feature map.
Consider ; then, by varying and , the model can capture any polynomial function. By Stone-Weierstrass theorem , there exist and such that the model can approximate any continuous function on compact subsets of .
This section is concerned with efficient algorithms for the model predictions as well as for learning the model parameters. The mathematical derivations of our algorithms were included for rigor; the reader can refer to the graphical notation to gain intuition on the underlying algorithms.
Given the factor matrices and the maps (), there are various ways of obtaining the model predictions, some of which are dramatically more computationally efficient than others. For example, implementing (8) directly yields a procedure that scales exponentially with .
We now derive an efficient procedure for computing the prediction. The prediction can be expressed as
The corresponding procedure is given in Algorithm 1, the complexity of which is .333Throughout the paper we give the asymptotic complexity of the algorithms, assuming that each operation is executed in sequence. The methods are, however, highly amenable to parallelization, and they can be implemented in TensorFlow very efficiently. The algorithm can be visualized using the graphical TN notation (see Fig. 3). Notice that obtaining the model predictions amounts to contracting the TN in Fig. 2.
In order to learn the model parameters (i.e., factor matrices) using an approach based on first-order derivatives (e.g., stochastic gradient descent, ADAM
, etc.) we need to obtain the partial derivative of the prediction with respect to each factor matrix (the desired gradient can subsequently be computed using the chain rule). Alternatively, one can employ automatic differentiation, for example in Keras, and specify only the forward pass (i.e., Algorithm 1).
With view to deriving the partial derivative analytically, consider the following expression of the prediction, which uses the mode- matricization of the tensors:
where denotes the trace operator and we have used its cyclic property.
The partial derivative of the prediction w.r.t. the th factor matrix is then given by
where we have used the well-known matrix calculus identity
In graphical notation, the corresponding procedure is depicted in Fig. 4. Notice that naively implementing (14) leads to a computational complexity of for obtaining the partial derivative w.r.t. each factor matrix and thus of w.r.t. to all factors. The quadratic scaling with is due to a repetition of the same matrix-vector products as we take the derivative w.r.t. different factors, which can be obviated by storing these products. This leads to Algorithm 2, which has a complexity of .
In this paper, we focus on features maps of the form
which is an extension to the map used in , i.e., . We include the subscript in (16) to emphasize the dependence on . The map given in (16) contains the linear model (see Section VIII), does not restrict the set of feature standardization techniques that can be employed (see Remark 1), and allows the straightforward modification of , so that we have an additional lever (aside from the rank ) on the expressiveness of the model.
At high enough , (16
) causes instability in the training process and very high overfitting, especially in the presence of outliers. We found that normalizing the resulting vector to unit length enables the use of very high, even , and also performs significantly better in the experiments that we conducted (see Section IX).
The map can be expressed as
Note that, due to the dependence of the denominator on the feature values, there no longer exists a bias term. One could include a 1 at the start of the vector in (17), or simply fit a bias with an independent parameter, though we achieved best performance in our experiments simply using (17).
The data are pre-processed by standardizing the features; otherwise, increasing
would quickly risk numerical overflow. We found in our experiments that a standard deviation of 1 (and mean of 0) worked well. These parameters imply that almost all feature values range between -3 and 3 after this preprocessing step, and so evendoes not cause overflow for 32-bit single-precision floating-point representation (one can use double-point precision for extremely high ). Furthermore, the entries of the vector in (17) either decrease exponentially with the index if the standardized feature values are between -1 and 1 or increase exponentially otherwise. Hence, only a few of the vector entries have a non-negligible impact on the prediction. This seems to have a positive effect on both the training and validation errors, since standardizing the powers of before vector normalization led to significantly worse results.
A possible reason for the stability of the algorithm and smoothness of the learning curves when the normalized map is used, even when the data contain outliers, is that the norms of the vectors after the matrix-vector multiplications in the prediction (see Fig. 3) remain bounded within the same range at each iteration:
denotes the smallest singular value of. Notice that the bound does not depend on the data.
An advantage of this framework for supervised learning is its ability to handle categorical features after they have been one-hot encoded. Although it is possible to concatenate all binary one-hot features into a large feature vector and allocate a factor matrix to each feature, thereby enabling the use of (16) with
, this would unnecessarily model interactions between one-hot features belonging to the same categorical variable. For this reason, the feature map used for categorical data is instead
where is the one-hot-encoded representation of the th categorical feature, and is the number of values that the feature can assume.
A feature map that was inspired by quantum physics and has been used in [30, 21] for object recognition tasks is . A potential issue of this map for other types of tasks is that and are one-to-one functions only if their domains are restricted; thus, if the input features are unbounded (i.e., there are no pre-defined minimum and maximum values), one can only apply Min-Max scaling (and not other standardization techniques) to the features as a preprocessing step. Although pixel values are bounded (typically from 0 to 255), this is not normally the case for other data types. This is especially problematic when there exist outliers, as, in that case, Min-Max scaling can result in extremely small standard deviations for the scaled features, adversely impacting performance.
Although the CP decomposition itself provides strong regularization for small rank, we can potentially improve the performance by guiding the training algorithm to hypotheses that are more likely to generalize well to unseen data. Standard regularizers, such as L1 and L2, do not differentiate between the coefficients for different feature interactions in (11). This is potentially suboptimal when using the polynomial or categorical mappings, because we might want to limit the coefficients of the higher-order terms relatively more. Order regularization  addresses this by penalizing large coefficients for higher-order terms more than for lower-order ones. In this section, we show how to apply order regularization to our model.
The penalty for order regularization is given by , where for a user-defined vector . In the case of polynomial functions, one choice is with : the coefficients of the higher-order terms are multiplied by a higher power of in the computation of the penalty function, and the factor matrices are adjusted in such a way as to shrink these coefficients relatively more. In the case of categorical feature mapping, is more suitable, since all binary, one-hot features should be treated equally.
Now, let and . The penalty can be re-written as
The procedure for computing the regularization penalty is given in Algorithm 3 and depicted graphically in Fig. 5. The algorithm has a computational cost of , so the training procedure scales quadratically with with order regularization (rather than linearly).
The partial derivative of the order regularization penalty w.r.t. each factor matrix can be found by first expressing the penalty in the following way:
Then, using the identity (see Appendix B for proof)
we obtain (notice that in this case )
As in Algorithm 2, by storing the Hadamard products, the procedure for computing the partial derivative of the penalty w.r.t. all factor matrices scales linearly with rather than quadratically, leading to a time complexity of (see Algorithm 4). Moreover, its memory complexity is . To further speed up the algorithm, it is possible to store the matrix products in the first for loop to avoid computing them in the second one. This would have no effect on the asymptotic computational complexity and would increase the memory complexity to .
One way to initialize the factor matrices is to use independent Gaussian noise, centered around 0 and with a tuned standard deviation. However, if the local feature map is of the form , where
, one can employ the solution of a linear model (linear or logistic regression depending on the task), trained on the set of features. Note that one can also train the linear model on a subset of this set, as is done in Example 4, but for clarity we assume for now the full set of transformed features.
Let denote the bias term of the linear model trained on the aforementioned set; in addition, let denote its weight corresponding to the th feature and the function . Then, the CP-based model produces the same predictions as the linear model if the entries of the factor matrices are given by
With such an initialization, the error at the first epoch tends to be lower than with random initialization and sometimes also converges to a lower point. This is especially common when the number of features is high (e.g., larger than 20).
To prove that the initialization indeed yields the same prediction as the linear model, notice from (11) that the bias term for the CP-based model is obtained through the sum of the Hadamard products of the first rows of the factor matrices. Therefore, the bias of the model equals
Moreover, the coefficient for equals
The coefficient for the second-order interaction for , , and is
since for all , , and all , . Finally, following this logic, it is obvious that the coefficients of the higher-order interactions are also 0.
A corollary is that the CP-based model with the polynomial feature map given by (16) contains the solution of the linear model trained on the original features. Namely, given the coefficient for the th feature , the relevant initialization for the entries of the factor matrices is
and all other rows equal , since they correspond to features raised to higher powers than 1.
When dealing with categorical features that are one-hot encoded, the initialization changes slightly due to the fact that the number of rows of the factor matrices is not constant (see Section VI). Given categorical features, each assuming () values, the rows of the factor matrices can be initialized as in (25), except that must be replaced with . Note that in this case, denotes the weight of the binary feature corresponding to the th categorical feature and its th possible value.
To demonstrate the performance and flexibility of our model on different tasks, experiments were performed on a synthetic polynomial regression task, the MovieLens 100K recommender system dataset (classification on very sparse data), and the famous California Housing dataset (regression with many outliers). The first dataset is a toy example that illustrates the effects of initialization from the linear model solution and order regularization for lower than and higher than optimal local dimensions. The MovieLens dataset shows how our model can be successfully employed for sparse (categorical) data. Finally, the utility of using very high local dimension, coupled with normalized polynomial feature mappings, is illustrated on the California Housing dataset.
We evaluated our model’s performance against the TT-based model as well as other common supervised learning models. For the former, we used the code provided in Google’s TensorNetwork library . Moreover, we used the scikit-learn implementation of the SVM and the Keras library to construct the neural networks, as well as polylearn444https://github.com/scikit-learn-contrib/polylearn and tffm555https://github.com/geffy/tffm for the polynomial networks and higher-order factorization machines, respectively.
For this experiment, the target function was set to a 2nd-order polynomial of 3,000 samples of four features with added white Gaussian noise, and the models were trained on seven features, i.e., three of the features were non-informative. We used the (unnormalized) polynomial map, so that the target function was included in the CP- and TT-based models’ hypotheses sets (for
). As a preprocessing step, we standardized the features by subtracting the mean and dividing by the standard deviation. We used the mean squared error as the loss function. Moreover, for the iterative methods, the mini-batch size was set to 32, and the ADAM optimizer was used, as it performed better than plain SGD. The neural network was composed of three hidden layers, with 20 neurons for the first two and 15 for the last, all followed by ReLu activation and batch normalization. 20% of the dataset was held out for testing and 20% of the remaining data points formed the validation set.
Fig. 6 shows the influence of linear initialization and order regularization on our model for different local dimensions, with the neural network acting as a baseline. Specifically, we trained the models for 60 epochs, first with random initialization (zero-mean Gaussian noise with standard deviation of 0.2) and no regularization, then with the linear model solution as initialization, and finally we added order regularization with and . In the top panel of the figure, we show the learning curves when and , i.e., the model was not expressive enough to capture the 2nd-order terms of the polynomial. Despite the linear-initialized model starting from a significantly lower validation error and converging faster, both initialization schemes led to the model reaching the same error after 4-5 epochs; the neural network converged after about 60 epochs to a similar value. Also, order regularization did not have a significant effect on the performance of the model. On the other hand, when was increased to four (a higher than optimal local dimension, as there were no 3rd-order terms) and to 30, random initialization took significantly longer to converge (around 60 epochs, then it started overfitting), whereas order regularization enabled the optimization procedure to reach a lower validation error.
The comparison with the other models is given in Table I, where the minimum validation score across 100 epochs is reported. We were not able to run the TT-based models with the unnormalized polynomial mapping (neither Google’s TensorNetwork nor Exponential Machines with
), as the loss diverged with all hyperparameters we tried.
|Method||Val Loss||Train Time (sec)|
|3rd-order Polynomial Network||0.1390||6.96|
|3rd-order Factorization Machine||0.1485||4.35|
Categorical data cause issues to traditional neural networks and SVMs due to data sparsity after one-hot encoding. For example, SVMs with a polynomial kernel find it difficult to learn the coefficients corresponding to interactions of two or more categorical features, because there are usually not enough training points where the relevant binary, one-hot features are both “hot.” In contrast, the CP-based model factorizes these coefficients and, thus, performs well in such settings (for a detailed discussion on why factorized models perform well with sparse data see ).
We demonstrate the performance of our model on the MovieLens 100K, a widely-used benchmark for sparse data classification. For details on the dataset, see . For the preprocessing of the data, we followed the procedure described in , in order to be able to directly compare the results with the TT-based model (Exponential Machines). The features were mappped as described in Section VI-C. The CP rank was set to 30, and order regularization was added with and , respectively. We trained the model until convergence, obtaining an Area Under Curve (AUC) score of 0.7863. The authors of Exponential Machines reported an AUC score of 0.784, and we could not achieve a higher score by tuning their model. Note that logistic regression obtains a score of 0.7821. For a comparison with other models, we refer the reader to . We found that both initialization from the linear model solution and order regularization helped significantly in achieving better performance on this task.
We now show the effect of high on a common regression dataset. The California Housing dataset is a single-target regression task, comprising eight features and 20,640 samples. We performed similar data pre-processing for this experiment as for the synthetic example. The CP rank was set to 20. We found that, for the polynomial map, increasing from two to three boosted performance, but any increase thereafter caused the learning procedure to be highly unstable and eventually to experience numerical overflow. Standardizing (zero mean, unit standard deviation) each element of the resulting vector after mapping prevented overflow but led to relatively erratic learning curves for validation (see Fig 7).
In contrast, the model with the normalized polynomial map (with random initialization666One could also use initialization from the linear model.) was able to be trained reliably even for very high , as can be seen from the smooth learning curves in Fig. 7. In addition, the training error kept decreasing to very small values without regularization, reflecting the gains in expressiveness (see Fig. 8). More surprisingly, however, the lowest validation error (without regularization) was obtained with , and it remained low even for dimensions around 100, indicating the regularization capabilities of low rank. Of course, with a smaller dataset, we would observe more overfitting for high . With L2 regularization on the factor matrices, it reached its smallest value at (we did not observe any improvement using order regularization over L2 with this map and on this dataset).
Not only does higher result in great gains in performance, but due to optimized matrix-vector multiplications in TensorFlow, high also has a small impact on training time, enabling fast tuning of this hyperparameter (see Fig. 9).
|Method||Val MSE||Train Time (sec)|
|CP-Based (, unnorm.)||0.3348||36.40|
|CP-Based (, unnorm.)||0.4221||41.29|
|CP-Based (, norm.)||0.2090||51.15|
|CP-Based (, norm., w/ L2)||0.1959||118.84|
|TT-Based (, norm.)||0.2139||773.25|
|4th order Polynomial Network||0.2761||9.06|
|4th order Factorization Machine||0.3322||51.69|
The comparison with the other models is given in Table II. As there is currently no implementation of Exponential Machines with Riemannian optimization for high , we used the TensorNetwork library to compare with a TT-based model. The TT rank was set to 5, leading to the same number of parameters as the CP-based model. The number of parameters for the TT-based model is , while for the CP-based model it is , where is the TT rank. For , corresponds to . Note that the very large training time compared to the CP-based model is likely due to the different implementation in TensorFlow, and not due to the TT-based model being inherently slower to train for the same number of parameters. The feedforward neural network that we constructed contained four hidden layers and also had an equal number of parameters. The minimum validation score across 100 epochs is reported.
The CP rank seems to affect performance differently for low than for high. As can be seen in Fig. 10, both the training and validation errors were decreasing up to rank equal to 50 and then plateaued for . On the other hand, when was increased to 75, although the training error was decreasing with higher rank, the validation error remained fairly constant. Notice that even with very small rank (about 2 to 5), the validation error still managed to stay dramatically lower than for any value of rank with . This suggests that the local dimension is more important than the rank for accurate model predictions. Finally, low validation error is observed even when the rank was set to larger values (e.g., 500) in the case with .
Increasing the number of features beyond a certain point (e.g., about 30) often makes the optimization process more sensitive to the standard deviation of the Gaussian distribution, assuming random initialization is used. This phenomenon is due to the many Hadamard products (or matrix-matrix products in the case of TT-based models) that are performed: if the standard deviation is too small or too big, this can lead to vanishingly small or exceedingly large predictions, respectively. Initialization with any linear model solution (not necessarily a linear model trained on the dataset at hand) largely alleviates this issue, as was confirmed when we managed to train a model on the (flattened) MNIST dataset with 784 features. It would be interesting to investigate other strategies that could enable these models to achieve state-of-the-art results on very high-dimensional datasets.
In this section, we discuss similarities and differences between the CP-based model and related works.
Both  and  used the TT format to limit the model parameters. The former proposed an algorithm inspired from quantum physics that allows the adaptation of the TT rank during training; unfortunately, it scales cubically with the local dimension . The latter group designed a stochastic version of a Riemannian optimization approach, which they found to be more robust to initialization than stochastic gradient descent methods. In terms of the local feature mappings, both works used (see Section VI).
To better understand how the TT-based models are linked to the CP-based one, it is worthwhile to notice that a CP decomposition can be expressed in terms of the TT format. Specifically, a tensor in the CP format with rank is equivalent to a tensor in the TT format, for which the TT cores are given by , , and for , where are the lateral slices of the cores . Hence any CP-decomposed weight tensor can be converted into its TT format. However, the predictions produced by the models are not equivalent, since an optimization process on the TT-based model alters the off-diagonal terms, i.e., they are not constrained to be zero. Thus, differences in performance between the two models must arise due to off-diagonal core elements either being helpful or detrimental to generalization capabilities for the dataset at hand. Viewed another way, for equal number of parameters, the TT-based model will achieve superior results if and only if it is better for generalization to have TT cores whose slices are relatively small matrices with off-diagonal terms rather than larger diagonal matrices. Note also that any tensor in the TT format can be mapped to a tensor in the CP format, but, in general, the CP rank equals , where .
A potential advantage of the CP-based model for non-sequential data is its insensitivity to the ordering of the features, as, when constrained to be laterally diagonal, the cores can be permuted in any way without altering the decomposition (since diagonal matrices commute). By the same token, the TT-based models may be more suitable for sequential data due to the inherent ordering of their cores.
The CP-based model with polynomial feature map resembles in some ways an SVM with the polynomial kernel, the prediction of which is given by
where maps the feature vector into a higher-dimensional space, and are the parameters of the SVM model. SVM with a polynomial kernel , can model an th-degree polynomial, similarly to our model (with ). The disadvantages of the polynomial SVM compared to the CP-based model, however, are that
they scale at least quadratically with the training set size
they tend to overfit the data for large since, unlike our model, the model parameters are independent of each other
they cannot effectively capture interactions for sparse (categorical) data (e.g., recommender systems)
they are not as interpretable since one cannot recover the coefficients of the polynomial
the prediction depends on the training data, or support vectors
Higher-Order Factorization Machines (HOFM)  address the limitations of the SVMs by factorizing the interaction parameters. The order in this case refers to the highest degree of feature interactions being modelled (e.g., an order of three refers to modelling interactions containing a maximum of three variables). Although HOFM resembles the CP-based model when , there are some important differences, which can be illuminated by casting HOFM into the tensor format .
HOFM with order can be expressed in this format as
where the weight tensors () are represented in the symmetric CP format. Furthermore, in this formulation we observe the outer product of the whole feature vector with itself times. Finally, only the entries above the super-diagonal of the weight tensors (which correspond to the products of distinct features) are used to construct the output. In contrast, the CP-based model takes the inner product between a tensor assumed to be in the (asymmetric) CP format (and of order equal to the number of features) and a tensor formed from the outer product of the local feature maps .
(the difference with (31) is in the subscript of the second sum) and is related in the same way to our model when .
Unlike HOFM and PN, the CP-model allows for the modelling of all-order interactions with a computational complexity that scales linearly with the number of features during both inference and training.
The CP-based model can be viewed as a special case of the model presented in  (see Section 3 in the reference). In that work, they represent a data instance as a collection of vectors, each of dimensionality . In the case of a grayscale image, these vectors may correspond to
consecutive pixels. They then apply so-called representation functions (analogous to the local feature maps) on each of these vectors, using the example of an affine transformation followed by an activation function. They refer to this model as Convolutional Arithmetic Circuit due to the nature of their representation function. The main differences between the CP-based model and the model presented in their work are that we useand the representation functions that we apply are the ones given in Section VI.
The authors proved in  that a model based on the Hierarchical Tucker decomposition is exponentially more expressive than one based on CP; that is, an HT-based model realizes functions that would almost always require a CP-based model with an exponentially large rank to even approximate them. A similar result was proved in , where they compared a model based on CP with one based on TT. Although this may at first seem like a disadvantage of the CP-based model, we argue that this is not so, at least for non-sequential data. The model architecture captures an inductive bias about the task at hand, and due to either potential overfitting or optimization difficulties, a model that is more expressive than necessary is likely to converge to a solution that is inferior to the one obtained by a model well-suited to the data. Our experiments on non-sequential data comparing CP- and TT-based models confirm this hypothesis. We also show that increasing the rank did not significantly improve the validation loss, which is further empirical evidence that it is unlikely that the target function for these tasks is one that the CP needs an exponentially large rank to approximate.
We have presented scalable inference and learning algorithms for a supervised learning model based on the Canonical Polyadic (CP) decomposition to serve as a potential alternative to the existing methods based on the Tensor Train (TT) for non-sequential tasks. This model is applicable to both standard classification and regression tasks as well as recommender systems. We also derived efficient procedures to incorporate order regularization and showed how to initialize the model using a linear model solution. In addition, we proposed a unit-normalized version of an arbitrarily high-dimensional local feature map, which enables the straightforward increase of model expressiveness and remains stable for very high dimensions.
With regard to future research directions, it would be interesting to extend the Riemannian optimization approach of Exponential Machines to handle high local dimensions for TT-based supervised learning and compare the performance with adaptive gradient approaches. It could also be fruitful to experiment with other local feature maps and compare them with those considered in this paper. Another potential follow-up work could be finding novel techniques to help in the training of the models when the dimensionality of the features is very high. Finally, future research could focus on the generalization ability of the models when different tensor networks (TNs) are employed to represent the weight tensor, depending on the nature of the data at hand. For example, given the inherent one-dimensional structure of natural language processing problems or the two-dimensional nature of images, it seems likely that the TT representation (1-D TNs) would lead to superior performance in the former case and PEPS (2-D TNs) would be more suitable in the latter case.
We prove here the identity in (23). Let and a colon () denote the inner product operator. Then,
We obtain the differential:
Finally, this implies that
The authors would like to thank Giuseppe Calvi and Bruno Scalzo Dees for helpful discussions. A.H. is supported by an Imperial College London President’s Scholarship. K.K. is supported by an International Doctoral Scholarship provided by EPSRC.
Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §I.