Exponential Machines

05/12/2016 ∙ by Alexander Novikov, et al. ∙ HSE University Moscow Institute of Physics and Technology Skoltech 0

Modeling interactions between features improves the performance of machine learning solutions in many domains (e.g. recommender systems or sentiment analysis). In this paper, we introduce Exponential Machines (ExM), a predictor that models all interactions of every order. The key idea is to represent an exponentially large tensor of parameters in a factorized format called Tensor Train (TT). The Tensor Train format regularizes the model and lets you control the number of underlying parameters. To train the model, we develop a stochastic Riemannian optimization procedure, which allows us to fit tensors with 2^160 entries. We show that the model achieves state-of-the-art performance on synthetic data with high-order interactions and that it works on par with high-order factorization machines on a recommender system dataset MovieLens 100K.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

TNML

Tensor network machine learning. Based on the paper "Supervised Learning with Quantum Inspired Tensor Networks" http://arxiv.org/abs/1605.05775


view repo

exp-machines

Exponential Machines implementation


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning problems with categorical data require modeling interactions between the features to solve them. As an example, consider a sentiment analysis problem – detecting whether a review is positive or negative – and the following dataset: ‘I liked it’, ‘I did not like it’, ‘I’m not sure’. Judging by the presence of the word ‘like’ or the word ‘not’ alone, it is hard to understand the tone of the review. But the presence of the pair of words ‘not’ and ‘like’ strongly indicates a negative opinion.

If the dictionary has words, modeling pairwise interactions requires 

parameters and will probably overfit to the data. Taking into account all interactions (all pairs, triplets, etc. of words) requires impractical 

parameters.

In this paper, we show a scalable way to account for all interactions. Our contributions are:

  • We propose a predictor that models all interactions of -dimensional data by representing the exponentially large tensor of parameters in a compact multilinear format – Tensor Train (TT-format) (Sec. 3). Factorizing the parameters into the TT-format leads to a better generalization, a linear with respect to number of underlying parameters and inference time (Sec. 5). The TT-format lets you control the number of underlying parameters through the TT-rank – a generalization of the matrix rank to tensors.

  • We develop a stochastic Riemannian optimization learning algorithm (Sec. 6.1

    ). In our experiments, it outperformed the stochastic gradient descent baseline (Sec. 

    8.1) that is often used for models parametrized by a tensor decomposition (see related works, Sec. 9).

  • We show that the linear model (e.g. logistic regression) is a special case of our model with the TT-rank equal

     (Sec. 8.2).

  • We extend the model to handle interactions between functions of the features, not just between the features themselves (Sec. 7).

2 Linear model

In this section, we describe a generalization of a class of machine learning algorithms – the linear model. Let us fix a training dataset of pairs , where is a

-dimensional feature vector of

-th object, and

is the corresponding target variable. Also fix a loss function

, which takes as input the predicted value and the ground truth value . We call a model linear, if the prediction of the model depends on the features only via the dot product between the features and the -dimensional vector of parameters :

(1)

where is the bias parameter.

Learning the parameters and of the model corresponds to minimizing the following loss

(2)

where is the regularization parameter. For the linear model we can choose any regularization term instead of , but later the choice of the regularization term will become important (see Sec. 6.1).

Several machine learning algorithms can be viewed as a special case of the linear model with an appropriate choice of the loss function

: least squares regression (squared loss), Support Vector Machine (hinge loss), and logistic regression (logistic loss).

3 Our model

Before introducing our model equation in the general case, consider a -dimensional example. The equation includes one term per each subset of features (each interaction)

(3)

Note that all permutations of features in a term (e.g. and ) correspond to a single term and have exactly one associated weight (e.g. ).

In the general case, we enumerate the subsets of features with a binary vector , where if the -th feature belongs to the subset. The model equation looks as follows

(4)

Here we assume that . The model is parametrized by a -dimensional tensor , which consist of elements.

The model equation (4) is linear with respect to the weight tensor . To emphasize this fact and simplify the notation we rewrite the model equation (4) as a tensor dot product , where the tensor is defined as follows

(5)

Note that there is no need in a separate bias term, since it is already included in the model as the weight tensor element (see the model equation example (3)).

The key idea of our method is to compactly represent the exponentially large tensor of parameters in the Tensor Train format (Oseledets, 2011).

4 Tensor Train

A -dimensional tensor is said to be represented in the Tensor Train (TT) format (Oseledets, 2011), if each of its elements can be computed as the following product of matrices and vectors

(6)

where for any and for any value of , is an matrix, is a vector and is an vector (see Fig. 1). We refer to the collection of matrices corresponding to the same dimension (technically, a -dimensional array) as the -th TT-core, where . The size of the slices controls the trade-off between the representational power of the TT-format and computational efficiency of working with the tensor. We call the TT-rank of the tensor .

An attractive property of the TT-format is the ability to perform algebraic operations on tensors without materializing them, i.e. by working with the TT-cores instead of the tensors themselves. The TT-format supports computing the norm of a tensor and the dot product between tensors; element-wise sum and element-wise product of two tensors (the result is a tensor in the TT-format with increased TT-rank), and some other operations (Oseledets, 2011).

Figure 1: An illustration of the TT-format for a tensor with the TT-rank equal .

5 Inference

In this section, we return to the model proposed in Sec. 3 and show how to compute the model equation (4) in linear time. To avoid the exponential complexity, we represent the weight tensor  and the data tensor  (5) in the TT-format. The TT-ranks of these tensors determine the efficiency of the scheme. During the learning, we initialize and optimize the tensor in the TT-format and explicitly control its TT-rank. The TT-rank of the tensor always equals 1. Indeed, the following TT-cores give the exact representation of the tensor 

The -th core is a matrix for any value of , hence the TT-rank of the tensor equals .

Now that we have a TT-representations of tensors and , we can compute the model response in the linear time with respect to the number of features .

Theorem 1.

The model response can be computed in , where is the TT-rank of the weight tensor .

We refer the reader to Appendix A where we propose an inference algorithm with complexity and thus prove Theorem 1.

The TT-rank of the weight tensor is a hyper-parameter of our method and it controls the efficiency vs. flexibility trade-off. A small TT-rank regularizes the model and yields fast learning and inference but restricts the possible values of the tensor . A large TT-rank allows any value of the tensor and effectively leaves us with the full polynomial model without any advantages of the TT-format.

6 Learning

Learning the parameters of the proposed model corresponds to minimizing the loss under the TT-rank constraint:

(7)
subject to

where the loss is defined as follows

(8)

We consider two approaches to solving problem (7). In a baseline approach, we optimize the objective with stochastic gradient descent applied to the underlying parameters of the TT-format of the tensor .

A simple alternative to the baseline is to perform gradient descent with respect to the tensor

, that is subtract the gradient from the current estimate of

on each iteration. The TT-format indeed allows to subtract tensors, but this operation increases the TT-rank on each iteration, making this approach impractical.

To improve upon the baseline and avoid the TT-rank growth, we exploit the geometry of the set of tensors that satisfy the TT-rank constraint (7) to build a Riemannian optimization procedure (Sec. 6.1). We experimentally show the advantage of this approach over the baseline in Sec. 8.1.

6.1 Riemannian optimization

The set of all -dimensional tensors with fixed TT-rank

forms a Riemannian manifold (Holtz et al., 2012). This observation allows us to use Riemannian optimization to solve problem (7). Riemannian gradient descent consists of the following steps which are repeated until convergence (see Fig. 3 for an illustration):

  1. Project the gradient on the tangent space of taken at the point . We denote the tangent space as and the projection as .

  2. Follow along with some step  (this operation increases the TT-rank).

  3. Retract the new point back to the manifold , that is decrease its TT-rank to .

We now describe how to implement each of the steps outlined above.

The complexity of projecting a TT-tensor on the tangent space of at a point is  (Lubich et al., 2015). The TT-rank of the projection is bounded by a constant that is independent of the TT-rank of the tensor :

Let us consider the gradient of the loss function (8)

(9)

Using the fact that and that the projection is a linear operator we get

(10)

Since the resulting expression is a weighted sum of projections of individual data tensors , we can project them in parallel. Since the TT-rank of each of them equals 1 (see Sec. 5), all projections cost in total. The TT-rank of the projected gradient is less or equal to regardless of the dataset size .

Note that here we used the particular choice of the regularization term. For terms other than (e.g. ), the gradient may have arbitrary large TT-rank.

As a retraction – a way to return back to the manifold – we use the TT-rounding procedure (Oseledets, 2011). For a given tensor  and rank  the TT-rounding procedure returns a tensor such that its TT-rank equals and the Frobenius norm of the residual  is as small as possible.

To choose the step size on iteration , we use backtracking. That is we start from an initial guess of the step size and multiply it by until it satisfies the generalization of Armijo rule for Riemannian optimization (Sato & Iwai, 2015):

(11)

Condition (11) is equivalent to the regular Armijo rule (Nocedal & Wright, 2006) with two exceptions: it uses the projected gradient instead of the regular gradient ; and it uses retracted point in the left-hand side of the inequality.

Since we aim for big datasets, we use a stochastic version of the Riemannian gradient descent: on each iteration we sample a random mini-batch of objects from the dataset, compute the stochastic gradient for this mini-batch, make a step along the projection of the stochastic gradient, and retract back to the manifold (Alg. 1).

TT-roundprojection
Figure 2: An illustration of one step of the Riemannian gradient descent. The step-size is assumed to be for clarity of the figure.
Figure 3: The effect of dropout on Exponential Machines (the proposed model) on a synthetic binary classification dataset with high-order interactions (see Sec. 8.3). Dropout rate of means that each feature was zeroed with probability .
  Input: Dataset , desired TT-rank , number of iterations , mini-batch size , ,
  Output: that approximately minimizes (7)
  Train linear model (2) to get the parameters and
  Initialize the tensor from and with the TT-rank equal
  Initialize
  for  to  do
     Sample indices
     
      (10)
     
     while  do
        
        
     end while
  end for
Algorithm 1 Riemannian optimization

6.2 Initialization

We found that a random initialization for the TT-tensor sometimes freezes the convergence of optimization method (Sec. 8.2). We propose to initialize the optimization from the solution of the corresponding linear model (1).

The following theorem shows how to initialize the weight tensor from the weights of the linear model.

Theorem 2.

For any -dimensional vector and a bias term there exist a tensor of TT-rank , such that for any -dimensional vector and the corresponding object-tensor the dot products and coincide.

The proof is provided in Appendix B.

6.3 Dropout

To improve the generalization of the proposed model and avoid local minima during optimization, we use dropout technique (Srivastava et al., 2014). On each iteration of Riemannian SGD, for each object  we independently sample a binary mask from Bernoulli distrbution and then use the element-wise product instead of . This corresponds to zeroing random slices in the object-tensor : for any such that , and for all values of other indices , the corresponding element . See the experimental evaluation of applying dropout to the proposed model in Sec. 8.3.

7 Extending the model

In this section, we extend the proposed model to handle polynomials of any function of the features. As an example, consider the logarithms of the features in a -dimensional case:

In the general case, to model interactions between features and functions of the features we redefine the object-tensor as follows:

where

The weight tensor and the object-tensor are now consist of elements. After this change to the object-tensor , learning and inference algorithms will stay unchanged compared to the original model (4).

Categorical features.

Our basic model handles categorical features by converting them into one-hot vectors . The downside of this approach is that it wastes the model capacity on modeling non-existing interactions between the one-hot vector elements which correspond to the same categorical feature. Instead, we propose to use one TT-core per categorical features and use the model extension technique with the following function

This allows us to cut the number of parameters per categorical feature from to without losing any representational power.

8 Experiments

We release a Python implementation of the proposed algorithm and the code to reproduce the experiemnts111https://github.com/Bihaqo/exp-machines. For Algorithm 1 we used the following parameters: , . For the operations related to the TT-format, we used the Python version of the TT-Toolbox222https://github.com/oseledets/ttpy.

8.1 Riemannian optimization

In this experiment, we compared two approaches to training the model: Riemannian optimization (Sec. 6.1) vs. the baseline (Sec. 6). To choose the step-size for the baseline optimization procedure, the used the same approach as for the Riemannian optimization: backtracking with Armijo rule. We experimented on the Car and HIV datasets from UCI repository (Lichman, 2013). Car dataset is a classification problem with objects and

binary features (after one-hot encoding). We simplicity, we binarized the classification problem: we picked the first class (‘unacc’) and made a one-versus-rest binary classification problem from the original Car dataset. HIV dataset is a binary classification problem with

objects and features. We report that on the Car dataset Riemannian optimization converges faster and achieves better final point than the baseline (Fig. 4a). On the HIV dataset Riemannian optimization converges to the value around times faster than the baseline (Fig. 4b).

(a) Car dataset
(b) HIV dataset
Figure 4: A comparison between Riemannian optimization and SGD applied to the underlying parameters of the TT-format (the baseline). Numbers in the legend stand for the batch size. The methods marked with ‘rand init’ in the legend (square marker) were initialized from a random TT-tensor, all other methods were initialized from the solution of ordinary linear logistic regression.

8.2 Initialization

In this experiment, we compared random initialization with the initialization from the solution of the corresponding linear problem (Sec. 6.2

). To initialize a TT-tensor randomly we filled its TT-cores with independent Gaussian noise and chose the variance to set the Frobenius norm of the resulting tensor

to . We report that on the Car dataset random initialization slowed the convergence compared to initialization from the linear model solution (Fig. 4a), while on the HIV dataset the convergence was completely frozen in case of the random initialization (Fig. 4b).

8.3 Dropout

In this experiment, we investigated the influence of dropout on the proposed model. We generated train and test objects with features. Each entry of the data matrix was independently sampled from with equal probabilities . We also uniformly sampled subsets of features (interactions) of order : . We set the ground truth target variable to a deterministic function of the input:

, and sampled the weights of the interactions from the uniform distribution:

.

We report that dropout helps to generilize better and to avoid local minima, i.e. it improves both train and test losses (Fig. 3).

8.4 Comparison to other approaches

In this experiment, we compared Exponential Machines (the proposed method) to other machine learning models on the dataset with high-order interactions that is described in the previous section 8.3. We used scikit-learn implementation (Pedregosa et al., 2011)

of logistic regression, random forest, and kernel SVM; FastFM implementation 

(Bayer, 2015) of -nd order Factorization Machines; and our implementation of Exponential Machines and high-order Factorization Machines333https://github.com/geffy/tffm. For our method, we chose the dropout level of based on the training set performance (Sec. 8.3). We compared the models based on the Area Under the Curve (AUC) metric since it is applicable to all methods and is robust to unbalanced labels (Tbl. 1).

Method Test AUC

Training

time (s)

Inference

time (s)

Log. reg.
RF
SVM RBF
SVM poly. 2
SVM poly. 6
2-nd order FM
6-th order FM
6-th order FM (more iters.)
ExM rank 8
ExM rank 30
ExM rank 30 (more iters.)
Table 1: A comparison between models on synthetic data with high-order interactions (Sec. 8.4

). We trained each model 5 times on the same data and report the mean and standard deviation AUC across the runs. The variance of Random Forest (RF) was greater than

but less than so it became after the rounding. We report the inference time on test objects in the last column.

8.5 MovieLens 100K

MovieLens 100K is a recommender system dataset with 943 users and 1682 movies (Harper & Konstan, 2015). We followed Blondel et al. (2016) in preparing the features and in turning the problem into binary classification. This process yields 78 additional one-hot features for each user-movie pair ( features in total). To encode the categorical features, we used the technique described in Sec. 7. Our method obtained test AUC with TT-rank ; our implentation of the 3-rd order factorization machines obtained ; and Blondel et al. (2016) reported with 3-rd order factorization machines on the same data.

9 Related work

Kernel SVM is a flexible non-linear predictor and, in particular, it can model interactions when used with the polynomial kernel (Boser et al., 1992). As a downside, it scales at least quadratically with the dataset size (Bordes et al., 2005) and overfits on highly sparse data.

With this in mind, Rendle (2010) developed Factorization Machine (FM), a general predictor that models pairwise interactions. To overcome the problems of polynomial SVM, FM restricts the rank of the weight matrix, which leads to a linear number of parameters and generalizes better on sparse data. FM running time is linear with respect to the number of nonzero elements in the data, which allows scaling to billions of training entries on sparse problems.

For high-order interactions FM uses CP-format (Caroll & Chang, 1970; Harshman, 1970) to represent the tensor of parameters. The choice of the tensor factorization is the main difference between the high-order FM and Exponential Machines. The TT-format comes with two advantages over the CP-format: first, the TT-format allows for Riemannian optimization; second, the problem of finding the best TT-rank approximation to a given tensor always has a solution and can be solved in polynomial time. We found Riemannian optimization superior to the SGD baseline (Sec. 6) that was used in several other models parametrized by a tensor factorization (Rendle, 2010; Lebedev et al., 2014; Novikov et al., 2015).

A number of works used full-batch or stochastic Riemannian optimization for data processing tasks (Meyer et al., 2011; Tan et al., 2014; Xu & Ke, 2016; Zhang et al., 2016). The last work Zhang et al. (2016) is especially interesting in the context of our method, since it improves the convergence rate of stochastic Riemannian gradient descent and is directly applicable to our learning procedure.

10 Discussion

We presented a predictor that models all interactions of every order. To regularize the model and to make the learning and inference feasible, we represented the exponentially large tensor of parameters in the Tensor Train format. The proposed model outperformed other machine learning algorithms on synthetic data with high-order interaction.

To train the model, we used Riemannian optimization in the stochastic regime and report that it outperforms a popular baseline based on the stochastic gradient descent. However, the Riemannian learning algorithm does not support sparse data, so for dataset with hundreds of thousands of features we are forced to fall back on the baseline learning method. We found that training process is sensitive to initialization and proposed an initialization strategy based on the solution of the corresponding linear problem. We also found that using dropout with our model improves generalization and helps the training procedure to avoid local minima. The solutions developed in this paper for the stochastic Riemannian optimization may suit other machine learning models parametrized by tensors in the TT-format.

References

  • Bayer (2015) I. Bayer. Fastfm: a library for factorization machines. arXiv preprint arXiv:1505.00641, 2015.
  • Blondel et al. (2016) M. Blondel, A. Fujino, N. Ueda, and M. Ishihata. Higher-order factorization machines. 2016.
  • Bordes et al. (2005) A. Bordes, S. Ertekin, J. Weston, and L. Bottou.

    Fast kernel classifiers with online and active learning.

    The Journal of Machine Learning Research, 6:1579–1619, 2005.
  • Boser et al. (1992) B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In

    Proceedings of the fifth annual workshop on Computational learning theory

    , pp. 144–152, 1992.
  • Caroll & Chang (1970) J. D. Caroll and J. J. Chang. Analysis of individual differences in multidimensional scaling via n-way generalization of Eckart-Young decomposition. Psychometrika, 35:283–319, 1970.
  • Harper & Konstan (2015) F. M. Harper and A. J. Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS), 2015.
  • Harshman (1970) R. A. Harshman. Foundations of the PARAFAC procedure: models and conditions for an explanatory multimodal factor analysis. UCLA Working Papers in Phonetics, 16:1–84, 1970.
  • Holtz et al. (2012) S. Holtz, T. Rohwedder, and R. Schneider. On manifolds of tensors of fixed TT-rank. Numerische Mathematik, pp. 701–731, 2012.
  • Lebedev et al. (2014) V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky.

    Speeding-up convolutional neural networks using fine-tuned CP-decomposition.

    In International Conference on Learning Representations (ICLR), 2014.
  • Lichman (2013) M. Lichman. UCI machine learning repository, 2013.
  • Lubich et al. (2015) C. Lubich, I. V. Oseledets, and B. Vandereycken. Time integration of tensor trains. SIAM Journal on Numerical Analysis, pp. 917–941, 2015.
  • Meyer et al. (2011) G. Meyer, S. Bonnabel, and R. Sepulchre. Regression on fixed-rank positive semidefinite matrices: a Riemannian approach. The Journal of Machine Learning Research, pp. 593–625, 2011.
  • Nocedal & Wright (2006) J. Nocedal and S. J. Wright. Numerical optimization 2nd. 2006.
  • Novikov et al. (2015) A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems 28 (NIPS). 2015.
  • Oseledets (2011) I. V. Oseledets. Tensor-Train decomposition. SIAM J. Scientific Computing, 33(5):2295–2317, 2011.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Rendle (2010) S. Rendle. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pp. 995–1000, 2010.
  • Sato & Iwai (2015) H. Sato and T. Iwai. A new, globally convergent riemannian conjugate gradient method. Optimization: A Journal of Mathematical Programming and Operations Research, pp. 1011–1031, 2015.
  • Srivastava et al. (2014) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, pp. 1929–1958, 2014.
  • Tan et al. (2014) M. Tan, I. W. Tsang, L. Wang, B. Vandereycken, and S. J. Pan. Riemannian pursuit for big matrix recovery. 2014.
  • Xu & Ke (2016) Z. Xu and Y. Ke. Stochastic variance reduced riemannian eigensolver. arXiv preprint arXiv:1605.08233, 2016.
  • Zhang et al. (2016) H. Zhang, S. J. Reddi, and S. Sra. Fast stochastic optimization on riemannian manifolds. arXiv preprint arXiv:1605.07147, 2016.

Appendix A Proof of Theorem 1

Theorem 1 states that the inference complexity of the proposed algorithm is , where is the TT-rank of the weight tensor . In this section, we propose an algorithm that achieve the stated complexity and thus prove the theorem.

Proof.

Let us rewrite the definition of the model response (4) assuming that the weight tensor is represented in the TT-format (6)

Let us group the factors that depend on the variable ,

where the matrices for are defined as follows

The final value can be computed from the matrices  via matrix-by-vector multiplications and vector-by-vector multiplication, which yields complexity.

Note that the proof is constructive and corresponds to an implementation of the inference algorithm. ∎

Appendix B Proof of Theorem 2

Theorem 2 states that it is possible to initialize the weight tensor  of the proposed model from the weights of the linear model.

Theorem.

For any -dimensional vector and a bias term there exist a tensor of TT-rank , such that for any -dimensional vector and the corresponding object-tensor the dot products and coincide.

To proof the theorem, in the rest of this section we show that the tensor from Theorem 2 is representable in the TT-format with the following TT-cores

(12)

and thus the TT-rank of the tensor equals .

We start the proof with the following lemma:

Lemma 1.

For the TT-cores (12) and any the following invariance holds:

Proof.

We prove the lemma by induction. Indeed, for the statement of the lemma becomes

which holds by definition of the first TT-core .

Now suppose that the statement of Lemma 1 is true for some . If , then

is an identity matrix and

. Also, , so the statement of the lemma stays the same.

If , then there are options:

  • If , then and

  • If , then and

  • If with , then and

Which is exactly the statement of Lemma 1. ∎

Proof of Theorem 2.

The product of all TT-cores can be represented as a product of the first cores times the last core and by using Lemma 1 we get

The elements of the obtained tensor that correspond to interactions of order equal to zero; the weight that corresponds to equals to ; and the bias term .

The TT-rank of the obtained tensor equal since its TT-cores are of size . ∎