The rise of deep learning methods have showed that complex, nonlinear problems can be learned in practice if the model families have sufficient capacity to cover the underlying interdependence. Extending the domain of known approaches further could increase our capability to tackle new challenging problems. In this paper, we study the problem of learning noisy, highly nonlinear multivariate functions from large data sets (
examples). We assume that the target functions to be learned are real valued and their domain is a bounded subset of finite dimensional Hilbert space. The corresponding training data contains a set of input-output pairs, points of the Hilbert space and the function values (scalar or vector) in those points. To construct the estimates of the functions, we further assume that they are continuous or can be approximated by a continuous function with a fixednorm based tolerance. The continuity assumption allows us to exploit the Stone-Weierstrass theorem and its generalizations , namely those function can be approximated by polynomials on a compact subset with an accuracy not worse than a given arbitrary small error. Since every polynomial can be represented by tensors, the function learning problem can be formulated as a tensor reconstruction problem. This reconstruction is a certain inverse of the tensor decomposition, i.e. HOSVD, in which a tensor is built up, for example, from one-rank tensors, see methods such as CANDECOMP/PARAFAC tensor decomposition by , , or .
The tensor reconstruction approach allows to reduce the number of parameters required to represent the polynomials. An arbitrary multivariate polynomial defined on the field of real numbers can be described by parameters, where is the number of variables, is the maximum degree. Thus the complexity relating to the size of the underlying tensor is , which increases exponentially in the number of parameters. A properly chosen variant of the tensor decomposition can significantly reduce that complexity. To this end, we consider polynomials of this type
where for all and are linear forms. This representation is also called polynomial factorization or decomposition problem . The main difference between the HOSVD and the linear form factorization is that in the former the orthogonality of the vectors defining the linear forms is generally assumed, but in the factorization, only the linear independence is expected. A significant advantage of the linear form based representation is that the polynomial function depends only linearly on its parameters. The basic model of this paper is outlined in Table 1.
The motivation of the learning approach presented in this paper comes from several sources. One of the influencing models is the Factorization machine (FM) which applies a special class, the symmetric polynomials, see , to approximate the underlying, unknown nonlinear functions. It was introduced by  and later on, further extensions have been published, such as the higher order case (HOFM) by . The FM itself also has close connection to the Polynomial Networks, , as highlighted by . Our approach can be viewed as a reformulation of factorization machines, removing the limiting assumption of the symmetry imposed on the polynomials, and hence extending the range of learnable functions. Furthermore, the latent tensor models [11, 12, 13]
also provide important background to the proposed method. In those approaches, the moments, up to third order, of the input data are exploited to yield the tensor structure. In our method, the decomposition of the latent tensor is conditioned on the output data as well, thus it is also a cross-covariance based method.
The latent tensor reconstruction (LTR) based function learning can be interpreted as a certain algebraic alternative of a multi-layered Deep Artificial Neural Network (DNN). In DNN, the linear subproblems are connected by nonlinear activation functions, e.g. sigmoid, into a complete learning system. By contrast, in LTR, the nonlinear subproblems are integrated by simple linear transfer functions. The latter model satisfies the basic principle of the dynamic programming , that is, if there are subproblems processed sequentially, then after computing subproblems, the optimization of subproblem does not influence the optimum solution computed on the previous
ones. This fact eliminate the need of backpropagation generally required in the training of an DNN. The LTR model can also be applied as a special type of recurrent module embedded into a DNN based learner. In this paper we focus only on standalone applications.
Our contributions are summarized in the following points:
- LTR incrementally estimates the unknown function, and monotonically reduces the approximation error. This allows to control the exploration of the hidden structure of the function and the underlying tensor as well.
- The proposed method provides a nonlinear, polynomial regression with computational complexity linear in the order of tensor , in its rank, , in the sample size, , and in the number of variables . The LTR is well suited to learn functions whose variables are interrelated.
- By applying mini-batch based processing, LTR also has constant memory complexity similarly to DNN learners. The parameters are stored memory in the iterative procedure, and in each step only a fixed size subset, a mini-batch, of the data is required.
Notation and conventions: In the text is used to denote the tensor product of vectors, and stand for inner product and the induced norm in a Hilbert space . The notation is also applied for the Frobenius inner product of tensors, denotes the pointwise product of tensors with the same shape of any order. The pointwise power of a vector or a matrix is given by . The trace of square matrix is denoted by , and stands for the column wise vectorization of matrix . denotes a vector of dimension with all components equal to . The word polynomial can also mean polynomial function depending on the context.
2.1 Data representation
In the learning problem we have a sample of examples given by input-output pairs
taken from an unknown joint distribution of input and output sources. The rows of the matrixcontain the vectors , and similarly the rows of hold the output vectors, , for all . In the first part of the paper we deal with the case where , and in Section 3.3 the extension to the vector valued case is presented.
2.2 Polynomials and Tensors
Polynomials are fundamental tools of the computational algebra. Let be a set of variables which can take values of a field , in our case from . A monomial is a product where the powers are non-negative integer numbers. With the help of -tuples and a monomial can be written as , thus the can be applied as a fingerprint of the monomial. Let be a finite set of tuples with type of , then a polynomial is a finite linear combination of monomials defined on the same set of variables, , where . The degree of a monomial is the sum of the powers of its variables, . A degree of a polynomial is the maximum degree of the monomials contained. A polynomial is homogeneous if all of its monomials have the same degree.
A non-homogeneous polynomial with degree can be derived from a homogeneous one of degree by substituting a real number, e.g. into one of the variables, thus is transformed into . Based on this fact in the sequel we assume that the polynomials are homogeneous.
Let be a set of finite dimensional real vector spaces, and for each the set denotes the dual of , the space of linear functionals acting on . A tensor as a multilinear form of order can be defined as an element of . If a basis is given in each of , then can be represented by an -way array . In the sequel, we might also refer to the -way arrays as tensor as well. The tensors form a vector space of dimension .
We can write the multilinear function, , as
If it is not stated otherwise, in this paper we generally assume that the vector spaces are the same and , thus the indexes of vector spaces can be dropped
Tensor is symmetric if for any permutation of the indexes the identity holds. In some cases this symmetry is called super-symmetry, see .
The space of symmetric tensors of order defined on as vector space is isomorphic to the space of homogeneous polynomials of degree defined on . Therefore any homogeneous polynomial with variables and with degree as multilinear form defined over the field of real numbers can be represented by the help of a symmetric tensor.
2.3 Representation of multilinear functions
where we exploit the well known identity connecting the inner product and the tensor products, . This form only consists of terms of scalar factors, where each scalar is the value of a linear functional acting on the space . This transformation eliminates the potential difficulties which could arise in working directly with full tensors. Observe that the function is linear in each of the vector valued parameters, .
2.4 Polynomial regression, the generic form
We start on a generic form of multivariate polynomial regression described by a tensor. Let the degree of the polynomial be equal to , then by the help of (3) we can write
Here the polynomial is assumed to be homogeneous, for the inhomogeneous case see Section 2.2. Representing functions by polynomials is an attractive approach supported by the Stone-Weierstrass type theorems, . Those theorems connect continuous functions to polynomials. A general form of those theorems is given here for real, separable Hilbert spaces ,
Let be a real, separable Hilbert space. The family of continuous polynomials on , restricted to a compact set , is dense in the set of continuous functions mapping into , and restricted to , where carries the uniform norm topology.
Informally, this theorem states that to any continuous function there is a polynomial to be arbitrary close .
Since the tensor representing a multivariate polynomial defined on the field of the real numbers is symmetric, the number of parameters in the polynomial is equal to for the homogeneous case. This number grows exponentially in the number of degree and the variables. To find a sounding polynomial approximation for large scale problems we need to reduce the dimension of the parameter space in a way which preserves the approximation flexibility but the computational complexity in the number of variables, and in the degree, is linear.
2.5 Factorization machines
Factorization machines (FM)  apply a special class of polynomials, the symmetric polynomials, see , to estimate nonlinear functions. FMs were extended to higher order case (HOFM) by  and . FMs are closely related to the polynomial networks, see  as highlighted by .
The model of factorization machine can be described via a compact, recursive form 
where for , for all , and for all , and . This model can be derived from the fundamental theorem of symmetric polynomials , which states that every symmetric polynomial can be expressed as a polynomial of the power sums, if the basic field of the polynomials contains the rational numbers. In the case of the FM the power sums are given as . A function (polynomial) is symmetric if its value is invariant on any permutation of the variables, e.g. . Symmetric polynomials can well approximate symmetric functions. In the non-symmetric cases the approximation capability of the factorization machine is limited, see the examples in Table 3.
2.6 Kernel Ridge regression
The Kernel Ridge Regression (KRR) is a popular alternative to resolve the polynomial regression problem with small to medium-sized data. It can be stated in a matrix form
For polynomial regression, in the KRR problem, the kernel is chosen as the polynomial kernel where is the bias term. For further details of kernel-based learning, see for example . The parameter space of KRR has dimension equal to the size of sample and independent from the number of variables. The latter property has an advantage when the the number of variables is large. However solving this type of problem assuming dense kernels requires time, and space complexity, which is a bottleneck for processing large data sets, although several approaches have been proposed to reduce those complexities, e.g. Nyström approximation and random Fourier features .
3 Latent tensor reconstruction
We first construct the basic problem where the target function is assumed to be scalar valued. The idea of the learning task is outlined in Table 1. Let , then the corresponding optimization problem can be formulated similarly to the Kernel Ridge Regression by assuming least square loss and Tikhonov type regularization of the parameters.
It is important to mention here that without the regularization of the parameters, the tensor one-rank approximation problem might be ill-posed for certain tensors . We can make an important statement about this problem. The objective function, if all variables, , except one are fixed then in that variable the function within the norm is linear, consequently convex. Therefore the objective function in that variable is also convex since the norm and the summation preserve the convexity.
3.1 LTR representation vs. factorization machine
In our polynomial representation a matrix is assigned to each variable (component) of the input vectors, namely for a fixed index of the variables we have . In the Factorization Machine (FM)  a vector is assigned to the variable indexed by .
The assignment of the LTR allows to decouple the factors of the polynomial, thus the formulation becomes more transparent. The linear dependence on the parameters allows us to use a simple gradient descent algorithm. The increased parameter capacity requires larger sample size to properly estimate the parameters. However when the sample is sufficiently large the LTR performs significantly better that the FM and KRR, see the examples in Section 5.
3.2 Learning algorithm
The summary of the basic algorithm and the solution to the rank-one subproblem is outlined in Table 2
. The basic rank-wise algorithm follows the scheme of a Singular Value Decomposition by adding the best fitting rank-one layers to the tensor under construction. In the rank-one subproblem, the mini-batch wise processing of the data can be interpreted as a piece-wise polynomial approximation, and also a certain variant of the spline methods, where the splines are connected via momentum-based [22, 23], gradient updates, e.g. ADAM .
The selection of the mini-batches could follow an online processing scheme, or if the entire data set is available then they can be uniformly randomly chosen out of the full set by imitating a stochastic gradient scheme.
3.3 Vector-valued output
The outputs of the learning problem might be given as vectors . Let the matrix contain the vectors in its rows for all . We can extend the basic scalar valued regression (9) into a vector valued one where the boxes highlights the changes applied on the scalar valued case.
The terms in the loss function of the problem can be rewritten in a tensor and in a pointwise product form as well
In the vector valued formulation the vectors play the role of certain nonlinear principal components of the output vectors. The coordinates of the output vectors with respect to those principal components are expressed by the polynomials defined on the input vectors.
3.4 Matrix representations
Let the following matrices be formed from the data and the parameters:
We can write Problem (10) in a compact matrix form.
3.5 Complexity of the algorithm
In the matrix representation of the optimization problem (12), the predictor function contains only a matrix product and pointwise products of the corresponding matrices, thus the original tensor decomposition form completely disappears from the computation. The transformation flow sending the input matrix into the output one is summarized in Figure 1.
From that transformation flow and from the matrix expressions of the gradient, the time complexity of one iteration can be computed, and we can state
At fixed number of epochs the time complexity of the gradient based algorithm is .
Note that the dominating part in (13) is to compute the matrix products. Computing requires operations, and have the complexity , and finally the product has complexity . Since these products can be evaluated sequentially, thus the complexity of one step is . If the number of epochs is fixed, then this also gives the overall complexity of the entire algorithm. ∎
3.6 Multi-view case
The basic case can be extended to deal with multiple input sources, where the input matrix is replaced with a set of matrices with potentially different number of columns. Clearly, the optimization problem (12) can be solved in the same way, thus this extension is a very natural one. To implement the multi-view case, only the loss term of the objective function needs to be changed to where .
3.7 Classification problems
In a classification problem, binary or multi-class, the functions connecting the input and the output are generally discontinuous. To learn discontinuous functions the polynomial function can be embedded into a differentiable activation function
. By applying the chain rule it only adds a scalar factor to the gradients computed for the polynomials. This type of activation function can be applied component-wise in the vector valued prediction as well.
where the original linear form is replaced with a multi-linear one. Then the maximum likelihood problem corresponding to the extended logistic regression can be straightforwardly solved by similar gradient descent approach described for the least square regression case.
4 Multi-layered model
A multi-layered version of LTR can handle the trade-off between the complexity of the polynomial function and the approximation error, more the rank less the loss but the generalization performance might deteriorate.
The method presented here can be interpreted as a variant of the general gradient boosting framework . Each problem related to a range of ranks might be taken as a weak learner acting on the residue remained after the aggregation of the approximation computed by the previous learners.
In this extension, the rank range are cut into blocks. Let be a partition of the rank range into disjoint index intervals, i.e. . Let , and . With these notations the multi-layered algorithm can constructed.
Let , and .
Initialize for all ,
, and .
Let , If Then Go To Step 2.
Provide function .
The optimization problem in Step 4 can also be solved on a sequence of mini-batches to keep the input source related memory complexity low.
We could test that adding a new layer of ranks to those accumulated earlier is a reliable step. By assuming that the output is scalar valued then the reliability could be measured by the correlation ratio (intraclass correlation), . Adding a new layer to increase the ranks might matter if the change in the output values highly correlate to the sum of those computed by the previous ones. Let be the predicted output at sample example in layer . Then the correlation ratio is given by
where is the layer wise mean, and is the total mean computed on all layers.
The experiments contain two main parts: in the first one, tests based on random generated data sets are presented and in the second part, results of experiments on real data sets are performed.
The first collection of tests is generated on simple, two-variable , polynomials, see Table 3 to demonstrate the basic difference between LTR and the FM. Whilst LTR uses unrestricted polynomials to fit the FM applies symmetric ones, e.g. , which condition limits the potential performance if the polynomial is arbitrary, e.g. anti-symmetric: .
Learning simple quadratic functions where the input is generated independently from standard normal distributionnumber of folds=5, m=1000, degree=2, rank=2, epoch=10
To test the general performance of the proposed learning method, samples of randomly generated polynomials are used. The generation procedure is based on 5), where the components of data vectors, , the parameter vectors, , and rank-wised scalar factors,
, are chosen independently from standard normal distribution. This sample could contain symmetric and non-symmetric polynomials as well. In the experiments, 2-fold cross validation is used to keep the test size high relative to that of the training, thus the distances between the test examples to the closest training ones can be sufficiently large. The results, the mean values, and standard errors are computed on all random examples and at each random sample on all folds.
In all tests, only one parameter runs through on a range of possible values while all other ones are fixed. The data related parameters which can influence the performance are: the degree and the rank of the generated polynomials, the number of examples and the number of variables. The parameters of the learner are: the degree () and the rank of the tensor () to be reconstructed, and the number of epochs. Other learner parameters are fixed in all experiments, learning speed is , the size of the mini-batches is taken as , the regularization constants, and are fixed to
. An additional parameter, the noise level, is also used. The noise is generated from Gaussian distribution with zero mean and the standard deviation chosen as the standard deviation of the values of the random polynomials multiplied with the noise parameter. All parameters employed in the experiments are reported on the corresponding figures.
The accuracy is measured by Pearson Correlation between the predicted and the real values.
The methods compared are Kernel Ridge Regression with polynomial kernels (KRR), High Order Factorization Machine implemented on TensorFlow (TFFM)
, Linear Regression (LR), and the proposed Latent Tensor Reconstruction method (LTR). In all experiments, every method applies the same polynomial parameters, degree and rank. In case of the TFFM the mini-batch size is chosen as the same as used by the LTR. The methods are implemented in Python with the help of Numpy package, except the FM where the model also uses the TensorFlow. The tests were run on a HP-Z230 workstation equipped with 8 CPU’s and 32GB memory. GPU has not been exploited in the experiments.
The second part of the experiments is built on real data sets. We processed three data sets, Corel5k, Espgame and Iaprtc12 . They are available on the authors web site111http://lear.inrialpes.fr/people/guillaumin/data.php. The published results on these data sets and the corresponding references are included in Table 5. These data sets consist of annotated images and 15 preprocessed image features. The annotations are vectors of labels identifying the objects appearing on the images. The image features, e.g. SIFT, are provided by authors of . The high sparsity of the label vectors and the strong interrelations between the feature variables make the prediction problem rather challenging. This test is based on the vector output model of the LTR presented in Section 3.3.
|Number of instances||non-zero|
|Label accuracy in F1()|
5.1 Summary of the experiments
Figures 3, 4 and 5 present the better accuracies provided by the LTR compared to the FM on varying degrees, noise levels and on the number of variables, where the data sets contain examples. These experiments show that the LTR has a sufficient capacity to learn highly complex functions on large data sets at a relatively low number of epochs (). On small sets (Figure 6) the large parameter space of both the LTR and the TFFM leads to inferior performance compared to KRR, especially when the number of variables is large.
Figure 7 shows that KRR has a several times larger time complexity than the LTR or the TFFM. The large scale example, (10M examples), in Figure 8 demonstrates the significant advantage of the LTR method which has a linear time algorithm in the degree of the polynomials. That property is the most critical one in capturing high level interdependence between the variables.
In this paper, a latent tensor reconstruction based regression model is proposed for learning highly non-linear target functions from large scale data sets. We presented and efficient mini-batch based algorithm that has a constant memory complexity and linear running time in all meaningful parameters (degree, rank, sample size, number of variables). These properties guarantee a high level scalability on broad range of complex problems. Our experiments demonstrate high accuracy in learning high-order polynomials from noisy large-scale data, as well as competitive accuracy in some real-world data sets.
This study was supported by the Academy of Finland [Grant No., 311273, 313266, 310107, 313268 and 334790].
-  Walter Rudin. Principles of mathematical analysis. McGraw-Hill Book Co., New York, third edition, 1976. International Series in Pure and Applied Mathematics.
-  Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM REVIEW, 51(3):455–500, 2009.
-  L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition. Journal of Matrix Anal. Appl., 21(4):1253–1278, 2000.
-  G. H. Golub and C. F. V. Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, MD, 4th edition edition, 2013.
-  Erich Kaltofen and Barry M. Trager. Computing with polynomials given byblack boxes for their evaluations: Greatest common divisors, factorization, separation of numerators and denominators. J. Symb. Comput., 9(3):301–320, March 1990.
-  David Cox, John Little, and Donal O’Shea. Ideals, varieties, and algorithms. An introduction to computational algebraic geometry and commutative algebra. 2nd ed. New York, NY: Springer, 2nd ed. edition, 1996.
-  Steffen Rendle. Factorization machines. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, pages 995–1000. IEEE Computer Society, 2010.
-  Mathieu Blondel, Akinori Fujino, Naonori Ueda, and Masakazu Ishihata. Higher-order factorization machines. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 3359–3367, USA, 2016. Curran Associates Inc.
-  Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, pages 855–863, Cambridge, MA, USA, 2014. MIT Press.
Mathieu Blondel, Masakazu Ishihata, Akinori Fujino, and Naonori Ueda.
Polynomial networks and factorization machines: New insights and
efficient training algorithms.
Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 850–858. JMLR.org, 2016.
-  Elizabeth S. Allman, Catherine Matias, and John A. Rhodes. Identifiability of parameters in latent structure models with many observed variables. Project Euclid, 37/6A, 2009.
-  Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773–2832, 2014.
-  Furong Huang, U. N. Niranjan, Mohammad Umar Hakeem, and Animashree Anandkumar. Online tensor methods for learning latent variable models. Journal of Machine Learning Research, 16:2797–2835, 2015.
-  Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. The MIT Press, 2016.
-  R. Bellman. Dynamic Programming. Princeton University Press, 1957. Dover paperback edition (2003).
-  Vin de Silva and Lek-Heng Lim. Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM J. Matrix Anal. Appl., 30(3):1084–1127, September 2008.
-  P.M. Prenter. A Weierstrass theorem for real, separable Hilbert spaces. Journal of Approximation Theory, 3:341–351, 1970.
-  Steffen Rendle. Factorization machines with libfm. ACM Trans. Intell. Syst. Technol., 3(3):57:1–57:22, May 2012.
-  J. Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
-  Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1177–1184. Curran Associates, Inc., 2008.
-  Jerome H. Friedman. Multivariate adaptive regression splines. Ann. Statist, 1991.
-  Boris Teodorovich Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
-  Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127–152, 2005.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.
-  Jerome H. Friedman. Stochastic gradient boosting. Computational Statistics and Data Analysis, 38:367–378, 1999.
-  A. Rényi. Probability Theory. Dover, 2007.
-  Alexander Novikov Mikhail Trofimov. tffm: Tensorflow implementation of an arbitrary order factorization machine. https://github.com/geffy/tffm, 2016.
Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid.
Multiple instance metric learning from automatically labeled bags of
European conference on computer vision, pages 634–647. Springer, 2010.
-  S. L. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In CVPR, 2004.
-  Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, and Cordelia Schmid. Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In 2009 IEEE 12th international conference on computer vision, pages 309–316. IEEE, 2009.
-  Ameesh Makadia, Vladimir Pavlovic, and Sanjiv Kumar. Baselines for image annotation. International Journal of Computer Vision, 90:88–105, 2010.
-  Minmin Chen, Alice Zheng, and Kilian Q. Weinberger. Fast image tagging. In ICML, 2013.