1 Introduction
The problem of identifying a nonlinear function
from inputoutput examples is of paramount importance in machine learning, dynamical system identification and control, communications, and many other disciplines. Technological improvements and the availability of vast amounts of data have led to the development of stateoftheart prediction models with unprecedented success in various domains such as image classification, speech recognition, and language processing. Kernel methods, random forests and more recently neural networks and deep learning are powerful classes of machine learning models that can learn highly nonlinear functions and have been successfully applied in many supervised machine learning tasks
(Hastie et al., 2001). Each of the aforementioned methods can be well suited for a particular problem, but may perform badly for another. In general it is seldom known in advance which method will perform best for any given problem.This paper presents a simple and elegant alternative for nonlinear system identification based on lowrank tensor completion. The proposed approach requires little parameter tuning, can model complex functions, can handle data of mixed types with potentially missing predictors and can also be used in multioutput regression problems. Tensor decomposition is a powerful tool for analyzing multiway data and has had major successes in applications spanning machine learning, statistics, signal processing and data mining (Sidiropoulos et al., 2017). The Canonical Polyadic Decomposition (CPD) model is one of the most popular tensor models mainly due to its simplicity and its uniqueness properties. The CPD model has been applied in various machine learning applications, including recommender systems to model timeevolving relational data (Xiong et al., 2010), community detection and clustering to model user interactions across different networks (Papalexakis et al., 2013), knowledge base completion and link prediction for discovering unobserved subjectobject interactions (Lacroix et al., 2018) and in latent variable models for parameter identification (Anandkumar et al., 2014)
. These works deal with relatively loworder tensors; however, highdimensional tensors also arise in practical scenarios – e.g., a joint probability mass function can be naturally regarded as a highorder probability tensor and modeled using a CPD model
(Kargas et al., 2018).In this work, we show that the CPD model offers an appealing solution for modeling and learning a general nonlinear system using a single highorder tensor. Note that tensors have been used to model loworder multivariate polynomial systems: a multivariate polynomial of order is represented by a tensor of order – e.g., a secondorder polynomial is represented by a quadratic form involving a single matrix (Rendle, 2010; Sidiropoulos et al., 2017). However, such an approach requires prior knowledge of polynomial order, and assuming that one deals with a polynomial of a given degree can be highly restrictive in practice.
Instead, what we advocate here is a simple and general approach: a nonlinear system having discrete inputs and a single output can be naturally represented as an way tensor where the tuple of input variables can be viewed as a cell multiindex and the cell content is the response of the system . Given a new data point, the corresponding tensor cell is queried and the output is used as the predictor. Note that only a small fraction of the tensor entries are in general observed during training, and we are ultimately interested in answering queries for unobserved data points (Figure 1
). This motivates the use of lowrank tensor models as a tool for capturing interactions between the predictors and imputing the missing data. Both experimental and theoretical studies have shown that exact tensor completion from limited samples is possible
(Krishnamurthy and Singh, 2013; Jain and Oh, 2014). The implication of our simple but profound modeling idea is very compelling, since every nonlinear system with discrete inputs can be modeled in this way – because every tensor admits a CPD of bounded rank. If the rank is low enough, provably correct nonlinear system identification is possible from limited samples. Of course, in practice, tensors corresponding to realworld systems may not be lowrank; but the key is that they can often be approximated by a low rank tensor, just like every matrix can be approximated by a low rank matrix. In this sense, we are aiming for the “principal components” of a nonlinear operator, in a sense that will be clarified in the sequel.Even though tensor recovery can be guaranteed under a lowrank assumption, tensor decomposition can often benefit from additional knowledge regarding the application by incorporating constraints such as nonnegativity, sparsity or smoothness. In our present context, smoothness is a desirable property for applications where we expect that small perturbations in the input will most probably cause small changes in the output of the system. Tensor decomposition with smooth latent factors has been mainly applied in the area of image processing. Specifically, CPD and Tucker models with smoothness constraints or regularization have been used for the recovery of incomplete image data (Yokota et al., 2016; Imaizumi and Hayashi, 2017). Smoothness regularization has also been used to capture variations across particular modes of a tensor in fluorescence data analysis (Fu et al., 2015).
To the best of our knowledge, tensor decomposition (with or without smooth latent factors) has not been considered yet as a tool for general system identification. As an example consider the task of estimating students’ grades in future courses based on their grades in past courses, an important topic in educational data mining as it can facilitate the creation of personalized degree paths which will potentially lead to timely graduation
(Elbadrawy et al., 2016). The predictors correspond to the grades in past courses that a student has received and the the predicted response is the student’s grade in a future course. We are interested in building a model that maps andimensional discrete feature vector to the output response. Adding a smoothness constraint or regularization will guarantee that the model will produce similar outputs for two students that differ slightly in their past grades as they are likely to perform similarly in the future.
Contributions: In this work, we propose modeling a general nonlinear system using a single highorder tensor admitting a CPD model. Specifically, we formulate the problem as a smooth tensor decomposition problem with missing data. Although our method is naturally suited to handle discrete features, it can also be used for continuous valued features (Kargas and Sidiropoulos, 2019) and be enhanced using ensemble techniques. Additionally, leveraging the structure of the CPD model, we propose a simple yet effective approach to handle randomly missing input variables. Finally, we discuss how the approach can be extended to vector valued function prediction. We propose an easy to implement Block Coordinate Descent (BCD) algorithm and demonstrate the performance in UCI machine learning datasets against competitive baselines as well as a challenging grade prediction task, using real student grade data.
2 Notation and Background
We use the symbols , , , for scalars, vectors, matrices, and tensors respectively. We use the notation , , to refer to a particular element of a vector, a column of a matrix and a slab of a tensor. Symbols , , , denote the outer, Kronecker, Hadamard and KhatriRao (columnwise Kronecker) product respectively.
An way tensor is a multidimensional array whose entries are indexed by coordinates. A polyadic decomposition expresses as a sum of rank components , where denotes the outer product and . If the number of rank components is minimal then the decomposition is called the CPD of and is called the rank of . By defining factor matrices the elements of the tensor can be expressed as
(1) 
We adopt the common notation to denote the tensor synthesized from the CPD model using these factors. The mode fibers of a tensor are the vectors obtained by fixing all the indices except for the th index. We can represent tensor using a matrix called mode matricization obtained by arranging the mode fibers of the tensor as columns of the resulting matrix
(2) 
where The mode product of a tensor with a matrix is denoted by and an entry of the resulting tensor is given by . Given that the tensor admits a CPD with rank , the mode product can be expressed as .
3 Proposed Approach
3.1 Canonical System Identification (CSID)
We are given a training dataset of inputoutput pairs . Let us assume that all predictors are discrete and take values from a common alphabet . The scalar output is a nonlinear function of the input distorted by some unknown noise
(3) 
The nonlinear function can be modeled as an way tensor where each input vector can be viewed as a cell multiindex and the cell content is the estimated response of the system . We are interested in building a model that minimizes the Mean Square Error (MSE) between the model predictions and the actual response
(4) 
It is evident that it is impossible to infer the response of unobserved data without any assumptions on . To alleviate this problem we aim for the principal components of the nonlinear operator by minimizing the tensor rank. Assuming a lowrank CPD model, the problem of finding the rank approximation which best fits our data can be formulated as
(5)  
s.t. 
where is a regularization parameter. It is convenient to express the problem in the following equivalent form
(6)  
s.t. 
where is a tensor containing the number of times a particular data point appears in the dataset and is a tensor containing the mean response of the corresponding data points. The equivalence between Problems (5), (6) is straightforward
The set contains the indices the data point appears in the dataset. The proposed formulation can handle categorical predictors i.e., predictors that take values indicating membership in one of several categories without an intrinsic ordering between them. Oftentimes, datasets contain both categorical and ordinal predictors, the later, being either discrete or continuous. In the presence of ordinal predictors a desirable property of a regression model is having smooth prediction surfaces i.e., small variations in the input will cause small changes in the output. Therefore, we propose augmenting the CPD tensor completion problem with smoothness regularization on the ordinal latent factors
(7)  
s.t. 
where the matrix is a smoothness promoting matrix typically defined as
We set for categorical predictors and otherwise. Penalizing the difference of consecutive row elements of a factor guarantees that varying the th dimension and keeping the remaining fixed will have a small impact on the predicted response. Another appealing feature of the proposed smoothness regularization is that it can potentially measure feature importance. Note that the effect a variable will have in the prediction is minimized if each column of the corresponding factor is a constant number. Irrelevant features are more likely to have factors that vary slightly. On the contrary, factors associated with predictive features will have more variations and induce a larger penalty cost.
3.2 Algorithm
The workhorse of tensor decomposition is the socalled Alternating Least Squares (ALS) algorithm. ALS is a special type of BCD which offers two distinct advantages: monotonic decrease of the cost function, and no need for parameter tuning. In this section, we propose an ALS approach to tackle Problem 7.
Tensors despite being highdimensional, are in general very sparse and optimized sparse tensor formats can offer huge memory and computational savings (Smith and Karypis, 2015; Li et al., 2018). Here, we use the traditional coordinate format illustrated in Figure 2. The idea of ALS is that we cyclically update variables while fixing the remaining variables at their last updated values. Assume that we fix estimates , we need to solve the following optimization problem
(8)  
where , with the square root computed elementwise. Equivalently, we have
(9)  
where , and . Note that we do not need to instantiate because only the nonzero elements of the sparse vector contribute to the cost function. The nonzero elements of correspond to the observed data points for which the th variable takes the value and therefore we need to compute the corresponding rows of the KhatriRao product. Problem 9 can be optimally solved by finding the solution to a set of linear equations obtained after setting the gradient to zero e.g., using the conjugate Gradient descent algorithm (Bertsekas, 1997). Simpler updates can be obtained by fixing all variables except for a single row of the factor . Let us fix every parameter except for the th row of
(10)  
The solution for is given by
(11)  
which results in very lightweight rowwise updates. BCD algorithms usually offer faster convergence in terms of the cost function compared to stochastic algorithms for small or moderate size problems. For largescale problems on the other hand, Stochastic Gradient Descent (SGD) can attain moderate solution accuracy faster than BCD. The merits of both alternating optimization and stochastic optimization can be combined by considering blockstochastic updates
(Xu and Yin, 2015). In this work, we propose an easy to implement ALS algorithm as our main goal is to present a fresh perspective on the nonlinear identification problem through lowrank tensor completion. Further algorithmic developments are underway, but beyond the scope of this first submission. Next, we show how the proposed approach can be extended to handle partially observed and multioutput regression tasks.3.3 Missing Data
It is quite common in general to have observations with missing values for one or more predictors. For example, in the grade prediction task described in the introduction, the predictions for a student rely on the student’s performance achieved in previously taken courses. Consider a studentgrade matrix where our goal is to predict the th course. The matrix will be in general sparse since each student enrolls in only few of the available courses, and the selected courses vary from student to student.
Common approaches for handling missing data include () removal of observations with any missing values, () imputing the missing values before training e.g., by replacing them with the mean, median, or the mode, and () directly handling the imputation by the algorithm. Let and denote the indices of the observed and missing entries of a single observation respectively. Instead of ignoring observations with missing entries we aim at computing the expectation of the nonlinear function conditioned on the observed variables i.e., we set
(12)  
Estimating the conditional probability is not possible since the number of parameters grows exponentially with the number of missing entries. Given the lowrank structure of the nonlinear function we propose modeling the Probability Mass Function (PMF) using a nonnegative CPD model which is a universal model for PMF estimation (Kargas et al., 2018). For the sake of simplicity, we adopt a simple rankone joint PMF model estimated via the empirical firstorder marginals (Huang and Sidiropoulos, 2017). Without loss of generality assume that the first predictors are known and the remaining missing, then, the expectation can be computed very efficiently
(13)  
The modification can be easily incorporated in the ALS algorithm. Rich dependencies between the variables can also be captured using a higherorder PMF model, but we defer this discussion to followup work due to space limitations.
3.4 MultiOutput Regression
The proposed framework is quite flexible and can easily be extended to vectorvalued functions . When there is no correlation between the output variables of a system, one can build independent models, one for each output, and then use those models to independently predict each one of the outputs. However, it is likely that the output values related to the same input are themselves correlated and often a better way is to build a single model capable of predicting simultaneously all outputs. We can treat each different model as an way tensor and stack them together to build an way tensor. The new tensor model can be described by factors associated with the predictors and an additional mode of dimension , . The vectorvalued prediction for is given by In matrix form we have
(14) 
No modification is needed for the ALS updates presented in Section 3.2. Depending on the application one may or may not need to apply smoothness regularization on .
Dataset  RR  SVR (RBF)  SVR (polynomial)  DT  MLP (5 Layer)  CSID 

Energy Eff. (1)  
Energy Eff. (2)  
C. Comp. Strength  
SkillCraft Master Table  
Abalone  
Wine Quality  
Parkinsons Tel. (1)  
Parkinsons Tel. (2)  
C. Cycle Power Plant  
Bike Sharing (1)  
Bike Sharing (2)  
Phys. Prop. 
Dataset  RR  SVR (RBF)  SVR (polynomial)  DT  MLP (5 Layer)  CSID 

Energy Eff. (1)  
Energy Eff. (2)  
C. Comp. Strength  
SkillCraft Master Table  
Abalone  
Wine Quality  
Parkinsons Tel. (1)  
Parkinsons Tel. (2)  
C. Cycle Power Plant  
Bike Sharing (1)  
Bike Sharing (2)  
Phys. Prop. 
Dataset  RR  MLP (1 Layer)  MLP (3 Layer)  MLP (5 Layer)  DT  CSID 

En. Eff. (2)  
Park. Tel. (2)  
B. Shar. (2) 
4 Experiments
We evaluate the proposed approach in single output regression tasks using several datasets obtained from the UCI machine learning repository (Lichman, 2013). Our proposed approach is implemented in MATLAB using the Tensor Toolbox (Bader and Kolda, 2007) for tensor operations. We then assess the ability of our model to handle missing predictors by hiding of the data as well as its ability to predict vector valued responses. For each experiment we split the dataset into two sets, used for training and for testing, and run MonteCarlo simulations. Finally, we evaluate the performance of our approach in a challenging student grade prediction task using a real student grade dataset. For each method we tune the hyperparameters using fold crossvalidation on a subset of the training set used as validation set. A description of the algorithms and datasets used as well as details regarding the parameter selection are given in the supplementary material. We compare the performance of the different algorithms in terms of the Root Mean Square Error (RMSE).
Dataset  GPA  BMF  CSID 

CSCI1  
CSCI2  
CSCI3  
CSCI4  
CSCI5  
CSCI6  
CSCI7  
CSCI8  
CSCI9  
CSCI10 
Dataset  GPA  BMF  CSID 

CSCI11  
CSCI12  
CSCI13  
CSCI14  
CSCI15  
CSCI16  
CSCI17  
CSCI18  
CSCI19  
CSCI20 
4.1 UCI Datasets
We used four different machine learning algorithms as baselines, Ridge Regresion (RR), Support Vector Regression (SVR), Decision Tree (DT) and Multilayer Perceptrons (MLPs) using the implementation of scikitlearn
(Pedregosa et al., 2011). For RR, SVR and MLP we standardize each ordinal feature such that it has zero mean and unit variance. Categorical features are transformed using onehot encoding. For DT no preprocessing step is required. For our method, we fix the alphabet size to be
and use LloydMax scalar quantizer for discretization of continuous predictors. For the MLPs, we set the number of hidden layers to , or and varied the number of nodes per layer , and . We observed that in most cases the MLP with hidden layers performed better than the or layer MLP and that further increasing the number of layers did not improve the performance. Table 1 shows the RMSE performance of the different methods when there are no missing predictors on the datasets. The number inside the square brackets denotes the number of nodes for each layer of MLP. We highlight the two best performing methods for each dataset. Our approach performs similarly or better than best baseline in most of the datasets. Note that both decision trees and our approach rely on discretization of continuous predictors however, adding the smooth regularization plays a significant role in boosting the RMSE performance for our method.Next we evaluate our approach on partially observed datasets. We randomly hide of the full dataset and repeat MonteCarlo simulations. Before fitting the data to the baseline algorithms we replace each missing entry of an ordinal predictor with the mean and for each categorical predictor we use the most frequent value (mode). For our algorithm we use a rank approximation of the joint PMF tensor estimated from the training data. Table 2 shows the performance of the different algorithms in this setting. Again, our approach similarly or better than best baseline.
Finally we test our approach in predicting multioutput responses against RR, DT tree and MLPs. Table 3 contains the results for three datasets.
4.2 Grade Prediction Datasets
Finally we evaluate our method in a student grade prediction task on a real dataset obtained from the CS department of a university. The predictors corespond to the course grades the students have received. Specifically, we used the most frequent courses to build independent single output regression tasks each one of them having predictors. Grades take discrete values () and due to the natural ordering between the different values smoothness regularization was applied on all factors. We used the Grade Point Average (GPA) and Biased Matrix Factorization (Koren et al., 2009) as our baselines. Lowrank matrix completion is considered a stateofart method in student grade prediction (Polyzou and Karypis, 2016; Almutairi et al., 2017). Note that in the matrix case each course is represented by a column while in the proposed tensor approach, each course is represented by a tensor mode (Figure 3). Table 4 shows the results for the different algorithms. Our approach outperforms BMF in tasks, performs the same in and worse in .
5 Conclusion and Future work
In this paper, we considered the problem of nonlinear system identification. We formulated the problem as a smooth tensor completion problem with missing data and developed a lightweight BCD algorithm to tackle it. We have proposed a simple approach to handle randomly missing data and extended our model to vector valued function approximation. Experiments on several real data regression tasks showcased the effectiveness of the proposed approach.
References
 Almutairi et al. (2017) F. M. Almutairi, N. D. Sidiropoulos, and G. Karypis. Contextaware recommendationbased learning analytics using tensor and coupled matrix factorization. IEEE Journal of Selected Topics in Signal Processing, 11(5):729–741, Aug 2017.
 Anandkumar et al. (2014) A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773–2832, 2014.
 Bader and Kolda (2007) B. W. Bader and T. G. Kolda. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing, 30(1):205–231, December 2007.
 Bertsekas (1997) D. P. Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3):334–334, 1997.
 Elbadrawy et al. (2016) A. Elbadrawy, A. Polyzou, Z. Ren, M. Sweeney, G. Karypis, and H. Rangwala. Predicting student performance using personalized analytics. Computer, 49(4):61–69, 2016.
 Fu et al. (2015) X. Fu, K. Huang, W. Ma, N. Sidiropoulos, and R. Bro. Joint tensor factorization and outlying slab suppression with applications. IEEE Transactions on Signal Processing, 63(23):6315–6328, Dec 2015.
 Hastie et al. (2001) T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. 2001.
 Huang and Sidiropoulos (2017) K. Huang and N. D. Sidiropoulos. KullbackLeibler principal component for tensors is not NPhard. In 2017 51st Asilomar Conference on Signals, Systems, and Computers, pages 693–697, Oct 2017.
 Imaizumi and Hayashi (2017) M. Imaizumi and K. Hayashi. Tensor decomposition with smoothness. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1597–1606, 06–11 Aug 2017.
 Jain and Oh (2014) P. Jain and S. Oh. Provable tensor factorization with missing data. In Advances in Neural Information Processing Systems 27, pages 1431–1439. 2014.
 Kargas and Sidiropoulos (2019) N. Kargas and N. D. Sidiropoulos. Learning mixtures of smooth product distributions: Identifiability and algorithm. In Artificial Intelligence and Statistics, volume 89, pages 388–396, 2019.
 Kargas et al. (2018) N. Kargas, N. D. Sidiropoulos, and X. Fu. Tensors, learning, and “Kolmogorov extension” for finitealphabet random vectors. IEEE Transactions on Signal Processing, 66(18):4854–4868, Sept 2018.
 Koren et al. (2009) Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, (8):30–37, 2009.
 Krishnamurthy and Singh (2013) A. Krishnamurthy and A. Singh. Lowrank matrix and tensor completion via adaptive sampling. In Advances in Neural Information Processing Systems, pages 836–844, 2013.
 Lacroix et al. (2018) T. Lacroix, N. Usunier, and G. Obozinski. Canonical tensor decomposition for knowledge base completion. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 2863–2872, 2018.
 Li et al. (2018) J. Li, J. Sun, and R. Vuduc. Hicoo: Hierarchical storage of sparse tensors. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 238–252, Nov 2018.
 Lichman (2013) M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.
 Papalexakis et al. (2013) E. E. Papalexakis, L. Akoglu, and D. Ience. Do more views of a graph help? community detection and clustering in multigraphs. In Proceedings of the 16th International Conference on Information FUSION, pages 899–905, 2013.
 Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

Polyzou and Karypis (2016)
A. Polyzou and G. Karypis.
Grade prediction with models specific to students and courses.
International Journal of Data Science and Analytics
, 2(3):159–171, Dec 2016.  Rendle (2010) S. Rendle. Factorization machines. In Proceedings of the 2010 IEEE International Conference on Data Mining, pages 995–1000, Dec 2010.
 Sidiropoulos et al. (2017) N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos. Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing, 65(13):3551–3582, July 2017.
 Smith and Karypis (2015) S. Smith and G. Karypis. Tensormatrix products with a compressed sparse tensor. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, page 5, 2015.
 Xiong et al. (2010) L. Xiong, X. Chen, T.K. Huang, J. Schneider, and J. G. Carbonell. Temporal collaborative filtering with bayesian probabilistic tensor factorization. In Proceedings of the 2010 SIAM international conference on data mining, pages 211–222, 2010.
 Xu and Yin (2015) Y. Xu and W. Yin. Block stochastic gradient iteration for convex and nonconvex optimization. SIAM Journal on Optimization, 25(3):1686–1716, 2015.
 Yokota et al. (2016) T. Yokota, Q. Zhao, and A. Cichocki. Smooth PARAFAC decomposition for tensor completion. IEEE Transactions on Signal Processing, 64(20):5423–5436, Oct 2016.
Comments
There are no comments yet.