Generally, the conventional machine learning problems aim at recovering a mathematical mapping from the feature space to the label space. We can represent the unknown true mapping steering the real-world data distribution as , where and denote the feature space and label space respectively, and represents the variables in the mapping function. Depending on the application settings, such a mapping can be either a simple or a quite complicated equation involving both the variables and extracted features.
To approximate such a mapping, various machine learning models have been proposed which can be trained based on a small amount of feature-vector pairssampled from space , where . Formally, we can represent the approximated mapping outlined by the general machine learning models (including deep neural networks) from the feature space to the label space as parameterized by . By minimizing the following objective function, the machine learning models can learn the optimal variable :
represents the loss function of the prediction results compared against the ground truth, anddenotes the feasible variable space. Terms and represent the feature matrix and label vector of the training data respectively.
Existing machine learning research works mostly assume mapping with the optimal variables can provide a good approximation of the true mapping between the feature and label space and . Here, variable can be either locally optimal or globally optimal depending on whether the loss function is convex or not regarding variable . Meanwhile, according to the specific machine learning algorithm adopted, function
usually has very different representations, e.g., weighted linear summation for linear models, probabilistic projection for graphical models, and nested projections for deep neural networks via various non-polynomial activation functions.
In this paper, we will analyze the errors introduced by the learning model in approximating the true mapping function . Literally, we say function can be approximated with model mapping based on a dataset iff (1) function can achieve low empirical loss on the training dataset , and (2) the empirical loss should also be close to the expected loss of on space as well. To achieve such objectives, based on different prior assumptions about the distribution of the learning space , the existing machine learning model mapping functions in different forms can usually achieve very different performance. In this paper, we try to provide a unified representation of the diverse existing machine learning algorithms, and illustrate the reason why they can obtain different learning performance.
2 Unified Machine Learning Model Representation
In the following part of this paper, to simplify the analysis, we will assume the labels of the data instances are real numbers of dimension , e.g., , and the feature vectors of data instances are binary of dimension , e.g., (i.e., and ). For the label vectors of higher dimensions and continuous feature values, similar analysis results will be also achieved.
Based on the space , for any functions , where and , they can all be represented as a finite weighted sum of monomial terms about as follows:
where term denotes a weight computed based on . Functions and project the feature and variable vectors to a space of the same dimension, which are called the “feature kernel function” and “variable reconciling function” respectively.
Here, we need to add a remark, the original feature and variable vectors can be of different dimensions actually, i.e., . There exist various definitions of functions and . For instance, according to the above equation, we can provide an example of these functions as follows, which projects the features and variables to a shared space of dimension :
Formally, given a mapping function , the true label of data instance featured by vector can be represented as . At , the mapping function can be represented as the weighted summation of polynomial terms according to the Taylor expansion theorem, i.e.,
Here, the key point is whether is finite or will be approaching . Depending on the highest order of polynomial terms involved in the true mapping , we have the following two cases:
case 1: In the case when the largest order of polynomial term in is a finite number , it is easy to show that . In other words, we have to be a finite number as well and .
case 2: In the case when derivative of function exists for any . It seems the equation will have an infinity number of polynomial terms. Meanwhile, considering that , we have the power of will be equal to , i.e., for any . In other words, the high-order polynomial term can always hold, which will reduce the infinity number of polynomial terms to finite polynomial terms of instead, i.e., is still a finite number.
Based on the above analysis, we can simplify the above equation as follows
where the weight terms is the sum of both the -order derivative value as well as even higher order of derivatives, e.g., and so forth.
2.1 Approximation Error Analysis
It is similar for the mappings of the machine learning model, which can also be represented as a polynomial summation, i.e., , where , and . Given the learning space , the approximation process of function for true mapping generally involves key factors:
dimension of parameter : ,
objective learning space dimension: ,
weight reconciling function ,
feature kernel function .
If can pick the identical factors as function , literally will precisely recover . However, in real applications, precise recovery of is usually an impossible task. Meanwhile, according to the Vapnik-Chervonenkis theory VC15 ; BEHW89 , for measuring the quality of function , we can compute the error introduced by it compared against the true mapping function based on the learning space , which can be represented as
denotes the probability density function ofand denotes a norm measuring the difference between and .
Minimization of term is equivalent to the minimization of and simultaneously. In the learning process, the training data is given but we have no idea about the remaining data instances . In other words, computation of the expected loss term is impossible. Existing machine learning algorithms solve the problem with two-fold: (1) minimization of the empirical loss , and (2) minimization of the gap between empirical loss and overall loss, i.e., . To achieve such an objective, various different machine learning models have been proposed already. In the following section, we will illustrate how the existing machine learning algorithms determine the factors so as to minimize the model approximation loss terms.
3 Classic Machine Learning Model Approximation Analysis
In this section, we will provide a comprehensive analysis about the existing machine learning algorithms, and illustrate that they can all be represented as the inner product of the kernel function of features and the reconciling function about variables.
3.1 Linear Model Approximation Error Analysis
At the beginning , we propose to give an analysis of the linear models CV95 ; YS09 first, which will provide the foundations for studying more complicated learning models. Formally, given a data instance featured by vector of dimension , based on a linear model parameterized by the optimal weight vector , we can represent the mapping result of the data instance as
According to the representation, we have the factors for linear models as: (1) , (2) , (3) , and (4) . Compared with the true values, we can represent the approximation error by the model for as
where and with entry .
Literally, for the linear models, the approximation error is mainly introduced by approximating the high-order remainder term with the linear function . In other words, for the linear models, the empirical error term is usually of a large value when dealing with non-linearly separable data instances. Even if can provide a good approximation of the whole error term , the overall approximation performance will still be seriously bad in such situations.
3.2 Quadratic Model Approximation Error Analysis
To resolve such a problem, in recent years, some research works propose to incorporate the interactions among the features into the model learning process, and several learning models, like FM (Factorization Machine) R10 , have been proposed.
FM proposes to combine the advantages of linear models, e.g., SVM, with the factorization models. Formally, given the data instance featured by vector , the prediction label by FM can be formally represented as
where denotes the variable vector. Operator denotes the concatenation of vectors.
For the data instances featured by vectors of dimension , the total number of variables involved in FM will be , learning of which a challenging problem for large . To resolve such a problem, besides the weights for linear and bias terms, FM introduces an extra factor vector to define weights in , which can be represented by matrix . Formally, FM defines the quadratic polynomial term weight as , where and are the factor vectors corresponding to the and feature respectively.
Therefore, for the FM model, we have the key factors: (1) , (2) , (3) , and (4) . Here, are the variables to be learned in FM.
3.3 Higher-Order Model Approximation Error Analysis
Meanwhile, the recent Multi-View Machine (MVM) CZLY16 proposes to partition the feature vector into several segments (each segment denotes a view), and consider even higher-order feature interactions among these views into modeling. Formally, we can represent the multi-view feature vector as , where the superscript denotes the view index and is the total view number. The prediction result by MVM can be represented as
where denotes the feature length of the view, i.e., the length of vector .
For the higher-order variable, e.g., , MVM also introduces a factorization style method to define the variable reconciling function based on a sequence of matrices for the view, where . Therefore, the key factors involved in the MVM are as follows: (1) , (2) , (3) , and (4) .
The FM can be viewed as a special case of the MVM, which involves views denoted by each feature in the vector . Furthermore, compared against the output of true models, the error introduced by the MVM (with order ) on instance can be represented as
which denotes the error introduced by using -order polynomial equation to approximate the remainder term
. By checking Linear Models, FM, MVM (and other machine learning models, like STM (support tensor machine)KGP12 ), their main drawbacks lie in their lack of ability to model higher order feature interactions. It will lead great empirical error in fitting such kinds of functions. In the following section, we will introduce that deep learning models can effectively resolve such a problem, which can fit any functions with any degree of accuracy universally.
4 Deep Learning Model Approximation Error Analysis
In this section, we will mainly focus on deep learning models approximation error analysis. At first, we will take perceptron neural network as an example and analyze its introduced approximation errors. After that, we will analyze the deep neural network models with hidden layers.
4.1 Perceptron Neural Network Approximation
As shown in Figure 2, we illustrate the architecture of a neural network model with a shallow architecture, merely involving the input layer and output layers respectively. The input for the model can be represented as vector , and terms are the connection weights for each input feature,
is the bias term. Here, we will use sigmoid function as the activation function. Formally, we can represent the prediction output for the input vectoras
Sigmoid function has a good property, which can be expressed with the following lemmas.
Let , where and . Given the derivative of , i.e., , the derivative of regarding can be represented as
In addition, to simplify the notations, we can use to denote the recursive application of function , where . Therefore, we can have the concrete representation for the derivative of function as follows:
The derivative of function regarding the variables in (where and ) in a sequence can be represented as
Based on the Lemmas, we can rewrite the approximation function for term at as the sum of a infinite polynomial equation according to the following theorem.
For the neural network model with sigmoid function as the activation function, its mapping function can be expanded as an infinite polynomial sequence at point :
where and .
According to Theorem 1, the expansion of at can be represented as
Next, we will mainly focus on studying the relationship between the weight variables . Based on the mapping function, we have the concrete representation of the bias scalar terms of the polynomial terms in the expanded equation at expansion point as . Meanwhile, for the remaining terms, we propose to compute the derivative of regarding on both sides:
Here, we know that , therefore we can have ,
Therefore, we can have the -order scalar weight for variable term to be
In other words, for the perceptron neural network model, we have the key factors as follows: (1) , (2) , (3) , and (4) . It seems the perceptron model can fit any high-order polynomial terms, since projects to any high-order product of the features. However, according to an example problem to be shown in the next subsection, perceptron may fail to work well for non-monotone functions, e.g., XOR. Therefore, perceptron may introduce a large empirical loss in fitting non-monotone functions.
4.2 Deep Neural Network with Hidden Layers Approximation
We start this subsection with a deep neural network model with one hidden layer, as shown in Figure 2. In the plot, is the input feature vector, and
are the connection weight variables. Formally, we can represent the output neuron state as follows:
where . Although the neural network model shown in Figure 2 has very simple structure, but it can be used as the foundation to learn the reconciled polynomial representation of any deep neural networks C89 ; M96 ; HSW89 ; H91 .
Given any deep neural network model, denoted by function , and any , function can provide a good approximation of with some value , i.e.,
Given any deep neural network model, denoted by function , it can be approximately represented with the following polynomial summation
where and , and and can be represented according to Theorem 2.
According to Theorem 2, the sigmoid function can be represented as a reconciled polynomial summation. Therefore, we have equation