1 Introduction
In statistics, linear regression is a linear approach for modelling the relationship between an explaining variable
and one or more explanatory variables denoted .The parameters
can be estimated via the method of least squares.
The first clear and concise exposition of the method of least squares was published by Legendre[6] in 1805. Later in 1809, Gauss published his method of calculating the orbits of celestial bodies. In that work he claimed to have been in possession of the method of least squares since 1795. Here is the basic theorem of linear regression.
Theorem 1.1.
In the above theorem, the errors are assumed to be independent Gaussian variables. Therefore, are also independent Gaussian variables.
When the i.i.d. (independent and identically distributed) assumption is not satisfied, the usual method of least squares does not work well. This can be illustrated by the following example.
Example 1.1.
Suppose the sample data (training set) is
Following is the result of the usual least square.

We can see from the graph that most of the sample data deviates from the regression line. The main reason is that are the same sample and the i.i.d. condition is violated.
In light of this, Lin[7] study the linear regression without i.i.d. condition by using the nonlinear expectation framework laid down by Peng[9]
. They split the training set into several groups and in each group the i.i.d. condition can be satisfied. The average loss is used for each group and the maximum of average loss among groups is used as the final loss function. They show that the linear regression problem under the nonlinear expectation framework is reduced to the following mini-max problem.
(1) |
They propose a genetic algorithm to solve this problem. However, the algorithm does not work well generally.
Motivated by Lin[7] and Peng[9]’s work, we consider nonlinear regression problems without the assumption of i.i.d. in this paper. We propose a correspondent mini-max problems and outline a numerical algorithm for solving this problem based on the work of Kiwiel[4]. Meanwhile, problem (1) in Lin’s paper can also be well solved by such an algorithm. We also have done some experiments in regression and machine learning problems.
2 Nonlinear Regression without i.i.d. Assumption
Nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more explanatory variables. (see e.g.
[10])Suppose the sample data (training set) is
where and . is called the input space and is called the output (label) space. The goal of nonlinear regression is to find (learn) a function from the hypothesis space such that is as closer to as possible.
The closeness is usually characterized by a loss function such that and
if and only if
Then the learning problem is reduced to an optimization problem of minimizing .
Following are two kinds of loss functions, namely, the average loss and the maximal loss.
The average loss is popular, particularly in machine learning, since it can be conveniently minimized using online algorithms, which process few instances at each iteration. The idea behinds the average loss is to learn a function that performs equally well for each training point. However, when the i.i.d. assumption is not satisfied, the average loss way may become a problem.
To overcome this difficulty, we use the max-mean as the loss function. First, we split the training set into several groups and in each group the i.i.d. condition can be satisfied. Then the average loss is used for each group and the maximum of average loss among groups is used as the final loss function. We propose the following mini-max problem for nonlinear regression problem.
(2) |
Here, is the number of samples in group .
Problem (2) is a generalization of problem (1). Next, we will give a numerical algorithm which solves problem (2).
Remark 2.1.
Peng and Jin[2] put forward a max-mean method to give the parameter estimation when the usual i.i.d. condition is not satisfied. They show that if are drawn from the maximal distribution and are nonlinearly independent, then the optimal unbiased estimation for
This fact, combined with the Law of Large Numbers (Theorem 19 in
3 Algorithm
Problem (2
) is a mini-max problem. The mini-max problems arise in different kinds of mathematical fields, such as game theory and the worst-case optimization. The general mini-max problem is described as
(3) |
Here, is continuous on and differentiable with respect to . is a compact subset of .
Problem (3) was considered theoretically by Klessig and Polak[5] in 1973 and Panin[8] in 1981. Later in 1987, Kiwiel[4] gave a concrete algorithm for problem (3).
Kiwiel linearized at each iterative point and obtain the convex approximation for as
The next step is to find , which minimizes .
In general, is not strictly convex with respect to , thus may not admit a minimum. To overcome this difficulty, he added a regularization term and the minimization problem is reduced to
It can be converted to the following form
which is equivalent to
over all satisfying
By duality theory, the above problem is further transformed into
over all with finitely many and
Denote and we give a numerical algorithm for the following mini-max problem.
(4) |
Suppose each is differentiable with respect to and denote
Step 1. Initialization
Select arbitrary . Set and
Step 2. Direction Finding
Assume that we have chosen . For ,
Step 2.1. Initialization
Set . Compute
Step 2.2. Weight Finding
Solve the following quadratic optimization problem.
Set
If , stop; otherwise, goto Step 2.3.
Step 2.3. Primal Optimality Testing
Set ,
If
goto Step 2.4; otherwise, set , and goto Step 2.2.
Step 2.4. Return
Set
If , stop; otherwise, goto Step 3.
Step 3. Line Search
Set
and
Set
and goto Step 2.
4 Experiment
4.1 The Linear Regression Case
Example 1.1 can be numerically well solved by the above algorithm with
The corresponding optimization problem is
The numerical result using the algorithm in section 3 is
The following picture summarize the result. It can be seen that the method using maximal loss function (black line) performs better than the traditional least square method (pink line).

4.2 Machine Learning
In this case, we use the MNIST database
555http://yann.lecun.com/exdb/mnist/of handwritten digits to perform the experiment. Many machine learning models cater to the identification of handwritten digits. We use the multi-layer perception model. Three hidden layers along with an input layer and an output layer are used. The number of neurons in each layer is 784,50,20,12,10, respectively.
The training data is chosen as follows. Firstly, we choose 1000 training data and split them randomly into groups, named . Each group has 100 i.i.d. samples. Then we set
Two different method are applied. One is using the average group loss
The other is using the maximal group loss
10000 additional test data are used to test these two models. The accuracy of the method using maximal loss is , while the accuracy of the method using average loss is .
In this experiment, the whole training set is not i.i.d.. On the other hand, each subgroup is i.i.d.. It turns out that the method using maximal loss performs better than the method using average loss. In the last 20 years, deep learning using many more hidden layers and other machine learning algorithm has achieved accuracy over
in the handwriting recognition and other problems. However, we think that the method in this paper can also improve the performance when the training set is not i.i.d..5 Conclusion
In this paper, we consider a class of nonlinear regression problems without the assumption of being independent and identically distributed. We propose a correspondent mini-max problem for nonlinear regression and outline a numerical algorithm. Such an algorithm can be applied in regression and machine learning problems, and yield better results than traditional regression and machine learning methods.
Acknowledgement
The authors would thank Professor Shige Peng for useful discussions.
References
- [1] Ben-Israel, Adi, Greville, Thomas N.E. (2003). Generalized inverses: Theory and applications (2nd ed.). New York, NY: Springer.
- [2] Hanqing Jin, Shige Peng (2016). Optimal Unbiased Estimation for Maximal Distribution. https://arxiv.org/abs/1611.07994.
- [3] Kendall, M. G., Stuart, A. (1968). The Advanced Theory of Statistics, Volume 3: Design and Analysis, and Time-Series (2nd ed.). London: Griffin.
- [4] Kiwiel, K.C. (1987). A Direct Method of Linearization for Continuous Minimax Problems. Journal of Optimization Theory and Applications, 55, 271-287.
- [5] Klessig, R. and E. Polak (1973). An Adaptive Precision Gradient Method for Optimal Control. SIAM Journal on Control, 11, 80-93.
- [6] Legendre, Adrien-Marie (1805). Nouvelles methodes pour la determination des orbites des cometes.
- [7] Lin L., Shi Y., Wang X., and Yang S (2013). Sublinear Expectation Linear Regression. Statistics.
- [8] Panin, V.M. (1981). Linearization Method for Continuous Min-max Problems. Kibernetika, 2, 75-78.
-
[9]
Peng, S. (2005). Nonlinear expectations and nonlinear Markov chains.
Chin. Ann. Math., 26B(2), 159-184. - [10] Seber, G. A. F., Wild, C. J. (1989). Nonlinear Regression. New York: John Wiley and Sons.