 # Nonlinear Regression without i.i.d. Assumption

In this paper, we consider a class of nonlinear regression problems without the assumption of being independent and identically distributed. We propose a correspondent mini-max problem for nonlinear regression and outline a numerical algorithm. Such an algorithm can be applied in regression and machine learning problems, and yield better results than traditional regression and machine learning methods.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In statistics, linear regression is a linear approach for modelling the relationship between an explaining variable

and one or more explanatory variables denoted .

 y=wTx+b.

The parameters

can be estimated via the method of least squares.

The first clear and concise exposition of the method of least squares was published by Legendre in 1805. Later in 1809, Gauss published his method of calculating the orbits of celestial bodies. In that work he claimed to have been in possession of the method of least squares since 1795. Here is the basic theorem of linear regression.

###### Theorem 1.1.

Suppose are drawn from the linear model with

 yi=wTxi+b+εi,

where the error terms are independent Gaussian variables with mean

and variance

. Denote

 A=⎛⎜ ⎜ ⎜⎝x11x12⋯x1d1x21x22⋯x2d1⋯⋯⋯⋯⋯xm1xm2⋯xmd1⎞⎟ ⎟ ⎟⎠,c=⎛⎜ ⎜ ⎜⎝y1y2⋯ym⎞⎟ ⎟ ⎟⎠.

Then the result of least square is

 (w1,w2,⋯,wd,b)⊤=A+c.

Here, is the Moore-Penrose inverse444For the definition and property of Moore-Penrose inverse, see . of .

In the above theorem, the errors are assumed to be independent Gaussian variables. Therefore, are also independent Gaussian variables.

When the i.i.d. (independent and identically distributed) assumption is not satisfied, the usual method of least squares does not work well. This can be illustrated by the following example.

###### Example 1.1.

Suppose the sample data (training set) is

 (x1,y1)=(x2,y2)=⋯=(x500,y500)=(0.15,1.48),
 (x501,y501)=(x502,y502)=⋯=(x1000,y1000)=(0.43,1.45),
 (x1001,y1001)=(x1002,y1002)=⋯=(x1500,y1500)=(0.04,1.59),
 (x1501,y1501)=(1.23,3.01),(x1502,y1502)=(0.63,2.89),(x1503,y1503)=(1.64,4.54),
 (x1504,y1504)=(0.98,3.32),(x1505,y1505)=(1.92,5.0),(x1506,y1506)=(1.26,3.96),
 (x1507,y1507)=(1.77,3.92),(x1508,y1508)=(1.1,2.8),(x1509,y1509)=(1.22,2.84),
 (x1510,y1510)=(1.48,4.52),(x1511,y1511)=(0.71,3.17),(x1512,y1512)=(0.77,2.59),
 (x1513,y1513)=(1.89,5.1),(x1514,y1514)=(1.31,3.17),(x1515,y1515)=(1.31,2.91),
 (x1516,y1516)=(1.63,4.02),(x1517,y1517)=(0.56,1.79).

Following is the result of the usual least square.

 y=0.4711∗x+1.4258.

We can see from the graph that most of the sample data deviates from the regression line. The main reason is that are the same sample and the i.i.d. condition is violated.

In light of this, Lin study the linear regression without i.i.d. condition by using the nonlinear expectation framework laid down by Peng

. They split the training set into several groups and in each group the i.i.d. condition can be satisfied. The average loss is used for each group and the maximum of average loss among groups is used as the final loss function. They show that the linear regression problem under the nonlinear expectation framework is reduced to the following mini-max problem.

 minw,bmax1≤j≤N1MM∑l=1(wTxjl+b−yjl)2. (1)

They propose a genetic algorithm to solve this problem. However, the algorithm does not work well generally.

Motivated by Lin and Peng’s work, we consider nonlinear regression problems without the assumption of i.i.d. in this paper. We propose a correspondent mini-max problems and outline a numerical algorithm for solving this problem based on the work of Kiwiel. Meanwhile, problem (1) in Lin’s paper can also be well solved by such an algorithm. We also have done some experiments in regression and machine learning problems.

## 2 Nonlinear Regression without i.i.d. Assumption

Nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more explanatory variables. (see e.g.

)

Suppose the sample data (training set) is

 S={(x1,y1),(x2,y2),⋯,(xm,ym)},

where  and . is called the input space and is called the output (label) space. The goal of nonlinear regression is to find (learn) a function from the hypothesis space  such that  is as closer to as possible.

The closeness is usually characterized by a loss function such that and

 φ(gθ(x1),y1,⋯,gθ(xm),ym)→0

if and only if

 gθ(xi)−yi→0,1≤i≤m.

Then the learning problem is reduced to an optimization problem of minimizing .

Following are two kinds of loss functions, namely, the average loss and the maximal loss.

 φ2=1mm∑j=1(gθ(xj)−yj)2.
 φ∞=max1≤j≤m(gθ(xj)−yj)2.

The average loss is popular, particularly in machine learning, since it can be conveniently minimized using online algorithms, which process few instances at each iteration. The idea behinds the average loss is to learn a function that performs equally well for each training point. However, when the i.i.d. assumption is not satisfied, the average loss way may become a problem.

To overcome this difficulty, we use the max-mean as the loss function. First, we split the training set into several groups and in each group the i.i.d. condition can be satisfied. Then the average loss is used for each group and the maximum of average loss among groups is used as the final loss function. We propose the following mini-max problem for nonlinear regression problem.

 minθmax1≤j≤N1njnj∑l=1(gθ(xjl)−yjl)2. (2)

Here,  is the number of samples in group .

Problem (2) is a generalization of problem (1). Next, we will give a numerical algorithm which solves problem (2).

###### Remark 2.1.

Peng and Jin put forward a max-mean method to give the parameter estimation when the usual i.i.d. condition is not satisfied. They show that if  are drawn from the maximal distribution

and are nonlinearly independent, then the optimal unbiased estimation for

is

 max{Z1,Z2,⋯,Zk}.

This fact, combined with the Law of Large Numbers (Theorem 19 in

) leads to the max-mean estimation of . We borrow this idea and use the max-mean as the loss function for the nonlinear regression problem.

## 3 Algorithm

Problem (2

) is a mini-max problem. The mini-max problems arise in different kinds of mathematical fields, such as game theory and the worst-case optimization. The general mini-max problem is described as

 minu∈RNmaxv∈Vh(u,v). (3)

Here,  is continuous on and differentiable with respect to  is a compact subset of .

Problem (3) was considered theoretically by Klessig and Polak in 1973 and Panin in 1981. Later in 1987, Kiwiel gave a concrete algorithm for problem (3).

Kiwiel linearized at each iterative point  and obtain the convex approximation for  as

 ^h(u)=maxv∈V{h(uk,v)+⟨∇h(uk,v),u−uk⟩}.

The next step is to find , which minimizes .

In general,  is not strictly convex with respect to , thus may not admit a minimum. To overcome this difficulty, he added a regularization term and the minimization problem is reduced to

 minu∈RN{^h(u)+12|u−uk|2}.

It can be converted to the following form

 mind∈RN{maxv∈V{h(uk,v)+⟨∇h(uk,v),d⟩}+12|d|2},

which is equivalent to

 minimize(12|d|2+a)

over all satisfying

 h(uk,v)+⟨∇h(uk,v),d⟩≤a, ∀ v∈V.

By duality theory, the above problem is further transformed into

 min⎛⎝12∣∣ ∣∣∑v∈Vλv∇uh(uk,v)∣∣ ∣∣2−∑v∈Vλvh(uk,v)⎞⎠

over all with finitely many and

 λv≥0, ∀v∈V,
 ∑v∈Vλv=1.

Our problem (2) is a special case of problem (3) with and

 h(u,j)=1njnj∑l=1(gu(xjl)−yjl)2.

Denote and we give a numerical algorithm for the following mini-max problem.

 minu∈Rnmax1≤j≤Nfj(u). (4)

Suppose each  is differentiable with respect to and denote

 Φ(u)=max1≤j≤Nfj(u).

Step 1. Initialization

Select arbitrary . Set  and

 termination accuracy ξ=10−6,
 linear approximation parameterm=2×10−4,
 line search parameterc=10−4,
 stepsize factor σ=0.5.

Step 2. Direction Finding

Assume that we have chosen . For ,

Step 2.1. Initialization

Set . Compute

 p0=∇fv1(uk),θ0=fv1(uk).

Step 2.2. Weight Finding

Solve the following quadratic optimization problem.

 μi=argminμ∈R{12∥(1−μ)pi−1+μ∇fvi(uk)∥2−(1−μ)θi−1−μfvi(uk)}.

Set

 pi=(1−μi)pi−1+μi∇fvi(uk),θi=(1−μi)θi−1+μifvi(uk).
 Ψi=−(∥pi∥2+Φ(uk)−θi).

If , stop; otherwise, goto Step 2.3.

Step 2.3. Primal Optimality Testing

Set ,

 vi+1=argmax1≤j≤N{fj(uk)+⟨∇fj(uk),di⟩}.

If

 fvi+1(uk)+⟨∇fvi+1(uk),di⟩−Φ(uk)≤mΨi,

goto Step 2.4; otherwise, set , and goto Step 2.2.

Step 2.4. Return

Set

 dk=−pi,ηk=Ψi.

If , stop; otherwise, goto Step 3.

Step 3. Line Search

Set

 Ik={σj:Φ(uk+σjdk)−Φ(uk)≤cσjηk, j=0,1,⋯}

and

 αk=max Ik.

Set

 uk+1=uk+αkdk, k=k+1

and goto Step 2.

## 4 Experiment

### 4.1 The Linear Regression Case

Example 1.1 can be numerically well solved by the above algorithm with

 fj(w,b)=(wxj+b−yj)2,j=1,2,⋯,1517.

The corresponding optimization problem is

 minw,bmax1≤j≤1517(wxj+b−yj)2.

The numerical result using the algorithm in section 3 is

 y=1.7589∗x+1.2591.

The following picture summarize the result. It can be seen that the method using maximal loss function (black line) performs better than the traditional least square method (pink line).

### 4.2 Machine Learning

In this case, we use the MNIST database

of handwritten digits to perform the experiment. Many machine learning models cater to the identification of handwritten digits. We use the multi-layer perception model. Three hidden layers along with an input layer and an output layer are used. The number of neurons in each layer is 784,50,20,12,10, respectively.

The training data is chosen as follows. Firstly, we choose 1000 training data and split them randomly into groups, named . Each group has 100 i.i.d. samples. Then we set

 A1=A2=⋯=A91=G1,A92=G2,⋯,A100=G10.

Two different method are applied. One is using the average group loss

 1100100∑j=1{1100100∑l=1|hθ(xjl)−yjl|2}.

The other is using the maximal group loss

 max1≤j≤100{1100100∑l=1|hθ(xjl)−yjl|2}.

10000 additional test data are used to test these two models. The accuracy of the method using maximal loss is , while the accuracy of the method using average loss is .

In this experiment, the whole training set is not i.i.d.. On the other hand, each subgroup is i.i.d.. It turns out that the method using maximal loss performs better than the method using average loss. In the last 20 years, deep learning using many more hidden layers and other machine learning algorithm has achieved accuracy over

in the handwriting recognition and other problems. However, we think that the method in this paper can also improve the performance when the training set is not i.i.d..

## 5 Conclusion

In this paper, we consider a class of nonlinear regression problems without the assumption of being independent and identically distributed. We propose a correspondent mini-max problem for nonlinear regression and outline a numerical algorithm. Such an algorithm can be applied in regression and machine learning problems, and yield better results than traditional regression and machine learning methods.

## Acknowledgement

The authors would thank Professor Shige Peng for useful discussions.

## References

•  Ben-Israel, Adi, Greville, Thomas N.E. (2003). Generalized inverses: Theory and applications (2nd ed.). New York, NY: Springer.
•  Hanqing Jin, Shige Peng (2016). Optimal Unbiased Estimation for Maximal Distribution. https://arxiv.org/abs/1611.07994.
•  Kendall, M. G., Stuart, A. (1968). The Advanced Theory of Statistics, Volume 3: Design and Analysis, and Time-Series (2nd ed.). London: Griffin.
•  Kiwiel, K.C. (1987). A Direct Method of Linearization for Continuous Minimax Problems. Journal of Optimization Theory and Applications, 55, 271-287.
•  Klessig, R. and E. Polak (1973). An Adaptive Precision Gradient Method for Optimal Control. SIAM Journal on Control, 11, 80-93.
•  Legendre, Adrien-Marie (1805). Nouvelles methodes pour la determination des orbites des cometes.
•  Lin L., Shi Y., Wang X., and Yang S (2013). Sublinear Expectation Linear Regression. Statistics.
•  Panin, V.M. (1981). Linearization Method for Continuous Min-max Problems. Kibernetika, 2, 75-78.
• 

Peng, S. (2005). Nonlinear expectations and nonlinear Markov chains.

Chin. Ann. Math., 26B(2), 159-184.
•  Seber, G. A. F., Wild, C. J. (1989). Nonlinear Regression. New York: John Wiley and Sons.