1 Introduction
Being a cornerstone in statistics, regression is a fundamental prediction technique. Due to its simplicity and robustness, it has been widely used in numerous disciplines of science and engineering as well as other fields such as economics and business. Once a regression model is established, we are able to adopt it for predicting expected values within a certain level of confidence.
Calculating regression coefficients is an essential statistical function of various software and systems. It has been integrated as part of both commercial software such as MATLAB (MathWorks), SPSS Statistics (IBM) and Excel (Microsoft) and opensource libraries including GNU Scientific Library (C/C++), Scikitlearn, Scipy, StatsModels (Python), Accord, MathNet (F#), and GoLearn (Go). On the other hand, new systems may be built from scratch in some situations, such as the development of statistical libraries in new programming language. Besides, some programmers may need to modify the regression, for examples, transforming the variables
(lin2000; luu2015; wu2017; luu2018) or parallelizing the computations (baboulin2006; dau2019). In most cases, the calculation involves extensive computations. It is thus prone to programming bugs, and data mishandling. Blunders in computing regression by these software or systems would definitely cast doubts on the reliability and robustness of data and scientific analyses based on them.Advancements in machine learning enable us to cope with the growth of big data, having the fastest rate over recent decades
(goodfellow2016). Despite the rapid change, linear regression remains as the technique to be widely used due to two main reasons. First, it has been serving as a baseline for evaluating more sophisticated modelling systems for years (weisberg2005; montgomery2012). For example, simple linear regression is frequently adopted to assess the performance of various artificial neural networks (ANN) systems
(bibault2018; ahmad2019; mondal2020; ayoubloo2011). Second, the ANN equations are reducible to the linear regression form, being referred to as deep linear network (saxe2014; lathuiliere2019). Recently, the deep linear network has received lots of attention as it helps to better understand the properties and reliability of general ANN (lathuiliere2019; lee2019; bah2020). Reliability of many systems therefore somehow relies on the correctness of linear regression calculation.Traditionally, the testing of regression system is difficult because we do not have a test oracle in general. For example, if we have some data points, say and
, and want to find a line that has the predicted values “closest” to these points. (In case of higher dimensions, the plane or hyperplane is used instead of line.) When the regression program returns the line as
, we have no simple way to tell whether or not this line is the best fitted line. Testers may find the regression line manually by solving the required equations constructed from these data points. However, this is not viable and is very tedious.Testing such programs becomes even more difficult in reality because we often never know the true line due to roundoff errors (higham2002; solomon2015)
. In other words, we are not able to determine whether or not the computed value from a regression algorithm is the exact estimator that should be returned by the algorithm. In fact, previous studies
(higham2002; gratton2013; marland2018) pointed out that solving the linear regression may cause a large accumulation of roundoff and truncation errors because achieving the regression solutions requires intensive matrix calculations. When the true value is unavailable, there is no direct way to validate the regression systems.Incorrect linear regression computation may lead to severe consequences. In many cases, even a marginal difference in estimating coefficients of regression may cause a huge impact. For instance, while a previous study estimated the global mean sea level has been rising at a rate of mm per year since the beginning of this century (church2011), another work suggested that the rate should be mm per year (hay2015). Due to the acceleration of global warming, the two regression coefficients after 50 years will lead to a gap larger than 2.5 cm in the predicted sea level that miscounted multi billion dollars per year in flood damage cost, according to a recent research (jevrejeva2018). Given that fact, while roundoff error was attributed to the US Army’s Patriot Missile Defense System failure in shooting down the SCUD missile causing death of 28 American soldiers (zhivich2009), the rounding is also known to deteriorate the accuracy of regression (higham2002). Indeed, a faulty regression program may cause damages more severe than the roundoff error. Validating the accuracy of regression is therefore not solely a computational issue but also has reallife impacts. Considering its very diversified range of applications, testing the correctness of regression systems is certainly of great importance.
In view of such problem, we should find an appropriate approach to tackle the testing of linear regressionbased systems. The software testing technique of Metamorphic Testing (MT), does allow us to alleviate the absence of oracle, and to generate test sets automatically (segura2016; chen2017). It has been applied successfully in testing various software systems, such as numerical analysis (chan1998)
, partial differential equations
(chen2002) search engines (zhou2012; zhou2016), Web Application Programming Interfaces (segura2018), feature models (segura2010; segura2011), scientific software (ding2016; lin2018), programming language compilers (le2014), graphics shader compilers (donaldson2017), contextsensitive middlewarebased applications (chan2005), object detection (wang2019), cybersecurity (chen2016; mai2020).Recently, MT has been applied efficiently to validate various machine learning and artificial intelligence systems
(xie2011; xie2018; segura2018b). A testing tool named DeepTest was developed to detect the erroneous responses of autonomous cardriving systems backed by deep neural networks (DNN) (tian2018). It helped revealing thousands of dangerous situations under different weather conditions affecting the functioning of DNNbased programs that were submitted to the Udacity selfdriving challenge. Another example in testing a reallife driverless car system named Apollo of Baidu (zhou2019) pointed out that the random noises added will significantly alter the system’s responses, potentially leading to fatal crashes. In fact, MT has been evolving beyond the domain of conventional testing. It is extended to validation (lin2018; zhou2019), program understanding (zhou2018) and quality assessment (pullum2012). MT was also combined with “statistical hypothesis test” to deduce the likelihood of a system’s faultiness
(guderlei2007). Despite of its many successes with complex systems, it remains unclear to what extent that MT is applicable to test linear regressionbased systems.The aim of this paper is to propose the use of MT to test multiple linear regression systems. We focus on testing the implementation of linear regression algorithm for three reasons. First of all, the algorithm is the core component of these regression systems. While testing a specific system may require a particular set of features and properties specific to that system, the technique for testing such system may not be applicable to test another system. For example, testing a C++ program may involve an explicit declaration of variables, while it is not required in testing a Python or Ruby program, making dissimilar outcomes in assessing the bugs associated with initializing variables in different systems. Second, we may inspire interested readers to come up with new intrinsic properties, or extend them for nonlinear regressionbased systems. Third, it has been reported that many statistical users have mistakenly manipulated the regression calculation, and thus might derive incorrect predictive models. For example, many users came up with erroneous results while applying linear regression in Python (bugpython1; bugpython2; bugpython3), Matlab (bugmatlab2) or R (bugr1; bugr2) software, and it turned out that they manipulated the regression wrongly by mishandling the intercept. By considering our different ways to check the implementation of regression algorithm, they may be able to test their own manipulation process, and thus improve the reliability of their works.
For MT to work on the regression systems, testers or users need to know some intrinsic properties of the regression. These properties should be supported by formal “mathematical proofs” to avoid misuses. For example, there has been some reports about users adopting some properties that are not necessary properties of the algorithms to be implemented (xie2011). This will cause misinterpretations of the results. One of the strengths of this paper is that we prove all these intrinsic properties of the regression systems. Contributions of this paper are summarized as follows:

We derive a set of mathematical properties of estimator of multiple linear regression related to the addition of data points, the rescaling of inputs, the shifting of variables, the reordering of data, and the rotation of independent variables.

We then develop 11 Metamorphic Relation (MRs) which are grouped in 6 categories to support the testing of related regression systems.

The technique of mutation analysis is applied on a wide range of regression programs to examine the performance of these MRs.

We further examine how the testing of regression could be carried out in a more effective way.
The structure of this paper is organised as follows. After the introduction, we start with recalling basic concepts of linear regression and MT. In Section 3, we present selected properties of estimator, and associated MRs that will later be used for verifying the regression estimates. Section 4 is on experimentation, presenting the mutation setup, the generation of the input datasets, and the assessment criteria. The effectiveness of MRs is reported and discussed in Section 5, before a comparison between MT and random testing is presented in Section 6. We describe the threats to validity of our experiments in Section 7. Finally, the paper is enclosed by concluding remarks in Section 8.
2 Preliminaries
2.1 Multiple linear regression
Regression is a model to describe the behavior of a certain random variable of interest. This variable may be the price of stocks in the financial market, the growth of a biological species, or the detection opportunity of the gravitational wave. It is referred to as the dependent variable, and denoted with
. Information of the dependent variable is provided on the basis of predictor or explanatory variables; such as time for the price of stocks, food for the species to grow or the number of observations for the detection of wave. These terms are usually called the independent variables, and denoted with . The regression model is to relate dependent variable to given independent variables by means of a function(1) 
Given the form of function , the regression model is said to be determined when the unknown parameter, being referred to as the estimator , is obtained. The optimal estimator is chosen such that the modelled dependent variable becomes closest to the true dependent variable . In multiple linear regression model, the variable has a linear relationship with respect to the components of optimized estimator by means of:
(2) 
where in which the upper symbol represents the transpose of the relevant matrix. Each individual variable ( = 0, 1, …, ) can be, for instance, a trend rate, an oscillatory factor or a function in the multiple linear regression (luu2015; luu2018), but it must be independent of all other variables . The intercept form is adopted if we set as a constant, say , in contrast with other input variables that are fed with data. Then we have
(3) 
Assume we have data points for the model, and denote and (all
) as corresponding vectors of data for the variables
and , and () is the vector of modelled dependent variable . Their matrices are expressed as(4)  
The linear regression equation to derive is
(5) 
To establish the model, a cost function (also named as objective function) is adopted to quantify the estimator based on a certain metrics. The idea is to minimize the difference between the true and the modelled , which is referred to as the residual (or error). In general, the cost function consists of the residual, the estimator and different
forms of parameters and weights. Some regression methods may add a penalty quantity associated with
, or take into account the standard deviation. In the ordinary least square (OLS) regression, the cost function is simply defined as the total of square of residuals. The optimized estimator is determined by the minimization (
) of this cost function using the square of Euclidean norm (), that is(6) 
There are different ways to derive the solution of this equation such as derivatives of sum of square errors, maximum likelihood, projection, generalized method of moments. When the matrix
has the full rank, i.e. all its rows and columns are linearly independent, we have the unique solution derived from the normal equation:(7) 
The upper numeric subscript () represents the inverse of a related matrix. Once the model is established after the derivation of the estimator , we may predict the change of dependent variable under varying conditions of . In the prediction system based on linear regression, the core model is trained with a set of input data to obtain the optimized estimator (), before it can be used to predict. A more comprehensive description about linear regression can be found in (weisberg2005; montgomery2012).
2.2 Metamorphic testing
Metamorphic Testing (MT) has been developed to alleviate the test oracle problem, which is referred to as the situations where test results of a program are impossible or extremely difficult to be validated. Consider an example of a program that implements a particular algorithm to find the shortest path from one node to another in a graph. Assume that is a graph having 100 nodes, and that, on average, each node has about 20 edges. In general, given the graph and two nodes and in , it is very timeconsuming to verify that the program’s actual output is really a shortest path in from to because a manual process may involve validating against possible paths connecting x and y.
The core of MT is the concept of Metamorphic Relations (MRs), which are derived from properties of the targeted algorithm to be implemented. If the program does not uphold these properties, we can conclude that the program has errors. Using the shortest path example mentioned earlier, assume further that , and are three different nodes in . Based on the domain knowledge of the shortest path in a graph, one can derive the metamorphic relation – “if is in the shortest path in from to , the length of the shortest path in from to is equal to the sum of the length of the shortest path in from to and that from to ”. Since the program is implementing a particular “shortest path” algorithm, one can expect that upholds this metamorphic relationship in the sense that “If is in the actual output , the length of is equal to the sum of the lengths of and . The original test case is considered as a source test case in MT. The two test cases and are referred to as the followup test cases in MT because these test cases are followups of the source test case. In this example, the MR involves one source test case and two followup test cases, and the followup test cases may depend on the actual output of the program with the source test case as input. In general, an MR can involve multiple source test cases and multiple followup test cases. Following is the formal definitions of MR and other concepts used in MT (chen2017).
Definition 1: Let be a target function or algorithm to be implemented. A Metamorphic Relation (MR) is a necessary property of over a sequence having two or more inputs , where , and the sequence of corresponding outputs . The relation is denoted by , where is the subset relation, and and are the Cartesian products of input and output spaces, respectively. We may simply adopt to represent .
Definition 2: Consider an MR . Suppose that each () is constructed based on according to . For any , is referred to as a source input. For any , is referred to as a followup input. That is, for a given , if all source inputs () are specified, then the followup inputs () can be constructed based on the source inputs and, if necessary, their corresponding outputs. The sequence of inputs is referred to as a metamorphic test group (MTG) of inputs for the MR.
Process for Metamorphic Testing (MT). Let be an implementation of a target function . For an MR , suppose that we have . Metamorphic Testing (MT) based on this MR for involves the following steps:
(1) Define from by replacing by
(2) Given a sequence of source test cases . After execution, their respective source outputs are given by . Construct and execute a sequence of followup test cases with reference to , and obtain their corresponding followup outputs .
(3) Compare the executed results against the expectation given by . If is not satisfied, then is determined to be faulty by this MR.
In other words, to apply MT to test a program using a given MR, we first need to generate the relevant source test cases. We then execute the program with these source test cases to obtain their respective source outputs. Based on the given MR, we can generate the relevant followup test cases. After that, the program is executed with these followup test cases to obtain the respective followup outputs. Finally, we can check whether the given MR is satisfied by the source test cases, the source outputs, the followup test cases, and the followup outputs. If the MR is violated, it means that the program under test is faulty. This entire process of MT can be automated.
3 Metaphoric relations of estimator
In this section, we propose 11 MRs that can be used to test linear regression based systems. As mentioned earlier, they are derived from the properties of the targeted algorithm to be implemented. It is important to discuss the intuition of each property, understand the properties and explain how MRs could be derived from the relevant properties. Interested readers may find the mathematical proofs of these properties in Appendix A. For ease of discussion and illustration, we will use 2dimensional examples, if needed. Last, but not least, all properties discussed in this section involve the optimized estimator .
3.1 Properties of estimator
3.1.1 Property 1. Inserting new data
This property is about inserting an additional data point to the original set of data points for linear regression. Proposition 1 shows that, after a new data point is added to the original set of data points, the new linear regression line obtained from the new set of data points can be derived from a relationship with the original regression line and the new data point.
Proposition 1.
Suppose that the dependent variable is related to a linear relationship with independent variables with the estimator derived from the least square fitting. Let and (all ) denote the vectors of data for the variables and , respectively, whose matrices are expressed in Equation (4). And let and be a data point being added into the original data set. The linear estimator obtained from the new data set and can be derived from the following equation
(8) 
where is the vector of size defined by and as follows
(9) 
We now derive an important property that forms the foundation of the first two metamorphic relations, namely MR 1.1 and MR 1.2 in Section 3.2.
Corollary 1. Same notation as in Proposition 1. If , then .
Proof: Suppose . It is straightforward from Equation (8) that .
In other words, if the new data point is generated from the model derived by the original data set, we do expect that the new model obtained from the new data set will be the same as the original model. For example, suppose we have 3 data points in the data set for linear regression and that the linear regression system returns the line . If we use this line equation to generate a new data point, say (7, 15), and feed all 4 (original 3 plus this new) data points to the linear regression system, we expect that the new regression line obtained will be the same as the original one.
Corollary 2. Same notation as in Proposition 1. If
(10)  
(11) 
for , then for the regression with intercept, being explicitly expressed in Equation (3).
Proof: The summation of both sides of Equation (3) gives us
(12) 
The linear regression with intercept is unbiased, that is
(13) 
As a result, from the definitions of and , we obtain
(14) 
This equation can be rewritten in form of vector as . It follows immediately after Corollary 1 that .
3.1.2 Property 2. Scaling data
The second property is about the scaling of certain coordinates of data points. Proposition 2 shows that, if certain coordinates of data points along a certain axis are scaled by a given factor, the slope of regression line with respect to this axis will be scaled proportionally by the same factor.
Proposition 2. Suppose that the values of the dependent variable are scaled by a factor of with respect to the original values , and the values of an independent variables being factorized by a factor of with respect to the original values , that is
(15)  
(16) 
where the constants and are nonzero real numbers (). The new estimator can be computed from the original estimator as follows
(17) 
Based on this proposition, we can further derive two properties of mirroring the components of the data points, and two properties of scaling the components of the data points. These properties are the basis of four MRs, namely MR 2.1, MR 2.2, MR 3.1 and MR 3.2 in Section 3.2. The proofs of these properties are straightforward from Equation (17). We will leave them to the readers.
Corollary 3. Same notation as in Proposition 2. If and , then .
That is, if we reflect the sign of the dependent variable, we expect that the sign of estimator will also be reflected. This is the basis of our MR 2.1.
Corollary 4. Same notation as in Proposition 2. If and , then is given by the following equation
(18) 
More simply, if we only reflect a particular independent variable, only the component of estimator related to this independent variable is reflected. This is the basis of our MR 2.2.
Corollary 5. Same notation as in Proposition 2. If and , then .
Intuitively, if we scale the dependent variable by a factor of , the estimator will also be scaled by the same factor. This is the basis of our MR 3.1.
Corollary 6. Same notation as in Proposition 2. If and , then is given by the following equation
(19) 
That is, if we only scale a particular independent variable by a factor of , only the component of the estimator related to this independent variable is scaled reciprocally. This is the basis of our MR 3.2.
3.1.3 Property 3. Shifting data
The third property is about the shifting of certain components of the data in the original data set. Proposition 3 shows that, if the coordinates of data points along a certain axis are shifted by a given distance, the projection of the regression line with intercept is also shifted by a proportional distance.
Proposition 3. Suppose that the values of the dependent variable are shifted by a distance of with respect to the original values , and the values of an independent variables being shifted by a distance of with respect to the original values , that is
(20)  
(21) 
where and are real constants (). In the linear regression with intercept, the new estimator can be determined from the original estimator as follows
(22) 
Based on this proposition, we can further derive two properties of shifting the components of the data points. These properties are the basis of two MRs, namely MR 4.1 and MR 4.2 in Section 3.2. The proofs of these properties are straightforward from Equation (22).
Corollary 7. Same notation as in Proposition 3. If and , then is given by the following equation
(23) 
In other words, if we add up the dependent variable by a value of , the intercept component of estimator will also be increased by the same value. This is the basis of our MR 4.1.
Corollary 8. Same notation as in Proposition 3. If and , then is given by the following equation
(24) 
That is, if we add up a particular independent variable by a value of , the intercept component of estimator will also be decreased by an amount proportional to both this value and the component of estimator related to this variable. This is the basis of our MR 4.2.
3.1.4 Property 4. Permuting data
The fourth property is about the permutation of certain components or certain samples of the data. Proposition 4 shows that, if the coordinates of the data points along certain axes are permuted, the corresponding projections of the regression line along these axes are permuted in the same order. Moreover, the proposition shows that, permuting the data points does not change the regression line.
Proposition 4. Suppose that the samples of the dependent variable are permuted by the function with respect to the original variable ; and the samples of the independent variables are permuted by both functions and with respect to the original values , such that
(25)  
(26) 
where is a permutation (bijective function) from set to ; whilst is the corresponding bijective renumbering of the set of sample index . The new estimator can be determined from the original estimator as follows
(27) 
Based on this proposition, we can further derive two properties of permuting the data points. These properties are the basis of two MRs, namely MR 5.1 and MR 5.2 in Section 3.2. The proofs of these properties are straightforward from Equation (27).
Corollary 9. Same notation as in Proposition 4. If is the identity function (i.e., ) and is a bijective function, then .
Put it differently, if we permute the data points only without permuting the dependent and independent variables, the estimator will remain unchanged. This is the basis of our MR 5.1.
Corollary 10. Same notation as in Proposition 4. If is a permutation that only swaps and where and , while is the identity function, then is given by the following equation
(28) 
That is, if we swap two independent variables and their relevant values in the data points, the corresponding components of the estimator will be swapped accordingly. This is the basis of our MR 5.2.
3.1.5 Property 5. Rotating data
The fifth property is about the rotation of coordinate system. Proposition 5 shows that, if the axes related to the independent variables of the data points are rotated by a given rotation angle, the corresponding projections of the regression line are rotated by the same angle.
Proposition 5. Suppose that the values of the dependent variable is kept unchanged, while the components of the independent variables are rotated by the matrix of size . In other words, rotates the matrix with respect to the original matrix , that is
(29) 
Then the new estimator can be determined from the original estimator as follows
(30) 
Based on this proposition, we can derive the property of rotating the data points. This property is the basis of the MR 6 in Section 3.2. The proof of this property is straightforward from Equation (30).
Corollary 11. Same notation as in Proposition 5. Suppose two components and with are rotated by an angle in the counterclockwise direction, that is
(31) 
in other words, the rotation matrix has the form
(32)  
then the new estimator is determined from the original estimator as follows
(33) 
That is, if we rotate two components of the data points by an angle, the corresponding component of the estimator will be rotated by the same angle.
3.2 Metamorphic relations
In this subsection, we will present 11 MRs for testing multiple linear regression systems. They are derived from properties presented in Section 3.1 and are grouped into 6 categories. We will describe the regression form with intercept first, and discuss the regression form without intercept later. Table 1 summaries these MRs and their applicable types of regression, and Figure 1 illustrates the transformation examples for each MR.
0.2cm
Category  MR  Name  Form 
Unchanged  1.1  Inserting a predicted point  I,C 
predictions  1.2  Inserting the centroid  I 
Mirrored  2.1  Reflecting the dependent variable  I,C 
regression  2.2  Reflecting an independent variable  I,C 
Scaled  3.1  Scaling the dependent variable  I,C 
regression  3.2  Scaling an independent variable  I,C 
Shifted  4.1  Shifting the dependent variable  I 
regression  4.2  Shifting an independent variable  I 
Reordered  5.1  Swapping samples  I,C 
regression  5.2  Swapping two independent variables  I,C 
Rotated  6  Rotating two independent variables  I,C 
regression  
I: Regression with intercept as in Equation (3) where ;  
C: Constrained regression (without intercept) as in Equation (2)  
where is an input variable whose data are provided by user. 
Before we discuss the individual MR, let us introduce some notations for the regression with intercept. This regression is adopted as the default unless stated otherwise. First, let denote the source input of a linear regression system under test (SUT), and it consists of a set of data points
(34) 
where each data point () can be denoted as
(35) 
in which () is the xcoordinate and is the ycoordinate of the point . The output of the SUT using the source input can be expressed by
(36) 
The output is referred to as the source output. That is, the equation of the regression form with intercept is
(37) 
Let us denote the followup input be a set of data points, being generated using a given MR, that is
(38) 
where
(39) 
for . The output of the same program using the followup input is expressed by
(40) 
The output is referred to as the followup output. For convenience in describing the MRs, we refer and ( and ) to as the source (followup) data set and output, respectively.
In the context of MT, if , , and do not satisfy the relevant MR, the SUT is said to be faulty. Now, we are going to describe the MRs based on the previously mentioned properties, to be used in our experimentation to validate regression systems.
3.2.1 Category 1: Unchanged predictions
Two MRs in this category are based on Corollary 1, that is, adding a point predicted by the regression line into the original data set shall compute the same regression line.
MR1.1. Inserting a predicted point
The regression line will remain the same after being updated by adding a point selected arbitrarily from the line into the original data set. This MR is illustrated in Figure 1c.
Given the source input and source output , the followup input is formed by adding a new point to the source input , that is
(41) 
where
(42) 
such that are arbitrary real numbers, and is computed from these () values and using the following equation
(43) 
In other words, is a point that falls in the “predicted” regression equation. Then we expect to have the followup output be the same as the source output.
(44) 
To sum up, the followup estimator is the same as the source estimator when a point predicted from the source model is included in the followup estimation. Instead of a single point, we can add several new points at the same time for the regression. The addup will preserve the source estimators as long as the points are derived from the regression model.
It is noted that, in practice, floatingpoint computations may give us slightly different values due to rounding. We thus need to consider the impact of roundoff error in determining the violation of each MR. For example, instead of having the equality in this MR, the comparison should be relaxed by an inequality that accounts for an error tolerance. This will be explained in details in Subsection 4.3 (Assessment).
MR1.2. Inserting the centroid
The linear regression line will remain the same after being updated by adding the centroid into the original data set for the intercept form, as illustrated in Figure 1d. The centroid can be derived from the arithmetic mean of all values of its data points, which can be easily computed. This point belongs to the linear regression line.
Given the source input and source output , the followup input is formed by adding the centroid point to the source input , that is
(45) 
where
(46) 
is such that and are the averages of the corresponding values in the source data , that is
(47)  
for . If the regression program is configured to run with intercept, then we have
(48) 
In other words, adding the centroid of data into the source input to form a followup input will not change the followup estimator for the regression form with intercept.
3.2.2 Category 2: Mirrored regression
The MRs in this category are based on Corollary 3 and Corollary 4 with the scaling parameter be set as . This set of MRs is related to how reflecting the data points will reflect the regression line accordingly.
MR2.1. Reflecting the dependent variable
Reflecting the points over a certain axis will reflect the regression line over the same axis, as illustrated in Figure 1e.
Given the source input and source output , the followup input consists of points
(49) 
where for each (), the value of its xcoordinate remains unchanged (that is , ) and its ycoordinate is reflected, that is
(50) 
Then, we have
(51) 
That is, the followup estimator is a reflection of the source estimator when the sign of the dependent variable is reversed in the followup input set.
MR2.2. Reflecting an independent variable while keeping the others unchanged
Reflecting the points over the yaxis will reflect the regression line over the same axis, as illustrated in Figure 1f.
Given the source input and source output , the followup input is defined to consist of points
(52) 
where for each (), the xcoordinate of an independent variable, say , is reflected while the xcoordinates of other independent variables and the ycoordinates remain the same, that is
(53) 
The followup output
(54) 
is determined by the following equation
(55) 
In a nutshell, if the sign of an independent variable is reversed, the sign of corresponding component of the estimator should also be reversed.
3.2.3 Category 3: Scaled regression
The MRs in this category are also based on Corollary 5 (MR3.1) and Corollary 6 (MR3.2) for an arbitrary positive scaling factor. When we scale particular coordinates of data points along an axis by a positive factor, the slope of regression line will be scaled accordingly.
MR3.1. Scaling the dependent variable
After the points are scaled in the y axis by a given factor, the regression line will be scaled by the same ratio, as shown in Figure 1g.
Given the source input and source output , the followup input is defined to consist of points
(56) 
where for each (), the values of its xcoordinates remain unchanged (, ) and its ycoordinate is scaled by a factor (), that is
(57) 
Then, the sign of followup output would be scaled by the same factor
(58) 
In other words, when values of the dependent variable is scaled by a given factor, all components of the estimator will be scaled proportionally by the same factor.
MR3.2. Scaling an independent variable while keeping the others unchanged
After the points are scaled in a particular x axis by a given factor, the slope of regression line will be scaled reciprocally with respect to that axis, as illustrated in Figure 1g.
Given the source input and source output , the followup input is defined to consist of points
(59) 
where for each (), the xvalues of an independent variable, say , are scaled by a factor (), while the values of other independent variables and the dependent variable remain the same, that is
(60) 
and
(61) 
The followup output
(62) 
would be defined by the following equation
(63) 
In other words, if the values of an independent variable are scaled by a given factor, the corresponding component of the predicted estimator would be scaled by the reciprocal of the scaling constant.
3.2.4 Category 4: Shifted regression
The MRs in this category are based on Corollary 7 and Corollary 8, and applicable only for the regression with intercept. The intercept of the regression line will accumulate all modified distances if you shift the points along an axis.
MR4.1. Shifting the dependent variable
When the points are shifted by a given distance along the y axis, the regression line will be shifted along this axis by the same distance, as illustrated in Figure 1i.
Given the source input and source output , the followup input is defined to consist of points
(64) 
where for each (), the value of its ycoordinate is shifted by a distance
(65) 
Then, the followup output
(66) 
would be defined by the following equation
(67) 
In brief, if a constant is added into the values of dependent variable, the intercept of the new estimator would be increased by the same value.
MR4.2. Shifting an independent variable while keeping the others unchanged
When the points are shifted by a given distance along a certain x axis, the regression line will be shifted in parallel along this axis accordingly, as illustrated in Figure 1j.
Given the source input and source output , the followup input is defined to consist of points
(68) 
where for each (), the value of an independent variable, say , is shifted by a distance while the xcoordinates of other independent variables and the ycoordinates remain the same, that is
(69) 
The followup output
Comments
There are no comments yet.