    Testing Multiple Linear Regression Systems with Metamorphic Testing

Regression is one of the most commonly used statistical techniques. However, testing regression systems is a great challenge because of the absence of test oracle in general. In this paper, we show that Metamorphic Testing is an effective approach to test multiple linear regression systems. In doing so, we identify intrinsic mathematical properties of linear regression, and then propose 11 Metamorphic Relations to be used for testing. Their effectiveness is examined using mutation analysis with a range of different regression programs. We further look at how the testing could be adopted in a more effective way. Our work is applicable to examine the reliability of predictive systems based on regression that has been widely used in economics, engineering and science, as well as of the regression calculation manipulated by statistical users.

Authors

05/28/2019

Can we disregard the whole model? Omnibus non-inferiority testing for R^2 in multivariable linear regression and ^2 in ANOVA

Determining a lack of association between an outcome variable and a numb...
08/25/2021

Inverse Sampling of Degenerate Datasets from a Linear Regression Line

When linear regression generates a relationship between a (dependent) sc...
02/07/2020

Distribution free testing for linear regression. Extension to general parametric regression

Recently a distribution free approach for testing parametric hypotheses ...
07/03/2018

A Multiple Linear Regression Approach For Estimating the Market Value of Football Players in Forward Position

In this paper, market values of the football players in the forward posi...
09/13/2021

Multiple Linear Regression and Correlation: A Geometric Analysis

In this review article we consider linear regression analysis from a geo...
09/16/2020

An Intrinsic Treatment of Stochastic Linear Regression

Linear regression is perhaps one of the most popular statistical concept...
05/12/2019

A New Look at an Old Problem: A Universal Learning Approach to Linear Regression

Linear regression is a classical paradigm in statistics. A new look at i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Being a cornerstone in statistics, regression is a fundamental prediction technique. Due to its simplicity and robustness, it has been widely used in numerous disciplines of science and engineering as well as other fields such as economics and business. Once a regression model is established, we are able to adopt it for predicting expected values within a certain level of confidence.

Calculating regression coefficients is an essential statistical function of various software and systems. It has been integrated as part of both commercial software such as MATLAB (MathWorks), SPSS Statistics (IBM) and Excel (Microsoft) and open-source libraries including GNU Scientific Library (C/C++), Scikit-learn, Scipy, StatsModels (Python), Accord, MathNet (F#), and GoLearn (Go). On the other hand, new systems may be built from scratch in some situations, such as the development of statistical libraries in new programming language. Besides, some programmers may need to modify the regression, for examples, transforming the variables

(lin2000; luu2015; wu2017; luu2018) or parallelizing the computations (baboulin2006; dau2019). In most cases, the calculation involves extensive computations. It is thus prone to programming bugs, and data mishandling. Blunders in computing regression by these software or systems would definitely cast doubts on the reliability and robustness of data and scientific analyses based on them.

Advancements in machine learning enable us to cope with the growth of big data, having the fastest rate over recent decades

(goodfellow2016). Despite the rapid change, linear regression remains as the technique to be widely used due to two main reasons. First, it has been serving as a baseline for evaluating more sophisticated modelling systems for years (weisberg2005; montgomery2012)

. For example, simple linear regression is frequently adopted to assess the performance of various artificial neural networks (ANN) systems

(bibault2018; ahmad2019; mondal2020; ayoubloo2011). Second, the ANN equations are reducible to the linear regression form, being referred to as deep linear network (saxe2014; lathuiliere2019). Recently, the deep linear network has received lots of attention as it helps to better understand the properties and reliability of general ANN (lathuiliere2019; lee2019; bah2020). Reliability of many systems therefore somehow relies on the correctness of linear regression calculation.

Traditionally, the testing of regression system is difficult because we do not have a test oracle in general. For example, if we have some data points, say and

, and want to find a line that has the predicted values “closest” to these points. (In case of higher dimensions, the plane or hyperplane is used instead of line.) When the regression program returns the line as

, we have no simple way to tell whether or not this line is the best fitted line. Testers may find the regression line manually by solving the required equations constructed from these data points. However, this is not viable and is very tedious.

Testing such programs becomes even more difficult in reality because we often never know the true line due to round-off errors (higham2002; solomon2015)

. In other words, we are not able to determine whether or not the computed value from a regression algorithm is the exact estimator that should be returned by the algorithm. In fact, previous studies

(higham2002; gratton2013; marland2018) pointed out that solving the linear regression may cause a large accumulation of round-off and truncation errors because achieving the regression solutions requires intensive matrix calculations. When the true value is unavailable, there is no direct way to validate the regression systems.

Incorrect linear regression computation may lead to severe consequences. In many cases, even a marginal difference in estimating coefficients of regression may cause a huge impact. For instance, while a previous study estimated the global mean sea level has been rising at a rate of mm per year since the beginning of this century (church2011), another work suggested that the rate should be mm per year (hay2015). Due to the acceleration of global warming, the two regression coefficients after 50 years will lead to a gap larger than 2.5 cm in the predicted sea level that miscounted multi billion dollars per year in flood damage cost, according to a recent research (jevrejeva2018). Given that fact, while round-off error was attributed to the US Army’s Patriot Missile Defense System failure in shooting down the SCUD missile causing death of 28 American soldiers (zhivich2009), the rounding is also known to deteriorate the accuracy of regression (higham2002). Indeed, a faulty regression program may cause damages more severe than the round-off error. Validating the accuracy of regression is therefore not solely a computational issue but also has real-life impacts. Considering its very diversified range of applications, testing the correctness of regression systems is certainly of great importance.

In view of such problem, we should find an appropriate approach to tackle the testing of linear regression-based systems. The software testing technique of Metamorphic Testing (MT), does allow us to alleviate the absence of oracle, and to generate test sets automatically (segura2016; chen2017). It has been applied successfully in testing various software systems, such as numerical analysis (chan1998)

(chen2002) search engines (zhou2012; zhou2016), Web Application Programming Interfaces (segura2018), feature models (segura2010; segura2011), scientific software (ding2016; lin2018), programming language compilers (le2014), graphics shader compilers (donaldson2017), context-sensitive middleware-based applications (chan2005), object detection (wang2019), cybersecurity (chen2016; mai2020).

Recently, MT has been applied efficiently to validate various machine learning and artificial intelligence systems

(xie2011; xie2018; segura2018b). A testing tool named DeepTest was developed to detect the erroneous responses of autonomous car-driving systems backed by deep neural networks (DNN) (tian2018). It helped revealing thousands of dangerous situations under different weather conditions affecting the functioning of DNN-based programs that were submitted to the Udacity self-driving challenge. Another example in testing a real-life driver-less car system named Apollo of Baidu (zhou2019) pointed out that the random noises added will significantly alter the system’s responses, potentially leading to fatal crashes. In fact, MT has been evolving beyond the domain of conventional testing. It is extended to validation (lin2018; zhou2019), program understanding (zhou2018) and quality assessment (pullum2012)

. MT was also combined with “statistical hypothesis test” to deduce the likelihood of a system’s faultiness

(guderlei2007). Despite of its many successes with complex systems, it remains unclear to what extent that MT is applicable to test linear regression-based systems.

The aim of this paper is to propose the use of MT to test multiple linear regression systems. We focus on testing the implementation of linear regression algorithm for three reasons. First of all, the algorithm is the core component of these regression systems. While testing a specific system may require a particular set of features and properties specific to that system, the technique for testing such system may not be applicable to test another system. For example, testing a C++ program may involve an explicit declaration of variables, while it is not required in testing a Python or Ruby program, making dissimilar outcomes in assessing the bugs associated with initializing variables in different systems. Second, we may inspire interested readers to come up with new intrinsic properties, or extend them for non-linear regression-based systems. Third, it has been reported that many statistical users have mistakenly manipulated the regression calculation, and thus might derive incorrect predictive models. For example, many users came up with erroneous results while applying linear regression in Python (bug-python1; bug-python2; bug-python3), Matlab (bug-matlab2) or R (bug-r1; bug-r2) software, and it turned out that they manipulated the regression wrongly by mishandling the intercept. By considering our different ways to check the implementation of regression algorithm, they may be able to test their own manipulation process, and thus improve the reliability of their works.

For MT to work on the regression systems, testers or users need to know some intrinsic properties of the regression. These properties should be supported by formal “mathematical proofs” to avoid misuses. For example, there has been some reports about users adopting some properties that are not necessary properties of the algorithms to be implemented (xie2011). This will cause misinterpretations of the results. One of the strengths of this paper is that we prove all these intrinsic properties of the regression systems. Contributions of this paper are summarized as follows:

• We derive a set of mathematical properties of estimator of multiple linear regression related to the addition of data points, the rescaling of inputs, the shifting of variables, the reordering of data, and the rotation of independent variables.

• We then develop 11 Metamorphic Relation (MRs) which are grouped in 6 categories to support the testing of related regression systems.

• The technique of mutation analysis is applied on a wide range of regression programs to examine the performance of these MRs.

• We further examine how the testing of regression could be carried out in a more effective way.

The structure of this paper is organised as follows. After the introduction, we start with recalling basic concepts of linear regression and MT. In Section 3, we present selected properties of estimator, and associated MRs that will later be used for verifying the regression estimates. Section 4 is on experimentation, presenting the mutation set-up, the generation of the input datasets, and the assessment criteria. The effectiveness of MRs is reported and discussed in Section 5, before a comparison between MT and random testing is presented in Section 6. We describe the threats to validity of our experiments in Section 7. Finally, the paper is enclosed by concluding remarks in Section 8.

2 Preliminaries

2.1 Multiple linear regression

Regression is a model to describe the behavior of a certain random variable of interest. This variable may be the price of stocks in the financial market, the growth of a biological species, or the detection opportunity of the gravitational wave. It is referred to as the dependent variable, and denoted with

. Information of the dependent variable is provided on the basis of predictor or explanatory variables; such as time for the price of stocks, food for the species to grow or the number of observations for the detection of wave. These terms are usually called the independent variables, and denoted with . The regression model is to relate dependent variable to given independent variables by means of a function

 y≈f(x0,x1,…,xd) (1)

Given the form of function , the regression model is said to be determined when the unknown parameter, being referred to as the estimator , is obtained. The optimal estimator is chosen such that the modelled dependent variable becomes closest to the true dependent variable . In multiple linear regression model, the variable has a linear relationship with respect to the components of optimized estimator by means of:

 ^y=x0^β0+x1^β1+…+xd^βd (2)

where in which the upper symbol represents the transpose of the relevant matrix. Each individual variable ( = 0, 1, …, ) can be, for instance, a trend rate, an oscillatory factor or a function in the multiple linear regression (luu2015; luu2018), but it must be independent of all other variables . The intercept form is adopted if we set as a constant, say , in contrast with other input variables that are fed with data. Then we have

 ^y=^β0+x1^β1+…+xd^βd (3)

Assume we have data points for the model, and denote and (all

) as corresponding vectors of data for the variables

and , and () is the vector of modelled dependent variable . Their matrices are expressed as

 X=⎡⎢ ⎢ ⎢⎣x0x1…xd⎤⎥ ⎥ ⎥⎦T =⎡⎢ ⎢ ⎢ ⎢⎣x1,0x2,0…xn,0x1,1x2,1…xn,1…………x1,dx2,d…xn,d⎤⎥ ⎥ ⎥ ⎥⎦; (4) y =⎡⎢ ⎢ ⎢ ⎢⎣y1y2…yn⎤⎥ ⎥ ⎥ ⎥⎦;^y=⎡⎢ ⎢ ⎢ ⎢⎣^y1^y2…^yn⎤⎥ ⎥ ⎥ ⎥⎦

The linear regression equation to derive is

 ^y=XT^β (5)

To establish the model, a cost function (also named as objective function) is adopted to quantify the estimator based on a certain metrics. The idea is to minimize the difference between the true and the modelled , which is referred to as the residual (or error). In general, the cost function consists of the residual, the estimator and different

forms of parameters and weights. Some regression methods may add a penalty quantity associated with

, or take into account the standard deviation. In the ordinary least square (OLS) regression, the cost function is simply defined as the total of square of residuals. The optimized estimator is determined by the minimization (

) of this cost function using the square of Euclidean norm (), that is

 ^β=arg minβ∥y−^y∥22 (6)

There are different ways to derive the solution of this equation such as derivatives of sum of square errors, maximum likelihood, projection, generalized method of moments. When the matrix

has the full rank, i.e. all its rows and columns are linearly independent, we have the unique solution derived from the normal equation:

 ^β=(XXT)−1Xy (7)

The upper numeric subscript () represents the inverse of a related matrix. Once the model is established after the derivation of the estimator , we may predict the change of dependent variable under varying conditions of . In the prediction system based on linear regression, the core model is trained with a set of input data to obtain the optimized estimator (), before it can be used to predict. A more comprehensive description about linear regression can be found in (weisberg2005; montgomery2012).

2.2 Metamorphic testing

Metamorphic Testing (MT) has been developed to alleviate the test oracle problem, which is referred to as the situations where test results of a program are impossible or extremely difficult to be validated. Consider an example of a program that implements a particular algorithm to find the shortest path from one node to another in a graph. Assume that is a graph having 100 nodes, and that, on average, each node has about 20 edges. In general, given the graph and two nodes and in , it is very time-consuming to verify that the program’s actual output is really a shortest path in from to because a manual process may involve validating against possible paths connecting x and y.

The core of MT is the concept of Metamorphic Relations (MRs), which are derived from properties of the targeted algorithm to be implemented. If the program does not uphold these properties, we can conclude that the program has errors. Using the shortest path example mentioned earlier, assume further that , and are three different nodes in . Based on the domain knowledge of the shortest path in a graph, one can derive the metamorphic relation – “if is in the shortest path in from to , the length of the shortest path in from to is equal to the sum of the length of the shortest path in from to and that from to ”. Since the program is implementing a particular “shortest path” algorithm, one can expect that upholds this metamorphic relationship in the sense that “If is in the actual output , the length of is equal to the sum of the lengths of and . The original test case is considered as a source test case in MT. The two test cases and are referred to as the follow-up test cases in MT because these test cases are follow-ups of the source test case. In this example, the MR involves one source test case and two follow-up test cases, and the follow-up test cases may depend on the actual output of the program with the source test case as input. In general, an MR can involve multiple source test cases and multiple follow-up test cases. Following is the formal definitions of MR and other concepts used in MT (chen2017).

Definition 1: Let be a target function or algorithm to be implemented. A Metamorphic Relation (MR) is a necessary property of over a sequence having two or more inputs , where , and the sequence of corresponding outputs . The relation is denoted by , where is the subset relation, and and are the Cartesian products of input and output spaces, respectively. We may simply adopt to represent .

Definition 2: Consider an MR . Suppose that each () is constructed based on according to . For any , is referred to as a source input. For any , is referred to as a follow-up input. That is, for a given , if all source inputs () are specified, then the follow-up inputs () can be constructed based on the source inputs and, if necessary, their corresponding outputs. The sequence of inputs is referred to as a metamorphic test group (MTG) of inputs for the MR.

Process for Metamorphic Testing (MT). Let be an implementation of a target function . For an MR , suppose that we have . Metamorphic Testing (MT) based on this MR for involves the following steps:

(1) Define from by replacing by

(2) Given a sequence of source test cases . After execution, their respective source outputs are given by . Construct and execute a sequence of follow-up test cases with reference to , and obtain their corresponding follow-up outputs .

(3) Compare the executed results against the expectation given by . If is not satisfied, then is determined to be faulty by this MR.

In other words, to apply MT to test a program using a given MR, we first need to generate the relevant source test cases. We then execute the program with these source test cases to obtain their respective source outputs. Based on the given MR, we can generate the relevant follow-up test cases. After that, the program is executed with these follow-up test cases to obtain the respective follow-up outputs. Finally, we can check whether the given MR is satisfied by the source test cases, the source outputs, the follow-up test cases, and the follow-up outputs. If the MR is violated, it means that the program under test is faulty. This entire process of MT can be automated.

3 Metaphoric relations of estimator

In this section, we propose 11 MRs that can be used to test linear regression based systems. As mentioned earlier, they are derived from the properties of the targeted algorithm to be implemented. It is important to discuss the intuition of each property, understand the properties and explain how MRs could be derived from the relevant properties. Interested readers may find the mathematical proofs of these properties in Appendix A. For ease of discussion and illustration, we will use 2-dimensional examples, if needed. Last, but not least, all properties discussed in this section involve the optimized estimator .

3.1 Properties of estimator

3.1.1 Property 1. Inserting new data

This property is about inserting an additional data point to the original set of data points for linear regression. Proposition 1 shows that, after a new data point is added to the original set of data points, the new linear regression line obtained from the new set of data points can be derived from a relationship with the original regression line and the new data point.

Proposition 1.

Suppose that the dependent variable is related to a linear relationship with independent variables with the estimator derived from the least square fitting. Let and (all ) denote the vectors of data for the variables and , respectively, whose matrices are expressed in Equation (4). And let and be a data point being added into the original data set. The linear estimator obtained from the new data set and can be derived from the following equation

 ^β∗=^β+G(y∗−x∗T^β) (8)

where is the vector of size defined by and as follows

 G=(XXT)−1x∗1+x∗T(XXT)−1x∗. (9)

We now derive an important property that forms the foundation of the first two metamorphic relations, namely MR 1.1 and MR 1.2 in Section 3.2.

Corollary 1. Same notation as in Proposition 1. If , then .

Proof: Suppose . It is straightforward from Equation (8) that .

In other words, if the new data point is generated from the model derived by the original data set, we do expect that the new model obtained from the new data set will be the same as the original model. For example, suppose we have 3 data points in the data set for linear regression and that the linear regression system returns the line . If we use this line equation to generate a new data point, say (7, 15), and feed all 4 (original 3 plus this new) data points to the linear regression system, we expect that the new regression line obtained will be the same as the original one.

Corollary 2. Same notation as in Proposition 1. If

 ¯¯¯y=1nn∑i=1y (10) ¯¯¯¯¯xj=1nn∑i=1xi,j (11)

for , then for the regression with intercept, being explicitly expressed in Equation (3).

Proof: The summation of both sides of Equation (3) gives us

 n∑i=1^y=d∑j=1^βjn∑i=1xi,j (12)

The linear regression with intercept is unbiased, that is

 1nn∑i=1^y=1nn∑i=1y (13)

As a result, from the definitions of and , we obtain

 ¯¯¯y=d∑j=1^βj¯¯¯¯¯xj (14)

This equation can be rewritten in form of vector as . It follows immediately after Corollary 1 that .

3.1.2 Property 2. Scaling data

The second property is about the scaling of certain coordinates of data points. Proposition 2 shows that, if certain coordinates of data points along a certain axis are scaled by a given factor, the slope of regression line with respect to this axis will be scaled proportionally by the same factor.

Proposition 2. Suppose that the values of the dependent variable are scaled by a factor of with respect to the original values , and the values of an independent variables being factorized by a factor of with respect to the original values , that is

 y∗ =a y (15) x∗k =b xk (16)

where the constants and are non-zero real numbers (). The new estimator can be computed from the original estimator as follows

 ^β∗=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣a^β0a^β1…a^βk−1ab^βka^βk+1…a^βd⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (17)

Based on this proposition, we can further derive two properties of mirroring the components of the data points, and two properties of scaling the components of the data points. These properties are the basis of four MRs, namely MR 2.1, MR 2.2, MR 3.1 and MR 3.2 in Section 3.2. The proofs of these properties are straightforward from Equation  (17). We will leave them to the readers.

Corollary 3. Same notation as in Proposition 2. If and , then .

That is, if we reflect the sign of the dependent variable, we expect that the sign of estimator will also be reflected. This is the basis of our MR 2.1.

Corollary 4. Same notation as in Proposition 2. If and , then is given by the following equation

 ^β∗=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣^β0^β1…^βk−1−^βk^βk+1…^βd⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (18)

More simply, if we only reflect a particular independent variable, only the component of estimator related to this independent variable is reflected. This is the basis of our MR 2.2.

Corollary 5. Same notation as in Proposition 2. If and , then .

Intuitively, if we scale the dependent variable by a factor of , the estimator will also be scaled by the same factor. This is the basis of our MR 3.1.

Corollary 6. Same notation as in Proposition 2. If and , then is given by the following equation

 ^β∗=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣^β0^β1…^βk−11b^βk^βk+1…^βd⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (19)

That is, if we only scale a particular independent variable by a factor of , only the component of the estimator related to this independent variable is scaled reciprocally. This is the basis of our MR 3.2.

3.1.3 Property 3. Shifting data

The third property is about the shifting of certain components of the data in the original data set. Proposition 3 shows that, if the coordinates of data points along a certain axis are shifted by a given distance, the projection of the regression line with intercept is also shifted by a proportional distance.

Proposition 3. Suppose that the values of the dependent variable are shifted by a distance of with respect to the original values , and the values of an independent variables being shifted by a distance of with respect to the original values , that is

 y∗ =y+a 1n (20) x∗k =xk+b 1n (21)

where and are real constants (). In the linear regression with intercept, the new estimator can be determined from the original estimator as follows

 ^β∗=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣^β0−b^βk+a^β1^β2…^βd⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (22)

Based on this proposition, we can further derive two properties of shifting the components of the data points. These properties are the basis of two MRs, namely MR 4.1 and MR 4.2 in Section 3.2. The proofs of these properties are straightforward from Equation (22).

Corollary 7. Same notation as in Proposition 3. If and , then is given by the following equation

 ^β∗=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣^β0+a^β1^β2…^βd⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (23)

In other words, if we add up the dependent variable by a value of , the intercept component of estimator will also be increased by the same value. This is the basis of our MR 4.1.

Corollary 8. Same notation as in Proposition 3. If and , then is given by the following equation

 ^β∗=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣^β0−b^βk^β1^β2…^βd⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (24)

That is, if we add up a particular independent variable by a value of , the intercept component of estimator will also be decreased by an amount proportional to both this value and the component of estimator related to this variable. This is the basis of our MR 4.2.

3.1.4 Property 4. Permuting data

The fourth property is about the permutation of certain components or certain samples of the data. Proposition 4 shows that, if the coordinates of the data points along certain axes are permuted, the corresponding projections of the regression line along these axes are permuted in the same order. Moreover, the proposition shows that, permuting the data points does not change the regression line.

Proposition 4. Suppose that the samples of the dependent variable are permuted by the function with respect to the original variable ; and the samples of the independent variables are permuted by both functions and with respect to the original values , such that

 y∗ =σs(y) (25) x∗k =σs(xσv(k)) (26)

where is a permutation (bijective function) from set to ; whilst is the corresponding bijective renumbering of the set of sample index . The new estimator can be determined from the original estimator as follows

 ^β∗=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣^βσv(0)^βσv(1)…^βσv(d)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (27)

Based on this proposition, we can further derive two properties of permuting the data points. These properties are the basis of two MRs, namely MR 5.1 and MR 5.2 in Section 3.2. The proofs of these properties are straightforward from Equation (27).

Corollary 9. Same notation as in Proposition 4. If is the identity function (i.e., ) and is a bijective function, then .

Put it differently, if we permute the data points only without permuting the dependent and independent variables, the estimator will remain unchanged. This is the basis of our MR 5.1.

Corollary 10. Same notation as in Proposition 4. If is a permutation that only swaps and where and , while is the identity function, then is given by the following equation

 ^β∗j=⎧⎪ ⎪⎨⎪ ⎪⎩^βqfor j=p^βpfor j=q^βjotherwise (28)

That is, if we swap two independent variables and their relevant values in the data points, the corresponding components of the estimator will be swapped accordingly. This is the basis of our MR 5.2.

3.1.5 Property 5. Rotating data

The fifth property is about the rotation of coordinate system. Proposition 5 shows that, if the axes related to the independent variables of the data points are rotated by a given rotation angle, the corresponding projections of the regression line are rotated by the same angle.

Proposition 5. Suppose that the values of the dependent variable is kept unchanged, while the components of the independent variables are rotated by the matrix of size . In other words, rotates the matrix with respect to the original matrix , that is

 X∗=RX (29)

Then the new estimator can be determined from the original estimator as follows

 ^β∗=R^β (30)

Based on this proposition, we can derive the property of rotating the data points. This property is the basis of the MR 6 in Section 3.2. The proof of this property is straightforward from Equation (30).

Corollary 11. Same notation as in Proposition 5. Suppose two components and with are rotated by an angle in the counter-clockwise direction, that is

 {x∗p=xpcosθ−xqsinθx∗q=xpsinθ+xqcosθ (31)

in other words, the rotation matrix has the form

 Rpq= (32) 1 0 … 0 … 0 … 0 0 1 … 0 … 0 … 0 … … … … … … … … p 0 0 … cosθ … −sinθ … 0 … … … … … … … … q 0 0 … sinθ … cosθ … 0 … … … … … … … … 0 0 … 0 … 0 … 1

then the new estimator is determined from the original estimator as follows

 ^β∗j=⎧⎪ ⎪⎨⎪ ⎪⎩^βpcosθ−^βqsinθfor  j=p^βpsinθ+^βqcosθfor  j=q^βjotherwise (33)

That is, if we rotate two components of the data points by an angle, the corresponding component of the estimator will be rotated by the same angle.

3.2 Metamorphic relations

In this subsection, we will present 11 MRs for testing multiple linear regression systems. They are derived from properties presented in Section 3.1 and are grouped into 6 categories. We will describe the regression form with intercept first, and discuss the regression form without intercept later. Table 1 summaries these MRs and their applicable types of regression, and Figure 1 illustrates the transformation examples for each MR. Figure 1: Simplified original points and their regression line (hyperplane) in (a) two-dimensional and (b) three-dimensional views, and their transformations due to (c) MR1.1, (d) MR1.2, (e) MR2.1, (f) MR2.2, (g) MR3.1, (h) MR 3.2, (i) MR4.1, (j) MR4.2, (k) MR5.1, (l) MR5.2, and (m) MR6. The origin of coordinates is denoted by the black point. The dashed grey circles represent original points, i.e. the source inputs. The open black circles denote transformed points, which may overlay the original points if there is no transformation. Newly added points are marked by the red circles. The follow-up inputs thus consist of transformed and added points. Black lines (grey hyperplanes) exhibit the follow-up regression lines (hyperplanes). The dashed red arrows sketch the transformation applied.

Before we discuss the individual MR, let us introduce some notations for the regression with intercept. This regression is adopted as the default unless stated otherwise. First, let denote the source input of a linear regression system under test (SUT), and it consists of a set of data points

 Is ={Ps1,Ps2,…,Psn} (34)

where each data point () can be denoted as

 Psi=(xsi,1,xsi,2,…,xsi,d,ysi) (35)

in which () is the x-coordinate and is the y-coordinate of the point . The output of the SUT using the source input can be expressed by

 Os =[^βs0^βs1…^βsd]T (36)

The output is referred to as the source output. That is, the equation of the regression form with intercept is

 ^y=^βs0+x1^βs1+…+xd^βsd (37)

Let us denote the follow-up input be a set of data points, being generated using a given MR, that is

 If ={Pf1,Pf2,…,Pfm} (38)

where

 Pfi=(xfi,1,xfi,2,…,xfi,d,yfi) (39)

for . The output of the same program using the follow-up input is expressed by

 Of =[^βf0^βf1…^βfd]T (40)

The output is referred to as the follow-up output. For convenience in describing the MRs, we refer and ( and ) to as the source (follow-up) data set and output, respectively.

In the context of MT, if , , and do not satisfy the relevant MR, the SUT is said to be faulty. Now, we are going to describe the MRs based on the previously mentioned properties, to be used in our experimentation to validate regression systems.

3.2.1 Category 1: Unchanged predictions

Two MRs in this category are based on Corollary 1, that is, adding a point predicted by the regression line into the original data set shall compute the same regression line.

MR1.1. Inserting a predicted point

The regression line will remain the same after being updated by adding a point selected arbitrarily from the line into the original data set. This MR is illustrated in Figure 1c.

Given the source input and source output , the follow-up input is formed by adding a new point to the source input , that is

 If =Is∪{Pfn+1} (41)

where

 Pfn+1=(xfn+1,1,xfn+1,2,…,xfn+1,d,yfn+1) (42)

such that are arbitrary real numbers, and is computed from these () values and using the following equation

 yfn+1=^βs0+xfn+1,1^βs1+…+xfn+1,d^βsd (43)

In other words, is a point that falls in the “predicted” regression equation. Then we expect to have the follow-up output be the same as the source output.

 Of =Os (44)

To sum up, the follow-up estimator is the same as the source estimator when a point predicted from the source model is included in the follow-up estimation. Instead of a single point, we can add several new points at the same time for the regression. The add-up will preserve the source estimators as long as the points are derived from the regression model.

It is noted that, in practice, floating-point computations may give us slightly different values due to rounding. We thus need to consider the impact of round-off error in determining the violation of each MR. For example, instead of having the equality in this MR, the comparison should be relaxed by an inequality that accounts for an error tolerance. This will be explained in details in Subsection 4.3 (Assessment).

MR1.2. Inserting the centroid

The linear regression line will remain the same after being updated by adding the centroid into the original data set for the intercept form, as illustrated in Figure 1d. The centroid can be derived from the arithmetic mean of all values of its data points, which can be easily computed. This point belongs to the linear regression line.

Given the source input and source output , the follow-up input is formed by adding the centroid point to the source input , that is

 If =Is∪{Pfn+1} (45)

where

 Pfn+1=(xfn+1,1,xfn+1,2,…,xfn+1,d,yfn+1) (46)

is such that and are the averages of the corresponding values in the source data , that is

 yfn+1 =1n(ys1+ys2+…+ysn) (47) xfn+1,j =1n(xs1,j+xs2,j+…+xsn,j)

for . If the regression program is configured to run with intercept, then we have

 Of =Os (48)

In other words, adding the centroid of data into the source input to form a follow-up input will not change the follow-up estimator for the regression form with intercept.

3.2.2 Category 2: Mirrored regression

The MRs in this category are based on Corollary 3 and Corollary 4 with the scaling parameter be set as . This set of MRs is related to how reflecting the data points will reflect the regression line accordingly.

MR2.1. Reflecting the dependent variable

Reflecting the points over a certain -axis will reflect the regression line over the same axis, as illustrated in Figure 1e.

Given the source input and source output , the follow-up input consists of points

 If ={Pf1,Pf2,…,Pfn} (49)

where for each (), the value of its x-coordinate remains unchanged (that is , ) and its y-coordinate is reflected, that is

 yfi=−ysi (50)

Then, we have

 Of =−Os (51)

That is, the follow-up estimator is a reflection of the source estimator when the sign of the dependent variable is reversed in the follow-up input set.

MR2.2. Reflecting an independent variable while keeping the others unchanged

Reflecting the points over the y-axis will reflect the regression line over the same axis, as illustrated in Figure 1f.

Given the source input and source output , the follow-up input is defined to consist of points

 If ={Pf1,Pf2,…,Pfn} (52)

where for each (), the x-coordinate of an independent variable, say , is reflected while the x-coordinates of other independent variables and the y-coordinates remain the same, that is

 xfi,j={−xsi,kfor j=kxsi,jfor j=1,2,…,d and j≠k (53)

The follow-up output

 Of =[^βf0^βf1…^βfd]T (54)

is determined by the following equation

 ^βfk={−^βskfor j=k^βsjfor j=0,1,…,d and j≠k (55)

In a nutshell, if the sign of an independent variable is reversed, the sign of corresponding component of the estimator should also be reversed.

3.2.3 Category 3: Scaled regression

The MRs in this category are also based on Corollary 5 (MR3.1) and Corollary 6 (MR3.2) for an arbitrary positive scaling factor. When we scale particular coordinates of data points along an axis by a positive factor, the slope of regression line will be scaled accordingly.

MR3.1. Scaling the dependent variable

After the points are scaled in the y axis by a given factor, the regression line will be scaled by the same ratio, as shown in Figure 1g.

Given the source input and source output , the follow-up input is defined to consist of points

 If ={Pf1,Pf2,…,Pfn} (56)

where for each (), the values of its x-coordinates remain unchanged (, ) and its y-coordinate is scaled by a factor (), that is

 yfi=a ysi (57)

Then, the sign of follow-up output would be scaled by the same factor

 Of =a Os (58)

In other words, when values of the dependent variable is scaled by a given factor, all components of the estimator will be scaled proportionally by the same factor.

MR3.2. Scaling an independent variable while keeping the others unchanged

After the points are scaled in a particular x axis by a given factor, the slope of regression line will be scaled reciprocally with respect to that axis, as illustrated in Figure 1g.

Given the source input and source output , the follow-up input is defined to consist of points

 If ={Pf1,Pf2,…,Pfn} (59)

where for each (), the x-values of an independent variable, say , are scaled by a factor (), while the values of other independent variables and the dependent variable remain the same, that is

 xfi,j={b xsi,kfor j=kxsi,jfor j=1,2,…,d and j≠k (60)

and

 yfi=ysi (61)

The follow-up output

 Of =[^βf0^βf1…^βfd]T (62)

would be defined by the following equation

 ^βfk={1b^βskfor% j=k^βsjfor j=0,1,…,d and j≠k (63)

In other words, if the values of an independent variable are scaled by a given factor, the corresponding component of the predicted estimator would be scaled by the reciprocal of the scaling constant.

3.2.4 Category 4: Shifted regression

The MRs in this category are based on Corollary 7 and Corollary 8, and applicable only for the regression with intercept. The intercept of the regression line will accumulate all modified distances if you shift the points along an axis.

MR4.1. Shifting the dependent variable

When the points are shifted by a given distance along the y axis, the regression line will be shifted along this axis by the same distance, as illustrated in Figure 1i.

Given the source input and source output , the follow-up input is defined to consist of points

 If ={Pf1,Pf2,…,Pfn} (64)

where for each (), the value of its y-coordinate is shifted by a distance

 yfi=ysi+a (65)

Then, the follow-up output

 Of =[^βf0^βf1…^βfd]T (66)

would be defined by the following equation

 ^βfj={^βs0+afor j=0^βsjfor j=1,2,…,d (67)

In brief, if a constant is added into the values of dependent variable, the intercept of the new estimator would be increased by the same value.

MR4.2. Shifting an independent variable while keeping the others unchanged

When the points are shifted by a given distance along a certain x axis, the regression line will be shifted in parallel along this axis accordingly, as illustrated in Figure 1j.

Given the source input and source output , the follow-up input is defined to consist of points

 If ={Pf1,Pf2,…,Pfn} (68)

where for each (), the value of an independent variable, say , is shifted by a distance while the x-coordinates of other independent variables and the y-coordinates remain the same, that is

 xfi,j={xsi,k+bfor j=kxsi,jfor j=1,2,…,d and j≠k (69)

The follow-up output

 Of =[^βf0^βf