Ordinary Least Squares

What is Ordinary Least Squares?

Ordinary Least Squares (OLS) is a type of linear regression, which is one of the most fundamental and widely used predictive analysis techniques in statistics and machine learning. OLS aims to find the best-fitting straight line through a set of points. This line is known as the regression line and is used to predict the value of a dependent variable based on the value of one or more independent variables.

How Does Ordinary Least Squares Work?

The goal of OLS is to minimize the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function. These differences are called "residuals" or "errors" and represent the elements of the vertical distance between the data points and the regression line.

To achieve this, OLS estimates the parameters of the linear function, which are the coefficients in the regression equation. The regression equation can be represented as:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε

where:

y is the dependent variable,
β₀ is the y-intercept,
β₁, β₂, ..., βₙ are the coefficients of the independent variables x₁, x₂, ..., xₙ,
ε is the error term (residual).

The OLS method assumes that there is a linear relationship between the independent and dependent variables. It also assumes that the residuals are normally distributed and have constant variance (homoscedasticity), and that the independent variables are not highly correlated with each other (no multicollinearity).

Calculating the Coefficients

The coefficients are calculated by deriving the least squares estimator, which involves taking partial derivatives of the sum of squared residuals with respect to each coefficient, setting them to zero, and solving the resulting system of equations. This process leads to a set of normal equations, which can be solved to find the coefficient estimates.

The formula for the coefficients in a simple linear regression (one independent variable) can be represented as:

β₁ = Σ((xᵢ - x̄)(yᵢ - ȳ)) / Σ((xᵢ - x̄)²)

β₀ = ȳ - β₁x̄

where x̄ and ȳ are the sample means of the independent and dependent variables, respectively.

Goodness of Fit

Once the regression line is calculated, the goodness of fit of the model can be assessed using the coefficient of determination, denoted as R². R² measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s). An R² value of 1 indicates that the regression line perfectly fits the data, while an R² of 0 indicates that the line does not fit the data at all.

Applications of Ordinary Least Squares

OLS regression is used across various fields, from economics to engineering. It is applied in situations where the relationship between variables needs to be quantified and predictions are required. For example, OLS can be used to predict consumer spending based on income, or to estimate the impact of education level on wages.

Limitations of Ordinary Least Squares

While OLS is a powerful tool, it has limitations. It is sensitive to outliers, which can significantly affect the slope and intercept of the regression line. Additionally, if the assumptions of OLS are violated (e.g., non-linearity, heteroscedasticity, multicollinearity), the estimates may be biased or inefficient. In such cases, other methods like weighted least squares or robust regression might be more appropriate.

Conclusion

Ordinary Least Squares regression is a cornerstone of statistical analysis and serves as a starting point for many predictive modeling tasks. Its simplicity and interpretability make it a go-to method for estimating relationships between variables and making predictions. However, practitioners must ensure that the assumptions underlying OLS are met and be aware of its limitations when applying it to real-world data.