Logistic Regression

What is Logistic Regression?

Logistic Regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). It is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is binary, meaning it only contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).

Understanding Logistic Regression

In logistic regression, we are essentially trying to find the weights that will convert our input data (independent variables) into predictions (dependent variable). The logistic function, also known as the sigmoid function, is what we use to ensure these predictions are in the range of 0 to 1. This function maps any real-valued number into another value between 0 and 1. In the case of logistic regression, it converts the linear regression output to a probability.

The logistic function has an S-shaped curve and is defined by the formula:

Probability of event = 1 / (1 + e^(-y))

where y is the linear combination of the input features weighted by the model coefficients. The 'e' is the base of the natural logarithms and 'y' is the equation of the line (y = mx + b in simple linear regression).

Logistic Regression Model

The logistic regression model builds a regression model to predict the probability that a given data entry belongs to the category numbered as "1". Just like a linear regression model, logistic regression is a predictive analysis. Logistic regression is used to describe data and the relationship between one dependent binary variable and one or more nominal, ordinal, interval, or ratio-level independent variables.

Types of Logistic Regression

1. Binary Logistic Regression: The categorical response has only two 2 possible outcomes. Example: Spam or Not Spam, Sick or Healthy.

2. Multinomial Logistic Regression: Three or more categories without ordering. Example: Predicting food quality rated as Good, Very Good, Best.

3. Ordinal Logistic Regression: Three or more categories with ordering. Example: Movie rating from 1 to 5.

Logistic Regression Assumptions

Logistic regression does not require a linear relationship between the dependent and independent variables. However, it does require the independent variables to be linearly related to the log odds (logit).

Other assumptions include:

- No high intercorrelations (multicollinearity) among the predictors.

- Independent errors.

- Linearity of independent variables and log odds.

- A large sample size.

Modeling with Logistic Regression

To perform logistic regression, we use the logistic function to model the probability that an input (X) belongs to the default class (Y=1). We can write this as:

P(Y=1|X) = logistic(w0 + w1*X1 + w2*X2 + ... + wn*Xn)

where P(Y=1|X) is the probability that the class label Y is 1 given the input X, logistic() is the logistic function, and w0, w1, ..., wn are the coefficients learned by the model.

The coefficients are learned from the training data using maximum likelihood estimation. Maximum likelihood estimation seeks to find the set of coefficients that maximizes the likelihood of the observed set of class labels in the training data.

Advantages and Disadvantages

Advantages of logistic regression include its simplicity and the fact that it can be regularized to avoid overfitting. Logistic models can be updated easily with new data using stochastic gradient descent.

Disadvantages include the assumption of linearity between the dependent variable and the independent variables. It can also require a large number of observations to achieve stable results.

Applications of Logistic Regression

Logistic regression is used in various fields, including machine learning, most medical fields, and social sciences. For example, the Trauma and Injury Severity Score (TRISS), which is widely used to predict mortality in injured patients, was originally developed using logistic regression. Many other medical scales used to assess severity of a patient have been developed using logistic regression. Logistic regression may be used to predict the risk of developing a given disease (e.g., diabetes, coronary heart disease), based on observed characteristics of the patient (e.g., age, sex, body mass index, results of various blood tests, etc.).

Conclusion

Logistic regression is a powerful statistical way of modeling a binomial outcome with one or more explanatory variables. It measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function. As a result, logistic regression is an important tool for data analysis and machine learning and provides a method for modeling binary outcomes in a variety of fields.