Skip to main content

Linear Regression Algorithm

Definition:​

Linear Regression is a supervised learning algorithm used for predictive modeling of continuous numerical variables. It establishes a linear relationship between a dependent variable (the target) and one or more independent variables (the features). The goal of linear regression is to model this relationship using a straight line (linear equation) to predict the target values based on input features.

Characteristics:​

  • Regression Model:
    Unlike classification, linear regression predicts continuous values rather than categories.

  • Linear Relationship:
    Assumes a linear relationship between the features and the target variable, where changes in feature values proportionally affect the target.

  • Minimizing Error:
    The model minimizes the difference between the actual values and predicted values using a method called Ordinary Least Squares (OLS).

Types of Linear Regression:​

  1. Simple Linear Regression:
    Used when there is one independent variable to predict the target.
    Example: Predicting house price based solely on square footage.

  2. Multiple Linear Regression:
    Used when there are two or more independent variables to predict the target.
    Example: Predicting house price based on square footage, number of rooms, and age of the house.

Linear Equation:​

In linear regression, the relationship between the target ( y ) and the input features ( X ) is modeled using the equation of a straight line:

image

Where:

  • ( y ) is the dependent variable (target).
  • image are the independent variables (features).
  • image is the interceptimage .
  • image are the coefficients (slopes), representing the change in ( y ) for a unit change in the corresponding image

How Linear Regression Works:​

  1. Data Collection:
    Gather a dataset with one or more features (independent variables) and the corresponding target variable (dependent variable).

  2. Model Training:
    The algorithm attempts to find the best-fitting line by optimizing the parameters (intercept and slopes). This is achieved using Ordinary Least Squares (OLS), which minimizes the sum of squared residuals (the differences between actual and predicted values).

  3. Making Predictions:
    Once trained, the model can predict the target value ( y ) for new data points by applying the learned linear equation.

  4. Residuals:
    The residual is the difference between the actual and predicted value: image

    The goal is to minimize these residuals.

Problem Statement:​

Given a dataset with independent variables (features), the objective is to learn a linear regression model that can predict the target variable based on new input values.

Key Concepts:​

  • Slope (Coefficient):
    The slope represents how much the target variable changes when the corresponding feature increases by one unit. In multiple regression, each feature has its own slope.

  • Intercept:
    The intercept is the predicted value of the target when all feature values are zero.

  • Best-Fit Line:
    Linear regression aims to find the line (or hyperplane for multiple regression) that best fits the data, meaning it minimizes the overall distance between the data points and the line.

Loss Function:​

The loss function used in linear regression is the Mean Squared Error (MSE), which calculates the average of the squared differences between the actual and predicted values:

image

Where:

  • image is the actual target value of the (i)-th data point.
  • image is the predicted target value.
  • ( n ) is the total number of samples.

Gradient Descent (Alternative Training Method):​

Another approach to finding the best-fit line is gradient descent, which iteratively updates the model parameters by moving in the direction of the steepest decrease in the loss function.

  • Update rule for each parameter: image

    Where:

    • image is the learning rate (controls step size).
    • image is the loss function (MSE).

    The parameters are updated in each iteration to reduce the error.

Time Complexity:​

  • Best, Average, and Worst Case: O(n)O(n)
    The time complexity for training a linear regression model is linear with respect to the number of samples ( n ) and features ( p ), making it efficient for large datasets.

Space Complexity:​

  • Space Complexity: O(p)O(p)
    The space complexity is proportional to the number of features ( p ), as the model stores one coefficient per feature, plus the intercept.

Example:​

Consider a dataset for predicting the price of a house based on square footage:

  • Dataset:
    | Square Footage | Price ($)    |  
    |----------------|--------------|
    | 1500 | 200,000 |
    | 1700 | 230,000 |
    | 1800 | 250,000 |
    | 1900 | 270,000 |

Step-by-Step Execution:

  1. Fit the model:
    Linear regression will learn the relationship between square footage (independent variable) and price (dependent variable).

  2. Equation:
    The model will output an equation like: image

  3. Predict price:
    For a new house with 2000 square feet, the model will predict the price using the equation.

Python Implementation:​

Here is a basic implementation of Linear Regression in Python using scikit-learn:

from sklearn.linear_model import LinearRegression
import numpy as np

# Sample data
X = np.array([[1500], [1700], [1800], [1900]]) # Features (Square Footage)
y = np.array([200000, 230000, 250000, 270000]) # Target (Price)

# Create linear regression model
model = LinearRegression()

# Train the model
model.fit(X, y)

# Make predictions
predicted_price = model.predict([[2000]]) # Predict price for 2000 square footage
print(f"Predicted price: ${predicted_price[0]:,.2f}")

# Display the model's coefficients
print(f"Intercept: {model.intercept_}")
print(f"Coefficient: {model.coef_[0]}")

Summary:​

The Linear Regression Algorithm is one of the most fundamental techniques for predicting continuous outcomes. Its simplicity and interpretability make it a powerful tool for many real-world applications, particularly in finance, economics, and engineering. However, it assumes a linear relationship between variables and may not work well for datasets with non-linear patterns.