Skip to main content
Engineering LibreTexts

12.2: Linear Regression

  • Page ID
    122645
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Introduction to Linear Regression

    Building on our discussion of optimization and parameter estimation, let's delve into a widely used statistical method for modeling the relationship between variables: Linear Regression. While the "Introduction to Optimization" Canvas focused on nonlinear functions, linear regression is a foundational concept that often serves as a starting point for understanding how models are fitted to data.

    What is Linear Regression?

    Linear regression is a statistical method used to model the relationship between a dependent variable (often denoted as Y) and one or more independent variables (often denoted as X). The core idea is to find the "best-fitting" straight line (or a hyperplane in higher dimensions) that describes how the dependent variable changes as the independent variable(s) change.

    The term "linear" refers to the fact that the model assumes a linear relationship between the parameters and the dependent variable. Even if the independent variables are transformed (e.g., \(X^2, \log(X)\)), as long as the parameters are multiplied by these terms and added together, the model remains linear in its parameters.

    Simple Linear Regression

    The simplest form is simple linear regression, which involves one independent variable and one dependent variable. The mathematical model for simple linear regression is:

    \[Y = \beta_0 + \beta_1 X + \epsilon\]

    Where:

    • Y is the dependent variable (the outcome you are trying to predict).
    • X is the independent variable (the predictor).
    • \(\beta_0\) (beta-naught) is the Y-intercept, representing the expected value of Y when X is 0.
    • \(\beta_1\) (beta-one) is the slope of the line, representing the change in Y for a one-unit change in X.
    • \(\epsilon\) (epsilon) is the error term (or residual), representing the difference between the observed Y value and the predicted Y value. This accounts for variability not explained by the model.

    Multiple Linear Regression

    When there are two or more independent variables, it's called multiple linear regression:

    \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n + \epsilon\]

    Here, βi represents the change in Y for a one-unit change in Xi, holding all other independent variables constant.

    How are the Parameters (β values) Estimated?

    The goal of linear regression is to find the values for the parameters (β01,…,βn​) that minimize the sum of the squared differences between the observed values of Y and the values predicted by the linear model. This method is known as the Ordinary Least Squares (OLS) method.

    The "error" or residual for each data point is \(e_i = Y_i - \hat{Y}_i \), where \(\hat{Y}_i\)​ is the predicted value. The objective function to minimize is the sum of squared residuals:

    Least Squares Minimization of \(\sum_{i=1}^m (Y_i - (\beta_0 + \beta_1 X_1 + \cdots + \beta_n X_n))^2 \)

    Unlike nonlinear regression, the parameters in linear regression can often be estimated directly using a closed-form mathematical solution (matrix algebra), although iterative optimization algorithms can also be used, especially for very large datasets.

    Applications of Linear Regression

    Linear regression is widely used in various fields for:

    • Prediction: Forecasting future values (e.g., predicting house prices based on size and location).
    • Understanding Relationships: Determining the strength and direction of the relationship between variables (e.g., how advertising spending affects sales).
    • Trend Analysis: Identifying trends over time.

    While linear regression is powerful, it's important to remember its assumptions (e.g., linearity of relationship, independence of errors, homoscedasticity) and to use it appropriately.

    Linear Regression

    Linear Regression with SciPy

    The basis for linear least squares regression is finding the line of best fit for a set of data points by minimizing the sum of the squares of the vertical distances (residuals or errors) between the observed data points and the predicted values from the line. Linear least squares regression models the relationship between a dependent variable (what you're trying to predict) and one or more independent variables (the predictors) by fitting a linear equation to the observed data. The "best fit" is determined by minimizing a specific error criterion. A residual (or error) is the difference between an observed value \(y_i​\) of the dependent variable and the predicted value ŷi from the linear regression line for a given independent variable value \(x_i\).

    \[e_i = y_i - \hat{y}_i\]

    Example 1 - Simple Linear Regression

    import numpy as np
    from scipy import stats
    import matplotlib.pyplot as plt
    
    # Generate sample data
    #np.random.seed(42)
    slope_act, intercept_act = 2, 4
    sigma = 3
    x_max, ndata = 5, 50
    x = np.linspace(0, x_max, ndata)
    # Generate y with some noise
    y = slope_act * x + intercept_act + np.random.normal(0, sigma, ndata)
    
    # 2. Perform linear regression using scipy.stats.linregress
    slope, intercept, r_value, p_value, stderr_slope = stats.linregress(x, y)
    
    # 3. Calculate the predicted y values
    y_pred = slope * x + intercept
    
    # Calculate the standard error of the regression estimate (s_y_x)
    # This is the standard deviation of the residuals
    residuals = y - y_pred
    s_y_x = np.sqrt(np.sum(residuals**2) / (len(x) - 2))
    
    # Calculate the standard error of the mean predicted value for each x
    # This is an approximation for simplicity, a more rigorous approach
    # involves the full covariance matrix and Student's t-distribution for confidence intervals.
    # For a given x_i, the standard error of the predicted mean is:
    # SE_pred = s_y_x * sqrt(1/n + (x_i - mean(x))^2 / sum((x_j - mean(x))^2))
    mean_x = np.mean(x)
    sum_sq_diff_x = np.sum((x - mean_x)**2)
    se_pred_mean = s_y_x * np.sqrt(1/len(x) + (x - mean_x)**2 / sum_sq_diff_x)
    
    # Determine the critical t-value for the desired confidence level
    # For a 95% confidence interval and n-2 degrees of freedom
    alpha = 0.05
    degrees_freedom = len(x) - 2
    t_critical = stats.t.ppf(1 - alpha/2, degrees_freedom)
    
    # Calculate the upper and lower confidence bounds
    upper_bound = y_pred + t_critical * se_pred_mean
    lower_bound = y_pred - t_critical * se_pred_mean
    
    # Plot the data, regression line, and confidence intervals
    plt.figure(figsize=(10, 6))
    plt.scatter(x, y, label='Data points')
    plt.plot(x, y_pred, color='red', label=f'Estimated Regression Line: y = {slope:.2f}x + {intercept:.2f}')
    plt.fill_between(x, lower_bound, upper_bound, color='gray', alpha=0.3, label='95% Confidence Interval')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Linear Regression with 95% Confidence Interval')
    plt.legend()
    plt.grid(True)
    plt.show()

    Minimizing the Sum of Squared Residuals (SSR or SSE)

    To ensure that larger errors have a proportionally greater impact on the minimization, linear least squares regression squares each residual and then sums them up. The method then finds the line that makes this sum of squared residuals (SSR) or sum of squared errors (SSE) as small as possible.

    \[SSR = \sum_{i=1}^n (y_i - \hat{y}_i)^2\]

    The model for a simple linear regression with one independent variable) is typically represented by the equation of a straight line:

    \[\hat{y} = \beta_0 x + \beta_1\]

    The process of linear least squares regression involves calculating the specific values for \(β_0\)​ and \(β_1\)​ that minimize the sum of squared residuals for the given dataset. The following example uses the linregress function for the scipy.stats library to obtain the estimated slope, intercept, r value, p value of the slope, standard error for the slope and the standard error for the intercept.

    Example Linear Ordinary Least Squares Regression with Scikit-learn (sklearn)

    LinearRegression fits a linear model with p coefficients, \(w = (w_1, w_2 , ..., w_p)\) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Mathematically it solves a problem of the form:

    \[ \min_w \parallel X w - y \parallel^2_2 \]

    with \(X\) independent variables and \(y\) dependent variables. The least squares solution is computed using the singular value decomposition of \(X\). If \(X\) is a matrix of shape \((N_{samples}, N_{features})\) this method has a cost of \(O(N_{samples} N^2_{features})\), assuming that \(N_{samples} \ge N_{features}\)

    The Scikit-learn library provides a machine learning approach for linear regression which can be applied to either single or multi dimensional set of data from independent variables. It provides a robust method to perform linear regression in Python.

    This example will cover:

    1. Generating Sample Data: Creating synthetic data that has a linear relationship.
    2. Splitting Data: Dividing the data into training and testing sets.
    3. Model Training: Fitting a LinearRegression model to the training data.
    4. Prediction: Making predictions on the test data.
    5. Evaluation: Assessing the model's performance.
    6. Visualization: Plotting the results to understand the linear fit.

    Example 2 - Supervised Machine Learning Regression

    %pip install scikit-learn
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import linear_model
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_squared_error, r2_score
    np.random.seed(42) # for reproducibility
    slope = 2
    intercept = 4
    X = 2 * np.random.rand(100, 1) # 100 samples, 1 feature (x values)
    y = intercept + slope * X + np.random.randn(100, 1) # y values with some noise
    # Split data into training and testing dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = linear_model.LinearRegression()
    # Train the model using the training data
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"\nModel Coefficient (Slope): {model.coef_[0][0]:.2f}")
    print(f"Model Intercept: {model.intercept_[0]:.2f}")
    print(f"Mean Squared Error (MSE): {mse:.2f}")
    print(f"R-squared (R2 Score): {r2:.2f}")
    plt.figure(figsize=(10, 7))
    # Plot the original training data
    plt.scatter(X_train, y_train, color='blue', label='Training Data', alpha=0.7)
    # Plot the original test data
    plt.scatter(X_test, y_test, color='red', label='Test Data', alpha=0.7)
    # Plot the regression line using the test data's X values and the model's predictions
    plt.plot(X_test, y_pred, color='red', linewidth=2, label='Regression Line (Predictions)')
    plt.plot([],[], ' ', label=f'Slope: {model.coef_[0][0]:.2f}, Intercept: {model.intercept_[0]:.2f}', color='red')
    plt.title('Linear Ordinary Least Squares Regression Fit with sklearn on Synthetic Data')
    plt.xlabel('X (Feature)')
    plt.ylabel('y (Target)')
    plt.legend()
    plt.grid(True)
    plt.show()
    Requirement already satisfied: scikit-learn in /srv/conda/envs/notebook/lib/python3.8/site-packages (1.3.2)
    Requirement already satisfied: scipy>=1.5.0 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from scikit-learn) (1.9.3)
    Requirement already satisfied: threadpoolctl>=2.0.0 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from scikit-learn) (3.5.0)
    Requirement already satisfied: joblib>=1.1.1 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from scikit-learn) (1.4.2)
    Requirement already satisfied: numpy<2.0,>=1.17.3 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from scikit-learn) (1.23.5)
    Note: you may need to restart the kernel to use updated packages.
    
    Model Coefficient (Slope): 2.80
    Model Intercept: 4.14
    Mean Squared Error (MSE): 0.65
    R-squared (R2 Score): 0.81
    

    Multiple Linear Regression

    Multiple linear regression is a statistical technique used to model the relationship between a dependent variable and two or more independent variables/features, \(N_{features} \ge 2\), by fitting a linear equation to observed data. In Python, this can be efficiently implemented using libraries like scikit-learn or statsmodels.

    For reliable results, multiple linear regression relies on several assumptions:

    • No Significant Multicollinearity: Independent variables are not highly correlated with each other. High multicollinearity can make it difficult to determine the individual impact of each predictor.
    • Linearity: There's a linear relationship between the dependent variable and each independent variable.
    • Independence of Errors: The errors (residuals) are independent of each other.
    • Homoscedasticity: The variance of the errors is constant across all levels of the independent variables.
    • Normality of Errors: The errors are normally distributed.

    The least squares regression function sklearn.linear_model.LinearRegression is demonstrated in the following example with home prices as a function of square foot space and number of bedrooms.

    Example 3 - Multiple Linear Regression

    %pip install scikit-learn
    import numpy as np
    from sklearn import linear_model
    import pandas as pd # Often used for handling multiple independent variables
    
    # Sample Data Generation
    # Let's consider a scenario where 'y' (house price) depends on 'X1' (size in sq ft)
    # and 'X2' (number of bedrooms).
    data = {
        'Size_sqft': [1000, 1200, 1500, 1300, 1100, 1600, 1400, 1700, 1800, 1900],
        'Num_Bedrooms': [2, 3, 3, 2, 2, 4, 3, 4, 4, 5],
        'House_Price_k': [250, 300, 380, 320, 280, 420, 350, 450, 480, 520]
    }
    df = pd.DataFrame(data)
    
    # Define independent variables (features) and dependent variable (target)
    X_multi = df[['Size_sqft', 'Num_Bedrooms']]
    y_multi = df['House_Price_k']
    
    # Model Initialization and Training
    model_multi = linear_model.LinearRegression()
    model_multi.fit(X_multi, y_multi)
    
    # Accessing Model Parameters
    # Each coefficient corresponds to its respective independent variable.
    print(f"Coefficients (slopes) for Size_sqft and Num_Bedrooms: {model_multi.coef_}")
    print(f"Intercept: {model_multi.intercept_:.2f}")
    
    # Making Predictions
    # Predict the price of a house with 1450 sq ft and 3 bedrooms.
    new_house = pd.DataFrame([[1450, 3]], columns=['Size_sqft', 'Num_Bedrooms'])
    predicted_price = model_multi.predict(new_house)
    print(f"Predicted price for a 1450 sqft, 3-bedroom house: ${predicted_price[0]:,.2f}k")
    Collecting scikit-learn
      Downloading scikit_learn-1.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.1 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.1/11.1 MB 84.9 MB/s eta 0:00:0000:0100:01
    Collecting threadpoolctl>=2.0.0
      Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
    Collecting joblib>=1.1.1
      Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.8/301.8 kB 59.3 MB/s eta 0:00:00
    Requirement already satisfied: numpy<2.0,>=1.17.3 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from scikit-learn) (1.23.5)
    Requirement already satisfied: scipy>=1.5.0 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from scikit-learn) (1.9.3)
    Installing collected packages: threadpoolctl, joblib, scikit-learn
    Successfully installed joblib-1.4.2 scikit-learn-1.3.2 threadpoolctl-3.5.0
    Note: you may need to restart the kernel to use updated packages.
    Coefficients (slopes) for Size_sqft and Num_Bedrooms: [ 0.26103448 11.59482759]
    Intercept: -40.60
    Predicted price for a 1450 sqft, 3-bedroom house: $372.68k
    

    Example 4 - Multiple Linear Regression

    Compare scikit-learn with statsmodels.

    %pip install scikit-learn
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error, r2_score
    import statsmodels.api as sm
    
    data = {
        'TV': [230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6, 199.8],
        'Radio': [37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2.6],
        'Newspaper': [69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 21.2],
        'Sales': [22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.6]
    }
    df = pd.DataFrame(data)
    print("Sample Data:")
    print(df)
    # Define independent variables (X) and dependent variable (y)
    X = df[['TV', 'Radio', 'Newspaper']]
    y = df['Sales']
    
    # Split data into training and testing sets (80% train, 20% test)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    # Create a Linear Regression model object
    model_skl = LinearRegression()
    
    # Train the model using the training data
    model_skl.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_pred_skl = model_skl.predict(X_test)
    
    # Evaluate the model's performance
    mse_skl = mean_squared_error(y_test, y_pred_skl)
    r2_skl = r2_score(y_test, y_pred_skl)
    
    print("\n--- Scikit-learn Results ---")
    print(f"Coefficients (β1, β2, β3 for TV, Radio, Newspaper respectively): {model_skl.coef_}")
    print(f"Intercept (β0): {model_skl.intercept_:.2f}")
    print(f"Mean Squared Error (MSE): {mse_skl:.2f}")
    print(f"R-squared (R²): {r2_skl:.2f}")
    
    # Define independent variables (X) and dependent variable (y)
    # For statsmodels, it's good practice to add a constant term for the intercept
    X_sm = sm.add_constant(df[['TV', 'Radio', 'Newspaper']])
    y_sm = df['Sales']
    
    # Create and fit the OLS (Ordinary Least Squares) model
    model_sm = sm.OLS(y_sm, X_sm).fit()
    
    # Print the comprehensive summary of the regression results
    print("\n--- Statsmodels Results ---")
    print(model_sm.summary())
    Requirement already satisfied: scikit-learn in /srv/conda/envs/notebook/lib/python3.8/site-packages (1.3.2)
    Collecting statsmodels
      Downloading statsmodels-0.14.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.9 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.9/10.9 MB 92.7 MB/s eta 0:00:0000:0100:01
    Requirement already satisfied: numpy<2.0,>=1.17.3 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from scikit-learn) (1.23.5)
    Requirement already satisfied: threadpoolctl>=2.0.0 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from scikit-learn) (3.5.0)
    Requirement already satisfied: scipy>=1.5.0 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from scikit-learn) (1.9.3)
    Requirement already satisfied: joblib>=1.1.1 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from scikit-learn) (1.4.2)
    Requirement already satisfied: pandas!=2.1.0,>=1.0 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from statsmodels) (1.5.2)
    Collecting patsy>=0.5.4
      Downloading patsy-1.0.1-py2.py3-none-any.whl (232 kB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 232.9/232.9 kB 41.9 MB/s eta 0:00:00
    Requirement already satisfied: packaging>=21.3 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from statsmodels) (24.0)
    Requirement already satisfied: python-dateutil>=2.8.1 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from pandas!=2.1.0,>=1.0->statsmodels) (2.9.0)
    Requirement already satisfied: pytz>=2020.1 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from pandas!=2.1.0,>=1.0->statsmodels) (2024.1)
    Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas!=2.1.0,>=1.0->statsmodels) (1.16.0)
    Installing collected packages: patsy, statsmodels
    Successfully installed patsy-1.0.1 statsmodels-0.14.1
    Note: you may need to restart the kernel to use updated packages.
    Sample Data:
          TV  Radio  Newspaper  Sales
    0  230.1   37.8       69.2   22.1
    1   44.5   39.3       45.1   10.4
    2   17.2   45.9       69.3    9.3
    3  151.5   41.3       58.5   18.5
    4  180.8   10.8       58.4   12.9
    5    8.7   48.9       75.0    7.2
    6   57.5   32.8       23.5   11.8
    7  120.2   19.6       11.6   13.2
    8    8.6    2.1        1.0    4.8
    9  199.8    2.6       21.2   10.6
    
    --- Scikit-learn Results ---
    Coefficients (β1, β2, β3 for TV, Radio, Newspaper respectively): [ 0.07468737  0.3029154  -0.06925603]
    Intercept (β0): -1.55
    Mean Squared Error (MSE): 13.20
    R-squared (R²): -0.68
    
    --- Statsmodels Results ---
                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                  Sales   R-squared:                       0.909
    Model:                            OLS   Adj. R-squared:                  0.864
    Method:                 Least Squares   F-statistic:                     20.07
    Date:                Wed, 16 Jul 2025   Prob (F-statistic):            0.00157
    Time:                        15:59:36   Log-Likelihood:                -17.930
    No. Observations:                  10   AIC:                             43.86
    Df Residuals:                       6   BIC:                             45.07
    Df Model:                           3                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    const          1.8555      1.589      1.168      0.287      -2.032       5.742
    TV             0.0633      0.009      7.165      0.000       0.042       0.085
    Radio          0.2374      0.062      3.839      0.009       0.086       0.389
    Newspaper     -0.0671      0.039     -1.699      0.140      -0.164       0.030
    ==============================================================================
    Omnibus:                        2.174   Durbin-Watson:                   2.204
    Prob(Omnibus):                  0.337   Jarque-Bera (JB):                0.988
    Skew:                          -0.762   Prob(JB):                        0.610
    Kurtosis:                       2.783   Cond. No.                         366.
    ==============================================================================
    
    Notes:
    [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
    

    Interpretation of scikit-learn Results:

    • Coefficients: These values indicate the change in 'Sales' for a one-unit increase in the respective advertising medium, holding other variables constant. For example, model_skl.coef_[0] is the coefficient for 'TV'.
    • Intercept: This is the predicted 'Sales' when all independent variables ('TV', 'Radio', 'Newspaper') are zero.
    • Mean Squared Error (MSE): This measures the average of the squares of the errors or deviations. A lower MSE indicates a better fit.
    • R-squared (R²): This represents the proportion of the variance in the dependent variable ('Sales') that is predictable from the independent variables. An R² of 1 indicates a perfect fit, while 0 indicates no linear relationship.

    Interpretation of statsmodels Results:

    The statsmodels summary is rich with information:

    • Coefficients (coef): Same as in scikit-learn, these are the estimated beta values.
    • std err (Standard Error): Measures the precision of the coefficient estimates. Smaller standard errors indicate more precise estimates.
    • t and P>|t| (t-value and P-value):
      • The t-value is the coefficient divided by its standard error.
      • The P-value (P>|t|) indicates the probability of observing such a t-value if the true coefficient were zero (i.e., no relationship). A common threshold for statistical significance is P-value < 0.05. If the P-value is less than 0.05, we typically consider the independent variable to be a statistically significant predictor of the dependent variable.
    • [0.025, 0.975] (Confidence Interval): Provides a range within which the true coefficient is likely to fall with 95% confidence.
    • R-squared and Adj. R-squared: Similar to scikit-learn, they tell us how much variance in 'Sales' is explained by the independent variables. Adjusted R-squared is generally preferred when comparing models with different numbers of independent variables, as it accounts for the number of predictors.
    • F-statistic and Prob (F-statistic): These assess the overall statistical significance of the entire regression model. A low Prob (F-statistic) (e.g., < 0.05) suggests that at least one of the independent variables significantly predicts the dependent variable.

    Both scikit-learn and statsmodels are valuable tools for multiple linear regression in Python, serving different primary purposes: scikit-learn for predictive modeling and statsmodels for statistical inference and hypothesis testing.

    Example 5: Predicting Concrete Compressive Strength

    Concrete strength is a crucial factor in civil engineering. It's influenced by the mix proportions of its ingredients. The following are variables for this model:

    • Dependent Variable (Y): Concrete Compressive Strength (MPa)
    • Independent Variables (X):
      • Cement (kg/m³)
      • Blast Furnace Slag (kg/m³)
      • Fly Ash (kg/m³)
      • Water (kg/m³)
      • Superplasticizer (kg/m³)
      • Coarse Aggregate (kg/m³)
      • Fine Aggregate (kg/m³)
      • Age (days)

    Since actual concrete strength data can be extensive, a small, representative synthetic dataset is created for this simulation. In a real engineering scenario, this data would come from laboratory experiments or field measurements.

    %pip install scikit-learn
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error, r2_score
    import statsmodels.api as sm
    
    # Synthetic Data for Concrete Compressive Strength
    # (Actual datasets are much larger and more complex)
    np.random.seed(42) # for reproducibility
    data = {
        'Cement': np.random.uniform(100, 500, 20),
        'BlastFurnaceSlag': np.random.uniform(0, 300, 20),
        'FlyAsh': np.random.uniform(0, 200, 20),
        'Water': np.random.uniform(150, 250, 20),
        'Superplasticizer': np.random.uniform(0, 20, 20),
        'CoarseAggregate': np.random.uniform(800, 1200, 20),
        'FineAggregate': np.random.uniform(600, 900, 20),
        'Age': np.random.randint(7, 90, 20),
        # Compressive Strength - simplified linear relationship + noise
        'CompressiveStrength': (
            5 + 0.05 * np.random.uniform(100, 500, 20) + # Cement
            0.03 * np.random.uniform(0, 300, 20) +      # Slag
            0.02 * np.random.uniform(0, 200, 20) -      # Fly Ash (slight negative impact sometimes)
            0.1 * np.random.uniform(150, 250, 20) +     # Water (negative impact)
            0.2 * np.random.uniform(0, 20, 20) +        # Superplasticizer
            0.01 * np.random.uniform(800, 1200, 20) +   # Coarse Aggregate
            0.01 * np.random.uniform(600, 900, 20) +    # Fine Aggregate
            0.3 * np.random.randint(7, 90, 20) +        # Age
            np.random.normal(0, 5, 20) # Noise
        )
    }
    df = pd.DataFrame(data)
    
    # Ensure strength is positive and somewhat realistic
    df['CompressiveStrength'] = df['CompressiveStrength'].apply(lambda x: max(10, x)).round(2)
    
    #print("Sample Concrete Compressive Strength Data:")
    #print(df.head())
    # Define independent variables (X) and dependent variable (y)
    X = df[['Cement', 'BlastFurnaceSlag', 'FlyAsh', 'Water',
            'Superplasticizer', 'CoarseAggregate', 'FineAggregate', 'Age']]
    y = df['CompressiveStrength']
    
    # Split data into training and testing sets (e.g., 80% train, 20% test)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Create a Linear Regression model object
    model_skl = LinearRegression()
    
    # Train the model using the training data
    model_skl.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_pred_skl = model_skl.predict(X_test)
    
    # Evaluate the model's performance
    mse_skl = mean_squared_error(y_test, y_pred_skl)
    rmse_skl = np.sqrt(mse_skl) # Root Mean Squared Error
    r2_skl = r2_score(y_test, y_pred_skl)
    
    print("\n--- Scikit-learn Results (for Prediction) ---")
    print("Coefficients for each constituent (Cement to Age):")
    for i, col in enumerate(X.columns):
        print(f"  {col}: {model_skl.coef_[i]:.4f}")
    print(f"Intercept: {model_skl.intercept_:.2f}")
    print(f"Mean Squared Error (MSE): {mse_skl:.2f}")
    print(f"Root Mean Squared Error (RMSE): {rmse_skl:.2f}") # Easier to interpret than MSE
    print(f"R-squared (R²): {r2_skl:.2f}")
    
    # Example prediction: Predict strength for a new mix
    new_mix = pd.DataFrame([[350, 50, 20, 180, 5, 1000, 700, 28]],
                           columns=X.columns)
    predicted_strength = model_skl.predict(new_mix)
    print(f"\nPredicted Compressive Strength for a new mix: {predicted_strength[0]:.2f} MPa")
    
    # statsmodels analysis
    # Add a constant (intercept) term to the independent variables
    X_sm = sm.add_constant(df[['Cement', 'BlastFurnaceSlag', 'FlyAsh', 'Water',
                               'Superplasticizer', 'CoarseAggregate', 'FineAggregate', 'Age']])
    y_sm = df['CompressiveStrength']
    
    # Create and fit the OLS (Ordinary Least Squares) model
    model_sm = sm.OLS(y_sm, X_sm).fit()
    
    # Print the comprehensive summary of the regression results
    print("\n--- Statsmodels Results (for Statistical Inference) ---")
    print(model_sm.summary())
    Collecting scikit-learn
      Downloading scikit_learn-1.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.1 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.1/11.1 MB 74.5 MB/s eta 0:00:0000:010:01
    Collecting statsmodels
      Downloading statsmodels-0.14.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.9 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.9/10.9 MB 103.0 MB/s eta 0:00:0000:0100:01
    Requirement already satisfied: scipy>=1.5.0 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from scikit-learn) (1.9.3)
    Requirement already satisfied: numpy<2.0,>=1.17.3 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from scikit-learn) (1.23.5)
    Collecting joblib>=1.1.1
      Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.8/301.8 kB 63.9 MB/s eta 0:00:00
    Collecting threadpoolctl>=2.0.0
      Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
    Requirement already satisfied: pandas!=2.1.0,>=1.0 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from statsmodels) (1.5.2)
    Requirement already satisfied: packaging>=21.3 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from statsmodels) (24.0)
    Collecting patsy>=0.5.4
      Downloading patsy-1.0.1-py2.py3-none-any.whl (232 kB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 232.9/232.9 kB 52.2 MB/s eta 0:00:00
    Requirement already satisfied: python-dateutil>=2.8.1 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from pandas!=2.1.0,>=1.0->statsmodels) (2.9.0)
    Requirement already satisfied: pytz>=2020.1 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from pandas!=2.1.0,>=1.0->statsmodels) (2024.1)
    Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas!=2.1.0,>=1.0->statsmodels) (1.16.0)
    Installing collected packages: threadpoolctl, patsy, joblib, scikit-learn, statsmodels
    Successfully installed joblib-1.4.2 patsy-1.0.1 scikit-learn-1.3.2 statsmodels-0.14.1 threadpoolctl-3.5.0
    Note: you may need to restart the kernel to use updated packages.
    
    --- Scikit-learn Results (for Prediction) ---
    Coefficients for each constituent (Cement to Age):
      Cement: 0.0112
      BlastFurnaceSlag: -0.0080
      FlyAsh: 0.0179
      Water: 0.0435
      Superplasticizer: 0.3829
      CoarseAggregate: 0.0161
      FineAggregate: 0.0186
      Age: 0.0564
    Intercept: -7.78
    Mean Squared Error (MSE): 117.65
    Root Mean Squared Error (RMSE): 10.85
    R-squared (R²): -0.45
    
    Predicted Compressive Strength for a new mix: 36.53 MPa
    
    --- Statsmodels Results (for Statistical Inference) ---
                                 OLS Regression Results                            
    ===============================================================================
    Dep. Variable:     CompressiveStrength   R-squared:                       0.227
    Model:                             OLS   Adj. R-squared:                 -0.335
    Method:                  Least Squares   F-statistic:                    0.4044
    Date:                 Wed, 16 Jul 2025   Prob (F-statistic):              0.896
    Time:                         18:37:33   Log-Likelihood:                -70.722
    No. Observations:                   20   AIC:                             159.4
    Df Residuals:                       11   BIC:                             168.4
    Df Model:                            8                                         
    Covariance Type:             nonrobust                                         
    ====================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------------
    const               -4.1808     38.362     -0.109      0.915     -88.616      80.254
    Cement               0.0047      0.031      0.152      0.882      -0.063       0.072
    BlastFurnaceSlag     0.0020      0.042      0.048      0.963      -0.091       0.095
    FlyAsh               0.0171      0.045      0.382      0.710      -0.081       0.115
    Water                0.0928      0.091      1.015      0.332      -0.108       0.294
    Superplasticizer     0.5224      0.527      0.992      0.343      -0.637       1.682
    CoarseAggregate      0.0210      0.023      0.928      0.373      -0.029       0.071
    FineAggregate       -0.0104      0.042     -0.247      0.810      -0.103       0.083
    Age                  0.0921      0.151      0.610      0.555      -0.240       0.425
    ==============================================================================
    Omnibus:                        2.366   Durbin-Watson:                   1.765
    Prob(Omnibus):                  0.306   Jarque-Bera (JB):                1.118
    Skew:                          -0.075   Prob(JB):                        0.572
    Kurtosis:                       1.851   Cond. No.                     2.01e+04
    ==============================================================================
    
    Notes:
    [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
    [2] The condition number is large, 2.01e+04. This might indicate that there are
    strong multicollinearity or other numerical problems.
    

    Engineering Relevance of scikit-learn Output:

    • Coefficients: An engineer can interpret these to understand the relative impact of each ingredient on strength. For example, a positive coefficient for 'Cement' means more cement generally leads to higher strength, while a negative coefficient for 'Water' means more water (assuming all else constant) tends to reduce strength.
    • RMSE: Provides an intuitive measure of the typical error in predictions, in the same units as the dependent variable (MPa). An engineer wants to minimize this.
    • R²: Indicates how well the model explains the variability in concrete strength. A high R² suggests the model's inputs are good predictors.

    Engineering Relevance of statsmodels Output:

    • P-values \(P>|t|\): This is perhaps the most critical output for an engineer designing concrete mixes. A low p-value (e.g., < 0.05) for a specific ingredient indicates that its quantity has a statistically significant impact on concrete strength. For instance, if 'Water' has a very low p-value, it confirms that controlling the water content is statistically important for strength.
    • Confidence Intervals ([0.025, 0.975]): Provide a range for the true effect of each ingredient. An engineer can use this to understand the variability and robustness of the relationships.
    • F-statistic and Prob (F-statistic): These indicate if the overall model is statistically significant. A low p-value here means that the chosen mix of ingredients (as a whole) significantly predicts concrete strength.
    • R-squared and Adjusted R-squared: Similar to scikit-learn, but often accompanied by more detailed statistical tests and diagnostics, which are valuable for a deeper understanding of model fit and assumptions.

    This engineering example demonstrates how multiple linear regression can be used not just for prediction but also for gaining actionable insights into complex material behaviors, allowing engineers to optimize designs, control quality, and understand the fundamental relationships between input parameters and performance.

    Summary

    Choosing the right method for linear regression in Python depends on your goals and the level of detail you need. The two most popular libraries for this task are scikit-learn and statsmodels. There is also a simpler, more direct approach using NumPy for basic cases.

    Scikit-learn

    Scikit-learn is a powerful and widely used machine learning library. It is the go-to choice for building predictive models.

    • When to use it: Use scikit-learn when your primary goal is to build a predictive model and integrate it into a larger machine learning pipeline. It's excellent for tasks like:
      • Making predictions on new data.
      • Evaluating model performance with metrics like R2, Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).
      • Performing other machine learning tasks like classification, clustering, and cross-validation, as it has a consistent API.
      • Working with multiple linear regression (multiple independent variables).
    • How it works: You use the LinearRegression class from sklearn.linear_model. The process is typically:
    1. Import LinearRegression.
    2. Create an instance of the model: model = LinearRegression().
    3. Fit the model to your data: model.fit(X, y), where X is your feature matrix and y is your target variable.
    4. Access the coefficients (model.coef_) and intercept (model.intercept_).
    5. Make predictions with model.predict(X_new).

    Statsmodels

    Statsmodels is a library that focuses on providing statistical models and tests. It is more similar to the output you would get from statistical software like R.

    • When to use it: Use statsmodels when your focus is on statistical inference and understanding the relationships between variables. It provides a detailed summary of the model, which is crucial for:
      • Examining p-values to determine the statistical significance of each independent variable.
      • Checking for multicollinearity and other regression diagnostics.
      • Performing hypothesis testing.
      • Getting a comprehensive summary table with many statistical details.
    • How it works: You can use the OLS (Ordinary Least Squares) method from statsmodels.api or statsmodels.formula.api. The formula API is particularly useful as it allows you to specify the model using R-style formulas.
    1. Import statsmodels.api as sm or statsmodels.formula.api as smf.
    2. Add a constant to your independent variables if you are using the sm API: X = sm.add_constant(X).
    3. Define and fit the model: model = sm.OLS(y, X).fit().
    4. Get the detailed summary table with print(model.summary()).

    NumPy

    NumPy is the fundamental package for scientific computing in Python that can also be used for linear regression, especially for simple cases; however, these are not as feature-rich as scikit-learn or statsmodels for more complex analyses or model evaluation.

    • Use numpy.linalg.lstsq for a very simple linear regression problems (e.g., a single predictor variable).
    • The Polynomial.fit can be used to fit a polynomial (including a degree 1 polynomial) to your data.
    • You can also manually calculate the coefficients using linear algebra functions within NumPy.

    Summary Table

    Method Best for... Key Features
    scikit-learn Predictive modeling, machine learning pipelines Simple API, performance metrics, easy integration with other ML tools.
    statsmodels Statistical inference, hypothesis testing Detailed summary output, p-values, regression diagnostics, R-style formulas.
    NumPy Simple linear regression, manual calculations Direct, low-level control, good for understanding the fundamentals.

    For most users doing data science and machine learning, scikit-learn is the most common and practical choice. If your work is more rooted in classical statistics and you need to formally test hypotheses about your variables, statsmodels is the superior option.

    Conclusion

    Multiple linear regression is a powerful and widely used tool for understanding and predicting relationships between variables. Python's scikit-learn provides a straightforward way to build predictive models, while statsmodels offers in-depth statistical insights crucial for hypothesis testing and understanding the underlying relationships.


    This page titled 12.2: Linear Regression is shared under a CC BY-SA 4.0 license and was authored, remixed, and/or curated by .