12.2: Linear Regression
- Page ID
- 122645
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Introduction to Linear Regression
Building on our discussion of optimization and parameter estimation, let's delve into a widely used statistical method for modeling the relationship between variables: Linear Regression. While the "Introduction to Optimization" Canvas focused on nonlinear functions, linear regression is a foundational concept that often serves as a starting point for understanding how models are fitted to data.
What is Linear Regression?
Linear regression is a statistical method used to model the relationship between a dependent variable (often denoted as Y) and one or more independent variables (often denoted as X). The core idea is to find the "best-fitting" straight line (or a hyperplane in higher dimensions) that describes how the dependent variable changes as the independent variable(s) change.
The term "linear" refers to the fact that the model assumes a linear relationship between the parameters and the dependent variable. Even if the independent variables are transformed (e.g., \(X^2, \log(X)\)), as long as the parameters are multiplied by these terms and added together, the model remains linear in its parameters.
Simple Linear Regression
The simplest form is simple linear regression, which involves one independent variable and one dependent variable. The mathematical model for simple linear regression is:
\[Y = \beta_0 + \beta_1 X + \epsilon\]
Where:
- Y is the dependent variable (the outcome you are trying to predict).
- X is the independent variable (the predictor).
- \(\beta_0\) (beta-naught) is the Y-intercept, representing the expected value of Y when X is 0.
- \(\beta_1\) (beta-one) is the slope of the line, representing the change in Y for a one-unit change in X.
- \(\epsilon\) (epsilon) is the error term (or residual), representing the difference between the observed Y value and the predicted Y value. This accounts for variability not explained by the model.
Multiple Linear Regression
When there are two or more independent variables, it's called multiple linear regression:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n + \epsilon\]
Here, βi represents the change in Y for a one-unit change in Xi, holding all other independent variables constant.
How are the Parameters (β values) Estimated?
The goal of linear regression is to find the values for the parameters (β0,β1,…,βn) that minimize the sum of the squared differences between the observed values of Y and the values predicted by the linear model. This method is known as the Ordinary Least Squares (OLS) method.
The "error" or residual for each data point is \(e_i = Y_i - \hat{Y}_i \), where \(\hat{Y}_i\) is the predicted value. The objective function to minimize is the sum of squared residuals:
Least Squares Minimization of \(\sum_{i=1}^m (Y_i - (\beta_0 + \beta_1 X_1 + \cdots + \beta_n X_n))^2 \)
Unlike nonlinear regression, the parameters in linear regression can often be estimated directly using a closed-form mathematical solution (matrix algebra), although iterative optimization algorithms can also be used, especially for very large datasets.
Applications of Linear Regression
Linear regression is widely used in various fields for:
- Prediction: Forecasting future values (e.g., predicting house prices based on size and location).
- Understanding Relationships: Determining the strength and direction of the relationship between variables (e.g., how advertising spending affects sales).
- Trend Analysis: Identifying trends over time.
While linear regression is powerful, it's important to remember its assumptions (e.g., linearity of relationship, independence of errors, homoscedasticity) and to use it appropriately.
Linear Regression
Linear Regression with SciPy
The basis for linear least squares regression is finding the line of best fit for a set of data points by minimizing the sum of the squares of the vertical distances (residuals or errors) between the observed data points and the predicted values from the line. Linear least squares regression models the relationship between a dependent variable (what you're trying to predict) and one or more independent variables (the predictors) by fitting a linear equation to the observed data. The "best fit" is determined by minimizing a specific error criterion. A residual (or error) is the difference between an observed value \(y_i\) of the dependent variable and the predicted value ŷi from the linear regression line for a given independent variable value \(x_i\).
\[e_i = y_i - \hat{y}_i\]
Example 1 - Simple Linear Regression
Minimizing the Sum of Squared Residuals (SSR or SSE)
To ensure that larger errors have a proportionally greater impact on the minimization, linear least squares regression squares each residual and then sums them up. The method then finds the line that makes this sum of squared residuals (SSR) or sum of squared errors (SSE) as small as possible.
\[SSR = \sum_{i=1}^n (y_i - \hat{y}_i)^2\]
The model for a simple linear regression with one independent variable) is typically represented by the equation of a straight line:
\[\hat{y} = \beta_0 x + \beta_1\]
The process of linear least squares regression involves calculating the specific values for \(β_0\) and \(β_1\) that minimize the sum of squared residuals for the given dataset. The following example uses the linregress function for the scipy.stats library to obtain the estimated slope, intercept, r value, p value of the slope, standard error for the slope and the standard error for the intercept.
Example Linear Ordinary Least Squares Regression with Scikit-learn (sklearn)
LinearRegression fits a linear model with p coefficients, \(w = (w_1, w_2 , ..., w_p)\) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Mathematically it solves a problem of the form:
\[ \min_w \parallel X w - y \parallel^2_2 \]
with \(X\) independent variables and \(y\) dependent variables. The least squares solution is computed using the singular value decomposition of \(X\). If \(X\) is a matrix of shape \((N_{samples}, N_{features})\) this method has a cost of \(O(N_{samples} N^2_{features})\), assuming that \(N_{samples} \ge N_{features}\)
The Scikit-learn library provides a machine learning approach for linear regression which can be applied to either single or multi dimensional set of data from independent variables. It provides a robust method to perform linear regression in Python.
This example will cover:
- Generating Sample Data: Creating synthetic data that has a linear relationship.
- Splitting Data: Dividing the data into training and testing sets.
- Model Training: Fitting a LinearRegression model to the training data.
- Prediction: Making predictions on the test data.
- Evaluation: Assessing the model's performance.
- Visualization: Plotting the results to understand the linear fit.
Example 2 - Supervised Machine Learning Regression
Multiple Linear Regression
Multiple linear regression is a statistical technique used to model the relationship between a dependent variable and two or more independent variables/features, \(N_{features} \ge 2\), by fitting a linear equation to observed data. In Python, this can be efficiently implemented using libraries like scikit-learn or statsmodels.
For reliable results, multiple linear regression relies on several assumptions:
- No Significant Multicollinearity: Independent variables are not highly correlated with each other. High multicollinearity can make it difficult to determine the individual impact of each predictor.
- Linearity: There's a linear relationship between the dependent variable and each independent variable.
- Independence of Errors: The errors (residuals) are independent of each other.
- Homoscedasticity: The variance of the errors is constant across all levels of the independent variables.
- Normality of Errors: The errors are normally distributed.
The least squares regression function sklearn.linear_model.LinearRegression is demonstrated in the following example with home prices as a function of square foot space and number of bedrooms.
Example 3 - Multiple Linear Regression
Example 4 - Multiple Linear Regression
Compare scikit-learn with statsmodels.
Interpretation of scikit-learn Results:
- Coefficients: These values indicate the change in 'Sales' for a one-unit increase in the respective advertising medium, holding other variables constant. For example, model_skl.coef_[0] is the coefficient for 'TV'.
- Intercept: This is the predicted 'Sales' when all independent variables ('TV', 'Radio', 'Newspaper') are zero.
- Mean Squared Error (MSE): This measures the average of the squares of the errors or deviations. A lower MSE indicates a better fit.
- R-squared (R²): This represents the proportion of the variance in the dependent variable ('Sales') that is predictable from the independent variables. An R² of 1 indicates a perfect fit, while 0 indicates no linear relationship.
Interpretation of statsmodels Results:
The statsmodels summary is rich with information:
- Coefficients (coef): Same as in scikit-learn, these are the estimated beta values.
- std err (Standard Error): Measures the precision of the coefficient estimates. Smaller standard errors indicate more precise estimates.
- t and P>|t| (t-value and P-value):
- The t-value is the coefficient divided by its standard error.
- The P-value (P>|t|) indicates the probability of observing such a t-value if the true coefficient were zero (i.e., no relationship). A common threshold for statistical significance is P-value < 0.05. If the P-value is less than 0.05, we typically consider the independent variable to be a statistically significant predictor of the dependent variable.
- [0.025, 0.975] (Confidence Interval): Provides a range within which the true coefficient is likely to fall with 95% confidence.
- R-squared and Adj. R-squared: Similar to scikit-learn, they tell us how much variance in 'Sales' is explained by the independent variables. Adjusted R-squared is generally preferred when comparing models with different numbers of independent variables, as it accounts for the number of predictors.
- F-statistic and Prob (F-statistic): These assess the overall statistical significance of the entire regression model. A low Prob (F-statistic) (e.g., < 0.05) suggests that at least one of the independent variables significantly predicts the dependent variable.
Both scikit-learn and statsmodels are valuable tools for multiple linear regression in Python, serving different primary purposes: scikit-learn for predictive modeling and statsmodels for statistical inference and hypothesis testing.
Example 5: Predicting Concrete Compressive Strength
Concrete strength is a crucial factor in civil engineering. It's influenced by the mix proportions of its ingredients. The following are variables for this model:
- Dependent Variable (Y): Concrete Compressive Strength (MPa)
- Independent Variables (X):
- Cement (kg/m³)
- Blast Furnace Slag (kg/m³)
- Fly Ash (kg/m³)
- Water (kg/m³)
- Superplasticizer (kg/m³)
- Coarse Aggregate (kg/m³)
- Fine Aggregate (kg/m³)
- Age (days)
Since actual concrete strength data can be extensive, a small, representative synthetic dataset is created for this simulation. In a real engineering scenario, this data would come from laboratory experiments or field measurements.
Engineering Relevance of scikit-learn Output:
- Coefficients: An engineer can interpret these to understand the relative impact of each ingredient on strength. For example, a positive coefficient for 'Cement' means more cement generally leads to higher strength, while a negative coefficient for 'Water' means more water (assuming all else constant) tends to reduce strength.
- RMSE: Provides an intuitive measure of the typical error in predictions, in the same units as the dependent variable (MPa). An engineer wants to minimize this.
- R²: Indicates how well the model explains the variability in concrete strength. A high R² suggests the model's inputs are good predictors.
Engineering Relevance of statsmodels Output:
- P-values \(P>|t|\): This is perhaps the most critical output for an engineer designing concrete mixes. A low p-value (e.g., < 0.05) for a specific ingredient indicates that its quantity has a statistically significant impact on concrete strength. For instance, if 'Water' has a very low p-value, it confirms that controlling the water content is statistically important for strength.
- Confidence Intervals ([0.025, 0.975]): Provide a range for the true effect of each ingredient. An engineer can use this to understand the variability and robustness of the relationships.
- F-statistic and Prob (F-statistic): These indicate if the overall model is statistically significant. A low p-value here means that the chosen mix of ingredients (as a whole) significantly predicts concrete strength.
- R-squared and Adjusted R-squared: Similar to scikit-learn, but often accompanied by more detailed statistical tests and diagnostics, which are valuable for a deeper understanding of model fit and assumptions.
This engineering example demonstrates how multiple linear regression can be used not just for prediction but also for gaining actionable insights into complex material behaviors, allowing engineers to optimize designs, control quality, and understand the fundamental relationships between input parameters and performance.
Summary
Choosing the right method for linear regression in Python depends on your goals and the level of detail you need. The two most popular libraries for this task are scikit-learn and statsmodels. There is also a simpler, more direct approach using NumPy for basic cases.
Scikit-learn
Scikit-learn is a powerful and widely used machine learning library. It is the go-to choice for building predictive models.
- When to use it: Use scikit-learn when your primary goal is to build a predictive model and integrate it into a larger machine learning pipeline. It's excellent for tasks like:
- Making predictions on new data.
- Evaluating model performance with metrics like R2, Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).
- Performing other machine learning tasks like classification, clustering, and cross-validation, as it has a consistent API.
- Working with multiple linear regression (multiple independent variables).
- How it works: You use the LinearRegression class from sklearn.linear_model. The process is typically:
- Import LinearRegression.
- Create an instance of the model: model = LinearRegression().
- Fit the model to your data: model.fit(X, y), where X is your feature matrix and y is your target variable.
- Access the coefficients (model.coef_) and intercept (model.intercept_).
- Make predictions with model.predict(X_new).
Statsmodels
Statsmodels is a library that focuses on providing statistical models and tests. It is more similar to the output you would get from statistical software like R.
- When to use it: Use statsmodels when your focus is on statistical inference and understanding the relationships between variables. It provides a detailed summary of the model, which is crucial for:
- Examining p-values to determine the statistical significance of each independent variable.
- Checking for multicollinearity and other regression diagnostics.
- Performing hypothesis testing.
- Getting a comprehensive summary table with many statistical details.
- How it works: You can use the OLS (Ordinary Least Squares) method from statsmodels.api or statsmodels.formula.api. The formula API is particularly useful as it allows you to specify the model using R-style formulas.
- Import statsmodels.api as sm or statsmodels.formula.api as smf.
- Add a constant to your independent variables if you are using the sm API: X = sm.add_constant(X).
- Define and fit the model: model = sm.OLS(y, X).fit().
- Get the detailed summary table with print(model.summary()).
NumPy
NumPy is the fundamental package for scientific computing in Python that can also be used for linear regression, especially for simple cases; however, these are not as feature-rich as scikit-learn or statsmodels for more complex analyses or model evaluation.
- Use numpy.linalg.lstsq for a very simple linear regression problems (e.g., a single predictor variable).
- The Polynomial.fit can be used to fit a polynomial (including a degree 1 polynomial) to your data.
- You can also manually calculate the coefficients using linear algebra functions within NumPy.
Summary Table
| Method | Best for... | Key Features |
| scikit-learn | Predictive modeling, machine learning pipelines | Simple API, performance metrics, easy integration with other ML tools. |
| statsmodels | Statistical inference, hypothesis testing | Detailed summary output, p-values, regression diagnostics, R-style formulas. |
| NumPy | Simple linear regression, manual calculations | Direct, low-level control, good for understanding the fundamentals. |
For most users doing data science and machine learning, scikit-learn is the most common and practical choice. If your work is more rooted in classical statistics and you need to formally test hypotheses about your variables, statsmodels is the superior option.
Conclusion
Multiple linear regression is a powerful and widely used tool for understanding and predicting relationships between variables. Python's scikit-learn provides a straightforward way to build predictive models, while statsmodels offers in-depth statistical insights crucial for hypothesis testing and understanding the underlying relationships.


