Learning Objectives
By the end of this chapter, you will be able to:
- Understand the concept and purpose of regression analysis
- Calculate regression coefficients using the least squares method
- Write and interpret the regression equation
- Use the regression equation for prediction
- Understand the relationship between correlation and regression
What is Regression Analysis?
Regression analysis is a statistical method used to establish a mathematical relationship between a dependent variable (Y) and one or more independent variables (X).
flowchart LR
A[Independent Variable<br/>X = Predictor] --> B[Regression<br/>Equation]
B --> C[Dependent Variable<br/>Y = Response]
D["Example:<br/>Training Hours (X)"] --> E["Predicts"]
E --> F["Performance Score (Y)"]
Key Differences: Correlation vs Regression
| Aspect | Correlation | Regression |
|---|---|---|
| Purpose | Measures strength of relationship | Establishes mathematical equation |
| Direction | Symmetric (X↔Y) | Asymmetric (X→Y) |
| Output | Single coefficient (r) | Equation (Y = a + bX) |
| Prediction | Cannot predict | Can predict Y from X |
| Causation | No implication | Implies directional influence |
The Simple Linear Regression Model
The equation for a straight line:
\[Y = a + bX\]Or written as:
\[\hat{Y} = a + bX\]Where:
- $\hat{Y}$ (Y-hat) = Predicted value of Y
- $a$ = Y-intercept (value of Y when X = 0)
- $b$ = Slope (change in Y for one unit change in X)
- $X$ = Independent variable value
flowchart TB
subgraph "Regression Line Components"
A["a = Y-intercept<br/>(where line crosses Y-axis)"]
B["b = Slope<br/>(steepness of line)"]
end
C["If b > 0: Line goes upward<br/>If b < 0: Line goes downward<br/>If b = 0: Horizontal line"]
Calculating Regression Coefficients
The Least Squares Method
The method finds the line that minimizes the sum of squared differences between actual and predicted values.
Formulas for Regression Coefficients
Slope (b):
\[b = \frac{n\sum XY - (\sum X)(\sum Y)}{n\sum X^2 - (\sum X)^2}\]Y-intercept (a):
\[a = \bar{Y} - b\bar{X}\]Or:
\[a = \frac{\sum Y - b\sum X}{n}\]Relationship with Correlation
\[b = r \times \frac{s_Y}{s_X}\]Where $r$ is the correlation coefficient, $s_Y$ is standard deviation of Y, and $s_X$ is standard deviation of X.
Step-by-Step Example 1: Basic Regression
Problem: A study examines the relationship between training hours (X) and test scores (Y) of 6 employees. Calculate the regression equation and predict the score for an employee with 14 hours of training.
| Employee | Training Hours (X) | Test Score (Y) |
|---|---|---|
| A | 10 | 65 |
| B | 12 | 70 |
| C | 15 | 75 |
| D | 8 | 60 |
| E | 18 | 82 |
| F | 20 | 88 |
Solution:
Step 1: Create calculation table
| $X$ | $Y$ | $XY$ | $X^2$ |
|---|---|---|---|
| 10 | 65 | 650 | 100 |
| 12 | 70 | 840 | 144 |
| 15 | 75 | 1125 | 225 |
| 8 | 60 | 480 | 64 |
| 18 | 82 | 1476 | 324 |
| 20 | 88 | 1760 | 400 |
| $\sum X = 83$ | $\sum Y = 440$ | $\sum XY = 6331$ | $\sum X^2 = 1257$ |
Step 2: Note the values
- $n = 6$
- $\sum X = 83$
- $\sum Y = 440$
- $\sum XY = 6331$
- $\sum X^2 = 1257$
Step 3: Calculate slope (b)
\[b = \frac{n\sum XY - (\sum X)(\sum Y)}{n\sum X^2 - (\sum X)^2}\] \[b = \frac{6(6331) - (83)(440)}{6(1257) - (83)^2}\] \[b = \frac{37986 - 36520}{7542 - 6889}\] \[b = \frac{1466}{653} = 2.245\]Step 4: Calculate means
\[\bar{X} = \frac{83}{6} = 13.83\] \[\bar{Y} = \frac{440}{6} = 73.33\]Step 5: Calculate Y-intercept (a)
\[a = \bar{Y} - b\bar{X}\] \[a = 73.33 - (2.245)(13.83)\] \[a = 73.33 - 31.05 = 42.28\]Step 6: Write the regression equation
\[\hat{Y} = 42.28 + 2.245X\]Step 7: Prediction for X = 14
\[\hat{Y} = 42.28 + 2.245(14)\] \[\hat{Y} = 42.28 + 31.43 = 73.71\]Answer:
- Regression equation: $\hat{Y} = 42.28 + 2.245X$
- Predicted score for 14 hours training: 73.71
Interpretation:
- The slope (b = 2.245) means that for each additional hour of training, the test score increases by approximately 2.25 points.
- The intercept (a = 42.28) is the predicted score with zero training (though this may not be practically meaningful).
Step-by-Step Example 2: Exam-Style Problem
Problem: The following data shows the relationship between years of experience (X) and monthly salary in thousands (Y) for government officers:
| Experience (X) | 2 | 4 | 6 | 8 | 10 |
|---|---|---|---|---|---|
| Salary (Y) | 25 | 30 | 35 | 42 | 45 |
(a) Find the regression equation of Y on X (b) Estimate the salary for an officer with 7 years of experience (c) Interpret the regression coefficients
Solution:
Step 1: Create calculation table
| $X$ | $Y$ | $XY$ | $X^2$ |
|---|---|---|---|
| 2 | 25 | 50 | 4 |
| 4 | 30 | 120 | 16 |
| 6 | 35 | 210 | 36 |
| 8 | 42 | 336 | 64 |
| 10 | 45 | 450 | 100 |
| 30 | 177 | 1166 | 220 |
Step 2: Calculate slope (b)
\[b = \frac{n\sum XY - (\sum X)(\sum Y)}{n\sum X^2 - (\sum X)^2}\] \[b = \frac{5(1166) - (30)(177)}{5(220) - (30)^2}\] \[b = \frac{5830 - 5310}{1100 - 900}\] \[b = \frac{520}{200} = 2.6\]Step 3: Calculate Y-intercept (a)
\[\bar{X} = \frac{30}{5} = 6\] \[\bar{Y} = \frac{177}{5} = 35.4\] \[a = \bar{Y} - b\bar{X} = 35.4 - (2.6)(6) = 35.4 - 15.6 = 19.8\]Step 4: Write regression equation
\[\hat{Y} = 19.8 + 2.6X\]Step 5: Prediction for X = 7
\[\hat{Y} = 19.8 + 2.6(7) = 19.8 + 18.2 = 38\]Answers:
(a) Regression Equation: $\hat{Y} = 19.8 + 2.6X$
(b) Estimated Salary for 7 years: NPR 38,000
(c) Interpretation:
- Slope (b = 2.6): For each additional year of experience, the monthly salary increases by approximately NPR 2,600.
- Intercept (a = 19.8): A fresh employee (X = 0) would theoretically start at NPR 19,800 per month. This represents the base salary.
Understanding Regression Visually
flowchart TD
subgraph "Regression Line"
A["Actual Data Points (X, Y)"]
B["Regression Line: Y = a + bX"]
C["Residuals = Actual - Predicted"]
end
D["The 'best fit' line minimizes<br/>the sum of squared residuals"]
Components of Regression Analysis
| Term | Symbol | Definition |
|---|---|---|
| Actual Value | $Y$ | Observed value of dependent variable |
| Predicted Value | $\hat{Y}$ | Value calculated from regression equation |
| Residual | $e = Y - \hat{Y}$ | Difference between actual and predicted |
| SSE | $\sum(Y - \hat{Y})^2$ | Sum of Squared Errors (minimized by least squares) |
Standard Error of Estimate
The standard error of estimate ($S_{YX}$) measures the accuracy of predictions:
\[S_{YX} = \sqrt{\frac{\sum(Y - \hat{Y})^2}{n-2}}\]Or using computational formula:
\[S_{YX} = \sqrt{\frac{\sum Y^2 - a\sum Y - b\sum XY}{n-2}}\]Interpretation
- Lower $S_{YX}$ = Better fit, more accurate predictions
- It’s measured in the same units as Y
Step-by-Step Example 3: Complete Analysis
Problem: Calculate the regression equation and standard error for:
| X | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| Y | 2 | 5 | 6 | 8 | 10 |
Solution:
Step 1: Calculation table
| $X$ | $Y$ | $XY$ | $X^2$ | $Y^2$ |
|---|---|---|---|---|
| 1 | 2 | 2 | 1 | 4 |
| 2 | 5 | 10 | 4 | 25 |
| 3 | 6 | 18 | 9 | 36 |
| 4 | 8 | 32 | 16 | 64 |
| 5 | 10 | 50 | 25 | 100 |
| 15 | 31 | 112 | 55 | 229 |
Step 2: Calculate regression coefficients
\[b = \frac{5(112) - (15)(31)}{5(55) - (15)^2} = \frac{560 - 465}{275 - 225} = \frac{95}{50} = 1.9\] \[\bar{X} = 3, \quad \bar{Y} = 6.2\] \[a = 6.2 - (1.9)(3) = 6.2 - 5.7 = 0.5\]Regression Equation: $\hat{Y} = 0.5 + 1.9X$
Step 3: Calculate Standard Error
\[S_{YX} = \sqrt{\frac{\sum Y^2 - a\sum Y - b\sum XY}{n-2}}\] \[S_{YX} = \sqrt{\frac{229 - (0.5)(31) - (1.9)(112)}{5-2}}\] \[S_{YX} = \sqrt{\frac{229 - 15.5 - 212.8}{3}}\] \[S_{YX} = \sqrt{\frac{0.7}{3}} = \sqrt{0.233} = 0.48\]Answer: $S_{YX} = 0.48$
Coefficient of Determination in Regression
The coefficient of determination ($R^2$) tells us what proportion of variation in Y is explained by X:
\[R^2 = r^2\]Where $r$ is the correlation coefficient.
Interpretation
- $R^2 = 0.81$ means 81% of variation in Y is explained by X
- Remaining 19% is unexplained (due to other factors)
Assumptions of Linear Regression
flowchart TD
A[Regression Assumptions] --> B[1. Linearity]
A --> C[2. Independence]
A --> D[3. Homoscedasticity]
A --> E[4. Normality of Residuals]
B --> B1["Relationship is linear"]
C --> C1["Observations are independent"]
D --> D1["Constant variance of errors"]
E --> E1["Errors are normally distributed"]
Limitations and Cautions
1. Don’t Extrapolate Beyond Data Range
If X ranges from 2 to 10 in your data, don’t predict for X = 20.
2. Correlation ≠ Causation
A significant regression doesn’t prove X causes Y.
3. Check for Outliers
Outliers can heavily influence the regression line.
4. Verify Assumptions
Always check if data meets regression assumptions.
Practice Problems
Problem 1
Calculate the regression equation for:
| X | 5 | 10 | 15 | 20 | 25 |
|---|---|---|---|---|---|
| Y | 15 | 25 | 32 | 38 | 45 |
(a) Find the regression line (b) Predict Y when X = 18
Problem 2
Government data shows:
| Budget (millions) X | 10 | 15 | 20 | 25 | 30 |
|---|---|---|---|---|---|
| Projects Completed Y | 8 | 12 | 14 | 18 | 22 |
(a) Derive the regression equation (b) Estimate projects if budget is 22 million (c) Calculate $R^2$ if r = 0.98
Problem 3
Explain why we cannot use the regression equation to predict training hours (X) from test scores (Y) using the same equation we derived to predict Y from X.
Summary
| Concept | Formula/Key Point |
|---|---|
| Regression Equation | $\hat{Y} = a + bX$ |
| Slope (b) | $b = \frac{n\sum XY - (\sum X)(\sum Y)}{n\sum X^2 - (\sum X)^2}$ |
| Intercept (a) | $a = \bar{Y} - b\bar{X}$ |
| Standard Error | Measures prediction accuracy |
| R² | Proportion of variance explained |
| Key Limitation | Don’t extrapolate beyond data range |
Unit 2 Complete!
In Unit 3, we will study Probability Theory - the mathematical foundation for inferential statistics, including basic probability, probability distributions, and their applications.

