Learning Objectives

By the end of this chapter, you will be able to:

  • Understand the concept and purpose of regression analysis
  • Calculate regression coefficients using the least squares method
  • Write and interpret the regression equation
  • Use the regression equation for prediction
  • Understand the relationship between correlation and regression

What is Regression Analysis?

Regression analysis is a statistical method used to establish a mathematical relationship between a dependent variable (Y) and one or more independent variables (X).

flowchart LR
    A[Independent Variable<br/>X = Predictor] --> B[Regression<br/>Equation]
    B --> C[Dependent Variable<br/>Y = Response]

    D["Example:<br/>Training Hours (X)"] --> E["Predicts"]
    E --> F["Performance Score (Y)"]

Key Differences: Correlation vs Regression

Aspect Correlation Regression
Purpose Measures strength of relationship Establishes mathematical equation
Direction Symmetric (X↔Y) Asymmetric (X→Y)
Output Single coefficient (r) Equation (Y = a + bX)
Prediction Cannot predict Can predict Y from X
Causation No implication Implies directional influence

The Simple Linear Regression Model

The equation for a straight line:

\[Y = a + bX\]

Or written as:

\[\hat{Y} = a + bX\]

Where:

  • $\hat{Y}$ (Y-hat) = Predicted value of Y
  • $a$ = Y-intercept (value of Y when X = 0)
  • $b$ = Slope (change in Y for one unit change in X)
  • $X$ = Independent variable value
flowchart TB
    subgraph "Regression Line Components"
        A["a = Y-intercept<br/>(where line crosses Y-axis)"]
        B["b = Slope<br/>(steepness of line)"]
    end

    C["If b > 0: Line goes upward<br/>If b < 0: Line goes downward<br/>If b = 0: Horizontal line"]

Calculating Regression Coefficients

The Least Squares Method

The method finds the line that minimizes the sum of squared differences between actual and predicted values.

Formulas for Regression Coefficients

Slope (b):

\[b = \frac{n\sum XY - (\sum X)(\sum Y)}{n\sum X^2 - (\sum X)^2}\]

Y-intercept (a):

\[a = \bar{Y} - b\bar{X}\]

Or:

\[a = \frac{\sum Y - b\sum X}{n}\]

Relationship with Correlation

\[b = r \times \frac{s_Y}{s_X}\]

Where $r$ is the correlation coefficient, $s_Y$ is standard deviation of Y, and $s_X$ is standard deviation of X.


Step-by-Step Example 1: Basic Regression

Problem: A study examines the relationship between training hours (X) and test scores (Y) of 6 employees. Calculate the regression equation and predict the score for an employee with 14 hours of training.

Employee Training Hours (X) Test Score (Y)
A 10 65
B 12 70
C 15 75
D 8 60
E 18 82
F 20 88

Solution:

Step 1: Create calculation table

$X$ $Y$ $XY$ $X^2$
10 65 650 100
12 70 840 144
15 75 1125 225
8 60 480 64
18 82 1476 324
20 88 1760 400
$\sum X = 83$ $\sum Y = 440$ $\sum XY = 6331$ $\sum X^2 = 1257$

Step 2: Note the values

  • $n = 6$
  • $\sum X = 83$
  • $\sum Y = 440$
  • $\sum XY = 6331$
  • $\sum X^2 = 1257$

Step 3: Calculate slope (b)

\[b = \frac{n\sum XY - (\sum X)(\sum Y)}{n\sum X^2 - (\sum X)^2}\] \[b = \frac{6(6331) - (83)(440)}{6(1257) - (83)^2}\] \[b = \frac{37986 - 36520}{7542 - 6889}\] \[b = \frac{1466}{653} = 2.245\]

Step 4: Calculate means

\[\bar{X} = \frac{83}{6} = 13.83\] \[\bar{Y} = \frac{440}{6} = 73.33\]

Step 5: Calculate Y-intercept (a)

\[a = \bar{Y} - b\bar{X}\] \[a = 73.33 - (2.245)(13.83)\] \[a = 73.33 - 31.05 = 42.28\]

Step 6: Write the regression equation

\[\hat{Y} = 42.28 + 2.245X\]

Step 7: Prediction for X = 14

\[\hat{Y} = 42.28 + 2.245(14)\] \[\hat{Y} = 42.28 + 31.43 = 73.71\]

Answer:

  • Regression equation: $\hat{Y} = 42.28 + 2.245X$
  • Predicted score for 14 hours training: 73.71

Interpretation:

  • The slope (b = 2.245) means that for each additional hour of training, the test score increases by approximately 2.25 points.
  • The intercept (a = 42.28) is the predicted score with zero training (though this may not be practically meaningful).

Step-by-Step Example 2: Exam-Style Problem

Problem: The following data shows the relationship between years of experience (X) and monthly salary in thousands (Y) for government officers:

Experience (X) 2 4 6 8 10
Salary (Y) 25 30 35 42 45

(a) Find the regression equation of Y on X (b) Estimate the salary for an officer with 7 years of experience (c) Interpret the regression coefficients

Solution:

Step 1: Create calculation table

$X$ $Y$ $XY$ $X^2$
2 25 50 4
4 30 120 16
6 35 210 36
8 42 336 64
10 45 450 100
30 177 1166 220

Step 2: Calculate slope (b)

\[b = \frac{n\sum XY - (\sum X)(\sum Y)}{n\sum X^2 - (\sum X)^2}\] \[b = \frac{5(1166) - (30)(177)}{5(220) - (30)^2}\] \[b = \frac{5830 - 5310}{1100 - 900}\] \[b = \frac{520}{200} = 2.6\]

Step 3: Calculate Y-intercept (a)

\[\bar{X} = \frac{30}{5} = 6\] \[\bar{Y} = \frac{177}{5} = 35.4\] \[a = \bar{Y} - b\bar{X} = 35.4 - (2.6)(6) = 35.4 - 15.6 = 19.8\]

Step 4: Write regression equation

\[\hat{Y} = 19.8 + 2.6X\]

Step 5: Prediction for X = 7

\[\hat{Y} = 19.8 + 2.6(7) = 19.8 + 18.2 = 38\]

Answers:

(a) Regression Equation: $\hat{Y} = 19.8 + 2.6X$

(b) Estimated Salary for 7 years: NPR 38,000

(c) Interpretation:

  • Slope (b = 2.6): For each additional year of experience, the monthly salary increases by approximately NPR 2,600.
  • Intercept (a = 19.8): A fresh employee (X = 0) would theoretically start at NPR 19,800 per month. This represents the base salary.

Understanding Regression Visually

flowchart TD
    subgraph "Regression Line"
        A["Actual Data Points (X, Y)"]
        B["Regression Line: Y = a + bX"]
        C["Residuals = Actual - Predicted"]
    end

    D["The 'best fit' line minimizes<br/>the sum of squared residuals"]

Components of Regression Analysis

Term Symbol Definition
Actual Value $Y$ Observed value of dependent variable
Predicted Value $\hat{Y}$ Value calculated from regression equation
Residual $e = Y - \hat{Y}$ Difference between actual and predicted
SSE $\sum(Y - \hat{Y})^2$ Sum of Squared Errors (minimized by least squares)

Standard Error of Estimate

The standard error of estimate ($S_{YX}$) measures the accuracy of predictions:

\[S_{YX} = \sqrt{\frac{\sum(Y - \hat{Y})^2}{n-2}}\]

Or using computational formula:

\[S_{YX} = \sqrt{\frac{\sum Y^2 - a\sum Y - b\sum XY}{n-2}}\]

Interpretation

  • Lower $S_{YX}$ = Better fit, more accurate predictions
  • It’s measured in the same units as Y

Step-by-Step Example 3: Complete Analysis

Problem: Calculate the regression equation and standard error for:

X 1 2 3 4 5
Y 2 5 6 8 10

Solution:

Step 1: Calculation table

$X$ $Y$ $XY$ $X^2$ $Y^2$
1 2 2 1 4
2 5 10 4 25
3 6 18 9 36
4 8 32 16 64
5 10 50 25 100
15 31 112 55 229

Step 2: Calculate regression coefficients

\[b = \frac{5(112) - (15)(31)}{5(55) - (15)^2} = \frac{560 - 465}{275 - 225} = \frac{95}{50} = 1.9\] \[\bar{X} = 3, \quad \bar{Y} = 6.2\] \[a = 6.2 - (1.9)(3) = 6.2 - 5.7 = 0.5\]

Regression Equation: $\hat{Y} = 0.5 + 1.9X$

Step 3: Calculate Standard Error

\[S_{YX} = \sqrt{\frac{\sum Y^2 - a\sum Y - b\sum XY}{n-2}}\] \[S_{YX} = \sqrt{\frac{229 - (0.5)(31) - (1.9)(112)}{5-2}}\] \[S_{YX} = \sqrt{\frac{229 - 15.5 - 212.8}{3}}\] \[S_{YX} = \sqrt{\frac{0.7}{3}} = \sqrt{0.233} = 0.48\]

Answer: $S_{YX} = 0.48$


Coefficient of Determination in Regression

The coefficient of determination ($R^2$) tells us what proportion of variation in Y is explained by X:

\[R^2 = r^2\]

Where $r$ is the correlation coefficient.

Interpretation

  • $R^2 = 0.81$ means 81% of variation in Y is explained by X
  • Remaining 19% is unexplained (due to other factors)

Assumptions of Linear Regression

flowchart TD
    A[Regression Assumptions] --> B[1. Linearity]
    A --> C[2. Independence]
    A --> D[3. Homoscedasticity]
    A --> E[4. Normality of Residuals]

    B --> B1["Relationship is linear"]
    C --> C1["Observations are independent"]
    D --> D1["Constant variance of errors"]
    E --> E1["Errors are normally distributed"]

Limitations and Cautions

1. Don’t Extrapolate Beyond Data Range

If X ranges from 2 to 10 in your data, don’t predict for X = 20.

2. Correlation ≠ Causation

A significant regression doesn’t prove X causes Y.

3. Check for Outliers

Outliers can heavily influence the regression line.

4. Verify Assumptions

Always check if data meets regression assumptions.


Practice Problems

Problem 1

Calculate the regression equation for:

X 5 10 15 20 25
Y 15 25 32 38 45

(a) Find the regression line (b) Predict Y when X = 18

Problem 2

Government data shows:

Budget (millions) X 10 15 20 25 30
Projects Completed Y 8 12 14 18 22

(a) Derive the regression equation (b) Estimate projects if budget is 22 million (c) Calculate $R^2$ if r = 0.98

Problem 3

Explain why we cannot use the regression equation to predict training hours (X) from test scores (Y) using the same equation we derived to predict Y from X.


Summary

Concept Formula/Key Point
Regression Equation $\hat{Y} = a + bX$
Slope (b) $b = \frac{n\sum XY - (\sum X)(\sum Y)}{n\sum X^2 - (\sum X)^2}$
Intercept (a) $a = \bar{Y} - b\bar{X}$
Standard Error Measures prediction accuracy
Proportion of variance explained
Key Limitation Don’t extrapolate beyond data range

Unit 2 Complete!

In Unit 3, we will study Probability Theory - the mathematical foundation for inferential statistics, including basic probability, probability distributions, and their applications.