Learning Objectives

By the end of this chapter, you will be able to:

  • Explain the concept of correlation and its types
  • Calculate Karl Pearson’s correlation coefficient
  • Interpret the strength and direction of correlation
  • Identify limitations and assumptions of Pearson’s correlation

What is Correlation?

Correlation is a statistical measure that describes the strength and direction of a relationship between two variables.

flowchart TD
    A[Correlation Analysis] --> B[Measures relationship<br/>between TWO variables]
    B --> C{Type of Relationship}
    C --> D[Positive<br/>Both increase together]
    C --> E[Negative<br/>One increases, other decreases]
    C --> F[Zero<br/>No relationship]

Key Terminology

Term Definition Example
Independent Variable (X) The variable we think causes change Training hours
Dependent Variable (Y) The variable we think is affected Performance score
Bivariate Data Data with paired observations (Training, Performance) for each employee

Types of Correlation

By Direction

flowchart LR
    subgraph "Positive Correlation (r > 0)"
        A1["As X increases"] --> A2["Y also increases"]
    end

    subgraph "Negative Correlation (r < 0)"
        B1["As X increases"] --> B2["Y decreases"]
    end

    subgraph "Zero Correlation (r = 0)"
        C1["X changes"] --> C2["Y shows no pattern"]
    end

Examples in Public Administration:

Type X Variable Y Variable Expected r
Positive Training hours Job performance Positive
Positive Education level Salary Positive
Negative Distance from office Punctuality Negative
Negative Bureaucratic delays Citizen satisfaction Negative
Zero Shoe size Job performance Zero

By Strength

Correlation Coefficient ($r$) Interpretation
0.00 to ±0.19 Very weak or negligible
±0.20 to ±0.39 Weak
±0.40 to ±0.59 Moderate
±0.60 to ±0.79 Strong
±0.80 to ±1.00 Very strong

Karl Pearson’s Correlation Coefficient

The Pearson correlation coefficient ($r$) measures the linear relationship between two continuous variables.

Formula 1: Definition Formula

\[r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \cdot \sum(y_i - \bar{y})^2}}\]

Formula 2: Computational Formula (Most Used)

\[r = \frac{n\sum xy - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}\]

Properties of $r$

  1. Range: $-1 \leq r \leq +1$
  2. r = +1: Perfect positive correlation
  3. r = -1: Perfect negative correlation
  4. r = 0: No linear correlation
  5. Unit-free: Same regardless of measurement units
  6. Symmetric: Correlation of X with Y equals correlation of Y with X

Step-by-Step Example 1: Basic Calculation

Problem: Calculate the correlation between training hours (X) and performance score (Y) for 6 employees:

Employee Training Hours (X) Performance (Y)
A 10 65
B 15 70
C 12 68
D 18 75
E 8 60
F 20 78

Solution:

Step 1: Create calculation table

$x$ $y$ $xy$ $x^2$ $y^2$
10 65 650 100 4225
15 70 1050 225 4900
12 68 816 144 4624
18 75 1350 324 5625
8 60 480 64 3600
20 78 1560 400 6084
$\sum x = 83$ $\sum y = 416$ $\sum xy = 5906$ $\sum x^2 = 1257$ $\sum y^2 = 29058$

Step 2: Note the values

  • $n = 6$
  • $\sum x = 83$
  • $\sum y = 416$
  • $\sum xy = 5906$
  • $\sum x^2 = 1257$
  • $\sum y^2 = 29058$

Step 3: Calculate the numerator

\(\text{Numerator} = n\sum xy - (\sum x)(\sum y)\) \(= 6(5906) - (83)(416)\) \(= 35436 - 34528 = 908\)

Step 4: Calculate the denominator

\[\text{Denominator} = \sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}\]

First part: $n\sum x^2 - (\sum x)^2 = 6(1257) - (83)^2 = 7542 - 6889 = 653$

Second part: $n\sum y^2 - (\sum y)^2 = 6(29058) - (416)^2 = 174348 - 173056 = 1292$

\[\text{Denominator} = \sqrt{653 \times 1292} = \sqrt{843676} = 918.52\]

Step 5: Calculate $r$

\[r = \frac{908}{918.52} = 0.989\]

Answer: $r = 0.989$

Interpretation: There is a very strong positive correlation between training hours and performance. As training hours increase, performance tends to increase significantly.


Step-by-Step Example 2: Exam-Style Problem

Problem: The following data shows the relationship between years of service (X) and job satisfaction score (Y) for 8 government employees:

X (Years) 2 4 6 8 10 12 14 16
Y (Score) 45 50 52 55 60 58 65 62

Calculate Karl Pearson’s correlation coefficient and interpret the result.

Solution:

Step 1: Set up the calculation table

$x$ $y$ $xy$ $x^2$ $y^2$
2 45 90 4 2025
4 50 200 16 2500
6 52 312 36 2704
8 55 440 64 3025
10 60 600 100 3600
12 58 696 144 3364
14 65 910 196 4225
16 62 992 256 3844
72 447 4240 816 25287

Step 2: Extract values

  • $n = 8$
  • $\sum x = 72$
  • $\sum y = 447$
  • $\sum xy = 4240$
  • $\sum x^2 = 816$
  • $\sum y^2 = 25287$

Step 3: Apply the formula

\[r = \frac{n\sum xy - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}\]

Numerator: \(= 8(4240) - (72)(447)\) \(= 33920 - 32184 = 1736\)

Denominator: \(= \sqrt{[8(816) - (72)^2][8(25287) - (447)^2]}\) \(= \sqrt{[6528 - 5184][202296 - 199809]}\) \(= \sqrt{[1344][2487]}\) \(= \sqrt{3342528} = 1828.26\)

Final Calculation: \(r = \frac{1736}{1828.26} = 0.950\)

Answer: $r = 0.950$

Interpretation: There is a very strong positive correlation (r = 0.95) between years of service and job satisfaction. Employees with more years of service tend to have higher satisfaction scores.


Coefficient of Determination ($r^2$)

The coefficient of determination tells us what proportion of variation in Y is explained by X.

\[r^2 = (\text{correlation coefficient})^2\]

Example:

If $r = 0.95$, then: \(r^2 = (0.95)^2 = 0.9025 = 90.25\%\)

Interpretation: 90.25% of the variation in job satisfaction can be explained by years of service.


Assumptions of Pearson’s Correlation

flowchart TD
    A[Assumptions of Pearson's r] --> B[1. Linear Relationship]
    A --> C[2. Continuous Variables]
    A --> D[3. Normally Distributed]
    A --> E[4. No Significant Outliers]
    A --> F[5. Homoscedasticity]

    F --> F1["Equal variance<br/>across all X values"]

When NOT to Use Pearson’s Correlation

Situation Alternative
Ordinal/ranked data Spearman’s Rank Correlation
Non-linear relationship Polynomial correlation
Significant outliers Spearman’s or remove outliers
Categorical variables Chi-square test

Correlation vs Causation

Critical Warning: Correlation does NOT imply causation!

flowchart TB
    A[High Correlation Found] --> B{Does X cause Y?}
    B --> C[Maybe: X → Y]
    B --> D[Maybe: Y → X]
    B --> E[Maybe: Z → X and Z → Y]
    B --> F[Maybe: Coincidence]

    C --> G[Need experimental<br/>design to prove]

Example:

Finding a high correlation between ice cream sales and drowning deaths doesn’t mean ice cream causes drowning! Both are caused by a confounding variable - hot weather.


Common Mistakes to Avoid

Mistake Correct Approach
Assuming correlation = causation Correlation only shows association
Using Pearson’s r for curved relationships Check scatter plot first
Ignoring outliers Examine data for unusual values
Reporting r without interpretation Always explain the practical meaning
Using r for ordinal data Use Spearman’s rank correlation

Practice Problems

Problem 1

Calculate the correlation coefficient for:

X 5 7 9 11 13
Y 12 16 18 22 24

Problem 2

The following data shows budget allocation (in millions) and project completion rate (%) for 7 departments:

Budget (X) 10 15 20 25 30 35 40
Completion (Y) 60 65 70 72 78 82 85

a) Calculate Pearson’s correlation coefficient b) Find the coefficient of determination c) Interpret the results

Problem 3

Why is it incorrect to conclude that “more training causes better performance” just because we found r = 0.95?


Summary

Concept Key Points
Correlation Measures strength and direction of linear relationship
Range $-1 \leq r \leq +1$
Positive r Both variables move in same direction
Negative r Variables move in opposite directions
r = 0 No linear relationship
$r^2$ Proportion of variance explained
Limitation Does not prove causation

Next Chapter

In the next chapter, we will study Spearman’s Rank Correlation - a non-parametric method suitable for ordinal data or when Pearson’s assumptions are violated.