Learning Objectives
By the end of this chapter, you will be able to:
- Explain the concept of correlation and its types
- Calculate Karl Pearson’s correlation coefficient
- Interpret the strength and direction of correlation
- Identify limitations and assumptions of Pearson’s correlation
What is Correlation?
Correlation is a statistical measure that describes the strength and direction of a relationship between two variables.
flowchart TD
A[Correlation Analysis] --> B[Measures relationship<br/>between TWO variables]
B --> C{Type of Relationship}
C --> D[Positive<br/>Both increase together]
C --> E[Negative<br/>One increases, other decreases]
C --> F[Zero<br/>No relationship]
Key Terminology
| Term | Definition | Example |
|---|---|---|
| Independent Variable (X) | The variable we think causes change | Training hours |
| Dependent Variable (Y) | The variable we think is affected | Performance score |
| Bivariate Data | Data with paired observations | (Training, Performance) for each employee |
Types of Correlation
By Direction
flowchart LR
subgraph "Positive Correlation (r > 0)"
A1["As X increases"] --> A2["Y also increases"]
end
subgraph "Negative Correlation (r < 0)"
B1["As X increases"] --> B2["Y decreases"]
end
subgraph "Zero Correlation (r = 0)"
C1["X changes"] --> C2["Y shows no pattern"]
end
Examples in Public Administration:
| Type | X Variable | Y Variable | Expected r |
|---|---|---|---|
| Positive | Training hours | Job performance | Positive |
| Positive | Education level | Salary | Positive |
| Negative | Distance from office | Punctuality | Negative |
| Negative | Bureaucratic delays | Citizen satisfaction | Negative |
| Zero | Shoe size | Job performance | Zero |
By Strength
| Correlation Coefficient ($r$) | Interpretation |
|---|---|
| 0.00 to ±0.19 | Very weak or negligible |
| ±0.20 to ±0.39 | Weak |
| ±0.40 to ±0.59 | Moderate |
| ±0.60 to ±0.79 | Strong |
| ±0.80 to ±1.00 | Very strong |
Karl Pearson’s Correlation Coefficient
The Pearson correlation coefficient ($r$) measures the linear relationship between two continuous variables.
Formula 1: Definition Formula
\[r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \cdot \sum(y_i - \bar{y})^2}}\]Formula 2: Computational Formula (Most Used)
\[r = \frac{n\sum xy - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}\]Properties of $r$
- Range: $-1 \leq r \leq +1$
- r = +1: Perfect positive correlation
- r = -1: Perfect negative correlation
- r = 0: No linear correlation
- Unit-free: Same regardless of measurement units
- Symmetric: Correlation of X with Y equals correlation of Y with X
Step-by-Step Example 1: Basic Calculation
Problem: Calculate the correlation between training hours (X) and performance score (Y) for 6 employees:
| Employee | Training Hours (X) | Performance (Y) |
|---|---|---|
| A | 10 | 65 |
| B | 15 | 70 |
| C | 12 | 68 |
| D | 18 | 75 |
| E | 8 | 60 |
| F | 20 | 78 |
Solution:
Step 1: Create calculation table
| $x$ | $y$ | $xy$ | $x^2$ | $y^2$ |
|---|---|---|---|---|
| 10 | 65 | 650 | 100 | 4225 |
| 15 | 70 | 1050 | 225 | 4900 |
| 12 | 68 | 816 | 144 | 4624 |
| 18 | 75 | 1350 | 324 | 5625 |
| 8 | 60 | 480 | 64 | 3600 |
| 20 | 78 | 1560 | 400 | 6084 |
| $\sum x = 83$ | $\sum y = 416$ | $\sum xy = 5906$ | $\sum x^2 = 1257$ | $\sum y^2 = 29058$ |
Step 2: Note the values
- $n = 6$
- $\sum x = 83$
- $\sum y = 416$
- $\sum xy = 5906$
- $\sum x^2 = 1257$
- $\sum y^2 = 29058$
Step 3: Calculate the numerator
\(\text{Numerator} = n\sum xy - (\sum x)(\sum y)\) \(= 6(5906) - (83)(416)\) \(= 35436 - 34528 = 908\)
Step 4: Calculate the denominator
\[\text{Denominator} = \sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}\]First part: $n\sum x^2 - (\sum x)^2 = 6(1257) - (83)^2 = 7542 - 6889 = 653$
Second part: $n\sum y^2 - (\sum y)^2 = 6(29058) - (416)^2 = 174348 - 173056 = 1292$
\[\text{Denominator} = \sqrt{653 \times 1292} = \sqrt{843676} = 918.52\]Step 5: Calculate $r$
\[r = \frac{908}{918.52} = 0.989\]Answer: $r = 0.989$
Interpretation: There is a very strong positive correlation between training hours and performance. As training hours increase, performance tends to increase significantly.
Step-by-Step Example 2: Exam-Style Problem
Problem: The following data shows the relationship between years of service (X) and job satisfaction score (Y) for 8 government employees:
| X (Years) | 2 | 4 | 6 | 8 | 10 | 12 | 14 | 16 |
|---|---|---|---|---|---|---|---|---|
| Y (Score) | 45 | 50 | 52 | 55 | 60 | 58 | 65 | 62 |
Calculate Karl Pearson’s correlation coefficient and interpret the result.
Solution:
Step 1: Set up the calculation table
| $x$ | $y$ | $xy$ | $x^2$ | $y^2$ |
|---|---|---|---|---|
| 2 | 45 | 90 | 4 | 2025 |
| 4 | 50 | 200 | 16 | 2500 |
| 6 | 52 | 312 | 36 | 2704 |
| 8 | 55 | 440 | 64 | 3025 |
| 10 | 60 | 600 | 100 | 3600 |
| 12 | 58 | 696 | 144 | 3364 |
| 14 | 65 | 910 | 196 | 4225 |
| 16 | 62 | 992 | 256 | 3844 |
| 72 | 447 | 4240 | 816 | 25287 |
Step 2: Extract values
- $n = 8$
- $\sum x = 72$
- $\sum y = 447$
- $\sum xy = 4240$
- $\sum x^2 = 816$
- $\sum y^2 = 25287$
Step 3: Apply the formula
\[r = \frac{n\sum xy - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}\]Numerator: \(= 8(4240) - (72)(447)\) \(= 33920 - 32184 = 1736\)
Denominator: \(= \sqrt{[8(816) - (72)^2][8(25287) - (447)^2]}\) \(= \sqrt{[6528 - 5184][202296 - 199809]}\) \(= \sqrt{[1344][2487]}\) \(= \sqrt{3342528} = 1828.26\)
Final Calculation: \(r = \frac{1736}{1828.26} = 0.950\)
Answer: $r = 0.950$
Interpretation: There is a very strong positive correlation (r = 0.95) between years of service and job satisfaction. Employees with more years of service tend to have higher satisfaction scores.
Coefficient of Determination ($r^2$)
The coefficient of determination tells us what proportion of variation in Y is explained by X.
\[r^2 = (\text{correlation coefficient})^2\]Example:
If $r = 0.95$, then: \(r^2 = (0.95)^2 = 0.9025 = 90.25\%\)
Interpretation: 90.25% of the variation in job satisfaction can be explained by years of service.
Assumptions of Pearson’s Correlation
flowchart TD
A[Assumptions of Pearson's r] --> B[1. Linear Relationship]
A --> C[2. Continuous Variables]
A --> D[3. Normally Distributed]
A --> E[4. No Significant Outliers]
A --> F[5. Homoscedasticity]
F --> F1["Equal variance<br/>across all X values"]
When NOT to Use Pearson’s Correlation
| Situation | Alternative |
|---|---|
| Ordinal/ranked data | Spearman’s Rank Correlation |
| Non-linear relationship | Polynomial correlation |
| Significant outliers | Spearman’s or remove outliers |
| Categorical variables | Chi-square test |
Correlation vs Causation
Critical Warning: Correlation does NOT imply causation!
flowchart TB
A[High Correlation Found] --> B{Does X cause Y?}
B --> C[Maybe: X → Y]
B --> D[Maybe: Y → X]
B --> E[Maybe: Z → X and Z → Y]
B --> F[Maybe: Coincidence]
C --> G[Need experimental<br/>design to prove]
Example:
Finding a high correlation between ice cream sales and drowning deaths doesn’t mean ice cream causes drowning! Both are caused by a confounding variable - hot weather.
Common Mistakes to Avoid
| Mistake | Correct Approach |
|---|---|
| Assuming correlation = causation | Correlation only shows association |
| Using Pearson’s r for curved relationships | Check scatter plot first |
| Ignoring outliers | Examine data for unusual values |
| Reporting r without interpretation | Always explain the practical meaning |
| Using r for ordinal data | Use Spearman’s rank correlation |
Practice Problems
Problem 1
Calculate the correlation coefficient for:
| X | 5 | 7 | 9 | 11 | 13 |
|---|---|---|---|---|---|
| Y | 12 | 16 | 18 | 22 | 24 |
Problem 2
The following data shows budget allocation (in millions) and project completion rate (%) for 7 departments:
| Budget (X) | 10 | 15 | 20 | 25 | 30 | 35 | 40 |
|---|---|---|---|---|---|---|---|
| Completion (Y) | 60 | 65 | 70 | 72 | 78 | 82 | 85 |
a) Calculate Pearson’s correlation coefficient b) Find the coefficient of determination c) Interpret the results
Problem 3
Why is it incorrect to conclude that “more training causes better performance” just because we found r = 0.95?
Summary
| Concept | Key Points |
|---|---|
| Correlation | Measures strength and direction of linear relationship |
| Range | $-1 \leq r \leq +1$ |
| Positive r | Both variables move in same direction |
| Negative r | Variables move in opposite directions |
| r = 0 | No linear relationship |
| $r^2$ | Proportion of variance explained |
| Limitation | Does not prove causation |
Next Chapter
In the next chapter, we will study Spearman’s Rank Correlation - a non-parametric method suitable for ordinal data or when Pearson’s assumptions are violated.

