Correlation isn’t casuality
- Correlations: a special case of covariance, it’s and index that tells if two variables move together in some way. For more details read: covariance.
- Casuality: means that one variable directly produce change in another. For example if
changes then changes.
To claim casuality, a much stronger evidence is required: controlled experiments, randomized trials, strong domain knowledge, ruling out confounding variables, temporal ordering (cause come before effect).
Example if you increase the dosage of a drug
In general two variables can be correlated because of:
- coincidence
- a hidden third variable (confounder)
- reverse casuality (
causes ) - Shared underlying trends
Without proper analysis, correlation alone can mislead you.
Visual example:
Y
↑
10 | *
9 | *
8 | *
7 | *
6 | *
5 | *
4 | *
3 | *
2 |
1 +------------------------------→ X
1 2 3 4 5 6 7 8 9
= ice cream sales = drowing incidents
Visually it seems like more ice cream sales causes more drowings. But this is NOT casual. Both increases because of a third variable: temperature.
- When it’s hot
more people buy ice cream - When it’s hot
more people go to swim more accidents
Real Machine Learning Cases
Case 1 - The Hospital Mortality Paradox
Task: predict whether a patient admitted to the ICU will survive. After the training the model discovered that patients who received a certain treatement had a higher survival rate.
So the model learned a positive correlation: treatment
But in real world, doctors give that treatment only to patients who are already stable and not extremely sick:
- Stable patients
get treatment more likely to survive. - Critical patients
cannot receive treatment less likely to survive.
The real cause was: how sick the patient already was, not the treatment.
Imagine that the model is deployed in practice, it would not improve survival of patients and in some cases would be dangerous for them.
Case 2 - The Job-Hiring Algorithm (Amazon)
Task: predict if the candidate should be hired.
The model was trained on a dataset of current and previous employee’s curriculum. Historically, Amazon had more male applicants in technical roles.
A model could then learn that being male
It wasn’t explicit, but correlated features (“women’s colleges”, certain keywords) were penalized.
In reality, gender doesn’t cause higher job performance.
Casual chain:
Biased historical hiring