- Unsupervised Learning
- Posts
- Correlation vs. Causation
Correlation vs. Causation
Many have heard the phrase:
Correlation is not causation!
This battle cry is often used in arguments about highly emotional topics where neither side is willing to change its mind, e.g., social science, public policy, etc. Let’s start with a definition for correlation. Meriam-Webster defines it as:
The reason the distinction between correlation and causation matters is because we’re trying to figure out what to change to achieve a different outcome.
Epistemology is concerned with the limits of knowledge.
What’s crucial to understand about correlation is that it’s fundamentally an Epistemic Opacity problem—meaning that the answer is there somewhere, but at any given time we might be unable to see it. This is especially true when there are thousands or millions of variables and there hasn’t been enough time or proper study to sift through them.
Said differently, if we knew about, and could see, all variables leading that feed into an outcome, we wouldn’t be forced to simply observe correlations.
Correlations are what we have when we cannot see under the covers.
The less information you have, the more you are forced to observe correlations and come up with theories. And the more knowledge you have—the more transparent things are—the more you can see actual causal relationships.
Examples
image from amplitude.com
Someone might notice that when people buy ice cream they also buy sunglasses. Someone else might notice that when it’s sunny outside people buy sunglasses. Sunglasses and ice cream are correlated, but sunny days are actually causing both.
In the past people noticed that sick people often smelled bad, and they assumed that bad smells caused disease. What they didn’t realize was that germs caused disease, and disease caused bad smells. So disease and smells were correlated, but not causally related.
Moving from correlation to causation
Keep in mind that everything is actually caused. Everything has one or more variables that lead to the outcome. The question is only how much data we can gather about those variables and their interactions, and whether it’s enough to move from correlations to statements about causation.
In general there are two main ways to move from correlation to causation:
Extensive visibility into, and study of, the variables involved—usually over long periods of time.
Tightly controlled scientific studies that properly control all variables except the ones being tested.
In a complex system, and without one of the two scenarios above, we are usually Epistemologically Shielded from causal truth. If we cannot either see and track the variables, or tightly control them via experiment, the thoughtful and cautious are limited to noticing correlations and exploring possible causal relationships.
Unless you’re in marketing or journalism.
Summary
Everything has one or more causes; the question is how much information about the variables is available to us.
Generally, if we don’t have deep knowledge of the variables (usually over long periods of study), or we can’t perform well-structured experiments, it’s very hard to isolate causes from within the multiple options.
The precise amount of Epistemic Transparency, or, Variable Visibility, into the system being scrutinized is what determines whether conversations about causation are possible or not. If it’s largely a black box, you’re going to be left with correlations and theorizing.
Be cautious of conversations that say correlation isn’t valuable because they often are and any one of them could actually end up being the causal link.
Most importantly, be cautious of people who see correlations and assume they have causation.
Notes
These are some early, well-respected criteria for moving from correlation to causation. Link
For a great book on this, I recommend Naked Statistics, by Charles Wheelan. Link