The 2 Sigma Problem and the AI Tutor Dream
June 3, 2025
By Vassili Philippov
The AI tutor dream is somewhat built on a 1984 study, “The 2 Sigma Problem.” 5,000+ citations! It’s based on a Randomised Control Trial (RCT) and must be legit. Right? Well, it’s actually not.
My world view was strongly influenced by this study—as for many others in EdTech today (check Sal Khan’s famous TED Talk, for example). So it was a shock for me to find the unpleasant details of this study when I started digging deeper.
Bloom’s study is like a magic trick. Here’s the setup:
- Take 30 students. Teach half normally. Tutor the other half one-on-one.
- Test them on a single, obscure topic (like probability).
- Boom! Tutored kids beat 98% of the class. A difference of two standard deviations (the “2 sigma” effect), which is enormous.
Paul von Hippel did a great job investigating how the research was conducted and why it’s broken, and I highly recommend his article. However, as someone looking to elevate the RCT practice for AI tutoring, I’d like to summarise my thoughts on this, so we can avoid these basic mistakes when designing RCTs.
Problem 1: The Three-Week Miracle
Imagine claiming you’ve cured obesity… after extrapolating the progress of a three-week diet.
That’s Bloom’s study. The experiment lasted only three weeks.
The big question is: how transferable are these results to real life? Would we see the same effect if the intervention lasted a year? Or is this effect specific to such a short sprint and would fade out over time?
There’s a lot of research showing that interventions that work in the short term often don’t have the same impact in the long term. But in summary, the first problem is simply that the study was only three weeks long.
Problem 2: The Custom Test Trap
Bloom’s team did something sneaky: they invented their own tests.
Say I aim to teach you algebra but what I do is teach you to multiply three-digit numbers. And test you on… multiplying three-digit numbers. You’ll ace it! But how is it related to my bigger aim of teaching you algebra?
That’s Bloom’s trick. Bloom’s team chose a particular topic that nobody knew beforehand and created custom tests on this topic. They then showed improvement on these custom tests. But how relevant are these results to real life?
In reality, students need to master a wide range of topics and pass comprehensive exams, such as a complete math exam or exams in multiple subjects. There are serious questions about whether such narrowly focused, custom testing reflects real educational outcomes.
We already know the answer from other research. For example, Cohen, Kulik, and Kulik’s 1982 meta-analysis found that tutoring effects averaged 0.84 standard deviations when measured with narrow, custom-made tests, but only 0.27 standard deviations when measured with broader, standardised tests. It’s always easier to show significant improvements on custom tests tailored to the intervention, but it’s much harder to achieve tangible improvements across a broader curriculum.
Problem 3: Methodological Imbalance
Bloom’s study wasn’t just tutoring vs. teaching. It was tutoring+++ vs. teaching—.
Here’s what the tutored kids got extra:
- Non-stop feedback: Fail a quiz? Try again and again. Regular kids got just one shot.
- Super-teachers: Tutors were also trained. Regular teachers were not.
Half the “2 sigma” effect came from these extras. Imagine giving regular teachers the same tools. Would tutoring still shine?
So, where does this leave us? Let’s not kid ourselves - Bloom’s study is like a shiny coin at the bottom of a murky pond. Tempting, sure. But can you really grab it? There are three big cracks in the glass: the study was short, it was narrow, and the playing field wasn’t even. That means we just can’t take its “2 sigma miracle” and slap it on real AI-driven classrooms, no matter how much we want to believe in educational magic.
Tutoring does help, a lot though! But not “rocket-to-the-moon” a lot. Most serious studies land in the 0.4 to 0.9 sigma range. That’s solid, but it’s not the two-sigma jackpot Bloom dangled in front of us.
It’s important to understand that these criticisms do not mean the results are necessarily wrong. They simply mean that, based on this study alone, we cannot be confident that the results are correct. It’s like flipping a coin, covering it before you see the result, and then guessing heads or tails based on reading tea leaves. The method is unreliable, so you can’t trust the outcome—not because it’s definitely wrong, but because the process doesn’t allow for a trustworthy conclusion.
So next time you hear someone shout “2 sigma!” before rushing into building another AI tutor, ask yourself: is it science, or just a really good story?