What six decades of learning research says about visual, interactive AI — and the design choices we made because of it.

In the fall of 2023, a team of Harvard physicists ran a quiet experiment in their largest physics course. Half the students learned two new topics the way Harvard teaches at its best — in an active-learning classroom with peer instruction and expert facilitators, the format that decades of research had already crowned the gold standard. The other half learned the same material at home, alone, from a custom AI tutor.

The students with the AI tutor learned more than twice as much, in less time, and reported feeling more engaged while doing it.

That result — published in Scientific Reports in 2025 by Kestin and colleagues — is the single most striking piece of evidence yet that a well-designed AI tutor can outperform even strong human-led instruction. But the word doing the work in that sentence is well-designed. In the same eighteen months, another large field experiment found that an AI tutor handed to a thousand high-school math students made them worse at math once the tool was taken away.

Same technology. Opposite outcomes. The difference was entirely in the design.

This is the gap that most "AI in education" conversations skip over, and it's the gap Mika was built to live in. This piece lays out the research we built on — not as proof that Mika works, but as the blueprint we followed, and the reason our product looks the way it does instead of looking like a chat box with a textbook bolted on.

Part 1: The oldest, most boring finding in learning science

Before we get to AI, it's worth grounding in something that has been replicated so many times it's essentially settled: people learn better from words and pictures together than from words alone.

Richard Mayer has spent since the 1970s establishing this — his multimedia principle and the Cognitive Theory of Multimedia Learning behind it are among the most tested ideas in educational psychology. A 2025 meta-analysis of Mayer's research program (Cromley & Chen, Educational Research Review) pooled 591 separate effects from 181 studies and confirmed large, consistent advantages for material that pairs text with diagrams, across factual recall, inference, and transfer to new problems.

Why does this work? Two converging theories explain it. Allan Paivio's Dual Coding Theory holds that the mind processes verbal and visual information through two distinct but linked channels; encode an idea in both and you leave two retrieval paths instead of one. John Sweller's Cognitive Load Theory adds the mechanism: working memory is painfully limited, and a well-chosen visual offloads work from the verbal channel onto the visual one, freeing up capacity for actual understanding instead of mental bookkeeping.

The practical upshot is almost embarrassingly simple. When a student is wrestling with a parametric curve, the worst thing you can do is hand them a paragraph of prose. The best thing you can do is show them the curve and explain it — together, aligned in space and time.

So Mika doesn't describe a polar rose curve. It draws one. It doesn't tell you a saddle surface "curves up in one direction and down in another." It renders the surface and lets you orbit around it until the geometry is obvious.

Polar rose curve and 3-D saddle surface generated by Mika

Words alone ask a student to imagine r = 2 sin(3θ) or z = x² − y². Mika draws both — turning an abstract equation into something the eye can grasp directly.

Part 2: Static pictures are good. Things you can move are better.

Here the research gets more interesting — and more conditional.

If a picture beats text, does a moving picture beat a static one? The honest answer from the literature is: sometimes, and only under specific conditions. Höffler and Leutner's meta-analysis of animation studies found animations outperformed static images by a moderate margin overall (d ≈ 0.37), with much larger effects when the animation was representational — when the motion carried information the still image couldn't — and largest of all for procedural knowledge. But Tversky and colleagues issued a famous warning: animation that's merely decorative, or that floods working memory with fleeting detail, buys you nothing. Motion for motion's sake can even hurt.

The thing that consistently does help is interactivity — letting the learner control the pace, change a parameter, and watch the consequence. This is where two more robust findings converge:

The generation effect (Slamecka & Graf, 1978, and fifty years of replication): material a learner actively generates is remembered better than material they passively read.
The ICAP framework (Chi & Wylie, 2014): learning yield rises predictably as engagement moves from Passive → Active → Constructive → Interactive. The deepest learning happens when students manipulate and explain, not when they watch.

This is the single most important design principle in Mika, and it's why our visuals aren't illustrations — they're instruments. When you drag the partition count on a Riemann sum and watch the rectangles converge toward the true area, you're not reading about the limit definition of the integral; you're generating it. When you slide the number of terms in a Taylor approximation and watch the curve creep outward to hug the sine wave, you're discovering the relationship yourself. That act of manipulation is the mechanism the research rewards.

Interactive Riemann sum widget showing rectangles converging on the true area

Drag n higher and watch the rectangles converge on the true area. The student isn't reading about the integral — they're generating it, which is exactly what the generation effect and the ICAP framework predict will stick.

Interactive Taylor series widget approximating sin x with successive terms

Adding terms one at a time, the Taylor polynomial visibly stretches out to hug sin x. Interactivity, not animation for its own sake, is what the evidence rewards.

It's also, not coincidentally, supported by the longest-standing evidence in mathematics-software research specifically. A meta-analysis of GeoGebra studies (Juandi et al., 2021, Heliyon) covering 2,111 students found a large overall effect (≈ 0.96 standard deviations) of dynamic mathematics software on student ability — with the strongest gains on understanding representations of functions, exactly the territory where a static graph falls short.

Mika derivative diagram showing a secant line approaching the tangent

The strongest GeoGebra gains were on understanding representations of functions. Watching the secant line collapse into the tangent as the interval shrinks makes the definition of the derivative visible rather than symbolic.

Part 3: The case for AI tutoring — and the case against it

Now to the part everyone actually wants to argue about.

The strongest evidence that active engagement drives STEM learning predates AI entirely. Freeman et al.'s landmark 2014 meta-analysis in PNAS — 225 studies — found that active learning raised exam scores by roughly half a standard deviation, and that students in traditional lecture courses were about 1.5 times more likely to fail than those in active-learning classes. Hake's classic 1998 study of over 6,000 physics students found interactive-engagement courses produced learning gains roughly double those of traditional lecture. The direction of travel in education research has been clear for thirty years: passive bad, interactive good.

The question AI raises is whether software can deliver that interactivity at the scale and availability a human teacher can't. The tutoring research suggests the ceiling is high: VanLehn's 2011 review found that intelligent tutoring systems produced learning gains nearly matching one-on-one human tutors, and Kulik & Fletcher's 2016 meta-analysis put the typical effect at around 0.66 standard deviations — enough to move a median student to the 75th percentile.

And then there's Kestin et al. (2025), the Harvard study we opened with. Its result is genuinely remarkable, but the reason for the result is what matters here. Their AI tutor wasn't a raw chatbot. It was deliberately engineered around the same pedagogical principles as the in-class lessons: present one step at a time, never give away the answer before the student attempts it, manage cognitive load, and prompt a growth mindset.

Which brings us to the cautionary tale. Bastani et al. (2025, PNAS) gave nearly a thousand high-school students access to GPT-4 math tutors. During practice, the AI helped enormously — a version with guardrails boosted practice performance by 127%. But on an exam with the AI removed, students who'd used the unguarded, answer-giving version scored measurably worse than peers who'd never had it. They had outsourced the thinking. The guardrailed version — the one that gave hints instead of answers — erased the damage.

The lesson is unambiguous, and it's the thesis of this entire piece: an AI tutor that hands over answers is worse than no tutor at all. An AI tutor that scaffolds, withholds, and makes the student do the work is the one that helps. Recent aggregate evidence echoes the nuance rather than the hype — a 2026 meta-analysis of 35 experimental studies (Wang, Humanities and Social Sciences Communications) found a moderate positive overall effect of ChatGPT on learning (≈ 0.67 SD), strongest in structured, sustained instructional settings and weakest when used as a casual answer engine. (It's worth noting the field is still maturing: an earlier, more breathless meta-analysis reporting much larger effects was retracted in 2026 over methodological concerns. We'd rather cite the conservative, intact number.)

This is the distinction that defines Ask Mika. By default — not as an optional setting a student has to find — Ask Mika works Socratically: it asks before it answers, breaks a problem into steps, and prompts the student to attempt each one rather than handing over a finished solution. It is, deliberately, the guardrailed tutor from the Bastani study, not the answer engine. And it goes a step further than a fixed set of guardrails: Ask Mika remembers how an individual student works, saves what it learns about their reasoning and their sticking points, and adapts in future conversations to match how that particular student learns. The result is the rarest combination in the research — the scaffolding that protects learning and the one-to-one personalization that Bloom identified, half a century ago, as the thing classrooms could never quite scale.

Part 4: How Mika maps onto the evidence

Here's the honest framing. No study above tested Mika. What the research validates is a category — visual, interactive, scaffolded, adaptive AI tutoring — and a set of design principles that separate the tools that help from the tools that hurt. We built Mika as a deliberate attempt to satisfy every one of those principles. The same approach applies across every STEM discipline — here it is generating the right representation for biology, chemistry, physics, and calculus in turn:

Mika enzyme kinetics curve — reaction rate vs. substrate concentration

Mika ideal gas law diagram — pressure vs. volume

Mika projectile motion diagram with parabolic trajectory

Mika solid of revolution diagram using the disk method

The same principle applies whether the subject is Michaelis–Menten kinetics, the ideal gas law, projectile motion, or a solid of revolution. Mika generates the right representation on demand across STEM.

Judge for yourself how well the research maps onto our design choices:

What the research says works	The Mika design choice
Words + pictures beat words alone (Mayer; Paivio; Sweller)	Ask Mika generates a relevant graph, diagram, or image for explanations rather than answering in prose
Representational, interactive visuals beat static or decorative ones (Höffler & Leutner; Tversky)	Visuals are manipulable instruments — drag a Riemann partition, slide a Taylor term, orbit a 3-D surface
Learners must generate, not consume (Slamecka & Graf; ICAP)	Students adjust parameters and predict outcomes; the system is built around doing, not watching
Withholding answers is essential; answer-giving harms (Bastani et al.)	Ask Mika is Socratic by default — it asks before it answers and scaffolds toward solutions step by step, rather than dumping final answers
Support must fade as expertise grows (expertise reversal effect)	Ask Mika remembers how each student works, identifies their learning method, and adapts in future chats; adaptive difficulty and gap detection reduce scaffolding as a student demonstrates mastery
Immediate, specific feedback drives gains (Hattie & Timperley)	Auto-marked constructed responses and real-time feedback, not just a grade
Multiple linked representations build understanding (GeoGebra evidence; representational fluency)	The same concept is surfaced symbolically, graphically, and numerically, with translation between them

That's the argument. We didn't reverse-engineer a marketing claim from the research; we read the research first and let it dictate the product.

Part 5: What we've actually seen

The science above describes the category. The only Mika-specific evidence we have is our own — and we want to be clear about exactly what it is and isn't.

Across two semesters and 600+ students in institutional pilots, we observed:

A +8.7 percentage-point lift in pass rate versus the prior-semester baseline in our second pilot
A +4.35 percentage-point lift in average score, sustained even as the cohort roughly doubled
97% of high-achieving students actively using the platform across both pilots

These are real numbers from real classrooms, and we're proud of them. They are also not a randomized controlled trial. They're observational results from live deployments, with all the confounds that implies — motivated instructors, cohort differences, the novelty of a new tool. We present them as encouraging early evidence of what happens when the principles in this article meet an actual curriculum, not as proof of causation. The rigorous causal claims in this piece belong to the published researchers we cited; the pilot numbers belong to us, with appropriate humility about what they can and can't show.

The bottom line

The evidence for visual, interactive, scaffolded AI tutoring in STEM is strong, deep, and — crucially — conditional. It rests on six decades of cognitive science and a growing body of classroom RCTs. But the same evidence base contains a clear warning: get the design wrong, hand students an answer machine, and you can actively set learning back.

We find that genuinely clarifying rather than threatening. It means the question isn't "AI or no AI." It's "designed how?" And that's a question we're happy to be measured on.

References

Bastani, H., Bastani, O., Sungu, A., Ge, H., Kabakcı, Ö., & Mariman, R. (2025). Generative AI without guardrails can harm learning: Evidence from high school mathematics. Proceedings of the National Academy of Sciences, 122(26), e2422633122. https://doi.org/10.1073/pnas.2422633122

Chi, M. T. H., & Wylie, R. (2014). The ICAP framework: Linking cognitive engagement to active learning outcomes. Educational Psychologist, 49(4), 219–243. https://doi.org/10.1080/00461520.2014.965823

Cromley, J. G., & Chen, R. (2025). A meta-analysis of Richard Mayer's multimedia learning research: Searching for boundary conditions of design principles across multiple media types. Educational Research Review, 49, 100730. https://doi.org/10.1016/j.edurev.2025.100730

Freeman, S., Eddy, S. L., McDonough, M., Smith, M. K., Okoroafor, N., Jordt, H., & Wenderoth, M. P. (2014). Active learning increases student performance in science, engineering, and mathematics. Proceedings of the National Academy of Sciences, 111(23), 8410–8415. https://doi.org/10.1073/pnas.1319030111

Hake, R. R. (1998). Interactive-engagement versus traditional methods: A six-thousand-student survey of mechanics test data for introductory physics courses. American Journal of Physics, 66(1), 64–74.

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81–112.

Höffler, T. N., & Leutner, D. (2007). Instructional animation versus static pictures: A meta-analysis. Learning and Instruction, 17(6), 722–738. https://doi.org/10.1016/j.learninstruc.2007.09.013

Juandi, D., Kusumah, Y. S., Tamur, M., Perbowo, K. S., & Wijaya, T. T. (2021). A meta-analysis of GeoGebra software decade of assisted mathematics learning: What to learn and where to go? Heliyon, 7(5), e06953. https://doi.org/10.1016/j.heliyon.2021.e06953

Kestin, G., Miller, K., Klales, A., Milbourne, T., & Ponti, G. (2025). AI tutoring outperforms in-class active learning: An RCT introducing a novel research-based design in an authentic educational setting. Scientific Reports, 15, 17458. https://doi.org/10.1038/s41598-025-97652-6

Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness of intelligent tutoring systems: A meta-analytic review. Review of Educational Research, 86(1), 42–78. https://doi.org/10.3102/0034654315581420

Mayer, R. E. (2009). Multimedia learning (2nd ed.). Cambridge University Press.

Paivio, A. (1986). Mental representations: A dual coding approach. Oxford University Press.

Slamecka, N. J., & Graf, P. (1978). The generation effect: Delineation of a phenomenon. Journal of Experimental Psychology: Human Learning and Memory, 4(6), 592–604.

Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2), 257–285.

Tversky, B., Morrison, J. B., & Bétrancourt, M. (2002). Animation: Can it facilitate? International Journal of Human–Computer Studies, 57(4), 247–262.

VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, 46(4), 197–221. https://doi.org/10.1080/00461520.2011.611369

Wang, S. (2026). ChatGPT's impact on student learning outcomes: A meta-analysis of 35 experimental studies. Humanities and Social Sciences Communications, 13, 684. https://doi.org/10.1057/s41599-026-07019-z

A note on the evidence in this piece: every study cited here has been verified against its original source. None of these studies tested Mika directly; they establish the principles Mika is built on. Our own pilot results are observational, not causal, and are labeled as such.

The Science Behind Mika: Why We Built a STEM Tutor That Draws, Withholds, and Adapts