All publications
BlogMay 27, 2026·15 min read

Why a Grade Tells You Almost Nothing: The Science Behind Mika's Diagnostic Loop

By Asma Mughrabi & Ahmed Zraiqat

Part two of the Mika Labs research series. Part one looked at why visual, interactive tutoring works. This one is about what happens after a student gets something wrong.


A student gets 68% on a Calculus II midterm. What do you actually know?

Almost nothing useful. You know that they struggled, not why. Two students can land on the same 68% for completely different reasons — one never solidified the Product Rule from Calc I and quietly falls apart every time it's hidden inside a harder problem; the other has a confident, specific, wrong belief about how integration distributes across a product. Same grade. Opposite problems. And the worst possible response — the one most courses default to — is to re-teach Integration by Parts to both of them and hope.

The research on this is unusually clear, and unusually old. The reason Mika doesn't just grade a wrong answer but tries to explain it comes down to a single finding that has survived fifty years of scrutiny: what a student already knows — and what they wrongly believe — is the dominant factor in what they learn next. This piece lays out that evidence, and the three-part diagnostic loop we built on top of it. As with the first piece in this series, the studies below establish a category of approach; none of them tested Mika. The argument is that we built Mika to match what they found.


Part 1: Prior knowledge is the whole game — but not the way the slogan says

David Ausubel wrote the sentence that every education student eventually memorizes: if all of educational psychology had to be reduced to one principle, it would be that the most important single factor influencing learning is what the learner already knows, and that you should ascertain this and teach accordingly (Ausubel, 1968). It is, remarkably, still the best-supported one-liner in the field.

But the modern evidence adds a twist that matters enormously for how you act on it. The largest synthesis ever conducted on this question — Simonsmeier, Flaig, Deiglmayr, Schalk, and Schneider's 2022 meta-analysis in Educational Psychologist, covering 493 articles, 8,776 effect sizes, and over 126,000 learners — found that prior knowledge correlates strongly with where a student ends up (r = .534) but only weakly and wildly variably with how much they gain, with a prediction interval so wide it runs from −.688 to .621. Their conclusion is worth quoting precisely: this variability, they write, falsifies both "knowledge is power" and "the effect of prior knowledge is negligible."

The practical reading is the one that shaped Mika's design: prior knowledge tells you where a student can start, not how much they'll learn. Which means the valuable thing isn't a global measure of "how much they know." It's knowing exactly which specific prerequisite is missing behind a specific wrong answer. A grade is a scalar. Learning needs a map.


Part 2: Two kinds of wrong

Here is the distinction the whole diagnostic loop turns on. When a student gets something wrong, the error is almost always one of two fundamentally different kinds — and they need opposite responses.

A gap is missing knowledge. The student never firmly acquired a prerequisite, so when it shows up — often buried inside a harder problem — they stall. The Integration-by-Parts student who keeps picking the wrong parts isn't failing integration; they're failing fluent Product-Rule differentiation under load, a Calc I skill. The fix is to go back down the chain and rebuild the missing rung.

A misconception is the opposite problem: not missing knowledge but confident, coherent, wrong knowledge. And in mathematics, these are everywhere — they're just less talked about than in physics because they hide inside notation.

Take the most studied math misconception of all: the illusion of linearity. Students develop a deep, almost irresistible belief that everything behaves proportionally — that if you double the input, you double the output. It produces the classic algebra errors every calculus instructor sees: writing (a+b)² = a² + b², or √(a+b) = √a + √b, or f(a+b) = f(a)+f(b). Van Dooren, De Bock, and colleagues (2004) documented how robust this is — students apply the linear model even in geometry problems where enlarging a figure by a factor of k multiplies its area by k², not k — and, crucially, showed that fixing it requires a conceptual-change intervention, not more practice problems. Re-teaching the right rule doesn't help, because the student isn't missing a rule; they're confidently applying the wrong one.

Or take limits, the conceptual heart of calculus. In one of the most cited papers in all of mathematics education, Tall and Vinner (1981) showed that students carry a "concept image" of a limit that quietly conflicts with its formal definition — most famously, the belief that a sequence can approach its limit but never actually reach it (so 0.999… must be "just less than" 1). That belief is coherent, intuitive, and wrong, and it survives the standard ε–δ lecture almost untouched, because the lecture addresses the definition while the student is reasoning from the image.

This is the sobering pattern across the misconception literature: a confident wrong belief is not dislodged by re-explaining the correct one. Posner, Strike, Hewson, and Gertzog (1982) laid out why, and what works instead — genuine conceptual change requires the student to first become dissatisfied with their existing belief, usually by being confronted with a case it cannot explain, before a new conception can take hold. You don't fix a misconception by repeating the right answer louder. You make the wrong belief visibly fail.

The clearest quantified demonstration of all this happens to come from physics, where the misconceptions were catalogued earliest. The Force Concept Inventory (Hestenes, Wells, & Swackhamer, 1992) built its wrong answers out of documented commonsense beliefs and found conventional instruction barely shifted them; Hake's (1998) survey of 6,542 students found that traditional lecture produced an average normalized gain of just 0.23, while courses built around directly confronting misconceptions averaged 0.48 — more than double. The instrument is physics, but the lesson is general, and it's exactly what Tall & Vinner and the linearity researchers found in pure mathematics: documented, coherent, wrong mental models that resist re-teaching and yield only to confrontation. (There's a long-running debate about what misconceptions fundamentally are — coherent mini-theories versus loose context-triggered fragments, in diSessa's "knowledge in pieces" account. For a tutoring system it's a debate we can sidestep: the symptoms are stable and documentable even where the underlying cognitive story is contested.)

This is why Mika has two separate detectors, not one. Telling a gap apart from a misconception is the single most consequential judgment in remediation, because mistaking one for the other guarantees the wrong fix.


Part 3: Why detecting from a fixed list — not free generation — is the right call

Here's the part a skeptical faculty member should press hardest on, because it's where most AI tutors quietly fail. If you ask a general-purpose chatbot to diagnose why a student got something wrong, it will happily produce a fluent, confident, plausible-sounding answer — that is sometimes simply invented.

The cost of that isn't hypothetical. A 2025 randomized study at the ACM Learning @ Scale conference (Steinbach, Bhandari, Meyer, & Pardos) deliberately varied how often an LLM math tutor's feedback contained hallucinations — 0%, 50%, or 100% — across 252 learners. As the hallucination rate rose, student confusion increased and their trust in the feedback's accuracy collapsed. A diagnostic engine that's confidently wrong is worse than one that says nothing, because it sends the student off to fix a problem they don't have.

This is the reasoning behind a specific architectural choice in Mika that's worth stating plainly. Mika's gap and misconception detection runs on a small, fine-tuned model whose entire output vocabulary is a curated, expert-validated list of known prerequisites and documented misconceptions for each topic. It can label a wrong answer with the most likely diagnosis from that list — or it can decline. It structurally cannot invent a gap or a misconception that isn't on the list, because inventing one isn't in its output space. The accuracy, speed, and cost benefits of a small specialized model are real, but the hallucination prevention is the point.

This isn't a novel trick so much as an old principle applied to a new tool. The Force Concept Inventory works precisely because its wrong-answer options aren't freely generated — they're drawn from a documented taxonomy of real student misconceptions. And the adaptive-tutoring systems that have worked at scale for decades, going back to Corbett and Anderson's (1995) Bayesian Knowledge Tracing in the Cognitive Tutor, have always tracked mastery over a fixed, defined skill set. Mika uses a modern language-aware model where they used a Bayesian network, but the closed-set discipline is the same — and it's the same discipline that keeps the system honest.

It's also worth being candid about the limit here: a curated list is, by definition, not exhaustive. Mika catches the documented, common gaps and misconceptions for a topic — the ones decades of discipline-based education research have actually catalogued — not every idiosyncratic error a student could conceivably make. That's a deliberate trade: high trust on the cases that matter most, rather than confident guesses across the long tail.


Part 4: The remediation half — diagnose, then test, not re-tell

Detecting the problem is only half the loop, and arguably the easier half. What you do next is where most of the learning actually happens — and here the research points somewhere counterintuitive.

The instinct, once you've found a gap, is to re-explain: show the student the right method again, more carefully. The evidence says this is close to the weakest thing you can do. The single most replicated finding in the modern science of learning is that being tested is itself one of the most powerful learning events available — far more powerful than re-reading or reviewing.

The numbers are striking. Roediger and Karpicke (2006) showed that students who took a practice recall test outperformed students who simply restudied the same material a week later — even though the restudy group felt more confident. Karpicke and Blunt (2011), in Science, pitted retrieval practice against elaborate concept-mapping and found retrieval produced about a full standard-deviation-and-a-half advantage on later inference questions. Adesope, Trevisan, and Sundararajan's (2017) meta-analysis of 272 independent effects found practice testing beat restudying and every other comparison condition, with effects in the medium-to-large range. And when Dunlosky and colleagues (2013) reviewed ten popular study techniques for Psychological Science in the Public Interest, only two earned their top "high utility" rating: practice testing and distributed practice.

So Mika's Adaptive Practice Engine doesn't re-explain a diagnosed gap and move on. It builds the fix around retrieval — and around three further things the research is specific about:

First, spacing. Cepeda and colleagues (2006, 2008) established not just that spacing out practice beats cramming, but a usable scaling rule: the ideal gap before re-testing is roughly 10–20% of how long you want the knowledge to last. The engine re-surfaces a remediated concept on that kind of schedule rather than drilling it once and abandoning it.

Second, retrieval over recognition. Per Karpicke and Blunt, generating an answer beats reviewing one — even though, tellingly, their students predicted the opposite. The re-test is a problem to solve, not a slide to re-read.

Third, mastery gating. The engine doesn't let a student move on until the specific diagnosed gap or misconception is actually resolved on fresh items. This is the old mastery-learning principle — formative test, corrective feedback, re-test until solid — that Bloom (1984) put at the center of his work, and it's also the operational definition of the "step-based" tutoring systems that VanLehn (2011) found produce learning gains (around 0.76 standard deviations over no tutoring) statistically indistinguishable from human tutors.

One important caveat we hold ourselves to: feedback is not automatically good. Kluger and DeNisi's classic meta-analysis found that more than a third of feedback interventions actually reduced performance — typically when the feedback targeted the student ("you're behind") rather than the task ("here's where this step went wrong"). Mika's diagnoses are deliberately task-level, never judgments about the student.


Part 5: How the loop maps onto the evidence

No study above tested Mika. What they establish is a set of design principles for diagnosis and remediation — and the loop was built to satisfy each one. Judge the fit for yourself:

What the research saysThe Mika design choice
Prior knowledge governs where learning can start; the useful unit is the specific missing prerequisite, not a global score (Ausubel; Simonsmeier et al.)Gap Detection traces a wrong answer back to the specific unmastered prerequisite, not just a topic-level grade
Math misconceptions are coherent and resist re-teaching; they must be directly confronted (Tall & Vinner; Van Dooren et al.; Posner et al.; with the sharpest quantified case in Hestenes et al. and Hake)Misconception Detection is a separate signal from gap detection, routing to confrontation-style remediation rather than repetition
Confidently-wrong AI feedback raises confusion and destroys trust (Steinbach et al.)Diagnosis runs on a small fine-tuned model constrained to a curated list — it cannot hallucinate a diagnosis that isn't documented
Adaptive tutoring should track mastery over a fixed, fine-grained skill set (Corbett & Anderson; VanLehn)The detectors operate over an expert-validated prerequisite-and-misconception ontology per topic
Testing beats re-reading; retrieval is a learning event (Roediger & Karpicke; Karpicke & Blunt; Adesope et al.; Dunlosky et al.)The Adaptive Practice Engine remediates with retrieval problems, not re-presented explanations
Spacing should scale with the retention horizon (Cepeda et al.)Re-tests are scheduled at spaced intervals, not massed into one session
Don't advance until mastery (Bloom; VanLehn)Mastery gating: the loop closes only when the gap or misconception is resolved on fresh items
Feedback aimed at the self can backfire (Kluger & DeNisi)Diagnoses are task-level, never evaluations of the student

We didn't pick these features and then find research to justify them. The loop is shaped the way the evidence said it should be.


Part 6: What we've actually seen

As in the first piece, the only Mika-specific evidence we have is our own, and we want to be exact about what it is and isn't.

Across two semesters and 600+ students in institutional pilots, we observed a +8.7 percentage-point lift in pass rate against the prior-semester baseline, a +4.35 percentage-point lift in average score that held as the cohort roughly doubled, and active use by 97% of high-achieving students.

A diagnose-remediate-retest loop is exactly the kind of mechanism that should produce a pass-rate effect like that — catching the students who'd otherwise quietly fail on a buried prerequisite. But we'll say plainly what we said before: these are observational results from live deployments, not a controlled trial. Motivated instructors, cohort differences, and novelty are all uncontrolled. We offer the numbers as encouraging real-world signal consistent with the research, not as causal proof. The rigorous causal claims in this piece belong to the cited studies; the pilot figures are ours, held to a lower standard of certainty and labeled as such.


The bottom line

A grade tells you a student is struggling. It doesn't tell you whether they're missing a foundation or holding onto a wrong idea — and those need opposite fixes. The research is clear that the highest-leverage thing an educational system can do is diagnose which of those it's facing, confront a misconception rather than repeat past it, and then remediate through testing rather than re-telling — without advancing until it's actually fixed.

That's the loop. We constrained the diagnosis to a curated list because a confidently-wrong tutor is worse than none, and we built the remediation around retrieval and spacing because that's what fifty years of evidence says works. None of it is magic. It's just the research, engineered into a product.


References

Adesope, O. O., Trevisan, D. A., & Sundararajan, N. (2017). Rethinking the use of tests: A meta-analysis of practice testing. Review of Educational Research, 87(3), 659–701. https://doi.org/10.3102/0034654316689306

Ausubel, D. P. (1968). Educational psychology: A cognitive view. Holt, Rinehart & Winston.

Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education, 5(1), 7–74. https://doi.org/10.1080/0969595980050102

Bloom, B. S. (1984). The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring. Educational Researcher, 13(6), 4–16. https://doi.org/10.3102/0013189X013006004

Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin, 132(3), 354–380. https://doi.org/10.1037/0033-2909.132.3.354

Cepeda, N. J., Vul, E., Rohrer, D., Wixted, J. T., & Pashler, H. (2008). Spacing effects in learning: A temporal ridgeline of optimal retention. Psychological Science, 19(11), 1095–1102. https://doi.org/10.1111/j.1467-9280.2008.02209.x

Corbett, A. T., & Anderson, J. R. (1995). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4(4), 253–278. https://doi.org/10.1007/BF01099821

diSessa, A. A. (1993). Toward an epistemology of physics. Cognition and Instruction, 10(2–3), 105–225.

Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., & Willingham, D. T. (2013). Improving students' learning with effective learning techniques: Promising directions from cognitive and educational psychology. Psychological Science in the Public Interest, 14(1), 4–58. https://doi.org/10.1177/1529100612453266

Hake, R. R. (1998). Interactive-engagement versus traditional methods: A six-thousand-student survey of mechanics test data for introductory physics courses. American Journal of Physics, 66(1), 64–74. https://doi.org/10.1119/1.18809

Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force Concept Inventory. The Physics Teacher, 30(3), 141–158. https://doi.org/10.1119/1.2343497

Karpicke, J. D., & Blunt, J. R. (2011). Retrieval practice produces more learning than elaborative studying with concept mapping. Science, 331(6018), 772–775. https://doi.org/10.1126/science.1199327

Kluger, A. N., & DeNisi, A. (1996). The effects of feedback interventions on performance: A historical review, a meta-analysis, and a preliminary feedback intervention theory. Psychological Bulletin, 119(2), 254–284. https://doi.org/10.1037/0033-2909.119.2.254

Posner, G. J., Strike, K. A., Hewson, P. W., & Gertzog, W. A. (1982). Accommodation of a scientific conception: Toward a theory of conceptual change. Science Education, 66(2), 211–227. https://doi.org/10.1002/sce.3730660207

Roediger, H. L., & Karpicke, J. D. (2006). Test-enhanced learning: Taking memory tests improves long-term retention. Psychological Science, 17(3), 249–255. https://doi.org/10.1111/j.1467-9280.2006.01693.x

Simonsmeier, B. A., Flaig, M., Deiglmayr, A., Schalk, L., & Schneider, M. (2022). Domain-specific prior knowledge and learning: A meta-analysis. Educational Psychologist, 57(1), 31–54. https://doi.org/10.1080/00461520.2021.1939700

Tall, D., & Vinner, S. (1981). Concept image and concept definition in mathematics with particular reference to limits and continuity. Educational Studies in Mathematics, 12(2), 151–169. https://doi.org/10.1007/BF00305619

Steinbach, M., Bhandari, S., Meyer, J., & Pardos, Z. A. (2025). When LLMs hallucinate: Examining the effects of erroneous feedback in math tutoring systems. In Proceedings of the Twelfth ACM Conference on Learning @ Scale (L@S '25). https://doi.org/10.1145/3698205.3729555

VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, 46(4), 197–221. https://doi.org/10.1080/00461520.2011.611369

Van Dooren, W., De Bock, D., Hessels, A., Janssens, D., & Verschaffel, L. (2004). Remedying secondary school students' illusion of linearity: A teaching experiment aiming at conceptual change. Learning and Instruction, 14(5), 485–501. https://doi.org/10.1016/j.learninstruc.2004.06.013


A note on the evidence in this piece: the studies cited here have been checked against their original sources. None tested Mika directly; they establish the principles the diagnostic loop is built on. Our own pilot results are observational, not causal, and are labeled as such.