Open-source evaluation

RedPen Benchmark — an open benchmark for AI in higher education.

An open-source benchmark for how helpful and accurate AI is in higher education — graded the way an educator grades a student, across real university-level STEM. Run it yourself, read every rubric, and reproduce our numbers.

Five axes

What we grade on.

Every answer is scored against the same rubric a student is held to — graded by educators, not a model.

ACorrectness
Is the answer right at degree level — factually and methodologically, not just plausible?
BTargetedness
Does it answer the actual question and the student's specific gap — not a generic restatement?
CLevel / Tone
Is the explanation pitched at the right level and register for a higher-education student?
DActionability
Can the student act on it — a concrete next step, not vague encouragement?
EGuidance
Does it lead the student toward understanding, or just hand over the final answer?
We ran it ourselves · degree-level STEM

Results by subject.

We ran the full benchmark across four subjects and published the graded results. Download the report for any one.

Mathematics
Report ready
Designed by educators · June 2026 edition
Download PDF report
Chemistry
Report ready
Designed by educators · June 2026 edition
Download PDF report
Physics
Report ready
Designed by educators · June 2026 edition
Download PDF report
Biology
Report ready
Designed by educators · June 2026 edition
Download PDF report
+More subjects coming — the benchmark is open, so anyone can add and submit one.

Scores are educator-graded composites of correctness, targetedness, level/tone, actionability & guidance. Full rubrics, transcripts and reproduction steps live in the repo.

MikaLabs

An advanced AI research lab building the learning systems universities trust to teach.

Frontiers
Research
© 2026 MikaLabs[email protected]