Open-source evaluation
RedPen Benchmark — an open benchmark for AI in higher education.
An open-source benchmark for how helpful and accurate AI is in higher education — graded the way an educator grades a student, across real university-level STEM. Run it yourself, read every rubric, and reproduce our numbers.
Five axes
What we grade on.
Every answer is scored against the same rubric a student is held to — graded by educators, not a model.
ACorrectness
Is the answer right at degree level — factually and methodologically, not just plausible?
BTargetedness
Does it answer the actual question and the student's specific gap — not a generic restatement?
CLevel / Tone
Is the explanation pitched at the right level and register for a higher-education student?
DActionability
Can the student act on it — a concrete next step, not vague encouragement?
EGuidance
Does it lead the student toward understanding, or just hand over the final answer?
We ran it ourselves · degree-level STEM
Results by subject.
We ran the full benchmark across four subjects and published the graded results. Download the report for any one.
Mathematics
Chemistry
Physics
Biology
+More subjects coming — the benchmark is open, so anyone can add and submit one.
Scores are educator-graded composites of correctness, targetedness, level/tone, actionability & guidance. Full rubrics, transcripts and reproduction steps live in the repo.