Open-source evaluation

RedPen Benchmark — an open benchmark for AI in higher education.

An open-source benchmark for how helpful and accurate AI is in higher education — graded the way an educator grades a student, across real university-level STEM. Run it yourself, read every rubric, and reproduce our numbers.

Run the benchmark yourself Talk to the team→

Five axes

What we grade on.

Every answer is scored against the same rubric a student is held to — graded by educators, not a model.

ACorrectness

Is the answer right at degree level — factually and methodologically, not just plausible?

BTargetedness

Does it answer the actual question and the student's specific gap — not a generic restatement?

CLevel / Tone

Is the explanation pitched at the right level and register for a higher-education student?

DActionability

Can the student act on it — a concrete next step, not vague encouragement?

EGuidance

Does it lead the student toward understanding, or just hand over the final answer?

We ran it ourselves · degree-level STEM

Results by subject.

We ran the full benchmark across four subjects and published the graded results. Download the report for any one.

Mathematics

Report ready

Download PDF report

Chemistry

Report ready

Download PDF report

Physics

Report ready

Download PDF report

Biology

Report ready

Download PDF report

+More subjects coming — the benchmark is open, so anyone can add and submit one.

Scores are educator-graded composites of correctness, targetedness, level/tone, actionability & guidance. Full rubrics, transcripts and reproduction steps live in the repo.