How We Caught 638 Bad AI-Generated Questions Before Our Users Saw Them
TL;DR: We ran a per-user audit on every CLEP student who got a low score or abandoned a session in the past 30 days. We re-checked the exact questions they saw, in order. 74% of those questions had already been flagged as broken by our quality pipeline and removed from circulation. Zero of the broken ones are still visible to future users. This post walks through how we got there.
The problem with LLM-generated test prep
Every edtech startup right now is racing to generate practice questions with GPT-4, Claude, or Gemini. The output looks great in a demo. In production, it breaks in ways students notice immediately:
- Stem says "What is the maximum?" — correct answer is the median.
- Explanation says "B is correct because..." — but stored answer is C.
- Options include both "4" and "2+2".
- "What is x?" with options that don't include the actual root.
- An explanation that contradicts the question.
A student who hits one of these doesn't think "the AI is having an off day." They think the platform is broken. They abandon. They write a Reddit post. They warn their friends.
What we built instead
PrepLion has 60+ deterministic validation gates that run on every question before it enters the approved pool. Each one catches a specific failure mode we've seen in real LLM output:
- letter-mismatch — explanation says "B is correct" but stored letter is C.
- stem-extremum-mismatch — stem asks for "greatest" but correct answer isn't the largest option.
- options-duplicate — two options are semantically identical.
- explanation-numeric-mismatch — explanation's number doesn't appear in the correct option.
- confession-phrase — explanation uses words like "approximately," undermining single-correct-answer claims.
- permutation-equivalent-options — options like {2, 3} and {3, 2} both selectable.
- hint-in-option — option text leaks the answer.
- stem-double-negation — confusing negation patterns.
- distractor-verbatim-from-stem — distractor lifted directly from stem.
Plus three deeper layers:
-
CAS recompute for math courses. We re-solve the question with mathjs and compare to the stored answer. Catches sign-flipped roots, swapped numerator/denominator, off-by-one errors.
-
Second-pass LLM verifier (Haiku 4.5). Independently solves the question and flags FAIL when its answer disagrees with the claimed correct.
-
Trust certification. Every question gets a Gold/Silver/Bronze score based on weighted gate-pass rate.
The audit we ran this week
We took every external user who had a low-score session (<30% correct) or abandoned a session in the past 30 days. That gave us 25 users, 67 problematic sessions, 860 distinct questions they encountered.
For each question, we ran every current gate. Here's what came back:
| Status | Count | % | |---|---|---| | Already unapproved by prior sweeps (cannot reappear) | 638 | 74% | | Approved + gate-clean now | 222 | 26% | | Still approved but failing current gates | 0 | 0% |
The neg-feedback users hit a pool that was 74% broken by today's standards. Every one of those bad questions has since been removed.
What the worst experience looked like
The most polluted experience belonged to a user who joined April 12. They saw 273 distinct questions across 9 sessions in PRECALCULUS and INTRODUCTORY_SOCIOLOGY. 252 of those 273 — 92% — have since been unapproved. That's the experience that drives a user to never come back.
A more recent user who joined in May saw 18 questions across BIOLOGY and COLLEGE_ALGEBRA. Zero of them have been flagged since. Same product, four weeks later, completely different experience.
Why this matters for trust
The hardest part of building AI-generated prep is convincing students you're not the platform that's going to break on them. The honest answer is: everyone produces broken questions sometimes — what matters is what you do about them.
Our answer is:
- Block at generation time via 60+ gates.
- Catch retroactively via nightly sweeps against the entire approved pool.
- Replay neg-feedback users to verify nothing they saw still poisons the pool.
Step 3 is the one most platforms skip. It's also the only one that actually rebuilds trust with a user who already had a bad experience.
What's next
We're packaging the deterministic gate layer as an open-source npm package, so other edtech platforms can run the same validations against their question banks. If you're building exam prep and tired of shipping broken AI content, watch this space.
If you're a student who hit a bad question on PrepLion in the past — sorry. Try us again. The pool you'll see now is different.