Rubrics
Calibrate AI Scores Against Teacher Judgment
Calibrating first prevents grading drift and reduces manual correction after batch runs.
Why calibration matters
Even strong models need assignment-specific setup. Calibration ensures the AI is applying your rubric the way you would, especially around partial credit and borderline responses.
Without calibration, small ambiguities can compound across a full class set. A short calibration pass up front usually saves substantial correction time later.
- Primary goal: Align scoring behavior to teacher intent before full-scale grading.
- Secondary goal: Improve feedback quality so comments are specific and instruction-ready.
- Best timing: Calibrate once per new assignment type, major rubric revision, or model change.
Choose a representative sample
Select a small set that reflects the real spread of student performance: high, middle, and developing responses. Include at least one ambiguous response that typically creates disagreement among graders.
A practical sample size is 6-12 submissions. If your class has multiple sections or accommodations that change response patterns, include examples from each context.
- Include: One near-perfect response, several mid-range responses, and one clearly below-standard response.
- Avoid: Sampling only top work, which hides partial-credit edge cases.
- Tip: Label each sample with your teacher score before running AI so comparison stays objective.
Compare AI vs teacher scores
Compare both the final score and the criterion-level reasoning. A matching score with weak rationale still signals a rubric clarity issue that can show up later at scale.
Track mismatches by pattern, not by single paper. Look for repeated issues such as consistent over-scoring of evidence quality or under-scoring of reasoning depth.
- Check 1: Does the AI identify the same evidence you used?
- Check 2: Are partial-credit decisions consistent with your rubric language?
- Check 3: Do feedback comments name the missing skill clearly?
Adjust one rule at a time
Revise one rubric element or instruction at a time, then retest. Isolating changes makes it obvious which edit improved or worsened alignment.
Most calibration fixes come from clarifying criteria wording, strengthening level descriptors, and tightening partial-credit boundaries.
- Start with criteria clarity: Replace broad phrases like "good analysis" with observable evidence requirements.
- Then refine levels: Define exactly what separates full credit from partial credit.
- Finally refine feedback rules: Require one strength and one actionable next step tied to the criterion.
Re-run and lock settings
After revisions, run a second calibration set before full grading. If alignment is stable, lock your model, rubric, and instruction settings for that assignment batch.
Locking settings prevents hidden drift caused by mid-run prompt edits or model switching. Document the final version so you can reuse the same calibrated setup next term.
- Quality gate: Proceed when most scores are within your acceptable difference threshold.
- Consistency gate: Confirm criterion-level rationale is repeatable across similar responses.
- Operational gate: Save the calibrated configuration as a reusable template.
