Rubrics

Calibrate AI Scores Against Teacher Judgment

Calibrating first prevents grading drift and reduces manual correction after batch runs.

Why calibration matters

Even strong models need assignment-specific setup. Calibration ensures the AI is applying your rubric the way you would, especially around partial credit and borderline responses.

Without calibration, small ambiguities can compound across a full class set. A short calibration pass up front usually saves substantial correction time later.

Primary goal: Align scoring behavior to teacher intent before full-scale grading.
Secondary goal: Improve feedback quality so comments are specific and instruction-ready.
Best timing: Calibrate once per new assignment type, major rubric revision, or model change.

Choose a representative sample

Select a small set that reflects the real spread of student performance: high, middle, and developing responses. Include at least one ambiguous response that typically creates disagreement among graders.

A practical sample size is 6-12 submissions. If your class has multiple sections or accommodations that change response patterns, include examples from each context.

Include: One near-perfect response, several mid-range responses, and one clearly below-standard response.
Avoid: Sampling only top work, which hides partial-credit edge cases.
Tip: Label each sample with your teacher score before running AI so comparison stays objective.

Compare AI vs teacher scores

Compare both the final score and the criterion-level reasoning. A matching score with weak rationale still signals a rubric clarity issue that can show up later at scale.

Track mismatches by pattern, not by single paper. Look for repeated issues such as consistent over-scoring of evidence quality or under-scoring of reasoning depth.

Check 1: Does the AI identify the same evidence you used?
Check 2: Are partial-credit decisions consistent with your rubric language?
Check 3: Do feedback comments name the missing skill clearly?

Adjust one rule at a time

Revise one rubric element or instruction at a time, then retest. Isolating changes makes it obvious which edit improved or worsened alignment.

Most calibration fixes come from clarifying criteria wording, strengthening level descriptors, and tightening partial-credit boundaries.

Start with criteria clarity: Replace broad phrases like "good analysis" with observable evidence requirements.
Then refine levels: Define exactly what separates full credit from partial credit.
Finally refine feedback rules: Require one strength and one actionable next step tied to the criterion.

Re-run and lock settings

After revisions, run a second calibration set before full grading. If alignment is stable, lock your model, rubric, and instruction settings for that assignment batch.

Locking settings prevents hidden drift caused by mid-run prompt edits or model switching. Document the final version so you can reuse the same calibrated setup next term.

Quality gate: Proceed when most scores are within your acceptable difference threshold.
Consistency gate: Confirm criterion-level rationale is repeatable across similar responses.
Operational gate: Save the calibrated configuration as a reusable template.