Judge dashboard
The Judge page surfaces:
- Trailing 30-day mean composite with a 7-day moving average.
- Per-component breakdown of judge scores (which prompts struggle).
- Prompt-A/B status — every active candidate with its sample size, current win/loss vs. the active baseline, and confidence interval.
- Drift alerts — when the daily mean drops more than 0.10 from the trailing 7-day baseline.
Last updated on