Skip to Content
v0.8.0 · shippedNative mobile SDKs, optional Sentry enrichment, and bring-your-own keys/storage. Read the changelog →
Admin consoleJudge dashboard
Spot slipping reads before they reach your queue · · open live demo ↗

Fix quality scores

The Judge page shows how well Mushi’s plain-English reads match reality over time. If classification quality is slipping — or a new prompt is ready to promote — you’ll see it here before bad reads reach your queue.


Trailing score panel

The top of the page shows:

  • 30-day mean composite — the rolling average of reports.judge_score across all classifications in your project for the last 30 days.
  • 7-day moving average — plotted as a line chart so you can spot week-over-week drift at a glance.
  • Yesterday vs. trailing baseline — delta highlighted in green / amber / red. If the delta exceeds the drift threshold (default: −0.10), a judge.drift alert has already fired.

Per-component breakdown

A table ranked by mean composite score per component, showing:

ColumnDescription
ComponentThe component tag the classifier assigned
Mean compositeAverage judge score for this component’s reports
VolumeReport count in the last 30 days
AccuracyMean accuracy sub-score
Severity calibrationMean severity_calibration sub-score
Component taggingMean component_tagging sub-score (lower on yourself = classifier struggles here)
Repro qualityMean repro_quality sub-score

Sort by any column to find which components the classifier handles worst. These are the best candidates for a targeted fine-tune export.


Prompt A/B status

Every active A/B experiment shows:

ColumnDescription
Candidate versionThe stage1_prompt_version or stage2_prompt_version under test
Sample sizeClassifications scored so far in this window
Candidate meanCurrent mean judge score for the candidate slice
Baseline meanConcurrent mean for the active production prompt
DeltaCandidate − Baseline
95% CIConfidence interval; auto-promotes when CI excludes zero and delta ≥ 0.05
Statusrunning · promoted · demoted · inconclusive

Click Promote to manually promote a candidate before the automated threshold is reached (useful when a prompt change is clearly winning but the CI window hasn’t closed). Click Demote to abort an experiment.


Drift alerts

The drift alert panel shows the last 30 judge.drift events with:

  • Timestamp of the alert.
  • The daily mean that triggered it.
  • The trailing 7-day baseline it was compared against.
  • Whether the drift was subsequently resolved (mean recovered) or is still open.

Alerts route to Slack if you’ve configured the integration in Settings → Integrations → Slack. Each alert includes a direct link back to this panel.


See also

Last updated on