Skip to Content
v0.8.0 · shippedNative iOS / Android / Flutter / Capacitor SDKs, A2A discovery, SOC 2 readiness, residency, BYO storage, BYOK. Read the changelog →
ConceptsJudge & self-improvement

Judge & self-improvement

Mushi doesn’t ship a static prompt. The classifier improves continuously via three loops:

1. Judge

judge-batch runs nightly on a sample of yesterday’s classifications. A separate judge model (typically a different family from the classifier — by default Anthropic Sonnet judging Anthropic Haiku, with OpenAI gpt-4o as fallback) scores every component:

ScoreWeight
accuracy0.35
severity_calibration0.25
component_tagging0.20
repro_quality0.20

The composite lands on the report itself (reports.judge_score) and persists in classification_evaluations for audit.

2. Prompt A/B testing

Each classification stage carries a stage1_prompt_version and stage2_prompt_version. The candidate prompt runs on a slice of traffic (default 5%) for a configurable window. When the candidate’s mean judge score wins by ≥ 0.05 with a 95% confidence interval that doesn’t include zero, the candidate is promoted automatically and the active prompt is demoted. All this is project-scoped — your prompts never leak into another project’s A/B counters.

3. Fine-tune export

The Fine-Tuning page in the admin console lets you export the best-scoring classifications, validate the export against an open-eval harness, and promote a fine-tuned variant to production with a single click. The full pipeline:

exporting → exported → training → trained → validating → validated → promoted rejected

Validation runs an offline benchmark and refuses to promote a candidate whose judge mean is below the current production prompt’s mean.

4. Drift detection

If yesterday’s mean composite drops > 0.10 vs. the trailing 7-day mean, judge-batch posts a judge.drift Slack alert. This is the first signal that an upstream model or the prompts themselves regressed.

Last updated on