Fine-tuning
The Fine-Tuning page guides you through exporting your project’s best-scored classifications, training a fine-tuned variant, validating it against an offline benchmark, and promoting it to replace the active production prompt — all without leaving the console.
Pipeline
Validation refuses to promote if the candidate judge mean doesn't beat production by ≥ 0.05 with p < 0.05.
Each stage is visible as a stepper in the UI. You can pause between stages — for example, export the dataset, review it offline, and come back to start training later.
Stage details
exporting → exported
Click Export dataset to queue a dataset export. The exporter:
- Fetches the last N classifications (configurable, default 2 000) where
judge_score >= 0.8— only high-confidence examples train the model. - Converts each classification into a fine-tune training pair:
{ prompt: <raw report text>, completion: <category, severity, component, tags> }. - Validates the export schema with
@mushi-mushi/inventory-schema’s classifier format. - Uploads the JSONL file to your BYO Storage bucket (or Mushi-managed storage).
The exported file URL appears in the Exports tab for 30 days.
training → trained
Click Start training to kick off a fine-tune job with your provider:
| Provider | Model | Notes |
|---|---|---|
| OpenAI | gpt-4o-mini | Default. Fast, cheap, good for classification. |
| Anthropic | claude-haiku-3-5 | Higher accuracy on nuanced severity calibration. |
| Custom | Any OpenAI-compatible endpoint | Self-hosted models via fine_tune_endpoint in project settings. |
Training runs asynchronously. The stepper polls the provider’s job status every 60 seconds and updates the UI.
validating → validated / rejected
When training completes, the validation harness runs automatically:
- Eval accuracy — classification accuracy on a held-out test set (20% of the export).
- Severity calibration delta — how much the model’s severity distribution drifts from the ground-truth labels.
- Component tagging F1 — precision × recall for component assignment.
- Cost-per-classification projection — estimated cost per 1 000 classifications at the provider’s current pricing.
A candidate is only promotable if its judge-score mean beats the current production prompt’s mean by ≥ 0.05 with p < 0.05. If the candidate loses, the UI offers Reject with a one-line reason. Rejected runs are archived in the History tab for future reference.
promoted
Click Promote to replace the active production prompt with the validated
candidate. The promotion is atomic — a database transaction flips the
active_prompt_version column and writes a prompt.promoted audit event.
The old prompt is demoted to archived status; it can be re-promoted from the
History tab if needed.
History tab
Every fine-tune run — promoted, rejected, or abandoned — appears in the History tab with:
- Start and end timestamps
- Export row count and held-out test accuracy
- Judge mean (candidate vs. baseline at time of validation)
- Promotion outcome and, for rejections, the rejection reason
- A Restore button that re-promotes a previously archived prompt
Cost projection
The validation report includes a cost-per-classification projection based on the provider’s current per-token pricing. This is an estimate using the median report length in your project. Actual costs vary with report complexity.
See also
- Concepts → Judge & self-improvement — how the nightly judge, A/B testing, and fine-tune pipeline fit together.
- Admin → Judge dashboard — the per-component score breakdown that helps you pick the right export slice.
- Admin → Intelligence reports — weekly digest that includes classifier quality metrics.