Fine-tuning

The Fine-Tuning page guides you through exporting your project’s best-scored classifications, training a fine-tuned variant, validating it against an offline benchmark, and promoting it to replace the active production prompt — all without leaving the console.

Pipeline

Fine-tune pipeline

exporting

export CSV from admin

exported

S3 / BYO storage

training

provider fine-tune job

trained

checkpoint ready

validating

offline eval harness

validated

judge mean wins +0.05

promoted

replaces active prompt

rejected

archived with reason

Validation refuses to promote if the candidate judge mean doesn't beat production by ≥ 0.05 with p < 0.05.

Each stage is visible as a stepper in the UI. You can pause between stages — for example, export the dataset, review it offline, and come back to start training later.

Stage details

exporting → exported

Click Export dataset to queue a dataset export. The exporter:

Fetches the last N classifications (configurable, default 2 000) where judge_score >= 0.8 — only high-confidence examples train the model.
Converts each classification into a fine-tune training pair: { prompt: <raw report text>, completion: <category, severity, component, tags> }.
Validates the export schema with @mushi-mushi/inventory-schema’s classifier format.
Uploads the JSONL file to your BYO Storage bucket (or Mushi-managed storage).

The exported file URL appears in the Exports tab for 30 days.

training → trained

Click Start training to kick off a fine-tune job with your provider:

Provider	Model	Notes
OpenAI	`gpt-4o-mini`	Default. Fast, cheap, good for classification.
Anthropic	`claude-haiku-3-5`	Higher accuracy on nuanced severity calibration.
Custom	Any OpenAI-compatible endpoint	Self-hosted models via `fine_tune_endpoint` in project settings.

Training runs asynchronously. The stepper polls the provider’s job status every 60 seconds and updates the UI.

validating → validated / rejected

When training completes, the validation harness runs automatically:

Eval accuracy — classification accuracy on a held-out test set (20% of the export).
Severity calibration delta — how much the model’s severity distribution drifts from the ground-truth labels.
Component tagging F1 — precision × recall for component assignment.
Cost-per-classification projection — estimated cost per 1 000 classifications at the provider’s current pricing.

A candidate is only promotable if its judge-score mean beats the current production prompt’s mean by ≥ 0.05 with p < 0.05. If the candidate loses, the UI offers Reject with a one-line reason. Rejected runs are archived in the History tab for future reference.

promoted

Click Promote to replace the active production prompt with the validated candidate. The promotion is atomic — a database transaction flips the active_prompt_version column and writes a prompt.promoted audit event. The old prompt is demoted to archived status; it can be re-promoted from the History tab if needed.

History tab

Every fine-tune run — promoted, rejected, or abandoned — appears in the History tab with:

Start and end timestamps
Export row count and held-out test accuracy
Judge mean (candidate vs. baseline at time of validation)
Promotion outcome and, for rejections, the rejection reason
A Restore button that re-promotes a previously archived prompt

Cost projection

The validation report includes a cost-per-classification projection based on the provider’s current per-token pricing. This is an estimate using the median report length in your project. Actual costs vary with report complexity.

Fine-tuning

Pipeline

Stage details

exporting → exported

training → trained

validating → validated / rejected

promoted

History tab

Cost projection

See also