Skip to Content
v0.8.0 · shippedNative mobile SDKs, optional Sentry enrichment, and bring-your-own keys/storage. Read the changelog →
Admin consoleFine-tuning
Fine-tuning jobs queue beside prompt versions · · open live demo ↗

Fine-tuning

The Fine-Tuning page guides you through exporting your project’s best-scored classifications, training a fine-tuned variant, validating it against an offline benchmark, and promoting it to replace the active production prompt — all without leaving the console.


Pipeline

Each stage is visible as a stepper in the UI. You can pause between stages — for example, export the dataset, review it offline, and come back to start training later.


Stage details

exporting → exported

Click Export dataset to queue a dataset export. The exporter:

  1. Fetches the last N classifications (configurable, default 2 000) where judge_score >= 0.8 — only high-confidence examples train the model.
  2. Converts each classification into a fine-tune training pair: { prompt: <raw report text>, completion: <category, severity, component, tags> }.
  3. Validates the export schema with @mushi-mushi/inventory-schema’s classifier format.
  4. Uploads the JSONL file to your BYO Storage bucket (or Mushi-managed storage).

The exported file URL appears in the Exports tab for 30 days.

training → trained

Click Start training to kick off a fine-tune job with your provider:

ProviderModelNotes
OpenAIgpt-4o-miniDefault. Fast, cheap, good for classification.
Anthropicclaude-haiku-3-5Higher accuracy on nuanced severity calibration.
CustomAny OpenAI-compatible endpointSelf-hosted models via fine_tune_endpoint in project settings.

Training runs asynchronously. The stepper polls the provider’s job status every 60 seconds and updates the UI.

validating → validated / rejected

When training completes, the validation harness runs automatically:

  • Eval accuracy — classification accuracy on a held-out test set (20% of the export).
  • Severity calibration delta — how much the model’s severity distribution drifts from the ground-truth labels.
  • Component tagging F1 — precision × recall for component assignment.
  • Cost-per-classification projection — estimated cost per 1 000 classifications at the provider’s current pricing.

A candidate is only promotable if its judge-score mean beats the current production prompt’s mean by ≥ 0.05 with p < 0.05. If the candidate loses, the UI offers Reject with a one-line reason. Rejected runs are archived in the History tab for future reference.

Click Promote to replace the active production prompt with the validated candidate. The promotion is atomic — a database transaction flips the active_prompt_version column and writes a prompt.promoted audit event. The old prompt is demoted to archived status; it can be re-promoted from the History tab if needed.


History tab

Every fine-tune run — promoted, rejected, or abandoned — appears in the History tab with:

  • Start and end timestamps
  • Export row count and held-out test accuracy
  • Judge mean (candidate vs. baseline at time of validation)
  • Promotion outcome and, for rejections, the rejection reason
  • A Restore button that re-promotes a previously archived prompt

Cost projection

The validation report includes a cost-per-classification projection based on the provider’s current per-token pricing. This is an estimate using the median report length in your project. Actual costs vary with report complexity.


See also

Last updated on