Prompt lab
Route: /prompt-lab
Scenario: Your fix-worker has been generating patches that touch the wrong file 20% of the time. You suspect the Stage 1 classifier is mislabelling the component field — so the fix agent is looking at the wrong area of the codebase. You want to try a new prompt, test it on real traffic, and promote it only if the judge confirms it’s better than what’s live.
The Prompt lab is where you iterate on the LLM prompts that power every Mushi pipeline. You never need to redeploy an edge function — change a prompt here and it takes effect immediately on the % of traffic you assign to it.
The 4-step workflow
The ribbon at the top of the page is the mental model for every prompt change:
1 Baseline → 2 Clone & edit → 3 A/B test → 4 PromoteStep 1 — Baseline: Every stage has one active prompt running at 100% traffic. This is your current production behaviour.
Step 2 — Clone & edit: Click Clone on the active prompt. A fork is created at 0% traffic — it won’t affect any live requests yet. Click Edit to open the full template editor and make your changes.
Step 3 — A/B test: Use Set traffic % to gradually increase the candidate’s
share (try 5% first, then 20%, then 50%). judge-batch runs overnight and scores both
versions. Come back the next day and compare Avg judge score columns.
Step 4 — Promote: When the candidate’s score consistently beats the baseline over several judge runs, click Activate to set it to 100% and demote the old version.
judge-batch runs once per night. A/B tests typically need 2–3 nights of data before
the scores are statistically meaningful. Don’t promote after a single night.
KPI tiles
| Tile | What to watch |
|---|---|
| Active prompts | Should equal the number of pipeline stages — a gap means a stage has no active prompt and is using a hardcoded fallback |
| Candidates | How many prompts are currently in A/B test |
| Best judge score | The highest score any candidate has achieved — if it’s not beating the baseline, the edits aren’t working |
| Eval dataset size | More labelled examples = more reliable judge scores. Aim for > 100 labelled examples per stage |
Stage tabs
Each tab is one pipeline stage. Start with the stage that’s causing your problem:
| Stage | Prompt controls |
|---|---|
| Stage 1 (fast-filter) | Spam detection — edit if legitimate reports are being dropped |
| Stage 2 (classify) | Category / severity / component — edit if triage labels are wrong |
| Judge | Accuracy scoring — edit if judge scores don’t match your human assessment |
| Fix-worker | Code patch generation — edit if fixes keep touching the wrong files |
| Intelligence digest | Weekly report format — edit for style or content changes |
| NL→SQL planner | Natural-language query translation — edit if query results are off |
Prompt table columns explained
| Column | What it tells you |
|---|---|
| Version name | Your label — use descriptive names like “v2-component-field-fix” |
| Active | The green badge = this is running in production |
| Avg judge score | Higher is better. Compare candidate vs baseline to decide |
| Evaluations | How many times judge-batch has scored this version — low count = unreliable score |
| Traffic % | What fraction of live requests this prompt is serving |
Actions per row
| Action | When to use it |
|---|---|
| Clone | Starting point for any edit — always clone, never edit active directly |
| Edit | Open the full template editor for a cloned (non-active) prompt |
| Diff | Side-by-side comparison against the active prompt — use before promoting |
| Set traffic % | Gradually increase a candidate’s traffic share |
| Activate | Set to 100% and demote all other versions — do this after A/B data confirms it’s better |
| Delete | Remove a candidate that didn’t work out — not available for active prompts |
Common tasks
Fixing wrong component classification
- Open the Stage 2 (classify) tab.
- Click Clone on the active prompt.
- Click Edit → in the template, add explicit examples of the component field format your codebase uses.
- Set traffic to 5% via Set traffic %.
- Wait 2 nights. Check Avg judge score column — is the candidate beating the baseline?
- If yes: diff → activate. If not: edit and retry.
Investigating a bad judge score
- Open the stage tab where the score is low.
- Click the low-score prompt → look at Evaluations count. If < 10, the score isn’t meaningful yet — wait.
- If evaluations are > 20 and the score is still low: the prompt is structurally wrong. Clone and start fresh.
Building the eval dataset
A large eval dataset makes judge scores reliable. The Synthetic Reports card lets you generate artificial reports that match your real report patterns — useful for quickly expanding coverage without waiting for real users.
Supporting cards
Fine-tuning jobs
Links to the Fine-tuning page. Once your eval dataset is large enough (typically 500+ labelled examples), you can export it to train a smaller, faster, cheaper custom model that you own.
Eval dataset
Shows total and labelled counts. The labelled count matters — only labelled examples contribute to judge scoring. Label examples by triaging real reports through the normal Reports flow — each triage action generates a labelled example automatically.
API
GET /v1/admin/prompt-lab
POST /v1/admin/prompt-lab/prompts { stage, name, promptTemplate }
PATCH /v1/admin/prompt-lab/prompts/:id { isActive, trafficPercentage, promptTemplate }
DELETE /v1/admin/prompt-lab/prompts/:idRelated pages
- Judge dashboard — nightly judge scores and A/B tournament results
- Fine-tuning — export the eval dataset to train a custom model
- Integration health — verify LLM keys before testing prompt changes