Closed-loop evolution — the thesis

“Sentry sees what code throws. Mushi sees what users feel — and closes the loop with AI.”

The vibe-coding era — AI-assisted development where you build and ship in hours — solved the creation bottleneck. The new bottleneck is selection: out of everything you could fix next, which changes actually make the app better for real users? Without a closed feedback loop, you’re flying blind. This page is the long-form argument for why the loop matters and how Mushi implements it.

For a shorter version see The evolution loop. For the manifesto see the EVOLUTION-LOOP.md doc.

The NTSB moment software is missing

In 1977 two Boeing 747s collided on a foggy runway in Tenerife. 583 people died. In the years that followed, aviation did something unusual: it systematically refused to let the crash be a tragedy and nothing more. The NTSB — the National Transportation Safety Board — dissected every failure, named every cause, and encoded each cause as a rule the next generation of pilots, controllers, and aircraft designers had to obey.

The result is the safest form of transport in history. Not because pilots got braver, or planes got sturdier, but because every crash was turned into institutional memory.

Matthew Syed documents this contrast in Black Box Thinking (2015): aviation learned from failure; medicine, for most of its history, did not. Doctors who lost patients rarely analysed why. Complications were attributed to the complexity of the case, not to a systemic flaw that could be fixed and encoded. The mortality rate reflected the difference.

Software development is closer to pre-NTSB medicine than to aviation. Every production bug is technically a crash. Most get fixed and forgotten. The knowledge that this class of bug appeared, under these conditions, encoded in this pattern — vanishes. The next developer joins the project and makes the same mistake. Six months later a user reports the same friction. The fix is almost identical. The cost is real.

Mushi is the NTSB layer software has been missing.

“In a world where crashes are hidden, disasters accumulate. In a world where crashes are dissected, progress compounds.” — Matthew Syed, Black Box Thinking (2015), p. 9

Why the linear SDLC is the wrong unit

The textbook software development lifecycle runs one way: PM defines spec → dev implements → QA tests → ships → user experiences → (maybe) bug reported → ticket created → fixed in a future sprint.

At every junction information is lost, delayed, or distorted:

The PM’s spec is a theory about user behaviour, not evidence from it.
The dev’s implementation is tested against the spec, not against the real user journey.
QA tests are mostly happy paths — the edge cases that cause real user friction are rarely in the test plan.
The bug report, when it arrives, is a support ticket with no reproduction context, no screenshot, no stack trace, no session replay, no device info — just “it doesn’t work.”
By the time the fix lands, the developer who wrote the original code has often moved to a different project. The lesson is not encoded anywhere.

This is not a people problem. It is an information architecture problem. The loop does not close. Each iteration starts roughly where the last one started.

Random walk — no memory

PM defines spec

months of research

Dev implements

works on my machine

QA signs off

happy paths only

Ships to user

feedback? support ticket

Bug reported

Jira ticket, someday

≥ 2 sprints later

fixed (maybe)

Fitness floor never rises. Mistakes repeat.

Cumulative selection

User feels friction

variation event

Mushi captures

phenotype recorded

Mistake DB clusters

selection pressure

coherence ≥ 0.75

Lesson encoded

.mushi/lessons.json🧬

PR inherits genome

agent reads the ruleset

Reporter credited

"Kenji helped fix this"

↻ selection with memory — each loop lifts the fitness floor

Fitness compounds. The codebase evolves.

Cumulative selection: why it works

Dawkins introduced cumulative selection in The Blind Watchmaker and developed it further in Climbing Mount Improbable. The key insight: single-step random search (pure chance) is hopeless at finding complex solutions. But selection with memory — where each generation keeps the improvements of the last — can reach solutions of arbitrary complexity, given enough iterations.

Applied to software:

A single bug report is noise. A cluster of similar reports is signal.
A cluster that reappears across releases is a systemic failure mode.
A systemic failure mode that is named, documented, and encoded as a rule the next developer sees before they repeat it — is a permanent improvement to the development process.

This is cumulative selection. The software system gets permanently better with each encoded lesson — not just temporarily patched.

“Cumulative selection is the key to all of evolution. … Each improvement, however slight, is retained and passed on.” — Richard Dawkins, Climbing Mount Improbable (1996), p. 74

Mushi operationalises this: catch the bug → embed it → cluster similar bugs → name the cluster → promote it to a learning rule → inject the rule into the next PR review and the next agent fix → reward the user who found it → repeat.

Antifragility as design goal

Taleb’s Antifragile (2012) distinguishes between:

Fragile: breaks under stress (most software without monitoring).
Robust: survives stress unchanged (good monitoring, alert fatigue).
Antifragile: gets stronger from stress.

A Mushi-equipped project is antifragile: each bug report not only gets fixed, it also strengthens the rules that prevent the next bug of the same class. The more bugs that appear, the richer the lesson library becomes. The more stress the system absorbs, the better it gets at absorbing the next unit of stress.

“Innovations emerge as a consequence of trial-and-error and then become encoded in heuristics and practical knowhow. … The more the system fails, the more it learns.” — Nassim Nicholas Taleb, Antifragile (2012), p. 230

What the loop looks like in practice

End user

feels friction

report

Mushi SDK

shake to report

embed

Mistake DB

vector cluster

promote

Learning rule

named + encoded

inject

PR review

rule injected

merge

Reward loop

credits reporter

findings

Drift agent

walks live app

regressions

PDCA agent

N iterations

↻ thank — reporter credited → next session starts the loop again

The Mushi closed loop —selection nodes (Mistake DB + Learning rule)

A user feels friction (a dead button, a slow screen, a layout that breaks on one phone) and shakes the app or clicks the feedback stamp. — Source: SDK or Anti-gaming for synthetic probes.
Mushi SDK captures the moment: screenshot, session breadcrumbs, device info, the current route. — Visible in Reports & triage.
The server embeds the report with text-embedding-3-small and runs the BIRCH-style streaming clusterer. — Architecture: classify-report edge function.
Similar reports collapse into a mistake_cluster. When the cluster reaches coherence ≥ 0.75 and size ≥ 3, the judge model promotes it to a lesson with auto-generated name + summary + one-shot rule. — Visible in Judge dashboard and Lessons.
The lesson is injected into the next PR review (token-budget packed to ≤ 3 000 tokens) and into the next agentic fix run — so neither human reviewers nor the AI agent can repeat the same class of mistake. — Triggered from Fix orchestrator or autonomously via Iterate (PDCA).
The user is credited in the changelog: “Fixed by Kenji: the Settings page back-button was double-counting the safe-area inset”. The SDK shows a toast on the next session. — Managed in Releases and Rewards.
Drift and anomaly detectors run continuously, generating candidate reports for issues no user has noticed yet. Those reports enter the same pipeline at step 2 — the loop closes without waiting for a human to file a ticket. — Drift scanner and Anomaly detection.
The PDCA loop runs on a schedule or on demand: crawl the live URL, plan improvements, fix them, verify. — Iterate (PDCA) page; powered by the pdca-runner edge function.

The three intellectual anchors

Book	Principle	Mushi equivalent
Black Box Thinking — Syed (2015)	Every failure is data; name it, encode it, inherit it	`mistake_clusters` → `lessons` → `.mushi/lessons.json`
Antifragile — Taleb (2012)	Systems improve by failing, learning, failing again	Each encoded lesson makes the next agent run less likely to repeat the mistake
Climbing Mount Improbable — Dawkins (1996)	Selection with memory reaches solutions random search cannot	The lesson library grows with every bug; each iteration starts from a better baseline

What this means for the roadmap

Every feature in the Mushi roadmap is either:

A sensor: something that generates more signal about where the system is failing (rewards that incentivise reporting, drift agents that find latent failures, A/B tests that detect which version performs better).
A processor: something that turns raw signal into named, structured knowledge (the mistake clusterer, the judge coherence gate, the PDCA loop, anomaly detection).
An effector: something that injects knowledge back into the development process (lesson injection into PRs, changelog attribution, the .mushi/lessons.json repo file, the PDCA draft PR).

The loop is: sense → process → effect → sense again. This is the NTSB model applied to software. It is how cumulative selection works. It is why Mushi is not just a feedback widget — it is an antifragile development infrastructure layer.

See Roadmap for the implementation schedule. See Architecture for the wire-level sequence diagrams.