There are many ways to build agentic loops. This is a simplified version of what I do. The tools are moving fast, and some of this may even be outdated by the time you finish reading this deck!
Take the principles, not the syntax. The shape of the loop matters more than which file does what.
“AI hallucinates.” “It can't do X.” “We've seen it fail at other companies.”
Put an automatic-only driver in this car and they spin it into the wall...then tell you why it's undrivable. Put a pro in the same seat, and it's the fastest thing on Earth. Both are correct.
The most responsible-sounding teams are often the least practiced.
Low proficiency before the reps are in
Output disappoints a hallucination, some slop
Distrust hardens must be the tool
Confirmation bias knew I couldn't rely on it
Retreat to old ways known beats unknown
Every stage of this loop masquerades as good judgment; however it is really a signal of low proficiency + low agency. Distrust is a status report on your scaffolding.
And the hallucination fear assumes the model is always guessing...it isn't. Hold that thought.
We're shifting the burden from the model to specification & scaffolding.
Model
The engine
Raw capability. Necessary, but the part everyone talks about.
+
Files
The judgment
Standards, specs, and memory. Where your engineering lives.
+
Loop
The discipline
Build, verify, document (repeat until done).
Take any one away and it stops working. And most of what makes it good isn't the model. It's the files (and what we'll cover in the rest of this deck.)
Only one of these can hallucinate. Knowing which is what makes the loop trustworthy.
Generatecan hallucinate
Text from weights
The model writes the most plausible next tokens from what it learned. Brilliant for synthesis and code — and the only place a guess can sneak in.
prompt→model→plausible text
Callreturns a fact
A tool runs, for real
The model emits a structured request; a deterministic system runs it — code, a search, a test — and hands back ground truth the model then reads.
request→tool runs→fact→model reads
In a loop, most of the work is tool calls returning ground truth not the model free-associating. run-qa’s Playwright test actually ran; the model didn’t decide it passed.
Executes the spec for a single feature / sprint. Then done.
The Builder is just a Claude Code session — no special framework, no persistent agent. Spun up fresh for one sprint, then thrown away. Its power is that nothing carries over.
Stateless
Clean brain every turn — no context rot from the last sprint.
Self-checking
It calls run-qa on its own work before it ever declares done.
Hands off
Writes its memory to PROGRESS.md so the next session can pick up.
One call — run-qa — runs all three QA agents and returns a single verdict: GREEN or RED. RED three times and it stops for a human.
run-qa
→
Lint
→
Review
→
Test
→
GREEN
↺ RED → read failures → fix → re-run max 3 attempts, then mark blocked & stop
.claude/skills/run-qa/SKILL.md
---name: run-qa
description: Full QA pass — lint, review,
e2e. Returns one GREEN/RED verdict.
---# run-qa
Run these in order, each as a subagent:
1. Delegate to linter.
2. Delegate to code-reviewer.
3. Delegate to playwright-tester.
Verdict: any lint error, CRITICAL
finding, or failed spec → RED.
Otherwise → GREEN.
If RED: fix, re-run from step 1.
Stop after 3 attempts; report to human.
RED doesn't stop the loop — it routes back to the Builder.
On RED, the verdict carries the actual failures — the Builder reads them, patches, and re-runs. Three attempts, then it stops. RED isn't an opinion — it's a failed assertion; the fix targets a fact, not a feeling.
RED → read failures · fix · re-run
The Builder
Write & fix
Builds the feature. On a RED return, reads the failures and patches the code.
→
run-qa
Verify
Lint · review · test. Returns one verdict — and the failures with it.
→
REDloops back ↑
GREENexits → Document
Three attempts, then it stops. Still RED after the third try → mark the sprint [!] blocked and hand back to a human. Autonomous, never infinite.
$ DEMO_MODE=planted ./run.sh # guaranteed Sprint-3 RED → GREEN
A fresh session spins up for each sprint — it builds the feature, runs the QA gate, ticks the roadmap, and hands off. Then the next sprint starts with a clean brain. You write specs; the loop writes code.
A full run takes up to ten minutes. Watch the console scroll by if you like, or step away, stretch, refill your water, and let the loop cook. It doesn't need you for this part.
A skill file holds a procedure, not facts — how to do a recurring job: run the three checks, return one verdict. The method becomes reusable and versioned, kept separate from any single task it runs on.
Each subagent gets its own context window and exactly one narrow job. The reviewer never sees the builder’s scratch work; the tester starts fresh. Narrow scope means a cheaper model and no cross-contamination.
A model’s context is finite and degrades as it fills. So every sprint launches a new session with an empty window — the roadmap and PROGRESS files carry state forward, not the model’s fading memory.
that’s the clean brain per sprint→roadmap.md + PROGRESS.md
When a session learns something durable — a gotcha, a convention — it writes it back into the broad context. The next session starts smarter. The system quietly edits its own instructions.
Generation is fuzzy by nature — synthesis, design, judgment. Verification isn’t: the assertion passes or it doesn’t. You never trust the model’s opinion of its own work; you run a tool that returns a fact.
# Click Farm
## Identity & Mission
You maintain the Click Farm game. Done = the feature
works, QA is GREEN, nothing else broke.
## Map
One index.html (inline CSS + JS). State lives in
localStorage. Specs are in specs/.
State: localStorage key clickfarm = { score, farmhands }.
Stable ids: #score, #harvest, #hire, #farmhands, #cost.
## Conventions
Vanilla HTML/CSS/JS — no frameworks, no build step.
Mobile-first, system fonts, one accent color.
## Running it
Open index.html in a browser to play. run-qa to check.
No package manager.
## Guardrails
Never add dependencies or a build tool. Don't edit
other sprints' specs. Ask before changing the schema.
## Definition of Done
Not done until run-qa returns GREEN.
If RED, read it, fix it, re-run. Never ship on RED.
# Sprint Turn
Read roadmap.md and pick the first sprint marked [ ].
Read that sprint's spec and PROGRESS.md for context.
Build the feature per the spec.
Then run the run-qa skill until GREEN (max 3 attempts).
When GREEN:
- Mark the sprint [x] in roadmap.md.
- Append a handoff to PROGRESS.md: what you built,
key decisions, gotchas, what the next sprint needs.
If you can't reach GREEN after 3 attempts:
- Mark the sprint [!] in roadmap.md, then stop.
Updating the status is your final action — never skip it.
# Sprint 2: Farmhands (auto-harvest)
## What to build
A "Hire Farmhand" button (cost 10). Hiring deducts
10 and adds a farmhand. Each farmhand earns
+1 score/sec automatically.
## Acceptance Criteria
- #hire with score ≥ 10 deducts 10 and adds a #farmhands
- With ≥1 farmhand, #score rises ~1/sec on its own
- #cost shows the hire cost (10)
## Technical Notes
- New ids: #hire, #farmhands, #cost
- setInterval 1000ms; persist state on each tick
# Sprint 3: Scaling Cost
## What to build
Farmhands get more expensive the more you own.
cost = floor(10 * 1.15 ^ farmhands). #cost
shows the NEXT price, updates after each hire.
## Acceptance Criteria
- #cost increases after each hire (per the formula)
- When score < cost, clicking #hire does nothing
(score & farmhands unchanged)
- Score never goes negative; hiring at score == cost succeeds
## Technical Notes
- Integer math, Math.floor
- Guard lives in the acceptance test; the bug is
planted via the fixture swap, not the spec
# Sprint 4: Polish
## What to build
Number formatting + click feedback.
## Acceptance Criteria
- Score ≥ 1000 shows abbreviated (1.2K / 3.4M / 5.6B);
below 1000 shows the integer
- Same formatting on #cost
- Underlying state stays an exact integer (display-only)
- #harvest pulses briefly on click
## Technical Notes
- Format on display only — never round the stored value
---name: run-qa
description: Full QA pass — lint, review, e2e.
Returns one GREEN/RED verdict.
---# run-qa
Run these in order. Each is a subagent; use its
returned message as the result.
1. Delegate to linter.
2. Delegate to code-reviewer.
3. Delegate to playwright-tester.
Verdict:
- Any lint error, any CRITICAL finding, or any
failed spec → RED. List what failed.
- Otherwise → GREEN.
If RED: fix the issues, re-run from step 1.
Stop after 3 attempts; then report to the human.
---name: linter
description: Validates HTML structure and JS errors.
tools: Read, Edit, Bash model: haiku
---
Run validation on index.html:
1. Check HTML is well-formed (tags, nesting)
2. Run node --check on inline script content
3. Check for unclosed strings, missing semicolons,
undefined variables
Auto-fix trivial violations (formatting, whitespace).
For judgment calls, leave it and report it.
Return "CLEAN" if no errors remain, else the list
of remaining errors with location.
---name: playwright-tester
description: Runs behavioral tests against the app.
tools: Read, Write, Bash model: sonnet
---
Write and run a Playwright test for the sprint's
acceptance criteria.
1. Create test-farm.spec.js from the sprint spec
2. Open index.html via file:// or a local server
3. Test each acceptance criterion
Key scenarios for Sprint 3:
- Click → score increments
- Hire with enough score → cost deducted, +1 farmhand
- Hire when you can't afford it → blocked, no change
- Hire at score == cost → succeeds, score ≥ 0
- Score never goes below 0
- State persists across refresh
Return "GREEN" if all pass, else "RED" plus, per
failure: test name, failed assertion, actual result.