The start-task → ship loop: how I turned Claude Code into a software factory

21 Jun 2026

Specs should be the source of truth

I don’t really write features any more. I maintain a loop, and the loop writes them. I built a real product this way - a backend and three native SDKs.

The loop has two commands. /start-task takes an idea and turns it into a plan I’ve agreed to, then a code change diff that implements it. /ship takes that diff and turns it into a reviewed, tested, merged PR. I’ve written about the mechanics of both before. This post is about what turns them into a loop rather than a one-way trip, because that’s the part that makes it work.

What turns them into a loop is a set of specs.

Code is the output, not the artifact

A spec describes what a part of the product should do. Not how - what. There’s one per capability, with a user story and acceptance criteria written so an agent or machine can check them. Here’s a trimmed example:

---
id: CAP-002
title: Manual / programmatic report invoke
status: shipped
surfaces: [android-sdk, ios-sdk, rn-sdk]
code_refs:
  - sdks/android/sdk/src/main/java/app/bugscreen/android/BugScreenSDK.kt
  - sdks/ios/Sources/BugScreenSDK/BugScreenSDK.swift
verified_by:
  - sdks/android/.../BugScreenSDKLifecycleTest.kt
---

User story
As a mobile developer, I want to open the bug reporter programmatically, so
that I can wire it to my own trigger instead of (or as well as) the
screenshot trigger.

Acceptance criteria
- Given the SDK is initialized, when the host app calls the open-reporter
  API, then the report UI opens.
- Given the SDK is not initialized, when the open-reporter API is called,
  then it is a safe no-op (does not throw).

The criteria are Given / when / then because something other than me has to be able to check them. code_refs and verified_by are the spec pointing at the code that implements it and the test that proves it.

The spec is the thing I maintain, and the code is derived from it. I don’t write the spec prose by hand. I tell Claude what I want and it drafts the spec; I edit until it says what I actually want.

When the spec and the code disagree, one of them is wrong and has to change. Usually it’s the code, because the spec is what I signed off on. Occasionally building proves the spec wrong, and then the spec changes instead. What never happens is the two quietly drifting apart.

I keep one rule to stop that happening: a shipped spec has to point at the code that proves it — that’s what verified_by is — and /ship checks the code actually matches the spec before it lets anything merge.

/start-task points the agent at the spec

Work doesn’t start from a prompt. It starts from a spec I’ve signed off on.

Before any code is written, /start-task reconciles what I’m asking for against the spec and makes me agree the acceptance criteria. Only then does it cut a branch and drop into planning. “Done” is defined up front, by me.

This looks like overhead but it’s the opposite. The expensive part of agentic coding isn’t writing code - the agent does that in minutes - it’s discovering, after the fact, that it built the wrong thing correctly. Agreeing the criteria first is the cheapest place to catch that.

It also gives the agent a target it can check itself against. Because the acceptance criteria are machine-checkable, Claude can run in a loop - write code, run the tests that prove the criteria, read the failures, try again - until the spec is satisfied. I’m agreeing the definition of done; the agent is looping until it gets there.

/ship closes the loop

The important step in /ship is that it reconciles the diff back against the spec - proving each acceptance criterion with a test, and updating the spec for anything that changed during the implementation loop. Building is how you find out the plan was wrong: a constraint surfaces that the spec didn’t anticipate, and the only correct behaviour turns out to be different from the one I wrote down. The code is right and the spec is stale, so /ship rewrites the criterion to match. The loop closes: spec → code → back to the spec. The two never get to drift apart, because shipping isn’t finished until they agree again.

What’s left for me

Not much. My inputs are three: approving the spec, approving the plan, reviewing the PR. Everything in between belongs to the agent.

What’s left is judgement: scope, taste, architecture, and noticing when “looks fixed” isn’t actually fixed - which an agent will cheerfully assure you it is. The work moved up a level; it didn’t go away.

Where it strains

I’m not going to pretend the loop is seamless.

Agents will accept “looks fixed” off a screenshot without reading the code underneath. You have to make them go and look.
The runs aren’t deterministic. I’ve had the same spec produce two different answers on two passes, because each read a different 3rd party doc page and reasoned from it.
Follow-ups evaporate. A ticket I file “to come back to” is a ticket I never pick up, so I’d rather the loop do the small thing now than promise to do it later.

This is more process than just prompting a model and reading the diff. But it’s the difference between vibe coding and vibe engineering. It’s how BugScreen got built. For anything more than a weekend project I think the extra scaffolding is worthwhile.