Edward Harker

The start-task → ship loop: how I turned Claude Code into a software factory

Specs should be the source of truth

I don’t really write features any more. I maintain a loop, and the loop writes them. I built a real product this way - a backend and three native SDKs.

The loop has two commands. /start-task takes an idea and turns it into a plan I’ve agreed to, then a code change diff that implements it. /ship takes that diff and turns it into a reviewed, tested, merged PR. I’ve written about the mechanics of both before. This post is about what turns them into a loop rather than a one-way trip, because that’s the part that makes it work.

What turns them into a loop is a set of specs.

Code is the output, not the artifact

A spec describes what a part of the product should do. Not how - what. There’s one per capability, with a user story and acceptance criteria written so an agent or machine can check them. Here’s a trimmed example:

---
id: CAP-002
title: Manual / programmatic report invoke
status: shipped
surfaces: [android-sdk, ios-sdk, rn-sdk]
code_refs:
  - sdks/android/sdk/src/main/java/app/bugscreen/android/BugScreenSDK.kt
  - sdks/ios/Sources/BugScreenSDK/BugScreenSDK.swift
verified_by:
  - sdks/android/.../BugScreenSDKLifecycleTest.kt
---

User story
As a mobile developer, I want to open the bug reporter programmatically, so
that I can wire it to my own trigger instead of (or as well as) the
screenshot trigger.

Acceptance criteria
- Given the SDK is initialized, when the host app calls the open-reporter
  API, then the report UI opens.
- Given the SDK is not initialized, when the open-reporter API is called,
  then it is a safe no-op (does not throw).

The criteria are Given / when / then because something other than me has to be able to check them. code_refs and verified_by are the spec pointing at the code that implements it and the test that proves it.

The spec is the thing I maintain, and the code is derived from it. I don’t write the spec prose by hand. I tell Claude what I want and it drafts the spec; I edit until it says what I actually want.

When the spec and the code disagree, one of them is wrong and has to change. Usually it’s the code, because the spec is what I signed off on. Occasionally building proves the spec wrong, and then the spec changes instead. What never happens is the two quietly drifting apart.

I keep one rule to stop that happening: a shipped spec has to point at the code that proves it — that’s what verified_by is — and /ship checks the code actually matches the spec before it lets anything merge.

/start-task points the agent at the spec

Work doesn’t start from a prompt. It starts from a spec I’ve signed off on.

Before any code is written, /start-task reconciles what I’m asking for against the spec and makes me agree the acceptance criteria. Only then does it cut a branch and drop into planning. “Done” is defined up front, by me.

This looks like overhead but it’s the opposite. The expensive part of agentic coding isn’t writing code - the agent does that in minutes - it’s discovering, after the fact, that it built the wrong thing correctly. Agreeing the criteria first is the cheapest place to catch that.

It also gives the agent a target it can check itself against. Because the acceptance criteria are machine-checkable, Claude can run in a loop - write code, run the tests that prove the criteria, read the failures, try again - until the spec is satisfied. I’m agreeing the definition of done; the agent is looping until it gets there.

/ship closes the loop

The important step in /ship is that it reconciles the diff back against the spec - proving each acceptance criterion with a test, and updating the spec for anything that changed during the implementation loop. Building is how you find out the plan was wrong: a constraint surfaces that the spec didn’t anticipate, and the only correct behaviour turns out to be different from the one I wrote down. The code is right and the spec is stale, so /ship rewrites the criterion to match. The loop closes: spec → code → back to the spec. The two never get to drift apart, because shipping isn’t finished until they agree again.

What’s left for me

Not much. My inputs are three: approving the spec, approving the plan, reviewing the PR. Everything in between belongs to the agent.

What’s left is judgement: scope, taste, architecture, and noticing when “looks fixed” isn’t actually fixed - which an agent will cheerfully assure you it is. The work moved up a level; it didn’t go away.

Where it strains

I’m not going to pretend the loop is seamless.

  • Agents will accept “looks fixed” off a screenshot without reading the code underneath. You have to make them go and look.
  • The runs aren’t deterministic. I’ve had the same spec produce two different answers on two passes, because each read a different 3rd party doc page and reasoned from it.
  • Follow-ups evaporate. A ticket I file “to come back to” is a ticket I never pick up, so I’d rather the loop do the small thing now than promise to do it later.

This is more process than just prompting a model and reading the diff. But it’s the difference between vibe coding and vibe engineering. It’s how BugScreen got built. For anything more than a weekend project I think the extra scaffolding is worthwhile.