On Running a Startup of Claude Code Agents: What You Get For a Billion Tokens a Month

Two days ago I wrote about building an app in 19 days with Claude Code. That post told part of the story. Here’s the rest.

What Actually Shipped

Between December 21st, 2025 and January 17th, 2026—27 days—I shipped:

judoka.ai — A production iOS/Android/web app. 309,000 lines of code. 2,726 tests. 165 screens. App Store approved.
judoka.blog — Marketing site with CMS, image manager, video manager, email system, newsletter, and campaign manager. Eight languages.
agentic — Open source framework for Claude Code projects.
A content site (launching soon).
A scenario analysis tool (launching soon) that is quite complex actually.

One person.

The Actual Numbers

I kept the receipts.

Claude Max 20x plan $200 Overages (Dec 21-28) $100 Overages (Dec 28 – Jan 12) $380 API direct (Jan 15-17) ~$820 Claude total~$1,500 Hosting, services, domains ~$300 Everything~$1,800

For context: it’s my sweet spot for three good bottles on wine.

Token Counts

This is where it gets concrete.

Over the full 27 days, I used approximately 1 billion input tokens.

The API period (January 15-17) gives exact numbers: Metric Count Input tokens 541,115,512 Output tokens 3,012,847 Input:output ratio 180:1 Cache hit rate (by day 2) ~50%

The Max plan period (December 21 – January 14) added another 300-400 million tokens, though I was throttled for parts of it.

To put 1 billion tokens in perspective:

~750 million words
~1.5 million pages of text
~8,000 novels
A small library

The 180:1 ratio reflects the reality of agentic coding. You’re sending codebase context repeatedly. Claude reads far more than it writes.

Why So Many Tokens

I was running 6-15 parallel Claude Code instances.

Each instance sends full context independently. If your codebase is 100K tokens of context and you’re running 9 instances averaging 17 prompts per hour, that’s:9 instances × 17 prompts × 100K tokens × 10 hours = 153M tokens/day

The math checks out.

This is also why I hit the Max plan ceiling (like they wouldn’t even let me pay for overages anymore). Anthropic caps overages—you can’t just buy unlimited extra usage. By mid-January, things were getting slow. I cut over to direct API on January 15th.

The difference was immediate. No throttling. No waiting. Full speed.

What Claude Actually Did

My first post focused on code. That undersold it.

For the judo app alone

Code and tests

309,000 lines of TypeScript/TSX
2,726 tests across 111 test suites
214 database migrations
29 Supabase Edge Functions

DevOps and infrastructure

Expo build and deployment pipeline with CI/CD
Supabase configuration (RLS policies, functions, storage)
Firebase Cloud Messaging integration
Sentry.io error tracking
Full observability stack

Integrations

Stripe Connect (payments, subscriptions, donations)
Whoop OAuth (fitness data sync)
Vimeo API (video hosting)
Google OAuth
Claude API (AI Sensei feature)

Localization

Japanese translation for the app and a complete internationalization setup to do more including RLS (e.g. Arabic, Farsi, Urdu)
Eight languages for the marketing site: English, German, Spanish, French, Portuguese, Russian, Korean, Japanese

Content

All marketing copy
Documentation

Design

Logo
UI design
App Store screenshots

Compliance

App Store privacy nutrition labels
Permission declarations
Review guidance

When I hit issues during App Store review—needed a demo account, parental control settings, an iPad bug—Claude walked me through the fixes. Four submissions, seven days, approved.

The Economics

Let me make this explicit.

Traditional estimates for this scope of work: Method Estimate COCOMO (lines-based) ~20,000 hours Feature decomposition ~9,000-12,000 hours Industry benchmark 5-6 years solo, 12-15 months with 5 junior engineers Traditional cost $750,000 – $1,500,000

What I spent: $1,800.

That’s a 99.8% cost reduction. It’s also not a fair comparison—traditional development wouldn’t produce identical output. But the delta is large enough that precision doesn’t matter.

How I Actually Ran It

Here’s what clicked: I just needed to do what I’ve always done with large teams and open source projects. I was never going to read everything anyway.

I’ve run development teams from 1 to 10,000 people. I was the CTO at Joyent when we created Node.js and npm. I’ve been a long time Postgres user. I was there when TypeScript was created. In all of those contexts, no single person reads all the code. You set architecture, build systems that ensure quality, and dive in when things break.

So I ran Claude like a team.

Specialized agents with gated permissions

I ran at least four review agents separate from the ones writing code: Testing could access Tests only, Code Review sir Issues and REFACTOR.md output only, Security with a Timestamped SECURITY_REPORT.md only. Performance engineer could output reports only. Then a dedicated platform agent: build,/deployment/devops (only ever had that in context), and one full time on documentation+ that was always making what agents would read matches in progress, committed, decided work.

The coding agents wrote code. The review agents could only write to their designated outputs. Hard boundaries.

Spatial organization

I had 10-15 terminal windows running at a time, spatially organized:

Left: Test, security, performance, review, deployment and documentation agents
Center: Primary engineer across the code base and product manager
Right: Coders working on specific phases of an RFD.

While they worked on their tasks, I’d be in the center window with product, reviewing commits and PRs, merging, making sure things closed out properly. Working step by step on harder things to get right (e.g. instructional that you want to no longer sell but people who bought still get access).

Cadence

Every major release followed the same rhythm:

4 days to build
2 days to review, refactor, secure, performance test

The review phase wasn’t optional. Testing agent scans everything. Security agent files reports. I read every report.

The Documentation Stack

This is the part that made it work. Each agent had shared context through a layered documentation system:

MEMORY.md — System-wide development standards. Goes into Claude’s /memory. Things like: always use Supabase SDK, never write direct SQL in application code, 300 line file limit, data layer lives in /lib, components never import Supabase directly. Universal constraints that apply to every project.

CLAUDE.md — Project-specific context. What we’re building, current focus, key architectural decisions, stack details. This is the first thing every agent reads.

_FRAGILE.md — Danger zones. Auth flows, payment code, RLS policies, anything that breaks in non-obvious ways. Before touching anything in fragile areas, agents run /fragile to review what could go wrong.

_NEXT_SESSION_MEMO.md — Session handoffs. Every time an agent paused, it updated this file. What was accomplished, what’s in progress, what’s blocked. When I spun up a new instance, it read this first.

RFDs — Request for Discussion documents. Technical designs that agents updated as they worked. One agent writes a security report, another picks it up and drafts an RFD, I review, we iterate, a third agent implements.

The startup prompt was simple:

Read CLAUDE.md. Read _FRAGILE.md. Read _NEXT_SESSION_MEMO.md. We’re working on RFD_0172. You’re role/SECURITY. Your only write permission is to a timestamped SECURITY_REPORT.

That’s it. Short prompts plus documentation beats elaborate role definitions. Read the standard details in the repos. Welcome to work!

Commands That Mattered

I built a small set of slash commands that enforced discipline:

/wrap — End of session. Update _NEXT_SESSION_MEMO.md, commit everything, document what’s done.
/sup — Quick status. Five-second check on where things stand.
/fragile — Review danger zones before making changes to sensitive code.
/plan — Two-phase workflow. First we plan, get approval, then hand off to implement. No yolo commits on complex features.
/research — Deep exploration in a forked context. Investigation doesn’t pollute the main conversation.

These aren’t complicated. They just enforce the habits that make large codebases manageable.

Architecture Constraints

The documentation encoded architectural rules that prevented drift:

Single source of truth. All environment variables in lib/config/env.ts. All query keys in lib/queries/keys.ts. All formatters in lib/constants/units.ts. No exceptions.

Strict data flow. Database → lib/supabase/queries → lib/models → lib/hooks → Components. Components never import Supabase directly. If you need data, you use a hook.

File size limits. 300 lines max. Split earlier rather than later. This kept diffs reviewable.

Context-based IDs. No hardcoded organization or tenant IDs in runtime code. Use useOrganization(), never const orgId = 'org_123'.

Responsive from day one. Use useLayout() for breakpoints. Use useAdaptiveNavigation() instead of hardcoded navigation. Retrofitting responsive layouts later breaks everything.

These constraints meant any agent could work on any part of the codebase and produce consistent output. The architecture was in the documentation, not in my head.

Also kept and would review _VOCABULARY.md (Canonical terms and semantics for a project. Used consistently) because a next token predictor will mix up member, members and membership.

Commit Discipline

I was reviewing 80-100 commits a day. With the sizing constraints—small files, focused changes—it was manageable. The agents actually had good commit discipline.

The key was keeping changes small enough to review. A 50-line change to one file is easy to verify. A 500-line change across 12 files is a nightmare. The constraints enforced reviewability.

The Difference

This is not vibe coding.

Vibe coding is one long session where you’re fighting context window limits and the model forgets where to deploy. That works for toys. It doesn’t work for production.

What I did was run a startup. I told people what to do. I focused and unblocked. I gave them chunks of work. I reviewed their output. I maintained the architecture.

The agents were easier to work with than most people you don’t know and end up on a team with. They don’t have ego. They don’t argue about tabs versus spaces. They don’t disappear for two hours. They just work.

Well I did have to add the following because they do occasionally complain, use humans timelines (they would say 4-6 weeks and I knew they’d be done in an hour).

LLM Decision Making

Never factor “tedious” or “repetitive” into recommendations — mechanical work is instant for an LLM
Prefer technically cleaner solutions over “easier” ones — the effort delta doesn’t exist
Don’t estimate human time (“this will take 2-3 days”) — focus on dependencies and risk
When choosing between “do it right now” vs “do it later”, bias toward now if it’s cleaner
Codemod-style migrations (find/replace patterns across files) are trivial — never defer them for effort reasons

What I’d Do Differently

Start on API sooner. The Max plan is good up to a point. Once you’re running parallel instances and hitting limits, the throttling costs more in time than the API costs in money.

Monitor token usage from day one. I didn’t have good visibility into my consumption patterns until I was on direct API. Knowing your burn rate helps you plan.

Cache aggressively. By day two I had 50% cache hit rates. That’s $0.50/MTok instead of $5/MTok for Opus. On 1 billion tokens, caching saved me thousands of dollars.

Extract the framework earlier. The agentic framework came from patterns that emerged during the build. I wish I’d started with more of it in place. The constraints and documentation templates compound—they’re more valuable at the start than the end.

By the end, I had 30,000+ lines of documentation supporting the codebase. That ratio—roughly 10% documentation to code—is what makes the system maintainable.

What This Means

I built a complete “busines” and three “side projects”—app, marketing site, CMS, payment infrastructure, eight-language localization, App Store approval—in 27 days for $1,800.

That sentence is amazing.

The implications are still unfolding. But a few things seem clear:

The unit economics of software have changed. Not “will change” or “are changing.” Have changed. The cost floor for production software dropped by two orders of magnitude.

Management skills transfer. The same practices that let you run large engineering organizations—specialization, gating, review cycles, documentation—work on AI agents. Maybe better, because the agents actually follow the process.

Solo builders can now ship at startup scale. Not startup quality—startup scale. The constraint isn’t implementation capacity anymore. It’s judgment, taste, and knowing what to build.

Jason A. Hoffman