Claude Code as a Poor Man's QA Team

In a small startup, bugs are expensive. Not just in engineering time — in trust. When your first ten clients rely on your product and something breaks, the cost is not a Jira ticket. It is a phone call, an apology, and a feature that slips another week.

Wispra is a small startup I work with as a freelance AI engineering expert. We optimize GEO (Generative Engine Optimization) for SMBs — helping local businesses appear in AI-generated search results. The tech stack is a FastAPI backend, PostgreSQL, and a fair amount of LLM orchestration. The team is tiny: I handle most of the AI stack myself.

I have over 1,300 unit tests. They cover services, routes, edge cases. They pass. And yet, they are not enough.

Unit tests verify isolated functions with mocked dependencies. They do not catch the bug where a prompt generation function produces duplicates because the deduplication logic depends on a database state that no fixture replicates. They do not catch the silent data corruption when a migration changes a column default and three services downstream assume the old value.

What I needed was something between unit tests and a full QA pass. Something that checks actual database state after running a real pipeline.

The E2E detour

I did try traditional end-to-end tests. I wrote about setting up local website mirrors with Docker network aliases and Caddy so the crawler would hit real HTML served locally. It works, but E2E tests in a fast-moving startup are a maintenance burden. You can rewrite a module in a few days — and then spend another day fixing the E2E tests that broke. Even with Claude Code helping, the ROI is negative when you ship multiple features a week.

How it started: a test plan and a realization

Recently, I worked through a major feature set: GEO Analysis V2. It spanned 8 PRDs (Product Requirements Documents) — covering prompt generation, crawling, scoring, competitor tracking, directory analysis, and benchmarking.

My workflow with Claude Code was fairly standard at that point: XML prompts stored in the repo, PRD generation, then implementation. After the 8 PRDs were done, I had a lot of new behavior to verify before shipping.

So I asked Claude to generate a test plan — a structured document listing every SQL query to run, every API endpoint to check, every expected value to verify. The idea was to execute it by hand. It came out as a markdown file full of SQL blocks and checkboxes:

## Phase 3 — Starter Plan Checks

### 3.1 Prompt distribution

```sql
SELECT prompt_type, COUNT(*) as nb
FROM generated_prompts
WHERE run_id = '<RUN_ID>' AND plan = 'starter'
GROUP BY prompt_type ORDER BY prompt_type;
```

Expected (10 prompts):
| type | nb |
|------|-----|
| brand-perception | 1 |
| transactional | 9 |

- [ ] Total = 10 prompts
- [ ] 0 informational, 0 comparative (starter budget)

Then I looked at it and thought: Claude can already run SQL queries via docker exec. It can call APIs with curl. It can run pytest. Why am I the one executing this?

I asked Claude to run the test plan. It did. The first run caught 3 real bugs before deployment. The production release went smoothly.

Architecture: slash commands, subagents, and state

What started as a one-off experiment became a repeatable workflow. Claude Code supports custom slash commands — markdown files placed in .claude/commands/ that you invoke with /command-name. Each file has YAML frontmatter and a markdown body with instructions.

I built two commands:

/qa — for day-to-day QA testing (run test plans, unit tests, generate a report)
/release-qa {release version} — for pre-release validation (adds commit analysis, release notes, git tagging)

---
name: qa
description: QA test execution — run test plans, unit tests, and generate results report
---
# QA — Project AI

You are the QA test executor. Run the QA test plans on the current state
of the repository.

Your role is to **coordinate subagents** for each major step...

The key design decision: the slash command is an orchestrator, not an executor. It delegates work to subagents (via Claude’s Task tool) that each handle one portion of the test plan. This keeps the main conversation context lean — verbose SQL output and pytest logs stay inside each subagent’s context. The orchestrator only surfaces structured summaries.

The orchestrator maintains a lightweight state object that grows through the workflow:

QA_STATE:
  RUN_IDS: { STARTER: uuid, PLUS: uuid, PRO: uuid }
  PHASE_RESULTS: { 0..8, logs }
  UNIT_TEST_SUMMARY: { PASSED: 1332, FAILED: 0, SKIPPED: 0 }
  REPORT_PATH: QA/results/TEST_RESULTS_v0.9.0_2026-02-27.md
  VERDICT: READY TO SHIP

Each subagent receives only the portion of state it needs. This is like passing function arguments — minimal coupling, clear contracts.

/qa command (orchestrator)
  |-- Step 0: Docker check, log clear, DB ping        (direct)
  |-- Step 1: Phase 0 prerequisites                   --> subagent
  |-- Step 1: Phases 1-2 pipeline execution            (interactive)
  |-- Step 1: Phases 3-6 plan-specific checks          --> subagent
  |-- Step 1: Phase 7 + Phase 8 + Logs                --> 3 parallel subagents
  |-- Step 2: Data pipeline test                       --> subagent
  |-- Step 3: Report generation                        --> subagent
  +-- Step 4: Discrepancy analysis + correction        --> subagent + user

Implementation: what happens inside each step

Step 0 — The boring but critical setup

Every QA run starts with three sanity checks. If Docker is not running, nothing works. If the log file is not cleared, you cannot tell which errors belong to this run.

# 1. Verify Docker services are up
cd /path/to/infrastructure && docker compose ps

# 2. Clear logs for a clean baseline
> /path/to/api/logs/app.log

# 3. Verify database connectivity
docker exec db_container psql -U db_user -d database -c "SELECT 1"

These run directly in the main context — no subagent needed for three quick commands.

Delegating work to subagents

The core pattern: the orchestrator sends a structured prompt to a general-purpose subagent. The prompt specifies what to read, how to execute, and what format to return.

**Delegate to a `general-purpose` subagent** with this prompt:

> You are running Phase 3 (Plan-Specific Checks) of the QA test plan.
>
> **RUN IDs:**
>
> - STARTER: {RUN_ID_STARTER}
> - PLUS: {RUN_ID_PLUS}
> - PRO: {RUN_ID_PRO}
>
> 1. Read Phases 3, 4, and 5 from `QA/TEST_PLAN.md`
> 2. Replace every `<RUN_ID>` placeholder with actual UUIDs
> 3. Execute every SQL check:
>    `docker exec db_container psql -U user -d database -c "SQL"`
> 4. Evaluate each result against expected values
>
> **Return results as markdown tables, one per phase:**
>
> | Check       | Result         | Notes   |
> | ----------- | -------------- | ------- |
> | description | PASS/FAIL/SKIP | details |

The subagent reads the test plan file, substitutes the RUN_IDs, runs every SQL query and curl command, compares results to expected values, and returns a structured summary. The orchestrator collects these summaries into its state.

Three subagents, one message

Some phases are independent. Unit tests, admin status checks, and log verification do not depend on each other. The slash command instructs Claude to launch all three in a single message — they run in parallel, cutting wall-clock time significantly.

### Phase 7 + Phase 8 + Log Check (Parallel Subagents)

**Launch all three subagents in a SINGLE message.**

#### Phase 7 — Unit Tests

**Delegate to subagent:** Run pytest, return pass/fail/skip counts...

#### Phase 8 — Admin Checks

**Delegate to subagent:** Run SQL checks for admin views...

#### Log Verification

**Delegate to subagent:** grep ERROR|CRITICAL in logs, categorize...

SQL: the real verification layer

The core of every phase is SQL queries against the actual database. Not HTTP mocks, not fixture assertions — real queries against the same schema and data as production.

-- Verify prompt count matches plan configuration
SELECT prompt_type, COUNT(*) as nb,
       SUM(CASE WHEN brand_mentioned THEN 1 ELSE 0 END) as branded
FROM generated_prompts
WHERE run_id = '78e0fd5a-...'
GROUP BY prompt_type ORDER BY prompt_type;

-- Expected: total=20, ...

Each check becomes a row in the results table: PASS if the value matches, FAIL if it does not, SKIP if it requires manual verification (like visual UI checks).

The test plan that fixes itself

This is probably the most useful part of the system. After every QA run, a subagent compares actual results against the test plan expectations and proposes corrections.

Three categories of discrepancies:

STALE_EXPECTATIONS — the expected value is outdated, not a bug. For example, the test plan says “1300 unit tests” but you added 47 new ones. Confidence levels help triage: HIGH means the same discrepancy appeared in 3+ consecutive runs (safe to update), MEDIUM means 2 runs, LOW means first occurrence (investigate before updating).

MISSING_CHECKS — new features observed in results with no test plan coverage. A new database column is populated but no SQL check verifies it.

OBSOLETE_CHECKS — checks for behavior that no longer exists. A legacy function was removed, but the test plan still greps for it.

STALE_EXPECTATIONS:
| Phase | Check            | Current   | Proposed  | Confidence |
|-------|------------------|-----------|-----------|------------|
| 7     | Unit test count  | 1300      | 1347      | HIGH       |

MISSING_CHECKS:
| Phase | Proposed Check         | Reason                        |
|-------|------------------------|-------------------------------|
| 5     | reputation_score range | New feature added in v0.8.3   |

OBSOLETE_CHECKS:
| Phase | Check                  | Reason                        |
|-------|------------------------|-------------------------------|
| 7     | Legacy function grep   | Function removed in v0.8.2    |

The developer reviews each proposal and approves, selects specific corrections, or skips entirely. Only after approval does another subagent apply the changes to the test plan file. An audit trail is added at the top of the document:

> Last correction: 2026-02-26 (QA run)
> Applied: 3 expectations updated, 7 checks added, 0 obsolete checks removed

Over time, the test plan drifts less and less from reality. It becomes a living document that tracks the actual behavior of the system.

From QA to release: `/release-qa v0.8.3`

The /release-qa command extends /qa with steps for release management. It takes a version number as argument:

---
name: release-qa
description: Full pre-release QA workflow
argument-hint: <version e.g. v0.9.0>
---

Three additional steps at the beginning:

Commit analysis — a subagent reads git log since the last tag, inspects the changed files, and categorizes commits into features, bug fixes, improvements, and migrations.
Test plan update — based on the code changes, a subagent decides whether the test plan needs new checks, updated expectations, or removed items.
Release notes generation — a parallel subagent writes release notes following the format of previous releases in the repo.

Steps 2 and 3 run in parallel since they both depend only on the commit analysis output.

At the end, if all checks pass, the command creates the git tag and pushes it:

git tag v0.9.0 && git push origin v0.9.0

Why this works: production data, not fixtures

The single most important design decision: testing against a local copy of the production database, not fixtures.

Fixtures are a representation of what you think the data looks like. A production dump is what the data actually looks like — with legacy rows, edge cases, and real-world inconsistencies that no fixture author would think to include.

This means the QA catches bugs that fixtures would never surface: data migration issues, foreign key cascade problems, scoring formulas that break on real values.

A word of caution: if your production data contains personal information, you need to anonymize it. In the EU, GDPR makes this non-negotiable. PostgreSQL Anonymizer handles this well — it can mask emails, names, and other sensitive fields while preserving the data structure. In my case, the data is business-facing (company names, websites, SEO scores), not personal, so the risk is lower. But think about this before copying your database.

I do not try to test everything. The system runs the main pipelines for a few businesses on different subscription plans and verifies the output state. If the happy path works on real data, the release is in good shape.

Results from a real run:

## QA Summary — v0.8.3

| Metric       | Previous (v0.8.0) | Current (v0.8.3) | Delta |
| ------------ | ----------------- | ---------------- | ----- |
| Total checks | 26                | 72               | +46   |
| PASS         | 26                | 66               | —     |
| FAIL         | 0                 | 0                | —     |
| SKIP         | 0                 | 6                | +6    |

Verdict: READY TO SHIP

The 6 skipped checks are visual verification items (admin UI badges) that require a human eye. Everything else is automated.

Caveats and honest limitations

This is not a replacement for a real QA team. It works for a solo developer or a tiny team shipping fast. At scale, you need humans doing exploratory testing, checking accessibility, and thinking about scenarios that no test plan covers.

Context window limits. Large test plans can exceed what a single subagent can hold in context. The phase-by-phase delegation pattern solves this — each subagent handles a portion, not the whole document.

Cost. Each QA run consumes LLM tokens. A full /release-qa run with 70+ checks, several subagents, and report generation is not free. For a startup, the cost is justified by the bugs it catches. But track it.

Non-determinism. LLM-based orchestration means the same check can occasionally be interpreted slightly differently. The structured prompt format and explicit return format reduce this, but do not eliminate it. I have not seen it cause a false PASS, though — the SQL queries are deterministic, and the subagent just reports the output.

What I actually shipped

Two slash commands. A test plan in markdown. A production database copy. Subagents that run SQL, curl, and pytest.

The system is not elegant. It is a collection of markdown files telling an AI what to check and how to check it. But after several releases, it has caught real bugs, kept the test plan current through its own correction loop, and automated the most tedious part of solo development: verifying that everything still works after a big change.

For a solo CTO shipping features weekly, this has been the best investment in quality outside of unit tests. Not E2E tests (too expensive to maintain alone). Not manual QA passes (too slow and error-prone when you are the one who wrote the code). This sits in between — automated enough to run in minutes, thorough enough to catch real problems, and cheap enough to run before every release.