Paula Silva | Software Global Black Belt
Engineering discipline for the AI era

Spec-Driven and Test-Driven Development,
with GitHub Spec-Kit and Specky.

How specifications and tests turn AI coding agents from unpredictable assistants into reliable engineering partners.

AuthorPaula Silva
RoleSoftware Global Black Belt
Duration90 to 120 minutes
Date2026-06-10
Agenda

Eleven chapters, one thesis: intent first, code last.

01The problem: vibe coding and intent debt
02Spec-Driven Development fundamentals
03Test-Driven Development fundamentals
04Where SDD meets TDD
05GitHub Spec-Kit in depth
06Specky in depth
07The end-to-end workflow
08Quality gates and traceability
09Anti-patterns and failure modes
10Results from the field
11Getting started: an adoption roadmap
Chapter 01 · The problem
I

Vibe coding does not scale.

AI coding agents are powerful, but prompting them without a written specification produces software whose intent lives nowhere. This chapter names the failure patterns and the debt they create.

Three failure patterns

Why "prompt and pray" fails on anything bigger than a script.

Pattern 01 · Lost intent

The "why" exists only in a chat history.

Requirements are scattered across prompts. Six months later nobody can say what the system was supposed to do, only what it happens to do.

Pattern 02 · Drift per iteration

Every regeneration moves the target.

Without a stable spec, each "fix this" prompt re-interprets the goal. The agent optimizes the last message, not the product.

Pattern 03 · Unverifiable output

Code that looks right is not code that is right.

Plausible output passes review by humans who skim. Without tests derived from requirements, correctness is a feeling, not a fact.

The cost
Intent debt

Technical debt is code you owe. Intent debt is understanding you owe: every behavior shipped without a written, testable requirement is a decision your future team must reverse-engineer from the code itself.

Symptoms: onboarding measured in months, "do not touch" modules, rewrites that lose features nobody documented, and AI agents that cannot help because the repository carries no machine-readable intent.

Two ways of working

Same agent, same model. The discipline is the difference.

Vibe coding

Prompt, accept, hope

  • Intent lives in ephemeral chat threads
  • Tests written after the fact, if at all
  • Each change re-negotiates the goal
  • Review checks style, not requirements
  • Agent context rebuilt by hand every session
Spec plus test driven

Specify, verify, implement

  • Intent versioned in the repository as specs
  • Tests derived from acceptance criteria, written first
  • Changes amend the spec, then the code
  • Review checks spec compliance and test coverage
  • Agents load specs and constitution as durable context
Chapter 02 · Foundations
II

Spec-Driven Development.

SDD inverts the usual relationship between specification and code: the spec is the primary artifact, executable and versioned, and code becomes its generated, verifiable expression.

Definition

In SDD, the specification is the source of truth. Code is a projection of it.

A spec in SDD is not a Word document that dies after kickoff. It is a structured, versioned artifact that states what the system must do, why, and how success is verified. Humans and AI agents both read it, both are accountable to it, and every implementation decision traces back to a line in it.

01Written before code, refined continuously, never abandoned.
02Precise enough that an agent can plan and implement from it.
03Testable by construction: every requirement maps to acceptance criteria.
The artifact chain

Four artifacts, each derived from the previous one.

01
Specification
User stories, functional requirements, acceptance criteria
What must the system do, and how do we know it does?
spec.md
02
Technical plan
Architecture, stack choices, data model, contracts
How will we build it, within which constraints?
plan.md
03
Task breakdown
Small, ordered, independently testable units of work
In what sequence, and what can run in parallel?
tasks.md
04
Implementation
Code plus tests, generated and reviewed against the spec
Does every change trace back to a requirement?
src/
Core principles

Four principles that make SDD work in practice.

PRINCIPLE 01
Intent before mechanism
What and why precede how
  • Specs state outcomes, not implementations
  • Technology choices live in the plan, not the spec
  • Ambiguity is flagged, never silently resolved
PRINCIPLE 02
Executable specifications
Specs drive generation
  • Structured enough for agents to act on
  • Acceptance criteria become test cases
  • The spec is input, not documentation output
PRINCIPLE 03
Continuous refinement
Specs evolve with the product
  • Change requests amend the spec first
  • Spec and code reviewed together in PRs
  • Stale specs are treated as build breaks
PRINCIPLE 04
Constitution as law
Non-negotiables in one file
  • Project-wide principles every artifact obeys
  • Security, quality, and style baselines
  • Checked at every phase gate
Lifecycle

The SDD loop: every change flows through the spec.

Specify requirements andacceptance criteria Plan architecture andtechnical choices Break down ordered, testabletasks Implement tests first, then code,task by task Verify gates, traceability,review Feedback amends the spec, never just the code
Quality bar

What separates a usable spec from a wish list.

01Testable. Every requirement has acceptance criteria a machine can check. "Fast" becomes "p95 under 200 ms at 1,000 RPS".
02Unambiguous. EARS notation forces structure: "WHEN a user submits invalid credentials, THE SYSTEM SHALL return 401 within 100 ms".
03Scoped. Explicit non-goals prevent agents and humans from gold-plating beyond the requirement.
04Marked where uncertain. Open questions carry a [NEEDS CLARIFICATION] tag instead of a silent guess.
05Traceable. Stable requirement IDs (FR-001, NFR-003) that tests, tasks, and commits reference.
Chapter 03 · Foundations
III

Test-Driven Development.

TDD is the discipline of writing a failing test before the code that makes it pass. Twenty years of practice, newly essential: tests are how humans verify what agents produce.

Definition

Red, green, refactor: the smallest loop in software engineering.

Write a test that fails because the behavior does not exist yet. Write the minimum code that makes it pass. Improve the design while the tests stay green. Repeat in cycles of minutes, not days. The test suite becomes a living, executable specification of behavior.

01Red. The failing test proves the test itself works and the behavior is missing.
02Green. The simplest passing implementation, resisting speculative generality.
03Refactor. Design improves under the protection of a green suite.
The cycle

Minutes per loop. The speed is the point.

RED failing test GREEN minimal code passes REFACTOR improve design write just enough code clean up, tests stay green next behavior, new test
The three laws of TDD

Robert C. Martin's formulation, still the sharpest.

01You may not write production code until you have written a failing unit test.
02You may not write more of a unit test than is sufficient to fail, and not compiling counts as failing.
03You may not write more production code than is sufficient to pass the currently failing test.

Applied to agents, the laws become a contract: the agent must show the failing test before it is allowed to write the implementation. Spec-Kit and Specky both encode this ordering into their task templates.

Why it pays

What a test-first suite buys you, especially with agents in the loop.

Design pressure

Hard-to-test code is hard-to-use code.

Writing the test first forces small interfaces, injected dependencies, and low coupling before the implementation calcifies.

Regression shield

Refactoring without fear.

A green suite makes aggressive cleanup safe, for humans and for agents asked to "modernize this module".

Executable documentation

Tests state behavior precisely.

Each test names a behavior and proves it. New team members and agents read the suite as ground truth.

Agent verification

The only scalable review of AI output.

Humans cannot line-review thousands of generated lines per day. A requirement-derived suite can.

Chapter 04 · Convergence
IV

Where SDD meets TDD.

Two disciplines, one chain of trust: the spec defines what correct means, the tests enforce it, and the code earns its way in by passing. Together they close the loop that vibe coding leaves open.

Complementary by design

SDD and TDD answer different questions. Neither replaces the other.

SDD answers

What and why

  • Captures intent at product and feature level
  • Defines scope, constraints, and non-goals
  • Aligns stakeholders before code exists
  • Gives agents durable, versioned context
  • Granularity: feature, days
TDD answers

Does it, provably

  • Verifies behavior at unit and integration level
  • Drives modular, testable design
  • Catches regressions on every change
  • Gives agents a binary pass/fail signal
  • Granularity: behavior, minutes
The bridge

Acceptance criteria are the translation layer between spec and test.

spec.md · requirement FR-007
## FR-007 Account lockout
WHEN a user fails authentication
  5 times within 15 minutes,
THE SYSTEM SHALL lock the account
  for 30 minutes
AND SHALL notify the user by email.

Acceptance criteria:
- 5th failure inside window locks
- 4 failures do not lock
- lock expires after 30 minutes
- email job enqueued exactly once
test_lockout.py · derived first, before any code
def test_fifth_failure_locks_account():
    # FR-007 / AC-1
    fail_login(user, times=5, window="15m")
    assert account(user).locked

def test_four_failures_do_not_lock():
    # FR-007 / AC-2
    fail_login(user, times=4, window="15m")
    assert not account(user).locked

def test_lock_expires_after_30_minutes():
    # FR-007 / AC-3
    ...
The double loop

An outer spec loop wrapping an inner test loop.

OUTER LOOP · specification (per feature, days) Write / amend spec requirements + criteria Derive test list one test per criterion INNER LOOP · TDD (per behavior, minutes) red green refactor all criteria green: feature done, learnings flow back into the spec
Chapter 05 · Tooling
V

GitHub Spec-Kit.

An open-source toolkit from GitHub that operationalizes SDD: a CLI, structured templates, and slash commands that walk any coding agent through specify, plan, tasks, and implement.

github.com/github/spec-kit

Spec-Kit turns SDD from a philosophy into a repeatable workflow.

Install the Specify CLI, initialize a project, and the kit scaffolds a constitution, spec templates, and agent commands. From then on, every feature follows the same phased path, and every artifact lands in a predictable place in the repository, reviewable like code.

01Open source, agent-agnostic, MIT licensed.
02Works inside the agent you already use, via slash commands.
03Artifacts are plain Markdown in git: diffable, reviewable, durable.
Architecture

Three layers: CLI, templates, and your agent.

01
Specify CLI
uvx specify init, project scaffolding, agent detection
Bootstraps the structure once per project
Python
02
Templates and scripts
spec, plan, and tasks templates, constitution, phase-gate checks
Encodes the method so quality does not depend on memory
.specify/
03
Your coding agent
GitHub Copilot, Claude Code, Gemini CLI, Cursor, and others
Executes each phase through slash commands
/specify
The phases

Four commands, four artifacts, four gates.

/specify
Specification
From idea to requirements
  • User stories and functional requirements
  • Acceptance criteria per requirement
  • [NEEDS CLARIFICATION] for every gap
/plan
Technical plan
From requirements to design
  • Stack, architecture, data model
  • Contracts and API surfaces
  • Checked against the constitution
/tasks
Task breakdown
From design to work units
  • Small, ordered, testable tasks
  • [P] markers for parallelizable work
  • Tests listed before implementation
/implement
Implementation
From tasks to verified code
  • Executes task by task, test first
  • Stops at gates and failures
  • Output traceable to requirement IDs
Phase 1 in practice

/specify: describe the what and why, get a structured spec.

agent session · feature 003-team-dashboard
> /specify Build a team dashboard where managers see open tasks per member,
  filter by sprint, and export the view to CSV. Managers only; members see
  their own tasks. No real-time updates needed in v1.

Created branch: 003-team-dashboard
Created spec:   specs/003-team-dashboard/spec.md

## User stories          3 stories, 11 functional requirements
## Acceptance criteria  per requirement, EARS formatted
## Non-goals            real-time sync, member-level analytics
## Open questions       2 × [NEEDS CLARIFICATION]
  - CSV export: include archived tasks? 
  - "manager" defined by org chart or project role?

The command creates a branch and a numbered spec folder. Open questions block the next phase until a human answers them.

Phase 2 in practice

/plan: technical choices, constrained by the constitution.

agent session · feature 003-team-dashboard
> /plan Use the existing React + FastAPI stack. Postgres for storage.
  Reuse the auth middleware. CSV generation server-side.

Reading constitution.md ... 9 articles loaded
Created plan:      specs/003-team-dashboard/plan.md
Created contracts: specs/003-team-dashboard/contracts/dashboard-api.yaml
Created data model: specs/003-team-dashboard/data-model.md

Constitution check
  ✓ Art. III  test-first mandate acknowledged in plan
  ✓ Art. VII  no new framework without justification
  ✗ Art. V   observability: missing structured logging section
Plan gate: 1 violation to resolve before /tasks

The plan inherits every requirement ID from the spec. Constitution violations are surfaced as gate failures, not buried in prose.

Phases 3 and 4 in practice

/tasks orders the work, /implement executes it test-first.

agent session · feature 003-team-dashboard
> /tasks
Created tasks: specs/003-team-dashboard/tasks.md · 18 tasks
  T001  setup: migration for dashboard_views table
  T002 [P] contract test: GET /api/teams/:id/dashboard  (FR-001)
  T003 [P] contract test: GET .../export.csv            (FR-009)
  T004  unit tests: sprint filter logic                  (FR-004)
  T005  implement: dashboard query service  depends: T002, T004
  ...

> /implement
T002 ... wrote test, ran suite: FAIL (expected, endpoint missing)
T005 ... implemented service, ran suite: PASS 14/14
Gate: coverage on changed lines 96%, traceability 18/18 tasks linked

Tests are tasks of their own and always precede the implementation tasks they verify. The agent must show red before green.

The constitution

Project law: principles that every spec, plan, and task must obey.

01Written once at init, amended rarely and deliberately.
02Checked automatically at the plan gate and the implement gate.
03Typical articles: test-first mandate, simplicity, observability, security baselines, dependency policy.
04Gives agents stable values that survive across sessions and models.
memory/constitution.md · excerpt
# Article III · Test-first (non-negotiable)
All implementation tasks MUST be preceded
by failing tests derived from acceptance
criteria. Red before green, no exceptions.

# Article VII · Simplicity
Start with the simplest design that meets
the spec. New frameworks require written
justification in plan.md.

NEVER merge with failing gates.
NEVER resolve a [NEEDS CLARIFICATION]
  by guessing.
Agent support

One method, many agents. Spec-Kit is deliberately agent-agnostic.

First-party

GitHub Copilot

Slash commands in VS Code and GitHub Copilot coding agent. Tightest integration with the GitHub flow: branches, PRs, checks.

CLI agents

Claude Code, Gemini CLI, Codex CLI

Commands installed as agent-native prompts at init. The same four phases, the same artifacts, in the terminal.

IDE agents

Cursor, Windsurf, Qwen, and more

specify init --ai <agent> generates the right command format per tool. Teams can mix agents on one repo because the artifacts are shared.

The artifacts, not the agent, carry the project. Switching agents mid-project costs nothing because the spec, plan, and tasks are plain Markdown in git.

Chapter 06 · Tooling
VI

Specky.

A spec-driven orchestrator that adds interactive discovery, EARS-notation requirements, quality gates with traceability matrices, and dual-runtime support for GitHub Copilot and Claude Code.

Specky · open source

Specky turns natural language into production-grade specs through guided discovery.

Where Spec-Kit gives you the phased skeleton, Specky leans into the conversation: it interviews you about the feature, scans existing codebases, drafts EARS-notation requirements, and will not let a phase close until its quality gate passes. It installs as agent commands for both GitHub Copilot and Claude Code from a single source.

01Interactive discovery: the agent asks before it assumes.
02Brownfield aware: auto-scans the repo to ground specs in reality.
03Gates with evidence: traceability matrix generated, not promised.
Capabilities

Four capabilities that define the Specky workflow.

CAPABILITY 01
Guided discovery
Interview before artifact
  • Structured questions on scope, users, constraints
  • Edge cases surfaced before they become bugs
  • Answers recorded into the spec, not lost in chat
CAPABILITY 02
EARS requirements
Structured, testable language
  • WHEN / WHILE / WHERE / IF templates
  • Every SHALL maps to a verification
  • Ambiguity becomes syntactically visible
CAPABILITY 03
Design and diagrams
Architecture made visible
  • Mermaid architecture and sequence diagrams
  • Data models and API contracts in the design doc
  • Pre-implementation review gate for humans
CAPABILITY 04
Sequenced execution
Tasks with gates
  • [P] parallel markers, dependency ordering
  • Quality gates per phase, evidence required
  • Requirement-to-test traceability matrix
Choosing a tool

Spec-Kit and Specky: same philosophy, different center of gravity.

GitHub Spec-Kit

The standard skeleton

  • Origin: GitHub open source
  • Focus: phased workflow and templates
  • Requirements style: structured Markdown
  • Breadth: a dozen-plus supported agents
  • Best when: adopting SDD as a team standard on greenfield features
Specky

The opinionated orchestrator

  • Origin: open source, Kiro-inspired
  • Focus: discovery depth and quality gates
  • Requirements style: EARS notation
  • Depth: GitHub Copilot and Claude Code, one source
  • Best when: brownfield work, strict traceability, regulated or audit-heavy contexts
Workflow

From conversation to gated delivery in five moves.

01
Discover
Interactive interview, codebase scan on brownfield
What exists, what is wanted, what is out of scope?
chat
02
Specify
EARS requirements with acceptance criteria
Gate: every requirement testable, zero open guesses
requirements.md
03
Design
Architecture, Mermaid diagrams, contracts, data model
Gate: human review before any implementation
design.md
04
Plan tasks
Sequenced tasks, [P] parallel markers, test-first ordering
Gate: every task traces to a requirement
tasks.md
05
Execute and verify
Agent implements, gates run, traceability matrix emitted
Gate: red-green evidence plus matrix completeness
handoff
Chapter 07 · Practice
VII

The end-to-end workflow.

Greenfield and brownfield follow the same spine with different first moves. This chapter walks a feature from intent to merged pull request.

Greenfield

New project: constitution first, then features in numbered folders.

01Day 0. specify init, write the constitution with the team: test-first, simplicity, security and observability baselines.
02Per feature. /specify, answer clarifications, stakeholder sign-off on the spec before any planning.
03Design. /plan against the constitution; contracts and data model reviewed like code.
04Build. /tasks then /implement; tests precede code, [P] tasks fan out to parallel agent sessions.
05Merge. PR contains spec delta, code, tests, and the gate report side by side.
Brownfield

Existing codebase: specify the seam, not the whole system.

01Scan. Let the tool index the codebase: stack, patterns, test layout, integration points. The spec must respect what exists.
02Characterize. Before changing legacy behavior, write characterization tests that pin current behavior down.
03Specify the delta. The spec covers the change and its blast radius, with explicit compatibility requirements.
04Migrate incrementally. Tasks sized so each lands behind a green suite; no big-bang rewrites.
05Pay down intent debt. Each touched module leaves with a spec it never had.
One feature, end to end

What actually lands in the pull request.

Intent issue or idea,stated by a human specs/00N-feature/ spec.mdplan.md + contracts/tasks.md reviewed before code tests/ contract, unit,integration,each tagged FR-xxx written first, seen red src/ implementation,task by task,agent-generated earns its way in green Pull request spec delta + code +tests + gate report,reviewed together traceability attached

The reviewer's question changes from "does this code look fine" to "does this code satisfy this spec", a question with an answer.

Chapter 08 · Assurance
VIII

Quality gates and traceability.

A gate is a checkpoint that blocks progress until evidence exists. Traceability is the evidence: every requirement linked to the tests that verify it and the code that satisfies it.

The gates

Four gates between an idea and a merge.

G1
Spec gate
After /specify
All requirements testable, zero unresolved clarifications
blocks /plan
G2
Plan gate
After /plan
Constitution articles satisfied, contracts cover every FR
blocks /tasks
G3
Task gate
After /tasks
Every task linked to a requirement, tests ordered before code
blocks /implement
G4
Merge gate
Before PR merge
Suite green, coverage threshold met, traceability matrix complete
blocks merge
Traceability

The matrix: requirement, test, code, status. No orphans in either direction.

traceability.md · generated at the merge gate
| Requirement | Tests                                  | Implementation              | Status |
|-------------|----------------------------------------|-----------------------------|--------|
| FR-001      | test_dashboard_contract.py::3 cases    | api/dashboard.py            | PASS   |
| FR-004      | test_sprint_filter.py::5 cases         | services/filters.py         | PASS   |
| FR-007      | test_lockout.py::4 cases               | auth/lockout.py             | PASS   |
| FR-009      | test_csv_export.py::3 cases            | services/export.py          | PASS   |
| NFR-002     | test_perf_dashboard.py::p95_under_200  | (query index, migration 14) | PASS   |
| FR-011      | MISSING                                | views/archive.py            | GAP    |

Two failure smells the matrix exposes instantly: requirements with no test (unverified intent) and code with no requirement (unrequested behavior). Both block the merge gate.

Chapter 09 · Caution
IX

Anti-patterns.

Both disciplines fail in predictable ways. Naming the failure modes is cheaper than living them.

SDD anti-patterns

Five ways spec-driven work goes wrong.

01The frozen spec. Written once, never amended; code drifts and the spec becomes fiction. Treat spec updates as part of every change.
02Waterfall cosplay. Specifying the entire system for months before any code. SDD specs are per-feature and days deep, not project-wide tomes.
03Implementation leakage. Specs that dictate libraries and table names. The what contaminated by the how loses its power to outlive the stack.
04Silent assumption. The agent (or the author) resolves ambiguity by guessing instead of tagging [NEEDS CLARIFICATION].
05Gate theater. Gates exist but are skipped under deadline pressure. A gate that can be waved through is documentation, not a gate.
TDD anti-patterns

Five ways test-first work goes wrong, doubly so with agents.

01Test-after rationalization. Code first, then tests that mirror the implementation. They pass by construction and verify nothing.
02Mock everything. Suites that test the mocks. Integration behavior, the thing that breaks in production, stays unverified.
03Coverage worship. 95% line coverage with assertion-free tests. Coverage measures execution, not verification.
04Agent gaming the suite. An agent told "make tests pass" may weaken the tests. Tests derive from the spec, and spec-side criteria are the agent's read-only ground truth.
05Brittle coupling. Tests bound to private internals shatter on every refactor and teach teams to delete them. Test behavior at stable interfaces.
Chapter 10 · Evidence
X

Results from the field.

What changes, measurably, when teams move from prompt-driven to spec-and-test-driven AI development.

What teams report

The pattern across early adopters is consistent in direction.

Rework

Fewer regenerate-from-scratch cycles.

Clarifications happen at spec time, where a wrong answer costs a sentence, not at code time, where it costs a sprint.

Review

PR review shifts from style to substance.

With gates checking tests and traceability, humans spend review time on design judgment, the part machines are worst at.

Onboarding

Specs become the fastest path into a codebase.

New engineers and new agent sessions load the same artifacts. Context rebuilding stops being a per-person tax.

Honesty

The method exposes weak requirements early.

Teams discover that many "agent failures" were specification failures. The discipline relocates the problem to where it can be fixed.

Treat any specific percentage with care: published numbers vary by team, codebase, and baseline. The directional pattern above is what recurs.

Chapter 11 · Adoption
XI

Getting started.

You do not adopt SDD plus TDD in one reorg. You adopt it one feature at a time, with a roadmap that survives contact with deadlines.

Roadmap

Crawl, walk, run: ninety days to a working practice.

01
Crawl · weeks 1 to 3
One pilot feature, one team, one tool
Init Spec-Kit or Specky, write the constitution, run one feature end to end
1 feature
02
Walk · weeks 4 to 8
Default for new features on the pilot team
Wire gates into CI, require the traceability matrix in PRs, refine templates from retro feedback
1 team
03
Run · weeks 9 to 13
Scale out, brownfield in
Second and third teams onboard with the proven constitution; characterization-test the first legacy seams
org

The only metric that matters in week 3: did the pilot team voluntarily start their second feature with /specify.

Takeaways and references

If you remember five things from these fifty slides.

01Vibe coding creates intent debt; specs are how intent survives.
02SDD answers what and why; TDD proves does it. You need both loops.
03Acceptance criteria are the bridge: every criterion becomes a test, written first.
04Spec-Kit gives the standard skeleton; Specky adds discovery depth and gated traceability.
05Adopt one feature at a time. The constitution and the gates do the scaling.
Primary references
github.com/github/spec-kit
Spec-Kit repository, Specify CLI, templates
Specky · open source
Specky repository, EARS workflow, dual runtime
Kent Beck, Test-Driven Development: By Example
Robert C. Martin, the three laws of TDD
EARS: Easy Approach to Requirements Syntax, Mavin et al.
GitHub blog: Spec-driven development with AI
Thank you

Specify first.
Then let the agents build.

The teams winning with AI coding agents are not the ones with the best prompts. They are the ones whose intent is written down, testable, and versioned next to the code.

Contact
Paula Silva
Software Global Black Belt
paulasilva@microsoft.com
This deck
SDD + TDD with Spec-Kit and Specky
v1.0.0 · Published 2026-06-10
Next step
Pick one feature this week
Run it through /specify, /plan, /tasks, /implement
Use para navegar · O overview · N notas
1 / 51