SDD + TDD in the AI era

From vibe coding to production discipline .

A practical guide on combining Spec-Driven and Test-Driven Development with GitHub Spec-Kit and Specky to build high-quality software with AI agents.

AuthorPaula Silva

RoleSoftware Global Black Belt

Contactpaulasilva@microsoft.com

Version · Datev3.1.0 · 2026-04-27

Agenda

Eight parts. From vibe coding to production discipline.

PART IThe vibe coding problem and the empirical evidence

PART IISpec-Driven Development foundations, the four phases

PART IIITest-Driven Development in the AI era

PART IVEARS notation, writing requirements that work

PART VGitHub Spec-Kit, the open source methodology

PART VISpecky, the MCP engine with 53 tools

PART VIIConstitutional SDD, security by construction

PART VIIIPractice, comparison, and ROI

Who is speaking

Paula Silva, Software Global Black Belt.

Building the future of software development with AI and Agentic DevOps.

I work with enterprise customers in Brazil and across Latin America on agentic AI, platform engineering and software modernization. This deck distills three years of patterns observed in dozens of customers, where promising AI-assisted prototypes stalled before production. SDD plus TDD is the discipline that separates teams shipping AI-generated code with confidence from teams chasing bugs they cannot trace.

Part I

I

The vibe coding problem.

Code that works is not the same as code that is correct, secure, maintainable, and aligned with business requirements. An SQL injection works, until someone exploits it.

Definition · Karpathy, Feb 2025

Vibe coding is accepting whatever AI generates without critical review.

Andrej Karpathy

"There's a new kind of coding I call vibe coding, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's not really coding, I just see stuff, say stuff, run stuff."

The developer describes what they want, the AI produces code, and the developer accepts it without reading. Code that works is not the same as code that is correct, secure, maintainable, and aligned with business requirements.

The five stages of AI-native development

Each stage solves the limitation of the previous one.

STAGE 01

AI Assistant

Inline suggestions

STAGE 02

Vibe Coding

Conversational prompts

STAGE 03

Prompt Engineering

Systematic refinement

STAGE 04

Context Engineering

Structured context

STAGE 05

Spec-Driven Development

Spec as source of truth

SDD does not replace prompt engineering or context engineering. It absorbs them and adds a versioned, executable spec layer on top.

The four risks of vibe coding

Speed without discipline compounds debt.

RISK 01

Scale and maintenance

Spaghetti code, low cohesion, high coupling
Hard to maintain, hard to expand

RISK 02

Context loss

No external source of truth
Sprint 1 decisions forgotten by Sprint 5

RISK 03

Prototype gap

Promising starts stall before production
Demo-grade vs production-grade

RISK 04

Collaboration friction

Decisions untraceable
Without specs, nobody knows why

Empirical evidence

Four studies. One direction.

Pearce et al · 2022

40%

of GitHub Copilot output is vulnerable in security-relevant scenarios.

Liu et al · 2026

26.1%

of 31,132 agent skills contain at least one security vulnerability.

Becker et al · 2025

19%

slower for experienced devs using AI on complex projects.

Issue tracker agent

8%

of agent invocations resulted in complete success, merged PR.

Sources: Pearce 2022, IEEE S&P; Liu et al 2026, arXiv 2601.10338; Becker et al 2025, arXiv 2507.09089; Huang et al 2025, arXiv 2512.14012.

Huang et al · 99 experienced developers · 2025

Professionals do not vibe. They control.

Qualitative research with 99 experienced developers found a clear pattern. Pros do not accept output uncritically. They guide agents through planning and active supervision.

Closing line of the paper

"AI capabilities and interfaces are changing fast: this work paints a picture of what is and is not working now. As of now, the AIs are not taking over yet, experienced developers are still in control."

Interactive · Vibe coding risk calculator

Move the sliders. Watch debt compound.

Vulnerability rate 40%

Spec-code drift 35%

Context loss 50%

4.2

Monitor zone

Vulnerabilities and drift are accumulating. Adopt SDD spec layer before scaling agentic workflows.

Score combines vulnerability rate (Pearce 40%, Liu 26.1%) with spec drift (Marri 2026) and context loss (Huang 2025). Move toward zero by adopting Constitutional SDD plus TDD.

Part II

II

SDD foundations.

SDD flips the script. For decades, code was king and specs were scaffolding we discarded. SDD makes the spec executable.

The fundamental inversion

From code as truth to intent as truth.

Old, code-centric

Code is the source of truth

Docs go stale within weeks
Tests document past behavior
Decisions live in chat history
AI re-explained on every prompt

SDD, intent-centric

Spec is the source of truth

Versioned specs replace stale docs
Reviews happen at the spec layer
Code, tests, docs are derived
AI agents have persistent context

The four phases of the core process

Specify → Plan → Tasks → Implement.

PHASE 01 · SPECIFY

What and why

User stories, goals, success criteria
Owner: product manager
Output: spec.md

PHASE 02 · PLAN

How

Architecture, stack, data model, APIs
Owner: architect or tech lead
Output: plan.md

PHASE 03 · TASKS

Decomposition

Atomic, executable, traceable tasks
Owner: dev lead
Output: tasks.md

PHASE 04 · IMPLEMENT

Execution

Executes tasks, validates against spec
TDD continuous validation, review
Owner: developer or coding agent

Anatomy of a specification

A well-formed spec has six components.

GOAL

Primary objective in one paragraph

What the system enables. Single sentence. The North Star for every downstream artifact.

USER STORIES

As a [user], I want [capability], so that [benefit]

Each story numbered (US-001). Maps to test cases. Defines persona, capability, business value.

SUCCESS CRITERIA

Measurable outcomes

Numbers and thresholds, not adjectives. "Login under 200ms p95" not "fast login".

SCOPE

In and out

In-scope and explicitly out-of-scope. Defines the iteration boundary the agent must respect.

DEPENDENCIES

External requirements

SMTP, databases, third-party APIs. What must exist for the system to function.

REQUIREMENTS

EARS-formatted, numbered

The contract the AI compiles against. Every requirement traceable to tests and to code.

Anatomy of a plan

The plan translates what into how.

Architecture

System structure, components, boundaries, deployment topology.

Tech Stack

Layer + Tech + Version + Rationale. Not just names. The why matters.

Data Models

Database schemas, indexes, constraints. SQL where applicable.

API Design

Endpoints with request and response shapes. OpenAPI when possible.

Performance

Latency targets at p50, p95, p99. Throughput. Concurrency.

Security

Hashing, encryption, rate limits, transport. Specific algorithms.

Anatomy of tasks

Atomic units with explicit dependencies and TDD steps.

## Phase 2: Authentication Core (Parallel after T001, T002)

### T003 [P]: Implement password hashing utility
- File:       app/core/security.py
- Output:     hash_password() and verify_password() functions
- Depends on: T002
- Acceptance: Unit tests pass, bcrypt cost=12 verified

### T006: Register endpoint with TDD
- TDD steps:
  1. Write test for successful registration → RED
  2. Write test for duplicate email → RED
  3. Implement endpoint → GREEN
  4. Refactor for clarity

[P] markers indicate parallel tasks. Explicit dependencies ensure correct execution order.

SDD benefits

Four categories of measurable business value.

BENEFIT 01

Higher code quality

Architectural consistency
Reduces tech debt at the source

BENEFIT 02

Traceable decisions

Every decision logged in spec
Onboarding from weeks to days

BENEFIT 03

Early risk detection

Dependencies surface in planning
Cheap to fix early, expensive late

BENEFIT 04

Tames complexity

AI handles bigger modules
Spec is the persistent context

Constitutional SDD impact, case study

Same baseline. SDD wins on every metric.

Metric	Without SDD	With SDD	Improvement
Security defects detected	11	3	73% reduction
Time to first secure build	9 days	4 days	56% faster
Compliance documentation	23%	100%	4.3x coverage
Security review iterations	4	1	75% reduction
Lines of security code	612	847	38% more thorough

Source: Marri 2026, Constitutional Spec-Driven Development, arXiv 2602.02584.

Part III

III

TDD foundations.

Tests written before production code. The classic Red-Green-Refactor cycle, now amplified by AI agents that are good at translating acceptance criteria into tests.

The Red-Green-Refactor cycle

Three short phases. Repeat for every behavior.

PHASE 01 · RED

Write a failing test

The test expresses desired behavior that does not exist yet. If it does not fail, the test is wrong or the behavior already exists.

PHASE 02 · GREEN

Minimum code to pass

Write only the minimum code necessary. No optimization, no extra features. Resist adding logic that is not being tested.

PHASE 03 · REFACTOR

Clean while keeping green

Improve design, remove duplication, rename for clarity. Never add behavior here. Tests stay green throughout.

Interactive · RGR live demo

Click each phase. Watch the code transform.

RED · test fails

Three short phases. Repeat for every behavior. The cycle compounds discipline.

Anatomy of a test, Arrange-Act-Assert

Every test follows three sections: given, when, then.

test_user_registration.py
deftest_user_can_register_with_valid_credentials():
    # ARRANGE (Given): set up the test context
    db = create_test_database()
    user_data = {
        "email": "alice@example.com",
        "password": "SecurePass123!"
    }

    # ACT (When): execute the behavior being tested
    response = register_user(db, user_data)

    # ASSERT (Then): verify the outcomeassert response.status_code == 201assert response.json["email"] == user_data["email"]
    assert"password"not in response.json
  

TDD in the AI era

AI agents are good at translating acceptance criteria into tests.

SDD provides the spec. The agent reads EARS requirements and acceptance criteria, then generates the test scaffold. The developer reviews, refines, runs RED. The agent generates the implementation. RED becomes GREEN. Refactor stays human-led.

Without TDD

AI generates code, you debug it

No safety net for refactoring
Edge cases discovered in production
Tests written after the fact, often skipped

With TDD + AI

Tests are the spec the AI compiles to

Test failure is the feedback loop
Refactoring becomes safe and frequent
Coverage is built-in, not retrofitted

Part IV

IV

EARS, requirements that work.

Easy Approach to Requirements Syntax. Six patterns that turn vague intent into testable requirements an AI can compile against.

The 6 EARS patterns

Each pattern has a trigger keyword and a clear shape.

UBIQUITOUS

Always true

"The system SHALL [behavior]." Used for invariants. No trigger keyword.

EVENT-DRIVEN

Triggered by an event

"WHEN [event], the system SHALL [behavior]." Most common pattern in practice.

STATE-DRIVEN

Active during a state

"WHILE [state], the system SHALL [behavior]." Used for ongoing conditions.

OPTIONAL

Feature-flag conditional

"WHERE [feature included], the system SHALL [behavior]." For optional capabilities.

UNWANTED

Failure modes

"IF [unwanted], THEN the system SHALL [response]." For error handling, security, edge cases.

COMPLEX

Combined patterns

"WHEN A, IF B, THEN system SHALL X." Combines triggers; use sparingly.

Interactive · EARS validator

Pick a pattern. See the converted requirement.

Vague requirement

EARS-formatted requirement

UBIQUITOUS

REQ-001: The system shall serve API responses with p50 latency under 100ms and p99 under 500ms.

Each pattern eliminates a specific kind of ambiguity. Specky validates these programmatically with Zod schemas.

Bad vs EARS, real example

From vague to testable.

Vague (rejected)

"The system should be fast and secure."

"Fast" is undefined, no threshold
"Secure" against what threat model
No trigger, no measurable behavior
Cannot be tested or compiled

EARS (accepted)

Three concrete requirements

R-001 WHEN a user submits login, the system SHALL respond within 200ms p95
R-002 The system SHALL hash passwords with bcrypt cost 12 (ubiquitous)
R-003 IF login fails 5 times, THEN system SHALL lock account for 15 minutes
Each requirement is testable, traceable, executable

Part V

V

GitHub Spec-Kit.

The open source methodology. Slash commands, templates, and a constitution. Standard across 30+ AI agents.

Spec-Kit slash commands

Seven commands. One workflow.

/specify

Generate spec.md from intent

Input: business problem. Output: structured spec with goals, user stories, EARS requirements.

/plan

Generate plan.md from spec

Input: spec.md. Output: architecture, tech stack, data model, API contracts, security plan.

/tasks

Decompose into atomic tasks

Input: plan.md. Output: tasks.md with [P] parallel markers, explicit dependencies, TDD steps.

/implement

Execute tasks with TDD loop

Input: tasks.md. Agent writes failing tests first, then minimum code, then refactors.

/constitution

Define non-negotiable principles

Input: project context. Output: principles, security policies, architectural constraints.

Anatomy of a Spec-Kit project

Five files. One source of truth.

project structure

my-project/
├── .specify/
│   └── constitution.md# Non-negotiable principles, security, compliance
├── specs/
│   ├── 001-feature-name/
│   │   ├── spec.md# Goals, user stories, EARS requirements
│   │   ├── plan.md# Architecture, stack, data model, security
│   │   └── tasks.md# Atomic tasks with [P] markers and TDD steps
├── src/                       # Generated and reviewed code
├── tests/                     # Generated tests, RED then GREEN
└── README.md

Every artifact in the spec layer is versioned, reviewable, and acts as input for the next phase. Code is downstream of intent.

Real example · building Taskify

A Trello-style platform built end-to-end with the workflow.

D 01specify init taskify --ai claude creates the repo and agent guide.

D 02Constitution drafted: code quality, testing, UX, performance principles.

D 03Spec drafted: 5 predefined users, 3 sample projects, Kanban columns, comments.

D 04Clarify resolves: tasks per project, comment ordering, mobile drag-and-drop.

D 05Plan generates research.md, data-model.md, OpenAPI spec, SignalR protocol.

D 06Tasks broken down by user story with [P] parallel markers and TDD steps.

D 07Implement validates prerequisites, runs tasks in dependency order, follows TDD.

Part VI

VI

Specky, the MCP engine for SDD.

53 MCP tools. State machine. Programmatic EARS. Compliance check. Cross-artifact analysis.

Spec-Kit vs Specky

Two complementary tools. One workflow.

Spec-Kit, methodology

Open source toolkit, prompts and templates

Slash commands and Markdown templates
Standard across 30+ AI agents
Defines the methodology, not the runtime
github.com/github/spec-kit

Specky, engine

MCP server, 53 tools, state machine

Programmatic EARS validation
Cross-artifact consistency analysis
Constitution compliance check
github.com/paulasilvatech/specky

Specky inventory

53 tools across 10 phases.

MCP tools

53

Specify, plan, taskify, generate, validate, audit, compliance, traceability.

Pipeline phases

10

From intent to deployed code, with LGTM gates between every phase.

Input methods

6

Conversation, transcript, document, ticket, code, hybrid. The agent picks the right one.

Project types

3

Greenfield, brownfield, modernization. Each with a different starting state and gates.

53 Specky tools · 12 categories

Every category solves a recurring problem.

Input · 5

PDF, DOCX, transcripts, Figma. Auto-pipeline from any source.

Pipeline Core · 8

init, discover, write_spec, clarify, write_design, write_tasks, run_analysis, advance_phase.

Quality · 5

Checklist, verify_tasks, compliance_check, cross_analyze, validate_ears.

Diagrams · 4

17 Mermaid types. Auto-generated per feature. FigJam-ready exports.

IaC · 3

Terraform, Bicep, Dockerfile. Validation via Terraform MCP.

Dev Env · 3

Local Docker setup, GitHub Codespaces, devcontainer.json.

Integration · 5

Branch naming, work items export, PR creation, implement, research.

Docs · 4

Auto-docs, API docs from design, runbook, onboarding guide.

Testing · 3

vitest, jest, playwright, pytest, junit. Property-based with fast-check or Hypothesis.

Plus: Turnkey (1), Checkpointing (3), Utility/Ecosystem (6) totaling 53 tools across 12 categories.

Interactive · Specky 10-phase pipeline

Click any phase. See tools and LGTM gates.

Phase 01 · Init

Creates structure, constitution, scans codebase.

Output: CONSTITUTION.md, scope diagram, project skeleton.

sdd_init sdd_scan_codebase

State machine blocks phase-skipping. LGTM gates between Specify, Design, Tasks, Verify ensure human oversight.

Part VII

VII

Constitutional SDD.

Security by construction, not by inspection. Non-negotiable principles in the spec layer. Traceability from principle to file and line.

The problem with AI-generated code

Four recurring patterns in unsafe AI code.

PROBLEM 01

Injection vectors

String concatenation in SQL
Unescaped HTML in templates

PROBLEM 02

Weak crypto defaults

MD5, SHA-1, ECB mode
Hardcoded keys and secrets

PROBLEM 03

Authorization gaps

Missing access checks
IDOR, privilege escalation

PROBLEM 04

Silent error handling

Catch all, return generic 500
No audit trail of failures

CONSTITUTION.md, example

Principles every artifact must honor.

.specify/constitution.md
# Project Constitution## Principle 01: All user input is hostile
- All inputs MUST pass through a parameterized query layer
- All HTML output MUST be escaped at the templating boundary
- Compliance: CWE-89, CWE-79
- Trace: src/db/repository.py, src/web/templates/## Principle 02: No secrets in code or config
- Secrets MUST come from a secret manager at runtime
- Build pipelines MUST scan for high-entropy strings
- Compliance: CWE-798
- Trace: .github/workflows/secret-scan.yml## Principle 03: Every privileged action is authorized and audited
- All privileged endpoints MUST verify identity AND permission
- All privileged actions MUST emit a structured audit event
- Compliance: CWE-862, CWE-285
- Trace: src/auth/decorators.py, src/audit/

Best practices · Bandara et al. 2025

Nine principles for production-grade agentic workflows.

01

Tool calls over MCP. Direct calls for deterministic operations.

02

Direct functions over tool calls. Pure functions for infra ops.

03

One agent, one tool. Reduces ambiguity.

04

Single-responsibility agents. No mixing generation + validation.

05

External prompts. Prompts in .md files in Git.

06

Multi-LLM consortium. Reasoner agent consolidates outputs.

07

Workflow vs MCP separation. REST API + lightweight MCP adapter.

08

Containerized deploy. Docker + Kubernetes for portability.

09

KISS principle. Avoid over-engineering. Avoid microservices when not needed.

Three project types

Same pipeline. Different starting point.

TYPE 01 Greenfield From scratch · sdd_init → sdd_discover → standard pipeline.

TYPE 02 Brownfield Existing codebase · scan + init with context → pipeline with awareness.

TYPE 03 Modernization Legacy migration · scan + batch_import + transcripts → pipeline with constraints.

Specky adapts to any project type. The pipeline is the same; the starting point changes.

Part VIII

VIII

Practice and ROI.

SDD plus TDD plus Constitutional SDD. Methodology plus engine plus security layer. Production-ready AI development in 2026.

The SDD ecosystem in 2026

Spec-Kit + Specky + Constitutional SDD

SDD ensures you are building the right thing. TDD ensures you are building it correctly. Constitutional SDD ensures you are building it safely.

METHODOLOGY

Spec-Kit

Open source toolkit. Templates and slash commands. Standard across 30+ AI agents.

ENGINE

Specky

53 MCP tools. State machine. Programmatic EARS. Compliance check. Cross-artifact analysis.

SECURITY LAYER

Constitutional SDD

Non-negotiable principles in the spec layer. Secure by construction, not by inspection.

Closing

From vibes to discipline. SDD plus TDD makes AI work.

Building the future of software development with AI and Agentic DevOps.

Paula Silva | Software Global Black Belt

Contact paulasilva@microsoft.com