paulasilva Paula Silva | Software Global Black Belt
SDD + TDD in the AI era

From vibe coding to production discipline .

A practical guide on combining Spec-Driven and Test-Driven Development with GitHub Spec-Kit and Specky to build high-quality software with AI agents.

AuthorPaula Silva
RoleSoftware Global Black Belt
Version · Datev3.1.0 · 2026-04-27
Agenda

Eight parts. From vibe coding to production discipline.

PART IThe vibe coding problem and the empirical evidence
PART IISpec-Driven Development foundations, the four phases
PART IIITest-Driven Development in the AI era
PART IVEARS notation, writing requirements that work
PART VGitHub Spec-Kit, the open source methodology
PART VISpecky, the MCP engine with 53 tools
PART VIIConstitutional SDD, security by construction
PART VIIIPractice, comparison, and ROI
Paula Silva, Software Global Black Belt
Who is speaking

Paula Silva, Software Global Black Belt.

Building the future of software development with AI and Agentic DevOps.

I work with enterprise customers in Brazil and across Latin America on agentic AI, platform engineering and software modernization. This deck distills three years of patterns observed in dozens of customers, where promising AI-assisted prototypes stalled before production. SDD plus TDD is the discipline that separates teams shipping AI-generated code with confidence from teams chasing bugs they cannot trace.

Part I
I

The vibe coding problem.

Code that works is not the same as code that is correct, secure, maintainable, and aligned with business requirements. An SQL injection works, until someone exploits it.

Definition · Karpathy, Feb 2025

Vibe coding is accepting whatever AI generates without critical review.

Andrej Karpathy

"There's a new kind of coding I call vibe coding, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's not really coding, I just see stuff, say stuff, run stuff."

The developer describes what they want, the AI produces code, and the developer accepts it without reading. Code that works is not the same as code that is correct, secure, maintainable, and aligned with business requirements.

The five stages of AI-native development

Each stage solves the limitation of the previous one.

STAGE 01
AI Assistant
Inline suggestions
STAGE 02
Vibe Coding
Conversational prompts
STAGE 03
Prompt Engineering
Systematic refinement
STAGE 04
Context Engineering
Structured context
STAGE 05
Spec-Driven Development
Spec as source of truth

SDD does not replace prompt engineering or context engineering. It absorbs them and adds a versioned, executable spec layer on top.

Stage evolution: AI Assistant to SDD STAGE 01 STAGE 02 STAGE 03 STAGE 04 STAGE 05 AI Assistant Vibe Coding Prompt Eng. Context Eng. Spec-Driven Development Inline suggestions Conversational Systematic prompts Structured context Spec as truth
The four risks of vibe coding

Speed without discipline compounds debt.

RISK 01
Scale and maintenance
  • Spaghetti code, low cohesion, high coupling
  • Hard to maintain, hard to expand
RISK 02
Context loss
  • No external source of truth
  • Sprint 1 decisions forgotten by Sprint 5
RISK 03
Prototype gap
  • Promising starts stall before production
  • Demo-grade vs production-grade
RISK 04
Collaboration friction
  • Decisions untraceable
  • Without specs, nobody knows why
Empirical evidence

Four studies. One direction.

Pearce et al · 2022
40%
of GitHub Copilot output is vulnerable in security-relevant scenarios.
Liu et al · 2026
26.1%
of 31,132 agent skills contain at least one security vulnerability.
Becker et al · 2025
19%
slower for experienced devs using AI on complex projects.
Issue tracker agent
8%
of agent invocations resulted in complete success, merged PR.

Sources: Pearce 2022, IEEE S&P; Liu et al 2026, arXiv 2601.10338; Becker et al 2025, arXiv 2507.09089; Huang et al 2025, arXiv 2512.14012.

Huang et al · 99 experienced developers · 2025

Professionals do not vibe. They control.

Qualitative research with 99 experienced developers found a clear pattern. Pros do not accept output uncritically. They guide agents through planning and active supervision.

Closing line of the paper

"AI capabilities and interfaces are changing fast: this work paints a picture of what is and is not working now. As of now, the AIs are not taking over yet, experienced developers are still in control."

Interactive · Vibe coding risk calculator

Move the sliders. Watch debt compound.

Vulnerability rate 40%
Spec-code drift 35%
Context loss 50%
4.2
Monitor zone

Vulnerabilities and drift are accumulating. Adopt SDD spec layer before scaling agentic workflows.

Score combines vulnerability rate (Pearce 40%, Liu 26.1%) with spec drift (Marri 2026) and context loss (Huang 2025). Move toward zero by adopting Constitutional SDD plus TDD.

Part II
II

SDD foundations.

SDD flips the script. For decades, code was king and specs were scaffolding we discarded. SDD makes the spec executable.

The fundamental inversion

From code as truth to intent as truth.

Old, code-centric

Code is the source of truth

  • Docs go stale within weeks
  • Tests document past behavior
  • Decisions live in chat history
  • AI re-explained on every prompt
SDD, intent-centric

Spec is the source of truth

  • Versioned specs replace stale docs
  • Reviews happen at the spec layer
  • Code, tests, docs are derived
  • AI agents have persistent context
Pyramid inversion: from code to spec as source of truth OLD · CODE-CENTRIC SDD · INTENT-CENTRIC Docs · Tests · Comments Code TRUTH INVERT code ← intent SPEC Plan · Tasks Code · Tests · Docs (derived)
The four phases of the core process

Specify → Plan → Tasks → Implement.

PHASE 01 · SPECIFY
What and why
  • User stories, goals, success criteria
  • Owner: product manager
  • Output: spec.md
PHASE 02 · PLAN
How
  • Architecture, stack, data model, APIs
  • Owner: architect or tech lead
  • Output: plan.md
PHASE 03 · TASKS
Decomposition
  • Atomic, executable, traceable tasks
  • Owner: dev lead
  • Output: tasks.md
PHASE 04 · IMPLEMENT
Execution
  • Executes tasks, validates against spec
  • TDD continuous validation, review
  • Owner: developer or coding agent
Spec-Driven Development four-phase workflow PHASE 01 · SPECIFY What & Why spec.md PHASE 02 · PLAN How plan.md PHASE 03 · TASKS Decomposition tasks.md PHASE 04 · IMPLEMENT Execution + TDD code + tests Each output is the input for the next phase. Spec is the source of truth.
Anatomy of a specification

A well-formed spec has six components.

GOAL
Primary objective in one paragraph
What the system enables. Single sentence. The North Star for every downstream artifact.
USER STORIES
As a [user], I want [capability], so that [benefit]
Each story numbered (US-001). Maps to test cases. Defines persona, capability, business value.
SUCCESS CRITERIA
Measurable outcomes
Numbers and thresholds, not adjectives. "Login under 200ms p95" not "fast login".
SCOPE
In and out
In-scope and explicitly out-of-scope. Defines the iteration boundary the agent must respect.
DEPENDENCIES
External requirements
SMTP, databases, third-party APIs. What must exist for the system to function.
REQUIREMENTS
EARS-formatted, numbered
The contract the AI compiles against. Every requirement traceable to tests and to code.
Anatomy of a plan

The plan translates what into how.

Architecture
System structure, components, boundaries, deployment topology.
Tech Stack
Layer + Tech + Version + Rationale. Not just names. The why matters.
Data Models
Database schemas, indexes, constraints. SQL where applicable.
API Design
Endpoints with request and response shapes. OpenAPI when possible.
Performance
Latency targets at p50, p95, p99. Throughput. Concurrency.
Security
Hashing, encryption, rate limits, transport. Specific algorithms.
Anatomy of tasks

Atomic units with explicit dependencies and TDD steps.

## Phase 2: Authentication Core (Parallel after T001, T002) ### T003 [P]: Implement password hashing utility - File: app/core/security.py - Output: hash_password() and verify_password() functions - Depends on: T002 - Acceptance: Unit tests pass, bcrypt cost=12 verified ### T006: Register endpoint with TDD - TDD steps: 1. Write test for successful registration → RED 2. Write test for duplicate email → RED 3. Implement endpoint → GREEN 4. Refactor for clarity

[P] markers indicate parallel tasks. Explicit dependencies ensure correct execution order.

SDD benefits

Four categories of measurable business value.

BENEFIT 01
Higher code quality
  • Architectural consistency
  • Reduces tech debt at the source
BENEFIT 02
Traceable decisions
  • Every decision logged in spec
  • Onboarding from weeks to days
BENEFIT 03
Early risk detection
  • Dependencies surface in planning
  • Cheap to fix early, expensive late
BENEFIT 04
Tames complexity
  • AI handles bigger modules
  • Spec is the persistent context
Constitutional SDD impact, case study

Same baseline. SDD wins on every metric.

Metric Without SDD With SDD Improvement
Security defects detected11373% reduction
Time to first secure build9 days4 days56% faster
Compliance documentation23%100%4.3x coverage
Security review iterations4175% reduction
Lines of security code61284738% more thorough

Source: Marri 2026, Constitutional Spec-Driven Development, arXiv 2602.02584.

Constitutional SDD impact: defects and time METRICS · LOWER IS BETTER (3) · HIGHER IS BETTER (1) Security defects Time to first secure build (days) Security review iterations Lines of security code 11 without 3 with SDD · 73% reduction 9 days 4 days · 56% faster 4 iterations 1 · 75% reduction 612 LOC 847 · 38% more thorough WITHOUT SDD WITH SDD (BETTER)
Part III
III

TDD foundations.

Tests written before production code. The classic Red-Green-Refactor cycle, now amplified by AI agents that are good at translating acceptance criteria into tests.

The Red-Green-Refactor cycle

Three short phases. Repeat for every behavior.

PHASE 01 · RED
Write a failing test

The test expresses desired behavior that does not exist yet. If it does not fail, the test is wrong or the behavior already exists.

PHASE 02 · GREEN
Minimum code to pass

Write only the minimum code necessary. No optimization, no extra features. Resist adding logic that is not being tested.

PHASE 03 · REFACTOR
Clean while keeping green

Improve design, remove duplication, rename for clarity. Never add behavior here. Tests stay green throughout.

Red-Green-Refactor cyclical loop PHASE 01 RED Write a failing test PHASE 02 GREEN Minimum code to pass PHASE 03 REFACTOR Clean while keeping green Repeat for every behavior
Interactive · RGR live demo

Click each phase. Watch the code transform.

RED · test fails

    

Three short phases. Repeat for every behavior. The cycle compounds discipline.

Anatomy of a test, Arrange-Act-Assert

Every test follows three sections: given, when, then.

test_user_registration.py
def test_user_can_register_with_valid_credentials(): # ARRANGE (Given): set up the test context db = create_test_database() user_data = { "email": "alice@example.com", "password": "SecurePass123!" } # ACT (When): execute the behavior being tested response = register_user(db, user_data) # ASSERT (Then): verify the outcome assert response.status_code == 201 assert response.json["email"] == user_data["email"] assert "password" not in response.json
TDD in the AI era

AI agents are good at translating acceptance criteria into tests.

SDD provides the spec. The agent reads EARS requirements and acceptance criteria, then generates the test scaffold. The developer reviews, refines, runs RED. The agent generates the implementation. RED becomes GREEN. Refactor stays human-led.

Without TDD

AI generates code, you debug it

  • No safety net for refactoring
  • Edge cases discovered in production
  • Tests written after the fact, often skipped
With TDD + AI

Tests are the spec the AI compiles to

  • Test failure is the feedback loop
  • Refactoring becomes safe and frequent
  • Coverage is built-in, not retrofitted
Spec, test, and code feedback loop SPEC Acceptance criteria EARS requirements TEST AI generates RED From criteria to assertions CODE AI implements GREEN Compiles to passing tests drives drives test failures feed spec refinement SOURCE OF TRUTH CONTRACT DERIVED
Part IV
IV

EARS, requirements that work.

Easy Approach to Requirements Syntax. Six patterns that turn vague intent into testable requirements an AI can compile against.

The 6 EARS patterns

Each pattern has a trigger keyword and a clear shape.

UBIQUITOUS
Always true
"The system SHALL [behavior]." Used for invariants. No trigger keyword.
EVENT-DRIVEN
Triggered by an event
"WHEN [event], the system SHALL [behavior]." Most common pattern in practice.
STATE-DRIVEN
Active during a state
"WHILE [state], the system SHALL [behavior]." Used for ongoing conditions.
OPTIONAL
Feature-flag conditional
"WHERE [feature included], the system SHALL [behavior]." For optional capabilities.
UNWANTED
Failure modes
"IF [unwanted], THEN the system SHALL [response]." For error handling, security, edge cases.
COMPLEX
Combined patterns
"WHEN A, IF B, THEN system SHALL X." Combines triggers; use sparingly.
Interactive · EARS validator

Pick a pattern. See the converted requirement.

Vague requirement
EARS-formatted requirement
UBIQUITOUS
REQ-001: The system shall serve API responses with p50 latency under 100ms and p99 under 500ms.

Each pattern eliminates a specific kind of ambiguity. Specky validates these programmatically with Zod schemas.

Bad vs EARS, real example

From vague to testable.

Vague (rejected)

"The system should be fast and secure."

  • "Fast" is undefined, no threshold
  • "Secure" against what threat model
  • No trigger, no measurable behavior
  • Cannot be tested or compiled
EARS (accepted)

Three concrete requirements

  • R-001 WHEN a user submits login, the system SHALL respond within 200ms p95
  • R-002 The system SHALL hash passwords with bcrypt cost 12 (ubiquitous)
  • R-003 IF login fails 5 times, THEN system SHALL lock account for 15 minutes
  • Each requirement is testable, traceable, executable
Part V
V

GitHub Spec-Kit.

The open source methodology. Slash commands, templates, and a constitution. Standard across 30+ AI agents.

Spec-Kit slash commands

Seven commands. One workflow.

/specify
Generate spec.md from intent
Input: business problem. Output: structured spec with goals, user stories, EARS requirements.
/plan
Generate plan.md from spec
Input: spec.md. Output: architecture, tech stack, data model, API contracts, security plan.
/tasks
Decompose into atomic tasks
Input: plan.md. Output: tasks.md with [P] parallel markers, explicit dependencies, TDD steps.
/implement
Execute tasks with TDD loop
Input: tasks.md. Agent writes failing tests first, then minimum code, then refactors.
/constitution
Define non-negotiable principles
Input: project context. Output: principles, security policies, architectural constraints.
Spec-Kit slash command timeline /constitution /specify /plan /tasks /implement /analyze non-negotiables spec.md plan.md tasks.md code + tests audit principles what + why how decompose execute TDD verify
Anatomy of a Spec-Kit project

Five files. One source of truth.

project structure
my-project/ ├── .specify/ │ └── constitution.md # Non-negotiable principles, security, compliance ├── specs/ │ ├── 001-feature-name/ │ │ ├── spec.md # Goals, user stories, EARS requirements │ │ ├── plan.md # Architecture, stack, data model, security │ │ └── tasks.md # Atomic tasks with [P] markers and TDD steps ├── src/ # Generated and reviewed code ├── tests/ # Generated tests, RED then GREEN └── README.md

Every artifact in the spec layer is versioned, reviewable, and acts as input for the next phase. Code is downstream of intent.

Real example · building Taskify

A Trello-style platform built end-to-end with the workflow.

D 01specify init taskify --ai claude creates the repo and agent guide.
D 02Constitution drafted: code quality, testing, UX, performance principles.
D 03Spec drafted: 5 predefined users, 3 sample projects, Kanban columns, comments.
D 04Clarify resolves: tasks per project, comment ordering, mobile drag-and-drop.
D 05Plan generates research.md, data-model.md, OpenAPI spec, SignalR protocol.
D 06Tasks broken down by user story with [P] parallel markers and TDD steps.
D 07Implement validates prerequisites, runs tasks in dependency order, follows TDD.
Part VI
VI

Specky, the MCP engine for SDD.

53 MCP tools. State machine. Programmatic EARS. Compliance check. Cross-artifact analysis.

Spec-Kit vs Specky

Two complementary tools. One workflow.

Spec-Kit, methodology

Open source toolkit, prompts and templates

  • Slash commands and Markdown templates
  • Standard across 30+ AI agents
  • Defines the methodology, not the runtime
  • github.com/github/spec-kit
Specky, engine

MCP server, 53 tools, state machine

  • Programmatic EARS validation
  • Cross-artifact consistency analysis
  • Constitution compliance check
  • github.com/paulasilvatech/specky
Spec-Kit and Specky complementary scope METHODOLOGY Spec-Kit ENGINE Specky Slash commands Markdown templates 30+ AI agents 53 MCP tools State machine EARS validator SDD workflow spec · plan · tasks · impl Spec-Kit defines the methodology. Specky executes it programmatically.
Specky inventory

53 tools across 10 phases.

MCP tools
53
Specify, plan, taskify, generate, validate, audit, compliance, traceability.
Pipeline phases
10
From intent to deployed code, with LGTM gates between every phase.
Input methods
6
Conversation, transcript, document, ticket, code, hybrid. The agent picks the right one.
Project types
3
Greenfield, brownfield, modernization. Each with a different starting state and gates.
53 Specky tools · 12 categories

Every category solves a recurring problem.

Input · 5
PDF, DOCX, transcripts, Figma. Auto-pipeline from any source.
Pipeline Core · 8
init, discover, write_spec, clarify, write_design, write_tasks, run_analysis, advance_phase.
Quality · 5
Checklist, verify_tasks, compliance_check, cross_analyze, validate_ears.
Diagrams · 4
17 Mermaid types. Auto-generated per feature. FigJam-ready exports.
IaC · 3
Terraform, Bicep, Dockerfile. Validation via Terraform MCP.
Dev Env · 3
Local Docker setup, GitHub Codespaces, devcontainer.json.
Integration · 5
Branch naming, work items export, PR creation, implement, research.
Docs · 4
Auto-docs, API docs from design, runbook, onboarding guide.
Testing · 3
vitest, jest, playwright, pytest, junit. Property-based with fast-check or Hypothesis.

Plus: Turnkey (1), Checkpointing (3), Utility/Ecosystem (6) totaling 53 tools across 12 categories.

Interactive · Specky 10-phase pipeline

Click any phase. See tools and LGTM gates.

Phase 01 · Init
Creates structure, constitution, scans codebase.
Output: CONSTITUTION.md, scope diagram, project skeleton.
sdd_init sdd_scan_codebase

State machine blocks phase-skipping. LGTM gates between Specify, Design, Tasks, Verify ensure human oversight.

Part VII
VII

Constitutional SDD.

Security by construction, not by inspection. Non-negotiable principles in the spec layer. Traceability from principle to file and line.

The problem with AI-generated code

Four recurring patterns in unsafe AI code.

PROBLEM 01
Injection vectors
  • String concatenation in SQL
  • Unescaped HTML in templates
PROBLEM 02
Weak crypto defaults
  • MD5, SHA-1, ECB mode
  • Hardcoded keys and secrets
PROBLEM 03
Authorization gaps
  • Missing access checks
  • IDOR, privilege escalation
PROBLEM 04
Silent error handling
  • Catch all, return generic 500
  • No audit trail of failures
CONSTITUTION.md, example

Principles every artifact must honor.

.specify/constitution.md
# Project Constitution ## Principle 01: All user input is hostile - All inputs MUST pass through a parameterized query layer - All HTML output MUST be escaped at the templating boundary - Compliance: CWE-89, CWE-79 - Trace: src/db/repository.py, src/web/templates/ ## Principle 02: No secrets in code or config - Secrets MUST come from a secret manager at runtime - Build pipelines MUST scan for high-entropy strings - Compliance: CWE-798 - Trace: .github/workflows/secret-scan.yml ## Principle 03: Every privileged action is authorized and audited - All privileged endpoints MUST verify identity AND permission - All privileged actions MUST emit a structured audit event - Compliance: CWE-862, CWE-285 - Trace: src/auth/decorators.py, src/audit/
Best practices · Bandara et al. 2025

Nine principles for production-grade agentic workflows.

01
Tool calls over MCP. Direct calls for deterministic operations.
02
Direct functions over tool calls. Pure functions for infra ops.
03
One agent, one tool. Reduces ambiguity.
04
Single-responsibility agents. No mixing generation + validation.
05
External prompts. Prompts in .md files in Git.
06
Multi-LLM consortium. Reasoner agent consolidates outputs.
07
Workflow vs MCP separation. REST API + lightweight MCP adapter.
08
Containerized deploy. Docker + Kubernetes for portability.
09
KISS principle. Avoid over-engineering. Avoid microservices when not needed.
Three project types

Same pipeline. Different starting point.

TYPE 01 Greenfield From scratch · sdd_init → sdd_discover → standard pipeline.
TYPE 02 Brownfield Existing codebase · scan + init with context → pipeline with awareness.
TYPE 03 Modernization Legacy migration · scan + batch_import + transcripts → pipeline with constraints.

Specky adapts to any project type. The pipeline is the same; the starting point changes.

Decision tree: when to apply SDD, TDD, or both START New feature or change? requirements unclear behavior well-defined PATH A · SDD Spec first · capture intent PATH B · TDD Test first · lock contract RECOMMENDED · SDD + TDD Spec drives test drives code production-grade AI development SDD ensures you build the right thing. TDD ensures you build it correctly.
Part VIII
VIII

Practice and ROI.

SDD plus TDD plus Constitutional SDD. Methodology plus engine plus security layer. Production-ready AI development in 2026.

The SDD ecosystem in 2026

Spec-Kit + Specky + Constitutional SDD

SDD ensures you are building the right thing. TDD ensures you are building it correctly. Constitutional SDD ensures you are building it safely.

METHODOLOGY
Spec-Kit

Open source toolkit. Templates and slash commands. Standard across 30+ AI agents.

ENGINE
Specky

53 MCP tools. State machine. Programmatic EARS. Compliance check. Cross-artifact analysis.

SECURITY LAYER
Constitutional SDD

Non-negotiable principles in the spec layer. Secure by construction, not by inspection.

SDD ecosystem: methodology, engine, security layer METHODOLOGY Spec-Kit ENGINE Specky SECURITY LAYER Constitutional SDD Production-grade AI development templates slash cmds 53 MCP tools state machine principles + traceability
Closing
paulasilva

From vibes to discipline. SDD plus TDD makes AI work.

Building the future of software development with AI and Agentic DevOps.

Paula Silva | Software Global Black Belt
Contact paulasilva@microsoft.com
Use to navigate · O or Esc for overview
1 / 35