Eight parts. From vibe coding to production discipline.
PART IThe vibe coding problem and the empirical evidence
PART IISpec-Driven Development foundations, the four phases
PART IIITest-Driven Development in the AI era
PART IVEARS notation, writing requirements that work
PART VGitHub Spec-Kit, the open source methodology
PART VISpecky, the MCP engine with 53 tools
PART VIIConstitutional SDD, security by construction
PART VIIIPractice, comparison, and ROI
Who is speaking
Paula Silva, Software Global Black Belt.
Building the future of software development with AI and Agentic DevOps.
I work with enterprise customers in Brazil and across Latin America on agentic AI, platform engineering and software modernization. This deck distills three years of patterns observed in dozens of customers, where promising AI-assisted prototypes stalled before production. SDD plus TDD is the discipline that separates teams shipping AI-generated code with confidence from teams chasing bugs they cannot trace.
Part I
I
The vibe coding problem.
Code that works is not the same as code that is correct, secure, maintainable, and aligned with business requirements. An SQL injection works, until someone exploits it.
Definition · Karpathy, Feb 2025
Vibe coding is accepting whatever AI generates without critical review.
Andrej Karpathy
"There's a new kind of coding I call vibe coding, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's not really coding, I just see stuff, say stuff, run stuff."
The developer describes what they want, the AI produces code, and the developer accepts it without reading. Code that works is not the same as code that is correct, secure, maintainable, and aligned with business requirements.
The five stages of AI-native development
Each stage solves the limitation of the previous one.
STAGE 01
AI Assistant
Inline suggestions
STAGE 02
Vibe Coding
Conversational prompts
STAGE 03
Prompt Engineering
Systematic refinement
STAGE 04
Context Engineering
Structured context
STAGE 05
Spec-Driven Development
Spec as source of truth
SDD does not replace prompt engineering or context engineering. It absorbs them and adds a versioned, executable spec layer on top.
The four risks of vibe coding
Speed without discipline compounds debt.
RISK 01
Scale and maintenance
Spaghetti code, low cohesion, high coupling
Hard to maintain, hard to expand
RISK 02
Context loss
No external source of truth
Sprint 1 decisions forgotten by Sprint 5
RISK 03
Prototype gap
Promising starts stall before production
Demo-grade vs production-grade
RISK 04
Collaboration friction
Decisions untraceable
Without specs, nobody knows why
Empirical evidence
Four studies. One direction.
Pearce et al · 2022
40%
of GitHub Copilot output is vulnerable in security-relevant scenarios.
Liu et al · 2026
26.1%
of 31,132 agent skills contain at least one security vulnerability.
Becker et al · 2025
19%
slower for experienced devs using AI on complex projects.
Issue tracker agent
8%
of agent invocations resulted in complete success, merged PR.
Sources: Pearce 2022, IEEE S&P; Liu et al 2026, arXiv 2601.10338; Becker et al 2025, arXiv 2507.09089; Huang et al 2025, arXiv 2512.14012.
Huang et al · 99 experienced developers · 2025
Professionals do not vibe. They control.
Qualitative research with 99 experienced developers found a clear pattern. Pros do not accept output uncritically. They guide agents through planning and active supervision.
Closing line of the paper
"AI capabilities and interfaces are changing fast: this work paints a picture of what is and is not working now. As of now, the AIs are not taking over yet, experienced developers are still in control."
Interactive · Vibe coding risk calculator
Move the sliders. Watch debt compound.
Vulnerability rate40%
Spec-code drift35%
Context loss50%
4.2
Monitor zone
Vulnerabilities and drift are accumulating. Adopt SDD spec layer before scaling agentic workflows.
Score combines vulnerability rate (Pearce 40%, Liu 26.1%) with spec drift (Marri 2026) and context loss (Huang 2025). Move toward zero by adopting Constitutional SDD plus TDD.
Part II
II
SDD foundations.
SDD flips the script. For decades, code was king and specs were scaffolding we discarded. SDD makes the spec executable.
The fundamental inversion
From code as truth to intent as truth.
Old, code-centric
Code is the source of truth
Docs go stale within weeks
Tests document past behavior
Decisions live in chat history
AI re-explained on every prompt
SDD, intent-centric
Spec is the source of truth
Versioned specs replace stale docs
Reviews happen at the spec layer
Code, tests, docs are derived
AI agents have persistent context
The four phases of the core process
Specify → Plan → Tasks → Implement.
PHASE 01 · SPECIFY
What and why
User stories, goals, success criteria
Owner: product manager
Output: spec.md
PHASE 02 · PLAN
How
Architecture, stack, data model, APIs
Owner: architect or tech lead
Output: plan.md
PHASE 03 · TASKS
Decomposition
Atomic, executable, traceable tasks
Owner: dev lead
Output: tasks.md
PHASE 04 · IMPLEMENT
Execution
Executes tasks, validates against spec
TDD continuous validation, review
Owner: developer or coding agent
Anatomy of a specification
A well-formed spec has six components.
GOAL
Primary objective in one paragraph
What the system enables. Single sentence. The North Star for every downstream artifact.
USER STORIES
As a [user], I want [capability], so that [benefit]
Each story numbered (US-001). Maps to test cases. Defines persona, capability, business value.
SUCCESS CRITERIA
Measurable outcomes
Numbers and thresholds, not adjectives. "Login under 200ms p95" not "fast login".
SCOPE
In and out
In-scope and explicitly out-of-scope. Defines the iteration boundary the agent must respect.
DEPENDENCIES
External requirements
SMTP, databases, third-party APIs. What must exist for the system to function.
REQUIREMENTS
EARS-formatted, numbered
The contract the AI compiles against. Every requirement traceable to tests and to code.
Anatomy of a plan
The plan translates what into how.
Architecture
System structure, components, boundaries, deployment topology.
Tech Stack
Layer + Tech + Version + Rationale. Not just names. The why matters.
Data Models
Database schemas, indexes, constraints. SQL where applicable.
API Design
Endpoints with request and response shapes. OpenAPI when possible.
Performance
Latency targets at p50, p95, p99. Throughput. Concurrency.
Security
Hashing, encryption, rate limits, transport. Specific algorithms.
Anatomy of tasks
Atomic units with explicit dependencies and TDD steps.
## Phase 2: Authentication Core (Parallel after T001, T002)### T003 [P]: Implement password hashing utility- File: app/core/security.py
- Output: hash_password() and verify_password() functions
- Depends on: T002
- Acceptance: Unit tests pass, bcrypt cost=12 verified
### T006: Register endpoint with TDD- TDD steps:
1. Write test for successful registration → RED
2. Write test for duplicate email → RED
3. Implement endpoint → GREEN
4. Refactor for clarity
Tests written before production code. The classic Red-Green-Refactor cycle, now amplified by AI agents that are good at translating acceptance criteria into tests.
The Red-Green-Refactor cycle
Three short phases. Repeat for every behavior.
PHASE 01 · RED
Write a failing test
The test expresses desired behavior that does not exist yet. If it does not fail, the test is wrong or the behavior already exists.
PHASE 02 · GREEN
Minimum code to pass
Write only the minimum code necessary. No optimization, no extra features. Resist adding logic that is not being tested.
PHASE 03 · REFACTOR
Clean while keeping green
Improve design, remove duplication, rename for clarity. Never add behavior here. Tests stay green throughout.
Interactive · RGR live demo
Click each phase. Watch the code transform.
RED · test fails
Three short phases. Repeat for every behavior. The cycle compounds discipline.
Anatomy of a test, Arrange-Act-Assert
Every test follows three sections: given, when, then.
test_user_registration.py
deftest_user_can_register_with_valid_credentials():
# ARRANGE (Given): set up the test context
db = create_test_database()
user_data = {
"email": "alice@example.com",
"password": "SecurePass123!"
}
# ACT (When): execute the behavior being tested
response = register_user(db, user_data)
# ASSERT (Then): verify the outcomeassert response.status_code == 201assert response.json["email"] == user_data["email"]
assert"password"not in response.json
TDD in the AI era
AI agents are good at translating acceptance criteria into tests.
SDD provides the spec. The agent reads EARS requirements and acceptance criteria, then generates the test scaffold. The developer reviews, refines, runs RED. The agent generates the implementation. RED becomes GREEN. Refactor stays human-led.
Without TDD
AI generates code, you debug it
No safety net for refactoring
Edge cases discovered in production
Tests written after the fact, often skipped
With TDD + AI
Tests are the spec the AI compiles to
Test failure is the feedback loop
Refactoring becomes safe and frequent
Coverage is built-in, not retrofitted
Part IV
IV
EARS, requirements that work.
Easy Approach to Requirements Syntax. Six patterns that turn vague intent into testable requirements an AI can compile against.
The 6 EARS patterns
Each pattern has a trigger keyword and a clear shape.
UBIQUITOUS
Always true
"The system SHALL [behavior]." Used for invariants. No trigger keyword.
EVENT-DRIVEN
Triggered by an event
"WHEN [event], the system SHALL [behavior]." Most common pattern in practice.
STATE-DRIVEN
Active during a state
"WHILE [state], the system SHALL [behavior]." Used for ongoing conditions.
OPTIONAL
Feature-flag conditional
"WHERE [feature included], the system SHALL [behavior]." For optional capabilities.
UNWANTED
Failure modes
"IF [unwanted], THEN the system SHALL [response]." For error handling, security, edge cases.
COMPLEX
Combined patterns
"WHEN A, IF B, THEN system SHALL X." Combines triggers; use sparingly.
Interactive · EARS validator
Pick a pattern. See the converted requirement.
Vague requirement
EARS-formatted requirement
UBIQUITOUS
REQ-001: The system shall serve API responses with p50 latency under 100ms and p99 under 500ms.
Each pattern eliminates a specific kind of ambiguity. Specky validates these programmatically with Zod schemas.
Bad vs EARS, real example
From vague to testable.
Vague (rejected)
"The system should be fast and secure."
"Fast" is undefined, no threshold
"Secure" against what threat model
No trigger, no measurable behavior
Cannot be tested or compiled
EARS (accepted)
Three concrete requirements
R-001 WHEN a user submits login, the system SHALL respond within 200ms p95
R-002 The system SHALL hash passwords with bcrypt cost 12 (ubiquitous)
R-003 IF login fails 5 times, THEN system SHALL lock account for 15 minutes
Each requirement is testable, traceable, executable
Part V
V
GitHub Spec-Kit.
The open source methodology. Slash commands, templates, and a constitution. Standard across 30+ AI agents.
Spec-Kit slash commands
Seven commands. One workflow.
/specify
Generate spec.md from intent
Input: business problem. Output: structured spec with goals, user stories, EARS requirements.
/plan
Generate plan.md from spec
Input: spec.md. Output: architecture, tech stack, data model, API contracts, security plan.
State machine blocks phase-skipping. LGTM gates between Specify, Design, Tasks, Verify ensure human oversight.
Part VII
VII
Constitutional SDD.
Security by construction, not by inspection. Non-negotiable principles in the spec layer. Traceability from principle to file and line.
The problem with AI-generated code
Four recurring patterns in unsafe AI code.
PROBLEM 01
Injection vectors
String concatenation in SQL
Unescaped HTML in templates
PROBLEM 02
Weak crypto defaults
MD5, SHA-1, ECB mode
Hardcoded keys and secrets
PROBLEM 03
Authorization gaps
Missing access checks
IDOR, privilege escalation
PROBLEM 04
Silent error handling
Catch all, return generic 500
No audit trail of failures
CONSTITUTION.md, example
Principles every artifact must honor.
.specify/constitution.md
# Project Constitution## Principle 01: All user input is hostile
- All inputs MUST pass through a parameterized query layer
- All HTML output MUST be escaped at the templating boundary
- Compliance: CWE-89, CWE-79
- Trace: src/db/repository.py, src/web/templates/## Principle 02: No secrets in code or config
- Secrets MUST come from a secret manager at runtime
- Build pipelines MUST scan for high-entropy strings
- Compliance: CWE-798
- Trace: .github/workflows/secret-scan.yml## Principle 03: Every privileged action is authorized and audited
- All privileged endpoints MUST verify identity AND permission
- All privileged actions MUST emit a structured audit event
- Compliance: CWE-862, CWE-285
- Trace: src/auth/decorators.py, src/audit/
Best practices · Bandara et al. 2025
Nine principles for production-grade agentic workflows.
01
Tool calls over MCP. Direct calls for deterministic operations.
02
Direct functions over tool calls. Pure functions for infra ops.
03
One agent, one tool. Reduces ambiguity.
04
Single-responsibility agents. No mixing generation + validation.