An executive and technical briefing on usage-based billing, what it changes for your engineering organization, and how to operate well under it.
On June 1, 2026, GitHub Copilot stops charging per request. Tokens become the unit of cost.
1 AI Credit equals US$ 0.01. Input, output, and cached tokens are metered at each model's published rate. Cost is now tied to the work performed, not to how often you ask.
A one-word prompt and a forty-turn agent session counted the same, one premium request each. Predictable to budget, but disconnected from the compute actually consumed.
Input, output, and cached tokens at each model's rate. A long agentic session is genuinely far more expensive than one chat question. The bill now matches reality.
cost = input times in-rate, plus cached times cache-rate, plus output times out-rate. 1 AI Credit = US$ 0.01. Every tactic in this deck acts on the model or the tokens.
Chat, agents, the coding agent, and premium models consume credits. Code completions and Next Edit Suggestions do not. GitHub Copilot code review costs twice: AI Credits plus GitHub Actions minutes.
Pro US$ 10, Pro+ US$ 39, Business US$ 19, Enterprise US$ 39 per user. Each license adds its dollar value as credits to a shared pool.
Code completions and Next Edit Suggestions remain unlimited and consume no AI Credits. The day-to-day inline experience is unchanged.
Chat, agent mode, the coding agent, Spaces, and premium models consume credits. That is where the meter actually spins.
The expensive dish always cost a lot to make. The flat buffet hid the cost of each plate and quietly rewarded waste.
Usage-based billing hands you the a la carte menu with prices printed. The kitchen did not change. Now you can choose.
FinOps 2026: 98 percent of organizations now manage AI spend, up from 31 percent in 2024. This is market maturity, not a GitHub quirk.
Providers bill GitHub in tokens. A request can be a few hundred or tens of thousands. Equal-request pricing never matched the cost.
Labs release models and reprice constantly. A token framework absorbs that without a pricing overhaul each time.
Larger context windows and richer agentic workflows only make economic sense when cost is metered by consumption.
Announced April 27. The May billing preview is your window to instrument monitoring and configure budgets before anything bills.
Usage-based billing goes live June 1 with extra included headroom that softens the first months. Use it as a learning window for real consumption.
The cushion expires August 31. September is the first un-cushioned month. Re-baseline budgets here, never against promotional spend.
Two companies feel the same change in opposite ways. The difference is usage maturity, not headcount.
Teams with governance pay in proportion to value. Teams without it pay for the chaos the flat price used to hide.
DORA 2025 found AI does not fix a team, it amplifies what is already there. Usage billing simply makes that visible on the invoice.
How LLMs, agents, tokens, and context windows work, the mental model the rest depends on.
An LLM is text in, text out. It cannot tell relevant from irrelevant, and it cannot tell a hallucination from a fact. Both come from the same math.
The core rule: as little context as possible, but as much as required. Too much biases the answer and costs more, too little invites hallucination.
Roughly three quarters of a word. 1M tokens is about the Lord of the Rings trilogy plus The Hobbit. Prompts, files, and replies all consume them.
Irrelevant information is weighed, not ignored. It pushes the model toward wrong answers, and you pay for every extra token, every turn.
Missing critical context makes the model fill the gap, with no error message. The math makes no distinction between a fact and a fabrication.
Roughly three quarters of a word. English averages about 1.3 tokens per word.
Portuguese and Spanish run 1.5 to 1.7 tokens per word. Code tokenizes at variable density.
You have little control over tokenization. Think higher level: prompts, files, and replies all consume, and compound every turn.
An agent is an application: VS Code Chat, GitHub Copilot CLI, the coding agent, even Claude Code or Codex. It talks to the model on your behalf, many times per task.
The model itself is just GPT, Claude, or Gemini. The interaction is not magic, it is text, and the model is stateless.
You influence it through three things: your prompt, the files in your project, and the agent configs, instructions, skills, MCP.
A 50,000-token session over 40 turns ships about 2 million input tokens, even if your last prompt was 20 words. Because LLMs are stateless, the whole conversation is re-sent each turn, so context bloat is the single biggest source of waste.
You cannot change these, but they are why every session starts north of zero.
The slice most people never think about, and the one that compounds with session length.
File reads, shell output, test runs. One careless read of a large file rides along forever.
Your typed prompt is usually the smallest slice. The model response, including reasoning tokens, is also billed.
Everything you send: prompt, history, attachments, system prompt, tool schemas. Re-billed every turn. The baseline rate.
What the model generates, including invisible reasoning tokens. Typically 4 to 10 times the input rate. Verbose answers cost real money.
A stable prefix the provider replays from cache. About 10 percent of input, a 90 percent discount. The largest lever for agentic work.
1 token is about three quarters of a word. 1M tokens is roughly the Lord of the Rings trilogy plus The Hobbit. The window is the capacity.
Tokens meter linearly whether you are at 20 or 80 percent. Sit at 500K on a 1M window and you pay for 500K every turn. Compaction saves you from the wall, not the cost of nearing it.
Models use information best at the start and end of the window, worst in the middle. Switch tasks mid-session and the model can snap back to the first instruction. Start a fresh window per task.
Chroma tested 18 frontier models. All degrade as input grows, often well before the window is full. Keep the window under roughly 60 to 70 percent.
One bill, three behaviors. Each surface has its own dominant cost driver and its own dials.
GitHub is repos, pull requests, issues, and Actions. GitHub Copilot is the AI layer woven through it, reached from VS Code, the CLI, the web, and other IDEs.
Free and unlimited: code completions and Next Edit Suggestions. Metered in AI Credits: chat, agents, the coding agent, Spaces, premium models.
The same tiers come from Anthropic, OpenAI, and Google, routed through one control plane and one billing unit, AI Credits.
Driver: re-send of the full history every turn. Dials: /clear, /compact, /context, specialist sub-agents.
Driver: attachments plus agent tool-call sprawl. Dials: mode choice, hash-file references, instructions, MCP scope.
Driver: coding-agent runs with a large blast radius. Dials: tightly scoped issues, narrow Spaces, PR review.
/clear resets between unrelated tasks, /compact summarizes mid-task, /context shows the breakdown, /usage shows what the session cost.
/model lets you plan on a strong model, implement on a cheap one, review on a specialist. Each step on the cheapest tier that finishes it.
/explore, /plan, /review, and /delegate run in isolated context. The main session only sees the summary, not the files they read.
Read-only. Explanations, design questions, doc lookups. No file writes. The cheapest mode and the right one for most questions.
Spec-first. Drafts a reviewable plan you hand to Agent. The plan narrows what Agent reads and changes, which is where savings come from.
Autonomous tool-using loop. Cheapest when driven by a spec. Default to hash-file over hash-codebase, and restart rather than trim.
Hash-codebase queries a semantic index, not a live grep. For GitHub repos it is remote and near-instant since March 2025. A cold index forces broader, costlier reads.
Summarizes older turns to free context, with a hint about what to keep. Cheaper than starting fresh when continuity matters, far cheaper than overflowing.
Issue to branch to PR in a sandbox VM, GA since September 2025. Issue quality dominates the bill, so scope tightly with file pointers.
Curated context across repos and docs. Three files beat three repos. One Space per question shape, refresh rather than accumulate.
copilot-instructions.md is honored by VS Code, the coding agent, and the CLI. Invest in it once, benefit everywhere.
/explore, /plan, narrow scope. When you are already in the terminal for a quick edit.
Plan mode drafts a spec, hand it to Agent with hash-file scope. For a mid-size change you review first.
Issue with file pointers, assign @copilot, review the PR. For a bounded change you can step away from.
Bumping deps, adding tests for one module, a repo-wide rename. Issues with file pointers and acceptance criteria, in repos with strong tests and CI.
Vague exploratory work, cross-cutting architecture, greenfield repos, repos without tests, or anything that should have been a ten-line conversation.
Selected repos and specific paths, pasted notes, free-text instructions, external doc links. Shipped into context at query time.
Narrow sources, one Space per question shape, refresh rather than accumulate. Stale sources cost the same as fresh ones.
Spaces launched May 2025 and replaced Knowledge Bases in September 2025. If you still see KB, that is the legacy term.
Repo-anchored Q&A and file-scoped edits. The cheapest entry points on the web.
Auto-review scoped to the diff. Now also consumes Actions minutes for its agentic analysis.
Generate full apps, or issue-to-PR runs. The heaviest per-run, long-running generation.
Background near 80, hard near 95, scaled to the model. It does not fire until you are close.
At 500K on a 1M window you pay 500K every turn, long before compaction. Bigger windows are not smaller bills.
Under 40 percent, do not pay to compact yet. Resume from a saved summary instead of pasting context back.
Why optimizing for quality beats optimizing for cost, and why they move in the same direction.
Like firing 20 cheap rockets at the Moon and hoping one lands. Little context, a lazy prompt, send it, and try again if it fails. Unsustainable when every developer dispatches dozens of agents a day.
Raise the value and quality of each agent before you send it. Fewer, better agents automatically means fewer tokens. Optimize for return on investment per agent.
ROI = value of output minus token cost, divided by token cost. You cannot compute it cleanly, but it still guides: driving cost to zero on a worthless output is not a win.
Raising an agent's value is often achieved by sending fewer tokens. Trimming irrelevant context is one lever that improves quality and lowers cost at once.
A test passes or fails, no probability. A failing test stops a drifting agent and forces it to rebuild on a stable base.
The GitHub Copilot CLI team ships about 500 PRs a week. Their number-one practice is tests, 53 percent of the codebase.
Linters, type checkers, and security scanners are guardrails the agent executes. Without them, it stacks buggy change on buggy change.
Squeeze tokens so hard the agent produces worse output and you have not saved money, you have paid for a worse result and the rework that follows.
Cutting irrelevant context raises quality and lowers cost at the same time. Match your effort to your maturity: the more agents you run, the more this pays.
Around 10 agents a day, AI as an assistant. Even 50 percent savings on a 20-dollar month is only 10 dollars. Learn the fundamentals, apply the top few habits.
Dozens or hundreds of async agents. Every percent of tokens and quality, remember the compounding math, is multiplied across the fleet. The full framework earns its keep.
Coding was never the value, analysis was. Telling an agent precisely what to do, in the domain's language, becomes the most valuable skill.
Domain-driven design, hexagonal, CQRS, event-driven. Clean architecture gives agents guardrails and reduces misses.
You are a context engineer now. Treat agent misses like incidents, keep configs fresh, recreate them quarterly.
Six compounding levers, each fixing one driver of token spend. Field models land the combined effect in the 37 to 44 percent band.
Open files, indexing, and history sent every turn.
Frontier versus lightweight tier, the Opus to mini spread.
The same project context re-sent every turn without caching.
High and max thinking traces, billed as output tokens.
Agent loops re-bill the work, MCP schemas re-sent every turn.
History compounds across turns, turn 30 unmanaged.
Attach the narrowest scope that lets the model answer. The cheapest token is the one you never send.
Match the model to the task. A 10 to 24 times spread across tiers is real money. Default to Auto.
Write team standards once into cacheable files, then reuse them at roughly 90 percent off.
Thinking is billed as output. Default to medium, escalate only for genuinely hard turns.
Trim MCP overhead, specialize roles, delegate discovery to subagents. Govern the loop or the bill explodes.
History compounds every turn. One topic, one session. Reset, do not compound.
Applied together, the levers land a roughly 41 percent token reduction with no measurable productivity loss, inside a 37 to 44 percent target band. The figures are illustrative, measure your own baseline.
Naming the 3 files that matter instead of searching the whole codebase cut a real refactor from about 25,000 to 1,700 input tokens per turn, and gave a sharper answer.
Plan on a powerful model, implement on a lightweight one, review on a versatile one. One OAuth feature dropped from about US$ 13.75 to US$ 2.30, same code quality.
No model is free under usage-based billing: every model, including the lightweight tier, consumes credits per token. Only code completions and Next Edit Suggestions consume no credits. Rates and model names are illustrative as of mid 2026, always validate against the official GitHub Copilot models and pricing rate card before committing budgets.
Simple Q&A, formatting, boilerplate, simple refactors. The cheap default for routine work.
Everyday coding, feature work, review, debugging. The workhorse where most spend lives.
Architecture, hard debugging, security review, novel logic. Reserve it, the multiplier is steep.
Keep team files stable so the cache holds. Append new rules at the bottom, never reorder. Scaled to 50 developers this saves on the order of US$ 32,000 a year on the instruction prefix alone.
On reasoning models, high or max effort can multiply the bill 10 to 80 times. Decompose: one medium plan, many cheap low steps, one medium review. About 7 times cheaper per task.
1x. About 200 to 800 thinking tokens. Fine for most steps.
2 to 4x. About 1K to 4K. The right default for most coding.
10 to 25x. About 5K to 20K. Save for genuinely hard turns.
50 to 80x. Up to 64K thinking tokens. Architectural design and subtle bugs only.
MCP trims what is sent, hooks trim what is stored, skills skip exploration. Specialize roles and delegate discovery to subagents. A 30-turn debug went from about US$ 45 to US$ 6.
By turn 30 the history slice is about 90 percent of every turn's bill. New topic, new chat. Compact with focus, fork to explore, archive when done.
Output is the priciest token. Bound it: one sentence, three bullets, code only. Code-only cuts output 40 to 70 percent.
Mode sets the number of calls: Ask is one, Agent is 5 to 25. Do not pay agent-loop overhead for a bounded question.
Pick three: /clear, /model, /usage. Context control, cost control, cost visibility, one for each dimension.
Disable servers you do not need today, use a CLI for one-offs. About 5K to 1K per request.
A local hook filters noisy output before it enters history. About 10K raw to 200 filtered.
A SKILL.md describes the structure once instead of reading 5+ files. About 5K to 500.
Read-only tools, produces a 5-step plan and design notes. The frontier rate, paid once.
Edit tools, executes one plan step per turn. The bulk of the work, on the cheap tier.
Read-only, diff and tests only. Naive 10 Opus turns about $15, handoff about $2.11, roughly 7x cheaper.
The parent reloads the same file content each turn. A 20-file refactor over 4 turns ships 86K input, paid four times.
The subagent reads the files in isolated context and returns a 1K summary. Parent input stays flat, about 8K total, roughly 10 times less.
Bounded discovery, file or context volume the bottleneck, one answer, parallel fan-out, parent stays. Discovery goes to a subagent.
Multi-phase build, reasoning depth the bottleneck, three-plus phases, sequential, the baton passes forward. A build goes to handoff.
Context engineering, budgets and controls, observability, and a sequenced adoption playbook.
copilot-instructions.md rides every session. Keep it tiny, write it yourself, log recurring agent misses, recreate it quarterly.
Path-scoped instructions and skills load only when relevant. Prompt files and chat modes cap the toolset and the blast radius.
Pin model and tools per role to prevent wrong paths. Subagents read files once and return only a summary, keeping the parent lean.
Not fix the bug, but issue 45 describes X, fix it. Point at the exact symptom and file.
Once the bug is fixed and tests pass, stop. Keeps the agent from wandering into unrequested work.
If you know the files, docs, or skill, name them. Every discovery loop you save is tokens you do not pay.
Loads many files, most irrelevant to implementation. Do it in its own window or a subagent.
A reasoning model produces a precise spec, a detailed to-do that does the thinking up front.
Execute the spec on a cheaper model with only relevant context. With a clear spec, parallel agents become possible.
Pass or fail, no probability. A red test stops a drifting agent and forces a rebuild on a stable base.
The GitHub Copilot CLI team ships about 500 PRs a week. Their number-one practice is tests, 53 percent of the codebase.
Linters, type checkers, security scanners. Any guardrail you can express as code, the agent will execute.
applyTo path globs load only when matching files are attached, keeping the always-on file small.
Versioned, manually invoked prompts. Kill the every-dev-has-a-pet-prompt drift.
A tight tool allowlist literally cannot enter an Agent loop, so the bill is bounded by design.
A skill exposes only its name and description until your prompt matches. Install 20 skills and you add a couple of KB, not the full bodies. An open standard across VS Code, CLI, and the cloud agent.
Every enabled MCP tool ships its schema on every turn, used or not. Scope per workspace, prune to weekly-used, and prefer a CLI like gh for read-heavy work.
Non-obvious hazards, the required test framework and build command, explicit do-not rules for files and patterns. Roughly 150 tokens, paid once and then cached.
Onboarding essays, architecture as ASCII art, full style guides, duplicated rules. 2026 research: generated context files add 20 to 23 percent tokens for about minus 2 percent correctness.
Always use pnpm, never npm. Tests live in spec/. Do not modify files in generated/. Document exports with jsdoc, and so on.
pkg: pnpm. tests: spec/. no-edit: generated/. docs: jsdoc on exports. Same meaning, paid on every turn, 64 percent lighter.
Lives in git, reviewable, scoped to the repo that needs it. One bad global server pollutes every project.
A 40-tool set advertises 10 to 15 KB per turn, used or not. Trim to weekly-used, and prefer the gh CLI for read-heavy work.
Scores the repo across 9 pillars and a 5-level maturity model. Use it as a CI gate that fails below level 3.
Generates copilot-instructions.md, the MCP config, settings, and AGENTS.md from the codebase itself.
Re-runs saved cases to catch instruction rot in CI. Experimental, so pin a version.
Pin model and tools per role: Planner, Implementer, Reviewer. The real win is not token savings, it is not handing the agent a tool you do not want it to use.
A subagent reads many files in isolated context and returns only a summary. The parent stays lean across a long session, often about 10 times less re-billed input.
Write a script to filter output before the model sees it. Filter an API response to the fields that matter, do not dump the whole payload.
For read-heavy work, a known CLI like gh is leaner than an MCP server's tool surface. Fewer static tokens, same answer.
Run chronicle to analyze your own session logs for improvement areas, and tools like RTK to trim long shell output to what the agent needs.
Conditional, by file path. Offered to the agent like skills. Start static, move to scoped only when the file grows.
Manually invoked, reusable prompts. A common starting point to kick off skills or custom agents.
Small, automatic, always-on guidance learned from your behavior, applied across surfaces. Check it periodically.
A light user's unused requests could not help a heavy user, who hit overage while others sat idle.
All users draw from one pool first. Only spend beyond the pool is additional spend, governed by the budget layers. User-level budgets cap any one person.
A light user's unused requests could not help a heavy one, who hit overage while others sat idle. One user could not drain another.
Users draw from one pool by real need, so unused value is not stranded. The trade-off, consumption is uneven, which is why user-level budgets exist.
Caps total additional spend beyond the pool. Alerts at 75, 90, and 100 percent. Cost centers can be excluded so a funded unit keeps working.
Allocates additional spend to an org or user group. Can map to an Azure subscription per cost center.
One default cap on total consumption per user. The easy way to stop one person from draining the shared pool.
Overrides the universal default for specific users. Max 10,000 budgets across the enterprise, so use overrides sparingly.
150 percent is high enough never to block normal work and low enough to bound a runaway loop to about 12 hours. A zero budget blocks access entirely, there is no fallback to a cheaper model.
A universal user budget sets the default, individual overrides lift specific power users. Baseline control with per-person flexibility.
Cost-center budgets per org, with the enterprise budget as the global ceiling and a failsafe for anyone outside a cost center.
All four layers together: universal default, individual overrides inside a cost center, the unit budget, and the enterprise ceiling, optionally excluding the unit.
How much is the business willing to spend on AI services?
How much should any single engineer be able to spend?
What is the maximum each business unit or team can spend?
Who needs an exception to those budgets?
At 75 percent, an email lands. Start monitoring closely.
Raise the budget, or let it cap. A conscious choice, not a surprise.
Users stop until the next cycle or an admin raises the cap, which takes seconds.
.env, .env.*, secrets/*, *.pem, *.key. Never let these flow into context.
dist/, build/, node_modules/, vendor/. Noise the model should never read.
compliance/, customer-data/. Globs per repo, org, or enterprise, propagate in about 30 minutes.
Auto picks an appropriate model and earns a roughly 10 percent paid-user discount. Make it the org default.
Cap premium models for routine work in Organization Settings, GitHub Copilot, Policies. Approve by workflow, team, and measurable need, not the honor system.
VS Code 1.119 emits OpenTelemetry spans with a token and cache breakdown per turn, plus the resolved model and multiplier inline. Free, local, today.
In-session and end-of-session cost signals before the invoice. Run /usage at the end of your next three sessions to learn your baseline.
A published dataset, now including CLI activity, exported to CSV for trend analysis and monthly budget re-tiering.
The open-source Metrics Viewer and Microsoft's Grafana dashboards turn token waste into a shareable chart. Track cache hit rate above 60 percent.
GenAI spans with token and cache breakdown per turn, exported to any OTLP backend.
Shows the model Auto resolved and its multiplier inline. On by default.
Offloads to-do bookkeeping to a lightweight model so the main model does not pay for it. Off by default, flip it on.
Open Agent mode, point it at hash-codebase, iterate 8 to 12 turns in one chat, accept whatever it produced. The control case.
Plan mode drafts a spec, you review it, hand it to Agent with tight hash-file scope, repo instructions guarantee conventions. Measure OTel tokens, turn count, time-to-mergeable.
Confirms the instruction prefix is stable. A falling rate means someone is editing team files mid-day and losing the discount.
Down from 30 percent plus in uncontrolled baselines. Review monthly averages, not daily spikes.
Look at the ten heaviest sessions each week. They reveal a fixable pattern: an always-on MCP, a never-cleared session, a premium model pinned for triage.
Instructions on top repos, content exclusion at the org level, Auto as the default, OTel wired for high-volume teams.
Generate the instructions stack with AgentRC, scope MCP per repo, convert prompts to skills, stand up the Metrics export and the 75 percent alert.
File coding-agent-ready issues, add an AgentRC readiness gate in CI, re-tier budgets quarterly from real telemetry.
Five lines, conventions only, compressed style.
Your highest-traffic directory, for example src/api/.
Per-workspace beats global, prune to weekly-used.
PR review, triage, regression check.
Your number-one cross-cutting question, three sources max.
Bounded scope, file pointers, acceptance criteria.
About two hours. Set ULB 150 percent, configure the overage cap with an 80 percent alert, define cost centers, enable Auto org-wide, commit a starter instructions file.
Train on context discipline and hash-mentions, audit MCP per team, set effort to medium, roll out path-specific instructions, start a weekly top-10 outlier review.
Export the usage CSV, review monthly averages, adjust ULB for justified power users, check cache hit rates above 60 percent, refine instructions and toolsets.
Are the budgets and policies still appropriate? Confirm or adjust early, while the cushion is still active.
What is the realistic budget at the standard allocation? Re-baseline and communicate. The most important checkpoint.
Is the tier still optimal? Are there renegotiation levers? Renew, change tier, or renegotiate.
Weight 25. The single biggest cost lever, the right model per task.
Weight 20. The root of runaway cost when absent.
Weight 15. Attacks the tokens factor directly.
Weight 15. Controls uncontrolled spend.
Weight 15. The start of structural discipline.
Weight 10. The guardrail against horror bills.
Model governance, budgets as guardrails, context curation, repository primitives. All act on the model-times-tokens equation.
A sedimented platform, a context and orchestration layer, a control center in Microsoft Foundry. Efficiency becomes a property of the infrastructure, not of individual discipline.
Strategy, diagnosis, governance, architecture, the what and the why.
They integrate and validate in the customer's environment, the how.
Data, participation, implementation. A checkpoint every 15 days proves the commitment is alive.
Pre-paid, one-year, via Azure subscriptions. Shared tiers: 20k = US$ 19k (5%), 100k = US$ 90k (10%), 500k = US$ 425k (15%).
Below 15 percent ACD the P3 wins, 15 to 25 only AI Credit P3 with Deal Desk, 25 percent plus the ACD wins. Validate against the P3 FAQ.
Wrong. The bill is a mirror of platform maturity, not a verdict. Heavy workloads can be legitimate.
The number changes in September. Always budget against the real allocation.
It is conditional. And never measure individual productivity, measure organizational delivery.
Auto routes to the cheapest capable model at a paid-user discount. Hydra, a small task-intent classifier, picks the fit at no extra cost and switches only at cache boundaries.
The coding agent is GA, Spaces replaced Knowledge Bases, agent skills are an open standard, and Grafana dashboards ship for GitHub Copilot, Claude Code, and Codex.
Token-aware routing, tiers of Auto (Eco, Fast, Balanced, Max), and admin policies to enforce Auto. The model catalog keeps turning over, so treat rates as snapshots.
A small ModernBERT classifier scores prompt complexity across reasoning, debugging, code-gen, and tool needs, and picks the cheapest best fit. It adds no extra cost.
Accounts for capacity, latency, and errors. Auto switches models only at cache boundaries, the start of a conversation and after compaction, so it never blows your cache.
Task intent matches work to a model path, so users need not weigh trade-offs.
Enterprise Teams and cost centers assign budgets where usage happens.
Be confident but cautious on timing and future routing.
More usage, fewer surprises, better controls.
Instant semantic code-search indexing went generally available, making hash-codebase fast and accurate on GitHub-hosted repos.
The async coding agent reached general availability, and GitHub Copilot Spaces replaced Knowledge Bases.
GitHub Copilot CLI gained richer agents and context management, and folded CLI activity into the usage Metrics API.
OpenTelemetry tracing, the model and multiplier badge, and the background todo agent landed, with UBB UI pre-staged.
Auto rolls out to more clients and becomes the only option for Free and EDU, with UX to explain which model was chosen and cache optimizations.
Token-aware routing and tiers of Auto: Eco, Fast, Balanced, Max. A single dial for your cost-versus-capability posture.
Admin policies to enforce Auto and pin its version. The efficient choice becomes the enforceable default. Treat model rates as snapshots.
Two caches people conflate. One you influence, the other you host. And whether Foundry plus Redis sits behind VS Code and the CLI.
The provider replays a stable prefix across turns, the cached tokens on your bill, about 90 percent off. VS Code reached over 93 percent cache reuse per request. There is no knob, you keep the prefix stable.
Returns a stored answer when a new prompt is semantically similar to a prior one. You build it for your own apps and agents on Azure, it does not lower your GitHub Copilot subscription bill.
Hold instruction files and the MCP set stable during a session, append new rules at the bottom, and let Auto switch models at cache boundaries.
Editing instructions mid-day, switching models by hand, or changing the tool set rewrites the prefix and you re-pay it at full input price.
VS Code 1.119 OTel shows cache read and creation per turn, /context shows the composition. Target a cache hit rate above 60 percent.
With the RediSearch module, the external cache and vector store. Enable RediSearch at creation.
The GenAI gateway in front of your models, also available built-in from Foundry.
llm-semantic-cache-lookup and store compare prompts by vector proximity, returning similar answers.
You are on Cache A, the prompt cache. Protect it with prefix discipline. Nothing to deploy.
Use Cache B, the semantic cache, on Foundry plus Redis. A real, governable cost lever.
Common in mature platforms. They coexist and never overlap, different layers entirely.
GitHub Copilot is a managed product. You cannot insert your own Redis or Foundry gateway into its cache or billing path. The only lever is prefix discipline.
API Management as a GenAI gateway with the semantic-cache policies plus Azure Managed Redis and RediSearch cuts tokens on the models you call directly. The transformation-horizon platform.
On every surface, what you ship into the chat is what you pay for. Prune, scope, restart.
Hash-file over hash-codebase, tight issue over vague, focused Space over kitchen-sink.
Instructions, chat modes, and Spaces in git outlast any one prompt.
The currency of usage-based billing. 1 credit = US$ 0.01, metered per token.
A word fragment, about three quarters of a word. Input, output, or cached.
The harness re-sends the whole conversation every turn, so context compounds.
What the model can hold. Distinct from what you put in it, which is what you pay.
Provider replay of a stable prefix, about 90 percent off. The cached tokens on your bill.
Quality decays as input grows, often before the window is full.
Auto routes to the cheapest capable model, with a paid-user discount. Hydra is its task-intent classifier.
Issue to branch to PR in a sandbox. GA since September 2025.
A portable, on-demand capability. Loads only when its description matches.
External tools for agents. Each advertises its schema every turn.
User-level budget, a cap on total consumption. Recommended at 150 percent.
Admin control that blocks files from context across all surfaces.
The usage-based billing blog post, the models and pricing rate card, budgets and Auto docs, the coding agent and Spaces changelogs.
VS Code 1.119 notes, semantic caching and AI gateway in API Management, Azure Managed Redis, Grafana for AI agents.
Liu et al. Lost in the Middle, Chroma Context Rot, the 2025 DORA report, the FinOps 2026 survey.
Pioneering software development with AI and Agentic DevOps.
Run /usage at the end of your next three sessions to learn your baseline, set the user-level budget to 150 percent, enable Auto as the default, and commit a small instructions file on your most active repo.