Agentic Workflows for Engineers

Context Amnesia

Marco Hefti — Mon, 15 Dec 2025 11:20:29 GMT

Compaction can make completed work invisible to the model. This explains what gets dropped, why auto-compaction is risky, and how to keep long sessions coherent.

TL;DR

Compaction preserves your messages, but drops assistant turns and tool transcripts. After compaction, Codex can’t see what it said, what tools it ran, or what it verified.
The model may redo work it already completed because, from its perspective, that work never happened.
Auto-compaction runs after a turn when total token usage crosses the configured token limit. That can happen at an undesirable time.
Store progress context in files, not in conversation. After any compaction event, verify the model knows what has already been done before continuing.
Your messages usually survive, but there’s a per-message budget. Very long prompts can get truncated (often the middle).

Context / Problem

Last month I was in the middle of a payment migration. Nothing exotic: webhook handlers, subscription state, tests, the usual.

We’d just finished a chunk of work. Files updated. Tests green. I told Codex to move on to the next phase.

A small notice appeared in the terminal: “compact completed.”

Codex started redoing the work we had just finished. It rewrote handlers that already existed, created a second version of a module under a slightly different name, and broke tests that had been green minutes earlier.

I scrolled up. The full conversation was right there: my instructions, its proposals, the diffs, the test output, its confirmation that everything worked. But when I asked Codex what happened, it had no memory of any of it. As far as it knew, it was starting the migration fresh.

I call this context amnesia. The name is slightly misleading. Codex didn’t forget my instructions. It forgot its own work. My messages survived. Its confirmations, tool outputs, and the record of what it actually did were discarded.

It s not human-style forgetting. After compaction, the agent loses its own receipts.

The terminal history showed the full conversation. The model sees something different: a summary and your recent messages, but none of its own responses.

What you’ll get from this:

What compaction actually discards (and what it keeps).
Why the model redoes work instead of forgetting instructions.
How auto-compaction timing creates dangerous gaps.
A workflow that makes completed work visible after compaction.

The core mismatch: you have history, the model has a bridge

This is not “forgetting” in the human sense. You and the model are continuing from different inputs.

You can scroll through the full history. The model can’t. After compaction, it continues from a rebuilt history that may not include what you’re seeing.

Once you internalize that mismatch, the rest of the behavior stops being mysterious.

It also explains the common complaint: “it got worse after /compact.” The model lost the receipts for what actually happened.

At the time of writing, most of what compaction keeps is out of your control. The reliable fix is workflow: make completion visible in files, and verify state after any compaction.

Why this is hard to fix cleanly

It’s tempting to treat compaction as a simple bug: “keep more context” or “write a better summary.”

In practice, it’s a trade-off that doesn’t go away:

The model has limits. Even with large context windows, there is always a limit. Something has to be dropped.
Tool transcripts are huge and noisy. Keeping raw outputs verbatim makes sessions expensive and harder to fit. That’s why summaries exist.
The important detail is often obvious only in hindsight. A summary can’t reliably predict which line of output will matter two hours later.
Humans trust the terminal history, agents trust the prompt. If those differ, you lose sync with what the agent believes.

So compaction isn’t just “memory loss.” It’s a context mismatch: what you see is not what the agent continues from.

This also isn’t a Codex-only problem. Any tool that “summarizes and continues” is building a bridge between two contexts. The failure mode is always the same: the bridge drops the receipt you needed later.

How other tools handle it

Most coding agents end up in the same place: they need some way to keep conversations going past a context window.

The difference is rarely “Model A is smarter.” It’s usually two design choices:

How visible the boundary is. Do you notice a rewrite happened, and can you tell what survived?
Where durable state lives. Does the workflow push you to store progress and constraints outside the chat?

If a tool feels better, it’s often because it makes compaction more explicit or it nudges you toward durable state. If it feels worse, it’s usually because the rewrite is silent and you only notice after the agent starts going off the rails.

Mental model: What survives and what does not

When compaction runs, Codex rebuilds the conversation from scratch. The rebuilt history contains:

Base instructions. System prompts, AGENTS.md content, environment context. These are regenerated fresh.
User messages. In the local CLI compaction path, Codex keeps up to 20,000 tokens of user messages. Very long prompts can be cut off in the middle.
A summary message. A model-generated summary of what happened, inserted as a user-role message.

Everything else is discarded:

All assistant messages (what Codex said)
All tool calls (commands it ran)
All tool outputs (results it received)
All reasoning traces

The model’s entire contribution to the conversation is gone. The only record of its work is whatever the summary happened to capture.

Why the model redoes work

Consider what this means for a typical exchange:

Before compaction

You: “Migrate the webhook handlers”
Codex: “I’ll update stripe-webhooks.ts...”
Codex runs tools:
Writes stripe-webhooks.ts
Runs tests (47 pass)
Codex: “Migration complete. Tests passing.”
You: “Commit and start on invoices”

After compaction

The model can still see your messages (”migrate...” and “commit and start...”).
It cannot see:
What it said
What tools it ran
Test output
It only sees whatever the summary captured.

After compaction, the model sees your request to migrate, your request to continue, and a summary. If the summary says “migrated webhook handlers, tests passing,” it proceeds. If it says “discussed migration approach,” it has no way to know the work is done.

Your instructions are preserved. The evidence that those instructions were executed is not.

The summary is a compression function

Compaction runs a built-in prompt (or your override) and takes the resulting output as the handoff summary. The default prompt is generic:

Create a handoff summary for another LLM that will resume the task. Include current progress and key decisions made, important context or constraints, what remains to be done, and any critical data needed to continue.

This works reasonably well for simple sessions. For complex, multi-step work, the summary often compresses completed tasks into vague statements, drops constraints that seemed minor at the time, or fails to capture the specific state of the codebase.

In practice, compaction injects a handoff summary and expects the next model to continue from it. If the summary is vague, completed work becomes invisible.

Auto-compaction timing

Auto-compaction triggers after a turn when total token usage exceeds model_auto_compact_token_limit.

The hard part is that it’s opaque. You get a short notice, but you don’t get to preview what the next turn will actually know.

That’s why it can feel like the model suddenly changed personality. You and the model are now continuing from different context.

Don’t overfit to the exact “context left” number. Treat it as a rough signal. What matters is whether you’re at a safe boundary and can restart from durable state.

Compaction usually fires after a long turn, not mid-turn. The timing still matters.

If it fires right after a big implementation, the summary can capture “feature A done.”

If it fires right after you queue the next request (“now do feature B”), the summary can capture “user requested feature B.”

It may still be vague about whether feature A is already complete.

User message truncation

User messages are preserved with a budget. In the local CLI compaction path, that budget is 20,000 tokens. When you exceed it, a long prompt may be truncated.

Constraints buried in the middle of a long prompt can disappear. The more common failure mode, though, is losing the assistant’s proof that the work was executed.

Solution: Making completed work visible

The goal is to ensure the model knows what has been done, regardless of whether compaction has run.

In my day-to-day workflow, I treat compaction as a normal risk. I keep a checked plan file, I compact after a checkpoint, and I start a new session when I switch phases.

1. Record progress in files, not in conversation

Conversation history compacts. Files do not.

For multi-step work, maintain a plan file that the model updates as it completes tasks:

# plans/stripe-migration.md

## Webhook handlers
- [x] Migrate to billing meter API (src/stripe-webhooks.ts)
- [x] Update payload parsing for new schema
- [x] Update subscription logic (src/billing/subscriptions.ts)
- [x] Tests passing (47/47)

## Invoice reconciliation
- [ ] Add meter usage aggregation
- [ ] Wire up invoice.created webhook
- [ ] Add reconciliation tests

Instruct Codex to update this file when it completes work:

Read plans/stripe-migration.md. This is the source of truth. When you complete a task, check it off in the file before reporting completion. Never redo a checked item without explicit instruction.

After compaction, the model can re-read the file and see exactly what has been done.

2. Compact at safe boundaries

Auto-compaction fires based on token count, not task completion. For important work, take control of the timing.

Run /compact manually when:

After completing a logical unit of work (a feature, a migration phase, a refactor).
After tests pass and changes are committed.
Before starting anything expensive or hard to undo.

Before compacting, ask the model to update the plan file. That way the summary has a clean snapshot and the repo has durable state.

If you’re finishing a major task, my default is even simpler: end the session.

Start a new session with a manual handoff prompt. It costs a minute and saves you an hour of untangling.

This is the part people underestimate: a manual handoff isn’t just “better context.” It’s visibility. You can read the handoff prompt and know what the next session will know.

2.1 Write a receipt for “done”

Compaction discards the agent’s confirmations. If you want a step to survive, write your own receipt.

In practice that means:

Check the item off in your plan file.
Write a one-line receipt in the task file (”Phase 1 complete. Tests green.”).
Commit or checkpoint the repo when it matters.

It’s simple and it works.

2.2 When to compact (and when not to)

The rule I use is simple:

Only compact when a fresh session could restart from durable state.

Durable state means Codex can infer context from files.

A plan/task file.
Repo reality (what’s checked in or in your working tree).
A clear signal for “done” (tests, a curl check, a migration output you wrote down).

Compact when:

You’re at the end of a phase and the plan file is up to date.
The state is easy to verify from the repo (tests, typecheck, CI, a single script run).
You want to intentionally switch context (new phase, new subsystem, new risk profile).

Avoid compacting when:

You are mid-investigation and the important facts only exist in terminal history.
You just ran an expensive or risky command and the only record is the tool output.
You haven’t written down the decision that makes the next step safe (”do not touch X”, “this backfill is complete”, “this is the exact flag we used”).

3. Optional: Local compaction (non-OpenAI providers)

Codex CLI decides where compaction happens based on your provider.

If you use the OpenAI provider, compaction is handled upstream. The local compact prompt is not used.

If you use a non-OpenAI provider, Codex runs compaction locally. In that case, experimental_compact_prompt_file can influence the handoff summary.

4. Commit early, commit often

Git commits create durable checkpoints that the model can verify.

After completing a feature, commit the changes before moving on. If compaction causes confusion, the model can check git status and git log to see what actually happened:

Run git log --oneline -5 and git status to see what has been completed.

The commit history survives compaction. The conversation history does not.

5. Keep critical instructions near the end of prompts

User messages are preserved with recent content prioritized. If you have a long prompt, keep critical constraints near the end, not buried in the middle.

Better yet, keep prompts short and put detailed specifications in files. Reference the file and quote only the immediately relevant section.

Failure modes & mitigations

Repeated work. The model reimplements a feature it already built because the summary did not record completion.

Mitigation: Record completed tasks in files. Have the model check the file before starting any work.

Conflicting implementations. The model creates a second version of something that exists, leading to duplicate code or broken imports.

Mitigation: Ask the model to check for existing implementations before creating new files. Use the plan file to track which files belong to which feature.

Lost test state. The model re-runs tests it already ran, or assumes tests failed when they passed, because tool outputs are discarded.

Mitigation: Record test results in the plan file. After compaction, have the model re-run tests to verify current state rather than relying on memory.

Summary drift. Multiple compactions in a long session compound the problem. Each summary is based on the previous summary plus recent messages, not on the original conversation.

Mitigation: Keep sessions focused. Start new sessions for new tasks rather than running one session for hours.

Undo confusion. Ghost snapshots let you roll back the UI, but the model continues from its compacted history. Rolling back does not restore the model’s memory.

Mitigation: After using undo, explicitly restate the current state and what work has been done.

Parallel sessions collide on diffs. Even if two agents touch different files, they still share the same working tree. Codex tries to converge to a diff that solves its current task. If another session changed the tree “out of scope,” you can see rollbacks or confusing overwrites.

Mitigation: Use separate working copies (for example via git worktree) when running agents in parallel. Treat each session like a developer with its own branch.

Noisy outputs trigger early compaction. Reading huge files and pasting long logs burns context quickly. You hit compaction sooner, and the summary has more chances to drop a critical detail.

Mitigation: Prefer targeted search over dumping whole files. Avoid pasting entire test logs. If a file is large, ask the agent to rg for specific patterns and open only the relevant sections.

Variants

Team workflows. Standardize a rule: every significant task gets recorded in a plan file before it is considered done. The plan file is the source of truth, not the conversation.

CI and scripted runs. When driving Codex from scripts, log the summary after every compaction event. If the model starts redoing work, the logs show what the summary contained and where it diverged from reality.

Long-running refactors. For refactors that span many files, maintain a tracking file that lists completed files. The model checks this file before touching any file.

Building Software With AI Agents

Marco Hefti — Wed, 10 Dec 2025 06:46:51 GMT

It took me about an hour to get a working 3D configurator for a standard shipping container in front of a customer.

Normally that is a “we will get back to you in a few weeks” kind of request. In this case, the rules for the application were already written down in an ISO standard the model had seen before. Once I referenced the norm, I had a rough prototype running. It was not production ready. But they could click around, break it, complain about it, and suddenly we were iterating on something real instead of talking about a concept.

The interesting part isn’t that it took an hour. It’s that the hour was possible at all. That only worked because I treated AI like a junior developer I could manage and feed with context, a plan, guardrails, and a clear scope.

A lot of people are arguing about whether AI will replace developers. That is the wrong question. The real shift is more operational and much more important.

Agentic development changes the economics of iteration.

When you can show something tomorrow and refine it with real feedback, you make different product and engineering decisions.

If you have tried to build anything non-trivial with AI, you have probably felt it: things move fast at the start, then collapse around the last 20 percent. This article is about why that happens and how to change it.

TL;DR Agentic development is treating AI like a swarm of junior developers inside your repo. The limiting factor is the environment you build around them: context, plans, signals, and guardrails. That environment changes how fast you can iterate and how you structure work, which is why this matters.

In this piece, I will cover:

What agentic development looks like in practice.
Why it became practical once the Codex, Claude, and Gemini CLIs caught up.
How agentic workflows let one developer manage a small team of assistants on their behalf.
The skill ceiling people do not like to talk about.
The core pillars that keep agents from collapsing at 80 percent.

Think of this as a practical guide to agentic engineering: patterns that have worked for me after months of daily use, independent of any specific model.

What agentic development really is

Agentic development is software engineering where you turn intentions into executable loops: plan, act, verify, repeat, inside a repository that stores context on purpose.

The closest analogy I have:

It feels like having junior developers on call all day: fast at implementation, dependent on you for context.

That can sound like a downgrade: why would you choose juniors when you could have seniors?

Because you do not get seniors on demand. You get what you get and you try to make the most out of it.

So what you do get is a scalable amount of junior-level execution, as long as you provide:

a clear plan
enough context
guardrails and tools
and the discipline to review and steer

Your role shifts:

You stop being “the person that types all the code”
You become the architect and product owner of your own project

That comes with a simple constraint:

Agentic development has a ceiling, and it is set by you.

If you do not know what “done” looks like, the agent will not either. If your repo is undocumented and fragile, every session is a fresh onboarding.

Agents amplify skill. They do not replace it. “Agents can just figure it out.” They cannot. If you do not give them context that tells them about plans, signals, and guardrails, you are not doing agentic development. You are pulling a slot machine and hoping to hit the jackpot (yes I’m carefully juggling around the word vibecoding).

Why prompts alone break: state and context

Most AI content stops at “paste your code into a browser and ask for refactors”. That frames it as a prompt problem. The real issues are state and context:

Work is stateful, prompts are not.

Why is this a workflow problem and not a “better prompt” problem?

Because prompting is a one-shot interaction.

state
retries and partial progress
verification
memory across sessions

If nothing in your system stores that state, in plans, task files, runbooks, or docs, you pay the same tax every time:

you rediscover the same facts
the model forgets them too
your “workflow” is just a series of one-off chats

Agentic development starts when you decide to stop paying that tax and treat state as something you design on purpose. If your system does not store state anywhere, you are not building workflows, you are starting a new conversation over and over again.

From feature work to context work

Agentic development flips the center of gravity: from features to context.

Traditional development is feature-oriented:

“Build the thing.”

Agentic engineering is context-oriented:

“Build the environment so assistants that start fresh each session can build the thing reliably.”

That sounds like overhead until you look at what changes:

You stop relying on “ask the lead engineer”.
You stop keeping setup quirks in your head.
You start encoding decisions where they belong, in the repo.

A new developer cloning a repo for the first time often hits problems that only exist on first setup. Someone gets called over. It gets fixed “just this once”. Nobody writes it down. A month later, someone else hits the same issue.

Agents are direct about this because every new session is a new developer trying to understand your project. If the project is not self-contained, they stall. If you are tired of repeating yourself, you hook your agents up with the right context and tools to validate itself. You are closing the loop.

It works like the pensieve from Harry Potter: Dumbledore pulling a memory out of his head and storing it in a bowl. Good agentic teams do that with context. They pull it out of brains and put it into the repository.

Loops, not guesses: “curl until green”

Consider a simple case: an endpoint is failing in staging.

Prompt-based pattern

you paste the stack trace into ChatGPT, Claude, or Gemini
you get suggestions
you try a change
both you and the model forget what you already tried
you spiral

You are still guessing, only faster.

Agentic loop

Define the success signal:
“curl must return 200 with the expected JSON.”
Give the agent tools:
run tests, run lints, build and restart services, inspect logs.
Run the loop until the signal is green.

The curl is not just a command. It is the signal.

The loop is the agent.

This is the core difference: you are turning “try a thing” into “run a loop against reality until a clear condition is met”.

Once you see it this way, almost any debugging or repair workflow can be expressed as: define the signal, then loop against reality until the signal is green.

Why now: tooling made this practical

Models have been decent at code for a while, but the step change for day-to-day work came from tooling and harness.

What changed recently is not just that the models got better. It is ergonomics:

Terminal native workflows in the Codex, Claude, and Gemini CLIs that live next to your docs, tests and git history.
Standardised tool access through MCP servers (Model Context Protocol servers that expose tools and surfaces like browsers), so agents can drive real surfaces instead of guessing.
A reusable skills library that lets the agent see your runbooks and workflows everywhere without pasting them into every prompt.

Once agents live in your terminal, with your tools, you can build loops like the curl example. If they stay in a browser tab, they rarely make it past the demo stage.

Autonomy comes in levels

Can you say:

“Design a complex web app, implement it, deploy it,”

and hand that to an agent?

No. Not reliably.

You can think about it in levels:

Level 1: lint fixes and very small edits that are easy to verify.
Level 2: missing tests around existing behavior.
Level 3: code changes that run under a test suite and type checks.
Level 4: multi-step repair loops built around a clear signal such as the curl example.
Level 5: small orchestrated flows that follow a written plan and reuse context across steps, for example a small migration or internal tool flow that chains a few tasks together.

Your workflow becomes:

you groom
you plan
you verify

If you want meaningful autonomy, expect to write plans. A lot of plans. That is not a failure. That is the work.

The pillars: context, plans, signals, guardrails, tooling

This is the core of agentic engineering: agentic development becomes reliable when you treat it like engineering and build a harness.

1) Context architecture

Agentic workflows push context out of people’s heads and into the repo:

AGENTS.md as an entry point to rules, runbooks, and conventions.
Runbooks for common operations.
Docs that make setup self-contained.
Task files that store the current plan and state.

2) Plans as durable memory

Plans and task files are independent storage for context. Plans survive the session and act as shared memory. If a run crashes, hits token limit, or you swap models, point the next session at the file and it’s immediately caught up.

3) Signals and success criteria

Prompts tell the agent what to try. Signals tell it when to stop.

A signal is a machine checkable condition that separates “we are still working” from “this part is done”.

The interesting part is what happens when you bundle signals. A real task often looks like this from the agent’s point of view:

tests for the affected module are green
curl /health returns 200
a specific error log pattern no longer appears under load

When all of these are true, the task is done. When any of them are false, keep working.

That is how I think about plans in agentic engineering.

Plans consist of procedures and signals. Procedures are how to approach the task. Signals are what must be true before you move on. From the agent’s perspective, the set of signals is the spine of the plan. It does not have to invent what “good” looks like. It has a list of conditions it is trying to make true.

In my own setup this usually ends up as small task files that contain a description, a few procedure hints, and a list of signals. Walking that list once is a single turn of work for the agent.

4) Verification and guardrails

Treat the agent like a developer and send it through the same guardrails as everyone else:

Static checks and Language analyzers: catch type and semantic issues (tsc --noEmit, mypy/pyright, go vet/staticcheck)
Style checks and formatters: enforce style and hygiene (eslint/prettier, ruff/black, gofmt)
Dependency rules: enforce layer boundaries (depcruiser, import-linter for Python, ArchUnit for Java)
Tests: unit/integration (Jest snapshots, Go/Python golden files, JUnit approvals)

Run the same commands locally that CI will run. CI is the safety net, not the first place you discover issues.

These are the same gates you would use for any PR. Make the agent run them too.

5) Tooling and harness

Agents need a harness so they can run, see context, and stay safe.

Project docs as entry points: repo and sub-directory AGENTS.md files that point to the right commands and tools.
Validation: one command that runs the validation suite (tests, style, static checks, dependency rules), plus scoped variants for quick loops. The agent runs the same gate humans do.
Task and state files: a small task system with status snapshots and execution logs, so the next session resumes from recorded state instead of prompt memory.
Environment and adapters: declare how the project runs (for example Docker or local scripts) and provide wrappers to restart services, seed data, and instructions on how to hit HTTP/DB/browser surfaces.
Safety and logging: run in a restricted environment, keep secrets out of reach by default, and log command output for audit.

Skip the harness and you’re back to prompting and hoping.

The downside: the 80 percent problem

So where does this usually fall apart?

The last 20 percent is where things usually fall apart:

edge cases show up
integrations behave differently than expected
missunderstandings (context drifts)
success criteria were never written down

This is not a reason to avoid agents. It is a reason to add the missing engineering layer:

clear plans
explicit signals
verification loops
context that persists beyond the session

In demos you mostly see the 80 percent. When you close the last 20 percent with plans, signals, and guardrails, you move from vibe coding to maintainable and scalable environments that can be driven by agents.

What it unlocks in practice

Take the configurator from the intro. The only reason that hour of work was enough is that the behavior of the material lived in an ISO spec the agent could rely on once I mentioned it. The agent did not invent physics. It stitched together a UI, used the right tools because it understood the scope and orchestrated code around rules that it was trained on.

The goal in that case was simple: give a customer something interactive they could click through and critique.

This is the real unlock:

You collapse the cost of iteration.
You make “show, then refine” cheap.
You change how you sell, prototype, and shape products.

On the other end of the spectrum, agents quietly handle boring work:

resolving lint errors
filling in missing tests
updating docs and runbooks
chasing down regressions with repeatable loops

You get leverage at both ends: faster experiments and less drag from maintenance.

How to start, for real (safe and securely)

You do not need a big AI strategy. You can start with one repo.

Most teams try to start at level ten instead of level one. They hand a vague project to an agent and hope for the best. The steps below keep you at the low end of autonomy while you learn what works.

1. Pick a safe surface

Choose an internal tool, admin panel, or non-critical service you understand well.

2. Prepare the environment

Make sure lint, test, and type commands exist and are reliable.
Add a short AGENTS.md that explains what the repo does, how to run it, and where to find key docs. See it as a map you provide to your agent.
Fix the obvious “new dev setup” traps you already know about. (because your agents will stumble over it and you will get annoyed by it)

3. Delegate low risk loops

Use an agent to propose a plan for lint fixes, missing tests, or a small refactor.
Let it execute under your guardrails while you review diffs and keep architectural judgment.
For one or two bugs, define a clear signal, like the curl loop, and let the agent drive fixes until the signal is green.

4. Encode what you learn

Turn the rough plan into a task file or runbook.
Update AGENTS.md with any new patterns.
Add a guardrail, test, check, or script so the same issue is cheaper next time.

That is enough to move from demos to a small, real agentic workflow.

Why this actually matters

Agentic development matters because it shifts software engineering up a layer. This is what building software with AI agents, for real, looks like in practice`.

Programming languages become second-level abstractions. The primary skills become:

designing workflows
encoding context in the repo
managing levels of autonomy
verifying outcomes through signals

AI is not replacing developers any time soon. But developers and teams who learn to manage agents inside well prepared environments will outpace those who do not. The leverage adds up. The gap widens, quietly at first, then obviously.

If you found this useful, you can subscribe to get future practitioner-first pieces on agentic development and real-world AI workflows.

If you want help rolling these patterns into your own teams, reach me at marco@heftiweb.ch.

For shorter, more frequent notes and experiments, I’m on X as [@mheftii](https://x.com/mheftii).

Skills in Codex: A library for your workflows

Marco Hefti — Wed, 03 Dec 2025 09:18:43 GMT

TL;DR

Skills turn your repeatable workflows into named building blocks. Each skill is a small SKILL.md file under ~/.codex/skills that describes a workflow you want Codex to know about everywhere.
Codex surfaces a compact ## Skills section instead of pasting full playbooks. The model sees each skill’s name and description plus a pointer to where its source lives on your machine. That is enough for it to know what exists.
You use Skills to stop retyping the same runbooks. Incident checklists, release steps, and “how we run tests” move out of every AGENTS.md and into a library that follows you between projects.
To try Skills, you only set up a small skills directory and set a feature flag. Create ~/.codex/skills, add a SKILL.md with name and description, enable the experimental Skills flag in your test config, and restart Codex to see a new entry under ## Skills.
Treat Skills as a library. Keep names and descriptions short, focus on workflows you repeat across repos, and refine each SKILL.md over time.

1. Context: Runbooks in every prompt

If you use Codex regularly, you probably have a few prompts and workflows that you keep retyping or pasting into AGENTS.md. Common examples are “how we deploy to staging”, “how to triage a production incident”, or “how we create diagrams for documentations”.

You can shove those into project docs, but it has real costs. Long instructions inflate every prompt, consume most of the available context window, and are hard to keep consistent across repos.

Skills aim to solve that problem. They give you a place under ~/.codex/skills for reusable playbooks, and Codex turns that into a small skills section that the model can see on every run. The model learns that a skill exists, what it does, and where to find its detailed description, without you pasting big blocks of Markdown into each conversation. Think of MCPs but for workflows instead of tools.

2. Mental model: A library of named workflows

A skill is a named entry in a library that Codex loads at startup. Conceptually, each skill is:

A short name that you can reference in prompts and logs.
A description that explains what the skill does and when to use it.
A Markdown body with detailed instructions that you maintain for humans.

Codex does not send the full body to the model. Instead, it builds a skills list and injects that list into the instructions as a ## Skills section. Each bullet looks like:

: (file: )

From the model’s point of view, each skill is a named capability with a clear description and a pointer to a file it can mention in its plans or suggestions. From your point of view, skills are bookmarks for your own runbooks that Codex makes visible to the model without paying the token cost of the full documents.

Skills are global to your Codex home, not per project, so one skill can back multiple repos and sessions.

What this guide covers

What Skills are and how they differ from AGENTS.md.
A concrete d2-diagrams example, including the SKILL file and how we use it.
A safe way to try Skills while they are still behind an experimental flag.
Limits, failure modes, and patterns for using Skills on your own and with a team.

3. What Skills look like in your Codex home and in Codex

Skills live in your Codex home directory under a single root:

~/.codex/skills

Codex looks for files named SKILL.md anywhere under that tree. For example:

~/.codex/skills/
  d2-diagrams/
    SKILL.md
  release-checklist/
    SKILL.md
  incident-response/
    SKILL.md

Each SKILL.md has two parts:

A small YAML header at the top.
A Markdown body with the full playbook.

Codex uses the header to build the ## Skills section. The body stays in your skills directory and can be read by Codex when needed.

Here is a simplified version of a real diagrams skill:

---
name: d2-diagrams
description: How to edit and regenerate D2 diagrams for documentation; use when a .d2 file changes or a new diagram is needed.
---

# D2 diagram workflow

1. Edit the source in `diagrams/*.d2`.
2. Regenerate outputs (keep both source and exported image):
   ```sh
   d2 diagrams/NAME.d2 diagrams/NAME.png
   # or SVG if preferred
   d2 diagrams/NAME.d2 diagrams/NAME.svg

Keep filenames stable; reuse the same NAME for updates.
Apply our default Material-inspired styling:

Primary color: #6200ee; secondary: #03dac6; muted text: #5f6368.
Arrow style: clean, medium weight; avoid overly thin lines.
Font: default sans; avoid serif.
Background: light, no gradients.

When the Skills feature is enabled and Codex finds this file, Codex receives instructions like:

```md
## Skills
These skills are discovered at startup from ~/.codex/skills. Each entry shows name, description, and file path so you can open the source for full instructions. Content is not inlined to keep context lean.
- d2-diagrams: How to edit and regenerate D2 diagrams for documentation; use when a .d2 file changes or a new diagram is needed. (file: /absolute/path/to/.codex/skills/d2-diagrams/SKILL.md)

The model sees the header line, the short explanatory sentence, and a bullet with the skill’s name, description, and file path. The body that starts with # D2 diagram workflow stays local and is not sent to the model.

4. Getting started: Your first skill (concrete example)

Creating a new skill is a simple filesystem workflow.

Before you add one, you need a Codex build with Skills enabled. As of this writing, Skills live only in experimental Codex CLI builds behind a skills feature flag, not in the default stable @openai/codex install. A safe pattern is to give Skills their own CLI install and their own Codex home so you can test them without altering your main setup.

In our own test we used an alpha build that included Skills:

npm i -g @openai/codex@0.65.0-alpha.2 --prefix “$HOME/.codex-skills-install”

At the time of writing, 0.65.0-alpha.2 is the first tag we used that contains the Skills feature. Future builds may move or rename this, so treat the exact tag as an example and check the latest release notes for a build that mentions Skills before copying this verbatim.

Once you have an experimental CLI installed into its own prefix, you can hook it up to a dedicated Codex home:

Create a dedicated Codex home and config root for this install, for example:

  mkdir -p “$HOME/.codex-skills” “$HOME/.config/codex-skills”

Add a small wrapper script so you can run this install without touching your main Codex setup:

  # ~/bin/codex-skills-test
  #!/usr/bin/env bash
  export CODEX_HOME=”$HOME/.codex-skills”
  export XDG_CONFIG_HOME=”$HOME/.config/codex-skills”
  exec “$HOME/.codex-skills-install/bin/codex” “$@”

Make the script executable (chmod +x ~/bin/codex-skills-test) and ensure ~/bin is on your PATH.

Copy your usual config.toml into the new Codex home and set skills = true under [features]:

  [features]
  skills = true

Run your wrapper command to confirm that the feature is enabled:

  codex-skills-test features list

You should see skills listed as true.

The exact tag and wrapper name will change, but isolating an experimental CLI install, giving it its own Codex home, and flipping the skills flag is a pattern that will age better than any specific version.

Once you have a Codex home with Skills enabled, adding your first skill looks like this:

Create the skills directory.

   mkdir -p ~/.codex/skills/d2-diagrams

Write the skill file. Name it SKILL.md.

   ---
   name: d2-diagrams
   description: How to edit and regenerate D2 diagrams for documentation; use when a .d2 file changes or a new diagram is needed.
   ---

   # D2 diagram workflow
   - Edit the source in `diagrams/*.d2`.
   - Regenerate outputs and keep both the `.d2` file and an exported PNG or SVG.
   - Apply your house style (colors, line weights, fonts).

Keep the header within the basic limits.

name and description must both be present and non empty.
name should stay under roughly 100 characters.
description should stay under roughly 500 characters and fit on a single line.

Restart Codex so it rescans skills.
Enable the experimental Skills feature so it actually injects the skills section. This flag adds the ## Skills block to the instructions.

With that in place, your next Codex session shows a d2-diagrams entry under ## Skills, and you can reference it explicitly in prompts. For example, “Use the d2-diagrams skill to plan how to update the diagrams for this article.”

5. Good use cases for Skills

Skills are best suited to workflows and playbooks that you want to reuse across projects and sessions.

Good candidates include:

A standard way to edit and regenerate D2 diagrams for documentation.
A consistent process for running linters and formatters before commits.
A deployment checklist for staging and production.
An incident response checklist with the first commands to run and who to notify.
A personal playbook for triaging pull requests or handling legacy code.

The common pattern is that you want the model to know these workflows exist, even when you are working in a new repo, without pasting the same instructions over and over.

Skills are not:

A mechanism to dump large instructions into every prompt.
Per project configuration or architecture notes.
A way to expose the full skill body to the model automatically.

When you keep skills tight and task focused, they act as a library of habits that Codex always sees up front while leaving your AGENTS.md files free to focus on repo specific behavior.

6. How Skills fit into your day to day work

Once Skills are enabled in your Codex setup, they change how you interact with each session.

In a real codebase you might use the d2-diagrams skill we set up earlier in this guide to describe how you create and regenerate D2 diagrams for documentations. When you start Codex in that repo and ask a simple question:

Are you aware of the diagram skill?

Codex answers that it can see a d2-diagrams skill registered, describes it as a workflow for editing and regenerating D2 diagrams used in documentations, and points at the skill file in the skills directory as the source of truth. That response comes entirely from the injected ## Skills section; we do not paste the skill body into the prompt.

When you actually need a diagram, you ask Codex to lean on that skill:

Use the d2-diagrams skill to design a new D2 diagram that documents the context flow for this service.
Goal:
- Create diagrams/context-flow.d2 plus a matching PNG export following the d2-diagrams skill.
- Show Codex reading AGENTS.md, scanning the skills tree, building a ## Skills block, and feeding that context into the model for this service.

From there Codex creates:

diagrams/context-flow.d2 – a D2 diagram that you can drop into your architecture or runbook documentations. It shows nodes for the client, the main service, its core dependencies, and any queues or topics in between. Arrows show how requests and data move through the system.
diagrams/context-flow.png – a PNG export generated via d2 diagrams/context-flow.d2 diagrams/context-flow.png, checked in alongside the source.

The D2 file follows the workflow and style from the skill without you restating it in the prompt. The important part is that the diagram is shaped by your SKILL.md rather than whatever the model happens to improvise on a given day.

In practice, this is what day to day usage looks like:

You keep detailed workflows in SKILL.md files and treat them as the source of truth.
Codex always starts each session knowing which skills exist and where they live, without you pasting those workflows into AGENTS.md or every prompt.
When you need help on a task that has a skill, you mention the skill by name (“use the d2-diagrams skill…”) and let Codex plan from there, rather than re-explaining the checklist each time.

7. Under the hood: What Codex actually does

Most of the time you do not need to think about the internals, but it helps to know roughly what happens behind the scenes.

On startup, Codex scans ~/.codex/skills for SKILL.md files.
It parses each one, checks that name and description are present and within basic limits, and records any errors.
It builds a ## Skills block from the valid entries only.
It then assembles the instructions for the model by combining any user or system instructions, project docs from AGENTS.*, and the skills block into a single string.

Project documentation like AGENTS.md stays repo specific. Skills live under ~/.codex/skills and are global to your Codex home. In the final instructions that reach the model, user or system instructions (if any) come first, project docs come next, and the ## Skills section is appended after project docs when the feature is enabled.

From the model’s point of view, skills are pure documentation. Codex does not execute code from skills or automatically load extra files into context. The model only sees a short description and a path it can mention in its responses.

If you have used Anthropic Claude, the pattern will look familiar because it also uses SKILL.md files with YAML headers and Markdown bodies. The main difference is that Codex stops at a simple ## Skills section, while Claude can decide to read a skill body into context or run scripts associated with a skill in some setups.

If you maintain other tools that use SKILL.md files, such as Superpowers style skill libraries, you can reuse their documents by placing compatible SKILL.md files under ~/.codex/skills. Codex will pick up any file that has a valid name and description, and those skills will show up in the ## Skills section. The orchestration and automation still live in the external tool.

8. Clarifying a few common questions

Two questions have already come up in early discussions about Skills. It helps to answer them explicitly.

How are Skills different from more AGENTS.md files?

Skills and AGENTS.md do related but different jobs. AGENTS.md is tied to a repo, can be a long narrative project doc, and is pasted into the prompt in full (up to the byte limit). A Skill is a small named handle under ~/.codex/skills with a one line description and a body you maintain for humans, and it only appears as a compact entry in a global ## Skills block (name: description (file: path)). In practice, AGENTS.md is for “how to work in this codebase”, while Skills cover habits and playbooks you want available in every codebase, like the d2-diagrams workflow.

Are Skills just tools with a different name?

Skills are not tools. Tools are executable capabilities such as the shell, MCP servers, HTTP clients, and test runners that Codex can call to do work. Skills are documentation: a SKILL.md that explains a workflow and when to use it, and Codex only sees the short entry in the ## Skills section, not an API surface. The relationship is one way: a Skill can describe how to use tools (“run d2 like this”, “call our deploy script like that”), but Skills never execute code on their own.

9. Limits and failure modes

Skills ship as an experimental feature.

They are controlled by a skills feature flag that is off by default.
Behavior and format may change, and you should expect breaking changes over time.

There are a few practical limits worth naming explicitly.

Global scope, no project scoped skills yet

Today, Skills are global to ~/.codex/skills. There is no first class project scoped skills feature.
Mitigation: keep repo specific rules and architecture notes in AGENTS.md. Reserve Skills for workflows you genuinely reuse across repos.

Broken SKILL files fail

Invalid YAML or fields that violate the constraints show up in a startup message, and invalid skills are ignored.
Mitigation: Fix or remove broken SKILL.md files.

Skill bloat and token cost

Each skill entry adds a line to the ## Skills block. Hundreds of long descriptions will make that section noisy and eat into the context window.
Mitigation: keep name and description short and concrete. Periodically prune unused skills and move “maybe useful someday” notes back into personal docs.

Path visibility

The model sees the absolute path to each skill file. This is intentional so it can refer to specific files, but it may reveal parts of your filesystem layout.
Mitigation: avoid putting sensitive directory names into the skills path, and be mindful of screenshots or logs that include the ## Skills block.

From a safety perspective, skills are conservative.

Each skill contributes only a couple of short lines to the prompt.
The body of the skill never reaches the model unless you or a tool explicitly paste it into the conversation.
Skills do not execute code on their own, and they do not change what commands Codex can run.

10. Variants and team setups

The simplest way to start is as a single user.

Pick two or three workflows you already reuse in three or more repos.
Create a SKILL.md for each under ~/.codex/skills.
Keep name and description short and concrete so the ## Skills section stays readable.

For teams, it helps to treat Skills like code.

Keep shared skills in a small Git repository that teammates clone under ~/.codex/skills/shared.
Review changes to shared SKILL.md files the same way you review application code.
Encourage each person to maintain their own personal skills alongside the shared library so they can experiment without polluting the team set.

If you already have a source of SKILL.md files, such as a Superpowers skills library, you can copy compatible skills into ~/.codex/skills to seed your library. The Skills feature then gives you a consistent list of skills across tools without forcing you to adopt a single orchestration layer.

Over time, a small curated set of well named skills tends to work better than a giant catalog. It keeps the ## Skills section useful and keeps you honest about which workflows are actually worth codifying.

Chrome MCP with Codex: Drive a Real Browser from Your Agent

Marco Hefti — Tue, 02 Dec 2025 06:01:48 GMT

TL;DR

Chrome DevTools MCP lets Codex drive a full Chrome browser through MCP, so you can test real flows, trace performance, and debug network issues from inside a Codex session.

MCP is how Codex talks to tools - the Model Context Protocol defines a standard way for agents to call local and remote tools over a simple JSON protocol.
Chrome DevTools MCP exposes a live browser - Codex can open pages, click, fill forms, wait for elements, record traces, inspect network calls, and inspect console output through a single MCP server.
Codex setup is simple but version sensitive - you need a working Codex install, a current Chrome, and Codex must run on Node 22.12.0 or newer for the Codex process itself or the MCP handshake fails.
Start with an isolated, almost zero setup config - run Chrome DevTools MCP in an isolated mode using npx chrome-devtools-mcp@latest in your Codex config.toml so the server auto launches Chrome and cleans up its profile after use.
Use a custom Chrome session for environment sensitive flows - switch to an attached mode when debugging issues tied to a specific user, cookie set, or browser environment. Attach Chrome MCP to a Chrome instance you started with a dedicated user data directory and remote debugging port.

1. Context / Problem

Most Codex usage stays inside code and the terminal. That works for pure logical changes, but sometimes it would be useful to have Codex play through a real browser flow. You end up copy pasting logs, URLs, and error messages between Chrome and Codex instead of letting the agent reproduce the issue itself.

Chrome DevTools MCP closes that gap. It gives Codex a full Chrome DevTools surface through MCP so the agent can open real pages, click through UI flows, wait on selectors, inspect network calls, run JavaScript, and record performance traces.

The useful part is that you do not need a custom Playwright test harness or a separate automation stack. With one MCP server and a small Codex config change, you get a browser session that Codex can control like any other tool.

In this guide I will focus on three things:

A quick mental model for MCP and what Chrome DevTools MCP adds on top.
A baseline configuration that auto starts Chrome from Codex and that you can keep in your config.toml.
An advanced profile that attaches to a long lived Chrome session for environment sensitive flows (based on a real issue I had), plus the failure modes you are likely to hit.

2. Mental model

2.1 What MCP is in practice

In practice, MCP is a protocol that lets an agent talk to tools as if they were local commands. The agent speaks JSON over standard input and output, and the MCP client handles communication through requests, and responses. Codex uses this to talk to local tools such as file systems, terminals, and browser controllers without baking those integrations into the model itself.

For you as a developer, the important part is that each MCP server becomes an internal tool. Each tool has a name and parameters. Codex decides which tool to call and in what order, but the MCP client handles the transport and execution.

2.2 What Chrome DevTools MCP adds

Chrome DevTools MCP is an MCP server that wraps Chrome DevTools and Puppeteer. It starts or attaches to a Chrome instance and exposes a set of tools that Codex can call.

At a high level you get:

Input tools such as click and fill_form so Codex can interact with the page like a user. In practice this means you can ask Codex to log in, type into fields, submit forms, and trigger buttons.
Navigation tools such as navigate_page and wait_for so the agent can move through flows and capture state. In practice this covers opening URLs and stepping through wizards.
Performance tools such as performance_start_trace and performance_analyze_insight so Codex can record and interpret traces. In practice you can have Codex run a trace for a route and tell you which scripts or requests take the most time.
Debugging tools such as evaluate_script, list_console_messages, and list_network_requests so the agent can see console output and network behavior. In practice this is how you have Codex pull console errors, inspect failing requests, or read values directly from the page.

Under the hood it uses Puppeteer and Chrome DevTools, but from a Codex session it is just another MCP server with tools you can call.

There are two important constraints:

The Chrome profile that the MCP server uses is fully visible to Codex. Cookies, local storage, and page content are accessible through the tools.
The MCP server must be able to either start Chrome itself inside your environment or attach to a Chrome instance that you start with remote debugging enabled.

2.3 How this changes Codex workflows

With Chrome DevTools MCP wired into Codex you can:

Ask Codex to open an URL, reproduce a bug, and show you the exact steps and network calls involved.
Have Codex run a full OAuth login flow against a provider, capture the redirect URL, and explain why a callback fails in a specific environment.
Record performance traces for a specific route, compare them between builds, and have Codex highlight slow scripts or layout shifts.
Hand Codex a broken form flow, let it click through and then ask it to propose fixes in the codebase.

It feels like a no setup browser automation rig that you can call from the same place you already run your agentic coding tasks.

3. Concrete example

3.1 Baseline: an isolated Chrome MCP session

Start from a machine where you already have:

A working Codex CLI installation.
Google Chrome installed in a current stable version.
Node 22.12.0 or newer for the Codex process itself.

If you have not set up Codex yet, follow the separate setup guide Codex CLI in Practice: Install It Once and Trust It so you have codex installed, logged in, and pointed at a config directory you trust.

In your Codex config.toml, add a Chrome DevTools MCP server entry:

[mcp_servers.chrome-devtools]
command = “npx”
args = [
  “-y”,
  “chrome-devtools-mcp@latest”,
  “--isolated=true”,
  “--headless=false”,
  “--chromeArg=--disable-extensions”,
  “--chromeArg=--no-first-run”,
  “--chromeArg=--disable-sync”,
]
startup_timeout_sec = 30.0

This configuration does a few things:

Uses npx to run chrome-devtools-mcp@latest so you always pick up the newest Chrome MCP server.
Runs Chrome in an isolated user data directory that is created per run and cleaned up when the browser closes.
Disables extensions, first run prompts, and sync to keep the session predictable.
Starts Chrome in non headless mode so you can see the browser windows as Codex drives them.

Once this entry is in place, restart Codex so it reloads the MCP configuration. Then run a quick test from inside a Codex session:

Prompt: Use chrome-devtools to check the performance of https://developers.chrome.com. Tell me what stands out in the trace, list any console errors, and summarize the slowest network requests.

Codex should:

Start the Chrome DevTools MCP server through npx.
Launch Chrome with a fresh profile.
Open the page, record a performance trace, and inspect console and network output.
Return a summary of the performance insights plus any errors and slow requests it found.

If you see the browser open and some activity in the window, the baseline functionality is working.

3.2 A real user flow with console and network checks

Once the basics work, treat Chrome MCP like a shared QA browser that Codex can drive. For example, you can ask Codex to run through a checkout flow:

Prompt: 
Open localhost in chrome-devtools. 
Log in using the test account credentials from AGENTS.md, add any product to the cart, run through checkout until the final confirmation page, then:
- List any console errors during the flow.
- Export or summarize the key network requests for the checkout API calls.
- Tell me where the slowest step is and how it could be optimized.

Behind the scenes, Codex will call tools like navigate_page, wait_for, fill_form, click, and list_network_requests. You do not need to know the tool names in advance, but understanding that they exist makes it easier to understand what Codex is doing.

3.3 Advanced: attach to a long lived Chrome session

For environment sensitive flows, I use a different pattern. Instead of letting Chrome MCP start its own headless browser, I start Chrome myself with a dedicated profile and remote debugging enabled, then tell Chrome MCP to attach to that session.

On macOS that looks like:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
  --remote-debugging-port=9222 \
  --user-data-dir=/tmp/atlas-mcp \
  --disable-features=AutomationControlled \
  --disable-blink-features=AutomationControlled \
  --no-first-run \
  --no-default-browser-check

This starts a fresh Chrome profile in /tmp/atlas-mcp, opens a window, and exposes a remote debugging port on 127.0.0.1:9222. I log into whatever providers or test apps I need from that window first so the cookies and session state live in that profile. In one case this was an OAuth flow that only failed when a specific bot detection system saw a headless browser.

Then I point Codex at that running browser by changing the MCP config to use --browser-url:

[mcp_servers.chrome-devtools]
command = “npx”
args = [
  “-y”,
  “chrome-devtools-mcp@latest”,
  “--browser-url=http://127.0.0.1:9222”,
  “--headless=false”,
  “--acceptInsecureCerts=true”,
]
startup_timeout_sec = 30.0

Now Chrome MCP does not launch Chrome at all. It attaches to the already running browser that has my cookies, logins, and any manual steps I completed. From Codex this still looks like the same chrome-devtools toolbox, but the agent is operating on a real browser session that has already gone through whatever checks the site applies.

The trade off is that the remote debugging port gives any process on the machine control over that browser. I keep this pattern limited to a dedicated profile with no sensitive data and I close that Chrome instance as soon as I am done.

4. Solution / Playbook

This is the pattern I use when setting up Chrome DevTools MCP in Codex.

Confirm runtime and Codex basics
Make sure the host has a current Node runtime and a working Codex install. Run node --version and aim for at least 22.12.0 for the Codex process itself, then codex --version and a simple codex session in a real repo. If the baseline Codex CLI is unstable, fix that before adding Chrome MCP.

Add a simple Chrome MCP entry in config (isolated mode)
Start with a single mcp_servers.chrome-devtools block using npx chrome-devtools-mcp@latest, --isolated=true, and a generous startup_timeout_sec as in section 3.1. This gets you a disposable browser profile and avoids tangling with your normal Chrome profile. Keep headless=false at first so you can see what Codex is doing.

Run a short smoke test prompt
Inside a Codex session, ask the agent to check the performance of a simple public site and report console errors and slow network requests. If Chrome does not open or the response hangs, check the Node version, the error output in the Codex pane, and whether npx can fetch the MCP package.

Tighten headless and isolation settings for day to day use
Once you trust the wiring, you can flip --headless=true to keep the browser off your desktop. I still keep --isolated=true by default so each run starts with a clean profile and you do not leak cookies or local storage between sessions. For long running investigations I sometimes switch back to headless=false so I can watch Codex work.

Introduce a second config for environment sensitive flows (attached mode)
When you need Codex to walk through flows that depend on specific cookies, users, or browser behavior, start a dedicated Chrome instance yourself with a custom user data directory and remote debugging enabled as in section 3.3. If the site also blocks headless automation, this is where you pass flags like --disable-features=AutomationControlled. Log in manually, then change the Codex MCP config to use --browser-url=http://127.0.0.1:9222 so Chrome MCP attaches to that session instead of starting its own.

Lean on performance and debugging tools explicitly
When debugging something non trivial, ask Codex to record a trace and analyze it rather than just clicking around. Have it call the performance tools, capture network requests, and pull console logs. In practice this makes the difference between a vague bug report and a concrete list of issues or failing endpoints that map straight back to code.

Add a Chrome MCP smoke test to bigger Codex tasks
For larger changes, I like to end the Codex task with a short chrome-devtools run over the main flow. I ask Codex to play through the flow and then tell me whether the console and network tabs look healthy. This catches obvious breakage before I spend time on a manual click through.

Codify your defaults in version control
I keep a checked in config.toml template in my dotfiles so every machine has the same Chrome MCP defaults. That includes which Chrome flags to pass, how long to wait for startup, and whether MCP should run headless by default. It turns Chrome MCP from an experiment into a standard part of the Codex environment.

5. Failure modes & gotchas

Chrome MCP adds its own set of failure modes on top of whatever Codex already has. These are the ones I see most often.

5.1 Runtime and environment

Node version too old for MCP
If Codex runs on an older Node version, the Chrome MCP handshake can fail with an error like:

  ⚠ MCP client for chrome-devtools failed to start: MCP startup failed: handshaking with MCP server failed: connection closed: initialize response

Fix this by upgrading the Node runtime that Codex uses to at least 22.12.0 (see section 3.1) and restarting Codex. If node --version still shows an older release, check your shell PATH and any version managers that might be pinning Node.

Chrome cannot start from inside a sandboxed environment
Some MCP clients and shells run MCP servers in a sandbox that does not let Chrome create its own sandbox processes. When that happens, chrome-devtools-mcp may fail to launch Chrome at all. The usual workaround is to start Chrome yourself with --remote-debugging-port and a custom user data directory, then point Chrome MCP at that running instance with --browser-url. This bypasses the sandbox constraint because Chrome is running outside the MCP sandbox.

5.2 Remote debugging and security

Remote debugging opens a powerful control surface
When you start Chrome with --remote-debugging-port=9222, any process on your machine can connect and control that browser. Only use this mode with a dedicated user data directory and avoid logging into sensitive sites from that profile. Closing that Chrome instance closes the control surface.

5.3 Site behavior and automation detection

Headless automation blocked by bot detection or captchas
Some sites refuse to serve real content to headless browsers or detect automation. In those cases I start Chrome manually with flags like --disable-features=AutomationControlled and --disable-blink-features=AutomationControlled, plus a dedicated --user-data-dir. Once I have logged in and solved any captchas by hand, I let Codex attach via --browser-url so it inherits the specific environment and authenticated session.

5.4 Startup stability

Startup flakiness and slow environments
On slower laptops or when Chrome updates itself in the background, the MCP server can hit a timeout before Chrome is ready. Increasing startup_timeout_sec in the MCP config and passing a --logFile with DEBUG=* in the environment makes it much easier to see what Chrome DevTools MCP is doing before it fails.

5.5 Profile data and contamination

Unexpected data or cookies from reused profiles
If you let Chrome MCP use a long lived profile, any cookies, extensions, or experimental flags in that profile affect Codex runs. That can be useful for reproducing a user specific bug, but it also means you need to be deliberate about which profile Codex uses. I default to isolated profiles and only attach to shared ones when I have a specific reason.

In my own projects I treat Chrome MCP as a power tool for debugging and flow exploration, but not as a replacement for browser tests. I still keep my canonical checks in Playwright or Cypress and use Codex plus Chrome MCP when I need fast, targeted investigations and console or network traces that would take longer to script by hand.

5.6 How I actually run this

On my laptops I keep Chrome MCP wired into Codex but fairly conservative. Day to day I use the isolated config with --headless=true so Codex can open and close its own browsers without touching my regular Chrome profile. When I am debugging something subtle, I flip headless to false so I can watch Codex work through the flow.

For environment sensitive issues, like an OAuth flow that only fails for a specific cookie set or login state, I start a dedicated Chrome instance with a throwaway user data directory and --remote-debugging-port. I log in once by hand so the profile has the right cookies and session, and then attach Chrome MCP to that session until I finish the investigation.

6. Variants & extensions

Once the basics work, there are a few ways to adapt this pattern.

Switch Chrome channels for feature testing
Chrome DevTools MCP supports a --channel option so you can use canary, beta, or dev builds of Chrome. This is useful when you want Codex to verify behavior that depends on upcoming browser features without changing your daily browser.

Use --viewport and emulation for mobile flows
You can set an initial viewport like 1280x720 with the --viewport option and combine it with Chrome DevTools emulation tools to approximate mobile devices. I use this when asking Codex to test responsive layouts or mobile only flows before writing full end to end tests.

Attach from containers or remote hosts
In containerized or remote setups where starting Chrome from inside the container is painful, running Chrome on the host with remote debugging and pointing the containerized MCP server at it via --browser-url is often simpler. You still need to handle any port forwarding, but the pattern is the same.

Codex CLI in Practice: Install Once and Trust It

Marco Hefti — Mon, 01 Dec 2025 09:14:11 GMT

TL;DR

Installing via npm keeps you on the latest version and makes Codex available just like any other CLI tool.

Use npm for faster releases - npm i -g @openai/codex tracks the newest CLI builds hours (sometimes days) before Homebrew or OS stores update.
Check the fundamentals first - Codex needs Node 16+ (I recommend at least 22.12.x for MCP to work), a ChatGPT Plus/Pro/Business/Edu/Enterprise login or API key.
Set Codex defaults in config - Use codex config.toml plus profiles/projects to define sandbox mode, approval policy, and model/provider defaults.
Confirm Codex runs end-to-end - After installation, run codex --version, log in with codex login, then start codex in a real repo and make sure you land in an interactive shell that can see your files.

1. Context / Problem

Codex now spans the browser, GitHub, IDEs, and the terminal, but the CLI is how all tools interact with it. However, when the initial setup feels unreliable, engineers quickly decide to not use Codex.

Right now, the @openai/codex npm package tends to ship releases a few days before Homebrew, OS-specific installers, or IDE extensions. The fastest way to get a stable CLI into your workflow is to treat it like any other global Node tool.

npm i -g @openai/codex

CLI vs Cloud (quick clarification): Codex (the CLI) is an open-source wrapper that runs on your machine and talks directly to OpenAI models. Codex Cloud is a hosted service that connects to your repositories and spins up sandbox VMs. It boots the same CLI inside them, and returns the output back to you. When this article says “install Codex,” it refers to the local CLI. The cloud product simply runs that CLI on managed infrastructure.

You can install Codex on macOS, Linux, and Windows 11 (via WSL2). Running Codex natively on Windows is possible, but it relies on an experimental sandbox that cannot fully protect directories writable by everyone. The docs recommend using WSL2 when you need the hardened Linux sandbox. Also note that native Windows and WSL2 have separate home directories by default, so each environment keeps its own Codex state unless you deliberately point both at the same CODEX_HOME.

For each Codex session, your configuration file is the source of truth. Codex looks for it under CODEX_HOME. If that environment variable is unset, it defaults to ~/.codex on macOS, Linux, and WSL2, and to %USERPROFILE%\.codex on native Windows. Every sandbox mode, approval policy, profile, and trusted project entry lives under that directory, and Codex can resume an old session or apply experimental flags on whatever defaults it finds there.

I keep that directory under my dotfiles repository so I can diff changes and keep environments aligned. This walkthrough focuses on getting those defaults right before you use it for real work.

2. What this guide covers

This guide is a practical setup pattern you can reuse across machines:

Why Codex CLI needs deliberate setup – when installs feel buggy, most engineers quietly stop using the tool.
Five-layer mental model – runtime, CLI surface, identity, configuration, and sessions/history as the core layers you harden.
Concrete example (atlas-inventory) – a full walkthrough of installing via npm, authenticating once, wiring config.toml, and proving codex works end-to-end in a real TypeScript service.
Reusable playbook – the small set of decisions I repeat on every host before trusting Codex in a repo.
Failure modes & gotchas – the ways Codex setup most often goes wrong and how to map each issue back to one of the five layers.
Variants & extensions – how to adapt the pattern for locked-down laptops, CI, remote dev containers, and WSL2.

3. Mental model

Think of Codex setup as five layers you harden from the outside in:

Runtime layer

Node >=16, npm (or bun), and your shell PATH. In practice we target Node 22.12.0+ so MCP servers and future tooling work without hacks. If Node is out of date or the global npm prefix is not on PATH, the CLI will never launch. Lock this down before touching Codex itself.

CLI surface

Codex ships a multi-tool CLI with interactive (codex), scripted (codex exec), resume (codex resume), patch (codex apply), sandbox debugging, MCP, and cloud helpers. Decide up front which surfaces belong in your day‑to‑day workflow so you can document the minimal flag set you need. I keep a short list of the two or three commands I rely on and treat everything else as optional.

Identity layer

Codex can authenticate either with your ChatGPT plan (Plus, Pro, Business, Edu, or Enterprise) or an API key tied to an Org/Project. Pick one, store it via codex login, and audit who owns the credentials. If you plan to use OSS providers, add the relevant --oss or --local-provider defaults now so nobody improvises later.

Configuration layer

CODEX_HOME/config.toml stores global defaults, profiles, project trust levels, and feature toggles. By default CODEX_HOME resolves to ~/.codex on macOS/Linux/WSL2 and %USERPROFILE%\.codex on native Windows, but you can override it for shared or network locations. When you pass --profile, --sandbox, or --ask-for-approval, you are just overriding this file. Treat it like infrastructure.

Session & history layer

Codex writes rollouts and history under CODEX_HOME/sessions and CODEX_HOME/history.jsonl. This is what gets pulled when doing codex resume, but only if the directory is writable and included in backups the way you’d treat any other tooling state. I once lost a week of sessions to a cleanup script, so now I tag that path as “never delete.”

4. Concrete example

Imagine onboarding Codex to a TypeScript service called atlas-inventory that lives in ~/Projects/atlas-inventory (on Windows, think C:\Users\you\Projects\atlas-inventory and substitute whatever path matches your setup). The goal is to let engineers run Codex in workspace-write mode with on-request approvals so the agent can edit files but still pause before risky commands.

First, bring the CLI onto the box and prove it runs end-to-end:

node --version                      # expect at least 18.x (I recommend at least 22.12.x)
npm i -g @openai/codex              # install the latest CLI from npm
codex --version                     # confirm the command works
codex login                         # or run `codex login --with-api-key` to login via ChatGPT API
codex login status                  # verify
codex                               # start codex

Next, setup the config in CODEX_HOME/config.toml (by default ~/.codex/config.toml on macOS/Linux/WSL2 or %USERPROFILE%\.codex\config.toml on native Windows):

model = “gpt-5.1-codex”
sandbox_mode = “workspace-write”
approval_policy = “on-request”

[projects.”/Users/marco/Projects/atlas-inventory”]
trust_level = “trusted”
sandbox_mode = “workspace-write”
approval_policy = “on-request”

[profiles.”oss-lab”]
model = “llama-guard”
oss_provider = “lmstudio”
sandbox_mode = “read-only”

The [projects] block forces workspace-write + on-request whenever the CLI sees that absolute path, so even if you forget to pass flags you stay inside the specified permissions. Codex requires absolute paths here (~ will not resolve), so capture the real path you use. The optional profile demonstrates how to carve out a different set of defaults (for example, an OSS lab that must remain read-only) and you can use it when needed via codex --profile oss-lab.

If you want to see how all of this fits together in a real config.toml, here is the template I use:

model = “gpt-5.1-codex-max”            # Default model for a new session unless a profile/CLI flag overrides it
model_reasoning_effort = “medium”      # Ask the model for a moderate amount of reasoning

approval_policy = “on-request”         # Let Codex auto-run safe commands but prompt for risky ones
sandbox_mode    = “read-only”          # Keep sessions read-only until I explicitly opt into workspace-write
project_doc_max_bytes = 98304          # Default truncation clipped AGENTS.md at ~500 lines, so I raised the limit to cover roughly 3× that content
model_auto_compact_token_limit = 263840  # Start compacting when only ~3% of the 272k window is left (instead of the default ~10%) so sessions stay longer before summarizing
tool_output_token_limit = 25000        # Truncate long tool outputs so instructions and code stay visible

[sandbox_workspace_write]
network_access = true                  # When I switch to workspace-write, allow outbound network access

# macOS example path, replace with the absolute path to your repo
[projects.”/Users/marcohefti/Projects/atlas-inventory”]
trust_level = “trusted”                # Skip the trust prompt and apply repo-specific overrides for this path
# Windows example (escape backslashes):
# [projects.”C:\\Users\\you\\Projects\\atlas-inventory”]

[features]
web_search_request   = true            # Enable the built-in web_search tool
apply_patch_freeform = true            # Opt into the experimental free-form apply_patch tool so Codex can edit multiple files in one patch
rmcp_client          = true            # Enable the experimental Rust MCP client for MCP login/auth management

Codex loads this file from CODEX_HOME/config.toml regardless of OS, and the [projects] keys must match the absolute path format of that environment (/Users/... on macOS/Linux/WSL2, C:\Users\... on native Windows).

Per-project entries only use trust_level today, so sandbox/approval overrides still happen via CLI flags or profiles. Likewise, there is no global exec_timeout_ms field. Timeouts are specified per tool call, so I leave it out of the config and document desired timeouts in the AGENTS.md or runbooks instead.

Now an engineer can start a session with:

cd ~/Projects/atlas-inventory
codex

Do a single task, exit, and then resume to prove history works:

# Show the most recent sessions
codex resume 

# Jump straight back into the latest run
codex resume --last

5. Failure modes & gotchas

If something feels off, it is usually one of these:

Runtime out of sync. Anything older than Node 16.20 tends to fail with errors.
Global npm path not actually on PATH. If codex is “installed” but the shell cannot see it, check npm config get prefix and verify that prefix’s bin directory is exported (/opt/homebrew/bin, /usr/local/bin, %APPDATA%\npm, or similar). When codex: command not found shows up, it is almost always this.
Auth cache confusion. Mixing ChatGPT device auth and API-key auth on the same host can confuse the CLI cache. If you see odd login behavior, treat that as a signal to redo authentication. Run codex logout, remove CODEX_HOME/auth.json, and log back in cleanly.
History disappearing. codex resume will not find anything if CODEX_HOME lives on a non-writable path.
Multiple installs fighting each other. If the CLI keeps telling you “✨ Update available!” after npm i -g @openai/codex@latest, there is probably another Codex binary somewhere on PATH. Use which codex (or Get-Command codex in PowerShell) to find the real binary, remove the others, and re-run codex --version to confirm.
Network and proxy oddities. On locked-down networks, Codex might start fine and then silently fail API calls. Exporting HTTPS_PROXY / NO_PROXY and testing with a simple codex exec -- curl ... call is usually enough to confirm whether the network path is healthy.

6. Variants & extensions

When npm is locked down. If you cannot install globals via npm (for example, on managed laptops), brew install --cask codex is a reasonable fallback as long as you accept that it may lag a couple of releases behind npm.
Remote and multi-host setups. For remote dev containers or WSL2, mounting a shared CODEX_HOME and running codex --sandbox read-only by default lets you reuse history and config without giving a new environment full write access until you trust it.

Agentic Workflows for Engineers

Context Amnesia

TL;DR

Context / Problem

The core mismatch: you have history, the model has a bridge

Why this is hard to fix cleanly

How other tools handle it

Mental model: What survives and what does not

Why the model redoes work

The summary is a compression function

Auto-compaction timing

User message truncation

Solution: Making completed work visible

1. Record progress in files, not in conversation

2. Compact at safe boundaries

2.1 Write a receipt for “done”

2.2 When to compact (and when not to)

3. Optional: Local compaction (non-OpenAI providers)

4. Commit early, commit often

5. Keep critical instructions near the end of prompts

Failure modes & mitigations

Variants

Further reading

Building Software With AI Agents

What agentic development really is

Why prompts alone break: state and context

Work is stateful, prompts are not.

From feature work to context work

Loops, not guesses: “curl until green”

Prompt-based pattern

Agentic loop

Why now: tooling made this practical

Autonomy comes in levels

The pillars: context, plans, signals, guardrails, tooling

1) Context architecture

2) Plans as durable memory

3) Signals and success criteria

4) Verification and guardrails

5) Tooling and harness

The downside: the 80 percent problem

What it unlocks in practice

How to start, for real (safe and securely)

1. Pick a safe surface

2. Prepare the environment

3. Delegate low risk loops

4. Encode what you learn

Why this actually matters

Skills in Codex: A library for your workflows

TL;DR

1. Context: Runbooks in every prompt

2. Mental model: A library of named workflows

What this guide covers

3. What Skills look like in your Codex home and in Codex

4. Getting started: Your first skill (concrete example)

5. Good use cases for Skills

6. How Skills fit into your day to day work

7. Under the hood: What Codex actually does

8. Clarifying a few common questions

9. Limits and failure modes

10. Variants and team setups

Further reading

Chrome MCP with Codex: Drive a Real Browser from Your Agent

TL;DR

1. Context / Problem

2. Mental model

2.1 What MCP is in practice

2.2 What Chrome DevTools MCP adds

2.3 How this changes Codex workflows

3. Concrete example

3.1 Baseline: an isolated Chrome MCP session

3.2 A real user flow with console and network checks

3.3 Advanced: attach to a long lived Chrome session

4. Solution / Playbook

5. Failure modes & gotchas

5.1 Runtime and environment

5.2 Remote debugging and security

5.3 Site behavior and automation detection

5.4 Startup stability

5.5 Profile data and contamination

5.6 How I actually run this