Make the agent show its work
An autonomous agent will happily rewrite your codebase and hand you a thousand-line diff. The logs tell you what it touched — never why. This is the case for treating an agent's decisions, not just its actions, as something you can see.
- where I'm at
- Three experiments are built — a wine recommender, an agent decision-surface prototype, and a wine-GPT trained from scratch on one consumer GPU.
- what's next
- Wire the decision-surface patterns into a real app, pressure-test them past 1000 records, and measure what an extra inference every few events actually costs.
You can see everything it did. And nothing about why.
Traces capture every action an agent takes — and none of the reasoning behind it: the forks taken, the alternatives rejected, the confidence in each move. This essay treats that missing layer as the thing to build.
AI agents now write production code on their own, but the tools we watch them with show what they did, not why. Traces and event streams capture actions; the reasoning — the forks taken, the alternatives rejected, the confidence behind each move — goes unrecorded.
This essay argues that legibility should be a first-class layer of the agent stack, and reports three experiments toward it: a recommender whose explanations mattered more than its rankings, a UI that collapses an agent's event log into legible decisions you can interrogate, and a small model trained to deliberate and defer. The through-line — the decision, not the output, is the unit of trust — and it can be surfaced without opening the black box.
raw output — what it did, never why
Agents act. You can't see why.
I've managed AI coding agents for over a year, and the daily friction has barely changed: babysit the tool command by command, or come back to a massive diff with no idea what it did — or why.
The tooling got better at showing the run — Cursor, Claude Code, Codex all stream traces. But a trace answers the wrong question. You see the action at each step and still have no clue why. (That's the panel on the right: every span, in order, saying nothing.)
Before agents, a simpler legibility lesson. I built a recommender the stubborn way: fourteen designs, one shared evaluation, and a metric — not my taste — to pick the winner. Ask it for a wine, and it ranks the cellar.
What stuck wasn't the leaderboard. It was that every result shipped with an explanation — why this wine, for this dish — and that the explanation was the part a person actually used. Scrub the panel from score-only to with-why: the number barely matters; the reason is the product.
A recommendation you can't interrogate is just an assertion with a number on it. Switch the query on the right — watch the reasoning and the matched attributes change, not just the rank.
A recommendation you can't interrogate is just an assertion with a number on it.
- 1.Côtes du Rhône 2020score 0.89
why: Peppery and savory with firm tannins — stands up to the char without overpowering the meat.
pepperygrenache/syrah$24medium-full - 2.Mendoza Malbecscore 0.81
why: Plush dark fruit and soft tannins; an easy crowd-pleaser with grilled red meat.
plushmalbec$18full - 3.Chianti Classicoscore 0.63
why: Bright acidity cuts the fat, but a lighter body gives up presence against a thick cut.
high-acidsangiovese$22medium
Outputs were one thing. But an agent doesn't hand you a ranking — it hands you actions. So picture the same wine pick, made by an agent: above you get its raw event log. Every action, in order, answering nothing — exactly the genre of tool we have today.
This is the experiment the essay is really about. I wanted to manage an agent the way you'd review a colleague — not by reading every keystroke, and not by waking up to a finished diff, but by seeing the handful of decisions that shaped the work.
So: every n events, prompt the agent to look back and decide whether a decision was made — and if so, collapse those records into a single DECISION. Flip the panel from raw log to decisions: eight events become one — “chose a Côtes du Rhône over Malbec or Chianti.”
Collapsed, the decision isn't just shorter — it's interrogable. Click it on the right and ask why this wine over that one; the answer draws on the full context up to that turn. The reasoning, finally first-class.
Each decision also carries a confidence — and an honest source tag: did the agent state this, or did I infer it? The decision, not the action, not the output, is the unit of trust. Nothing in the standard stack was recording it. (full write-up)
The decision — not the action — is the unit of trust. Nothing was recording it.
Chose a Côtes du Rhône over Malbec or Chianti
A surface can infer decisions from the outside. But what if the model just… told you? To see how small a trustworthy model could get, I trained a GPT-2 from scratch on a single consumer GPU — wine baked into pretraining.
It couldn't follow an instruction to save its life — which was the lesson: the gap between a small model and a frontier one is breadth, not depth. So the move is a small model with one added behavior, emitting the exact object the surface already renders: { options, chosen, confidence }.
Legibility, baked in at the source. The model deliberates out loud — here are the options it weighed for the steak, and the one it chose. Same decision object, now coming from the weights instead of inferred from a trace.
The gap between small and large is breadth, not depth.
And it knows when not to answer. Ask it the harder questions on the right: low confidence escalates to a stronger model; an irreversible, high-stakes call hands off to you. Deferral is itself a decision — and the model states it.
- Malbec
- Côtes du Rhône
- Chianti
Côtes du Rhône — peppery and savory, it stands up to the char without overpowering the meat.
Agent legibility should be part of the stack.
Three experiments, three layers — the output, the action, the model itself — and the same answer each time: what you have to trust is the decision, and the decision is exactly what today's tools throw away. We log everything an agent does and record nothing of why it did it.
Legibility closes that gap, and it doesn't require opening the black box or standing up new infrastructure. You can infer decisions from the trace, render them so a human can supervise at a glance, and — eventually — train models that simply state them. That layer should be added.
Still in progress
This is a working argument, not a finished system. The decision surface is a prototype; next is to wire the patterns into a real app, test them past 1000 records, and grow the chat-per-record pane into a real exploration tool.
Which feels like the right place to end an essay about legible decisions: by showing you mine, and what they're still missing.