D R A F T · June 2026

An essay · agent decision legibility

Make the agent show its work

Let an AI agent loose on your codebase and it'll happily rewrite it and hand you a thousand-line diff. You can scroll back through every action it took and still have no idea what it was thinking. So this is my case for putting an agent's decisions on the screen, somewhere you can actually read them.

Harvey Zheng · scroll, or jump with the timeline at left

Draft

where I'm at: Built so far: a wine recommender, an agent decision-surface prototype on a coding task, and a from-scratch GPT-2 (wine metadata) trained on one consumer GPU as a feasibility probe.
what's next: Build a coding-explanation LoRA on a capable model that deliberates and defers, wire the decision-surface patterns into a real app, pressure-test them past 1000 records, and measure what an extra inference every few events actually costs.

the problem

You can see everything it did. And nothing about why.

what it didfully logged

tool_call write_file

tool_result ok · 24ms

tool_call run_tests

reasoning …

tool_call grep_repo

tool_result 14 matches

why it did itunrecorded

A trace logs every action an agent takes, but not really any of the reasoning behind it: the forks it took, the options it threw out, how sure it was at each step. This essay is mostly about building that missing layer.

Abstract

AI agents now write production code on their own, but the tools we watch them with mostly show you what they did, not why. Traces and event streams capture the actions; the reasoning behind them (the forks it took, the options it threw out, how sure it was) just doesn't get recorded.

What I want to argue is that legibility should be a first-class layer of the agent stack, and I'll walk through three experiments toward it: a recommender that showed the components behind each ranking, a UI that collapses an agent's event log into decisions you can actually interrogate, and a small model meant to deliberate and defer. The through-line ended up being that the decision is the real unit of trust, and you can surface it without ever opening the black box.

the demos · three experiments, one question

Each panel below is really after the same thing: the why behind the answer. First there's a wine recommender that shows you what drove each rank (the score, decomposed). Then a coding agent whose decisions you can read back and poke at. Then a small model that's meant to state its reasoning before it answers.

trace · 1 of 1 · 47 spans

span session#s0 → start task=refactor_auth —

span user_msg#u1 → "extract middleware" —

span reasoning#r1 → 612 chars —

span tool_call#a17 → read_file(login.ts) ok 24ms

span tool_result#t1 → 2.1kb —

span tool_call#a18 → grep(requireAuth) ok 31ms

span tool_result#t2 → 14 matches —

span reasoning#r2 → 793 chars —

span tool_call#a19 → write_file(auth.ts) ok 12ms

span tool_result#t3 → ok —

span tool_call#a20 → run_tests(auth) ok 1.9s

span tool_result#t4 → pass 18/18 —

span reasoning#r3 → 1041 chars —

span tool_call#a21 → edit_file(login.ts) ok 11ms

span tool_result#t5 → ok —

span reasoning#r4 → 488 chars —

span tool_call#a22 → grep(...) ok …

span tool_result#t6 → … —

span reasoning#r5 → … —

span tool_call#a23 → … …

raw output: what it did, never why

The problem

Agents act. You can't see why.

I've been using AI coding agents for a little over a year now, and honestly the friction hasn't really changed. You either babysit the tool command by command, or you let it go wild and come back to a massive diff with no idea what it did. Neither's a great flow.

The tooling's gotten better at showing the run. Cursor, Claude Code, and Codex all stream the events. But a trace answers the wrong question: you can see the action it took at each step, and still have no clue whyit took it. (That's the panel on the right: every span, in order, telling you nothing.)

trace · 1 of 1 · 47 spans

span session#s0 → start task=refactor_auth —

span user_msg#u1 → "extract middleware" —

span reasoning#r1 → 612 chars —

span tool_call#a17 → read_file(login.ts) ok 24ms

span tool_result#t1 → 2.1kb —

span tool_call#a18 → grep(requireAuth) ok 31ms

span tool_result#t2 → 14 matches —

span reasoning#r2 → 793 chars —

span tool_call#a19 → write_file(auth.ts) ok 12ms

span tool_result#t3 → ok —

span tool_call#a20 → run_tests(auth) ok 1.9s

span tool_result#t4 → pass 18/18 —

span reasoning#r3 → 1041 chars —

span tool_call#a21 → edit_file(login.ts) ok 11ms

span tool_result#t5 → ok —

span reasoning#r4 → 488 chars —

span tool_call#a22 → grep(...) ok …

span tool_result#t6 → … —

span reasoning#r5 → … —

span tool_call#a23 → … …

I · recommend-arena

Before the agent stuff, a simpler lesson in legibility. I built a recommender kind of the stubborn way: fourteen different designs, one shared evaluation, and a metric (not my taste) picking the winner. You ask it for a wine, and it ranks the cellar.

What stuck with me wasn't the leaderboard. It was that every result came with its score decomposed (the filters it cleared, the lexical hit, the semantic match) instead of just a bare rank. No prose, nothing the model wrote: just the parts that added up to the number. Scrub the panel from score-only to with-why, and a rank on its own is kind of just an assertion, but the same rank with its parts is something you can actually check.

Show me the parts a score is made of and I can actually argue with it.

A recommendation you can't interrogate is kind of just a number you're told to trust. Switch the query on the right and you can watch the score components and the matched attributes shift along with the rank.

query

1.Côtes du Rhône 2020score 0.89
why: matched filters: price ≤ $30 · BM25 8.4 · vector cos 0.72 · RRF rerank
pepperygrenache/syrah$24medium-full
2.Mendoza Malbecscore 0.81
why: matched filters: price ≤ $30 · BM25 6.1 · vector cos 0.69 · RRF rerank
plushmalbec$18full
3.Chianti Classicoscore 0.63
why: matched filters: price ≤ $30 · BM25 4.7 · vector cos 0.55 · RRF rerank
high-acidsangiovese$22medium

II · observe-ui

Outputs were one thing. But an agent doesn't hand you a ranking. It hands you actions. So picture a coding agent refactoring auth across your routes: up top you get its raw event log. Every action, in order, and none of it telling you why, which is pretty much the kind of tool we've got today.

This is the experiment I actually care about. I wanted to manage an agent the way you'd review a colleague: you read the handful of decisions that actually shaped the work, and you trust them with the keystrokes in between.

So here's the harness: every n events, prompt the agent to look back and figure out whether it actually made a decision, and if it did, collapse those records into a single DECISION. Flip the panel from raw log to decisions and eight events become one: “chose a shared middleware over per-route guards.”

Collapsed, the decision isn't just shorter. It's something you can actually interrogate. Click it on the right and ask why this approach over that one, and the answer pulls from the full context up to that turn. The reasoning finally gets recorded alongside the action.

Each decision also carries a confidence, plus an honest source tag: did the agent actually state this, or did I infer it? This is the part I keep coming back to: the decision is the unit of trust here, and nothing in the standard stack was recording it. (full write-up)

You're not really betting on what the agent typed. You're betting on the call it made.

observe-ui · decision surface

turn 0session_startmodel=qwen3.5-9b

turn 0user_messageExtract the per-route auth checks into one place.

◆ decision · turn 1 · stated

Chose a shared middleware over per-route guards

covers turns 0–5 · deliberation

confidence 0.84

window · 2 events · no decision

III · code-gpt

A surface can infer decisions from the outside. But what if the model just… told you itself? To see how small a trustworthy model could get, I trained a GPT-2 from scratch on a single consumer GPU, with wine metadata baked into the pretraining.

It couldn't follow an instruction to save its life, which honestly was the lesson. The gap between a small model and a frontier one turned out to be mostly breadth: mine had simply seen too little of the world. A bigger pretrain won't happen on one consumer GPU, so the plan is one extra behavior, emitting the exact object the surface already renders: { options, chosen, confidence }. Coding's a harder domain than wine, though, so the next thing I'm building is a LoRA you attach to a capable model, taught to explain its coding calls.

The model itself would explain the call as it made it. That's the goal, anyway. The specialist deliberates out loud: here are the options it'd weigh for an auth refactor, and the one it goes with. Same decision object, except now the model writes it out itself, straight from the weights.

And it knows when not to answer. Try the harder questions on the right: if it's not confident enough, it escalates to a stronger model; if the call is irreversible and high-stakes, it hands off to you. Deferring is a decision too, and the model just says so.

ask

deliberation

per-route guards
shared middleware
a decorator

confidence ▁▃▅▆ 0.84

✓ answers

Shared middleware: the same check is duplicated across fourteen call sites; centralizing it removes the drift risk without scattering logic.

↓ the conclusion, below. Three layers that kept pointing the same way.

Conclusion

Agent legibility should be part of the stack.

Three experiments, three layers (the output, the action, and the model itself), and honestly I landed on the same answer each time: the thing you have to trust is the decision, and the decision is exactly what today's tools throw away. We log everything an agent does and record basically nothing about why it did it.

The part you actually need to review never makes it into the log.

The good news is that closing this gap doesn't mean opening the black box or standing up a bunch of new infrastructure. You can infer decisions from the trace, render them so a person can supervise at a glance, and (eventually) train models that just state them outright. I think that layer should be added.

Still in progress

All of this is still very much in progress. The decision surface is still a prototype; next up is wiring the patterns into a real app, testing them past 1000 records, and growing the chat-per-record pane into something you'd actually explore with.

So that's where I'll leave it: these are the decisions I've made so far, and the parts I still haven't worked out.