Back Original

Why I Make My Agents Keep Diaries

The avoidable process error was mine: I briefly launched two flash/probe jobs in parallel against the same /dev/ttyACM0 device. That is invalid for this setup and could have polluted the evidence, so I killed both jobs, wrote down the mistake here, and added a standing repo instruction to AGENTS.md so future work treats each serial device as single-owner during flash/probe work.

This is what an AI agent wrote about its own work process. I didn't ask it to explain itself after the fact — I instead asked it to keep a diary (a technique I borrowed from Jesse Vincent and refined over many months).

The Problem With Most Agent Memory

One thing that is often argued about is what memory in agents looks like, how it can be queried, how big it needs to be. Vector stores, retrieval pipelines, compaction strategies — the machinery keeps growing. But most of it misses what actually matters: why a certain sequence unfolded, not just what happened.

image

The standard approach is to store facts and retrieve them through some mechanism (CLI, MCP, markdown files, etc...). But the thing I and the agents actually need when we resume work is not a list of facts — it's the narrative. Which ideas were considered and abandoned, what didn't work and why, what the model thought it was doing when it went sideways, and what should happen next. That is the part that is hardest to reconstruct after the fact, and the part that LLMs are most likely to fake if you ask them to explain themselves retroactively.

(Side note, you can also find this article on substack: like and subscribe if you want to help me spread the good word)

Why The Word “Diary” Works

I'm always on the lookout for words that pack a punch in very few statements, because those mean that the concept is anchored strongly in the general human training corpus: everybody understands them, there is a breadth of examples in literature, and that means that these words also transpose across models pretty well and thus don't need much engineering effort to be tweaked. Plus, less words makes for more targeted prompting.

"Diary" is one of those words. It carries a whole constellation of behaviors with it:

You do not have to engineer much of this: the model already knows what a diary is.

In a way, using a diary allows me to delegate the task of figuring out what a memory system should be to the latent space of the model itself. It knows what it can do through commands and task execution, so it knows how much to write about each thing and how to delegate parts of the diary to tools later on. You do not need to overdesign the memory structure first. You need to pick the right word.

The Real Payoff: A Development Artifact I Actually Want To Read

One big aspect of the diary is that it is a result I enjoy reading. I always enjoyed reading author diaries and write-ups of coding sessions, and here I can see the unfolding of the codebase as it was written, in full detail and by a very "knowledgeable" and diligent "programmer".

That changes the role of memory entirely. The diary is not only there so the agent can resume work later. It is there so I can review what happened without reverse-engineering intent from code fragments, scattered tool output, and diffs. It lets me quickly see what decisions the model made, what loops it got into, what it struggled with, what files it touched, and what it figured it might do next, in a form that is not really captured in the conversation output that I see in the agent window, and certainly not captured in the resulting diffs.

It also helps me build a holistic understanding of the codebase and how it fits together. That is a very important part of it. While reading code fragments and doing pull request reviews of diffs often hides the forest for the trees, the diary lets me see the shape of the system: how subsystems relate, where complexity is accumulating, where issues are likely to be, and why a function or component ended up existing in the first place. I learn the codebase by reading the diary in a way that ordinary review surfaces rarely allow.

In practice, that makes the diary feel less like memory infrastructure than like a literate pull request: a narrative account of how the codebase moved, what the change was trying to do, what failed along the way, and what should happen next.

Because of the word diary, models — even terse ones like GPT-5 — tend to write in a pretty pleasant style. Reading diaries on my tablet is one of my favourite activities, as it allows me to pretty clearly see the difference between models and prompts.

The Template

The content of the diary went through multiple iterations, even if the prompt itself is still somewhat vibed and has never really been optimized. Concretely, each entry covers:

What I did — the factual record: actions taken, files touched, commands run.

Prompt verbatim — my input, the model's interpretation of the instructions, and the inferred intent. This is useful both for review and for seeing how a prompt actually played out.

Why I did it — forces the model to connect actions to goals rather than just list steps.

What worked — the positive signal: what to keep doing.

What didn't work — the practical feedback loop. These entries often point directly to the eventual solution.

What I learned — tacit knowledge that would otherwise disappear with the context window.

What was tricky — the friction points, the hidden complexity, the parts that took longer than they should have.

What warrants a second pair of eyes — I'm skeptical of this one. It might actually introduce too much complexity into designs rather than flagging genuine concerns, so it is something I'll have to assess again.

What should be done in the future — continuity across sessions.

Code review instructions — tells a human reviewer where to look and what to verify.

Technical details — file paths, API calls, commands, commit hashes, and anything else concrete enough to anchor the rest.

This is part of why the diary works well as a review surface. It does not just describe the work. It packages the work in a form that is navigable. You can find the skill on my github but I would strongly suggest you create your own, since these are my personal skills.

What This Looks Like In Practice

What does this look like in practice? Here are diaries from the same codebase, using the same prompting technique, across different kinds of work.

Example 1: Architecture Research Before Writing Code

An agent is starting work on a BLE temperature sensor logger for an ESP32-S3 Cardputer. Before writing any code, it needs to figure out which BLE stack to use, which sensor, which I2C API version — the kind of platform research that usually happens in someone's head and never gets written down. From Step 1: "Ticket Creation and Initial Research," "What I learned":

BLE Stack Choice: Codebase uses Bluedroid, not NimBLE. For consistency, should use Bluedroid even though NimBLE is lighter.

ESP-IDF Version: Cardputer uses v4.4.6, but example code uses v5.x I2C APIs. Need to verify compatibility or adapt APIs.

Architecture Pattern: Existing BLE code uses callback-based event handling with GAP and GATTS callbacks. Should follow same pattern.

A few steps later, the firmware compiles but fails at link time. The diary describes how the agent reads the framework source to understand the design constraint. From Step 4: "Confirm BLE 5.0 default disables legacy advertising APIs in ESP-IDF 5.4.1 (source inspection)":

In components/bt/host/bluedroid/Kconfig.in, BLE 5.0 is enabled by default and the prompt explicitly warns BLE 4.2 + BLE 5.0 are mutually exclusive. The legacy advertising structs + helper APIs are guarded by #if (BLE_42_FEATURE_SUPPORT == TRUE), so when BLE 5.0 is enabled and BLE 4.2 is disabled, these symbols are not built into the library.

Then the agent cross-checks its findings against research done by another contributor — and catches a factual error in one of the reference documents. From Step 5: "Document BLE 4.2 vs 5.0 advertising APIs (cross-validated with intern research)":

Intern doc proposed CONFIG_BT_BLE_42_ADV_EN=y, but that symbol does not exist in the ESP-IDF 5.4.1 tree for Bluedroid; relying on it would stall bring-up.

This one is a research journal. The model is building up knowledge about a hardware platform before writing a single line of application code. Six months from now, when someone asks "why did we use Bluedroid instead of NimBLE?", the answer is in the diary. More importantly, I get to learn about the decisions made without having to scroll through opaque transcripts scrolling by.

Example 2: Normal Feature Implementation

Intent: Show routine implementation work with no drama. The diary is valuable even when nothing goes badly wrong. It captures structure, layering, and incremental design decisions that would otherwise be spread across diffs.

An agent is building a demo application with a menu system for an ESP32 Cardputer — keyboard input, list views, scrolling. This is just normal feature work. From Step 1: "Create demo-suite project scaffold + A2 list view skeleton":

Implemented CardputerKeyboard (matrix scan) to emit keypress edge events (W/S/Enter/Del). Implemented ListView widget with selection + scrolling + ellipsis truncation. Wired it together in app_main so the device boots into a simple menu list.

Several steps later, the agent refactors the rendering into a layered architecture and adds a scene switcher — the kind of incremental design work that builds up a codebase. From Step 6: "Implement A1 (HUD header) + B2 (perf overlay) + scene switching":

Refactored rendering into layered sprites: header (16px), body (content), footer (16px). Added a minimal scene switcher: Enter opens a demo from the menu, Tab / Shift+Tab cycles scenes, Del returns to the home menu. Implemented A1: HUD header bar with screen name + fps + heap/dma. Implemented B2: perf footer bar with EMA-smoothed totals (update/render/present).

Then adding a new subsystem — a terminal console with scrollback. From Step 7: "Implement E1 terminal/log console scene":

Added a ConsoleState buffer with scrollback, input line, and dirty flag. Implemented wrapping via textWidth() when appending lines to the buffer. Rendered the console into the body sprite (excluding header/footer). Integrated the console as a real scene (E1TerminalDemo) in the scene switcher.

This is the most common kind of diary — just normal work. The value is that six sessions later I can still read back through and understand the architectural choices that led to the current state. I also get to learn the layout and APIs of the burgeoning codebase, and quickly intervene before things compound at the speed of agentic programming.

Example 3: End-to-End Validation And Shipping

The BLE temperature logger is finally working end-to-end. The agent documents not just that it works, but the exact tmux workflow, the PATH quirks it solved, and what to stress-test next. From Step 6: "Ship BLE 5.0 extended advertising + validate full E2E (firmware ↔ host) with tmux + btmon," "What worked":

Firmware compiles and links under ESP-IDF 5.4.1 (BLE 5.0 enabled). Device is discoverable as CP_TEMP_LOGGER and accepts connections. Host can: scan and find the device, connect, read temperature, subscribe to notifications and receive periodic values, adjust notify period via the control characteristic.

The workflow itself gets documented — the kind of detail that makes a setup reproducible months later. From the same step, "What I did":

Added two "pane scripts" to make the workflow reliable in tmux: tools/run_fw_flash_monitor.sh sources ESP-IDF 5.4.1 and runs idf.py flash monitor using the IDF venv python to avoid tmux/direnv PATH mismatches. tools/run_host_ble_client.sh runs the host bleak client, normalizing argparse global flags.

And the practical learning — the kind of thing you would otherwise rediscover repeatedly. From the same step, "What I learned":

ESP-IDF tool stability in tmux comes from being explicit: source the IDF export script, then run IDF_PATH/tools/idf.py using the IDF python venv, not whatever idf.py resolves to on PATH.

And then the handoff to the next phase. From the same step, "What should be done in the future":

Run stress scenarios (reconnect loops, long streaming) and confirm: notifies_fail stays at 0, heap minimum free remains stable over time. Proceed with hardware work (SHT3x I2C pins + driver) once BLE baseline is stable.

This diary documents the moment something actually works — and just as importantly, it captures the workflow that makes it reproducible. The tmux scripts, the PATH quirks, the exact validation steps. These will help an agent coming back to the issue quickly get started, often saving a full context window of trial and error. These types of diaries can also easily be "recycled" into full tooling or documentation.

Prompt Sharing And Delayed Interpretation

Diaries also serve as a reasonable way of sharing prompts, because sharing raw sessions is insufferable. A diary gives you not just my prompt, but how the model ended up interpreting it and what that interpretation produced, in a form that is actually readable.

image

As a side-note, you always have to be a bit wary of after-the-fact explanations by an LLM. But in this case the order helps — the sequence is usually: prompt → thinking → work → thinking → work → diary. The inferred intent of the prompt is only generated after all the work is done, rather than being locked in at the very beginning when the model has the least context. However, I read it in the more narrative sequence, and can usually trust that what the agent wrote up is also what ended up happening.

Where The Diaries Live

I have a system to manage all temporary output of LLMs and keep it in the codebase without having the model randomly select where to put these documents. They are all grouped by day, by ticket. A diary from March 22nd about an EPD instrumentation task lives at ttmp/2026/03/22/ESP-41-.../reference/01-diary.md.

While the diary prompting is not directly part of this tool, called docmgr (which is btw a complete vibe and not something I would you to use as is. I encourage you to "launder" the idea of it), the combination of both might actually be the most important part of my workflow. I can look up any repository of mine, for any piece of code, and find the diary that led to the creation of that file or the modification of that file, and track why a function exists. It is very easy to point an agent at a problem and just tell it to use diaries and git to answer questions.

docmgr provides an option to crosslink files and symbols across temporary files for easier querying, but the LLM is also able to use any symbols and filenames mentioned in the diary as a navigational map of the codebase for the work that was done within that ticket.

What This Buys Me Day To Day

The practical impact is straightforward. I am able to resume work without worrying much about memory or compaction, because the diary becomes an external narrative memory that survives context limits. In combination with the diary, Codex can carry an almost infinite amount of compaction without the work becoming unintelligible.

It also helps the model avoid mistakes made in the past, since those are marked as "what didn't work" and usually point toward the solution — a script, a configuration fix, a different workflow, a cleaner approach.

Just as importantly, it helps me build a holistic understanding of the codebase. The diary does not merely tell me what changed in one file. It shows me how the pieces fit together, where complexity is accumulating, where issues are likely to be hiding, and what sort of architectural pressure the system is under. I learn the codebase from the diary in a way that fragmented code review rarely permits.

That makes the diaries useful later for researching bugs, but also for creating high quality playbooks, implementation notes, onboarding material, and other documents that depend on understanding not just the code but the path that produced it.

And finally, they are pleasant to read. They allow me to do high quality reviews without getting bogged down by minutia, while still leaving very clear instructions on how to look up the minutiae when I need them. Because they are written in recognizably different voices, they are also one of the clearest ways I have found to compare models in practice.

What Is Still Rough

Some things are still rough.

Claude Code is a bit trickier because it tends to forget to write its diary over time, especially after compaction, so I usually reinject instructions to do so with a hook. Anthropic models on their own — say with the PI agent — do much better. Codex handles it well too.

I am still skeptical of the "what warrants a second pair of eyes" field. It may introduce too much complexity into designs rather than flagging genuine concerns, so that part of the template probably needs another pass.

I also do not currently track which model was used and which harness for each diary entry, and I should. Reading diaries is one of the clearest ways I have found to see the differences between models, so that metadata belongs in the artifact.

And I suspect there are other aspects of diaries as a concept that I have not yet tapped into — properties of the word that are activating behaviors I have noticed but can't fully put my finger on yet. Both the docmgr tool and the diary prompt are still somewhat vibed and have never really been optimized. The whole approach is deliberately underengineered, which is pretty much part of the point.

Conclusion

I originally came to diaries as a memory mechanism, but that is no longer the main reason I value them. Their real value is that they give me a document I can read.

Instead of reconstructing a change from code fragments, tool traces, pull request diffs, and whatever the model happened to say in the agent window, I get a narrative account of how the codebase moved: what was attempted, what failed, what was learned, what changed, how the system fits together, and what should happen next. That helps the model resume work, but it also helps me review, understand, and learn the codebase in a more holistic way.

That is why diary prompting matters to me. Not because it is an elaborate memory architecture, and not because it perfectly solves prompt interpretation, but because it produces an artifact that sits between transcript and diff — readable, inspectable, and useful. Something closer to literate programming than to logging. Something closer to a pull request written as a story of decisions.

For agentic coding, that middle layer matters a lot: it allows me to know my codebase without having to read an inscrutable log and a massive pile of disjoint diffs.