Tim Trailor

/autonomous

A persistent task runner with a delivery guarantee.

The sentence that forced this skill into existence was "email me when you are done". I had left the laptop running overnight on a long task, closed the lid and assumed the agent would email me when it finished, and in the morning there was no email. The task had probably completed and the session had died somewhere in the process, but I had no record of the result and no way to recover it.

/autonomous is the skill I now invoke instead. It kicks off a background runner that executes the task, retries on failure and is responsible for delivery, where delivery is not "the agent said it sent an email" but "the SMTP server returned 250 OK to this script, and the script also wrote a local file as a fallback in case the email itself fails to arrive".

The runner reads three things in order on each wake: its task description, the results of the last attempt and a running log. From those it decides whether the work is done, whether to retry, or whether to escalate, where escalating means sending me an ntfy push on my phone saying the autonomous task is stuck. When that happens, the escalation is the thing that makes me go and look at what has gone wrong.

The discipline the skill enforces is small and a little boring. Write the result to a file, send the email from the runner rather than from the agent, keep a log, and do not trust that the parent session is still alive when the work finishes. It has run jobs that complete in under a minute and jobs that run for many hours, and none have silently failed.

Full essay: "Email me when done": a persistent task runner with a delivery guarantee →

/debate

Three AI models, same question, parallel answers, then informed rounds.

Three models from three different families: Claude, Gemini 2.5 Pro and GPT-5.4. /debate sends the same question to all three at once in Round 0 with no knowledge of what the others will say, and from Round 1 onwards each model sees the other two answers and is asked to respond. I read the final round.

The blind Round 0 is the part that earned the skill its place, because it catches the cases where two models would have converged toward a wrong answer if they had seen each other early. A single-model review tends to say "this looks fine" without much fuss, whereas a debate opens with three separate reads and forces the disagreements out before any consensus has had a chance to form.

The value is not speed; a single-model review is faster and cheaper. The value is that the debate has caught two categories of error in my own usage that single-model review had missed: a plausible-looking migration plan that would have lost data, and an architecture proposal that recreated a previously-deleted pattern. Both were caught in blind Round 0 by the two models that were not the primary.

I run /debate before any irreversible architectural decision and do not run it for small edits or quick fixes. The confidence trap is real, in the sense that three models agreeing feels like strong evidence and often is, but when they happen to agree incorrectly they tend to agree very confidently. The protocol is not a substitute for me reading the answer carefully myself.

Full essay: Three-way AI model debate as a pre-commit gate →

/dream

Nightly consolidation that promotes session insights into canonical memory.

The memory system has two tiers. The canonical tier is a set of hand-curated topic files in plain Markdown which are the source of truth, and the ambient tier is every session's compressed conversation log, indexed by both semantic embedding and keyword. The ambient tier is large and noisy whereas the canonical tier is small and curated, and the two need to stay in sync if either is to remain useful.

/dream is the thing that connects them. It runs once a day, reads the sessions that have happened since the previous run, and identifies material that has stabilised into a genuine pattern that should be promoted into the canonical tier: a new feedback rule, a lesson from an incident, a fact that has been mentioned in three sessions and is now clearly load-bearing. The promotion is drafted as a diff against the relevant topic file and is either applied directly or queued for me to review.

The skill has a specific failure mode I have had to tune it for, which is over-promotion. A session in which I complained about a tool once is not a pattern, whereas a session in which I corrected the agent twice on the same point is, and the drafting prompt asks for evidence of repetition or confirmation before any promotion is accepted. Without that filter the canonical tier starts to rot quite quickly.

Memory that does not consolidate becomes memory that cannot really be searched. The canonical tier is the part I read, and the ambient tier is the part I search when I need something specific, and /dream is the pipe that keeps both honest.

Full essay: Memory that sleeps: a tiered memory architecture with daily consolidation →

/deep-context

Pre-assemble the context file before the task starts.

A context window is a budget. The mistake I kept making in the first few months of running an agentic environment was treating it as a container you fill rather than as an artifact you build, with files read on demand, relevant passages pulled from memory mid-conversation, and the window filling up with whatever the agent happened to look at first. By the time it got to the actual task the budget was half spent on things the task did not really need.

/deep-context inverts the order. Before the task runs, a planning pass builds a task-specific context file containing the relevant code extracts, the applicable lessons, the prior related decisions and the specific constraints. The file is bounded in size and is checked against a quality bar before the task is allowed to start, and the task itself then runs as a sub-session that reads only this file and the brief.

The sub-session cannot go off and read random files, cannot exceed its context budget because the budget is the file, and cannot forget the relevant prior decisions because those were included up front. What it produces is scoped to what the file said, and if the file was wrong the output will be wrong, which is now a visible failure rather than a silent one.

The shift from "hope the context ends up right" to "build the context deliberately" was the single largest quality improvement I made in my own use, and the one I would now recommend earliest to anyone trying to push the limits of how much they can hand to an agent.

Full essay: Context as a first-class artifact: the /deep-context pipeline →

Lessons as code

Twenty-three postmortem patterns, checked before every non-trivial change.

I keep a file called lessons.md, with twenty-three numbered patterns in it as of writing. Each pattern is a specific way I have broken my own system and the rule I now apply to prevent a recurrence, and every pattern has happened to me at least twice. The file is read at the start of every session, so the patterns are always somewhere near the top of mind.

Before any non-trivial change a skill called /lessons-check compares the proposed change against the patterns and flags any match, with messages naming the pattern number, the rule, and the dated incidents that put the rule in the file. The flag is a stop rather than a warning, which means I have to decide whether the match is real before the change is allowed to ship.

The patterns themselves are not at all clever; they are boring, specific and backed by a dated incident. "Do not hardcode paths in settings.json that differ between the laptop and the Mac Mini." "Never trust that a process is running as evidence that the feature works." "If a helper script rewrites an OAuth token, the backup goes in /tmp, never in the project folder." The value is that they get checked rather than that they are novel.

The portable part of all of this is the pattern itself rather than my specific list. Any agentic development environment accumulates incidents over time, and the question is whether the next change knows about them; a file plus a pre-flight skill is the thinnest version of that answer I have found so far.

Full essay: Lessons as code: turning postmortems into pre-flight checks →