Tim Trailor
Technical Manual

Personal AI Operating Environment

The full technical specification for the personal AI operating environment described across this site. Architecture, memory, context, hooks, safety, review, security, daemons, skills, and the incidents that shaped each of them.

Tim Trailor · Version updated 2026-04-23 · Roughly 10,500 words, 19 sections

1. Overview

This document describes a personal AI operating environment built around Anthropic’s Claude Code CLI over several months of daily use. It is not a product. It is what one person ended up with after using an AI coding assistant as the primary interface for managing infrastructure, running 3D printers, reviewing code, and reaching information from any device, fixing every failure that turned up along the way.

The design philosophy is short: measurement over narrative, enforcement over instruction, and defence-in-depth for anything that touches the physical world. Every safety rule in the system exists because a text-based instruction failed under pressure at some point. Every hook exists because a “don’t do this” rule was ignored when the agent was head-down on a task. Every layer of printer protection exists because the previous layer turned out to be insufficient during a real incident.

The environment today spans a Mac Mini server, a laptop thin client, and mobile devices connected over a Tailscale mesh network, running persistent daemons, enforcement hooks across the lifecycle stages, reusable workflow skills, and a hybrid memory system indexing every past conversation. A control plane repository gates every deployment through a scenario test suite that executes hooks with real payloads and asserts correct behaviour, and three 3D printers serve as the highest-stakes test of the safety architecture, because a firmware restart during a long print destroys a real, physical object.

The patterns described here (context budgets, incident-to-enforcement pipelines, multi-model adversarial review, persistent memory with hybrid search, context pre-assembly via a /deep-context pipeline) are not specific to this system; they apply to any environment where an AI agent operates with meaningful autonomy over real infrastructure.

2. Infrastructure

The infrastructure started on two machines and quickly demonstrated why that was a problem. A Mac Mini was the always-on server, a MacBook Pro was used for development, code existed in both places, and configuration drifted silently between them. A change made on one machine would work fine for days before someone discovered it had broken the other. The worst instance was a memory search MCP server that was silently broken for nearly three weeks because the settings file hardcoded a path that only existed on one of the machines.

The fix was to make the architecture honest about what it actually is, ie a single-server system with thin clients. The Mac Mini is the authoritative host. All persistent daemons run there, and all canonical code lives there. The laptop is permitted only a minimal set of services and everything else happens over SSH, whilst the phone connects through Tailscale or through a conversation server API. This is not elegant distributed systems design; it is the pragmatic recognition that for a single-person setup, keeping two machines in sync is harder than keeping one machine reliable.

Tailscale provides the network layer, with every machine (server, laptop, phone) sitting on the same mesh network and reachable from any physical network. This matters because the system needs to be accessible from a coffee shop, from a phone on mobile data, and from the couch on home Wi-Fi, all without VPN configuration. Tailscale makes the server reachable from anywhere with a single stable IP address per device.

Three 3D printers sit on the local network. The primary is a Sovol SV08 Max, a large-format Klipper-based machine with a 500mm-cubed build volume, a chamber heater on a CAN bus, and four microcontrollers. A Snapmaker U1 runs Klipper with Fluidd. A Bambu A1 uses a proprietary protocol over MQTT and FTPS. The printers are in this document not because they are the system’s purpose, but because they are its most demanding safety test. A badly-timed command to the primary printer during a long print does not just waste filament; it can warp a build plate, jam an extruder, or destroy a part that took many hours to produce. Every layer of the safety architecture in Section 9 was born from a real incident with real consequences.

3. Control Plane

Before the control plane existed, Claude Code configuration lived wherever it had been created, with hook scripts in one directory, rules in another, and skills scattered across the filesystem. Deploying a change meant manually copying files, verifying a change meant hoping nothing broke, and rolling back meant remembering what the files looked like before.

The control plane is a single private Git repository that serves as the source of truth for all Claude Code configuration: rules, hooks, agents, skills, service declarations, and host manifests. It was created during a seven-phase platform rebuild in April 2026, after a cascade of failures (described in Section 18) made it clear that unversioned infrastructure configuration was not sustainable.

Three scripts form the operational core. A deploy script copies versioned configuration from the repository into the correct locations on whichever host it runs on. A verify script runs the full scenario test suite. A drift-check script diffs configuration across all managed locations and reports divergence, running both on demand and weekly via cron. The verify tests do not just check that files exist; they feed real Claude Code JSON payloads into hook scripts and assert the correct allow/deny behaviour.

Each host has a YAML manifest declaring its role policy. The server manifest lists allowed LaunchAgents, required symlinks, required files, and forbidden files. The laptop manifest allows only a small set of supporting agents and explicitly forbids the conversation server, enforcing the thin-client policy in code rather than relying on memory. If a file that should only exist on the server appears on the laptop, the nightly drift check catches it.

The pattern here is versioned infrastructure-as-code for personal AI. The same principles organisations apply to fleet management (declarative manifests, automated verification, controlled deployment) turn out to be just as valuable when the “fleet” is two machines and a phone. The alternative, which this system tried first, is configuration that works fine until silently it doesn’t.

4. Conversation Server

The conversation server solves a specific problem: Claude Code is a CLI tool that runs in a terminal, but the system needs to be accessible from a phone, a browser, and a native iOS app. The server is a persistent Flask daemon that brokers access to Claude CLI subprocesses, providing WebSocket and Server-Sent Event (SSE) streaming, a terminal proxy, push notifications via Apple’s Push Notification service (APNs), Live Activity updates on iOS, printer control endpoints, and system health APIs.

At over seven thousand lines in a single Python file, it is the system’s largest monolith and its most instructive cautionary tale, having grown to that size through organic feature accretion. Each feature was small and reasonable in isolation, but the aggregate violates every context management principle the system now enforces (Section 6); no single Claude session can reliably edit a file of that size, because the “Lost in the Middle” effect means information in the middle of that context is systematically missed.

The decomposition is underway, guided by the same principles the system applies elsewhere. A target module map splits the monolith along natural seam lines: authentication, session management, core routes, printer safety, terminal management, notifications, Live Activity logic, and health monitoring. The decomposition follows a charter-first approach, ie each module gets a clear statement of what it owns and what invariants it maintains before any code moves, and the key invariant is that auth owns the Flask app and the global lock, sessions owns the subprocess lifecycle, and the printer safety helpers are entirely self-contained, reading printer state via HTTP and never touching session state, so a hot bug in one module cannot accidentally take down another via shared mutable state.

The pattern for other engineers is the mobile access layer, ie a persistent subprocess broker that exposes a CLI tool as a multi-device API. The specific implementation matters less than the architectural insight that CLI-native tools can be made universally accessible through a thin server layer, and that this server will inevitably grow unless it is decomposed early.

A small set of iOS apps consume this interface: TerminalApp (primary: browser, terminal, chat, and push-notified alerts), GovernorsApp (governance document Q&A) and PrinterPilot (direct printer control), with a TimSharedKit Swift package providing the shared SwiftUI framework reused across the app targets. Remote Control via claude.ai/code, live since March 2026, provides an additional zero-install path through Anthropic’s own interface for sessions initiated from a browser rather than the native app.

5. Memory System

Every Claude Code session starts from zero. The model has no memory of previous conversations, no awareness of decisions made yesterday, and no knowledge of incidents resolved last week. For a system used daily to manage real infrastructure this is not a minor inconvenience. It means every session risks repeating a mistake that was already analysed and resolved.

The memory system is a two-tier search engine exposed as an MCP server, indexing all conversation history into both semantic and keyword search backends. The semantic tier uses ChromaDB with local Open Neural Network Exchange (ONNX) embeddings, ie no API key required and no external dependency, and the keyword tier uses SQLite Full-Text Search version 5 (FTS5) with boolean operators; together they index every past conversation transcript.

Both search types exist because each fails where the other succeeds. Semantic search finds conceptually related content (“printer safety incidents”) but reliably misses specific dates, IP addresses, and error messages, whilst keyword search finds exact strings (“2026-03-11 FIRMWARE_RESTART”) but misses conceptual connections, so the system’s retrieval protocol always specifies which search type to use first based on the query: date-specific queries start with keyword search, open-ended questions start with semantic search, and both expand into the other tier for completeness.

Alongside the search engine, canonical knowledge lives in structured topic files, one file per topic, covering printers, infrastructure, lessons learned, applications, incidents, and behavioural feedback. An index file (MEMORY.md) is loaded into every session’s system prompt, kept under 200 lines following Anthropic’s own guidance on instruction file length. Topic files are loaded on demand when relevant.

The memory architecture enforces a strict precedence rule between layers. Topic files are the curated, current truth and they win on conflict; compressed session summaries (see Section 7) are routing and recall, not authoritative on facts; raw session transcripts arbitrate when summary and topic disagree; and ChromaDB and FTS5 indices are derived, so they can be regenerated from source at any time. The precedence rule lives in CLAUDE.md and is enforced as a convention that /dream promotion respects: insights can move from compressed into topics, never the reverse.

Memory consolidation happens through a mechanism called /dream. A hook at session end checks whether consolidation is due. If so, it sets a flag that the next session start detects, triggering a background subagent that reviews recent sessions, proposes updates to topic files, and consolidates learnings into the persistent knowledge base for future sessions to pick up. Memory topic files are version-controlled in a dedicated Git repository, automatically pulled at session start and pushed at session end, ensuring cross-machine consistency.

The pattern here is persistent hybrid-search memory for AI agents. The specific backends (ChromaDB, FTS5) matter less than the principles: dual search modalities that complement each other’s weaknesses, structured topic files that are human-readable and version-controlled, automatic indexing so no conversation is lost, a precedence rule that prevents poisoning at the summary layer, and a consolidation process that prevents knowledge rot.

6. Context Management

The most important lesson in this system was discovering that technical context limits are not productive context limits, ie Claude can accept large amounts of context but cannot reliably use all of it. Research by Liu et al. (2023), in “Lost in the Middle: How Language Models Use Long Contexts”, showed that language model performance is often highest when relevant information appears at the beginning or end of the input, and degrades materially when information sits in the middle.

This is not a theoretical concern. The system’s school governance application (Section 17) loaded a large document corpus into context and asked about a specific meeting, where the meeting minutes were present in the corpus but buried in the middle of the file, and the model missed them entirely. A school governor relying on this system for meeting preparation got an incomplete answer about a meeting whose minutes were right there in the context.

The response was to set concrete productive ceilings, enforced by tests:

  • A line ceiling for a single code file being edited; beyond this, edits become unreliable because the model cannot hold the full file in productive attention, and the conversation server monolith is exhibit A, with its decomposition driven by this ceiling.
  • 200 lines for any CLAUDE.md instruction file. This is Anthropic’s own stated guidance, and longer instruction files consume more context and reduce adherence to the instructions they contain.
  • A bounded loaded-corpus budget; beyond it, retrieval-then-load outperforms full-load, which is why the governance application’s corpus now uses query-type routing instead of full context loading.
  • A small number of subsystem files per edit. If an edit requires loading many files from a subsystem, the module boundary is wrong.

These ceilings are enforced through the control plane’s test suite, which checks file sizes and CLAUDE.md lengths, and per-subsystem CLAUDE.md files scope the model’s attention. Instead of one project-level instruction file trying to cover everything, each subsystem directory has its own CLAUDE.md with a module map showing exactly which line ranges to load for which type of change.

The pattern is context budget management, ie treat the model’s productive attention as a scarce resource with measurable limits, not as an infinitely expandable container. Every file, every instruction, every corpus load should have a budget justification. Anthropic’s own subagent mechanism (documented in their Claude Code docs) is the official answer to context protection: isolated context windows for side tasks that would otherwise flood the main conversation. The community’s “12 Factor Agents” framework (Factor 3: “Own Your Context Window”) arrives at the same principle from the practitioner side.

7. Deep-Context Pipeline

Context ceilings (Section 6) describe how much context to give a task. The deep-context pipeline describes how to assemble that context before the task starts. The two belong together, but the pipeline deserves its own section because it is the most recent and most substantial addition to the system. It inverts an assumption that most agent workflows quietly accept, ie that the relevant information will be found at task time, if the model looks for it.

The pipeline treats context as a first-class artifact. The /deep-context <brief> skill runs before a high-stakes task and produces a task-specific file called context.md that is under 50,000 tokens (tunable), structured, deduplicated, and cited by source. The task itself is then spawned as a sub-session with the brief and context.md as its only input. The sub-session does not need to search; the search has already happened.

The pipeline has four stages.

Pre-filter. The corpus of things to consider is three separate stores. Topic files are the curated current truth, hand-maintained. Compressed session summaries are one per closed session, generated by a separate compression pass, stored under ~/code/memory_server_data/sessions/. Raw session transcripts are the JSON Lines files Claude Code writes, several hundred megabytes across hundreds of past sessions. The pre-filter runs five queries against these stores: time window, topic overlap with the brief, file-path overlap with the brief, FTS5 keyword match, and ChromaDB semantic match. The union typically returns a manageable shortlist of candidate sessions plus the relevant topic files.

Fan-out. Three agents run in parallel. Agent A reads topic files, agent B reads the candidate compressed sessions, and agent C walks the codebase via Glob and Grep. None of them reads full raw transcripts; they read the compressed summaries only. Each agent returns relevant excerpts with source tags and a list of session identifiers flagged for deeper reading. The flagged-session list is the way the fan-out says “the compressed summary is suggestive but not enough; someone should read the raw transcript before answering this part of the brief”.

Aggregation. The aggregator reads the flagged raw transcripts (pre-stripped to remove tool definitions and truncated tool outputs) and assembles the final context.md. The structure is fixed: recent state from topics, relevant history from compressed plus raw re-reads, unresolved threads, files likely to touch, and a citations block where every claim is tagged by source. The tagging is mechanical, done at aggregation time, not self-reported by the model. This matters because the model’s own account of where information came from is untrustworthy; the tags are the record the system itself kept.

Sub-session. The original task is spawned as a fresh Claude session with the brief and context.md, nothing else, so the full context budget is pre-committed to substantive information and the sub-session does not waste tokens searching.

The aggregator is the single point of failure. If it fabricates, the whole pipeline fabricates; if it drops something load-bearing, the whole pipeline misses it. The pipeline is therefore evaluated against real briefs with an independent scorer comparing the synthesised output to alternative approaches, and a change to the aggregator prompt, the pre-filter, or the fan-out has to be re-run against the same comparison before being accepted.

The pipeline is invoked explicitly. It is not auto-triggered from plan mode. The overhead (tens of minutes of wallclock and a substantial Opus token spend per run) is only worth it for tasks that warrant the preparation, ie architectural changes, migrations, multi-file refactors in core systems, anything with blast radius. The rule of thumb is simple: if the task would warrant plan mode, it warrants /deep-context.

The first real run of the pipeline (on a recurring iOS prompt-button regression) caught a specific unaddressed seam in a previous fix, where a stale-guard had been added to one code path but not the parallel lock-screen path. The synthesis named the substrate (two parallel signal sources, fixes alternating which one they patched) and pointed at the fix, which then shipped as a small follow-up commit. The class of catch the pipeline is designed for is this one: cross-session connections that no single researcher had read, surfaced because the fan-out reads them all and the synthesis pass connects them.

The pattern generalises beyond this system: stop treating context as a free resource, manufacture it before the task, benchmark the manufacturing, and enforce a precedence rule between the stores that feed it.

8. Hooks and Enforcement

Text rules fail under pressure. This is the single most important lesson in the system. It was learned through direct financial and operational consequences.

The first significant failure was a rule that said “never use the API key directly, always use the subscription”. The agent, focused on completing a task, spawned a Claude CLI subprocess with the API key in the environment, and the resulting bill was £60, which is small in absolute terms but showed clearly that a text instruction provides zero protection when the agent is optimising for task completion and the rule sits in a file the agent never has to consult before acting. The fix was four lines of code: a function called env_for_claude_cli() that strips the API key from the environment before any CLI spawn, and the mistake became structurally impossible.

The second was a rule that said “never send FIRMWARE_RESTART during a print”. A daemon, attempting to recover from a printer error, sent FIRMWARE_RESTART while a 12-hour print was running, and the print was destroyed. The text rule existed; the daemon did not read text rules, it executed code. The fix had three parts: a Klipper macro that checks print state before allowing the command at all, a PreToolUse hook that intercepts the command at the agent layer before it ever reaches the printer, and an absolute policy that FIRMWARE_RESTART requires explicit human approval regardless of printer state, even after a print has finished cleanly.

Claude Code hooks are shell scripts triggered by lifecycle events; they receive JSON on standard input describing the tool call or event, and they return structured JSON responses. An exit code of 0 means allow; an exit code of 2 means deny, which blocks the action entirely, and this mechanism turns text rules into technical enforcement.

The system uses hooks across five lifecycle stages:

  • SessionStart hooks run when a new Claude session begins. They validate that every referenced hook and MCP launcher actually exists on disk, verify that the memory system can query its database (not just that the process is running), pull memory updates from Git, and check that the MEMORY.md index matches actual topic files. These hooks catch configuration drift before a session starts working with stale or broken tools.
  • SessionEnd hooks run when a session completes. They push memory changes to Git, index the session transcript into the search backends, trigger drift checks, and send a notification confirming the session ended. These hooks ensure no session’s work is lost.
  • PreToolUse hooks intercept tool calls before execution. This is where the safety-critical enforcement lives: a printer safety hook that checks print state and enforces the command allowlist, a protected path hook that blocks dangerous launchctl operations, a credential leak hook that scans file writes for API keys and passwords, and guards for file renames and commit operations. Each hook exists because a specific incident showed that text rules were insufficient.
  • PostToolUse hooks run after tool execution. An audit log hook records every Bash command for the audit trail. A lint hook runs language-specific linters on edited files and logs findings.
  • UserPromptSubmit and Stop hooks manage session lifecycle events. A Live Activity hook updates the iOS status display on every user message. A dream-check hook decides whether memory consolidation is due when a session ends.

The principle is straightforward, ie every incident should produce a hook that makes repetition structurally impossible. A text rule is a request; a hook is an enforcement, and the difference between the two is whether the system still works correctly when the agent is distracted, hurried, or optimising for something else, which (after several months of operation) is the situation the system has found hooks written to enforce lessons learned to be more reliable in than any amount of carefully worded instruction.

9. Printer Safety

Printer safety is the system’s showcase for defence-in-depth. Not because printers are the most important component, but because they give the clearest demonstration of why single-layer protection fails. A Klipper-based 3D printer accepts G-code commands over HTTP, and many of those commands are harmless at idle but catastrophic during a print. A firmware restart during a long print does not just cancel the job; it de-energises the stepper motors, causing the print head to drop onto the partially completed object, warping or destroying both the print and potentially the build surface.

The safety architecture has six layers, each born from a specific incident:

  • Layer 1: Text rules. A CLAUDE.md rules file defines the command allowlist and the FIRMWARE_RESTART policy. This layer catches the obvious cases, ie when the agent is paying attention and consults its instructions. It failed when the agent was focused on recovering from an error and did not check its rules first.
  • Layer 2: PreToolUse hook. A shell script intercepts every Bash command. When the printer reports its state as “printing” or “paused”, the hook checks the command against the allowlist. Only seven commands are permitted during active prints: display messages, Z-offset adjustments, speed changes within safe bounds, flow rate changes within safe bounds, fan control, pause/resume, and the confirmed cancel command. Everything else returns a deny code and blocks execution. This layer exists because a daemon sent FIRMWARE_RESTART during a print on 11 March 2026, destroying a 12-hour job.
  • Layer 3: Klipper macros. Safety checks embedded in the printer’s own firmware. The SAVE_CONFIG macro blocks itself if a print is active, because on 5 March 2026 a SAVE_CONFIG during a print killed a 12-hour job. The G28 (home axes) macro checks whether axes are already homed during a print, preventing unnecessary re-homing that would crash the print head into the bed.
  • Layer 4: Daemon state checks. The printer monitoring daemon checks print state before any recovery action. If a print is active, it alerts instead of attempting automated recovery. This layer exists because automated recovery was the most common cause of print destruction. Every “helper” daemon that tried to fix printer problems during prints made things worse.
  • Layer 5: Absolute human-approval policy. FIRMWARE_RESTART and RESTART are never sent without explicit human approval, regardless of printer state, even after a print finishes. This is the only absolute rule in the system: no exceptions, no automation, no “it seems safe” judgement calls.
  • Layer 6: Audit trail. Every printer command is logged via a PostToolUse hook. This does not prevent incidents, but it ensures they can be diagnosed. When something goes wrong during a print, the audit log gives the exact sequence of commands that led to the failure.

The proof that the architecture holds was a full system audit in April 2026, when a multi-model review team examined the entire infrastructure while a long print was running on the primary printer. The audit involved extensive tool calls, file reads, and system probes. The print completed successfully, and every layer held.

The pattern is defence-in-depth with progressive enforcement: text rules for the common case, hooks for the agent case, firmware for the hardware case, daemon checks for the automated case, human approval for the irreversible case, and audit trails for the diagnostic case. No single layer is trusted to be sufficient, and each layer catches what the layer above it misses.

10. Multi-Model Review

Single-model code review is unreliable for the same reason single-author code review is unreliable, ie the reviewer shares the author’s framing biases. When Claude reviews code that Claude wrote, it tends to evaluate the code on the terms the code was written on, rather than questioning whether those terms were correct, and this is not a hypothetical concern but an observed pattern where Claude’s self-review consistently missed structural issues that an independent reviewer would have caught at first read.

The system’s response is multi-model adversarial review, using three models with different training: Claude (Anthropic), Gemini 2.5 Pro (Google), and GPT-5.4 (OpenAI). The key insight is that models from different organisations, trained on different data with different objectives, produce genuinely independent analytical perspectives, ie they do not just find different bugs, they frame problems differently, prioritise different concerns, and challenge different assumptions, so the same code put through all three reviewers comes back with three distinct attack surfaces rather than three versions of the same one.

The /debate protocol structures this independence rigorously. In round 0, each model gets the question blind (without seeing the others’ positions) and produces an initial assessment with confidence intervals, which prevents anchoring. In subsequent rounds, each model sees the others’ reasoning and must explicitly update its position, stating what it found persuasive and what it still disagrees with. A rotating devil’s advocate role forces one model each round to defend the position it finds weakest, preventing consensus drift, and each model must state retraction criteria, ie what specific evidence would change its assessment. The debate continues until genuine convergence or a maximum number of rounds.

A critical discovery was that convergence should be questioned, not celebrated. When all three models agree immediately, it often means they share a common training-data bias rather than that the answer is obviously correct, and the most valuable debates are the ones where models maintain principled disagreements, because those disagreements highlight genuine trade-offs that a single perspective would flatten into a false consensus.

For routine code review, a lighter-weight /review pipeline serves as the standard pre-commit gate: automated linting, static analysis, a code-reviewer subagent that checks against the lessons-learned database, and optional independent review from one or both external models. The reviewer subagent applies an eight-point checklist derived from the system’s documented error patterns, focusing on the specific categories of mistakes this system has actually made.

The pattern is adversarial multi-model review, ie use the genuine independence of differently-trained models to create review diversity that a single model, no matter how capable, cannot provide on its own.

11. Security

The system’s security posture evolved from convenience to defence-in-depth through real incidents, not through planning. Early on, the API key was available in the environment, credentials lived in plaintext files, and security was a matter of “don’t do the wrong thing”. The £60 API bill changed that.

The exemplar of the incident-to-enforcement pattern is env_for_claude_cli(), a four-line function in the shared utilities module. When the system needs to spawn a Claude CLI subprocess, this function builds the environment variables, strips the API key, and forces subscription authentication, and the function’s source code comment cites the exact incident and date that motivated it. Before this function existed, any code that spawned a CLI subprocess could accidentally include the API key, resulting in pay-per-token billing instead of the flat-rate subscription; after this function the mistake is structurally impossible, not because anyone remembers to avoid it, but because the code does not allow it.

This pattern (real loss, then code that makes the loss impossible to repeat) runs through every security control in the system. A credential leak hook scans every file write and edit for patterns matching API keys, passwords, and tokens, blocking the write before it reaches disk. A protected path hook blocks dangerous system commands (like those that start or stop persistent services) without explicit approval. Pre-push hooks prevent code from being pushed from the wrong machine. Weekly automated scanning checks for hardcoded credentials in service configuration files, verifies that quarantined secrets have not resurfaced, and confirms that only authorised processes write to shared authentication tokens.

Credential management follows a single-writer principle. An OAuth token file was being written by seven different scripts, each refreshing with different permission scopes, so the last writer would silently strip scopes that other scripts needed, the token would work for one service and break another, and re-authentication would fix it temporarily until the next refresh overwrote it again. The fix was to designate a single background service as the sole writer, running every 30 minutes with the full scope set, where every other script reads the token but never writes it back, and the problem disappeared permanently.

The public repository policy is simple. Code files may be pushed to public GitHub repositories, but all secrets live in a credentials file that is gitignored on every machine, public repositories use redacted placeholders where secrets would appear, and memory files are private and can contain operational details. The distinction is maintained by hooks, not by vigilance.

The credential-rotation daemon is the active-maintenance counterpart to the credential-leak hook. It runs daily, scanning a manifest of managed secrets against per-secret max-age thresholds and rotating any that have exceeded their window, with state in a dedicated JSON file on the Mac Mini; the hook prevents new secrets from being written in plaintext, and the daemon ensures existing secrets age out on schedule.

One plaintext secret is retained by design: the login-keychain password itself, in a chmod 600 file that the unlock-keychain LaunchAgent reads at boot to make every other daemon’s secrets accessible on a headless machine. This is the system’s only documented plaintext-secret exception, explicitly flagged in the host service manifest as accepted risk rather than as an oversight, and the alternative, requiring an interactive login each time the Mac Mini reboots, would break the entire daemon fleet’s boot path.

12. Security Testing and Penetration Testing

Security posture decays silently. This is not a theoretical claim, it was demonstrated vividly during a multi-model audit in April 2026 that exposed zombie services that had been crash-looping unnoticed, plaintext credentials in world-readable configuration files, and passwordless sudo rules for dangerous commands, all in a system that was ostensibly secure. Advisory checks that detected problems but exited with success codes had been running the entire time; they were not security controls, they were log entries nobody read.

The lesson reshaped the system’s approach to security testing, ie rather than treating security as a static configuration set once and trusted, the environment is designed to be continuously probed through three complementary mechanisms.

Automated weekly scanning runs on a fixed schedule. The scan covers several areas. It searches all service configuration files for hardcoded credentials (any match is a failure, not a warning). It verifies that quarantined credential files have not resurfaced. It checks for service duplication. It scans for passwordless sudo entries on dangerous commands like reboot and shutdown (a lateral risk that bypasses printer safety controls). And it verifies that the OAuth token file has only the small set of expected writers (additional writers indicate credential management drift). The scan exits non-zero on any failure, sending a priority alert to the operator’s phone.

Multi-model adversarial security review applies the three-way debate protocol (Section 10) to security-focused analysis. The three models receive the system’s security configuration and independently identify attack surfaces. The rotating steelman mechanism forces one model each round to defend the current security posture while the other two probe it, preventing the consensus-drift problem where all models agree something is acceptable without genuine challenge. The April 2026 audit is the canonical example: three models independently identified the same core vulnerability pattern (declarative security claims not matched by filesystem reality) from three different analytical angles.

Scheduled penetration testing systematically exercises every security boundary through automated adversarial sessions. Hook enforcement testing feeds forged dangerous commands into security hooks and asserts they are blocked, then feeds benign commands and asserts they are allowed; testing both the positive and negative paths matters, because a false alarm that desensitises operators is as dangerous as a hole. Credential exposure scanning performs deep searches across all code, configuration, and skill definitions for secret patterns, catching anything that entered through channels the real-time hook does not monitor (manual edits, git pulls, backup restores). Network surface auditing enumerates all listening ports and flags any undeclared listener, motivated by the discovery of an ngrok tunnel that had been exposing a terminal to the public internet for 12 days without detection (Pattern 22 in the lessons-learned database). Privilege boundary testing verifies that the billing safety function is consistently used for all CLI spawns and that hooks cannot be bypassed through encoding tricks.

The incident response protocol follows the five-layer Root Cause Analysis (RCA) described in Section 18. The immediate fix comes first (quarantine the credential, disable the service, close the port), then the failure is classified by control class, then, critically, a new automated check is added to prevent regression. Every security incident results in a new assertion in the weekly scan, a new hook, or a new invariant in the verification suite, and the principle is that every incident should make the system structurally incapable of repeating that specific failure.

The control plane repository itself is Continuous Integration (CI) enforced. GitHub branch protection requires status checks on main, and a CODEOWNERS file gates enforcement-logic surfaces (hooks, rules, service manifests) behind explicit review. The portable scenarios workflow runs the full 78-test scenario suite server-side, so a pull request that would silently bypass a local hook fails at the repository level. An hourly enforcement-state sentinel polls the self-policing gates, and a Mac Mini-side CI failure poller pushes APNs notifications to the operator within minutes of a failing main-branch build. Bypasses of any hook or gate are logged to a bypass audit file that is reviewed weekly, and the intent is to make the repository itself untrustworthy only in detectable ways: any route around the enforcement produces an alert, and the alert’s existence is itself part of the assertion set.

13. Daemon Layer

Persistent services matter because the system must be available continuously. Claude must be reachable from any device at any time. Printers must be monitored during prints that run for many hours. Backups must run on schedule, and authentication tokens must be refreshed before they expire. None of these requirements can be met by on-demand processes that start when someone opens a terminal.

All services run as macOS LaunchAgents in KeepAlive mode, ie the operating system automatically restarts them if they crash, and a small fleet of daemons runs on the Mac Mini, each with a specific purpose and a specific probe that verifies functionality, not just process existence.

The conversation server daemon is the system’s front door, running the Flask server described in Section 4 and brokering access to Claude CLI subprocesses. Its health probe does not just check that the process is running, it calls the health endpoint and verifies that internal threads are alive, that authentication state is valid, and that the subprocess bridge is responsive.

The printer snapshot daemon monitors all connected printers with adaptive polling, ie every 30 seconds during active prints, every 5 minutes when idle, and it records state snapshots and Estimated Time of Arrival (ETA) data for the iOS app’s progress charts. Importantly, this daemon observes without intervening. Earlier iterations attempted automated recovery (restarting firmware, adjusting settings, clearing errors), and every such attempt caused more damage than it prevented; the observation-without-intervention principle is now a hard rule for printer monitoring, ie the daemon’s job is to collect data and alert, never to act.

The health check daemon runs hourly, producing a system-wide health report consumed by the iOS app, and its findings surface as red/yellow/green indicators on the phone. This daemon’s design carries a lesson about refresh cycles, in that after fixing a health check script the developer declared “all clean” without running the check manually, forgetting that the hourly schedule meant the iOS app was still showing stale results from the pre-fix run, so the phone gave a false-clean for nearly an hour after the actual fix landed. The lesson (Pattern 20) is that after editing any monitor, always run it manually to refresh its output, then verify the downstream consumer shows fresh results.

The backup daemon runs at 03:00 daily, backing up 257 files (code, configuration, credentials, memory topics, service definitions, certificates, the full control plane repository, and the OAuth plus APNs signing material) to Google Drive. The token refresh daemon runs every 30 minutes as the sole writer of the OAuth token file, preserving all scopes (Section 11), the governance document sync daemon pulls school governance documents weekly, and a date monitoring daemon watches an external website for schedule changes, alerting when new dates appear.

The pattern for daemon design is that every daemon must have a functional probe that tests what it does, not just whether it is running. A process check (“is the PID alive?”) catches crashes. A functional probe (“does the API return correct data?”) catches silent failures, broken configurations, expired credentials, and stale state, ie all the failure modes that killed this system’s services for days or weeks before anyone noticed they were dead.

Several further daemons joined the manifest in April 2026 as the enforcement surface grew. The credential-rotation daemon (Section 11) rotates managed secrets on schedule. The trend-tracker daemon captures per-run compliance metrics and enforces a ratchet, ie the “Persistent 9/10” policy prevents the monthly compliance score from regressing below 0.9 without an explicit override. The acceptance-tests daemon runs a deterministic compliance suite on a fixed interval, producing a numeric score consumed by the iOS home tab. The CI failure poller watches GitHub Actions and pushes notifications on main-branch failures. A scheduled cron job periodically indexes new JSON Lines (JSONL) session transcripts into the ChromaDB and FTS5 backends, so a session’s content is searchable shortly after the session ends rather than waiting for the next /dream cycle. The unlock-keychain agent runs at boot and is the single dependency root for the entire fleet, ie if it fails, no other daemon can access its secrets, so the boot-order contract is explicit in the service manifest.

The observation-without-intervention principle deserves an explicit name because it is the most expensive lesson in the daemon layer. Every persistent process that touched external hardware with authority to act (the Uninterruptible Power Supply (UPS) watchdog, the auto-speed adjuster, the power-loss recovery chain, the printer auto-recovery daemon) destroyed more prints than it saved. The policy for any new daemon in the manifest is now a hard requirement: observe and alert, never act, and intervention authority requires a human in the loop. The alert-responder pattern described in Section 16 is the operational answer to the question “how do you intervene quickly without giving a daemon the authority to do it autonomously?“.

14. MCP Integrations

The Model Context Protocol (MCP) is a standard for extending AI assistants with additional tools. Rather than hardcoding every capability into the agent, MCP servers expose tools that the agent can discover and call through a consistent interface, and this system uses MCP servers in three categories: local tools, cloud tools, and reasoning aids.

Local tools run on the Mac Mini and provide capabilities specific to this system. The memory server (Section 5) exposes five tools for searching and indexing conversation history, a filesystem server provides structured file operations within sandboxed paths, and a GitHub server wraps the GitHub API for repository operations. Each local MCP server uses standard input/output (stdio) transport, ie it communicates through standard input/output pipes rather than network connections.

Cloud tools are available when Claude Code runs through the cloud interface. Gmail integration enables email operations, Google Calendar integration enables event management and scheduling, Databricks SQL enables direct database queries, and a presentation tool generates documents and slide decks; together these extend the agent’s reach into cloud services without requiring custom API integration code.

Reasoning aids help the agent think more effectively. A sequential thinking server provides structured reasoning support for complex architectural decisions. A library documentation server pulls up-to-date docs for specific frameworks and tools into context, reducing hallucination about API details. A semantic code navigation server provides symbol-level awareness across Python and Swift codebases, addressing the cross-location drift problem with structural understanding rather than text-based grep searches.

A practical lesson from running MCP servers across two machines is to use launcher scripts instead of hardcoded paths. The GitHub MCP server binary lives in different locations on the server and the laptop, so rather than maintaining two different settings files, a small bash wrapper script detects which machine it is running on and executes the correct binary. This pattern (a launcher that resolves machine-specific paths at runtime) prevents the silent breakage that happens when a settings file hardcodes a path that only exists on one machine.

For engineers building their own systems, the MCP servers most likely to provide immediate value are: a memory/search server (persistent knowledge across sessions), a filesystem server (structured file access), a web search server (current information), and a code navigation server (structural codebase understanding). Everything else is domain-specific and can be added as needs emerge.

15. Skills Directory

Skills are reusable workflows encoded as named commands. Instead of re-specifying a complex multi-step process every session, eg “run the linter, then the static analyser, then the code reviewer subagent, then optionally get a second opinion from an external model”, a skill encodes the entire workflow as a single invocation: /review.

The system has a couple of dozen skills covering review, debate, autonomous execution, memory consolidation, deep-context assembly, merchant-risk assessment and the various printer and credential utilities. The insight behind skills is not that they save typing; it is that they encode institutional knowledge about how workflows should be executed, ie a /review skill does not just run linters, it runs them in the right order, with the right configuration, checking against the right lessons-learned patterns, and producing a verdict (approve, changes requested, or block) that follows the system’s quality standards. Without the skill, each session would need to reconstruct this workflow from scratch, with inevitable variation and omission.

The most architecturally interesting skills are:

  • /debate orchestrates the three-way multi-model debate described in Section 10. It manages the blind initial round, subsequent rounds with cross-model visibility, the rotating devil’s advocate assignment, confidence intervals, and retraction criteria. Encoding this protocol as a skill ensures it runs the same way every time, with the same rigour, regardless of which session invokes it.
  • /review implements the standard pre-commit quality gate: linting, static analysis, code review against the lessons-learned database, and optional external model review. This is the system’s most frequently used skill and its primary quality enforcement mechanism.
  • /autonomous activates a persistent retry-loop runner for tasks that need to complete without human supervision. When the operator steps away (“email me when done”), this skill takes over, making conservative decisions autonomously, retrying on failure, and sending a completion notification via email. It encodes the decision framework for what can be decided autonomously (simple choices, service restarts, commits to private repos) versus what requires human approval (public pushes, printer commands during prints, irreversible deletions).
  • /dream runs memory consolidation, reviewing recent sessions, proposing updates to topic files, and consolidating learnings into the persistent knowledge base. This skill ensures that institutional memory improves over time rather than decaying.
  • /deep-context runs the context pre-assembly pipeline described in Section 7. It is the newest core skill and the most expensive per invocation, reserved for tasks that warrant plan mode. The output is a dense, cited context.md that the task consumes as a sub-session input.

The pattern is skill-as-workflow-encoding, ie capture complex multi-step processes as named, versioned, reproducible commands. Skills are stored in the control plane repository alongside hooks and rules, versioned and deployed through the same pipeline. They are not convenience aliases. They are the system’s operational playbook, encoded in a form that can be executed reliably regardless of which session needs to use them.

16. Automated Maintenance

The maintenance architecture has three layers, each catching failures that the other layers miss.

The automatic layer runs on schedules without human involvement. Cron jobs handle tasks that must happen reliably: hourly memory health checks that verify the search database can actually query (not just that the process is running), bidirectional memory synchronisation every 30 minutes, weekly cross-machine configuration drift checks, weekly security scans for credential leaks and policy violations, nightly service manifest verification on both machines, and nightly host-role compliance checks. These jobs catch slow drift, ie the kind of degradation that happens over days or weeks and is invisible in any single session.

The per-session layer runs through hooks at session start and end. Session-start hooks validate that all referenced tools exist, verify memory search functionality, pull cross-machine updates, and check index consistency, whilst session-end hooks push memory changes, index the session transcript, trigger drift checks, and auto-commit memory updates. These hooks catch fast drift, ie changes made during a session that need to be propagated or verified before the session’s context is lost.

The periodic human review layer runs monthly, triggered by a cron-sent reminder. The review involves reading every instruction and rules file, retiring stale rules, checking topic file counts against the index, running a balanced audit against system invariants, and reviewing exempt files in the context budget configuration. This layer exists because automation cannot exercise judgement about whether a rule is still relevant; automated checks can verify that rules are followed, but only a human can decide that a rule should be retired because its conditions no longer apply.

The /dream consolidation sits between the automatic and human layers. It is triggered automatically (by the session-end hook detecting that consolidation is due) but performs a cognitive task (reviewing recent sessions and proposing knowledge updates) that requires the model’s judgement. It prevents knowledge rot, ie the gradual degradation of institutional memory as sessions accumulate without their learnings being captured in the persistent topic files.

The pattern is three-layer maintenance with distinct failure domains, ie automatic checks catch drift, per-session hooks catch propagation failures, and human review catches relevance decay, with each layer assuming the other two are insufficient and built to keep working in the absence of the others.

A newer self-diagnosing loop bridges the automatic and human layers in the other direction. When a persistent health alert fires (a specific check remaining red for several consecutive hourly runs, for instance), the conversation server’s internal alert-fired endpoint spawns an alert-responder subagent in an isolated session. The subagent runs the five-layer RCA protocol (Section 18) against the alert, proposes a concrete fix, and pushes the analysis to the iOS app via APNs with three action buttons: Accept (apply the proposed fix), Reject (dismiss), or Discuss (open a conversation thread to refine the proposal). This closes the loop from detection to remediation without requiring the operator to be at a terminal, whilst preserving human judgement for the irreversible step, and it is the operational expression of the observation-without-intervention principle: the daemon layer detects and analyses, the human authorises the action.

17. Governor App: A Case Study in Context Management

The governance application is a Streamlit application that helps a school governor prepare for inspections by querying governance documents with AI analysis. It serves as the system’s most instructive case study for context management because it failed in a way that directly validated the research cited in Section 6, and the fix demonstrated every principle from that section in practice.

The application’s document corpus is a large body of governance documents, ie meeting minutes, policy reviews, budget reports, safeguarding updates, and committee notes for a two-school federation. The initial implementation loaded the entire corpus into context for every query, which worked for broad questions (“What are the federation’s strategic priorities?”) but failed badly for specific ones.

The failure that drove the redesign was a governor asking what was discussed at a specific Full Governing Body meeting. The meeting minutes were present in the corpus, but buried in the middle of the file, the model’s response omitted the meeting entirely, and the governor, relying on the system for meeting preparation, got an incomplete answer about a meeting whose minutes were right there in the context. This is exactly the “Lost in the Middle” effect that Liu et al. documented: information at middle positions in long contexts is systematically under-retrieved.

The rebuild implemented query-type routing, ie different retrieval strategies for different kinds of questions. Date-specific queries (“What happened at the 25 March FGB?”) use keyword search on the date string first, because semantic embeddings reliably miss specific dates, then expand with semantic search for conceptual context. Entity queries (a person, school, or committee) start with keyword search on the entity name. Open-ended policy and strategy questions use semantic search to assemble the most relevant chunks into a bounded working context. Full-corpus loading is the fallback of last resort, used only when the query is genuinely ambiguous and retrieval returns very few relevant results.

The application also maintains a known-data-location index that maps section names to line ranges in the combined file, and this metadata enables targeted extraction that bypasses the lost-in-the-middle problem entirely for structured queries.

The broader lesson is that context management is not an optimisation, it is a correctness requirement. Loading the full corpus into context and asking a question is not equivalent to searching that corpus and loading only the relevant portion; the former misses information that the latter finds. For any application serving real users with real consequences, retrieval-first architecture is not a performance choice but a reliability one.

The corpus is not a static fixture on disk. It is re-downloaded weekly from GovernorHub (the upstream source of truth for the federation’s governance documents) by a dedicated LaunchAgent, and the Streamlit application encrypts the combined context at rest. This matters for disaster recovery (Section 19). The case study’s raw corpus is not backed up to the system’s Google Drive backup set because it does not need to be; GovernorHub is the recovery path. The backup set describes what the system considers irreplaceable, with anything re-derivable from an upstream source deliberately excluded.

18. Lessons Learned Framework

The system maintains a living document of error patterns, not as a historical record, but as an active checklist reviewed at the start of every session. Every pattern in the document represents a mistake that happened at least twice. The document exists because of a cultural commitment, ie every repeated mistake is treated as a system failure, not a human failure: if the same error happens twice, the first occurrence was an incident, but the second is evidence that the system’s controls are insufficient and need to be strengthened beyond text into something the agent cannot ignore.

The framework uses two severity tiers. Tier 1 patterns caused real damage: destroyed prints, financial loss, broken services, security exposures. These are always read at session start, regardless of the day’s planned work. Tier 2 patterns are behavioural, ie tendencies that lead to problems if unchecked but do not cause immediate damage. These are read when relevant to the current task.

The most important innovation is the five-layer RCA protocol. Every incident analysis must cover five layers. First: what happened (the sequence of events and immediate cause). Second: what controls existed (every rule, check, or enforcement mechanism that should have prevented the incident). Third: why each control failed (specifically, for each control, what gap or oversight allowed it to be bypassed). Fourth: whether the proposed fix is technical enforcement or another text rule (and if it is a text rule, an explanation of why this one will succeed where the previous rules did not). Fifth: the control class (was this a known-known, ie the agent knew the rule but skipped it; an unknown-known, ie the rule existed but the agent did not consult it; or an unknown-unknown, ie nobody knew the action was dangerous).

This classification matters because each class requires a different response. Known-knowns need enforcement (the agent knew but was optimising for something else, so a hook removes the choice). Unknown-knowns need better surfacing (the rule existed but was not loaded when relevant, so a session-start check ensures it is visible). Unknown-unknowns need protective defaults (nobody predicted the failure, so filesystem protection, wrapper scripts, and pre-execution backups limit the blast radius of unpredicted actions).

A few patterns illustrate the framework’s value:

  • “Fix creates new problem” (Pattern 1). Every automated process built to help with printing (a UPS power watchdog, an automatic speed adjuster, a power-loss recovery chain, a daemon with auto-recovery) ended up destroying prints, and the watchdog alone caused three or four failures before being permanently deleted. The prevention is now a mandatory five-question pre-flight checklist before any code that touches external systems: What commands can it send? Does it check state before every action? What happens on network failure? What happens in error states? Can the operator stop it with a single command?
  • “Silent failures go unnoticed for weeks” (Pattern 3). An OAuth token expired in February and was not noticed until March, a full month later. Security hooks parsed the wrong JSON format from the day they were written, silently passing through every command for months, discovered only when a test suite executed them with real payloads. The prevention is that verification must test the user experience, not process health. Do not check “is the daemon running?”, check “does the feature actually work?”.
  • “Escalating corrections” (Pattern 5). Printer safety rules escalated four times. “Check state before acting” was ignored, so it became “never restart without permission”, which was ignored, so it became “never restart even after print finishes”, which was still insufficient, so it became a firmware macro that blocks the command at the hardware level. The prevention is that if the same category of mistake is corrected twice, the rule is not strong enough. Make it absolute, add technical enforcement, and remember that the correction means the previous mitigation failed, so understand why before writing a weaker version of the same rule.

The escalation rule is the framework’s enforcement mechanism, ie if an error pattern is corrected twice, the next intervention must include technical enforcement (a hook, a macro, a filesystem protection, or a test), not another text rule. Text rules catch known-knowns when the agent is paying attention; technical enforcement catches everything else.

Patterns that now have technical enforcement include: silent failure detection (scenario tests with real payloads), printer safety (Klipper macros plus PreToolUse hooks), token scope management (single-writer LaunchAgent), destructive command protection (PreToolUse hooks for both plist extraction and service management), and observability drift (automated cross-checking of service manifests against monitoring configuration).

The pattern for other engineers is systematic learning through classification, escalation, and enforcement. Maintain a living error-pattern document and review it at the start of every work session. Classify failures by whether the control was missing, present but unconsulted, or known but skipped, and escalate from text rules to technical enforcement after two occurrences. Treat every repeated mistake as evidence that the system needs to change, not that the operator needs to try harder.

The current Tier 1 patterns, each one a mistake that caused real damage, and each one now backed by technical enforcement, are:

  1. Pattern 1 (Fix creates new problem). Every automated printer “helper” (UPS watchdog, auto-speed, power-loss recovery chain) destroyed more prints than it saved. Enforcement: five-question pre-flight checklist plus the observation-without-intervention rule.
  2. Pattern 2 (Safety guards added after the incident). FIRMWARE_RESTART, SAVE_CONFIG and G28 each got state checks only after they killed a twelve-hour print. Enforcement: guard-first coding standard for any external-system command path.
  3. Pattern 4 (Fixes that do not stick). Auto-speed patches reapplied three times before the capability was removed outright. Enforcement: a fix that has failed twice must include technical enforcement, not a stronger text rule.
  4. Pattern 5 (Escalating corrections). Printer safety rules escalated four times before reaching firmware-level enforcement. Enforcement: Klipper macros plus the PreToolUse hook plus the absolute human-approval policy on FIRMWARE_RESTART.
  5. Pattern 9 (Shared token file, multiple writers). Seven scripts wrote the OAuth token with different scope subsets, silently stripping each other’s scopes. Enforcement: token-refresh LaunchAgent as sole writer; all other scripts refresh in memory only.
  6. Pattern 10 (Infrastructure change based on false assumption). Reverting the Tailscale default to the LAN IP broke all remote access. Enforcement: pre-commit hook warns on IP and host changes; review agent fact-checks infrastructure commits.
  7. Pattern 12 (plutil -extract overwrites files in place). Destroyed fourteen LaunchAgent plists in one command that the agent believed was read-only. Enforcement: settings hook blocks plutil -extract without -o; chflags uchg on every plist.
  8. Pattern 13 (LaunchAgent operations require explicit approval). Attempted bootstrap of all fourteen plists without asking. Enforcement: protected_path_hook blocks launchctl state-changing commands.
  9. Pattern 22 (Rogue process, audit blind spot). An ngrok tunnel ran for twelve days exposing ttyd to the public internet, missed by every audit because no check enumerated undeclared listeners. Enforcement: health check now scans for rogue tunnel processes and unexpected listening ports beyond the declared service list.

The pattern across the table is that text rules catch known-knowns when the agent is paying attention, and technical enforcement catches everything else. The ratio of text rules to technical enforcement in the lessons-learned database is a direct measure of the system’s maturity.

19. Backup and Disaster Recovery

The backup posture followed the same incident-to-enforcement path as the rest of the system. The naive initial implementation was a cron job that copied a hardcoded list of about forty files to Google Drive. It was symptomatic of every anti-pattern this document catalogues: static file lists that silently drifted as the codebase grew, log-and-forget error handling that hid real failures, and a runbook claiming the existence of subfolders that had never been created. An audit in April 2026 exposed all three.

The current design is a single daily differential backup to a private Google Drive folder, triggered by a LaunchAgent at 03:00 with retry-and-alert via ntfy and Simple Mail Transfer Protocol (SMTP). The backup set is glob-defined rather than explicit, so new scripts added to the source tree are captured without manual intervention. Six categories are covered: top-level scripts and configuration under the code directory, the full control plane repository (redundant with GitHub, but backed up locally so the system survives a simultaneous GitHub-account and Mac Mini loss), daemon wrapper scripts, Claude Code configuration, LaunchAgent plist files, and memory topic files for the non-git-backed project. The manifest tracks 257 files totalling approximately 1.9 megabytes.

Several categories are deliberately excluded. iOS application source trees live in GitHub, with their Xcode build artefacts regenerable and otherwise dominating the backup volume. The ChromaDB embedding store is derivable from the JSONL transcripts via a rebuild script. School governance documents are re-downloaded weekly from GovernorHub (Section 17) and take no space on Drive. The memory git repository is backed up independently by its own git push, so its contents are not duplicated into the Drive set. Each exclusion is a deliberate statement that the item is either regenerable, upstream-sourced, or accepted as lost on disk failure.

The disaster recovery runbook defines a one-hour Recovery Time Objective for full Mac Mini disk loss, documented as an ordered checklist. First reinstall macOS and rejoin the Tailscale mesh. Then clone the control plane first (because it orchestrates everything else), clone the application repositories, and restore the memory repository. Then download non-source artefacts from Drive, seed the macOS Keychain from printed backup codes kept offline, re-download governance documents from GovernorHub, run the deploy script, and verify against the same smoke tests used to gate ordinary deploys. Each recovery step is numbered so progress is externally visible during execution.

Two trust boundaries matter for the backup’s security model. The Google account itself is the first: plaintext secrets (credentials file, OAuth tokens, TLS private keys, APNs signing key) sit in Drive protected only by Google’s at-rest encryption and the account’s two-factor authentication, and account recovery goes through a family member’s email and the operator’s phone, both outside the Drive trust boundary. This is accepted risk, not an oversight. Encrypting secrets at rest with a user-held key is a roughly fifty-line addition if the threat model ever changes. The second boundary is the GitHub account, ie the control plane repository is copied to Drive specifically so a GitHub-account compromise does not leave the system without a recovery path, and the backup set is deliberately redundant at that one point for that one reason.

The pattern is backup-as-system-documentation, ie the set of files the system chooses to back up is a declarative statement about what it considers irreplaceable, and everything else is either regenerable, sourced from an upstream, or deliberately accepted as lost on disk failure. An engineer reading the backup manifest should be able to reconstruct the system’s dependency graph. If they cannot, the manifest is wrong, ie either it is missing something the system actually depends on, or it is backing up something the system does not actually need.

Appendix: External References

  1. Liu et al. 2023. “Lost in the Middle: How Language Models Use Long Contexts”. arxiv.org/abs/2307.03172
  2. Anthropic. “Introducing Contextual Retrieval”. anthropic.com/research/contextual-retrieval
  3. Anthropic. “Building Effective Agents”. anthropic.com/research/building-effective-agents
  4. Claude Code Documentation. “How Claude remembers your project”. code.claude.com/docs/en/memory
  5. Claude Code Documentation. “Create custom subagents”. code.claude.com/docs/en/sub-agents
  6. HumanLayer. “12 Factor Agents”. github.com/humanlayer/12-factor-agents