Grep-addressable screen memoryFile path, not tree, on the wire/tmp/macos-use/<ts>_<tool>.txt

macOS AI Agent State Memory: Why macos-use Hands The LLM A File Path, Not The AX Tree

The first page of Google for this keyword is LangGraph checkpointers, Mem0, LangMem, and vector stores. Those articles solve the agent's memory of its own reasoning. macos-use solves the agent's memory of your mac. Every tool call writes the AX tree to /tmp/macos-use/<ts>_<tool>.txt as one grep-friendly line per element, returns only a ~15-line summary plus the file path to the model, and ships instructions on the MCP initialize handshake that literally tell the LLM to Grep/Read the file rather than load it into context. The screen lives on disk. The agent remembers it by grepping.

M
Matthew Diakonov
12 min read
5.0from open source
Memory policy baked into the MCP initialize handshake at main.swift:1411-1437
One AX element = one grep-friendly line (main.swift:972-989)
~15-line summary + file path on wire; full tree on disk, never inline

The SERP For "macOS AI Agent State Memory" Is Answering The Wrong Question

Every top result for this keyword is a variation on one of three themes. LangGraph checkpointers persisting AgentState across nodes so a graph can resume after a crash. Vector-store articles (Mem0, LangMem, Zep, Letta) on how to store long-term semantic facts about the user. Short-term versus long-term memory taxonomies from the usual platform blogs. All of it is about the agent remembering its own reasoning.

If you are driving the macOS UI with an agent, that is the wrong layer. Your problem is not "what did the model discuss with the user six turns ago". Your problem is "what is actually on the screen right now, and how does the model see it cheaply". Pumping the full AX tree into context after every action works once, and then it bankrupts your token budget as soon as the user opens Slack or Xcode.

macos-use solves the screen-memory layer. The agent never sees the tree directly. It sees a file path pointing at the tree, a ~15-line sample of visible elements, a diff summary, and a matching PNG path. Everything else stays on disk. The product's own MCP handshake teaches the model the recall pattern before the first tool call ever fires.

handshake

Use Grep/Read on the file to find specific elements. NEVER estimate coordinates visually from screenshots.

main.swift:1416, 1429 — verbatim from the instructions literal sent on MCP initialize

The Memory-Recall Policy Ships In The MCP Handshake

Most MCP servers either leave instructions empty or use it for ad-hoc usage tips. macos-use uses it to prescribe how the agent should recall UI state. The sentence "Use Grep/Read on the file to find specific elements" is in the server constructor at main.swift:1411-1437. Every MCP client sees it during the initialize request, and it becomes part of the model's system context for this server. You do not need to write a system prompt explaining the memory model — the server already told the model.

Sources/MCPServer/main.swift

One tool call. Two outputs for the LLM. Two outputs on disk.

click_and_traverse tool call
traverseBefore AX snapshot
traverseAfter AX snapshot
CGWindow screenshot
MCPServer
LLM context: 15-line summary
LLM context: file path pointer
Disk: <ts>_<tool>.txt (grep target)
Disk: <ts>_<tool>.png (visual check)

One AX Element = One Grep-Friendly Line

The disk format is deliberate. Every AX element gets exactly one line, with role in brackets, truncated text in quotes, and integer coordinates inlined. That is what makes the tree grep-addressable instead of parse-required. The agent never parses — it slices by role, by substring, by the visible suffix, or by the diff prefix.

Sources/MCPServer/main.swift

How The Agent Actually Recalls State

Four greps do the work that a naive agent would do by asking the MCP server for the full tree again. The agent's scratchpad holds only the file path from the last tool call. It runs Grep, picks the line it wants, and passes the coordinates into the next tool call. The tree never enters the model's context window.

Recall the Send button without re-reading the tree

The + / - / ~ Prefixes Are The Delta Index

For every action that mutates state — click, type, press, scroll — the server writes a diff file instead of a full traversal. Three prefixes: + for added, - for removed, ~ for modified. The agent runs grep '^[+-~]' <file>.txt and gets exactly the changes from the last action. No subtraction in context. No re-reading the pre-state.

Sources/MCPServer/main.swift
grep -n 'AXButton'grep '^# diff'grep '^[+-~]'ls /tmp/macos-use/ | tailfile + screenshot pairone element per linems-precision timestampshandshake ships the policy

What The LLM Actually Sees, Per Turn

Bounded. On the order of 15 lines for most calls, capped by the visible-element sampler before it hits the wire. Compare against a full traversal for a busy app, which can run 40 to 80 KB of text. The summary is the model's short-term memory of the screen; the file path on the summary is the hook into its long-term memory. Short-term is free. Long-term is one grep away.

One MCP response, as the model sees it

Six Design Choices That Make The Memory Cheap

One element per line

formatElementLine at main.swift:972 forces `[Role] "text" x:N y:N w:W h:H visible` on every row. Grep can slice by role, by keyword, or by coordinate range. No parser required.

Timestamped chronological log

Filenames are `<millisecond>_<tool>.txt`. `ls /tmp/macos-use/ | tail -5` is a five-action audit trail. Lexical sort equals temporal sort.

Diff prefixes for recall

+ added, - removed, ~ modified. One grep (`^[+-~]`) returns only what changed since the last traversal. The LLM reads deltas, not trees.

Summary is the agent's short-term memory

~15 lines per response: pid, app, file path, screenshot path, diff counts, up to 30 visible elements. Everything else is one grep away.

Screenshot verifies the tree

Same timestamp stem writes a PNG. The server instructions tell the model to Read it whenever the tree looks suspicious.

Policy ships in the handshake

main.swift:1411-1437 sends 'Use Grep/Read on the file' in the initialize response. The agent knows the memory pattern before its first tool call.

0 linestypical summary size on wire
0file path + 1 screenshot path per call
0 prefixes+ / - / ~ for grep-by-delta
0tools all writing the same shape

Five Steps Of A Single Recall

1

Tool call returns a pointer

The MCP response contains file=<ts>_<tool>.txt, screenshot=<ts>_<tool>.png, and a ~15-line summary. The full AX tree is on disk, not in the response.

2

Agent decides whether it already knows enough

The summary includes up to 30 visible_elements. Most turns, that is enough — the agent can pick the next action from the summary alone.

3

If not, grep the file

The model runs its Grep tool on the file path, scoped to the role or text it needs. `grep -n 'AXTextField' <file>.txt` pulls every text field with its coordinates.

4

If the tree looks wrong, read the PNG

The server's own instructions tell the model to visually verify the screen when the AX tree seems stale or mismatched. The PNG shares the filename stem, so the model reads it by swapping the extension.

5

Recall across multiple tool calls

Older screens are still on disk. `ls /tmp/macos-use/ | tail -20` gives the last 20 tool calls. The agent can reason about 'what I did three steps ago' by rereading that file, not by re-traversing the app.

Naive Tree-In-Context vs. Grep-Addressed Disk Memory

The contrast is stark on any busy app. Toggle between the two patterns below.

Two ways an agent can 'remember' the screen

Each tool call returns the entire AX tree as its result text. The agent pastes it into its reasoning, pays the token cost, and does it again next turn. Slack alone can be 40 to 80 KB per turn. Ten turns and the context is full of stale trees.

  • Token cost scales with screen complexity
  • Model context fills up fast on busy apps
  • No efficient way to ask 'what changed since last turn'
  • Stale copies linger in context across turns

The anchor fact, verbatim

The phrase "Use Grep/Read on the file to find specific elements" is on line 1416 of Sources/MCPServer/main.swift. It sits inside the multi-line string literal passed as instructions to the Server(name: "SwiftMacOSServerDirect", version: "1.6.0", ...) constructor at line 1411. Every MCP client this server talks to receives that string during the handshake. That is where the agent's memory-recall policy lives: not in a prompt, not in docs, but in the protocol response the LLM sees on connect.

You can verify by cloning the repo, opening main.swift, and jumping to line 1411. You can verify at runtime by connecting any MCP client with protocol logging enabled and watching the initialize result — the string is right there in the response body.

Framework Agent Memory vs. macos-use Screen Memory

Stack them, do not substitute them. LangGraph/Mem0 solves one layer; macos-use solves a different one.

FeatureFramework memory (LangGraph / Mem0 / LangMem)macos-use (/tmp/macos-use/)
What is storedConversation turns, AgentState dict, tool-call tracesAX tree snapshot + diff + window PNG, per tool call
When it writesAt each graph node, on a configured channelAfter every click, type, press, scroll, open, refresh
Storage backendSQLite, Postgres, Redis, in-memoryFlat .txt + .png pair in /tmp/macos-use/, ms-precision
Recall patternResume the graph from a saved nodeGrep the .txt by role/text/coords, Read the .png if unsure
What the LLM sees on wireFull state object or replayed history~15-line summary + file path + screenshot path
Token cost per turnScales with state sizeNear-constant (summary is bounded by visible-element caps)
Cross-reboot durabilitySurvives (database-backed)Ephemeral (/tmp clears on reboot) — pair with a checkpointer for long-term

Want a grep-addressable memory layer for your agent's macOS loop?

Fifteen minutes on a call to see how macos-use plugs under your existing LangGraph or Claude Agent SDK stack without rewriting the upper layers.

Frequently asked questions

What does macos-use mean by 'agent state memory'? Is this another LangGraph checkpointer?

No. It is orthogonal to LangGraph, Mem0, LangMem, and vector-DB memory. Those store the agent's reasoning: message history, the AgentState dict, tool-call traces, long-term facts. macos-use stores the agent's view of the screen. After every click, type, press, scroll, open, or refresh, the MCP server writes the current accessibility tree (or a diff of it) to /tmp/macos-use/<ms_timestamp>_<tool>.txt as line-per-element flat text, captures a PNG of the target window at the same timestamp, and returns a ~15-line summary plus both file paths to the model. The LLM's short-term memory of 'what is on screen right now' is therefore a file path, not a tokenized tree. The LLM's long-term memory of 'what the screen looked like earlier' is `ls /tmp/macos-use/ | grep -i slack | tail -5`. This is the memory layer; LangGraph is a layer above it.

Where exactly is this memory policy baked into the product?

It is baked into the MCP handshake itself. main.swift:1411-1437 constructs `Server(name: "SwiftMacOSServerDirect", version: "1.6.0", instructions: ...)` and the instructions literal contains the sentence `Use Grep/Read on the file to find specific elements.` That string is sent to every MCP client during the initialize request and becomes part of the model's system context for this server. You do not need to tell the LLM how to recall UI state; the server tells it on connect. Most MCP servers either omit the instructions field or use it for ad-hoc tips. macos-use uses it to define the agent's memory-recall strategy.

What is the on-disk format, and why is it one element per line?

formatElementLine at main.swift:972-989 emits one AX element as one line: `[AXButton (button)] "Send" x:820 y:612 w:60 h:28 visible`. Role in brackets, truncated text in quotes, integer coordinates prefixed by x:, y:, w:, h:, then `visible` when the element is inside the window viewport. buildFlatTextResponse at main.swift:992-1048 concatenates those lines and prepends headers like `# Slack — 471 elements (0.42s)` and `# diff: +2 added, -1 removed, ~3 modified`. The shape is a deliberate grep target. `grep -n 'AXButton' <file>.txt` returns every button with its coordinates. `grep '^+' <file>.txt` returns only the added elements from the last action. The agent never reads the whole file into its context; it greps for the subsection it needs.

How big is the summary the LLM actually sees per tool call?

Small enough to fit a dozen of them in a single context window. The MCP response text contains: one status line, the PID, the app name, the full file path, the screenshot path, a visible_elements sample (up to 20 interactive plus 10 static-text lines on full traversals, up to 30 on diffs), and a one-line diff summary when applicable. Compare that against a full traversal, which for a busy app like Slack or Gmail can easily be 40 to 80 KB of text, sometimes more. The delta matters because the LLM's attention is a scarce resource. Keeping the on-wire summary short lets the model stay on the task it is actually doing and pull specific elements on demand via a second tool call (Read or Grep on the file path the summary already named).

What are the + / - / ~ prefixes on diff lines?

They map to added, removed, and modified AX elements since the last traversal. buildFlatTextResponse at main.swift:1012-1027 writes them exactly once per element. `+ [AXStaticText] "Sending…"` means the element appeared after the action. `- [AXButton] "Send"` means it disappeared. `~ [AXTextField] | AXValue: 'Hey are you free Friday' -> ''` means the value transitioned from the old string to the new one, with the attribute name and both values inline. The agent recalls 'what changed' by running `grep '^[+-~]' <latest>.txt`. It never has to reason from two full trees subtracted in-context.

Is the memory persisted across agent restarts, or is it throwaway?

It persists for the life of /tmp/macos-use/, which macOS clears on reboot and sometimes earlier. Each file is named <millisecond_timestamp>_<tool_name>.txt so the directory is an append-only chronological log of every tool call the agent made, sorted lexically by name. Equivalent to a flat-file commit log. A thousand tool calls produces a thousand .txt + .png pairs, timestamped to the millisecond. The agent can walk back in time with `ls /tmp/macos-use/ | tail -N`. For durable memory across reboots, pair it with a checkpointer on top (LangGraph, Postgres, whatever) that records file paths alongside conversation state — the filesystem holds the snapshot, the checkpointer holds the pointer.

Why does the MCP response include both a .txt path and a .png path?

Because the accessibility tree lies sometimes. An agent reading only the .txt can be fooled by a stale label, a misidentified role, or a sheet whose AXSheet ancestor was not walked. The .png is captured at the same millisecond timestamp from the same window (with a red crosshair drawn at the click point when relevant) so the agent can visually confirm the state. The convention is reinforced in the server instructions at main.swift:1417-1420: `IMPORTANT: Use the Read tool on this .png file to visually verify the screen state — the accessibility tree alone can be misleading`. Two views, same moment, same filename stem. The agent's memory of the screen is text + image, not text alone.

How does this differ from naive 'dump the tree into the prompt' approaches?

Naive approach: every turn, the agent receives the full AX tree as a tool result, fills its context with thousands of AX lines, pays the token cost, and usually re-reads elements it has already seen. macos-use approach: every turn, the agent receives a summary + path. It pulls specific elements with Grep when it needs them and leaves the rest on disk. If the user says 'click the second send button' the agent runs `grep -n 'Send' <file>.txt | head -5`, picks the row with the matching coordinates, and passes them to click_and_traverse. The tree stays on disk. The model's working memory holds only the decision, not the input to the decision.

Can I inspect this manually without running the MCP loop?

Yes. Build the binary with `xcrun --toolchain com.apple.dt.toolchain.XcodeDefault swift build`, run it with an MCP-capable client like Claude Desktop, and open a Terminal window tailing `ls -lt /tmp/macos-use/ | head`. Trigger any tool call and watch the .txt + .png pair appear within milliseconds. Run `head -1 <file>.txt` for the tree header (app name, element count, processing time), `grep '^# diff' <file>.txt` for the diff summary, and `grep -E '\[AX(Button|TextField|Link)\]' <file>.txt` for interactive elements. This is the same path the agent's own Grep tool follows; you are just running it by hand.

Does the agent ever load the full .txt file into its context?

Only if it decides to, and only for tiny apps. The server never inlines the full tree in the tool response. If the agent chooses to Read the file, that is a deliberate tool call with its own cost. In practice Claude and similar agents default to Grep first (`grep -n 'keyword' <file>.txt`) and only Read as a fallback, which the server instructions reinforce. For common interactive apps (Slack, Gmail, Notion, Xcode) the full .txt is larger than most models would prefer to hold in scratchpad memory, so grep-first is the sustainable pattern.

Is there a risk that the agent loses track because the memory is external?

The opposite: because memory is external and addressable by timestamp, the agent can always re-derive state by rereading the latest file. There is no drift between the agent's internal model and the true state of the machine, because the agent's model is the file, plus what the last tool call added to its summary. If the agent's context gets compressed or evicted, the screen memory does not evaporate — it is still on disk at the same paths. The loop can resume from any tool-call boundary by pointing at the newest file in /tmp/macos-use/.

How does this stack with framework-level agent memory like LangGraph or Mem0?

Stack them. LangGraph checkpoints the graph state (which node is next, what messages have been exchanged) into SQLite or Postgres. Mem0 or LangMem store long-term semantic memory (facts about the user, preferences, summaries). macos-use stores the per-action OS state. A full deployment looks like: Postgres holds conversation history, Mem0 holds learned facts about the user, /tmp/macos-use/ holds the physical screen snapshots. Each layer solves a different problem and does not contend for the same bytes. The macos-use layer is the one nobody else is solving because nobody else is driving the macOS UI at the AX level.

macos-useMCP server for native macOS control
© 2026 macos-use. All rights reserved.