macOS Accessibility Tree For Agents: The Part Every Other Guide Skips Is The Diff After The Click
Every article about macOS accessibility trees for AI agents teaches you the tree. What those articles skip is what agents actually want on iteration: not the whole tree again, just what changed. mcp-server-macos-use does this with a one-flag switch on click, type, press, and scroll. The server snapshots before, performs the action, snapshots after, filters scroll-bar and empty-container noise, then writes a flat + / - / ~ diff to disk. Grep-able, token-cheap, and specific.
Every Other Page Stops At "The Tree"
Search the keyword and every top result explains the same thing. The macOS accessibility tree is a structured hierarchy of UI elements exposed by the AXUIElement APIs. Buttons, text fields, menus, rows, cells. Structured, semantic, cheaper than screenshots. Fazm's blog post says it. Screen2AX's paper says it. Ghost OS's docs say it. MacPaw's macapptree repo says it. Peek says it. agent-native says it.
That is half the answer. An agent reads the tree once and uses it to pick a target. Then it clicks. Then what? The agent needs to know whether the click worked, whether the Send button disabled, whether a new row appeared in the table, whether a sheet opened. The naive answer is to re-traverse and diff in the model's head. That is expensive and error-prone. The specific answer mcp-server-macos-use encodes is: the server already has the before-tree, so the server does the diff.
That is the part the other pages skip. The tree is the input. The diff is the signal. The rest of this page is how the diff is constructed, what gets filtered out before it reaches the agent, and what the resulting flat file looks like on disk.
The Numbers That Anchor The Pattern
main.swift:1408 defines the tool list. click, type, press, scroll set showDiff=true. open and refresh do not.
Full Traversal After Every Click, Versus Diff-Only
The same click, narrated two ways.
What the agent reads after click_and_traverse
Agent calls click_and_traverse. Server returns the entire post-click accessibility tree — 1,847 elements, ~180 KB of flat text, most of them identical to the pre-click state. The agent has to diff in its own head, re-read the tree, and guess at stable identifiers. Scroll position shifted one pixel? Every message row appears to have 'changed' because its y coordinate moved.
- Thousands of elements, most unchanged
- Scroll pixel drift looks like UI change
- Agent burns tokens re-reading the world
- Stable identifiers fragile across traversals
How One Tool Call Turns Into A Diff
Inputs on the left are the AX primitives and sources the server reads. Outputs on the right are what gets written to disk for the agent to grep.
Inside a single click_and_traverse call
The Line That Makes Click Tools Emit Diffs
A single flag on the per-tool action options distinguishes mutation-type tools (which agents want diffs for) from snapshot tools (which agents want full traversals for).
The Writer: Three Prefixes, One File
+ added elements, - removed elements, ~ modified elements followed by the attribute transition. The format is flat because the consumer is an LLM, not a tree-walking parser.
What A Real Diff File Looks Like On Disk
One Messages "Send" click, post-filter. Seven lines, all semantic.
The Noise Rules: What Never Reaches The Agent
Two filter functions and one coordinate guard. If any of these relaxed, the diff would balloon and stop being readable.
Scroll-bar elements
AXScrollBar, value indicators, page buttons, arrow buttons. They mutate on every pixel of scroll. main.swift:591-597.
Empty structural containers
AXRow, AXCell, AXColumn, AXMenu with no text. Containers alone are not actionable for agents. main.swift:599-607.
Coordinate-only changes
If a modified element only has x/y/width/height deltas, it is filtered out at main.swift:681-682. Window moves do not pollute the diff.
Resolved container text
AXRow with no text? findTextForElement at main.swift:551-589 uses coordinate containment + list proximity to lift text from child AXStaticText before emitting the diff line.
Six Stages From Tool Call To The Diff Landing On Disk
Tool call arrives with showDiff = true
click_and_traverse, type_and_traverse, press_key_and_traverse, and scroll_and_traverse set options.showDiff = true at main.swift:1600, 1617, and 1633. That flag forces a pre-action traversal.
Traverse #1: snapshot the tree before the action
MacosUseSDK traverses the target PID's AXUIElement tree. Stats include total element count, processing time in seconds, and per-role counts.
Execute the input event
CGEvent is posted at the auto-centered point (x + w/2, y + h/2) from the click coordinates. If text or pressKey were passed, those chain after.
Traverse #2: snapshot the tree after
Same SDK call against the same PID. If the action handed focus off to a different app, main.swift:1788-1808 also traverses the new frontmost app.
Subtract, filter, enrich
buildToolResponse at main.swift:612 diffs before and after, drops scrollbar and structural noise, drops coord-only modified entries, and marks each added element with in_viewport using the window bounds collected at main.swift:623-629.
Write flat text + screenshot receipt
main.swift:1821-1839 writes /tmp/macos-use/<ts>_<tool>.txt and /tmp/macos-use/<ts>_<tool>.png on the same millisecond-precision timestamp. The response to the MCP client is a compact summary pointing at the file path.
One Sequence, Two Traversals, One File
The model never sees the two traversals. It sees a summary line plus a file path. The traversals happen on the server because the server is where the before-state still exists.
click_and_traverse on the wire
“The greppable wire format the server writes is one line per node, role in brackets, text in quotes, four coordinate fields, and a trailing 'visible' token if it falls inside any window bounds. That format is produced by formatElementLine in main.swift:979.”
Sources/MCPServer/main.swift
The Flat-Text Line Format, Verbatim
Both the full traversal and the diff use the same per-element line format. The only difference is a leading prefix for diff entries. Grepping the file for a role or a substring of text gives the agent coordinates it can pass directly back into click_and_traverse, which auto-centers at (x + w/2, y + h/2).
Why This Detail Doesn't Show Up In Other Guides
The accessibility tree is a macOS concept. The diff is not. It is an agent-ergonomics pattern that only makes sense once you have a specific loop in mind: the agent acts, the agent needs to know what changed, and the agent is billed per token of context it re-reads. Articles written for developers who want to read the tree treat the tree as the product. Articles written for agents that drive the tree treat the diff as the product.
macos-use is the second kind. The surface area is six tools; four of them return diffs, two return trees; all six write the same line format; everything lands in /tmp/macos-use/ as a.txtplus a.pngpair the agent can reference after the fact.
The Page In Numbers
Try it: one click, one diff, one receipt
Clone the repo, swift build, point Claude Desktop at the binary, call click_and_traverse on any Mac app. Watch /tmp/macos-use/ fill with <ts>_<tool>.txt files that each hold the specific accessibility-tree diff for that single action.
Read the source on GitHub →Frequently asked questions
What is the macOS accessibility tree, in the form an agent actually receives it here?
It is a flat list, not a nested JSON blob. macos-use writes one element per line to /tmp/macos-use/<timestamp>_<tool>.txt in the shape '[AXButton (button)] "Send" x:820 y:612 w:60 h:28 visible'. That format is produced by formatElementLine and buildFlatTextResponse in Sources/MCPServer/main.swift:991-1048. Role in brackets, text in quotes, four coordinate fields, and a trailing 'visible' token if the element falls inside the current window bounds. The agent greps the file; it does not parse a tree. The tree is a detail of how the Accessibility APIs expose the data; the wire format the model reads is one line per node.
Why does mcp-server-macos-use return a diff instead of the full tree after a click?
Because re-sending the tree on every action is expensive and misleading. A typical app traversal is thousands of elements; most of them did not change. The file an agent actually needs after a click contains only the new Send button, only the label that flipped from 'Message' to 'Sending…', and nothing else. buildToolResponse at main.swift:612 branches on the hasDiff flag. For click, type, press, and scroll, the handlers at main.swift:1600, 1617, 1633, and the scroll branch set options.showDiff = true, which forces a traverseBefore pass, runs the action, traverses after, then subtracts. open_application and refresh_traversal keep the full-traversal path at main.swift:719-722.
What exactly gets filtered out of the diff as noise?
Two filters, applied in that order. isScrollBarNoise at main.swift:591-597 drops any element whose role matches scrollbar, scroll bar, value indicator, page button, or arrow button — those mutate every time scroll position shifts by a pixel and are never actionable for agents. isStructuralNoise at main.swift:599-607 drops AXRow, AXCell, AXColumn, AXMenu, and outline-row elements when they have no text of their own; containers alone are not actionable. Coordinate-only changes on modified elements (x, y, width, height) are dropped at main.swift:681-682 so you do not see a window move and assume the UI changed.
How many tools does this MCP server actually expose?
Six. The full list is declared at main.swift:1408: open_application_and_traverse, click_and_traverse, type_and_traverse, press_key_and_traverse, scroll_and_traverse, and refresh_traversal. click, type, press, and scroll carry the diff contract. open and refresh return the full enriched traversal. Every tool writes a flat .txt and a .png screenshot to /tmp/macos-use/ on the same timestamp.
How does the server decide which elements are 'visible' to the agent?
Multi-window viewport check. main.swift:623-629 collects all window bounds for the target app (not just the main window — Sparkle update dialogs, Preferences, and secondary windows all count). An element is marked in_viewport if its top-left point falls inside any of those rectangles. main.swift:631-638 also checks for AXSheet children (save, open, and attached dialogs). If a sheet is present, the viewport is scoped to the sheet bounds instead, so 'visible' means 'visible inside the active sheet'. That is why the flat-text file appends 'visible' only to elements an agent could plausibly click right now.
What is the action-chaining optimization this server encodes?
click_and_traverse accepts text and pressKey arguments so one tool call performs click + type + press with a single round trip. The additionalActions array is populated at main.swift:1604-1610. type_and_traverse also accepts pressKey at main.swift:1620-1624. The server instructions string at main.swift:1414-1432 tells the model explicitly to prefer one combined call over three. So 'type a Slack message and send it' is a single JSON-RPC request, not three. Only one diff is produced — the diff between the pre-click tree and the post-press tree.
Why flat text instead of JSON?
Two reasons. Grep-ability and token cost. The flat format at main.swift:979-988 is 'prefix [role] "text" x:N y:N w:W h:H visible', one element per line. An agent that wants every AXButton runs 'grep -n AXButton <filepath>' and gets a list of clickable targets with coordinates, without parsing JSON. The hint line the server returns with every summary at main.swift:761 literally shows the grep command. Token-wise, each element is one line (~50-80 chars) versus a JSON object with field names per entry (~150-200 chars). On a 2,000-element app that is a meaningful difference in context.
How does the diff surface attribute-level state changes like a button going from disabled to enabled?
Modified elements carry a changes array. Each entry has attributeName, oldValue, newValue, addedText, and removedText fields (main.swift:684-691). When the diff is flattened to text at main.swift:1019-1027, the line reads '~ [AXButton] "Send" | AXEnabled: \'false\' -> \'true\'', one tilde prefix, the role, the current text, a pipe separator, then the attribute transition. Multiple attribute changes on the same element join with ', '. That is the single line an agent reads to know the Send button is now clickable — no re-traversal needed.
Does the diff catch a cross-app handoff, like a click that launches Mail and hands off focus?
Yes, at main.swift:1788-1808 (approximately). After the action, the handler checks the current frontmost app. If the PID differs from the one the tool was called with, it re-traverses the new frontmost app and populates appSwitchPid and appSwitchTraversal on the response. The flat-text file appends an 'app_switch:' header at main.swift:1030-1037 followed by the new app's element list. The compact summary includes the new app name and PID. The agent does not lose the thread: one tool call, one .txt, but two traversals when focus escapes.
What does 'findTextForElement' solve that naive diff code misses?
Container rows in AXTable and AXOutline often have no text of their own; the text lives in a child AXStaticText. A naive diff that emits '~ [AXRow] ""' is useless. findTextForElement at main.swift:551-589 runs two strategies. Strategy 1 is coordinate containment: find the text-bearing child whose point falls inside the container's bounds. Strategy 2 is list proximity: because the traversal is depth-first, children follow the parent in the flat list; walk the next few entries with ±2px coordinate tolerance and lift the first non-empty text. That is how diff lines for a Messages chat row come back as '~ [AXRow] "Hey, are you free Friday" | ...' instead of '~ [AXRow] ""'.
Where does this leave a non-macOS agent like one targeting Windows?
Complementary tool, different host. macos-use is the macOS half; Terminator is the Windows half and uses UI Automation instead of AXUIElement. Both speak MCP to the same client. An agent routed to different hosts can hold the same mental model — tool call returns a diff plus a screenshot receipt — across both OS. The specific filters (AXRow vs ListItem, AXScrollBar vs ScrollBar) differ; the pattern does not.
How would I verify any of this on my own machine?
Clone the repo, xcrun --toolchain com.apple.dt.toolchain.XcodeDefault swift build, point Claude Desktop or any MCP client at the binary. Call open_application_and_traverse with a small app (Calculator is good). Then call click_and_traverse with the coordinates of the '7' button. Open /tmp/macos-use/ in another terminal — you will see one <ts>_open_application_and_traverse.txt with a full tree and one <ts>_click_and_traverse.txt that starts with '# diff: +N added, -N removed, ~N modified' and contains maybe a dozen lines instead of a thousand. That is the pattern the page is describing, live.