GuideProtocol + response shapebuildCompactSummary

What Are MCP Servers? The Part Everyone Skips: What The Response Looks Like On Disk

Every article on page one of Google tells you an MCP server is a JSON-RPC 2.0 process that exposes tools to an AI client and returns a text result. Correct, and maybe 80% of the answer for a remote SQL server. For an MCP server that traverses a live macOS app, the remaining 20% is the only interesting part: what do you do when the tool output is 27 KB of structured text and the client has a context budget? macos-use answers that by writing every response to /tmp/macos-use/<ms>_<tool>.txt with a same-basename .png, and returning a ~12-line summary with a grep hint.

Matthew Diakonov, Written with AI

Published April 19, 202610 min read

Read main.swift:1822 on GitHub Clone the repo

5.0from open source

Response shape defined in Sources/MCPServer/main.swift:731 and 1822

27,343-byte tree on disk, ~800-byte summary returned to the model

Paired .txt and .png at the same millisecond timestamp, every call

What an MCP server really returns

The on-disk response pattern the spec never names

A tool on macOS can produce a 27 KB accessibility tree

Inline JSON-RPC would cost ~8,000 tokens per call

macos-use writes the tree to /tmp/macos-use/<ms>_<tool>.txt

The client gets a 12-line summary with a grep hint

The model pays tokens only for the lines it actually greps

0:00 / 0:05

The Textbook Answer, And Why It Is Not Enough

MCP servers are long-running processes that speak JSON-RPC 2.0 over stdio or HTTP. They advertise three kinds of primitives to an AI client: tools (callable actions with typed arguments), resources (documents the server can read out), and prompts (templated instructions). The client picks which tools to call; the server executes them and returns a CallTool.Result with a content array.

That is the answer every generic explainer gives. It is complete for a remote SQL server whose tool output is a paginated row count. It is half the answer for a server whose tool output is a live view of a desktop app.

On macOS, a single accessibility traversal of a moderate-sized app window runs 20 to 100 KB of structured text. Returning that inline would exhaust a meaningful slice of the client's context budget on every call. The design question nobody writes about is: how do you get 27 KB of tree data into a conversation without actually putting 27 KB of tree data into the conversation? macos-use answers that question with a specific on-disk pattern, and everything below is an audit of the exact bytes involved.

What A Single Tool Call Actually Produces

Zoom in on one callTool. The server receives JSON-RPC on stdin, drives the target app through the Accessibility APIs, and fans its result into three artifacts: a large flat-text tree on disk, a paired screenshot on disk, and a compact summary returned over stdout. The client only reads the summary directly.

One callTool, three artifacts, one summary on the wire

The Numbers That Anchor The Pattern

Every number below is an empirical value I measured on this repo today. Clone the repo, run one tool call against an app, and ls -la /tmp/macos-use/ yourself. The ratios are what make the file-paging pattern worth the disk I/O.

bytes in one real refresh_traversal.txt on my machine

accessibility elements in that single file

lines in the summary returned to the client

tools registered at main.swift:1408, every one uses this shape

The Two Shapes A Tool Call Can Take

Flip the toggle to see the same tool call shaped two ways. The left is the shape most "what is an MCP server" articles imply. The right is what macos-use actually returns, and the only reason long-running agent loops stay affordable.

Inline payload vs. on-disk paged response

The old-school way: return the entire accessibility tree in the JSON-RPC content block. A 27 KB dump becomes roughly 8,000 tokens the model has to read before thinking. With five tool calls in a task, that is ~40,000 tokens burned on raw DOM before any decision. Long tasks slow to a crawl, and the model runs out of room for actual reasoning, not to mention the visual layer (screenshots) never reaches the conversation because there is no budget left for it.

27 KB of tree text lands in the JSON-RPC content block
Roughly 8,000 tokens per tool call, before reasoning
Five calls ≈ 40,000 tokens spent on raw DOM dumps
No budget left for the paired screenshot

Anchor fact · what lands on stdout

The Exact ~12 Lines The Client Reads

This is a real summary, emitted by macos-use for a refresh_traversal call against a dev app window. Line-for-line the format is deterministic: status, pid, app, file, file_size, hint, screenshot, summary, visible_elements. Everything the model needs to know either answers the tool call directly or points to where on disk the full answer lives.

stdout — CallTool.Result content[0].text

27 KB → ~200 tokens

“Don't read entire files into context — use targeted grep searches.”

mcp-server-macos-use/CLAUDE.md (repo root, 'MCP Response Files' section)

What Each Field In The Summary Is For

Nine fixed fields, one tool-specific line, a capped visible_elements block. Every field earns its place, most of them point outward to files or subsequent tool calls.

status

One of success or error. Derived from primaryActionError and traversalError at main.swift:735. The model branches on this before anything else.

pid

The AX PID of the target app. Needed for every follow-up click, type, press, or scroll — those tools all require pid as a parameter.

app

Human-readable app name from the AX traversal. Useful when the tool call caused a cross-app switch; appSwitchPid is also surfaced when present.

file

Absolute path to the flat-text tree, e.g. /tmp/macos-use/1776457209163_refresh_traversal.txt. The model uses Grep / Read on this, not on the summary.

file_size

Bytes and element count. A quick sanity check that the traversal actually captured something. Real values seen in practice: 27,343 bytes / 451 elements.

hint

Literal shell command: `grep -n 'AXButton' <filepath> # search by role or text`. Tells an LLM client exactly how to consume the paged file.

screenshot

Path to a .png with the same basename as the .txt. Captured by captureWindowScreenshot at main.swift:378. Read the PNG to visually verify the tree.

summary

One-line result, tool-specific. For click_and_traverse: `Clicked element 'Open' [AXButton]. 2 added, 0 removed, 1 modified.` Built in the switch at main.swift:776.

visible_elements

Capped inline preview: up to 30 interactive + 10 static text entries from the viewport, emitted by buildVisibleElementsSection. Everything else lives in the file.

Anchor code 1 of 2 · main.swift:1822

Where The File Gets Written

Here is the actual block, between the point where the tool finishes executing and the point where CallTool.Result returns over stdio. Directory creation is best-effort, the timestamp is millisecond-precision so concurrent calls never collide, and the screenshot helper is a subprocess because ReplayKit leaks framework state into its host.

Sources/MCPServer/main.swift:1822-1842

Anchor code 2 of 2 · main.swift:731

How The Summary Is Built

The function that turns the full tool response into the 12-line text block the client reads. It never serializes the tree; it only appends the fields the client needs to either act directly or know where to grep. Caps at 30 interactive + 10 text visible_elements, 3 text diffs, 60 chars per diff value.

Sources/MCPServer/main.swift:731-884 (excerpted)

See The Files On Disk For Yourself

Run the MCP server, point an AI client at it, fire one tool call. This is the exact shell session you will see. The numbers below are ground truth from /tmp/macos-use/ on my machine today.

Real bytes, real paths

The Path A Tool Call Takes, End To End

Not a list of abstract protocol phases. This is the actual order the macos-use handler runs, from the moment a callTool JSON-RPC frame arrives on stdin to the moment the summary returns on stdout.

One callTool, six steps

1
callTool arrives
AI client (Claude Desktop, Cursor) sends JSON-RPC over stdio with tool name and params.
2
Primary action executes
click_and_traverse posts a synthetic CGEvent; open_application launches; all guarded by InputGuard.
3
Accessibility tree is traversed
AX APIs walk the target PID's windows; each element becomes one flat-text line: [Role] "text" x y w h visible.
4
Tree + screenshot written to disk
main.swift:1822-1842 writes <ms>_<tool>.txt and shell-outs a subprocess to capture the paired .png.
5
Summary assembled
buildCompactSummary at main.swift:731 appends the nine fixed fields plus a capped visible_elements block.
6
Text content returned
CallTool.Result with content: [.text(summary)]. The model greps the file on demand.

The Six Tools, Each One Writing The Same Shape

The server advertises six tools via ListTools. Each one executes different input synthesis (click, type, scroll, keypress, app launch, or just a re-traversal), but all six funnel through the same response path: write the tree, capture the screenshot, build the summary, return it.

Every one of these returns the same ~12-line summary shape

macos-use_open_application_and_traverse — opens an app by name, path, or bundle ID; returns a full traversal in the paired file. Starts most sessions.
macos-use_click_and_traverse — clicks at (x, y) or by `element` text match; optionally types text and presses a key in the SAME call; returns a diff of what changed.
macos-use_type_and_traverse — types into the frontmost field, with an optional pressKey afterwards. Returns a diff paired with a post-type screenshot.
macos-use_press_key_and_traverse — sends a key with optional modifiers (Command, Shift, etc.). Useful for shortcuts (cmd+R, cmd+,) and navigation (Return, Escape, Tab).
macos-use_scroll_and_traverse — posts a scroll wheel event at (x, y). Delta is in lines, not pixels. Needed when target elements are below the fold.
macos-use_refresh_traversal — re-traverses the app without any action. Emits a full tree + screenshot. No diff; useful to re-anchor when the model is confused.

Greppable Role Prefixes The Format Was Designed For

Every line in the on-disk tree starts with its AX role so the model can grep -n 'AXButton' and get every clickable button without loading the rest. The prefixes worth grepping by, ordered by how often they show up in real apps:

AXButtonAXLinkAXTextFieldAXTextAreaAXCheckBoxAXRadioButtonAXPopUpButtonAXComboBoxAXSliderAXMenuItemAXMenuButtonAXTabAXStaticTextAXImageAXGroupAXCellAXRowAXWindowAXScrollBarAXValueIndicator

The canonical list is defined at main.swift:916-919 for the inline visible_elements cap, which is also the order the server biases toward when there are more than 30 interactive elements in one viewport.

Typical MCP server response vs. macos-use response

Most 'what is an MCP server' guides describe the column on the left. macos-use is the column on the right. The delta is the entire point of this page.

Feature	Typical MCP server	macOS MCP
Where the tool response lives	Inline in the JSON-RPC text content block	/tmp/macos-use/<ms>_<tool>.txt (plus .png), path returned inline
Summary the model reads first	None, or a free-form blurb	~12-line fixed-format summary with status, file, hint, screenshot
How the model consumes large trees	Full payload enters the conversation	Model greps the file; only the matching lines enter context
Visual ground truth	Optional, usually absent	Same-basename PNG next to every .txt, captured via ReplayKit subprocess
Inline element cap	Unbounded; real trees run 20-100 KB	30 interactive + 10 static text entries (main.swift:863-868)
Line format in the on-disk tree	Usually verbose JSON	[Role] "text" x y w h visible — one line per element, greppable

Frequently asked questions

What are MCP servers in one sentence?

MCP servers are long-running processes that speak JSON-RPC 2.0 to an AI client (Claude Desktop, Cursor, VS Code, Cline) and advertise a list of typed tools the model can call. When the model invokes a tool, the server executes it and returns a result. macos-use is an MCP server that advertises 6 tools over stdio, registered in the allTools array at Sources/MCPServer/main.swift:1408, and drives macOS apps through the Accessibility APIs.

What does an MCP server actually return when a tool is called?

The spec says a CallTool.Result with a content array of text or image blocks. In practice, most generic explainers stop there. The question nobody answers is what happens when the output is large. A traversal of Safari with a few tabs open is easily 60 to 150 KB of structured text. Returning that inline would blow a big chunk of the model's context on a single call. macos-use returns, instead, a ~12-line text summary with a pointer to a file on disk and a grep hint. See buildCompactSummary at main.swift:731.

Where does the file live and what does its name look like?

main.swift:1822 sets outputDir to /tmp/macos-use. Line 1825 makes a millisecond-precision timestamp. Line 1827 builds the filename as <timestamp>_<tool>.txt, stripping the macos-use_ prefix from the tool name. So a click call writes /tmp/macos-use/1776457217931_click_and_traverse.txt and a paired 1776457217931_click_and_traverse.png next to it. The timestamp is the join key between the two files: grep the .txt for coordinates, Read the .png for visual confirmation.

What does the 12-line summary actually contain?

status (success or error), pid, app name, file path, file size in bytes and element count, a grep hint (literally `grep -n 'AXButton' <filepath>`), screenshot path, a tool-specific one-liner (e.g. `Clicked element 'Open'. 2 added, 0 removed, 1 modified.`), up to 3 notable text diffs, and a compact `visible_elements:` section with the interactive elements in the viewport. The line ordering is fixed in buildCompactSummary at main.swift:735 through 884. Every line is the same format across every tool call, so the model can parse the output deterministically.

Why not just return the full tree every time?

A real accessibility tree on this machine right now measures 27,343 bytes and 451 lines for a moderate-sized dev app window. Returned as a JSON-RPC text block it is roughly 8,000 tokens. Multiply by even 5 tool calls in a task and you have burned 40,000 tokens of the model's context budget on raw element dumps before any reasoning happens. Writing the tree to disk and handing back a path is a form of paging: the model loads the 12-line summary for free and pays for the full tree only on the grep or Read the call actually needs.

Is the grep hint just documentation, or does the server expect the client to use it?

It is a real instruction. The server emits the exact command `grep -n 'AXButton' <filepath> # search by role or text` at main.swift:761. The repo's own CLAUDE.md section 'MCP Response Files' tells clients explicitly: `Don't read entire files into context — use targeted grep searches.` The accessibility tree format at main.swift:916-919 uses role prefixes (AXButton, AXLink, AXTextField, AXTextArea, AXCheckBox, AXRadioButton, AXPopUpButton, AXComboBox, AXSlider, AXMenuItem, AXMenuButton, AXTab) precisely because each line starts with its role and can be grepped by role or by text.

What is the format of a single element line in the tree file?

`[Role (subrole)] "text" x:N y:N w:W h:H visible`. For example: `[AXButton (button)] "Open" x:680 y:520 w:80 h:30 visible`. Role is the accessibility role, subrole is the human-readable variant, text is the element's label or value, x/y are the top-left point in screen coordinates, w/h are the size, and `visible` is present only if the element is on screen. The click_and_traverse tool accepts exactly these four numbers (x, y, w, h) and centers the click automatically, which is why grep on the tree gives the model everything it needs to click without guessing pixels from a screenshot.

Why pair every .txt with a .png?

Two different signals for the same moment. The tree tells you what is semantically on screen; the screenshot tells you what is visually on screen. They can disagree. A stale tree from a frame before redraw, an element clipped by a sheet, a modal that has no AX role — any of those will mislead a model that trusts only the tree. The server instructions at main.swift:1417-1420 explicitly say: `Always check the screenshot after interactions (click, type, press) to confirm the action had the intended visual effect.` The PNG is captured in-process by a subprocess helper (captureWindowScreenshot, main.swift:378) that loads ReplayKit in a short-lived child so the parent server doesn't leak framework state.

Do remote MCP servers have this problem?

Mostly no. A SQL MCP server can paginate. A GitHub MCP server returns a JSON blob that is already small. The problem is specific to local MCP servers whose tool output is a live view of some large application state: a DOM snapshot, an accessibility tree, a file system diff, a database schema dump. For those, you either paginate, write to disk and return a path, or compress. macos-use picked write-to-disk because the model's own filesystem tools (Read, Grep) are the most natural way for an LLM client like Claude Desktop or Cursor to consume the output.

Does this mean the MCP server is stateful?

Barely. The server does not persist per-client state; it just writes a file under /tmp/macos-use with a unique millisecond timestamp and forgets it. The AI client is what holds state: the previous tool's file path lives in the conversation log, and the model can grep it again later if it wants. /tmp cleanup is the OS's job. The server itself is as stateless as any tool-call-then-return MCP server; only the response transport is different.

What stops a tool call from ballooning past context anyway?

The summary has a hard ceiling. buildCompactSummary caps visible_elements at 30 interactive and 10 static text entries (main.swift:863-868 via buildVisibleElementsSection) and caps notable text changes at 3 (main.swift:841-857). Single text values in diffs are truncated to 60 characters (main.swift:844-845) and the 'typed' text line to 40 (main.swift:804). The entire summary returned inline is bounded. The full data is still on disk if the model wants it.

Can I see this with my own eyes?

Yes. Clone the repo, `swift build -c release`, point Claude Desktop or another MCP client at the resulting binary, fire a tool call. Then `ls -la /tmp/macos-use/` and you will see the `<ms>_<tool>.txt` and `<ms>_<tool>.png` pair. `wc -c` on the txt gives real bytes. `head -20` on it shows the element format. On this machine, one refresh_traversal of a dev app produced /tmp/macos-use/1776457209163_refresh_traversal.txt at exactly 27,343 bytes and 451 lines, plus a same-basename PNG.

See one file pair with your own eyes

Clone the repo, build with `swift build -c release`, point Claude Desktop at the binary, fire one tool call, then `ls -la /tmp/macos-use/`. You'll see the <ms>_<tool>.txt and <ms>_<tool>.png pair, millisecond timestamp and all.

Open mcp-server-macos-use on GitHub →