Definitional guide6 tools, 2 files per call/tmp/macos-use/

What 'MCP Server' Actually Means: A Receipt-Pointer Contract, Not A Tool Return Value

The spec answer is that an MCP server exposes tools, resources, and prompts over a JSON-RPC transport. True, and not the interesting part. In macos-use, 'MCP server' means something specific: six tools registered at main.swift:1408, every tool call writes a pair of files to /tmp/macos-use/ with the same millisecond timestamp ({ms_epoch}_{toolname}.txt for the flat-text traversal, {ms_epoch}_{toolname}.png for the window screenshot), and what the MCP client actually sees over the wire is not the traversal, it is a compact summary that says 'grep this file'. The accessibility tree stays on disk. The model decides what to pull back in.

Matthew Diakonov, Written with AI

Published April 18, 20268 min read

Read main.swift:1821 on GitHub Open buildCompactSummary at main.swift:731

5.0from open source

6 tools declared as one Swift array at main.swift:1408

Every call writes two files sharing a millisecond-precision timestamp prefix

The MCP response payload is a pointer, not the accessibility tree

The MCP response is not the data. It is a pointer to the data.

What the phrase 'MCP server' means inside macos-use

The spec says: an MCP server exposes tools over JSON-RPC

macos-use says: six tools, registered once, each writes on disk

Every tool call produces a {ts}.txt + a {ts}.png in /tmp/macos-use/

The client receives a summary, not the traversal, under 500 bytes

The model decides what to grep back into context on the next turn

0:00 / 0:05

The SERP Answers The Question About The Protocol. Not About The Implementation Contract.

Search 'mcp server means' and the top results walk you through the Model Context Protocol spec: client, server, transport, tools, resources, prompts. That reading is correct at the protocol level. It is also useless for anyone deciding whether a local MCP server will hold up inside a real agent session with a 200k-context model that fires twenty tool calls in a minute.

The interesting question is: what does the tool return? In every macOS MCP server I checked, the answer is 'the traversal, inline in the response text'. That works for a toy five-step flow. It falls apart when the target is an accessibility tree with 1500 elements, because the client has to carry those elements through every subsequent turn.

macos-use answers the question differently. The tool returns a pointer. The data goes on disk. The client pulls back in only the lines it needs. That choice, more than any MCP spec detail, is what the phrase 'MCP server' refers to inside this codebase.

What Goes In, What Comes Back, What Lands On Disk

Three inputs describe the tool call (name, arguments, target PID). One compact summary comes back over MCP. Two files land in /tmp/macos-use/ with a shared timestamp. The model chooses what, if anything, to read back in.

One tool call, one summary over MCP, two files on disk

The Numbers That Make The Contract Concrete

Each value below comes from a specific line or constant in Sources/MCPServer/main.swift at HEAD. Clone the repo and grep the line. No invented benchmarks.

0tools registered at main.swift:1408

0files written per tool call (.txt + .png)

0xms precision on the shared filename prefix

0compact summary returned over MCP

line that aggregates the 6 tools into one array

line that sets the millisecond timestamp

line that writes the flat-text .txt file

line that returns the compact summary over MCP

Anchor code 1 of 2

The Entire Receipt-Pointer Contract In 21 Lines

Everything that makes an 'MCP server' mean something specific in this repo happens here. The millisecond timestamp, the tool-name prefix stripping, the two filenames sharing that prefix, the compact summary built at the bottom. Runs on every single tool call.

Sources/MCPServer/main.swift:1821-1842

The safeName substitution strips the macos-use_ prefix so the on-disk filenames stay short: 1744996800123_click_and_traverse.txt, not 1744996800123_macos-use_click_and_traverse.txt.

Inline Payload vs. Receipt Pointer, Same Information

Left: the shape other macOS MCP servers return. Right: the shape macos-use returns. Same underlying facts (status, PID, diff, traversal), delivered differently. The right side is what an 'MCP server' means in this repo.

The same tool call, two response shapes

# A typical 'inline' macOS MCP response. Every tool call carries the tree. # Each element is a JSON object. Each click echoes 1000+ of them. # Over a 50-step session, the model's context window fills with duplicates. { "status": "success", "pid": 1247, "elements": [ { "role": "AXWindow", "x": 0, "y": 0, "w": 1512, "h": 982 }, { "role": "AXToolbar", "x": 12, "y": 38, "w": 1488, "h": 52 }, { "role": "AXButton", "text": "Reload", ... }, { "role": "AXButton", "text": "Back", ... }, { "role": "AXGroup", ... }, { "role": "AXTabGroup", ... }, // ... 1477 more entries, every call ... ], "diff": { "added": [ /* ... */ ], "removed": [ /* ... */ ], "modified": [ /* ... */ ] } }

Every response carries the full tree, often 100s of elements
Duplicate elements accumulate in model context across turns
Model cannot skip past data it does not need; it is all inline

Watch The Pair Land After One Click

After a single click_and_traverse call, list /tmp/macos-use/ by mtime. The newest four entries are always the last two tool calls: each call writes its .txt and .png with the exact same prefix. Then pull two lines back with grep. Everything else stays on disk.

A real session transcript

Four Things Happen On Every Tool Call, In Order

From the tool handler's perspective, the receipt-pointer contract is four steps. Each one maps to a specific line range in main.swift. The sequence does not vary between the 6 tools; only the primary action in step one differs.

1
Perform primary action
Click, type, press, scroll, open, or just re-traverse. Wrapped in the InputGuard at main.swift:1696 for the five disruptive tools.
2
Build flat-text response
buildFlatTextResponse at main.swift:992 renders the traversal or diff into one-line-per-element text. No JSON, no escaping.
3
Write .txt + .png with shared timestamp
main.swift:1825 grabs the ms-epoch timestamp; lines 1827 and 1834 interpolate the same integer into both filenames. Collision-proof per call.
4
Return compact summary over MCP
main.swift:1842 calls buildCompactSummary. The result goes into an MCPContent.text and back to the client. File path + grep hint included, payload excluded.

The Six Tools That Share One Return-Value Contract

Every tool below writes the same .txt + .png pair and returns the same summary shape. The tools differ in what the primary action does, and whether they engage the InputGuard, but the receipt-pointer contract is identical across all six.

macos-use_open_application_and_traverse

Activate an app by name, path, or bundle ID, then return the full accessibility tree. The .txt holds every element; the .png holds the chosen window after the app becomes frontmost. main.swift:1301.

macos-use_click_and_traverse

Click at x,y (or search for an element by text) and return a diff. Optionally composes into click→type→press→traverse in one call. main.swift:1329.

macos-use_type_and_traverse

Type text into the focused field and return a diff. Optional pressKey chains a final key event. main.swift:1349.

macos-use_press_key_and_traverse

Press one key with optional modifiers (Return, Tab, Escape, Cmd+W, Cmd+Shift+A) and return a diff. main.swift:1384.

macos-use_scroll_and_traverse

Scroll deltaY at x,y in the target app, then return a diff so the model knows which elements entered or left the viewport. main.swift:1402.

macos-use_refresh_traversal

The only non-disruptive tool. No input events, no InputGuard engagement. Still writes the same .txt + .png pair, so its output is indistinguishable from the others on disk. main.swift:1363.

Anchor code 2 of 2

The Summary The Client Actually Sees

The function below builds the text content of the MCP response. No JSON. No escaping. Lines a model can parse on its first token. The grep hint line is deliberate: it tells the model how to interact with the .txt file without loading it.

Sources/MCPServer/main.swift:731-765

file: /tmp/macos-use/1744996800123_click_and_traverse.txt file_size: 74821 bytes (1483 elements) hint: grep -n 'AXButton' /tmp/macos-use/1744996800123_click_and_traverse.txt # search by role or text screenshot: /tmp/macos-use/1744996800123_click_and_traverse.png

That is the pointer, in full. Four lines. Everything else lives on disk at the paths above, which are re-readable for the rest of the session and, because /tmp is reset on reboot, garbage-collected automatically.

200:1

“MCP tool responses are saved as flat text files to /tmp/macos-use/ to reduce context bloat. Each tool call returns a compact summary plus a file path instead of the full traversal data.”

project CLAUDE.md, Sources/MCPServer/main.swift:1821-1842

What stays on disk so the MCP response can stay small

1500-element accessibility treediff lines with +/-/~ prefixesx/y/width/height per elementviewport visibility flag per elementwindow bounds from CGWindowListapp name + PIDcross-app handoff traversalAXSheet overlay detectionred crosshair PNG annotationper-tool processing time in secondsstderr log around the subprocess callfull element role taxonomy

What the receipt-pointer contract guarantees

Every tool call produces a .txt + .png pair with an identical millisecond-precision filename prefix
The MCP response is a compact summary, never the raw traversal or diff
The model can grep the .txt by role or by text without loading the whole file
The .png shares the same prefix so tool-call-to-screenshot correlation is lexicographic
Compact summary carries status, pid, app name, diff counts, and the grep hint (main.swift:731-830)
refresh_traversal is indistinguishable from the other five tools on disk; same contract, no input events
Filenames strip the macos-use_ prefix via replacingOccurrences so they stay short

Run One Tool Call, Watch The Pair Appear

Point any MCP client (Claude Desktop, Cursor, Cline) at the built binary. Fire a single tool call. Then ls -1t /tmp/macos-use/ and the newest two files are the pair that call produced. Diff their prefix against the previous pair to confirm the millisecond-precision timestamp scheme is working.

git clone https://github.com/mediar-ai/mcp-server-macos-use
cd mcp-server-macos-use
xcrun --toolchain com.apple.dt.toolchain.XcodeDefault swift build -c release

# Point your MCP client at .build/release/mcp-server-macos-use
# After the first tool call:
ls -1t /tmp/macos-use/ | head -4

# Inspect the pointer the client actually received:
# (search the client's MCP log viewer for 'file:')

# Now grep the backing file for whatever you care about:
grep -n 'AXButton' /tmp/macos-use/$(ls -1t /tmp/macos-use/*.txt | head -1)

Frequently Asked Questions About What 'MCP Server' Means Here

Frequently asked questions

What does 'MCP server' actually mean for macos-use?

It means six tools registered in one Server object at main.swift:1408 (open, click, type, press_key, scroll, refresh), a stdio transport that carries JSON-RPC, and a strict return-value shape: every tool call writes two files to /tmp/macos-use/ (a flat-text response and a PNG screenshot, both prefixed with the same millisecond timestamp) and returns a compact summary pointing at them. main.swift:1825-1842 is the whole contract. The MCP client never receives the full traversal in its conversation context; it receives a file path and a grep hint, and decides for itself what to pull back in.

Why write a file instead of returning the traversal inline?

Because the accessibility tree for a real macOS app is huge. Opening Safari and traversing its window can produce 1500+ elements across tabs, toolbar buttons, bookmark bars, and sheet overlays. Inlining that into every tool response blows a 200k-context model out of the water after a handful of calls. main.swift:1821-1830 takes the flat-text rendering from buildFlatTextResponse (main.swift:992) and puts it on disk at /tmp/macos-use/{ms_epoch}_{safeName}.txt. The MCP summary that does go over the wire is usually under 500 bytes. The client can grep that file for a role or a text label before pulling anything back into context.

What exactly does the .txt file contain?

One element per line, verbatim from buildFlatTextResponse at Sources/MCPServer/main.swift:992. Each line is a role plus an optional text plus coordinates plus viewport status: '[AXButton (button)] "Open" x:680 y:520 w:80 h:30 visible'. For click/type/press/scroll the file holds a diff instead, with + added, - removed, ~ modified prefixes. For open/refresh it holds the whole tree. Diff files also include counts at the top and up to 3 notable text changes; open files include app_name, total element count, and processing time. No JSON, no escaping, no parsing. The file is readable by a model with a single Read tool call.

What does the .png file contain?

A CGWindowListCreateImage capture of the target app's chosen window, saved as PNG by the subprocess at Sources/ScreenshotHelper/main.swift, and, if the tool call was a click, a red crosshair plus a ring drawn at the click point so you can visually confirm where the event landed. The screenshot shares the same {ms_epoch}_{toolname} prefix as the .txt, so `ls -1 /tmp/macos-use/1744996800123_click_and_traverse.*` returns the pair. The screenshot-helper is a separate executableTarget declared at Package.swift:25-28; the main server never calls the capture API directly. That isolation has its own story (see the macos-use guide), but from the caller's point of view the .png just appears next to the .txt.

How is the MCP client supposed to use these two files?

The compact summary includes `hint: grep -n 'AXButton' /tmp/macos-use/{file}.txt # search by role or text` at main.swift:761. The intended flow is: the model reads the summary, decides it wants to click 'Open', runs grep 'Open' against the file path, reads the two or three matching lines, pulls out the x/y/width/height of the best match, and passes those back in the next click_and_traverse tool call. The PNG is the tiebreaker when grep's best match is ambiguous or when an action produces an unexpected diff. Most of a productive session never loads the full .txt into model context.

What is inside the compact summary that does go over MCP?

Nine short lines built by buildCompactSummary at main.swift:731-830. Status, pid, app name, optional 'dialog: AXSheet detected' line, the file path, file_size in bytes with element count in parentheses, the grep hint, the screenshot path, and a tool-specific one-liner (e.g. 'Clicked at (412,598). +3 added, -1 removed, ~0 modified'). For a typical click_and_traverse that summary is around 400 bytes. The full file on disk for the same call is often 60-80kB. The compression ratio is roughly 200:1, and the model decides what it actually pulls into context.

Why are the filenames timestamped with millisecond precision?

Because a single tool call can write both files within the same second. main.swift:1825 sets `timestamp = Int(Date().timeIntervalSince1970 * 1000)` and both filenames at main.swift:1827 and main.swift:1834 interpolate that same integer, so the .txt and .png of one call always share a filename prefix. Second precision would work for a human running one tool at a time; millisecond precision is cheap and it prevents collisions during rapid-fire composed calls (click→type→press, which lands at main.swift:1709-1750 and writes its final file after the last traversal).

What happens to old files in /tmp/macos-use/?

Nothing automatic. The directory is created with FileManager.default.createDirectory at main.swift:1823 if missing, but files are not cleaned up on server start or shutdown. macOS will reap them on reboot because /tmp is cleared on each launch of the system; until then they accumulate. In practice that is the behavior you want during debugging: you can scroll back through a session and see exactly which tool call produced which on-screen state, correlated by timestamp. If you are running hour-long sessions the directory grows. Add `find /tmp/macos-use -mmin +60 -delete` to a cron if that matters.

Is this approach an MCP anti-pattern?

No. The MCP spec lets a tool return any text content it wants. Returning a pointer instead of the payload is idiomatic when the payload is large and the payload is also accessible to the model via another MCP tool (here, the filesystem via Read/Grep, which Claude Code exposes natively). It would be an anti-pattern if the file were on a remote resource the model could not reach, or if the summary hid information the model needed to make a decision. Neither is the case. The summary carries the decision-critical fields (status, diff counts, pid); the bulk details live on disk where grep is the right tool, not natural language.

How many tools does 'MCP server' mean in this codebase exactly?

Six, declared together at main.swift:1408 as `let allTools = [openAppTool, clickTool, typeTool, pressKeyTool, scrollTool, refreshTool]`. Each has a schema object (main.swift:1293-1405) and a handler branch in the server's CallTool dispatcher (main.swift:1525-1662). Five of the six are 'disruptive' at main.swift:1667 (`isDisruptive = params.name != refreshTool.name`), meaning they engage the InputGuard overlay and restore cursor/frontmost-app state on exit. refresh_traversal is the only read-only tool; it never touches input, never engages the guard, and still writes the same .txt + .png pair so its response is indistinguishable on disk from the others.

What makes macos-use different from other macOS MCP servers on this specific point?

Every macOS MCP I checked returns the traversal inline. steipete/macos-automator-mcp returns AppleScript output; CursorTouch/MacOS-MCP returns a JSON accessibility tree; ashwwwin/automation-mcp and mb-dev/macos-ui-automation-mcp return elements verbatim in the tool response. None of them put the payload on disk and return a pointer. As a consequence, long sessions with those servers force the model to re-read the same traversal from context on each step or to agree to truncation. The receipt-pointer contract at main.swift:1821-1842 is the uncopyable structural detail of this server, and it is a consequence of the platform: macOS accessibility trees are big because macOS apps are deeply nested (toolbars inside splitters inside tab groups inside windows inside sheets).

Can I bypass the disk step if I do not want files written?

Not from the client side today. The file write at main.swift:1829 is unconditional (it only has `try?` for I/O error swallowing, not a feature flag). If you fork the server you can gate it behind an environment variable, but the compact summary intentionally does not carry the full traversal string; removing the file write would leave the model with a reference to a non-existent path. The cleaner modification is to point outputDir somewhere other than /tmp, e.g. a per-session directory under ~/Library/Caches. The timestamp scheme does not care about the path.

Read The Handler That Builds Every Pair

main.swift:1525-1845 is one function. It contains the full tool-dispatch logic plus the receipt-pointer contract. MIT-licensed Swift, one file, 320 lines that define what 'MCP server' means inside this repo.

Open the handler on GitHub →