Drive native macOS apps via the AX tree with MCP: the 9 tools and the one chaining trick
A grounded read of what an MCP client actually calls when you ask Claude to do something on your Mac, where the actuation surface starts and stops, and why the design wins on round-trip count.
Direct answer · verified 2026-05-13
Install macos-use, an open-source Swift MCP server (mediar-ai/mcp-server-macos-use, MIT). It exposes 9 MCP tools to any client that speaks MCP (Claude Code, Cursor, Claude Desktop, Cline, VS Code, Windsurf, Zed). Each tool reads the macOS accessibility tree of a target app, optionally mutates it (click, type, press, scroll, AX attribute write, AX action), and returns the new tree on disk so the model can grep it. Three of the nine tools chain click + type + press into one round trip via the element, text, and pressKey parameters. Source of truth: Sources/MCPServer/main.swift:1322-1483.
Install (1 command for Claude Code, JSON for the rest)
claude mcp add macos-use -- npx -y mcp-server-macos-usenpm i -g @anthropic-ai/claude-code) and macOS 13+. Swift builds on first run, ~20 seconds.On first run, macOS will prompt for Accessibility permission for the host (Claude Code, Cursor, etc.) under System Settings → Privacy & Security → Accessibility. Grant it, restart the client, and the 9 tools show up in the MCP picker.
The 9 verbs the MCP exposes
Pulled directly from Sources/MCPServer/main.swift (the Tool(name: ...) declarations start at line 1322 and end at line 1479; the aggregate list lives at line 1482 as let allTools = [...]). One read-only verb, eight actuation verbs.
open_application_and_traverse
Launches or activates an app by bundle identifier or name, waits for its window to be ready, then dumps the full AX tree to /tmp/macos-use/<ts>_<tool>.txt. Returns pid, file path, and a screenshot path. Declared at main.swift:1322.
refresh_traversal
Re-reads the AX tree for the given pid without any side effect. This is the only non-disruptive verb (main.swift:1800 defines isDisruptive as 'every tool except refresh_traversal'). Use it when state may have changed because of something the user did, not the agent.
click_and_traverse
Synthetic CGEvent left/right/double click at (x, y) or at the first AX node whose text matches the element parameter. Optionally chains text and pressKey. Declared at main.swift:1349. This is the high-traffic tool; the chaining design is the whole point.
type_and_traverse
Posts CGEvent keystrokes for the given string into the focused field of the target pid. Also accepts pressKey for an immediate trailing key (Return, Tab, Escape). Declared at main.swift:1369.
press_key_and_traverse
Single named key with optional modifiers. Useful for menu shortcuts (Command+S), navigation (PageDown, Home, End), and confirmation (Return, Escape). Declared at main.swift:1404.
scroll_and_traverse
CGEvent scroll-wheel event at (x, y) with deltaY in lines (negative = up, positive = down). Optional deltaX. The traversal after the scroll re-reads what is now visible. Declared at main.swift:1422.
set_value_and_traverse
AXUIElementSetAttributeValue with kAXValueAttribute on the AX element under (x, y). Bypasses the input event tap entirely. Use when typed events fail (Catalyst right-pane fields, sandboxed contexts). Declared at main.swift:1440.
press_ax_and_traverse
AXUIElementPerformAction with kAXPressAction on the AX element under (x, y). The right primitive when synthetic mouse clicks are silently dropped by the host app (most often a Catalyst right-pane button). Declared at main.swift:1457.
set_selected_and_traverse
Writes kAXSelectedAttribute on rows, sidebar entries, and outline items that expose AXSelected but no AXPress action. The escalation when both regular click and press_ax fail. Declared at main.swift:1475.
The chaining trick: 1 call instead of 3
The most under-documented thing about this server is the design of click_and_traverse. It is not just "click at this coordinate." The schema (main.swift:1327-1347) defines two optional parameters that almost nobody mentions when they describe the server:
text— a string typed into the field after the click lands. The click and the typing happen in the same tool invocation.pressKey— a key name pressed after the typing is in. SopressKey: "Return"sends the message you just typed.
The server's own instructions string (returned to every MCP client when it connects, defined at main.swift:1488-1507) tells the model to default to the chained form:
CRITICAL — Minimize tool calls by chaining actions: - click_and_traverse supports `text` and `pressKey` params to click, type, AND press a key — all in ONE call. - Example: to type into a Slack message box and send it, use ONE click_and_traverse call with element="Message to X", text="hello", pressKey="Return" — do NOT split into separate click, type, and press calls.
Concretely:
// Three MCP round trips. Same result, 3x the latency, 3x the token cost. { "tool": "macos-use_click_and_traverse", "arguments": { "pid": 51234, "element": "Message to engineering" } } { "tool": "macos-use_type_and_traverse", "arguments": { "pid": 51234, "text": "shipping in 10" } } { "tool": "macos-use_press_key_and_traverse", "arguments": { "pid": 51234, "keyName": "Return" } }
- 3 MCP round trips for one user-visible action
- 3 disruptive engages of the input guard (overlay flashes 3 times)
- Tree state can drift between calls (other apps, OS events, animation)
- 3x the token cost in tool-call frames
type_and_traverse has the same pressKey parameter at main.swift:1360, so "type something and submit" from an already-focused field is one call too. The model does not have to invent this; the server tells it explicitly.
What one tool call looks like end-to-end
From the moment Claude decides to call click_and_traverse with element="Send", six things happen before the model sees a result. Walking the wire so you have a mental model when you write a prompt.
One chained click_and_traverse call
The full tree never crosses the wire. The model gets a path to a .txt file (one element per line) and a path to a .png screenshot, plus a short visible-elements sample. If the model needs to find a specific control, it greps the .txt itself with its existing file tools and only pulls the lines it needs. Tool-call frames stay small even when the underlying app has 6,000 AX nodes.
The 9 tools at a glance
| Feature | Disruptive? | What it does |
|---|---|---|
| open_application_and_traverse | Yes (input guard engages) | Launch or activate by bundle ID/name, dump tree |
| refresh_traversal | No (read-only) | Re-read the AX tree for a pid; no side effects |
| click_and_traverse | Yes | Click (+ optional type + pressKey, chained) |
| type_and_traverse | Yes | Type into focused field (+ optional pressKey) |
| press_key_and_traverse | Yes | Named key with optional modifiers (Command+S, etc.) |
| scroll_and_traverse | Yes | Scroll wheel at (x, y) with deltaX, deltaY in lines |
| set_value_and_traverse | Yes | kAXValueAttribute write; bypasses the event tap |
| press_ax_and_traverse | Yes | kAXPressAction on the element under (x, y) |
| set_selected_and_traverse | Yes | kAXSelectedAttribute on rows/list entries |
The three at the bottom (set_value, press_ax, set_selected) are the fallback ladder for the cases where the synthetic event sources land on a control that ignores them. The dedicated walkthrough on those three is the three-rung fallback ladder page.
A working example: Mail, draft, send
The shape of an agent loop driving Mail.app, expressed as the minimum number of tool calls. Six round trips for compose + write + send + verify. Notice three of them are pure chaining.
- open_application_and_traverse with
identifier: "com.apple.mail". Tree shows the New Message button. The model reads the .txt, finds it, grabs its (x, y, w, h). - click_and_traverse with
element: "New Message"(or those explicit coordinates). The compose window opens. The new tree shows To/Subject/Body fields. - click_and_traverse with
element: "To",text: "alice@example.com",pressKey: "Tab". Recipient is in, focus jumps to Subject. One call, three actions. - type_and_traverse with
text: "Shipping at 5",pressKey: "Tab". Subject is in, focus jumps to body. - type_and_traverse with the body text. No trailing key, the model is going to use the menu shortcut for send.
- press_key_and_traverse with
keyName: "Return",modifierFlags: ["Command", "Shift"]. That is Mail's Send shortcut. The diff after the send shows the compose window gone and the "Sent" mailbox count incremented; the model verifies and stops.
Six tool calls for compose + write + send. The naive version of the same flow without chaining is nine to twelve calls. Round trips, input-guard flashes, and tokens all scale with that count.
What this is not
Three honest limitations the README does not advertise loudly enough.
- Not a browser driver. If the workflow is "open a tab, fill a form, click submit," the right MCP is a browser-automation server (Playwright MCP, Stagehand, etc.). macos-use can drive Safari at the AX level, but it sees WKWebView accessibility nodes, not the DOM. Use the right tool for the layer.
- Not a remote driver. The server runs on the same machine as the apps it drives. If you want Claude to drive a Mac you are not sitting at, the right pick is
mcp-remote-macos-use(VNC-backed). The input-guard semantics here only make sense for the local case. - Not magic against opt-out apps. Many full-screen game engines (Metal-only renderers), DRM-protected video players, and some kiosk apps expose only an empty AX tree. No tool here can fix that; the agent's plan has to fall back to a different control surface entirely.
Stuck on a Catalyst app, an Electron app, or a multi-window flow?
If you are wiring macos-use into a real agent loop and hitting a control that does not actuate, 20 minutes with the maintainer is usually faster than another night in the AX docs.
Frequently asked
Frequently asked questions
What does 'drive' actually mean when the MCP is reading and writing the AX tree?
Two halves of one round trip. Every tool except refresh_traversal is 'disruptive' (defined at main.swift:1800): the server first acts on the app (CGEvent post, AX attribute write, AX action) and then re-traverses the tree, writes the new state to /tmp/macos-use/<ts>_<tool>.txt with a diff against the previous traversal, and returns the file path to the LLM. Reading is free (refresh_traversal). Anything that mutates state goes through one of the eight other tools and pays the read cost on the way out.
Why are there nine tools instead of one generic 'do_action' tool?
Because the LLM has to know in its tool schema which actuation primitive it is asking for. click_and_traverse posts a synthetic CGEvent mouse down/up; type_and_traverse posts CGEvent keystrokes through the system source; press_ax_and_traverse calls AXUIElementPerformAction with kAXPressAction; set_value_and_traverse calls AXUIElementSetAttributeValue with kAXValueAttribute; set_selected_and_traverse writes kAXSelectedAttribute. They have different fallback semantics and different failure modes (Catalyst right-pane buttons silently drop synthetic clicks but accept press_ax). A single generic tool would force the model to encode that decision in a string parameter the schema cannot constrain. Nine narrow tools let the model pick the right primitive from the description and the schema validates the call before the server runs.
What is the chained click_and_traverse call and why is it the default path?
click_and_traverse accepts three optional parameters that turn a multi-tool dance into one call: element (case-insensitive partial text match across visible AX nodes), text (typed after the click lands), and pressKey (key name pressed after the text is in). One MCP round trip becomes click + type + return. The reason it is the default is in the server's own instructions block at main.swift:1488-1507: 'CRITICAL — Minimize tool calls by chaining actions. Example: to type into a Slack message box and send it, use ONE click_and_traverse call with element="Message to X", text="hello", pressKey="Return" — do NOT split into separate click, type, and press calls.' The server tells the model this on every connect; the model defaults to the chained form on first contact.
How does the model know the coordinates of the element it wants to click?
Two paths. Path one: pass element="Send" and let the server scan the previous traversal for a visible AX node whose text contains that substring; it clicks the first match. Path two: read the .txt file at /tmp/macos-use/<ts>_<tool>.txt, parse one of the lines that looks like [AXButton] "Send" x:820 y:612 w:60 h:28 visible, and pass (x, y, w, h) explicitly so the tool centers the click at (x + w/2, y + h/2). The server instructions block explicitly forbids estimating coordinates from screenshots ('NEVER estimate coordinates visually from screenshots') because pixel positions in the .png and screen coordinates differ by the window origin offset. The .txt is the ground truth.
When does click_and_traverse fail, and what's the escalation?
Synthetic CGEvent clicks land cleanly on AppKit and most SwiftUI apps but are silently dropped by Mac Catalyst right-pane controls (Messages, Maps), some sandboxed apps, and any field in a secure-input context. The tree shows the element, the click coordinates are right, the action just does not register. The escalation lives in three other tools: press_ax_and_traverse (kAXPressAction on the element) for buttons, set_value_and_traverse (kAXValueAttribute) for text fields that ignore typed events, and set_selected_and_traverse (kAXSelectedAttribute) for list and outline rows that expose AXSelected but no AXPress action. The fallback ladder is documented in the side guide on the three-rung escalation.
Does each tool block my keyboard while it runs?
Yes, briefly, for every disruptive call. The handler at main.swift:1835 fires InputGuard.shared.engage which installs a CGEventTap at .cghidEventTap headInsert with a mask covering keyDown, keyUp, both mouse buttons, mouseMoved, dragged, scrollWheel, and flagsChanged. The tap callback drops hardware events (eventSourceStateID == 0) and forwards programmatic events (non-zero stateID), which is how the agent's synthesized clicks pass through while your typing does not. The overlay shows a pulsing orange dot and a one-line label of what the agent is doing. Plain Esc with no modifiers is the kill switch. A 30-second watchdog auto-disengages if anything hangs.
Which MCP clients can call these 9 tools today?
Any client that speaks MCP: Claude Code, Cursor, Claude Desktop, Cline, VS Code (with an MCP extension), Windsurf, and Zed. The server is stdio-based, registered via npx, and known to work with every client that follows the standard MCP transport. Claude Code is the lowest-friction install: claude mcp add macos-use -- npx -y mcp-server-macos-use. For the JSON-config clients (Cursor, Claude Desktop, VS Code, Windsurf, Cline, Zed) the snippet is in the install block on this page.
What does the LLM actually see when a tool returns?
A compact text summary, not the full tree. The summary contains: a file: field pointing at /tmp/macos-use/<ts>_<tool>.txt with the full element list, a screenshot: field pointing at a sibling .png of the captured window, and a visible_elements: sample showing up to a few dozen on-screen elements with coordinates. The full tree (often 1000+ lines for a real app) stays on disk; the model greps the file when it needs to find an element by text. Smaller summary on the wire, lower token cost per turn, full state available when the model asks.
How is the diff used after an action?
Every disruptive call writes both an absolute traversal and a diff against the prior traversal. The diff is line-prefixed: + for new elements, - for removed elements, ~ for elements whose attributes changed. Coordinate-only churn (a window shifted one pixel) is filtered out before the diff is built. Scroll-bar and structural-only noise is dropped at main.swift:591-607. The point is to give the model a small, semantic delta after each action: 'this button became enabled', 'this row was selected', 'this text field's AXValue changed from empty to your typed string', not the entire tree again.
Is this a one-off project or is something real running on it?
Fazm's screen-control feature is built on the same Swift core. The MCP server is the standalone surface around that core, packaged so any MCP client can drive it without touching the Fazm app. Open source under MIT, repo at github.com/mediar-ai/mcp-server-macos-use. The maintainers (Mediar) ship a consumer macOS app on top of it, so behavior is exercised in production every day, not only in the README.
Related reading
- Accessibility tree, native macOS apps: the two practical ways to read one — the reading half of the same story: Inspector for browsing, macos-use for a flat-text dump you can grep.
- The three-rung fallback ladder for when click silently drops — set_value, press_ax, set_selected, in the order to try them.
- Who owns the keyboard while the LLM is clicking — the input-arbitration layer that wraps every disruptive tool call on this page.
- 3 macOS automation MCP servers compared — native AX vs AppleScript wrapper vs VNC-backed remote, with the cases each one wins.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.