Drive native macOS apps via the AX tree with MCP: the 9 tools and the one chaining trick

A grounded read of what an MCP client actually calls when you ask Claude to do something on your Mac, where the actuation surface starts and stops, and why the design wins on round-trip count.

Matthew Diakonov, Written with AI

Published May 13, 20267 min

Direct answer · verified 2026-05-13

Install macos-use, an open-source Swift MCP server (mediar-ai/mcp-server-macos-use, MIT). It exposes 9 MCP tools to any client that speaks MCP (Claude Code, Cursor, Claude Desktop, Cline, VS Code, Windsurf, Zed). Each tool reads the macOS accessibility tree of a target app, optionally mutates it (click, type, press, scroll, AX attribute write, AX action), and returns the new tree on disk so the model can grep it. Three of the nine tools chain click + type + press into one round trip via the element, text, and pressKey parameters. Source of truth: Sources/MCPServer/main.swift:1322-1483.

Install (1 command for Claude Code, JSON for the rest)

claude mcp add macos-use -- npx -y mcp-server-macos-use

Requires Claude Code (npm i -g @anthropic-ai/claude-code) and macOS 13+. Swift builds on first run, ~20 seconds.

On first run, macOS will prompt for Accessibility permission for the host (Claude Code, Cursor, etc.) under System Settings → Privacy & Security → Accessibility. Grant it, restart the client, and the 9 tools show up in the MCP picker.

The 9 verbs the MCP exposes

Pulled directly from Sources/MCPServer/main.swift (the Tool(name: ...) declarations start at line 1322 and end at line 1479; the aggregate list lives at line 1482 as let allTools = [...]). One read-only verb, eight actuation verbs.

open_application_and_traverse

Launches or activates an app by bundle identifier or name, waits for its window to be ready, then dumps the full AX tree to /tmp/macos-use/<ts>_<tool>.txt. Returns pid, file path, and a screenshot path. Declared at main.swift:1322.

refresh_traversal

Re-reads the AX tree for the given pid without any side effect. This is the only non-disruptive verb (main.swift:1800 defines isDisruptive as 'every tool except refresh_traversal'). Use it when state may have changed because of something the user did, not the agent.

click_and_traverse

Synthetic CGEvent left/right/double click at (x, y) or at the first AX node whose text matches the element parameter. Optionally chains text and pressKey. Declared at main.swift:1349. This is the high-traffic tool; the chaining design is the whole point.

type_and_traverse

Posts CGEvent keystrokes for the given string into the focused field of the target pid. Also accepts pressKey for an immediate trailing key (Return, Tab, Escape). Declared at main.swift:1369.

press_key_and_traverse

Single named key with optional modifiers. Useful for menu shortcuts (Command+S), navigation (PageDown, Home, End), and confirmation (Return, Escape). Declared at main.swift:1404.

scroll_and_traverse

CGEvent scroll-wheel event at (x, y) with deltaY in lines (negative = up, positive = down). Optional deltaX. The traversal after the scroll re-reads what is now visible. Declared at main.swift:1422.

set_value_and_traverse

AXUIElementSetAttributeValue with kAXValueAttribute on the AX element under (x, y). Bypasses the input event tap entirely. Use when typed events fail (Catalyst right-pane fields, sandboxed contexts). Declared at main.swift:1440.

press_ax_and_traverse

AXUIElementPerformAction with kAXPressAction on the AX element under (x, y). The right primitive when synthetic mouse clicks are silently dropped by the host app (most often a Catalyst right-pane button). Declared at main.swift:1457.

set_selected_and_traverse

Writes kAXSelectedAttribute on rows, sidebar entries, and outline items that expose AXSelected but no AXPress action. The escalation when both regular click and press_ax fail. Declared at main.swift:1475.

The chaining trick: 1 call instead of 3

The most under-documented thing about this server is the design of click_and_traverse. It is not just "click at this coordinate." The schema (main.swift:1327-1347) defines two optional parameters that almost nobody mentions when they describe the server:

text — a string typed into the field after the click lands. The click and the typing happen in the same tool invocation.
pressKey — a key name pressed after the typing is in. So pressKey: "Return" sends the message you just typed.

The server's own instructions string (returned to every MCP client when it connects, defined at main.swift:1488-1507) tells the model to default to the chained form:

Sources/MCPServer/main.swift:1496-1500

CRITICAL — Minimize tool calls by chaining actions:
- click_and_traverse supports `text` and `pressKey` params to click, type,
  AND press a key — all in ONE call.
- Example: to type into a Slack message box and send it, use ONE
  click_and_traverse call with element="Message to X", text="hello",
  pressKey="Return" — do NOT split into separate click, type, and press calls.

Concretely:

// Three MCP round trips. Same result, 3x the latency, 3x the token cost. { "tool": "macos-use_click_and_traverse", "arguments": { "pid": 51234, "element": "Message to engineering" } } { "tool": "macos-use_type_and_traverse", "arguments": { "pid": 51234, "text": "shipping in 10" } } { "tool": "macos-use_press_key_and_traverse", "arguments": { "pid": 51234, "keyName": "Return" } }

3 MCP round trips for one user-visible action
3 disruptive engages of the input guard (overlay flashes 3 times)
Tree state can drift between calls (other apps, OS events, animation)
3x the token cost in tool-call frames

type_and_traverse has the same pressKey parameter at main.swift:1360, so "type something and submit" from an already-focused field is one call too. The model does not have to invent this; the server tells it explicitly.

What one tool call looks like end-to-end

From the moment Claude decides to call click_and_traverse with element="Send", six things happen before the model sees a result. Walking the wire so you have a mental model when you write a prompt.

One chained click_and_traverse call

The full tree never crosses the wire. The model gets a path to a .txt file (one element per line) and a path to a .png screenshot, plus a short visible-elements sample. If the model needs to find a specific control, it greps the .txt itself with its existing file tools and only pulls the lines it needs. Tool-call frames stay small even when the underlying app has 6,000 AX nodes.

The 9 tools at a glance

Feature	Disruptive?	What it does
open_application_and_traverse	Yes (input guard engages)	Launch or activate by bundle ID/name, dump tree
refresh_traversal	No (read-only)	Re-read the AX tree for a pid; no side effects
click_and_traverse	Yes	Click (+ optional type + pressKey, chained)
type_and_traverse	Yes	Type into focused field (+ optional pressKey)
press_key_and_traverse	Yes	Named key with optional modifiers (Command+S, etc.)
scroll_and_traverse	Yes	Scroll wheel at (x, y) with deltaX, deltaY in lines
set_value_and_traverse	Yes	kAXValueAttribute write; bypasses the event tap
press_ax_and_traverse	Yes	kAXPressAction on the element under (x, y)
set_selected_and_traverse	Yes	kAXSelectedAttribute on rows/list entries

The three at the bottom (set_value, press_ax, set_selected) are the fallback ladder for the cases where the synthetic event sources land on a control that ignores them. The dedicated walkthrough on those three is the three-rung fallback ladder page.

A working example: Mail, draft, send

The shape of an agent loop driving Mail.app, expressed as the minimum number of tool calls. Six round trips for compose + write + send + verify. Notice three of them are pure chaining.

open_application_and_traverse with identifier: "com.apple.mail". Tree shows the New Message button. The model reads the .txt, finds it, grabs its (x, y, w, h).
click_and_traverse with element: "New Message" (or those explicit coordinates). The compose window opens. The new tree shows To/Subject/Body fields.
click_and_traverse with element: "To", text: "alice@example.com", pressKey: "Tab". Recipient is in, focus jumps to Subject. One call, three actions.
type_and_traverse with text: "Shipping at 5", pressKey: "Tab". Subject is in, focus jumps to body.
type_and_traverse with the body text. No trailing key, the model is going to use the menu shortcut for send.
press_key_and_traverse with keyName: "Return", modifierFlags: ["Command", "Shift"]. That is Mail's Send shortcut. The diff after the send shows the compose window gone and the "Sent" mailbox count incremented; the model verifies and stops.

Six tool calls for compose + write + send. The naive version of the same flow without chaining is nine to twelve calls. Round trips, input-guard flashes, and tokens all scale with that count.

What this is not

Three honest limitations the README does not advertise loudly enough.

Not a browser driver. If the workflow is "open a tab, fill a form, click submit," the right MCP is a browser-automation server (Playwright MCP, Stagehand, etc.). macos-use can drive Safari at the AX level, but it sees WKWebView accessibility nodes, not the DOM. Use the right tool for the layer.
Not a remote driver. The server runs on the same machine as the apps it drives. If you want Claude to drive a Mac you are not sitting at, the right pick is mcp-remote-macos-use (VNC-backed). The input-guard semantics here only make sense for the local case.
Not magic against opt-out apps. Many full-screen game engines (Metal-only renderers), DRM-protected video players, and some kiosk apps expose only an empty AX tree. No tool here can fix that; the agent's plan has to fall back to a different control surface entirely.

Stuck on a Catalyst app, an Electron app, or a multi-window flow?

If you are wiring macos-use into a real agent loop and hitting a control that does not actuate, 20 minutes with the maintainer is usually faster than another night in the AX docs.

Frequently asked

Frequently asked questions

What does 'drive' actually mean when the MCP is reading and writing the AX tree?

Two halves of one round trip. Every tool except refresh_traversal is 'disruptive' (defined at main.swift:1800): the server first acts on the app (CGEvent post, AX attribute write, AX action) and then re-traverses the tree, writes the new state to /tmp/macos-use/<ts>_<tool>.txt with a diff against the previous traversal, and returns the file path to the LLM. Reading is free (refresh_traversal). Anything that mutates state goes through one of the eight other tools and pays the read cost on the way out.

Why are there nine tools instead of one generic 'do_action' tool?

Because the LLM has to know in its tool schema which actuation primitive it is asking for. click_and_traverse posts a synthetic CGEvent mouse down/up; type_and_traverse posts CGEvent keystrokes through the system source; press_ax_and_traverse calls AXUIElementPerformAction with kAXPressAction; set_value_and_traverse calls AXUIElementSetAttributeValue with kAXValueAttribute; set_selected_and_traverse writes kAXSelectedAttribute. They have different fallback semantics and different failure modes (Catalyst right-pane buttons silently drop synthetic clicks but accept press_ax). A single generic tool would force the model to encode that decision in a string parameter the schema cannot constrain. Nine narrow tools let the model pick the right primitive from the description and the schema validates the call before the server runs.

What is the chained click_and_traverse call and why is it the default path?

click_and_traverse accepts three optional parameters that turn a multi-tool dance into one call: element (case-insensitive partial text match across visible AX nodes), text (typed after the click lands), and pressKey (key name pressed after the text is in). One MCP round trip becomes click + type + return. The reason it is the default is in the server's own instructions block at main.swift:1488-1507: 'CRITICAL — Minimize tool calls by chaining actions. Example: to type into a Slack message box and send it, use ONE click_and_traverse call with element="Message to X", text="hello", pressKey="Return" — do NOT split into separate click, type, and press calls.' The server tells the model this on every connect; the model defaults to the chained form on first contact.

How does the model know the coordinates of the element it wants to click?

Two paths. Path one: pass element="Send" and let the server scan the previous traversal for a visible AX node whose text contains that substring; it clicks the first match. Path two: read the .txt file at /tmp/macos-use/<ts>_<tool>.txt, parse one of the lines that looks like [AXButton] "Send" x:820 y:612 w:60 h:28 visible, and pass (x, y, w, h) explicitly so the tool centers the click at (x + w/2, y + h/2). The server instructions block explicitly forbids estimating coordinates from screenshots ('NEVER estimate coordinates visually from screenshots') because pixel positions in the .png and screen coordinates differ by the window origin offset. The .txt is the ground truth.

When does click_and_traverse fail, and what's the escalation?

Synthetic CGEvent clicks land cleanly on AppKit and most SwiftUI apps but are silently dropped by Mac Catalyst right-pane controls (Messages, Maps), some sandboxed apps, and any field in a secure-input context. The tree shows the element, the click coordinates are right, the action just does not register. The escalation lives in three other tools: press_ax_and_traverse (kAXPressAction on the element) for buttons, set_value_and_traverse (kAXValueAttribute) for text fields that ignore typed events, and set_selected_and_traverse (kAXSelectedAttribute) for list and outline rows that expose AXSelected but no AXPress action. The fallback ladder is documented in the side guide on the three-rung escalation.

Does each tool block my keyboard while it runs?

Yes, briefly, for every disruptive call. The handler at main.swift:1835 fires InputGuard.shared.engage which installs a CGEventTap at .cghidEventTap headInsert with a mask covering keyDown, keyUp, both mouse buttons, mouseMoved, dragged, scrollWheel, and flagsChanged. The tap callback drops hardware events (eventSourceStateID == 0) and forwards programmatic events (non-zero stateID), which is how the agent's synthesized clicks pass through while your typing does not. The overlay shows a pulsing orange dot and a one-line label of what the agent is doing. Plain Esc with no modifiers is the kill switch. A 30-second watchdog auto-disengages if anything hangs.

Which MCP clients can call these 9 tools today?

Any client that speaks MCP: Claude Code, Cursor, Claude Desktop, Cline, VS Code (with an MCP extension), Windsurf, and Zed. The server is stdio-based, registered via npx, and known to work with every client that follows the standard MCP transport. Claude Code is the lowest-friction install: claude mcp add macos-use -- npx -y mcp-server-macos-use. For the JSON-config clients (Cursor, Claude Desktop, VS Code, Windsurf, Cline, Zed) the snippet is in the install block on this page.

What does the LLM actually see when a tool returns?

A compact text summary, not the full tree. The summary contains: a file: field pointing at /tmp/macos-use/<ts>_<tool>.txt with the full element list, a screenshot: field pointing at a sibling .png of the captured window, and a visible_elements: sample showing up to a few dozen on-screen elements with coordinates. The full tree (often 1000+ lines for a real app) stays on disk; the model greps the file when it needs to find an element by text. Smaller summary on the wire, lower token cost per turn, full state available when the model asks.

How is the diff used after an action?

Every disruptive call writes both an absolute traversal and a diff against the prior traversal. The diff is line-prefixed: + for new elements, - for removed elements, ~ for elements whose attributes changed. Coordinate-only churn (a window shifted one pixel) is filtered out before the diff is built. Scroll-bar and structural-only noise is dropped at main.swift:591-607. The point is to give the model a small, semantic delta after each action: 'this button became enabled', 'this row was selected', 'this text field's AXValue changed from empty to your typed string', not the entire tree again.

Is this a one-off project or is something real running on it?

Fazm's screen-control feature is built on the same Swift core. The MCP server is the standalone surface around that core, packaged so any MCP client can drive it without touching the Fazm app. Open source under MIT, repo at github.com/mediar-ai/mcp-server-macos-use. The maintainers (Mediar) ship a consumer macOS app on top of it, so behavior is exercised in production every day, not only in the README.

Drive native macOS apps via the AX tree with MCP: the 9 tools and the one chaining trick

The 9 verbs the MCP exposes

open_application_and_traverse

refresh_traversal

click_and_traverse

type_and_traverse

press_key_and_traverse

scroll_and_traverse

set_value_and_traverse

press_ax_and_traverse

set_selected_and_traverse

The chaining trick: 1 call instead of 3

What one tool call looks like end-to-end

The 9 tools at a glance

A working example: Mail, draft, send

What this is not

Stuck on a Catalyst app, an Electron app, or a multi-window flow?

Frequently asked

Frequently asked questions

Related reading

Comments ()

The 9 verbs the MCP exposes

open_application_and_traverse

refresh_traversal

click_and_traverse

type_and_traverse

press_key_and_traverse

scroll_and_traverse

set_value_and_traverse

press_ax_and_traverse

set_selected_and_traverse

The chaining trick: 1 call instead of 3

What one tool call looks like end-to-end

The 9 tools at a glance

A working example: Mail, draft, send

What this is not

Stuck on a Catalyst app, an Electron app, or a multi-window flow?

Frequently asked

Frequently asked questions

Related reading

Comments (••)

Comments ()