Drive any Mac app from Claude Code
A Swift MCP server that hands your AI assistant the same accessibility tree Apple gives VoiceOver. Click any button by text. Type into any field. Drive Xcode, Slack, Mail, System Settings, anything with an AX tree.
- →Not an AppleScript wrapper. Native AX APIs + CGEvent, so it works on apps with no scripting dictionary (most modern Mac apps).
- →Not a screenshot agent. Structured tree responses, diff-only after each action. No OCR tax, no vision-model bill.
- →100% local. One Swift binary over stdio. No SaaS, no network egress from the server.
- →You stay in control. Every action is gated by your MCP client's approval prompt. An InputGuard overlay blocks stray input during automation, and Escape cancels instantly.
The setup pack adds per-client JSON for Cursor, Claude Desktop, VS Code, Windsurf, Cline & Zed, the six prompts below ready to paste, and a heads-up when new tools ship. One email, no list.
$ claude mcp add macos-use -- npx -y mcp-server-macos-useCursor, Claude Desktop, VS Code, Windsurf, Cline, Zed? Same package, JSON config below ↓
macOS 13+ · Swift builds on first run (~20s) · 326 ★ on GitHub
Real session: Claude Code calling macos-use to open an app, read the accessibility tree, click by text, and verify the result. No edits, no AppleScript, no screenshot loops.
Install
Copy. Paste. Approve once.
The whole install is a single command for Claude Code or a five-line JSON config for everything else. macOS 13+. Swift builds on first run, about twenty seconds.
claude mcp add macos-use -- npx -y mcp-server-macos-usenpm i -g @anthropic-ai/claude-code) and macOS 13+. Swift builds on first run, ~20 seconds.Want the full setup pack by email?
Per-client config paths (Cursor, Claude Desktop, VS Code, Windsurf, Cline, Zed), the 30-second Accessibility-permission walkthrough, the six prompts above formatted for paste-in, and a heads-up when new tools ship. One email, no list.
First run prompts for Accessibility permission on whichever app is running your MCP client. Revoke anytime in System Settings → Privacy & Security.
In practice
Once it's installed, type things like this
Real prompts you can drop into Claude Code or Cursor today. The server resolves each one through the accessibility tree, so the agent clicks the right button instead of guessing pixels.
Launches Xcode, presses ⌘R, watches the issue navigator, returns the error region.
Focuses Slack, opens the DM by accessibility label, types into the message field, sends.
Walks the settings tree, reads the toggles row by row, returns a structured list.
Cross-app handoff: focuses Cursor, ⌘P to open, ⌘G to jump, ⌘L for the panel.
Uses Finder + Preview through the accessibility tree, no AppleScript glue.
Mail traversal + native CGEvent clicks. Same flow that ships inside Fazm.
Every call is gated by your MCP client's approval prompt. You see the action before the server runs it.
Tools
Six tools. Full control.
Every tool returns the updated accessibility tree as a diff, so the agent always knows what changed.
open_application_and_traverseLaunch or focus any app by name, bundle ID, or path.
click_and_traverseClick at coordinates or by element text. Optionally type and press a key in one call.
type_and_traverseType into the focused field, with optional modifier keystroke.
press_key_and_traverseArrow keys, ⌘⇧4, anything. Full modifier support.
scroll_and_traverseScroll lines in any direction at a given position.
refresh_traversalRe-read the accessibility tree without taking an action.
Why macos-use
Why native accessibility beats screenshots
Screenshot agents burn tokens re-describing the screen every step and guess pixel positions. macos-use hands Claude a structured tree with semantic roles and coordinates, then returns only what changed after each action.
Accessibility tree, not pixels
Every action returns structured elements with role, text, and coordinates: `[AXButton] "Open" x:680 y:520 w:80 h:30 visible`. No OCR, no vision model tax.
Click by text
element: "Submit" finds and clicks. No pixel guessing.
Diff responses
After each action, only changed elements come back. Cheaper tokens, faster loops.
Native event injection
CGEvent clicks and keystrokes are OS-level. Works with apps that reject other simulated input.
InputGuard + Escape
User input blocked during automation so you can't fight the agent. Escape cancels, 30s watchdog prevents lockout.
Cross-app handoff
Click a link that opens Safari? The server detects the new frontmost app and traverses it automatically.
Response shape
What Claude actually receives
Every tool returns a compact summary plus a path to the full accessibility tree dump. Claude greps the file for the element it wants. No screenshots in the prompt, no OCR pass, no pixel guessing.
pid: 4218 app: Slack elements: 412 total, 87 visible file: /tmp/macos-use/slack-traversal.txt screenshot: /tmp/macos-use/slack.png processing_time_seconds: "0.31"
[AXButton] "Direct messages" x:14 y:198 w:236 h:32 visible [AXRow] "Sarah Chen" x:14 y:286 w:236 h:36 visible [AXTextArea] "Message Sarah" x:268 y:812 w:892 h:42 visible [AXButton] "Send" x:1188 y:818 w:32 h:30 visible [AXStaticText] "on it" x:268 y:812 w:54 h:18 visible
{
"added": [
{ "role": "AXStaticText", "text": "on it", "x": 268, "y": 760, "in_viewport": true },
{ "role": "AXStaticText", "text": "Just now", "x": 1108, "y": 760, "in_viewport": true }
],
"removed": [
{ "role": "AXStaticText", "text": "on it", "x": 268, "y": 812, "in_viewport": true }
],
"modified": [
{
"before": { "role": "AXTextArea", "text": "on it" },
"after": { "role": "AXTextArea", "text": "" },
"changes": [{ "attributeName": "AXValue", "oldValue": "on it", "newValue": "" }]
}
]
}Click sends in ~300ms and returns five fields, not a screenshot. The agent sees the message left the input and landed in the thread, then moves on.
macos-use vs. AppleScript-based MCP servers
If you've tried steipete/macos-automator-mcp, peakmojo/applescript-mcp, or any other osascript wrapper, here's what changes.
| Feature | AppleScript MCPs | macos-use |
|---|---|---|
| What the AI gets | Free-form text from `osascript`. The agent has to know the right script for every app. | Live accessibility tree with roles, labels, and coordinates. Same data Apple gives VoiceOver. |
| App coverage | Only apps that ship a real AppleScript dictionary. Most modern apps don't. | Every app macOS can describe via AX, including Electron apps, browsers, settings panels. |
| How clicks happen | AppleScript `click button` calls, often blocked by sandboxing. Many apps just refuse. | Native CGEvent at the OS level. Indistinguishable from a real user, works everywhere. |
| Failure mode | Cryptic AppleScript errors, the agent retries blind. | Diff response shows exactly which element changed. Agent can self-correct. |
| Auth & runtime | AppleScript runs in the host app's permission scope, hard to reason about. | One Swift binary over stdio. Local, open source, pinnable npm version. |
AppleScript still wins for a handful of legacy automation tasks (Finder folder actions, Mail rules). macos-use stays out of those lanes; everything else, the AX tree is just a better data source.
macos-use vs. Claude computer use & other screenshot agents
Anthropic's computer-use beta, OpenAI Operator, and most desktop agents in 2026 ground every decision in pixels. On a hosted VM that's the only option. On your real Mac, the accessibility tree is a faster, cheaper, more reliable substrate.
| Feature | Claude computer use / screenshot agents | macos-use |
|---|---|---|
| How it sees the UI | Screenshot + OCR / vision model | Accessibility tree with roles and coordinates |
| Token cost per action | Full screen re-described every step | Diff-only: elements added / removed / changed |
| Latency per click | 1-3s per step (vision inference + re-screenshot) | ~300ms per click + tree diff |
| Click targeting | Pixel guess from screenshot | Exact coords from tree, or element text match |
| Input injection | Simulated keystrokes via vision loop | CGEvent, indistinguishable from real user input |
| Cross-platform | Yes (universal — pixels everywhere) | macOS only (uses native AX). Pair with Terminator on Windows. |
| Setup | Electron/Docker/Python stack | One Swift binary + stdio MCP |
| Where it runs | Often hosted SaaS | 100% local on your Mac |
Screenshots still matter for apps that expose no accessibility tree, and Claude's computer use is the right call when you need a single agent that works the same on macOS, Windows, and Linux. macos-use captures windows on demand so you can combine both when you need to.
- You need the same agent to run on macOS, Windows, and Linux.
- The target app is a custom-rendered canvas (most games, some Electron builds) with no AX tree at all.
- You're fine paying ~1-3s and a vision-model call per click for universal compatibility.
- You're on macOS and want sub-second clicks with no vision-model bill.
- You want the agent to click by text (“Send”, “Submit”), not by guessed pixel coordinates.
- You want everything local: one Swift binary over stdio, no SaaS, no Docker.
- You already speak MCP and want a tool Claude can pick up next to your other servers.
You can run both. Anthropic's own guidance is that Claude tries MCP tools first and falls back to screen control only when no better integration exists, so macos-use becomes the fast path and computer use becomes the safety net.
Battle tested in production
The same server ships inside Fazm as the screen-control layer for a real, paying-customer product. If it works there, it works for your side project.
Every line is on GitHub. Pin a version, fork it, audit the Swift. Local binary over stdio, no network calls from the server itself.
Anything that exposes an accessibility tree, which is most modern Mac apps. Sampling devs run regularly:
Electron apps (Slack, Notion, Discord, Figma, Linear, Cursor, VS Code) all expose AX — the tree is richer than people expect.
Honest about the limits
What it doesn't do (and what to do instead)
Every accessibility-driven tool has a ceiling. Here's where macos-use hits one, and the workarounds we and Fazm actually use in production.
Apps with no accessibility tree
A handful of Electron apps strip AX. Most games are custom-rendered with no semantic elements at all.
What to do
Pass raw coordinates to click_and_traverse. The tool still injects native CGEvent clicks even when the tree is empty.
Cross-window drag & drop
Single-window drags work. Drag-and-drop across two windows (Finder → app, Safari → Notes) is brittle because AX loses the source mid-gesture.
What to do
Use copy/paste plus keyboard focus, or the Services menu. Both are fully covered by the existing six tools.
No record/replay API yet
You can't capture a session and replay it deterministically. Every run goes through the LLM that called the tools.
What to do
Save your prompt history. The accessibility-tree dumps under /tmp/macos-use/ are enough to reconstruct a run.
Hit something else? Open an issue — the roadmap follows actual users, not a wishlist.
Questions developers ask before installing
Which MCP clients does it work with?
Anything that speaks MCP over stdio. Tested daily with Claude Code, Claude Desktop, Cursor, VS Code (Copilot Chat), Windsurf, Cline, and Zed. Same JSON config, different file path per client.
How is this different from screenshot-based macOS agents?
It reads Apple's native accessibility tree (AXUIElement), so the AI gets structured elements with roles, labels, and coordinates instead of pixels. No OCR, no vision-model tax, no guessing pixel positions. Click by text match (element: "Submit") or exact coordinates from the tree. Responses are diff-only, so after an action you get what changed in the UI, not the whole screen again.
What macOS permissions does it need, and who grants them?
Accessibility permission is granted to the host process (Claude Desktop, Terminal, iTerm, VS Code, whoever spawns the MCP server), not to macos-use itself. That's macOS's TCC model. Screen Recording is needed only if you want window screenshots. Both are revocable from System Settings > Privacy & Security.
Will it click things I didn't approve?
No. Every tool call is gated by your MCP client's approval UI (Claude Code shows a diff-style prompt before each call). During automation, an InputGuard overlay blocks stray keyboard and mouse input so you don't fight the agent. Escape cancels the current action immediately. A 30-second watchdog prevents permanent lockout.
Is this safe to install? Where does the code run?
Fully local. The MCP server is a Swift binary running on your Mac, communicating with your AI client over stdio. No network egress from the server itself. Source is open on GitHub under mediar-ai/mcp-server-macos-use. Pin a specific npm version if you want reproducible installs.
What can't it do yet?
Three real limits: apps with no accessibility tree fall back to coordinate-only clicks, cross-window drag-and-drop is brittle (use copy/paste instead), and there's no record/replay API yet. The 'Honest about the limits' section above walks through each one with the workaround. Open an issue if you hit something not listed.
How do I uninstall?
Remove the entry from your MCP config file and (optionally) npm uninstall -g mcp-server-macos-use. Revoke Accessibility/Screen Recording in System Settings > Privacy & Security > Accessibility by removing the host app.
Ready to try it?
Install with one command. If you're building something bigger on top of it and want the Swift, accessibility, or MCP side tailored to your use case, book 20 minutes with the team.