Drive any Mac app from Claude Code
A Swift MCP server that hands your AI assistant the same accessibility tree Apple gives VoiceOver. Click any button by text. Type into any field. Drive Xcode, Slack, Mail, System Settings, anything with an AX tree.
- →Not an AppleScript wrapper. Native AX APIs + CGEvent, so it works on apps with no scripting dictionary (most modern Mac apps).
- →Not a screenshot agent. Structured tree responses, diff-only after each action. No OCR tax, no vision-model bill.
- →100% local. One Swift binary over stdio. No SaaS, no network egress from the server.
Works with Claude Code · Claude Desktop · Cursor · VS Code · Windsurf · Cline · Zed
Real session: Claude Code calling macos-use to open an app, read the accessibility tree, click by text, and verify the result. No edits, no AppleScript, no screenshot loops.
Setup
Three steps to automation
Install
One command in Claude Code, or paste the JSON config into your MCP client.
Approve
Grant Accessibility permission to your MCP host on first run.
Ask Claude
Tell Claude to open apps, click buttons, fill forms, anything with an AX tree.
Installation
Drop your email. Pick your client. Done.
We'll email you the one-line install for Claude Code plus the JSON config for Cursor, Claude Desktop, VS Code, and Windsurf. macOS 13+, Swift builds on install.
First run prompts for Accessibility permission on whichever app is running your MCP client.
In practice
Once it's installed, type things like this
Real prompts you can drop into Claude Code or Cursor today. The server resolves each one through the accessibility tree, so the agent clicks the right button instead of guessing pixels.
Launches Xcode, presses ⌘R, watches the issue navigator, returns the error region.
Focuses Slack, opens the DM by accessibility label, types into the message field, sends.
Walks the settings tree, reads the toggles row by row, returns a structured list.
Cross-app handoff: focuses Cursor, ⌘P to open, ⌘G to jump, ⌘L for the panel.
Uses Finder + Preview through the accessibility tree, no AppleScript glue.
Mail traversal + native CGEvent clicks. Same flow that ships inside Fazm.
Every call is gated by your MCP client's approval prompt. You see the action before the server runs it.
Tools
Six tools. Full control.
Every tool returns the updated accessibility tree as a diff, so the agent always knows what changed.
open_application_and_traverseLaunch or focus any app by name, bundle ID, or path.
click_and_traverseClick at coordinates or by element text. Optionally type and press a key in one call.
type_and_traverseType into the focused field, with optional modifier keystroke.
press_key_and_traverseArrow keys, ⌘⇧4, anything. Full modifier support.
scroll_and_traverseScroll lines in any direction at a given position.
refresh_traversalRe-read the accessibility tree without taking an action.
Why macos-use
Why native accessibility beats screenshots
Screenshot agents burn tokens re-describing the screen every step and guess pixel positions. macos-use hands Claude a structured tree with semantic roles and coordinates, then returns only what changed after each action.
Accessibility tree, not pixels
Every action returns structured elements with role, text, and coordinates: `[AXButton] "Open" x:680 y:520 w:80 h:30 visible`. No OCR, no vision model tax.
Click by text
element: "Submit" finds and clicks. No pixel guessing.
Diff responses
After each action, only changed elements come back. Cheaper tokens, faster loops.
Native event injection
CGEvent clicks and keystrokes are OS-level. Works with apps that reject other simulated input.
InputGuard + Escape
User input blocked during automation so you can't fight the agent. Escape cancels, 30s watchdog prevents lockout.
Cross-app handoff
Click a link that opens Safari? The server detects the new frontmost app and traverses it automatically.
Response shape
What Claude actually receives
Every tool returns a compact summary plus a path to the full accessibility tree dump. Claude greps the file for the element it wants. No screenshots in the prompt, no OCR pass, no pixel guessing.
pid: 4218 app: Slack elements: 412 total, 87 visible file: /tmp/macos-use/slack-traversal.txt screenshot: /tmp/macos-use/slack.png processing_time_seconds: "0.31"
[AXButton] "Direct messages" x:14 y:198 w:236 h:32 visible [AXRow] "Sarah Chen" x:14 y:286 w:236 h:36 visible [AXTextArea] "Message Sarah" x:268 y:812 w:892 h:42 visible [AXButton] "Send" x:1188 y:818 w:32 h:30 visible [AXStaticText] "on it" x:268 y:812 w:54 h:18 visible
{
"added": [
{ "role": "AXStaticText", "text": "on it", "x": 268, "y": 760, "in_viewport": true },
{ "role": "AXStaticText", "text": "Just now", "x": 1108, "y": 760, "in_viewport": true }
],
"removed": [
{ "role": "AXStaticText", "text": "on it", "x": 268, "y": 812, "in_viewport": true }
],
"modified": [
{
"before": { "role": "AXTextArea", "text": "on it" },
"after": { "role": "AXTextArea", "text": "" },
"changes": [{ "attributeName": "AXValue", "oldValue": "on it", "newValue": "" }]
}
]
}Click sends in ~300ms and returns five fields, not a screenshot. The agent sees the message left the input and landed in the thread, then moves on.
macos-use vs. AppleScript-based MCP servers
If you've tried steipete/macos-automator-mcp, peakmojo/applescript-mcp, or any other osascript wrapper, here's what changes.
| Feature | AppleScript MCPs | macos-use |
|---|---|---|
| What the AI gets | Free-form text from `osascript`. The agent has to know the right script for every app. | Live accessibility tree with roles, labels, and coordinates. Same data Apple gives VoiceOver. |
| App coverage | Only apps that ship a real AppleScript dictionary. Most modern apps don't. | Every app macOS can describe via AX, including Electron apps, browsers, settings panels. |
| How clicks happen | AppleScript `click button` calls, often blocked by sandboxing. Many apps just refuse. | Native CGEvent at the OS level. Indistinguishable from a real user, works everywhere. |
| Failure mode | Cryptic AppleScript errors, the agent retries blind. | Diff response shows exactly which element changed. Agent can self-correct. |
| Auth & runtime | AppleScript runs in the host app's permission scope, hard to reason about. | One Swift binary over stdio. Local, open source, pinnable npm version. |
AppleScript still wins for a handful of legacy automation tasks (Finder folder actions, Mail rules). macos-use stays out of those lanes; everything else, the AX tree is just a better data source.
macos-use vs. screenshot-based agents
| Feature | Screenshot agents | macos-use |
|---|---|---|
| How it sees the UI | Screenshot + OCR / vision model | Accessibility tree with roles and coordinates |
| Token cost per action | Full screen re-described every step | Diff-only: elements added / removed / changed |
| Click targeting | Pixel guess from screenshot | Exact coords from tree, or element text match |
| Input injection | Simulated keystrokes via vision loop | CGEvent, indistinguishable from real user input |
| Setup | Electron/Docker/Python stack | One Swift binary + stdio MCP |
| Where it runs | Often hosted SaaS | 100% local on your Mac |
Screenshots still matter for apps that expose no accessibility tree. macos-use captures windows on demand so you can combine both when you need to.
Battle tested in production
The same server ships inside Fazm as the screen-control layer for a real, paying-customer product. If it works there, it works for your side project.
Every line is on GitHub. Pin a version, fork it, audit the Swift. Local binary over stdio, no network calls from the server itself.
Questions developers ask before installing
Which MCP clients does it work with?
Anything that speaks MCP over stdio. Tested daily with Claude Code, Claude Desktop, Cursor, VS Code (Copilot Chat), Windsurf, Cline, and Zed. Same JSON config, different file path per client.
How is this different from screenshot-based macOS agents?
It reads Apple's native accessibility tree (AXUIElement), so the AI gets structured elements with roles, labels, and coordinates instead of pixels. No OCR, no vision-model tax, no guessing pixel positions. Click by text match (element: "Submit") or exact coordinates from the tree. Responses are diff-only, so after an action you get what changed in the UI, not the whole screen again.
What macOS permissions does it need, and who grants them?
Accessibility permission is granted to the host process (Claude Desktop, Terminal, iTerm, VS Code, whoever spawns the MCP server), not to macos-use itself. That's macOS's TCC model. Screen Recording is needed only if you want window screenshots. Both are revocable from System Settings > Privacy & Security.
Will it click things I didn't approve?
No. Every tool call is gated by your MCP client's approval UI (Claude Code shows a diff-style prompt before each call). During automation, an InputGuard overlay blocks stray keyboard and mouse input so you don't fight the agent. Escape cancels the current action immediately. A 30-second watchdog prevents permanent lockout.
Is this safe to install? Where does the code run?
Fully local. The MCP server is a Swift binary running on your Mac, communicating with your AI client over stdio. No network egress from the server itself. Source is open on GitHub under mediar-ai/mcp-server-macos-use. Pin a specific npm version if you want reproducible installs.
What can't it do yet?
Apps that expose no accessibility tree (some Electron and custom-rendered games) fall back to coordinate-only clicks. Drag gestures across windows are basic. There is no recording/replay API yet. If you hit something missing, open an issue or book a call, the roadmap follows actual users.
How do I uninstall?
Remove the entry from your MCP config file and (optionally) npm uninstall -g mcp-server-macos-use. Revoke Accessibility/Screen Recording in System Settings > Privacy & Security > Accessibility by removing the host app.
Ready to try it?
Install with one command. If you're building something bigger on top of it and want the Swift, accessibility, or MCP side tailored to your use case, book 20 minutes with the team.