macOS automation tools split into three tiers, and only one of them re-reads the screen after every click.
Every listicle lumps AppleScript, Keyboard Maestro, Shortcuts, Hammerspoon, and (lately) AI agents into one bucket. They are three different delivery mechanisms with three different ceilings. Apple Events, input synthesis, AI-agent MCP. This guide draws the line. Then it shows what the newest tier is doing that the older two cannot, using one concrete file path: Sources/MCPServer/main.swift.
The map, drawn once
Every tool on a "best macOS automation apps" list is in one of three tiers. The tier is defined by how the tool gets intent into the OS, not by who the tool is for.
Tier 1 sends Apple Events. The app receives a high-level verb ("make new sheet", "duplicate") because it published a scripting dictionary. AppleScript, osascript, Automator, and most of Shortcuts live here.
Tier 2 synthesizes input. The app receives a CGEvent or an AX method call that mimics a human. Keyboard Maestro, BetterTouchTool, Alfred workflows, Hammerspoon, and Raycast scripts live here. The trigger is human-authored (hotkey, gesture, cron), the target is usually a pre-recorded coordinate or UI path.
Tier 3 combines both: the primitives match tier 2 (CGEventPost, AXUIElement), the trigger is an LLM reasoning over a fresh read of the accessibility tree, and the response format is shaped for the model to consume. macos-use is a tier-3 tool. Terminator is its Windows sibling.
Tier 1 — Apple Events scripting
AppleScript, osascript, Automator, Shortcuts. Talks to apps that publish a scripting dictionary via NSAppleEventDescriptor. High-level verbs ('make new sheet', 'duplicate selection'). Cannot drive apps that never exposed a dictionary (most Electron apps, most web views, most new macOS apps). System Events' UI scripting is the escape hatch and it is a reluctant wrapper over the Accessibility API.
Tier 2 — Input synthesis hotkeys
Keyboard Maestro, BetterTouchTool, Alfred workflows, Hammerspoon, Raycast scripts. Trigger is a hotkey, gesture, or schedule. Delivery is CGEventPost, AXUIElement calls, or shell invocations. The script is a human-written sequence. Breaks when the UI shifts, because coordinates or AX paths were recorded, not discovered.
Tier 3 — AI-agent MCP servers
macos-use, Terminator (Windows sibling). Primitives match tier 2 (CGEventPost, AXUIElement) but the trigger is an LLM reasoning over a live read of the UI. The schema is JSON-RPC, the response is a diff, the observer is a model. Every mutation carries state back so the next decision has fresh ground truth.
Where Shortcuts actually sits
Shortcuts is nominally tier 1 (it speaks Intents / App Intents, which are an evolution of Apple Events). In practice it bolts on tier-2 UI actions for everything else. Powerful for scriptable apps. Awkward the moment the workflow crosses into an app that has not adopted Intents.
The ceiling of each tier
Tier 1 can only automate what the app's dictionary exposes. Tier 2 can automate anything you can click, but is brittle to layout change. Tier 3 reads the layout on every action so it stays correct under layout change and reaches apps that tiers 1 and 2 cannot. Different ceilings, not a ranking.
Same task, three tiers, three delivery mechanisms
"Save the active document" looks the same to the user in all three tiers. Underneath, the OS is hearing three completely different messages.
The six tools that define tier-3 on macOS
An MCP client asks the server for its tool list once, at connect. That list is this array. Five tools mutate the UI and return a diff. One re-reads the tree and returns a snapshot. The LLM picks one per turn.
One boolean. Six behaviors. The heart of tier 3.
This is what a tier-2 tool does not have: a unified read/write distinction at the server level that gates cursor save, app save, input guard, cancel checks, cursor restore, and app restore in one place. The refresh tool opts out of all of it because it mutates nothing.
What hangs off that one line
Tier 3 by the numbers
The 2 in that row is the one a reader should pause on. A tier-3 tool call that triggers an XPC dialog (Save Panel, 0% of File → Save flows in scriptable and non-scriptable apps alike) returns two accessibility trees in the same JSON-RPC response, not one. That is why the next tool call from the model already knows the new PID and does not need a refresh round-trip.
End-to-end lifecycle of one tier-3 tool call
Eight moments between "client sent a CallTool request" and "server wrote the response". Everything between is what tier 1 and tier 2 tools do not do.
Client picks one of the six tools from the schema
ListTools at Sources/MCPServer/main.swift:1465 returns the six-tool array and each tool's JSON Schema. The LLM decides which tool, fills in the params, and sends a CallTool request.
Server computes isDisruptive
main.swift:1667. One line. Anything that is not refreshTool is disruptive and triggers cursor save, frontmost-app save, and InputGuard.engage with a tool-specific overlay message.
For click_and_traverse, the target app is activated first
main.swift:1582-1586. NSRunningApplication(processIdentifier: pid).activate(options: []) then a 200ms sleep, because a click posted before activation propagates will land on the wrong window.
The primary action fires through MacosUseSDK
performAction runs on the MainActor (main.swift:1703-1706) or the composed-mode path at main.swift:1709-1751 if the call carries chained text/pressKey params.
After each step, throwIfCancelled unwinds if you pressed Esc
main.swift:1708, :1721, :1728, :1734, :1758. InputGuard installs a CGEventTap at CGEventTapLocation.cghidEventTap, so Esc wins against synthetic events.
Post-action: traverse, diff, write .txt, capture .png
buildToolResponse (main.swift:612) assembles the diff. buildFlatTextResponse (main.swift:992) writes one line per element to /tmp/macos-use/. captureWindowScreenshot (main.swift:386) spawns the sibling screenshot-helper binary to capture the PNG without leaking ReplayKit into the server process.
Handoff check: if frontmost PID changed, traverse the new one too
main.swift:1786-1809. If NSWorkspace.shared.frontmostApplication?.processIdentifier differs from the PID you passed in, the server re-traverses the new app, attaches appSwitchTraversal, and writes a '# app_switch:' header into the flat-text file.
Cursor and frontmost app restored before the response is sent
main.swift:1767-1772 posts a .mouseMoved CGEvent to the saved point. main.swift:1774-1780 re-activates the original frontmost app if something else took focus that was not the target app.
“A tier-1 AppleScript cannot return 'what changed in the UI' because Apple Events are fire-and-forget at the dictionary level. A tier-2 hotkey cannot return it because it never reads the UI. Tier 3 returns a diff because the response shape exists for a model to reason over, not a human to read a log.”
Compiled from main.swift:612 (buildToolResponse) and main.swift:992 (buildFlatTextResponse)
Classical automation tool vs. tier-3 MCP server
| Feature | Tier 1 / Tier 2 | macos-use (tier 3) |
|---|---|---|
| How the tool gets intent into the OS | Apple Events (tier 1) or CGEventPost/hotkey (tier 2) | CGEventPost driven by an LLM reading the live AX tree (tier 3) |
| What the tool returns after an action | Either nothing, or the app's scripting dictionary result | A diff of the AX tree (added/removed/modified) + flat-text file + screenshot |
| Reaches apps that never published a scripting dictionary | Tier 1: no. Tier 2: yes but via recorded coords | Yes, because AXUIElement is system-wide |
| Chains click + type + press into one call | No (AppleScript can sequence, but returns no intermediate state) | Yes, composed-mode path at main.swift:1709-1751 |
| Detects a dialog that opened in a sibling process | No concept of frontmost-PID comparison | Yes, cross-app handoff detector at main.swift:1786-1809 |
| Treats user keyboard as a shared resource during automation | No (user is assumed to be hands-off) | InputGuard engages a CGEventTap with 30s watchdog + Esc kill-switch |
| Consumer of the response | A human reading logs, or the next script step | An LLM picking the next tool call from the 6-tool schema |
Verify everything in one terminal tail
Build the server, wire it into any MCP-compliant client, fire a click, and look at what hit disk.
Trying to pick between AppleScript, Keyboard Maestro, or an AI-agent MCP?
We will walk you through which tier fits your use case and what the trade-offs actually cost.
Frequently asked questions
What counts as a 'macOS automation tool' in this guide?
Anything a user installs to get the mac to perform a task without them clicking through it step by step. That spans three unrelated technologies. Tier 1 tools (AppleScript, Automator, Shortcuts, osascript) speak Apple Events to apps that publish a scripting dictionary. Tier 2 tools (Keyboard Maestro, BetterTouchTool, Alfred, Hammerspoon, Raycast scripts) synthesize input: CGEventPost strokes, hotkey triggers, recorded coordinates. Tier 3 tools (AI-agent MCP servers like macos-use) read the live Accessibility tree and synthesize input against the coordinates they just read. The boundary matters because each tier has a different ceiling on what it can automate. Grouping them in one listicle hides that ceiling.
What does 'delivery mechanism' mean for an automation tool?
How the tool gets the user's intent into the operating system. Apple Events is message-passing: you say 'tell application Numbers to make new sheet' and NSAppleEventDescriptor routes it to Numbers' scripting dictionary. Input synthesis is event injection: CGEvent(keyboardEventSource:...).post writes a keystroke directly into the hidEventTap. AI-agent MCP combines live read (AXUIElement traversal) with input synthesis (same CGEventPost), but adds an LLM in the loop that decides what to click based on the tree it just received. The delivery mechanism determines what kind of app the tool can drive. Electron apps, for example, expose almost nothing to Apple Events but everything to Accessibility.
Where does macos-use fit and what exactly does it ship?
Tier 3, the AI-agent branch. It ships six MCP tools defined at Sources/MCPServer/main.swift:1408: macos-use_open_application_and_traverse, macos-use_click_and_traverse, macos-use_type_and_traverse, macos-use_press_key_and_traverse, macos-use_scroll_and_traverse, and macos-use_refresh_traversal. Five of them mutate state; one is read-only. Every mutation tool accepts a PID (required) plus its own action params, fires the CGEvent, retraverses the app's Accessibility tree, and returns the DIFF (added/removed/modified elements) along with a flat-text file path at /tmp/macos-use/<ts>_<tool>.txt and a screenshot PNG. A classical input-synthesis tool would only return 'event posted'; macos-use also returns what the event changed.
What is the one boolean this page keeps mentioning?
The line `let isDisruptive = params.name != refreshTool.name` at Sources/MCPServer/main.swift:1667. It is computed once per tool call and decides whether the server will save the cursor position (main.swift:1672-1675), save the currently frontmost app (main.swift:1671), engage InputGuard to block your keyboard and show the red overlay (main.swift:1696), check InputGuard.wasCancelled between the primary action and the follow-ups (main.swift:1708/1721/1728/1734/1758), restore the cursor after the action (main.swift:1767-1772), and restore the frontmost app if focus escaped (main.swift:1774-1780). Six behaviors hanging off one boolean. The refresh tool skips all of them because it is a pure read.
What is the 'two AX trees per tool call' claim?
When a mutation tool causes focus to escape to a different process — a Save Panel owned by openAndSavePanelService, a Share sheet owned by SharingUIServer, a permissions prompt from tccd — the server notices by comparing NSWorkspace.shared.frontmostApplication?.processIdentifier against the PID you passed in. If they differ, the block at Sources/MCPServer/main.swift:1786-1809 calls traverseAccessibilityTree on the NEW frontmost PID and attaches the result as appSwitchTraversal on the same ToolResponse. One JSON-RPC request, two trees, one screenshot of the new window, one compact summary with an 'app_switch:' line and a sampled visible_elements block for the new app. No other category of macOS automation tool has an equivalent concept because no other category reads state after each write.
Can AppleScript / Shortcuts do what macos-use does?
Only for apps that publish a scripting dictionary. If an app does not expose Apple Events, AppleScript falls back to System Events' UI scripting, which drives the Accessibility API in a much narrower way (tell window 1 of process 'X' to click button 'Save'). It cannot return a diff, cannot chain click+type+press into one response, cannot detect that the Save Panel is owned by a sibling process, and cannot be driven by an LLM that has already read the tree. Where AppleScript wins: it is preinstalled, has a 30-year corpus of examples, and talks to scriptable apps (OmniFocus, BBEdit, Numbers, Mail, Finder) at a higher level of abstraction than pixel-accurate clicks.
Can Keyboard Maestro / Hammerspoon do what macos-use does?
They can post the same CGEvent that macos-use posts. What they cannot do is let an LLM pick the target at runtime based on a live read of the Accessibility tree. Hammerspoon's hs.axuielement module exposes the AX API in Lua, which is the closest tier-2 cousin to tier 3, but it is still a human-written script triggered by a hotkey. The tier-3 shift is that the automation instructions are a JSON-RPC schema (the 6-tool array at main.swift:1408), a model decides which one to call based on prior observations, and the response format is shaped for an LLM to consume (grep-friendly flat text, compact summary, diff not snapshot). Different loop, different consumer.
Why does the server save the cursor before every non-refresh action?
Because an automation run can move the mouse to a coordinate, click, and leave the user's cursor halfway across the screen — awkward when you are watching the agent work. main.swift:1672-1675 flips NSEvent.mouseLocation (AppKit, bottom-left origin) into CGEvent coordinates (top-left origin) by computing primaryScreen.frame.height - nsPos.y, so the saved CGPoint can be passed straight into CGEvent(mouseEventSource:mouseType:.mouseMoved, mouseCursorPosition:) on restore at main.swift:1767-1772. Skip the flip and the cursor restores to the wrong Y on multi-monitor setups. The frontmost app restore at main.swift:1774-1780 only fires if the current frontmost differs from the one we saved, so a click that legitimately ended with the target app in focus does not re-activate.
Where exactly are the six tools registered and how is the array used?
At Sources/MCPServer/main.swift:1408 the array is assembled as `let allTools = [openAppTool, clickTool, typeTool, pressKeyTool, scrollTool, refreshTool]` and then handed to the ListTools handler at main.swift:1465 so any MCP client can enumerate them. Each Tool instance carries its own JSON Schema (main.swift:1293-1399) describing the params each tool accepts. The CallTool handler at main.swift:1474 does one switch on params.name against the tool names, builds a PrimaryAction, runs MacosUseSDK's performAction on the MainActor, and packs the result through buildToolResponse at main.swift:612 and buildFlatTextResponse at main.swift:992. The whole thing is 500 lines of dispatch and the tier-3 behaviors fall out of it.
What does 'chained' action mean for macos-use and why does it matter?
click_and_traverse accepts `text` and `pressKey` parameters (schema at main.swift:1318-1324). If both are passed, one MCP call runs click + type + press in a single JSON-RPC round trip via the composed-mode path at main.swift:1709-1751. The mechanics: primary click with traverseAfter=false (main.swift:1714-1716), sleep 100ms, type, sleep 100ms, press, then ONE final traverseOnly call (main.swift:1737-1741) to capture the after-state. The model does not pay three round-trip latencies to type-and-send a Slack message; it pays one. The server's instructions string at main.swift:1422-1426 tells the client model to prefer this shape.
What is the InputGuard and why does a tier-3 tool need one?
It is a kernel-level CGEventTap at Sources/MCPServer/InputGuard.swift that swallows your keyboard and mouse while the agent is posting events, so a stray keypress from you does not race with the synthetic one the server posted. It engages from main.swift:1696 with a per-tool description string, disengages at main.swift:1759, has a 30-second watchdog auto-release at InputGuard.swift:24, and treats Esc (keycode 53, no modifiers) as a hard cancel via throwIfCancelled at InputGuard.swift:53. Tier-1 tools do not need this (Apple Events do not touch your input stream). Tier-2 tools usually skip it and rely on the user not touching the keyboard. Tier 3 has to treat input as a shared resource because the automation loop is long and the user is usually watching.
How do I see all of this running?
Clone the repo, build with `xcrun --toolchain com.apple.dt.toolchain.XcodeDefault swift build`, wire the binary into Claude Desktop or Cursor as an MCP server, and `ls -lt /tmp/macos-use/`. Every tool call writes a timestamped pair: <ts>_<tool>.txt for the flat-text response (one line per accessibility element) and <ts>_<tool>.png for the window screenshot with a red crosshair at the click point. `grep -n '# app_switch' /tmp/macos-use/*.txt` surfaces every handoff that fired. `grep -c '^\[AX' /tmp/macos-use/<ts>_*.txt` counts how many elements the agent could see on a given action.
Deeper into the tier-3 internals
The MCP Server Desktop-App Problem No One Documents
Your click just opened a dialog owned by a different process. 22 lines at main.swift:1786-1809 compare frontmost PID after every action and return both trees in the same response.
macOS Accessibility Tree Agents
The diff format, the in_viewport enrichment, the noise filters. What the tier-3 tree actually looks like when it reaches the model.
macOS AI Agent State Memory
The .txt files under /tmp/macos-use/ are the agent's memory. One line per element, grep-addressable, no tokens until the agent opens the file.