Guide / 2026
macOS accessibility automation in 2026: the return-shape problem, and the ReplayKit subprocess hack nobody mentions.
Every other guide on this topic explains AXUIElement and how to traverse the tree. That is the easy half. The harder half, the one that decides whether your automation survives ten minutes of real use, is the response. This is a walk through what changed when the caller stopped being a human and started being an LLM.
Direct answer / verified 2026-05-04
macOS accessibility automation is the practice of programmatically reading and driving native Mac apps through Apple's system Accessibility API (AXUIElement, AXObserver) and CGEvent input synthesis, the same APIs that power VoiceOver and Switch Control, instead of through screen scraping or AppleScript GUI scripting.
Authoritative reference: Apple developer docs, AXUIElement.h. The shape of that API has barely changed since OS X 10.2. What changed in 2026 is who calls it.
The thesis
For twenty years, the consumer of the Accessibility API on macOS was a human. Either a person who needed VoiceOver, or a developer who wrote a Swift utility, or someone using AppleScript's tell application System Events wrapper. All three want the same thing out of the API: pose a question, get a synchronous answer, write code that reacts.
Starting in 2025 a different consumer showed up. An LLM agent over MCP, running inside Claude Code or Cursor or VS Code, calling tools that drive a real Mac. The API is the same. The constraints on the response are not. A model with a context window does not have unlimited room for a 200 KB AX tree on every step. It cannot efficiently parse a deeply nested JSON blob describing eight thousand elements. It needs the action to come back with just enough signal to plan the next action, in a form that grep and Read can chew on cheaply.
That is the design problem. The rest of this page is what it looks like when you build for it on purpose.
What an agent actually receives after one tool call
One click_and_traverse call, traced from the model out and back. The full accessibility tree lives on disk. The model reads a file path, a hint, a screenshot path, and a small inline sample.
round-trip of a single click
The response shape, byte for byte
Below is what arrives back inside the MCP CallTool response after a click on Slack's Send button. Eight lines of structured metadata, one human-readable summary, a short list of text changes, and a sample of visible interactive elements. The full tree is at the file path. The annotated screenshot is at the screenshot path. Everything else is on disk for grep.
# Sample compact summary returned to the agent after a click. # This is what arrives in the MCP CallTool result. The full # accessibility tree is on disk; the model reads it on demand. status: success pid: 4821 app: Slack file: /tmp/macos-use/1745496712384_click_and_traverse.txt file_size: 1734 bytes (4 elements) hint: grep -n 'AXButton' /tmp/macos-use/1745496712384_click_and_traverse.txt screenshot: /tmp/macos-use/1745496712384_slack.png summary: Clicked element 'Send'. 3 added, 1 modified. text_changes: 'Message #general' -> 'Sending…' visible_elements: [AXButton] "Send" x:820 y:612 w:60 h:28 visible [AXTextArea] "Message" x:280 y:612 w:520 h:28 visible [AXButton] "Add file" x:240 y:612 w:32 h:28 visible [AXButton] "Channel: general" x:240 y:108 w:200 h:28 visible ...
Constructed by buildCompactSummary at Sources/MCPServer/main.swift:731. The visible_elements section is filtered to interactive roles (AXButton, AXLink, AXTextField, AXTextArea, AXCheckBox, AXRadioButton, AXPopUpButton, AXComboBox, AXSlider, AXMenuItem, AXMenuButton, AXTab) at main.swift:937-941.
The screenshot trap that ate our parent process
Here is a thing nobody warns you about. Calling CGWindowListCreateImage on macOS quietly side-loads ReplayKit. ReplayKit is the framework that powers screen recording. Once it is loaded into your process, it parks a thread that idles around 19 percent CPU and never goes away, even after every reference to the API is gone. For a CLI utility that exits in milliseconds: invisible. For a long-running MCP server: a slow CPU and memory leak that the user notices first as fan noise.
long-running MCP server, screenshot path
An MCP server captures a window screenshot via CGWindowListCreateImage in-process on every tool call. The first call works fine. By the tenth call, the server's RSS has grown to half a gig and a background ReplayKit thread is sitting at 19 percent CPU. The user reports their fans spinning up while the agent is idle. Activity Monitor shows the server consuming CPU between turns. Restarting the server fixes it. The user does this every twenty minutes and eventually uninstalls.
- ReplayKit thread holds ~19% CPU forever
- RSS climbs across the session
- Restarting fixes it for twenty minutes
The naive code, and the code that ships
Both paths produce a PNG. Only one of them produces a server you can leave running for an afternoon. Toggle below.
screenshot capture path
// The naive way most people write a screenshot grab.
// This is correct in a CLI utility that exits in milliseconds.
// In a long-running MCP server it leaks a ReplayKit thread.
import CoreGraphics
func grabWindow(windowID: CGWindowID, to path: String) {
let image = CGWindowListCreateImage(
.null,
.optionIncludingWindow,
windowID,
[.boundsIgnoreFraming, .bestResolution]
)
// Side effect: ReplayKit framework loaded into THIS process.
// It will idle at ~19% CPU on a background thread forever.
// Restarting the screenshot path does not reclaim it.
// The cost compounds across an automation session.
writePNG(image, path)
}The crosshair is drawn inside the subprocess too
The subprocess is not just an isolation trick. It is also where macos-use draws the click annotation that the model reads. The parent passes the click point in screen coordinates and the window bounds; the helper translates into the captured image's local pixel space, flips the y-axis for CoreGraphics, and stamps a red crosshair plus a 10-point circle. The agent loads the .png the same turn it gets the response and verifies visually that it clicked the button it thought it was clicking.
// Sources/ScreenshotHelper/main.swift:50-90
// The helper draws the click crosshair before writing the PNG.
// Multimodal models read the .png and verify the click landed.
if let clickPoint = clickPoint, let windowRect = windowRect {
let scaleX = imageWidth / windowRect.width
let scaleY = imageHeight / windowRect.height
let localX = (clickPoint.x - windowRect.origin.x) * scaleX
let localY = (clickPoint.y - windowRect.origin.y) * scaleY
// CoreGraphics origin is bottom-left, so flip y.
let drawX = localX
let drawY = imageHeight - localY
ctx.setStrokeColor(CGColor(red: 1, green: 0, blue: 0, alpha: 1))
ctx.setLineWidth(2.0 * max(scaleX, scaleY))
let arm: CGFloat = 15 * max(scaleX, scaleY)
ctx.move(to: CGPoint(x: drawX - arm, y: drawY))
ctx.addLine(to: CGPoint(x: drawX + arm, y: drawY))
ctx.move(to: CGPoint(x: drawX, y: drawY - arm))
ctx.addLine(to: CGPoint(x: drawX, y: drawY + arm))
ctx.strokePath()
let radius: CGFloat = 10 * max(scaleX, scaleY)
ctx.addEllipse(in: CGRect(
x: drawX - radius, y: drawY - radius,
width: radius * 2, height: radius * 2))
ctx.strokePath()
}Source: Sources/ScreenshotHelper/main.swift:50-90. The whole helper is 111 lines. Read time, two minutes.
What happens between the click and the response
Six things, in order, every time the agent issues a disruptive tool call. The two halves visible to the model are the first (the request) and the last (the compact summary). Everything between them is the server's job to make small.
one click, end to end
click_and_traverse
InputGuard engage
CGEvent.post
AX traversal diff
screenshot subprocess
compact summary
How this differs from the tools you already have
The shape of the comparison is not "does it work". AppleScript and Shortcuts and Keyboard Maestro all work. The shape is what comes back to the caller. None of these tools were designed for an LLM in a tool loop, because the loop did not exist when they were built.
Response and capability surface
| Feature | AppleScript / Shortcuts / Keyboard Maestro | macos-use |
|---|---|---|
| Returns post-action UI state | No, you call System Events again on the next line | Yes, AX-tree diff written to disk + summary in the tool response |
| Screenshot capture with click annotation | Not part of the language | PNG with red crosshair at click point, drawn by the subprocess |
| Token-aware response shape for LLMs | No, output is for human eyes or osascript callers | 5-line summary + file path + grep hint + 10-30 inline elements |
| Reaches Catalyst right-pane controls | Often silently no-ops on Notes, Reminders, App Store | Falls back to AXUIElementPerformAction(kAXPressAction) |
| Sets values on sandboxed text fields | Synthetic key events get dropped | kAXValueAttribute write bypasses the input event tap |
| Survives a long automation session | n/a, it exits after the script | ReplayKit isolated into a subprocess so the parent stays clean |
Tool category, not direct head-to-head: AppleScript still wins on app-specific scripting dictionaries (TextEdit, Mail, Numbers), and Shortcuts wins on App Intents. The point of the table is response shape for an LLM caller, not raw reach.
Why a file plus grep, instead of returning the tree
Two reasons, both about token economics.
One. JSON returned in a tool response goes straight into the model's conversation context. Whatever you return is multiplied by every subsequent turn that has to read the conversation. A single 200 KB tree dump becomes 2 MB of context across ten steps, and that is before the next eight tool calls. A file path costs a few dozen tokens, regardless of how big the file actually is. The model decides whether to call Read or Grep, and if it does, it reads only the slice it needs.
Two. Grep is sharper than free-form scanning. If the model wants the "Send" button on Slack, the right call is grep -n 'AXButton.*Send' /tmp/macos-use/<file>.txt. That returns one line, with x:820 y:612 w:60 h:28, which is exactly the four numbers the next click_and_traverse call needs. The cost of that grep is far below the cost of asking the model to scan a JSON tree for the same substring.
The shape of the file is one element per line: [Role] "text" x:N y:N w:W h:H visible. That is parsed in your head and by every grep regex you might ever write. The structured tree is reconstructible from the flat text if you really need it, but in nine months of using this shape, neither I nor the model has needed to.
Verify any of the above in the source
Seven steps from clone to grounded. The repo is MIT-licensed, github.com/mediar-ai/mcp-server-macos-use.
confirm the design exists, in seven reads
- Clone: git clone https://github.com/mediar-ai/mcp-server-macos-use
- Build: swift build (requires Swift 5.9+, Xcode command-line tools)
- Open Sources/MCPServer/main.swift and jump to line 382. The doc comment names ReplayKit as the reason the helper exists.
- Open Sources/ScreenshotHelper/main.swift. Confirm it is a complete 111-line standalone executable with its own main() and exit().
- Open Sources/MCPServer/main.swift:731. Read buildCompactSummary end to end. Note that no point in the function returns the full AX tree to the caller.
- Run scripts/test_mcp.py against your local build. Watch the JSON-RPC tool response come back with file: and screenshot: keys, not a tree.
- Tail /tmp/macos-use/. Every tool call leaves a timestamped .txt and a matching .png. The .txt is what the model greps; the .png has the click crosshair.
Wiring an LLM into a Mac and hitting the response-shape ceiling?
Walk through your stack with the macos-use maintainers. We have shipped this against real Catalyst apps, real long sessions, and real ReplayKit leaks; the conversation tends to save weeks.
Frequently asked questions
What is macOS accessibility automation, in 2026 terms?
It is the practice of programmatically reading and driving native Mac apps through Apple's system Accessibility API (AXUIElement, AXObserver) and CGEvent input synthesis. The same surface that powers VoiceOver and Switch Control. The tree exposes role, identifier, position, value, and AX actions for every element a sighted user can interact with. The 2026 wrinkle is who the caller is. Until recently it was a human writing AppleScript or a Swift utility on top of HammerSpoon. Now it is an LLM agent over MCP, and the design constraints on the response a server returns have changed accordingly.
Why does the caller's identity change anything? The API is the same.
The API is the same. The token budget is not. A naive accessibility-tree dump on a typical Slack window is around 8,000 elements, somewhere in the 200 KB to 800 KB range as JSON. Every model on the market gets lost reading that on every tool call. macos-use writes the full tree to a file in /tmp/macos-use, returns a five-line summary plus the file path plus a grep hint, attaches a PNG screenshot of the window with a red crosshair drawn at the click point, and inlines a small sample (10-30 entries) of visible interactive elements with role, text, and coordinates. The model uses Read and Grep tools on the .txt and Read on the .png. The wire format is what shifted, not the API.
What is the ReplayKit screenshot trap actually?
Calling CGWindowListCreateImage on macOS quietly loads the ReplayKit framework as a side-effect, regardless of whether you are recording the screen. Once loaded into a long-running process, ReplayKit holds a thread that idles around 19% CPU forever and does not unload when you stop using the API. This is fine for a CLI utility that exits in milliseconds. It is fatal for an MCP server that has to stay alive for the duration of a session. macos-use isolates the call into a separate executable named screenshot-helper, runs it as a subprocess on every screenshot, and lets ReplayKit die with the helper. The helper is 111 lines (Sources/ScreenshotHelper/main.swift) and the rationale is in a doc comment at Sources/MCPServer/main.swift:382-386.
What does the screenshot subprocess actually do beyond capture?
It also draws the click annotation. main.swift:451-453 passes --click <x>,<y> --bounds <x>,<y>,<w>,<h> to the helper, and the helper at ScreenshotHelper/main.swift:50-90 translates global screen coordinates into the captured image's local pixel space (scaleX = imageWidth / windowRect.width, scaleY = imageHeight / windowRect.height), flips the y-axis for CoreGraphics drawing, and stamps a red crosshair plus a circle at the click location. The annotation is what lets the model verify visually that it clicked the button it thought it was clicking. The .png path arrives in the tool-call summary so a multimodal model can Read it the same turn.
What does the compact summary that an agent receives actually look like?
buildCompactSummary at main.swift:731 produces something like: status: success / pid: 4821 / app: Slack / file: /tmp/macos-use/1745496712384_click_and_traverse.txt / file_size: 1734 bytes (4 elements) / hint: grep -n 'AXButton' /tmp/macos-use/... / screenshot: /tmp/macos-use/1745496712384_slack.png / summary: Clicked element 'Send'. 3 added, 1 modified. / text_changes: 'Message' -> 'Sending…' / visible_elements: ten lines of interactive elements with role, text, x, y, w, h. The point is that nothing on this list is the full AX tree. The full tree is on disk. The summary is what the model reads first, and it is structured so the model knows how to ask for more (a Grep on the file path) without re-traversing.
How is the post-action diff shaped, and why is that better than re-sending the tree?
For click, type, press, and scroll, the server snapshots the AX tree before the action, runs the action, snapshots after, subtracts, and writes the differences (added, removed, modified) to disk plus a count to the summary. Coordinate-only changes are dropped, scroll-bar churn is filtered, and empty structural containers are dropped. A typical click on a real app produces 2-15 entries in the diff, not 8,000. The model gets enough signal to plan the next action without paying the full-tree token cost on every step. Sources are at main.swift:591-682 (filters), main.swift:612 (branch on showDiff), main.swift:1600 (click handler that flips the flag).
How is this different from AppleScript, Shortcuts, or Keyboard Maestro?
Three different layers. AppleScript GUI scripting (System Events) wraps the same Accessibility API; you write tell application System Events to click button 'Save' and osascript resolves the path at send time. It does not return the tree after the click and does not surface diffs. Shortcuts and Automator target App Intents, not pixel-level UI; they reach apps that explicitly opt in, which is most Apple apps and a small slice of third-party. Keyboard Maestro records and replays at the input layer; it has no AX-tier introspection. macos-use sits on the AX layer like AppleScript, but the response shape is designed for a model loop instead of a synchronous script return value. That is the line.
Why a file plus grep instead of returning JSON in the tool response?
Token economics and tool composition. JSON returned in a tool response goes straight into the model's context, costing tokens on every subsequent turn. A file path costs a few dozen tokens; the model decides whether to Read or Grep it, and if it does, it reads only the slice it needs. On a ten-step automation the difference is the model staying coherent versus the model running out of context halfway through. A secondary benefit is that grep is a sharper instrument than free-form scanning: grep -n 'AXButton.*Send' /tmp/macos-use/<file>.txt returns one line and that line has the coordinates the next click_and_traverse needs.
Does any of this require special permissions or entitlements?
Yes. The Accessibility API needs the host app (Claude Code, Cursor, Claude Desktop, the terminal you launched the server from) to be granted Accessibility permission in System Settings, Privacy and Security, Accessibility. CGWindowListCreateImage on Sequoia and later additionally needs Screen Recording permission for the host app, otherwise the captured PNG is blank or contains only the menu bar. Both prompts fire on first call and the user has to allow once per host. macos-use does not bundle a privileged helper or anything that requires manual entitlements.
How can I verify any of this in the source myself?
git clone https://github.com/mediar-ai/mcp-server-macos-use, then read four files in this order. Sources/MCPServer/main.swift:382-386 (the ReplayKit comment that justifies the subprocess). Sources/ScreenshotHelper/main.swift (the entire 111-line helper). Sources/MCPServer/main.swift:731-906 (buildCompactSummary, the response shape). Sources/MCPServer/main.swift:1147-1296 (findElementByText and the scroll chaser). Total reading time is around twenty minutes and you will have grounded every claim on this page.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.