accessibility tree automation, control-arbitration editionCGEventTap, head-inserteventSourceStateID

macOS Accessibility Tree Automation: The Control-Arbitration Problem Every Other Guide Skips

Most articles about macOS accessibility tree automation teach you AXUIElementCopyAttributeValue, AXObserver, and how to walk the tree. They skip the question that actually decides whether a real user keeps the tool installed: when an LLM is driving the AX tree on a machine you also use, who owns the keyboard for the next 800 milliseconds. This page is about the file that answers it, Sources/MCPServer/InputGuard.swift.

Matthew Diakonov, Written with AI

Published April 24, 202611 min read

Read InputGuard.swift on GitHub Repo on GitHub

5.0from open source

CGEventTap installed at head-insert position so it runs first — InputGuard.swift:131

Hardware vs programmatic split by eventSourceStateID — InputGuard.swift:329-332

Plain Esc is keycode 53 with empty modifier intersection — InputGuard.swift:340-351

Watchdog auto-release at watchdogTimeout = 30 seconds — InputGuard.swift:24

When the LLM is driving, who owns the keyboard?

The part of macOS accessibility tree automation no other guide writes about

Pre-action: cursor + frontmost app saved

Engage: head-insert CGEventTap installed

Hardware events dropped, programmatic events pass

Plain Esc is the one-key abort, 30s watchdog as backstop

Post-action: cursor and focus restored to your state

0:00 / 0:05

What The Other Guides Cover, And Where They Stop

Search this topic and you will get a stack of useful but incomplete articles. Apple's documentation walks you through the AXUIElement APIs. MacPaw's macapptree repo dumps the tree to JSON. AccessibilityInspector ships with Xcode and lets you point at any UI element and read its role and attributes. Hammerspoon's hs.axuielement wraps the same APIs in Lua. AppleScript GUI scripting is the decade-old answer. Anthropic's computer-use cookbook shows you how to call the model.

Every one of those treats the accessibility tree as the artifact and stops there. The artifact is necessary; it is not sufficient. The moment you put a model behind the tools and the machine is also a machine you sit at, you hit a question none of them answer: while the model is mid-action, what happens when a human reaches for the keyboard out of muscle memory? On the naive path, the human's keystrokes race the model's into the focused app, the action lands somewhere different from where the agent thinks it landed, and the next traversal describes a UI state that did not arise from the action. From that point, the agent's belief about the world and the world disagree, and there is no robust way to reconcile them without restarting.

The fix has to be in the layer that sits between hardware input and the focused app. That is what Sources/MCPServer/InputGuard.swift is. The rest of this page is a tour of what it does, why it does each thing, and how every piece is verifiable in the source tree.

The Same Click, Two Worlds

One mid-action user keystroke. Two outcomes.

A 700ms type_and_traverse on Slack

An LLM agent is mid-way through a 700ms type_and_traverse on Slack. The user, unaware, brings their hand back to the keyboard and presses Cmd-K out of muscle memory. The Cmd-K races the agent's typed characters into Slack, which interprets the chord as 'jump to channel' and pops a modal. The agent's next CGEvent lands inside that modal. The accessibility-tree diff returned to the agent now describes the modal, not the message. The agent has no idea what just happened.

User Cmd-K races the agent's keystrokes
Modal pops, agent's next event lands in the modal
Diff describes the modal, not the message
Agent's belief and the world diverge silently

The Numbers That Anchor The Pattern

watchdog auto-release

Esc keycode (no modifiers)

CGEventTypes in the tap mask

hardware event sourceStateID

Every number above is a literal in InputGuard.swift. Verifiable with grep.

The Tap Mask, Built Bit By Bit

The first thing the guard does on engage is install a tap that sees the eleven CGEventTypes worth swallowing. The mask is built incrementally because the Swift type-checker times out on the equivalent single OR expression. Head-insert placement (rawValue: 0) means this tap runs before any other listener for the duration of the call.

Sources/MCPServer/InputGuard.swift

The Two Lines That Make The Whole Pattern Work

One field on every CGEvent decides whether it came from hardware (your keyboard) or from CGEvent.post (the agent). The tap callback reads it once and short-circuits. That is the whole arbitration mechanism, and it is the part other guides on macOS accessibility tree automation do not mention because they assume the tree is the only thing the agent needs.

Sources/MCPServer/InputGuard.swift

Where Each Event Goes

Inputs on the left are every kind of event the tap sees during a single tool call. Outputs on the right are what the tap callback returns to the OS for that event. Every other tap on the system sees only what the hub sends right.

Inside the InputGuard tap callback

Six Mechanisms That Keep You In Control

The tap is the headline. There are five other mechanisms around it. Together they make the difference between "I wrote a CGEventTap once" and "a CGEventTap that survives a long automation run on a real user's machine".

Head-insert CGEventTap

Tap is created with CGEventTapPlacement(rawValue: 0) at InputGuard.swift:131 so it runs before any other listener. Mask covers keyDown, keyUp, both mouse buttons, mouseMoved, dragged, scrollWheel, flagsChanged.

Programmatic vs hardware

Source state ID at InputGuard.swift:329-332. Non-zero passes; zero is dropped. The agent and the human use the same tap, separated by one integer field.

Plain Esc as kill switch

keycode 53, modifier mask intersection must be empty. InputGuard.swift:340-351. Cmd-Esc and Opt-Esc stay blocked along with the rest of your input.

30-second watchdog

DispatchSource timer on a global queue fires unconditionally at watchdogTimeout. Even if the Swift task hangs, the tap is torn down and the overlay disappears.

Cursor save/restore

NSEvent.mouseLocation captured at main.swift:1672, flipped to CGEvent coords, replayed via mouseMoved CGEvent at main.swift:1767-1771. Even if the agent dragged the cursor across three monitors, it ends back where it started.

Frontmost-app restore

NSWorkspace.frontmostApplication saved at main.swift:1671 and reactivated with .activate(options: []) at main.swift:1778, but only if the saved app is not terminated. Cross-app handoffs are detected and respected separately.

The Watchdog: A 30-Second Hard Cap

CGEventTaps that stay engaged forever are how you produce unrecoverable lockouts. The guard schedules a DispatchSource timer on a global queue at engage time. The timer fires from a different thread than the one running the Swift task, so even if the task is wedged, the tap is torn down and the overlay disappears.

Sources/MCPServer/InputGuard.swift

2 saves, 2 restores per call

“Every disruptive tool call also saves the cursor position via NSEvent.mouseLocation and the frontmost app via NSWorkspace.shared.frontmostApplication, then restores both after the action. The cursor is replayed with a CGEvent mouseMoved at the saved point. The previous app is reactivated only if it is still alive. Cross-app handoffs are detected separately and respected.”

Sources/MCPServer/main.swift:1671-1781

Cursor And Focus, Saved And Restored

The control boundary covers more than keyboard and mouse events. It also covers the side effects of having clicked somewhere: the cursor is now wherever the click landed, and whatever app the action touched is now the frontmost app. Neither of those states is what the user had before the call. Both are restored.

Sources/MCPServer/main.swift

The Lifecycle Of One Disruptive Tool Call

Seven moments, all in the source. A typical 700ms click_and_traverse visits each one. Refresh tools (which only read the tree) skip the entire input-guard path because isDisruptive is set to false at main.swift:1667.

T+0ms: tool call arrives

MCP handler at main.swift:1666 sets isDisruptive = (params.name != refreshTool.name). Refresh tools skip the input guard entirely; everything else runs through it.

T+1ms: cursor and focus snapshot

NSEvent.mouseLocation read, y axis flipped using NSScreen.screens.first.frame.height, stashed as CGPoint. NSWorkspace.shared.frontmostApplication saved by reference. The current PID gets logged with localizedName so post-mortems are readable.

T+2ms: InputGuard.shared.engage

Lock acquired, _engaged set to true, _cancelled reset. createEventTap is called synchronously on the main thread because CGEventTaps must register with the main run loop. showOverlaySync builds the NSWindow before the call returns.

T+3ms: watchdog timer armed

DispatchSource.makeTimerSource on a global queue scheduled at .now() + 30. setEventHandler closes over [weak self] and calls disengage. This timer fires regardless of what the Swift task is doing.

T+4ms to T+~700ms: action runs

performAction issues CGEvent.post calls with non-zero sourceStateID. The tap sees them, returns Unmanaged.passUnretained(event), and they reach the target app. User keystrokes during this window have stateID=0 and return nil from the tap callback.

T+~700ms: action completes

200ms grace period via Task.sleep so the user has a moment to press Esc after the visible action completes. If wasCancelled is true, the function throws InputGuardCancelled and the MCP response is an error, not a diff.

T+~900ms: disengage and restore

InputGuard.disengage tears down the tap, removes the run loop source from CFRunLoopGetMain, hides the overlay, and stops the watchdog. Cursor restored via CGEvent mouseMoved. Frontmost app reactivated if it changed and is still alive.

What The Server Logs During A Real Call

One click_and_traverse on the Slack "Send" button while the user half-typed "he" into their keyboard. The two keyDown lines with sourceState=0 are the user's; both return nil from the tap callback so neither key reaches any app. The 643ms is the inner action, not the visible call duration.

stderr from MCPServer during type_and_traverse

What Happens When You Press Esc Mid-Call

The Esc keycode is 53. The tap recognizes it, writes a marker file to /tmp/macos-use/esc_pressed.txt so a post-mortem is possible, sets the cancelled flag, and disengages. The next throwIfCancelled inside the action throws and the tool call returns an error instead of a diff.

stderr when the user presses Esc mid-call

How This Compares To Common Alternatives

AppleScript GUI scripting, Hammerspoon's hs.axuielement, and most ad-hoc shell wrappers ignore the control-arbitration problem entirely. They assume the script is short and the user is not at the keyboard. That assumption breaks the moment an LLM holds the tools.

Feature	naive AX automation	macos-use
Block hardware input during the action	no, your keystrokes race the script	head-insert CGEventTap drops every hardware event
Distinguish your input from the agent's	no, both look like CGEvents	eventSourceStateID == 0 vs != 0
Single-key abort while the action runs	Cmd-period sometimes works in AppleScript	plain Esc, keycode 53, no modifiers required
Auto-release if the script hangs	no, you reach for the power button	30s DispatchSource timer on a global queue
Cursor returns to where you left it	no, the cursor sits where the agent dropped it	NSEvent.mouseLocation snapshot and CGEvent replay
Frontmost app restored after the call	no, focus stays in whatever the script touched last	NSWorkspace.frontmostApplication saved and reactivated
Visible 'AI is using your computer' state	no, your screen looks normal during automation	screensaver-level pill, pulsing orange dot, custom message

The Surface Area That Sits Around The Tree

The accessibility tree is the input. These are the layers and primitives that wrap it on macOS. Every chip is a real symbol from the source tree or a sibling project.

AXUIElement

AXObserver

CGEventTap

NSWorkspace

NSEvent

MacosUseSDK

AppleScript

hs.axuielement

AccessibilityInspector

Hammerspoon

macapptree

Terminator (Windows)

Anthropic computer-use

Model Context Protocol

Why This Detail Doesn't Show Up In Existing Articles

The accessibility tree is a macOS concept; the control-arbitration problem is an agent-on-real-user concept. They live in different worlds. Articles aimed at developers who are reading the tree to test their own apps treat the tree as the product. Articles aimed at agents that drive someone else's machine have to also answer what happens to that machine's human while the agent acts. The first kind is the entire current corpus on this topic. The second kind is what an agent host actually ships.

macos-use is the second kind. The accessibility tree is served as a flat, grep-able file with diff semantics on iteration. The tree access is gated by an input guard that gives the human a hard boundary while each call runs. Together those two layers are what make "LLM driving my Mac" a tool you reach for, not a story you watch from a safe distance.

Verify Every Claim On This Page Yourself

Two minutes from clone to verified. None of this requires faith.

Eight steps to confirm the input boundary is real

Clone mediar-ai/mcp-server-macos-use
swift build with the Xcode default toolchain
Grant Accessibility permission to the resulting binary
Point Claude Desktop or any MCP client at .build/debug/MCPServer
Call open_application_and_traverse on Calculator (small target, fast tree)
Call type_and_traverse and mash your keyboard during the call
Confirm /tmp/macos-use/tap_status.txt was rewritten with tap_created: enabled=true
Press Esc mid-call and confirm /tmp/macos-use/esc_pressed.txt appears

0watchdog seconds

0Esc keycode

0CGEventTypes blocked

0MCP tools, 5 input-guarded

Putting an LLM on your Mac and want a sanity check?

Walk through your control-arbitration boundary with the maintainers and pressure-test the guard for your workflow.

Frequently asked questions

What does macos-use actually do during a single accessibility-tree automation call?

Six things happen in order around every disruptive tool call (open, click, type, press, scroll). One: the handler captures the cursor location with NSEvent.mouseLocation and the frontmost app with NSWorkspace.shared.frontmostApplication at main.swift:1671-1675. Two: InputGuard.shared.engage installs a CGEventTap on .cghidEventTap at the head-insert position with a mask covering keyDown, keyUp, both mouse buttons, mouseMoved, dragged, scrollWheel, and flagsChanged at InputGuard.swift:113-153. Three: a 30-second DispatchSource timer is armed as a watchdog at InputGuard.swift:172-181. Four: a fullscreen NSWindow at .screenSaver level is shown with a centered dark pill, a 16-point pulsing orange dot, and a 20-point semibold white label that says what the AI is doing. Five: the action runs and posts CGEvents that pass through because their eventSourceStateID is non-zero. Six: the overlay disengages, the cursor is restored to its saved point, and the previously frontmost app is reactivated at main.swift:1767-1781.

How does the event tap tell my keystrokes apart from the agent's keystrokes?

Source state ID. Every CGEvent carries an eventSourceStateID field. Programmatic events posted via CGEvent.post with the .hidSystemState source carry a non-zero stateID. Hardware events that originate from the keyboard or trackpad always carry stateID = 0. The tap callback at InputGuard.swift:329-332 reads that field first and short-circuits with Unmanaged.passUnretained(event) when the value is non-zero. That is how the agent can synthesize a keystroke through the same tap that is currently swallowing your typing: the bit is set by the OS, not inferable from key code alone.

Why is plain Esc the only key that gets through, and how is 'plain' defined?

InputGuard.swift:340-351 checks for keyCode == 53 and zero intersection with the modifier mask {.maskCommand, .maskControl, .maskAlternate, .maskShift}. Cmd-Esc, Opt-Esc, and Shift-Esc are all blocked along with the rest of your input. Only the unmodified Esc reaches handleEscPressed, which sets the cancelled flag, disengages the tap, and calls the optional onUserCancelled callback. The reason the bar is plain Esc and not a chord is reachability under panic: if the agent has done something visibly wrong, you want the dumbest possible single-key to interrupt, not a combo your hand has to find under pressure.

What stops the tool from locking my keyboard forever if the Swift process hangs?

The watchdog. InputGuard.swift:24 declares watchdogTimeout: TimeInterval = 30 and startWatchdog at InputGuard.swift:172-181 schedules a DispatchSource timer on a global queue that calls disengage() unconditionally after 30 seconds. Even if the action never returns, the tap is torn down, the run loop source is removed from CFRunLoopGetMain, and the overlay is hidden. The 30-second budget is also why click-then-type-then-press is exposed as a single composed tool call rather than three separate ones: a chained call costs the same one budget window as a single click.

What does the overlay actually look like on screen and why is it intrusive on purpose?

buildAndShowOverlay at InputGuard.swift:202-276 creates a borderless NSWindow at .screenSaver level (above almost everything except the menu bar), tinted black at 0.15 alpha so the desktop bleeds through, with collectionBehavior = [.canJoinAllSpaces, .fullScreenAuxiliary] so it follows you across Spaces and into fullscreen apps. Centered: a 720pt-or-50%-wide pill at 92% opacity, rounded to half its 80pt height, holding a 16pt orange dot that pulses between 1.0 and 0.3 opacity every 0.8 seconds (CABasicAnimation, autoreversing, infinite repeat) and a 20pt semibold label. The intrusiveness is the point. If the model is driving your computer, the screen should look different from when you are.

How does the cursor get restored to where I left it?

At main.swift:1672-1675 the handler reads NSEvent.mouseLocation (which uses bottom-left AppKit coordinates), flips the y axis using primaryScreen.frame.height, and stashes the resulting CGPoint. After the action, main.swift:1767-1771 builds a CGEvent with type .mouseMoved at that saved point and posts it via cghidEventTap. The cursor visually snaps back to where it was when the call started, even if the agent dragged it across three monitors during a click. This works because backingScaleFactor on these screens is 1.0 (1pt == 1px), as documented in the project CLAUDE.md.

What about focus? If the agent activates Mail, am I left in Mail when the call ends?

No. main.swift:1671 saves NSWorkspace.shared.frontmostApplication at the start, and main.swift:1775-1781 calls .activate(options: []) on that saved app after the action, but only if the previous app is still alive (prevApp.isTerminated == false). If the action handed focus off to a different app and that handoff is the user's intent, the diff already records it: main.swift:1788-1808 detects cross-app frontmost changes and traverses the new frontmost app, attaching its tree as appSwitchTraversal in the response. Focus restoration only kicks in for incidental focus theft, not for intentional handoffs.

Can the system disable my event tap mid-call, and what happens if it does?

Yes. macOS will disable a CGEventTap if it takes too long to return from the callback (.tapDisabledByTimeout) or if user input subverted it (.tapDisabledByUserInput). InputGuard.swift:298-306 catches both cases in the callback's preamble and calls CGEvent.tapEnable(tap: tap, enable: true) to re-arm without touching the run loop source. Because the tap was inserted at the head-insert placement, it remains the first listener after re-enable. This is the difference between 'I wrote a CGEventTap once' and 'a CGEventTap that survives a long automation run'.

How does this complement Terminator on Windows?

Same shape, different host. macos-use uses AXUIElement APIs for the tree and CGEventTap for input arbitration. Terminator uses UI Automation for the tree and a Windows raw-input hook for the equivalent block-and-pass-through. Both speak MCP, so an agent that wants to run on a mixed fleet holds the same mental model: the OS gives you a structured tree, the server filters noise, and the server protects the human's input boundary while the action happens. The specific code (.cghidEventTap vs RegisterRawInputDevices) differs; the contract does not.

Why a CGEventTap and not just dimming the screen or showing a banner?

Banners are advisory; a tap is enforced. If you only show an overlay, the user's keystrokes still race the agent's into the focused app. A real example: agent is mid-way through typing a Slack message, you bring up Spotlight by reflex, your Cmd-Space lands inside Slack as Cmd-Space, and the message text fragments. With a head-insert CGEventTap, your Cmd-Space is intercepted before any app sees it; the agent's CGEvent.post calls flow through because their stateID is non-zero. The tap is what makes 'AI is using the computer' a hard fact, not a polite request.

How do I verify any of this on my own machine?

Clone the repo. xcrun --toolchain com.apple.dt.toolchain.XcodeDefault swift build. Point an MCP client at .build/debug/MCPServer. Call open_application_and_traverse with a small target like Calculator. Then call type_and_traverse and start mashing your keyboard during the 800ms the call takes. Your keys land nowhere (the tap swallows them). The orange-dot pill is centered on screen. /tmp/macos-use/tap_status.txt is rewritten with tap_created: enabled=true at <timestamp>. Press Esc; /tmp/macos-use/esc_pressed.txt appears with the timestamp and the call returns an InputGuardCancelled error. Two minutes from clone to verified.

Is this only relevant for AI agents, or is it useful for traditional automation too?

It is most acute for AI because LLM-driven actions are slower, less predictable, and more frequent than scripted ones. A 50ms AppleScript click finishes before you can interrupt it; a model-issued click_and_traverse can take 700ms because of the before-and-after traversals. That window is exactly long enough for a human to start typing. But the same primitives (event tap, programmatic stateID filtering, watchdog auto-release, cursor restore) apply to any long-running GUI automation: long-running build wizards, recorded macros that cross app boundaries, and accessibility-tree fuzzers all benefit from a hard input boundary while the script runs.