macOS Accessibility Tree Agent: The Three-Rung Fallback Ladder For When The Click Silently Drops
An accessibility-tree-driven agent on macOS hits a wall the moment it touches a Mac Catalyst app, a sandboxed app, or a secure-input field. The CGEvent click reaches the OS, the OS hands it to the app, and the app drops it. The diff comes back empty. mcp-server-macos-use ships three escalation tools for exactly this case, declared at main.swift:1441, 1458, and 1476. Each one bypasses the input event tap and fires the action directly on the AX element.
The Click That Doesn't Click
Most articles about driving macOS via the accessibility tree stop at the same place. Read the tree, find an AXButton, post a CGEvent at the centered point, done. That works on most native AppKit and SwiftUI apps. It does not work on Mac Catalyst right panes. It does not work on a number of sandboxed apps. It does not work when a secure-input field has focus. And it does not work on Catalyst rows, where the row exposes AXSelected but no AXPress and the click would have nothing to actuate even if it did land.
The failure mode is the worst kind: the click reaches the OS, the OS routes it through the HID event tap, the foreground window server queue accepts it, and the target app drops it on the floor. No exception, no error code, no callback. The traversal-after looks identical to the traversal-before. The agent is left with an empty diff and no signal about what to do next.
mcp-server-macos-use solves this with three additional tools. All three skip the input event tap entirely and operate on the AX element directly. The agent escalates: try the click, see the empty diff, call the next rung. By the time the third rung fails, the agent knows the action is not actuatable, not that the path was wrong.
The Inventory, In Numbers
let allTools at main.swift:1482 declares all nine. Three of them only exist for the failure case.
How Much Of The Server Exists Just For This Failure Mode
Where The Click Goes When It Doesn't Land
Inputs on the left are the happy-path tools that post CGEvents. Outputs on the right are the three escalation rungs. The hub is where the agent decides which path to take based on the diff from the previous call.
Happy path on the left, fallback ladder on the right
What The Agent Sees: Empty Diff, Then The Real Diff
Click reaches the OS, OS hands it to a Catalyst right-pane button, button drops it. The first call returns a 64-byte file with no diff entries. The agent reads the empty diff, falls back to press_ax_and_traverse, and gets the actual state change.
With Versus Without The Ladder
Same Catalyst app, same Save button, same agent. The only difference is whether the agent escalates after an empty diff.
Catalyst right-pane Save button, two agent loops
Agent reads the AX tree, sees AXButton 'Save' at (612, 408), calls click_and_traverse. The diff comes back empty: zero added, zero removed, zero modified. The agent has no way to know if the click was just slow, the button was disabled, or the app dropped the synthetic event. Most agents loop the same call, hit the same empty diff, then bail out and ask the user.
- click_and_traverse returns empty diff
- Agent does not know if click landed
- Loops the same call, gets the same empty diff
- Bails out and asks the user
The Three Rungs, Source-Level
Each tool is declared once. The description string is the contract: it tells the model exactly when to reach for that tool. Read these three back-to-back and the failure-mode map is self-documenting.
The Dispatch That Picks The AX Action Variant
The handler maps each tool name to a different variant of .input(action: ...). set_value goes to .axSetValue, press_ax goes to .axPress, set_selected goes to .axSetSelected. All three flip showDiff so the agent gets the same +/-/~ output it gets from a regular click.
The Order The Agent Walks
Try the cheapest path first. Each rung leaves behind a timestamped .txt + .png pair so the model can audit which attempts ran and which one finally actuated the UI.
1. click_and_traverse
CGEvent posted at the auto-centered point. Works for ~95 percent of native AppKit and SwiftUI apps. Use first.
- 2
2. press_ax_and_traverse
Bypasses the event tap. Calls kAXPressAction on the AX element. Catalyst right-pane buttons, sandboxed apps.
- 3
3a. set_value_and_traverse
kAXValueAttribute. For text fields where typing failed: Catalyst right-pane fields, secure-input contexts.
- 4
3b. set_selected_and_traverse
kAXSelectedAttribute. Catalyst rows, sidebar entries, outlines that expose AXSelected without AXPress.
What Each Rung Is Designed For
Mac Catalyst right pane
UIKit-rendered fields and buttons participate in the AX tree but drop CGEvent posts at their coordinates. set_value (fields) and press_ax (buttons) land. Cited at main.swift:1442 and 1459.
Sandboxed apps
Sandbox entitlements opt the app out of receiving synthetic events. AX actions still go through because they cross the accessibility boundary, not the input event tap.
Secure-input contexts
When a secure field has focus, synthetic keyboard events are blocked system-wide. set_value writes via kAXValueAttribute and is not subject to the secure-input mode.
AXSelected-only rows
Catalyst tables, sidebar lists, outline rows. press_ax errors with kAXErrorActionUnsupported. set_selected toggles kAXSelectedAttribute and the row activates. main.swift:1477.
kAXErrorActionUnsupported, Then The Recovery
Catalyst sidebar rows are the canonical example. The row participates in the AX tree, exposes AXSelected, and ignores both the synthetic click and kAXPressAction. The error code kAXErrorActionUnsupported from press_ax is the agent's cue to flip to set_selected_and_traverse.
“Right primitive for Catalyst table rows, sidebar list entries, outline rows, and other selection-bearing controls that expose AXSelected but no AXPress action (where regular click is dropped and press_ax errors with kAXErrorActionUnsupported).”
set_selected_and_traverse description, verbatim
The Two Wire Diagrams That Explain The Whole Thing
First call: the click goes through the input event tap, the target app drops it, the diff is empty. Second call: the AX action goes around the tap and lands on the element directly. The agent sees both.
click drops, press_ax lands
The Aggregated Tool List
Six tools cover the happy path. Three exist purely for the apps that don't cooperate with synthetic events. Reading theallToolsline as a literal answers the question of how much of this server is dedicated to the failure mode.
CGEvent Path vs. AX API Path, Side By Side
The two paths are not redundant. Each one wins where the other loses. CGEvent works on webviews, Electron, games, and anything that doesn't implement AX actions. AX API actions work on Catalyst, sandboxed apps, and selection-only rows. The agent doesn't pick a side; it picks based on what the previous rung returned.
| Dimension | CGEvent path (click/type/press/scroll) | AX API path (set_value/press_ax/set_selected) |
|---|---|---|
| Path through the OS | Posts CGEvent. OS routes through the HID event tap and on to the foreground app's window server queue. | Calls AXUIElementPerformAction or AXUIElementSetAttributeValue. Delivered through the accessibility API, never touches HID. |
| Catalyst right pane | Click drops silently. Empty diff. Agent has no recourse. | Action lands on the UIKit-bridged element via kAXPressAction or kAXValueAttribute. |
| Secure-input field | type_and_traverse blocked at the OS level. No keystrokes delivered. | set_value_and_traverse writes via kAXValueAttribute. Not subject to secure-input filtering. |
| Selection-only row | Click hits the row visually; nothing changes because the row has no AXPressAction. | set_selected_and_traverse flips kAXSelectedAttribute. Detail pane updates. |
| Diff contract | click_and_traverse returns +/-/~ diff (empty when click drops). | All three fallback tools return the same +/-/~ diff (showDiff=true at main.swift:1745, 1764, 1784). |
| Coordinate handling | Click auto-centers at (x + w/2, y + h/2) from CGEvent position. | Hit-tests at (x + w/2, y + h/2) when width/height are passed; passes through findAXElementAtPoint at main.swift:1078. |
Why This Doesn't Show Up In Other Guides
Most writing about driving macOS through the accessibility tree treats the tree as the goal: read the tree, parse the tree, render the tree, embed the tree. The agent layer is left as an exercise. The interesting part of agent work isn't the read; it's the action loop. And the action loop on macOS has a specific cliff that the read-side guides never get to: the click that the OS accepts and the app silently throws away.
The recovery is not a generic retry(). It is three different AX actions, each backing off to a different attribute or perform-action call, each tuned to a specific failure mode named in the source. set_value for fields that ignore the keystroke. press_ax for buttons that ignore the click. set_selected for rows that do not implement press at all.
If your agent already drives macOS through the AX tree and you have ever shrugged and told a teammate "Catalyst is weird, the click just doesn't work," this is the shape of the answer.
Driving Catalyst, sandboxed, or secure-input apps from your agent?
Talk to the team about how the fallback ladder maps onto your specific app set, including which rung lands first.
Frequently asked questions
What does it mean when an agent's click silently drops on macOS, and how do I tell?
The agent reads the AX tree, finds an AXButton at known coordinates, posts a CGEvent mouse-down/mouse-up at the centered point, and the app does not respond. No exception is raised, no error is returned. The traversal-after looks identical to traversal-before. macos-use makes this easy to spot because every action returns a diff: when the diff is empty (zero added, zero removed, zero modified) on what should have been a state-changing click, the input event tap was bypassed by the target app. This happens predictably on Mac Catalyst right-pane controls, sandboxed apps that opt out of synthetic events, and any app where a secure-input field has captured the event tap.
What are the three fallback tools and what does each one do at the AX layer?
set_value_and_traverse writes a string into the element under (x,y) via kAXValueAttribute (declared at main.swift:1441). press_ax_and_traverse calls kAXPressAction on the element (main.swift:1458). set_selected_and_traverse sets kAXSelectedAttribute (main.swift:1476). All three skip CGEvent entirely. They get the AXUIElement at the hit-tested point and call AXUIElementSetAttributeValue or AXUIElementPerformAction directly. The input event tap is never engaged, so apps that drop synthetic events still receive the action because the action is happening on the accessibility object, not on the keyboard or mouse hardware path.
Why are Mac Catalyst right-pane fields specifically called out in the source?
Catalyst is UIKit-on-AppKit. The right pane of a Catalyst window often hosts UIKit-rendered text fields and buttons that participate in the AX tree (so the agent can see them) but ignore CGEvent posts targeted at their screen coordinates. The macos-use source names this case explicitly at main.swift:1442: 'Use when typing fails (Catalyst right-pane fields, sandboxed/secure-input contexts).' For those fields, the only thing that lands is AXUIElementSetAttributeValue with kAXValueAttribute. Same for Catalyst right-pane buttons at main.swift:1459: 'Use when a synthetic mouse click is dropped (Catalyst right-pane buttons, sandboxed apps). Often the only path that actuates buttons in those apps.'
When would I need set_selected instead of press_ax?
When the element exposes the AXSelected attribute but does not implement AXPressAction. The source comment at main.swift:1477 spells this out: 'where regular click is dropped and press_ax errors with kAXErrorActionUnsupported.' Common offenders are Catalyst table rows, sidebar list entries, and outline rows. Pressing them does nothing because there is no press action to perform; selecting them is the operation. set_selected_and_traverse calls AXUIElementSetAttributeValue with kAXSelectedAttribute and the boolean payload. The traversal-after typically shows the row's AXSelected flipping from false to true and the detail pane updating with the new content.
What does the escalation order look like in practice?
Try click_and_traverse first. It is a single call that posts a CGEvent and chains optional type/press, which is what works for ~95 percent of native AppKit and SwiftUI apps. If the diff comes back empty for a click that should have caused a visible state change, escalate to press_ax_and_traverse with the same x, y, width, height. press_ax errors with kAXErrorActionUnsupported on rows. When that error appears, escalate to set_selected_and_traverse for rows or set_value_and_traverse for text fields. The diff returned by each rung tells the agent immediately whether the rung worked: empty diff means try the next rung.
How does the server know which AX element corresponds to the (x, y) the agent passes in?
findAXElementAtPoint at main.swift:1078-1104 walks the AX tree depth-first from the application root. For each element it reads AXPosition and AXSize, builds a CGRect, and checks whether the point falls inside. It returns the deepest match. Fallback tools pass the centered point (x + width/2, y + height/2) when width and height are present, which is why the schema for set_value, press_ax, and set_selected accept optional width/height parameters at main.swift:1432-1435 and 1452-1455. The hit-tested element is the one the AX action is performed on, not whatever was visually under the cursor.
Does the agent get a diff back from these fallback tools, or just a status?
Same diff contract as click. All three escalation tools set options.showDiff = true at main.swift:1745, 1764, and 1784. That triggers a traverseBefore, executes the AX action, traverses again, subtracts, and returns the +/-/~ diff. So the agent sees exactly what changed: the AXValue of a text field flipping from empty to the new string, an AXButton's AXEnabled going from true to false, a row's AXSelected flipping. Same /tmp/macos-use/<ts>_<tool>.txt + .png pair, same flat format, same grep workflow.
Why not just always use kAXPressAction and skip CGEvent entirely?
Because not every AX element implements every action. AXButton usually does. AXMenuItem usually does. AXTextField and most controls in webviews, Electron apps, and games do not implement AXPressAction. CGEvent works against any window the OS knows about because it is firing real mouse and keyboard events the OS routes through the input event tap. The fallback ladder exists because neither path covers everything alone: CGEvent loses to Catalyst-and-sandboxed; AX actions lose to webviews-and-games. So the server exposes both, and the agent picks based on what the previous rung did.
What about secure-input fields, password boxes, the lock screen?
Secure input is a system-level mode where macOS blocks synthetic keyboard events from reaching the foreground app while a secure field has focus. Apps like Terminal opt into it; banking sites and 1Password trigger it. type_and_traverse will silently fail. set_value_and_traverse at main.swift:1441 is the recovery. It writes into the field via kAXValueAttribute, which is not blocked by secure input because no synthetic key event is fired. Whether the target app then accepts the written value depends on whether the field calls a controller that reads kAXValue, but the input is at least delivered. For a true password field on the macOS lock screen, neither path works, by design.
How does this fit alongside the click that does work, in one agent loop?
The agent calls click_and_traverse with element="Save" and reads the diff. If the diff has changes, the click landed and the agent moves on. If the diff is empty, the agent grabs the x, y, width, height for the same element from the prior traversal file and calls press_ax_and_traverse. If press_ax returns kAXErrorActionUnsupported, the agent flips to set_selected_and_traverse for a row or set_value_and_traverse for a field. Each rung writes its own /tmp/macos-use/<ts>_<tool>.txt with diff or error, so the model has a complete trace of which rungs it attempted and which one actuated the UI.