macOS automation only works when the tool re-reads the screen and rescues the click when the button is below the fold.
AppleScript sends an Apple Event and hopes. Shortcuts runs an intent and hopes. Keyboard Maestro fires a CGEvent at a recorded coordinate and hopes. None of them return the post-action accessibility tree, and none of them know what to do when the target has scrolled off-screen. macos-use does both. This guide shows the 30-step scroll chaser at Sources/MCPServer/main.swift:1159 and the 15-pixel viewport inset that tells it when the rescue is done.
The pipe and the loop
Most macOS automation stacks are a pipe. Intent enters one end, an event exits the other end, the app reacts, the script ends. If the event missed, the script errors or, worse, silently succeeds against the wrong target. The pipe is fine when the UI is stable, hand-written, and verified by a human on every change.
A loop is different. Intent enters, the tool reads the live accessibility tree, picks a coordinate, posts an event, re-reads the tree, and returns the diff. The next intent is authored against that diff, not against a stale snapshot. The model driving the loop can be an LLM or a script; the point is that the feedback path exists at the tool boundary, not in the user's head.
The rest of this page is what the loop has to include to work on real macOS apps: a viewport filter, a scroll chaser, and a cancel path for the user.
The specific problem: click coordinate outside the window bounds
An LLM picked a coordinate out of a prior traversal. Between then and now the user scrolled. The point is now 1400 pixels below the viewport. Four classes of automation tool give four different wrong answers.
The rescue code
One function, roughly 130 lines, runs before any off-screen click. The top half tracks a target with known text; the bottom half probes an edge point when the text is not known. The full file is at Sources/MCPServer/main.swift:1159-1285.
Where the rescue decides it is done
findElementByText accepts a match only when the element center sits inside a viewport shrunk by 15 pixels top and bottom. That margin is not cosmetic — it is the difference between a click that lands on a sticky-header row and a click that lands on the real row you asked for.
What flows through scrollIntoViewIfNeeded
The rescue, step by step
Seven moments between "the click target is off-screen" and "the click is posted at a rescued coordinate". Every moment is a line or two of Swift in one function.
Client sends click_and_traverse with (x, y) from a prior traversal
The LLM picked coordinates out of the flat-text response file that a previous tool call wrote to /tmp/macos-use. That traversal was taken when the list was scrolled to the top. The user's new intent targets a row that is now below the fold.
Server resolves the window bounds for the target PID
getWindowContainingPoint walks the app's AX tree, finds the window whose frame contains the target point, and returns its bounds. If no window contains the point, the original coordinate is returned as-is — the caller handles the failure.
If the point is already in-viewport, return it unchanged
main.swift:1168-1175. The chaser explicitly skips the AX-tree refine step here because overlapping full-width groups (message rows that span the window) would shadow sidebar items and deflect the click to the wrong location.
Pick a scroll cadence proportional to the off-screen distance
Distance < 80px: 1 line per step. Distance < 250px: 2 lines per step. Else: 3 lines per step (main.swift:1187). A scroll line is ~20-40px in the AX coordinate space, so a 1-line step is enough for tiny offsets and a 3-line step converges on long lists without overshooting.
Scroll, sleep 100ms, re-query the AX tree by text
CGEvent(scrollWheelEvent2Source:...) at the window midY. After each step, findElementByText walks the tree for the exact text the original target had, and accepts the match only when the element center sits inside viewport.insetBy(dx:0, dy:15).
If the target had no text, probe an edge point and wait
For far off-screen sidebar items where AX returned a coordinate but no text, the loop switches to probing a point 60px inside the viewport edge (main.swift:1230-1233). As the user would see it, content enters from the bottom edge on scroll down; the server watches that edge for newly-revealed text.
Return the rescued center, or give up after 30 steps
The caller gets back a CGPoint that is guaranteed to be inside the window's reliably-clickable area. If the chase failed, the original point is returned and the click is posted anyway — the diff will show that no useful state change followed, which is the signal the model needs to re-plan.
The JSON-RPC round trip with one off-screen click inside it
The client sees one request and one response. Inside the server, the rescue loop is N scroll events and N AX-tree reads before the mouseDown ever fires.
click_and_traverse with an off-screen target
The rescue, by the numbers
The 0 in that row is the ceiling that keeps the rescue inside the InputGuard 30-second watchdog budget. Worst case is 0 steps × 0ms per step ≈ 0s. That leaves ~25s of headroom for the click itself, the follow-up type, press, and the post-action traversal inside a single chained tool call.
“A macOS automation tool that does not re-read the accessibility tree after every action is a pipe. You can build a script on top of a pipe. You cannot build an agent. The reason is the same reason: the pipe has no way to say 'the click actually hit nothing because the target was below the fold.' That signal is what turns a script into a loop.”
main.swift:1159 (scrollIntoViewIfNeeded) + main.swift:992 (buildFlatTextResponse)
How this line compares to what people usually mean by "macOS automation"
None of the older tools were wrong at the time. Each one is optimized for a human author. macos-use is optimized for a loop. The comparison is about which half of that statement you need today.
Classical macOS automation vs. feedback-loop MCP
| Feature | AppleScript / Shortcuts / Hammerspoon / cliclick | macos-use |
|---|---|---|
| Returns the post-action UI state | AppleScript / osascript: no (fire and forget) | Yes — accessibility-tree diff (added / removed / modified) |
| Handles a click target that is off-screen | Shortcuts / Automator: errors or silently targets whatever is at the coordinate | scrollIntoViewIfNeeded rescues the target (main.swift:1159) |
| Knows which elements are visible vs. offscreen | Hammerspoon's hs.axuielement exposes the tree but no viewport filter | Every element tagged in_viewport via window-bounds intersection |
| Sticky-header safety margin on the accept check | pyautogui / cliclick: click by coordinate, no AX concept at all | viewport.insetBy(dx:0, dy:15) in findElementByText |
| Cap on the rescue loop so a stuck scroll cannot hang | Most hotkey tools: no concept of a rescue loop | maxSteps = 30 and a 30s CGEventTap watchdog (InputGuard.swift:24) |
| LLM-consumable response shape | Apple Event result / shell stdout / Lua table | Flat-text file under /tmp/macos-use grep-addressable by AX role / text |
Verify every claim on this page with grep
No benchmark numbers, no vendor quotes. Every specific on this page maps to a line in the repo. Clone it and check.
What to grep for
- The word "main.swift:1159" is the scroll-chaser entry point.
- Line 1187 picks 1, 2, or 3 lines per step from the distance.
- Line 1189 caps the loop at 30 steps.
- Line 1128 shrinks the accept viewport by 15px top and bottom.
- Line 1232-1233 places the edge probe 60px inside the window.
- Line 992 writes the flat-text response to /tmp/macos-use/.
- InputGuard.swift:24 caps the CGEventTap watchdog at 30 seconds.
Planning a macOS agent that survives real UI state?
We will walk through your specific app, your off-screen cases, and where the feedback loop pays for itself.
Frequently asked questions
What makes macOS automation actually hard, if input events are easy to synthesize?
Posting a CGEvent is one line. Knowing where to post it on the next call is the whole problem. macOS apps re-layout on window resize, tab switch, dynamic content load, and when a sibling process steals focus with a save panel. A hotkey tool records a coordinate once and hopes it still matches. An AppleScript leans on the app's scripting dictionary and falls back to UI scripting when the dictionary is thin. Neither strategy returns the post-action state of the UI, so the next step is authored against a guess. macos-use closes that loop by returning an accessibility-tree diff after every mutation: added elements, removed elements, and modified elements, each tagged with their new viewport coordinates. The next tool call is authored against ground truth, not a snapshot from 200ms ago.
What is the 'viewport-filtered' part of the response?
Every element in the returned accessibility tree carries an `in_viewport` boolean. The server computes it by intersecting each element's AX frame with the union of visible window bounds for the target PID. The enrichment happens in enrichDiff at Sources/MCPServer/main.swift around line 610: AXSheet bounds override window bounds when a save panel or share sheet is open, so the model does not waste tokens on elements that exist in the AX tree but are clipped behind the sheet. The flat-text response at /tmp/macos-use/<ts>_<tool>.txt marks each line with `visible` or `offscreen`, so a grep for `AXButton.*visible` gives you the clickable set without any AX coordinate math on the client side.
What does 'scroll chaser' actually do, step by step?
When the click target is outside the window viewport, scrollIntoViewIfNeeded at Sources/MCPServer/main.swift:1159 runs a loop that is up to 30 steps long. Before the loop, it picks lines-per-step proportional to the off-screen distance: 1 line if the target is less than 80px outside, 2 lines if less than 250px, 3 lines otherwise. It fires a CGEvent(scrollWheelEvent2Source:...) at the window midY on every step, then either (a) re-queries the accessibility tree for the target element's text and returns the new center once it sits within viewport.insetBy(dx: 0, dy: 15), or (b) if the target coordinates had no text attached, probes a point 60px inside the viewport edge on every step, waits for text to appear there, then switches to text tracking. Text-tracking pauses 100ms between steps; edge-probe pauses 150ms. If 30 steps is not enough, the server falls back to the original point and returns — no infinite loop, no silent hang.
Why does findElementByText inset the viewport by 15 pixels vertically?
The inset at Sources/MCPServer/main.swift:1128 (`viewport.insetBy(dx: 0, dy: 15)`) is a safety margin. An element whose center is exactly on the top or bottom edge of the window is technically in-viewport but often clipped by a header, toolbar shadow, or content inset that the AX API does not report. A 15px top/bottom shrink moves the acceptance boundary inside the reliably-clickable area, so the click lands on pixels that are actually visible. Without the inset, scroll chases were overshooting by a row or two on list views with sticky headers.
How is this different from AppleScript's 'tell application to click button'?
AppleScript's UI scripting wraps the Accessibility API with a synchronous command grammar. You write `click button "Save" of window 1 of process "TextEdit"` and osascript resolves the path at send time. Two things it does not do: it does not scroll the list to find "Save" if it is off-screen, and it does not return the accessibility tree after the click. If the click fails because the button is below the fold, you get an error instead of a rescued click. macos-use's scroll chaser is the missing half-loop. The diff is the other missing half.
Why does Shortcuts / Automator not need this?
Because they aim at a different target. Shortcuts and Automator automate apps that have adopted Apple Events or App Intents, and those interfaces expose high-level verbs (`Get Contents of URL`, `Create Event`) rather than pixel-level clicks. There is no scroll problem because there is no coordinate problem. The trade-off is reach: anything that did not adopt Intents (most Electron apps, web views inside native apps, system preference sub-panes that were never scripted) is opaque to Shortcuts. macos-use reaches all of them because AXUIElement is system-wide, and the scroll chaser is what makes reaching them practical when the target is scrolled out of frame.
What does the flat-text response file actually look like?
One element per line, prefixed by its AX role and quoted text, followed by x/y/width/height and a `visible` or `offscreen` tag. For a diff response, lines are prefixed with `+` (added), `-` (removed), or `~` (modified, with a trailing attribute change list). The file is written to /tmp/macos-use/<timestamp>_<tool>.txt by buildFlatTextResponse at Sources/MCPServer/main.swift:992. A typical click into Slack returns 120-200 lines; an open-application response on a large browser window can be 600+. Because it is plain text, the LLM client uses grep / head / tail against it instead of streaming the whole tree through the model context.
Can I test the scroll chaser without Claude or Cursor connected?
Yes. Run `python3 scripts/test_mcp.py` from the repo root. It spawns a fresh server binary over stdio, calls open_application_and_traverse on Messages, walks to a recipient whose row is below the viewport, and calls click_and_traverse with the off-screen coordinates returned by the open traversal. If scrollIntoViewIfNeeded is working, the click still lands on the right row and the diff shows the thread panel populated. If the rescue fails, the click lands on whatever is at the target y and the diff shows an unexpected thread panel. The log file under /tmp/macos-use includes every scroll step with the AX frame of the target for that step, so you can see the chase unfold.
What is the 30-second watchdog and why is it related to the scroll chaser?
InputGuard.swift:24 sets `watchdogTimeout = 30` seconds. While the server is posting synthetic events — including the scroll wheel events the chaser fires — user input is suppressed by a CGEventTap at the head of the HID queue (InputGuard.swift:132-150). The watchdog forces the tap to disengage after 30s no matter what, so a misbehaving scroll chase cannot lock the keyboard forever. The max 30-step loop at main.swift:1189 is inside that budget: 30 steps × ~150ms per step ≈ 4.5s worst case, well under the watchdog. Pressing Esc (keycode 53, no modifiers) still cancels instantly via throwIfCancelled at main.swift:1708.
Which apps benefit most from the scroll chaser, in practice?
Messaging apps (Messages, Slack, Discord) where the recipient list is longer than the sidebar viewport. Mail when a folder contains hundreds of threads. System Settings where a sub-pane is far down the sidebar. Any web view embedded in a native app where the AX tree reports elements that the user has to scroll to. The common pattern is a virtualized or tall list where the target's AX frame is valid, populated, and far off-screen. If the app uses true lazy loading that destroys off-screen rows, the chaser falls back to edge-probing (the text-less case at main.swift:1220-1283) and works on what scroll reveals.
What changes if I run macos-use on a multi-monitor setup?
The scroll chaser uses window bounds from the AX tree, not screen bounds, so it is invariant to monitor layout. The cursor and frontmost-app save/restore at main.swift:1672-1675 and main.swift:1774-1780 flip AppKit's bottom-left origin into CGEvent's top-left origin per screen. The project's CLAUDE.md records a 3-screen rig: built-in at (0,0), left external at x≈-3840, right external at x≈3456, all at backingScaleFactor=1.0. Negative x coordinates work. The scroll-wheel event in the chaser is posted with `location = CGPoint(x: point.x, y: windowBounds.midY)` (main.swift:1197), which stays inside the target window regardless of which screen it is on.
More on the tier-3 pattern
macOS Automation Tools: The Three Tiers Nobody Draws The Line Between
Apple Events vs. input synthesis vs. AI-agent MCP. Six tools, one isDisruptive boolean, two AX trees on a handoff. A map of the category.
macOS Accessibility Tree Agents
The diff format, the in_viewport enrichment, the noise filters. What the tier-3 tree actually looks like when it reaches the model.
macOS AI Agent State Memory
The .txt files under /tmp/macos-use are the agent's memory. One line per element, grep-addressable, no tokens until the agent opens the file.