Desktop-only MCP behaviourCross-process handoffmain.swift:1786-1809

The MCP Server Desktop-App Problem No One Documents: Your Click Just Opened A Dialog Owned By A Different Process

On macOS the Save Panel, Share sheet, Print dialog, and permissions prompt all live in their own processes. The click you fired in Numbers can land the frontmost-app PID in openAndSavePanelService before your next tool call. A remote MCP server never deals with this. A desktop MCP server that ignores it leaves the agent staring at a tree that no longer has the button it wants. In macos-use this is 22 lines of Swift, and the payoff is two AX trees and a retargeted screenshot in a single MCP response.

M
Matthew Diakonov
10 min read
5.0from open source
22 Swift lines at main.swift:1786-1809 solve the cross-process handoff problem
Two AX trees + one retargeted screenshot in a single MCP response
Works for Save Panels, Share sheets, Print dialogs, permissions prompts, notifications

The problem in one sentence

A remote MCP server runs somewhere else and responds with data. A desktop MCP server runs here and drives your UI. The second kind has to reckon with the fact that the window your click just opened may not be a window belonging to the app you were automating at all.

On macOS this is not a corner case. Save Panels are not a feature of Numbers; they are served by com.apple.appkit.xpc.openAndSavePanelService. Share sheets are not a feature of Safari; they are served by com.apple.SharingUIServer. Permissions prompts are owned by tccd. Print dialogs go through PrintUIService. You cannot automate a mac end-to-end without landing in at least one of these at least once.

One check. Four tool types. Two trees when it matters.

click_and_traverse
type_and_traverse
press_key_and_traverse
scroll_and_traverse
PID compare
Tree of the original PID
Tree of the new frontmost PID
Screenshot retargeted
app_switch: line in summary

The 22 lines that do it

This is the entire cross-process handoff detector. It runs only on diff-producing tools (click, type, press, scroll). One NSWorkspace call. One PID compare. One extra traversal when needed. No heuristics, no dialog classname matching, no screenshot OCR.

Sources/MCPServer/main.swift

What the agent actually sees

The compact summary below is the whole MCP response body that comes back from one click. Two extra lines name the new process. The full tree of the new process is already on disk at the file path in the response, waiting for a grep.

MCP response (compact summary)

And the on-disk flat text

Same event, written to /tmp/macos-use/<ts>_click_and_traverse.txt. The # app_switch: header (emitted by the formatter at main.swift:1030-1037) is the agent's grep target when it wants to jump to the new tree's elements.

1713644901221_click_and_traverse.txt

Sequence diagram of one handoff-producing click

Four actors, eight messages, one response. Everything right of the dotted MCP-server lifeline happens on the user's mac; everything left of it happens in the agent.

click_and_traverse with cross-process handoff

AgentMCP ServerNumbers (PID 4821)openAndSavePanelServiceclick_and_traverse pid=4821 element="Save"traverseBeforepost click eventspawn Save Panel (XPC)traverseAfter (PID 4821)NSWorkspace.frontmostApplication?traverseAccessibilityTree(pid: 9912)summary + two trees + retargeted screenshot
0 linesof Swift that implement the handoff check (main.swift:1786-1809)
0AX trees returned in one response when a handoff fires
0NSWorkspace call per diff-producing action
0tool types where the check runs (click / type / press / scroll)
openAndSavePanelServiceSharingUIServerPrintUIServicetccdUserNotificationCenterSystemUIServerfrontmostApplication?.processIdentifier# app_switch:app_switch_elements:enrichResponseData()

The dialogs this covers

None of these are modelled specially. The PID compare handles all of them because the common property is a frontmost-app PID change, not a dialog classname.

Save / Export panels

NSSavePanel and NSOpenPanel live in openAndSavePanelService. Every 'File > Save' in Numbers, Pages, TextEdit, Preview, Safari's 'Save As' flow triggers this service. If the MCP server keeps staring at the source app, the agent cannot click Save inside the panel.

Share sheets

com.apple.SharingUIServer owns the share picker. Agent clicks the Share toolbar button, focus jumps, the picker opens, the source app's AX tree is now useless.

Print dialog

PrintUIService is its own process. The second-tier 'Show Details' expansion spawns an entirely new dialog owned by that service.

Permissions prompts

tccd / SystemUIServer puts up 'Grant access to X'. The source app cannot see the sheet; only the system process can.

Notifications

UserNotificationCenter banners are owned by com.apple.UserNotificationCenter. Click-to-expand drops focus onto the center's process.

Color / Font / Character picker

NSColorPanel and friends live in the app that summons them, usually — but when invoked from a service menu they handoff. Cheap to detect, expensive to miss.

Step-by-step: what happens inside one tool call

1

Client sends click_and_traverse

The agent passes the PID it thinks is still in charge. The tool schema requires pid; no handoff prediction is asked of the agent.

2

Primary click executes on the original PID

CGEventPost fires. The click lands. macOS delivers the event to the target window, which may spawn a service-owned dialog.

3

Server takes traverseAfter on the original PID

This is the normal diff. It captures the source app's post-click state. If the click opened a sheet in the same process, the sheet shows up in added elements here.

4

Server asks NSWorkspace who is frontmost NOW

One line: NSWorkspace.shared.frontmostApplication?.processIdentifier. Cheap, synchronous, runs unconditionally on diff tools.

5

If it differs, traverse the new PID too

traverseAccessibilityTree(pid: newPid) on a MainActor task. Response enriched with window bounds, assigned to appSwitchTraversal.

6

Retarget the screenshot to the new PID

captureWindowScreenshot uses appSwitchPid ?? traversalPid ?? pidForTraversal at main.swift:1837 so the PNG matches the tree the agent just got.

7

Emit one MCP response, two trees, one screenshot

Compact summary includes an app_switch: line and sample elements. Flat-text file includes a '# app_switch:' header plus the new app's full tree. The next tool call already knows which PID to target.

3 round-trips → 1 round-trip

Compared against a naive MCP server that always returns the original PID's tree. Measured by counting tool-call round-trips to reach the Save button after a File > Save click in Numbers.

Hand-tested April 2026

How the summary code appends the app_switch line

The summary formatter does not care whether a handoff occurred; it just checks appSwitchPid and, if present, emits two lines plus a sample. The agent does not have to ask for this — it arrives automatically.

Sources/MCPServer/main.swift

Handoff-aware vs. naive desktop MCP server

FeatureNaive MCP servermacos-use
Detects cross-process dialog ownershipNo — returns the tree of the PID you asked for, alwaysYes — compares frontmost PID after every diff-producing action
Response shape when a dialog opens in a sibling processOne AX tree (the stale one) + a screenshot of the old windowTwo AX trees, one screenshot of the new window, one summary that names the new PID
Round-trips to recover from a handoff2+ (refresh + probe for the frontmost PID)0 (new tree arrives inside the same response)
Screenshot target when handoff firesOriginal window (probably hidden under the dialog)New frontmost window (main.swift:1837 uses appSwitchPid first)
Works for save panels, share sheets, print, permissionsPartial at best — usually fails on XPC-backed servicesSingle code path covers all of them by comparing PIDs, not classnames
Cost per action when no handoff happensN/AOne NSWorkspace call, skipped traversal (branch taken only when PIDs differ)

Try it in one terminal tail

Build the binary, connect it to any MCP client, open Numbers, and tail the output directory:

Verify the handoff path

Automating a mac where half the clicks open system dialogs?

We can walk you through how the handoff detector works and how to plug it into your agent loop.

Frequently asked questions

What is an MCP server that runs as (or inside) a desktop app?

It is an MCP server co-located on the same machine as the end user and whose tools drive that machine's native UI. Unlike a remote MCP server that returns data from an API, a desktop MCP server issues real OS events: CGEventPost on macOS, SendInput on Windows, XTest on X11. mcp-server-macos-use is the macOS variant. It speaks JSON-RPC 2.0 over stdio to clients like Claude Desktop, Cursor, Cline, and VS Code, and its six tools (open_application_and_traverse, click_and_traverse, type_and_traverse, press_key_and_traverse, scroll_and_traverse, refresh_traversal) move the cursor and keyboard on your real mac. All tools are registered in the allTools array at Sources/MCPServer/main.swift:1408.

What is the cross-process dialog problem, exactly?

On macOS a dialog you think of as 'part of Safari' is frequently owned by a separate process: Save Panels are served by com.apple.appkit.xpc.openAndSavePanelService, SharingService pickers live in com.apple.SharingUIServer, print dialogs in PrintUIService. When your click triggers one of these, the frontmost-app PID changes from (say) Safari to a service PID that the agent has never seen. A naive MCP server returns the AX tree of Safari — which no longer has the element the agent wants to click — and the agent gets stuck poking at a window that is no longer accepting input.

How does mcp-server-macos-use detect the handoff?

With one comparison after every diff-producing action. Sources/MCPServer/main.swift:1786-1809 reads NSWorkspace.shared.frontmostApplication?.processIdentifier after the tool call completes and compares it to options.pidForTraversal (the PID you originally passed in). If they differ, it calls traverseAccessibilityTree(pid: newPid) on the new frontmost process and attaches the result to toolResponse.appSwitchTraversal. The screenshot code at main.swift:1837 then uses appSwitchPid as the effective PID so the PNG captures the NEW window, not the old one. All of this happens inside the single MCP response so the agent's next reasoning step already has the new tree in hand.

What does the response look like when a handoff fires?

Two extra lines appear in the compact summary (main.swift:871-882): 'app_switch: <AppName> (PID: <newPid>) is now frontmost' and 'app_switch_elements: <total> total, <visible> visible', followed by a sampled visible_elements block for the new app. The on-disk flat text file gets a '# app_switch: <AppName> (PID: <newPid>)' header (main.swift:1030-1037) and then one line per element of the new app's tree in the same format the agent is already grepping: `[Role] "text" x:N y:N w:W h:H visible`. One response, two trees, one grep target.

Why only for diff-producing actions and not for every tool call?

Because open_application_and_traverse and refresh_traversal are full-traversal tools. They do not produce a diff, and they already pick the PID you asked for. The handoff check is gated on `if hasDiff` at main.swift:1788, which is only true for click / type / press / scroll. That is exactly the set of actions that can cause a process to pop a service-owned window. The trade is deliberate: skip the compare when it cannot matter, do it when it can.

Is the screenshot of the old app or the new app?

The new app. main.swift:1837 reads `toolResponse.appSwitchPid ?? toolResponse.traversalPid ?? options.pidForTraversal` in that order, meaning the effective screenshot PID falls through to appSwitchPid first when a handoff was detected. You click Save in Numbers, the Save Panel opens owned by openAndSavePanelService, and the PNG captured at the same millisecond timestamp shows the Save Panel. The screenshot and AX tree in the response pair up.

Does this mean I can chain a click into the dialog's buttons in one MCP call?

Not quite — two calls. The first call returns the handoff tree with the Save Panel elements and the new PID. The second call uses that new PID. The click/type/press schemas all require pid and once you have appSwitchPid from response one, you pass it as `pid` in response two. The benefit is you do not have to waste a round-trip on refresh_traversal just to discover that the frontmost app changed. The agent's next tool call can target the right process immediately.

What about a dialog that belongs to the ORIGINAL app (like a sheet)?

Sheets attached to a window are owned by the same process, so frontmost PID does not change and the handoff branch does not fire. The normal diff (added/removed/modified elements) on the original PID still captures the sheet because it is part of the same accessibility tree. The handoff branch is narrowly for the case where a NEW process became frontmost. It is the only case the simple diff path cannot express.

Can I see the exact lines of Swift that implement this in the repo?

Yes. `git clone https://github.com/mediar-ai/mcp-server-macos-use && sed -n '1786,1809p' Sources/MCPServer/main.swift` prints the 22-line block. The flat-text formatter at main.swift:1030-1037 writes the handoff header. The compact-summary formatter at main.swift:871-882 writes the app_switch lines. The screenshot-target selector at main.swift:1837 uses appSwitchPid first. Four locations, one behavior.

How do I verify this is happening live?

Tail /tmp/macos-use/. Run `ls -lt /tmp/macos-use/ | head` in a terminal, then use a Claude Desktop session to open Numbers or Preview, call macos-use_click_and_traverse on a File > Save menu item, and watch a new pair of files appear within a millisecond. Run `grep -n '^# app_switch' /tmp/macos-use/<latest>.txt` and you should see the header. Run `grep -n '^app_switch:' /tmp/macos-use/<latest>.txt` to see the line the compact summary emitted. If both are there, the handoff path ran.

Which MCP clients does this server work with?

Any MCP-compliant client that can spawn a stdio process. Tested with Claude Desktop, Claude Code, Cursor, Cline, and VS Code's MCP support. The handshake is at Sources/MCPServer/main.swift:1411-1437 and ships the server's own tool-use instructions in the `instructions` field of the initialize response, which every compliant client forwards into the model's system context. You do not have to teach the model how to use this server per-client; the server tells it on connect.

How is this different from Terminator, which is for Windows?

Terminator (github.com/mediar-ai/terminator) is the Windows sibling and solves the same class of problem (AI-driven UI automation) on a different OS. It uses UI Automation instead of the macOS Accessibility API, SendInput instead of CGEventPost, and Windows-specific tricks for window ownership. The cross-process handoff problem exists on Windows too (print dialogs, file pickers) and Terminator addresses it differently because the Win32 window ownership model is not the same as macOS process ownership. macos-use is the macOS-specific answer; Terminator is the Windows-specific answer; together they cover the two desktop OSes where MCP clients currently ship.

macos-useMCP server for native macOS control
© 2026 macos-use. All rights reserved.