The MCP Server Desktop-App Problem No One Documents: Your Click Just Opened A Dialog Owned By A Different Process
On macOS the Save Panel, Share sheet, Print dialog, and permissions prompt all live in their own processes. The click you fired in Numbers can land the frontmost-app PID in openAndSavePanelService before your next tool call. A remote MCP server never deals with this. A desktop MCP server that ignores it leaves the agent staring at a tree that no longer has the button it wants. In macos-use this is 22 lines of Swift, and the payoff is two AX trees and a retargeted screenshot in a single MCP response.
The problem in one sentence
A remote MCP server runs somewhere else and responds with data. A desktop MCP server runs here and drives your UI. The second kind has to reckon with the fact that the window your click just opened may not be a window belonging to the app you were automating at all.
On macOS this is not a corner case. Save Panels are not a feature of Numbers; they are served by com.apple.appkit.xpc.openAndSavePanelService. Share sheets are not a feature of Safari; they are served by com.apple.SharingUIServer. Permissions prompts are owned by tccd. Print dialogs go through PrintUIService. You cannot automate a mac end-to-end without landing in at least one of these at least once.
One check. Four tool types. Two trees when it matters.
The 22 lines that do it
This is the entire cross-process handoff detector. It runs only on diff-producing tools (click, type, press, scroll). One NSWorkspace call. One PID compare. One extra traversal when needed. No heuristics, no dialog classname matching, no screenshot OCR.
What the agent actually sees
The compact summary below is the whole MCP response body that comes back from one click. Two extra lines name the new process. The full tree of the new process is already on disk at the file path in the response, waiting for a grep.
And the on-disk flat text
Same event, written to /tmp/macos-use/<ts>_click_and_traverse.txt. The # app_switch: header (emitted by the formatter at main.swift:1030-1037) is the agent's grep target when it wants to jump to the new tree's elements.
Sequence diagram of one handoff-producing click
Four actors, eight messages, one response. Everything right of the dotted MCP-server lifeline happens on the user's mac; everything left of it happens in the agent.
click_and_traverse with cross-process handoff
The dialogs this covers
None of these are modelled specially. The PID compare handles all of them because the common property is a frontmost-app PID change, not a dialog classname.
Save / Export panels
NSSavePanel and NSOpenPanel live in openAndSavePanelService. Every 'File > Save' in Numbers, Pages, TextEdit, Preview, Safari's 'Save As' flow triggers this service. If the MCP server keeps staring at the source app, the agent cannot click Save inside the panel.
Share sheets
com.apple.SharingUIServer owns the share picker. Agent clicks the Share toolbar button, focus jumps, the picker opens, the source app's AX tree is now useless.
Print dialog
PrintUIService is its own process. The second-tier 'Show Details' expansion spawns an entirely new dialog owned by that service.
Permissions prompts
tccd / SystemUIServer puts up 'Grant access to X'. The source app cannot see the sheet; only the system process can.
Notifications
UserNotificationCenter banners are owned by com.apple.UserNotificationCenter. Click-to-expand drops focus onto the center's process.
Color / Font / Character picker
NSColorPanel and friends live in the app that summons them, usually — but when invoked from a service menu they handoff. Cheap to detect, expensive to miss.
Step-by-step: what happens inside one tool call
Client sends click_and_traverse
The agent passes the PID it thinks is still in charge. The tool schema requires pid; no handoff prediction is asked of the agent.
Primary click executes on the original PID
CGEventPost fires. The click lands. macOS delivers the event to the target window, which may spawn a service-owned dialog.
Server takes traverseAfter on the original PID
This is the normal diff. It captures the source app's post-click state. If the click opened a sheet in the same process, the sheet shows up in added elements here.
Server asks NSWorkspace who is frontmost NOW
One line: NSWorkspace.shared.frontmostApplication?.processIdentifier. Cheap, synchronous, runs unconditionally on diff tools.
If it differs, traverse the new PID too
traverseAccessibilityTree(pid: newPid) on a MainActor task. Response enriched with window bounds, assigned to appSwitchTraversal.
Retarget the screenshot to the new PID
captureWindowScreenshot uses appSwitchPid ?? traversalPid ?? pidForTraversal at main.swift:1837 so the PNG matches the tree the agent just got.
Emit one MCP response, two trees, one screenshot
Compact summary includes an app_switch: line and sample elements. Flat-text file includes a '# app_switch:' header plus the new app's full tree. The next tool call already knows which PID to target.
“Compared against a naive MCP server that always returns the original PID's tree. Measured by counting tool-call round-trips to reach the Save button after a File > Save click in Numbers.”
Hand-tested April 2026
How the summary code appends the app_switch line
The summary formatter does not care whether a handoff occurred; it just checks appSwitchPid and, if present, emits two lines plus a sample. The agent does not have to ask for this — it arrives automatically.
Handoff-aware vs. naive desktop MCP server
| Feature | Naive MCP server | macos-use |
|---|---|---|
| Detects cross-process dialog ownership | No — returns the tree of the PID you asked for, always | Yes — compares frontmost PID after every diff-producing action |
| Response shape when a dialog opens in a sibling process | One AX tree (the stale one) + a screenshot of the old window | Two AX trees, one screenshot of the new window, one summary that names the new PID |
| Round-trips to recover from a handoff | 2+ (refresh + probe for the frontmost PID) | 0 (new tree arrives inside the same response) |
| Screenshot target when handoff fires | Original window (probably hidden under the dialog) | New frontmost window (main.swift:1837 uses appSwitchPid first) |
| Works for save panels, share sheets, print, permissions | Partial at best — usually fails on XPC-backed services | Single code path covers all of them by comparing PIDs, not classnames |
| Cost per action when no handoff happens | N/A | One NSWorkspace call, skipped traversal (branch taken only when PIDs differ) |
Try it in one terminal tail
Build the binary, connect it to any MCP client, open Numbers, and tail the output directory:
Automating a mac where half the clicks open system dialogs?
We can walk you through how the handoff detector works and how to plug it into your agent loop.
Frequently asked questions
What is an MCP server that runs as (or inside) a desktop app?
It is an MCP server co-located on the same machine as the end user and whose tools drive that machine's native UI. Unlike a remote MCP server that returns data from an API, a desktop MCP server issues real OS events: CGEventPost on macOS, SendInput on Windows, XTest on X11. mcp-server-macos-use is the macOS variant. It speaks JSON-RPC 2.0 over stdio to clients like Claude Desktop, Cursor, Cline, and VS Code, and its six tools (open_application_and_traverse, click_and_traverse, type_and_traverse, press_key_and_traverse, scroll_and_traverse, refresh_traversal) move the cursor and keyboard on your real mac. All tools are registered in the allTools array at Sources/MCPServer/main.swift:1408.
What is the cross-process dialog problem, exactly?
On macOS a dialog you think of as 'part of Safari' is frequently owned by a separate process: Save Panels are served by com.apple.appkit.xpc.openAndSavePanelService, SharingService pickers live in com.apple.SharingUIServer, print dialogs in PrintUIService. When your click triggers one of these, the frontmost-app PID changes from (say) Safari to a service PID that the agent has never seen. A naive MCP server returns the AX tree of Safari — which no longer has the element the agent wants to click — and the agent gets stuck poking at a window that is no longer accepting input.
How does mcp-server-macos-use detect the handoff?
With one comparison after every diff-producing action. Sources/MCPServer/main.swift:1786-1809 reads NSWorkspace.shared.frontmostApplication?.processIdentifier after the tool call completes and compares it to options.pidForTraversal (the PID you originally passed in). If they differ, it calls traverseAccessibilityTree(pid: newPid) on the new frontmost process and attaches the result to toolResponse.appSwitchTraversal. The screenshot code at main.swift:1837 then uses appSwitchPid as the effective PID so the PNG captures the NEW window, not the old one. All of this happens inside the single MCP response so the agent's next reasoning step already has the new tree in hand.
What does the response look like when a handoff fires?
Two extra lines appear in the compact summary (main.swift:871-882): 'app_switch: <AppName> (PID: <newPid>) is now frontmost' and 'app_switch_elements: <total> total, <visible> visible', followed by a sampled visible_elements block for the new app. The on-disk flat text file gets a '# app_switch: <AppName> (PID: <newPid>)' header (main.swift:1030-1037) and then one line per element of the new app's tree in the same format the agent is already grepping: `[Role] "text" x:N y:N w:W h:H visible`. One response, two trees, one grep target.
Why only for diff-producing actions and not for every tool call?
Because open_application_and_traverse and refresh_traversal are full-traversal tools. They do not produce a diff, and they already pick the PID you asked for. The handoff check is gated on `if hasDiff` at main.swift:1788, which is only true for click / type / press / scroll. That is exactly the set of actions that can cause a process to pop a service-owned window. The trade is deliberate: skip the compare when it cannot matter, do it when it can.
Is the screenshot of the old app or the new app?
The new app. main.swift:1837 reads `toolResponse.appSwitchPid ?? toolResponse.traversalPid ?? options.pidForTraversal` in that order, meaning the effective screenshot PID falls through to appSwitchPid first when a handoff was detected. You click Save in Numbers, the Save Panel opens owned by openAndSavePanelService, and the PNG captured at the same millisecond timestamp shows the Save Panel. The screenshot and AX tree in the response pair up.
Does this mean I can chain a click into the dialog's buttons in one MCP call?
Not quite — two calls. The first call returns the handoff tree with the Save Panel elements and the new PID. The second call uses that new PID. The click/type/press schemas all require pid and once you have appSwitchPid from response one, you pass it as `pid` in response two. The benefit is you do not have to waste a round-trip on refresh_traversal just to discover that the frontmost app changed. The agent's next tool call can target the right process immediately.
What about a dialog that belongs to the ORIGINAL app (like a sheet)?
Sheets attached to a window are owned by the same process, so frontmost PID does not change and the handoff branch does not fire. The normal diff (added/removed/modified elements) on the original PID still captures the sheet because it is part of the same accessibility tree. The handoff branch is narrowly for the case where a NEW process became frontmost. It is the only case the simple diff path cannot express.
Can I see the exact lines of Swift that implement this in the repo?
Yes. `git clone https://github.com/mediar-ai/mcp-server-macos-use && sed -n '1786,1809p' Sources/MCPServer/main.swift` prints the 22-line block. The flat-text formatter at main.swift:1030-1037 writes the handoff header. The compact-summary formatter at main.swift:871-882 writes the app_switch lines. The screenshot-target selector at main.swift:1837 uses appSwitchPid first. Four locations, one behavior.
How do I verify this is happening live?
Tail /tmp/macos-use/. Run `ls -lt /tmp/macos-use/ | head` in a terminal, then use a Claude Desktop session to open Numbers or Preview, call macos-use_click_and_traverse on a File > Save menu item, and watch a new pair of files appear within a millisecond. Run `grep -n '^# app_switch' /tmp/macos-use/<latest>.txt` and you should see the header. Run `grep -n '^app_switch:' /tmp/macos-use/<latest>.txt` to see the line the compact summary emitted. If both are there, the handoff path ran.
Which MCP clients does this server work with?
Any MCP-compliant client that can spawn a stdio process. Tested with Claude Desktop, Claude Code, Cursor, Cline, and VS Code's MCP support. The handshake is at Sources/MCPServer/main.swift:1411-1437 and ships the server's own tool-use instructions in the `instructions` field of the initialize response, which every compliant client forwards into the model's system context. You do not have to teach the model how to use this server per-client; the server tells it on connect.
How is this different from Terminator, which is for Windows?
Terminator (github.com/mediar-ai/terminator) is the Windows sibling and solves the same class of problem (AI-driven UI automation) on a different OS. It uses UI Automation instead of the macOS Accessibility API, SendInput instead of CGEventPost, and Windows-specific tricks for window ownership. The cross-process handoff problem exists on Windows too (print dialogs, file pickers) and Terminator addresses it differently because the Win32 window ownership model is not the same as macOS process ownership. macos-use is the macOS-specific answer; Terminator is the Windows-specific answer; together they cover the two desktop OSes where MCP clients currently ship.
More on what a desktop MCP server actually has to do
What Is An MCP Server, Really
The other thing a desktop MCP server has to handle: sharing your keyboard with you. CGEventTap, 30-second watchdog, Esc as kill-switch. InputGuard.swift, 355 lines the spec never covers.
macOS AI Agent State Memory
Grep-addressable screen memory. Every tool call writes the AX tree to /tmp/macos-use/<ts>_<tool>.txt as one line per element. The LLM gets a file path, not tokens.
AI Agent UI State Checkpointing
Three snapshots (cursor, frontmost app, AX tree) around every disruptive tool call, two restored on exit. The sibling concept to the handoff detector.