One call, five OS eventsEsc kills the chainmain.swift:1709-1751

MCP Agent Plan Execution On A Real Desktop: Why macos-use Refuses To Have A Plan Primitive

Every article you will find for this keyword describes plan execution as an orchestration-layer problem: planner, executor, synthesizer, Temporal-style durable workflows, agents compiling plans into code. All useful. None of it addresses plan execution on a desktop where the human is at the keyboard, focus can shift to a different app mid-plan, the target element can be off-screen, and a single keystroke needs to kill the whole thing. This is the below-MCP half of the problem, and macos-use solves it with no plan primitive at all.

Matthew Diakonov, Written with AI

Published April 20, 202610 min read

Read the composed path at main.swift:1709 Repo on GitHub

5.0from open source

One tool call runs 5 ordered OS events: traverseBefore → click → type → press → traverseAfter (main.swift:1709-1751)

throwIfCancelled() polled between each step at main.swift:1708, 1721, 1728, 1734; Esc kills the chain at the current boundary

scroll_into_view probes up to 30 steps, app switch handoff traverses the new frontmost, watchdog releases the input tap after 30s

The 6 tools macos-use exposes (no plan primitive)

open_application_and_traverseclick_and_traversetype_and_traversepress_key_and_traversescroll_and_traverserefresh_traversal

One MCP call, five OS events, one diff

01 / 05

The MCP client makes one call

click_and_traverse with text and pressKey both set.

What The Top Results For This Keyword All Get Right, And What They All Miss

The first-page SERP for mcp agent plan execution is consistent: lastmile-ai/mcp-agent describes planner / orchestrator / worker patterns from Anthropic's "Building Effective Agents." Agent-MCP frames multi-agent coordination with shared context. Cloudflare's Code Mode MCP argues that agents should compile plans into code snippets to avoid loading every endpoint definition into context. OpenAI's Agents SDK documents MCP server integration. Anthropic's engineering blog covers code execution with MCP. All of these live above the MCP protocol: they describe how the agent decides what to call.

None of them describe what happens when the call itself is to a real desktop app running on a real user's machine. On a desktop, plan execution has failure modes that do not exist in a web sandbox or a Docker container. The user is at the keyboard. Focus can shift because an action launched a second app. The AX element the plan computed an hour ago may now be off-screen. And if something goes wrong, there has to be a way to stop it that does not involve the agent re-planning anything.

That is what the rest of this page is about.

The Plan Compression Trick: One Tool Call, Five OS Events

macos-use has no plan tool and no session state. What it has is optional chaining params on click_and_traverse. If you pass both text and pressKey, the server runs five ordered OS events inside the single MCP request, with a cancellation checkpoint between each.

The five events, in order

traverseBefore

showDiff = true implies a pre-action AX tree snapshot

click

auto-activates target app, optional scroll_into_view up to 30 steps

type

only runs if caller passed the text param on click_and_traverse

press

only runs if caller passed the pressKey param

traverseAfter

subtract from before to build the diff the agent reads next

Execution order is set at main.swift:1701-1751. Every try InputGuard.shared.throwIfCancelled() is a boundary where pressing Esc can still kill the chain.

OS events per call

click+type+press

MCP round-trip

one diff returned

max scroll steps

main.swift:1189

input-tap watchdog

InputGuard.swift:24

Naive Plan Versus Chained Plan, On The Wire

If you write the agent loop as three sequential MCP calls (click, then type, then press), you pay for three LLM turns, three AX traversals, and three windows during which the user can race you or focus can shift. The chained form collapses all three into one.

Plan round-trips

// Naive plan: one MCP call per OS event.
// Three LLM turns. Three traversals. Three chances for the user
// to type in the middle of your plan.

// Turn 1
call click_and_traverse { pid, element: "To field" }
  → 140 KB diff, LLM re-plans

// Turn 2
call type_and_traverse { pid, text: "alice@example.com" }
  → 140 KB diff, LLM re-plans

// Turn 3
call press_key_and_traverse { pid, key: "Tab" }
  → 140 KB diff, LLM re-plans

// Turn 4
call click_and_traverse { pid, element: "Send" }
  → 140 KB diff, LLM sees the new screen

// Total: 4 MCP round-trips, 4 LLM turns, 4 traversals.
// Every round-trip is a window for the user to interfere
// and a window for focus to shift to a different app.

11% fewer MCP round-trips

The Composed Execution Path, Verbatim

This is the branch the server takes when click_and_traverse has extra input actions to run after the click. Every arrow labeled main.swift:17xx in the comments is a line where the plan can be cancelled mid-execution.

Sources/MCPServer/main.swift

The Plan-Break Events The Server Handles Silently

If the agent had to detect and retry every one of these, the outer planner would be half plan and half recovery logic. macos-use absorbs them below the MCP boundary. The agent never learns most of them happened.

Off-screen target

scroll_into_view probes up to 30 scroll steps with line-scaled deltas (1 / 2 / 3 lines per step based on distance) before giving up. The agent never learns it happened. main.swift:1189 sets maxSteps = 30.

Focus shifts to a different app

After the action, the handler compares the new frontmost PID to the one the call targeted. If they differ, it traverses the new frontmost and returns it as appSwitchTraversal on the same response. main.swift:1788-1808.

User types mid-plan

CGEventTap at .cghidEventTap filters by source state ID. Hardware events (stateID=0) are dropped; events from the macos-use process pass through. Human input cannot race with the agent.

User presses Esc

The tap captures keycode 53 with no modifiers, sets _cancelled, and throwIfCancelled() raises on the next chain boundary. The restore code runs on the cancel path too. InputGuard.swift:289-350.

The whole process hangs

DispatchSource timer at InputGuard.swift:173 auto-disengages the event tap after 30 seconds. The user always gets their keyboard back, even if the Swift process is stuck.

5 events / 1 diff

“Passing both text and pressKey on click_and_traverse triggers the composed path at main.swift:1709-1751. Five ordered OS events, an Esc-cancellation poll between each, one diff written to /tmp/macos-use/<ts>_click_and_traverse.txt. That is what 'plan execution' means in this server.”

main.swift:1709

A Plan, From Agent Turn To Diff On Disk

The whole round-trip in six stages. No planner object is maintained between stages; the diff from step 5 is the only state that survives into the next agent turn.

The agent issues one tool call

No plan primitive. No session handle. One request with optional chaining params. The server figures out the rest.

The server checkpoints OS state

Frontmost app, cursor position (flipped into CGEvent coordinates), AX tree. Input tap engaged with a 30s watchdog.

The server fires N OS events in order

For a click+type+press request, that is 5 ordered events with a 100ms gap between the input actions and a throwIfCancelled poll between each.

The server handles plan-break events

Off-screen: scroll up to 30 steps until the target is visible. App switch: traverse the new frontmost and attach it to the response. User Esc: throw and restore.

The server writes the diff and restores OS state

One .txt with + / - / ~ diff lines, one .png screenshot with a click crosshair. Cursor moved back, previous frontmost reactivated. MCP response returns.

The agent re-plans from the diff

Not from a planner object, not from a graph-state checkpoint. From a flat-text delta. The diff is the plan state.

What Keeps A Plan From Surviving On A Desktop

Six things, orbiting one handler. The handler is the only code that ever sees all of them at once. Each orbit item corresponds to a named code path inside Sources/MCPServer/.

macos-use handler

off-screen target

app switch mid-plan

user typing

user Esc

process hang

cursor drift

Reproducing This, Start To Finish

1. Clone the repo and build: xcrun --toolchain com.apple.dt.toolchain.XcodeDefault swift build.
2. Point an MCP client (Claude Desktop, Cursor, Zed, whatever speaks MCP) at the built binary.
3. Call click_and_traverse with both a text and a pressKey argument. Watch /tmp/macos-use/ for the new .txt + .png pair.
4. Call it again, then press Esc before the chain finishes. The response should come back with isError: true and a message indicating cancellation. Your cursor will be where you left it.
5. Click a button that launches a different app (a mailto: link in a browser works). Check the response for appSwitchTraversal. The new app's full AX tree is in the same response.

Why This Isn't An Orchestration-Layer Concern

Everything macos-use does inside that one tool call is invisible to the agent framework above. LangGraph, Temporal, mcp-agent, whatever else you are using to decide the next step: they see one MCP request and one MCP response. The planner is not aware that five OS events happened in between, that the cursor was parked at the corner of the screen and restored, that the input tap blocked a keystroke the user aimed at the target window, or that the 24th of 30 scroll steps was the one that revealed the element.

That is the point. Plan execution on a desktop is a pile of concerns that the agent framework should not have to model. The tool either atomically moved the desktop into the requested state, or returned an error that said why. The contract is the same one a SQL driver gives to a database ORM: hand me a statement, I will either commit it or tell you what happened.

You can and should stack this under a real planner. The planner decides "send Alice an email about the Friday meeting." macos-use handles each of the seven or eight tool calls that realizes it, and absorbs the thirty or forty OS-layer things that could go sideways per call.

Wiring macos-use under an agent planner?

Book 20 minutes with the team. We will walk the composed execution path on your actual plan and help you decide where the MCP boundary should sit.

Frequently asked questions

Does macos-use have a 'plan' tool, a 'sequence' tool, or a planner primitive?

No. The server exposes six tools (open, click, type, press_key, scroll, refresh) and nothing else. There is no plan object, no sequence handle, no agent-side state the server persists. The agent is the planner. What the server does offer is optional parameter chaining: click_and_traverse accepts optional text and pressKey params, so click + type + press fires inside one tool call. That is as close as the server gets to a plan primitive, and it is intentional. See Sources/MCPServer/main.swift:1300-1408 for the tool definitions.

What is the minimum number of MCP round-trips to fill a form field and submit it?

One. Calling click_and_traverse with { x, y, text: 'hello world', pressKey: 'Return' } runs traverseBefore → click → type → press Return → traverseAfter inside the server, returns one response with one diff, and writes one /tmp/macos-use/<ts>_click_and_traverse.txt file. A naive agent that issues three separate MCP calls (click, then type, then press) pays for 3x the LLM round-trips, 3x the traversal noise, and 3x the chance of the plan getting interrupted between tool calls. The composed path is main.swift:1709-1751.

If a chained plan is 5 OS events deep inside one MCP call, how does the user kill it?

Press Esc. InputGuard installs a CGEventTap at .cghidEventTap that sees every keystroke while the tool call is running. An Escape keydown with no modifiers sets _cancelled = true at InputGuard.swift:292. The handler polls InputGuard.shared.throwIfCancelled() between every action in the chain: main.swift:1708, 1721, 1728, 1734. After the chain finishes there is a 200ms grace window at main.swift:1757 during which a late Esc still cancels. On cancel the handler throws InputGuardCancelled, runs the cursor + foreground restore code, and returns an isError response. The plan stops at the current boundary, not at the final step.

What does the agent get back as 'plan state' for the next step?

A diff. After every action the server traverses the target app twice (before and after), subtracts them, and returns the delta as a flat-text list prefixed with + added, - removed, and ~ modified. The diff gets written to /tmp/macos-use/<timestamp>_<tool>.txt alongside a PNG screenshot with a red crosshair at the click point. The agent reads that file (grep by role, by text, by coordinates) to choose the next tool call. There is no JSON blob of 'plan progress' the server maintains between calls. The diff is the state. See main.swift:1007-1028 for the diff format and main.swift:1828-1840 for the file write.

What happens if the click launches a different app and focus shifts mid-plan?

The server catches it without the agent asking. After the action, main.swift:1788-1808 compares NSWorkspace.shared.frontmostApplication processIdentifier against the PID the tool call was addressed to. If the frontmost app changed, the handler traverses the NEW frontmost app, writes its tree into toolResponse.appSwitchTraversal, and sets appSwitchPid + appSwitchAppName. The agent's next planning step gets two trees in one response: the diff of the original app, plus the full tree of whatever is now in front. No retry loop, no 'app not found' error.

What happens if the target element is off-screen when the agent tries to click it?

scroll_into_view kicks in automatically. main.swift:1159-1285 computes the direction from the click point relative to the window bounds, then scrolls up to 30 steps (main.swift:1189, maxSteps = 30) with line-scaled deltas: 1 line/step if the distance is under 80px, 2 lines if under 250px, 3 lines otherwise. After every scroll step it probes the AX tree by text match (when the target has text) or by watching the viewport edge for newly-revealed elements. If the target appears, it clicks. If it doesn't appear after 30 steps, it logs a warning and clicks at the original coordinates anyway. The agent never learns the element was off-screen; it just gets a diff that shows the click landed.

Why 30 steps and not unlimited? Isn't a cap a footgun?

Accessibility-driven scrolling is cheap but not free (each step posts a CGEvent scrollWheelEvent2 at main.swift:1196 and waits 100ms before the next tree probe). 30 steps at 2 lines each ≈ 60 scroll lines ≈ ~1500px of travel, which covers most practical cases (long tables, sidebar lists, chat scrollback). Beyond that the likely explanation is that the AX coordinate the agent computed is stale, the window scrolled back, or the element is inside an unscrollable container. An unbounded loop would turn a planning error into a silent hang. The cap turns it into a 3-second log line.

How is this different from mcp-agent or Agent-MCP or Cloudflare's Code Mode MCP?

Those operate above MCP: they describe how a planner LLM breaks a user goal into sub-tasks, dispatches them to worker agents, and synthesizes results. Plan execution in that world is an orchestration-layer problem solved with Temporal-style durable workflows, code-compilation of plans, or multi-agent message passing. macos-use operates below MCP: it is one of the tools those planners call. Its job is to make a single call on a real desktop atomic, cancellable, and self-repairing, so the outer orchestration does not have to retry for reasons like 'the user was typing' or 'the window scrolled.' You can and should run both. See the SERP results at the top of the page for links.

Can I chain more than click + type + press in one call?

Not yet. The composed path at main.swift:1710-1751 special-cases a primary action followed by a series of input actions (type or press). You can issue multiple presses in the pressKey param (the exact parsing is at main.swift:1349-1383 for type_and_traverse and 1384-1408 for press_key_and_traverse). If you need a richer sequence, issue it as separate MCP calls; the diff-as-state model means each call self-plans off the previous result.

Is the InputGuard overlay safe during a plan that takes longer than 30 seconds?

The guard auto-releases the event tap after 30 seconds regardless of whether the tool call returned (InputGuard.swift:24 sets watchdogTimeout = 30). If your chained plan is still running when the watchdog fires, the input tap drops, the user regains full keyboard/mouse control, but the agent's OS events are still posted (they come from inside the macos-use process and do not need the tap). The tradeoff is safety over automation duration: a crashed Swift process holding an engaged tap would lock the Mac, so the ceiling is non-negotiable. If you need longer plans, split them into multiple tool calls and let the agent re-engage the guard per call.