Per-action OS transactionsThree snapshots, two restoredEsc cancels, 30s watchdog

AI Agent UI State Checkpointing: The Three Snapshots macOS-Use Takes Around Every Click

Most articles on AI agent state checkpointing point at LangGraph, AG-UI, and conversation memory. mcp-server-macos-use checkpoints something the agent layer never touches: the operating system. Before every click, type, press or scroll tool call, the server takes three snapshots: the frontmost app, the cursor position, and the AX tree. Two of them are restored when the call returns. The third is differenced and written to disk as the artifact the agent reads next. There is also a 30-second watchdog so a hung tool call cannot lock the machine.

M
Matthew Diakonov
11 min read
5.0from open source
Three snapshots per disruptive tool call: frontmost app, cursor, AX tree (main.swift:1669-1697)
Two restored on exit: cursor at main.swift:1767-1772, frontmost app at main.swift:1775-1781
30-second InputGuard watchdog at InputGuard.swift:24 prevents lockout if the tool call hangs

What Other "Agent Checkpointing" Articles Are Talking About, And Why It Is Not This

If you searched for AI agent UI state checkpointing in 2026, the top results are all variations on the same theme. LangGraph checkpointers persist AgentState across nodes so a graph can resume after a crash. AG-UI defines a protocol for streaming agent state into a web frontend. Articles from Fast.io and the eunomia blog catalog file-based, database-backed, and in-memory storage backends. DeepLearning.AI forum threads explain how to thread a checkpointer through the LangGraph compile step. All useful. None of it tells you what should happen to the human's desktop when an agent clicks "Send" in a real running app on a real machine.

That is the gap. macos-use treats every disruptive tool call as a transaction at the OS layer, separate from anything the agent framework above it does. The agent can be in LangGraph, in the Claude Agent SDK, in a hand-rolled loop. The MCP server below it still snapshots the same three pieces of OS state, restores the same two pieces, and writes the same diff. Stack the layers; do not conflate them.

The rest of this page is the specific implementation. Line numbers are checkable in Sources/MCPServer/main.swift and Sources/MCPServer/InputGuard.swift in the repo.

The Three Snapshots, By The Numbers

0
snapshots taken
frontmost app, cursor, AX tree
0
restored on exit
cursor + frontmost app
0s
lockout watchdog
InputGuard.swift:24

See main.swift:1669-1781 for the snapshot/restore block.

The Loop, In Five Stages

Every disruptive tool call walks the same path. Refresh-traversal is the only call that skips the guard and restore (it does not modify state).

  1. 1

    Snapshot

    Frontmost app, cursor (flipped), AX tree before action.

  2. 2

    Guard

    Block human input, arm 30s watchdog, accept Esc to cancel.

  3. 3

    Act

    Post the CGEvent. Optionally chain type and press in one call.

  4. 4

    Diff

    Traverse again, subtract, drop scroll-bar and coord-only noise.

  5. 5

    Restore

    Move cursor back, reactivate prev frontmost, write .txt + .png.

The Snapshot Block, Verbatim

Lines 1669 through 1697 of main.swift. This is the "before" half of the transaction. The screen-flip on the cursor is the part most macOS automation forgets and pays for later on multi-monitor setups.

Sources/MCPServer/main.swift

The Restore Block, Verbatim

Lines 1766 through 1781. Cursor first, then foreground app, then return. Both halves run on success and on cancellation; the catch block at main.swift:1847 invokes the same restore code path so pressing Esc still leaves the user's workspace intact.

Sources/MCPServer/main.swift

The Lockout Watchdog That Stops Hung Tool Calls From Locking The Mac

CGEventTap is powerful: when engaged, it can suppress every keystroke and mouse click on the machine. If the Swift process crashes while the tap is engaged, the user is locked out. The watchdog at InputGuard.swift:24 is the safety net.

Sources/MCPServer/InputGuard.swift

What The Checkpoint Pipeline Looks Like End To End

Inputs on the left are what the server reads from the OS before the action. Outputs on the right are what gets written back, both to the OS (cursor, foreground app) and to disk (the diff).

Three reads in. Two writes back to the OS. One diff to disk.

NSWorkspace.frontmostApplication
NSEvent.mouseLocation (flipped)
AX tree (traverseBefore)
InputGuard.engage()
macos-use handler (main.swift:1474)
Cursor restore (CGEvent .mouseMoved)
App reactivate (prevApp.activate)
Diff written to <ts>_<tool>.txt
Watchdog releases input tap

One Tool Call, On The Wire

The MCP client sees a single request and a single response. Everything in between is private to the server. This is the same pattern any database transaction takes; the difference is that the "rows" here are accessibility elements and the "commit log" is a flat .txt file.

click_and_traverse, including snapshot + restore

MCP ClientHandlerInputGuardAX SnapshotCGEvent/tmp/macos-useclick_and_traverse {pid, element}save frontmost + cursor (1671-1676)engage (Esc=cancel, 30s watchdog)traverseBefore (showDiff=true)tree snapshot #1post click eventevent deliveredtraverseAftertree snapshot #2subtract → diffdisengage (or Esc → throw)restore cursor mouseMovedreactivate prev frontmostwrite <ts>_<tool>.txt + .pngsummary + filepath

What The Persisted Checkpoint Looks Like On Disk

One Messages "Send" click, post-filter. Six diff lines. The cursor and foreground app have already been restored by the time you can read these files.

/tmp/macos-use/
3 lines

The cursor save flips NSEvent.mouseLocation by primaryScreen.frame.height - nsPos.y so the saved CGPoint can be passed directly to CGEvent(mouseEventSource:mouseType:.mouseMoved, mouseCursorPosition:) on restore. Skip that flip and the cursor restores to the wrong y on multi-monitor setups.

main.swift:1672-1676

How This Is Different From LangGraph And AG-UI Checkpointing

Both are useful. They sit at different layers and address different failure modes. You can run all three together (a LangGraph thread that calls the macos-use MCP from inside an AG-UI frontend) without overlap.

FeatureLangGraph / AG-UImacos-use checkpoints
What gets persistedAgentState dict, message history, next node pointerOS-side: frontmost app PID, cursor CGPoint, AX tree diff
When it firesAfter each LangGraph node, on a configured channelAround every disruptive MCP tool call (click, type, press, scroll)
Storage backendSQLite, Postgres, in-memory checkpointerFlat .txt + .png pair in /tmp/macos-use/, ms-precision timestamps
Restore semanticResume the agent's reasoning loop from a saved nodePut the human's desktop back: cursor moveTo, app reactivate
CancellationInterrupt API, requires a checkpointed thread_idEsc keydown, intercepted by CGEventTap, throws InputGuardCancelled
Lockout protectionNot applicable30s watchdog at InputGuard.swift:24 auto-releases the input tap

Each Stage In One Sentence

  1. 1

    Snapshot foreground app

    savedFrontmostApp captured at main.swift:1671. Used by the restore branch to put focus back where the human left it.

  2. 2

    Snapshot cursor (flipped)

    main.swift:1672-1676 saves the cursor in CGEvent coordinates (top-left origin), not AppKit coordinates (bottom-left). Skip the flip and restore lands off-screen on multi-monitor.

  3. 3

    Engage input guard

    CGEventTap blocks human input. Esc keydown sets _cancelled. 30s watchdog at InputGuard.swift:24 is the hard ceiling.

  4. 4

    Traverse before, act, traverse after

    showDiff = true at main.swift:1600 enables the implicit AX tree snapshot. Two traversals bracket the input event.

  5. 5

    Restore cursor + foreground

    CGEvent .mouseMoved at main.swift:1767-1772, then prevApp.activate at main.swift:1775-1781. The diff is already on disk.

Why This Pattern Only Shows Up In macOS-Specific Agent Stacks

Browser agents do not need it because the browser is already a sandbox; closing the tab restores the world. Web-only agents do not need it because they never own focus. Container-based agents do not need it because there is no human at the keyboard whose state to preserve. Native desktop agents that run on the user's actual machine, against the user's actual apps, while the user is still at the keyboard, are the only case where checkpoint-and-restore at the OS layer matters. macos-use is the macOS half of that pattern; Terminator is the Windows half and uses the equivalent Windows APIs (GetForegroundWindow, GetCursorPos, UI Automation tree) to encode the same contract.

If you are building an agent loop where the human watches the screen while the agent works, you want this. If you are building a headless batch runner, you do not.

Wiring macos-use into a real agent loop?

Book a 20-minute call with the team. We will walk the checkpoint path with you and help you stack it under whichever agent framework you are using.

Frequently asked questions

What is 'AI agent UI state checkpointing' as mcp-server-macos-use defines it?

It is not graph-state checkpointing in the LangGraph sense. It is a per-tool-call transaction at the OS layer. Before every disruptive action (click, type, press, scroll, open) the server captures three snapshots: NSWorkspace.shared.frontmostApplication at main.swift:1671, NSEvent.mouseLocation flipped into top-left CGEvent coordinates at main.swift:1672-1676, and the full AX tree of the target app via showDiff = true at main.swift:1600/1617/1633/1651. The action runs. Two of those snapshots are then restored: the cursor via a CGEvent mouseMoved at main.swift:1767-1772, and the previous frontmost app via prevApp.activate at main.swift:1775-1781. The third (the AX tree) is differenced against a fresh post-action traversal and written to /tmp/macos-use/ as the artifact the agent reads.

Why restore the cursor and the frontmost app at all? The agent does not care.

The agent does not. The human does. macOS automation that fights the user for foreground focus or strands the cursor in the wrong app produces a workflow nobody can supervise. The contract this server encodes is: when the tool call returns, the human's interrupted state is back where they left it. That is what makes the loop usable as something the user runs in the foreground while still doing their own work, instead of a batch script they kick off and walk away from.

What protects the human if a tool call hangs while the input guard is engaged?

InputGuard.swift sets watchdogTimeout = 30 seconds at line 24. The engage() path schedules a one-shot timer at InputGuard.swift:174 that fires regardless of whether the action ever returns. When the timer fires it logs 'watchdog fired after 30s — auto-disengaging' and tears down the event tap. Without that, a crashed Swift process holding an active CGEventTap could lock keyboard and mouse for the user. The 30-second ceiling is the lockout safety net.

How does Esc cancel an in-flight tool call?

InputGuard installs a CGEventTap that intercepts every keyboard and mouse event during automation. When it sees an Escape keydown with no modifiers, it sets _cancelled = true at InputGuard.swift:292 and invokes onUserCancelled. The MCP handler checks InputGuard.shared.throwIfCancelled() between every primary and additional action (main.swift:1708, 1721, 1728, 1734) and again after a 200ms grace period at main.swift:1757-1763. If cancelled, it throws InputGuardCancelled, which the catch block at main.swift:1847 traps. Inside that catch block: disengage the guard, restore the cursor, reactivate the previous frontmost app, return an isError response. The cancellation path runs the same restore code as the success path.

Why does NSEvent.mouseLocation get flipped before being saved?

AppKit gives mouse coordinates with origin at the bottom-left of the primary screen. CGEvent posts mouse events with origin at the top-left. main.swift:1673-1675 flips by computing primaryScreen.frame.height - nsPos.y so the saved CGPoint can be passed directly back to CGEvent(mouseEventSource:mouseType:.mouseMoved, mouseCursorPosition:) on restore. Skip that flip and the cursor restores to the wrong y on multi-monitor setups, sometimes off-screen entirely.

Where is the diff between checkpoints actually written?

Every tool call writes two files to /tmp/macos-use/ named with millisecond-precision timestamps: <ts>_<tool>.txt and <ts>_<tool>.png. The .txt holds the flat-text diff produced at main.swift:1007-1028 with prefixes + (added), - (removed), and ~ (modified, with attribute transitions). The .png is captured at main.swift:1832-1840 with a red crosshair drawn at the click point if applicable. Both files share the same timestamp so an agent can correlate the visual receipt with the symbolic diff.

What happens if the action accidentally hands focus to a different app, like clicking an email link that launches Mail?

The handler detects this at main.swift:1788-1808. After the action it compares the current NSWorkspace.shared.frontmostApplication processIdentifier against the original PID the tool was called with. If they differ, it traverses the new frontmost app, populates appSwitchPid, appSwitchAppName, and appSwitchTraversal on the response, and appends an 'app_switch:' header to the .txt file. The agent gets one tool call, one .txt, but two traversals when focus escapes. This is also the only case where the frontmost-app restore is intentionally relaxed: if the previous frontmost was the launching app and the launched app is now what the user expects to see, restoring would be wrong. The restore at main.swift:1775-1781 only fires when isDisruptive and the previous app is still alive.

Does this overlap with what LangGraph or AG-UI mean by checkpointing?

No, it is orthogonal. LangGraph checkpointers persist agent graph state (the AgentState dict, message history, the next-node pointer) into a SQLite or Postgres backend so the workflow can resume after a process exit. AG-UI synchronizes UI state between a running agent and a web frontend. Neither addresses what mcp-server-macos-use addresses: the OS-level state of a real human's desktop while an agent is poking at it. You can run macos-use under a LangGraph agent and stack both layers — the LangGraph checkpointer covers conversation resume, the macOS checkpoint-restore covers per-action atomicity on the user's desktop.

How does the diff payload actually fit into the 'checkpoint' framing?

Think of each tool call as a database transaction. traverseBefore captures the read snapshot. The action is the write. traverseAfter captures the post-write read. The diff at main.swift:612-718 is the changed-rows result set. Filters at main.swift:591-607 strip non-actionable noise before persisting. The /tmp/macos-use/<ts>_<tool>.txt file is the commit log. If you ran the agent for a thousand actions, /tmp/macos-use/ would hold a thousand .txt + .png pairs, one per transaction, replayable from disk for post-mortem.

Does the InputGuard overlay block the agent's own input events?

No. The CGEventTap installed by InputGuard at the .cghidEventTap layer filters events by source. Agent input is generated via CGEvent.post(tap: .cghidEventTap) inside the same process; those events are emitted from the macos-use binary and are not blocked. Human input from the physical keyboard and mouse is blocked because it originates from outside the process. The Esc keydown is the one human keystroke the tap forwards: it captures, sets _cancelled, and lets the rest of the event drop. Other keystrokes are swallowed silently.

Why is this checkpoint-and-restore design specifically a macOS thing? Could a Windows agent do the same?

The pattern generalizes; the APIs do not. macOS uses NSWorkspace + NSEvent + CGEvent and exposes the foreground app via processIdentifier. The Windows equivalent (the Terminator project, also MCP-speaking) uses GetForegroundWindow + UI Automation and would need GetCursorPos / SetCursorPos instead of CGEvent mouseMoved, and would diff against UIA tree snapshots instead of AX tree snapshots. The contract — three snapshots, two restored, one diffed and persisted — ports cleanly. The implementation is per-OS.

What is the smallest reproducible test of all three snapshots firing?

Build the binary with xcrun --toolchain com.apple.dt.toolchain.XcodeDefault swift build, point an MCP client at it, and call click_and_traverse on any Mac app while you have a different app frontmost. Watch /tmp/macos-use/ for the new <ts>_click_and_traverse.txt + .png pair. Move your cursor to the corner of the screen before triggering the call; when the call returns, the cursor lands back at that corner and the originally-frontmost app is foreground again. The .txt holds the AX diff. That is all three checkpoints visible in one round trip.

macos-useMCP server for native macOS control
© 2026 macos-use. All rights reserved.