GuideProtocol + OS contractInputGuard.swift

What Is An MCP Server? The Part The Spec Skips: Sharing Your Keyboard With The Model

Every explainer tells you the same thing: an MCP server is a JSON-RPC 2.0 process that exposes tools to an AI client, the "USB-C for AI". Correct. And, for any MCP server that actually runs on your machine and moves the cursor, about half the real answer. The other half is what the spec doesn't define: how the server shares your keyboard and mouse with you while it runs. In macos-use that answer is 355 lines of Swift in Sources/MCPServer/InputGuard.swift, a kernel-level CGEventTap, a 30-second watchdog, and plain Esc (keycode 53) as a hard kill-switch.

Matthew Diakonov, Written with AI

Published April 18, 20269 min read

Read InputGuard.swift on GitHub Clone the repo

5.0from open source

6 tools advertised over JSON-RPC 2.0 via stdio transport

11 hardware event types blocked by a kernel-level CGEventTap

30-second watchdog so the keyboard is never unrecoverably locked

A local MCP server has to share your keyboard with you

What the protocol spec never defines

The spec: JSON-RPC 2.0, tools, resources, prompts, transport

Reality on a local server: the cursor and keys are shared hardware

macos-use installs a CGEventTap before each tool call runs

Plain Esc (keycode 53) is the universal kill-switch, anywhere

A 30-second watchdog guarantees the tap always releases

0:00 / 0:05

The One-Line Answer, And Why It Is Not Enough

An MCP server is a long-running process that speaks JSON-RPC 2.0 to an AI client (Claude Desktop, Cursor, VS Code, Cline) and advertises a list of typed tools the model is allowed to call. When the client invokes a tool, the server executes it and returns a text result. The transport is usually stdio for local servers and HTTP for remote ones. That is the textbook answer and every article on the first page of Google will give you some version of it.

The textbook answer is complete for remote servers. If the server runs on someone else's box and its tool is "run a SQL query", the only resource it shares with you is network bandwidth. Nothing to mediate.

The textbook answer is half the story for local servers. When the MCP server runs on your own laptop and its tool is "click at (412, 598) in Safari and type a password", the cursor and the keyboard are shared hardware. You and the server are both trying to drive them. That coordination problem is the part the spec is silent about, and it is where a real local MCP server earns its keep.

What A Local MCP Server Actually Has To Do

Zoom in on a single tool call. The JSON-RPC layer is the dashed line around the whole diagram. Everything inside is platform-level work the spec never mentions: process state, OS permissions, input mediation, cancel semantics, cleanup guarantees. A local MCP server that skips any one of these becomes a liability the first time the model loops.

MCP Server
Tool Call

CGEventTap install

30s watchdog arm

Esc kill-switch

Overlay paint

Cursor save/restore

Frontmost app save

Tap re-enable on timeout

throwIfCancelled checks

What Actually Travels On The Wire Between Client And Server

The spec only covers this diagram. Four JSON-RPC messages, one handshake, one tool listing, one tool invocation, one result. What macos-use does between receiving callTool and returning the result is the OS-level contract on the next section.

MCP over stdio — the client-server exchange

The Numbers That Anchor The Guard

Each value below is a concrete constant in Sources/MCPServer/InputGuard.swift. Clone the repo and grep the file. None of these are tunable at runtime; they are tuned values baked into the server.

0swatchdog timeout at InputGuard.swift:24

0keycode for Esc (line 345)

0hardware event types blocked (lines 115-126)

0total lines in InputGuard.swift

tools advertised over JSON-RPC (main.swift:1408)

of those 6 engage the InputGuard (main.swift:1667)

runtime flags to disable the guard; fork to change it

Anchor code 1 of 2

Installing The Tap: The Block That Makes The Keyboard Yours Until The Tool Returns

Right before a disruptive tool (click, type, press, scroll, open) executes, the server calls InputGuard.shared.engage(), which runs the block below. It creates a .cghidEventTap placed at the head of the dispatch chain, with a mask for eleven event types, and installs the run-loop source on the main run loop. From this point until disengage(), every hardware keyboard and mouse event on the machine passes through a single callback.

Sources/MCPServer/InputGuard.swift:113-155

Anchor code 2 of 2

The Callback: Keycode 53 Is The Only Key That Means Anything

The free function below runs on the main run loop for every hardware event the tap captures. Three things happen: events posted by the server itself (non-zero sourceStateID) pass through, plain Esc (keycode 53, no modifiers) writes /tmp/macos-use/esc_pressed.txt and cancels the automation, and everything else is returned as nil so the frontmost app never sees it.

Sources/MCPServer/InputGuard.swift:311-355

Press Esc Mid-Automation. Check The Marker File. See The Log.

The Esc handler writes a timestamp to disk before anything else. That gives you a ground-truth marker that is independent of anything the model reports back: you can always verify the cancel fired at the OS level by checking one file.

A real cancel, logged end to end

What Happens In The 400ms Between Esc And The Cancelled Response

Press Esc while a composed click_and_traverse call is mid-flight. Here is the exact sequence the server executes, start to finish. Each step maps to a line range in InputGuard.swift or main.swift.

Hardware Esc reaches the tap callback

keyDown, keyCode 53, sourceStateID 0. The callback at InputGuard.swift:311 picks it up before any app window does.

Marker file is written

Line 347 writes `esc_at_<Date>` to /tmp/macos-use/esc_pressed.txt. This is the ground-truth marker you can grep for later.

Cancelled flag is flipped

handleEscPressed at InputGuard.swift:289 locks, sets _cancelled = true, and calls disengage(). The event tap tears down.

Tap returns nil, the Esc event is suppressed

The callback at line 349 returns nil. The frontmost app never sees the keystroke, so nothing in the foreground reacts to it.

Next throwIfCancelled check fires in the handler

Between composed actions, main.swift:1708/1721/1728/1734 calls try InputGuard.shared.throwIfCancelled(). The flag is true, so it throws InputGuardCancelled.

Cursor and frontmost app are restored, error goes back over MCP

Saved cursor position (set at main.swift:1674) and saved frontmost app are restored in the catch block. The JSON-RPC response carries the error.

Remote MCP Server vs. Local MCP Server, The Safety Delta

Feature	Remote MCP server	Local MCP server (macos-use)
Shared hardware with user	No. Runs on a different machine.	Yes. Same keyboard, same mouse, same screen.
Needs an input kill-switch	No. Disconnect the client.	Yes. Esc (keycode 53) cancels anywhere on the OS.
Needs a watchdog timeout	HTTP timeout on the client is enough.	Yes. 30s at InputGuard.swift:24 prevents hardware lockout.
Needs OS-level permissions	API keys, OAuth.	macOS Accessibility + Screen Recording, granted per app.
Needs visible user affordance	Client UI is enough.	Full-screen overlay with pulsing orange dot and cancel hint.
Cleanup on crash	TCP close.	Process exit releases the CGEventTap even if disengage never runs.
Transport	HTTP/SSE or WebSocket	stdio, parent client manages the process lifecycle.

355 lines

“The CGEventTap must be on the main run loop to receive events. Watchdog fires after 30s auto-disengaging. Plain Esc (keycode 53, no modifiers) writes /tmp/macos-use/esc_pressed.txt and returns nil to suppress the event.”

Sources/MCPServer/InputGuard.swift:150, 176, 340-349

Safety guarantees the guard gives you

Every disruptive tool call installs the tap before the action and tears it down after
Eleven hardware event types (keys, mouse buttons, moves, drags, scroll, modifiers) are blocked for the duration
Plain Esc (no Cmd, Ctrl, Option, Shift) cancels automation, regardless of which app has focus
A 30-second watchdog force-releases the tap if the tool hangs or disengage never runs
A full-screen overlay with a pulsing orange dot and a 'press Esc to cancel' hint is visible the whole time
Programmatic events the server itself posts pass through (sourceStateID != 0), so the server can still click while the user is blocked
Cursor position and frontmost app are saved before the tool and restored after, even on cancel
refresh_traversal is the only non-disruptive tool; it never engages the guard and never blocks input

Try It Yourself In Under Five Minutes

git clone https://github.com/mediar-ai/mcp-server-macos-use cd mcp-server-macos-use xcrun --toolchain com.apple.dt.toolchain.XcodeDefault swift build -c release # Point your MCP client at .build/release/mcp-server-macos-use # Grant Accessibility permission when macOS prompts. # Fire any disruptive tool (e.g. open Safari), then press Esc mid-way. # The marker file is your receipt that the cancel fired at the OS level: cat /tmp/macos-use/esc_pressed.txt # The tap status file confirms the tap was armed at engage(): cat /tmp/macos-use/tap_status.txt # Server logs (stderr) show the full lifecycle: # log: InputGuard: engaging — AI: Clicking in app… — press Esc to cancel # log: InputGuard TAP: keyDown keyCode=53 sourceState=0 # log: InputGuard: Esc pressed — user cancelled # log: InputGuard: disengaging # log: InputGuard: CGEventTap destroyed

Frequently Asked Questions

Frequently asked questions

What is an MCP server in one sentence?

An MCP server is a long-running process that speaks JSON-RPC 2.0 to an AI client (Claude Desktop, Cursor, VS Code, Cline) and advertises a list of typed tools the model is allowed to call. When the client invokes a tool, the server executes the action and returns a text or structured result. macos-use is an MCP server that exposes six such tools, registered in one array at Sources/MCPServer/main.swift:1408, which control macOS apps via the Accessibility APIs.

What does the MCP spec say an MCP server must do?

The spec defines three primitives (tools, resources, prompts), a JSON-RPC 2.0 wire format, a handful of lifecycle methods (initialize, listTools, callTool, listResources, readResource, listPrompts), and a transport (stdio or HTTP). That is the contract between client and server. It does not define what the server does to your OS while executing the tool. Everything below this line happens inside a block the spec is silent about.

Why does a local MCP server need a keyboard kill-switch at all?

Because a remote MCP server runs on someone else's box. A local MCP server runs on yours, and if its tool is 'click at (412, 598) and type hello' it is fighting you for the same cursor and the same keyboard the whole time. Without a guard, the user alt-tabbing mid-automation can land a keystroke inside the wrong field, or the model can loop and never release focus. macos-use solves this by installing a kernel-level CGEventTap at InputGuard.swift:113 which swallows all hardware keyboard and mouse events for the duration of the tool call, and a 30-second watchdog at InputGuard.swift:24 which force-releases the tap so the machine is never unrecoverably locked.

How exactly does the Esc kill-switch work?

The CGEventTap callback at InputGuard.swift:311-355 reads every hardware keyboard event. When it sees a keyDown, it pulls the keycode with event.getIntegerValueField(.keyboardEventKeycode). Line 345 checks `keyCode == 53 && flags.intersection(modifierMask).isEmpty`, which is plain Escape with no Cmd, Ctrl, Option, or Shift. On match, line 347 writes `/tmp/macos-use/esc_pressed.txt` as a ground-truth marker (useful for post-hoc debugging), calls handleEscPressed, and returns `nil` to suppress the Esc event so it never reaches the frontmost app. Between automation steps, throwIfCancelled at InputGuard.swift:53 reads that cancelled flag and throws InputGuardCancelled, which aborts the composed click→type→press chain.

Why 30 seconds for the watchdog?

It is long enough to cover a normal tool call (click, scroll, multi-step open), but short enough that if the server hangs or the tap gets stuck, the user is not locked out of their machine for more than half a minute. Line 24 of InputGuard.swift declares `var watchdogTimeout: TimeInterval = 30` and startWatchdog at line 172 arms a DispatchSource.makeTimerSource on the global queue that fires after exactly that interval, logs `watchdog fired after 30.0s — auto-disengaging`, and calls disengage(). If your tool genuinely needs longer, bump the constant in your fork; there is no runtime flag.

Which event types does the tap actually block?

Eleven types, enumerated at InputGuard.swift:115-126. keyDown, keyUp, leftMouseDown, leftMouseUp, rightMouseDown, rightMouseUp, mouseMoved, leftMouseDragged, rightMouseDragged, scrollWheel, and flagsChanged (modifier key state). That is every hardware input path the user has. Returning `nil` from the callback at line 354 swallows the event. Programmatic events the server itself posts via CGEvent.post use .hidSystemState (non-zero sourceStateID), so the check at line 329-332 lets those through: the server can still click and type while the user is blocked.

What is the difference between an MCP server and an AI agent?

An MCP server is just a tool provider. It does not decide what to do; the client's LLM does. The server advertises `{tool: 'click_and_traverse', params: {pid, x, y}}`, the model reasons about when to invoke it, the client sends the JSON-RPC call over stdio, the server performs the click, the server returns a summary. The agent is the loop that wraps the model and the client; the MCP server is a passive endpoint. macos-use, specifically, implements only the server side: it has no prompts, no memory, and no model. It is 1917 lines of Swift at Sources/MCPServer/main.swift plus 355 lines of input-guard code at InputGuard.swift.

What transport does macos-use use?

Stdio. Line 1874 of main.swift wires a StdioTransport() into the MCP SDK's Server instance. That means the server reads newline-delimited JSON-RPC from stdin and writes responses to stdout, and the AI client (e.g. Claude Desktop) launches the binary as a child process. There is no HTTP listener, no port, no socket. That choice matters for the safety story: because the server is a child of the client, killing the client kills the server, which releases any CGEventTap it had installed. The OS cleans up on process exit even if InputGuard.disengage() never runs.

Can the AI client ever accidentally engage the guard without showing the overlay?

No. The engage() path at InputGuard.swift:69-95 is synchronous on the main thread: it creates the event tap and shows the overlay before returning. If the tap fails (Accessibility permission not granted, for example), line 140 logs an error, flips _engaged back to false, and the call is a no-op. If the overlay fails, the tap would already be down so hardware is blocked but the user has no visual indication. In practice, both succeed or both fail together because both go through the same NSApplication.shared + main run loop, and the check for Accessibility permission happens once at server boot.

How does this compare to other macOS MCP servers?

Most macOS-flavored MCP servers I checked do not ship any keyboard kill-switch at all. They assume a cooperative user. The ones that do ship a kill-switch usually rely on the AI client to implement the cancel UI, which means pressing Esc in the client window works, but pressing Esc while focus is inside the app being automated does not. macos-use installs the CGEventTap at .cghidEventTap placement (InputGuard.swift:133), which sits ahead of every app's event dispatcher including its own. Esc from inside any app on the machine cancels automation. That is the structural thing the protocol spec doesn't talk about.

What does an MCP server return to the client?

A CallTool.Result that carries `content: [MCPContent]`. macos-use returns a single text content item containing a compact summary (status, pid, app name, file path to the on-disk traversal, grep hint, screenshot path, tool-specific one-liner). The model reads the summary, then uses its own filesystem tools to grep the traversal file on demand. The return value is always text; structured content is supported by the spec but this server does not use it. See buildCompactSummary at main.swift:731.

Is the InputGuard idea something other MCP servers can copy?

Yes, and they probably should for any local MCP server that touches hardware input. The pattern is platform-specific (CGEventTap on macOS, low-level keyboard hooks on Windows, evdev/libinput on Linux) but the shape generalizes: install a system-level input blocker before the tool executes, arm a short watchdog timer, reserve one plain keystroke as a universal cancel, and emit a visible overlay so the user knows who is driving. The 355 lines of InputGuard.swift are the reference implementation for macOS; the constants (30s watchdog, keycode 53, .cghidEventTap) are tuned values, not cargo-culted defaults.

Read The 355 Lines That Define The OS Contract

InputGuard.swift is one file. One class, one free-function callback, two constants (30s, keycode 53), eleven event types. MIT-licensed Swift, exactly where the MCP spec stops and the operating system starts.

Open InputGuard.swift on GitHub →