macos automation mcp serverInputGuard.swiftCGEventTapeventSourceStateIDwatchdogTimeout=30s

macOS Automation MCP Server: The 355-Line Input Guard That Decides Whether You Can Actually Share A Keyboard With Your Agent

Three macOS automation MCP servers are worth comparing in 2026. Two of them have no answer for the moment your hand drifts back to the keyboard while the agent is mid-click. The third ships a single Swift file, 355 lines, that drops every hardware event from your keyboard while a tool call is in flight, lets every synthetic event from the agent through, and surrenders the second you press plain Esc. This page is what that file does, what the alternatives do not do, and why that single design choice is the difference between using one of these every day and using none of them after the second time the agent typed into the same field as you.

M
Matthew Diakonov
14 min read

Direct answer · verified 2026-05-08

Of the three macOS automation MCP servers in active distribution, macos-use (mediar-ai/mcp-server-macos-use) is the only one with a runtime input-arbitration layer. macos-automator-mcp runs AppleScript and JXA; mcp-remote-macos-use drives a Mac you are not sitting at. Neither installs an event tap, neither has an Esc cancel, neither has a watchdog. If the agent is on the same machine as your hands, that gap is the whole story.

claude mcp add macos-use -- npx -y mcp-server-macos-use
Requires Claude Code (npm i -g @anthropic-ai/claude-code) and macOS 13+. Swift builds on first run, ~20 seconds.

What InputGuard.swift does, in five facts

  • 355 lines, one file, no dependencies beyond AppKit and CoreGraphics
  • Tap created at .cghidEventTap with headInsert placement so it sees events first
  • Drops events where eventSourceStateID == 0 (hardware), forwards everything else
  • Plain Esc (keycode 53, zero modifiers) is the unconditional kill switch
  • watchdogTimeout = 30s auto-releases the tap so you cannot get locked out
355

The MCP spec covers stdio framing and tool calls. It does not specify what happens when a local server fires synthetic input on a machine the user is also using. That gap is where InputGuard lives.

InputGuard.swift, the file

The Premise Every Other Comparison Skips

Most comparisons of macOS automation MCP servers stop at the feature matrix: which apps the server can drive, whether it uses AppleScript or accessibility APIs, what its tool surface looks like. Those matter. They are not the part that decides whether you keep the server installed after week two.

The part that decides is what happens when the agent is in the middle of a 700 ms click and your hand drifts back to your keyboard. On every other macOS MCP server I have tested, the answer is your keystrokes race the agent’s into the same focused field. AppleScript-based servers run osascript synchronously and have no concept of input ownership. VNC-based remote servers assume nobody is at the local keyboard. Both are correct designs for what they target. Neither is correct for the case where you are sitting at the Mac the agent is driving.

macos-use treats input arbitration as a first-class responsibility of the server. The implementation is in one file (Sources/MCPServer/InputGuard.swift) and is wrapped around every disruptive tool call from main.swift:1835 (engage) to main.swift:1898 (disengage). The design has three moving parts: a head-insert event tap that filters by source, an explicit user kill switch, and an unconditional watchdog. The rest of this page is what each of those is and why each one matters.

What Happens When You Type While The Agent Clicks

Toggle the panel below to see the same scenario without and with the input guard. The setup is identical: a Reminders right-pane click, a stray Tab keypress from you mid-action.

You ask the agent to set up a recurring reminder. The agent issues open_application_and_traverse on Reminders, then click_and_traverse on the right-pane '+ New Reminder' button. While the click is mid-flight (the synthetic CGEvent has been posted but the AX traversal has not yet captured the new field), your hand drifts back to the keyboard and you press Tab thinking you are about to fix a typo in your editor. Tab fires into Reminders because Reminders is briefly the frontmost app. The new field receives a tab character before the agent's next type_and_traverse arrives. The agent reads the resulting AX tree, sees an unexpected initial value, and either retries (corrupting your input further) or gives up. You did not realize you contributed to the failure.

  • your keystroke lands in the agent's target app
  • AX traversal captures the corrupted input
  • agent's plan diverges silently
  • no signal to either party that interference happened

The Two Unobvious Bits: Source-State Filtering And Plain Esc

A naive event tap drops every event. That breaks the agent immediately because the agent’s synthesized clicks are also events, and a tap that drops everything drops them too. The interesting design question is: how does the tap tell the difference?

The answer is eventSourceStateID, a CGEventField that ships on every CGEvent. Hardware events (your keyboard, your trackpad) carry stateID == 0. Programmatic events posted by CGEvent.post with a non-default source carry stateID != 0. macos-use posts all of its synthetic input via CGEvent(mouseEventSource: .hidSystemState, ...), which sets a non-zero stateID. The callback at InputGuard.swift:329 reads that field and forwards the event if it is non-zero, silently drops it if it is zero.

The kill switch is the second unobvious decision. The naive choice is some chord the user is unlikely to press by accident (Cmd-Shift-Esc, F19, a custom hotkey). macos-use picks plain Esc, no modifiers (keyCode == 53 && flags.intersection(modifierMask).isEmpty at line 345). This is the right call because the agent is usually doing something the user wants to cancel now, not after they remember a chord. Plain Esc is the most instinctive cancel key on the platform; it is the universal modal-dismiss key. Reusing it means the user does not have to learn anything to abort.

The price of that choice is that during a tool call, your normal Esc gets eaten. If you wanted Esc to close an unrelated dialog in another app while the agent is clicking in Reminders, that Esc cancels the agent instead of the dialog. The trade-off is intentional: the agent’s action is shorter than your reaction time most of the time, and giving you a single key that always means “stop now” is more valuable than letting Esc pass through.

One Tool Call, Step By Step

What actually happens between the moment a tool call lands on stdin and the moment a response goes back. The guard wraps the entire disruptive path; everything outside the guard is read-only and skips the dance.

1

Tool call arrives over stdio

main.swift:1800 sets isDisruptive = (params.name != refreshTool.name). Every tool except refresh_traversal triggers the guard. refresh_traversal is read-only (no CGEvent.post, no AX writes), so the guard is intentionally skipped.

2

Pre-flight state save

main.swift:1804-1808 captures NSWorkspace.shared.frontmostApplication and NSEvent.mouseLocation (translated to CG coordinates). These get restored at main.swift:1906-1920 after the action so the user's window focus and cursor end where they were.

3

InputGuard.shared.engage(message:)

main.swift:1835 calls engage with a tool-specific message ('AI: Clicking in app... press Esc to cancel'). engage installs the CGEventTap, registers it on the main run loop, shows the fullscreen pill overlay, and starts the 30-second watchdog. If you are not on the main thread, it dispatches synchronously so the tap is live before the action runs.

4

Action runs, throwIfCancelled between steps

Composed actions (click + type + press in one call) call throwIfCancelled between every step. main.swift:1847, 1860, 1867, 1873. Each check reads InputGuard.shared.wasCancelled under the lock and throws InputGuardCancelled if set. The agent gets one cancellation per logical action even on multi-step calls.

5

200 ms grace period after the action

main.swift:1894-1902 sleeps 200 ms after the action completes, then re-checks wasCancelled before disengaging. This catches the case where the user reads the response and presses Esc the moment the action lands, but before the next tool call. Late cancellations turn the response into an isError result.

6

Disengage and restore

InputGuard.shared.disengage() at main.swift:1898 tears down the tap, removes the run-loop source, stops the watchdog, and hides the overlay. main.swift:1906-1911 posts a CGEvent mouseMoved to restore the cursor; main.swift:1914-1920 reactivates the previously frontmost app if focus drifted.

The Same Sequence As A Diagram

Hardware keystroke from you, programmatic click from the server, and an Esc cancellation, in the order they hit the tap.

One disruptive tool call with the guard engaged

Youmacos-useInputGuardTarget appengage("AI: Clicking in app...")tapCreate + overlay + watchdog 30shardware keystrokedrop (sourceStateID==0)CGEvent.post(.leftMouseDown) sourceStateID!=0forward (sourceStateID!=0)AX diff post-actionpress Esc (keycode 53, no modifiers)throw InputGuardCancelleddisengage

macos-use vs. The Other macOS Automation MCP Servers

Each row is a behavior that decides whether you can actually keep the server installed once you start using your Mac for both your work and the agent’s work in the same hour.

Featuremacos-automator-mcp / mcp-remote-macos-usemacos-use
ArchitectureAppleScript/JXA via osascript (automator); VNC + remote CGEvent (remote-macos-use)Native Swift, AXUIElement read + CGEvent post, runs in-process
Runtime input arbitration (you and the agent share the keyboard)None. Trust-based. "Use your robot wisely."Head-insert CGEventTap at .cghidEventTap, hardware events dropped
Cancel a tool call mid-flightSend SIGINT to the client and hope the action finishedPlain Esc, keycode 53, intercepted in the tap before the foreground app
Lockout safety netNone. If the script wedges, you reboot or wait it out.30-second watchdog auto-disengage, plus auto-recovery on tapDisabledByTimeout
Surfacing 'AI is acting' to the userNo UI. The script runs invisibly.Fullscreen overlay with pulsing dot and per-tool-call message
Cursor and frontmost-app restore after the callWhatever the script left behindsavedCursorPos and savedFrontmostApp restored at main.swift:1906-1920
Best fitautomator: AppleScript-friendly Apple apps, no human at keyboard. remote-macos-use: a Mac you are not at.Local agent on the Mac you are using right now

The other two servers are correct designs for their target use case. macos-automator is right for AppleScript-friendly workflows when you are not at the keyboard. mcp-remote-macos-use is right for a Mac you are not sitting at. The input guard is only load-bearing for the local-and-also-using-the-Mac case.

Reading The File Yourself

Every claim on this page traces to a line in the public source. If you are evaluating whether to install one of these and want to verify rather than trust, this is the path. Each step is a single command and a single line you should see.

Verify in 7 commands

  • git clone https://github.com/mediar-ai/mcp-server-macos-use
  • cat Sources/MCPServer/InputGuard.swift | wc -l should print 355
  • grep -n 'watchdogTimeout: TimeInterval' Sources/MCPServer/InputGuard.swift returns line 24
  • grep -n 'sourceStateID' Sources/MCPServer/InputGuard.swift returns line 329
  • grep -n 'keyCode == 53' Sources/MCPServer/InputGuard.swift returns line 345
  • grep -n 'InputGuard.shared.engage' Sources/MCPServer/main.swift returns line 1835
  • swift build -c release in the repo root produces .build/release/mcp-server-macos-use

Where The Guard Does Not Help

Honest section. The input guard is a narrow primitive and there are several scenarios where it either does not apply or does not help.

If the server crashes mid-action, the watchdog at 30 seconds is the only backstop. macOS does eventually reclaim event taps from dead processes, but the exact reclaim path is not documented and on a crash inside the tap callback you can be looking at a few seconds of frozen keyboard. The watchdog covers up to 30; nothing covers past 30 except the OS noticing.

If you lose Accessibility permission mid-session (System Settings revokes it, another security tool resets it), CGEvent.tapCreate returns nil at line 132 and the engage path silently fails. The server logs the failure to stderr but proceeds to issue the CGEvent.post calls anyway. The action runs without the guard. This is a known weakness; a stricter design would refuse the tool call when the tap fails to install.

If you have a custom hardware-based input source (a Stream Deck, a foot pedal, a macro keyboard) that posts events with a non-zero stateID, the guard will let those through too. The filter is on source, not on physical origin, and there is no robust way to tell those apart from the agent’s synthetic input. In practice this is rare; in principle it is a soft spot.

The guard does not protect other apps from the agent. If the agent decides to type into the wrong window, the agent types. The four-primitive click-attribution layer in macos-use makes the agent less likely to do that, but the input guard is about you and the agent, not the agent and your other apps. Different problem, different layer.

Installing And Watching The Guard Engage

Four steps. The third one is where you confirm the guard is actually doing what this page claims.

1

Add the server

claude mcp add macos-use -- npx -y mcp-server-macos-use. The npm postinstall hook runs xcrun swift build -c release against the bundled Swift sources, so the only host requirement is Xcode Command Line Tools.

2

Grant Accessibility permission

First disruptive tool call triggers the system prompt. Allow it. Without this permission CGEvent.tapCreate returns nil and the input guard fails to engage; the server logs 'failed to create CGEventTap (check Accessibility permissions)' to stderr.

3

Smoke test the guard

From the agent: open Calculator and click any digit. You should see the orange-dot overlay flash for ~600 ms. Press Esc during a longer action (a typing tool call against TextEdit works) and you should see the action stop with a 'Cancelled: user pressed Esc' response.

4

Verify the breadcrumbs

Run ls /tmp/macos-use/. You should see tap_status.txt (written at InputGuard.swift:154 every time the tap is created), cancel_check.txt (written at line 56 on every throwIfCancelled call), and esc_pressed.txt (written at line 347 if you cancelled). These are deliberate forensic files.

MCP servers that ship for macOS treat input arbitration as a TODO. The one that actually shipped a 355-line solution to it is the one I keep installed.
M
Matthew Diakonov
working on macos-use

Want to drop this into a real workflow?

Walk through your specific Mac automation case with the team. We will tell you whether macos-use is the right fit and what the integration looks like.

Frequently asked questions

Which macOS automation MCP server should I install if I want Claude Code to actually drive my Mac?

Three are in active distribution as of May 2026. macos-use (mediar-ai/mcp-server-macos-use, Swift, native Accessibility APIs + CGEvent, MIT) is the local one with the input guard described on this page. macos-automator-mcp (steipete/macos-automator-mcp, TypeScript, runs AppleScript and JXA) is a thin wrapper around osascript. mcp-remote-macos-use (baryhuang/mcp-remote-macos-use, Python, VNC + remote synthetic events) targets a Mac you are not sitting at. If you want an agent on the same machine you actually use, macos-use is the only one with a runtime input-arbitration layer; the other two delegate that responsibility to you. If you need remote, mcp-remote-macos-use is the right pick and the input guard is irrelevant. If you only need scripted Apple-app workflows that the AppleScript dictionary already covers (Mail filters, Notes search, Reminders queries), macos-automator-mcp is fine because the model is not racing you on a keyboard.

What does the input guard actually block?

Eight CGEventTypes are added to the tap mask at InputGuard.swift:116-126: keyDown, keyUp, leftMouseDown, leftMouseUp, rightMouseDown, rightMouseUp, mouseMoved, leftMouseDragged, rightMouseDragged, scrollWheel, flagsChanged. The tap is created at .cghidEventTap with .defaultTap options and headInsert placement (line 131-138), which means it sits in front of every other tap on the system. The callback at line 311 inspects each event's eventSourceStateID. Hardware events have stateID == 0 (line 329); programmatic events posted by CGEvent.post inside the server use .hidSystemState which has a non-zero stateID. The callback returns nil for hardware events (drops them) and Unmanaged.passUnretained(event) for programmatic events (forwards them). The result: while a tool call is in flight, your physical keyboard and mouse do nothing, but the agent's synthesized clicks and keystrokes go through.

What happens if the server crashes or hangs while the guard is engaged?

Three independent failsafes prevent the lockout. First, the watchdog timer at InputGuard.swift:24 with default watchdogTimeout = 30 seconds calls disengage automatically (line 172-181). Second, plain Esc with no modifiers (keycode 53, checked at line 345) is the explicit kill switch and is the only key that survives the drop filter; the callback returns nil after handleEscPressed runs, so even Esc does not leak into the foreground app. Third, macOS itself disables an event tap if the runloop blocks for too long, which fires .tapDisabledByTimeout in the callback (line 300-305); the server re-enables the tap if it can, but if the process is dead the tap is gone with it. The combination is intentional: you get an opt-in lock that you can unconditionally break.

Does the input guard slow down the agent?

The tap install itself is the expensive part: CGEvent.tapCreate plus the run-loop source registration plus the synchronous overlay window draw cost roughly 30-60 ms on a 2024-class Mac. The per-event cost in the callback is a single integer field read (eventSourceStateID) plus one keyCode read on keyDown frames; sub-microsecond. The callback runs on the main run loop, so callbacks during a long-running tool call interleave with whatever the agent is doing on the main thread. The disengage at the end of every disruptive tool call (main.swift:1898) costs another ~10 ms because it tears down the run-loop source and hides the overlay window. In normal use you notice the overlay flash on every action; you do not notice the latency.

Why is this not just a system-level Do Not Disturb mode or an OS feature?

Because OS-level focus modes do not block your keyboard. Do Not Disturb suppresses notifications. Screen Time blocks apps, not events. There is no Apple-supplied API for 'pause my hardware input for 800 ms while this synthetic click happens, but let the synthetic click through.' The closest primitive Apple ships is the CGEventTap, which is exactly what InputGuard wraps. Doing this correctly requires three things you would not get from a generic OS toggle: filtering by eventSourceStateID so synthetic events still flow, scoping the block to one tool-call duration so the user is not locked out for an entire session, and a kill switch that runs from inside the tap before the event reaches the foreground app. None of those are exposed at any layer above the tap API, which is why the server has to ship its own.

Can I disable the input guard if I want to keep typing while the agent works?

Not as a config flag right now. The engage call at main.swift:1835 fires unconditionally on any tool call where isDisruptive == true (defined at main.swift:1800 as 'every tool except refresh_traversal'). The trade-off is deliberate: the alternative is the user typing into the same focused field as the agent, which will silently corrupt either the user's work or the agent's plan. If you want a 'no guard' mode for narrow cases (you are recording a video and need to overlay your own keystrokes, you are running automation against a different display than your active one), the patch is small (one boolean parameter on the tool input schema, gated at the engage call site) but it is not in the shipped server.

What does the overlay look like and can I customize it?

A fullscreen NSWindow at .screenSaver level (so it floats above every app), 15% black tint, with a centered dark pill containing a pulsing orange dot and a single line of text. The pill width is min(720, 50% of screen width); height is 80 px; corner radius is 40 px. The pill text comes from the tool description selected at main.swift:1810-1834, which maps every tool name to a short imperative ('Clicking in app...', 'Typing in app...', 'Pressing Return...'). The window has ignoresMouseEvents = true so it does not intercept clicks visually; only the event tap intercepts them. Code starts at InputGuard.swift:202 (buildAndShowOverlay) and is plain AppKit, so customizing it is a 30-line patch.

How do I install macos-use and verify the input guard is actually engaged?

claude mcp add macos-use -- npx -y mcp-server-macos-use installs and rebuilds the Swift binary against your Xcode toolchain. After the first tool call you will see two things: the system Accessibility prompt (one-time), and the orange-dot overlay during every subsequent disruptive call. To verify programmatically, tail /tmp/macos-use/tap_status.txt while issuing a tool call (the file is written at InputGuard.swift:154 with the line tap_created: enabled=true at <Date>). Press Esc mid-action and check /tmp/macos-use/esc_pressed.txt for a marker file with the timestamp (written at InputGuard.swift:347). Both files are deliberate breadcrumbs for exactly this kind of verification.

macos-useMCP server for native macOS control
© 2026 macos-use. All rights reserved.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.