macOS computer use

When the AI and your hand reach for the same mouse

M
Matthew Diakonov
8 min read

Computer use on macOS means an AI assistant reads what is on screen and then takes real actions: it clicks, it types, it opens apps, it walks a multi-step task to completion. The clean way to do that on a Mac is not to screenshot the display and guess pixel coordinates. It is to read the same accessibility tree that VoiceOver uses (AXUIElement) and post real input events through CGEvent. That is what an MCP server like macos-use does.

But here is the part almost every guide on this topic skips. Because the events are real, they drive your actual keyboard and mouse, not a VM. So when the model is mid-task and you reach for the trackpad to check Slack, you are both grabbing the same physical input device. What stops you from fighting over the cursor?

Direct answer · verified 2026-06-15

macos-use ships a hardware input guard. While a disruptive tool call runs, a CGEventTap blocks your physical keyboard and mouse, a floating overlay reads “AI is controlling your computer — press Esc to cancel”, and pressing Esc kills the run instantly. A 30-second watchdog guarantees you are never locked out. The whole mechanism is ~355 lines in InputGuard.swift.

Native computer use shares one set of hands

There are two broad ways to give an AI computer use. The first is isolation: spin up a virtual machine or a remote desktop, screenshot it, and let the model click around inside a sandbox that is not your working session. The second is native: drive the machine you are actually sitting in front of. Browser-only servers are a narrow version of the first. macos-use is the second, because the entire reason it exists is to reach the apps a browser cannot: Finder, Mail, Calendar, Xcode, and whatever third-party Mac GUI you live in.

Native is more powerful and also more dangerous in a very specific, physical way. The model posts a left mouse click and the real cursor jumps. It types a string and the characters land in whatever app holds focus. If you happen to click somewhere else at the same instant, focus moves, and the model’s next keystroke goes into the wrong window. The failure mode is not abstract. It is you and a language model both trying to steer one cursor a few times a second.

Most write-ups on macOS computer use stop at “use the accessibility API, not screenshots.” True, and covered well elsewhere. The unglamorous engineering that makes native computer use actually usable is the handoff: a clear, enforced boundary around the moments the AI is holding the controls.

The handoff, in order

When a tool call that moves the mouse or types arrives, the server wraps the action in an engage / disengage pair. Your hardware is locked out for exactly that window, and only that window.

One disruptive tool call

Youmacos-usemacOS HIDengage() — install CGEventTap + overlayblock hardware key/mouse eventspost synthetic click via CGEventstate ID != 0, event passesyou press Esc (keycode 53)throwIfCancelled() stops the rundisengage() — tear down tap + overlay

Actor 0 is the macOS HID layer the events flow through; actor 1 is the server; the “You” column is where your physical input and your Esc both originate.

How the guard tells your hand from the model’s

A tap that blocked all input would block the server’s own clicks too, and nothing would happen. The trick is that the events the server synthesizes and the events your hand generates are not identical at the system level. Synthetic CGEvents posted by the server carry a non-zero eventSourceStateID; genuine hardware events carry zero. The callback reads that field on every event and uses it as the gate.

That single check is what makes the whole thing work. Pass the server’s own events through untouched, swallow everything that came from a human hand, and special-case one key, Esc, as the escape hatch.

InputGuard.swift · inputGuardCallback
// inputGuardCallback — runs for every hardware event during automation
let sourceStateID = event.getIntegerValueField(.eventSourceStateID)
if sourceStateID != 0 {
    return Unmanaged.passUnretained(event)   // our own synthetic CGEvent, let it through
}

// keycode 53 == Esc. No Command/Control/Option/Shift held.
if type == .keyDown {
    let keyCode = event.getIntegerValueField(.keyboardEventKeycode)
    let modifierMask: CGEventFlags = [.maskCommand, .maskControl, .maskAlternate, .maskShift]
    if keyCode == 53 && event.flags.intersection(modifierMask).isEmpty {
        guard_.handleEscPressed()
        return nil                            // swallow Esc, cancel the run
    }
}

return nil                                    // block every other hardware event

Returning nil from a CGEventTap callback drops the event; returning the event passes it on. So return nil on the last line is what actually freezes your trackpad.

The lifecycle of one click

Reads do not engage the guard, only actions that touch your hardware do. Here is what happens around a single disruptive tool call.

1

Engage

The server calls engage() before acting. It installs the CGEventTap on the main run loop and draws a full-screen, click-through overlay with a pulsing dot and the message line. Your input is now blocked.

2

Start the watchdog

A 30-second timer arms immediately. If anything hangs, it auto-disengages the tap so you are never trapped. This runs no matter what the automation does next.

3

Act, checking for cancel

The server posts the synthetic click or keystrokes. Between steps it calls throwIfCancelled(), which throws InputGuardCancelled the moment the cancelled flag is set. The model's events pass the tap because their state ID is non-zero.

4

Esc, any time (optional)

If you press plain Esc, the callback flips the cancelled flag, suppresses the key, disengages, and fires onUserCancelled. The next throwIfCancelled() stops the run instead of doing the next click.

5

Disengage

On success or cancel, disengage() tears down the tap, cancels the watchdog, and hides the overlay. Your keyboard and mouse are yours again. A late-cancel check catches an Esc that landed right at the boundary.

What the tap blocks, and what slips through

The event mask covers the full set of input you could use to disrupt a run. Everything below is intercepted while the guard is engaged, with two deliberate exceptions.

During a disruptive tool call

  • Hardware key down / key up — blocked
  • Left, right mouse down / up — blocked
  • Mouse moved and drag events — blocked
  • Scroll wheel and modifier flag changes — blocked
  • The server's own synthetic CGEvents (state ID != 0) — passed
  • Plain Esc — passed to the cancel handler, then swallowed

The two unchecked rows are not failures. They are the intentional gaps that let the model work and let you bail out.

Why this is the part that matters

A sandboxed, screenshot-based computer-use setup does not need any of this. The model is poking at a VM you are not touching, so there is no contention and no need for an escape key. The price is that it can only do what lives inside the sandbox, which on macOS means you give up the native apps that are the whole reason to automate a Mac in the first place.

The moment you choose native control, the shared-hardware problem is yours to solve, and a guard like this is not a nice-to-have. It is the thing that makes you willing to hand the model your real cursor at all, because you know one keypress takes it back and a stuck run releases on its own in 30 seconds. If you are evaluating any macOS computer-use tool, ask where its Esc is. If the answer is “force-quit the client,” that tells you how seriously the handoff was designed.

macos-use is open source under BSL 1.1, runs on macOS 13 and up, and drops into Claude Code, Cursor, VS Code, or Claude Desktop as a standard stdio MCP server. The input guard is the same code path whether you run it standalone or through a product built on top of it.

Driving native Mac apps from an AI agent?

Book 20 minutes to talk through the accessibility-API approach, the input-guard handoff, and whether macos-use fits your workflow.

Frequently asked questions

What does macOS computer use actually mean?

Computer use is an AI taking real actions on a computer instead of only writing text: opening apps, clicking buttons, typing into fields, navigating menus, and chaining those into a task. On macOS specifically, the cleanest path is an MCP server built on Apple's Accessibility APIs (AXUIElement to read UI state, CGEvent to synthesize clicks and keystrokes) rather than screenshotting the screen and guessing pixel coordinates. The accessibility tree gives the model real element roles, labels, and positions, so it clicks the button named 'Send' instead of a pixel that might be a button.

Does macOS computer use drive my real keyboard and mouse?

With a native, accessibility-based server, yes. Unlike browser computer use or a sandboxed VM, the events are posted to the real system event stream via CGEvent, so they move your actual cursor and type into your actual focused app. That is the whole point: it can drive Finder, Mail, Calendar, Xcode, and any third-party GUI, not just a web page. The tradeoff is that you and the model are now sharing one physical input device, which is exactly the problem the input guard solves.

What happens if I touch the trackpad while the AI is working?

On the macos-use server, your physical input is blocked for the duration of a disruptive tool call. A CGEventTap installed by InputGuard intercepts hardware keyboard and mouse events and swallows them, so a stray trackpad swipe cannot move focus out from under the model mid-action. The synthetic events the server itself posts are allowed through, because they carry a non-zero event source state ID while hardware events carry zero. When the tool call finishes, the tap is torn down and your input works normally again.

How do I stop the AI if it goes off the rails?

Press Esc with no modifiers. The input guard watches for keycode 53 with an empty modifier mask, and when it sees it, it sets a cancelled flag, suppresses the Esc event so it does not leak into the app, tears down the tap, and fires an onUserCancelled callback. The server checks that flag between automation steps via throwIfCancelled(), so the in-flight tool call stops rather than completing the next click. It is a single key, no chord to remember.

What if the guard gets stuck and locks me out of my own Mac?

It cannot stay stuck for long. InputGuard starts a watchdog timer the moment it engages, and the watchdog auto-disengages the tap after watchdogTimeout seconds (30 by default), even if the automation never calls disengage. It also re-enables itself if macOS disables the tap due to timeout or user input (the tapDisabledByTimeout and tapDisabledByUserInput cases). So the worst case is a 30-second pause, not a permanent lockout.

Is the input guard active during every tool call?

No. It only engages for disruptive tool calls, the ones that actually move the mouse or type. Reading the accessibility tree (a traversal or refresh) does not touch your hardware, so it does not need the guard and does not block you. In main.swift the guard is engaged before the action and disengaged after, with throwIfCancelled() checks interleaved only when isDisruptive is true. Pure reads stay out of your way.

How is this different from Anthropic's reference computer use?

Anthropic's reference computer-use tool is screenshot-driven: the model is shown an image of the screen and returns pixel coordinates to click. That works anywhere but it is slow, token-heavy, and brittle when the layout shifts. The macos-use approach reads the structured accessibility tree instead, so the model targets named elements and gets a compact text summary plus a screenshot path back rather than burning tokens on raw pixels. Both are 'computer use'; one looks at pixels, the other reads the same tree VoiceOver uses.

Can I use this with Claude Code, Cursor, or Claude Desktop?

Yes. It is a standard MCP server that speaks over stdio, so any MCP-compatible client works: Claude Code, Cursor, VS Code, Claude Desktop, Cline, and custom Anthropic or OpenAI clients. You add it to your MCP config and the six-plus tools (open, click, type, press key, scroll, read tree, and a few AX-action variants) show up to the model. It runs locally on your Mac; there is no hosted endpoint.

macos-useMCP server for native macOS control
© 2026 macos-use. All rights reserved.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.