MCP latency by backend, with real numbers per call
Every macOS MCP server is wrapping one of four backends: AppleScript, the Accessibility API + CGEvent, screenshot-plus-vision, or browser CDP. The backend choice determines almost all of the per-call latency, and most of the guides on this topic quote a single "MCP latency" number without saying which backend they measured. This page breaks it down. The non-obvious answer up top: the AX + CGEvent backend that macos-use uses pays a ~575ms median AX-walk cost per call (n=50, real audit log on a 387-element window), plus a hardcoded 200ms activation grace at Sources/MCPServer/main.swift:1659. That is the entire budget; the click itself is sub-10ms.
Per-call latency, fastest to slowest, on a modern Apple Silicon Mac:
- AppleScript, persistent osascript: ~5ms per call (process kept alive between commands).
- AppleScript, cold spawn: ~80ms per call (fresh osascript each time).
- Accessibility API + CGEvent (macos-use): ~600 to 1200ms per
click_and_traverse(200ms activation + ~575ms median AX walk + <10ms CGEvent + disk write). - Screenshot + VLM: ~200ms window capture + 1 to 4s vision-model round trip; backend latency is dominated by the model, not the macOS side.
Source: read of Sources/MCPServer/main.swift plus 50 real traversals out of /tmp/macos-use/ on a Fazm Dev window (median 575ms, p90 1100ms, max 5210ms).
The numbers, from this dev machine
Every macos-use tool call writes the AX-walk time into the header of the file it drops in /tmp/macos-use/. The line looks like # Fazm Dev — 387 elements (0.55s). Sampling 50 of those headers on a current Apple Silicon Mac gives this distribution. It is not a benchmark, it is a working session.
The four backends, ranked by per-call latency
Each row is one macOS automation backend an MCP server can sit on top of. The numbers are the per-call cost of the backend, before the model turn and before any thinking time. They are not the same workload, so treat the ranking as a shape, not a leaderboard.
1. AppleScript with persistent osascript
~5 msKeep the osascript process alive between commands and pipe in successive AppleScript snippets. Per-command cost is roughly 5ms in published Safari MCP benchmarks. The catch is coverage: AppleScript only works for apps that ship a scripting dictionary (Mail, Safari, Finder, Music, Pages). Most third-party native apps, all Catalyst apps, and almost all Electron-on-Mac apps do not.
2. AppleScript with cold osascript spawn
~80 msThe naive way: an MCP server that shells out to osascript -e '...' every call. The fork-exec is the dominant cost. About 16x the persistent path. This is what the macos-automator-style MCP servers pay when they do not pool processes. Same coverage limits as row 1.
3. Accessibility API + CGEvent (macos-use)
~600 to 1200 msThe native-everything path. AXUIElement walks the focused window to give the model a tree it can grep; CGEvent posts the synthetic click, type, or scroll. Pays a 200ms activation grace, a median 575ms AX walk on a 387-element window (p90 ~1.1s), and a sub-10ms event post. Works on every Mac app with Accessibility permission, including the ones AppleScript cannot reach. Three to five orders of magnitude richer state per call than osascript, at one-to-two orders of magnitude more latency.
4. Screenshot + VLM (computer-use style)
~1 to 5 sCapture the window with CGWindowListCreateImage or ScreenCaptureKit (~100 to 300ms), then send it to a vision model that returns coordinates to click. The macOS side is faster than the AX walk; the vision RTT is what burns the budget. Coverage is the best of the four (works on anything with pixels), accuracy is the lowest (misclicks scale with model accuracy on small UI), and the per-call cost is the highest. macos-use deliberately does not use this backend as the primary path; the .png in /tmp/macos-use/ is an audit artifact, not the input the model uses to decide.
“On this dev machine, the median accessibility-tree walk across 50 real MCP tool calls was 575ms, on a 387-element Fazm Dev window. The hardcoded 200ms app-activation grace sits on top. Together that is the per-call budget you see in Claude Code; the click itself is sub-10ms.”
50 traversals from /tmp/macos-use/ on 2026-05-21, headers parsed from the .txt files; Sources/MCPServer/main.swift:1659, 1018
The AX + CGEvent budget, line by line
Inside a single click_and_traverse on macos-use, this is where the time goes. Each step maps to a specific line in Sources/MCPServer/main.swift; the ones that block the response are the AX walks, not the input post.
One click_and_traverse, top to bottom
App activate
NSRunningApplication.activate() (main.swift:1657)
200ms grace
Task.sleep 200_000_000 ns (main.swift:1659)
Scroll into view
0 if visible, +100-150ms/step otherwise
Pre-traversal
AX walk (~575ms median, n=50)
CGEvent post
click / type / scroll (<10ms)
Post-traversal
AX walk + diff build
Write + reply
.txt to /tmp + JSON-RPC out
Two non-obvious things in this diagram. One: the AX walk runs twice on a mutating tool: once before the click so the diff has a baseline, once after so the model gets the resulting tree. On a big window that is two 575ms walks back-to-back, and the post-walk is what the model is waiting on, not the click. Two: the screenshot helper is not on the critical path. It runs in a forked subprocess after the response is built; the JSON-RPC reply ships even if the helper hits its 5s timeout (main.swift:476-477).
What the budget does not show
The numbers above are the steady-state cost on a hot, focused window. Real sessions hit several long-tail conditions that the median does not capture; the audit log is full of them. Six to keep in mind:
Hidden latency leaks
- Cold app launch can add 2-5s to the first traversal (we measured a 41-element Fazm window at 5.09s on first walk, ~100ms on the next).
- Catalyst apps with deep AXGroup nesting cost N synchronous IPC round trips, not one batched call. AXUIElementCopyAttributeValue is per-attribute, per-node.
- scrollIntoViewIfNeeded loops up to 30 scroll steps at 100-150ms each (main.swift:1208-1234). An off-screen click can quietly cost 3-4.5s of pure scroll time before any AX work happens.
- Composed calls (click + type + Return) add a 100ms gap between each additional input action (main.swift:1866) on top of the primary path.
- After every disruptive tool, a 200ms post-action grace runs to let an Esc keypress cancel the call before the response goes out (main.swift:1896).
- Screenshot subprocess can time out at 5s on a contended system (main.swift:476-477). The .txt still ships, but the call duration spikes.
Measuring it yourself
macos-use writes the AX-walk time into every audit file by default, so you do not need an instrumented build. After a few Claude Code or Cursor sessions against your own apps, dump the headers:
The shape of your distribution will look different from this dev machine. Catalyst apps (the ones built with UIKit and bridged to Mac, a lot of Apple's own first-party apps and a growing portion of third-party ones) walk slower because the AXGroup nesting is deeper. Pure Cocoa apps with shallow hierarchies walk faster. Either way, the file headers are ground truth from your real workload; you do not have to trust ours.
So which backend should an MCP server pick?
Picking on latency alone gives you osascript; picking on coverage alone gives you screenshot + VLM. The honest answer is that the dominant cost of an MCP-driven workflow is not the backend, it is the model turn that decides what to click. A 575ms AX walk is a rounding error next to a 3 to 8s assistant turn. What the AX backend buys versus the others is a clean, queryable state representation per call, which the model uses to decide the next step without paying another round trip. The latency you save by skipping the walk you usually pay back in extra model turns. That is the trade macos-use makes on purpose. If you are using a different backend and shipping it on macOS, the numbers above are the bar you need to clear; if you are building on macos-use, the budget above is the one you can already read from disk on every call.
Tuning your own MCP backend for macOS?
If you are forking macos-use to shave the AX walk, or comparing it against a screenshot-based backend you are shipping, 20 minutes with the maintainer is usually faster than another day on a benchmark.
Frequently asked
Frequently asked questions
Which macOS MCP backend is fastest per call?
Persistent osascript wins on raw per-call latency: about 5ms once the osascript process is kept alive between commands, against roughly 80ms when each call spawns a fresh osascript. The catch is that AppleScript only fully covers apps with a scripting dictionary; for native apps that do not expose one, you need the Accessibility API instead, and that backend pays a much larger walk-the-tree cost (median 575ms in our 50-call sample). Screenshot + VLM backends are the slowest, dominated by the vision model round trip (1 to 4 seconds), not by the macOS side.
Why is the Accessibility API backend so much slower than osascript?
Different unit of work. osascript runs one targeted command and returns; the Accessibility API call in an MCP server walks the entire AX tree of the focused window so the model has something to grep. On a Fazm Dev window with 387 elements the median walk was 575ms (n=50), p90 was about 1.1s, max was 5.2s when the window was still hydrating. The walk dwarfs the actual click. macos-use exposes the walk result via the 'file:' line in the JSON-RPC response so the model can grep the tree instead of re-walking it.
What is the 200ms Task.sleep in main.swift actually doing?
It is an activation grace period. Sources/MCPServer/main.swift:1659 calls NSRunningApplication(processIdentifier:).activate() and then sleeps 200_000_000 nanoseconds (200ms) before posting the CGEvent. Without that pause, the click sometimes lands while the previous frontmost app still owns the input focus, the event hits the wrong window, and the AX tree the server walks afterward does not match what the user sees. 200ms is empirical; we landed on it because shorter values still produce focus races on slower Macs.
Why does macos-use chain its actions with 100ms gaps?
click + type + Return is the common composed call, and the gap at main.swift:1866 (100_000_000 nanoseconds = 100ms) gives the target app a chance to actually update its focused element before the next synthetic event lands. Without that pause, typed characters sometimes get eaten by the previous focus owner, or Return fires before the text field finishes processing the keystrokes. The 100ms is the lower bound that worked across the apps we tested; you can tune it down in your fork if you only drive Cocoa-pure apps.
How do I measure the latency for my own MCP backend?
macos-use already writes the timing to disk on every call. Look at the header of any file in /tmp/macos-use/, e.g. '# Fazm Dev — 387 elements (0.55s)'. The 0.55s is the AX walk; the file's mtime against the previous file in the directory is the per-call wall time including activation, click, walk, and PNG capture. A one-liner: cd /tmp/macos-use && for f in *.txt; do head -1 "$f"; done | awk '/elements/ {print $NF}'. That gives you a real distribution from your own usage, not a benchmark slide.
Does the latency scale with window size?
Yes, roughly linearly with element count for the AX walk. A 41-element Fazm dialog walked in 5.09s (a cold launch outlier), but on a hot run a 41-element window walks in ~100ms. A 547-element main window walks in ~650ms on average, 1.57s p90. Catalyst apps with deep nested AXGroup hierarchies are the worst case: each AXUIElementCopyAttributeValue is a synchronous IPC call to the target process, and a deep tree means thousands of round trips. The fix is not faster code, it is fewer attributes per node, which macos-use already does (it reads role, value, frame, and a small attribute set per node, not every AX attribute).
Where does the screenshot fit in the budget?
It is out of the critical path. captureWindowScreenshot at main.swift:386 forks a subprocess (screenshot-helper) so the ReplayKit framework loaded as a side effect dies with the child instead of pinning the parent at ~19% CPU forever. The parent gives the helper a 5s timeout (deadline = DispatchTime.now() + 5.0 at main.swift:476-477) and waits with group.wait. In normal operation the PNG lands in 100-300ms. If the helper times out, the .txt and the JSON-RPC reply still go through; you get a degraded but not-broken response.
Can I bypass the AX walk to get a faster MCP click?
Yes, but you give up state observability. macos-use deliberately fuses every input tool with a traversal so the diff (+/-/~ in the .txt) is part of the response the model reads. If you only want to click, fork the server and skip options.showDiff and options.traverseAfter for clickTool; you save the 575ms walk on the back half and the call drops to ~250ms (200ms activation + ~10ms CGEvent post + ~40ms response build). The trade-off is that the model now has no idea whether the click did anything, so it usually issues a follow-up refresh_traversal anyway and you end up paying the walk on the next turn instead of this one.
Is MCP itself adding meaningful latency on top of the backend?
Not in a way you can see. MCP over stdio is framing JSON-RPC on a local pipe between the client and the server process. Round-trip overhead is sub-millisecond. The dominant transport cost in a Claude Code session is the model RTT to Anthropic, not the MCP stdio. When people quote 'MCP latency' numbers in the 200ms to 5s range, they are quoting the backend the server is wrapping, not the protocol. The protocol is free; the backend you picked is what you are paying for.
Adjacent technical posts on what macos-use actually does per call
Keep reading
macOS MCP audit logging: where every call lands on disk
Every tool call writes a .txt and .png to /tmp/macos-use/. The header of each .txt is the AX walk time used in this latency budget.
MCP server discovery limits on macOS
Discovery (the AX walk) sees everything; actuation can drop silently on Catalyst right-pane controls. The latency budget assumes the click lands.
Drive native macOS apps via the AX tree with MCP
The 9 tools that ship with macos-use, and what each one pays in the budget on this page.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.