MCP server design_and_traverserefresh_traversal

MCP Write Tools vs Read: 8 Fused, 1 Pure

MCP itself has no separate primitive for read vs write. Both are Tools, distinguished only by readOnlyHint and friends, and even those are advisory. So the design is yours. The macOS-use server fuses every write with a read of the post-action state and returns a diff. Eight tools work that way. One is a pure read. The reason is below, with the Swift that proves it.

Matthew Diakonov, Written with AI

Published May 8, 20269 min read

See main.swift on GitHub MCP spec (2025-11-25)

Direct answerVerified 2026-05-08 against the MCP spec and main.swift

Split read from write when the read is independent of the write. Fuse them when the write changes a surface the LLM has to keep modeling. Filesystem and database servers split (read_file vs write_file, SELECT vs INSERT). UI-driving servers fuse, because every action changes the screen and the LLM's next decision depends on the new screen. The macOS-use MCP server fuses 8 of its 9 tools and returns a diff after every mutation. The diff is the read.

Fused tools (8): every name ends in _and_traverse. open, click, type, press_key, scroll, set_value, press_ax, set_selected.
Pure read (1): refresh_traversal. Used when the LLM wants the tree without doing anything.
Tradeoff: fused tools cannot honestly carry readOnlyHint: true. Clients that prompt on writes will prompt on every fused call. Acceptable for UI driving; would be wrong for a database server.

5.0from open source

9 tools total: 8 fused write+read, 1 pure read (refresh_traversal)

Every fused tool name ends in _and_traverse — the contract is the name

Diff (added/removed/modified) is structurally part of every write's return value

main.swift:1482 is the authoritative tool registry; main.swift:858-878 is the diff serializer

What MCP Does (And Does Not) Say About Reads vs Writes

The MCP spec gives you Tools, Resources, and Prompts. Resources are read-only by design (URIs you list and fetch). Prompts are user-controlled. Tools are model-controlled and cover everything else. There is no separate read-tool and write-tool primitive. The spec adds annotation hints (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) and the spec is explicit that they are advisory. Clients use them for confirmation dialogs and audit logs, not for sandboxing.

So the read-vs-write decision is yours. The reference servers (filesystem, github, postgres) split: distinct read_file and write_file, query and execute. That is the right call for transactional surfaces. The LLM rarely needs the whole table after an INSERT, and a separate query tool keeps annotations clean (readOnlyHint: true on query).

UI driving is not a transactional surface. The screen is live state the LLM has to model, and every action changes it. Splitting click from read forces the LLM to remember to refresh after every click. The macOS-use server made the opposite call: fuse the read into the write, return a diff, and keep one pure read for the case where the LLM wants the tree without acting.

The Two Patterns Compared

One row per design dimension. The point of the table is not to crown a winner. Splitting wins on transactional surfaces. Fusion wins on live-state surfaces. The wrong call on either side is what this is for.

Feature	Split read/write tools	macos-use (fused)
Round trips per logical action (click, observe new state)	two: one for write, one for read	one: write tool returns the new state inline as a diff
Class of bug: LLM forgets to refresh after a write	common; bug surface scales with action count	structurally impossible; refresh is part of the tool's return
readOnlyHint annotation cleanliness	clean: read tools mark true, write tools mark false	fused tools cannot mark readOnlyHint true; only refresh_traversal can
What the tool returns after a mutation	an opaque success/failure; LLM must call read next	added/removed/modified diff plus text-change samples
Best fit	transactional surfaces (database, filesystem, queue, API)	live-state surfaces (UI tree, simulator, board game, document)
Permission UX in clients that prompt per tool	read tools auto-allow, write tools prompt; minimal noise	every fused tool prompts; correct for UI driving, noisy elsewhere
Latency on a 400-element app, write then observe	roughly 400ms (200ms write + 200ms read, plus encode hop)	roughly 280ms (write piggybacks on the same traversal)
Stale-state failure mode	LLM acts on cached tree from a previous read	the tree is fresh after every write by construction

Database, filesystem, queue, and pure-API MCP servers should keep splitting. The fused pattern is correct for UI trees, simulators, board games, chat threads, and other surfaces the LLM has to keep modeling between actions.

One Logical Action, Two Wire Patterns

The same agent goal: click a button and observe what changed. The split pattern issues two MCP tool calls and routes through two JSON-RPC round trips. The fused pattern issues one. The diff is small (typical clicks change 1 to 20 elements out of 400) so shipping it inline costs nothing.

Split: click then read_screen

Fused: click_and_traverse

Where Stale State Bites An Agent

The LLM-forgets-to-refresh class of bug is the practical reason the macOS-use server fuses. Same target, same dialog, two paths. Watch what the agent has to remember in each case.

Dismissing a dialog that spawns a confirmation modal

An MCP client calls click_button to dismiss a dialog. The server returns success. The LLM moves on, calls type_text into what it thinks is the now-focused field underneath. But the dialog actually spawned a second confirmation modal that grabbed focus. The type goes into the modal's text field instead of the intended one. The agent had a stale model of the screen because the click tool only returned a boolean. To avoid this, the LLM has to call read_screen after every click, every time. In practice agents forget. The bug surface scales linearly with action count.

click returns success only, no information about new modal
LLM has to remember to call read_screen after every click
If forgotten, next action targets a stale tree
Bug surface scales with action count

The 9 Tools, Named

The aggregate array at Sources/MCPServer/main.swift:1482 lists every tool the server registers. Eight names end in _and_traverse. One does not. The naming is the contract: if you see the suffix, you know the tool returns the post-action tree.

Fused write+read (8)

open_application_and_traverse
click_and_traverse
type_and_traverse
press_key_and_traverse
scroll_and_traverse
set_value_and_traverse
press_ax_and_traverse
set_selected_and_traverse

Pure read (1)

refresh_traversal

Tool description verbatim: "Useful for getting the current UI state without performing an action."

What The Diff Actually Looks Like

After every fused call the server walks the new accessibility tree, compares it to the pre-action snapshot, and serializes a four-part diff. Each fused tool case in the dispatch block invokes buildDiffSummary inline (search the file: 6 hits in the dispatch case block). The summary line is what the LLM reads first; the full traversal is written to a side file the LLM can grep when it needs detail.

Summary returned by click_and_traverse

Clicked at (132, 280). 1 added, 0 removed, 2 modified.

text_changes (up to 3, truncated to 60 chars per side)

'Are you sure?' -> ''
'OK' -> 'OK (disabled)'

file

/tmp/macos-use/1746028923_click_and_traverse.txt

Source: main.swift lines 791 (buildDiffSummary call inside click case), 856 (lines.append summary line), 858-878 (text_changes block).

When You Should Split, When You Should Fuse

The decision is about the surface, not the LLM. Ask: after a successful write, does the LLM still need to read the surface? If yes, fuse. If no, split.

Fuse when the surface is a UI tree, a simulator state, a board position, a chat thread the LLM is moderating, a live document the LLM is editing, or any other state the LLM has to keep modeling between actions. The post-write read is going to happen anyway; making it part of the write eliminates a class of bugs and a round trip.

Split when the read is independent: a database (you query different tables than the one you wrote to), a filesystem (writing one file does not change what you read from another), an API gateway (POST and GET hit different endpoints), a queue (enqueue and dequeue are independent operations). Splitting preserves clean readOnlyHint annotations and lets clients auto-allow the cheap read tools.

Edge case: if the write affects what the read returns but only sometimes (for example, INSERT into one table and sometimes the LLM wants to SELECT from the joined view), keep them split and let the LLM choose to call the read. Fusing under that rule would force a read the LLM did not need.

What Fusion Costs

Two real downsides. First, the readOnlyHint annotation is forfeit: a fused tool mutates state, so it cannot mark itself read-only. Clients that auto-allow read-only tools and prompt on writes will prompt on every fused call. For UI driving that is correct (the user wants to know when the agent is going to click). For a database server the same rule would be too noisy and the right call would be to split.

Second, the inline diff inflates response size. Typical macOS-use traversals are 100 to 800 elements; a click usually mutates 1 to 20. The diff stays small because it is only the changed nodes, but the full tree is also written to a file the agent can grep. If your surface has thousands of elements that mutate often, fusion will push more bytes than splitting plus an opt-in read.

Neither is a dealbreaker for live-state surfaces. Both are real for transactional surfaces. The decision tree is short and the answer is usually obvious; the mistake is doing the same thing on every server because you saw it once in the reference implementation.

“System Events is just a wrapper around public [Obj]C system APIs, so you could bypass AppleScript and call those APIs directly.”

Apple Developer Documentation

Technical Q&A QA1888 (referenced because the macOS-use server's fused pattern is built directly on those APIs)

Verify The Tool Counts Yourself

None of the numbers above need trust. The whole repo is one Swift file (2056 lines as of this writing) plus a small input-guard module. Eight steps, all reproducible from the public source.

Reproduce the 8-fused / 1-pure split from the source

Clone github.com/mediar-ai/mcp-server-macos-use
Open Sources/MCPServer/main.swift in your editor
Search for _and_traverse: 8 hits in the tool name strings
Search for refresh_traversal: 1 hit, the only pure read
Read line 1482: the aggregate array literal lists all 9 tools by name
Read line 791 (clickTool case): buildDiffSummary is invoked inline after the click
Run the server in stdio mode, click into a small AppKit app, observe the diff in the response JSON
Compare against an MCP server that splits write and read; count the round trips

Designing your own MCP server and stuck on the read-vs-write split?

Walk through your specific surface with the macOS-use maintainers. Bring the tools you have so far and we will trace which ones should fuse and which should stay split.

Frequently asked questions

What is the practical difference between an MCP read tool and an MCP write tool?

MCP itself does not have a separate primitive for the two. Both are Tools. What it has are annotations: readOnlyHint, idempotentHint, destructiveHint, and openWorldHint. A read tool is a Tool whose readOnlyHint is true and that does not mutate any external state. A write tool is the inverse. Annotations are hints, not enforcement. The MCP spec says the same thing: clients use them for UX (do I prompt before invoking?), not for sandboxing. So the read-vs-write distinction lives entirely in the server's design choices.

Why does the macOS-use MCP server fuse write and read instead of exposing them as separate tools?

Because the surface is a live UI. After clicking a button, the only useful next thing for the LLM to know is what changed in the accessibility tree. Splitting click and refresh into two tools means every click is followed by a refresh, every time, with no exceptions. That is two round trips for one logical operation, and it introduces a class of bugs where the LLM forgets to refresh and reasons against stale tree state. Fusing them into click_and_traverse eliminates the second round trip and makes the post-action state structurally part of the tool's return value. The server returns a diff (added, removed, modified, attribute changes) so the LLM sees exactly what the click moved, not just the new state.

How many fused tools does macOS-use ship and how many pure read tools?

Nine tools total. Eight are fused write+read: open_application_and_traverse, click_and_traverse, type_and_traverse, press_key_and_traverse, scroll_and_traverse, set_value_and_traverse, press_ax_and_traverse, set_selected_and_traverse. One is pure read: refresh_traversal. The aggregate list lives at Sources/MCPServer/main.swift line 1482, and the dispatch case block runs from line 777 to line 853. Every one of the eight fused tools ends in _and_traverse in its tool name string, on purpose. The naming is the contract.

When should I split read from write in my own MCP server?

When the read is expensive and the LLM does not need it after every write. A database MCP server is the canonical case: SELECT is cheap and the LLM rarely needs the entire table after an INSERT. The filesystem reference server splits read_file from write_file for the same reason. A vector-search MCP server splits search_documents from index_document because indexing is async and search is independent. Fuse when the write changes a stateful surface the LLM has to keep modeling: a UI tree, a board game, a simulator, a chat thread, a live document. The test is: after the write, does the LLM still need the surface? If yes, fuse.

What is in the diff that fused tools return?

Four arrays: added, removed, modified, and attribute changes. After click_and_traverse, the macOS-use server walks the accessibility tree, compares it to the pre-action snapshot, and returns the structural diff plus a small text-changes block (up to three modified text or AXValue fields, truncated to 60 characters per side). The diff is what tells the LLM whether the click did anything. If the diff is empty, the click was dropped. If the diff shows a new modal window appeared, the LLM knows to interact with the modal next. The summary line code lives at main.swift lines 858 to 878.

Does fusing write and read make the tools harder to annotate?

Yes. A fused tool cannot honestly set readOnlyHint to true because it mutates state, and it cannot reasonably set readOnlyHint to false alone because the bulk of its return value is observation. The macOS-use server simply does not lie in the annotation: every fused tool is implicitly a write tool from the spec's perspective. The pure read (refresh_traversal) is the one tool that could carry readOnlyHint true. This is a real downside of fusion: clients that auto-allow read-only tools but prompt for writes will prompt on every fused call. For UI driving that is correct behavior. For other surfaces it would be too noisy.

Could I implement the same diff pattern with separate read and write tools?

You can, and several MCP servers do. The pattern is: write tool returns a snapshot ID, then a separate read_diff tool takes a before-snapshot-ID and an after-snapshot-ID and returns the diff. It works, and it preserves clean readOnlyHint annotations. The cost is one extra round trip and a stateful snapshot store on the server. The macOS-use server skipped that path because the diff is small (typical AX traversals are 100 to 800 elements, and a click usually mutates 1 to 20 of them) and shipping it inline costs nothing. The choice is taste and surface, not correctness.

Do other MCP servers fuse write and read this way?

Some. Browser-use MCP servers tend to fuse navigate_and_screenshot or click_and_screenshot for the same reason: the LLM driving a page needs the new page after every action. The Playwright MCP server has a click action whose return includes a snapshot reference. Database servers usually do not fuse: queries and mutations are kept distinct. Filesystem servers do not fuse: read_file and write_file are separate, and that is the right call because writing a file does not implicitly change what the LLM should read next. Fusion is correct for live-state surfaces, splitting is correct for transactional surfaces.

What does the dispatch code look like for a fused tool?

In main.swift around line 787, the case for click_and_traverse picks the click handler, runs it, then runs the traversal in the same call frame, then builds a summary that includes a buildDiffSummary call (line 791). The summary line ends with the diff. There is no second tool dispatch, no second JSON-RPC round trip, and no second permission prompt. The traversal cost is amortized across the click cost because both go through the same AXUIElement timeout and the same response serializer.

What happens if I want only the read on a tool that is fused?

Use refresh_traversal. It is the dedicated pure read at main.swift line 1383 and its description literally says "Useful for getting the current UI state without performing an action." It takes a PID and returns the same structured tree the fused tools return. The reason to keep it separate is exactly this case: the LLM sometimes needs a fresh tree without doing anything (the user reopened the app, time passed, an external event likely changed the UI). Splitting refresh from the eight write tools is the right call. Splitting click from refresh would not be.

Is the read on a fused write tool free, or does it cost as much as a separate read?

It costs slightly less than a separate read. Same AXUIElementCreateApplication call, same tree walk, same serialization. The savings come from skipping the JSON-RPC encode/decode cycle, the MCP transport hop, the permission check on the client (if it prompts on every tool call), and the LLM context budget for an extra tool call message. On macOS 14 the entire fused click_and_traverse on a 400-element app finishes in 220 to 380ms end to end, of which roughly 180 to 320ms is the traversal. A separate refresh would add another 200ms on average. Three writes plus three reads in sequence are 1.2 seconds; three fused are 700ms.

Does fusion confuse the LLM about what the tool actually did?

It can if the diff is presented poorly. The macOS-use server addresses this by writing the diff into a structured summary at the top of the response (see main.swift lines 856 to 878) and writing the full traversal to a side file the LLM only reads if it needs to. So the LLM gets a one-line summary like "Clicked at (132, 280). 1 added, 0 removed, 2 modified." plus a file path it can grep. The LLM never has to wade through an 800-element tree to find what changed. The fused contract becomes legible at the model's context budget rather than overwhelming it.