How To Control Someone's Screen On FaceTime When The Viewer Cannot See Your Cursor: The Accessibility-Tree Diff That Narrates Every Click
Apple's native FaceTime remote control (iOS 18, macOS 15) is one workflow. The other workflow keeps the cursor with the host, puts an AI in the middle, and lets the remote viewer narrate. The piece that makes the narration work is not the video feed. It is the flat-text accessibility-tree diff mcp-server-macos-use writes after every disruptive tool call: one line per added, removed, or modified element, with attribute-level before and after. The diff plus a PNG with a red crosshair at the click point survive on disk as a per-call receipt.
The SERP Thinks You Want Someone Else Driving Your Cursor
Search the keyword and every top result tells you to use FaceTime's built-in remote control, Zoom remote control, TeamViewer, Anydesk, or macOS Screen Sharing. Different products, same workflow: the remote person moves your cursor directly. That workflow has its place (support calls where the remote expert has to touch the UI) and Apple's native feature is pretty good on iOS 18 and macOS 15, outside the EU.
This page is about the workflow they miss. The remote person never gets the cursor. An AI on your Mac does, via mcp-server-macos-use. The remote person talks, the AI acts, and after every action the AI narrates what changed. The question that workflow raises is: how does the remote person know the click worked? The answer is not the SharePlay video feed. It is the accessibility-tree diff written to the server's flat-text response.
Two ways to know what happened after a click
The remote viewer watches the compressed 30fps SharePlay stream and tries to spot the change. A Send button going from disabled to enabled is often one or two pixels of gray shift, easily lost to H.264 blocking. A label swapping from 'Send' to 'Sending…' can survive a compression pass or not. Narration depends on visual acuity and luck.
- Relies on what the encoder preserved
- Small state flips are often invisible
- No persistent record after the call
- Cursor position muddies the signal
“The response file opens with '# diff: +N added, -N removed, ~N modified' at main.swift:1008. Modified lines use the exact shape '~ [AXButton] "Send" | AXEnabled: \'false\' -> \'true\'' at main.swift:1024-1026. The paired PNG at /tmp/macos-use/<ts>_<tool>.png carries a 15pt red crosshair and a 10pt circle at lastClickPoint, drawn by ScreenshotHelper/main.swift:70-85. Both files survive the FaceTime call as a per-action receipt.”
Sources/MCPServer/main.swift and Sources/ScreenshotHelper/main.swift
One Tool Call, Three Outputs
Every disruptive tool call (click, type, press, scroll) produces the same three artifacts. The compact summary is what the MCP client sees inline. The .txt file is the full diff for grep. The .png is the crosshair receipt. They are all keyed to the same millisecond timestamp so you can pair them up after the call.
One call, three receipts
The Diff Block, Verbatim From main.swift
Three loops, one header. Added elements print with a plus prefix, removed with a minus, modified with a tilde. The interesting case is the modified loop. Every changed attribute becomes one '<name>: old -> new' fragment, joined by commas, tail-appended after a pipe. That is the shape the AI reads to narrate; it is also the shape a human reader can scan to answer “did the click do anything?”
What The Remote Viewer Actually Hears On The Call
Step by step, here is the loop a single “click the Send button” request runs through. The remote viewer is on the far side of a FaceTime call, SharePlay is active, and the host has mcp-server-macos-use wired into an MCP client. Nothing in this loop depends on what the remote viewer sees in the video feed.
Remote viewer narrates: 'click Send'
Voice or text from the remote side of the FaceTime call. The host forwards it to the MCP client. FaceTime carries no input from the remote side in this workflow.
AI picks a tool and calls it
The MCP client issues macos-use_click_and_traverse with (pid, element: 'Send'). The handler at main.swift:1474 resolves the element, runs the click, and builds the diff via buildToolResponse at main.swift:612.
Server filters noise out of the diff
Scroll-bar elements are dropped by isScrollBarNoise at main.swift:591. Structural containers without text (AXRow, AXCell, AXColumn, AXMenu) are dropped by isStructuralNoise at main.swift:600-607. Coordinate-only modified entries are dropped at main.swift:681-682.
Server writes the receipt pair
main.swift:1827-1829 writes the flat-text response to /tmp/macos-use/<timestamp>_<tool>.txt. main.swift:1834-1839 launches the screenshot-helper subprocess to capture the window and draw the crosshair, writing the PNG with the same timestamp.
AI narrates from the summary
The summary lines 'summary: Clicked element Send. 0 added, 0 removed, 1 modified.' and 'text_changes:' feed the AI's response: 'Okay, Send fired. The button is greyed out now and the composer is empty.' The remote viewer hears that, not video interpretation.
After the call, the pair is your audit trail
Every action taken during the call left a .txt + .png in /tmp/macos-use/. Timestamps are in milliseconds so ordering is preserved. If something went wrong, you can reconstruct exactly what the AI clicked, where the cursor was, and what the accessibility tree reported afterward.
A Real Diff Response, Line By Line
What the AI sees when it reads /tmp/macos-use/<ts>_click_and_traverse.txt after a click on the Send button in Mail. The header counts, the modified block carries the AXEnabled flip and the text swap, the added block surfaces a spinner that appeared in the toolbar.
By The Numbers
The Receipt Pair, Written On One Timestamp
The .txt and .png share the same ms-precision timestamp by construction, not by coincidence. Both filenames are built at main.swift:1827 and main.swift:1834 from the single timestamp captured at main.swift:1825. So sorting /tmp/macos-use/ by name is sorting by chronological order, and the pair is always adjacent.
The Crosshair, Verbatim
The crosshair is a separate binary (ScreenshotHelper) so the main server never links against Quartz drawing paths it does not otherwise need. The helper reads --click-point from argv, captures the window with CGWindowListCreateImage, then draws a red 2pt stroke through the point with a 10pt circle around it. The point is scaled into image space via scaleX and scaleY computed from the window rect at ScreenshotHelper:55-58.
One Call, Four Actors, Framed Around The Diff
The remote FaceTime viewer, the host's FaceTime (sharing the screen), the host's AI client (running MCP), and mcp-server-macos-use. Notice how the diff flows left-to-right and the video flows right-to-left. They are independent channels.
Click -> diff -> narration, by actor
Against The Top SERP Workflows, Row By Row
| Feature | FaceTime remote control / Zoom / TeamViewer | macos-use MCP + FaceTime SharePlay |
|---|---|---|
| Who drives the cursor | the remote person, directly | the AI on the host, never the remote |
| How the remote party knows a click landed | their own eyes on the pixel stream | structured AX diff narrated by the AI |
| Click evidence after the call | none by default | .txt + .png pair in /tmp/macos-use/ |
| Works without Apple contacts relationship | remote control requires contacts | yes, any FaceTime call works |
| Available in the EU | FaceTime remote control: no | yes, no regional gate |
| State changes invisible to video compression | often lost to H.264 blocking | captured in the diff (AXEnabled, AXValue) |
| Grep-able audit trail per action | screen recording, if you remembered | main.swift:761 prints the grep command |
Why The Pair Matters, By Situation
The click seemed to do nothing on SharePlay
Grep the .txt for the tool name's last entry. If the diff says '0 added, 0 removed, 0 modified', the click really did nothing. If it says '1 modified', the UI changed but your viewer missed the pixel shift. Open the .png to see exactly where the crosshair landed.
You want to file a repro for a flaky app
Zip /tmp/macos-use/<ts>*.txt and <ts>*.png for the affected call range. You now have a timeline of accessibility state + click crosshairs for every action, no screen recording needed.
The remote viewer is on a bad connection
SharePlay may be dropping to a few fps. That does not matter. The diff is already on the wire from your AI client; the narration does not depend on the video reaching them cleanly.
A click silently launched another app
main.swift:1788-1808 detects the cross-app handoff, re-traverses the new frontmost app, and appends 'app_switch:' to the .txt. Your AI narrates 'that opened Mail, here is its window' without waiting for the video feed to resolve.
Frequently asked questions
What exactly does the accessibility-tree diff look like in the response file?
Three blocks under a header. The header is 'diff: +N added, -N removed, ~N modified' written at main.swift:1008. Added elements print with a plus prefix at main.swift:1014 ('+ [AXButton (button)] "Send" x:820 y:612 w:60 h:28'). Removed elements print with a minus prefix at main.swift:1017. Modified elements print with a tilde prefix at main.swift:1026 in the shape '~ [AXButton] "Send" | AXEnabled: \'false\' -> \'true\''. The full response is written to /tmp/macos-use/<timestamp>_<tool>.txt so you can grep it later.
Why is that format good for the remote viewer on FaceTime instead of just watching the video?
SharePlay encodes at roughly 30fps and compresses text aggressively. Small UI state changes, like a disabled button going to enabled or a label swap from 'Send' to 'Sending…', are routinely lost to compression blur. The diff is unambiguous: the exact element role, the exact before and after text, the AXEnabled change. The remote viewer's AI reads 'AXButton Send changed AXEnabled false -> true' and narrates 'the Send button is enabled now' without ever inspecting a video frame.
Which tools return a diff and which return a full traversal?
The switch is inside buildToolResponse at main.swift:612 on the hasDiff flag. hasDiff is true for click, type, press, scroll — the four that mutate UI state. open_application and refresh_traversal return a full traversal instead, written out by the branch at main.swift:720-722. So the diff format is specific to mutation calls, which is the useful case during a FaceTime session. You do not need a full dump of the accessibility tree after every click, just what changed.
Where does the red crosshair in the screenshot come from?
ScreenshotHelper/main.swift:70-85. After CGWindowListCreateImage captures the frontmost window, ScreenshotHelper draws a 2pt red stroke crosshair with 15pt arms centered at lastClickPoint, plus a 10pt radius circle around it. The click coordinates are passed from main.swift:1839 via the --click-point flag on the helper subprocess. lastClickPoint is set per-call at the click_and_traverse handler site, so the PNG shows where the cursor landed even though the cursor itself has already snapped back.
Where does the .txt file come from and how is it named?
main.swift:1825-1829. The handler builds a timestamp in milliseconds ('Int(Date().timeIntervalSince1970 * 1000)'), strips the 'macos-use_' prefix from the tool name, and writes the response to '/tmp/macos-use/<ts>_<toolname>.txt'. The screenshot at main.swift:1834-1839 reuses the same timestamp so the .txt and .png names match. If you collect five clicks in one call they will be 1713456789012_click_and_traverse.txt through 1713456792512_click_and_traverse.txt, each paired with its own PNG.
Does filtering remove noise from the diff, or is every accessibility change surfaced?
Filtering happens in buildToolResponse at main.swift:648-718. Scroll-bar elements are dropped by isScrollBarNoise (main.swift:591). Structural containers like AXRow, AXCell, AXColumn, AXMenu without text are dropped by isStructuralNoise at main.swift:600-607. Coordinate-only changes (x, y, width, height attributes) are filtered out of modified entries at main.swift:681-682. What you are left with is role + text + the semantic attribute that flipped, which is exactly what narrates well.
What does 'text_changes' mean in the compact summary the MCP client actually sees?
The tool returns a short summary to the MCP client, with the full diff written to the .txt file. The summary at main.swift:838-857 collects up to three modified elements whose changed attribute is 'text' or 'AXValue' and prints them as 'text_changes:' followed by 'old' -> 'new' lines. That is the terse signal the AI reads first. If it wants more, the 'file:' line tells it where to grep. The hint line at main.swift:761 even shows the grep command: 'hint: grep -n AXButton <filepath>'.
Can the remote viewer or their AI read the .txt file directly?
Only the host's AI can. The .txt and .png live in /tmp/macos-use/ on the host machine. The MCP client (running on the host) sees the summary, then can shell out to read the full file if it decides to. The remote viewer sees neither; they see the host AI's narration and the SharePlay video feed. The receipt pair is for the host: it is what they hand a teammate, an auditor, or a bug report after the call to say 'this is exactly what happened'.
Does the diff tell you if the action silently opened a different app?
Yes, via the cross-app handoff section at main.swift:1788-1808. If hasDiff is true and the frontmost app PID changed from the one passed to the tool, the handler sets toolResponse.appSwitchPid and re-traverses the new frontmost app. The .txt file then appends a second 'app_switch:' header followed by the new app's element list (main.swift:1031-1036). The summary includes 'app_switch: <App> (PID: N) is now frontmost'. So the AI narrates 'that click launched Mail, here is its new window'.
What if the click did nothing — is the diff empty or is there a default message?
buildDiffSummary at main.swift:888-894 returns 'No changes.' when all three arrays are empty, and that string is appended to the one-line summary. So a click that landed on a non-interactive element, or an AXButton that did not change state, produces a response like 'Clicked at (420, 300). No changes.' and the .txt file has the header '# diff: +0 added, -0 removed, ~0 modified' followed by a blank element section. The AI can read that and narrate 'nothing happened, try a different spot'.
Why both a .txt and a .png instead of just one? Isn't the diff enough?
The diff describes the post-click world in accessibility terms. The PNG describes where the click physically landed in pixel terms, with the red crosshair showing the exact coordinate. Most of the time you only need the diff. But when an action does nothing, the PNG is the tiebreaker: you can see the crosshair fell on a disabled area, or missed the target, or landed on an overlay you did not know was there. Two formats, two angles on the same event.
Can I clear the receipt files, or will /tmp/macos-use grow forever?
Nothing in the server prunes them. /tmp is cleared by macOS on reboot and by periodic launchd tasks (typically anything untouched for 3 days). For a single FaceTime session you will accumulate on the order of tens to low-hundreds of file pairs. If you need to keep them, copy /tmp/macos-use/ somewhere persistent before rebooting. If you want them gone sooner, 'rm -rf /tmp/macos-use/*' between calls is safe — the directory is recreated by main.swift:1823 before the next write.
Read the diff format, the receipt-pair writer, and the crosshair drawer in one sitting
Three spots, total under 60 lines: main.swift:1007-1028 for the +/-/~ format, main.swift:1821-1840 for the .txt + .png pair, and ScreenshotHelper/main.swift:70-85 for the red crosshair. All open source, MIT-licensed, no accounts, no telemetry.
Browse the repo on GitHub →