FaceTime SharePlay + MCPstructured diff, not pixels.txt + .png receipt per call

How To Control Someone's Screen On FaceTime When The Viewer Cannot See Your Cursor: The Accessibility-Tree Diff That Narrates Every Click

Apple's native FaceTime remote control (iOS 18, macOS 15) is one workflow. The other workflow keeps the cursor with the host, puts an AI in the middle, and lets the remote viewer narrate. The piece that makes the narration work is not the video feed. It is the flat-text accessibility-tree diff mcp-server-macos-use writes after every disruptive tool call: one line per added, removed, or modified element, with attribute-level before and after. The diff plus a PNG with a red crosshair at the click point survive on disk as a per-call receipt.

Matthew Diakonov, Written with AI

Published April 18, 202611 min read

Read the diff format at main.swift:1007 Crosshair at ScreenshotHelper:70

5.0from open source

Per-call AX diff: '# diff: +N added, -N removed, ~N modified' at main.swift:1008

Modified lines carry attribute-level before -> after at main.swift:1024

Paired .txt + .png receipt in /tmp/macos-use/ per call at main.swift:1821-1839

The remote viewer isn't watching the video. Their AI is reading your diff.

Why an AI-narrated FaceTime screen-control session leans on text, not pixels

Host clicks via macos-use_click_and_traverse

Server returns +/-/~ diff of the AX tree

Summary + .txt file + .png with red crosshair

Remote viewer's AI reads the diff, narrates out loud

Receipt pair lives in /tmp/macos-use/ after the call

0:00 / 0:05

The SERP Thinks You Want Someone Else Driving Your Cursor

Search the keyword and every top result tells you to use FaceTime's built-in remote control, Zoom remote control, TeamViewer, Anydesk, or macOS Screen Sharing. Different products, same workflow: the remote person moves your cursor directly. That workflow has its place (support calls where the remote expert has to touch the UI) and Apple's native feature is pretty good on iOS 18 and macOS 15, outside the EU.

This page is about the workflow they miss. The remote person never gets the cursor. An AI on your Mac does, via mcp-server-macos-use. The remote person talks, the AI acts, and after every action the AI narrates what changed. The question that workflow raises is: how does the remote person know the click worked? The answer is not the SharePlay video feed. It is the accessibility-tree diff written to the server's flat-text response.

Two ways to know what happened after a click

The remote viewer watches the compressed 30fps SharePlay stream and tries to spot the change. A Send button going from disabled to enabled is often one or two pixels of gray shift, easily lost to H.264 blocking. A label swapping from 'Send' to 'Sending…' can survive a compression pass or not. Narration depends on visual acuity and luck.

Relies on what the encoder preserved
Small state flips are often invisible
No persistent record after the call
Cursor position muddies the signal

1 .txt + 1 .png per call

“The response file opens with '# diff: +N added, -N removed, ~N modified' at main.swift:1008. Modified lines use the exact shape '~ [AXButton] "Send" | AXEnabled: \'false\' -> \'true\'' at main.swift:1024-1026. The paired PNG at /tmp/macos-use/<ts>_<tool>.png carries a 15pt red crosshair and a 10pt circle at lastClickPoint, drawn by ScreenshotHelper/main.swift:70-85. Both files survive the FaceTime call as a per-action receipt.”

Sources/MCPServer/main.swift and Sources/ScreenshotHelper/main.swift

One Tool Call, Three Outputs

Every disruptive tool call (click, type, press, scroll) produces the same three artifacts. The compact summary is what the MCP client sees inline. The .txt file is the full diff for grep. The .png is the crosshair receipt. They are all keyed to the same millisecond timestamp so you can pair them up after the call.

One call, three receipts

The Diff Block, Verbatim From main.swift

Three loops, one header. Added elements print with a plus prefix, removed with a minus, modified with a tilde. The interesting case is the modified loop. Every changed attribute becomes one '<name>: old -> new' fragment, joined by commas, tail-appended after a pipe. That is the shape the AI reads to narrate; it is also the shape a human reader can scan to answer “did the click do anything?”

Sources/MCPServer/main.swift

What The Remote Viewer Actually Hears On The Call

Step by step, here is the loop a single “click the Send button” request runs through. The remote viewer is on the far side of a FaceTime call, SharePlay is active, and the host has mcp-server-macos-use wired into an MCP client. Nothing in this loop depends on what the remote viewer sees in the video feed.

Remote viewer narrates: 'click Send'

Voice or text from the remote side of the FaceTime call. The host forwards it to the MCP client. FaceTime carries no input from the remote side in this workflow.

Apple remote control is off. The only input channel into the Mac is the host's own keyboard into the AI client. The remote viewer is a narrator, not a driver.

AI picks a tool and calls it

The MCP client issues macos-use_click_and_traverse with (pid, element: 'Send'). The handler at main.swift:1474 resolves the element, runs the click, and builds the diff via buildToolResponse at main.swift:612.

hasDiff is true for click/type/press/scroll (main.swift:1518-1519). That flag determines the branch at main.swift:648 that returns a diff instead of a full traversal.

Server filters noise out of the diff

Scroll-bar elements are dropped by isScrollBarNoise at main.swift:591. Structural containers without text (AXRow, AXCell, AXColumn, AXMenu) are dropped by isStructuralNoise at main.swift:600-607. Coordinate-only modified entries are dropped at main.swift:681-682.

What is left is role + text + the semantic attribute that flipped. That is the useful signal for narration; everything else is layout churn the AI would have to ignore anyway.

Server writes the receipt pair

main.swift:1827-1829 writes the flat-text response to /tmp/macos-use/<timestamp>_<tool>.txt. main.swift:1834-1839 launches the screenshot-helper subprocess to capture the window and draw the crosshair, writing the PNG with the same timestamp.

Both paths are printed in the compact summary. The AI can shell out and grep the .txt for more detail; the host can open the .png after the call to verify the click landed where they meant.

AI narrates from the summary

The summary lines 'summary: Clicked element Send. 0 added, 0 removed, 1 modified.' and 'text_changes:' feed the AI's response: 'Okay, Send fired. The button is greyed out now and the composer is empty.' The remote viewer hears that, not video interpretation.

The AI reads the .txt with grep -n 'AXButton' <filepath> (the hint at main.swift:761 spells the command out) when the summary is not enough.

After the call, the pair is your audit trail

Every action taken during the call left a .txt + .png in /tmp/macos-use/. Timestamps are in milliseconds so ordering is preserved. If something went wrong, you can reconstruct exactly what the AI clicked, where the cursor was, and what the accessibility tree reported afterward.

/tmp is wiped on reboot and by launchd after ~3 days. If you need long-term audit, copy /tmp/macos-use/ to persistent storage before ending the session.

A Real Diff Response, Line By Line

What the AI sees when it reads /tmp/macos-use/<ts>_click_and_traverse.txt after a click on the Send button in Mail. The header counts, the modified block carries the AXEnabled flip and the text swap, the added block surfaces a spinner that appeared in the toolbar.

/tmp/macos-use/1713456789012_click_and_traverse.txt

# diff: +N added, -N removed, ~N modified+ [AXButton] "Send"- [AXTextField] "draft"~ AXEnabled: 'false' -> 'true'isScrollBarNoise filterisStructuralNoise filter.txt + .png share a ms timestampred crosshair at lastClickPoint

By The Numbers

diff prefixes: + added, - removed, ~ modified

files per call: .txt diff + .png crosshair

0pt

crosshair arm length at ScreenshotHelper:74

AX roles filtered as structural noise

The Receipt Pair, Written On One Timestamp

The .txt and .png share the same ms-precision timestamp by construction, not by coincidence. Both filenames are built at main.swift:1827 and main.swift:1834 from the single timestamp captured at main.swift:1825. So sorting /tmp/macos-use/ by name is sorting by chronological order, and the pair is always adjacent.

Sources/MCPServer/main.swift

The Crosshair, Verbatim

The crosshair is a separate binary (ScreenshotHelper) so the main server never links against Quartz drawing paths it does not otherwise need. The helper reads --click-point from argv, captures the window with CGWindowListCreateImage, then draws a red 2pt stroke through the point with a 10pt circle around it. The point is scaled into image space via scaleX and scaleY computed from the window rect at ScreenshotHelper:55-58.

Sources/ScreenshotHelper/main.swift

One Call, Four Actors, Framed Around The Diff

The remote FaceTime viewer, the host's FaceTime (sharing the screen), the host's AI client (running MCP), and mcp-server-macos-use. Notice how the diff flows left-to-right and the video flows right-to-left. They are independent channels.

Click -> diff -> narration, by actor

Against The Top SERP Workflows, Row By Row

Feature	FaceTime remote control / Zoom / TeamViewer	macos-use MCP + FaceTime SharePlay
Who drives the cursor	the remote person, directly	the AI on the host, never the remote
How the remote party knows a click landed	their own eyes on the pixel stream	structured AX diff narrated by the AI
Click evidence after the call	none by default	.txt + .png pair in /tmp/macos-use/
Works without Apple contacts relationship	remote control requires contacts	yes, any FaceTime call works
Available in the EU	FaceTime remote control: no	yes, no regional gate
State changes invisible to video compression	often lost to H.264 blocking	captured in the diff (AXEnabled, AXValue)
Grep-able audit trail per action	screen recording, if you remembered	main.swift:761 prints the grep command

Why The Pair Matters, By Situation

The click seemed to do nothing on SharePlay

Grep the .txt for the tool name's last entry. If the diff says '0 added, 0 removed, 0 modified', the click really did nothing. If it says '1 modified', the UI changed but your viewer missed the pixel shift. Open the .png to see exactly where the crosshair landed.

You want to file a repro for a flaky app

Zip /tmp/macos-use/<ts>*.txt and <ts>*.png for the affected call range. You now have a timeline of accessibility state + click crosshairs for every action, no screen recording needed.

The remote viewer is on a bad connection

SharePlay may be dropping to a few fps. That does not matter. The diff is already on the wire from your AI client; the narration does not depend on the video reaching them cleanly.

A click silently launched another app

main.swift:1788-1808 detects the cross-app handoff, re-traverses the new frontmost app, and appends 'app_switch:' to the .txt. Your AI narrates 'that opened Mail, here is its window' without waiting for the video feed to resolve.

Frequently asked questions

What exactly does the accessibility-tree diff look like in the response file?

Three blocks under a header. The header is 'diff: +N added, -N removed, ~N modified' written at main.swift:1008. Added elements print with a plus prefix at main.swift:1014 ('+ [AXButton (button)] "Send" x:820 y:612 w:60 h:28'). Removed elements print with a minus prefix at main.swift:1017. Modified elements print with a tilde prefix at main.swift:1026 in the shape '~ [AXButton] "Send" | AXEnabled: \'false\' -> \'true\''. The full response is written to /tmp/macos-use/<timestamp>_<tool>.txt so you can grep it later.

Why is that format good for the remote viewer on FaceTime instead of just watching the video?

SharePlay encodes at roughly 30fps and compresses text aggressively. Small UI state changes, like a disabled button going to enabled or a label swap from 'Send' to 'Sending…', are routinely lost to compression blur. The diff is unambiguous: the exact element role, the exact before and after text, the AXEnabled change. The remote viewer's AI reads 'AXButton Send changed AXEnabled false -> true' and narrates 'the Send button is enabled now' without ever inspecting a video frame.

Which tools return a diff and which return a full traversal?

The switch is inside buildToolResponse at main.swift:612 on the hasDiff flag. hasDiff is true for click, type, press, scroll — the four that mutate UI state. open_application and refresh_traversal return a full traversal instead, written out by the branch at main.swift:720-722. So the diff format is specific to mutation calls, which is the useful case during a FaceTime session. You do not need a full dump of the accessibility tree after every click, just what changed.

Where does the red crosshair in the screenshot come from?

ScreenshotHelper/main.swift:70-85. After CGWindowListCreateImage captures the frontmost window, ScreenshotHelper draws a 2pt red stroke crosshair with 15pt arms centered at lastClickPoint, plus a 10pt radius circle around it. The click coordinates are passed from main.swift:1839 via the --click-point flag on the helper subprocess. lastClickPoint is set per-call at the click_and_traverse handler site, so the PNG shows where the cursor landed even though the cursor itself has already snapped back.

Where does the .txt file come from and how is it named?

main.swift:1825-1829. The handler builds a timestamp in milliseconds ('Int(Date().timeIntervalSince1970 * 1000)'), strips the 'macos-use_' prefix from the tool name, and writes the response to '/tmp/macos-use/<ts>_<toolname>.txt'. The screenshot at main.swift:1834-1839 reuses the same timestamp so the .txt and .png names match. If you collect five clicks in one call they will be 1713456789012_click_and_traverse.txt through 1713456792512_click_and_traverse.txt, each paired with its own PNG.

Does filtering remove noise from the diff, or is every accessibility change surfaced?

Filtering happens in buildToolResponse at main.swift:648-718. Scroll-bar elements are dropped by isScrollBarNoise (main.swift:591). Structural containers like AXRow, AXCell, AXColumn, AXMenu without text are dropped by isStructuralNoise at main.swift:600-607. Coordinate-only changes (x, y, width, height attributes) are filtered out of modified entries at main.swift:681-682. What you are left with is role + text + the semantic attribute that flipped, which is exactly what narrates well.

What does 'text_changes' mean in the compact summary the MCP client actually sees?

The tool returns a short summary to the MCP client, with the full diff written to the .txt file. The summary at main.swift:838-857 collects up to three modified elements whose changed attribute is 'text' or 'AXValue' and prints them as 'text_changes:' followed by 'old' -> 'new' lines. That is the terse signal the AI reads first. If it wants more, the 'file:' line tells it where to grep. The hint line at main.swift:761 even shows the grep command: 'hint: grep -n AXButton <filepath>'.

Can the remote viewer or their AI read the .txt file directly?

Only the host's AI can. The .txt and .png live in /tmp/macos-use/ on the host machine. The MCP client (running on the host) sees the summary, then can shell out to read the full file if it decides to. The remote viewer sees neither; they see the host AI's narration and the SharePlay video feed. The receipt pair is for the host: it is what they hand a teammate, an auditor, or a bug report after the call to say 'this is exactly what happened'.

Does the diff tell you if the action silently opened a different app?

Yes, via the cross-app handoff section at main.swift:1788-1808. If hasDiff is true and the frontmost app PID changed from the one passed to the tool, the handler sets toolResponse.appSwitchPid and re-traverses the new frontmost app. The .txt file then appends a second 'app_switch:' header followed by the new app's element list (main.swift:1031-1036). The summary includes 'app_switch: <App> (PID: N) is now frontmost'. So the AI narrates 'that click launched Mail, here is its new window'.

What if the click did nothing — is the diff empty or is there a default message?

buildDiffSummary at main.swift:888-894 returns 'No changes.' when all three arrays are empty, and that string is appended to the one-line summary. So a click that landed on a non-interactive element, or an AXButton that did not change state, produces a response like 'Clicked at (420, 300). No changes.' and the .txt file has the header '# diff: +0 added, -0 removed, ~0 modified' followed by a blank element section. The AI can read that and narrate 'nothing happened, try a different spot'.

Why both a .txt and a .png instead of just one? Isn't the diff enough?

The diff describes the post-click world in accessibility terms. The PNG describes where the click physically landed in pixel terms, with the red crosshair showing the exact coordinate. Most of the time you only need the diff. But when an action does nothing, the PNG is the tiebreaker: you can see the crosshair fell on a disabled area, or missed the target, or landed on an overlay you did not know was there. Two formats, two angles on the same event.

Can I clear the receipt files, or will /tmp/macos-use grow forever?

Nothing in the server prunes them. /tmp is cleared by macOS on reboot and by periodic launchd tasks (typically anything untouched for 3 days). For a single FaceTime session you will accumulate on the order of tens to low-hundreds of file pairs. If you need to keep them, copy /tmp/macos-use/ somewhere persistent before rebooting. If you want them gone sooner, 'rm -rf /tmp/macos-use/*' between calls is safe — the directory is recreated by main.swift:1823 before the next write.

Read the diff format, the receipt-pair writer, and the crosshair drawer in one sitting

Three spots, total under 60 lines: main.swift:1007-1028 for the +/-/~ format, main.swift:1821-1840 for the .txt + .png pair, and ScreenshotHelper/main.swift:70-85 for the red crosshair. All open source, MIT-licensed, no accounts, no telemetry.

Browse the repo on GitHub →