Guide / 2026
Local LLM, macOS backoffice automation: the context budget problem nobody talks about.
Every guide on running a local model on a Mac stops at the install step. Every guide on AI computer use assumes a 200K-context cloud model. Almost no one writes about what happens in the middle, where a 7B or 8B local model on Ollama tries to drive Mail or Numbers and falls out of context on the third tool call. This is the piece I wish existed when I started.
Direct answer / verified 2026-05-07
Yes, a local LLM can run a real macOS backoffice agent. The hardware floor is a Mac with 16 GB of unified memory and a 7B+ instruct model on Ollama or LM Studio. The thing that decides whether it actually works is the MCP server's response shape, not the model. macos-use returns a compact summary plus an on-disk accessibility tree the model can grep, instead of dumping 200 KB of tree on every turn. That is the unlock. Everything else is setup.
Why the model size is not the bottleneck
People assume that running a backoffice agent locally needs a 70B model. It does not. A clean Llama 3.1 8B Instruct on Ollama, given an MCP server with a sane response shape, will close out a vendor invoice triage in twelve to fifteen tool calls without losing the plot. A 14B model is comfortable. A 30B is overkill for almost every backoffice task that is not legal review.
The bottleneck is the context budget. A 7B model loaded with something like Llama 3.1 has, depending on quantisation and how you launch the runtime, somewhere between 8K and 32K tokens of context. Once the system prompt is loaded and the MCP tool schemas are parsed in (those are not free), you have maybe 6,000 to 28,000 tokens left to actually do work. A naive computer-use server, trying to be helpful, dumps the full accessibility tree of the app it just touched. Mail with a healthy inbox is around 8,200 elements, roughly 200 KB of JSON, roughly 50,000 tokens. Two tool calls in, you are out of room. The model produces gibberish. You blame the model.
The model was fine. The response was wrong shape.
Same model, two response shapes
Toggle between the two. Both are after one click on Mail's compose button. The naive shape is what most computer-use code on GitHub does today. The macos-use shape is what fits inside a small local model.
One click, two response shapes, one survives a 7B
A small local model on Ollama, 7B parameters, 8K context after the system prompt is loaded. The user asks it to open Mail, find the latest invoice from a vendor, and copy the amount into a Numbers cell. The MCP server does the right thing on the first call, returns the full accessibility tree of Mail. The tree is 8,200 elements. The tool response uses the entire context window. The model produces no output, or hallucinates a completion, or it tries to call a tool with garbled arguments. The user thinks the model is bad. The model is fine. The response was wrong shape.
- 200 KB JSON, ~50,000 tokens per turn
- First tool call exhausts an 8K context
- Looks like a model failure, is a server failure
The bytes, side by side
The naive payload below is illustrative shape, not literal output from a specific server. The macos-use payload is the literal shape that buildCompactSummary produces at Sources/MCPServer/main.swift:731. The byte count is roughly two orders of magnitude apart.
# What a naive computer-use MCP server hands back to the model.
# This is what kills a 7B local model.
# After ONE click on Mail's compose button:
{
"tree": [
{ "role": "AXApplication", "text": "Mail", "children": [ ... ] },
{ "role": "AXWindow", "text": "Inbox", "children": [
{ "role": "AXSplitGroup", "children": [
{ "role": "AXScrollArea", "children": [
{ "role": "AXOutline", "children": [ /* 1,200 mailbox rows */ ] }
] },
{ "role": "AXScrollArea", "children": [
{ "role": "AXTable", "children": [ /* 8,000 message rows */ ] }
] }
] }
] },
/* ...hundreds more nodes... */
]
}
# Total payload: ~210 KB of JSON, ~52,000 tokens.
# A 7B model with 8K usable context: dead on arrival.
# A 14B with 32K context: alive for two more turns.# What macos-use returns instead. The bytes a 7B can actually read. # Same click on Mail's compose button. # Source: buildCompactSummary at Sources/MCPServer/main.swift:731 status: success pid: 4821 app: Mail file: /tmp/macos-use/1746619200000_click_and_traverse.txt file_size: 2104 bytes (6 elements) hint: grep -n 'AXButton' /tmp/macos-use/1746619200000_click_and_traverse.txt screenshot: /tmp/macos-use/1746619200000_mail.png summary: Clicked element 'New Message'. 4 added, 2 modified. text_changes: 'Inbox - 1,247 messages' -> 'New Message' visible_elements: [AXTextField] "To:" x:280 y:140 w:520 h:24 visible [AXTextField] "Subject:" x:280 y:172 w:520 h:24 visible [AXTextArea] "" x:280 y:208 w:520 h:380 visible [AXButton] "Send" x:760 y:108 w:60 h:28 visible [AXButton] "Attach" x:680 y:108 w:60 h:28 visible [AXPopUpButton] "From:" x:280 y:108 w:200 h:24 visible ... # Total payload: ~640 bytes, ~190 tokens. 270x smaller. # A 7B model: still has 7,800 tokens to think with.
The interactive-role allowlist that decides which roles get inlined into visible_elements lives at main.swift:937-941. The output directory /tmp/macos-use is set at main.swift:1961. The grep hint is emitted at main.swift:761.
Backoffice tasks worth running locally
Not every workflow benefits. The shape of a good local-only task is small in steps, repetitive, confined to one or two apps, and handles data the user is not eager to send to a vendor. These six are where I have seen real wins.
Vendor invoice triage
Pull amounts and counterparties from inbound PDFs in Mail, sort into the right Numbers tab, flag mismatches against a vendor list. Stays inside Mail, Preview, and Numbers. Never leaves the laptop.
Expense categorisation
Statement PDF open in Preview, target categories in a Numbers sheet, agent reads line by line and writes the category alongside. Mistakes are visible because the source row is right there.
Mail-merge generation
Numbers range as input, Mail compose window as output, one personalised draft per row. Saved to Drafts so a human approves the send. Local-only because the names and addresses are private.
CRM hygiene pass
Cross-references Mail's signature parsing with a CRM tab in Safari, fixes obvious drift (job title changes, new email domain), logs everything it changed to a Notes file you can review.
Receipt and statement pull
Logs into a small handful of vendor portals in Safari, downloads PDFs to a designated folder in Finder, renames by vendor and date. The browsing happens in your real Safari profile.
File housekeeping
Walks Finder windows, moves files matching a description, renames, archives. Useful at month-end. The agent literally drives Finder; no shell, no AppleScript bridge.
Why local at all, given the cloud is faster
A cloud model running a backoffice agent is faster per token, has stronger reasoning, and gets new capabilities monthly. Those are real advantages. The argument for local is not that local is better. It is that for one specific class of work, the bytes cannot leave the laptop.
Vendor invoices have counterparty bank details. CRM notes have things a human wrote in confidence. Statement exports from a bank have transaction-level history that is not yours to upload. The policy at most companies handling this kind of data is informal but firm: do not paste into ChatGPT, do not pipe through a third-party agent. A local model, on a Mac, with an MCP server that drives apps natively and writes its scratch state under /tmp/macos-use, is the boring compliant version of computer use. Nothing is sent anywhere.
The other reason is per-call cost. A backoffice automation that takes thirty tool calls to finish, run twenty times a day, on a cloud model with vision turned on, runs into real money over a month. Local is free at the margin once the laptop is paid for.
Cloud computer-use vs local + macos-use
Where the design choices actually differ.
| Feature | Cloud model + screenshot computer-use | Local LLM + macos-use |
|---|---|---|
| Per-turn payload size to the model | 100-300 KB raw accessibility tree, often as JSON | ~600 bytes summary plus on-disk file the model greps on demand |
| Survives an 8K-context local model | No, the first tree dump exhausts context | Yes, the steady-state cost is small enough for a 7B |
| Interactive elements surfaced inline | Either everything or nothing | 10-30 elements, role-filtered to the kind a model can act on |
| Post-action signal | Re-traverse and re-send the world | Diff vs before, only changed elements written to disk |
| Visual verification for multimodal local models | None, or a separate screenshot tool | PNG with a red crosshair drawn at the click point |
| User keyboard collision while the agent works | Synthetic events race with whatever the user types | InputGuard suppresses input during action, Esc cancels |
| Lockout if the agent hangs | Possible, often requires force-kill | 30-second watchdog auto-disengages, keyboard returns |
What the agent does to your keyboard while it works
The first time a local agent drives a Mac it is uncanny. The cursor jumps, a window comes forward, text appears in a field, a button highlights and clicks. If you are still typing into another app, that is a problem. CGEvent, which is what macos-use uses for input, posts events into the same global event stream as your keyboard. They mix.
macos-use wraps every disruptive action in an InputGuard. It engages before the action, suppresses the user's keyboard and mouse so the synthetic events go through cleanly, disengages after, and ships a 30-second watchdog that breaks the guard if anything hangs (Sources/MCPServer/InputGuard.swift:24). Pressing Esc at any point cancels the in-flight action and returns control. The watchdog is the part that matters: a stuck agent that locked your keyboard for an hour would be the end of the experiment.
In practice, the lock is short enough to be invisible most of the time. You will know it is on, because there is a small floating overlay that says what the AI is doing.
What the local stack looks like end to end
Eight pieces. Three for the model, three for the agent loop, two for the operator. Nothing here costs money to run.
Ollama
Local model runtime. Run llama 3.1 8b or qwen 2.5 14b.
LM Studio
Desktop UI for local models, exposes an OpenAI-style API.
llama.cpp
Lower-level runtime, useful when you want a custom server.
Claude Desktop
Reference MCP client, swap the model for a local endpoint.
Cline
VS Code extension, MCP-aware, supports local model backends.
Goose
Block's open-source MCP agent, runs against any model.
MCP Inspector
Debug tool, watch every tool call live before you wire it up.
Activity Monitor
Built-in Mac tool, watch RSS while the agent loops.
The setup, condensed
- Install the runtime. Ollama (
brew install ollama) or LM Studio. Pull a 7B+ instruct model, e.g.ollama pull llama3.1:8b-instruct-q4_K_M. - Install macos-use.
npm install -g mcp-server-macos-use. The postinstall step runsswift build -c releaseand writes the binary into the npm prefix. - Pick an MCP-aware client. Claude Desktop is the easiest to point at a local model via a proxy; Cline is the most ergonomic inside VS Code; Goose is the most opinionated for autonomous loops. All three accept an MCP server in the standard config.
- Add macos-use to the client config.
"macos-use": { "command": "mcp-server-macos-use" }. Restart the client. - Grant permissions to the host. System Settings, Privacy and Security. Accessibility for the client app. Screen Recording for the client app. Both prompts fire on first call.
- Sanity check. Ask the model to open Mail. If you see a tool call go through and a response with a
file:line and ascreenshot:line, the wiring is right. - Tail the scratch dir.
ls -lt /tmp/macos-use/ | head. You will see one .txt and one .png per tool call. The .txt is the full accessibility tree on disk. That is what makes the compact summary affordable.
The honest counterargument
There are tasks a local 7B will fumble that a frontier cloud model handles cleanly. Anything that requires holding a long, fuzzy plan in working memory across a dozen apps. Anything that needs deep reading of a 40-page contract. Anything where the success criteria are ambiguous and the model has to negotiate them with you. For those, the cloud is the right tool, and macos-use works fine against a cloud model too.
What this page is about is the other 80 percent. The form filling, the file moving, the cross-tab reconciliation, the mostly-mechanical work that is too dull for a person and too sensitive for the cloud. That is where local pays back.
Wiring this up for a real backoffice workflow?
Twenty minutes on a call, I'll help you pick the model, the client, and the first task that's worth automating.
Frequently asked
Frequently asked questions
Can a local LLM on a Mac actually drive backoffice apps reliably?
Yes, on a Mac with 16 GB or more of unified memory running a 7B or 8B instruct model via Ollama or LM Studio, paired with macos-use as the MCP server. The bottleneck is not the model's reasoning, it is the context budget. A 7B model with 8K of usable context dies on the first naive accessibility-tree dump from a screenshot-based computer-use server. macos-use returns a 5-line summary plus a file path plus 10-30 visible elements per turn instead, which fits the budget. That is the single design choice that makes the rest possible.
Why not just use Claude or GPT for this?
Cloud is fine for a side project. It stops being fine the moment the data going through the agent is regulated. Vendor invoices have counterparty bank details on them. Mail merges contain customer addresses. CRM updates contain notes a human wrote in confidence. A local model on a Mac means the bytes never leave the laptop. You also avoid the per-call fees, which add up fast in a tool loop where the agent makes 30 calls to do one task. The trade is wall-clock time and reasoning quality, not capability.
Which local models actually work for this?
Anything in the 7B-to-30B instruct range with strong tool-calling behavior. As of May 2026: Llama 3.1 8B Instruct, Qwen 2.5 7B Instruct and 14B, Mistral Nemo 12B, Phi-4 14B, gpt-oss-20b. The 7B models are tight on backoffice flows that span more than five steps, the 14B-and-up models are comfortable. The 30B+ models are the sweet spot if you have 32 GB of unified memory; Llama 3.3 70B at Q4 fits on a 64 GB Mac and behaves like a small cloud model. Reasoning models are usually overkill for backoffice work, which is mostly form filling.
What does the MCP response actually look like? Show me the bytes.
After one click_and_traverse call, what arrives in the model's context is roughly: status line, pid, app name, a file path under /tmp/macos-use/ where the full accessibility tree was written, a file_size line that includes the element count, a literal 'hint: grep -n AXButton <path>' line so even a small model knows the next step, a screenshot path with a red crosshair drawn at the click point, a one-line summary like 'Clicked element Send. 3 added, 1 modified.', and a sample of 10-30 interactive elements with role, text, x, y, width, height. The full tree is on disk. The model reads or greps the file only when it needs more, so steady-state per-turn cost is tiny. The function that builds this is buildCompactSummary at Sources/MCPServer/main.swift:731.
Backoffice work that is a fit for a local model on a Mac?
Repetitive, structured, confined to one or two apps, with clear success criteria. Examples that have shipped well: vendor invoice triage in Mail with extraction into Numbers, expense categorisation pulling from a Statement PDF in Preview, mail-merge generation from a Numbers row range, CRM hygiene passes that reconcile contacts across Mail and a CRM tab in Safari, file housekeeping across Finder windows. Bad fits: anything that requires reading dense PDFs end to end (use a separate document model for that), anything that needs cross-machine state, anything where a wrong action is unrecoverable.
Will the agent fight me for keyboard control while it works?
No. macos-use ships an InputGuard that engages while a tool runs, suppresses keyboard and mouse from the user so a stray keystroke does not collide with a synthetic CGEvent, and a 30-second watchdog that auto-disengages if anything hangs (Sources/MCPServer/InputGuard.swift line 24). Pressing Esc cancels the current action immediately and returns control. So in practice you can keep using the Mac, the agent just briefly takes over for the action and gives the keyboard back. This matters more than it sounds. A backoffice automation that runs for an hour cannot lock you out for an hour.
How do I wire this up end to end?
Install Ollama or LM Studio and pull a 7B+ instruct model. Pick an MCP-aware client: Claude Desktop, Cline, or Goose all work. Add macos-use to the client's MCP config as 'mcp-server-macos-use' (npm install -g mcp-server-macos-use does the build). Grant Accessibility and Screen Recording permission to the host (the client app) in System Settings. Confirm tools light up by asking the model to open Mail. From there, the rest is prompt engineering: describe the backoffice task, give the model the success criteria, let it loop. Total wall-clock setup is around fifteen minutes.
How does macos-use handle long automations without exhausting the model's context?
Three habits, all in the response shape. First, the post-action diff. After click, type, press, and scroll, the server snapshots the accessibility tree before, runs the action, snapshots after, subtracts, and only the changed entries make it into the on-disk file. Coordinate-only churn and scroll-bar noise are dropped. Second, interactive-role filtering. The visible_elements section in the summary is filtered to the small set of roles a model can actually act on (AXButton, AXLink, AXTextField, AXTextArea, AXCheckBox, AXRadioButton, AXPopUpButton, AXComboBox, AXSlider, AXMenuItem, AXMenuButton, AXTab) at main.swift:937-941. Third, file-and-grep instead of inline JSON. A file path costs a few dozen tokens. A 200 KB JSON tree costs the model's whole turn.
What is the failure mode I should expect?
On a 7B model, the most common failure is the model picking the wrong button when several have similar labels. The fix is to push it toward the role-filtered text search, which lives on the click_and_traverse tool itself: pass element='Send' and role='AXButton' instead of trying to compute coordinates from the tree. The second-most-common failure is the model deciding it is done before the work is actually complete; mitigate with a small post-step verifier in the prompt that re-reads /tmp/macos-use/<latest>.txt and confirms a specific element appeared. The third is permission denial because Screen Recording is not granted; the screenshot will be a blank PNG.
How do I verify any of this in source?
Clone github.com/mediar-ai/mcp-server-macos-use and read four spans in this order. Sources/MCPServer/main.swift line 731 (buildCompactSummary, the response shape itself). Lines 937-941 (the interactive-role allowlist). Line 1961 (the /tmp/macos-use output directory). Sources/MCPServer/InputGuard.swift line 24 (the 30-second watchdog timeout). That is the whole load-bearing core of why a small local model can drive a Mac. Total reading time is around fifteen minutes.
Same neighbourhood
Related guides
macOS accessibility automation in 2026
The return-shape problem and the ReplayKit subprocess hack nobody mentions. Same family of constraints, different angle.
MCP agent plan execution
How a multi-step automation actually loops between the model and the tool layer, and where the cheap wins live.
macOS AI agent state and memory
What an agent should remember between tool calls, where to put it on disk, and why it matters more for small models.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.