A local AI agent with computer use, fully offline on a Mac
- Computer use means the AI reads screenshots and then clicks or types through macOS Accessibility.
- Vision 35B does the screen-reading locally on a 24 GB+ Mac (~16 tok/s on the V9 engine).
- By default it asks before every action. You can scope-grant a whole workflow so it stops asking.
- You give up some visual-reasoning quality versus cloud. You get nothing on your screen ever leaving the machine.
"Computer use" is the AI capability where the model looks at a screenshot of your desktop, decides what to click or type, then does it. Anthropic's cloud version gets most of the attention. The local version is quieter and it does lag on a few things. But it works, and this is what it actually looks like on a Mac.
The three primitives
Three things have to be in place.
- Screen perception. A model that can take a screenshot as input and actually reason about what's in it. In Outlier that's the Vision 35B-A3B tier, a multimodal MoE that handles image plus text.
- Action emission. The model emits structured tool calls. Things like "click at (x, y)" or "type 'hello'." Outlier's agent loop takes those and hands them to a sandboxed executor.
- Action execution. A small driver that does the physical part: moves the mouse, presses keys, types. macOS exposes this through the Accessibility framework. You grant the app Accessibility permission once and you're done.
Not one of those steps touches the network. The screenshot is captured on your machine, the model runs on your machine, the clicks happen on your machine.
What Vision 35B sees and decides
Vision 35B is a 35-billion-parameter Mixture-of-Experts model with an image encoder, and only 3.6B of those parameters fire per token. You need a 24 GB+ Mac. It runs on the V9 paged engine at roughly 16 tok/s. On a 64 GB Mac Studio you've got plenty of room left over for long image-heavy sessions.
The prompt for a computer-use task looks about like this. "Here's a screenshot of my desktop. The user wants the export-to-PDF button in this app. Where do I click?" The model hands back rough coordinates or a description of the element it found. Then the Outlier agent does one of three things: clicks it, asks you first, or gives control back.
Where the local version trails
No point pretending it's even with cloud. It isn't, in three places.
- Visual reasoning quality. Vision 35B is genuinely good, but it's a smaller model than Claude's vision stack. Hand it a crowded or low-contrast UI (older Mac apps, dense data tables, custom themes) and it slips up more often.
- Latency. Each look-then-decide-then-act loop runs in seconds, because every time the model has to encode the screenshot, reason about it, then emit the action. Cloud isn't wildly faster, but over a long sequence the gap shows.
- Tool ergonomics. The computer-use surface is newer than the chat and agent surfaces. Rougher edges. More approval prompts than you'd like.
What the local version gives back
- Nothing on screen leaves your machine. Your open emails, a password manager, customer data, an internal dashboard. Whatever's visible stays local. This is the whole reason to run computer use locally.
- No per-action billing. Cloud computer-use APIs bill you for every screenshot they process. There's no meter running here.
- Works offline. Handy for on-device automation that has to keep going with no network. Backup workflows, scheduled scripted jobs, a kiosk-style setup left running on its own.
How to try it
Open the Outlier app and switch to Vision 35B. Any Pro tier includes it ($20/mo, $149/yr, or $99 lifetime via Founding 200). Go into Agent mode. Grant Accessibility permission when it asks. Then try something small: "open Calculator and compute the square root of 169." The agent grabs a screenshot, works out what to do, asks for your OK, and acts. That approval gate is on by default for anything with side effects. If you'd rather it run start to finish without interrupting, scope-grant the workflow.
Frequently asked questions
Can an AI control my Mac offline?
Yes. With screen perception from Vision 35B and macOS Accessibility permission, the agent can see the screen and click or type, fully locally.
How good is local computer use versus cloud?
It works but trails on visual reasoning for crowded or unusual interfaces, and each look, decide, act loop takes a few seconds.
Is computer use safe?
Every action with side effects is gated behind an approval prompt by default; you can scope-grant a single workflow to run uninterrupted.
Try Outlier free
Free Nano + Lite — local, private, no account. Pro $20/mo or $149/yr adds everything (Plus 397B, Marathon mode, Computer use, Deep Research v3, long context to 128K). Lifetime Pro from $99 (Founding 200, first 200 seats) or $200 (Founders 500). Apple Silicon only.
Download for Mac