Outlier  ›  learn

Can local AI see images? Vision AI on Mac, tested

Quick answer

Yes — multiple local vision models run on Apple Silicon. Outlier supports vision via drag-and-drop: images are analyzed on your Mac with nothing uploaded. Quality is solid for screenshots, documents, and everyday images; complex visual reasoning is where cloud still has an edge.

When most people think of AI that can "see," they think of GPT-4o or Claude Sonnet — cloud tools where images get sent to a server somewhere and analyzed there. What isn't as widely known is that the same capability now runs locally on a MacBook. I've been testing this in Outlier on an M1 Ultra, dragging in receipts, error screenshots, charts, and personal photos. This article covers how it works, which models are available, what you can actually do with it, and where it's still not as capable as cloud vision.

How vision AI works

A vision-language model (VLM) is a standard language model with one extra component: a visual encoder bolted to the front. When you send an image, the encoder divides it into a grid of small patches — think of it as tiling the image into hundreds of squares — and converts each patch into a set of numbers called tokens. Those image tokens get fed into the language model alongside your text prompt, and the model reasons over the combined input just as it would over a long document.

From the model's point of view, "look at this image and tell me what's in it" isn't fundamentally different from "read this document and summarize it." The visual encoder is what transforms pixels into something a language model can work with. The quality of that encoder — and how well it was trained alongside the language model — is what separates good VLMs from mediocre ones.

One practical consequence: image encoding takes time before any text is generated. On my M1 Ultra, that encoding step typically takes 1–3 seconds regardless of model size. Once encoding finishes, generation starts at the normal speed. That pause is normal and expected — it's not a stall.

Which local vision models are available

The open-source vision model landscape has moved quickly. Here are the ones worth knowing about:

ModelMade bySizesNotes
Qwen2-VLAlibaba7B, 72BStrong text-in-image; what Outlier ships
LLaMA 3.2 VisionMeta11B, 90BGood general understanding; wide community support
LLaVAOpen (UW/Microsoft)VariousEarly open VLM; spawned many fine-tunes
Phi-3 VisionMicrosoft4BSmall and efficient; decent for documents
PixtralMistral12BNative high-res; good at charts and diagrams

For comparison, ChatGPT (GPT-4o), Claude Sonnet and Opus, and Gemini Pro all have vision built in — but all of them send your image to their servers to process it. The models above run the entire pipeline on your own hardware.

What you can actually do with local vision

I ran through a range of tasks over several weeks. Here's what worked well:

In Outlier specifically: drag an image into the chat window, or paste with ⌘V. No setup, no extra steps. The model that handles vision is selected automatically.

The privacy argument for local vision

Here's the thing that doesn't get enough attention: when you send an image to a cloud AI, that image travels to a data center. The provider's terms govern what happens to it from there — how long it's retained, whether it's used for training, who can access it.

Most of the time that's fine. But think about what kinds of images people actually want AI help with: medical records, prescription labels, financial documents, legal paperwork, personal photos, work documents marked confidential. These are exactly the images you'd probably rather not upload anywhere.

With local vision, the image never leaves your device. There's no upload, no server-side copy, no retention question. You can run it with Wi-Fi off and it works identically — that's the clearest proof nothing is going out. For sensitive images in particular, local vision removes an entire category of exposure that cloud-based vision carries by design.

Where local vision is not as capable

Being honest here matters. Local VLMs are not on par with GPT-4o for every task:

The realistic framing: local vision is strong for practical, document-oriented tasks — reading text in images, parsing structured information, describing scenes, understanding UI. It's less reliable for tasks that demand nuanced interpretation of complex visuals. If that's your use case, cloud vision still has a practical advantage. If it's receipts, screenshots, and everyday images — local handles it well and your images stay on your machine.

How to use vision in Outlier

There's no setup. Open Outlier, start a chat, and either drag an image into the window or paste one from your clipboard with ⌘V. The vision-capable model — Qwen2-VL on my setup — activates automatically when an image is attached. Type your question alongside the image, or just drop it in and hit return to get a description.

The 1–3 second encoding pause happens before the first token appears; after that, streaming continues at normal speed. You can attach images alongside text prompts, follow up with more questions about the same image, or clear the context and start fresh. Everything stays on-device throughout.

Receipts: Vision examples in this article were run on my own M1 Ultra using Outlier's Qwen2-VL model locally. Images were processed on-device; no images were uploaded to any external service during testing.

Try vision in Outlier

Free Nano + Lite — local, private, no account. Pro $20/mo or $149/yr adds all model tiers including vision models. Lifetime Pro from $99 (Founding 200, first 200 seats). Apple Silicon only.

Download for Mac