How to run AI on Mac without a discrete GPU
Yes — Apple Silicon Macs have GPU cores built directly into the chip, sharing unified memory with the CPU. You don't need a separate graphics card. Even an M4 MacBook Air can run a 7B model at 20+ tokens per second.
If you've looked into running AI locally on a PC, you've probably run into the GPU wall: most guides assume you have an NVIDIA card with dedicated VRAM. The Mac story is different, and it comes down to how Apple Silicon is physically built.
Why AI usually needs a GPU
On a typical Windows or Linux desktop, the CPU and GPU are separate chips connected by a PCIe bus. The GPU has its own pool of dedicated video RAM (VRAM) — 8 GB, 16 GB, 24 GB depending on the card. LLM inference is essentially a huge sequence of matrix multiplications, and GPUs are built to execute those in parallel across thousands of cores.
The catch: a model has to fit inside that dedicated VRAM. An RTX 4090 has 24 GB. A 70B-parameter model in 4-bit quantization needs roughly 35 GB. It won't load. You're either paying for multiple high-end cards or you're stuck with smaller models.
The other catch is the software stack. GPU-accelerated inference on PC relies almost entirely on NVIDIA's CUDA platform. AMD GPUs have ROCm, but driver and library coverage is narrower. Intel integrated graphics on Windows — the kind most laptops ship with — offer very limited acceleration for LLM workloads. Running a 7B+ model on integrated Intel or AMD graphics on a Windows laptop is possible but slow enough to be frustrating in practice.
Why Mac is different
Apple Silicon uses a unified memory architecture: the CPU, GPU, Neural Engine, and all the other processors share a single pool of fast on-package memory. There is no separate VRAM, no PCIe bus, no memory copy from system RAM to GPU RAM. Everything is the same memory.
This changes the math for local AI entirely. When a model loads on Apple Silicon, it lands in unified memory where both the CPU and GPU can read it directly. The GPU cores in the chip perform the matrix multiplications that drive inference — the same work a discrete GPU does on a PC — but they're operating on memory that's already shared. No transfer overhead, no 24 GB ceiling imposed by a separate card.
What the M-series chip actually contains
Apple Silicon isn't just a CPU with a small integrated GPU bolted on. Every M-series chip packs several compute engines onto one die:
- CPU cores (performance + efficiency clusters) — handles the OS, scheduling, non-matrix work
- GPU cores — M4 MacBook Air has 10; M4 Max has 40; M1 Ultra has 64. These do the heavy matrix math for inference.
- Neural Engine — a dedicated ML accelerator. The M4's Neural Engine runs at 38 TOPS. Used for certain model operations and on-device CoreML tasks.
- Unified memory — shared by all of the above, with high-bandwidth access across the whole chip
The GPU cores are real GPU cores. They're not a watered-down integrated graphics chip; they're the same tile-based deferred renderer Apple uses for gaming and creative workloads. For LLM inference, frameworks like MLX and llama.cpp's Metal backend route matrix operations directly to these cores.
What this means for local AI performance
The practical result is that a Mac without any discrete GPU can run local AI at speeds that would have required a dedicated workstation GPU a couple of years ago.
Measured on real hardware with Outlier:
- Outlier Nano (1.5B) on M4 MacBook Air: ~32 tokens/second
- Outlier Core (27B) on M1 Ultra: ~20 tokens/second
- Outlier Plus (397B) on M1 Ultra via paged MoE inference: ~2.1 tokens/second
Memory bandwidth is the real limiting factor for LLM inference — more so than raw GPU core count. The GPU has to read model weights on every forward pass, and how fast it can do that determines throughput. The M2 Ultra delivers 800 GB/s of memory bandwidth. The M4 Max delivers 546 GB/s. An RTX 4090 hits 1,008 GB/s, but only across 24 GB of VRAM. Once your model exceeds that 24 GB, the entire setup breaks down; on Apple Silicon, larger models just use more of the shared pool.
| Spec | M4 MacBook Air | RTX 4090 gaming PC |
|---|---|---|
| GPU cores | 10-core GPU (on-chip) | 16,384 CUDA cores (discrete) |
| Memory pool | 16–32 GB unified memory | 24 GB VRAM (system RAM separate) |
| Memory bandwidth | 120 GB/s (16 GB) / 273 GB/s (32 GB) | 1,008 GB/s |
| Max model size | Up to available RAM (7B–13B on 16 GB) | Capped at 24 GB VRAM |
| Approximate price | $1,099 (complete laptop) | ~$1,600 (GPU only) |
| Discrete GPU required | No | Yes |
The catch: you still need enough RAM
Unified memory solves the VRAM ceiling problem, but it doesn't eliminate the memory requirement. A model has to fit in your Mac's RAM. Here's a rough guide by RAM tier:
- 8 GB — Small models only (1B–3B). Usable for quick tasks; macOS needs ~4 GB, leaving little headroom for a 7B model.
- 16 GB — Comfortable for 7B models in 4-bit quantization (~4.5 GB), with room for the OS and other apps. A practical everyday setup.
- 32 GB — Opens up 13B–27B models. Good for serious local AI work without compromise.
- 64 GB+ — Runs 70B models, or with paged MoE inference, models far larger than the chip's RAM through streaming from disk.
If you're buying a Mac for local AI, 16 GB is the practical floor for everyday use. If budget allows, 32 GB gives you substantially more model range without needing any special software tricks.
Getting started on Mac without a GPU
The minimum you need: any M1 or newer Mac, macOS 13 Ventura or later, and model weights that fit in your available RAM. No discrete GPU, no CUDA, no NVIDIA driver setup.
From there, you have a few paths:
- Outlier — native Mac app that manages models, runs inference via MLX and Metal, and supports models from 1.5B up to 397B with paged inference. Nano tier is free.
- Ollama — command-line tool with a local HTTP API. Pulls models from a registry, uses llama.cpp with Metal acceleration under the hood.
- LM Studio — GUI application, good for browsing and testing models without writing any code.
All three use the Metal GPU backend on Apple Silicon, meaning they route matrix operations to your chip's GPU cores automatically. You don't configure anything; the Mac just works.
Try local AI on your Mac
Outlier runs on any M1 or newer Mac. Download the app, pick a model, and start running inference on your own hardware — no GPU, no cloud account required.
Download Outlier