Plain-English explainers on MLX, unified memory, quantization, and what actually happens when a model runs on Apple Silicon.
What is K_override on the Plus tier?K_override sets how many experts the paged engine keeps resident in RAM at any moment. Higher K reduces cache misses but inflates RAM and can regress decode speed if the model is compute-bound rather than I/O-bound.
What is MLX and why does Outlier use it?MLX is Apple’s array framework for Apple Silicon. It exposes the unified GPU and CPU as one device, avoids the host–device copy that CUDA frameworks require, and ships a quantization toolkit that targets the 4-bit dense format Outlier uses for every shipping tier except Plus.
What is a paged Mixture-of-Experts model?A 397B-parameter MoE that does not fit in unified memory has to be paged from disk. The router picks a subset of experts per token; the engine keeps the top-K resident in RAM and reads the rest on demand.
What is ternary quantization?Ternary quantization stores model weights as three values (typically -1, 0, +1) plus a per-channel scale. The compression is aggressive (around 1.6 bits per weight) and the matmul becomes a sign-and-add instead of a multiply, which is friendly to commodity hardware.
What is unified memory on Apple Silicon?Unified memory is a shared address space for the CPU and GPU on Apple Silicon. There is no separate VRAM and no PCIe round-trip; the GPU reads model weights directly from main memory at the chip’s memory bandwidth.