01The Number That Made Me Try It

Here's a number that should make any Mac owner pay attention: on a four-year-old M1 Max, oMLX served 291 tokens per second across four concurrent requests, a 1.78× speedup over running them serially. Ollama, the default answer for "run an LLM on my Mac", would have queued those same four requests and finished in roughly twice the wall-clock time.

That gap is the entire reason this post exists.

I had two free hours on Sunday afternoon. I had a Mac Studio I barely use for ML. I had an older Mac mini quietly running my home-server stack. And I had this nagging suspicion that oMLX, the LLM inference server that kept popping up on my feed, was the first project to take continuous batching seriously on Apple Silicon.

Two hours turned into a weekend. I learned more about Homebrew's Python bottle ABI policy than I ever wanted to. I uncovered a Hugging Face download bug that is quietly hanging on thousands of machines right now. And I came out the other side with a working, OpenAI-compatible LLM server humming on my Mac Studio.

If you are thinking about running a local LLM in 2026. This is the honest report.

02What oMLX Actually Is

In one sentence: oMLX is what happens when someone takes the ideas from vLLM (the production-grade NVIDIA LLM server) and ports them properly to Apple Silicon.

Three things make it genuinely different from what came before:

  1. Continuous batching. Most "run an LLM at home" projects (Ollama, LM Studio, llama.cpp's server) process one request at a time. oMLX batches multiple in-flight requests through the model the way vLLM does on H100s. If you have agents fanning out parallel tool calls, this changes the economics.

  2. Tiered KV caching. It maintains a hot in-memory cache and a paged SSD cache for prefix reuse, configurable up to 100 GB. The same RAG context, hit a thousand times by different user turns, gets cached on disk and reloaded fast. This is the kind of feature you usually only see in commercial inference platforms.

  3. OpenAI-compatible API. /v1/models, /v1/chat/completions, the whole shape. Anything that talks to OpenAI can talk to oMLX: your existing Python SDK, your existing curl scripts, your existing agent framework. Swap one base URL and you are done.

It is built on Apple's MLX framework, so it is Apple Silicon only. macOS 15+ (Sequoia or newer). Python 3.10+. Apache 2.0 license. There is also a macOS menu-bar app for people who would rather not touch the terminal, but for serious use, the CLI is where you live.

03The Install: One Mac Worked, One Mac Wanted to Fight

Two Macs. Same brew. Same tap. Two completely different stories.

The Mac Studio: Textbook Install

This box runs macOS 26.3 (Tahoe). The Homebrew path was exactly as the README promised:

brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx

About twelve minutes start to finish. Most of that was spent downloading llvm and rust, because the formula compiles pydantic-core, rpds-py, and tiktoken from source. When it was done, omlx --help worked, brew services start jundot/omlx/omlx would set it as a login service, and /opt/homebrew/opt/omlx/libexec/bin/pip install mcp would add MCP support. Lovely.

The Mac mini: Same Command, Total Failure

The mini runs macOS 26.0.1: same Apple Silicon, just a slightly older point release. Same brew install omlx. About ninety seconds in, the post-install step blew up with this:

ImportError: dlopen(...pyexpat.cpython-311-darwin.so):
  Symbol not found: _XML_SetAllocTrackerActivationThreshold
  Expected in: /usr/lib/libexpat.1.dylib

Here is what happened, because it is genuinely surprising: the Homebrew python@3.11 bottle was built against a newer libexpat than the one Apple shipped in macOS 26.0.1's /usr/lib. The macOS update to 26.3 quietly fixed the OS-side library. On 26.0.1, you cannot create a Python venv with brew's Python. Which means brew install omlx, or any Homebrew formula that builds a Python venv in post-install, dies on the spot.

Brew's error message blames out-of-date Command Line Tools. That is a red herring. CLT does not ship /usr/lib/libexpat.1.dylib, Apple does, in the OS itself. Updating Xcode would not have helped me.

The fix is to step around brew's Python entirely. Install uv and run oMLX from source against uv's bundled Python:

git clone https://github.com/jundot/omlx.git
cd omlx
uv venv --python 3.11 .venv
uv pip install --python .venv/bin/python -e .

Total time: under a minute. uv ships its own standalone Python (built from python-build-standalone, which statically links its own expat). The ABI bug just evaporates. Eight minutes of brew anguish, dodged with a single tool nobody told me I needed.

I will swap the mini to the clean brew install once Apple pushes 26.3 down to it. Until then, the from-source venv is the answer.

04The hf-xet Trap (or: How I Lost Twenty Minutes to a Silent Hang)

Once oMLX is installed, you need models. oMLX discovers them from ~/.omlx/models/<id>/, so you grab one with hf download:

hf download mlx-community/Llama-3.2-3B-Instruct-4bit \
  --local-dir ~/.omlx/models/llama-3b

On my Mac Studio, this command hung. For twenty minutes. Zero bytes written. No progress bar movement. No error. Nothing.

I almost gave up. Then I poked at the cache directory and saw a single .incomplete file at exactly zero bytes, timestamped twenty minutes ago. Suspicious.

Here is the cause, and it is going to bite a lot of people: the modern huggingface_hub client uses Xet (a new content-addressable backend) by default for some repos. When Xet hangs (and it does, on certain network setups, including mine apparently), the CLI gives you no signal whatsoever. The download just sits there.

The fix is one environment variable:

export HF_HUB_DISABLE_XET=1

With that set, the same download finished in twenty-five seconds. Add it to your ~/.zshrc until the Hugging Face team improves their timeout and failure surfacing. The hub's GitHub issues are full of confused users who have no idea why their downloads hang. Now you know.

05First Inference (Two and a Half Hours In)

With the model finally on disk, I started the server:

omlx serve --port 8765

And hit it with curl:

curl -s http://127.0.0.1:8765/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen-0.5b","messages":[{"role":"user","content":"Hello"}]}'

Response:

{
  "choices":[{"message":{"role":"assistant",
    "content":"Hello there! How can I assist you today?"}}],
  "usage":{"prompt_tokens":37,"completion_tokens":10}
}

Two and a half hours of yak-shaving for a polite hello. Worth every minute, because the next benchmark made me sit up.

06The Numbers That Matter

I ran a focused benchmark on the Mac Studio with mlx-community/Qwen2.5-0.5B-Instruct-4bit, a tiny model, deliberately, so I was measuring the server's overhead and not the model's compute time.

Hardware: M1 Max, 32 GB unified memory, 10-core CPU. Four years old. Already obsolete by Apple's own marketing standards.

TestWall timeOutput tokensThroughput
1 request, 256-token gen1.31 s256196 tok/s
4 concurrent, 256 tokens each2.94 s855291 tok/s aggregate

Memory footprint: 711 MB RSS with the model loaded. The whole thing is so small it would run on a base-model M4 mini with room to spare.

The interesting number is the second row. Four parallel requests finished in 2.94 seconds. Not 4 × 1.31 = 5.24 seconds. That is a 1.78× speedup purely from continuous batching: requests sharing forward passes through the model instead of queueing politely behind each other.

On a 0.5B model the gain is modest because the model is memory-bandwidth-bound at this size. The GPU is not the bottleneck. On a 7B or 13B model, where the GPU genuinely is the bottleneck, the batching win compounds. I will rerun with Qwen2.5-7B-Instruct-4bit and update this post with those numbers; I expect the parallel-vs-serial gap to widen significantly.

07Where oMLX Genuinely Beats Ollama

Ollama is the default answer when someone says "run an LLM on my Mac." It is polished, popular, and works out of the box. So where does oMLX have a real edge?

Concurrent requests. This is the headline. Ollama processes requests serially per model. If you have an agent fanning out ten parallel tool calls, Ollama queues all ten and walks through them one at a time. oMLX runs them through the model together. If your workload looks anything like modern agentic AI (and increasingly, everyone's does), this matters.

Prefix caching with SSD spill. Ollama keeps a small in-memory KV cache. oMLX adds a paged SSD-backed cache up to 100 GB. The same system prompt and RAG context, hit a thousand times with different user turns, gets cached on disk and reloaded in milliseconds. For RAG-heavy or agent-heavy workloads this is a real win.

Memory ceilings that mean something. --max-model-memory and --max-process-memory are enforced limits. Ollama's memory management is more opportunistic. It works, but it does not coexist as politely with other workloads on the same machine. If your Mac is doing more than one thing, oMLX is the better neighbor.

An honest path to production. oMLX exposes log levels, request limits, host binding, API keys, HTTP proxy support, custom CA bundles, base-path prefixes for reverse proxying. None of that is glamorous. All of it is what you actually need when this stops being a toy and becomes part of your real stack.

08Where It Does Not

I would not be writing this honestly if I did not call out the gaps.

Smaller model library. Ollama has a curated registry, ollama pull llama3.1 and you are done. With oMLX you download directly from Hugging Face's mlx-community repos. More flexible, less hand-holding. If you do not already know which quantization you want. This is a steeper curve.

Smaller community. At the time of writing, oMLX has a few hundred GitHub stars. Ollama has tens of thousands. Documentation, third-party guides, Stack Overflow answers, blog posts. Ollama wins by an order of magnitude. You will be reading source code more often than you would like.

Apple Silicon only. Ollama runs on Linux and Windows too. If you have an Intel Mac, a Linux box, or a Windows machine, oMLX is not your tool. This is by design. It is built on Apple's MLX framework, but it is also a real limit.

A Homebrew install path that can break on point releases. The libexpat ABI bug I hit on 26.0.1 is not oMLX's fault, but it is a real risk surface. The from-source path is bulletproof; the brew path is convenient when it works.

09The Honest Verdict

If you are on Apple Silicon, and your workflow fans out parallel LLM requests (agents, batched evaluation, multi-document summarization, ensembling), switch this afternoon. The OpenAI-compatible API means swapping it in is one base-URL line in your client config. The throughput difference is real and measurable.

If you are occasionally chatting with one model from one client, the difference will not matter much in wall-clock time. Stay with Ollama or LM Studio. The polish and the ecosystem are worth the modest performance gap at single-user scale.

I am leaving oMLX running on the Mac Studio as the inference backend for my home agent stack. The mini stays on the from-source install until Apple ships 26.3 to it. And I now have a much clearer mental model of what a "production" LLM server looks like at home. And why "production" mostly means "does not fall over when you ask it two questions at the same time."

Two hours? It cost me a weekend. I would do it again tomorrow.


Benchmarks run May 10, 2026 on macOS 26.3 / M1 Max / 32 GB. Your mileage will vary. Source: github.com/jundot/omlx.