mlx-server

Local LLM inference server for Apple Silicon using vllm-mlx. Serves MLX-quantized models via an OpenAI-compatible API.

Setup

Requires Python 3.13+ and uv.

uv sync

./serve.sh

The server starts on port 8082 by default, serving mlx-community/Qwen3.5-27B-6bit.

Override with environment variables:

MLX_MODEL=mlx-community/Qwen3.5-27B-4bit MLX_PORT=8080 ./serve.sh

Qwen3.5-27B generation throughput (isolated 3-run averages, M3 Max 96GB):

MLX 6bit is the default — best balance of quality and throughput (+23% over ollama Q4_K_M).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
serve.sh		serve.sh
uv.lock		uv.lock