Skip to content

dehora/mlx-server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mlx-server

Local LLM inference server for Apple Silicon using vllm-mlx. Serves MLX-quantized models via an OpenAI-compatible API.

Setup

Requires Python 3.13+ and uv.

uv sync

Usage

./serve.sh

The server starts on port 8082 by default, serving mlx-community/Qwen3.5-27B-6bit.

Override with environment variables:

MLX_MODEL=mlx-community/Qwen3.5-27B-4bit MLX_PORT=8080 ./serve.sh

Benchmarks

Qwen3.5-27B generation throughput (isolated 3-run averages, M3 Max 96GB):

Backend Quantization tok/s
MLX 4bit 16.8
MLX 6bit 11.8
MLX mxfp8 9.4
Ollama Q4_K_M 9.6

MLX 6bit is the default — best balance of quality and throughput (+23% over ollama Q4_K_M).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages