-
Notifications
You must be signed in to change notification settings - Fork 14.1k
feat: add --moe-n-expert flag for MoE expert count override #18029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Add ability to reduce the number of active experts in MoE models at runtime, providing significant speedup with minimal quality loss when using 50% of default experts. Implementation: - Add moe_n_expert_override parameter to llama_context_params - Add --moe-n-expert CLI flag to override n_expert_used - Implement "Hard Mask" in build_moe_ffn() that slices expert tensors - Uses ggml_view_2d/3d + ggml_cont to reduce actual computation Benchmark results (AOCL BLIS 5.0, AMD EPYC 9655): - Qwen3-Coder-480B-A35B: 2.5 → 3.7 t/s (48% speedup) - GLM-4.6-355B-A32B: 2.2 → 3.0 t/s (36% speedup) - Qwen3-Coder-30B-A3B: 26.6 → 33.6 t/s (26% speedup) - Qwen3-VL-30B-A3B: 32.2 → 38.9 t/s (21% speedup) Quality: Excellent at 50% experts, degraded at 25%, gibberish at 12.5% Usage: llama-cli -m model.gguf --moe-n-expert 4 -p "prompt" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
I would be surprised if this survives perplexity tests. Seems like a lobotomy. |
|
This seems to be based on some misunderstanding, the "hard mask" makes no sense and completely destroys warmup. Besides you can easily override the |
|
@pestopoppa your idea is the same as this repo and with |
|
Thank you @am17an, @CISC, and @congson1293 for the feedback! You raised valid points that prompted us to investigate further. Our Use Case: CPU-Only InferenceWe're optimizing for CPU-only inference on an AMD EPYC 9655 (96 cores, 1.13TB DDR5). Memory capacity isn't our constraint—bandwidth is. We wanted to understand if reducing active experts could improve throughput. Hard Mask vs Soft Mask ComparisonFollowing @CISC's suggestion, we benchmarked
Hard mask is ~32% faster in absolute terms (38.9 vs 29.4 t/s). The difference appears to be that hard mask skips the routing computation for excluded experts entirely, while soft mask still routes over all 8 before computing only 4. Larger Models Benefit More
Quality Assessment@am17an - You're right that we need perplexity measurements. Subjectively, at 50% experts the output quality appears unchanged for conversational tasks. At 25% we saw degradation, and at 12.5% it was gibberish. We'll add formal perplexity benchmarks. SummaryFor CPU inference where memory isn't the bottleneck, hard mask provides better speedups than soft mask. The routing "warmup" concern may be more relevant for GPU scenarios where VRAM is precious and you want proper expert selection. Would you recommend we:
Happy to adjust the PR based on your guidance! |

Summary
Add ability to reduce the number of active experts in MoE models at runtime, providing significant speedup with minimal quality loss when using 50% of default experts.
moe_n_expert_overrideparameter tollama_context_params--moe-n-expertCLI flag to overriden_expert_usedbuild_moe_ffn()that slices expert tensorsggml_view_2d/3d+ggml_contto reduce actual computationMotivation
MoE models like Qwen3 and GLM-4 select Top-K experts per token (e.g., Top-8 out of 128). By reducing the number of active experts at runtime, we can trade a small amount of quality for significant speedup - useful for drafting, interactive use, or resource-constrained environments.
Benchmark Results
Tested on AMD EPYC 9655 (96 cores) with AOCL BLIS 5.0:
Quality Assessment
Usage
Test Plan
--moe-n-expertvalues from 1 to N on multiple MoE models🤖 Generated with Claude Code