feat: add --moe-n-expert flag for MoE expert count override #18029

pestopoppa · 2025-12-14T12:42:33Z

Summary

Add ability to reduce the number of active experts in MoE models at runtime, providing significant speedup with minimal quality loss when using 50% of default experts.

Add moe_n_expert_override parameter to llama_context_params
Add --moe-n-expert CLI flag to override n_expert_used
Implement "Hard Mask" in build_moe_ffn() that slices expert tensors
Uses ggml_view_2d/3d + ggml_cont to reduce actual computation

Motivation

MoE models like Qwen3 and GLM-4 select Top-K experts per token (e.g., Top-8 out of 128). By reducing the number of active experts at runtime, we can trade a small amount of quality for significant speedup - useful for drafting, interactive use, or resource-constrained environments.

Benchmark Results

Tested on AMD EPYC 9655 (96 cores) with AOCL BLIS 5.0:

Model	Baseline	50% Experts	Speedup
Qwen3-Coder-480B-A35B	2.5 t/s	3.7 t/s	48%
GLM-4.6-355B-A32B	2.2 t/s	3.0 t/s	36%
Qwen3-Coder-30B-A3B	26.6 t/s	33.6 t/s	26%
Qwen3-VL-30B-A3B	32.2 t/s	38.9 t/s	21%

Quality Assessment

50% experts (e.g., Top-4 instead of Top-8): Excellent quality, nearly identical to baseline
25% experts: Noticeable degradation, still coherent
12.5% experts: Significant quality loss, not recommended

Usage

# Use 4 experts instead of default 8
llama-cli -m model.gguf --moe-n-expert 4 -p "prompt"

# Combine with other options
llama-cli -m qwen3-30b.gguf --moe-n-expert 4 -t 96 -c 4096 -p "Hello"

Test Plan

Verified baseline (no flag) produces identical output to before
Tested with --moe-n-expert values from 1 to N on multiple MoE models
Confirmed speedup scales with expert reduction
Validated quality remains acceptable at 50% expert count

🤖 Generated with Claude Code

Add ability to reduce the number of active experts in MoE models at runtime, providing significant speedup with minimal quality loss when using 50% of default experts. Implementation: - Add moe_n_expert_override parameter to llama_context_params - Add --moe-n-expert CLI flag to override n_expert_used - Implement "Hard Mask" in build_moe_ffn() that slices expert tensors - Uses ggml_view_2d/3d + ggml_cont to reduce actual computation Benchmark results (AOCL BLIS 5.0, AMD EPYC 9655): - Qwen3-Coder-480B-A35B: 2.5 → 3.7 t/s (48% speedup) - GLM-4.6-355B-A32B: 2.2 → 3.0 t/s (36% speedup) - Qwen3-Coder-30B-A3B: 26.6 → 33.6 t/s (26% speedup) - Qwen3-VL-30B-A3B: 32.2 → 38.9 t/s (21% speedup) Quality: Excellent at 50% experts, degraded at 25%, gibberish at 12.5% Usage: llama-cli -m model.gguf --moe-n-expert 4 -p "prompt" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

am17an · 2025-12-14T14:21:08Z

I would be surprised if this survives perplexity tests. Seems like a lobotomy.

CISC · 2025-12-14T14:29:26Z

This seems to be based on some misunderstanding, the "hard mask" makes no sense and completely destroys warmup.

Besides you can easily override the <arch>.expert_used_count with --override-kv if you wish. :)

IIIIIllllIIIIIlllll · 2025-12-14T15:07:28Z

May I ask, I have seen similar settings in LM Studio. Is it the same as this PR?

congson1293 · 2025-12-14T16:01:12Z

@pestopoppa your idea is the same as this repo and with llama-server we can change the expert_used_count by --override-kv param. Example: --override-kv qwen3moe.expert_used_count=int:4. I do not try with this idea, so like @am17an, I'm curious about the perplexity test when reducing 50% active experts

pestopoppa · 2025-12-14T19:50:42Z

Thank you @am17an, @CISC, and @congson1293 for the feedback! You raised valid points that prompted us to investigate further.

Our Use Case: CPU-Only Inference

We're optimizing for CPU-only inference on an AMD EPYC 9655 (96 cores, 1.13TB DDR5). Memory capacity isn't our constraint—bandwidth is. We wanted to understand if reducing active experts could improve throughput.

Hard Mask vs Soft Mask Comparison

Following @CISC's suggestion, we benchmarked --override-kv (soft mask) against our hard mask approach on Qwen3-VL-30B-A3B:

Approach	Method	Baseline	4 Experts	Speedup
Soft mask	`--override-kv qwen3vlmoe.expert_used_count=int:4`	25.1 t/s	29.4 t/s	+17%
Hard mask	`--moe-n-expert 4`	32.2 t/s	38.9 t/s	+21%

Hard mask is ~32% faster in absolute terms (38.9 vs 29.4 t/s). The difference appears to be that hard mask skips the routing computation for excluded experts entirely, while soft mask still routes over all 8 before computing only 4.

Larger Models Benefit More

Model	Baseline	Hard Mask (50%)	Speedup
Qwen3-Coder-480B-A35B	2.5 t/s	3.7 t/s	+48%
GLM-4.6-355B-A32B	2.2 t/s	3.0 t/s	+36%
Qwen3-Coder-30B-A3B	26.6 t/s	33.6 t/s	+26%

Quality Assessment

@am17an - You're right that we need perplexity measurements. Subjectively, at 50% experts the output quality appears unchanged for conversational tasks. At 25% we saw degradation, and at 12.5% it was gibberish. We'll add formal perplexity benchmarks.

Summary

For CPU inference where memory isn't the bottleneck, hard mask provides better speedups than soft mask. The routing "warmup" concern may be more relevant for GPU scenarios where VRAM is precious and you want proper expert selection.

Would you recommend we:

Add perplexity benchmarks to validate quality claims?
Document this as a CPU-specific optimization?
Something else entirely?

Happy to adjust the PR based on your guidance!

pestopoppa requested review from CISC and ggerganov as code owners December 14, 2025 12:42

loci-dev mentioned this pull request Dec 14, 2025

UPSTREAM PR #18029: feat: add --moe-n-expert flag for MoE expert count override auroralabs-loci/llama.cpp#563

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add --moe-n-expert flag for MoE expert count override #18029

feat: add --moe-n-expert flag for MoE expert count override #18029

pestopoppa commented Dec 14, 2025

Uh oh!

am17an commented Dec 14, 2025

Uh oh!

CISC commented Dec 14, 2025

Uh oh!

IIIIIllllIIIIIlllll commented Dec 14, 2025

Uh oh!

congson1293 commented Dec 14, 2025

Uh oh!

pestopoppa commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat: add --moe-n-expert flag for MoE expert count override #18029

Are you sure you want to change the base?

feat: add --moe-n-expert flag for MoE expert count override #18029

Conversation

pestopoppa commented Dec 14, 2025

Summary

Motivation

Benchmark Results

Quality Assessment

Usage

Test Plan

Uh oh!

am17an commented Dec 14, 2025

Uh oh!

CISC commented Dec 14, 2025

Uh oh!

IIIIIllllIIIIIlllll commented Dec 14, 2025

Uh oh!

congson1293 commented Dec 14, 2025

Uh oh!

pestopoppa commented Dec 14, 2025

Our Use Case: CPU-Only Inference

Hard Mask vs Soft Mask Comparison

Larger Models Benefit More

Quality Assessment

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants