-
Notifications
You must be signed in to change notification settings - Fork 14.2k
mtmd: add Eagle2-VL vision and projector support #17360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
mtmd: add Eagle2-VL vision and projector support #17360
Conversation
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explicitly confirm if part of the PR is generated by AI? I feel very suspicious about some redundant code
While you said in the PR description that you tested it, you haven't even mentioned the link to the model, as well as how you tested it.
tools/mtmd/clip.cpp
Outdated
| learned_pos_embd, | ||
| nullptr); | ||
|
|
||
| // keep runtime quiet in normal runs; shapes are correct by construction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some indentations seem off here
tools/mtmd/clip.cpp
Outdated
| if (model.mm_0_b) { | ||
| embeddings = ggml_add(ctx0, embeddings, model.mm_0_b); | ||
| } | ||
|
|
||
| embeddings = ggml_gelu(ctx0, embeddings); | ||
|
|
||
| GGML_ASSERT(model.mm_2_w != nullptr); | ||
| // keep [n_in, n_tokens] layout for the second matmul as well | ||
| embeddings = ggml_reshape_2d(ctx0, embeddings, embeddings->ne[0], embeddings->ne[1]); | ||
| embeddings = ggml_cont_2d(ctx0, embeddings, embeddings->ne[0], embeddings->ne[1]); | ||
| // Weights are canonicalized at conversion time to [n_in, n_out]; multiply directly. | ||
| embeddings = ggml_mul_mat(ctx0, model.mm_2_w, embeddings); | ||
| if (model.mm_2_b) { | ||
| embeddings = ggml_add(ctx0, embeddings, model.mm_2_b); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better replacing this whole block with build_ffn
| mlp_pos = name.find("mlp1.") | ||
| if mlp_pos != -1: | ||
| mlp_suffix = name[mlp_pos + len("mlp1."):] | ||
| # Skip LayerNorm (mlp1.0.*) | ||
| if mlp_suffix.startswith("0."): | ||
| return [] | ||
| # Map first Linear (mlp1.1.*) -> mm.0.* | ||
| if mlp_suffix.startswith("1."): | ||
| new_name = "mm.0." + mlp_suffix[2:] | ||
| if new_name.endswith(".weight"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all of these code are redundant. This model: https://huggingface.co/nvidia/Eagle2-1B has simple .mlp.fc1 and .mlp.fc2 MLP, there is no nesting mlp1.1.* as you described
convert_hf_to_gguf.py
Outdated
| ] | ||
|
|
||
| # 5) Conv3D patch embed -> two Conv2D kernels | ||
| if name.endswith("patch_embed.proj.weight") and data_torch.ndim == 5: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure about this? seems like bad copy-paste code from QwenVL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds Eagle2-VL multimodal model support (1B/2B variants) to the MTMD pipeline. The implementation introduces a dedicated projector type with a 2-layer MLP architecture (LayerNorm → Linear → GELU → Linear) that operates on spatially-merged vision tokens. The changes are self-contained and follow established MTMD patterns for projector implementations.
Key Changes
- New Eagle2VL projector type with metadata-driven spatial merge (default 2×2) and learned absolute position embeddings
- Python converter handles HuggingFace checkpoint normalization, including projector weight canonicalization to [n_in, n_out] layout and QKV tensor splitting
- Runtime graph builder implements ViT encoder with RMS normalization, spatial merge, and 2-layer MLP projector using existing helper functions
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| gguf-py/gguf/constants.py | Adds EAGLE2VL projector type constant to VisionProjectorType enum |
| tools/mtmd/clip-impl.h | Registers PROJECTOR_TYPE_EAGLE2VL enum and string mapping "eagle2vl" |
| tools/mtmd/clip.cpp | Implements build_eagle2vl() graph builder, parameter loading, preprocessing integration, and embedding dimension calculation |
| convert_hf_to_gguf.py | Adds Eagle2VLVisionModel converter with metadata extraction, tensor name normalization, projector weight canonicalization, and QKV splitting |
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
This PR adds initial support for the Eagle2-VL multimodal models (1B / 2B) in the MTMD pipeline.
The update introduces a dedicated converter path and runtime builder for the Eagle2-VL vision tower and its 2-layer projector.
All changes are fully self-contained and do not affect any existing model architectures.
Converter (convert_hf_to_gguf.py)
Eagle2VLVisionModel.VisionProjectorType=EAGLE2VLinto GGUF metadata.spatial_merge_size, default: 2×2).mm.0,mm.2) to[n_in, n_out]; supports optional biases.GGUF (gguf-py/gguf/constants.py)
EAGLE2VL.Runtime (tools/mtmd/clip.cpp)
build_eagle2vl()vision path:mm.0 → GELU → mm.2) using canonical[n_in, n_out]weights.PROJECTOR_TYPE_EAGLE2VLto the new builder.mm_2_w->ne[1].Integration & Compatibility
Validation
Tested locally on Eagle2-VL 1B and 2B checkpoints:
Scope
This PR focuses on Eagle2-VL (1B / 2B).
Support for additional Eagle2 variants (e.g., 9B) will be handled in a follow-up.
Closes #16704