-
Notifications
You must be signed in to change notification settings - Fork 14.1k
model: support GLM4V vision encoder #18042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ngxson Can this branch be used with GLM 4.6V (the 106B one)? I can assist with testing if desired. |
|
@tarruda just tested, it should work with the latest commit (feel free to give it a try) |
Will do. Did you publish any GGUF weights? |
| case LLM_ARCH_GLM4: | ||
| return model->hparams.use_mrope() ? LLAMA_ROPE_TYPE_MROPE : LLAMA_ROPE_TYPE_NORM; | ||
| case LLM_ARCH_GLM4_MOE: | ||
| return model->hparams.use_mrope() ? LLAMA_ROPE_TYPE_MROPE : LLAMA_ROPE_TYPE_NEOX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the 2 models (vision and non-vision) are mostly the same, except for the rope mode, so I was quite lazy not to duplicate it into a new arch (which adds involves quite a lot of copy-paste code)
I hope that we can somewhat allow de-duplicating some code via #18051
In the meantime, lmk if you're OK with keeping this hack, or a new arch is still preferable @ggerganov @CISC
No because there is a chance we will change the arch name |
Sorry for the interruption. I have quantized a Q4 model, but the current PR does not yet support the vision module. Edited: 2025-12-16 10:37 |
Yes it does? |

On first look, it seems to be an easy model to support as the HF implementation is pretty much the same as Qwen2.5VL
However, there are some very subtle differences that even some LLMs will miss (I tried both Grok and Gemini 3 and they both missed the 2 first points):
The embedding output was tested against HF transformers and confirmed to be matched
Important
RoPE ordering was corrected upon conversion - no more backend changes here in this PR
Testing
https://huggingface.co/zai-org/GLM-4.6V-Flash
I'm using the ./tools/mtmd/test-1.jpeg already included in this repo:
llama-mtmd-cli -m ..... -mm ..... --image ./tools/mtmd/test-1.jpeg -p "extract all texts from this image" --temp 0 -n 1024Output: