-
Notifications
You must be signed in to change notification settings - Fork 72
feat: add multimodal message support with image data URI handling #497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @ilopezluna, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Ollama API's message handling by introducing robust support for multimodal messages. It enables the system to process and correctly format messages that include both textual content and image data, converting them into an OpenAI-compatible structure. A crucial part of this change is the intelligent handling of image data URIs, ensuring that images are always correctly prefixed for downstream processing, even when the input format varies from sources like OpenWebUI. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for multimodal messages by handling image data URIs, which is a great feature. The implementation correctly converts messages to the OpenAI multimodal format and includes comprehensive tests for various scenarios.
My review includes a few suggestions to improve the implementation:
- In
ensureDataURIPrefix, I've suggested a way to make image type detection more robust instead of hardcodingimage/jpeg. - In
convertMessages, I've recommended pre-allocating a slice to improve performance. - For the new tests, I've proposed a more robust way to compare JSON objects to avoid failures from formatting or key order differences.
Overall, this is a solid contribution. Addressing these points will make the code more resilient and performant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- The
ensureDataURIPrefixhelper assumes all raw base64 should be treated asimage/jpeg; if other formats are possible from clients, consider either detecting the MIME type or making the default configurable to avoid mislabeling non-JPEG data. - The tests in
TestConvertMessages_Multimodalcompare JSON via exact string equality on map-encoded objects, which is fragile due to nondeterministic map key order; consider unmarshaling both expected and actual JSON into structs ormap[string]interface{}and comparing structurally instead. - In
ensureDataURIPrefix, you may want tostrings.TrimSpacethe input before checking prefixes so that leading/trailing whitespace from UIs does not prevent correct detection of an existing scheme.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `ensureDataURIPrefix` helper assumes all raw base64 should be treated as `image/jpeg`; if other formats are possible from clients, consider either detecting the MIME type or making the default configurable to avoid mislabeling non-JPEG data.
- The tests in `TestConvertMessages_Multimodal` compare JSON via exact string equality on map-encoded objects, which is fragile due to nondeterministic map key order; consider unmarshaling both expected and actual JSON into structs or `map[string]interface{}` and comparing structurally instead.
- In `ensureDataURIPrefix`, you may want to `strings.TrimSpace` the input before checking prefixes so that leading/trailing whitespace from UIs does not prevent correct detection of an existing scheme.
## Individual Comments
### Comment 1
<location> `pkg/ollama/http_handler.go:736-737` </location>
<code_context>
+ return imageData
+ }
+
+ // Assume raw base64 data - add data URI prefix
+ return "data:image/jpeg;base64," + imageData
+}
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Hardcoding the MIME type to image/jpeg may be too restrictive for non-JPEG images.
This treats all raw base64 input as `image/jpeg`. If upstream may send PNG/WebP or other formats, the data URI will be incorrect. Consider making the MIME type configurable (e.g., from a parameter or `msg` metadata) or otherwise centralizing/documenting this assumption so it’s easy to change if additional formats are supported.
Suggested implementation:
```golang
// ensureDataURIPrefix ensures that image data has a proper data URI prefix.
// OpenWebUI may send raw base64 data without prefix, but llama.cpp requires it.
// This function:
// - Returns data as-is if it already starts with "data:", "http://", or "https://"
-// - Prepends "data:image/jpeg;base64," to raw base64 strings
-func ensureDataURIPrefix(imageData string) string {
+// - Prepends "data:<mimeType>;base64," to raw base64 strings
+// The mimeType parameter is used for the data URI; if empty, "image/jpeg" is assumed.
+func ensureDataURIPrefix(imageData string, mimeType string) string {
// Check if already has a URI scheme
if strings.HasPrefix(imageData, "data:") ||
strings.HasPrefix(imageData, "http://") ||
strings.HasPrefix(imageData, "https://") {
return imageData
}
+
+ // Default to image/jpeg if no MIME type was provided
+ if mimeType == "" {
+ mimeType = "image/jpeg"
+ }
- // Assume raw base64 data - add data URI prefix
- return "data:image/jpeg;base64," + imageData
+ // Assume raw base64 data - add data URI prefix using the provided MIME type
+ return "data:" + mimeType + ";base64," + imageData
}
```
1. All call sites of `ensureDataURIPrefix` must be updated from `ensureDataURIPrefix(imageData)` to `ensureDataURIPrefix(imageData, mimeType)` (or `ensureDataURIPrefix(imageData, "")` to keep the existing default JPEG behavior).
2. Where possible, derive `mimeType` from upstream metadata (e.g., message fields like `image_mime_type`, HTTP headers such as `Content-Type`, or other existing structures) so that PNG/WebP/etc. are passed through correctly.
3. If there is a central place where image messages are handled, consider defining a small helper that encapsulates how the MIME type is resolved (e.g., `resolveImageMimeType(msg)`), and use that helper when calling `ensureDataURIPrefix`.
</issue_to_address>
### Comment 2
<location> `pkg/ollama/http_handler_test.go:106-107` </location>
<code_context>
+ t.Fatalf("Failed to marshal result: %v", err)
+ }
+
+ // Compare JSON strings
+ if string(resultJSON) != tt.expected {
+ t.Errorf("convertMessages() mismatch\nGot: %s\nExpected: %s", string(resultJSON), tt.expected)
+ }
</code_context>
<issue_to_address>
**suggestion (testing):** Avoid brittle JSON string comparison in table test
Comparing raw JSON strings here is brittle for `map[string]interface{}` because key order isn’t guaranteed and may vary by Go version/implementation. Instead, unmarshal both `resultJSON` and `tt.expected` into a structured type (e.g. `[]map[string]interface{}` or `[]any`) and compare with `reflect.DeepEqual` or a cmp helper so the test only fails on real semantic differences, not ordering or formatting.
Suggested implementation:
```golang
// Marshal to JSON for comparison
resultJSON, err := json.Marshal(result)
if err != nil {
t.Fatalf("Failed to marshal result: %v", err)
}
// Unmarshal both actual and expected into structured values to avoid brittle
// string comparison that depends on JSON key ordering or formatting.
var resultData []map[string]interface{}
if err := json.Unmarshal(resultJSON, &resultData); err != nil {
t.Fatalf("Failed to unmarshal result JSON: %v", err)
}
var expectedData []map[string]interface{}
if err := json.Unmarshal([]byte(tt.expected), &expectedData); err != nil {
t.Fatalf("Failed to unmarshal expected JSON: %v", err)
}
// Compare structured values so the test only fails on semantic differences.
if !reflect.DeepEqual(resultData, expectedData) {
t.Errorf("convertMessages() mismatch\nGot: %s\nExpected: %s", string(resultJSON), tt.expected)
}
```
1. Ensure the file imports the `reflect` package at the top:
- Add `reflect` to the existing import block, e.g.:
`import ( ... "encoding/json" "reflect" ... )`
2. If the `expected` field in the table tests is not JSON representing a `[]map[string]interface{}`, adjust the `resultData`/`expectedData` types (e.g. to `[]any` or a concrete struct) to match the actual shape of `convertMessages` output and the expected JSON.
</issue_to_address>
### Comment 3
<location> `pkg/ollama/http_handler_test.go:91` </location>
<code_context>
+ {
+ name: "assistant message with tool calls",
</code_context>
<issue_to_address>
**suggestion (testing):** Add coverage for messages with `tool_call_id` (tool result messages)
Since `convertMessages` handles `ToolCalls` and `ToolCallID` separately, please add a test case for a tool result message (e.g., `Role` = "tool" with `ToolCallID` set) and assert that the output map includes the correct `"tool_call_id"`. This will exercise the tool-result path and protect against regressions.
</issue_to_address>
### Comment 4
<location> `pkg/ollama/http_handler_test.go:114` </location>
<code_context>
+ }
+}
+
+func TestEnsureDataURIPrefix(t *testing.T) {
+ tests := []struct {
+ name string
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding an empty-string case for `ensureDataURIPrefix`
You already cover prefixed data URIs and http/https URLs. Please also add a case for an empty-string input, which currently becomes just the `"data:image/jpeg;base64,"` prefix, so this behavior is explicitly tested and guarded against future changes.
Suggested implementation:
```golang
func TestEnsureDataURIPrefix(t *testing.T) {
tests := []struct {
name string
input string
expected string
}{
{
name: "jpeg data uri",
input: "data:image/jpeg;base64,abc123",
expected: "data:image/jpeg;base64,abc123",
},
{
name: "http url",
input: "http://example.com/image.jpg",
expected: "http://example.com/image.jpg",
},
{
name: "https url",
input: "https://example.com/image.jpg",
expected: "https://example.com/image.jpg",
},
{
name: "raw base64 without prefix",
input: "abc123",
expected: "data:image/jpeg;base64,abc123",
},
{
name: "empty string",
input: "",
// Current behavior: empty input becomes just the data URI prefix.
expected: "data:image/jpeg;base64,",
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
result := ensureDataURIPrefix(tt.input)
if result != tt.expected {
t.Errorf("ensureDataURIPrefix(%q) = %q, want %q", tt.input, result, tt.expected)
}
})
}
}
```
If the concrete test cases or field names in `TestEnsureDataURIPrefix` differ from what I inferred, adjust the SEARCH block to match the existing tests slice and insert the `empty string` case in the same style:
- Keep `name`, `input`, and `expected` field names consistent.
- Ensure the expected value for the empty-string input is `"data:image/jpeg;base64,"`, reflecting the current behavior you want to guard.
</issue_to_address>
### Comment 5
<location> `pkg/ollama/http_handler_test.go:157-166` </location>
<code_context>
+func TestConvertMessages_PreservesOrder(t *testing.T) {
</code_context>
<issue_to_address>
**suggestion (testing):** Strengthen multimodal assertions in `TestConvertMessages_PreservesOrder`
To improve robustness, also assert the structure of the two content parts (e.g., first has `type == "text"` with the expected text, second has `type == "image_url"` with the expected URL) so the test validates the multimodal `content` array contents, not just its length.
Suggested implementation:
```golang
func TestConvertMessages_PreservesOrder(t *testing.T) {
messages := []Message{
{Role: "system", Content: "You are a helpful assistant"},
{Role: "user", Content: "Hello"},
{Role: "assistant", Content: "Hi there!"},
{Role: "user", Content: "What's in this image?", Images: []string{"data:image/jpeg;base64,abc123"}},
}
result := convertMessages(messages)
if len(result) != 4 {
t.Fatalf("expected 4 converted messages, got %d", len(result))
}
// Verify roles are preserved in order
expectedRoles := []string{"system", "user", "assistant", "user"}
for i, role := range expectedRoles {
if result[i].Role != role {
t.Fatalf("message %d role mismatch: got %q, want %q", i, result[i].Role, role)
}
}
// Verify multimodal content structure for the last user message
last := result[3]
// We expect a multimodal content array with two parts: text then image_url
if len(last.Content) != 2 {
t.Fatalf("expected last message to have 2 content parts, got %d", len(last.Content))
}
// First part: text
textPart, ok := last.Content[0].(openai.ChatCompletionMessageContentPartText)
if !ok {
t.Fatalf("expected first content part to be text, got %T", last.Content[0])
}
if textPart.Type != "text" {
t.Fatalf("expected first content part type %q, got %q", "text", textPart.Type)
}
if textPart.Text != "What's in this image?" {
t.Fatalf("unexpected text content: got %q, want %q", textPart.Text, "What's in this image?")
}
// Second part: image_url
imagePart, ok := last.Content[1].(openai.ChatCompletionMessageContentPartImage)
if !ok {
t.Fatalf("expected second content part to be image_url, got %T", last.Content[1])
}
if imagePart.Type != "image_url" {
t.Fatalf("expected second content part type %q, got %q", "image_url", imagePart.Type)
}
if imagePart.ImageURL.URL != "data:image/jpeg;base64,abc123" {
t.Fatalf("unexpected image_url content URL: got %q, want %q", imagePart.ImageURL.URL, "data:image/jpeg;base64,abc123")
}
```
1. This change assumes:
- `result` is a slice of a struct with a `Role string` field and a `Content []interface{}` (or similar) field.
- `Content` elements are concrete types `openai.ChatCompletionMessageContentPartText` and `openai.ChatCompletionMessageContentPartImage` from the `github.com/sashabaranov/go-openai` package, with fields:
- `Type string`
- `Text string` on the text part
- `ImageURL.URL string` on the image part.
2. If your actual types differ (e.g., `[]openai.ChatCompletionMessagePart` with a `Type` discriminator instead of Go interfaces), adjust the type assertions and field accesses accordingly:
- Replace the type assertions with direct indexing (e.g. `last.Content[0].Type == openai.ChatMessagePartTypeText`).
- Replace `imagePart.ImageURL.URL` with the correct field path used in your code.
3. Ensure the test file imports the `openai` package if it is not already imported:
- Add `openai "github.com/sashabaranov/go-openai"` (or the correct alias/path) to the imports of `pkg/ollama/http_handler_test.go`.
</issue_to_address>
### Comment 6
<location> `pkg/ollama/http_handler_test.go:69` </location>
<code_context>
+ {
+ name: "multimodal message with raw base64 from OpenWebUI (no prefix)",
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding a case for multiple raw base64 images without prefix
You already cover a single raw base64 image (no prefix) being normalized via `ensureDataURIPrefix` in `convertMessages`. Since this also runs for multiple images, please add a test where `Images` contains at least two raw base64 strings without prefixes, and assert that each resulting `image_url.url` is correctly prefixed. This helps catch any off‑by‑one or loop issues in the multimodal path.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| // Compare JSON strings | ||
| if string(resultJSON) != tt.expected { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Avoid brittle JSON string comparison in table test
Comparing raw JSON strings here is brittle for map[string]interface{} because key order isn’t guaranteed and may vary by Go version/implementation. Instead, unmarshal both resultJSON and tt.expected into a structured type (e.g. []map[string]interface{} or []any) and compare with reflect.DeepEqual or a cmp helper so the test only fails on real semantic differences, not ordering or formatting.
Suggested implementation:
// Marshal to JSON for comparison
resultJSON, err := json.Marshal(result)
if err != nil {
t.Fatalf("Failed to marshal result: %v", err)
}
// Unmarshal both actual and expected into structured values to avoid brittle
// string comparison that depends on JSON key ordering or formatting.
var resultData []map[string]interface{}
if err := json.Unmarshal(resultJSON, &resultData); err != nil {
t.Fatalf("Failed to unmarshal result JSON: %v", err)
}
var expectedData []map[string]interface{}
if err := json.Unmarshal([]byte(tt.expected), &expectedData); err != nil {
t.Fatalf("Failed to unmarshal expected JSON: %v", err)
}
// Compare structured values so the test only fails on semantic differences.
if !reflect.DeepEqual(resultData, expectedData) {
t.Errorf("convertMessages() mismatch\nGot: %s\nExpected: %s", string(resultJSON), tt.expected)
}- Ensure the file imports the
reflectpackage at the top:- Add
reflectto the existing import block, e.g.:
import ( ... "encoding/json" "reflect" ... )
- Add
- If the
expectedfield in the table tests is not JSON representing a[]map[string]interface{}, adjust theresultData/expectedDatatypes (e.g. to[]anyor a concrete struct) to match the actual shape ofconvertMessagesoutput and the expected JSON.
| }, | ||
| }, | ||
| // The tool_calls will have arguments converted to JSON string | ||
| // Note: JSON field order follows struct definition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Add coverage for messages with tool_call_id (tool result messages)
Since convertMessages handles ToolCalls and ToolCallID separately, please add a test case for a tool result message (e.g., Role = "tool" with ToolCallID set) and assert that the output map includes the correct "tool_call_id". This will exercise the tool-result path and protect against regressions.
| } | ||
| } | ||
|
|
||
| func TestEnsureDataURIPrefix(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Consider adding an empty-string case for ensureDataURIPrefix
You already cover prefixed data URIs and http/https URLs. Please also add a case for an empty-string input, which currently becomes just the "data:image/jpeg;base64," prefix, so this behavior is explicitly tested and guarded against future changes.
Suggested implementation:
func TestEnsureDataURIPrefix(t *testing.T) {
tests := []struct {
name string
input string
expected string
}{
{
name: "jpeg data uri",
input: "data:image/jpeg;base64,abc123",
expected: "data:image/jpeg;base64,abc123",
},
{
name: "http url",
input: "http://example.com/image.jpg",
expected: "http://example.com/image.jpg",
},
{
name: "https url",
input: "https://example.com/image.jpg",
expected: "https://example.com/image.jpg",
},
{
name: "raw base64 without prefix",
input: "abc123",
expected: "data:image/jpeg;base64,abc123",
},
{
name: "empty string",
input: "",
// Current behavior: empty input becomes just the data URI prefix.
expected: "data:image/jpeg;base64,",
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
result := ensureDataURIPrefix(tt.input)
if result != tt.expected {
t.Errorf("ensureDataURIPrefix(%q) = %q, want %q", tt.input, result, tt.expected)
}
})
}
}If the concrete test cases or field names in TestEnsureDataURIPrefix differ from what I inferred, adjust the SEARCH block to match the existing tests slice and insert the empty string case in the same style:
- Keep
name,input, andexpectedfield names consistent. - Ensure the expected value for the empty-string input is
"data:image/jpeg;base64,", reflecting the current behavior you want to guard.
| func TestConvertMessages_PreservesOrder(t *testing.T) { | ||
| messages := []Message{ | ||
| {Role: "system", Content: "You are a helpful assistant"}, | ||
| {Role: "user", Content: "Hello"}, | ||
| {Role: "assistant", Content: "Hi there!"}, | ||
| {Role: "user", Content: "What's in this image?", Images: []string{"data:image/jpeg;base64,abc123"}}, | ||
| } | ||
|
|
||
| result := convertMessages(messages) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Strengthen multimodal assertions in TestConvertMessages_PreservesOrder
To improve robustness, also assert the structure of the two content parts (e.g., first has type == "text" with the expected text, second has type == "image_url" with the expected URL) so the test validates the multimodal content array contents, not just its length.
Suggested implementation:
func TestConvertMessages_PreservesOrder(t *testing.T) {
messages := []Message{
{Role: "system", Content: "You are a helpful assistant"},
{Role: "user", Content: "Hello"},
{Role: "assistant", Content: "Hi there!"},
{Role: "user", Content: "What's in this image?", Images: []string{"data:image/jpeg;base64,abc123"}},
}
result := convertMessages(messages)
if len(result) != 4 {
t.Fatalf("expected 4 converted messages, got %d", len(result))
}
// Verify roles are preserved in order
expectedRoles := []string{"system", "user", "assistant", "user"}
for i, role := range expectedRoles {
if result[i].Role != role {
t.Fatalf("message %d role mismatch: got %q, want %q", i, result[i].Role, role)
}
}
// Verify multimodal content structure for the last user message
last := result[3]
// We expect a multimodal content array with two parts: text then image_url
if len(last.Content) != 2 {
t.Fatalf("expected last message to have 2 content parts, got %d", len(last.Content))
}
// First part: text
textPart, ok := last.Content[0].(openai.ChatCompletionMessageContentPartText)
if !ok {
t.Fatalf("expected first content part to be text, got %T", last.Content[0])
}
if textPart.Type != "text" {
t.Fatalf("expected first content part type %q, got %q", "text", textPart.Type)
}
if textPart.Text != "What's in this image?" {
t.Fatalf("unexpected text content: got %q, want %q", textPart.Text, "What's in this image?")
}
// Second part: image_url
imagePart, ok := last.Content[1].(openai.ChatCompletionMessageContentPartImage)
if !ok {
t.Fatalf("expected second content part to be image_url, got %T", last.Content[1])
}
if imagePart.Type != "image_url" {
t.Fatalf("expected second content part type %q, got %q", "image_url", imagePart.Type)
}
if imagePart.ImageURL.URL != "data:image/jpeg;base64,abc123" {
t.Fatalf("unexpected image_url content URL: got %q, want %q", imagePart.ImageURL.URL, "data:image/jpeg;base64,abc123")
}- This change assumes:
resultis a slice of a struct with aRole stringfield and aContent []interface{}(or similar) field.Contentelements are concrete typesopenai.ChatCompletionMessageContentPartTextandopenai.ChatCompletionMessageContentPartImagefrom thegithub.com/sashabaranov/go-openaipackage, with fields:Type stringText stringon the text partImageURL.URL stringon the image part.
- If your actual types differ (e.g.,
[]openai.ChatCompletionMessagePartwith aTypediscriminator instead of Go interfaces), adjust the type assertions and field accesses accordingly:- Replace the type assertions with direct indexing (e.g.
last.Content[0].Type == openai.ChatMessagePartTypeText). - Replace
imagePart.ImageURL.URLwith the correct field path used in your code.
- Replace the type assertions with direct indexing (e.g.
- Ensure the test file imports the
openaipackage if it is not already imported:- Add
openai "github.com/sashabaranov/go-openai"(or the correct alias/path) to the imports ofpkg/ollama/http_handler_test.go.
- Add
| Images: []string{"/9j/4AAQSkZJRgABAQEBLA...."}, | ||
| }, | ||
| }, | ||
| // Should auto-add the data URI prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Consider adding a case for multiple raw base64 images without prefix
You already cover a single raw base64 image (no prefix) being normalized via ensureDataURIPrefix in convertMessages. Since this also runs for multiple images, please add a test where Images contains at least two raw base64 strings without prefixes, and assert that each resulting image_url.url is correctly prefixed. This helps catch any off‑by‑one or loop issues in the multimodal path.
Adds support for multimodal messages in Ollama API