Skip to content

Fix LlamaIndexEmbeddingOperator returning None vectors for all chunks#68424

Closed
bujjibabukatta wants to merge 1 commit into
apache:mainfrom
bujjibabukatta:fix/llamaindex-embedding-vector-none-68416
Closed

Fix LlamaIndexEmbeddingOperator returning None vectors for all chunks#68424
bujjibabukatta wants to merge 1 commit into
apache:mainfrom
bujjibabukatta:fix/llamaindex-embedding-vector-none-68416

Conversation

@bujjibabukatta

Copy link
Copy Markdown
Contributor

Problem

LlamaIndexEmbeddingOperator was returning vector: None for every chunk in its output, making the results unusable for downstream vector storage tasks.

Root cause: VectorStoreIndex._get_node_with_embedding() in llama-index-core calls node.copy() internally before attaching embedding vectors. This means embeddings are only stored on the internal copies, The original node objects in the nodes list retain embedding=None.

Minimal reproduction:

from llama_index.core import Document, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.embeddings.mock_embed_model import MockEmbedding

docs = [Document(text="hello world")]
nodes = SentenceSplitter(chunk_size=512).get_nodes_from_documents(docs)
index = VectorStoreIndex(nodes, embed_model=MockEmbedding(embed_dim=8))

print(nodes[0].embedding)  # None  ← bug
print(index.vector_store.data.embedding_dict)  # {node_id: [...]}  ← vector is here, not on the node

Fix

Pre-embed the nodes using embed_model.get_text_embedding_batch() before building the index and assign the results directly to the original node objects. Since VectorStoreIndex skips re-embedding nodes that already carry a vector, this avoids redundant API calls while ensuring node.embedding is correctly set on the objects we read from later.

Changes

providers/common/ai/.../operators/llamaindex_embedding.py - added pre-embedding step before VectorStoreIndex construction
providers/common/ai/tests/.../test_llamaindex_embedding.py - updated existing tests to mock get_text_embedding_batch, added regression test

…turning None vectors

VectorStoreIndex._get_node_with_embedding() calls node.copy() internally
before attaching embeddings, so reading node.embedding from the original
node list after index construction always returned None.

Fix by calling embed_model.get_text_embedding_batch() before building the
index and assigning the results directly to the original node objects.
VectorStoreIndex then skips re-embedding nodes that already carry a vector.

Closes apache#68416
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants