Skip to content

Segmentation fault core dumped inserting item into vec0 table #245

@robogeek

Description

@robogeek

I am close to having integrated vector tables into my static file content management system - AkashaCMS. It is a fairly mature system that is already using SQLITE3 as an in-memory database indexing the various files involved in building a site.

The crash occurs while simply indexing the files in the test suite documents folder. The sqlite-lembed and sqlite-vec packages are the latest version, and the module for computing embeds is one of the recommendations.

The files are added to the embeddings database one at a time - unlike the recommendation of adding them all at once. This is because AkashaCMS supports running in watch mode where files can be updated and the database must be updated.

In the test run, four items are successfully added to both the DOCUMENTS and vec_documents tables. On the fifth item, it is added to DOCUMENTS, but there is a segmentation fault core dumped on running the SQL to add an item to vec_documents.

The first four items are all about 600 bytes, and the fifth is over 5k bytes.

Therefore, I'm suspecting a buffer overrun. I'm also curious how to know the correct size for the FLOAT array in the vec_documents table.

I'm using this SQLITE3 driver, and extensions:

    "sqlite-lembed": "^0.0.1-alpha.8",
    "sqlite-regex": "^0.2.4-alpha.1",
    "sqlite-vec": "^0.1.7-alpha.2",
    "sqlite3": "^5.1.7",

And for the module to compute embeds: all-MiniLM-L6-v2.e4ce9877.q8_0.gguf

To initialize the database and extensions:

    console.log({
        lembedModelFile,
        lembedModelName,
        lembed: sqlite_lembed.getLoadablePath(),
        vec: sqlite_vec.getLoadablePath()
    });
    sqlite_lembed.load(<any>sqdb.inner);
    sqlite_vec.load(<any>sqdb.inner);

    await sqdb.run(`
        INSERT INTO temp.lembed_models(name, model)
        select ?, lembed_model_from_file(?);
    `, [
        lembedModelName,
        lembedModelFile
    ]);

The model name and model file name are passed in using environment variables. The log message above looks like:

{
  lembedModelFile: '/home/david/Projects/akasharender/akasharender/test/all-MiniLM-L6-v2.e4ce9877.q8_0.gguf',
  lembedModelName: 'all-MiniLM-L6-v2',
  lembed: '/home/david/Projects/akasharender/akasharender/node_modules/sqlite-lembed-linux-x64/lembed0.so',
  vec: '/home/david/Projects/akasharender/akasharender/node_modules/sqlite-vec-linux-x64/vec0.so'
}

The vector table is declared with this:

-- Support sqlite_vec and sqlite_lembed
CREATE VIRTUAL TABLE IF NOT EXISTS vec_documents USING vec0(
        vpath TEXT,
        -- Ignore this for now
        -- title_embeddings FLOAT[384],
        body_embeddings  FLOAT[384]
);
-- CREATE INDEX "vec_body"
--         ON "vec_documents" ("body_embeddings");

My original intent was to compute embeddings for both the title and body, but the SQL for the query looks too complex, so I'm starting with computing embeddings for the body content.

For every document added to the DOCUMENTS table, an entry is added to vec_documents as so:

INSERT INTO vec_documents(
    vpath,
    -- Ignore this for now
    -- title_embeddings,
    body_embeddings
) VALUES (
    $vpath,
    -- lembed($lembedModel, $titleEmbed),
    lembed($lembedModel, $bodyEmbed)
);

I have console.log statements before and after this is run, so I know that the segmentation fault core dumped message is printed as a consequence of running this SQL

           console.log(this.#insertLembedDocuments, {
                $vpath: info.vpath,
                $lembedModel: lembedModelName,
                // $titleEmbed: info.title,
                $bodyEmbed:  info.docBody
            });
            await this.db.run(this.#insertLembedDocuments, {
                $vpath: info.vpath,
                $lembedModel: lembedModelName,
                // $titleEmbed: info.title,
                $bodyEmbed:  info.docBody
            });
            console.log(`vec_documents inserted ${info.vpath}`);

After the program crashes -- it's simply indexing documents -- the DOCUMENTS table has these entries:

sqlite> select vpath from DOCUMENTS;
anchor-cleanups-handlebars.html.md
anchor-cleanups-liquid.html.md
anchor-cleanups-nunjucks.html.md
anchor-cleanups.html.md
asciidoctor-handlebars.html.adoc

But, vec_documents has these entries:

sqlite> select vpath from vec_documents;
anchor-cleanups-handlebars.html.md
anchor-cleanups-liquid.html.md
anchor-cleanups-nunjucks.html.md
anchor-cleanups.html.md

The segmentation fault core dumped message was printed while running the INSER INTO vec_documents for asciidoctor-handlebars.html.adoc.

One possible clue is the file sizes:

-rw-rw-r-- 1 david david  662 Jun 26 23:41 documents/anchor-cleanups-handlebars.html.md
-rw-rw-r-- 1 david david  650 Jun 26 23:41 documents/anchor-cleanups.html.md
-rw-rw-r-- 1 david david  658 Jun 26 23:41 documents/anchor-cleanups-liquid.html.md
-rw-rw-r-- 1 david david  655 Jun 26 23:41 documents/anchor-cleanups-nunjucks.html.md
-rw-rw-r-- 1 david david 5796 Jun 26 23:41 documents/asciidoctor-handlebars.html.adoc

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions