A from-scratch, single-file relational database engine written in C++17, supporting a SQL-like dialect over a paged, B+ tree–indexed file format. Inspired by Connor Stack's Let's Build a Simple Database and SQLite's overall architecture.
This is the v2 evolution of an earlier C implementation. It moves the project from a hard-coded single-table schema to a dynamic, multi-table catalog with typed values, real SQL parsing, WHERE expressions over any column, ORDER BY / LIMIT, aggregates, and an EXPLAIN plan visualizer.
- Hand-written SQL frontend. A streaming lexer (zero-copy via
std::string_view) feeds a recursive-descent parser that produces typed AST nodes and a compiled bytecodeProgram. Parse errors carry line/column info. - Bytecode VM with a query planner. Eighteen opcodes; the parser-side planner picks
OP_FIND_KEYfor primary-key equality lookups andOP_REWIND+OP_FILTERscans otherwise. TryEXPLAIN <stmt>to see exactly what the engine will run. - Real B+ tree index. Leaf and internal node layouts with on-page binary headers, leaf splits, internal splits, root-split-induced tree height growth, parent pointer maintenance, and a leaf sibling chain for ordered scans.
- Dynamic, persisted schema. A typed system catalog lives on page 0 with a 4-byte magic and version; foreign files are rejected cleanly. Up to four user tables, each with its own B+ tree.
- Typed values.
std::variant<std::monostate, int64_t, std::string>for in-memory values; fixed-width on-disk layout (length-prefixed TEXT) keeps cell sizing predictable. - Modern C++ idioms throughout. RAII for the pager, database, and program;
std::unique_ptrfor AST ownership;std::sortforORDER BY;enum classopcodes; exceptions caught at the REPL boundary. - Tested under AddressSanitizer + UBSan in CI. 30 doctest unit assertions + 74 shell-based integration checks across 10 test files.
┌────────────────────────────────────────────────────────────────┐
│ REPL (src/main.cpp) │
│ reads a line, dispatches meta-commands (.schema, .stats, …) │
│ or hands SQL to the frontend. │
└────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Frontend │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Lexer │ -> │ Parser │ -> │ Planner │ │
│ │ string_view │ │ recursive │ │ pk-eq fast path │ │
│ │ tokens │ │ descent │ │ vs scan+filter │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
│ produces Program { code[], predicate, … } │
└────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Bytecode VM (src/vm.cpp) │
│ switch over enum class Opcode { OPEN, REWIND, NEXT, │
│ FIND_KEY, FILTER, RESULT, COLLECT, SORT_EMIT, │
│ AGG_COUNT / MIN / MAX / SUM, AGG_EMIT, │
│ INSERT, UPDATE, DELETE, CLOSE, HALT } │
│ Holds: aggregate accumulators, collected rows for ORDER BY │
└────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Storage │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────────┐ │
│ │ Cursor │ ← │ B+ Tree │ ← │ Pager (RAII) │ │
│ │ leaf+cell │ │ leaf/intern. │ │ open/read/write/ │ │
│ │ iteration │ │ splits, root │ │ lseek + page cache │ │
│ │ │ │ height grow │ │ + cache stats │ │
│ └──────────────┘ └──────────────┘ └─────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ File on disk │
│ page 0: Catalog (magic, version, tables[], root pages) │
│ page N: B+ tree leaves / internals (per table) │
└──────────────────────────────────────────────────────────┘
The on-disk catalog is a static_assert-enforced trivially-copyable POD; the in-memory representation uses STL containers. The row codec is the bridge.
make # build (gcc / g++17)
make test # build + run unit + integration tests
make asan # rebuild under AddressSanitizer + UBSan and re-test
make clean
./db_engine mydata.db # open or create a database filedb > CREATE TABLE users (id INT PRIMARY KEY, name TEXT(32), age INT);
Executed.
db > INSERT INTO users VALUES (1, 'alice', 30);
db > INSERT INTO users VALUES (2, 'bob', 25);
db > INSERT INTO users VALUES (3, 'carol', 35);
db > SELECT * FROM users WHERE age > 26 ORDER BY age DESC;
(3, carol, 35)
(1, alice, 30)
db > SELECT COUNT(*) FROM users;
(3)
db > EXPLAIN SELECT * FROM users WHERE id = 2;
Program table=users predicate=id = INT(2)
0 OP_OPEN
1 OP_FIND_KEY 2 4 0
2 OP_RESULT
3 OP_CLOSE
4 OP_HALT
db > EXPLAIN SELECT * FROM users WHERE age > 30;
Program table=users predicate=age > INT(30)
0 OP_OPEN
1 OP_REWIND 5 0 0
2 OP_FILTER 4 0 0
3 OP_RESULT
4 OP_NEXT 2 0 0
5 OP_CLOSE
6 OP_HALTNotice the planner difference: an equality lookup on the primary key compiles to OP_FIND_KEY (B+ tree descent — O(log n)), while a non-PK predicate compiles to a OP_REWIND / OP_NEXT loop with an in-VM OP_FILTER (full leaf-chain scan — O(n)).
CREATE TABLE name (col TYPE [PRIMARY KEY], ...); -- TYPE = INT | TEXT(n)
DROP TABLE name;
INSERT INTO name VALUES (v1, v2, ...);
SELECT * | col[, col...] | agg(...) FROM name
[WHERE expr]
[ORDER BY col [ASC|DESC]]
[LIMIT n];
UPDATE name SET col = v [, col = v]... [WHERE expr];
DELETE FROM name [WHERE expr];
EXPLAIN <any-statement>;exprsupports= != < <= > >= AND ORover any column.- Aggregates:
COUNT(*),MIN(col),MAX(col),SUM(col),AVG(col)(INT only for SUM/AVG). - Exactly one
PRIMARY KEYcolumn is required per table; it must beINT.
| Command | Purpose |
|---|---|
.help |
List commands |
.exit |
Flush and quit |
.tables |
List user tables |
.schema [name] |
Pretty-print schema |
.schema --json |
Dump catalog as JSON (no library — handwritten) |
.stats |
Pages, rows per table, cache hits/misses, dirty writes |
.btree [name] |
Print the tree structure of a table |
.constants |
Sizing constants |
| v1 (C) | v2 (C++17) | |
|---|---|---|
| Schema | Hard-coded (id, username, email) row |
Dynamic CREATE TABLE with INT and TEXT(n) |
| Tables | One | Up to four, with their own B+ tree each |
| Values | Fixed C struct | std::variant<monostate, int64_t, std::string> |
| Parsing | strtok-style splitter |
Lexer (std::string_view) + recursive-descent parser with line/col errors |
WHERE |
PK only | Any column; = != < <= > >= AND OR |
SELECT |
* only |
Column projection, ORDER BY (any column, ASC/DESC), LIMIT |
| Aggregates | None | COUNT(*) MIN MAX SUM AVG |
| Plan visibility | None | EXPLAIN prints opcodes; planner picks PK fast path vs scan |
| Errors | Return codes | Exceptions caught at REPL boundary |
| Memory model | Manual free |
RAII; unique_ptr for AST |
| Build | Make | Make (primary) + CMakeLists.txt provided |
| Tests | 16 shell checks | 30 doctest assertions + 74 integration checks |
| CI | None | GitHub Actions: build+test and ASan |
include/dbe/ typed value, schema, catalog, pager, btree,
cursor, lexer, expr, parser, bytecode, vm, db
src/ one .cpp per header, plus main.cpp (REPL)
tests/unit/ doctest-based C++ tests
tests/integration/ shell-driven end-to-end tests
.github/workflows ci.yml — build+test and asan jobs on Ubuntu
About 3,300 lines of C++ across the engine.
Deliberately scoped out for v2; each is a possible follow-up:
- No WAL/journal. A crash mid-write can corrupt the file. A rollback journal (sidecar
.journal, the SQLite-3 model) is sketched out but not implemented. - No LRU eviction. The page cache is unbounded; the working set must fit in memory.
- No secondary indexes. Non-PK
WHEREalways scans. - No JOINs, no GROUP BY, no transactions.
DROP TABLEleaks pages. A free-page list would reclaim them.- Fixed-width rows. Variable-length records (SQLite's record format) would let TEXT be unbounded; this is the single largest follow-up.
- No NULL surfaced through SQL. The variant has a
monostateslot but no syntax produces it yet.
The architecture was designed so each of these can be added without rewriting the layers below it.
Personal/educational. No warranty.