Skip to content

Dileepkumar65/db_engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

db_engine — a relational database engine in modern C++17

A from-scratch, single-file relational database engine written in C++17, supporting a SQL-like dialect over a paged, B+ tree–indexed file format. Inspired by Connor Stack's Let's Build a Simple Database and SQLite's overall architecture.

This is the v2 evolution of an earlier C implementation. It moves the project from a hard-coded single-table schema to a dynamic, multi-table catalog with typed values, real SQL parsing, WHERE expressions over any column, ORDER BY / LIMIT, aggregates, and an EXPLAIN plan visualizer.


Highlights

  • Hand-written SQL frontend. A streaming lexer (zero-copy via std::string_view) feeds a recursive-descent parser that produces typed AST nodes and a compiled bytecode Program. Parse errors carry line/column info.
  • Bytecode VM with a query planner. Eighteen opcodes; the parser-side planner picks OP_FIND_KEY for primary-key equality lookups and OP_REWIND + OP_FILTER scans otherwise. Try EXPLAIN <stmt> to see exactly what the engine will run.
  • Real B+ tree index. Leaf and internal node layouts with on-page binary headers, leaf splits, internal splits, root-split-induced tree height growth, parent pointer maintenance, and a leaf sibling chain for ordered scans.
  • Dynamic, persisted schema. A typed system catalog lives on page 0 with a 4-byte magic and version; foreign files are rejected cleanly. Up to four user tables, each with its own B+ tree.
  • Typed values. std::variant<std::monostate, int64_t, std::string> for in-memory values; fixed-width on-disk layout (length-prefixed TEXT) keeps cell sizing predictable.
  • Modern C++ idioms throughout. RAII for the pager, database, and program; std::unique_ptr for AST ownership; std::sort for ORDER BY; enum class opcodes; exceptions caught at the REPL boundary.
  • Tested under AddressSanitizer + UBSan in CI. 30 doctest unit assertions + 74 shell-based integration checks across 10 test files.

Architecture

┌────────────────────────────────────────────────────────────────┐
│  REPL  (src/main.cpp)                                          │
│  reads a line, dispatches meta-commands (.schema, .stats, …)   │
│  or hands SQL to the frontend.                                 │
└────────────────────────────────────────────────────────────────┘
            │
            ▼
┌────────────────────────────────────────────────────────────────┐
│  Frontend                                                      │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │ Lexer        │ -> │ Parser       │ -> │ Planner          │  │
│  │ string_view  │    │ recursive    │    │ pk-eq fast path  │  │
│  │ tokens       │    │ descent      │    │ vs scan+filter   │  │
│  └──────────────┘    └──────────────┘    └──────────────────┘  │
│              produces  Program { code[], predicate, … }        │
└────────────────────────────────────────────────────────────────┘
            │
            ▼
┌────────────────────────────────────────────────────────────────┐
│  Bytecode VM  (src/vm.cpp)                                     │
│  switch over enum class Opcode { OPEN, REWIND, NEXT,           │
│    FIND_KEY, FILTER, RESULT, COLLECT, SORT_EMIT,               │
│    AGG_COUNT / MIN / MAX / SUM, AGG_EMIT,                      │
│    INSERT, UPDATE, DELETE, CLOSE, HALT }                       │
│  Holds: aggregate accumulators, collected rows for ORDER BY    │
└────────────────────────────────────────────────────────────────┘
            │
            ▼
┌────────────────────────────────────────────────────────────────┐
│  Storage                                                       │
│  ┌──────────────┐   ┌──────────────┐   ┌─────────────────────┐ │
│  │ Cursor       │ ← │ B+ Tree      │ ← │ Pager (RAII)        │ │
│  │ leaf+cell    │   │ leaf/intern. │   │ open/read/write/    │ │
│  │ iteration    │   │ splits, root │   │ lseek + page cache  │ │
│  │              │   │ height grow  │   │ + cache stats       │ │
│  └──────────────┘   └──────────────┘   └─────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
            │
            ▼
   ┌──────────────────────────────────────────────────────────┐
   │  File on disk                                            │
   │  page 0: Catalog (magic, version, tables[], root pages)  │
   │  page N: B+ tree leaves / internals (per table)          │
   └──────────────────────────────────────────────────────────┘

The on-disk catalog is a static_assert-enforced trivially-copyable POD; the in-memory representation uses STL containers. The row codec is the bridge.


Quick start

make                         # build (gcc / g++17)
make test                    # build + run unit + integration tests
make asan                    # rebuild under AddressSanitizer + UBSan and re-test
make clean
./db_engine mydata.db        # open or create a database file

Example session

db > CREATE TABLE users (id INT PRIMARY KEY, name TEXT(32), age INT);
Executed.
db > INSERT INTO users VALUES (1, 'alice', 30);
db > INSERT INTO users VALUES (2, 'bob', 25);
db > INSERT INTO users VALUES (3, 'carol', 35);
db > SELECT * FROM users WHERE age > 26 ORDER BY age DESC;
(3, carol, 35)
(1, alice, 30)
db > SELECT COUNT(*) FROM users;
(3)
db > EXPLAIN SELECT * FROM users WHERE id = 2;
Program table=users predicate=id = INT(2)
0 OP_OPEN
1 OP_FIND_KEY 2 4 0
2 OP_RESULT
3 OP_CLOSE
4 OP_HALT
db > EXPLAIN SELECT * FROM users WHERE age > 30;
Program table=users predicate=age > INT(30)
0 OP_OPEN
1 OP_REWIND 5 0 0
2 OP_FILTER 4 0 0
3 OP_RESULT
4 OP_NEXT 2 0 0
5 OP_CLOSE
6 OP_HALT

Notice the planner difference: an equality lookup on the primary key compiles to OP_FIND_KEY (B+ tree descent — O(log n)), while a non-PK predicate compiles to a OP_REWIND / OP_NEXT loop with an in-VM OP_FILTER (full leaf-chain scan — O(n)).

Supported SQL

CREATE TABLE name (col TYPE [PRIMARY KEY], ...);   -- TYPE = INT | TEXT(n)
DROP TABLE name;

INSERT INTO name VALUES (v1, v2, ...);

SELECT * | col[, col...] | agg(...) FROM name
  [WHERE expr]
  [ORDER BY col [ASC|DESC]]
  [LIMIT n];

UPDATE name SET col = v [, col = v]... [WHERE expr];
DELETE FROM name [WHERE expr];

EXPLAIN <any-statement>;
  • expr supports = != < <= > >= AND OR over any column.
  • Aggregates: COUNT(*), MIN(col), MAX(col), SUM(col), AVG(col) (INT only for SUM/AVG).
  • Exactly one PRIMARY KEY column is required per table; it must be INT.

Meta-commands

Command Purpose
.help List commands
.exit Flush and quit
.tables List user tables
.schema [name] Pretty-print schema
.schema --json Dump catalog as JSON (no library — handwritten)
.stats Pages, rows per table, cache hits/misses, dirty writes
.btree [name] Print the tree structure of a table
.constants Sizing constants

What changed from v1

v1 (C) v2 (C++17)
Schema Hard-coded (id, username, email) row Dynamic CREATE TABLE with INT and TEXT(n)
Tables One Up to four, with their own B+ tree each
Values Fixed C struct std::variant<monostate, int64_t, std::string>
Parsing strtok-style splitter Lexer (std::string_view) + recursive-descent parser with line/col errors
WHERE PK only Any column; = != < <= > >= AND OR
SELECT * only Column projection, ORDER BY (any column, ASC/DESC), LIMIT
Aggregates None COUNT(*) MIN MAX SUM AVG
Plan visibility None EXPLAIN prints opcodes; planner picks PK fast path vs scan
Errors Return codes Exceptions caught at REPL boundary
Memory model Manual free RAII; unique_ptr for AST
Build Make Make (primary) + CMakeLists.txt provided
Tests 16 shell checks 30 doctest assertions + 74 integration checks
CI None GitHub Actions: build+test and ASan

Source layout

include/dbe/      typed value, schema, catalog, pager, btree,
                  cursor, lexer, expr, parser, bytecode, vm, db
src/              one .cpp per header, plus main.cpp (REPL)
tests/unit/       doctest-based C++ tests
tests/integration/ shell-driven end-to-end tests
.github/workflows ci.yml — build+test and asan jobs on Ubuntu

About 3,300 lines of C++ across the engine.


Limitations & future work

Deliberately scoped out for v2; each is a possible follow-up:

  • No WAL/journal. A crash mid-write can corrupt the file. A rollback journal (sidecar .journal, the SQLite-3 model) is sketched out but not implemented.
  • No LRU eviction. The page cache is unbounded; the working set must fit in memory.
  • No secondary indexes. Non-PK WHERE always scans.
  • No JOINs, no GROUP BY, no transactions.
  • DROP TABLE leaks pages. A free-page list would reclaim them.
  • Fixed-width rows. Variable-length records (SQLite's record format) would let TEXT be unbounded; this is the single largest follow-up.
  • No NULL surfaced through SQL. The variant has a monostate slot but no syntax produces it yet.

The architecture was designed so each of these can be added without rewriting the layers below it.


License

Personal/educational. No warranty.

About

A relational database engine in modern C++17 — SQL frontend, bytecode VM with query planner, B+ tree storage

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages