Skip to content

feat(iceberg): PostgreSQL compatibility via backend capability model#659

Merged
EDsCODE merged 8 commits into
mainfrom
iceberg-pg-compat
Jun 2, 2026
Merged

feat(iceberg): PostgreSQL compatibility via backend capability model#659
EDsCODE merged 8 commits into
mainfrom
iceberg-pg-compat

Conversation

@EDsCODE
Copy link
Copy Markdown
Contributor

@EDsCODE EDsCODE commented Jun 2, 2026

What

Makes Duckgres present a stable, documented PostgreSQL-compatible surface when the backing catalog is Iceberg via Lakekeeper (executed by DuckDB). Generalizes the prior DuckLakeMode machinery into a reusable typed backend-profile model and adds an Iceberg profile. Unsupported PostgreSQL semantics return predictable PG-shaped errors / safe command tags rather than raw DuckDB/Flight failures. Not full PG emulation.

Catalog identity follows main unchanged: the startup database param is pure catalog selection ("" / ducklake / iceberg; anything else → 3D000), and current_database() / pg_catalog / information_schema report the real attached catalog. There is no logical-database masking. The PR keeps the control-plane diff to ~3 lines (one hook); it does not change routing/identity.

Core model

  • Typed backend profiles (transpiler/backend): a Profile bundles catalog/DDL/DML/metadata policies, selected per session from the resolved physical catalog. Presets for memory, ducklake, iceberg; the DuckLake preset reproduces prior behavior exactly. Iceberg mirrors DuckLake's DDL/DML policy but keeps the physical schema public.
  • Backend selection: server.SetConnectionPhysicalCatalog records the resolved catalog on the connection so newTranspiler picks the right profile. This is the only control-plane addition.

Changes

  • Type mapping (server/pgtypes.ForIceberg): canonical Iceberg→PG type map (nested → jsonb) applied at REST-metadata load time; udt_name populated; internal __duckgres_iceberg_column_metadata hidden from information_schema.tables.
  • DDL policy (transpiler transforms): strip PK/UNIQUE/FK/CHECK constraints, rewrite SERIAL→integer, strip volatile (DEFAULT now()) and GENERATED columns, no-op DDL (CREATE/DROP INDEX, VACUUM, ANALYZE, REINDEX, CLUSTER, GRANT/REVOKE, COMMENT, REFRESH MATVIEW), DROP … CASCADERESTRICT, split multi-command ALTER TABLE.
  • DML policy: INSERT … ON CONFLICT (cols) DO UPDATE/NOTHINGMERGE (requires an explicit column list). ON CONFLICT ON CONSTRAINT and ALTER COLUMN TYPE … USING rejected with 0A000 via a SQLSTATE-carrying transform.CodedError.
  • publicmain rewrite disabled for Iceberg (its physical schema is literally public).
  • Classify fix: ANALYZE and GENERATED are now detected as DDL triggers — previously a statement whose only trigger word was one of these skipped the DDL transform and reached DuckDB unhandled (ANALYZE errored; STORED GENERATED columns errored).
  • Error normalization: unwrap Arrow-Flight/gRPC envelopes so worker errors classify to proper SQLSTATEs instead of XX000; map Not implemented0A000.

Compatibility summary

  • Supported: SELECT / INSERT / UPDATE / DELETE / COPY, CREATE/DROP TABLE+SCHEMA, ADD COLUMN (idempotent), multi-command ALTER (split), ALTER COLUMN TYPE without USING, introspection (information_schema, pg_catalog, current_database(), JDBC).
  • No-op (acknowledged): CREATE/DROP INDEX, VACUUM, ANALYZE, REINDEX, CLUSTER, GRANT/REVOKE, COMMENT, REFRESH MATVIEW; constraints + volatile/GENERATED defaults stripped.
  • Unsupported → 0A000: ON CONFLICT ON CONSTRAINT, ALTER COLUMN TYPE … USING, DML RETURNING via extended-query Describe; FOR UPDATE/SHARE stripped.

Testing

  • go build ./... + -tags kubernetes, go vet ./... — clean
  • transpiler / server / controlplane unit suites — green
  • integration suite — 933/0 (DuckLake, wire protocol, transpilation)
  • live Lakekeeper exhaustive DML + DDL QA against a real Iceberg table (local MinIO): constraint stripping (PK/UNIQUE/CHECK/FK/SERIAL/DEFAULT/GENERATED verified by behavior), all no-op DDL, ALTER variants, 0A000 rejections, INSERT/UPDATE/DELETE, ON CONFLICT→MERGE, MERGE — all pass. This QA found and fixed the ANALYZE/GENERATED Classify bugs above.

Known gaps / out of scope

  • Iceberg optimistic concurrency is not auto-retried or normalized: DML immediately after DDL raises … is already outdated. Please restart your transaction, and ALTER commits can hit a REST Conflict_409. Clients must retry (a 40001-style mapping/retry is a follow-up).
  • ON CONFLICT→MERGE requires an explicit column list; INSERT INTO t VALUES (…) ON CONFLICT … (no column list) is not converted.
  • RETURNING on Iceberg INSERT is rejected by DuckDB itself (RETURNING clause not yet supported for insertion into Iceberg table).
  • Nested-type wire OID still falls back to text (information_schema correctly reports jsonb).
  • Standalone session-open does not call setIcebergDefault (multitenant worker path does); full data-path validation in CI is the tests/k8s lane against real cloud storage.
  • LOCK TABLE / advisory locks / SET TRANSACTION normalization; metadata perf, full client-compat matrix, observability metrics, and staged rollout are not implemented.

🤖 Generated with Claude Code

EDsCODE and others added 8 commits June 2, 2026 09:02
Generalize DuckLakeMode into a per-backend capability model and add an
Iceberg-via-Lakekeeper preset so PostgreSQL clients get a stable, documented
compatibility subset. Reconciled onto main's catalog-identity model
(current_database() reports the real catalog; no logical-database masking).

- transpiler capability model (StorageBackend + BackendCapabilities); DuckLake
  preset reproduces prior DuckLakeMode behavior exactly
- Iceberg preset: full DDL/DML policy, plus public-schema rewrite disabled
  (Iceberg's physical schema is "public") and three-part refs left untouched
- canonical Iceberg->PostgreSQL type map (nested -> jsonb) feeding REST metadata;
  hide __duckgres_iceberg_column_metadata from information_schema.tables
- DDL/DML: ON CONFLICT ON CONSTRAINT and ALTER COLUMN TYPE ... USING rejected
  with 0A000 via transform.CodedError (conn.go honors the carried SQLSTATE)
- error normalization: unwrap Flight/gRPC envelopes so worker errors classify to
  proper PostgreSQL SQLSTATEs instead of XX000; map "Not implemented" to 0A000
- docs/iceberg-pg-compat.md compatibility contract

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…plit

Restructures the Iceberg PG-compatibility work:
- transpiler capability model → typed backend profiles (transpiler/backend)
- canonical Iceberg→PG type map → server/pgtypes (ForIceberg)
- session identity split into client database vs physical catalog
  (server/sessioncatalog.Selection); current_database()/pg_catalog surfaces
  expose the client-visible database while execution routes to the physical
  DuckDB catalog

Fix (regression caught by the integration suite): rewriteDirectQuery now maps
`USE <client-database-name>` to the physical catalog's default schema
(ducklake.main / iceberg.public). Without it the common round-trip of reading
current_database() and issuing `USE` on it emitted a bogus `USE <client-db>`
that DuckDB rejected, stranding the session on the wrong catalog. Adds unit
coverage in direct_query_rewrite_test.go.

Verified: unit/server/controlplane green; integration suite 933/0; live
Lakekeeper Iceberg smoke (attach, identity, 0A000 rejections, introspection).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Recent main (#651) defaults sniRoutingMode to "enforce" when unset, so
unresolvable managed hostnames are rejected on the configStore-backed
multi-tenant path. The iceberg refactor dropped this default; restore it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adhere to main (#651): the startup `database` param is pure catalog selection —
only "" / "ducklake" / "iceberg" are accepted; arbitrary names fail closed
(3D000). current_database()/pg_catalog surfaces de-mask to the real attached
catalog rather than echoing the startup name.

- configstore: reject non-catalog database names (CatalogValid=false)
- control plane + standalone conn: de-mask the session identity to the resolved
  physical catalog before installing session metadata
- drop the now-dead `USE <client-db>` rewrite case (current_database() is again
  the physical catalog, so the existing `USE ducklake`/`USE iceberg` cases cover
  the round-trip) and its unit test
- restore tests/integration/catalog_demask_test.go to main's de-masking contract

Verified: server + controlplane green; integration 933/0; live Lakekeeper smoke
(current_database() de-masks to "iceberg").

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- restore accurate comments describing the startup `database` param as pure
  catalog selection (not a client-visible database name), matching the
  fail-closed behavior
- drop the vestigial PostgresConnectionResolution.ClientDatabase field (it was
  captured then immediately de-masked)
- inline the probeAttachedCatalogs helper back to main's two direct probes
- revert the session-default helpers to plain string params (no Selection),
  dropping the sessioncatalog import from session_search_path

No behavior change; the shared sessioncatalog.ResolveSelection (used by the
standalone server path too) is retained.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eberg compat

The session-catalog abstraction (sessioncatalog.Selection / ResolveSelection,
the ClientDatabase-vs-PhysicalCatalog split) was built for a client-database
identity model that was reverted to match main. With that gone, the abstraction
was dead weight, so remove it and restore main's inline catalog resolution:

- delete server/sessioncatalog package
- control.go / conn.go: restore main's inline resolveEffectiveCatalog +
  de-mask-to-real-catalog; the only addition is SetConnectionPhysicalCatalog,
  which feeds the transpiler's backend-profile selection (the actual iceberg hook)
- sessionmeta: restore the catalog-string signature and main's metadata views;
  keep the genuine iceberg type-map additions (udt_name column +
  __duckgres_iceberg_column_metadata filter)
- configstore/session_search_path/tests/k8s: restored to main

controlplane diff vs main is now ~3 lines. Verified: build (both tags) +
server/controlplane/transpiler suites green; integration 933/0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Exhaustive Iceberg DML/DDL QA surfaced two statements that skipped the DDL
transform because Classify's substring list didn't mention them — when they were
the only DDL trigger word in the statement, FlagDDL was never set and the
statement reached DuckDB unhandled:

- ANALYZE <table>: parses as a VacuumStmt but lacked an "ANALYZE" trigger, so it
  was not no-op'd and DuckDB rejected it ("Vacuum is only implemented for DuckDB
  tables"). Add "ANALYZE" to Classify and give it its own command tag
  (distinguished from VACUUM via VacuumStmt.IsVacuumcmd).
- CREATE TABLE ... col GENERATED ALWAYS AS (...) STORED: with no other trigger
  word, the generated column was not stripped and DuckDB rejected the STORED
  generated column. Add "GENERATED" to Classify.

Unit tests added for both (ANALYZE no-op tag; GENERATED stripped when sole
trigger). Verified: transpiler/server/controlplane suites green; integration
933/0; live Lakekeeper exhaustive DDL+DML QA all pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@EDsCODE EDsCODE marked this pull request as ready for review June 2, 2026 19:05
@EDsCODE EDsCODE merged commit e9b139b into main Jun 2, 2026
22 checks passed
@EDsCODE EDsCODE deleted the iceberg-pg-compat branch June 2, 2026 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant