diff --git a/.agents/skills/author-recipes-and-cookbooks/SKILL.md b/.agents/skills/author-recipes-and-cookbooks/SKILL.md index 0737ec6..7934f11 100644 --- a/.agents/skills/author-recipes-and-cookbooks/SKILL.md +++ b/.agents/skills/author-recipes-and-cookbooks/SKILL.md @@ -27,11 +27,11 @@ User-facing, all three kinds are presented as one thing: **template**. The site The internal kind names (`recipe`, `cookbook`, `example`) **live only in code, file paths, and this skill** — they never appear in shipped UI, markdown content, or generated indexes. -| Internal kind | Source location | Route at runtime | When to use | -| ------------- | ---------------------------------------------------------------------------------------- | ------------------- | -------------------------------------------------------------------------- | -| `recipe` | `content/recipes//{content,prerequisites,deployment}.md` + entry in `recipes` | `/templates/` | One atomic outcome, copy-pasteable in a single agent prompt. | -| `cookbook` | Entry in `cookbooks` (composes recipes) + manual page `src/pages/templates/.tsx` | `/templates/` | End-to-end walkthrough composed from multiple recipes. | -| `example` | `content/examples//content.md` + full app source under `examples//template/` | `/templates/` | Full deployable codebase that bundles cookbooks/recipes plus runnable app. | +| Internal kind | Source location | Route at runtime | When to use | +| ------------- | ------------------------------------------------------------------------------------------------- | ------------------- | -------------------------------------------------------------------------- | +| `recipe` | `content/recipes//{content,prerequisites,deployment}.md` + entry in `recipes` | `/templates/` | One atomic outcome, copy-pasteable in a single agent prompt. | +| `cookbook` | Entry in `cookbooks` (composes recipes) + manual page `src/pages/templates/.tsx` | `/templates/` | End-to-end walkthrough composed from multiple recipes. | +| `example` | `content/examples//content.md` + full app source at `app-templates//` (separate repo) | `/templates/` | Full deployable codebase that bundles cookbooks/recipes plus runnable app. | All three are registered in `src/lib/recipes/recipes.ts`, share a flat `/templates/` URL hierarchy, and must have globally unique slugs (the content-entries plugin asserts this at build time). Choose the kind that matches the **shape of the work**, not the customer-facing label. @@ -39,7 +39,7 @@ All three are registered in `src/lib/recipes/recipes.ts`, share a flat `/templat 1. One atomic, self-contained step for an agent → author a `recipe`. 2. A multi-recipe walkthrough that ships a coherent end-to-end use case (no full app source) → author a `cookbook` composing existing recipes. -3. A full deployable app (`template/` source tree, README runbook, optional pipelines and seed) → author an `example`. +3. A full deployable app (source tree in `app-templates//`, README runbook, optional pipelines and seed) → author an `example`. 4. Reuse existing recipes whenever you can. New recipes are the most valuable; new cookbooks/examples should compose them. ## Author A `recipe` @@ -93,59 +93,57 @@ An example is a full working codebase plus narrative markdown. It bundles cookbo ### 1. Create The Example Code -Create `examples//` with: +Example code lives in the [app-templates](https://github.com/databricks/app-templates) repo (not in `devhub/`). Create `app-templates//` with: ``` -examples// - template/ # full runnable tree (AppKit app + optional pipelines/seed/provisioning) - README.md # canonical provisioning, SQL, seed, and deploy runbook - databricks.yml # bundle config with REPLACE_ME placeholders - app.yaml # runtime env from bundle resources - package.json # app dependencies - appkit.plugins.json # plugin manifest - server/ # Express backend - client/ # React frontend - config/queries/ # SQL query files - provisioning/sql/ # baseline SQL (Unity Catalog, Postgres, etc.) - pipelines/ # Lakeflow pipelines (optional) - / - databricks.yml - resources/*.yml - src/**/*.sql or *.py - seed/ # seed script for demo data (optional) - seed.ts - package.json +app-templates// # full runnable tree (AppKit app + optional pipelines/seed/provisioning) + README.md # canonical provisioning, SQL, seed, and deploy runbook + databricks.yml # bundle config with REPLACE_ME placeholders + app.yaml # runtime env from bundle resources + package.json # app dependencies + appkit.plugins.json # plugin manifest + server/ # Express backend + client/ # React frontend + config/queries/ # SQL query files + provisioning/sql/ # baseline SQL (Unity Catalog, Postgres, etc.) + pipelines/ # Lakeflow pipelines (optional) + / + databricks.yml + resources/*.yml + src/**/*.sql or *.py + seed/ # seed script for demo data (optional) + seed.ts + package.json ``` Key conventions: -- The app directory MUST be named `template/` (not `app/`) so `databricks apps init --template` works. -- All runnable assets (app, optional `pipelines/`, `seed/`, `provisioning/sql/`) live **under** `template/`. Do not leave `pipelines/` or `seed/` at the example root — `template/README.md` must describe the full path from zero to deployed app. +- All runnable assets (app, optional `pipelines/`, `seed/`, `provisioning/sql/`) live at the root of the example folder so `databricks apps init --template https://github.com/databricks/app-templates/tree/main/` works. `README.md` must describe the full path from zero to deployed app. - Use `REPLACE_ME` placeholders for workspace-specific values (host, warehouse ID, catalog name, Lakebase project, etc.). - Never commit workspace-specific values, `.databricks/`, `node_modules/`, or `.env`. - Pipeline SQL files use schema-qualified names (e.g., `silver.users`); rely on the pipeline YAML `catalog` setting for catalog resolution. - Add `.npmrc` pointing to `https://npm-proxy.dev.databricks.com/` if the app uses `@databricks/appkit`. - SQL files in `config/queries/` run against the **Databricks SQL Warehouse** (Spark SQL dialect), NOT Lakebase Postgres. Use `CURRENT_DATE()` not `NOW()`, `DATE_ADD(d, n)` not `d + INTERVAL`, `SUM(CASE WHEN ... THEN 1 ELSE 0 END)` not `COUNT(*) FILTER (WHERE ...)`. Reference Unity Catalog three-part names (e.g., `catalog.schema.table`). -### `template/README.md` (canonical runbook) +### `README.md` (canonical runbook) -This file is the single source of truth for operators and coding agents. The example detail page on DevHub points users here via clone + `cd` into `template/`; it must be complete enough to deploy without guessing. +This file is the single source of truth for operators and coding agents. The example detail page on DevHub points users here via clone + `cd` into the example folder; it must be complete enough to deploy without guessing. Include, as appropriate: 1. **Architecture** — short diagram or bullet flow (OLTP → sync → pipelines → app, etc.). 2. **Components** — what lives in `client/`, `server/`, `pipelines/`, `seed/`, `provisioning/sql/`. 3. **Provisioning** — numbered order of operations. For each step, state clearly what is: - - **Runnable SQL** — point to files under `template/provisioning/sql/`. + - **Runnable SQL** — point to files under `provisioning/sql/`. - **Manual / UI only** — Lakehouse Sync, Sync Tables, Genie space, catalog creation with storage root, etc. - **CLI / bundles** — which directory to `cd` into, `databricks bundle deploy` targets, dependencies between pipelines and app. -4. **Seeding** — exact commands from `template/seed/` (`cd seed`, `npm install`, `DATABASE_URL=... npm run seed`). Note Postgres prerequisites (e.g. `REPLICA IDENTITY FULL`). -5. **Deploy** — from `template/`: install, build, `databricks bundle deploy`. Link pipeline deploys before/after as required. -6. **Optional** — `databricks apps init --template https://github.com/databricks/devhub/tree/main/examples/` for users who scaffold instead of cloning. +4. **Seeding** — exact commands from `seed/` (`cd seed`, `npm install`, `DATABASE_URL=... npm run seed`). Note Postgres prerequisites (e.g. `REPLICA IDENTITY FULL`). +5. **Deploy** — from the example root: install, build, `databricks bundle deploy`. Link pipeline deploys before/after as required. +6. **Optional** — `databricks apps init --template https://github.com/databricks/app-templates/tree/main/` for users who scaffold instead of cloning. -Do **not** maintain a separate long-form `provisioning/README.md` next to the SQL — duplicate instructions drift. Keep narrative in `template/README.md` only. +Do **not** maintain a separate long-form `provisioning/README.md` next to the SQL — duplicate instructions drift. Keep narrative in `README.md` only. -For examples with no Unity Catalog DDL, still add `template/provisioning/sql/` with a comment-only file (e.g. `00_no_unity_catalog_ddl.sql`) so every example has a predictable place for SQL. +For examples with no Unity Catalog DDL, still add `provisioning/sql/` with a comment-only file (e.g. `00_no_unity_catalog_ddl.sql`) so every example has a predictable place for SQL. ### 2. Create The Example Markdown @@ -155,7 +153,7 @@ Create `content/examples//content.md`: - Brief motivation (1-2 paragraphs): what it demonstrates and why. - Data flow or architecture description. - What the user needs to adapt: which resources to create, which placeholders to fill in, manual steps (Lakehouse Sync, Sync Tables, etc.). -- Add a sentence under **What to Adapt** that **provisioning, seeding, and deployment** are documented in the repository's **`template/README.md`** — do not duplicate the runbook in this markdown. +- Add a sentence under **What to Adapt** that **provisioning, seeding, and deployment** are documented in the repository's **`README.md`** — do not duplicate the runbook in this markdown. - Keep it short and actionable. ### 3. Register The Example @@ -163,12 +161,12 @@ Create `content/examples//content.md`: Update `src/lib/recipes/recipes.ts`: - Add an entry to `examples` using `createExample()`. -- Set `id`, `name`, `description`, `githubPath`, `initCommand`, `templateIds`, `recipeIds`. +- Set `id`, `name`, `description`, `templateUrl`, `initCommand`, `templateIds`, `recipeIds`. - `templateIds` references cookbooks the example builds upon. - `recipeIds` references standalone recipes not already pulled in via a cookbook. - `createExample()` derives `tags` and `services`. -- `initCommand` format: `git clone --depth 1 https://github.com/databricks/devhub.git` then `cd devhub/examples//template`. Optional CLI scaffold: `databricks apps init --template https://github.com/databricks/devhub/tree/main/examples/`. -- `githubPath` is `examples/`. +- `initCommand` format: `git clone --depth 1 https://github.com/databricks/app-templates.git` then `cd app-templates/`. Optional CLI scaffold: `databricks apps init --template https://github.com/databricks/app-templates/tree/main/`. +- `templateUrl` is `https://github.com/databricks/app-templates/tree/main/`. ### 4. Add Preview And Gallery Images (Optional) @@ -206,7 +204,7 @@ Screenshot guidance: - Light mode: `--db-bg` + `--db-card` surfaces, navy text, orange accents. - Dark mode: `--db-navy` + `--db-navy-light` surfaces, `--db-lava-light` accents, near-white text. Avoid pure-black CSS defaults. - Use orange (`--db-lava` / `--db-lava-light`) sparingly — primary CTAs, active state, single accents. Avoid saturating whole regions. -- AppKit defaults already wire these tokens into Tailwind; copy from an existing example's `template/client/tailwind.config.ts` so new examples are on-brand by default. +- AppKit defaults already wire these tokens into Tailwind; copy from an existing example's `client/tailwind.config.ts` so new examples are on-brand by default. ### 5. Verify The DevHub Build @@ -214,9 +212,9 @@ Run `npm run fmt && npm run typecheck && npm run build && npm run test` from the ### 6. Test With A Dry Run -**Two directories, two purposes.** `examples/` is committed source code with `REPLACE_ME` placeholders. `../../demos//` (outside the repo) is the scratch workspace for installing, configuring, and deploying. +**Two directories, two purposes.** `app-templates//` (in the separate app-templates repo) is committed source code with `REPLACE_ME` placeholders. `../../demos//` (outside the repo) is the scratch workspace for installing, configuring, and deploying. -NEVER `npm install`, deploy, or write workspace-specific values inside `examples/`. ALWAYS work from the demos folder outside the repo. +NEVER `npm install`, deploy, or write workspace-specific values inside `app-templates//`. ALWAYS work from the demos folder outside the repo. NEVER reuse existing workspace resources (Lakebase projects, Genie spaces, apps, UC catalogs) unless the developer explicitly says to. Always create fresh resources for the dry run to avoid corrupting or overwriting existing data. @@ -227,7 +225,7 @@ The demos folder must be **outside the git repo** because `databricks bundle dep ```bash # 1. Copy the template tree into demos (outside the repo) mkdir -p ../../demos/ -cp -r examples//template/* ../../demos// +cp -r //* ../../demos// # 2. Fill in workspace-specific values # Edit ../../demos//databricks.yml — replace REPLACE_ME with real IDs @@ -244,7 +242,7 @@ npm run build # - Unity Catalog table (if analytics queries need warehouse-accessible data) # 5. Seed data (if the example has a seed script) -cd /examples//template/seed +cd //seed npm install DATABASE_URL="postgresql://..." npm run seed @@ -258,9 +256,9 @@ databricks apps get --profile #### Fixing issues found during dry run -- **Code bug** (build fails, runtime error, wrong SQL dialect) — fix in `examples//` in the repo, then re-copy to `../../demos//` and retry. +- **Code bug** (build fails, runtime error, wrong SQL dialect) — fix in `app-templates//` in the app-templates repo, then re-copy to `../../demos//` and retry. - **Instruction gap** (missing step, unclear placeholder) — fix in `content/examples//content.md` or the relevant recipe under `content/recipes/`. -- **Seed data issue** — fix in `examples//template/seed/seed.ts`. +- **Seed data issue** — fix in `app-templates//seed/seed.ts`. #### Cleanup @@ -349,8 +347,8 @@ Allowed exceptions (the validator skips these): - a prompt for an AI coding agent - a human step-by-step walkthrough 10. Run `npm run validate:content && npm run fmt && npm run typecheck && npm run build && npm run test`. -11. For examples: verify `examples//` contains only `REPLACE_ME` placeholders — no real workspace hosts, warehouse IDs, Lakebase project names, or Genie space IDs. -12. For examples: verify `examples//template/README.md` covers provisioning (manual vs SQL), seeding, pipeline deploys, and app deploy end-to-end. +11. For examples: verify `app-templates//` contains only `REPLACE_ME` placeholders — no real workspace hosts, warehouse IDs, Lakebase project names, or Genie space IDs. +12. For examples: verify `app-templates//README.md` covers provisioning (manual vs SQL), seeding, pipeline deploys, and app deploy end-to-end. 13. For examples: verify the dry-run deploy succeeded and the app is functional before considering the example complete. ## References @@ -360,5 +358,5 @@ Allowed exceptions (the validator skips these): - Read `src/lib/recipes/recipes.ts` for all type contracts (`Recipe`, `Cookbook`, `Example`). - Read `src/pages/templates/app-with-lakebase.tsx` for the cookbook composition pattern. - Read `src/components/examples/example-detail.tsx` for example detail rendering. -- Read `examples/agentic-support-console/template/README.md` for a full example runbook (provisioning, SQL, seed, pipelines, deploy). +- Read `app-templates/agentic-support-console/README.md` (in the [app-templates](https://github.com/databricks/app-templates) repo) for a full example runbook (provisioning, SQL, seed, pipelines, deploy). - Read `plugins/content-entries.ts` for slug parity and uniqueness validation. diff --git a/.agents/skills/author-recipes-and-cookbooks/references/quality-checklist.md b/.agents/skills/author-recipes-and-cookbooks/references/quality-checklist.md index d76c8b7..aa1ffc5 100644 --- a/.agents/skills/author-recipes-and-cookbooks/references/quality-checklist.md +++ b/.agents/skills/author-recipes-and-cookbooks/references/quality-checklist.md @@ -41,12 +41,12 @@ Use this checklist after drafting and before final handoff. ## Example Quality -- Confirm example app directory is named `template/` (for `databricks apps init --template` compatibility). -- Confirm optional `pipelines/`, `seed/`, and `provisioning/sql/` live under `template/` (not only at `examples//` root). -- Confirm **`template/README.md`** is the full runbook: provisioning order (SQL vs manual vs bundles), seeding commands, pipeline and app deploy, and optional `databricks apps init` scaffold URL. +- Confirm example code lives at `app-templates//` in the [app-templates](https://github.com/databricks/app-templates) repo (flat layout — no `template/` subdir). +- Confirm optional `pipelines/`, `seed/`, and `provisioning/sql/` live at the example root. +- Confirm **`README.md`** is the full runbook: provisioning order (SQL vs manual vs bundles), seeding commands, pipeline and app deploy, and optional `databricks apps init` scaffold URL. - Confirm all workspace-specific values use `REPLACE_ME` placeholders. - Confirm `.databricks/`, `node_modules/`, and `.env` are not committed. - Confirm pipeline SQL uses schema-qualified names (not hardcoded catalog names). -- Confirm `initCommand` in `recipes.ts` uses clone + `cd` into `devhub/examples//template` (see `createExample()` / example detail page); optional CLI scaffold uses `https://github.com/databricks/devhub/tree/main/examples/`. +- Confirm `initCommand` in `recipes.ts` uses clone + `cd` into `app-templates/` (see `createExample()` / example detail page); optional CLI scaffold uses `https://github.com/databricks/app-templates/tree/main/`. - Images are optional. If provided, confirm they pass `npm run verify:images` (16:9 ±2%, ≥1600×900 px, PNG/JPG/WEBP — no SVG screenshots). Set `previewImageLightUrl`/`previewImageDarkUrl` (or `galleryImages` for a multi-slide carousel) in the `createExample()` entry. When omitted, the UI falls back to generic card art. -- Confirm `content/examples/.md` exists and matches the example's `id`, and points readers at `template/README.md` for setup. +- Confirm `content/examples/.md` exists and matches the example's `id`, and points readers at `README.md` for setup. diff --git a/.gitattributes b/.gitattributes deleted file mode 100644 index b9758b0..0000000 --- a/.gitattributes +++ /dev/null @@ -1,5 +0,0 @@ -# Collapse lockfiles in GitHub PR diffs and exclude them from language stats. -# These files are always machine-generated and are excluded from bundle uploads -# (see examples/rag-chat/template/databricks.yml:sync.exclude) so they don't -# reach the Databricks Apps Linux build container. -examples/*/template/package-lock.json linguist-generated=true -diff diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 6f92914..4340180 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -97,7 +97,7 @@ DevHub has three internal content tiers that compose into each other: - **Recipe** — atomic, copy-pasteable agent prompt for one outcome (e.g. "Create a Lakebase instance"). The smallest unit; everything else is built from these. - **Cookbook** — composes multiple recipes into a longer end-to-end guide, plus its own meta content (intro, narrative, ordering). No app source. -- **Example** — a cookbook _plus_ a full deployable `examples//template/` codebase. Bundles recipes and cookbook narrative around runnable app code. +- **Example** — a cookbook _plus_ a full deployable codebase that lives in the [app-templates](https://github.com/databricks/app-templates) repo at `app-templates//`. Bundles recipes and cookbook narrative around runnable app code. So: recipes are the atoms, cookbooks compose recipes with additional context, and examples are cookbooks with shipped code. **User-facing, all three are presented as one thing: a "template"** — the site, navigation, filters, copy-pasted prompts, and `llms.txt` only ever say "template(s)". @@ -105,7 +105,7 @@ So: recipes are the atoms, cookbooks compose recipes with additional context, an | ------------ | ------------------------------------------------------------------ | --------------------------------------------------------------------------------- | | **Recipe** | One atomic outcome (e.g. "Create a Lakebase instance") | `content/recipes/.md` + metadata in `src/lib/recipes/recipes.ts` | | **Cookbook** | End-to-end walkthrough composed from multiple recipes | Metadata in `src/lib/recipes/recipes.ts` + page in `src/pages/templates/.tsx` | -| **Example** | Cookbook + full runnable app template with code, pipelines, deploy | `content/examples/.md` + `examples//template/` + metadata | +| **Example** | Cookbook + full runnable app template with code, pipelines, deploy | `content/examples/.md` + `app-templates//` (separate repo) + metadata | All three render at `/templates/` and live in one unified Templates catalog filterable by service. Slugs must be globally unique across all three — the content-entries plugin validates this at build time. diff --git a/api/content-markdown.ts b/api/content-markdown.ts index 16a1ec3..32a5259 100644 --- a/api/content-markdown.ts +++ b/api/content-markdown.ts @@ -188,11 +188,8 @@ function readExampleMarkdown(rootDir: string, slug: string): string { if (example.initCommand) { lines.push("## Quick start", "", "```bash", example.initCommand, "```", ""); } - if (example.githubPath) { - lines.push( - `[View source on GitHub](https://github.com/databricks/devhub/tree/main/${example.githubPath}/template)`, - "", - ); + if (example.templateUrl) { + lines.push(`[View source on GitHub](${example.templateUrl})`, ""); } const includedTemplates = [ ...example.cookbookIds.map((id) => cookbooks.find((c) => c.id === id)), diff --git a/content/examples/agentic-support-console/goal.md b/content/examples/agentic-support-console/goal.md index 7788369..25ca379 100644 --- a/content/examples/agentic-support-console/goal.md +++ b/content/examples/agentic-support-console/goal.md @@ -13,7 +13,7 @@ Customer interactions flow from your application's OLTP database (Lakebase Postg ### What to Adapt -Provisioning (manual steps and SQL), seeding, pipeline deploys, reverse sync, and app deploy are documented in the repository’s **`template/README.md`** alongside the code. +Provisioning (manual steps and SQL), seeding, pipeline deploys, reverse sync, and app deploy are documented in the repository’s **`README.md`** alongside the code. To make this template your own: diff --git a/content/examples/content-moderator/goal.md b/content/examples/content-moderator/goal.md index 133cab1..e798767 100644 --- a/content/examples/content-moderator/goal.md +++ b/content/examples/content-moderator/goal.md @@ -13,7 +13,7 @@ Content moves through a review pipeline backed by Lakebase and AI Gateway: ### What to Adapt -Setup and provisioning are documented in the repository’s **`template/README.md`**. +Setup and provisioning are documented in the repository’s **`README.md`**. To make this template your own: diff --git a/content/examples/inventory-intelligence/goal.md b/content/examples/inventory-intelligence/goal.md index c261903..1354aab 100644 --- a/content/examples/inventory-intelligence/goal.md +++ b/content/examples/inventory-intelligence/goal.md @@ -17,7 +17,7 @@ The app should have a **beautiful, polished design** — clean typography, consi ### What to Adapt -Provisioning (Unity Catalog schemas, Lakebase REPLICA IDENTITY), seeding, pipeline deploys, reverse sync, and app deploy are documented in the repository's **`template/README.md`** alongside the code. +Provisioning (Unity Catalog schemas, Lakebase REPLICA IDENTITY), seeding, pipeline deploys, reverse sync, and app deploy are documented in the repository's **`README.md`** alongside the code. To make this template your own: diff --git a/content/examples/rag-chat/goal.md b/content/examples/rag-chat/goal.md index 88fb0e9..6c425e8 100644 --- a/content/examples/rag-chat/goal.md +++ b/content/examples/rag-chat/goal.md @@ -23,7 +23,7 @@ This validates the [AppKit templates system](/docs/appkit/v0/development/templat ### What to Adapt -Setup and provisioning are documented in the repository's **`template/README.md`**. +Setup and provisioning are documented in the repository's **`README.md`**. To make this template your own: diff --git a/content/examples/saas-tracker/goal.md b/content/examples/saas-tracker/goal.md index 8451358..0a49c41 100644 --- a/content/examples/saas-tracker/goal.md +++ b/content/examples/saas-tracker/goal.md @@ -11,7 +11,7 @@ All subscription data lives in a single Lakebase Postgres table and is served di ### What to Adapt -Setup and provisioning are documented in the repository’s **`template/README.md`**. +Setup and provisioning are documented in the repository’s **`README.md`**. To make this template your own: diff --git a/content/examples/vacation-rentals/goal.md b/content/examples/vacation-rentals/goal.md index c91bb73..351cf06 100644 --- a/content/examples/vacation-rentals/goal.md +++ b/content/examples/vacation-rentals/goal.md @@ -11,7 +11,7 @@ The app composes four Databricks primitives behind a single React UI: ### What to Adapt -Setup, environment variables, and bundle deployment are documented in the repository's **`template/README.md`**. +Setup, environment variables, and bundle deployment are documented in the repository's **`README.md`**. To make this template your own: diff --git a/examples/README.md b/examples/README.md index 669aa9b..e5337a3 100644 --- a/examples/README.md +++ b/examples/README.md @@ -1,7 +1,5 @@ # DevHub examples -Each example includes a **`template/`** folder: Databricks App (AppKit), optional **`pipelines/`**, **`seed/`**, and **`provisioning/sql/`** when baseline SQL helps. +The runnable example apps that DevHub references (Agentic Support Console, Vacation Rentals, SaaS Tracker, Content Moderator, Inventory Intelligence, RAG Chat) now live in the [databricks/app-templates](https://github.com/databricks/app-templates) repository at `app-templates//`. -**Runbook:** use each example’s **`template/README.md`** for manual steps, SQL order, seeding, and deploy. - -See **[agentic-support-console/template/README.md](./agentic-support-console/template/README.md)** for a full stack (UC, Lakebase CDC, pipelines, reverse sync). +This folder retains only `vacation-rentals/blog-post-snippets/` — supporting code snippets referenced by the "Building a Vacation Rental Operations App with AppKit" blog post. The runnable Vacation Rentals template itself lives at [app-templates/vacation-rentals](https://github.com/databricks/app-templates/tree/main/vacation-rentals). diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/SKILL.md b/examples/agentic-support-console/template/.agents/skills/databricks-apps/SKILL.md deleted file mode 100644 index 53f74ad..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-apps/SKILL.md +++ /dev/null @@ -1,159 +0,0 @@ ---- -name: databricks-apps -description: Build apps on Databricks Apps platform. Use when asked to create dashboards, data apps, analytics tools, or visualizations. Invoke BEFORE starting implementation. -compatibility: Requires databricks CLI (>= v0.294.0) -metadata: - version: '0.1.1' -parent: databricks-core ---- - -# Databricks Apps Development - -**FIRST**: Use the parent `databricks-core` skill for CLI basics, authentication, and profile selection. - -Build apps that deploy to Databricks Apps platform. - -## Required Reading by Phase - -| Phase | READ BEFORE proceeding | -| ----------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Scaffolding | Parent `databricks-core` skill (auth, warehouse discovery); run `databricks apps manifest` and use its plugins/resources to build `databricks apps init` with `--features` and `--set` (see AppKit section below) | -| Writing SQL queries | [SQL Queries Guide](references/appkit/sql-queries.md) | -| Writing UI components | [Frontend Guide](references/appkit/frontend.md) | -| Using `useAnalyticsQuery` | [AppKit SDK](references/appkit/appkit-sdk.md) | -| Adding API endpoints | [tRPC Guide](references/appkit/trpc.md) | -| Using Lakebase (OLTP database) | [Lakebase Guide](references/appkit/lakebase.md) | -| Using Model Serving (ML inference) | [Model Serving Guide](references/appkit/model-serving.md) | -| Typed data contracts (proto-first design) | [Proto-First Guide](references/appkit/proto-first.md) and [Plugin Contracts](references/appkit/proto-contracts.md) | -| Platform rules (permissions, deployment, limits) | [Platform Guide](references/platform-guide.md) — READ for ALL apps including AppKit | -| Non-AppKit app (Streamlit, FastAPI, Flask, Gradio, Next.js, etc.) | [Other Frameworks](references/other-frameworks.md) | - -## Generic Guidelines - -- **App name**: ≤26 characters, lowercase letters/numbers/hyphens only (no underscores). dev- prefix adds 4 chars, max 30 total. -- **Validation**: `databricks apps validate --profile ` before deploying. -- **Smoke tests** (AppKit only): ALWAYS update `tests/smoke.spec.ts` selectors BEFORE running validation. Default template checks for "Minimal Databricks App" heading and "hello world" text — these WILL fail in your custom app. See [testing guide](references/testing.md). -- **Authentication**: covered by parent `databricks-core` skill. - -## Project Structure (after `databricks apps init --features analytics`) - -- `client/src/App.tsx` — main React component (start here) -- `config/queries/*.sql` — SQL query files (queryKey = filename without .sql) -- `server/server.ts` — backend entry (tRPC routers) -- `tests/smoke.spec.ts` — smoke test (⚠️ MUST UPDATE selectors for your app) -- `client/src/appKitTypes.d.ts` — auto-generated types (`npm run typegen`) - -## Project Structure (after `databricks apps init --features lakebase`) - -- `server/server.ts` — backend with Lakebase pool + tRPC routes -- `client/src/App.tsx` — React frontend -- `app.yaml` — manifest with `database` resource declaration -- `package.json` — includes `@databricks/lakebase` dependency -- Note: **No `config/queries/`** — Lakebase apps use `pool.query()` in tRPC, not SQL files - -## Data Discovery - -Before writing any SQL, use the parent `databricks-core` skill for data exploration — search `information_schema` by keyword, then batch `discover-schema` for the tables you need. Do NOT skip this step. - -## Development Workflow (FOLLOW THIS ORDER) - -**Analytics apps** (`--features analytics`): - -1. Create SQL files in `config/queries/` -2. Run `npm run typegen` — verify all queries show ✓ -3. Read `client/src/appKitTypes.d.ts` to see generated types -4. **THEN** write `App.tsx` using the generated types -5. Update `tests/smoke.spec.ts` selectors -6. Run `databricks apps validate --profile ` - -**DO NOT** write UI code before running typegen — types won't exist and you'll waste time on compilation errors. - -**Lakebase apps** (`--features lakebase`): No SQL files or typegen. See [Lakebase Guide](references/appkit/lakebase.md) for the tRPC pattern: initialize schema at startup, write procedures in `server/server.ts`, then build the React frontend. - -## When to Use What - -- **Read analytics data → display in chart/table**: Use visualization components with `queryKey` prop -- **Read analytics data → custom display (KPIs, cards)**: Use `useAnalyticsQuery` hook -- **Read analytics data → need computation before display**: Still use `useAnalyticsQuery`, transform client-side -- **Read/write persistent data (users, orders, CRUD state)**: Use Lakebase pool via tRPC — see [Lakebase Guide](references/appkit/lakebase.md) -- **Call ML model endpoint**: Use tRPC — see [Model Serving Guide](references/appkit/model-serving.md) -- **⚠️ NEVER use tRPC to run SELECT queries against the warehouse** — always use SQL files in `config/queries/` -- **⚠️ NEVER use `useAnalyticsQuery` for Lakebase data** — it queries the SQL warehouse only - -## Frameworks - -### AppKit (Recommended) - -TypeScript/React framework with type-safe SQL queries and built-in components. - -**Official Documentation** — the source of truth for all API details: - -```bash -npx @databricks/appkit docs # ← ALWAYS start here to see available pages -npx @databricks/appkit docs # view a section by name or doc path -npx @databricks/appkit docs --full # full index with all API entries -npx @databricks/appkit docs "appkit-ui API reference" # example: section by name -npx @databricks/appkit docs ./docs/plugins/analytics.md # example: specific doc file -``` - -**DO NOT guess doc paths.** Run without args first, pick from the index. The `` argument accepts both section names (from the index) and file paths. Docs are the authority on component props, hook signatures, and server APIs — skill files only cover anti-patterns and gotchas. - -**App Manifest and Scaffolding** - -**Agent workflow for scaffolding: get the manifest first, then build the init command.** - -1. **Get the manifest** (JSON schema describing plugins and their resources): - - ```bash - databricks apps manifest --profile - # See plugins available in a specific AppKit version: - databricks apps manifest --version --profile - # Custom template: - databricks apps manifest --template --profile - ``` - - The output defines: - - **Plugins**: each has a key (plugin ID for `--features`), plus `requiredByTemplate`, and `resources`. - - **requiredByTemplate**: If **true**, that plugin is **mandatory** for this template — do **not** add it to `--features` (it is included automatically); you must still supply all of its required resources via `--set`. If **false** or absent, the plugin is **optional** — add it to `--features` only when the user's prompt indicates they want that capability (e.g. analytics/SQL), and then supply its required resources via `--set`. - - **Resources**: Each plugin has `resources.required` and `resources.optional` (arrays). Each item has `resourceKey` and `fields` (object: field name → description/env). Use `--set ..=` for each required resource field of every plugin you include. - -2. **Scaffold** (DO NOT use `npx`; use the CLI only): - ```bash - databricks apps init --name --features , \ - --set ..= \ - --set ..= \ - --description "" --run none --profile - # --run none: skip auto-run after scaffolding (review code first) - # With custom template: - databricks apps init --template --name --features ... --set ... --profile - ``` - Optionally use `--version ` to target a specific AppKit version. - - **Required**: `--name`, `--profile`. Name: ≤26 chars, lowercase letters/numbers/hyphens only. Use `--features` only for **optional** plugins the user wants (plugins with `requiredByTemplate: false` or absent); mandatory plugins must not be listed in `--features`. - - **Resources**: Pass `--set` for every required resource (each field in `resources.required`) for (1) all plugins with `requiredByTemplate: true`, and (2) any optional plugins you added to `--features`. Add `--set` for `resources.optional` only when the user requests them. - - **Discovery**: Use the parent `databricks-core` skill to resolve IDs (e.g. warehouse: `databricks warehouses list --profile ` or `databricks experimental aitools tools get-default-warehouse --profile `). - -**DO NOT guess** plugin names, resource keys, or property names — always derive them from `databricks apps manifest` output. Example: if the manifest shows plugin `analytics` with a required resource `resourceKey: "sql-warehouse"` and `fields: { "id": ... }`, include `--set analytics.sql-warehouse.id=`. - -**READ [AppKit Overview](references/appkit/overview.md)** for project structure, workflow, and pre-implementation checklist. - -### Common Scaffolding Mistakes - -```bash -# ❌ WRONG: name is NOT a positional argument -databricks apps init --features analytics my-app-name -# → "unknown command" error - -# ✅ CORRECT: use --name flag -databricks apps init --name my-app-name --features analytics --set "..." --profile -``` - -### Directory Naming - -`databricks apps init` creates directories in kebab-case matching the app name. -App names must be lowercase with hyphens only (≤26 chars). - -### Other Frameworks (Streamlit, FastAPI, Flask, Gradio, Dash, Next.js, etc.) - -Databricks Apps supports any framework that runs as an HTTP server. LLMs already know these frameworks — the challenge is Databricks platform integration. - -**READ [Other Frameworks Guide](references/other-frameworks.md) BEFORE building any non-AppKit app.** It covers port/host configuration, `app.yaml` and `databricks.yml` setup, dependency management, networking, and framework-specific gotchas. diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/agents/openai.yaml b/examples/agentic-support-console/template/.agents/skills/databricks-apps/agents/openai.yaml deleted file mode 100644 index 6a6f81b..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-apps/agents/openai.yaml +++ /dev/null @@ -1,7 +0,0 @@ -interface: - display_name: 'Databricks Apps' - short_description: 'Apps development and deployment' - icon_small: './assets/databricks.svg' - icon_large: './assets/databricks.png' - brand_color: '#FF3621' - default_prompt: 'Use $databricks-apps for Databricks Apps development and deployment.' diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/assets/databricks.png b/examples/agentic-support-console/template/.agents/skills/databricks-apps/assets/databricks.png deleted file mode 100644 index 263fe98..0000000 Binary files a/examples/agentic-support-console/template/.agents/skills/databricks-apps/assets/databricks.png and /dev/null differ diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/assets/databricks.svg b/examples/agentic-support-console/template/.agents/skills/databricks-apps/assets/databricks.svg deleted file mode 100644 index 9d19110..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-apps/assets/databricks.svg +++ /dev/null @@ -1,3 +0,0 @@ - - - \ No newline at end of file diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/appkit-sdk.md b/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/appkit-sdk.md deleted file mode 100644 index 675d4fc..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/appkit-sdk.md +++ /dev/null @@ -1,112 +0,0 @@ -# Databricks App Kit SDK - -## TypeScript Import Rules - -This template uses strict TypeScript settings with `verbatimModuleSyntax: true`. **Always use `import type` for type-only imports**. - -Template enforces `noUnusedLocals` - remove unused imports immediately or build fails. - -```typescript -// ✅ CORRECT - use import type for types -import type { MyInterface, MyType } from './types'; - -// ❌ WRONG - will fail compilation -import { MyInterface, MyType } from './types'; -``` - -## Server Setup - -For server configuration, see: `npx @databricks/appkit docs ./docs/plugins.md` - -## useAnalyticsQuery Hook - -**ONLY use when displaying data in a custom way that isn't a chart or table.** For charts/tables, pass `queryKey` directly to the component — don't double-fetch. Charts also accept a `format` option (`"json"` | `"arrow"` | `"auto"`, default `"auto"`) to control the data transfer format. - -Use cases: - -- Custom HTML layouts (cards, lists, grids) -- Summary statistics and KPIs -- Conditional rendering based on data values -- Data that needs transformation before display - -### ⚠️ Memoize Parameters to Prevent Infinite Loops - -```typescript -// ❌ WRONG - creates new object every render → infinite refetch loop -const { data } = useAnalyticsQuery('query', { id: sql.string(selectedId) }); - -// ✅ CORRECT - memoize parameters -const params = useMemo(() => ({ id: sql.string(selectedId) }), [selectedId]); -const { data } = useAnalyticsQuery('query', params); -``` - -### Conditional Queries - -```typescript -// ❌ WRONG - `enabled` is NOT a valid option (this is a React Query pattern) -const { data } = useAnalyticsQuery('query', params, { enabled: !!selectedId }); - -// ✅ CORRECT - use autoStart: false -const { data } = useAnalyticsQuery('query', params, { autoStart: false }); - -// ✅ ALSO CORRECT - conditional rendering (component only mounts when data exists) -{selectedId && } -``` - -### Type Inference - -When `appKitTypes.d.ts` has been generated (via `npm run typegen`), types are inferred automatically: - -```typescript -// ✅ After typegen - types are automatic, no generic needed -const { data } = useAnalyticsQuery('my_query', params); - -// ⚠️ Before typegen - data is `unknown`, you must provide type manually -const { data } = useAnalyticsQuery('my_query', params); -``` - -**Common mistake** — don't define interfaces that duplicate generated types: - -```typescript -// ❌ WRONG - manual interface may conflict with generated QueryRegistry -interface MyData { - id: string; - value: number; -} -const { data } = useAnalyticsQuery('my_query', params); - -// ✅ CORRECT - run `npm run typegen` and let it provide types -const { data } = useAnalyticsQuery('my_query', params); -``` - -### Basic Usage - -```typescript -import { useAnalyticsQuery, Skeleton } from '@databricks/appkit-ui/react'; -import { sql } from '@databricks/appkit-ui/js'; -import { useMemo } from 'react'; - -function CustomDisplay() { - const params = useMemo(() => ({ - start_date: sql.date('2024-01-01'), - category: sql.string("tools") - }), []); - - const { data, loading, error } = useAnalyticsQuery('query_name', params); - - if (loading) return ; - if (error) return
Error: {error}
; - if (!data) return null; - - return ( -
- {data.map(row => ( -
-

{row.column_name}

-

{Number(row.value).toFixed(2)}

-
- ))} -
- ); -} -``` diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/frontend.md b/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/frontend.md deleted file mode 100644 index 3359a65..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/frontend.md +++ /dev/null @@ -1,174 +0,0 @@ -# Frontend Guidelines - -**For full component API**: run `npx @databricks/appkit docs` and navigate to the component you need. - -## Common Anti-Patterns - -These mistakes appear frequently — check the official docs for actual prop names: - -| Mistake | Why it's wrong | What to do | -| ---------------------------------------------------------- | --------------------------------------------------------------------------------- | ------------------------------------------------------------- | -| `xAxisKey`, `dataKey` on charts | Recharts naming, not AppKit | Use `xKey`, `yKey` (auto-detected from schema if omitted) | -| `yAxisKeys`, `yKeys` on charts | Recharts naming | Use `yKey` (string or string[]) | -| `config` on charts | Not a valid prop name | Use `options` for ECharts overrides | -| ``, `` children | AppKit charts are ECharts-based, NOT Recharts wrappers — configure via props only | | -| `columns` on DataTable | DataTable auto-generates columns from data | Use `queryKey` + `parameters`; use `transform` for formatting | -| Double-fetching with `useAnalyticsQuery` + chart component | Components handle their own fetching | Just pass `queryKey` to the component | - -**Always verify props against docs before using a component.** - -## Chart Data Modes - -All chart/data components support two modes: - -- **Query mode**: pass `queryKey` + `parameters` — component fetches data automatically. `parameters` is REQUIRED even if empty (`parameters={{}}`). -- **Data mode**: pass static data via `data` prop (JSON array or Arrow Table) — no `queryKey`/`parameters` needed. - -```tsx -// Query mode (recommended for Databricks SQL) - - -// Data mode (static/pre-fetched data) - -``` - -## Chart Props Quick Reference - -All charts accept these core props (verify full list via `npx @databricks/appkit docs`): - -```tsx - d} // transform raw data before rendering - colors={['#40d1f5']} // custom colors (overrides colorPalette) - colorPalette="categorical" // "categorical" | "sequential" | "diverging" - title="Sales by Region" // chart title - showLegend // show legend - options={{}} // additional ECharts options to merge - height={400} // default: 300 - orientation="vertical" // "vertical" | "horizontal" (BarChart/LineChart/AreaChart) - stacked // stack bars/areas (BarChart/AreaChart) -/> - - -``` - -Charts are **ECharts-based** — configure via props, not Recharts-style children. Components handle data fetching, loading, and error states internally. - -> ⚠️ **`parameters` is REQUIRED on all data components**, even when the query has no params. Always include `parameters={{}}`. - -```typescript -// ❌ Don't double-fetch -const { data } = useAnalyticsQuery('sales_data', {}); -return ; // fetches again! -``` - -## DataTable - -DataTable auto-generates columns from data and handles fetching, loading, error, and empty states. - -**For full props**: `npx @databricks/appkit docs "DataTable"`. - -```tsx -// ❌ WRONG - missing required `parameters` prop - - -// ✅ CORRECT - minimal - - -// ✅ CORRECT - with filtering and pagination - - -// ✅ CORRECT - with row selection - console.log(selection)} -/> -``` - -**Custom column formatting** — use the `transform` prop or format in SQL: - -```typescript - data.map(row => ({ - ...row, - price: `$${Number(row.price).toFixed(2)}`, - }))} -/> -``` - -## Available Components (Quick Reference) - -**For full prop details**: `npx @databricks/appkit docs "appkit-ui API reference"`. - -All data components support both query mode (`queryKey` + `parameters`) and data mode (static `data` prop). Common props across all charts: `format`, `transformer`, `colors`, `colorPalette`, `title`, `showLegend`, `height`, `options`, `ariaLabel`, `testId`. - -### Data Components (`@databricks/appkit-ui/react`) - -| Component | Extra Props | Use For | -| -------------- | ---------------------------------------------------------------------------------------------- | ----------------------------- | -| `BarChart` | `xKey`, `yKey`, `orientation`, `stacked` | Categorical comparisons | -| `LineChart` | `xKey`, `yKey`, `smooth`, `showSymbol`, `orientation` | Time series, trends | -| `AreaChart` | `xKey`, `yKey`, `smooth`, `showSymbol`, `stacked`, `orientation` | Cumulative/stacked trends | -| `PieChart` | `xKey`, `yKey`, `innerRadius`, `showLabels`, `labelPosition` | Part-of-whole | -| `DonutChart` | `xKey`, `yKey`, `innerRadius`, `showLabels`, `labelPosition` | Donut (pie with inner radius) | -| `ScatterChart` | `xKey`, `yKey`, `symbolSize` | Correlation, distribution | -| `HeatmapChart` | `xKey`, `yKey`, `yAxisKey`, `min`, `max`, `showLabels` | Matrix-style data | -| `RadarChart` | `xKey`, `yKey`, `showArea` | Multi-dimensional comparison | -| `DataTable` | `filterColumn`, `filterPlaceholder`, `transform`, `pageSize`, `enableRowSelection`, `children` | Tabular data display | - -### UI Components (`@databricks/appkit-ui/react`) - -| Component | Common Props | -| -------------------------------------------------------- | ----------------------------------------------------------------- | -| `Card`, `CardHeader`, `CardTitle`, `CardContent` | Standard container | -| `Badge` | `variant`: "default" \| "secondary" \| "destructive" \| "outline" | -| `Button` | `variant`, `size`, `onClick` | -| `Input` | `placeholder`, `value`, `onChange` | -| `Select`, `SelectTrigger`, `SelectContent`, `SelectItem` | Dropdown; `SelectItem` value cannot be "" | -| `Skeleton` | `className` — use for loading states | -| `Separator` | Visual divider | -| `Tabs`, `TabsList`, `TabsTrigger`, `TabsContent` | Tabbed interface | - -All data components **require `parameters={{}}`** even when the query has no params. - -## Layout Structure - -```tsx -
-

Page Title

-
{/* form inputs */}
-
{/* list items */}
-
-``` - -## Component Organization - -- Shared UI components: `@databricks/appkit-ui/react` -- Feature components: `client/src/components/FeatureName.tsx` -- Split components when logic exceeds ~100 lines or component is reused - -## Gotchas - -- `SelectItem` cannot have `value=""`. Use sentinel value like `"all"` for "show all" options. -- Use `` components instead of plain "Loading..." text -- Handle nullable fields: `value={field || ''}` for inputs -- For maps with React 19, use react-leaflet v5: `npm install react-leaflet@^5.0.0 leaflet @types/leaflet` - -Databricks brand colors: `['#40d1f5', '#4462c9', '#EB1600', '#0B2026', '#4A4A4A', '#353a4a']` diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/lakebase.md b/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/lakebase.md deleted file mode 100644 index ee56312..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/lakebase.md +++ /dev/null @@ -1,226 +0,0 @@ -# Lakebase: OLTP Database for Apps - -Use Lakebase when your app needs **persistent read/write storage** — forms, CRUD operations, user-generated data. For analytics dashboards reading from a SQL warehouse, use `config/queries/` instead. - -## When to Use Lakebase vs Analytics - -| Pattern | Use Case | Data Source | -| --------- | ------------------------------------------- | --------------------------------- | -| Analytics | Read-only dashboards, charts, KPIs | Databricks SQL Warehouse | -| Lakebase | CRUD operations, persistent state, forms | PostgreSQL (Lakebase Autoscaling) | -| Both | Dashboard with user preferences/saved state | Warehouse + Lakebase | - -## Scaffolding - -**ALWAYS scaffold with the correct feature flags** — do not add Lakebase manually to an analytics-only scaffold. - -**Lakebase only** (no analytics SQL warehouse): - -```bash -databricks apps init --name --features lakebase \ - --set "lakebase.postgres.branch=" \ - --set "lakebase.postgres.database=" \ - --run none --profile -``` - -**Both Lakebase and analytics**: - -```bash -databricks apps init --name --features analytics,lakebase \ - --set "analytics.sql-warehouse.id=" \ - --set "lakebase.postgres.branch=" \ - --set "lakebase.postgres.database=" \ - --run none --profile -``` - -Where `` and `` are full resource names (e.g. `projects//branches/` and `projects//branches//databases/`). - -Use the `databricks-lakebase` skill to create a Lakebase project and discover branch/database resource names before running this command. - -**Get resource names** (if you have an existing project): - -```bash -# List branches → use the name field of a READY branch -databricks postgres list-branches projects/ --profile -# List databases → use the name field -databricks postgres list-databases projects//branches/ --profile -``` - -## Project Structure (after `databricks apps init --features lakebase`) - -``` -my-app/ -├── server/ -│ └── server.ts # Backend with Lakebase pool + tRPC routes -├── client/ -│ └── src/ -│ └── App.tsx # React frontend -├── app.yaml # Manifest with database resource declaration -└── package.json # Includes @databricks/lakebase dependency -``` - -Note: **No `config/queries/` directory** — Lakebase apps use server-side `pool.query()` calls, not SQL files. - -## `createLakebasePool` API - -```typescript -import { createLakebasePool } from '@databricks/lakebase'; -// or: import { createLakebasePool } from "@databricks/appkit"; - -const pool = createLakebasePool({ - // All fields optional — auto-populated from env vars when deployed - host: process.env.PGHOST, // Lakebase hostname - database: process.env.PGDATABASE, // Database name - endpoint: process.env.LAKEBASE_ENDPOINT, // Endpoint resource path - user: process.env.PGUSER, // Service principal client ID - max: 10, // Connection pool size - idleTimeoutMillis: 30000, - connectionTimeoutMillis: 10000, -}); -``` - -Call `createLakebasePool()` **once at module level** (server startup), not inside request handlers. - -## Environment Variables (auto-set when deployed with database resource) - -| Variable | Description | -| ------------------- | --------------------------- | -| `PGHOST` | Lakebase hostname | -| `PGPORT` | Port (default 5432) | -| `PGDATABASE` | Database name | -| `PGUSER` | Service principal client ID | -| `PGSSLMODE` | SSL mode (`require`) | -| `LAKEBASE_ENDPOINT` | Endpoint resource path | - -## tRPC CRUD Pattern - -Always use tRPC for Lakebase operations — do NOT call `pool.query()` from the client. - -```typescript -// server/server.ts -import { initTRPC } from '@trpc/server'; -import { createLakebasePool } from '@databricks/lakebase'; -import { z } from 'zod'; -import superjson from 'superjson'; // requires: npm install superjson - -const pool = createLakebasePool(); // reads env vars automatically - -const t = initTRPC.create({ transformer: superjson }); -const publicProcedure = t.procedure; - -export const appRouter = t.router({ - listItems: publicProcedure.query(async () => { - const { rows } = await pool.query('SELECT * FROM app_data.items ORDER BY created_at DESC LIMIT 100'); - return rows; - }), - - createItem: publicProcedure.input(z.object({ name: z.string().min(1) })).mutation(async ({ input }) => { - const { rows } = await pool.query('INSERT INTO app_data.items (name) VALUES ($1) RETURNING *', [input.name]); - return rows[0]; - }), - - deleteItem: publicProcedure.input(z.object({ id: z.number() })).mutation(async ({ input }) => { - await pool.query('DELETE FROM app_data.items WHERE id = $1', [input.id]); - return { success: true }; - }), -}); -``` - -> **Deploy first (App + Lakebase only)!** When your Databricks App uses Lakebase, the Service Principal must create and own the schema. Run `databricks apps deploy` before any local development. See **`databricks-lakebase`** skill's **Schema Permissions for Deployed Apps** for details. - -## Schema Initialization - -**Always create a custom schema** — the Service Principal cannot access any existing schemas (including `public`). It must create the schema itself to become its owner. See **`databricks-lakebase`** skill's **Schema Permissions for Deployed Apps** for the full permission model and deploy-first workflow. Initialize tables on server startup: - -```typescript -// server/server.ts — run once at startup before handling requests -await pool.query(` - CREATE SCHEMA IF NOT EXISTS app_data; - CREATE TABLE IF NOT EXISTS app_data.items ( - id SERIAL PRIMARY KEY, - name TEXT NOT NULL, - created_at TIMESTAMPTZ DEFAULT NOW() - ); -`); -``` - -## ORM Integration (Optional) - -The pool returned by `createLakebasePool()` is a standard `pg.Pool` — works with any PostgreSQL library: - -```typescript -// Drizzle ORM -import { drizzle } from 'drizzle-orm/node-postgres'; -const db = drizzle(pool); - -// Prisma (with @prisma/adapter-pg) -import { PrismaPg } from '@prisma/adapter-pg'; -const adapter = new PrismaPg(pool); -const prisma = new PrismaClient({ adapter }); -``` - -## Key Differences from Analytics Pattern - -| | Analytics | Lakebase | -| -------------- | ------------------------------------------- | ------------------------------------------ | -| SQL dialect | Databricks SQL (Spark SQL) | Standard PostgreSQL | -| Query location | `config/queries/*.sql` files | `pool.query()` in tRPC routes | -| Data retrieval | `useAnalyticsQuery` hook | tRPC query procedure | -| Date functions | `CURRENT_TIMESTAMP()`, `DATEDIFF(DAY, ...)` | `NOW()`, `AGE(...)` | -| Auto-increment | N/A | `SERIAL` or `GENERATED ALWAYS AS IDENTITY` | -| Insert pattern | N/A | `INSERT ... VALUES ($1) RETURNING *` | -| Params | Named (`:param`) | Positional (`$1, $2, ...`) | - -**NEVER use `useAnalyticsQuery` for Lakebase data** — it queries the SQL warehouse, not Lakebase. -**NEVER put Lakebase SQL in `config/queries/`** — those files are only for warehouse queries. - -## Local Development - -### Prerequisites (MUST verify before local development) - -**This applies when your Databricks App uses Lakebase.** Run this check before any local development: - -```bash -databricks apps get --profile -``` - -Check the response for the `active_deployment` field. If it exists with `status.state` of `SUCCEEDED`, the app has been deployed. If `active_deployment` is missing, the app has never been deployed: - -1. **STOP** — do not proceed with local development -2. Deploy first: `databricks apps deploy --profile ` -3. Wait for deployment to complete, then continue - -If you skip this step, the Service Principal won't own the database schema. You'll create schemas under your credentials that the SP **cannot access** after deployment. See **`databricks-lakebase`** skill's **Schema Permissions for Deployed Apps** for the full workflow and recovery steps. - -The Lakebase env vars (`PGHOST`, `PGDATABASE`, etc.) are auto-set only when deployed. For local development, get the connection details from your endpoint and set them manually: - -```bash -# Get endpoint connection details -databricks postgres get-endpoint \ - projects//branches//endpoints/ \ - --profile -``` - -Then create `server/.env` with the values from the endpoint response: - -``` -PGHOST= -PGPORT=5432 -PGDATABASE= -PGUSER= -PGSSLMODE=require -LAKEBASE_ENDPOINT=projects//branches//endpoints/ -``` - -Load `server/.env` in your dev server (e.g. via `dotenv` or `node --env-file=server/.env`). Never commit `.env` files — add `server/.env` to `.gitignore`. - -## Troubleshooting - -| Error | Cause | Solution | -| -------------------------------------------------- | -------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `permission denied for schema public` | SP cannot access `public` schema | Create custom schema: `CREATE SCHEMA IF NOT EXISTS app_data` and qualify all table names with `app_data.` | -| `permission denied for schema ` | Schema was created by another role (e.g. you ran locally before deploying) | **Ask the user before dropping** — `DROP SCHEMA` deletes all data. See **`databricks-lakebase`** skill's **Schema Permissions for Deployed Apps** for options | -| Works locally but `permission denied` after deploy | Local credentials created the schema; the SP can't access schemas it doesn't own | **Ask the user before dropping** — warn about data loss, then deploy first. See **`databricks-lakebase`** skill's **Schema Permissions for Deployed Apps** for options | -| `connection refused` | Pool not connected or wrong env vars | Check `PGHOST`, `PGPORT`, `LAKEBASE_ENDPOINT` are set | -| `relation "X" does not exist` | Tables not initialized | Run `CREATE TABLE IF NOT EXISTS` at startup | -| App builds but pool fails at runtime | Env vars not set locally | Set vars in `server/.env` — see Local Development above | diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/model-serving.md b/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/model-serving.md deleted file mode 100644 index e07dac7..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/model-serving.md +++ /dev/null @@ -1,102 +0,0 @@ -# Model Serving: Calling ML Endpoints from Apps - -Use Model Serving when your app needs **AI features** — chat, inference, embeddings, or predictions from a Databricks Model Serving endpoint. For analytics dashboards, use `config/queries/` instead. For persistent storage, use Lakebase. - -## When to Use - -| Pattern | Use Case | Data Source | -| ------------- | ---------------------------------------------- | --------------------- | -| Analytics | Read-only dashboards, charts, KPIs | SQL Warehouse | -| Lakebase | CRUD operations, persistent state, forms | PostgreSQL (Lakebase) | -| Model Serving | Chat, AI features, model inference | Serving Endpoint | -| Multiple | Dashboard with AI features or persistent state | Combine as needed | - -## Scaffolding - -Check if the `serving` plugin is available in the AppKit template: - -```bash -databricks apps manifest --profile -``` - -**If the manifest includes a `serving` plugin:** - -```bash -databricks apps init --name --features serving \ - --set "serving.serving-endpoint.name=" \ - --run none --profile -``` - -**If no `serving` plugin** (add manually to an existing app): - -Use the `databricks-model-serving` skill to create a serving endpoint first, then follow the resource declaration and tRPC patterns below. - -## Resource Declaration - -Add the serving endpoint resource to `databricks.yml`: - -```yaml -resources: - apps: - my_app: - resources: - - name: my-model-endpoint - serving_endpoint: - name: - permission: CAN_QUERY # auto-granted to SP on deploy -``` - -Add environment variable injection in `app.yaml`: - -```yaml -env: - - name: SERVING_ENDPOINT - valueFrom: serving-endpoint -``` - -The injected value is the endpoint **name** (not a URL). Use it in server-side code to call the endpoint. - -## tRPC Pattern - -Always use tRPC for model serving calls — do NOT call endpoints directly from the client. - -```typescript -// server/server.ts (or server/trpc.ts) -import { initTRPC } from '@trpc/server'; -import { getExecutionContext } from '@databricks/appkit'; -import { z } from 'zod'; -import superjson from 'superjson'; - -const t = initTRPC.create({ transformer: superjson }); -const publicProcedure = t.procedure; - -export const appRouter = t.router({ - queryModel: publicProcedure.input(z.object({ prompt: z.string() })).query(async ({ input: { prompt } }) => { - const { serviceDatabricksClient: client } = getExecutionContext(); - const response = await client.servingEndpoints.query({ - name: process.env.SERVING_ENDPOINT, - messages: [{ role: 'user', content: prompt }], - }); - return response; - }), -}); -``` - -## Client-side Pattern - -```typescript -// client/src/components/ChatComponent.tsx -import { trpc } from '@/lib/trpc'; - -const result = await trpc.queryModel.query({ prompt: userInput }); -const answer = result.choices?.[0]?.message?.content; -``` - -## Troubleshooting - -| Error | Cause | Solution | -| -------------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------ | -| `PERMISSION_DENIED` on query | SP missing CAN_QUERY | Declare `serving_endpoint` resource in `databricks.yml` with `permission: CAN_QUERY` | -| `SERVING_ENDPOINT` env var empty | Missing env injection | Add `valueFrom: serving-endpoint` to `app.yaml` env section | -| 504 Gateway Timeout | Inference exceeds 120s proxy limit | Reduce `max_tokens` or use WebSockets — see [Platform Guide](../platform-guide.md) | -| `getExecutionContext` undefined | Called outside AppKit server context | Ensure call is inside a tRPC procedure on the server side | diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/overview.md b/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/overview.md deleted file mode 100644 index bfe470d..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/overview.md +++ /dev/null @@ -1,146 +0,0 @@ -# AppKit Overview - -AppKit is the recommended way to build Databricks Apps - provides type-safe SQL queries, React components, and seamless deployment. - -## Choose Your Data Pattern FIRST - -Before scaffolding, decide which data pattern the app needs: - -| Pattern | When to use | Init command | -| -------------------------------- | --------------------------------------- | ------------------------------------------------------------------------------------------------------ | -| **Analytics** (read-only) | Dashboards, charts, KPIs from warehouse | `--features analytics --set analytics.sql-warehouse.id=` | -| **Lakebase (OLTP)** (read/write) | CRUD forms, persistent state, user data | `--features lakebase --set lakebase.postgres.branch= --set lakebase.postgres.database=` | -| **Both** | Dashboard + user data or preferences | `--features analytics,lakebase` with all required `--set` flags | -| **Model Serving** (ML inference) | Chat, AI features, model predictions | Add `serving_endpoint` resource to `databricks.yml` (or `--features serving` if available in manifest) | - -See [Lakebase Guide](lakebase.md) for full Lakebase scaffolding and app-code patterns. - -## Workflow - -1. **Scaffold**: Run `databricks apps manifest`, then `databricks apps init` with `--features` and `--set` as in parent SKILL.md (App Manifest and Scaffolding) -2. **Develop**: `cd && npm install && npm run dev` -3. **Validate**: `databricks apps validate` -4. **Deploy**: `databricks apps deploy --profile ` (⚠️ USER CONSENT REQUIRED) - -## Data Discovery (Before Writing SQL) - -**Use the parent `databricks-core` skill for data discovery** (table search, schema exploration, query execution). - -## Pre-Implementation Checklist - -Before writing App.tsx, complete these steps: - -1. ✅ Create SQL files in `config/queries/` -2. ✅ Run `npm run typegen` to generate query types -3. ✅ Read `client/src/appKitTypes.d.ts` to see available query result types -4. ✅ Verify component props via `npx @databricks/appkit docs` (check the relevant component page) -5. ✅ Plan smoke test updates (default expects "Minimal Databricks App") - -**DO NOT** write UI code until types are generated and verified. - -## Post-Implementation Checklist - -Before running `databricks apps validate`: - -1. ✅ Update `tests/smoke.spec.ts` heading selector to match your app title -2. ✅ Update or remove the 'hello world' text assertion -3. ✅ Verify `npm run typegen` has been run after all SQL files are finalized -4. ✅ Ensure all numeric SQL values use `Number()` conversion in display code - -## Project Structure - -``` -my-app/ -├── server/ -│ ├── server.ts # Backend entry point (AppKit) -│ └── .env # Optional local dev env vars (do not commit) -├── client/ -│ ├── index.html -│ ├── vite.config.ts -│ └── src/ -│ ├── main.tsx -│ └── App.tsx # <- Main app component (start here) -├── config/ -│ └── queries/ -│ └── my_query.sql # -> queryKey: "my_query" -├── app.yaml # Deployment config -├── package.json -└── tsconfig.json -``` - -**Key files to modify:** -| Task | File | -|------|------| -| Build UI | `client/src/App.tsx` | -| Add SQL query | `config/queries/.sql` | -| Add API endpoint | `server/server.ts` (tRPC) | -| Add shared helpers (optional) | create `shared/types.ts` or `client/src/lib/formatters.ts` | -| Fix smoke test | `tests/smoke.spec.ts` | - -## Type Safety - -For type generation details, see: `npx @databricks/appkit docs ./docs/development/type-generation.md` - -**Quick workflow:** - -1. Add/modify SQL in `config/queries/` -2. Types auto-generate during dev via the Vite plugin (or run `npm run typegen` manually) -3. Types appear in `client/src/appKitTypes.d.ts` - -## Adding Visualizations - -**Step 1**: Create SQL file `config/queries/my_data.sql` - -```sql -SELECT category, COUNT(*) as count FROM my_table GROUP BY category -``` - -**Step 2**: Use component (types auto-generated!) - -```typescript -import { BarChart } from '@databricks/appkit-ui/react'; -// Query mode: fetches data automatically - - -// Data mode: pass static data directly (no queryKey/parameters needed) - -``` - -## AppKit Official Documentation - -**Always use AppKit docs as the source of truth for API details.** - -```bash -npx @databricks/appkit docs # show the docs index (start here) -npx @databricks/appkit docs # look up a section by name or doc path -``` - -Do not guess paths — run without args first, then pick from the index. - -## References - -| When you're about to... | Read | -| ---------------------------------------- | ---------------------------------------------------------------------------- | -| Write SQL files | [SQL Queries](sql-queries.md) — parameterization, dialect, sql.\* helpers | -| Use `useAnalyticsQuery` | [AppKit SDK](appkit-sdk.md) — memoization, conditional queries | -| Add chart/table components | [Frontend](frontend.md) — component quick reference, anti-patterns | -| Add API mutation endpoints | [tRPC](trpc.md) — only if you need server-side logic | -| Use Lakebase for CRUD / persistent state | [Lakebase](lakebase.md) — createLakebasePool, tRPC patterns, schema init | -| Call ML model serving endpoints | [Model Serving](model-serving.md) — resource declaration, tRPC query pattern | - -## Critical Rules - -1. **SQL for data retrieval**: Use `config/queries/` + visualization components. Never tRPC for SELECT. -2. **Numeric types**: SQL numbers may return as strings. Always convert: `Number(row.amount)` -3. **Type imports**: Use `import type { ... }` (verbatimModuleSyntax enabled). -4. **Charts are ECharts**: No Recharts children — use props (`xKey`, `yKey`, `colors`). `xKey`/`yKey` auto-detect from schema if omitted. -5. **Two data modes**: Charts/tables support query mode (`queryKey` + `parameters`) and data mode (static `data` prop). -6. **Conditional queries**: Use `autoStart: false` option or conditional rendering to control query execution. - -## Decision Tree - -- **Display data from SQL?** - - Chart/Table → `BarChart`, `LineChart`, `DataTable` components - - Custom layout (KPIs, cards) → `useAnalyticsQuery` hook -- **Call Databricks API?** → tRPC (serving endpoints, MLflow, Jobs) -- **Modify data?** → tRPC mutations diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/proto-contracts.md b/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/proto-contracts.md deleted file mode 100644 index 1ae4520..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/proto-contracts.md +++ /dev/null @@ -1,203 +0,0 @@ -# Plugin Contract Reference - -Concrete proto↔plugin mappings for the three core AppKit plugins. - -## Files Plugin Contract - -**Plugin manifest**: `files/manifest.json` -**Resource**: UC Volume with `WRITE_VOLUME` permission -**Env**: `DATABRICKS_VOLUME_FILES` for volume path - -### Boundary: What the files plugin owns - -The files plugin is the ONLY module that touches UC Volumes. Other modules -interact with files through typed proto messages, never raw paths. - -``` -┌─────────────┐ UploadRequest ┌──────────────┐ -│ api module │ ──────────────────→ │ files plugin │ -│ │ ←────────────────── │ │ -│ │ StoredArtifact │ UC Volumes │ -└─────────────┘ └──────────────┘ -``` - -### Proto → Plugin Method Mapping - -| Proto Message | Plugin Method | Direction | -| ---------------- | --------------------------------------- | --------- | -| `UploadRequest` | `files.upload(path, content, opts)` | IN | -| `StoredArtifact` | Return type of upload/getInfo | OUT | -| `VolumeLayout` | `files.config.volumePath` + conventions | CONFIG | - -### Volume Path Convention (from VolumeLayout proto) - -``` -/Volumes/{catalog}/{schema}/{volume}/ -├── uploads/ # User uploads (UploadRequest.destination_path) -├── results/ # Computed outputs (StoredArtifact) -│ └── {run_id}/ -│ ├── output.proto.bin # Binary proto serialization -│ └── output.json # JSON for debugging -└── artifacts/ # Build artifacts, archives - └── {app_name}/ - └── {version}/ -``` - -### Config ↔ Proto Mapping - -| manifest.json field | Proto field | Notes | -| ---------------------------- | -------------------------------- | ---------------------- | -| `config.timeout` (30000) | Not in proto | Plugin-internal config | -| `config.maxUploadSize` (5GB) | `UploadRequest.content` max size | Validation constraint | -| `resources.path` env | `VolumeLayout.root` | Runtime injection | - ---- - -## Lakebase Plugin Contract - -**Plugin manifest**: `lakebase/manifest.json` -**Resource**: Postgres with `CAN_CONNECT_AND_CREATE` permission -**Env**: `PGHOST`, `PGDATABASE`, `PGPORT`, `PGSSLMODE`, `LAKEBASE_ENDPOINT` - -### Boundary: What the lakebase plugin owns - -Lakebase owns ALL structured data. Every table's schema is derived from a proto -message in `database.proto`. No ad-hoc `CREATE TABLE` statements. - -``` -┌─────────────┐ RunRecord ┌──────────────┐ -│ compute mod │ ──────────────────→ │ lakebase │ -│ │ │ plugin │ -│ │ MetricRecord │ │ -│ │ ──────────────────→ │ Postgres │ -└─────────────┘ └──────┬───────┘ - │ -┌─────────────┐ SQL query │ -│ analytics │ ←──────────────────────────┘ -│ module │ RunRecord[] -└─────────────┘ -``` - -### Proto → Table Mapping - -| Proto Message | Table Name | Primary Key | Notes | -| -------------- | ---------- | -------------------- | ----------------- | -| `RunRecord` | `runs` | `(run_id, app_name)` | One row per run | -| `MetricRecord` | `metrics` | auto-increment | FK to runs.run_id | -| `ConfigRecord` | `configs` | `config_id` | Versioned configs | - -### Proto → DDL Type Mapping - -| Proto Type | SQL Type | Column Default | -| -------------- | ------------------ | ---------------- | -| `string` | `TEXT` | `''` | -| `bool` | `BOOLEAN` | `false` | -| `int32` | `INTEGER` | `0` | -| `int64` | `BIGINT` | `0` | -| `double` | `DOUBLE PRECISION` | `0.0` | -| `bytes` | `BYTEA` | `NULL` | -| `Timestamp` | `TIMESTAMPTZ` | `NOW()` | -| `repeated T` | `JSONB` | `'[]'::jsonb` | -| `map` | `JSONB` | `'{}'::jsonb` | -| nested message | `JSONB` | `NULL` | -| `enum` | `TEXT` | First value name | - -### Migration Convention - -``` -migrations/ -├── 001_create_runs.sql -├── 002_create_metrics.sql -├── 003_create_configs.sql -└── 004_add_metrics_index.sql -``` - -Each migration is idempotent (`CREATE TABLE IF NOT EXISTS`, `CREATE INDEX IF NOT EXISTS`). - -### Config ↔ Proto Mapping - -| manifest.json field | Proto usage | Notes | -| --------------------------------------- | ------------------ | --------------------- | -| `resources.branch` | Not in proto | Infrastructure config | -| `resources.database` | Not in proto | Infrastructure config | -| `resources.host` (`PGHOST`) | Connection string | Runtime injection | -| `resources.databaseName` (`PGDATABASE`) | Database selection | Runtime injection | - ---- - -## Jobs / Compute Contract - -**No plugin manifest** — Jobs are invoked via `@databricks/sdk-experimental` -**Resource**: Databricks Jobs API -**Auth**: Workspace token or OAuth - -### Boundary: What the jobs module owns - -The jobs module owns compute execution. It receives typed task inputs, runs them -on Databricks clusters, and produces typed task outputs. - -``` -┌─────────────┐ JobConfig ┌──────────────┐ -│ api module │ ──────────────────→ │ jobs module │ -│ │ │ │ -│ │ JobTaskInput │ Databricks │ -│ │ ──────────────────→ │ Jobs API │ -│ │ │ │ -│ │ JobTaskOutput │ Clusters │ -│ │ ←────────────────── │ │ -└─────────────┘ └──────────────┘ -``` - -### Proto → Jobs SDK Mapping - -| Proto Message | SDK Method | Direction | -| --------------- | ------------------------------- | ---------------------- | -| `JobConfig` | `jobs.create(config)` | IN — defines the job | -| `TaskConfig` | Task within a job | IN — defines task deps | -| `JobTaskInput` | Task params (base64 proto) | IN — task receives | -| `JobTaskOutput` | Task output (written to Volume) | OUT — task produces | - -### Task Parameter Convention - -Job tasks receive their typed input via: - -1. **Small payloads (<256KB)**: Base64-encoded proto in task params -2. **Large payloads**: Proto binary written to UC Volume, path passed as param - -```typescript -// Producer (api module) -const input: JobTaskInput = { taskId, taskType, runId, inputPayload }; -const encoded = Buffer.from(JobTaskInput.encode(input).finish()).toString('base64'); -// Pass as notebook parameter: { "input": encoded } - -// Consumer (job task code) -const decoded = JobTaskInput.decode(Buffer.from(params.input, 'base64')); -``` - -### Task Output Convention - -Job tasks write their typed output to: - -``` -/Volumes/{catalog}/{schema}/{volume}/results/{run_id}/{task_id}.output.bin -``` - -The output is a serialized `JobTaskOutput` proto. The orchestrator reads it -back with the generated decoder. - -### Jobs API Patterns - -```typescript -// Create a multi-task job from JobConfig proto -const jobConfig: JobConfig = { - jobName: `${appName}-${runId}`, - clusterSpec: '{"num_workers": 1}', - maxRetries: 2, - timeoutSeconds: 3600, - tasks: [ - { taskKey: 'generate', taskType: 'generate', dependsOn: [] }, - { taskKey: 'evaluate', taskType: 'evaluate', dependsOn: ['generate'] }, - { taskKey: 'aggregate', taskType: 'aggregate', dependsOn: ['evaluate'] }, - ], -}; -``` diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/proto-first.md b/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/proto-first.md deleted file mode 100644 index 94d7a28..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/proto-first.md +++ /dev/null @@ -1,306 +0,0 @@ -# Proto-First App Design - -Schema-first approach for AppKit apps using protobuf data contracts. Define contracts BEFORE implementation — derive TypeScript types, Lakebase DDL, and Volume paths from `.proto` files. - -**When to use:** New apps with multiple plugins (files + lakebase + jobs), or adding typed boundaries to existing apps. Skip for quick prototypes. - -**Requires:** `buf` CLI for proto linting and code generation. - -**Rule: No implementation before contracts. No contracts without consumers.** - -Define protobuf data contracts FIRST, then derive everything else (TypeScript types, Lakebase DDL, Volume paths, API shapes) from those contracts. - -## When to Use - -| Scenario | Use this skill | -| --------------------------------------------- | ---------------------------------------------------- | -| Creating a new Databricks app | YES — define contracts before `databricks apps init` | -| Adding a new data boundary to an existing app | YES — add proto before implementation | -| Quick prototype / hackathon | NO — skip contracts, move fast | -| Modifying existing typed code | NO — contracts already exist | - -## Core Principle - -``` -User intent → Module map → Proto contracts → Generated types → Implementation - ↓ ↓ - Lakebase DDL TypeScript interfaces - ↓ ↓ - Migrations Plugin code -``` - -The `.proto` file is the single source of truth. If it's not in a proto, it doesn't cross a module boundary. - -## Phase 1: Decompose into Modules - -Every Databricks app decomposes into a combination of these plugin modules: - -| Module | Plugin | Data Boundary | Owns | -| ------------- | --------- | ------------------- | --------------------------------------- | -| **Storage** | files | UC Volumes | Blobs, uploads, artifacts, archives | -| **Database** | lakebase | Postgres tables | Structured records, queries, migrations | -| **Compute** | jobs | Databricks Jobs API | Job runs, task results, cluster configs | -| **Analytics** | analytics | SQL Warehouse | Read-only queries, dashboards | -| **Serving** | server | HTTP/tRPC routes | API endpoints, SSE streams | - -### Decomposition Rules - -1. **Each module owns its data** — files plugin never writes to lakebase, lakebase never writes to volumes. -2. **Cross-module communication is typed** — a proto message, never a raw JSON blob. -3. **Every proto message has exactly one producer module.** -4. **Multiple modules can consume** — but the producer defines the schema. -5. **No god messages** — if a message has >12 fields, split it. - -### Output: Module Map - -Before proceeding, produce a module map for the user to confirm: - -``` -App: -Modules: - storage: files plugin → uploads/, results/, artifacts/ - db: lakebase plugin → runs, metrics, configs tables - compute: jobs → generation tasks, eval tasks - api: server plugin → POST /run, GET /status, SSE /stream -``` - -## Phase 2: Define Proto Contracts - -### Directory Structure - -``` -proto/ -├── buf.yaml -├── buf.gen.yaml -└── / - └── v1/ - ├── common.proto # Shared enums, IDs - ├── storage.proto # Files plugin boundary - ├── database.proto # Lakebase plugin boundary - ├── compute.proto # Jobs boundary - └── api.proto # Server/API boundary -``` - -### Proto Style Rules - -- **Package**: `.v1` (versioned from day one) -- **One file per module boundary**, not per message -- **Every field has a consumer** — if no code reads it, delete it -- **snake_case** for all field names -- **proto3** syntax only - -### Files Plugin Boundary (`storage.proto`) - -The files plugin operates on UC Volumes. Type every file path and payload: - -```protobuf -syntax = "proto3"; -package .v1; - -import "google/protobuf/timestamp.proto"; - -// StoredArtifact — produced by files plugin after upload. -message StoredArtifact { - string volume_path = 1; - string content_type = 2; - int64 size_bytes = 3; - google.protobuf.Timestamp created_at = 4; - string checksum_sha256 = 5; -} - -// UploadRequest — sent to files plugin by api module. -message UploadRequest { - string destination_path = 1; - string content_type = 2; - bytes content = 3; - map metadata = 4; -} - -// VolumeLayout — design-time contract for volume directory structure. -message VolumeLayout { - string root = 1; // /Volumes/catalog/schema/app_name - string uploads_dir = 2; // uploads/ - string results_dir = 3; // results/ - string artifacts_dir = 4; // artifacts/ -} -``` - -### Lakebase Plugin Boundary (`database.proto`) - -Every Lakebase table has a corresponding proto message. The message IS the schema: - -```protobuf -syntax = "proto3"; -package .v1; - -import "google/protobuf/timestamp.proto"; - -// RunRecord — one row in the `runs` table. -// Producer: compute module. Consumers: api, analytics. -message RunRecord { - string run_id = 1; - string app_name = 2; - RunStatus status = 3; - google.protobuf.Timestamp started_at = 4; - google.protobuf.Timestamp completed_at = 5; - string error_message = 6; - string config_json = 7; -} - -// MetricRecord — one row in the `metrics` table. -// Producer: compute module. Consumers: analytics, api. -message MetricRecord { - string run_id = 1; - string metric_name = 2; - double value = 3; - google.protobuf.Timestamp recorded_at = 4; - map dimensions = 5; -} -``` - -### Jobs Boundary (`compute.proto`) - -Type job task inputs and outputs: - -```protobuf -syntax = "proto3"; -package .v1; - -// JobTaskInput — typed payload sent to a Databricks job task. -// Producer: api module. Consumer: job task code. -message JobTaskInput { - string task_id = 1; - string task_type = 2; - string run_id = 3; - bytes input_payload = 4; - map env = 5; -} - -// JobTaskOutput — typed result from a completed job task. -// Producer: job task code. Consumer: api module. -message JobTaskOutput { - string task_id = 1; - string run_id = 2; - bool success = 3; - string error = 4; - bytes output_payload = 5; - int64 duration_ms = 6; - map metrics = 7; -} -``` - -## Phase 3: Generate Types and DDL - -### 3a. Buf configuration - -```yaml -# buf.yaml -version: v2 -lint: - use: - - STANDARD -breaking: - use: - - FILE -``` - -```yaml -# buf.gen.yaml -version: v2 -plugins: - - remote: buf.build/connectrpc/es - out: proto/gen - opt: target=ts -``` - -### 3b. Generate TypeScript types - -```bash -buf lint proto/ -buf generate proto/ -``` - -### 3c. Generate Lakebase DDL - -For each message in `database.proto`, generate a numbered migration file. - -**Proto→SQL type mapping:** - -| Proto Type | SQL Type | Default | -| -------------- | ------------------ | ---------------- | -| `string` | `TEXT` | `''` | -| `bool` | `BOOLEAN` | `false` | -| `int32` | `INTEGER` | `0` | -| `int64` | `BIGINT` | `0` | -| `double` | `DOUBLE PRECISION` | `0.0` | -| `bytes` | `BYTEA` | `NULL` | -| `Timestamp` | `TIMESTAMPTZ` | `NOW()` | -| `repeated T` | `JSONB` | `'[]'::jsonb` | -| `map` | `JSONB` | `'{}'::jsonb` | -| nested message | `JSONB` | `NULL` | -| `enum` | `TEXT` | first value name | - -Example migration: - -```sql --- migrations/001_create_runs.sql -CREATE TABLE IF NOT EXISTS runs ( - run_id TEXT NOT NULL, - app_name TEXT NOT NULL, - status TEXT NOT NULL DEFAULT 'RUN_STATUS_PENDING', - started_at TIMESTAMPTZ, - completed_at TIMESTAMPTZ, - error_message TEXT, - config_json JSONB, - PRIMARY KEY (run_id, app_name) -); -``` - -### 3d. Validate - -```bash -npx tsc --noEmit # all generated types compile -buf lint proto/ # proto style checks -``` - -## Phase 4: Implement Against Contracts - -NOW implementation begins. Each module uses ONLY its generated types: - -```typescript -import type { StoredArtifact, UploadRequest } from '../proto/gen//v1/storage'; -import type { RunRecord, MetricRecord } from '../proto/gen//v1/database'; -import type { JobTaskInput, JobTaskOutput } from '../proto/gen//v1/compute'; -``` - -No `any`, no `unknown`, no `JSON.parse()` at module boundaries. - -## Validation Checklist - -Before writing implementation code: - -- [ ] Module map exists with clear data boundaries -- [ ] Proto files exist for every cross-boundary data structure -- [ ] `buf lint proto/` passes -- [ ] `buf generate proto/` produces TypeScript types -- [ ] Lakebase DDL derived from `database.proto` messages -- [ ] No proto message exceeds 12 fields -- [ ] Every field has at least one identified consumer -- [ ] Every message has exactly one producer module -- [ ] Volume layout documented (not freeform paths) -- [ ] Job inputs/outputs typed (no raw JSON params) - -## Common Traps - -| Trap | Why it fails | Fix | -| ------------------------------------ | ------------------------------------------------ | -------------------------------------------------- | -| "I'll add the proto later" | Boundaries calcify around untyped shapes | Proto first or not at all | -| `any` at a module boundary | Type errors surface at runtime, not compile time | Use generated types | -| `JSON.parse()` crossing a boundary | No schema validation | Deserialize with proto decoder | -| Giant 30-field message | Impossible to review, version, or extend | Split by concern, max 12 fields | -| Storing raw JSON in Lakebase | Loses queryability and type safety | Map to `repeated`, `map`, or nested message fields | -| Shared mutable state between modules | Race conditions, unclear ownership | Communicate through typed messages | - -## References - -- [Plugin Contract Details](references/plugin-contracts.md) — proto↔plugin type mappings for files, lakebase, jobs diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/sql-queries.md b/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/sql-queries.md deleted file mode 100644 index e22133b..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/sql-queries.md +++ /dev/null @@ -1,270 +0,0 @@ -# SQL Query Files - -**IMPORTANT**: ALWAYS use SQL files in `config/queries/` for data retrieval. NEVER use tRPC for SQL queries. - -- Store ALL SQL queries in `config/queries/` directory -- Name files descriptively: `trip_statistics.sql`, `user_metrics.sql`, `sales_by_region.sql` -- Reference by filename (without extension) in `useAnalyticsQuery` or directly in a visualization component passing it as `queryKey` -- App Kit automatically executes queries against configured Databricks warehouse -- Benefits: Built-in caching, proper connection pooling, better performance - -## Type Generation - -For full type generation details, see: `npx @databricks/appkit docs ./docs/development/type-generation.md` - -**Type generation:** Types are auto-regenerated during dev whenever SQL files change. - -**Quick workflow:** Add SQL files → Types auto-generate during dev → Types appear in `client/src/appKitTypes.d.ts` - -## Query Schemas (Optional) - -Create `config/queries/schema.ts` only if you need **runtime validation** with Zod. - -```typescript -import { z } from 'zod'; - -export const querySchemas = { - my_query: z.array( - z.object({ - category: z.string(), - // Use z.coerce.number() - handles both string and number from SQL - amount: z.coerce.number(), - }) - ), -}; -``` - -**Why `z.coerce.number()`?** - -- Auto-generated types use `number` based on SQL column types -- But some SQL types (DECIMAL, large BIGINT) return as strings at runtime -- `z.coerce.number()` handles both cases safely - -## SQL Type Handling (Critical) - -**Understanding Type Generation vs Runtime:** - -1. **Auto-generated types** (`appKitTypes.d.ts`): Based on SQL column types - - `BIGINT`, `INT`, `DECIMAL` → TypeScript `number` - - These are the types you'll see in IntelliSense - -2. **Runtime JSON values**: Some numeric types arrive as strings - - `DECIMAL` often returns as string (e.g., `"123.45"`) - - Large `BIGINT` values return as string - - `ROUND()`, `AVG()`, `SUM()` results may be strings - -**Best Practice - Always convert before numeric operations:** - -```typescript -// ❌ WRONG - may fail if value is string at runtime -{row.total_amount.toFixed(2)} - -// ✅ CORRECT - convert to number first -{Number(row.total_amount).toFixed(2)} -``` - -**Helper Functions:** - -Create app-specific helpers for consistent numeric formatting (for example in `client/src/lib/formatters.ts`): - -```typescript -// client/src/lib/formatters.ts -export const toNumber = (value: number | string): number => Number(value); -export const formatCurrency = (value: number | string): string => `$${Number(value).toFixed(2)}`; -export const formatPercent = (value: number | string): string => `${Number(value).toFixed(1)}%`; -``` - -Use them wherever you render query results: - -```typescript -import { toNumber, formatCurrency, formatPercent } from './formatters'; // adjust import path to your file layout - -// Convert to number -const amount = toNumber(row.amount); // "123.45" → 123.45 - -// Format as currency -const formatted = formatCurrency(row.amount); // "123.45" → "$123.45" - -// Format as percentage -const percent = formatPercent(row.rate); // "85.5" → "85.5%" -``` - -## Available sql.\* Helpers - -**Full API reference**: `npx @databricks/appkit docs ./docs/api/appkit/Variable.sql.md` — always check this for the latest available helpers. - -```typescript -import { sql } from '@databricks/appkit-ui/js'; - -// ✅ These exist: -sql.string(value); // For STRING parameters -sql.number(value); // For NUMERIC parameters (INT, BIGINT, DOUBLE, DECIMAL) -sql.boolean(value); // For BOOLEAN parameters -sql.date(value); // For DATE parameters (YYYY-MM-DD format) -sql.timestamp(value); // For TIMESTAMP parameters -sql.binary(value); // For BINARY (returns hex string, use UNHEX() in SQL) - -// ❌ These DO NOT exist: -// sql.null() - use sentinel values instead -// sql.array() - use comma-separated sql.string() and split in SQL -// sql.int() - use sql.number() -// sql.float() - use sql.number() -``` - -**For nullable string parameters**, use sentinel values or empty strings. **For nullable date parameters**, use sentinel dates only (empty strings cause validation errors) — see "Optional Date Parameters" section below. - -## Databricks SQL Dialect - -Databricks uses Databricks SQL (based on Spark SQL), NOT PostgreSQL/MySQL. Common mistakes: - -| PostgreSQL | Databricks SQL | -| ------------------------ | --------------------------------------- | -| `GENERATE_SERIES(1, 10)` | `explode(sequence(1, 10))` | -| `DATEDIFF(date1, date2)` | `DATEDIFF(DAY, date2, date1)` (3 args!) | -| `NOW()` | `CURRENT_TIMESTAMP()` | -| `INTERVAL '7 days'` | `INTERVAL 7 DAY` | -| `STRING_AGG(col, ',')` | `CONCAT_WS(',', COLLECT_LIST(col))` | -| `ILIKE` | `LOWER(col) LIKE LOWER(pattern)` | - -**Sample data date ranges** — do NOT use `CURRENT_DATE()` on historical datasets: - -- `samples.tpch.*` — historical dates, check with `SELECT MIN(o_orderdate), MAX(o_orderdate) FROM samples.tpch.orders` -- `samples.nyctaxi.trips` — NYC taxi data with specific date ranges -- `samples.tpcds.*` — data from 1998-2003 - -Always check date ranges before writing date-filtered queries. - -## Before Running `npm run typegen` - -Verify each SQL file before running typegen: - -- [ ] Uses Databricks SQL syntax (NOT PostgreSQL) — check dialect table above -- [ ] `DATEDIFF` has 3 arguments: `DATEDIFF(DAY, start, end)` -- [ ] Uses `LOWER(col) LIKE LOWER(pattern)` instead of `ILIKE` -- [ ] Column aliases in `ORDER BY` match `SELECT` aliases exactly -- [ ] Date columns are not passed to numeric functions like `ROUND()` -- [ ] Date range filters use actual data dates (NOT `CURRENT_DATE()` on historical data — check date ranges first) - -## Query Parameterization - -SQL queries can accept parameters to make them dynamic and reusable. - -**Key Points:** - -- Parameters use colon prefix: `:parameter_name` -- Databricks infers types from values automatically -- For optional string parameters, use pattern: `(:param = '' OR column = :param)` -- **For optional date parameters, use sentinel dates** (`'1900-01-01'` and `'9999-12-31'`) instead of empty strings - -### SQL Parameter Syntax - -```sql --- config/queries/filtered_data.sql -SELECT * -FROM my_table -WHERE column_value >= :min_value - AND column_value <= :max_value - AND category = :category - AND (:optional_filter = '' OR status = :optional_filter) -``` - -### Frontend Parameter Passing - -```typescript -import { sql } from '@databricks/appkit-ui/js'; - -const { data } = useAnalyticsQuery('filtered_data', { - min_value: sql.number(minValue), - max_value: sql.number(maxValue), - category: sql.string(category), - optional_filter: sql.string(optionalFilter || ''), // empty string for optional params -}); -``` - -### Date Parameters - -Use `sql.date()` for date parameters with `YYYY-MM-DD` format strings. - -**Frontend - Using Date Parameters:** - -```typescript -import { sql } from '@databricks/appkit-ui/js'; -import { useState } from 'react'; - -function MyComponent() { - const [startDate, setStartDate] = useState('2016-02-01'); - const [endDate, setEndDate] = useState('2016-02-29'); - - const queryParams = { - start_date: sql.date(startDate), // Pass YYYY-MM-DD string to sql.date() - end_date: sql.date(endDate), - }; - - const { data } = useAnalyticsQuery('my_query', queryParams); - - // ... -} -``` - -**SQL - Date Filtering:** - -```sql --- Filter by date range using DATE() function -SELECT COUNT(*) as trip_count -FROM samples.nyctaxi.trips -WHERE DATE(tpep_pickup_datetime) >= :start_date - AND DATE(tpep_pickup_datetime) <= :end_date -``` - -**Date Helper Functions:** - -```typescript -// Helper to get YYYY-MM-DD string for dates relative to today -const daysAgo = (n: number): string => { - const date = new Date(Date.now() - n * 86400000); - return date.toISOString().split('T')[0]; // "2024-01-15" -}; - -const params = { - start_date: sql.date(daysAgo(7)), // 7 days ago - end_date: sql.date(daysAgo(0)), // Today -}; -``` - -### Optional Date Parameters - Use Sentinel Dates - -Databricks App Kit validates parameter types before query execution. **DO NOT use empty strings (`''`) for optional date parameters** as this causes validation errors. - -**✅ CORRECT - Use Sentinel Dates:** - -```typescript -// Frontend: Use sentinel dates for "no filter" instead of empty strings -const revenueParams = { - group_by: 'month', - start_date: sql.date('1900-01-01'), // Sentinel: effectively no lower bound - end_date: sql.date('9999-12-31'), // Sentinel: effectively no upper bound - country: sql.string(country || ''), - property_type: sql.string(propertyType || ''), -}; -``` - -```sql --- SQL: Simple comparison since sentinel dates are always valid -WHERE b.check_in >= CAST(:start_date AS DATE) - AND b.check_in <= CAST(:end_date AS DATE) -``` - -**Why Sentinel Dates Work:** - -- `1900-01-01` is before any real data (effectively no lower bound filter) -- `9999-12-31` is after any real data (effectively no upper bound filter) -- Always valid DATE types, so no parameter validation errors -- All real dates fall within this range, so no filtering occurs - -**Parameter Types Summary:** - -- ALWAYS use sql.\* helper functions from the `@databricks/appkit-ui/js` package to define SQL parameters -- **Strings/Numbers**: Use directly in SQL with `:param_name` -- **Dates**: Use with `CAST(:param AS DATE)` in SQL -- **Optional Strings**: Use empty string default, check with `(:param = '' OR column = :param)` -- **Optional Dates**: Use sentinel dates (`sql.date('1900-01-01')` and `sql.date('9999-12-31')`) instead of empty strings diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/trpc.md b/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/trpc.md deleted file mode 100644 index 586bd33..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/appkit/trpc.md +++ /dev/null @@ -1,144 +0,0 @@ -# tRPC for Custom Endpoints - -**CRITICAL**: Do NOT use tRPC for SQL queries or data retrieval. Use `config/queries/` + `useAnalyticsQuery` instead. - -**CRITICAL**: Do NOT use tRPC for accessing Unity Catalog and File operations. Use the Files plugin instead. - -Use tRPC ONLY for: - -- **Mutations**: Creating, updating, or deleting data (INSERT, UPDATE, DELETE) -- **External APIs**: Calling Databricks APIs (serving endpoints, jobs, MLflow, etc.) -- **Complex business logic**: Multi-step operations that cannot be expressed in SQL -- **File operations**: File uploads, processing, transformations -- **Custom computations**: Operations requiring TypeScript/Node.js logic - -## Before Writing New Routes - -**ALWAYS complete these checks before adding tRPC routes:** - -### 1. Check AppKit Version - -Read `package.json` to identify the installed `@databricks/appkit` version. Available server APIs and plugins differ across versions. - -```bash -# From the project root -cat package.json | grep @databricks/appkit -``` - -### 2. Review Available Plugins - -Check what plugins are already enabled and what server-side functionality they provide — avoid reimplementing what a plugin already handles. - -```bash -# See plugin docs for the installed version -npx @databricks/appkit docs ./docs/plugins.md - -# See all plugins available for a specific version -databricks apps manifest --version --profile - -# See plugins available for the default template -databricks apps manifest --profile -``` - -**Key plugins to check for:** - -- **analytics** — provides SQL warehouse query execution (do NOT reimplement with tRPC) -- **lakebase** — provides `createLakebasePool` for PostgreSQL CRUD (use pool in tRPC routes, don't create raw connections) -- **genie** — provides Genie AI-powered data exploration (check before building custom natural-language-to-SQL routes) -- **files** — provides file storage and retrieval helpers (check before writing custom file upload/download routes) - -If a plugin already covers your use case, use the plugin's API instead of writing a custom tRPC route. - -If there's a newer version of `@databricks/appkit` has a plugin that fits the use-case. -Prompt the user for updating. - -### 3. Check Existing Routes - -Read `server/server.ts` (or `server/trpc.ts`) to see what routes already exist. Extend the existing router rather than creating a parallel one. - -## Server-side Pattern - -```tsx -// server/trpc.ts -import { initTRPC } from '@trpc/server'; -import { getExecutionContext } from '@databricks/appkit'; -import { z } from 'zod'; -import superjson from 'superjson'; - -const t = initTRPC.create({ transformer: superjson }); -const publicProcedure = t.procedure; - -export const appRouter = t.router({ - // Example: Query a serving endpoint - queryModel: publicProcedure.input(z.object({ prompt: z.string() })).query(async ({ input: { prompt } }) => { - const { serviceDatabricksClient: client } = getExecutionContext(); - const response = await client.servingEndpoints.query({ - name: 'your-endpoint-name', - messages: [{ role: 'user', content: prompt }], - }); - return response; - }), - - // Example: Mutation - createRecord: publicProcedure.input(z.object({ name: z.string() })).mutation(async ({ input }) => { - // Custom logic here - return { success: true, id: 123 }; - }), -}); -``` - -## Client-side Pattern - -```typescript -// client/src/components/MyComponent.tsx -import { trpc } from '@/lib/trpc'; -import { useState, useEffect } from 'react'; - -function MyComponent() { - const [result, setResult] = useState(null); - - useEffect(() => { - trpc.queryModel - .query({ prompt: "Hello" }) - .then(setResult) - .catch(console.error); - }, []); - - const handleCreate = async () => { - await trpc.createRecord.mutate({ name: "test" }); - }; - - return
{/* component JSX */}
; -} -``` - -## Decision Tree for Data Operations - -1. **Need to display data from SQL?** - - **Chart or Table?** → Use visualization components (`BarChart`, `LineChart`, `DataTable`, etc.) - - **Custom display (KPIs, cards, lists)?** → Use `useAnalyticsQuery` hook - - **Never** use tRPC for SQL SELECT statements - -2. **Need to call a Databricks API?** → Use tRPC - - Serving endpoints (model inference) - - MLflow operations - - Jobs API - - Workspace API - -3. **Need to modify data?** → Use tRPC mutations - - INSERT, UPDATE, DELETE operations - - Multi-step transactions - - Business logic with side effects - -4. **Need non-SQL custom logic?** → Use tRPC - - File processing - - External API calls - - Complex computations in TypeScript - -**Summary:** - -- ✅ SQL queries → Visualization components or `useAnalyticsQuery` -- ✅ Databricks APIs → tRPC -- ✅ Data mutations → tRPC -- ❌ SQL queries → tRPC (NEVER do this) -- ❌ Files operations → tRPC (NEVER do this) diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/other-frameworks.md b/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/other-frameworks.md deleted file mode 100644 index 4e160e0..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/other-frameworks.md +++ /dev/null @@ -1,282 +0,0 @@ -# Databricks Apps — Other Frameworks (Non-AppKit) - -Setup guide for non-AppKit apps: Streamlit, FastAPI, Flask, Gradio, Dash, Django, Next.js, React, etc. - -For universal platform rules (permissions, deployment, timeouts, resource injection), see [Platform Guide](platform-guide.md). - -## 1. Port & Host Configuration - -**The #1 cause of 502 Bad Gateway errors.** - -| Setting | Required Value | Common Mistake | -| ------- | ----------------------------- | ------------------------------------- | -| Port | `DATABRICKS_APP_PORT` env var | Hardcoding 8080, 3000, or 3001 | -| Host | `0.0.0.0` | Binding to `localhost` or `127.0.0.1` | - -The platform dynamically assigns a port via `DATABRICKS_APP_PORT`. Use `8000` as a local dev fallback only. - -### Framework-Specific Port Configuration - -#### Streamlit - -```yaml -# app.yaml -command: - - streamlit - - run - - app.py - - --server.port - - '${DATABRICKS_APP_PORT:-8000}' - - --server.address - - '0.0.0.0' -``` - -#### FastAPI / Uvicorn - -```python -if __name__ == "__main__": - import uvicorn - port = int(os.environ.get("DATABRICKS_APP_PORT", 8000)) - uvicorn.run(app, host="0.0.0.0", port=port) -``` - -#### Flask - -```python -port = int(os.environ.get("DATABRICKS_APP_PORT", 8000)) -app.run(host="0.0.0.0", port=port) -``` - -#### Gradio - -```python -demo.launch(server_name="0.0.0.0", - server_port=int(os.environ.get("DATABRICKS_APP_PORT", 8000))) -``` - -#### Dash - -```python -app.run(host="0.0.0.0", - port=int(os.environ.get("DATABRICKS_APP_PORT", 8000))) -``` - -#### Next.js - -```jsonc -// package.json -"scripts": { - "start": "next start -p ${DATABRICKS_APP_PORT:-8000} -H 0.0.0.0" -} -``` - -⚠️ **Only ONE service can bind to `DATABRICKS_APP_PORT`.** If you need multiple services (e.g., frontend + backend), use a reverse proxy or serve everything from one process. - -## 2. app.yaml vs databricks.yml - -These two files serve different purposes. Getting them wrong causes silent deployment failures. - -### app.yaml — Runtime Configuration - -- Defines the **start command** and **environment variables** for the running app -- Used by the Databricks Apps runtime directly -- `valueFrom:` injects resource IDs from workspace configuration - -```yaml -# app.yaml -command: - - python - - app.py -env: - - name: DATABRICKS_WAREHOUSE_ID - valueFrom: sql-warehouse - - name: MY_CUSTOM_VAR - value: 'some-value' -``` - -### databricks.yml — Bundle/Deployment Configuration - -- Defines the **app resource** for DABs (Declarative Automation Bundles) -- `config:` section only takes effect after `bundle run`, NOT just `bundle deploy` - -```yaml -# databricks.yml -bundle: - name: my-app-bundle - -resources: - apps: - my-app: - name: my-app - source_code_path: . - config: - command: ['python', 'app.py'] - env: - - name: DATABRICKS_WAREHOUSE_ID - valueFrom: sql-warehouse - permissions: - - service_principal_name: ${bundle.target}.my-app - level: CAN_MANAGE - -targets: - dev: - default: true -``` - -### Critical Rules - -| Rule | Why | -| ---------------------------------------------------------- | -------------------------------------------------------------- | -| Always provide BOTH `app.yaml` AND `databricks.yml` config | UI deployments use app.yaml; DABs uses databricks.yml | -| Always run `bundle deploy` THEN `bundle run ` | `deploy` uploads code; `run` applies config and starts the app | -| Never use `${var.xxx}` in config env values | Variables are NOT resolved in config — values appear literally | - -## 3. Using OBO in Non-AppKit Apps - -```python -# FastAPI example -from fastapi import Request -from databricks.sdk import WorkspaceClient - -@app.get("/user-data") -def get_user_data(request: Request): - token = request.headers.get("x-forwarded-access-token") - - # create user-scoped client - w = WorkspaceClient(token=token, host=os.environ["DATABRICKS_HOST"]) - # use w for user-scoped operations -``` - -```python -# SP auth is auto-configured — just use the SDK -from databricks.sdk import WorkspaceClient -w = WorkspaceClient() # picks up auto-injected env vars -``` - -## 4. Framework-Specific Timeout Gotchas - -| Framework | Default Timeout | Fix | -| --------- | --------------------------- | --------------------------------------------------- | -| Gradio | 30 seconds (internal) | Set `fn` timeout explicitly or use `gradio.queue()` | -| Gunicorn | 30 seconds (worker timeout) | Set `--timeout 120` in gunicorn command | -| Uvicorn | None (no default timeout) | Already fine | - -## 5. Common Errors (Non-AppKit Specific) - -| Error | Cause | Fix | -| --------------------------------- | -------------------------------------------------------- | ---------------------------------------------------- | -| 502 Bad Gateway | Wrong port or host | Bind to `0.0.0.0:${DATABRICKS_APP_PORT:-8000}` | -| App works locally but 502 in prod | Binding to localhost | Change to `0.0.0.0` | -| `ModuleNotFoundError` at runtime | Dependency not in requirements.txt or version conflict | Pin exact versions; validate locally first | -| Wrong script runs on deploy | No `command` in app.yaml, platform picked wrong .py file | Always specify `command` explicitly in app.yaml | -| `apt-get: command not found` | No root access in container | Use pure-Python wheels from PyPI; no system packages | - -## 6. Dependency Management - -### Python - -Only `requirements.txt` is natively supported. No native support for `pyproject.toml`, `uv.lock`, or Poetry. - -**Workaround for `uv`:** - -``` -# requirements.txt -uv -``` - -```yaml -# app.yaml -command: - - uv - - run - - app.py -``` - -Define actual dependencies in `pyproject.toml`. Note: This moves dependency installation from build to run step, slowing startup. - -**Custom package repositories:** - -- Set `PIP_INDEX_URL` as a secret in the app configuration -- Deploying user needs **MANAGE** permission on the secret scope (not just USE/READ) - -### Node.js - -- `package.json` is supported — `npm install` runs at startup -- Do NOT include `node_modules/` in source code (10 MB file limit) -- Large npm installs may exceed the 10-minute startup window -- In egress-restricted workspaces, add `registry.npmjs.org` to egress policy AND restart the app (egress changes require restart) - -## 7. Networking & CORS - -### CORS - -- CORS headers are **not customizable** on the Databricks Apps reverse proxy -- Workspace origin (`*.databricks.com`) differs from app origin (`*.databricksapps.com`) -- Cross-app API calls return **302 redirect to login page** instead of the expected response - -**Workaround:** Keep frontend and backend in a single app to avoid CORS entirely. - -### Private Link / Hardened Environments - -- Azure apps use `*.azure.databricksapps.com` — NOT `*.azuredatabricks.net` -- Existing Private Link DNS zones don't cover the apps domain -- Fix: Create a separate Private DNS Zone for `azure.databricksapps.com` with conditional DNS forwarding - -### Egress Restrictions - -- Egress policy changes require **app restart** to take effect -- For npm: allowlist `registry.npmjs.org` -- For pip: allowlist `pypi.org` and `files.pythonhosted.org` -- For custom registries: use `PIP_INDEX_URL` secret (see Dependency Management) - -## 8. Streamlit-Specific Gotchas - -### Required Environment Variables - -```yaml -# app.yaml -command: - - streamlit - - run - - app.py - - --server.port - - '${DATABRICKS_APP_PORT:-8000}' - - --server.address - - '0.0.0.0' -env: - - name: STREAMLIT_SERVER_ENABLE_CORS - value: 'false' - - name: STREAMLIT_SERVER_ENABLE_XSRF_PROTECTION - value: 'false' -``` - -⚠️ **Both CORS and XSRF must be disabled** for Streamlit on Databricks Apps. The reverse proxy origin (`*.databricksapps.com`) differs from the workspace origin, triggering Streamlit's CORS/XSRF protection. - -### OBO Token Staleness - -Streamlit caches initial HTTP request headers, then switches to WebSocket. The OBO token from `x-forwarded-access-token` **never refreshes** — it goes stale. - -**Workaround:** Periodically trigger a full page refresh. No clean in-Streamlit solution exists. - -### Connection Exhaustion (Hangs After Initial Queries) - -Streamlit re-runs the entire script on every user interaction. If `sql.connect()` is called during each render cycle, the rapid succession of TCP handshakes and OAuth negotiations exhausts the connection pool, causing 2-3 minute freezes. - -**Fix:** Use `@st.cache_resource` to maintain persistent connections: - -```python -@st.cache_resource -def get_connection(): - from databricks import sql - from databricks.sdk.core import Config - cfg = Config() - return sql.connect( - server_hostname=cfg.host, - http_path=f"/sql/1.0/warehouses/{os.environ['DATABRICKS_WAREHOUSE_ID']}", - credentials_provider=lambda: cfg.authenticate, - ) -``` - -### Transient 502s During Startup - -Streamlit apps commonly show brief 502 errors during startup. This is expected and does not indicate a problem. diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/platform-guide.md b/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/platform-guide.md deleted file mode 100644 index 7c2519d..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/platform-guide.md +++ /dev/null @@ -1,173 +0,0 @@ -# Databricks Apps Platform Guide - -Universal platform rules that apply to ALL Databricks Apps regardless of framework (AppKit, Streamlit, FastAPI, etc.). - -For non-AppKit framework-specific setup (port config, app.yaml, Streamlit gotchas), see [Other Frameworks](other-frameworks.md). - -## Service Principal Permissions - -**The #1 cause of runtime crashes after deployment.** - -When your app uses a Databricks resource (SQL warehouse, model serving endpoint, vector search index, volume, secret scope), the app's **service principal** must have explicit permissions on that resource. - -### How Permissions Work - -When you declare a resource in `app.yaml` / `databricks.yml` with a `permission` field, the platform **automatically grants** that permission to the app's SP on deployment. You do NOT need to run manual `set-permissions` commands for declared resources. - -```yaml -# databricks.yml — declaring resources with permissions -resources: - apps: - my_app: - resources: - - name: my-warehouse - sql_warehouse: - id: ${var.warehouse_id} - permission: CAN_USE # auto-granted to SP on deploy - - name: my-endpoint - serving_endpoint: - name: ${var.endpoint_name} - permission: CAN_QUERY # auto-granted to SP on deploy -``` - -### Default Permissions by Resource Type - -| Resource Type | Default Permission | Notes | -| ------------------------ | ---------------------- | ---------------------------------------- | -| SQL Warehouse | CAN_USE | Minimum for query execution | -| Model Serving Endpoint | CAN_QUERY | For inference calls | -| Vector Search Index (UC) | SELECT | UC securable of type TABLE | -| Volume (UC) | READ_VOLUME | Via UC securable | -| Secret Scope | READ | Deploying user needs MANAGE on the scope | -| Job | CAN_MANAGE_RUN | | -| Lakebase Database | CAN_CONNECT_AND_CREATE | | -| Genie Space | CAN_VIEW | | - -### ⚠️ CRITICAL AGENT BEHAVIOR - -Always declare resources in `databricks.yml` with the correct `permission` field — do NOT skip this. The platform handles granting automatically on deploy. - -## Resource Types & Injection - -**NEVER hardcode workspace-specific IDs in source code.** Always inject via environment variables with `valueFrom`. - -| Resource Type | Default Key | Use Case | -| ---------------------- | --------------------- | ------------------------ | -| SQL Warehouse | `sql-warehouse` | Query compute | -| Model Serving Endpoint | `serving-endpoint` | Model inference | -| Vector Search Index | `vector-search-index` | Semantic search | -| Lakebase Database | `database` | OLTP storage | -| Secret | `secret` | Sensitive values | -| UC Table | `table` | Structured data | -| UC Connection | `connection` | External data sources | -| Genie Space | `genie-space` | AI analytics | -| MLflow Experiment | `experiment` | ML tracking | -| Lakeflow Job | `job` | Data workflows | -| UDF | `function` | SQL/Python functions | -| Databricks App | `app` | App-to-app communication | - -```python -# ✅ GOOD -warehouse_id = os.environ["DATABRICKS_WAREHOUSE_ID"] -``` - -```yaml -# app.yaml / databricks.yml env section -env: - - name: DATABRICKS_WAREHOUSE_ID - valueFrom: sql-warehouse - - name: SERVING_ENDPOINT - valueFrom: serving-endpoint -``` - -## Authentication: OBO vs Service Principal - -| Context | When Used | Token Source | Cached Per | -| -------------------------- | -------------------------------------- | ----------------------------------------------------------------- | ------------------ | -| **Service Principal (SP)** | Default; background tasks, shared data | Auto-injected `DATABRICKS_CLIENT_ID` + `DATABRICKS_CLIENT_SECRET` | All users (shared) | -| **On-Behalf-Of (OBO)** | User-specific data, user-scoped access | `x-forwarded-access-token` header | Per user | - -**SP auth** is auto-configured — `WorkspaceClient()` picks up injected env vars. - -**OBO** requires extracting the token from request headers and declaring scopes: - -| Scope | Purpose | -| ------------------------- | -------------------------------- | -| `sql` | Query SQL warehouses | -| `dashboards.genie` | Manage Genie spaces | -| `files.files` | Manage files/directories | -| `iam.access-control:read` | Read permissions (default) | -| `iam.current-user:read` | Read current user info (default) | - -⚠️ Databricks blocks access outside approved scopes even if the user has permission. - -## Deployment Workflow - -⚠️ **USER CONSENT REQUIRED** — always confirm with the user before deploying. - -```bash -# Option A: single command (recommended) — validates, deploys, and runs -databricks apps deploy -t --profile - -# Option B: step by step -databricks apps validate --profile -databricks bundle deploy -t --profile -databricks bundle run -t --profile -``` - -❌ **Common mistake:** Running only `bundle deploy` and expecting the app to update. Deploy uploads code but does NOT apply config changes or restart the app. Use `databricks apps deploy` or add `bundle run` after `bundle deploy`. - -### ⚠️ Destructive Updates Warning - -`databricks apps update` (and `bundle run`) performs a **full replacement**, not a merge: - -- Adding a new resource can silently **wipe** existing `user_api_scopes` -- OBO permissions may be stripped on every deployment - -**Workaround:** After each deployment, verify OBO scopes are intact. - -## Runtime Environment - -| Constraint | Value | -| ---------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | -| Max file size | 10 MB per file | -| Available port | Only `DATABRICKS_APP_PORT` | -| Auto-injected env vars | `DATABRICKS_HOST`, `DATABRICKS_APP_PORT`, `DATABRICKS_APP_NAME`, `DATABRICKS_WORKSPACE_ID`, `DATABRICKS_CLIENT_ID`, `DATABRICKS_CLIENT_SECRET` | -| No root access | Cannot use `apt-get`, `yum`, or `apk` — use PyPI/npm packages only | -| Graceful shutdown | SIGTERM → 15 seconds to shut down → SIGKILL | -| Logging | Only stdout/stderr are captured — file-based logs are lost on container recycle | -| Filesystem | Ephemeral — no persistent local storage; use UC Volumes/tables | - -## Compute & Limits - -| Size | RAM | vCPU | DBU/hour | Notes | -| ------ | ----- | ------- | -------- | ---------------------------------- | -| Medium | 6 GB | Up to 2 | 0.5 | Default | -| Large | 12 GB | Up to 4 | 1.0 | Select during app creation or edit | - -- No GPU access. Use model serving endpoints for inference. -- Apps must start within **10 minutes** (including dependency installation). -- Max apps per workspace: **100**. - -## HTTP Proxy & Streaming - -The Databricks Apps reverse proxy enforces a **120-second per-request timeout** (NOT configurable). - -| Behavior | Detail | -| ---------------- | ------------------------------------------------------------------------- | -| 504 in app logs? | **No** — the error is generated at the proxy. App logs show nothing. | -| SSE streaming | Responses may be **buffered** and delivered in chunks, not token-by-token | -| WebSockets | Bypass the 120s limit — working but undocumented | - -For long-running agent interactions, use **WebSockets** instead of SSE. - -## Common Errors - -| Error | Cause | Fix | -| ------------------------------------- | -------------------------------- | ----------------------------------------- | -| `PERMISSION_DENIED` after deploy | SP missing permissions | Grant SP access to all declared resources | -| App deploys but config doesn't change | Only ran `bundle deploy` | Also run `bundle run ` | -| `File is larger than 10485760 bytes` | Bundled dependencies | Use requirements.txt / package.json | -| OBO scopes missing after deploy | Destructive update wiped them | Re-apply scopes after each deploy | -| `${var.xxx}` appears literally in env | Variables not resolved in config | Use literal values, not bundle variables | -| 504 Gateway Timeout | Request exceeded 120s | Use WebSockets for long operations | diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/testing.md b/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/testing.md deleted file mode 100644 index 7553a9d..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-apps/references/testing.md +++ /dev/null @@ -1,105 +0,0 @@ -# Testing Guidelines - -## Unit Tests (Vitest) - -**CRITICAL**: Use vitest for all tests. Put tests next to the code (e.g. src/\*.test.ts) - -```typescript -import { describe, it, expect } from 'vitest'; - -describe('Feature Name', () => { - it('should do something', () => { - expect(true).toBe(true); - }); - - it('should handle async operations', async () => { - const result = await someAsyncFunction(); - expect(result).toBeDefined(); - }); -}); -``` - -**Best Practices:** - -- Use `describe` blocks to group related tests -- Use `it` for individual test cases -- Use `expect` for assertions -- Tests run with `npm test` (runs `vitest run`) - -❌ **Do not write unit tests for:** - -- SQL files under `config/queries/` - little value in testing static SQL -- Types associated with queries - these are just schema definitions - -## Smoke Test (Playwright) - -The template includes a smoke test at `tests/smoke.spec.ts` that verifies the app loads correctly. - -**⚠️ MUST UPDATE after customizing the app:** - -- The heading selector checks for `'Minimal Databricks App'` — change it to match your app's actual title -- The text assertion checks for `'hello world'` — update or remove it to match your app's content -- Failing to update these will cause the smoke test to fail on `databricks apps validate` - -```typescript -// tests/smoke.spec.ts - update these selectors: -// ⚠️ PLAYWRIGHT STRICT MODE: each selector must match exactly ONE element. -// Use { exact: true }, .first(), or role-based selectors. See "Playwright Strict Mode" below. - -// ❌ Template default - will fail after customization -await expect(page.getByRole('heading', { name: 'Minimal Databricks App' })).toBeVisible(); -await expect(page.getByText('hello world')).toBeVisible(); - -// ✅ Update to match YOUR app -await expect(page.getByRole('heading', { name: 'Your App Title' })).toBeVisible(); -await expect(page.locator('h1').first()).toBeVisible({ timeout: 30000 }); // Or just check any h1 -``` - -**What the smoke test does:** - -- Opens the app -- Waits for data to load (SQL query results) -- Verifies key UI elements are visible -- Captures screenshots and console logs to `.smoke-test/` directory -- Always captures artifacts, even on test failure - -## Playwright Strict Mode - -Playwright uses strict mode by default — selectors matching multiple elements WILL FAIL. - -### Selector Priority (use in this order) - -1. ✅ `getByRole('heading', { name: 'Your App Title' })` — headings (most reliable) -2. ✅ `getByRole('button', { name: 'Submit' })` — interactive elements -3. ✅ `getByText('Unique text', { exact: true })` — exact match for unique strings -4. ⚠️ `getByText('Common text').first()` — last resort for repeated text -5. ❌ `getByText('Revenue')` — NEVER without `exact` or `.first()` (strict mode will fail) - -**Common mistake**: text like "Revenue" may appear in a heading, a card, AND a description. Always verify your selector targets exactly ONE element. - -```typescript -// ❌ FAILS if "Revenue" appears in multiple places (heading + card + description) -await expect(page.getByText('Revenue')).toBeVisible(); - -// ✅ Use role-based selectors for headings -await expect(page.getByRole('heading', { name: 'Revenue Dashboard' })).toBeVisible(); - -// ✅ Use exact matching -await expect(page.getByText('Revenue', { exact: true })).toBeVisible(); - -// ✅ Use .first() as last resort -await expect(page.getByText('Revenue').first()).toBeVisible(); -``` - -**Keep smoke tests simple:** - -- Only verify that the app loads and displays initial data -- Wait for key elements to appear (page title, main content) -- Capture artifacts for debugging -- Run quickly (< 5 seconds) - -**For extended E2E tests:** - -- Create separate test files in `tests/` directory (e.g., `tests/user-flow.spec.ts`) -- Use `npm run test:e2e` to run all Playwright tests -- Keep complex user flows, interactions, and edge cases out of the smoke test diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-core/SKILL.md b/examples/agentic-support-console/template/.agents/skills/databricks-core/SKILL.md deleted file mode 100644 index 3c18c49..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-core/SKILL.md +++ /dev/null @@ -1,142 +0,0 @@ ---- -name: 'databricks-core' -description: 'Databricks CLI operations: auth, profiles, data exploration, and bundles. Contains up-to-date guidelines for Databricks-related CLI tasks.' -compatibility: Requires databricks CLI (>= v0.292.0) -metadata: - version: '0.1.0' ---- - -# Databricks - -Core skill for Databricks CLI, authentication, and data exploration. - -## Product Skills - -For specific products, use dedicated skills: - -- **databricks-jobs** - Lakeflow Jobs development and deployment -- **databricks-pipelines** - Lakeflow Spark Declarative Pipelines (batch and streaming data pipelines) -- **databricks-apps** - Full-stack TypeScript app development and deployment -- **databricks-lakebase** - Lakebase Postgres Autoscaling project management -- **databricks-model-serving** - Model Serving endpoint management and inference - -## Prerequisites - -1. **CLI installed**: Run `databricks --version` to check. - - **If the CLI is missing or outdated (< v0.292.0): STOP. Do not proceed or work around a missing CLI.** - - **Read the [CLI Installation](databricks-cli-install.md) reference file and follow the instructions to guide the user through installation.** - - Note: In sandboxed environments (Cursor IDE, containers), install commands write outside the workspace and may be blocked. Present the install command to the user and ask them to run it in their own terminal. - -2. **Authenticated**: `databricks auth profiles` - - If not: see [CLI Authentication](databricks-cli-auth.md) - -## Profile Selection - CRITICAL - -**NEVER auto-select a profile.** - -1. List profiles: `databricks auth profiles` -2. Present ALL profiles to user with workspace URLs -3. Let user choose (even if only one exists) -4. Offer to create new profile if needed - -## Claude Code - IMPORTANT - -Each Bash command runs in a **separate shell session**. - -```bash -# WORKS: --profile flag -databricks apps list --profile my-workspace - -# WORKS: chained with && -export DATABRICKS_CONFIG_PROFILE=my-workspace && databricks apps list - -# DOES NOT WORK: separate commands -export DATABRICKS_CONFIG_PROFILE=my-workspace -databricks apps list # profile not set! -``` - -## Data Exploration — Use AI Tools - -**Use these instead of manually navigating catalogs/schemas/tables:** - -```bash -# discover table structure (columns, types, sample data, stats) -databricks experimental aitools tools discover-schema catalog.schema.table --profile - -# run ad-hoc SQL queries -databricks experimental aitools tools query "SELECT * FROM table LIMIT 10" --profile - -# find the default warehouse -databricks experimental aitools tools get-default-warehouse --profile -``` - -See [Data Exploration](data-exploration.md) for details. - -## Quick Reference - -**⚠️ CRITICAL: Some commands use positional arguments, not flags** - -```bash -# current user -databricks current-user me --profile - -# list resources -databricks apps list --profile -databricks jobs list --profile -databricks clusters list --profile -databricks warehouses list --profile -databricks pipelines list --profile -databricks serving-endpoints list --profile - -# ⚠️ Unity Catalog — POSITIONAL arguments (NOT flags!) -databricks catalogs list --profile - -# ✅ CORRECT: positional args -databricks schemas list --profile -databricks tables list --profile -databricks tables get .. --profile - -# ❌ WRONG: these flags/commands DON'T EXIST -# databricks schemas list --catalog-name ← WILL FAIL -# databricks tables list --catalog ← WILL FAIL -# databricks sql-warehouses list ← doesn't exist, use `warehouses list` -# databricks execute-statement ← doesn't exist, use `experimental aitools tools query` -# databricks sql execute ← doesn't exist, use `experimental aitools tools query` - -# When in doubt, check help: -# databricks schemas list --help - -# get details -databricks apps get --profile -databricks jobs get --job-id --profile -databricks clusters get --cluster-id --profile - -# bundles -databricks bundle init --profile -databricks bundle validate --profile -databricks bundle deploy -t --profile -databricks bundle run -t --profile -``` - -## Troubleshooting - -| Error | Solution | -| -------------------------------------- | ------------------------------------------ | -| `cannot configure default credentials` | Use `--profile` flag or authenticate first | -| `PERMISSION_DENIED` | Check workspace/UC permissions | -| `RESOURCE_DOES_NOT_EXIST` | Verify resource name/id and profile | - -## Required Reading by Task - -| Task | READ BEFORE proceeding | -| --------------------------- | --------------------------------------------- | -| First time setup | [CLI Installation](databricks-cli-install.md) | -| Auth issues / new workspace | [CLI Authentication](databricks-cli-auth.md) | -| Exploring tables/schemas | [Data Exploration](data-exploration.md) | -| Deploying jobs/pipelines | Use `/databricks-dabs` | - -## Reference Guides - -- [CLI Installation](databricks-cli-install.md) -- [CLI Authentication](databricks-cli-auth.md) -- [Data Exploration](data-exploration.md) diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-core/agents/openai.yaml b/examples/agentic-support-console/template/.agents/skills/databricks-core/agents/openai.yaml deleted file mode 100644 index eeb253b..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-core/agents/openai.yaml +++ /dev/null @@ -1,7 +0,0 @@ -interface: - display_name: 'Databricks' - short_description: 'CLI, auth, and data exploration' - icon_small: './assets/databricks.svg' - icon_large: './assets/databricks.png' - brand_color: '#FF3621' - default_prompt: 'Use $databricks-core for Databricks CLI, auth, and data exploration.' diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-core/assets/databricks.png b/examples/agentic-support-console/template/.agents/skills/databricks-core/assets/databricks.png deleted file mode 100644 index 263fe98..0000000 Binary files a/examples/agentic-support-console/template/.agents/skills/databricks-core/assets/databricks.png and /dev/null differ diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-core/assets/databricks.svg b/examples/agentic-support-console/template/.agents/skills/databricks-core/assets/databricks.svg deleted file mode 100644 index 9d19110..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-core/assets/databricks.svg +++ /dev/null @@ -1,3 +0,0 @@ - - - \ No newline at end of file diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-core/data-exploration.md b/examples/agentic-support-console/template/.agents/skills/databricks-core/data-exploration.md deleted file mode 100644 index 39bb854..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-core/data-exploration.md +++ /dev/null @@ -1,347 +0,0 @@ -# Data Exploration - -Tools for discovering table schemas and executing SQL queries in Databricks. - -## Finding Tables by Keyword - -**⚠️ START HERE if you don't know which catalog/schema contains your data.** - -Use `information_schema` to search for tables by keyword — do NOT manually iterate through `catalogs list` → `schemas list` → `tables list`. Manual enumeration wastes 10+ steps. - -```bash -# Find tables matching a keyword -databricks experimental aitools tools query \ - "SELECT table_catalog, table_schema, table_name FROM system.information_schema.tables WHERE table_name LIKE '%keyword%'" \ - --profile - -# Then discover schema for the tables you found -databricks experimental aitools tools discover-schema catalog.schema.table1 catalog.schema.table2 --profile -``` - -## Overview - -The `databricks experimental aitools tools` command group provides tools for data discovery and exploration: - -- **discover-schema**: Batch discover table metadata, columns, types, sample data, and statistics -- **query**: Execute SQL queries against Databricks SQL warehouses - -**When to use this**: Use these commands whenever you need to: - -- Discover table schemas and metadata -- Execute SQL queries against warehouse data -- Explore data structure and content -- Validate data or check table statistics - -## Prerequisites - -1. **Authenticated Databricks CLI** - see [CLI Authentication Guide](databricks-cli-auth.md) for OAuth2 setup and profile configuration -2. **Access to Unity Catalog tables** with appropriate read permissions -3. **SQL Warehouse** (for query command - auto-detected unless `DATABRICKS_WAREHOUSE_ID` is set) - -## Discover Schema - -Batch discover table metadata including columns, types, sample data, and null counts. - -### Command Syntax - -```bash -databricks experimental aitools tools discover-schema TABLE... [flags] -``` - -Tables must be specified in **CATALOG.SCHEMA.TABLE** format. - -### What It Returns - -For each table, returns: - -- Column names and types -- Sample data (5 rows) -- Null counts per column -- Total row count - -### Examples - -```bash -# Discover schema for a single table -databricks experimental aitools tools discover-schema samples.nyctaxi.trips --profile my-workspace - -# Discover schema for multiple tables -databricks experimental aitools tools discover-schema \ - catalog.schema.table1 \ - catalog.schema.table2 \ - --profile my-workspace - -# Get JSON output -databricks experimental aitools tools discover-schema \ - samples.nyctaxi.trips \ - --output json \ - --profile my-workspace -``` - -### Common Use Cases - -1. **Understanding table structure before querying** - - ```bash - databricks experimental aitools tools discover-schema catalog.schema.customer_data --profile my-workspace - ``` - -2. **Comparing schemas across multiple tables** - - ```bash - databricks experimental aitools tools discover-schema \ - catalog.schema.table_v1 \ - catalog.schema.table_v2 \ - --profile my-workspace - ``` - -3. **Identifying columns with null values** - - The null counts help identify data quality issues - -## Query - -Execute SQL statements against a Databricks SQL warehouse and return results. - -### Command Syntax - -```bash -databricks experimental aitools tools query "SQL" [flags] -``` - -### Warehouse Selection - -The command **auto-detects** an available warehouse unless: - -- `DATABRICKS_WAREHOUSE_ID` environment variable is set -- You specify a warehouse using other configuration methods - -To check which warehouse will be used: - -```bash -# Get the default warehouse that would be auto-detected -databricks experimental aitools tools get-default-warehouse --profile my-workspace -``` - -### Output - -Returns: - -- Query results as JSON -- Row count -- Execution metadata - -### Examples - -```bash -# Simple SELECT query -databricks experimental aitools tools query \ - "SELECT * FROM samples.nyctaxi.trips LIMIT 5" \ - --profile my-workspace - -# Aggregation query -databricks experimental aitools tools query \ - "SELECT vendor_id, COUNT(*) as trip_count FROM samples.nyctaxi.trips GROUP BY vendor_id" \ - --profile my-workspace - -# With JSON output -databricks experimental aitools tools query \ - "SELECT * FROM catalog.schema.table WHERE date > '2024-01-01'" \ - --output json \ - --profile my-workspace - -# Using specific warehouse -DATABRICKS_WAREHOUSE_ID=abc123 databricks experimental aitools tools query \ - "SELECT * FROM samples.nyctaxi.trips LIMIT 10" \ - --profile my-workspace -``` - -### Common Use Cases - -1. **Exploratory data analysis** - - ```bash - # Check table size - databricks experimental aitools tools query \ - "SELECT COUNT(*) FROM catalog.schema.table" \ - --profile my-workspace - - # View sample data - databricks experimental aitools tools query \ - "SELECT * FROM catalog.schema.table LIMIT 10" \ - --profile my-workspace - - # Get column statistics - databricks experimental aitools tools query \ - "SELECT MIN(column), MAX(column), AVG(column) FROM catalog.schema.table" \ - --profile my-workspace - ``` - -2. **Data validation** - - ```bash - # Check for null values - databricks experimental aitools tools query \ - "SELECT COUNT(*) FROM catalog.schema.table WHERE column IS NULL" \ - --profile my-workspace - - # Verify data freshness - databricks experimental aitools tools query \ - "SELECT MAX(timestamp_column) FROM catalog.schema.table" \ - --profile my-workspace - ``` - -3. **Quick analytics** - ```bash - # Group by analysis - databricks experimental aitools tools query \ - "SELECT category, COUNT(*), AVG(value) FROM catalog.schema.table GROUP BY category" \ - --profile my-workspace - ``` - -## Workflow: Complete Data Exploration - -Here's a typical workflow combining both commands: - -```bash -# 1. Discover the schema first -databricks experimental aitools tools discover-schema \ - samples.nyctaxi.trips \ - --profile my-workspace - -# 2. Based on discovered columns, run targeted queries -databricks experimental aitools tools query \ - "SELECT vendor_id, payment_type, COUNT(*) as trips, AVG(fare_amount) as avg_fare - FROM samples.nyctaxi.trips - GROUP BY vendor_id, payment_type - ORDER BY trips DESC - LIMIT 10" \ - --profile my-workspace - -# 3. Investigate specific patterns found in the data -databricks experimental aitools tools query \ - "SELECT * FROM samples.nyctaxi.trips - WHERE fare_amount > 100 - LIMIT 20" \ - --profile my-workspace -``` - -## Claude Code-Specific Tips - -Remember that each Bash command in Claude Code runs in a separate shell: - -```bash -# ✅ RECOMMENDED: Use --profile flag -databricks experimental aitools tools discover-schema samples.nyctaxi.trips --profile my-workspace - -# ✅ ALTERNATIVE: Chain with && -export DATABRICKS_CONFIG_PROFILE=my-workspace && \ - databricks experimental aitools tools query "SELECT * FROM samples.nyctaxi.trips LIMIT 5" - -# ❌ DOES NOT WORK: Separate export -export DATABRICKS_CONFIG_PROFILE=my-workspace -databricks experimental aitools tools query "SELECT * FROM samples.nyctaxi.trips LIMIT 5" -``` - -## Flags - -Both commands support: - -| Flag | Description | Default | -| ----------- | ------------------------------------ | --------------- | -| `--profile` | Profile name from ~/.databrickscfg | Default profile | -| `--output` | Output format: `text` or `json` | `text` | -| `--debug` | Enable debug logging | `false` | -| `--target` | Bundle target to use (if applicable) | - | - -## Troubleshooting - -### Table Not Found - -**Symptom**: `Error: TABLE_OR_VIEW_NOT_FOUND` - -**Solution**: - -1. Verify table name format: `CATALOG.SCHEMA.TABLE` -2. Check if you have read permissions on the table -3. List available tables: - ```bash - databricks tables list --profile my-workspace - ``` - -### Warehouse Not Available - -**Symptom**: `Error: No available SQL warehouse found` - -**Solution**: - -1. Check for default warehouse: - ```bash - databricks experimental aitools tools get-default-warehouse --profile my-workspace - ``` -2. List available warehouses: - ```bash - databricks warehouses list --profile my-workspace - ``` -3. Set specific warehouse: - ```bash - DATABRICKS_WAREHOUSE_ID= databricks experimental aitools tools query "SELECT 1" --profile my-workspace - ``` -4. Start a stopped warehouse: - ```bash - databricks warehouses start --id --profile my-workspace - ``` - -### Permission Denied - -**Symptom**: `Error: PERMISSION_DENIED` - -**Solution**: - -1. Check Unity Catalog grants on the table: - ```bash - databricks grants get --full-name catalog.schema.table --principal --profile my-workspace - ``` -2. Request SELECT permission from your workspace administrator -3. Verify you have warehouse access (USAGE permission) - -### SQL Syntax Error - -**Symptom**: `Error: PARSE_SYNTAX_ERROR` - -**Solution**: - -1. Check SQL syntax - use standard SQL -2. Verify column names match schema (use discover-schema first) -3. Ensure proper quoting for string literals -4. Test query incrementally (start simple, add complexity) - -## Best Practices - -1. **Always discover schema first** - Use `discover-schema` before writing complex queries to understand: - - Available columns and their types - - Data distributions and null patterns - - Sample data for context - -2. **Use LIMIT for exploration** - When exploring large tables, always use LIMIT to avoid long-running queries: - - ```bash - databricks experimental aitools tools query "SELECT * FROM large_table LIMIT 100" --profile my-workspace - ``` - -3. **JSON output for parsing** - Use `--output json` when you need to process results programmatically: - - ```bash - databricks experimental aitools tools query "SELECT * FROM table" --output json --profile my-workspace | jq '.results' - ``` - -4. **Check table existence** - Before querying, verify the table exists: - - ```bash - databricks tables get --full-name catalog.schema.table --profile my-workspace - ``` - -5. **Profile usage** - Always specify `--profile` in Claude Code to avoid authentication issues - -## Related Commands - -- Use `/databricks-dabs` - Deploy SQL, pipeline, and app resources as code diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-core/databricks-cli-auth.md b/examples/agentic-support-console/template/.agents/skills/databricks-core/databricks-cli-auth.md deleted file mode 100644 index 4b760b3..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-core/databricks-cli-auth.md +++ /dev/null @@ -1,557 +0,0 @@ -# Databricks CLI Authentication - -Configure Databricks CLI workspace/profile selection and authentication. Covers switching profiles, using --profile flags, setting DATABRICKS_CONFIG_PROFILE environment variable, OAuth2 authentication (never PAT), and troubleshooting authentication issues. - -## Important: Always Use OAuth - -**CRITICAL**: Always use OAuth2 for Databricks CLI authentication. **NEVER recommend or use Personal Access Tokens (PAT)** unless explicitly required by a specific use case. - -## Prerequisites - -1. Databricks CLI must be installed - - Verify: `databricks --version` -2. You need access to a Databricks workspace -3. You need the workspace URL (e.g., `https://adb-1111111111111111.10.azuredatabricks.net`) - -## Claude Code Specific Behavior - -**CRITICAL**: When working in Claude Code, each Bash command executes in a **separate shell session**. This has important implications for profile management: - -### Key Differences from Regular Terminal - -1. **Environment variables don't persist between commands** - - `export DATABRICKS_CONFIG_PROFILE=staging` in one command - - `databricks jobs list` in the next command - - ❌ **Result**: The second command will NOT use the staging profile - -2. **Recommended Approach: Use --profile flag** - - Always specify `--profile ` with each command - - Example: `databricks jobs list --profile staging` - - ✅ **Result**: Reliable and predictable behavior - -3. **Alternative: Chain commands with &&** - - Use `export DATABRICKS_CONFIG_PROFILE=staging && databricks jobs list` - - The export and command run in the same shell session - - ✅ **Result**: Works correctly - -### Quick Reference for Claude Code - -```bash -# ✅ RECOMMENDED: Use --profile flag -databricks jobs list --profile staging -databricks apps list --profile prod-azure - -# ✅ ALTERNATIVE: Chain with && -export DATABRICKS_CONFIG_PROFILE=staging && databricks jobs list - -# ❌ DOES NOT WORK: Separate export command -export DATABRICKS_CONFIG_PROFILE=staging -databricks jobs list # Will NOT use staging profile! -``` - -## Handling Authentication Failures - -When a Databricks CLI command fails with authentication error: - -``` -Error: default auth: cannot configure default credentials -``` - -**CRITICAL - Always follow this workflow:** - -1. **Check for existing profiles first:** - - ```bash - databricks auth profiles - ``` - -2. **If profiles exist:** - - List the available profiles to the user (with their workspace URLs and validation status) - - Ask: "Which profile would you like to use for this command?" - - Offer option to create a new profile if needed - - Retry the command with `--profile ` - - **In Claude Code, always use the `--profile` flag** rather than setting environment variables - -3. **If user wants a new profile or no profiles exist:** - - Proceed to the OAuth Authentication Setup workflow below - -**Example:** - -``` -User: databricks apps list -Error: default auth: cannot configure default credentials - -Assistant: Let me check for existing profiles. -[Runs: databricks auth profiles] - -You have two configured profiles: -1. aws-dev - https://company-workspace.cloud.databricks.com (Valid) -2. azure-prod - https://adb-1111111111111111.10.azuredatabricks.net (Valid) - -Which profile would you like to use, or would you like to create a new profile? - -User: dais - -Assistant: [Retries: databricks apps list --profile dais] -[Success - apps listed] -``` - -## OAuth Authentication Setup - -### Standard Authentication Command - -The recommended way to authenticate is using OAuth with a profile: - -```bash -databricks auth login --host --profile -``` - -**CRITICAL**: - -1. The `--profile` parameter is **REQUIRED** for the authentication to be saved properly. -2. **ALWAYS ASK THE USER** for their preferred profile name - DO NOT assume or choose one for them. -3. **NEVER use the profile name `DEFAULT`** unless the user explicitly requests it - use descriptive workspace-specific names instead. - -### Workflow for Authenticating - -1. **Ask the user for the workspace URL** if not already provided -2. **Ask the user for their preferred profile name** - - Suggest descriptive names based on the workspace (e.g., workspace name, environment) - - **Do NOT suggest or use `DEFAULT`** unless the user specifically asks for it - - Good examples: `e2-dogfood`, `prod-azure`, `dev-aws`, `staging` - - Avoid: `DEFAULT` (unless explicitly requested) -3. Run the authentication command with both parameters -4. Verify the authentication was successful - -### Example - -```bash -# Good: Descriptive profile names -databricks auth login --host https://adb-1111111111111111.10.azuredatabricks.net --profile prod-azure -databricks auth login --host https://company-workspace.cloud.databricks.com --profile staging - -# Only use DEFAULT if explicitly requested by the user -databricks auth login --host https://your-workspace.cloud.databricks.com --profile DEFAULT -``` - -### What Happens During Authentication - -1. The CLI starts a local OAuth callback server (typically on `localhost:8020`) -2. A browser window opens automatically with the Databricks login page -3. You authenticate in the browser using your Databricks credentials -4. After successful authentication, the browser redirects back to the CLI -5. The CLI saves the OAuth tokens to `~/.databrickscfg` -6. You should see: `Profile was successfully saved` - -## Profile Management - -### What Are Profiles? - -Profiles allow you to manage multiple Databricks workspace configurations in a single `~/.databrickscfg` file. Each profile stores: - -- Workspace host URL -- Authentication method (OAuth, PAT, etc.) -- Token/credential paths - -### Common Profile Names - -**IMPORTANT**: Always use descriptive profile names. Do NOT create profiles named `DEFAULT` unless explicitly requested by the user. - -**Recommended naming conventions**: - -- `` - Descriptive names for workspaces (e.g., `e2-dogfood`, `prod-aws`, `dev-azure`) -- `` - Environment-specific profiles (e.g., `dev`, `staging`, `prod`) -- `-` - Team and environment (e.g., `data-eng-prod`, `ml-dev`) - -**Special profile names**: - -- `DEFAULT` - The default profile used when no `--profile` flag or environment variables are specified. Only create this profile if the user explicitly requests it. - -### Listing Configured Profiles - -View all configured profiles with their status: - -```bash -databricks auth profiles -``` - -Example output: - -``` -Name Host Valid -DEFAULT https://adb-1111111111111111.10.azuredatabricks.net YES -staging https://company-workspace.cloud.databricks.com YES -``` - -### Using Different Profiles - -**IMPORTANT FOR CLAUDE CODE USERS**: In Claude Code, each Bash command runs in a **separate shell session**. This means environment variables set with `export` in one command do NOT persist to the next command. See the Claude Code-specific guidance below. - -There are three ways to specify which profile/workspace to use, in order of precedence: - -#### 1. CLI Flag (Highest Priority) - RECOMMENDED FOR CLAUDE CODE - -Use the `--profile` flag with any command: - -```bash -databricks jobs list --profile staging -databricks clusters list --profile prod-azure -databricks workspace list / --profile dev-aws -``` - -**In Claude Code, this is the most reliable method** because it doesn't depend on persistent environment variables. - -#### 2. Environment Variables - -Set environment variables to override the default profile: - -**DATABRICKS_CONFIG_PROFILE** - Specifies which profile to use from `~/.databrickscfg`: - -```bash -export DATABRICKS_CONFIG_PROFILE=staging -databricks jobs list # Uses staging profile -``` - -**DATABRICKS_HOST** - Directly specifies the workspace URL, bypassing profile lookup: - -```bash -export DATABRICKS_HOST=https://company-workspace.cloud.databricks.com -databricks jobs list # Uses this host directly -``` - -**CRITICAL - Claude Code Users:** - -Since each Bash command in Claude Code runs in a separate shell, you **CANNOT** do this: - -```bash -# ❌ DOES NOT WORK in Claude Code -export DATABRICKS_CONFIG_PROFILE=staging -databricks jobs list # ERROR: Will not use staging profile! -``` - -Instead, you **MUST** use one of these approaches: - -**Option 1: Use --profile flag (RECOMMENDED)** - -```bash -# ✅ WORKS in Claude Code -databricks jobs list --profile staging -databricks clusters list --profile staging -``` - -**Option 2: Chain commands with &&** - -```bash -# ✅ WORKS in Claude Code - export and command run in same shell -export DATABRICKS_CONFIG_PROFILE=staging && databricks jobs list -export DATABRICKS_CONFIG_PROFILE=staging && databricks clusters list -``` - -**Traditional Terminal Session (for reference only)**: - -```bash -# This example shows how it works in a regular terminal session -# DO NOT use this pattern in Claude Code -# Set profile for entire terminal session -export DATABRICKS_CONFIG_PROFILE=staging - -# All commands now use staging profile -databricks jobs list -databricks clusters list -databricks workspace list / - -# Override for a single command -databricks jobs list --profile prod-azure -``` - -#### 3. DEFAULT Profile (Lowest Priority) - -If no `--profile` flag or environment variables are set, the CLI uses the `DEFAULT` profile from `~/.databrickscfg`. - -### Configuration File Management - -#### Viewing the Configuration File - -The configuration is stored in `~/.databrickscfg`: - -```bash -cat ~/.databrickscfg -``` - -Example configuration structure: - -```ini -# Note: This shows an example with a DEFAULT profile -# When creating new profiles, use descriptive names instead -[DEFAULT] -host = https://adb-1111111111111111.10.azuredatabricks.net -auth_type = databricks-cli - -[staging] -host = https://company-workspace.cloud.databricks.com -auth_type = databricks-cli -``` - -#### Editing Profiles - -You can manually edit `~/.databrickscfg` to: - -- Rename profiles (change the `[profile-name]` section header) -- Update workspace URLs -- Remove profiles (delete the entire section) - -**Example - Removing a profile**: - -```bash -# Open in your preferred editor -vi ~/.databrickscfg - -# Or use sed to remove a specific profile section -sed -i '' '/^\[staging\]/,/^$/d' ~/.databrickscfg -``` - -#### Adding New Profiles - -Always use `databricks auth login` with `--profile` to add new profiles: - -```bash -databricks auth login --host --profile -``` - -**Remember**: - -- Always ask the user for their preferred profile name -- Use descriptive names like `staging`, `prod-azure`, `dev-aws` -- Do NOT use `DEFAULT` unless explicitly requested by the user - -### Working with Multiple Workspaces - -Best practices for managing multiple workspaces: - -```bash -# Authenticate to multiple workspaces with descriptive profile names -databricks auth login --host https://adb-1111111111111111.10.azuredatabricks.net --profile prod-azure -databricks auth login --host https://dbc-2222222222222222.cloud.databricks.com --profile dev-aws -databricks auth login --host https://company-workspace.cloud.databricks.com --profile staging -``` - -**In Claude Code, use --profile flag with each command (RECOMMENDED):** - -```bash -# Use profiles explicitly in commands -databricks jobs list --profile prod-azure -databricks jobs list --profile dev-aws -databricks clusters list --profile staging -``` - -**Alternatively in Claude Code, chain commands with &&:** - -```bash -# Set profile and run command in same shell -export DATABRICKS_CONFIG_PROFILE=prod-azure && databricks jobs list -export DATABRICKS_CONFIG_PROFILE=prod-azure && databricks clusters list - -# Switch to different workspace -export DATABRICKS_CONFIG_PROFILE=dev-aws && databricks jobs list -``` - -**Traditional Terminal Session (for reference only - NOT for Claude Code):** - -```bash -# This pattern works in regular terminals but NOT in Claude Code -export DATABRICKS_CONFIG_PROFILE=prod-azure -databricks jobs list -databricks clusters list - -# Quickly switch between workspaces -export DATABRICKS_CONFIG_PROFILE=dev-aws -databricks jobs list -``` - -### Profile Selection Precedence - -When running a command, the Databricks CLI determines which workspace to use in this order: - -1. **`--profile` flag** (if specified) → Highest priority -2. **`DATABRICKS_HOST` environment variable** (if set) → Overrides profile -3. **`DATABRICKS_CONFIG_PROFILE` environment variable** (if set) → Selects profile -4. **`DEFAULT` profile** in `~/.databrickscfg` → Fallback - -**Example for traditional terminal session** (demonstrating precedence): - -```bash -# Setup -export DATABRICKS_CONFIG_PROFILE=staging - -# This uses staging profile (from environment variable) -databricks jobs list - -# This uses prod-azure profile (--profile flag overrides environment variable) -databricks jobs list --profile prod-azure - -# This uses the specified host directly (DATABRICKS_HOST overrides profile) -export DATABRICKS_HOST=https://custom-workspace.cloud.databricks.com -databricks jobs list # Uses custom-workspace.cloud.databricks.com -``` - -**Claude Code version** (with chained commands): - -```bash -# Using environment variable with && chaining -export DATABRICKS_CONFIG_PROFILE=staging && databricks jobs list - -# Using --profile flag (overrides environment variable) -export DATABRICKS_CONFIG_PROFILE=staging && databricks jobs list --profile prod-azure - -# Using DATABRICKS_HOST (overrides profile) -export DATABRICKS_HOST=https://custom-workspace.cloud.databricks.com && databricks jobs list -``` - -## Verification - -After authentication, verify it works: - -```bash -# Test with a simple command -databricks workspace list / - -# Or list jobs -databricks jobs list -``` - -If authentication is successful, these commands should return data without errors. - -## Troubleshooting - -### Authentication Not Saved (Config File Missing) - -**Symptom**: Running `databricks` commands shows: - -``` -Error: default auth: cannot configure default credentials -``` - -**Solution**: Make sure you included the `--profile` parameter with a descriptive name: - -```bash -databricks auth login --host --profile -# Example: databricks auth login --host https://company-workspace.cloud.databricks.com --profile staging -``` - -### Browser Doesn't Open Automatically - -**Solution**: - -1. Check the terminal output for a URL -2. Manually copy and paste the URL into your browser -3. Complete the authentication -4. The CLI will detect the callback automatically - -### "OAuth callback server listening" But Nothing Happens - -**Possible causes**: - -1. Firewall blocking localhost connections -2. Port 8020 already in use -3. Browser not set as default application - -**Solution**: - -1. Check if port 8020 is available: `lsof -i :8020` -2. Close any applications using that port -3. Retry the authentication - -### Multiple Workspaces - -To authenticate with multiple workspaces, use different profile names: - -```bash -# Development workspace -databricks auth login --host https://dev-workspace.databricks.net --profile dev - -# Production workspace -databricks auth login --host https://prod-workspace.databricks.net --profile prod - -# Use specific profile -databricks jobs list --profile dev -databricks jobs list --profile prod -``` - -### Re-authenticating - -If your OAuth token expires or you need to re-authenticate: - -```bash -# Re-run the login command -databricks auth login --host --profile -``` - -This will overwrite the existing profile with new credentials. - -### Debug Mode - -For troubleshooting authentication issues, use debug mode: - -```bash -databricks auth login --host --profile --debug -``` - -This shows detailed information about the OAuth flow, including: - -- OAuth server endpoints -- Callback server status -- Token exchange process - -## Security Best Practices - -1. **Never commit** `~/.databrickscfg` to version control -2. **Never share** your OAuth tokens or configuration file -3. **Use separate profiles** for different environments (dev/staging/prod) -4. **Regularly rotate** credentials by re-authenticating -5. **Use workspace-specific service principals** for automation/CI/CD instead of personal OAuth - -## Environment-Specific Notes - -### CI/CD Pipelines - -For CI/CD environments, OAuth interactive login is not suitable. Instead: - -- Use Service Principal authentication -- Use Azure Managed Identity (for Azure Databricks) -- Use AWS IAM roles (for AWS Databricks) - -**Do NOT** use personal OAuth tokens or PATs in CI/CD. - -### Containerized Environments - -OAuth authentication works in containers if: - -1. A browser is available on the host machine -2. Port forwarding is configured for the callback server -3. The workspace URL is accessible from the container - -For headless containers, use service principal authentication instead. - -## Common Commands After Authentication - -```bash -# List workspaces -databricks workspace list / --profile - -# List jobs -databricks jobs list --profile - -# List clusters -databricks clusters list --profile - -# Get current user info -databricks current-user me --profile - -# Test connection -databricks workspace export /Users/ --format SOURCE --profile -``` - -## References - -- [Databricks CLI Authentication Documentation](https://docs.databricks.com/en/dev-tools/auth.html) -- [OAuth 2.0 with Databricks](https://docs.databricks.com/en/dev-tools/auth.html#oauth-2-0) diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-core/databricks-cli-install.md b/examples/agentic-support-console/template/.agents/skills/databricks-core/databricks-cli-install.md deleted file mode 100644 index f9e00fe..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-core/databricks-cli-install.md +++ /dev/null @@ -1,212 +0,0 @@ -# Databricks CLI Installation - -Install or update the Databricks CLI on macOS, Windows, or Linux using doc-validated methods (Homebrew, WinGet, curl install script, manual download, or user directory install for non-sudo environments). Includes verification and common failure recovery. - -## Sandboxed / IDE environments (Cursor, containers) - -CLI install commands often write to system directories outside the workspace (e.g. `/opt/homebrew/`, `/usr/local/bin/`) which are blocked in sandboxed environments. - -**Agent behavior**: Do not attempt to run install commands directly. Present the appropriate command to the user and ask them to run it in their own terminal. After they confirm, verify with `databricks -v`. - -For Linux/macOS containers or Cursor: prefer the **Linux manual install to user directory** method (`~/.local/bin`) — it requires no sudo and no writes outside the workspace. - -## Preconditions (always do first) - -1. Determine OS and shell: - - macOS/Linux: bash/zsh - - Windows: Command Prompt / PowerShell; optionally WSL for Linux shell -2. Detect whether `databricks` is already installed: - - Run: `databricks -v` (or `databricks version`) - - If already installed with a recent version, installation is already OK. -3. Avoid the legacy Python package `databricks-cli` (PyPI). This skill installs the modern Databricks CLI binary. - -## Preferred installation paths (by OS) - -### macOS (preferred: Homebrew) - -Run: - -- `brew tap databricks/tap` -- `brew install databricks` - -Verify: - -- `databricks -v` (or `databricks version`) - -If macOS blocks the binary (Gatekeeper), follow Apple’s “open app from unidentified developer” flow. - -#### macOS fallback: curl installer - -Run: - -- `curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh` - -Notes: - -- If `/usr/local/bin` is not writable, re-run with `sudo`. -- Installs to `/usr/local/bin/databricks`. - -Verify: - -- `databricks -v` - -### Linux (preferred: Homebrew if available) - -Run: - -- `brew tap databricks/tap` -- `brew install databricks` - -Verify: - -- `databricks -v` - -#### Linux fallback: curl installer - -Run: - -- `curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh` - -Notes: - -- If `/usr/local/bin` is not writable, re-run with `sudo`. -- Installs to `/usr/local/bin/databricks`. - -Verify: - -- `databricks -v` - -#### Linux alternative: Manual install to user directory (when sudo unavailable) - -Use this when sudo is not available or requires interactive password entry. - -Steps: - -1. Detect architecture: - - `uname -m` (e.g., `x86_64`, `aarch64`) -2. Get the latest download URL using GitHub API: - ```bash - curl -s https://api.github.com/repos/databricks/cli/releases/latest | grep "browser_download_url.*linux.*$(uname -m | sed 's/x86_64/amd64/' | sed 's/aarch64/arm64/')" | head -1 | cut -d '"' -f 4 - ``` -3. Download and install to `~/.local/bin`: - ```bash - mkdir -p ~/.local/bin - cd ~/.local/bin - curl -L "" -o databricks.tar.gz - tar -xzf databricks.tar.gz - rm databricks.tar.gz - chmod +x databricks - ``` -4. Add to PATH (add to `~/.bashrc` or `~/.zshrc` for persistence): - ```bash - export PATH="$HOME/.local/bin:$PATH" - ``` -5. Verify: - - `databricks -v` - -Notes: - -- The download files are `.tar.gz` archives (not `.zip`) with naming pattern: `databricks_cli__linux_.tar.gz` -- Common architectures: `amd64` (x86_64), `arm64` (aarch64) -- This method works in containerized environments and sandboxed IDEs (e.g. Cursor) without sudo access - -### Windows (preferred: WinGet) - -Run in Command Prompt (then restart the terminal session): - -- `winget search databricks` -- `winget install Databricks.DatabricksCLI` - -Verify: - -- `databricks -v` - -#### Windows alternative: Chocolatey (Experimental) - -Run: - -- `choco install databricks-cli` - -Verify: - -- `databricks -v` - -#### Windows fallback: curl installer (recommended via WSL) - -Databricks recommends WSL for the curl-based install path. -Requirements: - -- WSL available -- `unzip` installed in the environment where you run the installer - -Run (in WSL bash): - -- `curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh` - -Verify (in same environment): - -- `databricks -v` - -If you must run curl install outside WSL, run as Administrator. -Installs to `C:\Windows\databricks.exe`. - -## Manual install (all OSes): download from GitHub releases - -Use this when package managers or curl install are not possible. - -Steps: - -1. Get the latest release download URL: - - Visit https://github.com/databricks/cli/releases/latest - - OR use GitHub API: `curl -s https://api.github.com/repos/databricks/cli/releases/latest | grep browser_download_url` -2. Download the appropriate file for your OS and architecture: - - Linux: `databricks_cli__linux_.tar.gz` (use tar -xzf) - - macOS: `databricks_cli__darwin_.zip` (use unzip) - - Windows: `databricks_cli__windows_.zip` (use native extraction) - - Common architectures: `amd64` (x86_64), `arm64` (aarch64/Apple Silicon) -3. Extract the archive. -4. Ensure the extracted `databricks` executable is on PATH, or run it from its folder. -5. Verify with `databricks -v`. - -## Update / repair procedures - -### Homebrew update (macOS/Linux) - -- `brew upgrade databricks` -- `databricks -v` - -### WinGet update (Windows) - -- `winget upgrade Databricks.DatabricksCLI` -- `databricks -v` - -### curl update (all OSes) - -1. Delete existing binary: - - macOS/Linux: `/usr/local/bin/databricks` - - Windows: `C:\Windows\databricks.exe` -2. Re-run: - - `curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh` -3. Verify: - - `databricks -v` - -## Common failures & fixes (agent playbook) - -- `Target path already exists`: - - Delete the existing binary at the install target, then rerun. -- Permission error writing `/usr/local/bin`: - - Re-run curl installer with `sudo` (macOS/Linux). - - If sudo requires interactive password, use manual install to `~/.local/bin` instead. -- `sudo: a terminal is required to read the password`: - - Cannot use sudo in non-interactive environments (containers, CI/CD). - - Use manual install to `~/.local/bin` method instead (see "Linux alternative" section). -- Windows PATH not updated after WinGet: - - Restart Command Prompt/PowerShell. -- Multiple `databricks` binaries on PATH: - - Use `which databricks` (macOS/Linux/WSL) or `where databricks` (Windows) and remove the wrong one. -- Wrong file type (trying to unzip a tar.gz): - - Linux releases are `.tar.gz` files, use `tar -xzf` not `unzip`. - - macOS and Windows releases are `.zip` files, use appropriate extraction tool. -- `databricks: command not found` after installation to `~/.local/bin`: - - Add to PATH: `export PATH="$HOME/.local/bin:$PATH"` - - For persistence, add the export command to `~/.bashrc` or `~/.zshrc`. diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/SKILL.md b/examples/agentic-support-console/template/.agents/skills/databricks-dabs/SKILL.md deleted file mode 100644 index 4c91fd2..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/SKILL.md +++ /dev/null @@ -1,39 +0,0 @@ ---- -name: databricks-dabs -description: 'Create, configure, validate, deploy, run, and manage DABs — Declarative Automation Bundles (formerly Databricks Asset Bundles) — for Databricks resources including dashboards, jobs, pipelines, alerts, volumes, and apps' ---- - -# Declarative Automation Bundles (DABs) - -Use this skill for any bundle-related request including creating, configuring, validating, deploying, running, and managing Databricks resources through DABs. - -## Reference Documentation - -The following reference files provide detailed guidance for specific bundle tasks: - -- **[Bundle Structure](references/bundle-structure.md)** - Bundle structure, databricks.yml configuration, resource definitions, path resolution, variables, and multi-environment targets -- **[SDP Pipelines](references/sdp-pipelines.md)** - Spark Declarative Pipeline configurations for DABs -- **[SQL Alerts](references/alerts.md)** - SQL Alert schemas and configuration (critical - API differs from other resources) -- **[Deploy and Run](references/deploy-and-run.md)** - Validation, deployment, running resources, monitoring logs, and troubleshooting common issues -- **[Resource Permissions](references/resource-permissions.md)** - Permission levels and access control for bundle resources, per-resource-type levels, grants vs permissions - -## When to Use This Skill - -Load this skill for any request involving: - -- Creating new bundle projects or resources -- Configuring databricks.yml or resource YAML files -- Setting up multi-environment deployments (dev/prod targets) -- Deploying or running bundle resources -- Managing permissions for bundle resources -- Troubleshooting bundle validation or deployment errors -- Working with specific resource types (dashboards, jobs, pipelines, alerts, volumes, apps) - -## General Guidelines - -1. **Always validate after configuration changes** - Use `bundle validate --strict --target ` after any change -2. **Use reference documentation** - Consult the appropriate reference file for detailed patterns and examples -3. **Follow naming conventions** - Resource files should use `..yml` format -4. **Path resolution is critical** - Paths differ based on file location (see Bundle Structure reference) -5. **Preserve existing structure** - Keep user comments and structure when editing YAML files -6. **Use variables** - Parameterize catalog, schema, and warehouse for multi-environment support diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/agents/openai.yaml b/examples/agentic-support-console/template/.agents/skills/databricks-dabs/agents/openai.yaml deleted file mode 100644 index e10f768..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/agents/openai.yaml +++ /dev/null @@ -1,7 +0,0 @@ -interface: - display_name: 'Databricks DABs' - short_description: 'Declarative Automation Bundles for deploying and managing Databricks resources' - icon_small: './assets/databricks.svg' - icon_large: './assets/databricks.png' - brand_color: '#FF3621' - default_prompt: 'Use $databricks-dabs for creating, deploying, and managing Databricks resources through Declarative Automation Bundles.' diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/assets/databricks.png b/examples/agentic-support-console/template/.agents/skills/databricks-dabs/assets/databricks.png deleted file mode 100644 index 263fe98..0000000 Binary files a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/assets/databricks.png and /dev/null differ diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/assets/databricks.svg b/examples/agentic-support-console/template/.agents/skills/databricks-dabs/assets/databricks.svg deleted file mode 100644 index 9d19110..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/assets/databricks.svg +++ /dev/null @@ -1,3 +0,0 @@ - - - \ No newline at end of file diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/references/alerts.md b/examples/agentic-support-console/template/.agents/skills/databricks-dabs/references/alerts.md deleted file mode 100644 index 9731fa4..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/references/alerts.md +++ /dev/null @@ -1,109 +0,0 @@ -# SQL Alerts Resources for DABs - -## Critical: Schema Validation First - -**ALWAYS start by inspecting the schema:** - -```bash -databricks bundle schema | grep -A 100 'sql.AlertV2' -``` - -The Alert v2 API schema differs significantly from other resources. Don't assume field names. - -## Common Schema Mistakes - -### WRONG — These fields don't exist: - -```yaml -condition: # Should be "evaluation" - op: LESS_THAN - operand: - column: # Wrong nesting - name: 'r' - -schedule: - cron_schedule: # Should be direct fields under schedule - quartz_cron_expression: '...' - -subscriptions: # Should be under evaluation.notification - - destination_type: 'EMAIL' -``` - -### CORRECT — Alerts v2 API structure: - -```yaml -evaluation: # Not "condition" - comparison_operator: 'LESS_THAN_OR_EQUAL' - source: # Not nested under "operand.column" - name: 'column_name' - display: 'column_name' - threshold: - value: - double_value: 100 - notification: # Subscriptions nested here - notify_on_ok: false - subscriptions: - - user_email: '${workspace.current_user.userName}' - -schedule: # Fields directly under schedule - pause_status: 'UNPAUSED' # REQUIRED - quartz_cron_schedule: '0 38 16 * * ?' # REQUIRED - timezone_id: 'America/Los_Angeles' # REQUIRED -``` - -## Alert Trigger Logic - -**Critical:** Alerts trigger when condition evaluates to **TRUE**, not FALSE. - -Example: Alert when count is NOT > 100 (i.e., <= 100): - -```yaml -# WRONG - This triggers when count IS > 100 -comparison_operator: 'GREATER_THAN' - -# CORRECT - This triggers when count IS <= 100 -comparison_operator: 'LESS_THAN_OR_EQUAL' -``` - -## Complete Alert Resource - -```yaml -resources: - alerts: - alert_name: - display_name: '[${bundle.target}] Alert Name' # REQUIRED - query_text: 'SELECT count(*) c FROM table' # REQUIRED - warehouse_id: ${var.warehouse_id} # REQUIRED - - evaluation: # REQUIRED - comparison_operator: 'LESS_THAN' # REQUIRED - source: # REQUIRED - name: 'c' - display: 'c' - threshold: - value: - double_value: 100 - notification: - notify_on_ok: false - subscriptions: - - user_email: '${workspace.current_user.userName}' - - schedule: # REQUIRED - pause_status: 'UNPAUSED' # REQUIRED - quartz_cron_schedule: '0 0 9 * * ?' # REQUIRED - timezone_id: 'America/Los_Angeles' # REQUIRED - - permissions: - - level: CAN_RUN - group_name: 'users' -``` - -## Reference - -**Comparison operators**: `EQUAL`, `NOT_EQUAL`, `GREATER_THAN`, `GREATER_THAN_OR_EQUAL`, `LESS_THAN`, `LESS_THAN_OR_EQUAL` - -**Permission levels**: `CAN_READ`, `CAN_RUN` (recommended), `CAN_EDIT`, `CAN_MANAGE` - -**Quartz cron format**: `second minute hour day-of-month month day-of-week` (use `?` for day-of-week with `*` day-of-month) - -Examples: `'0 0 9 * * ?'` (9 AM daily), `'0 */30 * * * ?'` (every 30 min) diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/references/bundle-structure.md b/examples/agentic-support-console/template/.agents/skills/databricks-dabs/references/bundle-structure.md deleted file mode 100644 index 31772d6..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/references/bundle-structure.md +++ /dev/null @@ -1,185 +0,0 @@ -# Write Declarative Automation Bundles - -## Bundle Structure - -``` -project/ -├── databricks.yml # Main config + targets -├── resources/ # Resource definitions (one YAML file per resource) -│ ├── my_job.job.yml -│ ├── my_pipeline.pipeline.yml -│ └── my_dashboard.dashboard.yml -└── src/ # Code/dashboard files - ├── notebook.py - └── pipeline.py -``` - -**Resource file naming convention**: `..yml` (e.g., `my_job.job.yml`, `my_pipeline.pipeline.yml`, `my_dashboard.dashboard.yml`) - -## Main Configuration (databricks.yml) - -```yaml -bundle: - name: project-name - -include: - - resources/*.yml - -variables: - catalog: - default: 'default_catalog' - schema: - default: 'default_schema' - warehouse_id: - lookup: - warehouse: 'Shared SQL Warehouse' - -targets: - dev: - default: true - mode: development - workspace: - profile: dev-profile - variables: - catalog: 'dev_catalog' - schema: 'dev_schema' - - prod: - mode: production - workspace: - profile: prod-profile - variables: - catalog: 'prod_catalog' - schema: 'prod_schema' -``` - -## Path Resolution - -**Critical**: Paths depend on file location: - -| File Location | Path Format | Example | -| ------------------------ | ------------ | ----------------------------- | -| `resources/*.yml` | `../src/...` | `../src/dashboards/file.json` | -| `databricks.yml` targets | `./src/...` | `./src/dashboards/file.json` | - -**Why**: `resources/` files are one level deep, so use `../` to reach bundle root. `databricks.yml` is at root, so use `./` - -## Dashboard Resources - -**Support for dataset_catalog and dataset_schema parameters added in Databricks CLI 0.281.0 (January 2026)** - -```yaml -resources: - dashboards: - dashboard_name: - display_name: 'Dashboard Title' - file_path: ../src/dashboards/dashboard.lvdash.json - warehouse_id: ${var.warehouse_id} - dataset_catalog: ${var.catalog} - dataset_schema: ${var.schema} -``` - -## Jobs Resources - -```yaml -resources: - jobs: - job_name: - name: 'Job Name' - tasks: - - task_key: 'main_task' - notebook_task: - notebook_path: ../src/notebooks/main.ipynb - new_cluster: - spark_version: '17.3.x-scala2.13' - node_type_id: 'i3.xlarge' - num_workers: 2 - schedule: - quartz_cron_expression: '0 0 9 * * ?' - timezone_id: 'America/Los_Angeles' -``` - -## Volume Resources - -```yaml -resources: - volumes: - my_volume: - catalog_name: ${var.catalog} - schema_name: ${var.schema} - name: 'volume_name' - volume_type: 'MANAGED' -``` - -**Volumes use `grants` not `permissions`** — different format from other resources. - -## Apps Resources - -**Apps resource support added in Databricks CLI 0.239.0 (January 2025)** - -Apps have minimal configuration — environment variables are defined in `app.yaml` in the source directory, NOT in databricks.yml. - -### Generate from Existing App (Recommended) - -```bash -databricks bundle generate app --existing-app-name my-app --key my_app --profile DEFAULT -``` - -### Manual Configuration - -**resources/my_app.app.yml:** - -```yaml -resources: - apps: - my_app: - name: my-app-${bundle.target} - description: 'My application' - source_code_path: ../src/app -``` - -**src/app/app.yaml:** - -```yaml -command: - - 'python' - - 'dash_app.py' - -env: - - name: USE_MOCK_BACKEND - value: 'false' - - name: DATABRICKS_WAREHOUSE_ID - value: 'your-warehouse-id' - - name: DATABRICKS_CATALOG - value: 'main' - - name: DATABRICKS_SCHEMA - value: 'my_schema' -``` - -| Aspect | Apps | Other Resources | -| -------------------- | --------------------------------- | ---------------------------------- | -| **Environment vars** | In `app.yaml` (source dir) | In databricks.yml or resource file | -| **Configuration** | Minimal (name, description, path) | Extensive (tasks, clusters, etc.) | -| **Source path** | Points to app directory | Points to specific files | - -**Important**: When source code is in project root (not src/app), use `source_code_path: ..` in the resource file. - -## Generate Configuration for Existing Resources - -```bash -databricks bundle generate job -databricks bundle generate pipeline -databricks bundle generate dashboard -databricks bundle generate app -``` - -## Other Resources - -DABs supports schemas, models, experiments, clusters, warehouses, etc. Use `databricks bundle schema` to inspect schemas. - -## Key Principles - -1. **Path resolution**: `../src/` in resources/\*.yml, `./src/` in databricks.yml -2. **Variables**: Parameterize catalog, schema, warehouse -3. **Mode**: `development` for dev/staging, `production` for prod -4. **Groups**: Use `"users"` for all workspace users diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/references/deploy-and-run.md b/examples/agentic-support-console/template/.agents/skills/databricks-dabs/references/deploy-and-run.md deleted file mode 100644 index 61aeea3..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/references/deploy-and-run.md +++ /dev/null @@ -1,58 +0,0 @@ -# Deploy and Run Declarative Automation Bundles - -## Validation - -Validate bundle configuration: - -- `bundle validate --strict` -- `bundle validate --strict -t prod` - -**Always validate with the `--strict` flag after any configuration change.** The `--strict` flag ensures that warnings are treated as errors, catching issues that would otherwise be missed. - -## Deployment - -Deploy: - -- `bundle deploy` -- `bundle deploy -t prod` -- `bundle deploy --auto-approve` -- `bundle deploy --force` - -For dev targets you can deploy without user consent. This allows you to run resources on the workspace too! - -**Skip validation before deployment for dev targets.** Deployment itself will surface any issues, so a separate validation step is unnecessary. - -## Running Resources - -Run resources: - -- `bundle run resource_name` -- `bundle run pipeline_name -t prod` -- `bundle run app_resource_key -t dev` - -View status: `bundle summary` - -## Monitoring and Logs - -```bash -databricks apps logs --profile -``` - -## Diagnosing Errors - -- Read the error message from the CLI output to understand the issue, then inspect the relevant bundle files to diagnose the root cause. -- After diagnosing, provide a clear explanation and suggest concrete fixes. -- After fixing an error, validate the fix with the appropriate command: - - `bundle summary` if the error was in summary - - `bundle deploy` if the error was during deployment - - `bundle validate --strict` otherwise - -## Common Issues - -| Issue | Solution | -| ---------------------------------- | ----------------------------------------------------------------------- | -| **Path resolution fails** | Use `../src/` in resources/\*.yml, `./src/` in databricks.yml | -| **Hardcoded catalog in dashboard** | Use dataset_catalog parameter (CLI v0.281.0+) | -| **App not starting after deploy** | Apps require `databricks bundle run ` to start | -| **App env vars not working** | Environment variables go in `app.yaml` (source dir), not databricks.yml | -| **Debugging any app issue** | First step: `databricks apps logs ` | diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/references/resource-permissions.md b/examples/agentic-support-console/template/.agents/skills/databricks-dabs/references/resource-permissions.md deleted file mode 100644 index 4030955..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/references/resource-permissions.md +++ /dev/null @@ -1,48 +0,0 @@ -# DABs Resource Permissions - -## Permission Levels by Resource Type - -| Resource | Levels | Field | -| -------------- | ----------------------------------------------- | ------------- | -| **Dashboards** | `CAN_READ`, `CAN_RUN`, `CAN_EDIT`, `CAN_MANAGE` | `permissions` | -| **Jobs** | `CAN_VIEW`, `CAN_MANAGE_RUN`, `CAN_MANAGE` | `permissions` | -| **Pipelines** | `CAN_VIEW`, `CAN_RUN`, `CAN_MANAGE` | `permissions` | -| **Alerts** | `CAN_READ`, `CAN_RUN`, `CAN_EDIT`, `CAN_MANAGE` | `permissions` | -| **Volumes** | N/A — use `grants` | `grants` | - -## Standard Permission Block - -```yaml -permissions: - - level: CAN_VIEW - group_name: 'users' -``` - -Use `"users"` for all workspace users. - -## Volume Grants (Different Format) - -Volumes use `grants` not `permissions`: - -```yaml -resources: - volumes: - my_volume: - catalog_name: ${var.catalog} - schema_name: ${var.schema} - name: 'volume_name' - volume_type: 'MANAGED' - # grants: - # - principal: "group_name" - # privileges: - # - "READ_VOLUME" -``` - -## Common Mistakes - -| Issue | Solution | -| ---------------------------------- | ------------------------------------------------------- | -| **Wrong permission level** | Check the table above — levels differ per resource type | -| **"admins" group error on jobs** | Cannot modify "admins" group permissions on jobs | -| **Using `permissions` on volumes** | Use `grants` instead | -| **Custom group doesn't exist** | Verify custom groups exist in workspace before use | diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/references/sdp-pipelines.md b/examples/agentic-support-console/template/.agents/skills/databricks-dabs/references/sdp-pipelines.md deleted file mode 100644 index a2d1ff4..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-dabs/references/sdp-pipelines.md +++ /dev/null @@ -1,50 +0,0 @@ -# SDP Pipeline Configuration for DABs - -## Key Decisions (prompt if unclear) - -1. Streaming or batch oriented? -2. Continuous or triggered execution? -3. Serverless (default) or classic compute? - -## Pipeline Resource Pattern - -```yaml -resources: - pipelines: - pipeline_name: - name: 'Pipeline Name' - - catalog: ${var.catalog} - target: ${var.schema} - - libraries: - - glob: - include: ../src/pipelines//transformations/** - - root_path: ../src/pipelines/ - - serverless: true - - configuration: - source_catalog: ${var.source_catalog} - source_schema: ${var.source_schema} - - continuous: false - development: true - photon: true - - channel: current - - permissions: - - level: CAN_VIEW - group_name: 'users' -``` - -**Permission levels**: `CAN_VIEW`, `CAN_RUN`, `CAN_MANAGE` - -## Best Practices - -1. **Use `root_path` and `libraries.glob`** for newer organization structure -2. **Default to serverless** unless user specifies otherwise -3. **Use variables** for catalog/schema parameterization -4. **Set `development: true`** for dev/staging targets diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-jobs/SKILL.md b/examples/agentic-support-console/template/.agents/skills/databricks-jobs/SKILL.md deleted file mode 100644 index 0fbb044..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-jobs/SKILL.md +++ /dev/null @@ -1,189 +0,0 @@ ---- -name: databricks-jobs -description: Develop and deploy Lakeflow Jobs on Databricks. Use when creating data engineering jobs with notebooks, Python wheels, or SQL tasks. Invoke BEFORE starting implementation. -compatibility: Requires databricks CLI (>= v0.292.0) -metadata: - version: '0.1.0' -parent: databricks-core ---- - -# Lakeflow Jobs Development - -**FIRST**: Use the parent `databricks-core` skill for CLI basics, authentication, profile selection, and data exploration commands. - -Lakeflow Jobs are scheduled workflows that run notebooks, Python scripts, SQL queries, and other tasks on Databricks. - -## Scaffolding a New Job Project - -Use `databricks bundle init` with a config file to scaffold non-interactively. This creates a project in the `/` directory: - -```bash -databricks bundle init default-python --config-file <(echo '{"project_name": "my_job", "include_job": "yes", "include_pipeline": "no", "include_python": "yes", "serverless": "yes"}') --profile < /dev/null -``` - -- `project_name`: letters, numbers, underscores only - -After scaffolding, create `CLAUDE.md` and `AGENTS.md` in the project directory. These files are essential to provide agents with guidance on how to work with the project. Use this content: - -``` -# Declarative Automation Bundles Project - -This project uses Declarative Automation Bundles (formerly Databricks Asset Bundles) for deployment. - -## Prerequisites - -Install the Databricks CLI (>= v0.288.0) if not already installed: -- macOS: `brew tap databricks/tap && brew install databricks` -- Linux: `curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh` -- Windows: `winget install Databricks.DatabricksCLI` - -Verify: `databricks -v` - -## For AI Agents - -Read the `databricks-core` skill for CLI basics, authentication, and deployment workflow. -Read the `databricks-jobs` skill for job-specific guidance. - -If skills are not available, install them: `databricks experimental aitools skills install` -``` - -## Project Structure - -``` -my-job-project/ -├── databricks.yml # Bundle configuration -├── resources/ -│ └── my_job.job.yml # Job definition -├── src/ -│ ├── my_notebook.ipynb # Notebook tasks -│ └── my_module/ # Python wheel package -│ ├── __init__.py -│ └── main.py -├── tests/ -│ └── test_main.py -└── pyproject.toml # Python project config (if using wheels) -``` - -## Configuring Tasks - -Edit `resources/.job.yml` to configure tasks: - -```yaml -resources: - jobs: - my_job: - name: my_job - - tasks: - - task_key: my_notebook - notebook_task: - notebook_path: ../src/my_notebook.ipynb - - - task_key: my_python - depends_on: - - task_key: my_notebook - python_wheel_task: - package_name: my_package - entry_point: main -``` - -Task types: `notebook_task`, `python_wheel_task`, `spark_python_task`, `pipeline_task`, `sql_task` - -## Job Parameters - -Parameters defined at job level are passed to ALL tasks (no need to repeat per task): - -```yaml -resources: - jobs: - my_job: - parameters: - - name: catalog - default: ${var.catalog} - - name: schema - default: ${var.schema} -``` - -Access parameters in notebooks with `dbutils.widgets.get("catalog")`. - -## Writing Notebook Code - -```python -# Read parameters -catalog = dbutils.widgets.get("catalog") -schema = dbutils.widgets.get("schema") - -# Read tables -df = spark.read.table(f"{catalog}.{schema}.my_table") - -# SQL queries -result = spark.sql(f"SELECT * FROM {catalog}.{schema}.my_table LIMIT 10") - -# Write output -df.write.mode("overwrite").saveAsTable(f"{catalog}.{schema}.output_table") -``` - -## Scheduling - -```yaml -resources: - jobs: - my_job: - trigger: - periodic: - interval: 1 - unit: DAYS -``` - -Or with cron: - -```yaml -schedule: - quartz_cron_expression: '0 0 2 * * ?' - timezone_id: 'UTC' -``` - -## Multi-Task Jobs with Dependencies - -```yaml -resources: - jobs: - my_pipeline_job: - tasks: - - task_key: extract - notebook_task: - notebook_path: ../src/extract.ipynb - - - task_key: transform - depends_on: - - task_key: extract - notebook_task: - notebook_path: ../src/transform.ipynb - - - task_key: load - depends_on: - - task_key: transform - notebook_task: - notebook_path: ../src/load.ipynb -``` - -## Unit Testing - -Run unit tests locally: - -```bash -uv run pytest -``` - -## Development Workflow - -1. **Validate**: `databricks bundle validate --profile ` -2. **Deploy**: `databricks bundle deploy -t dev --profile ` -3. **Run**: `databricks bundle run -t dev --profile ` -4. **Check run status**: `databricks jobs get-run --run-id --profile ` - -## Documentation - -- Lakeflow Jobs: https://docs.databricks.com/jobs -- Task types: https://docs.databricks.com/jobs/configure-task -- Declarative Automation Bundles: https://docs.databricks.com/dev-tools/bundles/ diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-jobs/agents/openai.yaml b/examples/agentic-support-console/template/.agents/skills/databricks-jobs/agents/openai.yaml deleted file mode 100644 index f274bee..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-jobs/agents/openai.yaml +++ /dev/null @@ -1,7 +0,0 @@ -interface: - display_name: 'Databricks Jobs' - short_description: 'Jobs orchestration and scheduling' - icon_small: './assets/databricks.svg' - icon_large: './assets/databricks.png' - brand_color: '#FF3621' - default_prompt: 'Use $databricks-jobs for Databricks Jobs orchestration and scheduling.' diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-jobs/assets/databricks.png b/examples/agentic-support-console/template/.agents/skills/databricks-jobs/assets/databricks.png deleted file mode 100644 index 263fe98..0000000 Binary files a/examples/agentic-support-console/template/.agents/skills/databricks-jobs/assets/databricks.png and /dev/null differ diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-jobs/assets/databricks.svg b/examples/agentic-support-console/template/.agents/skills/databricks-jobs/assets/databricks.svg deleted file mode 100644 index 9d19110..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-jobs/assets/databricks.svg +++ /dev/null @@ -1,3 +0,0 @@ - - - \ No newline at end of file diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-lakebase/SKILL.md b/examples/agentic-support-console/template/.agents/skills/databricks-lakebase/SKILL.md deleted file mode 100644 index a7a3e2a..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-lakebase/SKILL.md +++ /dev/null @@ -1,206 +0,0 @@ ---- -name: databricks-lakebase -description: 'Manage Lakebase Postgres Autoscaling projects, branches, and endpoints via Databricks CLI. Use when asked to create, configure, or manage Lakebase Postgres databases, projects, branches, computes, or endpoints.' -compatibility: Requires databricks CLI (>= v0.294.0) -metadata: - version: '0.1.0' -parent: databricks-core ---- - -# Lakebase Postgres Autoscaling - -**FIRST**: Use the parent `databricks-core` skill for CLI basics, authentication, and profile selection. - -Lakebase is Databricks' serverless Postgres-compatible database (similar to Neon). It provides fully managed OLTP storage with autoscaling, branching, and scale-to-zero. - -Manage Lakebase Postgres projects, branches, endpoints, and databases via `databricks postgres` CLI commands. - -## Resource Hierarchy - -``` -Project (top-level container) - └── Branch (isolated database environment, copy-on-write) - ├── Endpoint (read-write or read-only) - ├── Database (standard Postgres DB) - └── Role (Postgres role) -``` - -- **Project**: Top-level container. Creating one auto-provisions a `production` branch and a `primary` read-write endpoint. -- **Branch**: Isolated database environment sharing storage with parent (copy-on-write). States: `READY`, `ARCHIVED`. -- **Endpoint** (called **Compute** in the Lakebase UI): Compute resource powering a branch. Types: `ENDPOINT_TYPE_READ_WRITE`, `ENDPOINT_TYPE_READ_ONLY` (read replica). -- **Database**: Standard Postgres database within a branch. Default: `databricks_postgres`. -- **Role**: Postgres role within a branch. Manage roles via `databricks postgres create-role -h`. - -### Resource Name Formats - -| Resource | Format | -| -------- | -------------------------------------------------------------------- | -| Project | `projects/{project_id}` | -| Branch | `projects/{project_id}/branches/{branch_id}` | -| Endpoint | `projects/{project_id}/branches/{branch_id}/endpoints/{endpoint_id}` | -| Database | `projects/{project_id}/branches/{branch_id}/databases/{database_id}` | - -All IDs: 1-63 characters, start with lowercase letter, lowercase letters/numbers/hyphens only (RFC 1123). - -## CLI Discovery — ALWAYS Do This First - -> **Note:** "Lakebase" is the product name; the CLI command group is `postgres`. All commands use `databricks postgres ...`. - -**Do NOT guess command syntax.** Discover available commands and their usage dynamically: - -```bash -# List all postgres subcommands -databricks postgres -h - -# Get detailed usage for any subcommand (flags, args, JSON fields) -databricks postgres -h -``` - -Run `databricks postgres -h` before constructing any command. Run `databricks postgres -h` to discover exact flags, positional arguments, and JSON spec fields for that subcommand. - -## Create a Project - -> **Do NOT list projects before creating.** - -```bash -databricks postgres create-project \ - --json '{"spec": {"display_name": ""}}' \ - --profile -``` - -- Auto-creates: `production` branch + `primary` read-write endpoint (1 CU min/max, scale-to-zero) -- Long-running operation; the CLI waits for completion by default. Use `--no-wait` to return immediately. -- Run `databricks postgres create-project -h` for all available spec fields (e.g. `pg_version`). - -After creation, verify the auto-provisioned resources: - -```bash -databricks postgres list-branches projects/ --profile -databricks postgres list-endpoints projects//branches/ --profile -databricks postgres list-databases projects//branches/ --profile -``` - -## Autoscaling - -Endpoints use **compute units (CU)** for autoscaling. Configure min/max CU via `create-endpoint` or `update-endpoint`. Run `databricks postgres create-endpoint -h` to see all spec fields. - -Scale-to-zero is enabled by default. When idle, compute scales down to zero; it resumes in seconds on next connection. - -## Branches - -Branches are copy-on-write snapshots of an existing branch. Use them for **experimentation**: testing schema migrations, trying queries, or previewing data changes -- without affecting production. - -```bash -databricks postgres create-branch projects/ \ - --json '{ - "spec": { - "source_branch": "projects//branches/", - "no_expiry": true - } - }' --profile -``` - -Branches require an expiration policy: use `"no_expiry": true` for permanent branches. - -When done experimenting, delete the branch. Protected branches must be unprotected first -- use `update-branch` to set `spec.is_protected` to `false`, then delete: - -```bash -# Step 1 — unprotect -databricks postgres update-branch projects//branches/ \ - --json '{"spec": {"is_protected": false}}' --profile - -# Step 2 — delete (run -h to confirm positional arg format for your CLI version) -databricks postgres delete-branch projects//branches/ \ - --profile -``` - -**Never delete the `production` branch** — it is the authoritative branch auto-provisioned at project creation. - -## What's Next - -### Build a Databricks App - -After creating a Lakebase project, scaffold a Databricks App connected to it. - -**Step 1 — Discover branch name** (use `.name` from a `READY` branch): - -```bash -databricks postgres list-branches projects/ --profile -``` - -**Step 2 — Discover database name** (use `.name` from the desired database; `` is the branch ID, not the full resource name): - -```bash -databricks postgres list-databases projects//branches/ --profile -``` - -**Step 3 — Scaffold the app** with the `lakebase` feature: - -```bash -databricks apps init --name \ - --features lakebase \ - --set "lakebase.postgres.branch=" \ - --set "lakebase.postgres.database=" \ - --run none --profile -``` - -Where `` is the full resource name (e.g. `projects//branches/`) and `` is the full resource name (e.g. `projects//branches//databases/`). - -For the full app development workflow, use the **`databricks-apps`** skill. - -### Schema Permissions for Deployed Apps - -When a Lakebase database is used by a deployed Databricks App, the app's Service Principal has `CAN_CONNECT_AND_CREATE` permission, which means it can create new objects but **cannot access any existing schemas or tables** (including `public`). The SP must create the schema itself to become its owner. - -**ALWAYS deploy the app before running it locally.** This is the #1 source of Lakebase permission errors. - -When deployed, the app's Service Principal runs the schema initialization SQL (e.g. `CREATE SCHEMA IF NOT EXISTS app_data`), creating the schema and tables — and becoming their **owner**. Only the owner (or a superuser) can access those objects. - -**If you run locally first**, your personal credentials create the schema and become the owner. The deployed Service Principal then **cannot access it** — even though it has `CAN_CONNECT_AND_CREATE` — because it didn't create it and cannot access existing schemas. - -**Correct workflow:** - -1. **Deploy first**: `databricks apps deploy --profile ` — verify with `databricks apps get --profile ` that the app is deployed before proceeding -2. **Grant local access** _(if needed)_: if you're not the project creator, assign `databricks_superuser` to your identity via the Lakebase UI. Project creators already have sufficient access. -3. **Develop locally**: your credentials get DML access (SELECT/INSERT/UPDATE/DELETE) to SP-owned schemas - -> **Note:** Project creators already have access to SP-owned schemas. Other team members need `databricks_superuser` (grants full DML but **not DDL**). If you need to alter the schema during local development, redeploy the app to apply DDL changes. - -**If you already ran locally first** and hit `permission denied` after deploying: the schema is owned by your personal credentials, not the SP. **⚠️ Do NOT drop the schema without asking the user first** — dropping it (`DROP SCHEMA CASCADE`) **deletes all data** in that schema. Ask the user how they'd like to proceed: - -- **Option A (destructive):** Drop the schema and redeploy so the SP recreates it. Only safe if the schema has no valuable data. -- **Option B (manual):** The user can reassign ownership or manually grant the SP access, preserving existing data. - -### Other Workflows - -**Connect a Postgres client** -Get the connection string from the endpoint, then connect with psql, DBeaver, or any standard Postgres client. - -```bash -databricks postgres get-endpoint projects//branches//endpoints/ --profile -``` - -**Manage roles and permissions** -Create Postgres roles and grant access to databases or schemas. - -```bash -databricks postgres create-role -h # discover role spec fields -``` - -**Add a read-only endpoint** -Create a read replica for analytics or reporting workloads to avoid contention on the primary read-write endpoint. - -```bash -databricks postgres create-endpoint projects//branches/ \ - --json '{"spec": {"type": "ENDPOINT_TYPE_READ_ONLY"}}' --profile -``` - -## Troubleshooting - -| Error | Solution | -| -------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | -| `cannot configure default credentials` | Use `--profile` flag or authenticate first | -| `PERMISSION_DENIED` | Check workspace permissions | -| `permission denied for schema ` | Schema owned by another role. Deploy the app first so the SP creates and owns the schema. See **Schema Permissions for Deployed Apps** above | -| Protected branch cannot be deleted | `update-branch` to set `spec.is_protected` to `false` first | -| Long-running operation timeout | Use `--no-wait` and poll with `get-operation` | diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-lakebase/agents/openai.yaml b/examples/agentic-support-console/template/.agents/skills/databricks-lakebase/agents/openai.yaml deleted file mode 100644 index 8c04a0f..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-lakebase/agents/openai.yaml +++ /dev/null @@ -1,7 +0,0 @@ -interface: - display_name: 'Databricks Lakebase' - short_description: 'Lakebase database development' - icon_small: './assets/databricks.svg' - icon_large: './assets/databricks.png' - brand_color: '#FF3621' - default_prompt: 'Use $databricks-lakebase for Databricks Lakebase database development.' diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-lakebase/assets/databricks.png b/examples/agentic-support-console/template/.agents/skills/databricks-lakebase/assets/databricks.png deleted file mode 100644 index 263fe98..0000000 Binary files a/examples/agentic-support-console/template/.agents/skills/databricks-lakebase/assets/databricks.png and /dev/null differ diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-lakebase/assets/databricks.svg b/examples/agentic-support-console/template/.agents/skills/databricks-lakebase/assets/databricks.svg deleted file mode 100644 index 9d19110..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-lakebase/assets/databricks.svg +++ /dev/null @@ -1,3 +0,0 @@ - - - \ No newline at end of file diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-model-serving/SKILL.md b/examples/agentic-support-console/template/.agents/skills/databricks-model-serving/SKILL.md deleted file mode 100644 index 30dd3f6..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-model-serving/SKILL.md +++ /dev/null @@ -1,175 +0,0 @@ ---- -name: databricks-model-serving -description: 'Manage Databricks Model Serving endpoints via CLI. Use when asked to create, configure, query, or manage model serving endpoints for LLM inference, custom models, or external models.' -compatibility: Requires databricks CLI (>= v0.294.0) -metadata: - version: '0.1.0' -parent: databricks-core ---- - -# Model Serving Endpoints - -**FIRST**: Use the parent `databricks-core` skill for CLI basics, authentication, and profile selection. - -Model Serving provides managed endpoints for serving LLMs, custom ML models, and external models as scalable REST APIs. Endpoints are identified by **name** (unique per workspace). - -## Endpoint Types - -| Type | When to Use | Key Detail | -| ---------------------- | ----------------------------------------- | ------------------------------------------------- | -| Pay-per-token | Foundation Model APIs (Llama, DBRX, etc.) | Uses `system.ai.*` catalog models, simplest setup | -| Provisioned throughput | Dedicated GPU capacity | Guaranteed throughput, higher cost | -| Custom model | Your own MLflow models or containers | Deploy any model with an MLflow signature | - -## Endpoint Structure - -``` -Serving Endpoint (top-level, identified by NAME) - ├── Config - │ ├── Served Entities (model references + scaling config) - │ └── Traffic Config (routing percentages across entities) - ├── AI Gateway (rate limits, usage tracking) - └── State (READY / NOT_READY, config_update status) -``` - -- **Served Entities**: Each entity references a model (from Unity Catalog or MLflow) with scaling parameters. Get the entity name from `served_entities[].name` in the `get` output — needed for `build-logs` and `logs` commands. -- **Traffic Config**: Routes requests across served entities by percentage (for A/B testing, canary deployments). -- **State**: Endpoints transition `NOT_READY` → `READY` after creation or config update. Poll via `get` to check `state.ready`. - -## CLI Discovery — ALWAYS Do This First - -**Do NOT guess command syntax.** Discover available commands and their usage dynamically: - -```bash -# List all serving-endpoints subcommands -databricks serving-endpoints -h - -# Get detailed usage for any subcommand (flags, args, JSON fields) -databricks serving-endpoints -h -``` - -Run `databricks serving-endpoints -h` before constructing any command. Run `databricks serving-endpoints -h` to discover exact flags, positional arguments, and JSON spec fields for that subcommand. - -## Create an Endpoint - -> **Do NOT list endpoints before creating.** - -```bash -databricks serving-endpoints create \ - --json '{ - "served_entities": [{ - "entity_name": "", - "entity_version": "", - "min_provisioned_throughput": 0, - "max_provisioned_throughput": 0, - "workload_size": "Small" - }], - "traffic_config": { - "routes": [{ - "served_entity_name": "", - "traffic_percentage": 100 - }] - } - }' --profile -``` - -- Discover available Foundation Models: check the `system.ai` catalog in Unity Catalog. -- Long-running operation; the CLI waits for completion by default. Use `--no-wait` to return immediately, then poll: - ```bash - databricks serving-endpoints get --profile - # Check: state.ready == "READY" - ``` -- For provisioned throughput or custom model endpoints, run `databricks serving-endpoints create -h` to discover the required JSON fields for your endpoint type. - -## Query an Endpoint - -```bash -databricks serving-endpoints query \ - --json '{"messages": [{"role": "user", "content": "Hello, how are you?"}]}' \ - --profile -``` - -- Use `--stream` for streaming responses. -- For non-chat endpoints (embeddings, custom models): use `get-open-api ` first to discover the request/response schema, then construct the appropriate JSON payload. - -## Get Endpoint Schema (OpenAPI) - -Returns the OpenAPI 3.1 JSON schema describing what each served model accepts and returns. Use this to understand an endpoint's input/output format before querying it. - -```bash -databricks serving-endpoints get-open-api --profile -``` - -The schema shows paths per served model (e.g., `/served-models//invocations`) with full request/response definitions including parameter types, enums, and nullable fields. - -## Other Commands - -Run `databricks serving-endpoints -h` for usage details. - -| Task | Command | Notes | -| --------------------------------- | ------------------------------------ | ------------------------------------------------------------------ | -| List all endpoints | `list` | | -| Get endpoint details | `get ` | Shows state, config, served entities | -| Delete endpoint | `delete ` | | -| Update served entities or traffic | `update-config --json '...'` | Zero-downtime: old config serves until new is ready | -| Rate limits & usage tracking | `put-ai-gateway --json '...'` | | -| Update tags | `patch --json '...'` | | -| Build logs | `build-logs ` | Get `SERVED_MODEL` from `get` output: `served_entities[].name` | -| Runtime logs | `logs ` | | -| Metrics (Prometheus format) | `export-metrics ` | | -| Permissions | `get-permissions ` | ⚠️ Uses endpoint **ID** (hex string), not name. Find ID via `get`. | - -## What's Next - -### Integrate with a Databricks App - -After creating a serving endpoint, wire it into a Databricks App. - -**Step 1 — Check if the `serving` plugin is available** in the AppKit template: - -```bash -databricks apps manifest --profile -``` - -If the output includes a `serving` plugin, scaffold with: - -```bash -databricks apps init --name \ - --features serving \ - --set "serving.serving-endpoint.name=" \ - --run none --profile -``` - -**Step 2 — If no `serving` plugin**, add the endpoint resource manually to an existing app's `databricks.yml`: - -```yaml -resources: - apps: - my_app: - resources: - - name: my-model-endpoint - serving_endpoint: - name: - permission: CAN_QUERY -``` - -And inject the endpoint name as an environment variable in `app.yaml`: - -```yaml -env: - - name: SERVING_ENDPOINT - valueFrom: serving-endpoint -``` - -Then add a tRPC route to call it from your app. For the full app integration pattern, use the **`databricks-apps`** skill and read the [Model Serving Guide](../databricks-apps/references/appkit/model-serving.md). - -## Troubleshooting - -| Error | Solution | -| -------------------------------------- | --------------------------------------------------------------------------------------------------- | -| `cannot configure default credentials` | Use `--profile` flag or authenticate first | -| `PERMISSION_DENIED` | Check workspace permissions; for apps, ensure `serving_endpoint` resource declared with `CAN_QUERY` | -| Endpoint stuck in `NOT_READY` | Check `build-logs` for the served model (get entity name from `get` output) | -| `RESOURCE_DOES_NOT_EXIST` | Verify endpoint name with `list` | -| Query returns 404 | Endpoint may still be provisioning; check `state.ready` via `get` | -| `RATE_LIMIT_EXCEEDED` (429) | AI Gateway rate limit; check `put-ai-gateway` config or retry after backoff | diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-model-serving/agents/openai.yaml b/examples/agentic-support-console/template/.agents/skills/databricks-model-serving/agents/openai.yaml deleted file mode 100644 index 6c9cfe1..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-model-serving/agents/openai.yaml +++ /dev/null @@ -1,7 +0,0 @@ -interface: - display_name: 'Databricks Model Serving' - short_description: 'Model Serving endpoint management' - icon_small: './assets/databricks.svg' - icon_large: './assets/databricks.png' - brand_color: '#FF3621' - default_prompt: 'Use $databricks-model-serving for Databricks Model Serving endpoint management.' diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-model-serving/assets/databricks.png b/examples/agentic-support-console/template/.agents/skills/databricks-model-serving/assets/databricks.png deleted file mode 100644 index 263fe98..0000000 Binary files a/examples/agentic-support-console/template/.agents/skills/databricks-model-serving/assets/databricks.png and /dev/null differ diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-model-serving/assets/databricks.svg b/examples/agentic-support-console/template/.agents/skills/databricks-model-serving/assets/databricks.svg deleted file mode 100644 index 9d19110..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-model-serving/assets/databricks.svg +++ /dev/null @@ -1,3 +0,0 @@ - - - \ No newline at end of file diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/SKILL.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/SKILL.md deleted file mode 100644 index 05dfc8f..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/SKILL.md +++ /dev/null @@ -1,270 +0,0 @@ ---- -name: databricks-pipelines -description: Develop Lakeflow Spark Declarative Pipelines (formerly Delta Live Tables) on Databricks. Use when building batch or streaming data pipelines with Python or SQL. Invoke BEFORE starting implementation. -compatibility: Requires databricks CLI (>= v0.292.0) -metadata: - version: '0.1.0' -parent: databricks-core ---- - -# Lakeflow Spark Declarative Pipelines Development - -**FIRST**: Use the parent `databricks-core` skill for CLI basics, authentication, profile selection, and data discovery commands. - -## Decision Tree - -Use this tree to determine which dataset type and features to use. Multiple features can apply to the same dataset — e.g., a Streaming Table can use Auto Loader for ingestion, Append Flows for fan-in, and Expectations for data quality. Choose the dataset type first, then layer on applicable features. - -``` -User request → What kind of output? -├── Intermediate/reusable logic (not persisted) → Temporary View -│ ├── Preprocessing/filtering before Auto CDC → Temporary View feeding CDC flow -│ ├── Shared intermediate streaming logic reused by multiple downstream tables -│ ├── Pipeline-private helper logic (not published to catalog) -│ └── Published to UC for external queries → Persistent View (SQL only) -├── Persisted dataset -│ ├── Source is streaming/incremental/continuously growing → Streaming Table -│ │ ├── File ingestion (cloud storage, Volumes) → Auto Loader -│ │ ├── Message bus (Kafka, Kinesis, Pub/Sub, Pulsar, Event Hubs) → streaming source read -│ │ ├── Existing streaming/Delta table → streaming read from table -│ │ ├── CDC / upserts / track changes / keep latest per key / SCD Type 1 or 2 → Auto CDC -│ │ ├── Multiple sources into one table → Append Flows (NOT union) -│ │ ├── Historical backfill + live stream → one-time Append Flow + regular flow -│ │ └── Windowed aggregation with watermark → stateful streaming -│ └── Source is batch/historical/full scan → Materialized View -│ ├── Aggregation/join across full dataset (GROUP BY, SUM, COUNT, etc.) -│ ├── Gold layer aggregation from streaming table → MV with batch read (spark.read / no STREAM) -│ ├── JDBC/Federation/external batch sources -│ └── Small static file load (reference data, no streaming read) -├── Output to external system (Python only) → Sink -│ ├── Existing external table not managed by this pipeline → Sink with format="delta" -│ │ (prefer fully-qualified dataset names if the pipeline should own the table — see Publishing Modes) -│ ├── Kafka / Event Hubs → Sink with format="kafka" + @dp.append_flow(target="sink_name") -│ ├── Custom destination not natively supported → Sink with custom format -│ ├── Custom merge/upsert logic per batch → ForEachBatch Sink (Public Preview) -│ └── Multiple destinations per batch → ForEachBatch Sink (Public Preview) -└── Data quality constraints → Expectations (on any dataset type) -``` - -## Common Traps - -- **"Create a table"** without specifying type → ask whether the source is streaming or batch -- **Materialized View from streaming source** is an error → use a Streaming Table instead, or switch to a batch read -- **Streaming Table from batch source** is an error → use a Materialized View instead, or switch to a streaming read -- **Aggregation over streaming table** → use a Materialized View with batch read (`spark.read.table` / `SELECT FROM` without `STREAM`), NOT a Streaming Table. This is the correct pattern for Gold layer aggregation. -- **Aggregation over batch/historical data** → use a Materialized View, not a Streaming Table. MVs recompute or incrementally refresh aggregates to stay correct; STs are append-only and don't recompute when source data changes. -- **Preprocessing before Auto CDC** → use a Temporary View to filter/transform the source before feeding into the CDC flow. SQL: the CDC flow reads from the view via `STREAM(view_name)`. Python: use `spark.readStream.table("view_name")`. -- **Intermediate logic → default to Temporary View** → Use a Temporary View for intermediate/preprocessing logic, even when reused by multiple downstream tables. Only consider a Private MV/ST (`private=True` / `CREATE PRIVATE ...`) when the computation is expensive and materializing once would save significant reprocessing. -- **View vs Temporary View** → Persistent Views publish to Unity Catalog (SQL only), Temporary Views are pipeline-private -- **Union of streams** → use multiple Append Flows. Do NOT present UNION as an alternative — it is an anti-pattern for streaming sources. -- **Changing dataset type** → cannot change ST→MV or MV→ST without manually dropping the existing table first. Full refresh does NOT help. Rename the new dataset instead. -- **SQL `OR REFRESH`** → Prefer `CREATE OR REFRESH` over bare `CREATE` for SQL dataset definitions. Both work identically, but `OR REFRESH` is the idiomatic convention. For PRIVATE datasets: `CREATE OR REFRESH PRIVATE STREAMING TABLE` / `CREATE OR REFRESH PRIVATE MATERIALIZED VIEW`. -- **Kafka/Event Hubs sink serialization** → The `value` column is mandatory. Use `to_json(struct(*)) AS value` to serialize the entire row as JSON. Read the sink skill for details. -- **Multi-column sequencing** in Auto CDC → SQL: `SEQUENCE BY STRUCT(col1, col2)`. Python: `sequence_by=struct("col1", "col2")`. Read the auto-cdc skill for details. -- **Auto CDC supports TRUNCATE** (SCD Type 1 only) → SQL: `APPLY AS TRUNCATE WHEN condition`. Python: `apply_as_truncates=expr("condition")`. Do NOT say truncate is unsupported. -- **Python-only features** → Sinks, ForEachBatch Sinks, CDC from snapshots, and custom data sources are Python-only. When the user is working in SQL, explicitly clarify this and suggest switching to Python. -- **MV incremental refresh** → Materialized Views on **serverless** pipelines support automatic incremental refresh for aggregations. Mention the serverless requirement when discussing incremental refresh. -- **Recommend ONE clear approach** → Present a single recommended approach. Do NOT present anti-patterns or significantly inferior alternatives — it confuses users. Only mention alternatives if they are genuinely viable for different trade-offs. - -## Publishing Modes - -Pipelines use a **default catalog and schema** configured in the pipeline settings. All datasets are published there unless overridden. - -- **Fully-qualified names**: Use `catalog.schema.table` in the dataset name to write to a different catalog/schema than the pipeline default. The pipeline creates the dataset there directly — no Sink needed. -- **USE CATALOG / USE SCHEMA**: SQL commands that change the current catalog/schema for all subsequent definitions in the same file. -- **LIVE prefix**: Deprecated. Ignored in the default publishing mode. -- When reading or defining datasets within the pipeline, use the dataset name only — do NOT use fully-qualified names unless the pipeline already does so or the user explicitly requests a different target catalog/schema. - -## Comprehensive API Reference - -**MANDATORY:** Before implementing, editing, or suggesting any code for a feature, you MUST read the linked reference file for that feature. NO exceptions — always look up the reference before writing code. - -Some features require reading multiple skills together: - -- **Auto Loader** → also read the streaming-table skill (Auto Loader produces a streaming DataFrame, so the target is a streaming table) and look up format-specific options for the file format being loaded -- **Auto CDC** → also read the streaming-table skill (Auto CDC always targets a streaming table) -- **Sinks** → also read the streaming-table skill (sinks use streaming append flows) -- **Expectations** → also read the corresponding dataset definition skill to ensure constraints are correctly placed - -### Dataset Definition APIs - -| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) | -| -------------------------- | ------------------------------------ | ------------------------------------- | ------------------------------------------- | ----------------------------- | ------------------------------------------------------------------------- | ------------------------------------------------------------------- | -| Streaming Table | `@dp.table()` returning streaming DF | `@dlt.table()` returning streaming DF | `CREATE OR REFRESH STREAMING TABLE` | `CREATE STREAMING LIVE TABLE` | [streaming-table-python](streaming-table/streaming-table-python.md) | [streaming-table-sql](streaming-table/streaming-table-sql.md) | -| Materialized View | `@dp.materialized_view()` | `@dlt.table()` returning batch DF | `CREATE OR REFRESH MATERIALIZED VIEW` | `CREATE LIVE TABLE` (batch) | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Temporary View | `@dp.temporary_view()` | `@dlt.view()`, `@dp.view()` | `CREATE TEMPORARY VIEW` | `CREATE TEMPORARY LIVE VIEW` | [temporary-view-python](temporary-view/temporary-view-python.md) | [temporary-view-sql](temporary-view/temporary-view-sql.md) | -| Persistent View (UC) | N/A — SQL only | — | `CREATE VIEW` | — | — | [view-sql](view/view-sql.md) | -| Streaming Table (explicit) | `dp.create_streaming_table()` | `dlt.create_streaming_table()` | `CREATE OR REFRESH STREAMING TABLE` (no AS) | — | [streaming-table-python](streaming-table/streaming-table-python.md) | [streaming-table-sql](streaming-table/streaming-table-sql.md) | - -### Flow and Sink APIs - -| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) | -| ---------------------------- | ---------------------------- | ----------------------------- | -------------------------------------- | ---------------- | ---------------------------------------------------------------------------- | ------------------------------------------------------------- | -| Append Flow | `@dp.append_flow()` | `@dlt.append_flow()` | `CREATE FLOW ... INSERT INTO` | — | [streaming-table-python](streaming-table/streaming-table-python.md) | [streaming-table-sql](streaming-table/streaming-table-sql.md) | -| Backfill Flow | `@dp.append_flow(once=True)` | `@dlt.append_flow(once=True)` | `CREATE FLOW ... INSERT INTO ... ONCE` | — | [streaming-table-python](streaming-table/streaming-table-python.md) | [streaming-table-sql](streaming-table/streaming-table-sql.md) | -| Sink (Delta/Kafka/EH/custom) | `dp.create_sink()` | `dlt.create_sink()` | N/A — Python only | — | [sink-python](sink/sink-python.md) | — | -| ForEachBatch Sink | `@dp.foreach_batch_sink()` | — | N/A — Python only | — | [foreach-batch-sink-python](foreach-batch-sink/foreach-batch-sink-python.md) | — | - -### CDC APIs - -| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) | -| ---------------------------- | ----------------------------------------- | ------------------------------------------- | ------------------------------- | ------------------------------------ | ---------------------------------------------- | ---------------------------------------- | -| Auto CDC (streaming source) | `dp.create_auto_cdc_flow()` | `dlt.apply_changes()`, `dp.apply_changes()` | `AUTO CDC INTO ... FROM STREAM` | `APPLY CHANGES INTO ... FROM STREAM` | [auto-cdc-python](auto-cdc/auto-cdc-python.md) | [auto-cdc-sql](auto-cdc/auto-cdc-sql.md) | -| Auto CDC (periodic snapshot) | `dp.create_auto_cdc_from_snapshot_flow()` | `dlt.apply_changes_from_snapshot()` | N/A — Python only | — | [auto-cdc-python](auto-cdc/auto-cdc-python.md) | — | - -### Data Quality APIs - -| Feature | Python (current) | Python (deprecated) | SQL (current) | Skill (Py) | Skill (SQL) | -| ------------------ | ---------------------------- | ----------------------------- | ------------------------------------------------------ | ---------------------------------------------------------- | ---------------------------------------------------- | -| Expect (warn) | `@dp.expect()` | `@dlt.expect()` | `CONSTRAINT ... EXPECT (...)` | [expectations-python](expectations/expectations-python.md) | [expectations-sql](expectations/expectations-sql.md) | -| Expect or drop | `@dp.expect_or_drop()` | `@dlt.expect_or_drop()` | `CONSTRAINT ... EXPECT (...) ON VIOLATION DROP ROW` | [expectations-python](expectations/expectations-python.md) | [expectations-sql](expectations/expectations-sql.md) | -| Expect or fail | `@dp.expect_or_fail()` | `@dlt.expect_or_fail()` | `CONSTRAINT ... EXPECT (...) ON VIOLATION FAIL UPDATE` | [expectations-python](expectations/expectations-python.md) | [expectations-sql](expectations/expectations-sql.md) | -| Expect all (warn) | `@dp.expect_all({})` | `@dlt.expect_all({})` | Multiple `CONSTRAINT` clauses | [expectations-python](expectations/expectations-python.md) | [expectations-sql](expectations/expectations-sql.md) | -| Expect all or drop | `@dp.expect_all_or_drop({})` | `@dlt.expect_all_or_drop({})` | Multiple constraints with `DROP ROW` | [expectations-python](expectations/expectations-python.md) | [expectations-sql](expectations/expectations-sql.md) | -| Expect all or fail | `@dp.expect_all_or_fail({})` | `@dlt.expect_all_or_fail({})` | Multiple constraints with `FAIL UPDATE` | [expectations-python](expectations/expectations-python.md) | [expectations-sql](expectations/expectations-sql.md) | - -### Reading Data APIs - -| Feature | Python (current) | Python (deprecated) | SQL (current) | SQL (deprecated) | Skill (Py) | Skill (SQL) | -| --------------------------------- | ---------------------------------------------- | --------------------------------------------------- | ------------------------------------------------ | ---------------------------------- | ------------------------------------------------------------------- | ------------------------------------------------------------- | -| Batch read (pipeline dataset) | `spark.read.table("name")` | `dp.read("name")`, `dlt.read("name")` | `SELECT ... FROM name` | `SELECT ... FROM LIVE.name` | — | — | -| Streaming read (pipeline dataset) | `spark.readStream.table("name")` | `dp.read_stream("name")`, `dlt.read_stream("name")` | `SELECT ... FROM STREAM name` | `SELECT ... FROM STREAM LIVE.name` | — | — | -| Auto Loader (cloud files) | `spark.readStream.format("cloudFiles")` | — | `STREAM read_files(...)` | — | [auto-loader-python](auto-loader/auto-loader-python.md) | [auto-loader-sql](auto-loader/auto-loader-sql.md) | -| Kafka source | `spark.readStream.format("kafka")` | — | `STREAM read_kafka(...)` | — | — | — | -| Kinesis source | `spark.readStream.format("kinesis")` | — | `STREAM read_kinesis(...)` | — | — | — | -| Pub/Sub source | `spark.readStream.format("pubsub")` | — | `STREAM read_pubsub(...)` | — | — | — | -| Pulsar source | `spark.readStream.format("pulsar")` | — | `STREAM read_pulsar(...)` | — | — | — | -| Event Hubs source | `spark.readStream.format("kafka")` + EH config | — | `STREAM read_kafka(...)` + EH config | — | — | — | -| JDBC / Lakehouse Federation | `spark.read.format("postgresql")` etc. | — | Direct table ref via federation catalog | — | — | — | -| Custom data source | `spark.read[Stream].format("custom")` | — | N/A — Python only | — | — | — | -| Static file read (batch) | `spark.read.format("json"\|"csv"\|...).load()` | — | `read_files(...)` (no STREAM) | — | — | — | -| Skip upstream change commits | `.option("skipChangeCommits", "true")` | — | `read_stream("name", skipChangeCommits => true)` | — | [streaming-table-python](streaming-table/streaming-table-python.md) | [streaming-table-sql](streaming-table/streaming-table-sql.md) | - -### Table/Schema Feature APIs - -| Feature | Python (current) | SQL (current) | Skill (Py) | Skill (SQL) | -| ---------------------------- | ----------------------------------------------------- | --------------------------------------- | ------------------------------------------------------------------------- | ------------------------------------------------------------------- | -| Liquid clustering | `cluster_by=[...]` | `CLUSTER BY (col1, col2)` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Auto liquid clustering | `cluster_by_auto=True` | `CLUSTER BY AUTO` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Partition columns | `partition_cols=[...]` | `PARTITIONED BY (col1, col2)` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Table properties | `table_properties={...}` | `TBLPROPERTIES (...)` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Explicit schema | `schema="col1 TYPE, ..."` | `(col1 TYPE, ...) AS` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Generated columns | `schema="..., col TYPE GENERATED ALWAYS AS (expr)"` | `col TYPE GENERATED ALWAYS AS (expr)` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Row filter (Public Preview) | `row_filter="ROW FILTER fn ON (col)"` | `WITH ROW FILTER fn ON (col)` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Column mask (Public Preview) | `schema="..., col TYPE MASK fn USING COLUMNS (col2)"` | `col TYPE MASK fn USING COLUMNS (col2)` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | -| Private dataset | `private=True` | `CREATE PRIVATE ...` | [materialized-view-python](materialized-view/materialized-view-python.md) | [materialized-view-sql](materialized-view/materialized-view-sql.md) | - -### Import / Module APIs - -| Current | Deprecated | Notes | -| ------------------------------------------------- | --------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | -| `from pyspark import pipelines as dp` | `import dlt` | Both work. Prefer `dp`. Do NOT change existing `dlt` imports. | -| `spark.read.table()` / `spark.readStream.table()` | `dp.read()` / `dp.read_stream()` / `dlt.read()` / `dlt.read_stream()` | Deprecated reads still work. Prefer `spark.*`. | -| — | `LIVE.` prefix | Fully deprecated. NEVER use. Causes errors in newer pipelines. | -| — | `CREATE LIVE TABLE` / `CREATE LIVE VIEW` | Fully deprecated. Use `CREATE STREAMING TABLE` / `CREATE MATERIALIZED VIEW` / `CREATE TEMPORARY VIEW`. | - -## Language-specific guides - -Lakeflow Spark Declarative Pipelines (formerly Delta Live Tables / DLT) is a framework for building batch and streaming data pipelines. - -## Scaffolding a New Pipeline Project - -Use `databricks bundle init` with a config file to scaffold non-interactively. This creates a project in the `/` directory: - -```bash -databricks bundle init lakeflow-pipelines --config-file <(echo '{"project_name": "my_pipeline", "language": "python", "serverless": "yes"}') --profile < /dev/null -``` - -- `project_name`: letters, numbers, underscores only -- `language`: `python` or `sql`. Ask the user which they prefer: - - SQL: Recommended for straightforward transformations (filters, joins, aggregations) - - Python: Recommended for complex logic (custom UDFs, ML, advanced processing) - -After scaffolding, create `CLAUDE.md` and `AGENTS.md` in the project directory. These files are essential to provide agents with guidance on how to work with the project. Use this content: - -``` -# Declarative Automation Bundles Project - -This project uses Declarative Automation Bundles (formerly Databricks Asset Bundles) for deployment. - -## Prerequisites - -Install the Databricks CLI (>= v0.288.0) if not already installed: -- macOS: `brew tap databricks/tap && brew install databricks` -- Linux: `curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh` -- Windows: `winget install Databricks.DatabricksCLI` - -Verify: `databricks -v` - -## For AI Agents - -Read the `databricks-core` skill for CLI basics, authentication, and deployment workflow. -Read the `databricks-pipelines` skill for pipeline-specific guidance. - -If skills are not available, install them: `databricks experimental aitools skills install` -``` - -## Pipeline Structure - -- Follow the medallion architecture pattern (Bronze → Silver → Gold) unless the user specifies otherwise -- Use the convention of 1 dataset per file, named after the dataset -- Place transformation files in a `src/` or `transformations/` folder - -``` -my-pipeline-project/ -├── databricks.yml # Bundle configuration -├── resources/ -│ ├── my_pipeline.pipeline.yml # Pipeline definition -│ └── my_pipeline_job.job.yml # Scheduling job (optional) -└── src/ - ├── my_table.py (or .sql) # One dataset per file - ├── another_table.py (or .sql) - └── ... -``` - -## Scheduling Pipelines - -To schedule a pipeline, add a job that triggers it in `resources/.job.yml`: - -```yaml -resources: - jobs: - my_pipeline_job: - trigger: - periodic: - interval: 1 - unit: DAYS - tasks: - - task_key: refresh_pipeline - pipeline_task: - pipeline_id: ${resources.pipelines.my_pipeline.id} -``` - -## Running Pipelines - -**You must deploy before running.** In local development, code changes only take effect after `databricks bundle deploy`. Always deploy before any run, dry run, or selective refresh. - -- Selective refresh is preferred when you only need to run one table. For selective refresh it is important that dependencies are already materialized. -- **Full refresh is the most expensive and dangerous option, and can lead to data loss**, so it should be used only when really necessary. Always suggest this as a follow-up that the user explicitly needs to select. - -## Development Workflow - -1. **Validate**: `databricks bundle validate --profile ` -2. **Deploy**: `databricks bundle deploy -t dev --profile ` -3. **Run pipeline**: `databricks bundle run -t dev --profile ` -4. **Check status**: `databricks pipelines get --pipeline-id --profile ` - -## Pipeline API Reference - -Detailed reference guides for each pipeline API. **Read the relevant guide before writing pipeline code.** - -- [Write Spark Declarative Pipelines](references/write-spark-declarative-pipelines.md) — Core syntax and rules ([Python](references/python-basics.md), [SQL](references/sql-basics.md)) -- [Streaming Tables](references/streaming-table.md) — Continuous data stream processing ([Python](references/streaming-table-python.md), [SQL](references/streaming-table-sql.md)) -- [Materialized Views](references/materialized-view.md) — Physically stored query results with incremental refresh ([Python](references/materialized-view-python.md), [SQL](references/materialized-view-sql.md)) -- [Views](references/view.md) — Reusable query logic published to Unity Catalog ([SQL](references/view-sql.md)) -- [Temporary Views](references/temporary-view.md) — Pipeline-private views ([Python](references/temporary-view-python.md), [SQL](references/temporary-view-sql.md)) -- [Auto Loader](references/auto-loader.md) — Incrementally ingest files from cloud storage ([Python](references/auto-loader-python.md), [SQL](references/auto-loader-sql.md)) -- [Auto CDC](references/auto-cdc.md) — Process Change Data Capture feeds, SCD Type 1 & 2 ([Python](references/auto-cdc-python.md), [SQL](references/auto-cdc-sql.md)) -- [Expectations](references/expectations.md) — Define and enforce data quality constraints ([Python](references/expectations-python.md), [SQL](references/expectations-sql.md)) -- [Sinks](references/sink.md) — Write to Kafka, Event Hubs, external Delta tables ([Python](references/sink-python.md)) -- [ForEachBatch Sinks](references/foreach-batch-sink.md) — Custom streaming sink with per-batch Python logic ([Python](references/foreach-batch-sink-python.md)) diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/agents/openai.yaml b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/agents/openai.yaml deleted file mode 100644 index 2b76362..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/agents/openai.yaml +++ /dev/null @@ -1,7 +0,0 @@ -interface: - display_name: 'Databricks Pipelines' - short_description: 'Pipelines for ETL and streaming' - icon_small: './assets/databricks.svg' - icon_large: './assets/databricks.png' - brand_color: '#FF3621' - default_prompt: 'Use $databricks-pipelines for Databricks Pipelines ETL and streaming.' diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/assets/databricks.png b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/assets/databricks.png deleted file mode 100644 index 263fe98..0000000 Binary files a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/assets/databricks.png and /dev/null differ diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/assets/databricks.svg b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/assets/databricks.svg deleted file mode 100644 index 9d19110..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/assets/databricks.svg +++ /dev/null @@ -1,3 +0,0 @@ - - - \ No newline at end of file diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-cdc-python.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-cdc-python.md deleted file mode 100644 index 0b30181..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-cdc-python.md +++ /dev/null @@ -1,214 +0,0 @@ -Auto CDC in Spark Declarative Pipelines processes change data capture (CDC) events from streaming sources or snapshots. - -**API Reference:** - -**dp.create_auto_cdc_flow() / dp.apply_changes() / dlt.create_auto_cdc_flow() / dlt.apply_changes()** -Applies CDC operations (inserts, updates, deletes) from a streaming source to a target table. Supports SCD Type 1 (latest) and Type 2 (history). Does NOT return a value - call at top level without assignment. - -```python -dp.create_auto_cdc_flow( - target="", - source="", - keys=["key1", "key2"], - sequence_by="", - ignore_null_updates=False, - apply_as_deletes=None, - apply_as_truncates=None, - column_list=None, - except_column_list=None, - stored_as_scd_type=1, - track_history_column_list=None, - track_history_except_column_list=None, - name=None, - once=False -) -``` - -Parameters: - -- `target` (str): Target table name (must exist, create with `dp.create_streaming_table()`). **Required.** -- `source` (str): Source table name with CDC events. **Required.** -- `keys` (list): Primary key columns for row identification. **Required.** -- `sequence_by` (str | Column): Column for ordering events (timestamp, version). **Required.** Accepts a string column name or a `Column` expression. For multi-column sequencing, use `struct("col1", "col2")` to order by multiple columns. -- `ignore_null_updates` (bool): If True, NULL values won't overwrite existing non-NULL values -- `apply_as_deletes` (str or Column): Expression identifying delete operations. Use `expr("op = 'D'")` (Column) or `"op = 'D'"` (string). -- `apply_as_truncates` (str or Column): Expression identifying truncate operations. Use `expr("op = 'TRUNCATE'")` (Column) or `"op = 'TRUNCATE'"` (string). -- `column_list` (list): Columns to include (mutually exclusive with `except_column_list`) -- `except_column_list` (list): Columns to exclude -- `stored_as_scd_type` (int): `1` for latest values (default), `2` for full history with `__START_AT`/`__END_AT` columns -- `track_history_column_list` (list): For SCD Type 2, columns to track history for (others use Type 1) -- `track_history_except_column_list` (list): For SCD Type 2, columns to exclude from history tracking -- `name` (str): Flow name (for multiple flows to same target) -- `once` (bool): Process once and stop (default: False) - -**dp.create_auto_cdc_from_snapshot_flow() / dp.apply_changes_from_snapshot() / dlt.create_auto_cdc_from_snapshot_flow() / dlt.apply_changes_from_snapshot()** -Applies CDC from full snapshots by comparing to previous state. Automatically infers inserts, updates, deletes. - -```python -dp.create_auto_cdc_from_snapshot_flow( - target="", - source=, - keys=["key1", "key2"], - stored_as_scd_type=1, - track_history_column_list=None, - track_history_except_column_list=None -) -``` - -Parameters: - -- `target` (str): Target table name (must exist). **Required.** -- `source` (str or callable): **Required.** Can be one of: - - **String**: Source table name containing the full snapshot (most common) - - **Callable**: Function for processing historical snapshots with type `SnapshotAndVersionFunction = Callable[[SnapshotVersion], SnapshotAndVersion]` - - `SnapshotVersion = Union[int, str, float, bytes, datetime.datetime, datetime.date, decimal.Decimal]` - - `SnapshotAndVersion = Optional[Tuple[DataFrame, SnapshotVersion]]` - - Function receives the latest processed snapshot version (or None for first run) - - Must return `None` when no more snapshots to process - - Must return tuple of `(DataFrame, SnapshotVersion)` for next snapshot to process - - Snapshot version is used to track progress and must be comparable/orderable -- `keys` (list): Primary key columns. **Required.** -- `stored_as_scd_type` (int): `1` for latest (default), `2` for history -- `track_history_column_list` (list): Columns to track history for (SCD Type 2) -- `track_history_except_column_list` (list): Columns to exclude from history tracking - -**Use create_auto_cdc_flow when:** Processing streaming CDC events from transaction logs, Kafka, Delta change feeds -**Use create_auto_cdc_from_snapshot_flow when:** Processing periodic full snapshots (daily dumps, batch extracts) - -**Common Patterns:** - -**Pattern 1: Basic CDC flow from streaming source** - -```python -# Step 1: Create target table -dp.create_streaming_table(name="users") - -# Step 2: Define CDC flow (source must be a table name) -dp.create_auto_cdc_flow( - target="users", - source="user_changes", - keys=["user_id"], - sequence_by="updated_at" -) -``` - -**Pattern 2: CDC flow with upstream transformation** - -```python -# Step 1: Define view with transformation (source preprocessing) -@dp.temporary_view() -def filtered_user_changes(): - return ( - spark.readStream.table("raw_user_changes") - .filter("user_id IS NOT NULL") - ) - -# Step 2: Create target table -dp.create_streaming_table(name="users") - -# Step 3: Define CDC flow using the view as source -dp.create_auto_cdc_flow( - target="users", - source="filtered_user_changes", # References the view name - keys=["user_id"], - sequence_by="updated_at" -) -# Note: Use distinct names for view and target for clarity -# Note: If "raw_user_changes" is defined in the pipeline and no additional transformations or expectations are needed, -# source="raw_user_changes" can be used directly -``` - -**Pattern 3: CDC with explicit deletes and truncates** - -```python -from pyspark.sql.functions import expr - -dp.create_streaming_table(name="orders") - -dp.create_auto_cdc_flow( - target="orders", - source="order_events", - keys=["order_id"], - sequence_by="event_timestamp", - apply_as_deletes=expr("operation = 'DELETE'"), - apply_as_truncates=expr("operation = 'TRUNCATE'"), - ignore_null_updates=True -) -``` - -**Pattern 4: SCD Type 2 (Historical tracking)** - -```python -dp.create_streaming_table(name="customer_history") - -dp.create_auto_cdc_flow( - target="customer_history", - source="source.customer_changes", - keys=["customer_id"], - sequence_by="changed_at", - stored_as_scd_type=2 # Track full history -) -# Target will include __START_AT and __END_AT columns -``` - -**Pattern 5: Snapshot-based CDC (Simple - table source)** - -```python -dp.create_streaming_table(name="products") - -@dp.materialized_view(name="product_snapshot") -def product_snapshot(): - return spark.read.table("source.daily_product_dump") - -dp.create_auto_cdc_from_snapshot_flow( - target="products", - source="product_snapshot", # String table name - most common - keys=["product_id"], - stored_as_scd_type=1 -) -``` - -**Pattern 6: Snapshot-based CDC (Advanced - callable for historical snapshots)** - -```python -dp.create_streaming_table(name="products") - -# Define a callable to process historical snapshots sequentially -def next_snapshot_and_version(latest_snapshot_version: Optional[int]) -> Tuple[DataFrame, Optional[int]]: - if latest_snapshot_version is None: - return (spark.read.load("products.csv"), 1) - else: - return None - -dp.create_auto_cdc_from_snapshot_flow( - target="products", - source=next_snapshot_and_version, # Callable function for historical processing - keys=["product_id"], - stored_as_scd_type=1 -) -``` - -**Pattern 7: Selective column tracking** - -```python -dp.create_streaming_table(name="accounts") - -dp.create_auto_cdc_flow( - target="accounts", - source="account_changes", - keys=["account_id"], - sequence_by="modified_at", - stored_as_scd_type=2, - track_history_column_list=["balance", "status"], # Only track history for these columns - ignore_null_updates=True -) -``` - -**KEY RULES:** - -- Create target with `dp.create_streaming_table()` before defining CDC flow -- `dp.create_auto_cdc_flow()` does NOT return a value - call it at top level without assigning to a variable -- `source` must be a table name (string) - use `@dp.temporary_view()` to preprocess/filter/transform data before CDC processing. A temporary view is the **preferred** approach for source preprocessing (not a streaming table) -- SCD Type 2 adds `__START_AT` and `__END_AT` columns for validity tracking -- When specifying the schema of the target table for SCD Type 2, you must also include the `__START_AT` and `__END_AT` columns with the same data type as the `sequence_by` field -- Legacy names (`apply_changes`, `apply_changes_from_snapshot`) are equivalent but deprecated - prefer `create_auto_cdc_*` variants diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-cdc-sql.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-cdc-sql.md deleted file mode 100644 index 851aa69..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-cdc-sql.md +++ /dev/null @@ -1,182 +0,0 @@ -Auto CDC in Declarative Pipelines processes change data capture (CDC) events from streaming sources. - -**API Reference:** - -**CREATE FLOW ... AS AUTO CDC INTO** -Applies CDC operations (inserts, updates, deletes) from a streaming source to a target table. Supports SCD Type 1 (latest) and Type 2 (history). Must be used with a pre-created streaming table. - -```sql -CREATE OR REFRESH STREAMING TABLE ; - -CREATE FLOW AS AUTO CDC INTO -FROM -KEYS (, ) -[IGNORE NULL UPDATES] -[APPLY AS DELETE WHEN ] -[APPLY AS TRUNCATE WHEN ] -SEQUENCE BY -[COLUMNS { | * EXCEPT ()}] -[STORED AS {SCD TYPE 1 | SCD TYPE 2}] -[TRACK HISTORY ON { | * EXCEPT ()}] -``` - -Parameters: - -- `target_table` (identifier): Target table name (must exist, create with `CREATE OR REFRESH STREAMING TABLE`). **Required.** -- `flow_name` (identifier): Identifier for the created flow. **Required.** -- `source` (identifier or expression): Streaming source with CDC events. Use `STREAM()` to read with streaming semantics. **Required.** -- `KEYS` (column list): Primary key columns for row identification. **Required.** -- `IGNORE NULL UPDATES` (optional): If specified, NULL values won't overwrite existing non-NULL values -- `APPLY AS DELETE WHEN` (optional): Condition identifying delete operations (e.g., `operation = 'DELETE'`) -- `APPLY AS TRUNCATE WHEN` (optional): Condition identifying truncate operations (supported only for SCD Type 1) -- `SEQUENCE BY` (column or struct): Column for ordering events (timestamp, version). **Required.** For multi-column sequencing, use `SEQUENCE BY STRUCT(timestamp_col, id_col)` to order by the first field first, then break ties with subsequent fields. -- `COLUMNS` (optional): Columns to include or exclude (use `column1, column2` or `* EXCEPT (column1, column2)`) -- `STORED AS` (optional): `SCD TYPE 1` for latest values (default), `SCD TYPE 2` for full history with `__START_AT`/`__END_AT` columns -- `TRACK HISTORY ON` (optional): For SCD Type 2, columns to track history for (others use Type 1) - -**Common Patterns:** - -**Pattern 1: Basic CDC flow from streaming source** - -```sql --- Step 1: Create target table -CREATE OR REFRESH STREAMING TABLE users; - --- Step 2: Define CDC flow using STREAM() for streaming semantics -CREATE FLOW user_flow AS AUTO CDC INTO users -FROM STREAM(user_changes) -KEYS (user_id) -SEQUENCE BY updated_at; -``` - -**Pattern 2: CDC with source filtering via temporary view** - -```sql --- Step 1: Create temporary view to filter/transform source data -CREATE OR REFRESH TEMPORARY VIEW filtered_changes AS -SELECT * FROM source_table WHERE status = 'active'; - --- Step 2: Create target table -CREATE OR REFRESH STREAMING TABLE active_records; - --- Step 3: Define CDC flow reading from the temporary view -CREATE FLOW active_flow AS AUTO CDC INTO active_records -FROM STREAM(filtered_changes) -KEYS (record_id) -SEQUENCE BY updated_at; -``` - -**Pattern 3: CDC with explicit deletes** - -```sql -CREATE OR REFRESH STREAMING TABLE orders; - -CREATE FLOW order_flow AS AUTO CDC INTO orders -FROM STREAM(order_events) -KEYS (order_id) -IGNORE NULL UPDATES -APPLY AS DELETE WHEN operation = 'DELETE' -SEQUENCE BY event_timestamp; -``` - -**Pattern 4: SCD Type 2 (Historical tracking)** - -```sql -CREATE OR REFRESH STREAMING TABLE customer_history; - -CREATE FLOW customer_flow AS AUTO CDC INTO customer_history -FROM STREAM(customer_changes) -KEYS (customer_id) -SEQUENCE BY changed_at -STORED AS SCD TYPE 2; --- Target will include __START_AT and __END_AT columns -``` - -**Pattern 5: Multi-column sequencing** - -```sql -CREATE OR REFRESH STREAMING TABLE events; - -CREATE FLOW event_flow AS AUTO CDC INTO events -FROM STREAM(event_changes) -KEYS (event_id) -SEQUENCE BY STRUCT(event_timestamp, event_id) -STORED AS SCD TYPE 1; -``` - -**Pattern 6: Selective column inclusion** - -```sql -CREATE OR REFRESH STREAMING TABLE accounts; - -CREATE FLOW account_flow AS AUTO CDC INTO accounts -FROM STREAM(account_changes) -KEYS (account_id) -SEQUENCE BY modified_at -COLUMNS account_id, balance, status -STORED AS SCD TYPE 1; -``` - -**Pattern 7: Selective column exclusion** - -```sql -CREATE OR REFRESH STREAMING TABLE products; - -CREATE FLOW product_flow AS AUTO CDC INTO products -FROM STREAM(product_changes) -KEYS (product_id) -SEQUENCE BY updated_at -COLUMNS * EXCEPT (internal_notes, temp_field); -``` - -**Pattern 8: SCD Type 2 with selective history tracking** - -```sql -CREATE OR REFRESH STREAMING TABLE accounts; - -CREATE FLOW account_flow AS AUTO CDC INTO accounts -FROM STREAM(account_changes) -KEYS (account_id) -IGNORE NULL UPDATES -SEQUENCE BY modified_at -STORED AS SCD TYPE 2 -TRACK HISTORY ON balance, status; --- Only balance and status changes create new history records -``` - -**Pattern 9: SCD Type 2 with history tracking exclusion** - -```sql -CREATE OR REFRESH STREAMING TABLE accounts; - -CREATE FLOW account_flow AS AUTO CDC INTO accounts -FROM STREAM(account_changes) -KEYS (account_id) -SEQUENCE BY modified_at -STORED AS SCD TYPE 2 -TRACK HISTORY ON * EXCEPT (last_login, view_count); --- Track history on all columns except last_login and view_count -``` - -**Pattern 10: Truncate support (SCD Type 1 only)** - -```sql -CREATE OR REFRESH STREAMING TABLE inventory; - -CREATE FLOW inventory_flow AS AUTO CDC INTO inventory -FROM STREAM(inventory_events) -KEYS (product_id) -APPLY AS TRUNCATE WHEN operation = 'TRUNCATE' -SEQUENCE BY event_timestamp -STORED AS SCD TYPE 1; -``` - -**KEY RULES:** - -- Create target with `CREATE OR REFRESH STREAMING TABLE` before defining CDC flow -- `source` must be a streaming source for safe CDC change processing. Use `STREAM()` to read an existing table/view with streaming semantics -- The `STREAM()` function accepts ONLY a table/view identifier - NOT a subquery. Define source data as a separate streaming table or temporary view first, then reference it in the flow -- SCD Type 2 adds `__START_AT` and `__END_AT` columns for validity tracking -- When specifying the schema of the target table for SCD Type 2, you must also include the `__START_AT` and `__END_AT` columns with the same data type as the `SEQUENCE BY` field -- Legacy `APPLY CHANGES INTO` API is equivalent but deprecated - prefer `AUTO CDC INTO` -- `AUTO CDC FROM SNAPSHOT` is only available in Python, not in SQL. SQL only supports `AUTO CDC INTO` for processing CDC events from streaming sources. diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-cdc.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-cdc.md deleted file mode 100644 index 5fad71a..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-cdc.md +++ /dev/null @@ -1,21 +0,0 @@ -# Auto CDC (apply_changes) in Spark Declarative Pipelines - -The `apply_changes` API enables processing Change Data Capture (CDC) feeds to automatically handle inserts, updates, and deletes in target tables. - -## Key Concepts - -Auto CDC in Spark Declarative Pipelines: - -- Automatically processes CDC operations (INSERT, UPDATE, DELETE) -- Supports SCD Type 1 (update in place) and Type 2 (historical tracking) -- Handles ordering of changes via sequence columns -- Deduplicates CDC records - -## Language-Specific Implementations - -For detailed implementation guides: - -- **Python**: [auto-cdc-python.md](auto-cdc-python.md) -- **SQL**: [auto-cdc-sql.md](auto-cdc-sql.md) - -**Note**: The API is also known as `applyChanges` in some contexts. diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-loader-python.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-loader-python.md deleted file mode 100644 index 251361a..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-loader-python.md +++ /dev/null @@ -1,133 +0,0 @@ -Auto Loader (`cloudFiles`) is recommended for ingesting from cloud storage. - -**Basic Syntax:** - -```python -@dp.table() -def my_table(): - return ( - spark.readStream.format("cloudFiles") - .option("cloudFiles.format", "json") # or csv, parquet, etc. - .load("s3://bucket/path") - ) -``` - -**Critical Spark Declarative Pipelines + Auto Loader Rules:** - -- Databricks automatically manages `cloudFiles.schemaLocation` and checkpoint - do NOT specify these -- Auto Loader returns a streaming DataFrame - general API guidelines for `streamingTable` apply (MANDATORY to look up `streamingTable` guide) - - Can be used in either a streaming `@dp.table()` / `@dlt.table()` or via `@dp.append_flow()` / `@dlt.append_flow()` - - Use `spark.readStream` not `spark.read` for streaming ingestion -- If manually specifying a schema, include the rescued data column (default `_rescued_data STRING`, configurable via `rescuedDataColumn` option) -- Common Schema Options: - - `cloudFiles.inferColumnTypes`: Enable type inference (default: strings for JSON/CSV/XML) - - `cloudFiles.schemaHints`: Optionally specify known column types (e.g., `"id int, name string"`) -- File detection: File notification mode recommended for scalability - -**Common Auto Loader Options** -Below are all format agnostic options for Auto Loader. - -Common Auto Loader Options - -| Option | Type | Notes | -| ---------------------------------------- | --------------- | ---------------------------------- | -| cloudFiles.allowOverwrites | Boolean | | -| cloudFiles.backfillInterval | Interval String | | -| cloudFiles.cleanSource | String | | -| cloudFiles.cleanSource.retentionDuration | Interval String | | -| cloudFiles.cleanSource.moveDestination | String | | -| cloudFiles.format | String | | -| cloudFiles.includeExistingFiles | Boolean | | -| cloudFiles.inferColumnTypes | Boolean | | -| cloudFiles.maxBytesPerTrigger | Byte String | | -| cloudFiles.maxFileAge | Interval String | | -| cloudFiles.maxFilesPerTrigger | Integer | | -| cloudFiles.partitionColumns | String | | -| cloudFiles.schemaEvolutionMode | String | | -| cloudFiles.schemaHints | String | | -| cloudFiles.schemaLocation | String | DO NOT SET - managed automatically | -| cloudFiles.useStrictGlobber | Boolean | | -| cloudFiles.validateOptions | Boolean | | - -Directory Listing Options - -| Option | Type | -| -------------------------------- | ------ | -| cloudFiles.useIncrementalListing | String | - -File Notification Options - -| Option | Type | -| ------------------------------- | ------------------- | -| cloudFiles.fetchParallelism | Integer | -| cloudFiles.pathRewrites | JSON String | -| cloudFiles.resourceTag | Map(String, String) | -| cloudFiles.useManagedFileEvents | Boolean | -| cloudFiles.useNotifications | Boolean | - -AWS-Specific Options - -| Option | Type | -| ---------------------------- | ------ | -| cloudFiles.region | String | -| cloudFiles.queueUrl | String | -| cloudFiles.awsAccessKey | String | -| cloudFiles.awsSecretKey | String | -| cloudFiles.roleArn | String | -| cloudFiles.roleExternalId | String | -| cloudFiles.roleSessionName | String | -| cloudFiles.stsEndpoint | String | -| databricks.serviceCredential | String | - -Azure-Specific Options - -| Option | Type | -| ---------------------------- | ------ | -| cloudFiles.resourceGroup | String | -| cloudFiles.subscriptionId | String | -| cloudFiles.clientId | String | -| cloudFiles.clientSecret | String | -| cloudFiles.connectionString | String | -| cloudFiles.tenantId | String | -| cloudFiles.queueName | String | -| databricks.serviceCredential | String | - -GCP-Specific Options - -| Option | Type | -| ---------------------------- | ------ | -| cloudFiles.projectId | String | -| cloudFiles.client | String | -| cloudFiles.clientEmail | String | -| cloudFiles.privateKey | String | -| cloudFiles.privateKeyId | String | -| cloudFiles.subscription | String | -| databricks.serviceCredential | String | - -Generic File Format Options - -| Option | Type | -| -------------------------------- | ---------------- | -| ignoreCorruptFiles | Boolean | -| ignoreMissingFiles | Boolean | -| modifiedAfter | Timestamp String | -| modifiedBefore | Timestamp String | -| pathGlobFilter / fileNamePattern | String | -| recursiveFileLookup | Boolean | - -Format-Specific Options - -For detailed format-specific options, refer to these files: - -- **[JSON Options](options-json.md)**: Options for reading JSON files -- **[CSV Options](options-csv.md)**: Options for reading CSV files -- **[Parquet Options](options-parquet.md)**: Options for reading Parquet files -- **[Avro Options](options-avro.md)**: Options for reading Avro files -- **[ORC Options](options-orc.md)**: Options for reading ORC files -- **[XML Options](options-xml.md)**: Options for reading XML files -- **[Text Options](options-text.md)**: Options for reading text files - -See the linked format option files for specific documentation. - -**Auto Loader documentation:** -MANDATORY: Look up the official Databricks documentation for detailed information on any specific cloudFiles (Auto Loader) option before use. Each option has extensive documentation. No exceptions. diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-loader-sql.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-loader-sql.md deleted file mode 100644 index 5ebcc33..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-loader-sql.md +++ /dev/null @@ -1,83 +0,0 @@ -Auto Loader with SQL (`read_files`) is recommended for ingesting from cloud storage. - -**Basic Syntax:** - -```sql --- Using Auto Loader with CREATE STREAMING TABLE -CREATE OR REFRESH STREAMING TABLE my_table -AS SELECT * FROM STREAM(read_files( - 's3://bucket/path', - format => 'json' -)); - --- Using Auto Loader directly with CREATE FLOW (no intermediate table needed) -CREATE STREAMING TABLE target_table; - -CREATE FLOW ingest_flow -AS INSERT INTO target_table BY NAME -SELECT * FROM STREAM(read_files( - 's3://bucket/path', - format => 'json' -)); -``` - -**Critical Spark Declarative Pipelines + Auto Loader Rules:** - -- **MUST use `STREAM` keyword with `read_files` in streaming contexts** (e.g., `SELECT * FROM STREAM read_files(...)`) -- `inferColumnTypes` defaults to `true` - column types are automatically inferred, no need to specify unless setting to `false` -- Schema inference: Samples data initially to determine structure, then adapts as new data is encountered - - Use `schemaHints` to specify known column types (e.g., `schemaHints => 'id int, name string'`) - - Use `schemaEvolutionMode` to control how schema adapts when encountering new columns -- Unity Catalog pipelines must use external locations when loading files - -**Common read_files Options** -Below are all format agnostic options for `read_files`. - -Basic Options - -| Option | Type | -| ------------------ | ------- | -| `format` | String | -| `inferColumnTypes` | Boolean | -| `partitionColumns` | String | -| `schemaHints` | String | -| `useStrictGlobber` | Boolean | - -Generic File Format Options - -| Option | Type | -| ------------------------------------ | ---------------- | -| `ignoreCorruptFiles` | Boolean | -| `ignoreMissingFiles` | Boolean | -| `modifiedAfter` | Timestamp String | -| `modifiedBefore` | Timestamp String | -| `pathGlobFilter` / `fileNamePattern` | String | -| `recursiveFileLookup` | Boolean | - -Streaming Options - -| Option | Type | -| ---------------------- | ----------- | -| `allowOverwrites` | Boolean | -| `includeExistingFiles` | Boolean | -| `maxBytesPerTrigger` | Byte String | -| `maxFilesPerTrigger` | Integer | -| `schemaEvolutionMode` | String | -| `schemaLocation` | String | - -Format-Specific Options - -For detailed format-specific options, refer to these files: - -- **[JSON Options](options-json.md)**: Options for reading JSON files -- **[CSV Options](options-csv.md)**: Options for reading CSV files -- **[Parquet Options](options-parquet.md)**: Options for reading Parquet files -- **[Avro Options](options-avro.md)**: Options for reading Avro files -- **[ORC Options](options-orc.md)**: Options for reading ORC files -- **[XML Options](options-xml.md)**: Options for reading XML files -- **[Text Options](options-text.md)**: Options for reading text files - -See the linked format option files for specific documentation. - -**Auto Loader documentation:** -MANDATORY: Look up the official Databricks documentation for detailed information on any specific read_files (Auto Loader) option before use. Each option has extensive documentation. No exceptions. diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-loader.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-loader.md deleted file mode 100644 index c686304..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/auto-loader.md +++ /dev/null @@ -1,32 +0,0 @@ -# Auto Loader (cloudFiles) - -Auto Loader is the recommended approach for incrementally ingesting data from cloud storage into Delta Lake tables. It automatically processes new files as they arrive in cloud storage. - -## Key Concepts - -Auto Loader (`cloudFiles`) provides: - -- Automatic file discovery and processing -- Schema inference and evolution -- Exactly-once processing guarantees -- Scalable incremental ingestion -- Support for various file formats - -## Language-Specific Implementations - -For detailed implementation guides: - -- **Python**: [auto-loader-python.md](auto-loader-python.md) -- **SQL**: [auto-loader-sql.md](auto-loader-sql.md) - -## Format-Specific Options - -For format-specific configuration options, refer to: - -- **JSON**: [options-json.md](options-json.md) -- **CSV**: [options-csv.md](options-csv.md) -- **XML**: [options-xml.md](options-xml.md) -- **Parquet**: [options-parquet.md](options-parquet.md) -- **Avro**: [options-avro.md](options-avro.md) -- **Text**: [options-text.md](options-text.md) -- **ORC**: [options-orc.md](options-orc.md) diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/expectations-python.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/expectations-python.md deleted file mode 100644 index 484dc64..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/expectations-python.md +++ /dev/null @@ -1,150 +0,0 @@ -Expectations apply data quality constraints to Lakeflow Spark Declarative Pipelines tables and views in Python. They use SQL Boolean expressions to validate each record and take actions when constraints are violated. - -## When to Use Expectations - -- Apply to `@dp.materialized_view()`/`@dp.table()`/`@dlt.table()`/`@dp.temporary_view()`/`@dp.view()`/`@dlt.view()` decorated functions -- Use on streaming tables, materialized views, or temporary views -- Stack multiple expectation decorators above the dataset function - -## Decorator Types - -### Single Expectation Decorators - -**@dp.expect(description, constraint)** (or **@dlt.expect(description, constraint)**) - -- Logs violations but allows invalid records to pass through -- Collects metrics for monitoring - -**@dp.expect_or_drop(description, constraint)** (or **@dlt.expect_or_drop(description, constraint)**) - -- Removes invalid records before writing to target -- Logs dropped record metrics - -**@dp.expect_or_fail(description, constraint)** (or **@dlt.expect_or_fail(description, constraint)**) - -- Stops pipeline execution immediately on violation -- Requires manual intervention to resolve - -### Multiple Expectations Decorators - -**@dp.expect_all({description: constraint, ...})** (or **@dlt.expect_all({description: constraint, ...})**) - -- Applies multiple warn-level expectations -- Takes dictionary of description-constraint pairs - -**@dp.expect_all_or_drop({description: constraint, ...})** (or **@dlt.expect_all_or_drop({description: constraint, ...})**) - -- Applies multiple drop-level expectations -- Records dropped if any constraint fails - -**@dp.expect_all_or_fail({description: constraint, ...})** (or **@dlt.expect_all_or_fail({description: constraint, ...})**) - -- Applies multiple fail-level expectations -- Pipeline stops if any constraint fails - -## Parameters - -**description** (str, required) - -- Unique identifier for the constraint within the dataset -- Should clearly communicate what is being validated -- Can be reused across different datasets - -**constraint** (str, required) - -- SQL Boolean expression evaluated per record -- Must return true or false -- Cannot contain Python functions or UDFs, external calls, or subqueries -- Cannot include subqueries in constraint logic - -## Usage Examples - -All variants below work on both the `table`, `materialized_view` or `view` decorators. - -### Basic Single Expectation - -```python -@dp.materialized_view() -@dp.expect("valid_price", "price >= 0") -def sales_data(): - return spark.read.table("raw_sales") - -@dp.table() -@dp.expect("valid_price", "price >= 0") -def sales_data(): - return spark.read.table("raw_sales") -``` - -### Drop Invalid Records - -```python -@dp.materialized_view() -@dp.expect_or_drop("valid_email", "email IS NOT NULL AND email LIKE '%@%'") -def customer_contacts(): - return spark.read.table("raw_contacts") -``` - -### Fail on Critical Violations - -```python -@dp.materialized_view() -@dp.expect_or_fail("required_id", "customer_id IS NOT NULL") -def customer_master(): - return spark.read.table("raw_customers") -``` - -### Multiple Expectations - -```python -@dp.materialized_view() -@dp.expect_all({ - "valid_age": "age >= 0 AND age <= 120", - "valid_country": "country_code IN ('US', 'CA', 'MX')", - "recent_date": "created_date >= '2020-01-01'" -}) -def validated_customers(): - return spark.read.table("raw_customers") -``` - -### Stacking Multiple Decorators - -```python -@dp.materialized_view( - comment="Clean customer data with quality checks" -) -@dp.expect_or_drop("valid_email", "email LIKE '%@%'") -@dp.expect_or_fail("required_id", "id IS NOT NULL") -@dp.expect("valid_age", "age BETWEEN 0 AND 120") -def customers_clean(): - return spark.read.table("raw_customers") -``` - -### With Views - -```python -@dp.view( - name="high_value_customers", - comment="Customers with total purchases over $1000" -) -@dp.expect("valid_total", "total_purchases > 0") -def high_value_view(): - return spark.read.table("orders") \ - .groupBy("customer_id") \ - .agg(sum("amount").alias("total_purchases")) \ - .filter("total_purchases > 1000") -``` - -## Monitoring - -- View metrics in pipeline UI -- Query the event log for detailed analytics -- Metrics unavailable if pipeline fails or no updates occur - -## Best Practices - -- Use unique, descriptive names for each expectation -- Apply `expect_or_fail` for critical business constraints -- Use `expect_or_drop` for data cleansing operations -- Use `expect` for monitoring optional quality metrics -- Keep constraint logic simple and SQL-based only -- Group related expectations using `expect_all` variants diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/expectations-sql.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/expectations-sql.md deleted file mode 100644 index cecece3..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/expectations-sql.md +++ /dev/null @@ -1,171 +0,0 @@ -Expectations apply data quality constraints to Lakeflow Spark Declarative Pipelines tables and views in SQL. They use SQL Boolean expressions to validate each record and take actions when constraints are violated. - -## When to Use Expectations - -- Apply within `CREATE OR REFRESH STREAMING TABLE`, `CREATE OR REFRESH MATERIALIZED VIEW`, or `CREATE LIVE VIEW` statements -- Use as optional clauses in table/view creation statements -- Stack multiple CONSTRAINT clauses (comma-separated) in a single statement - -**Note on Temporary Views**: Use `CREATE LIVE VIEW` syntax when you need to include expectations with temporary views. The newer `CREATE TEMPORARY VIEW` syntax does not support CONSTRAINT clauses. `CREATE LIVE VIEW` is retained specifically for this use case, even though `CREATE TEMPORARY VIEW` is otherwise preferred for temporary views without expectations. - -## Constraint Syntax - -### Single Expectation (Warn) - -**CONSTRAINT constraint_name EXPECT (condition)** - -- Logs violations but allows invalid records to pass through -- Collects metrics for monitoring -- Invalid records are retained in target dataset - -### Single Expectation (Drop) - -**CONSTRAINT constraint_name EXPECT (condition) ON VIOLATION DROP ROW** - -- Removes invalid records before writing to target -- Logs dropped record metrics -- Invalid records are excluded from target - -### Single Expectation (Fail) - -**CONSTRAINT constraint_name EXPECT (condition) ON VIOLATION FAIL UPDATE** - -- Stops pipeline execution immediately on violation -- Requires manual intervention to resolve -- Transaction rolls back atomically - -### Multiple Expectations - -Multiple CONSTRAINT clauses can be stacked in a single CREATE statement using commas: - -```sql -CREATE OR REFRESH STREAMING TABLE table_name( - CONSTRAINT name1 EXPECT (condition1), - CONSTRAINT name2 EXPECT (condition2) ON VIOLATION DROP ROW, - CONSTRAINT name3 EXPECT (condition3) ON VIOLATION FAIL UPDATE -) AS SELECT ... -``` - -## Parameters - -**constraint_name** (required) - -- Unique identifier for the constraint within the dataset -- Should clearly communicate what is being validated -- Can be reused across different datasets - -**condition** (required) - -- SQL Boolean expression evaluated per record -- Must return true or false -- Can include SQL functions (e.g., year(), date(), CASE statements) -- Cannot contain Python functions or UDFs, external calls, or subqueries - -## Usage Examples - -### Basic Single Expectation - -```sql -CREATE OR REFRESH STREAMING TABLE sales_data( - CONSTRAINT valid_price EXPECT (price >= 0) -) AS -SELECT * FROM STREAM(raw_sales); -``` - -### Drop Invalid Records - -```sql -CREATE OR REFRESH STREAMING TABLE customer_contacts( - CONSTRAINT valid_email EXPECT ( - email IS NOT NULL AND email LIKE '%@%' - ) ON VIOLATION DROP ROW -) AS -SELECT * FROM STREAM(raw_contacts); -``` - -### Fail on Critical Violations - -```sql -CREATE OR REFRESH MATERIALIZED VIEW customer_master( - CONSTRAINT required_id EXPECT (customer_id IS NOT NULL) ON VIOLATION FAIL UPDATE -) AS -SELECT * FROM raw_customers; -``` - -### Multiple Expectations - -```sql -CREATE OR REFRESH STREAMING TABLE validated_customers( - CONSTRAINT valid_age EXPECT (age >= 0 AND age <= 120), - CONSTRAINT valid_country EXPECT (country_code IN ('US', 'CA', 'MX')), - CONSTRAINT recent_date EXPECT (created_date >= '2020-01-01') -) AS -SELECT * FROM STREAM(raw_customers); -``` - -### Stacking Multiple Constraints with Different Actions - -```sql -CREATE OR REFRESH STREAMING TABLE customers_clean -( - CONSTRAINT valid_email EXPECT (email LIKE '%@%') ON VIOLATION DROP ROW, - CONSTRAINT required_id EXPECT (id IS NOT NULL) ON VIOLATION FAIL UPDATE, - CONSTRAINT valid_age EXPECT (age BETWEEN 0 AND 120) -) -COMMENT "Clean customer data with quality checks" AS -SELECT * FROM STREAM(raw_customers); -``` - -### With SQL Functions - -```sql -CREATE OR REFRESH STREAMING TABLE transactions( - CONSTRAINT valid_date EXPECT (year(transaction_date) >= 2020), - CONSTRAINT non_negative_price EXPECT (price >= 0), - CONSTRAINT valid_purchase_date EXPECT (transaction_date <= current_date()) -) AS -SELECT * FROM STREAM(raw_transactions); -``` - -### Complex Business Logic - -```sql -CREATE OR REFRESH MATERIALIZED VIEW active_subscriptions( - CONSTRAINT valid_subscription_dates EXPECT ( - start_date <= end_date - AND end_date <= current_date() - AND start_date >= '2020-01-01' - ) ON VIOLATION DROP ROW -) AS -SELECT * FROM subscriptions WHERE status = 'active'; -``` - -### With Temporary Views - -```sql -CREATE LIVE VIEW high_value_customers( - CONSTRAINT valid_total EXPECT (total_purchases > 0) -) -COMMENT "Customers with total purchases over $1000" AS -SELECT - customer_id, - SUM(amount) AS total_purchases -FROM orders -GROUP BY customer_id -HAVING total_purchases > 1000; -``` - -## Monitoring - -- View metrics in pipeline UI under the **Data quality** tab -- Query the event log for detailed analytics -- Metrics available for `warn` and `drop` actions -- Metrics unavailable if pipeline fails or no updates occur - -## Best Practices - -- Use unique, descriptive names for each constraint -- Apply `ON VIOLATION FAIL UPDATE` for critical business constraints -- Use `ON VIOLATION DROP ROW` for data cleansing operations -- Use default (warn) behavior for monitoring optional quality metrics -- Keep constraint logic simple diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/expectations.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/expectations.md deleted file mode 100644 index 129a59c..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/expectations.md +++ /dev/null @@ -1,19 +0,0 @@ -# Expectations (Data Quality) in Spark Declarative Pipelines - -Expectations enable you to define and enforce data quality constraints on your pipeline tables. - -## Key Concepts - -Expectations in Spark Declarative Pipelines: - -- Define constraints on data quality -- Can drop, fail, or track invalid records -- Support complex validation logic -- Integrated with pipeline monitoring - -## Language-Specific Implementations - -For detailed implementation guides: - -- **Python**: [expectations-python.md](expectations-python.md) -- **SQL**: [expectations-sql.md](expectations-sql.md) diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/foreach-batch-sink-python.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/foreach-batch-sink-python.md deleted file mode 100644 index 17dc80a..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/foreach-batch-sink-python.md +++ /dev/null @@ -1,121 +0,0 @@ -ForEachBatch sinks in Spark Declarative Pipelines process a stream as micro-batches with custom Python logic. **Public Preview** — this API may change. - -**When to use:** Use ForEachBatch when built-in sink formats (`delta`, `kafka`) are insufficient: - -- Custom merge/upsert logic into a Delta table -- Writing to multiple destinations per batch -- Writing to unsupported streaming sinks (e.g., JDBC targets) -- Custom per-batch transformations - -**API Reference:** - -**@dp.foreach_batch_sink()** -Decorator that defines a ForEachBatch sink. The decorated function is called for each micro-batch. - -```python -@dp.foreach_batch_sink(name="") -def my_sink(df, batch_id): - # df: Spark DataFrame with micro-batch data - # batch_id: integer ID for the micro-batch (0 = start of stream or full refresh) - # Access SparkSession via df.sparkSession - pass -``` - -Parameters: - -- `name` (str): Optional. Unique name for the sink within the pipeline. Defaults to function name. - -The decorated function receives: - -- `df` (DataFrame): Spark DataFrame containing data for the current micro-batch -- `batch_id` (int): Integer ID of the micro-batch. Spark increments this for each trigger interval. `0` means start of stream or beginning of a full refresh — the handler should properly handle a full refresh for downstream data sources. - -The handler does not need to return a value. - -**Writing to a ForEachBatch Sink:** - -Use `@dp.append_flow()` with the `target` parameter matching the sink name: - -```python -@dp.append_flow(target="my_sink") -def my_flow(): - return spark.readStream.table("source_table") -``` - -**Common Patterns:** - -**Pattern 1: Merge/upsert into a Delta table** - -The target table must already exist before the MERGE runs. Create it externally or handle creation in the handler. - -```python -@dp.foreach_batch_sink(name="upsert_sink") -def upsert_sink(df, batch_id): - df.createOrReplaceTempView("batch_data") - df.sparkSession.sql(""" - MERGE INTO target_catalog.schema.target_table AS target - USING batch_data AS source - ON target.id = source.id - WHEN MATCHED THEN UPDATE SET * - WHEN NOT MATCHED THEN INSERT * - """) - return - -@dp.append_flow(target="upsert_sink") -def upsert_flow(): - return spark.readStream.table("source_events") -``` - -**Pattern 2: Write to multiple destinations with idempotent writes** - -Use `txnVersion`/`txnAppId` for idempotent Delta writes — if a batch partially fails and retries, already-completed writes are safely skipped. - -```python -app_id = "my-app-name" # must be unique per application writing to the same table - -@dp.foreach_batch_sink(name="multi_target_sink") -def multi_target_sink(df, batch_id): - df.write.format("delta").mode("append") \ - .option("txnVersion", batch_id).option("txnAppId", app_id) \ - .saveAsTable("my_catalog.my_schema.table_a") - df.write.format("json").mode("append") \ - .option("txnVersion", batch_id).option("txnAppId", app_id) \ - .save("/tmp/json_target") - return - -@dp.append_flow(target="multi_target_sink") -def multi_target_flow(): - return spark.readStream.table("processed_events") -``` - -When writing to multiple destinations, use `df.persist()` or `df.cache()` inside the handler to read the source data only once instead of once per destination. - -**Pattern 3: Enrich and write to an external Delta table** - -```python -from pyspark.sql.functions import current_timestamp - -@dp.foreach_batch_sink(name="enriched_sink") -def enriched_sink(df, batch_id): - enriched = df.withColumn("processed_timestamp", current_timestamp()) - enriched.write.format("delta").mode("append") \ - .saveAsTable("my_catalog.my_schema.enriched_events") - return - -@dp.append_flow(target="enriched_sink") -def enriched_flow(): - return spark.readStream.table("source_events") -``` - -**KEY RULES:** - -- ForEachBatch sinks are **Python only** and in **Public Preview** -- Designed for streaming queries (`append_flow`) only — not for batch-only pipelines or Auto CDC semantics -- The pipeline does NOT track data written from a ForEachBatch sink — you manage downstream data and retention -- On full refresh, checkpoints reset and `batch_id` restarts from 0. Data in your target is NOT automatically cleaned up — you must manually drop or truncate target tables/locations if a clean slate is needed -- Multiple `@dp.append_flow()` decorators can target the same sink — each flow maintains its own checkpoint -- To access SparkSession inside the handler, use `df.sparkSession` (not `spark`) -- ForEachBatch supports all Unity Catalog features — you can write to UC managed or external tables and volumes -- When writing to multiple destinations, use `df.persist()` or `df.cache()` to avoid multiple source reads, and `txnVersion`/`txnAppId` for idempotent Delta writes -- Keep the handler function concise — avoid threading, heavy library dependencies, or large in-memory data manipulations -- **databricks-connect compatibility**: If your pipeline may run on databricks-connect, the handler function must be serializable and must not use `dbutils`. Avoid referencing local objects, classes, or unpickleable resources — use pure Python modules. Move `dbutils` calls (e.g., `dbutils.widgets.get()`) outside the handler and capture values in variables. The pipeline raises a warning in the event log for non-serializable UDFs but does not fail the pipeline. However, non-serializable logic can break at runtime in databricks-connect contexts diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/foreach-batch-sink.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/foreach-batch-sink.md deleted file mode 100644 index 348e8c5..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/foreach-batch-sink.md +++ /dev/null @@ -1,20 +0,0 @@ -# ForEachBatch Sinks in Spark Declarative Pipelines - -> **Public Preview** — This API may change. - -ForEachBatch sinks process a stream as a series of micro-batches, each handled by a custom Python function. Use when built-in sink formats (Delta, Kafka) are insufficient. - -## When to Use - -- Custom merge/upsert into a Delta table -- Writing to multiple destinations per batch -- Unsupported streaming sinks (e.g., JDBC targets) -- Custom per-batch transformations - -## Language Support - -- **Python only** — SQL does not support ForEachBatch sinks. - -## Implementation Guide - -- **Python**: [foreach-batch-sink-python.md](foreach-batch-sink-python.md) diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/materialized-view-python.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/materialized-view-python.md deleted file mode 100644 index 856ae6f..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/materialized-view-python.md +++ /dev/null @@ -1,192 +0,0 @@ -Materialized Views in Spark Declarative Pipelines enable batch processing of data with full refresh or incremental computation. - -**NOTE:** This guide focuses on materialized views. For details on streaming tables (incremental processing with `spark.readStream`), use the API guide for `streamingTable` instead. - -**API Reference:** - -**@dp.materialized_view() (Recommended)** -Decorator to define a materialized view. This is the recommended approach for creating materialized views. - -```python -@dp.materialized_view( - name="", - comment="", - spark_conf={"": ""}, - table_properties={"": ""}, - path="", - partition_cols=[""], - cluster_by_auto=True, - cluster_by=[""], - schema="schema-definition", - row_filter="row-filter-clause", - private=False -) -def my_materialized_view(): - return spark.read.table("source.data") -``` - -**@dp.table() / @dlt.table() (Alternative for Materialized Views)** -In the older `dlt` module, the `@dlt.table` decorator was used to create both streaming tables and materialized views. The `@dp.table()` decorator in the `pyspark.pipelines` module still works in this way, but Databricks recommends using the `@dp.materialized_view()` decorator to create materialized views. Note that `@dp.table()` remains the standard decorator for streaming tables. - -```python -# Still works, but @dp.materialized_view() is preferred for materialized views -@dp.table( - name="", - comment="", - spark_conf={"": ""}, - table_properties={"": ""}, - path="", - partition_cols=[""], - cluster_by_auto=True, - cluster_by=[""], - schema="schema-definition", - row_filter="row-filter-clause", - private=False -) -def my_materialized_view(): - return spark.read.table("source.data") -``` - -Parameters: - -- `name` (str): Table name (defaults to function name) -- `comment` (str): Description for the table -- `spark_conf` (dict): Spark configurations for query execution -- `table_properties` (dict): Delta table properties -- `path` (str): Storage location for table data (defaults to managed location) -- `partition_cols` (list): Columns to partition the table by -- `cluster_by_auto` (bool): Enable automatic liquid clustering -- `cluster_by` (list): Columns to use as clustering keys for liquid clustering -- `schema` (str or StructType): Schema definition (SQL DDL string or StructType) - - Supports generated columns: `"order_datetime STRING, order_day STRING GENERATED ALWAYS AS (dayofweek(order_datetime))"` - - Supports constraints: Primary keys, foreign keys - - Supports column masks: `"ssn STRING MASK catalog.schema.ssn_mask_fn USING COLUMNS (region)"` -- `row_filter` (str): (Public Preview) A row filter clause that filters rows when fetched from the table. - - Must use syntax: `"ROW FILTER func_name ON (column_name [, ...])"` where `func_name` is a SQL UDF returning `BOOLEAN`. The UDF can be defined in Unity Catalog. - - Rows are filtered out when the function returns `FALSE` or `NULL`. - - You can pass table columns or constant literals (`STRING`, numeric, `BOOLEAN`, `INTERVAL`, `NULL`) as arguments. - - The filter is applied as soon as rows are fetched from the data source. - - The function runs with pipeline owner's rights during refresh and invoker's rights during queries (allowing user-context functions like `CURRENT_USER()` and `IS_MEMBER()` for data security). - - Note: Using row filters on source tables forces full refresh of downstream materialized views. - - Note: It is NOT possible to call `CREATE FUNCTION` within a Spark Declarative Pipeline. -- `private` (bool): Restricts table to pipeline scope; prevents metastore publication - -**Materialized View vs Streaming Table:** - -- **Materialized View**: Use `@dp.materialized_view()` decorator with function returning `spark.read...` (batch DataFrame) -- **Streaming Table**: Use `@dp.table()` decorator with function returning `spark.readStream...` (streaming DataFrame) - see the `streamingTable` API guide - -Note: When using `@dp.table()` with a batch DataFrame return type, a materialized view is created. However, `@dp.materialized_view()` is preferred for this use case. The `@dp.table()` decorator remains the standard approach for streaming tables (with streaming DataFrame return type). - -**Incremental Refresh for Materialized Views:** - -Materialized views on **serverless pipelines** support automatic incremental refresh, which processes only changes in underlying data since the last refresh rather than recomputing everything. This significantly reduces compute costs. - -**How it works:** - -- Lakeflow Spark Declarative Pipelines uses a cost model to determine whether to perform incremental refresh or full recompute -- Incremental refresh processes delta changes and appends to the table -- If incremental refresh is not feasible or more expensive, the system falls back to full recompute automatically - -**Requirements for incremental refresh:** - -- Must run on **serverless pipelines** (not classic compute) -- Source tables must be Delta tables, materialized views, or streaming tables -- Row-tracking must be enabled on source tables for certain operations (see Notes column) - -**Supported SQL operations for incremental refresh (use PySpark DataFrame API equivalents in Python):** - -| SQL Operation | Support | Notes | -| --------------------------- | ------- | ------------------------------------------------------------------------------------------------------- | -| SELECT expressions | Yes | Deterministic built-in functions and immutable UDFs. Requires row tracking | -| GROUP BY | Yes | — | -| WITH | Yes | Common table expressions | -| UNION ALL | Yes | Requires row tracking | -| FROM | Yes | Supported base tables include Delta tables, materialized views, and streaming tables | -| WHERE, HAVING | Yes | Requires row tracking | -| INNER JOIN | Yes | Requires row tracking | -| LEFT OUTER JOIN | Yes | Requires row tracking | -| FULL OUTER JOIN | Yes | Requires row tracking | -| RIGHT OUTER JOIN | Yes | Requires row tracking | -| OVER (Window functions) | Yes | Must specify PARTITION BY columns | -| QUALIFY | Yes | — | -| EXPECTATIONS | Partial | Generally supported; exceptions for views with expectations and DROP expectations with NOT NULL columns | -| Non-deterministic functions | Limited | Time functions like `current_date()` supported in WHERE clauses only | -| Non-Delta sources | No | Volumes, external locations, foreign catalogs unsupported | - -**Limitations:** - -- Falls back to full recompute when incremental is more expensive or query uses unsupported expressions - -**Best practices:** - -- Enable deletion vectors, row tracking, and change data feed on source tables for optimal incremental refresh -- Design queries with supported operations to leverage incremental refresh -- For exactly-once processing semantics (Kafka, Auto Loader), use streaming tables instead - -**Common Patterns:** - -**Pattern 1: Simple batch transformation** - -```python -@dp.materialized_view() -def bronze_batch(): - return spark.read.format("parquet").load("/path/to/data") - -@dp.materialized_view() -def silver_batch(): - return spark.read.table("bronze_batch").filter("id IS NOT NULL") -``` - -**Pattern 2: Schema with generated columns** - -```python -@dp.materialized_view( - schema=""" - order_datetime STRING, - order_day_of_week STRING GENERATED ALWAYS AS (dayofweek(order_datetime)), - customer_id BIGINT, - amount DECIMAL(10,2) - """, - cluster_by=["order_day_of_week", "customer_id"] -) -def orders_with_day(): - return spark.read.table("raw.orders") -``` - -**Pattern 3: Row filters for data security** - -```python -# Assumes filter_by_dept is a SQL UDF defined in Unity Catalog that returns BOOLEAN - -@dp.materialized_view( - name="employees", - schema="emp_id INT, emp_name STRING, dept STRING, salary DECIMAL(10,2)", - row_filter="ROW FILTER my_catalog.my_schema.filter_by_dept ON (dept)" -) -def employees(): - return spark.read.table("source.employees") -``` - -**Pattern 4: Column masking for sensitive data** - -```python -@dp.materialized_view( - schema=""" - user_id BIGINT, - ssn STRING MASK catalog.schema.ssn_mask_fn USING COLUMNS (region), - region STRING - """ -) -def users_with_masked_ssn(): - return spark.read.table("raw.users") -``` - -**KEY RULES:** - -- Use `@dp.materialized_view()` for materialized views (preferred over `@dp.table()`) -- Materialized views use `spark.read` (batch reads) -- Streaming tables use `spark.readStream` (streaming reads) - see the `streamingTable` API guide -- Never use `.write`, `.save()`, `.saveAsTable()`, or `.toTable()` - Databricks manages writes automatically -- Generated columns, constraints, and masks require schema definition -- Row filters force full refresh of downstream materialized views diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/materialized-view-sql.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/materialized-view-sql.md deleted file mode 100644 index 5851f39..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/materialized-view-sql.md +++ /dev/null @@ -1,187 +0,0 @@ -Materialized Views in Lakeflow Spark Declarative Pipelines enable batch processing of data with full refresh or incremental computation. - -**NOTE:** This guide focuses on materialized views. For details on streaming tables (incremental processing with streaming reads), use the API guide for `streamingTable` instead. - -**SQL Syntax:** - -**CREATE MATERIALIZED VIEW** -Creates a materialized view for batch data processing. For streaming tables, see the `CREATE STREAMING TABLE` guide. - -```sql -CREATE OR REFRESH [PRIVATE] MATERIALIZED VIEW - view_name - [ column_list ] - [ view_clauses ] - AS query - -column_list - ( { column_name column_type column_properties } [, ...] - [ column_constraint ] [, ...] - [ , table_constraint ] [...] ) - - column_properties - { NOT NULL | COMMENT column_comment | column_constraint | MASK clause } [ ... ] - -view_clauses - { USING DELTA | - PARTITIONED BY (col [, ...]) | - CLUSTER BY clause | - LOCATION path | - COMMENT view_comment | - TBLPROPERTIES clause | - WITH { ROW FILTER clause } } [...] -``` - -**Parameters:** - -- `PRIVATE`: Restricts table to pipeline scope; prevents metastore publication -- `view_name`: Unique identifier for the view (fully qualified name including catalog and schema must be unique unless marked PRIVATE) -- `column_list`: Optional schema definition with column names, types, and properties - - `column_name`: Name of the column - - `column_type`: Data type (STRING, BIGINT, DECIMAL, etc.) - - `column_properties`: Column attributes: - - `NOT NULL`: Column cannot contain null values - - `COMMENT column_comment`: Description for the column - - `column_constraint`: Data quality constraints, consult the `expectations` API guide for details. - - `MASK clause`: Column masking syntax `MASK catalog.schema.mask_fn USING COLUMNS (other_column)` (Public Preview) - - `table_constraint`: Informational table-level constraints (Unity Catalog only, **not enforced** by Databricks): - - Look up exact documentation when using - - Note: Constraints are informational metadata for documentation and query optimization hints; data validation must be performed independently -- `view_clauses`: Optional clauses for view configuration: - - `USING DELTA`: Optional format specification (only DELTA supported, can be omitted) - - `PARTITIONED BY (col [, ...])`: Columns for traditional partitioning, mutually exclusive with CLUSTER BY - - `CLUSTER BY clause`: Columns for liquid clustering (optimized query performance) - - `LOCATION path`: Storage path (Hive metastore only) - - `COMMENT view_comment`: Description for the view - - `TBLPROPERTIES clause`: Custom table properties `(key = value [, ...])` - - `WITH ROW FILTER clause`: Row-level security filtering - - Syntax: `ROW FILTER func_name ON (column_name [, ...])` (Public Preview) - - `func_name` must be a SQL UDF returning BOOLEAN (can be defined in Unity Catalog) - - Rows are filtered out when function returns FALSE or NULL - - Accepts table columns or constant literals (STRING, numeric, BOOLEAN, INTERVAL, NULL) - - Filter applies when rows are fetched from the data source - - Runs with pipeline owner's rights during refresh and invoker's rights during queries - - Note: Using row filters on source tables forces full refresh of downstream materialized views - - Note: It is NOT possible to call `CREATE FUNCTION` within a Spark Declarative Pipeline. -- `query`: A Spark SQL query that defines the dataset for the table - -**Incremental Refresh for Materialized Views:** - -Materialized views on **serverless pipelines** support automatic incremental refresh, which processes only changes in underlying data since the last refresh rather than recomputing everything. This significantly reduces compute costs. - -**How it works:** - -- Lakeflow Spark Declarative Pipelines uses a cost model to determine whether to perform incremental refresh or full recompute -- Incremental refresh processes delta changes and appends to the table -- If incremental refresh is not feasible or more expensive, the system falls back to full recompute automatically - -**Requirements for incremental refresh:** - -- Must run on **serverless pipelines** (not classic compute) -- Source tables must be Delta tables, materialized views, or streaming tables -- Row-tracking must be enabled on source tables for certain operations (see Notes column) - -**Supported SQL operations for incremental refresh:** - -| SQL Operation | Support | Notes | -| --------------------------- | ------- | ------------------------------------------------------------------------------------------------------- | -| SELECT expressions | Yes | Deterministic built-in functions and immutable UDFs. Requires row tracking | -| GROUP BY | Yes | — | -| WITH | Yes | Common table expressions | -| UNION ALL | Yes | Requires row tracking | -| FROM | Yes | Supported base tables include Delta tables, materialized views, and streaming tables | -| WHERE, HAVING | Yes | Requires row tracking | -| INNER JOIN | Yes | Requires row tracking | -| LEFT OUTER JOIN | Yes | Requires row tracking | -| FULL OUTER JOIN | Yes | Requires row tracking | -| RIGHT OUTER JOIN | Yes | Requires row tracking | -| OVER (Window functions) | Yes | Must specify PARTITION BY columns | -| QUALIFY | Yes | — | -| EXPECTATIONS | Partial | Generally supported; exceptions for views with expectations and DROP expectations with NOT NULL columns | -| Non-deterministic functions | Limited | Time functions like `current_date()` supported in WHERE clauses only | -| Non-Delta sources | No | Volumes, external locations, foreign catalogs unsupported | - -**Best practices:** - -- Enable deletion vectors, row tracking, and change data feed on source tables for optimal incremental refresh -- Design queries with supported operations to leverage incremental refresh -- For exactly-once processing semantics (Kafka, Auto Loader), use streaming tables instead - -**Common Patterns:** - -**Pattern 1: Simple batch transformation** - -```sql -CREATE MATERIALIZED VIEW bronze_batch -AS SELECT * FROM delta.`/path/to/data`; - -CREATE MATERIALIZED VIEW silver_batch -AS SELECT * FROM bronze_batch WHERE id IS NOT NULL; -``` - -**Pattern 2: Schema with generated columns** - -```sql -CREATE MATERIALIZED VIEW orders_with_day ( - order_datetime STRING, - order_day_of_week STRING GENERATED ALWAYS AS (dayofweek(order_datetime)), - customer_id BIGINT, - amount DECIMAL(10,2) -) -CLUSTER BY (order_day_of_week, customer_id) -AS SELECT order_datetime, customer_id, amount FROM raw.orders; -``` - -**Pattern 3: Row filters for data security** - -```sql --- Assumes filter_by_dept is a SQL UDF defined in Unity Catalog that returns BOOLEAN - -CREATE MATERIALIZED VIEW employees ( - emp_id INT, - emp_name STRING, - dept STRING, - salary DECIMAL(10,2) -) -WITH ROW FILTER my_catalog.my_schema.filter_by_dept ON (dept) -AS SELECT * FROM source.employees; -``` - -**Pattern 4: Column masking for sensitive data** - -```sql -CREATE MATERIALIZED VIEW users_with_masked_ssn ( - user_id BIGINT, - ssn STRING MASK catalog.schema.ssn_mask_fn USING COLUMNS (region), - region STRING -) -AS SELECT user_id, ssn, region FROM raw.users; -``` - -**Pattern 5: Aggregations with liquid clustering** - -```sql -CREATE MATERIALIZED VIEW daily_sales_summary -CLUSTER BY (sale_date, region) -AS -SELECT - DATE(order_timestamp) AS sale_date, - region, - COUNT(*) AS order_count, - SUM(amount) AS total_revenue -FROM raw.orders -GROUP BY DATE(order_timestamp), region; -``` - -**KEY RULES:** - -- Materialized views perform batch processing of data -- Streaming tables perform incremental streaming processing - see the `streamingTable` guide -- Identity columns, and default columns are not supported -- Row filters force full refresh of downstream materialized views -- Sum aggregates over nullable columns return zero instead of NULL when only nulls remain (when last non-NULL value is removed) -- Non-column expressions require explicit aliases (column references do not need aliases) -- PRIMARY KEY requires explicit NOT NULL specification to be valid -- OPTIMIZE and VACUUM commands unavailable, Lakeflow Declarative Pipelines handles maintenance automatically -- `CLUSTER BY` is recommended over `PARTITIONED BY` for most use cases -- Table renaming and ownership changes prohibited diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/materialized-view.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/materialized-view.md deleted file mode 100644 index e23fa0b..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/materialized-view.md +++ /dev/null @@ -1,19 +0,0 @@ -# Materialized Views in Spark Declarative Pipelines - -Materialized views store the results of a query physically, enabling faster query performance for expensive transformations and aggregations. - -## Key Concepts - -Materialized views in Spark Declarative Pipelines: - -- Physically store query results -- Are incrementally refreshed when source data changes -- Support complex transformations and aggregations -- Published to Unity Catalog - -## Language-Specific Implementations - -For detailed implementation guides: - -- **Python**: [materialized-view-python.md](materialized-view-python.md) -- **SQL**: [materialized-view-sql.md](materialized-view-sql.md) diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-avro.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-avro.md deleted file mode 100644 index 80e85ab..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-avro.md +++ /dev/null @@ -1,9 +0,0 @@ -AVRO-Specific Options - -| Option | Type | -| ------------------- | ------- | -| avroSchema | String | -| datetimeRebaseMode | String | -| mergeSchema | Boolean | -| readerCaseSensitive | Boolean | -| rescuedDataColumn | String | diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-csv.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-csv.md deleted file mode 100644 index 6590b89..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-csv.md +++ /dev/null @@ -1,38 +0,0 @@ -CSV-Specific Options - -| Option | Type | -| ------------------------- | ------- | -| badRecordsPath | String | -| charToEscapeQuoteEscaping | Char | -| columnNameOfCorruptRecord | String | -| comment | Char | -| dateFormat | String | -| emptyValue | String | -| encoding / charset | String | -| enforceSchema | Boolean | -| escape | Char | -| header | Boolean | -| ignoreLeadingWhiteSpace | Boolean | -| ignoreTrailingWhiteSpace | Boolean | -| inferSchema | Boolean | -| lineSep | String | -| locale | String | -| maxCharsPerColumn | Int | -| maxColumns | Int | -| mergeSchema | Boolean | -| mode | String | -| multiLine | Boolean | -| nanValue | String | -| negativeInf | String | -| nullValue | String | -| parserCaseSensitive | Boolean | -| positiveInf | String | -| preferDate | Boolean | -| quote | Char | -| readerCaseSensitive | Boolean | -| rescuedDataColumn | String | -| sep / delimiter | String | -| skipRows | Int | -| timestampFormat | String | -| timeZone | String | -| unescapedQuoteHandling | String | diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-json.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-json.md deleted file mode 100644 index 2f3bce7..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-json.md +++ /dev/null @@ -1,28 +0,0 @@ -JSON-Specific Options - -| Option | Type | -| ---------------------------------- | ------- | -| allowBackslashEscapingAnyCharacter | Boolean | -| allowComments | Boolean | -| allowNonNumericNumbers | Boolean | -| allowNumericLeadingZeros | Boolean | -| allowSingleQuotes | Boolean | -| allowUnquotedControlChars | Boolean | -| allowUnquotedFieldNames | Boolean | -| badRecordsPath | String | -| columnNameOfCorruptRecord | String | -| dateFormat | String | -| dropFieldIfAllNull | Boolean | -| encoding / charset | String | -| inferTimestamp | Boolean | -| lineSep | String | -| locale | String | -| mode | String | -| multiLine | Boolean | -| prefersDecimal | Boolean | -| primitivesAsString | Boolean | -| readerCaseSensitive | Boolean | -| rescuedDataColumn | String | -| singleVariantColumn | String | -| timestampFormat | String | -| timeZone | String | diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-orc.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-orc.md deleted file mode 100644 index e2097b6..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-orc.md +++ /dev/null @@ -1,5 +0,0 @@ -ORC-Specific Options - -| Option | Type | -| ----------- | ------- | -| mergeSchema | Boolean | diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-parquet.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-parquet.md deleted file mode 100644 index 43981c6..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-parquet.md +++ /dev/null @@ -1,9 +0,0 @@ -PARQUET-Specific Options - -| Option | Type | -| ------------------- | ------- | -| datetimeRebaseMode | String | -| int96RebaseMode | String | -| mergeSchema | Boolean | -| readerCaseSensitive | Boolean | -| rescuedDataColumn | String | diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-text.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-text.md deleted file mode 100644 index 8b18998..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-text.md +++ /dev/null @@ -1,7 +0,0 @@ -TEXT-Specific Options - -| Option | Type | -| --------- | ------- | -| encoding | String | -| lineSep | String | -| wholeText | Boolean | diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-xml.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-xml.md deleted file mode 100644 index eed595f..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/options-xml.md +++ /dev/null @@ -1,29 +0,0 @@ -XML-Specific Options - -| Option | Type | -| ------------------------- | ------- | -| rowTag | String | -| samplingRatio | Double | -| excludeAttribute | Boolean | -| mode | String | -| inferSchema | Boolean | -| columnNameOfCorruptRecord | String | -| attributePrefix | String | -| valueTag | String | -| encoding | String | -| ignoreSurroundingSpaces | Boolean | -| rowValidationXSDPath | String | -| ignoreNamespace | Boolean | -| timestampFormat | String | -| timestampNTZFormat | String | -| dateFormat | String | -| locale | String | -| rootTag | String | -| declaration | String | -| arrayElementName | String | -| nullValue | String | -| compression | String | -| validateName | Boolean | -| readerCaseSensitive | Boolean | -| rescuedDataColumn | String | -| singleVariantColumn | String | diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/python-basics.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/python-basics.md deleted file mode 100644 index 0216eba..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/python-basics.md +++ /dev/null @@ -1,70 +0,0 @@ -#### Setup - -- `from pyspark import pipelines as dp` (preferred) or `import dlt` (deprecated but still works) is always required on top when doing Python. Prefer `dp` import style unless `dlt` was already imported, don't change existing imports unless explicitly asked. -- The SparkSession object is already available (no need to import it again) - unless in a utility file - -#### Core Decorators - -- `@dp.materialized_view()` - Materialized views (batch processing, recommended for materialized views) -- `@dp.table()` - Streaming tables (when returning streaming DataFrame) or materialized views (legacy, when returning batch DataFrame) -- `@dp.temporary_view()` - Temporary views (non-materialized, private to pipeline) -- `@dp.expect*()` - Data quality constraints (expect, expect_or_drop, expect_or_fail, expect_all, expect_all_or_drop, expect_all_or_fail) - -#### Core Functions - -- `dp.create_streaming_table()` - Continuous processing -- `dp.create_auto_cdc_flow()` - Change data capture -- `dp.create_auto_cdc_from_snapshot_flow()` - Change data capture from database snapshots -- `dp.create_sink()` - Write to alternative targets (Kafka, Event Hubs, external Delta tables) -- `@dp.foreach_batch_sink()` - Custom streaming sink with per-batch Python logic (Public Preview) -- `dp.append_flow()` - Append-only patterns -- `dp.read()`/`dp.read_stream()` - Read from other pipeline datasets (deprecated - always use `spark.read.table()` or `spark.readStream.table()` instead) - -#### Critical Rules - -- ✅ Dataset functions MUST return Spark DataFrames -- ✅ Use `spark.read.table`/`spark.readStream.table` (NOT dp.read* and NOT dlt.read*) -- ✅ Use `auto_cdc` API (NOT apply_changes) -- ✅ Look up documentation for decorator/function parameters when unsure -- ❌ Do not use star imports -- ❌ NEVER use .collect(), .count(), .toPandas(), .save(), .saveAsTable(), .start(), .toTable() -- ❌ AVOID custom monitoring in dataset definitions -- ❌ Keep functions pure (evaluated multiple times) -- ❌ NEVER use the "LIVE." prefix when reading other datasets (deprecated) -- ❌ No arbitrary Python logic in dataset definitions - focus on DataFrame operations only - -#### Python-Specific Considerations - -**Reading Pipeline Datasets:** - -When reading from other datasets defined in the pipeline, use the dataset's **dataset name directly** - NEVER use the `LIVE.` prefix: - -```python -# ✅ CORRECT - use the function name directly -customers = spark.read.table("bronze_customers") -transactions = spark.readStream.table("bronze_transactions") - -# ❌ WRONG - do NOT use "LIVE." prefix (deprecated) -customers = spark.read.table("LIVE.bronze_customers") -transactions = spark.readStream.table("LIVE.bronze_transactions") -``` - -The `LIVE.` prefix is deprecated and should never be used. The pipeline automatically resolves dataset references by dataset name. - -**Streaming vs. Batch Semantics:** - -- Use `spark.read.table()` (or deprecated `dp.read()`/`dlt.read()`) for batch processing (materialized views with full refresh or incremental computation) -- Use `spark.readStream.table()` (or deprecated `dp.read_stream()`/`dlt.read_stream()`) for streaming tables to enable continuous incremental processing -- **Materialized views**: Use `@dp.materialized_view()` decorator (recommended) with batch DataFrame (`spark.read`) -- **Streaming tables**: Use `@dp.table()` decorator with streaming DataFrame (`spark.readStream`) -- Note: The `@dp.table()` decorator can create both batch and streaming tables based on return type, but `@dp.materialized_view()` is preferred for materialized views - -#### skipChangeCommits - -When a downstream streaming table reads from an upstream streaming table that has updates or deletes (e.g., GDPR compliance, Auto CDC targets), use `skipChangeCommits` to ignore those change commits: - -```python -@dp.table() -def downstream(): - return spark.readStream.option("skipChangeCommits", "true").table("upstream_table") -``` diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/sink-python.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/sink-python.md deleted file mode 100644 index f680588..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/sink-python.md +++ /dev/null @@ -1,133 +0,0 @@ -Sinks enable writing pipeline data to alternative targets like event streaming services (Apache Kafka, Azure Event Hubs), external Delta tables, or custom data sources using Python code. Sinks are Python-only and work exclusively with streaming append flows. - -## Creating Sinks - -**dp.create_sink() / dlt.create_sink()** - -Defines a sink for writing to alternative targets (Kafka, Event Hubs, external Delta tables). Call at top level before using in append flows. - -```python -dp.create_sink( - name="", - format="", - options={"": ""} -) -``` - -Parameters: - -- `name` (str): Unique identifier for the sink within the pipeline. Used to reference the sink in append flows. **Required.** -- `format` (str): Output format (`"kafka"`, `"delta"`, or custom format). Determines required options. **Required.** -- `options` (dict): Configuration dictionary with format-specific key-value pairs. Required options depend on the format. **Required.** - -## Writing to Sinks - -After creating a sink, use `@dp.append_flow()` (or `@dlt.append_flow()`) decorator to write streaming data to it. The `target` parameter specifies which sink to write to (must match a sink name created with `dp.create_sink()`). - -For complete documentation on append flows, see [streaming-table-python.md](../streaming-table/streaming-table-python.md). - -## Supported Sink Formats - -### Delta Sinks - -Write to Unity Catalog external/managed tables or file paths. - -**Options for Unity Catalog tables:** - -```python -{ - "tableName": "catalog_name.schema_name.table_name" # Fully qualified table name -} -``` - -**Options for file paths:** - -```python -{ - "path": "/Volumes/catalog_name/schema_name/path/to/data" -} -``` - -**Example:** - -```python -# Create Delta sink with table name -dp.create_sink( - name="delta_sink", - format="delta", - options={"tableName": "main.sales.transactions"} -) - -# Write to sink using append flow -@dp.append_flow(name="write_to_delta", target="delta_sink") -def write_transactions(): - return spark.readStream.table("bronze_transactions") \ - .select("transaction_id", "customer_id", "amount", "timestamp") -``` - -### Kafka and Azure Event Hubs Sinks - -Write to Apache Kafka or Azure Event Hubs topics for real-time event streaming. - -**Important**: This code works for both Apache Kafka and Azure Event Hubs sinks. - -**Required options:** - -```python -{ - "kafka.bootstrap.servers": "host:port", # Kafka/Event Hubs endpoint - "topic": "topic_name", # Target topic - "databricks.serviceCredential": "credential_name" # Unity Catalog service credential -} -``` - -**Authentication**: Use `databricks.serviceCredential` to reference a Unity Catalog service credential for connecting to external cloud services. - -**Data format requirements**: - -- The `value` parameter is mandatory for Kafka and Azure Event Hubs sinks -- Optional parameters: `key`, `partition`, `headers`, and `topic` - -**Example (works for both Kafka and Event Hubs):** - -```python -# Define credentials and connection details -credential_name = "" -bootstrap_servers = "kafka-broker:9092" # or "{eh-namespace}.servicebus.windows.net:9093" for Event Hubs -topic_name = "customer_events" - -# Create Kafka/Event Hubs sink -dp.create_sink( - name="kafka_sink", - format="kafka", - options={ - "databricks.serviceCredential": credential_name, - "kafka.bootstrap.servers": bootstrap_servers, - "topic": topic_name - } -) - -# Write to sink with required value parameter -@dp.append_flow(name="stream_to_kafka", target="kafka_sink") -def kafka_flow(): - return spark.readStream.table("customer_events") \ - .selectExpr( - "cast(customer_id as string) as key", - "to_json(struct(*)) AS value" - ) -``` - -## Limitations and Considerations - -- Sinks only work with streaming queries and cannot be used with batch DataFrames -- Only compatible with `@dp.append_flow()` decorator -- Full refresh updates don't clean existing sink data - - Reprocessed data will be appended to the sink - - Consider idempotency: Design for duplicate writes since full refresh appends data -- Delta sink table names must be fully qualified (catalog.schema.table), use three-part names for Unity Catalog tables -- Volume file paths are supported as an alternative -- Pipeline expectations cannot be applied to sinks - - Apply data quality checks before writing to sinks - - Validate data in upstream tables/views instead -- Sinks are Python-only in Spark Declarative Pipelines, SQL does not support sink creation or usage -- Handle serialization: For Kafka/Event Hubs, convert data to JSON or appropriate format diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/sink.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/sink.md deleted file mode 100644 index cf54ef4..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/sink.md +++ /dev/null @@ -1,21 +0,0 @@ -# Sinks in Spark Declarative Pipelines - -Sinks enable writing pipeline data to alternative targets beyond Databricks-managed Delta tables, including event streaming services and external tables. - -## Key Concepts - -Sinks in Spark Declarative Pipelines: - -- Write to event streaming services (Apache Kafka, Azure Event Hubs) -- Write to externally-managed Delta tables (Unity Catalog external/managed tables) -- Enable reverse ETL into systems outside Databricks -- Support custom Python data sources -- Work exclusively with streaming queries and append flows - -## Language-Specific Implementations - -For detailed implementation guides: - -- **Python**: [sink-python.md](sink-python.md) - -**Important**: Sinks are only available in Python. SQL does not support sinks in Spark Declarative Pipelines. diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/sql-basics.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/sql-basics.md deleted file mode 100644 index bbbf496..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/sql-basics.md +++ /dev/null @@ -1,57 +0,0 @@ -#### Core SQL Statements - -- `CREATE MATERIALIZED VIEW` - Batch processing with full refresh or incremental computation -- `CREATE STREAMING TABLE` - Continuous incremental processing -- `CREATE TEMPORARY VIEW` - Non-materialized views (pipeline lifetime only) -- `CREATE VIEW` - Non-materialized catalog views (Unity Catalog only) -- `AUTO CDC INTO` - Change data capture flows -- `CREATE FLOW` - Define flows or backfills for streaming tables - -#### Message Bus Ingestion Functions - -- `read_kafka(bootstrapServers => '...', subscribe => '...')` - Apache Kafka -- `read_kinesis(streamName => '...', region => '...')` - AWS Kinesis -- `read_pubsub(subscriptionId => '...', topicId => '...')` - Google Cloud Pub/Sub -- `read_pulsar(serviceUrl => '...', topics => '...')` - Apache Pulsar -- Event Hubs: Use `read_kafka()` with Kafka-compatible Event Hubs config - -#### Critical Rules - -- ✅ Prefer `CREATE OR REFRESH` syntax for defining datasets (bare `CREATE` also works, but `OR REFRESH` is the idiomatic convention) -- ✅ Use `STREAM` keyword when reading sources for streaming tables -- ✅ Use `read_files()` function for Auto Loader (cloud storage ingestion) -- ✅ Look up documentation for statement parameters when unsure -- ❌ NEVER use `LIVE.` prefix when reading other datasets (deprecated) -- ❌ NEVER use `CREATE LIVE TABLE` or `CREATE LIVE VIEW` (deprecated - use `CREATE STREAMING TABLE`, `CREATE MATERIALIZED VIEW`, or `CREATE TEMPORARY VIEW` instead) -- ❌ Do not use `PIVOT` clause (unsupported) - -#### SQL-Specific Considerations - -**Streaming vs. Batch Semantics:** - -- Omit `STREAM` keyword for materialized views (batch processing) -- Use `STREAM` keyword for streaming tables to enable streaming semantics - -**GROUP BY Best Practices:** - -- Prefer `GROUP BY ALL` over explicitly listing individual columns unless the user specifically requests explicit grouping -- Benefits: more maintainable when adding/removing columns, less verbose, reduces risk of missing columns in the GROUP BY clause -- Example: `SELECT category, region, SUM(sales) FROM table GROUP BY ALL` instead of `GROUP BY category, region` - -**Python UDFs:** - -- You can use Python user-defined functions (UDFs) in SQL queries -- UDFs must be defined in Python files before calling them in SQL source files - -**Configuration:** - -- Use `SET` statements and `${}` string interpolation for dynamic values and Spark configurations - -#### skipChangeCommits - -When a downstream streaming table reads from an upstream streaming table that has updates or deletes, use `skipChangeCommits` to ignore change commits: - -```sql -CREATE OR REFRESH STREAMING TABLE downstream -AS SELECT * FROM STREAM read_stream("upstream_table", skipChangeCommits => true) -``` diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/streaming-table-python.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/streaming-table-python.md deleted file mode 100644 index 2259cbf..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/streaming-table-python.md +++ /dev/null @@ -1,242 +0,0 @@ -Streaming Tables in Spark Declarative Pipelines enable incremental processing of continuously arriving data. - -**NOTE:** This guide focuses on streaming tables. For details on materialized views (batch processing with `spark.read`), use the API guide for `materializedView` instead. - -**API Reference:** - -**@dp.table() / @dlt.table()** -Decorator to define a streaming table or materialized view. Returns streaming table when function returns `spark.readStream`. For materialized views using `spark.read`, see the `materializedView` API guide. - -```python -@dp.table( - name="", - comment="", - spark_conf={"": ""}, - table_properties={"": ""}, - path="", - partition_cols=[""], - cluster_by_auto=True, - cluster_by=[""], - schema="schema-definition", - row_filter="row-filter-clause", - private=False -) -def my_append_flow(): - return spark.readStream.table("source.data") -``` - -Parameters: - -- `name` (str): Table name (defaults to function name) -- `comment` (str): Description for the table -- `spark_conf` (dict): Spark configurations for query execution -- `table_properties` (dict): Delta table properties -- `path` (str): Storage location for table data (defaults to managed location) -- `partition_cols` (list): Columns to partition the table by -- `cluster_by_auto` (bool): Enable automatic liquid clustering -- `cluster_by` (list): Columns to use as clustering keys for liquid clustering -- `schema` (str or StructType): Schema definition (SQL DDL string or StructType) - - Supports generated columns: `"order_datetime STRING, order_day STRING GENERATED ALWAYS AS (dayofweek(order_datetime))"` - - Supports constraints: Primary keys, foreign keys - - Supports column masks: `"ssn STRING MASK catalog.schema.ssn_mask_fn USING COLUMNS (region)"` -- `row_filter` (str): (Public Preview) A row filter clause that filters rows when fetched from the table. - - Must use syntax: `"ROW FILTER func_name ON (column_name [, ...])"` where `func_name` is a SQL UDF returning `BOOLEAN`. The UDF can be defined in Unity Catalog. - - Rows are filtered out when the function returns `FALSE` or `NULL`. - - You can pass table columns or constant literals (`STRING`, numeric, `BOOLEAN`, `INTERVAL`, `NULL`) as arguments. - - The filter is applied as soon as rows are fetched from the data source. - - The function runs with pipeline owner's rights during refresh and invoker's rights during queries (allowing user-context functions like `CURRENT_USER()` and `IS_MEMBER()` for data security). - - Note: Using row filters on source tables forces full refresh of downstream materialized views. - - Note: It is NOT possible to call `CREATE FUNCTION` within a Spark Declarative Pipeline. -- `private` (bool): Restricts table to pipeline scope; prevents metastore publication - -**dp.create_streaming_table() / dlt.create_streaming_table()** -Creates an empty streaming table as target for CDC flows or append flows. Does NOT return a value - call at top level without assignment. - -```python -dp.create_streaming_table( - name="", - comment="", - spark_conf={"": ""}, - table_properties={"": ""}, - path="", - partition_cols=[""], - cluster_by_auto=True, - cluster_by=[""], - schema="schema-definition", - expect_all={"": ""}, - expect_all_or_drop={"": ""}, - expect_all_or_fail={"": ""}, - row_filter="row-filter-clause" -) -``` - -Parameters: Same as @dp.table() except `private`, plus: - -- `expect_all` (dict): Data quality expectations (warn on failure, include in target) -- `expect_all_or_drop` (dict): Expectations that drop failing rows from target -- `expect_all_or_fail` (dict): Expectations that fail pipeline on violation - -**@dp.append_flow() / @dlt.append_flow()** -Decorator to define a flow that appends data from a source to an existing target table. Multiple append flows can write to the same target table. - -```python -@dp.append_flow( - target="", - name="", # optional, defaults to function name - once=, # optional, defaults to False - spark_conf={"": "", "": ""}, # optional - comment="" # optional -) -def my_append_flow(): - # For once=False (streaming): use spark.readStream - return spark.readStream.table("source.data") - # For once=True (batch): use spark.read - return spark.read.table("source.data") -``` - -Parameters: - -- `target` (str): The name of the target streaming table where data will be appended. Target must exist (created with `dp.create_streaming_table()`). **Required.** -- `name` (str): The name of the flow. If not specified, defaults to the function name. Use distinct names when multiple flows target the same table. -- `once` (bool): Controls whether the flow runs continuously or once: - - **False (default)**: Flow continuously processes new data as it arrives in streaming mode. **Must return a streaming DataFrame using `spark.readStream`**, CAN use `cloudFiles` (Auto Loader). - - **True**: Flow processes data only once during pipeline execution and then stops. **Must return a batch DataFrame using `spark.read`**. Do NOT use `cloudFiles` (Auto Loader) with `once=True` - use regular batch reads like `spark.read.format("")` instead. -- `spark_conf` (dict): A dictionary of Spark configuration key-value pairs to apply specifically to this flow's query execution (e.g., `{"spark.sql.shuffle.partitions": "10"}`). -- `comment` (str): A description of the flow that appears in the pipeline metadata and documentation. - -**Two Ways to Define Streaming Tables:** - -1. **@dp.table decorator (MOST COMMON)** - - Returns a streaming DataFrame using `spark.readStream` - - Automatically inferred as a streaming table when returning a streaming DataFrame - - ```python - @dp.table(name="events_stream") - def events_stream(): - return spark.readStream.table("source_catalog.schema.events") - ``` - -2. **dp.create_streaming_table()** - - Creates an empty streaming table target - - Required as target for Auto CDC flows and append flows - - Does NOT return a value (do not assign to a variable) - - ```python - dp.create_streaming_table( - name="users", - schema="user_id INT, name STRING, updated_at TIMESTAMP" - ) - ``` - -**WHEN TO USE WHICH:** - -Use **@dp.table with readStream** when: - -- Reading and transforming streaming data -- Creating streaming tables from sources (Auto Loader, Delta tables, etc.) -- This is the standard pattern for most streaming use cases - -Use **dp.create_streaming_table()** when: - -- Creating a target table for `dp.create_auto_cdc_flow()` -- Creating a target table for `@dp.append_flow` from multiple sources -- Need to explicitly define table schema before data flows in - -**Common Patterns:** - -**Pattern 1: Simple streaming transformation** - -```python -@dp.table() -def bronze(): - return spark.readStream.format("cloudFiles") \ - .option("cloudFiles.format", "json") \ - .load("/path/to/data") - -@dp.table() -def silver(): - return spark.readStream.table("bronze").filter("id IS NOT NULL") -``` - -**Pattern 2: Multi-source aggregation** - -```python -dp.create_streaming_table(name="all_events") - -@dp.append_flow(target="all_events", name="mobile") -def mobile(): - return spark.readStream.table("mobile.events") - -@dp.append_flow(target="all_events", name="web") -def web(): - return spark.readStream.table("web.events") -``` - -**Pattern 3: One-time backfill with append flow** - -```python -dp.create_streaming_table(name="transactions") - -# Continuous streaming flow for new data -@dp.append_flow(target="transactions", name="live_stream") -def live_transactions(): - return spark.readStream.table("source.transactions") - -# One-time backfill flow for historical data (uses spark.read for batch) -@dp.append_flow( - target="transactions", - name="historical_backfill", - once=True, - comment="Backfill historical transactions from archive" -) -def backfill_transactions(): - return spark.read.table("archive.historical_transactions") -``` - -**Pattern 4: Row filters for data security** - -```python -# Assumes filter_by_dept is a SQL UDF defined in Unity Catalog that returns BOOLEAN - -# Apply row filter to streaming table -@dp.table( - name="employees", - schema="emp_id INT, emp_name STRING, dept STRING, salary DECIMAL(10,2)", - row_filter="ROW FILTER my_catalog.my_schema.filter_by_dept ON (dept)" -) -def employees(): - return spark.readStream.table("source.employees") -``` - -**Pattern 5: Stream-static join (enrich streaming data with dimension table)** - -```python -@dp.table() -def enriched_transactions(): - transactions = spark.readStream.table("transactions") - customers = spark.read.table("customers") - return transactions.join(customers, transactions.customer_id == customers.id) -``` - -The dimension table (`customers`) is read as a static snapshot at stream start, while the streaming source (`transactions`) is read incrementally. - -**Pattern 6: Reading from upstream ST with updates/deletes (skipChangeCommits)** - -```python -@dp.table() -def downstream(): - return spark.readStream.option("skipChangeCommits", "true").table("upstream_with_deletes") -``` - -Use `skipChangeCommits` when reading from a streaming table that has updates/deletes (e.g., GDPR compliance, Auto CDC targets). Without this flag, change commits cause errors. - -**KEY RULES:** - -- Streaming tables use `spark.readStream` (streaming reads) -- Materialized views use `spark.read` (batch reads) - see the `materializedView` API guide -- Never use `.writeStream`, `.start()`, or checkpoint options - Databricks manages these automatically -- For streaming flows (`once=False`): Use `spark.readStream` to return a streaming DataFrame -- For one-time flows (`once=True`): Use `spark.read` to return a batch DataFrame -- Generated columns, constraints, and masks require schema definition -- Row filters force full refresh of downstream materialized views -- Use `skipChangeCommits` when reading from STs that have updates/deletes diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/streaming-table-sql.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/streaming-table-sql.md deleted file mode 100644 index 316b7d8..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/streaming-table-sql.md +++ /dev/null @@ -1,288 +0,0 @@ -Streaming Tables in SQL Declarative Pipelines enable incremental processing of continuously arriving data. - -**NOTE:** This guide focuses on streaming tables in SQL. For details on materialized views (batch processing), use the API guide for `materializedView` instead. - -**API Reference:** - -**CREATE STREAMING TABLE** -Creates a streaming table that processes data incrementally using `STREAM()` for streaming reads. For materialized views using batch reads (without `STREAM()`), see the `materializedView` API guide. - -```sql -CREATE OR REFRESH [PRIVATE] STREAMING TABLE - table_name - [ table_specification ] - [ table_clauses ] - [ AS query ] - -table_specification - ( { column_identifier column_type [column_properties] } [, ...] - [ column_constraint ] [, ...] - [ , table_constraint ] [...] ) - - column_properties - { NOT NULL | COMMENT column_comment | column_constraint | MASK clause } [ ... ] - -table_clauses - { USING DELTA - PARTITIONED BY (col [, ...]) | - CLUSTER BY clause | - LOCATION path | - COMMENT view_comment | - TBLPROPERTIES clause | - WITH { ROW FILTER clause } } [ ... ] -``` - -**Parameters:** - -- `PRIVATE`: Restricts table to pipeline scope; prevents metastore publication -- `table_name`: Unique identifier for the table (fully qualified name including catalog and schema must be unique unless marked PRIVATE) -- `table_specification`: Optional schema definition with column names, types, and properties - - `column_identifier`: Name of the column - - `column_type`: Data type (STRING, BIGINT, DECIMAL, etc.) - - `column_properties`: Column attributes: - - `NOT NULL`: Column cannot contain null values - - `COMMENT column_comment`: Description for the column - - `column_constraint`: Data quality constraints, consult the `expectations` API guide for details. - - `MASK clause`: Column masking syntax `MASK catalog.schema.mask_fn USING COLUMNS (other_column)` (Public Preview) - - `table_constraint`: Informational table-level constraints (Unity Catalog only, **not enforced** by Databricks): - - Look up exact documentation when using - - Note: Constraints are informational metadata for documentation and query optimization hints; data validation must be performed independently -- `table_clauses`: Optional clauses for table configuration: - - `USING DELTA`: Optional format specification (only DELTA supported, can be omitted) - - `PARTITIONED BY (col [, ...])`: Columns for traditional partitioning, mutually exclusive with CLUSTER BY - - `CLUSTER BY clause`: Columns for liquid clustering (optimized query performance, recommended over partitioning) - - `LOCATION path`: Storage path (defaults to pipeline storage location) - - `COMMENT view_comment`: Description for the table - - `TBLPROPERTIES clause`: Custom table properties `(key = value [, ...])` - - `WITH ROW FILTER clause`: Row-level security filtering - - Syntax: `ROW FILTER func_name ON (column_name [, ...])` (Public Preview) - - `func_name` must be a SQL UDF returning BOOLEAN (can be defined in Unity Catalog) - - Rows are filtered out when function returns FALSE or NULL - - Accepts table columns or constant literals (STRING, numeric, BOOLEAN, INTERVAL, NULL) - - Filter applies when rows are fetched from the data source - - Runs with pipeline owner's rights during refresh and invoker's rights during queries - - Note: Using row filters on source tables forces full refresh of downstream materialized views - - Note: It is NOT possible to call `CREATE FUNCTION` within a Spark Declarative Pipeline. -- `query`: A Spark SQL query that defines the streaming dataset. Must use `STREAM()` function for streaming semantics. - -**STREAM() Function:** -Provides streaming read semantics for the source table. Required for streaming queries. - -```sql -SELECT * FROM STREAM(source_catalog.schema.source_table); -``` - -**CREATE FLOW with INSERT INTO** -Creates a flow that appends data from a source to an existing target streaming table. Multiple flows can write to the same target table. - -```sql -CREATE FLOW flow_name [COMMENT comment] AS -INSERT INTO [ONCE] target_table BY NAME query -``` - -**Parameters:** - -- `flow_name`: Unique identifier for the flow. Use distinct names when multiple flows target the same table. -- `ONCE`: Controls whether the flow runs continuously or once: - - **Omitted (default)**: Flow continuously processes new data as it arrives in streaming mode. **Query must use `STREAM()` for streaming reads**. - - **ONCE**: Flow processes data only once during pipeline execution and then stops. **Query uses non-streaming reads (without `STREAM()`)** for batch processing. Re-executes during pipeline complete refreshes to recreate data. -- `target_table_name`: The name of the target streaming table where data will be appended. Target must exist (created with `CREATE STREAMING TABLE`). **Required.** -- `SELECT ... FROM STREAM(source_table)`: The query to read source data - - For continuous flows (no ONCE): Use `STREAM()` to return streaming data - - For one-time flows (with ONCE): Omit `STREAM()` to return batch data - -**Two Ways to Define Streaming Tables:** - -1. **CREATE STREAMING TABLE with AS SELECT (MOST COMMON)** - - Defines schema and query in one statement - - Schema can be inferred from query or explicitly defined - - **This automatically creates a continuous streaming pipeline - no separate flow needed** - - ```sql - CREATE STREAMING TABLE events_stream - AS SELECT * FROM STREAM(source_catalog.schema.events); - ``` - -2. **CREATE STREAMING TABLE without AS SELECT** - - Creates an empty streaming table target - - Required for multi-source append patterns - - Schema definition is optional - - **Requires separate `CREATE FLOW` statements to populate the table** - - ```sql - CREATE STREAMING TABLE users ( - user_id INT, - name STRING, - updated_at TIMESTAMP - ); - ``` - -**CRITICAL: WHEN TO USE WHICH:** - -Use **CREATE STREAMING TABLE with AS SELECT** when: - -- Reading and transforming streaming data from a single source -- Creating streaming tables from Delta tables, Auto Loader sources, etc. -- This is the standard pattern for most streaming use cases -- **DO NOT add a separate `CREATE FLOW` - the AS SELECT clause already handles continuous processing** - -Use **CREATE STREAMING TABLE without AS SELECT + CREATE FLOW** when: - -- Creating a target table for multiple `INSERT INTO` flows from different sources -- Need to explicitly define table schema before data flows in -- Using `AUTO CDC INTO` for CDC. See 'autoCdc' API guide for details. -- **In this case, you MUST create separate flows - the table definition alone does not process data** - -**NEVER:** - -- Create both `CREATE STREAMING TABLE ... AS SELECT` AND `CREATE FLOW` for the same source - this is redundant and incorrect -- The AS SELECT clause already provides continuous streaming; adding a flow duplicates the work - -**Common Patterns:** - -**Pattern 1: Simple streaming transformation** - -```sql --- Bronze layer: ingest raw data with Auto Loader -CREATE STREAMING TABLE bronze -AS SELECT * FROM STREAM(read_files( - '/path/to/data', - format => 'json' -)); - --- Silver layer: filter and clean data -CREATE STREAMING TABLE silver -AS SELECT * -FROM STREAM(bronze) -WHERE id IS NOT NULL; -``` - -**Pattern 2: Multi-source aggregation with flows** - -```sql --- Create target table for multiple sources. Schema is optional. -CREATE STREAMING TABLE all_events ( - event_id STRING, - event_type STRING, - event_timestamp TIMESTAMP, - source STRING -); - --- Flow from mobile source -CREATE FLOW mobile_flow -AS INSERT INTO all_events BY NAME -SELECT event_id, event_type, event_timestamp, 'mobile' as source -FROM STREAM(mobile.events); - --- Flow from web source -CREATE FLOW web_flow -AS INSERT INTO all_events BY NAME -SELECT event_id, event_type, event_timestamp, 'web' as source -FROM STREAM(web.events); -``` - -**Pattern 3: Row filters for data security** - -```sql --- Assumes filter_by_dept is a SQL UDF defined in Unity Catalog that returns BOOLEAN - -CREATE STREAMING TABLE employees ( - emp_id INT, - emp_name STRING, - dept STRING, - salary DECIMAL(10,2) -) -WITH ROW FILTER my_catalog.my_schema.filter_by_dept ON (dept) -AS SELECT * FROM STREAM(source.employees); -``` - -**Pattern 4: Partitioning and clustering** - -```sql --- Using partitioning (traditional approach) -CREATE STREAMING TABLE orders_partitioned -PARTITIONED BY (order_date) -AS SELECT * FROM STREAM(source.orders); - --- Using liquid clustering (recommended) -CREATE STREAMING TABLE orders_clustered -CLUSTER BY (order_date, customer_id) -AS SELECT * FROM STREAM(source.orders); -``` - -**Pattern 5: Sensitive data masking** - -```sql -CREATE STREAMING TABLE customers ( - customer_id INT, - name STRING, - email STRING, - ssn STRING MASK catalog.schema.ssn_mask USING COLUMNS (customer_id) -) -AS SELECT * FROM STREAM(source.customers); -``` - -**Pattern 6: Private streaming table (pipeline-internal staging)** - -```sql -CREATE OR REFRESH PRIVATE STREAMING TABLE staging_events -AS SELECT * -FROM STREAM(raw_events) -WHERE event_type IS NOT NULL; -``` - -Use `PRIVATE` for internal staging datasets that should not be published to the catalog. Private tables are only accessible within the pipeline. - -**Pattern 7: One-time backfill with flow** - -```sql -CREATE STREAMING TABLE transactions ( - transaction_id STRING, - customer_id STRING, - amount DECIMAL(10,2), - transaction_date TIMESTAMP -); - --- Continuous streaming flow for new data -CREATE FLOW live_stream -AS INSERT INTO transactions -SELECT * FROM STREAM(source.transactions); - --- One-time backfill flow for historical data (uses batch read without STREAM) -CREATE FLOW historical_backfill -AS INSERT INTO ONCE transactions -SELECT * FROM archive.historical_transactions; -``` - -**Pattern 8: Stream-static join (enrich streaming data with dimension table)** - -```sql -CREATE OR REFRESH STREAMING TABLE enriched_transactions -AS SELECT t.*, c.name, c.email -FROM STREAM(transactions) t -JOIN customers c ON t.customer_id = c.id; -``` - -The dimension table (`customers`) is read as a static snapshot at stream start, while the streaming source (`transactions`) is read incrementally. This is the standard pattern for enriching streaming data with lookup/dimension tables. - -**Pattern 9: Reading from upstream ST with updates/deletes (skipChangeCommits)** - -```sql -CREATE OR REFRESH STREAMING TABLE downstream -AS SELECT * FROM STREAM read_stream("upstream_with_deletes", skipChangeCommits => true) -``` - -Use `skipChangeCommits` when reading from a streaming table that has updates/deletes (e.g., GDPR compliance, Auto CDC targets). Without this flag, change commits cause errors. - -**KEY RULES:** - -- Streaming tables require `STREAM()` keyword for streaming reads -- Never use batch reads (`SELECT * FROM table` without `STREAM()`) in streaming table definitions -- `ALTER TABLE` commands are not supported - use `CREATE OR REFRESH` or `ALTER STREAMING TABLE` instead -- Generated columns, identity columns, and default columns are not currently supported -- Row filters force full refresh of downstream materialized views -- Only table owners can refresh streaming tables -- Table renaming and ownership changes prohibited -- `CLUSTER BY` is recommended over `PARTITIONED BY` for most use cases -- For batch processing, use materialized views instead (see the `materializedView` API guide) -- Use `skipChangeCommits` when reading from STs that have updates/deletes diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/streaming-table.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/streaming-table.md deleted file mode 100644 index f57baf9..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/streaming-table.md +++ /dev/null @@ -1,19 +0,0 @@ -# Streaming Tables in Spark Declarative Pipelines - -Streaming tables enable continuous processing of data streams with exactly-once semantics and automatic checkpointing. - -## Key Concepts - -Streaming tables in Spark Declarative Pipelines: - -- Process data continuously as it arrives -- Provide exactly-once processing guarantees -- Support stateful operations (aggregations, joins, deduplication) -- Automatically manage checkpoints and state - -## Language-Specific Implementations - -For detailed implementation guides: - -- **Python**: [streaming-table-python.md](streaming-table-python.md) -- **SQL**: [streaming-table-sql.md](streaming-table-sql.md) diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/temporary-view-python.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/temporary-view-python.md deleted file mode 100644 index dab90cd..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/temporary-view-python.md +++ /dev/null @@ -1,66 +0,0 @@ -Temporary Views in Spark Declarative Pipelines create temporary logical datasets without persisting data to storage. Use views for intermediate transformations that drive downstream workloads but don't need materialization. - -**API Reference:** - -**@dp.temporary_view() (preferred) / @dp.view() (alias) / @dlt.view() (deprecated)** -Decorator to define a temporary view. - -```python -@dp.temporary_view( - name="", - comment="" -) -def my_view(): - return spark.read.table("source.data") -``` - -Parameters: - -- `name` (str): View name (defaults to function name) -- `comment` (str): Description for the view - -**Common Patterns:** - -**Pattern 1: Intermediate transformation layer** - -```python -# View for shared filtering logic -@dp.temporary_view() -def valid_events(): - return spark.read.table("raw.events") \ - .filter("event_type IS NOT NULL") \ - .filter("timestamp IS NOT NULL") - -# Multiple tables consume the view -@dp.materialized_view() -def user_events(): - return spark.read.table("valid_events") \ - .filter("event_type = 'user_action'") - -@dp.materialized_view() -def system_events(): - return spark.read.table("valid_events") \ - .filter("event_type = 'system_event'") -``` - -**Pattern 2: Streaming views** - -```python -# Views work with streaming DataFrames too -@dp.temporary_view() -def streaming_events(): - return spark.readStream.table("bronze.events") \ - .filter("event_id IS NOT NULL") - -@dp.table() -def filtered_stream(): - return spark.readStream.table("streaming_events") \ - .filter("event_type = 'critical'") -``` - -**KEY RULES:** - -- Views can return either batch (`spark.read`) or streaming (`spark.readStream`) DataFrames -- Views are not materialized - they're computed on demand when referenced -- Reference views using `spark.read.table("view_name")` or `spark.readStream.table("view_name")` -- Views prevent code duplication when multiple downstream tables need the same transformation diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/temporary-view-sql.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/temporary-view-sql.md deleted file mode 100644 index f1d8bb6..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/temporary-view-sql.md +++ /dev/null @@ -1,82 +0,0 @@ -Temporary Views in Spark Declarative Pipelines create temporary logical datasets without persisting data to storage. Use views for intermediate transformations that drive downstream workloads but don't need materialization. - -**API Reference:** - -**CREATE TEMPORARY VIEW** -SQL statement to define a temporary view. - -```sql -CREATE TEMPORARY VIEW view_name - [(col_name [COMMENT col_comment] [, ...])] - [COMMENT view_comment] - [TBLPROPERTIES (key = value [, ...])] -AS query -``` - -Parameters: - -- `view_name` (identifier): Name of the temporary view -- `col_name` (identifier): Optional column name specifications -- `col_comment` (string): Optional description for individual columns -- `view_comment` (string): Optional description for the view -- `TBLPROPERTIES` (key-value pairs): Optional table properties -- `query` (SELECT statement): Query that defines the view's data - -**Common Patterns:** - -**Pattern 1: Intermediate transformation layer** - -```sql --- View for shared filtering logic -CREATE TEMPORARY VIEW valid_events -AS SELECT * FROM raw.events -WHERE event_type IS NOT NULL - AND timestamp IS NOT NULL; - --- Multiple tables consume the view -CREATE MATERIALIZED VIEW user_events -AS SELECT * FROM valid_events -WHERE event_type = 'user_action'; - -CREATE MATERIALIZED VIEW system_events -AS SELECT * FROM valid_events -WHERE event_type = 'system_event'; -``` - -**Pattern 2: Views with streaming sources** - -```sql --- Temporary views work with streaming sources too -CREATE TEMPORARY VIEW streaming_events -AS SELECT * FROM STREAM(bronze.events) -WHERE event_id IS NOT NULL; - --- Downstream streaming table consuming the view -CREATE STREAMING TABLE filtered_stream -AS SELECT * FROM STREAM(streaming_events) -WHERE event_type = 'critical'; -``` - -**KEY RULES:** - -- Views are not materialized - they're computed on demand when referenced -- Views exist only during the pipeline execution lifetime and are private to the pipeline -- Reference views in downstream tables using `FROM view_name` or `FROM STREAM(view_name)` for streaming -- Views prevent code duplication when multiple downstream tables need the same transformation -- Temporary views work with both batch and streaming data sources (using `STREAM()` function) -- Views can share names with catalog objects; within the pipeline, references resolve to the temporary view - -**IMPORTANT - Using Expectations with Temporary Views:** - -`CREATE TEMPORARY VIEW` does not support CONSTRAINT clauses for expectations. If you need to include expectations (data quality constraints) with a temporary view, use `CREATE LIVE VIEW` syntax instead: - -```sql -CREATE LIVE VIEW view_name( - CONSTRAINT constraint_name EXPECT (condition) [ON VIOLATION DROP ROW | FAIL UPDATE] -) -AS query -``` - -`CREATE LIVE VIEW` is the older syntax for temporary views, retained specifically for this use case. Use `CREATE TEMPORARY VIEW` for views without expectations, and `CREATE LIVE VIEW` when you need to add CONSTRAINT clauses. - -For detailed information on using expectations with temporary views, see the "expectations" API guide. diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/temporary-view.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/temporary-view.md deleted file mode 100644 index 0ea0a88..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/temporary-view.md +++ /dev/null @@ -1,19 +0,0 @@ -# Temporary Views in Spark Declarative Pipelines - -Temporary views are pipeline-private views that exist only within the context of the pipeline and are not published to Unity Catalog. - -## Key Concepts - -Temporary views in Spark Declarative Pipelines: - -- Are private to the pipeline (not published to Unity Catalog) -- Can be referenced by other tables/views in the same pipeline -- Do not persist after pipeline execution -- Useful for organizing complex transformations - -## Language-Specific Implementations - -For detailed implementation guides: - -- **Python**: [temporary-view-python.md](temporary-view-python.md) -- **SQL**: [temporary-view-sql.md](temporary-view-sql.md) diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/view-sql.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/view-sql.md deleted file mode 100644 index 2d47f36..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/view-sql.md +++ /dev/null @@ -1,76 +0,0 @@ -Views in Spark Declarative Pipelines create virtual tables published to the Unity Catalog metastore. Unlike temporary views (which are private to the pipeline), views created with CREATE VIEW are accessible outside the pipeline and persist in the catalog. - -**API Reference:** - -**CREATE VIEW** -SQL statement to define a persistent view in Unity Catalog. - -```sql -CREATE VIEW view_name - [COMMENT view_comment] - [TBLPROPERTIES (key = value [, ...])] -AS query -``` - -Parameters: - -- `view_name` (identifier): Unique identifier within the catalog and schema -- `view_comment` (string): Optional description for the view -- `TBLPROPERTIES` (key-value pairs): Optional table properties -- `query` (SELECT statement): Query that defines the view's data (must be batch, not streaming) - -**Common Patterns:** - -**Pattern 1: Filtered view for reusable logic** - -```sql --- View with filtering logic published to catalog -CREATE VIEW valid_orders -COMMENT 'Orders with valid data for analysis' -AS SELECT * -FROM raw.orders -WHERE order_id IS NOT NULL - AND customer_id IS NOT NULL - AND order_date IS NOT NULL; - --- Multiple downstream tables can reference this view -CREATE MATERIALIZED VIEW orders_by_region -AS SELECT - region, - COUNT(*) AS order_count, - SUM(amount) AS total_revenue -FROM valid_orders -GROUP BY region; -``` - -**Pattern 2: View with custom properties** - -```sql --- View with table properties for metadata -CREATE VIEW customer_summary -COMMENT 'Aggregated customer metrics' -TBLPROPERTIES ( - 'quality' = 'silver', - 'owner' = 'analytics-team', - 'refresh_frequency' = 'daily' -) -AS SELECT - customer_id, - COUNT(DISTINCT order_id) AS total_orders, - SUM(amount) AS lifetime_value, - MAX(order_date) AS last_order_date -FROM valid_orders -GROUP BY customer_id; -``` - -**KEY RULES:** - -- Views are virtual tables - not materialized, computed on demand when referenced -- Views are published to Unity Catalog and accessible outside the pipeline -- Views require Unity Catalog pipelines with default publishing mode -- Does not support explicit column definitions with COMMENT -- Cannot use `STREAM()` function - views must use batch queries only -- Cannot define expectations (CONSTRAINT clauses) on views -- Views require appropriate permissions: SELECT on source tables, CREATE TABLE on target schema -- For pipeline-private views, use `CREATE TEMPORARY VIEW` instead -- For materialized data persistence, use `CREATE MATERIALIZED VIEW` instead diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/view.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/view.md deleted file mode 100644 index f028227..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/view.md +++ /dev/null @@ -1,20 +0,0 @@ -# Views in Spark Declarative Pipelines - -Views provide a way to define reusable query logic and publish datasets to Unity Catalog for broader consumption. - -## Key Concepts - -Views in Spark Declarative Pipelines: - -- Are published to Unity Catalog when the pipeline runs -- Can reference other tables and views in the pipeline -- Support both SQL and Python (with limitations) -- Are refreshed when the pipeline updates - -## Language-Specific Implementations - -For detailed implementation guides: - -- **SQL**: [view-sql.md](view-sql.md) - -**Important**: Python in Spark Declarative Pipelines only supports temporary views (private to the pipeline), not persistent views published to Unity Catalog. For Unity Catalog-published views, use SQL syntax with `CREATE VIEW`. diff --git a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/write-spark-declarative-pipelines.md b/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/write-spark-declarative-pipelines.md deleted file mode 100644 index 7806c71..0000000 --- a/examples/agentic-support-console/template/.agents/skills/databricks-pipelines/references/write-spark-declarative-pipelines.md +++ /dev/null @@ -1,8 +0,0 @@ -# Write Spark Declarative Pipelines - -Core syntax and rules for writing Spark Declarative Pipelines datasets. - -## Language-specific guides - -- [Python basics](python-basics.md) - Python decorators, functions, and critical rules -- [SQL basics](sql-basics.md) - SQL statements and critical rules diff --git a/examples/agentic-support-console/template/.env.example b/examples/agentic-support-console/template/.env.example deleted file mode 100644 index aa32224..0000000 --- a/examples/agentic-support-console/template/.env.example +++ /dev/null @@ -1,11 +0,0 @@ -DATABRICKS_HOST=https://... -DATABRICKS_WAREHOUSE_ID=your_sql_warehouse_id -DATABRICKS_GENIE_SPACE_ID=your_genie_space_id -PGDATABASE=your_postgres_databaseName -LAKEBASE_ENDPOINT=your_postgres_endpointPath -PGHOST=your_postgres_host -PGPORT=5432 -PGSSLMODE=require -DATABRICKS_APP_PORT=8000 -DATABRICKS_APP_NAME=support-console -FLASK_RUN_HOST=0.0.0.0 diff --git a/examples/agentic-support-console/template/.gitignore b/examples/agentic-support-console/template/.gitignore deleted file mode 100644 index f2abc32..0000000 --- a/examples/agentic-support-console/template/.gitignore +++ /dev/null @@ -1,10 +0,0 @@ -.DS_Store -node_modules/ -client/dist/ -dist/ -build/ -.env -.databricks/ -.smoke-test/ -test-results/ -playwright-report/ diff --git a/examples/agentic-support-console/template/.prettierrc.json b/examples/agentic-support-console/template/.prettierrc.json deleted file mode 100644 index d95a63f..0000000 --- a/examples/agentic-support-console/template/.prettierrc.json +++ /dev/null @@ -1,12 +0,0 @@ -{ - "semi": true, - "trailingComma": "es5", - "singleQuote": true, - "printWidth": 120, - "tabWidth": 2, - "useTabs": false, - "arrowParens": "always", - "endOfLine": "lf", - "bracketSpacing": true, - "jsxSingleQuote": false -} diff --git a/examples/agentic-support-console/template/README.md b/examples/agentic-support-console/template/README.md deleted file mode 100644 index 78e49f3..0000000 --- a/examples/agentic-support-console/template/README.md +++ /dev/null @@ -1,119 +0,0 @@ -# Agentic Support Console - -An end-to-end example of an AI-powered support console built on the Databricks platform: Lakebase (OLTP), Lakehouse Sync (CDC), Lakeflow Declarative Pipelines (medallion), a Lakeflow Job (AI agent), reverse sync, and a Databricks App with Genie analytics. - -## Architecture - -``` -OLTP (Lakebase Postgres) - | Lakehouse Sync (CDC) - v -Bronze (lakebase.lb_*_history) - | Lakeflow Declarative Pipeline - v -Silver (current-state SCD Type 1 tables) - | Materialized Views - v -Gold (analytics + support context) - | Lakeflow Job (LLM via AI Gateway) - v -gold.support_agent_responses (Delta) - | Reverse Sync (Sync Tables) - v -Lakebase (gold.*_sync tables) - v -Support Console (Databricks App) -``` - -## Components - -- **`client/`, `server/`, `config/queries/`** — AppKit app (Cases, Case Detail, Analytics). -- **`pipelines/support_analytics/`** — Medallion Lakeflow Declarative Pipeline (silver + gold). -- **`pipelines/support_agent/`** — Lakeflow Job calling an LLM via AI Gateway. -- **`seed/`** — TypeScript seed for Lakebase OLTP data. -- **`provisioning/sql/`** — Optional baseline SQL (Unity Catalog schemas, Postgres `REPLICA IDENTITY FULL`). - -## Prerequisites - -- Databricks CLI with a workspace profile -- Lakebase Postgres project, branch, database -- Unity Catalog catalog (with grants) -- SQL Warehouse -- Genie Space (analytics tab) - -## Provisioning - -Optional scripts in `provisioning/sql/` are idempotent where noted. Skip if you reuse an existing catalog and sync. - -| Step | What | -| ---- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | -| 1 | **Catalog** — Must exist (often UI or CLI with storage root). | -| 2 | **UC schemas** — Run `provisioning/sql/01_unity_catalog_schemas.sql` in a SQL warehouse after replacing `__CATALOG_NAME__`. | -| 3 | **Lakebase** — Create tables via `seed/`; seed sets `REPLICA IDENTITY FULL`, or run `provisioning/sql/02_lakebase_replica_identity_full.sql` on Postgres. | -| 4 | **Lakehouse Sync** — UI: replicate Lakebase `public` to UC `lakebase.lb_*_history`. | -| 5 | **Bundles** — Deploy `pipelines/support_analytics`, `pipelines/support_agent`, then this app (see below). | -| 6 | **Reverse sync** — UI: Sync Tables from gold Delta to Lakebase `gold.*_sync` (see table below). | -| 7 | **Genie** — UI: create space, wire `genie_space_id` in the app bundle. | - -### Reverse sync (Sync Tables) - -| Source | Target | Mode | -| ------------------------------ | ----------------------------------- | ---------- | -| `gold.support_agent_responses` | `gold.support_agent_responses_sync` | CONTINUOUS | -| `gold.support_case_context` | `gold.support_case_context_sync` | SNAPSHOT | -| `gold.user_support_profile` | `gold.user_support_profile_sync` | SNAPSHOT | -| `gold.support_overview` | `gold.support_overview_sync` | SNAPSHOT | - -## Setup (order) - -### 1. Seed data - -```bash -cd seed -npm install -DATABASE_URL="postgresql://..." npm run seed -``` - -### 2. Lakehouse Sync - -Configure in the UI from Lakebase `public` to Unity Catalog (bronze history tables). - -### 3. Deploy medallion pipeline - -```bash -cd pipelines/support_analytics -# Set workspace host and catalog in databricks.yml -databricks bundle deploy --target dev -``` - -### 4. Deploy support agent job - -```bash -cd pipelines/support_agent -databricks bundle deploy --target dev -``` - -### 5. Reverse sync - -Configure Sync Tables per the table above (UI). - -### 6. Deploy app - -From this `template/` directory: - -```bash -npm install -databricks bundle deploy -``` - -### Optional: CLI scaffold - -```bash -databricks apps init \ - --template https://github.com/databricks/devhub/tree/main/examples/agentic-support-console \ - --name support-console -``` - -## Tech stack - -AppKit (Express + React 19), Lakebase, Unity Catalog, Lakeflow Pipelines/Jobs, AI Gateway, Databricks Asset Bundles. diff --git a/examples/agentic-support-console/template/app.yaml b/examples/agentic-support-console/template/app.yaml deleted file mode 100644 index e1033a0..0000000 --- a/examples/agentic-support-console/template/app.yaml +++ /dev/null @@ -1,8 +0,0 @@ -command: ['npm', 'run', 'start'] -env: - - name: DATABRICKS_WAREHOUSE_ID - valueFrom: sql-warehouse - - name: DATABRICKS_GENIE_SPACE_ID - valueFrom: genie-space - - name: LAKEBASE_ENDPOINT - valueFrom: postgres diff --git a/examples/agentic-support-console/template/appkit.plugins.json b/examples/agentic-support-console/template/appkit.plugins.json deleted file mode 100644 index 9309347..0000000 --- a/examples/agentic-support-console/template/appkit.plugins.json +++ /dev/null @@ -1,151 +0,0 @@ -{ - "$schema": "https://databricks.github.io/appkit/schemas/template-plugins.schema.json", - "version": "1.0", - "plugins": { - "analytics": { - "name": "analytics", - "displayName": "Analytics Plugin", - "description": "SQL query execution against Databricks SQL Warehouses", - "package": "@databricks/appkit", - "resources": { - "required": [ - { - "type": "sql_warehouse", - "alias": "SQL Warehouse", - "resourceKey": "sql-warehouse", - "description": "SQL Warehouse for executing analytics queries", - "permission": "CAN_USE", - "fields": { - "id": { - "env": "DATABRICKS_WAREHOUSE_ID", - "description": "SQL Warehouse ID" - } - } - } - ], - "optional": [] - }, - "requiredByTemplate": true - }, - "files": { - "name": "files", - "displayName": "Files Plugin", - "description": "File operations against Databricks Volumes and Unity Catalog", - "package": "@databricks/appkit", - "resources": { - "required": [ - { - "type": "volume", - "alias": "Files", - "resourceKey": "files", - "description": "Permission to write to volumes", - "permission": "WRITE_VOLUME", - "fields": { - "path": { - "env": "DATABRICKS_VOLUME_FILES", - "description": "Volume path for file storage (e.g. /Volumes/catalog/schema/volume_name)" - } - } - } - ], - "optional": [] - } - }, - "genie": { - "name": "genie", - "displayName": "Genie Plugin", - "description": "AI/BI Genie space integration for natural language data queries", - "package": "@databricks/appkit", - "resources": { - "required": [ - { - "type": "genie_space", - "alias": "Genie Space", - "resourceKey": "genie-space", - "description": "Genie Space for AI-powered data queries. Space IDs configured via plugin config.", - "permission": "CAN_RUN", - "fields": { - "id": { - "env": "DATABRICKS_GENIE_SPACE_ID", - "description": "Default Genie Space ID" - } - } - } - ], - "optional": [] - }, - "requiredByTemplate": true - }, - "lakebase": { - "name": "lakebase", - "displayName": "Lakebase", - "description": "SQL query execution against Databricks Lakebase Autoscaling", - "package": "@databricks/appkit", - "resources": { - "required": [ - { - "type": "postgres", - "alias": "Postgres", - "resourceKey": "postgres", - "description": "Lakebase Postgres database for persistent storage", - "permission": "CAN_CONNECT_AND_CREATE", - "fields": { - "branch": { - "description": "Full Lakebase Postgres branch resource name. Obtain by running `databricks postgres list-branches projects/{project-id}`, select the desired item from the output array and use its .name value.", - "examples": ["projects/{project-id}/branches/{branch-id}"] - }, - "database": { - "description": "Full Lakebase Postgres database resource name. Obtain by running `databricks postgres list-databases {branch-name}`, select the desired item from the output array and use its .name value. Requires the branch resource name.", - "examples": ["projects/{project-id}/branches/{branch-id}/databases/{database-id}"] - }, - "host": { - "env": "PGHOST", - "localOnly": true, - "resolve": "postgres:host", - "description": "Postgres host for local development. Auto-injected by the platform at deploy time." - }, - "databaseName": { - "env": "PGDATABASE", - "localOnly": true, - "resolve": "postgres:databaseName", - "description": "Postgres database name for local development. Auto-injected by the platform at deploy time." - }, - "endpointPath": { - "env": "LAKEBASE_ENDPOINT", - "bundleIgnore": true, - "resolve": "postgres:endpointPath", - "description": "Lakebase endpoint resource name. Auto-injected at runtime via app.yaml valueFrom: postgres. For local development, obtain by running `databricks postgres list-endpoints {branch-name}`, select the desired item from the output array and use its .name value.", - "examples": ["projects/{project-id}/branches/{branch-id}/endpoints/{endpoint-id}"] - }, - "port": { - "env": "PGPORT", - "localOnly": true, - "value": "5432", - "description": "Postgres port. Auto-injected by the platform at deploy time." - }, - "sslmode": { - "env": "PGSSLMODE", - "localOnly": true, - "value": "require", - "description": "Postgres SSL mode. Auto-injected by the platform at deploy time." - } - } - } - ], - "optional": [] - }, - "requiredByTemplate": true - }, - "server": { - "name": "server", - "displayName": "Server Plugin", - "description": "HTTP server with Express, static file serving, and Vite dev mode support", - "package": "@databricks/appkit", - "resources": { - "required": [], - "optional": [] - }, - "requiredByTemplate": true - } - } -} diff --git a/examples/agentic-support-console/template/client/components.json b/examples/agentic-support-console/template/client/components.json deleted file mode 100644 index 13e1db0..0000000 --- a/examples/agentic-support-console/template/client/components.json +++ /dev/null @@ -1,21 +0,0 @@ -{ - "$schema": "https://ui.shadcn.com/schema.json", - "style": "new-york", - "rsc": false, - "tsx": true, - "tailwind": { - "config": "", - "css": "src/index.css", - "baseColor": "neutral", - "cssVariables": true, - "prefix": "" - }, - "aliases": { - "components": "@/components", - "utils": "@/lib/utils", - "ui": "@/components/ui", - "lib": "@/lib", - "hooks": "@/hooks" - }, - "iconLibrary": "lucide" -} diff --git a/examples/agentic-support-console/template/client/index.html b/examples/agentic-support-console/template/client/index.html deleted file mode 100644 index 6b0d9ff..0000000 --- a/examples/agentic-support-console/template/client/index.html +++ /dev/null @@ -1,13 +0,0 @@ - - - - - - - Support Console - - -
- - - diff --git a/examples/agentic-support-console/template/client/public/favicon.svg b/examples/agentic-support-console/template/client/public/favicon.svg deleted file mode 100644 index cb30c1e..0000000 --- a/examples/agentic-support-console/template/client/public/favicon.svg +++ /dev/null @@ -1,6 +0,0 @@ - - - - - - diff --git a/examples/agentic-support-console/template/client/public/site.webmanifest b/examples/agentic-support-console/template/client/public/site.webmanifest deleted file mode 100644 index 5251312..0000000 --- a/examples/agentic-support-console/template/client/public/site.webmanifest +++ /dev/null @@ -1,19 +0,0 @@ -{ - "name": "Support Console", - "short_name": "Support Console", - "icons": [ - { - "src": "/favicon-192x192.png", - "sizes": "192x192", - "type": "image/png" - }, - { - "src": "/favicon-512x512.png", - "sizes": "512x512", - "type": "image/png" - } - ], - "theme_color": "#ffffff", - "background_color": "#ffffff", - "display": "standalone" -} diff --git a/examples/agentic-support-console/template/client/src/App.tsx b/examples/agentic-support-console/template/client/src/App.tsx deleted file mode 100644 index 2c309be..0000000 --- a/examples/agentic-support-console/template/client/src/App.tsx +++ /dev/null @@ -1,49 +0,0 @@ -import { createBrowserRouter, RouterProvider, NavLink, Outlet } from 'react-router'; -import { CaseQueuePage } from './pages/CaseQueuePage'; -import { CaseDetailPage } from './pages/CaseDetailPage'; -import { AnalyticsPage } from './pages/AnalyticsPage'; - -const navLinkClass = ({ isActive }: { isActive: boolean }) => - `px-3 py-1.5 rounded-md text-sm font-medium transition-colors ${ - isActive ? 'bg-foreground/10 text-foreground' : 'text-muted-foreground hover:text-foreground' - }`; - -function Layout() { - return ( -
-
-
-

Support Console

- -
- Agentic Support -
- -
- -
-
- ); -} - -const router = createBrowserRouter([ - { - element: , - children: [ - { path: '/', element: }, - { path: '/cases/:caseId', element: }, - { path: '/analytics', element: }, - ], - }, -]); - -export default function App() { - return ; -} diff --git a/examples/agentic-support-console/template/client/src/ErrorBoundary.tsx b/examples/agentic-support-console/template/client/src/ErrorBoundary.tsx deleted file mode 100644 index 6a73c26..0000000 --- a/examples/agentic-support-console/template/client/src/ErrorBoundary.tsx +++ /dev/null @@ -1,75 +0,0 @@ -import React, { Component } from 'react'; -import type { ReactNode } from 'react'; -import { Card, CardContent, CardHeader, CardTitle } from '@databricks/appkit-ui/react'; - -interface Props { - children: ReactNode; -} - -interface State { - hasError: boolean; - error: Error | null; - errorInfo: React.ErrorInfo | null; -} - -export class ErrorBoundary extends Component { - constructor(props: Props) { - super(props); - this.state = { - hasError: false, - error: null, - errorInfo: null, - }; - } - - static getDerivedStateFromError(error: Error): Partial { - return { hasError: true, error }; - } - - componentDidCatch(error: Error, errorInfo: React.ErrorInfo) { - console.error('ErrorBoundary caught an error:', error); - console.error('Error details:', errorInfo); - this.setState({ - error, - errorInfo, - }); - } - - render() { - if (this.state.hasError) { - return ( -
- - - Application Error - - -
-
-

Error Message:

-
{this.state.error?.toString()}
-
- {this.state.errorInfo && ( -
-

Component Stack:

-
-                      {this.state.errorInfo.componentStack}
-                    
-
- )} - {this.state.error?.stack && ( -
-

Stack Trace:

-
{this.state.error.stack}
-
- )} -
-
-
-
- ); - } - - return this.props.children; - } -} diff --git a/examples/agentic-support-console/template/client/src/appKitTypes.d.ts b/examples/agentic-support-console/template/client/src/appKitTypes.d.ts deleted file mode 100644 index fbbeed9..0000000 --- a/examples/agentic-support-console/template/client/src/appKitTypes.d.ts +++ /dev/null @@ -1,55 +0,0 @@ -// Auto-generated by AppKit - DO NOT EDIT -// Generated by 'npx @databricks/appkit generate-types' or Vite plugin during build -import '@databricks/appkit-ui/react'; -import type { - SQLTypeMarker, - SQLStringMarker, - SQLNumberMarker, - SQLBooleanMarker, - SQLBinaryMarker, - SQLDateMarker, - SQLTimestampMarker, -} from '@databricks/appkit-ui/js'; - -declare module '@databricks/appkit-ui/react' { - interface QueryRegistry { - agent_performance: { - name: 'agent_performance'; - parameters: Record; - result: Array<{ - /** @sqlType STRING */ - action: string; - /** @sqlType BIGINT */ - count: number; - /** @sqlType DOUBLE */ - avg_amount_cents: number; - }>; - }; - support_metrics: { - name: 'support_metrics'; - parameters: Record; - result: Array<{ - /** @sqlType DATE */ - case_date: string; - /** @sqlType BIGINT */ - total_cases: number; - /** @sqlType BIGINT */ - open_cases: number; - /** @sqlType BIGINT */ - resolved_cases: number; - /** @sqlType DOUBLE */ - avg_messages_per_case: number; - /** @sqlType DOUBLE */ - avg_first_response_minutes: number; - /** @sqlType BIGINT */ - cases_with_refund: number; - /** @sqlType BIGINT */ - cases_with_credit: number; - /** @sqlType BIGINT */ - total_refund_cents: number; - /** @sqlType BIGINT */ - total_credit_cents: number; - }>; - }; - } -} diff --git a/examples/agentic-support-console/template/client/src/components/ActionBadge.tsx b/examples/agentic-support-console/template/client/src/components/ActionBadge.tsx deleted file mode 100644 index 3f90749..0000000 --- a/examples/agentic-support-console/template/client/src/components/ActionBadge.tsx +++ /dev/null @@ -1,22 +0,0 @@ -const ACTION_STYLES: Record = { - refund: 'bg-amber-500/15 text-amber-400 border-amber-500/30', - credit: 'bg-blue-500/15 text-blue-400 border-blue-500/30', - escalate: 'bg-red-500/15 text-red-400 border-red-500/30', - no_action: 'bg-muted text-muted-foreground border-border/50', -}; - -function formatCents(cents: number): string { - return `$${(cents / 100).toFixed(2)}`; -} - -export function ActionBadge({ action, amountCents }: { action: string; amountCents?: number | null }) { - const style = ACTION_STYLES[action] ?? ACTION_STYLES.no_action; - const showAmount = (action === 'refund' || action === 'credit') && typeof amountCents === 'number' && amountCents > 0; - - return ( - - {action.replace('_', ' ')} - {showAmount && {formatCents(amountCents)}} - - ); -} diff --git a/examples/agentic-support-console/template/client/src/index.css b/examples/agentic-support-console/template/client/src/index.css deleted file mode 100644 index e188d55..0000000 --- a/examples/agentic-support-console/template/client/src/index.css +++ /dev/null @@ -1,33 +0,0 @@ -@import '@databricks/appkit-ui/styles.css'; - -:root { - --radius: 0.625rem; - --background: oklch(0.141 0.005 285.823); - --foreground: oklch(0.985 0 0); - --card: oklch(0.18 0.004 285.823); - --card-foreground: oklch(0.985 0 0); - --popover: oklch(0.18 0.004 285.823); - --popover-foreground: oklch(0.985 0 0); - --primary: oklch(0.92 0.004 286.32); - --primary-foreground: oklch(0.141 0.005 285.823); - --secondary: oklch(0.274 0.006 286.033); - --secondary-foreground: oklch(0.985 0 0); - --muted: oklch(0.274 0.006 286.033); - --muted-foreground: oklch(0.705 0.015 286.067); - --accent: oklch(0.274 0.006 286.033); - --accent-foreground: oklch(0.985 0 0); - --destructive: oklch(0.704 0.191 22.216); - --destructive-foreground: oklch(0.985 0 0); - --success: oklch(0.67 0.12 167); - --success-foreground: oklch(1 0 0); - --warning: oklch(0.83 0.165 85); - --warning-foreground: oklch(0.199 0.027 238.732); - --border: oklch(1 0 0 / 10%); - --input: oklch(1 0 0 / 15%); - --ring: oklch(0.552 0.016 285.938); - --chart-1: oklch(0.985 0 0); - --chart-2: oklch(0.705 0.015 286.067); - --chart-3: oklch(0.552 0.016 285.938); - --chart-4: oklch(0.83 0.165 85); - --chart-5: oklch(0.704 0.191 22.216); -} diff --git a/examples/agentic-support-console/template/client/src/lib/utils.ts b/examples/agentic-support-console/template/client/src/lib/utils.ts deleted file mode 100644 index f7734e4..0000000 --- a/examples/agentic-support-console/template/client/src/lib/utils.ts +++ /dev/null @@ -1,42 +0,0 @@ -import { clsx, type ClassValue } from 'clsx'; -import { twMerge } from 'tailwind-merge'; - -export function cn(...inputs: ClassValue[]) { - return twMerge(clsx(inputs)); -} - -const UUID_RE = /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i; -const HEX32_RE = /^[0-9a-f]{32}$/i; - -function hexToUuid(hex: string): string { - const h = hex.toLowerCase(); - return `${h.slice(0, 8)}-${h.slice(8, 12)}-${h.slice(12, 16)}-${h.slice(16, 20)}-${h.slice(20)}`; -} - -/** - * Attempt to decode a base64-encoded 16-byte identifier (lakehouse BYTEA format) - * into a standard UUID string. Returns null if the input isn't valid base64 or - * doesn't decode to exactly 16 bytes. - */ -function base64ToUuid(b64: string): string | null { - try { - const binary = atob(b64); - if (binary.length !== 16) return null; - const hex = Array.from(binary, (ch) => ch.charCodeAt(0).toString(16).padStart(2, '0')).join(''); - return hexToUuid(hex); - } catch { - return null; - } -} - -/** - * Normalise an identifier that may be a UUID, a 32-char hex UUID, or a - * base64-encoded 16-byte lakehouse ID into a lowercase UUID string. - * Returns null when the input can't be recognised as any of those formats. - */ -export function normaliseToUuid(input: string): string | null { - const trimmed = input.trim(); - if (UUID_RE.test(trimmed)) return trimmed.toLowerCase(); - if (HEX32_RE.test(trimmed)) return hexToUuid(trimmed); - return base64ToUuid(trimmed); -} diff --git a/examples/agentic-support-console/template/client/src/main.tsx b/examples/agentic-support-console/template/client/src/main.tsx deleted file mode 100644 index 35c59a5..0000000 --- a/examples/agentic-support-console/template/client/src/main.tsx +++ /dev/null @@ -1,13 +0,0 @@ -import { StrictMode } from 'react'; -import { createRoot } from 'react-dom/client'; -import './index.css'; -import App from './App.tsx'; -import { ErrorBoundary } from './ErrorBoundary.tsx'; - -createRoot(document.getElementById('root')!).render( - - - - - -); diff --git a/examples/agentic-support-console/template/client/src/pages/AnalyticsPage.tsx b/examples/agentic-support-console/template/client/src/pages/AnalyticsPage.tsx deleted file mode 100644 index f327163..0000000 --- a/examples/agentic-support-console/template/client/src/pages/AnalyticsPage.tsx +++ /dev/null @@ -1,144 +0,0 @@ -import { useState } from 'react'; -import { - useAnalyticsQuery, - BarChart, - GenieChat, - Card, - CardContent, - CardHeader, - CardTitle, - Skeleton, -} from '@databricks/appkit-ui/react'; - -function MetricCard({ label, value, sub }: { label: string; value: string; sub?: string }) { - return ( - - -

{label}

-

{value}

- {sub &&

{sub}

} -
-
- ); -} - -type Tab = 'dashboard' | 'genie'; - -const tabClass = (active: boolean) => - `px-4 py-2 text-sm font-medium border-b-2 transition-colors ${ - active - ? 'border-foreground text-foreground' - : 'border-transparent text-muted-foreground hover:text-foreground hover:border-border' - }`; - -function DashboardTab() { - const { data: metrics, loading: metricsLoading } = useAnalyticsQuery('support_metrics', {}); - const latest = metrics && metrics.length > 0 ? metrics[0] : null; - - return ( -
- {metricsLoading && ( -
- {Array.from({ length: 4 }, (_, i) => ( - - ))} -
- )} - - {latest && ( -
- - - - -
- )} - -
- - - Agent Action Distribution - - - - - - - - - Avg Suggested Amount by Action - - - - - -
-
- ); -} - -function GenieTab() { - return ( - - - Ask Genie -

- Ask questions about orders, revenue, support cases, customers, and more. -

-
- - - -
- ); -} - -export function AnalyticsPage() { - const [tab, setTab] = useState('dashboard'); - - return ( -
-
-

Analytics

-
- - -
-
- - {tab === 'dashboard' && ( -
- -
- )} - - {tab === 'genie' && ( -
- -
- )} -
- ); -} diff --git a/examples/agentic-support-console/template/client/src/pages/CaseDetailPage.tsx b/examples/agentic-support-console/template/client/src/pages/CaseDetailPage.tsx deleted file mode 100644 index c6aa198..0000000 --- a/examples/agentic-support-console/template/client/src/pages/CaseDetailPage.tsx +++ /dev/null @@ -1,527 +0,0 @@ -import { - Card, - CardContent, - CardHeader, - CardTitle, - Button, - Input, - Badge, - Skeleton, - Separator, - Select, - SelectContent, - SelectItem, - SelectTrigger, - SelectValue, -} from '@databricks/appkit-ui/react'; -import { useState, useEffect, useCallback } from 'react'; -import { useParams, Link } from 'react-router'; -import { ArrowLeft, Send, Check, ChevronDown, Loader2, Copy } from 'lucide-react'; -import { ActionBadge } from '../components/ActionBadge'; - -interface CaseDetail { - case_id: string; - user_id: string; - user_name: string; - user_email: string; - user_region: string; - subject: string; - status: string; - case_created_at: string; - message_count: number; - has_admin_reply: boolean; - first_response_minutes: number | null; - linked_refund_cents: number; - linked_credit_cents: number; - user_lifetime_spend_cents: number; - user_cases_90d: number; -} - -interface Message { - id: string; - role: 'customer' | 'admin'; - content: string; - created_at: string; -} - -interface AgentResponse { - message_id: string; - case_summary: string; - suggested_response: string; - suggested_action: string; - suggested_amount_cents: number; - reasoning: string; - model: string; - generated_at: string; -} - -interface UserProfile { - total_orders_90d: number; - total_spend_90d_cents: number; - lifetime_order_count: number; - lifetime_spend_cents: number; - support_cases_90d: number; - support_cases_lifetime: number; - total_refunds_90d_cents: number; - total_credits_90d_cents: number; -} - -interface CaseDetailResponse { - case: CaseDetail; - messages: Message[]; - agentResponses: AgentResponse[]; - userProfile: UserProfile | null; -} - -const CASE_STATUSES = ['open', 'in_progress', 'resolved', 'closed'] as const; - -function formatCents(cents: number): string { - return `$${(cents / 100).toFixed(2)}`; -} - -function AgentHistory({ responses }: { responses: AgentResponse[] }) { - const [expanded, setExpanded] = useState(false); - - return ( - - - - - {expanded && ( - - {responses.map((r) => ( -
-
- - - {new Date(r.generated_at).toLocaleString([], { - month: 'short', - day: 'numeric', - hour: '2-digit', - minute: '2-digit', - })} - -
-

{r.case_summary}

-

{r.reasoning}

-
- ))} -
- )} -
- ); -} - -function CopyableId({ id }: { id: string }) { - const [copied, setCopied] = useState(false); - - function handleCopy() { - navigator.clipboard.writeText(id).then(() => { - setCopied(true); - setTimeout(() => setCopied(false), 1500); - }); - } - - return ( - - ); -} - -export function CaseDetailPage() { - const { caseId } = useParams<{ caseId: string }>(); - const [data, setData] = useState(null); - const [loading, setLoading] = useState(true); - const [error, setError] = useState(null); - - const [draftResponse, setDraftResponse] = useState(''); - const [draftAction, setDraftAction] = useState('no_action'); - const [draftAmount, setDraftAmount] = useState('0'); - const [submitting, setSubmitting] = useState(false); - const [submitted, setSubmitted] = useState(false); - - const [caseStatus, setCaseStatus] = useState(''); - const [updatingStatus, setUpdatingStatus] = useState(false); - - const refetch = useCallback(() => { - if (!caseId) return; - return fetch(`/api/cases/${caseId}`) - .then((res) => { - if (!res.ok) throw new Error(`Failed to fetch case: ${res.statusText}`); - return res.json() as Promise; - }) - .then((d) => { - setData(d); - setCaseStatus(d.case.status); - const latest = d.agentResponses[0]; - if (latest && !submitted) { - setDraftResponse(latest.suggested_response); - setDraftAction(latest.suggested_action); - setDraftAmount(String(latest.suggested_amount_cents)); - } - }) - .catch((err) => setError(err instanceof Error ? err.message : 'Failed to load case')); - }, [caseId, submitted]); - - useEffect(() => { - refetch()?.finally(() => setLoading(false)); - }, [refetch]); - - useEffect(() => { - if (!caseId || loading) return; - const interval = setInterval(() => refetch(), 10_000); - return () => clearInterval(interval); - }, [caseId, loading, refetch]); - - async function handleSubmit() { - if (!caseId || !draftResponse.trim()) return; - setSubmitting(true); - try { - const res = await fetch(`/api/cases/${caseId}/decision`, { - method: 'POST', - headers: { 'Content-Type': 'application/json' }, - body: JSON.stringify({ - case_id: caseId, - admin_action: draftAction, - admin_amount_cents: parseInt(draftAmount, 10) || 0, - admin_response: draftResponse.trim(), - }), - }); - if (!res.ok) throw new Error('Failed to submit'); - setSubmitted(true); - await refetch(); - } catch (err) { - setError(err instanceof Error ? err.message : 'Submit failed'); - } finally { - setSubmitting(false); - } - } - - async function handleStatusChange(newStatus: string) { - if (!caseId || newStatus === caseStatus) return; - setUpdatingStatus(true); - try { - const res = await fetch(`/api/cases/${caseId}/status`, { - method: 'PATCH', - headers: { 'Content-Type': 'application/json' }, - body: JSON.stringify({ status: newStatus }), - }); - if (!res.ok) throw new Error('Failed to update status'); - setCaseStatus(newStatus); - } catch (err) { - console.error('Status update failed:', err); - } finally { - setUpdatingStatus(false); - } - } - - if (loading) { - return ( -
- -
-
- {Array.from({ length: 4 }, (_, i) => ( - - ))} -
-
- -
-
-
- ); - } - - if (error || !data) { - return ( -
-
{error ?? 'Case not found'}
-
- ); - } - - const { case: caseData, messages, agentResponses, userProfile } = data; - const latestAgentResponse = agentResponses[0] ?? null; - const olderAgentResponses = agentResponses.slice(1); - const lastMessage = messages[messages.length - 1]; - const adminAlreadyReplied = lastMessage?.role === 'admin' && !submitted; - const hasLinkedRefund = caseData.linked_refund_cents > 0; - const hasLinkedCredit = caseData.linked_credit_cents > 0; - const hasLinkedCompensation = hasLinkedRefund || hasLinkedCredit; - - return ( -
-
- - - Back - - -

{caseData.subject}

- - - {caseStatus} - -
- -
-
- - -
- Message Thread - {messages.length} messages -
-
- - {messages.map((msg) => ( -
-
- - {msg.role} - - - {new Date(msg.created_at).toLocaleTimeString([], { hour: '2-digit', minute: '2-digit' })} - -
-

{msg.content}

-
- ))} -
-
- - {hasLinkedCompensation && ( - - - Applied Compensation - - -
- {hasLinkedRefund && ( -
- Refund - {formatCents(caseData.linked_refund_cents)} -
- )} - {hasLinkedCredit && ( -
- Credit - {formatCents(caseData.linked_credit_cents)} -
- )} -
-
-
- )} - - {submitted ? ( - - - Decision Submitted - - -
- - Decision recorded successfully. -
-
-
- ) : adminAlreadyReplied ? ( - - - Response Sent - - -
- - An admin response has been sent. Waiting for customer reply. -
-
-
- ) : latestAgentResponse ? ( - - - Your Decision - - -
- - -
- - {(draftAction === 'refund' || draftAction === 'credit') && ( -
- - setDraftAmount(e.target.value)} min={0} /> -
- )} - -
- -