fix: avoid versioned describe table for namespace opens#7250
Conversation
yanghua
left a comment
There was a problem hiding this comment.
Can we also add a test for these change?
| version : optional, int | str | ||
| If specified, load a specific version of the Lance dataset. Else, loads the | ||
| latest version. A version number (`int`) or a tag (`str`) can be provided. |
There was a problem hiding this comment.
The public behavior of version does not change here: it still opens the requested dataset version.
This change only adjusts the internal namespace resolution flow. We now use describe_table to resolve table metadata/location/storage options, and apply version later in the lower-level dataset open path. I'm not sure this internal detail should be exposed in the public parameter docs.
Do you think the current version doc is misleading for users, or were you suggesting documenting this namespace-specific implementation detail?
There was a problem hiding this comment.
and apply version later in the lower-level dataset open path
Can you share more details about this?
There was a problem hiding this comment.
Currently some namespace implementations can support describing a specific table version in describe_table, but not all implementations do. For example, REST-backed namespace implementations may only support resolving the current table metadata/location/storage options from describe_table. This is the background of lance-format/lance-spark#609.
This PR avoids requiring every namespace implementation to support versioned describe_table for the dataset-open path. We only use describe_table to resolve the table location and storage options. The requested dataset version is still passed to the actual dataset open path afterward.
So this keeps compatibility with namespace implementations that support only latest-table describe, while preserving the user-facing behavior of version: the opened dataset is still the requested version.
|
Added Java and Python regression tests that verify namespace |
What
Avoid passing the requested dataset version to namespace
describe_table/describeTablewhen opening a dataset through a namespace client.The namespace describe call is only used to resolve the table location and storage options. The requested dataset version is still applied by the lower-level dataset open path after namespace resolution.
Why
Some namespace implementations do not support describing a table from a specific version. Passing
versiontodescribe_tablecan fail even though opening the dataset at that version would work once the location and storage options are resolved.Fixes lance-format/lance-spark#609.
Testing
JAVA_HOME=/opt/homebrew/Cellar/openjdk@17/17.0.18/libexec/openjdk.jdk/Contents/Home mvn test -Dtest=DirectoryNamespaceTest#testOpenSpecificVersionDoesNotPassVersionToDescribeTablefromjava/UV_HTTP_TIMEOUT=120 uv run pytest python/tests/test_namespace_dir.py::test_dataset_namespace_open_does_not_pass_version_to_describe_tablefrompython/JAVA_HOME=/opt/homebrew/Cellar/openjdk@17/17.0.18/libexec/openjdk.jdk/Contents/Home ./mvnw spotless:checkfromjava/UV_HTTP_TIMEOUT=120 uv run ruff format --check --diff pythonfrompython/UV_HTTP_TIMEOUT=120 uv run ruff check pythonfrompython/git diff --checkUV_HTTP_TIMEOUT=120 uv run make lintfrompython/(fails in pyright because local optional importstensorflowandtorchare not installed;ruff format --checkandruff checkpassed)