Conversation
a445f7d to
1c99386
Compare
Merging this PR will degrade performance by 24.94%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | new_bp_prim_test_between[i64, 32768] |
177.3 µs | 236.2 µs | -24.94% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[128] |
275.3 ns | 246.1 ns | +11.85% |
Comparing aduffy/geo-v0 (f499553) with develop (44a6367)
Footnotes
-
138 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
3382e4f to
05dcff9
Compare
| duckdb_logical_type duckdb_vx_create_geometry(const char *crs) { | ||
| D_ASSERT(crs); | ||
| auto geom = | ||
| (*crs == '\0') ? duckdb::LogicalType::GEOMETRY() : duckdb::LogicalType::GEOMETRY(std::string(crs)); | ||
| auto copy = duckdb::make_uniq<duckdb::LogicalType>(std::move(geom)); | ||
| return reinterpret_cast<duckdb_logical_type>(copy.release()); | ||
| } |
There was a problem hiding this comment.
These C++ changes are because DuckDB upstream doesn't expose the full geometry type stuff over the C API.
| auto copy = duckdb::make_uniq<duckdb::LogicalType>(std::move(geom)); | ||
| return reinterpret_cast<duckdb_logical_type>(copy.release()); |
There was a problem hiding this comment.
Claude generated this function and the header update, most of the rest of the PR was braincoded
| let blob = unsafe { cpp::duckdb_get_blob(self.as_ptr()) }; | ||
| let slice = | ||
| unsafe { std::slice::from_raw_parts(blob.data.cast::<u8>(), blob.size.as_()) }; |
There was a problem hiding this comment.
in certain situations this could lead to UB depending on the allocator, b/c if duckdb_malloc returns NULL then you'd call slice::from_raw_parts(NULL) which is UB.
see the new take_blob function below
i only fixed this b/c i was going to repeat this for the GEOMETRY branch and figured i'd fix it in both places instead
| } | ||
|
|
||
| pub unsafe fn set_vector_buffer(&self, buffer: &VectorBufferRef) { | ||
| pub unsafe fn set_vector_buffer(&mut self, buffer: &VectorBufferRef) { |
There was a problem hiding this comment.
these methods should've always been &mut for soundness reasons, they were just missing before
| } | ||
|
|
||
| #[test] | ||
| fn test_geometry() { |
There was a problem hiding this comment.
see this for (rather simple) example of querying geometry data from DuckDB
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
| type NativeValue<'a> = Wkb<'a>; | ||
|
|
||
| fn id(&self) -> ExtId { | ||
| ExtId::new_static("geo.wkb") |
There was a problem hiding this comment.
lets think about this id a little more.
What namespace do we want?
| if ext.ext_dtype().is::<AnyTemporal>() { | ||
| return temporal::new_exporter(TemporalArray::try_from(ext)?, ctx); | ||
| } | ||
|
|
||
| if ext.ext_dtype().is::<WellKnownBinary>() { | ||
| return geo::new_wkb_exporter(WellKnownBinaryData::try_from(ext)?, ctx); | ||
| } | ||
|
|
There was a problem hiding this comment.
shall we move this into a ext mod?
| .execute::<Canonical>(ctx)? | ||
| .into_varbinview(); |
There was a problem hiding this comment.
execute::
| .execute::<Canonical>(ctx)? | |
| .into_varbinview(); | |
| .execute::<VarBinView>(ctx)?; |
|
|
||
| fn temporal_to_duckdb(temporal: TemporalMetadata) -> VortexResult<LogicalType> { | ||
| let duckdb_type = match temporal { | ||
| TemporalMetadata::Timestamp(unit, None) => match unit { | ||
| TimeUnit::Nanoseconds => DUCKDB_TYPE::DUCKDB_TYPE_TIMESTAMP_NS, | ||
| TimeUnit::Microseconds => DUCKDB_TYPE::DUCKDB_TYPE_TIMESTAMP, | ||
| TimeUnit::Milliseconds => DUCKDB_TYPE::DUCKDB_TYPE_TIMESTAMP_MS, | ||
| TimeUnit::Seconds => DUCKDB_TYPE::DUCKDB_TYPE_TIMESTAMP_S, | ||
| _ => vortex_bail!("Invalid TimeUnit {} for timestamp", unit), | ||
| }, | ||
| TemporalMetadata::Timestamp(unit, Some(tz)) => { | ||
| if tz.as_ref() != "UTC" { | ||
| vortex_bail!("Invalid timezone for timestamp_tz {tz}, must be UTC"); | ||
| } | ||
| if unit != &TimeUnit::Microseconds { | ||
| vortex_bail!( | ||
| "Invalid TimeUnit {} for timestamp_tz, must be Microseconds", | ||
| unit | ||
| ); | ||
| } | ||
| DUCKDB_TYPE::DUCKDB_TYPE_TIMESTAMP_TZ | ||
| } | ||
| TemporalMetadata::Date(unit) => match unit { | ||
| TimeUnit::Days => DUCKDB_TYPE::DUCKDB_TYPE_DATE, | ||
| _ => vortex_bail!("Invalid TimeUnit {} for date", unit), | ||
| }, | ||
| TemporalMetadata::Time(unit) => match unit { | ||
| TimeUnit::Microseconds => DUCKDB_TYPE::DUCKDB_TYPE_TIME, | ||
| TimeUnit::Nanoseconds => DUCKDB_TYPE::DUCKDB_TYPE_TIME_NS, | ||
| _ => vortex_bail!("Invalid TimeUnit {} for time", unit), | ||
| }, | ||
| }; | ||
|
|
||
| Ok(LogicalType::new(duckdb_type)) | ||
| } | ||
|
|
There was a problem hiding this comment.
shall we also move these into a ext mod?
| const auto val = reinterpret_cast<duckdb::Value *>(value); | ||
| const auto &str = duckdb::StringValue::Get(*val); |
There was a problem hiding this comment.
Is this a panic or memory unsafe is val is not a string type?
| /// | ||
| /// This bypasses the GEOMETRY -> BLOB default cast (which would require the spatial extension to | ||
| /// be loaded). The returned `data` pointer must be freed with `duckdb_free`. The caller must | ||
| /// ensure `value` is a non-null GEOMETRY value, otherwise behavior is undefined. |
Summary
Part of the implementation of #7686
This PR adds a new crate,
vortex-geo, which will hold all extension types, custom layouts, and encodings necessary to store geospatial vector datasets in Vortex files.The goal of this crate is to enable integration with DuckDB Spatial, GeoDataFusion, SedonaDB, and Iceberg v3 geo types.
This initial PR implements an extension type for the Well-Known Binary encoding (WKB). This encoding is the most common format for geospatial data for analytics, it's what both GeoParquet and DuckDB use to represent geometry types.
We also wire this into vortex DuckDB extension to support converting Geometry columns between Vortex and DuckDB formats.
API Changes
Adds new crate
vortex-geowith extension typeWellKnownBinary, (geo.wkb)Adds support for geometry columns to DuckDB as well, e.g. you can now do something like
It won't be very performant yet, not until we support better layouts + stats for geometry.
Testing
We have unit tests for metadata serde, round trip conversion between Vortex <> DuckDB geometry format, and an additional E2E test that demonstrates reading a Vortex file with geometry data and providing it to the DuckDB Spatial extension.