-
Notifications
You must be signed in to change notification settings - Fork 512
Fix Lance writer to emit Arrow FixedSizeList for array columns to enable native vector search #2707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
35cb755
a6acf1e
029e204
4164ceb
5e9046f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -47,31 +47,84 @@ | |
|
|
||
| import java.util.Collections; | ||
| import java.util.List; | ||
| import java.util.Map; | ||
| import java.util.stream.Collectors; | ||
|
|
||
| import static org.apache.fluss.utils.Preconditions.checkArgument; | ||
|
|
||
| /** | ||
| * Utilities for converting Fluss RowType to non-shaded Arrow Schema. This is needed because Lance | ||
| * requires non-shaded Arrow API. | ||
| */ | ||
| public class LanceArrowUtils { | ||
|
|
||
| /** Property suffix for configuring a fixed-size list Arrow type on array columns. */ | ||
| public static final String FIXED_SIZE_LIST_SIZE_SUFFIX = ".arrow.fixed-size-list.size"; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The property key format is <column_name>.arrow.fixed-size-list.size. If the column name itself contains . (e.g., a.b), it leads to parsing ambiguity. For example, a.b.arrow.fixed-size-list.size — it's unclear whether this refers to column a.b or column a with some unrelated suffix. Suggestion:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good comment, I actually look up Lance doc further and found that they do not support dots in columns. Ref: https://docs.lancedb.com/search/filtering#advanced-sql-filters Updated to throw exception |
||
|
|
||
| /** Returns the non-shaded Arrow schema of the specified Fluss RowType. */ | ||
| public static Schema toArrowSchema(RowType rowType) { | ||
| return toArrowSchema(rowType, Collections.emptyMap()); | ||
| } | ||
|
|
||
| /** | ||
| * Returns the non-shaded Arrow schema of the specified Fluss RowType, using table properties to | ||
| * determine whether array columns should use FixedSizeList instead of List. | ||
| * | ||
| * <p>When a table property {@code <column>.arrow.fixed-size-list.size} is set, the | ||
| * corresponding ARRAY column will be emitted as {@code FixedSizeList<element>(size)} instead of | ||
| * {@code List<element>}. | ||
| */ | ||
| public static Schema toArrowSchema(RowType rowType, Map<String, String> tableProperties) { | ||
| List<Field> fields = | ||
| rowType.getFields().stream() | ||
| .map(f -> toArrowField(f.getName(), f.getType())) | ||
| .map(f -> toArrowField(f.getName(), f.getType(), tableProperties)) | ||
| .collect(Collectors.toList()); | ||
| return new Schema(fields); | ||
| } | ||
|
|
||
| private static Field toArrowField(String fieldName, DataType logicalType) { | ||
| FieldType fieldType = | ||
| new FieldType(logicalType.isNullable(), toArrowType(logicalType), null); | ||
| private static Field toArrowField( | ||
| String fieldName, DataType logicalType, Map<String, String> tableProperties) { | ||
| checkArgument( | ||
| !fieldName.contains("."), | ||
| "Column name '%s' must not contain periods. " | ||
| + "Lance does not support field names with periods.", | ||
| fieldName); | ||
| ArrowType arrowType; | ||
| if (logicalType instanceof ArrayType && tableProperties != null) { | ||
| String sizeStr = tableProperties.get(fieldName + FIXED_SIZE_LIST_SIZE_SUFFIX); | ||
| if (sizeStr != null) { | ||
| int listSize; | ||
| try { | ||
| listSize = Integer.parseInt(sizeStr); | ||
| } catch (NumberFormatException e) { | ||
| throw new IllegalArgumentException( | ||
| String.format( | ||
| "Invalid value '%s' for property '%s', expected a positive integer.", | ||
| sizeStr, fieldName + FIXED_SIZE_LIST_SIZE_SUFFIX), | ||
| e); | ||
| } | ||
|
|
||
| checkArgument( | ||
| listSize > 0, | ||
| "Invalid value '%s' for property '%s'. Expected a positive integer.", | ||
| sizeStr, | ||
| fieldName + FIXED_SIZE_LIST_SIZE_SUFFIX); | ||
| arrowType = new ArrowType.FixedSizeList(listSize); | ||
| } else { | ||
| arrowType = toArrowType(logicalType); | ||
| } | ||
| } else { | ||
| arrowType = toArrowType(logicalType); | ||
| } | ||
| FieldType fieldType = new FieldType(logicalType.isNullable(), arrowType, null); | ||
| List<Field> children = null; | ||
| if (logicalType instanceof ArrayType) { | ||
| children = | ||
| Collections.singletonList( | ||
| toArrowField("element", ((ArrayType) logicalType).getElementType())); | ||
| toArrowField( | ||
| "element", | ||
| ((ArrayType) logicalType).getElementType(), | ||
| tableProperties)); | ||
| } | ||
| return new Field(fieldName, fieldType, children); | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to confirm the return type of
getCustomProperties(). If it returns an immutable Map or a Properties type rather thanMap<String, String>, willtableProperties.get(fieldName + FIXED_SIZE_LIST_SIZE_SUFFIX)in toArrowField work correctly? Suggest usingMap<String, String>in the new method signature and ensuring compatibility at the call site.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getCustomProperties() returns Map<String, String>