Add databricks direct unload file containing complex map key#87
Conversation
|
Thank you @njaremko 🙏 Could you please add some comments about what this file contains and how you created it? Ideally you could follow the model of https://github.com/apache/parquet-testing/blob/master/data/README.md Also I noticed the file is 8KB -- given how widely this repo is cloned / copied is there any way to make the example file smaller? |
79cc811 to
6eea094
Compare
|
I've removed the unneeded columns, and it's 2kb now. I've also updated the readme |
6eea094 to
f0d64c2
Compare
f0d64c2 to
54bffa8
Compare
|
What are the next steps to get this merged? |
alamb
left a comment
There was a problem hiding this comment.
Thank you @njaremko . This looks good to me
I dumped schema and layout ofdata/complex_map_key.parquet and it looks good to me
Note that I am not a committer on parquet so I can not commit this PR. Perhaps @wgtmac or @emkornfield could take a look.
Also FYI I don't think we need to gate the fix in arrow-rs on this PR. I will comment on apache/arrow-rs#7769 as well
parquet-layout data/complex_map_key.parquet
required group field_id=-1 spark_schema {
required group field_id=-1 map_nested (Map) {
repeated group field_id=-1 key_value {
required binary field_id=-1 key (String);
required group field_id=-1 value (Map) {
repeated group field_id=-1 key_value {
required binary field_id=-1 key (String);
required binary field_id=-1 value (String);
}
}
}
}
required group field_id=-1 map_nested_array (Map) {
repeated group field_id=-1 key_value {
required group field_id=-1 key (List) {
repeated group field_id=-1 list {
required int32 field_id=-1 element;
}
}
required group field_id=-1 value (Map) {
repeated group field_id=-1 key_value {
required binary field_id=-1 key (String);
required int32 field_id=-1 value;
}
}
}
}
}
and
File Name: data/complex_map_key.parquet
Version: 1.0
Created By: parquet-mr version 1.12.3-databricks-0002 (build 2484a95dbe16a0023e3eb29c201f99ff9ea771ee)
Total rows: 1
Number of RowGroups: 1
Number of Real Columns: 2
Number of Columns: 6
Number of Selected Columns: 6
Column 0: map_nested.key_value.key (BYTE_ARRAY / String / UTF8)
Column 1: map_nested.key_value.value.key_value.key (BYTE_ARRAY / String / UTF8)
Column 2: map_nested.key_value.value.key_value.value (BYTE_ARRAY / String / UTF8)
Column 3: map_nested_array.key_value.key.list.element (INT32)
Column 4: map_nested_array.key_value.value.key_value.key (BYTE_ARRAY / String / UTF8)
Column 5: map_nested_array.key_value.value.key_value.value (INT32)
--- Row Group: 0 ---
--- Total Bytes: 256 ---
--- Total Compressed Bytes: 266 ---
--- Rows: 1 ---
Column 0
Values: 1, Null Values: 0, Distinct Values: 0
Max: a, Min: a
Compression: SNAPPY, Encodings: PLAIN
Uncompressed Size: 40, Compressed Size: 42
Column 1
Values: 1, Null Values: 0, Distinct Values: 0
Max: b, Min: b
Compression: SNAPPY, Encodings: PLAIN
Uncompressed Size: 42, Compressed Size: 44
Column 2
Values: 1, Null Values: 0, Distinct Values: 0
Max: c, Min: c
Compression: SNAPPY, Encodings: PLAIN
Uncompressed Size: 42, Compressed Size: 44
Column 3
Values: 2, Null Values: 0, Distinct Values: 0
Max: 2, Min: 1
Compression: SNAPPY, Encodings: PLAIN
Uncompressed Size: 45, Compressed Size: 47
Column 4
Values: 1, Null Values: 0, Distinct Values: 0
Max: green, Min: green
Compression: SNAPPY, Encodings: PLAIN
Uncompressed Size: 46, Compressed Size: 46
Column 5
Values: 1, Null Values: 0, Distinct Values: 0
Max: 5, Min: 5
Compression: SNAPPY, Encodings: PLAIN
Uncompressed Size: 41, Compressed Size: 43
--- Values ---
key |key |value |element |key |value |
a |b |c |1 |green |5 |
2 |
wgtmac
left a comment
There was a problem hiding this comment.
Do we actually need the map_nested field if we just want to add complex key/value types?
|
|
||
| | File | Description | | ||
| |----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| | complex_map_key.parquet | Contains a map with an array key. | |
There was a problem hiding this comment.
Is it worth describing the exact schema of the file at this line?
Required for 7769