Skip to content

feat: Mean aggregate#7201

Merged
blaginin merged 25 commits intodevelopfrom
db/count-and-mean
Apr 27, 2026
Merged

feat: Mean aggregate#7201
blaginin merged 25 commits intodevelopfrom
db/count-and-mean

Conversation

@blaginin
Copy link
Copy Markdown
Member

@blaginin blaginin commented Mar 29, 2026

No description provided.

Signed-off-by: blaginin <dima@spiraldb.com>
Co-authored-by: Claude <noreply@anthropic.com>
@blaginin blaginin self-assigned this Mar 29, 2026
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: blaginin <dima@spiraldb.com>
/// Return the count of non-null elements in an array.
///
/// See [`Count`] for details.
pub fn count(array: &ArrayRef, ctx: &mut ExecutionCtx) -> VortexResult<u64> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is counting things in an array, should it return a usize?

false
}

fn accumulate(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to register a generic aggregate kernel to reduce count-non-null to be Array.validity().sum(), then we avoid decompressing all the data.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

575672a

added generic kernels first, but then it's hard to express "I never want Count::accumulate to be called" - so switched to try_accumulate instead

}

fn partial_struct_dtype() -> DType {
DType::Struct(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you static LazyLock this entire dtype inside this function?

@gatesn gatesn added the changelog/feature A new feature label Mar 30, 2026
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be decimal for decimals?

Comment on lines +53 to +56
pub struct MeanPartial {
sum: f64,
count: u64,
}
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will compute with a pretty large error?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$$\mu_n = \sum_{i=0}^n x_i/n \iff \mu_{n+1} = \mu_n + \frac{x_{n+1} - \mu_n}{1+n}$$

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need only hold the partial mean and count

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can incrementally push the next value into this partial, but you cannot combine two partials afaik?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For combining you do. $$\mu_{AB} = \frac{n_A \mu_A + n_B \mu_B}{n_A + n_B}$$

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DF sum is T and ours is f64

Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error comes from numbers of different scales being combined, not from number of OPS.

Since we are storing these on disk we cannot "just change it later".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DF only supports AVG for floats, decimals, and durations. And coerces all floats to f64.

Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| Input Type   | DuckDB | DataFusion | Velox      |
|--------------|--------|------------|------------|
| Integer types| DOUBLE | Float64    | DOUBLE     |
| FLOAT/REAL   | DOUBLE | Float64    | REAL       |
| DOUBLE       | DOUBLE | Float64    | DOUBLE     |
| DECIMAL(p,s) | DOUBLE | Float64    | DECIMAL(p,s)|

Copy link
Copy Markdown
Member Author

@blaginin blaginin Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought about this more - i think it should decimal for decimals? - happy to change

blaginin added 3 commits April 2, 2026 14:24
Signed-off-by: blaginin <github@blaginin.me>
Signed-off-by: blaginin <github@blaginin.me>
@blaginin blaginin changed the title Count and Mean aggregates feat: Count and Mean aggregates Apr 2, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 2, 2026

Merging this PR will improve performance by 27.62%

⚡ 2 improved benchmarks
✅ 1161 untouched benchmarks

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation varbinview_zip_block_mask 3.7 ms 2.9 ms +27.62%
Simulation varbinview_zip_fragmented_mask 7.3 ms 6.5 ms +11.46%

Comparing db/count-and-mean (4401d73) with develop (c0c6fb9)

Open in CodSpeed

Signed-off-by: blaginin <github@blaginin.me>
@blaginin blaginin changed the title feat: Count and Mean aggregates feat: Mean aggregate Apr 2, 2026
blaginin and others added 4 commits April 2, 2026 17:56
Signed-off-by: blaginin <github@blaginin.me>
# Conflicts:
#	vortex-array/public-api.lock
#	vortex-array/src/aggregate_fn/fns/mod.rs
Signed-off-by: blaginin <github@blaginin.me>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: blaginin <github@blaginin.me>
Comment thread vortex-array/src/aggregate_fn/fns/mean/mod.rs Outdated
blaginin added 3 commits April 8, 2026 16:23
Signed-off-by: blaginin <github@blaginin.me>
Signed-off-by: blaginin <github@blaginin.me>
Signed-off-by: blaginin <github@blaginin.me>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can factor this out of this PR, we could at a later date add the back to on disk version will be the same?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disk version will be the same

i blocked serialization for now?

we can factor this out of this PR

why?

Comment on lines +49 to +51
/// - Decimals are kept as decimals with widened precision and scale
/// (`+4` each, capped at [`MAX_PRECISION`] / [`MAX_SCALE`]), matching
/// DataFusion's `coerce_avg_type`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you definitely want higher for the sum, that is only 10k max sized values for an overflow

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but Sum.partial_dtype() also adds +10

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this comment need to explain what the +4 is for, 4 extra least sig figures?

@blaginin blaginin marked this pull request as ready for review April 9, 2026 14:25
blaginin and others added 5 commits April 9, 2026 15:26
Signed-off-by: blaginin <github@blaginin.me>
Signed-off-by: blaginin <github@blaginin.me>
Signed-off-by: blaginin <github@blaginin.me>
@blaginin blaginin enabled auto-merge (squash) April 20, 2026 18:06
@blaginin blaginin requested a review from joseph-isaacs April 20, 2026 18:06
Comment on lines +95 to +97
let sum_cast = sum.cast(target.clone())?;
let count_cast = count.cast(target)?;
sum_cast.binary(count_cast, Operator::Div)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle div by zero?

Signed-off-by: blaginin <github@blaginin.me>
Signed-off-by: blaginin <github@blaginin.me>
Signed-off-by: blaginin <github@blaginin.me>
Signed-off-by: blaginin <github@blaginin.me>
@joseph-isaacs
Copy link
Copy Markdown
Contributor

This is a mean? not avg?

@blaginin
Copy link
Copy Markdown
Member Author

I think the pattern is avg is the function and mean is an alias

Signed-off-by: blaginin <github@blaginin.me>
@joseph-isaacs
Copy link
Copy Markdown
Contributor

I think we want mean with avg as an alias

@joseph-isaacs
Copy link
Copy Markdown
Contributor

or really arithmetic_mean

let sum = sum_cast.as_primitive().typed_value::<f64>();
let count = count_cast.as_primitive().typed_value::<f64>();
let value = match (sum, count) {
(None, _) | (_, None) | (_, Some(0.0)) => return Ok(Scalar::null(target)), // Sum overflowed
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty or overflowed?

Comment on lines +68 to +76
/// Field name for the left child in the partial struct dtype.
fn left_name(&self) -> &'static str {
"left"
}

/// Field name for the right child in the partial struct dtype.
fn right_name(&self) -> &'static str {
"right"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why default these?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those don't matter, so may as well use left/right


impl<T: BinaryCombined> AggregateFnVTable for Combined<T> {
type Options = CombinedOptions<T>;
type Partial = (LeftPartial<T>, RightPartial<T>);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this partial be seralized?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a struct scalar with left/right fields

Comment on lines +256 to +262
fn struct_dtype(left_name: &str, right_name: &str, left: DType, right: DType) -> DType {
DType::Struct(
StructFields::new(
FieldNames::from_iter([FieldName::from(left_name), FieldName::from(right_name)]),
vec![left, right],
),
Nullability::NonNullable,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method on combined?

@joseph-isaacs joseph-isaacs disabled auto-merge April 27, 2026 13:23
@blaginin blaginin enabled auto-merge (squash) April 27, 2026 17:25
@blaginin blaginin merged commit 1689d7a into develop Apr 27, 2026
59 checks passed
@blaginin blaginin deleted the db/count-and-mean branch April 27, 2026 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants