-
Notifications
You must be signed in to change notification settings - Fork 1k
Topn doing order |> head/tail #5167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5167 +/- ##
==========================================
- Coverage 99.02% 98.84% -0.18%
==========================================
Files 87 88 +1
Lines 16758 16836 +78
==========================================
+ Hits 16595 16642 +47
- Misses 163 194 +31 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Very nice! This is a common use case and would be great to get in. No problems about the code. Just thinking about API.
|
This comment was marked as outdated.
This comment was marked as outdated.
|
Regarding API: Regarding Functionality: Regarding implementation: I like the idea of a versatile LIMIT in the light of prototyping. However, my most common use case for this feature is only |
|
I wonder if possibly https://github.com/Rdatatable/data.table/blob/master/src/quickselect.c could be reused? I used it in naive rolling median algorithm to find partial (half) ordering. |
Possibly. But quickselect returns values of |
|
ah yes, you are correct. Anyway you can compare speed of returning a value vs index, and at least you will know if there is something to improve regarding your current implementation, in case quickselect would be faster |
|
BTW. those benchmark timings tables are terrible to look at when different rows use different units (ms vs s). |
will add a version with quickselect but my guess is that heapselect is faster for smaller |
|
Then topn can be confusing name because it is commonly used in MSSQL for what is LIMIT in some other dbses. |
|
I added a quickselect version called Will update the benchmarks to make an informed decision. |
Good call-out: https://learn.microsoft.com/en-us/dax/topn-function-dax Examples there are not all that helpful, but AFAICT it suffers from the same confusing API where you are writing I am leaning more towards
I'm not sure a |
|
No obvious timing issues in HEAD=topn_heap Generated via commit 9e0786c Download link for the artifact containing the test results: ↓ atime-results.zip
|
| if (j <= n) l = i; \ | ||
| } \ | ||
| } \ | ||
| memcpy(ians, ix, n * sizeof(CTYPE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat scary for character vectors, but I'm not seeing anything that would break right now.
From the GC generations viewpoint, x and ans are likely from the same GC generation; x is possibly older. There shouldn't be any problem with elements of newer, more-frequently-sweeped ans pointing to values from an older, less-frequently-sweeped GC generation. (It's the opposite that causes use-after-frees.)
From the reference counts viewpoint, it'll be one less than what it should be for elements of ans, but CHARSXPs are cached and immutable anyway.
Co-authored-by: aitap <[email protected]>
Co-authored-by: aitap <[email protected]>

Closes #3804
I see the general use case of
topnfor arrays where sorting costs much and using as few additional memory as possible with good performance.Benchmarks
Integer
Worst case
Array is sorted ascending and we want the maximum
topnso we need to update the heap at every step after nBest case
Array is sorted ascending and we want the minimum
topnso we never need to update after n(not benchmarking with kit since it errors from n=1e4 onwards)
Random permutation (mimicking average case)
Double
Worst case
Best case
Random permutation
Strings
Random strings