Topn doing order |> head/tail #5167

ben-schwen · 2021-09-17T14:15:16Z

R/C code
tests
man page
news

I see the general use case of topn for arrays where sorting costs much and using as few additional memory as possible with good performance.

Benchmarks

Integer

Worst case

Array is sorted ascending and we want the maximum topn so we need to update the heap at every step after n

library(data.table)
setDTthreads(1L)
x = seq.int(1e8)
b <- function() {
  bench::mark(check=FALSE,
    topn(x,n,decreasing=TRUE),
    quickn(x,n,decreasing=TRUE),
    kit::topn(x,n,decreasing=TRUE),
    data.table:::forder(x,decreasing=TRUE)[1:n]
  )
}

n = 1e0
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)   150.43ms 162.42ms     6.17    381.5MB    0    
#> 2 quickn(x, n, decreasing = TRUE) 332.06ms 370.74ms     2.70    381.5MB    2.70 
#> 3 kit::topn(x, n, decreasing = T… 415.41ms 431.38ms     2.32     39.8KB    0    
#> 4 data.table:::forder(x, decreas…    1.83s    1.83s     0.547   381.5MB    0.547

n = 1e1
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)      2.67s    2.67s     0.375        0B    0    
#> 2 quickn(x, n, decreasing = TRUE) 253.91ms 263.16ms     3.80      381MB    3.80 
#> 3 kit::topn(x, n, decreasing = T… 833.57ms 833.57ms     1.20         0B    0    
#> 4 data.table:::forder(x, decreas…    1.65s    1.65s     0.607     381MB    0.607

n = 1e2
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)      4.45s    4.45s     0.225      448B    0    
#> 2 quickn(x, n, decreasing = TRUE) 269.91ms 271.52ms     3.68      381MB    3.68 
#> 3 kit::topn(x, n, decreasing = T…     7.3s     7.3s     0.137      448B    0    
#> 4 data.table:::forder(x, decreas…    1.65s    1.65s     0.605     381MB    0.605

Best case

Array is sorted ascending and we want the minimum topn so we never need to update after n
(not benchmarking with kit since it errors from n=1e4 onwards)

library(data.table)
setDTthreads(1L)
x = seq.int(1e8)
b <- function() {
  bench::mark(check=FALSE,
    topn(x,n,decreasing=FALSE),
    quickn(x,n,decreasing=FALSE),
    data.table:::forder(x,decreasing=FALSE)[1:n]
  )
}

n = 1e0
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       107ms  109ms      9.20     381MB     0   
#> 2 quickn(x, n, decreasing = FALSE)     245ms  259ms      3.86     381MB     3.86
#> 3 data.table:::forder(x, decreasing =… 837ms  837ms      1.19     382MB     1.19

n = 1e1
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       125ms  126ms      7.88        0B     0   
#> 2 quickn(x, n, decreasing = FALSE)     249ms  256ms      3.91     381MB     3.91
#> 3 data.table:::forder(x, decreasing =… 971ms  971ms      1.03     381MB     1.03

n = 1e2
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       124ms  127ms      7.76      448B     0   
#> 2 quickn(x, n, decreasing = FALSE)     252ms  256ms      3.91     381MB     3.91
#> 3 data.table:::forder(x, decreasing =… 889ms  889ms      1.12     381MB     1.12

n = 1e3
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       122ms  123ms      8.11    3.95KB     0   
#> 2 quickn(x, n, decreasing = FALSE)     245ms  257ms      3.89  381.47MB     3.89
#> 3 data.table:::forder(x, decreasing =… 911ms  911ms      1.10  381.48MB     1.10

n = 1e4
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       118ms  120ms      8.32    39.1KB     0   
#> 2 quickn(x, n, decreasing = FALSE)     248ms  250ms      4.00   381.5MB     4.00
#> 3 data.table:::forder(x, decreasing =… 876ms  876ms      1.14   381.5MB     1.14

n = 1e5
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       119ms  124ms      8.13     391KB     0   
#> 2 quickn(x, n, decreasing = FALSE)     254ms  280ms      3.57     382MB     3.57
#> 3 data.table:::forder(x, decreasing =… 924ms  924ms      1.08     382MB     1.08

n = 1e6
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)  133.82ms 136.46ms     7.28     3.81MB    0    
#> 2 quickn(x, n, decreasing = FALS… 272.88ms 279.49ms     3.58   385.29MB    3.58 
#> 3 data.table:::forder(x, decreas…    1.04s    1.04s     0.964   389.1MB    0.964

n = 1e7
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       203ms  209ms      4.74    38.1MB     0   
#> 2 quickn(x, n, decreasing = FALSE)     265ms  280ms      3.57   419.6MB     3.57
#> 3 data.table:::forder(x, decreasing =… 987ms  987ms      1.01   457.8MB     1.01

Random permutation (mimicking average case)

library(data.table)
setDTthreads(1L)
set.seed(373)
x = sample(seq.int(1e8))
b <- function() {
  bench::mark(check=FALSE,
    topn(x,n,decreasing=TRUE),
    quickn(x,n,decreasing=TRUE),
    kit::topn(x,n,decreasing=TRUE),
    data.table:::forder(x,decreasing=TRUE)[1:n]
  )
}

n = 1e0
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)     83.98ms 85.38ms    11.7      3.19KB    0    
#> 2 quickn(x, n, decreasing = TRUE)     1.04s   1.04s     0.964  381.47MB    0.964
#> 3 kit::topn(x, n, decreasing = TRU…  45.9ms 46.12ms    21.6     39.77KB    0    
#> 4 data.table:::forder(x, decreasin…   2.26s   2.26s     0.443  381.55MB    0.443

n = 1e1
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)      81.7ms 82.12ms    12.1          0B    0    
#> 2 quickn(x, n, decreasing = TRUE)     1.09s   1.09s     0.916     381MB    0.916
#> 3 kit::topn(x, n, decreasing = TRUE) 46.3ms 46.52ms    21.3          0B    0    
#> 4 data.table:::forder(x, decreasing…  2.25s   2.25s     0.444     381MB    0.444

n = 1e2
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)     82.62ms 84.17ms    11.9        448B    0    
#> 2 quickn(x, n, decreasing = TRUE)     1.07s   1.07s     0.931     381MB    0.931
#> 3 kit::topn(x, n, decreasing = TRU… 46.58ms 47.92ms    20.2        448B    0    
#> 4 data.table:::forder(x, decreasin…   2.42s   2.42s     0.414     381MB    0.414

n = 1e3
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)     95.41ms 98.98ms     9.98     3.95KB    0    
#> 2 quickn(x, n, decreasing = TRUE)     1.13s   1.13s     0.884  381.47MB    0.884
#> 3 kit::topn(x, n, decreasing = TRU… 58.84ms 60.07ms    16.5      3.95KB    0    
#> 4 data.table:::forder(x, decreasin…   2.51s   2.51s     0.398  381.48MB    0.398

n = 1e4
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)   118.19ms 119.21ms     8.33     39.1KB    0    
#> 2 quickn(x, n, decreasing = TRUE)    1.03s    1.03s     0.967   381.5MB    0.967
#> 3 kit::topn(x, n, decreasing = T…     2.1s     2.1s     0.477   381.5MB    0.477
#> 4 data.table:::forder(x, decreas…    2.25s    2.25s     0.445   381.5MB    0.445

n = 1e5
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)   519.26ms 519.26ms     1.93      391KB    0    
#> 2 quickn(x, n, decreasing = TRUE)    1.07s    1.07s     0.930     382MB    0.930
#> 3 kit::topn(x, n, decreasing = T…    1.91s    1.91s     0.523     382MB    0.523
#> 4 data.table:::forder(x, decreas…    2.35s    2.35s     0.426     382MB    0.426

n = 1e6
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)        5.44s  5.44s     0.184    3.81MB    0    
#> 2 quickn(x, n, decreasing = TRUE)      1.02s  1.02s     0.983  385.29MB    0.983
#> 3 kit::topn(x, n, decreasing = TRUE)   1.94s  1.94s     0.517  385.29MB    0.517
#> 4 data.table:::forder(x, decreasing =… 2.27s  2.27s     0.441   389.1MB    0.441

Double

Worst case

library(data.table)
setDTthreads(1L)
x = as.double(seq.int(1e7))
b <- function() {
  bench::mark(check=FALSE,
    topn(x,n,decreasing=TRUE),
    quickn(x,n,decreasing=TRUE),
    kit::topn(x,n,decreasing=TRUE),
    data.table:::forder(x,decreasing=TRUE)[1:n]
  )
}

n = 1e0
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)      24.2ms  26.3ms     37.2     76.3MB     0   
#> 2 quickn(x, n, decreasing = TRUE)    53.6ms  54.6ms     17.9     76.3MB    17.9 
#> 3 kit::topn(x, n, decreasing = TRU…  51.9ms    55ms     17.7     39.8KB     0   
#> 4 data.table:::forder(x, decreasin… 449.7ms 456.1ms      2.19    38.2MB     1.10

n = 1e1
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)     337.5ms 342.8ms      2.92        0B     0   
#> 2 quickn(x, n, decreasing = TRUE)    54.3ms  56.4ms     17.4     76.3MB    17.4 
#> 3 kit::topn(x, n, decreasing = TRU… 158.3ms 167.4ms      5.83        0B     0   
#> 4 data.table:::forder(x, decreasin… 386.9ms 424.8ms      2.35    38.1MB     1.18

n = 1e2
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)   584.17ms 584.17ms     1.71       448B     0   
#> 2 quickn(x, n, decreasing = TRUE)   52.1ms  53.86ms    18.6      76.3MB    18.6 
#> 3 kit::topn(x, n, decreasing = T…    1.94s    1.94s     0.515      448B     0   
#> 4 data.table:::forder(x, decreas… 367.69ms 383.86ms     2.61     38.1MB     1.30

n = 1e3
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)     873.2ms 873.2ms    1.15      3.95KB     0   
#> 2 quickn(x, n, decreasing = TRUE)    50.9ms  51.5ms   19.3       76.3MB    19.3 
#> 3 kit::topn(x, n, decreasing = TRU…   18.3s   18.3s    0.0546    3.95KB     0   
#> 4 data.table:::forder(x, decreasin…   358ms 358.3ms    2.79     38.16MB     1.40

Best case

library(data.table)
setDTthreads(1L)
x = as.double(seq.int(1e7))
b <- function() {
  bench::mark(check=FALSE,
    topn(x,n,decreasing=FALSE),
    quickn(x,n,decreasing=FALSE),
    data.table:::forder(x,decreasing=FALSE)[1:n]
  )
}

n = 1e0
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     18.8ms  19.2ms     51.3     76.3MB     0   
#> 2 quickn(x, n, decreasing = FALSE)   54.7ms  55.8ms     17.9     76.3MB    17.9 
#> 3 data.table:::forder(x, decreasin… 179.5ms 209.6ms      4.44    38.2MB     1.48

n = 1e1
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     20.2ms  21.3ms     46.7         0B     0   
#> 2 quickn(x, n, decreasing = FALSE)   56.5ms  59.9ms     16.2     76.3MB    16.2 
#> 3 data.table:::forder(x, decreasin…   188ms 190.5ms      5.11    38.1MB     1.70

n = 1e2
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     18.5ms  19.8ms     49.8       448B     0   
#> 2 quickn(x, n, decreasing = FALSE)     53ms  54.7ms     18.0     76.3MB    18.0 
#> 3 data.table:::forder(x, decreasin… 178.6ms 187.8ms      5.30    38.1MB     1.77

n = 1e3
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     18.5ms  18.9ms     51.6     3.95KB     0   
#> 2 quickn(x, n, decreasing = FALSE)   54.2ms  56.1ms     16.7     76.3MB    16.7 
#> 3 data.table:::forder(x, decreasin… 237.2ms 243.6ms      4.11   38.16MB     1.37

n = 1e4
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     20.6ms  21.7ms     45.8     39.1KB     0   
#> 2 quickn(x, n, decreasing = FALSE)   58.2ms  59.3ms     16.4     76.4MB    16.4 
#> 3 data.table:::forder(x, decreasin… 180.7ms 199.2ms      5.15    38.2MB     1.72

n = 1e5
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     19.5ms  20.1ms     48.3    390.7KB     2.01
#> 2 quickn(x, n, decreasing = FALSE)   61.2ms  61.2ms     16.3     77.1MB   114.  
#> 3 data.table:::forder(x, decreasin… 171.7ms 185.8ms      5.38    38.9MB     2.69

n = 1e6
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       27ms    28ms     34.3     3.81MB     0   
#> 2 quickn(x, n, decreasing = FALSE)   52.8ms  56.7ms     15.5    83.92MB    15.5 
#> 3 data.table:::forder(x, decreasin… 179.2ms 180.7ms      5.54   45.78MB     1.85

n = 1e7
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)    139.7ms 139.9ms      6.97    38.1MB     2.32
#> 2 quickn(x, n, decreasing = FALSE)   63.1ms  63.1ms     15.9    152.6MB    79.3 
#> 3 data.table:::forder(x, decreasin… 200.9ms 202.3ms      4.94   114.4MB     2.47

Random permutation

library(data.table)
setDTthreads(1L)
set.seed(373)
x = sample(as.double(seq.int(1e7)))
b <- function() {
  bench::mark(check=FALSE,
    topn(x,n,decreasing=TRUE),
    quickn(x,n,decreasing=TRUE),
    kit::topn(x,n,decreasing=TRUE),
    data.table:::forder(x,decreasing=TRUE)[1:n]
  )
}

n = 1e0
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)    15.76ms  16.95ms     58.8     3.19KB     0   
#> 2 quickn(x, n, decreasing = TRUE) 123.92ms 125.14ms      8.01    76.3MB     8.01
#> 3 kit::topn(x, n, decreasing = T…   9.45ms   9.69ms    103.     39.77KB     0   
#> 4 data.table:::forder(x, decreas… 457.36ms 487.79ms      2.05   38.22MB     1.03

n = 1e1
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)      18.4ms  19.1ms     51.8         0B     0   
#> 2 quickn(x, n, decreasing = TRUE)   143.7ms 145.9ms      6.80    76.3MB     6.80
#> 3 kit::topn(x, n, decreasing = TRU…  10.6ms  10.7ms     91.7         0B     0   
#> 4 data.table:::forder(x, decreasin… 521.7ms 521.7ms      1.92    38.1MB     0

n = 1e2
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)     16.43ms  17.1ms     57.4       448B     0   
#> 2 quickn(x, n, decreasing = TRUE)  120.77ms 126.8ms      7.97    76.3MB     7.97
#> 3 kit::topn(x, n, decreasing = TR…   9.89ms    11ms     90.9       448B     0   
#> 4 data.table:::forder(x, decreasi… 471.95ms 484.7ms      2.06    38.1MB     1.03

n = 1e3
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)      19.6ms  20.2ms     49.0     3.95KB     0   
#> 2 quickn(x, n, decreasing = TRUE)   128.9ms   133ms      7.36    76.3MB     7.36
#> 3 kit::topn(x, n, decreasing = TRU…  31.9ms  33.3ms     29.9     3.95KB     0   
#> 4 data.table:::forder(x, decreasin… 510.5ms 510.5ms      1.96   38.16MB     0

n = 1e4
b()
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)      38.4ms  40.6ms     24.7     39.1KB     0   
#> 2 quickn(x, n, decreasing = TRUE)   140.3ms 140.3ms      7.13    76.4MB    21.4 
#> 3 kit::topn(x, n, decreasing = TRU… 326.9ms 326.9ms      3.06    38.2MB     3.06
#> 4 data.table:::forder(x, decreasin… 530.2ms 530.2ms      1.89    38.2MB     0

n = 1e5
b()
#> # A tibble: 4 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)        271ms  275ms      3.63   390.7KB     0   
#> 2 quickn(x, n, decreasing = TRUE)      143ms  143ms      7.01    77.1MB    21.0 
#> 3 kit::topn(x, n, decreasing = TRUE)   443ms  443ms      2.26    38.5MB     2.26
#> 4 data.table:::forder(x, decreasing =… 510ms  510ms      1.96    38.9MB     0

n = 1e6
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)        3.8s    3.8s     0.263    3.81MB     0   
#> 2 quickn(x, n, decreasing = TRUE)   129.2ms   139ms     6.67    83.92MB     6.67
#> 3 kit::topn(x, n, decreasing = TRU… 344.9ms 378.2ms     2.64    41.96MB     1.32
#> 4 data.table:::forder(x, decreasin… 530.2ms 530.2ms     1.89    45.78MB     0

Strings

Random strings

library(data.table)
setDTthreads(1L)
x = stringi::stri_rand_strings(1e6, 10)
b <- function() {
  bench::mark(check=FALSE,
    topn(x,n,decreasing=FALSE),
    quickn(x,n,decreasing=FALSE),
    data.table:::forder(x,decreasing=FALSE)[1:n]
  )
}

n = 1e0
b()
#> # A tibble: 3 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)    8.94ms   9.22ms    108.      3.19KB     0   
#> 2 quickn(x, n, decreasing = FALS…  41.45ms  42.73ms     23.2     7.63MB    23.2 
#> 3 data.table:::forder(x, decreas… 208.66ms 209.69ms      4.77    3.89MB     2.38

n = 1e1
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       10ms  10.3ms     96.7         0B      0  
#> 2 quickn(x, n, decreasing = FALSE)   47.4ms  47.7ms     20.7     7.63MB     13.8
#> 3 data.table:::forder(x, decreasin… 210.6ms 216.9ms      4.63    3.81MB      0

n = 1e2
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     10.2ms  10.6ms     94.0       448B     0   
#> 2 quickn(x, n, decreasing = FALSE)   46.7ms  47.7ms     20.9     7.63MB     5.96
#> 3 data.table:::forder(x, decreasin… 215.8ms 217.6ms      4.60    3.82MB     0

n = 1e3
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     12.2ms  12.7ms     78.1     3.95KB     0   
#> 2 quickn(x, n, decreasing = FALSE)     44ms  45.2ms     22.1     7.64MB     5.53
#> 3 data.table:::forder(x, decreasin… 206.4ms 212.8ms      4.63    3.82MB     0

n = 1e4
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     35.7ms  36.7ms     27.1    39.11KB     0   
#> 2 quickn(x, n, decreasing = FALSE)   43.6ms  44.3ms     22.4     7.71MB     5.61
#> 3 data.table:::forder(x, decreasin… 206.1ms 206.5ms      4.77    3.89MB     0

n = 1e5
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)    233.2ms 233.7ms      4.27  390.67KB      0  
#> 2 quickn(x, n, decreasing = FALSE)   44.1ms  45.9ms     21.9     8.39MB     11.0
#> 3 data.table:::forder(x, decreasin… 202.5ms 205.3ms      4.80    4.58MB      0

n = 1e6
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     54.1ms  56.1ms     17.6     3.81MB     5.88
#> 2 quickn(x, n, decreasing = FALSE)   24.9ms  25.8ms     38.8    15.26MB   116.  
#> 3 data.table:::forder(x, decreasin… 206.6ms 209.2ms      4.71   11.44MB     0

codecov · 2021-09-17T14:22:34Z

Codecov Report

❌ Patch coverage is 60.25641% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 98.84%. Comparing base (9029d79) to head (9e0786c).

Files with missing lines	Patch %	Lines
src/topn.c	60.52%	30 Missing ⚠️
R/wrappers.R	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5167      +/-   ##
==========================================
- Coverage   99.02%   98.84%   -0.18%     
==========================================
  Files          87       88       +1     
  Lines       16758    16836      +78     
==========================================
+ Hits        16595    16642      +47     
- Misses        163      194      +31

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mattdowle · 2021-09-24T22:36:31Z

Very nice! This is a common use case and would be great to get in. No problems about the code. Just thinking about API.

I agree with this comment that the word 'top' doesn't convey min or max. How about minn/maxn, or min_n/max_n?
this comment was a good point that maybe it should be topn(n, ...) with passing multiple columns in future in mind, iiuc, which followed from @MichaelChirico's good point.
A concept in SQL land is LIMIT. Whether SQL engines typically knows that LIMIT is set and optimize accordingly, I don't know. Regardless, a function or parameter to limit the number of rows returned could apply and optimize other operations too; e.g. X[Y,limit=10] could return the first 10 rows of the X[Y] result without computing it all. One use-case for that that springs to mind would be testing a join on large data to check it's returning the expected result before removing the limit to get the full result. But it's just an example, really any query could be limited; e.g. X[,j,by,limit=3] could return the first 3 groups say if each group took a long time because j was costly. I was just about to write that X[Y][1:10] could be optimized to X[Y, limit=10] but it's hard to see how to optimize across two [...][...] calls unless we make [...] lazy (which isn't impossible). Anyway, X[order(col), limit=10] could do what X[topn(col, 10)] is proposed and would avoid needing to discuss 1 and 2 above. It wouldn't change this PR much since the meat is in the C code, just the API to call that C code.

ben-schwen · 2021-09-25T12:43:03Z

Regarding API:
What about nmin respectively nmax? This would go nicely with nmin(n, ....). Thinking about API and the future, it is easy to just return the root and basically cover nth(n, ...) with the same code.

Regarding Functionality:
Should the indices always be returned in the "right" order as specified by decreasing and na.last or would it make sense to add an sorted argument? This would speedup the runtime by k * log(n) for topn(x,k) with n = length(x).

Regarding implementation:
The current binary heap can be exchanged by an d-ary heap. However, this results in a slightly slower running time for lower k and only seems to overtake the binary heap for k >= 1e4.

I like the idea of a versatile LIMIT in the light of prototyping. However, my most common use case for this feature is only head(X)[,Y] and I'm not sure if I would really switch to X[,Y, limit=6L] for that.

jangorecki · 2024-01-11T06:59:11Z

I wonder if possibly https://github.com/Rdatatable/data.table/blob/master/src/quickselect.c could be reused?
Or maybe benchmark against that implementation?

I used it in naive rolling median algorithm to find partial (half) ordering.

ben-schwen · 2024-01-11T12:32:23Z

I wonder if possibly https://github.com/Rdatatable/data.table/blob/master/src/quickselect.c could be reused? Or maybe benchmark against that implementation?

I used it in naive rolling median algorithm to find partial (half) ordering.

Possibly. But quickselect returns values of x not of order(x).

jangorecki · 2024-01-11T12:37:34Z

ah yes, you are correct. Anyway you can compare speed of returning a value vs index, and at least you will know if there is something to improve regarding your current implementation, in case quickselect would be faster

jangorecki · 2024-01-11T12:38:58Z

BTW. those benchmark timings tables are terrible to look at when different rows use different units (ms vs s).

ben-schwen · 2024-01-12T17:12:38Z

ah yes, you are correct. Anyway you can compare speed of returning a value vs index, and at least you will know if there is something to improve regarding your current implementation, in case quickselect would be faster

will add a version with quickselect but my guess is that heapselect is faster for smaller k and quickselect will be faster as soon as k starts to grow.

jangorecki · 2024-09-12T19:21:03Z

Then topn can be confusing name because it is commonly used in MSSQL for what is LIMIT in some other dbses.

ben-schwen · 2024-09-16T22:17:29Z

I added a quickselect version called quickn. This would make sense if make topn mostly internal e.g. DT[order(...), ..., limit = n, method=c("heapselect", "quickselect")]

Will update the benchmarks to make an informed decision.

MichaelChirico · 2024-12-03T05:54:02Z

Then topn can be confusing name because it is commonly used in MSSQL for what is LIMIT in some other dbses.

Good call-out: https://learn.microsoft.com/en-us/dax/topn-function-dax

Examples there are not all that helpful, but AFAICT it suffers from the same confusing API where you are writing TOPN(..., DESC/ASC) and "top" is no longer the best phrasing.

I am leaning more towards limit= argument to [. It will be a good eventual complement to other FRs e.g. adding having= (#788), where= (#2911), join= (#3946), to make [ queries highly SQL-compatible.

method=c("heapselect", "quickselect")

I'm not sure a method= argument to [ is warranted, I think options(datatable.query.limit.method) makes more sense.

github-actions · 2025-12-21T14:02:36Z

No obvious timing issues in HEAD=topn_heap

Generated via commit 9e0786c

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	2 minutes and 48 seconds
Installing different package versions	23 seconds
Running and plotting the test cases	5 minutes and 7 seconds

src/topn.c

aitap · 2025-12-22T15:25:52Z

src/topn.c

+      if (j <= n) l = i;                            \
+    }                                               \
+  }                                                 \
+  memcpy(ians, ix, n * sizeof(CTYPE))               


This is somewhat scary for character vectors, but I'm not seeing anything that would break right now.

From the GC generations viewpoint, x and ans are likely from the same GC generation; x is possibly older. There shouldn't be any problem with elements of newer, more-frequently-sweeped ans pointing to values from an older, less-frequently-sweeped GC generation. (It's the opposite that causes use-after-frees.)

From the reference counts viewpoint, it'll be one less than what it should be for elements of ans, but CHARSXPs are cached and immutable anyway.

Co-authored-by: aitap <[email protected]>

ben-schwen added 9 commits September 12, 2021 00:49

init topn

b972e13

make stable

a8f4096

add complex

1767174

rename index array

7c9ce8b

ISNAN_COMPLEX

43cc962

added tests

1687cb6

added man

fc7b948

fix man

90a652f

finish tests

c513efb

ben-schwen added 3 commits September 17, 2021 16:28

add coverage

ab25d3f

typo

6785cf5

add NEWS

78502c7

Kamgang-B requested a review from mattdowle September 25, 2021 10:19

This comment was marked as outdated.

Sign in to view

ben-schwen mentioned this pull request Jan 2, 2022

nth max and nth min #919

Open

ben-schwen added 3 commits January 10, 2024 23:02

add sorted argument

c9b7a07

update tests

87aa734

add CODEOWNERS

446ca7f

tdhock mentioned this pull request Jan 10, 2024

data.table topn heap tdhock/atime#18

Open

ben-schwen added 3 commits January 10, 2024 23:23

update NEWS

db055f5

fix docs

2b16d6a

add arg to doc

13caa28

MichaelChirico mentioned this pull request Sep 13, 2024

export forder #3447

Open

ben-schwen added 2 commits September 17, 2024 00:02

add quickselect support

c70fbc1

add string support for quickselect

f8ff588

ben-schwen added 2 commits September 17, 2024 00:22

use memcpy instead of assignment

2dfb740

update NEWS

0604714

ben-schwen closed this Nov 10, 2024

ben-schwen reopened this Nov 10, 2024

MichaelChirico modified the milestones: 1.17.0, 1.18.0 Jan 17, 2025

This was referenced Jun 30, 2025

Add options= to test(), convert most Rraw scripts #5845

Draft

made todo labels consistent #7113

Open

jangorecki modified the milestones: 1.18.0, 1.19.0 Nov 30, 2025

ben-schwen added 2 commits December 21, 2025 14:39

Merge branch 'master' into topn_heap

6a162de

make linter happy

e9688fb

jangorecki closed this Dec 21, 2025

jangorecki reopened this Dec 21, 2025

aitap reviewed Dec 22, 2025

View reviewed changes

ben-schwen and others added 8 commits January 9, 2026 08:54

use RO pointer

63f1cb1

Co-authored-by: aitap <[email protected]>

more RO pointer

d049b7e

Co-authored-by: aitap <[email protected]>

use R_alloc instead of malloc

ee27215

save call of strcmp

611a7c0

Merge branch 'master' into topn_heap

2249c71

use DATAPTR_RO and add comment

c18537f

add clamp with warning

e5a0ccb

adjust test warning

9e0786c

Topn doing order |> head/tail #5167

Are you sure you want to change the base?

Topn doing order |> head/tail #5167

Conversation

ben-schwen commented Sep 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Integer

Worst case

Best case

Random permutation (mimicking average case)

Double

Worst case

Best case

Random permutation

Strings

Random strings

Uh oh!

codecov bot commented Sep 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mattdowle commented Sep 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

ben-schwen commented Sep 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jangorecki commented Jan 11, 2024

Uh oh!

ben-schwen commented Jan 11, 2024

Uh oh!

jangorecki commented Jan 11, 2024

Uh oh!

jangorecki commented Jan 11, 2024

Uh oh!

ben-schwen commented Jan 12, 2024

Uh oh!

jangorecki commented Sep 12, 2024

Uh oh!

ben-schwen commented Sep 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaelChirico commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aitap Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ben-schwen commented Sep 17, 2021 •

edited

Loading

codecov bot commented Sep 17, 2021 •

edited

Loading

mattdowle commented Sep 24, 2021 •

edited

Loading

ben-schwen commented Sep 25, 2021 •

edited

Loading

ben-schwen commented Sep 16, 2024 •

edited

Loading

MichaelChirico commented Dec 3, 2024 •

edited

Loading

github-actions bot commented Dec 21, 2025 •

edited

Loading