Skip to content

al8n/stretto

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

164 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stretto

A high performance thread-safe memory-bound Rust cache.

Stretto is a pure Rust implementation of https://github.com/dgraph-io/ristretto.

github LoC Build codecov

docs.rs crates.io crates.io license

Features

  • Internal Mutability - Do not need to use Arc<RwLock<Cache<...>> for concurrent code, you just need Cache<...> or AsyncCache<...>
  • Sync and Async - Stretto support sync and runtime agnostic async.
    • In sync, Cache starts two extra OS level threads. One is policy thread, the other is writing thread.
    • In async, AsyncCache starts two extra green threads. One is policy thread, the other is writing thread.
  • Store policy Stretto only store the value, which means the cache does not store the key.
  • High Hit Ratios - with Dgraph's developers unique admission/eviction policy pairing, Ristretto's performance is best in class.
    • Eviction: SampledLFU - on par with exact LRU and better performance on Search and Database traces.
    • Admission: TinyLFU - extra performance with little memory overhead (12 bits per counter).
  • Designed for Database Workloads - on OLTP-style traces with small working sets and strong frequency skew, Stretto keeps the hot set far better than general-purpose caches (see Benchmarks).
  • Fast Throughput - use a variety of techniques for managing contention and the result is excellent throughput.
  • Cost-Based Eviction - any large new item deemed valuable can evict multiple smaller items (cost could be anything).
  • Fully Concurrent - you can use as many threads as you want with little throughput degradation.
  • Metrics - optional performance metrics for throughput, hit ratios, and other stats.
  • Simple API - just figure out your ideal CacheBuilder/AsyncCacheBuilder values and you're off and running.

Table of Contents

Benchmarks

Hit ratios produced by cachebench against the ARC trace suite, 16 concurrent clients. Compared against QuickCache 0.6, TinyUFO 0.8 and Moka 0.12. Higher is better; bold marks the leader for each row. "Stretto" and "Stretto Async" report the sync Cache and async AsyncCache respectively — both run the same TinyLFU policy and 64-stripe insert buffer (see Sync vs async for a side-by-side discussion).

Where Stretto fits. TinyLFU admission compares each candidate's recent access frequency against the current eviction victim's and only admits when the candidate is "hotter." Two conditions together make this win:

  1. Capacity is small relative to the working set (typically ≤ ~25%). The admission filter then has a real choice on every insert; rejecting a one-off scan beats evicting a hot tail entry.
  2. Access frequencies are skewed (Zipf-like). The frequency sketch needs signal to distinguish hot from cold; uniform or scan-heavy traffic gives it none.

When both hold, Stretto leads by 7–47 percentage points — OLTP at every capacity, the 20K rows of P1–P13, the 100K rows of S1–S3. When either fails (DS1 at 4M+, S1/S2 at 800K, P14, ConCat, MergeP, MergeS, the 160K rows of P4/P6/P7/P11/P12), admission rejects items LRU/W-TinyLFU would have kept and Stretto trails by up to 57 points. Decision rule: if your working set is several times your cache size and hot keys repeat, pick Stretto; if traffic is bursty scans or your cache is sized to hold most of the data, pick Moka or QuickCache.

Read-heavy ≠ Stretto-friendly. The fit axis is access pattern, not the read-to-write ratio. A read-heavy workload with a tight hot set and skewed frequencies (OLTP, P1–P13) is Stretto's sweet spot; a read-heavy workload with a wide, churning keyspace (DS1) is its worst case — admission rejects items that LRU/W-TinyLFU would have kept, regardless of how few writes there are. Sized your cache to hold most of the data? Use an LRU-style cache (Moka, QuickCache).

S3 trace (Search)

S3 traces an entire workload from small to large capacity in one section. At 100K (cold tail still being rejected) Stretto wins by ~8 points; at 400K (Stretto's "frontier") it edges past Moka Segmented by 1.6; at 800K (capacity ≈ working set) admission filtering hurts and Moka Segmented pulls ahead by 13.6. The 400K row is the cleanest single illustration of where Stretto's regime ends.

Capacity QuickCache Stretto Stretto Async TinyUFO Moka Sync Moka Async Moka Segmented(8)
100,000 12.83% 20.72% 20.90% 2.80% 10.40% 10.49% 10.08%
400,000 42.29% 43.92% 43.21% 20.85% 40.98% 41.59% 42.30%
800,000 68.35% 56.52% 58.50% 62.28% 66.35% 67.37% 70.10%

DS1 trace (Database server)

DS1 is a "database" trace in name only — its access pattern has weak frequency skew across a huge key space and rewards LRU-style retention. Even at 1M (≈5% of interesting working set) Stretto trails QuickCache by 2 points; the gap widens to 20 points at 4M and 45 points at 8M, as LRU policies keep recently-touched items that admission filtering rejects. DS1 is the canonical "do not pick Stretto" workload.

Capacity QuickCache Stretto Stretto Async TinyUFO Moka Sync Moka Async Moka Segmented(8)
1,000,000 15.09% 12.98% 14.83% 3.36% 12.51% 13.19% 14.74%
4,000,000 44.42% 24.92% 26.97% 24.71% 41.48% 39.22% 44.65%
8,000,000 69.31% 24.57% 34.64% 53.42% 64.38% 64.85% 66.43%

OLTP trace

OLTP is the workload Ristretto was designed for: a tight, repeatedly-accessed working set with strong frequency skew. Stretto's hit ratio is 33–47 percentage points above the next-best cache and stays nearly flat across capacity (75.7% at cap=256, 76.4% at cap=2000) — TinyLFU pins the hot tail regardless of slack. Other caches' ratios climb with capacity (QuickCache: 22% → 42%, Moka Sync: 26% → 43%) because they have to physically retain the working set first. v0.9.0's 64-stripe insert buffer keeps producer contention out of the picture under 16 concurrent clients, so the result reflects the policy itself, not buffer-overflow drops.

Capacity QuickCache Stretto Stretto Async TinyUFO Moka Sync Moka Async Moka Segmented(8)
256 22.28% 75.74% 75.78% 16.35% 26.40% 25.55% 28.80%
512 28.79% 76.02% 75.66% 19.89% 32.71% 32.00% 33.53%
1,000 35.12% 76.13% 75.87% 27.08% 38.02% 37.48% 37.80%
2,000 41.68% 76.39% 76.15% 35.05% 42.99% 42.47% 42.85%

ARC P-series traces (Workstation)

The P-series captures workstation block-IO over a few hours per machine. Two distinct regimes:

  • At 20K capacity (~12.5% of working set) — Stretto wins all 13 of P1–P13, by 13–40 points. P14 (a ~5×-larger trace, 80K is already a tighter ratio) sits past Stretto's regime and QuickCache leads.
  • At 160K capacity (~100% of working set) — the picture flattens. Stretto still wins clearly on P2 (+6), P3 (+5), P8 (+9), P10 (+7); ties P1, P5, P9, P13 within ~1 point; trails P4, P6, P7, P11, P12, P14 where W-TinyLFU's recency-and-frequency signal extracts more from a saturated cache than admission filtering can.
Trace Capacity QuickCache Stretto Stretto Async TinyUFO Moka Sync Moka Async Moka Segmented(8)
P1 20,000 22.70% 54.97% 54.61% 17.39% 21.37% 21.25% 21.86%
P1 160,000 68.61% 68.94% 69.41% 64.30% 63.98% 64.84% 66.96%
P2 20,000 21.02% 60.94% 60.75% 16.20% 19.39% 19.44% 20.16%
P2 160,000 67.15% 73.48% 73.55% 63.91% 63.38% 63.85% 66.83%
P3 20,000 8.92% 44.17% 44.82% 3.82% 11.22% 11.25% 11.02%
P3 160,000 54.16% 59.44% 63.39% 50.02% 48.23% 48.89% 50.91%
P4 20,000 4.79% 18.61% 18.51% 4.52% 4.79% 4.74% 4.66%
P4 160,000 30.62% 22.46% 22.60% 26.98% 28.81% 29.04% 30.71%
P5 20,000 8.98% 29.91% 30.27% 6.26% 8.55% 8.53% 8.38%
P5 160,000 46.58% 47.39% 47.08% 38.57% 43.14% 43.28% 45.16%
P6 20,000 17.22% 37.05% 37.48% 8.89% 16.29% 15.89% 14.89%
P6 160,000 84.41% 64.87% 44.88% 58.88% 77.27% 78.01% 81.28%
P7 20,000 10.46% 43.19% 41.85% 5.12% 8.22% 8.10% 7.46%
P7 160,000 57.35% 53.52% 54.12% 44.31% 53.20% 54.30% 56.63%
P8 20,000 22.64% 62.46% 61.90% 18.43% 21.38% 21.63% 21.73%
P8 160,000 76.54% 85.26% 83.24% 73.72% 72.72% 73.19% 76.03%
P9 20,000 12.49% 40.13% 39.63% 9.88% 13.63% 13.65% 13.87%
P9 160,000 55.31% 56.59% 56.10% 51.33% 50.28% 51.42% 53.15%
P10 20,000 6.56% 19.93% 19.77% 3.36% 4.58% 4.61% 4.53%
P10 160,000 28.72% 37.08% 35.26% 21.48% 30.38% 29.57% 29.99%
P11 20,000 19.53% 42.40% 44.59% 17.57% 18.81% 18.69% 18.91%
P11 160,000 68.28% 56.77% 57.73% 66.50% 65.71% 65.91% 68.26%
P12 20,000 10.09% 32.71% 32.38% 8.96% 9.23% 9.26% 9.11%
P12 160,000 45.06% 41.79% 41.53% 40.61% 41.94% 42.58% 44.66%
P13 20,000 12.60% 43.26% 42.98% 7.32% 11.36% 11.16% 10.99%
P13 160,000 55.44% 56.74% 57.17% 47.44% 50.05% 50.76% 53.95%
P14 80,000 29.80% 24.99% 24.76% 19.53% 28.13% 28.22% 29.11%
P14 640,000 57.93% 33.05% 33.99% 53.13% 51.93% 52.15% 54.30%

ARC S-series traces (Search)

S1 and S2 are search-engine traces with weak frequency skew. The crossover between Stretto's regime and LRU-style policies' regime sits between 100K and 800K capacity:

  • At 100K (working set still doesn't fit), Stretto's filter rejects the long cold tail and wins by 8–9 points.
  • At 800K (capacity nears the working set), LRU/W-TinyLFU keep recently-touched items that admission rejects, and Stretto trails by 14–27 points.
Trace Capacity QuickCache Stretto Stretto Async TinyUFO Moka Sync Moka Async Moka Segmented(8)
S1 100,000 9.59% 18.78% 18.96% 2.64% 8.88% 8.89% 8.69%
S1 800,000 56.86% 29.81% 30.62% 49.44% 55.32% 55.38% 56.08%
S2 100,000 13.11% 21.24% 21.27% 2.78% 10.53% 10.61% 10.24%
S2 800,000 68.36% 56.55% 62.35% 62.57% 66.78% 67.67% 70.33%

Merged ARC traces (scan-heavy)

ConCat (DS1 ∥ S3), MergeP (P1–P14 interleaved), and MergeS (S1–S3 interleaved) splice ARC traces into long sequences with weak frequency skew over very large key spaces. TinyLFU's frequency sketch is built for a single stable workload, not concatenated phases, so admission rejects items LRU-style policies would retain — Stretto trails by 30–60 points across every measured capacity. These traces document the regime where Stretto is the wrong tool; do not use them to size a Stretto cache.

ConCat (DS1 ∥ S3):

Capacity QuickCache Stretto Stretto Async TinyUFO Moka Sync Moka Async Moka Segmented(8)
200,000 60.43% 14.31% 14.55% 40.81% 54.80% 55.38% 57.86%
400,000 72.44% 15.31% 16.61% 57.37% 63.41% 64.51% 67.96%
3,200,000 87.74% 43.26% 42.40% 86.87% 73.09% 76.99% 82.22%

MergeP (P1–P14 interleaved):

Capacity QuickCache Stretto Stretto Async TinyUFO Moka Sync Moka Async Moka Segmented(8)
400,000 44.31% 10.71% 10.54% 38.40% 38.09% 38.48% 40.31%
1,000,000 57.98% 22.83% 20.23% 53.41% 50.66% 51.46% 53.77%
3,200,000 75.23% 39.62% 39.40% 71.23% 65.02% 66.43% 70.42%

MergeS (S1–S3 interleaved):

Capacity QuickCache Stretto Stretto Async TinyUFO Moka Sync Moka Async Moka Segmented(8)
400,000 19.96% 14.67% 14.96% 7.29% 20.62% 20.68% 21.14%
1,000,000 44.81% 28.25% 29.26% 30.15% 41.67% 42.01% 44.19%
3,200,000 83.02% 45.29% 51.21% 82.03% 77.54% 78.38% 80.38%

Sync vs async cache

AsyncCache since v0.9.0 uses the same TinyLFU policy and the same 64-stripe insert buffer as sync Cache. The full breakdown is in the per-trace tables above; the summary:

  • On workloads matching Stretto's regime (OLTP, P1–P5, P8–P10, P12, P13, S1/S2 at 100K, S3 ≤ 400K), sync and async hit ratios track within ~1 percentage point — the underlying admission decisions are the same.
  • On scan-heavy or burst-heavy traces where the producer-side insert channel saturates, drop-on-overflow timing differs between sync (crossbeam_channel::bounded) and async (async_channel over std::thread) and the two diverge by 5–20 points. The sign isn't fixed: async wins some, sync wins others.
Trace, Capacity Sync Async Δ
P6, 160,000 64.87% 44.88% −19.99
DS1, 8,000,000 24.57% 34.64% +10.06
S2, 800,000 56.55% 62.35% +5.80
MergeS, 3,200,000 45.29% 51.21% +5.92
P3, 160,000 59.44% 63.39% +3.95

These swings are timing-dependent, not policy divergence: the affected traces sit outside Stretto's sweet spot (capacity ≈ working set, or scan-heavy phases) where hit ratio isn't the deciding metric anyway. Throughput is comparable: at S3 cap=400K sync finishes in ~2.9s vs async ~4.2s; at OLTP cap=256 both complete in under 0.07s. Pick AsyncCache for runtime ergonomics — hit ratio is not the tradeoff for workloads matching Stretto's regime.

Reproducing

git clone --recursive https://github.com/al8n/cachebench.git
cd cachebench/cache-trace/arc && for f in *.lis.zst; do zstd -d "$f"; done && cd ../..
cargo run --release --features "stretto,quick_cache,tiny-ufo,moka-v012" -- \
    -f oltp,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,s1,s2,s3,ds1,concat,merge-p,merge-s -n 16

spc1likeread is omitted from the tables above; with a 30,000,000-entry default capacity it takes ≈30 minutes per cache implementation and adds little signal beyond what DS1 already shows.

Installation

  • Use Cache.

    [dependencies]
    stretto = "0.9"
  • Use AsyncCache with tokio

    [dependencies]
    stretto = { version = "0.9", features = ["tokio"] }
  • Use AsyncCache with smol

    [dependencies]
    stretto = { version = "0.9", features = ["smol"] }

Usage

Example

See examples folder for more details:

  • sync: Use stretto's cache in sync environment
  • tokio: Use stretto's cache with tokio async runtime
  • smol: Use stretto's cache with smol async runtime

Config

The CacheBuilder struct is used when creating Cache instances if you want to customize the Cache settings.

num_counters

num_counters is the number of 4-bit access counters to keep for admission and eviction. Dgraph's developers have seen good performance in setting this to 10x the number of items you expect to keep in the cache when full.

For example, if you expect each item to have a cost of 1 and max_cost is 100, set num_counters to 1,000. Or, if you use variable cost values but expect the cache to hold around 10,000 items when full, set num_counters to 100,000. The important thing is the number of unique items in the full cache, not necessarily the max_cost value.

max_cost

max_cost is how eviction decisions are made. For example, if max_cost is 100 and a new item with a cost of 1 increases total cache cost to 101, 1 item will be evicted.

max_cost can also be used to denote the max size in bytes. For example, if max_cost is 1,000,000 (1MB) and the cache is full with 1,000 1KB items, a new item (that's accepted) would cause 5 1KB items to be evicted.

max_cost could be anything as long as it matches how you're using the cost values when calling insert.

key_builder

pub trait KeyBuilder {
    type Key: Hash + Eq + ?Sized;

    /// hash_index is used to hash the key to u64
    fn hash_index<Q>(&self, key: &Q) -> u64
        where 
            Self::Key: core::borrow::Borrow<Q>,
            Q: Hash + Eq + ?Sized;

    /// if you want a 128bit hashes, you should implement this method,
    /// or leave this method return 0
    fn hash_conflict<Q>(&self, key: &Q) -> u64
        where 
            Self::Key: core::borrow::Borrow<Q>,
            Q: Hash + Eq + ?Sized;
    { 0 }

    /// build the key to 128bit hashes.
    fn build_key<Q>(&self, k: &Q) -> (u64, u64) 
        where 
            Self::Key: core::borrow::Borrow<Q>,
            Q: Hash + Eq + ?Sized;
    {
        (self.hash_index(k), self.hash_conflict(k))
    }
}

KeyBuilder is the hashing algorithm used for every key. In Stretto, the Cache will never store the real key. The key will be processed by KeyBuilder. Stretto has two default built-in key builder, one is TransparentKeyBuilder, the other is DefaultKeyBuilder. If your key implements TransparentKey trait, you can use TransparentKeyBuilder which is faster than DefaultKeyBuilder. Otherwise, you should use DefaultKeyBuilder You can also write your own key builder for the Cache, by implementing KeyBuilder trait.

Note that if you want 128bit hashes you should use the full (u64, u64), otherwise just fill the u64 at the 0 position, and it will behave like any 64bit hash.

buffer_size

Both Cache and AsyncCache buffer admission requests in a 64-stripe ring that hashes each thread (or task) to its own Mutex<Vec<_>> so producers rarely contend. When a stripe accumulates insert_stripe_high_water items (default 64) the full batch is sent to the policy processor; the processor also drains every stripe inline every drain_interval (default 500ms) as a safety net.

Use set_insert_stripe_high_water(items) to tune the per-stripe batch size. Larger values amortize channel sends but delay admission decisions; values around 64–256 work well for caches under ~10K capacity. The stripe count and per-batch channel capacity are not user-tunable.

metrics

Metrics is true when you want real-time logging of a variety of stats. The reason this is a CacheBuilder flag is because there's a 10% throughput performance overhead.

ignore_internal_cost

Defaults to true: each insert is charged only the caller-supplied cost, so max_cost behaves as an entry budget when you pass 1 per insert.

Set to false when max_cost represents a byte budget and you need each stored item to also account for ~56 bytes of per-entry bookkeeping (key, conflict, version, value wrapper, time).

cleanup_duration

The Cache will cleanup the expired values every 500ms by default.

update_validator

pub trait UpdateValidator: Send + Sync + 'static {
    type Value: Send + Sync + 'static;

    /// should_update is called when a value already exists in cache and is being updated.
    fn should_update(&self, prev: &Self::Value, curr: &Self::Value) -> bool;
}

By default, the Cache will always update the value if the value already exists in the cache, this trait is for you to check if the value should be updated.

callback

pub trait CacheCallback: Send + Sync + 'static {
    type Value: Send + Sync + 'static;

    /// on_exit is called whenever a value is removed from cache. This can be
    /// used to do manual memory deallocation. Would also be called on eviction
    /// and rejection of the value.
    fn on_exit(&self, val: Option<Self::Value>);

    /// on_evict is called for every eviction and passes the hashed key, value,
    /// and cost to the function.
    fn on_evict(&self, item: Item<Self::Value>) {
        self.on_exit(item.val)
    }

    /// on_reject is called for every rejection done via the policy.
    fn on_reject(&self, item: Item<Self::Value>) {
        self.on_exit(item.val)
    }
}

CacheCallback is for customize some extra operations on values when related event happens.

coster

pub trait Coster: Send + Sync + 'static {
    type Value: Send + Sync + 'static;

    /// cost evaluates a value and outputs a corresponding cost. This function
    /// is ran after insert is called for a new item or an item update with a cost
    /// param of 0.
    fn cost(&self, val: &Self::Value) -> i64;
}

Cost is a trait you can pass to the CacheBuilder in order to evaluate item cost at runtime, and only for the insert calls that aren't dropped (this is useful if calculating item cost is particularly expensive, and you don't want to waste time on items that will be dropped anyways).

To signal to Stretto that you'd like to use this Coster trait:

  1. Set the Coster field to your own Coster implementation.
  2. When calling insert for new items or item updates, use a cost of 0.

hasher

The hasher for the Cache, default is SeaHasher.

Acknowledgements

  • Thanks Dgraph's developers for providing amazing Go version Ristretto implementation.

License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

About

Stretto is a Rust implementation for Dgraph's ristretto (https://github.com/dgraph-io/ristretto). A high performance memory-bound Rust cache.

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

  •  

Packages

 
 
 

Contributors

Languages