A high performance thread-safe memory-bound Rust cache.
Stretto is a pure Rust implementation of https://github.com/dgraph-io/ristretto.
- Internal Mutability - Do not need to use
Arc<RwLock<Cache<...>>for concurrent code, you just needCache<...>orAsyncCache<...> - Sync and Async - Stretto support sync and runtime agnostic async.
- In sync, Cache starts two extra OS level threads. One is policy thread, the other is writing thread.
- In async, AsyncCache starts two extra green threads. One is policy thread, the other is writing thread.
- Store policy Stretto only store the value, which means the cache does not store the key.
- High Hit Ratios - with Dgraph's developers unique admission/eviction policy pairing, Ristretto's performance is best in class.
- Eviction: SampledLFU - on par with exact LRU and better performance on Search and Database traces.
- Admission: TinyLFU - extra performance with little memory overhead (12 bits per counter).
- Designed for Database Workloads - on OLTP-style traces with small working sets and strong frequency skew, Stretto keeps the hot set far better than general-purpose caches (see Benchmarks).
- Fast Throughput - use a variety of techniques for managing contention and the result is excellent throughput.
- Cost-Based Eviction - any large new item deemed valuable can evict multiple smaller items (cost could be anything).
- Fully Concurrent - you can use as many threads as you want with little throughput degradation.
- Metrics - optional performance metrics for throughput, hit ratios, and other stats.
- Simple API - just figure out your ideal
CacheBuilder/AsyncCacheBuildervalues and you're off and running.
Hit ratios produced by cachebench against the ARC trace suite, 16 concurrent clients. Compared against QuickCache 0.6, TinyUFO 0.8 and Moka 0.12. Higher is better; bold marks the leader for each row. "Stretto" and "Stretto Async" report the sync Cache and async AsyncCache respectively — both run the same TinyLFU policy and 64-stripe insert buffer (see Sync vs async for a side-by-side discussion).
Where Stretto fits. TinyLFU admission compares each candidate's recent access frequency against the current eviction victim's and only admits when the candidate is "hotter." Two conditions together make this win:
- Capacity is small relative to the working set (typically ≤ ~25%). The admission filter then has a real choice on every insert; rejecting a one-off scan beats evicting a hot tail entry.
- Access frequencies are skewed (Zipf-like). The frequency sketch needs signal to distinguish hot from cold; uniform or scan-heavy traffic gives it none.
When both hold, Stretto leads by 7–47 percentage points — OLTP at every capacity, the 20K rows of P1–P13, the 100K rows of S1–S3. When either fails (DS1 at 4M+, S1/S2 at 800K, P14, ConCat, MergeP, MergeS, the 160K rows of P4/P6/P7/P11/P12), admission rejects items LRU/W-TinyLFU would have kept and Stretto trails by up to 57 points. Decision rule: if your working set is several times your cache size and hot keys repeat, pick Stretto; if traffic is bursty scans or your cache is sized to hold most of the data, pick Moka or QuickCache.
Read-heavy ≠ Stretto-friendly. The fit axis is access pattern, not the read-to-write ratio. A read-heavy workload with a tight hot set and skewed frequencies (OLTP, P1–P13) is Stretto's sweet spot; a read-heavy workload with a wide, churning keyspace (DS1) is its worst case — admission rejects items that LRU/W-TinyLFU would have kept, regardless of how few writes there are. Sized your cache to hold most of the data? Use an LRU-style cache (Moka, QuickCache).
S3 traces an entire workload from small to large capacity in one section. At 100K (cold tail still being rejected) Stretto wins by ~8 points; at 400K (Stretto's "frontier") it edges past Moka Segmented by 1.6; at 800K (capacity ≈ working set) admission filtering hurts and Moka Segmented pulls ahead by 13.6. The 400K row is the cleanest single illustration of where Stretto's regime ends.
| Capacity | QuickCache | Stretto | Stretto Async | TinyUFO | Moka Sync | Moka Async | Moka Segmented(8) |
|---|---|---|---|---|---|---|---|
| 100,000 | 12.83% | 20.72% | 20.90% | 2.80% | 10.40% | 10.49% | 10.08% |
| 400,000 | 42.29% | 43.92% | 43.21% | 20.85% | 40.98% | 41.59% | 42.30% |
| 800,000 | 68.35% | 56.52% | 58.50% | 62.28% | 66.35% | 67.37% | 70.10% |
DS1 is a "database" trace in name only — its access pattern has weak frequency skew across a huge key space and rewards LRU-style retention. Even at 1M (≈5% of interesting working set) Stretto trails QuickCache by 2 points; the gap widens to 20 points at 4M and 45 points at 8M, as LRU policies keep recently-touched items that admission filtering rejects. DS1 is the canonical "do not pick Stretto" workload.
| Capacity | QuickCache | Stretto | Stretto Async | TinyUFO | Moka Sync | Moka Async | Moka Segmented(8) |
|---|---|---|---|---|---|---|---|
| 1,000,000 | 15.09% | 12.98% | 14.83% | 3.36% | 12.51% | 13.19% | 14.74% |
| 4,000,000 | 44.42% | 24.92% | 26.97% | 24.71% | 41.48% | 39.22% | 44.65% |
| 8,000,000 | 69.31% | 24.57% | 34.64% | 53.42% | 64.38% | 64.85% | 66.43% |
OLTP is the workload Ristretto was designed for: a tight, repeatedly-accessed working set with strong frequency skew. Stretto's hit ratio is 33–47 percentage points above the next-best cache and stays nearly flat across capacity (75.7% at cap=256, 76.4% at cap=2000) — TinyLFU pins the hot tail regardless of slack. Other caches' ratios climb with capacity (QuickCache: 22% → 42%, Moka Sync: 26% → 43%) because they have to physically retain the working set first. v0.9.0's 64-stripe insert buffer keeps producer contention out of the picture under 16 concurrent clients, so the result reflects the policy itself, not buffer-overflow drops.
| Capacity | QuickCache | Stretto | Stretto Async | TinyUFO | Moka Sync | Moka Async | Moka Segmented(8) |
|---|---|---|---|---|---|---|---|
| 256 | 22.28% | 75.74% | 75.78% | 16.35% | 26.40% | 25.55% | 28.80% |
| 512 | 28.79% | 76.02% | 75.66% | 19.89% | 32.71% | 32.00% | 33.53% |
| 1,000 | 35.12% | 76.13% | 75.87% | 27.08% | 38.02% | 37.48% | 37.80% |
| 2,000 | 41.68% | 76.39% | 76.15% | 35.05% | 42.99% | 42.47% | 42.85% |
The P-series captures workstation block-IO over a few hours per machine. Two distinct regimes:
- At 20K capacity (~12.5% of working set) — Stretto wins all 13 of P1–P13, by 13–40 points. P14 (a ~5×-larger trace, 80K is already a tighter ratio) sits past Stretto's regime and QuickCache leads.
- At 160K capacity (~100% of working set) — the picture flattens. Stretto still wins clearly on P2 (+6), P3 (+5), P8 (+9), P10 (+7); ties P1, P5, P9, P13 within ~1 point; trails P4, P6, P7, P11, P12, P14 where W-TinyLFU's recency-and-frequency signal extracts more from a saturated cache than admission filtering can.
| Trace | Capacity | QuickCache | Stretto | Stretto Async | TinyUFO | Moka Sync | Moka Async | Moka Segmented(8) |
|---|---|---|---|---|---|---|---|---|
| P1 | 20,000 | 22.70% | 54.97% | 54.61% | 17.39% | 21.37% | 21.25% | 21.86% |
| P1 | 160,000 | 68.61% | 68.94% | 69.41% | 64.30% | 63.98% | 64.84% | 66.96% |
| P2 | 20,000 | 21.02% | 60.94% | 60.75% | 16.20% | 19.39% | 19.44% | 20.16% |
| P2 | 160,000 | 67.15% | 73.48% | 73.55% | 63.91% | 63.38% | 63.85% | 66.83% |
| P3 | 20,000 | 8.92% | 44.17% | 44.82% | 3.82% | 11.22% | 11.25% | 11.02% |
| P3 | 160,000 | 54.16% | 59.44% | 63.39% | 50.02% | 48.23% | 48.89% | 50.91% |
| P4 | 20,000 | 4.79% | 18.61% | 18.51% | 4.52% | 4.79% | 4.74% | 4.66% |
| P4 | 160,000 | 30.62% | 22.46% | 22.60% | 26.98% | 28.81% | 29.04% | 30.71% |
| P5 | 20,000 | 8.98% | 29.91% | 30.27% | 6.26% | 8.55% | 8.53% | 8.38% |
| P5 | 160,000 | 46.58% | 47.39% | 47.08% | 38.57% | 43.14% | 43.28% | 45.16% |
| P6 | 20,000 | 17.22% | 37.05% | 37.48% | 8.89% | 16.29% | 15.89% | 14.89% |
| P6 | 160,000 | 84.41% | 64.87% | 44.88% | 58.88% | 77.27% | 78.01% | 81.28% |
| P7 | 20,000 | 10.46% | 43.19% | 41.85% | 5.12% | 8.22% | 8.10% | 7.46% |
| P7 | 160,000 | 57.35% | 53.52% | 54.12% | 44.31% | 53.20% | 54.30% | 56.63% |
| P8 | 20,000 | 22.64% | 62.46% | 61.90% | 18.43% | 21.38% | 21.63% | 21.73% |
| P8 | 160,000 | 76.54% | 85.26% | 83.24% | 73.72% | 72.72% | 73.19% | 76.03% |
| P9 | 20,000 | 12.49% | 40.13% | 39.63% | 9.88% | 13.63% | 13.65% | 13.87% |
| P9 | 160,000 | 55.31% | 56.59% | 56.10% | 51.33% | 50.28% | 51.42% | 53.15% |
| P10 | 20,000 | 6.56% | 19.93% | 19.77% | 3.36% | 4.58% | 4.61% | 4.53% |
| P10 | 160,000 | 28.72% | 37.08% | 35.26% | 21.48% | 30.38% | 29.57% | 29.99% |
| P11 | 20,000 | 19.53% | 42.40% | 44.59% | 17.57% | 18.81% | 18.69% | 18.91% |
| P11 | 160,000 | 68.28% | 56.77% | 57.73% | 66.50% | 65.71% | 65.91% | 68.26% |
| P12 | 20,000 | 10.09% | 32.71% | 32.38% | 8.96% | 9.23% | 9.26% | 9.11% |
| P12 | 160,000 | 45.06% | 41.79% | 41.53% | 40.61% | 41.94% | 42.58% | 44.66% |
| P13 | 20,000 | 12.60% | 43.26% | 42.98% | 7.32% | 11.36% | 11.16% | 10.99% |
| P13 | 160,000 | 55.44% | 56.74% | 57.17% | 47.44% | 50.05% | 50.76% | 53.95% |
| P14 | 80,000 | 29.80% | 24.99% | 24.76% | 19.53% | 28.13% | 28.22% | 29.11% |
| P14 | 640,000 | 57.93% | 33.05% | 33.99% | 53.13% | 51.93% | 52.15% | 54.30% |
S1 and S2 are search-engine traces with weak frequency skew. The crossover between Stretto's regime and LRU-style policies' regime sits between 100K and 800K capacity:
- At 100K (working set still doesn't fit), Stretto's filter rejects the long cold tail and wins by 8–9 points.
- At 800K (capacity nears the working set), LRU/W-TinyLFU keep recently-touched items that admission rejects, and Stretto trails by 14–27 points.
| Trace | Capacity | QuickCache | Stretto | Stretto Async | TinyUFO | Moka Sync | Moka Async | Moka Segmented(8) |
|---|---|---|---|---|---|---|---|---|
| S1 | 100,000 | 9.59% | 18.78% | 18.96% | 2.64% | 8.88% | 8.89% | 8.69% |
| S1 | 800,000 | 56.86% | 29.81% | 30.62% | 49.44% | 55.32% | 55.38% | 56.08% |
| S2 | 100,000 | 13.11% | 21.24% | 21.27% | 2.78% | 10.53% | 10.61% | 10.24% |
| S2 | 800,000 | 68.36% | 56.55% | 62.35% | 62.57% | 66.78% | 67.67% | 70.33% |
ConCat (DS1 ∥ S3), MergeP (P1–P14 interleaved), and MergeS (S1–S3 interleaved) splice ARC traces into long sequences with weak frequency skew over very large key spaces. TinyLFU's frequency sketch is built for a single stable workload, not concatenated phases, so admission rejects items LRU-style policies would retain — Stretto trails by 30–60 points across every measured capacity. These traces document the regime where Stretto is the wrong tool; do not use them to size a Stretto cache.
ConCat (DS1 ∥ S3):
| Capacity | QuickCache | Stretto | Stretto Async | TinyUFO | Moka Sync | Moka Async | Moka Segmented(8) |
|---|---|---|---|---|---|---|---|
| 200,000 | 60.43% | 14.31% | 14.55% | 40.81% | 54.80% | 55.38% | 57.86% |
| 400,000 | 72.44% | 15.31% | 16.61% | 57.37% | 63.41% | 64.51% | 67.96% |
| 3,200,000 | 87.74% | 43.26% | 42.40% | 86.87% | 73.09% | 76.99% | 82.22% |
MergeP (P1–P14 interleaved):
| Capacity | QuickCache | Stretto | Stretto Async | TinyUFO | Moka Sync | Moka Async | Moka Segmented(8) |
|---|---|---|---|---|---|---|---|
| 400,000 | 44.31% | 10.71% | 10.54% | 38.40% | 38.09% | 38.48% | 40.31% |
| 1,000,000 | 57.98% | 22.83% | 20.23% | 53.41% | 50.66% | 51.46% | 53.77% |
| 3,200,000 | 75.23% | 39.62% | 39.40% | 71.23% | 65.02% | 66.43% | 70.42% |
MergeS (S1–S3 interleaved):
| Capacity | QuickCache | Stretto | Stretto Async | TinyUFO | Moka Sync | Moka Async | Moka Segmented(8) |
|---|---|---|---|---|---|---|---|
| 400,000 | 19.96% | 14.67% | 14.96% | 7.29% | 20.62% | 20.68% | 21.14% |
| 1,000,000 | 44.81% | 28.25% | 29.26% | 30.15% | 41.67% | 42.01% | 44.19% |
| 3,200,000 | 83.02% | 45.29% | 51.21% | 82.03% | 77.54% | 78.38% | 80.38% |
AsyncCache since v0.9.0 uses the same TinyLFU policy and the same 64-stripe insert buffer as sync Cache. The full breakdown is in the per-trace tables above; the summary:
- On workloads matching Stretto's regime (OLTP, P1–P5, P8–P10, P12, P13, S1/S2 at 100K, S3 ≤ 400K), sync and async hit ratios track within ~1 percentage point — the underlying admission decisions are the same.
- On scan-heavy or burst-heavy traces where the producer-side insert channel saturates, drop-on-overflow timing differs between sync (
crossbeam_channel::bounded) and async (async_channeloverstd::thread) and the two diverge by 5–20 points. The sign isn't fixed: async wins some, sync wins others.
| Trace, Capacity | Sync | Async | Δ |
|---|---|---|---|
| P6, 160,000 | 64.87% | 44.88% | −19.99 |
| DS1, 8,000,000 | 24.57% | 34.64% | +10.06 |
| S2, 800,000 | 56.55% | 62.35% | +5.80 |
| MergeS, 3,200,000 | 45.29% | 51.21% | +5.92 |
| P3, 160,000 | 59.44% | 63.39% | +3.95 |
These swings are timing-dependent, not policy divergence: the affected traces sit outside Stretto's sweet spot (capacity ≈ working set, or scan-heavy phases) where hit ratio isn't the deciding metric anyway. Throughput is comparable: at S3 cap=400K sync finishes in ~2.9s vs async ~4.2s; at OLTP cap=256 both complete in under 0.07s. Pick AsyncCache for runtime ergonomics — hit ratio is not the tradeoff for workloads matching Stretto's regime.
git clone --recursive https://github.com/al8n/cachebench.git
cd cachebench/cache-trace/arc && for f in *.lis.zst; do zstd -d "$f"; done && cd ../..
cargo run --release --features "stretto,quick_cache,tiny-ufo,moka-v012" -- \
-f oltp,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,s1,s2,s3,ds1,concat,merge-p,merge-s -n 16spc1likeread is omitted from the tables above; with a 30,000,000-entry default capacity it takes ≈30 minutes per cache implementation and adds little signal beyond what DS1 already shows.
-
Use Cache.
[dependencies] stretto = "0.9"
-
Use AsyncCache with tokio
[dependencies] stretto = { version = "0.9", features = ["tokio"] }
-
Use AsyncCache with smol
[dependencies] stretto = { version = "0.9", features = ["smol"] }
See examples folder for more details:
- sync: Use stretto's cache in sync environment
- tokio: Use stretto's cache with tokio async runtime
- smol: Use stretto's cache with smol async runtime
The CacheBuilder struct is used when creating Cache instances if you want to customize the Cache settings.
num_counters is the number of 4-bit access counters to keep for admission and eviction. Dgraph's developers have seen good performance in setting this to 10x the number of items you expect to keep in the cache when full.
For example, if you expect each item to have a cost of 1 and max_cost is 100, set num_counters to 1,000. Or, if you use variable cost values but expect the cache to hold around 10,000 items when full, set num_counters to 100,000. The important thing is the number of unique items in the full cache, not necessarily the max_cost value.
max_cost is how eviction decisions are made. For example, if max_cost is 100 and a new item with a cost of 1 increases total cache cost to 101, 1 item will be evicted.
max_cost can also be used to denote the max size in bytes. For example, if max_cost is 1,000,000 (1MB) and the cache is full with 1,000 1KB items, a new item (that's accepted) would cause 5 1KB items to be evicted.
max_cost could be anything as long as it matches how you're using the cost values when calling insert.
pub trait KeyBuilder {
type Key: Hash + Eq + ?Sized;
/// hash_index is used to hash the key to u64
fn hash_index<Q>(&self, key: &Q) -> u64
where
Self::Key: core::borrow::Borrow<Q>,
Q: Hash + Eq + ?Sized;
/// if you want a 128bit hashes, you should implement this method,
/// or leave this method return 0
fn hash_conflict<Q>(&self, key: &Q) -> u64
where
Self::Key: core::borrow::Borrow<Q>,
Q: Hash + Eq + ?Sized;
{ 0 }
/// build the key to 128bit hashes.
fn build_key<Q>(&self, k: &Q) -> (u64, u64)
where
Self::Key: core::borrow::Borrow<Q>,
Q: Hash + Eq + ?Sized;
{
(self.hash_index(k), self.hash_conflict(k))
}
}KeyBuilder is the hashing algorithm used for every key. In Stretto, the Cache will never store the real key.
The key will be processed by KeyBuilder. Stretto has two default built-in key builder,
one is TransparentKeyBuilder, the other is DefaultKeyBuilder. If your key implements TransparentKey trait,
you can use TransparentKeyBuilder which is faster than DefaultKeyBuilder. Otherwise, you should use DefaultKeyBuilder
You can also write your own key builder for the Cache, by implementing KeyBuilder trait.
Note that if you want 128bit hashes you should use the full (u64, u64),
otherwise just fill the u64 at the 0 position, and it will behave like
any 64bit hash.
Both Cache and AsyncCache buffer admission requests in a 64-stripe ring that hashes each thread (or task) to its own Mutex<Vec<_>> so producers rarely contend. When a stripe accumulates insert_stripe_high_water items (default 64) the full batch is sent to the policy processor; the processor also drains every stripe inline every drain_interval (default 500ms) as a safety net.
Use set_insert_stripe_high_water(items) to tune the per-stripe batch size. Larger values amortize channel sends but delay admission decisions; values around 64–256 work well for caches under ~10K capacity. The stripe count and per-batch channel capacity are not user-tunable.
Metrics is true when you want real-time logging of a variety of stats. The reason this is a CacheBuilder flag is because there's a 10% throughput performance overhead.
Defaults to true: each insert is charged only the caller-supplied cost, so max_cost behaves as an entry budget when you pass 1 per insert.
Set to false when max_cost represents a byte budget and you need each stored item to also account for ~56 bytes of per-entry bookkeeping (key, conflict, version, value wrapper, time).
The Cache will cleanup the expired values every 500ms by default.
pub trait UpdateValidator: Send + Sync + 'static {
type Value: Send + Sync + 'static;
/// should_update is called when a value already exists in cache and is being updated.
fn should_update(&self, prev: &Self::Value, curr: &Self::Value) -> bool;
}By default, the Cache will always update the value if the value already exists in the cache, this trait is for you to check if the value should be updated.
pub trait CacheCallback: Send + Sync + 'static {
type Value: Send + Sync + 'static;
/// on_exit is called whenever a value is removed from cache. This can be
/// used to do manual memory deallocation. Would also be called on eviction
/// and rejection of the value.
fn on_exit(&self, val: Option<Self::Value>);
/// on_evict is called for every eviction and passes the hashed key, value,
/// and cost to the function.
fn on_evict(&self, item: Item<Self::Value>) {
self.on_exit(item.val)
}
/// on_reject is called for every rejection done via the policy.
fn on_reject(&self, item: Item<Self::Value>) {
self.on_exit(item.val)
}
}CacheCallback is for customize some extra operations on values when related event happens.
pub trait Coster: Send + Sync + 'static {
type Value: Send + Sync + 'static;
/// cost evaluates a value and outputs a corresponding cost. This function
/// is ran after insert is called for a new item or an item update with a cost
/// param of 0.
fn cost(&self, val: &Self::Value) -> i64;
}Cost is a trait you can pass to the CacheBuilder in order to evaluate
item cost at runtime, and only for the insert calls that aren't dropped (this is
useful if calculating item cost is particularly expensive, and you don't want to
waste time on items that will be dropped anyways).
To signal to Stretto that you'd like to use this Coster trait:
- Set the Coster field to your own Coster implementation.
- When calling
insertfor new items or item updates, use acostof 0.
The hasher for the Cache, default is SeaHasher.
- Thanks Dgraph's developers for providing amazing Go version Ristretto implementation.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.