11# TIDE -- Token-Informed Depth Execution
22
3+ <p align =" center " >
4+ <img src =" assets/tide-diagram.svg " alt =" TIDE: Per-token early exit for transformer inference " width =" 100% " />
5+ </p >
6+
37** Make any LLM faster by skipping layers tokens don't need.**
48
59TIDE learns which tokens are "easy" (converge early) and which are "hard" (need all layers).
610Easy tokens exit early. Hard tokens go deep. No model retraining. No architecture changes.
711Drop it onto any HuggingFace model in 3 lines.
812
9- ```
10- Standard LLM TIDE LLM
11- ========== ========
12-
13- "The cat sat" "The cat sat"
14- | | | | | |
15- [ Layer 1 Layer 1 Layer 1 ] [ Layer 1 Layer 1 Layer 1 ]
16- | | | | | |
17- [ Layer 2 Layer 2 Layer 2 ] [ Layer 2 Layer 2 Layer 2 ]
18- | | | | | |
19- [ Layer 3 Layer 3 Layer 3 ] [ Layer 3 Layer 3 Layer 3 ]
20- | | | | | |----> converged! exit.
21- [ Layer 4 Layer 4 Layer 4 ] [ Layer 4 Layer 4 ] |
22- | | | | | |
23- ... ... ... ... ... |
24- | | | | | |
25- [ Layer N Layer N Layer N ] [ Layer N Layer N ] |
26- | | | | | |
27- logits logits logits logits logits logits
28-
29- Every token runs every layer. Easy tokens exit early.
30- N layers x 3 tokens = 3N ops. Fewer ops. Same quality.
31- ```
32-
3313## Install
3414
3515``` bash
@@ -108,12 +88,14 @@ TIDE auto-probes your model's architecture. No adapter code needed.
10888
10989| Model Family | Examples | Status |
11090| ---| ---| ---|
111- | LLaMA | LLaMA 2, LLaMA 3, CodeLlama, TinyLlama | Tested |
112- | Mistral | Mistral 7B, Mixtral | Tested |
113- | Qwen | Qwen 2.5 series | Tested |
91+ | LLaMA | LLaMA 3.3, LLaMA 4 Scout/Maverick | Benchmarked |
92+ | DeepSeek | DeepSeek R1, R1 Distill 8B/32B/70B | Benchmarked |
93+ | Qwen | Qwen3 8B/32B, Qwen 2.5 | Benchmarked |
94+ | Mistral | Mistral Small 3.1, Mixtral | Supported |
95+ | Gemma | Gemma 3 12B/27B | Supported |
11496| GPT-2 | GPT-2, DistilGPT-2 | Tested |
11597| GPT-NeoX | Pythia, GPT-NeoX-20B | Supported |
116- | Phi | Phi-2 , Phi-3 | Supported |
98+ | Phi | Phi-3 , Phi-4 | Supported |
11799| Falcon | Falcon 7B/40B | Supported |
118100| OPT | OPT-1.3B through OPT-30B | Supported |
119101| ** Anything else** | Any ` AutoModelForCausalLM ` | Auto-probed |
@@ -130,108 +112,86 @@ engine = TIDE.TIDE(model, "router.pt") # UniversalAdapter handles it
130112
131113GPU architecture is auto-detected at install time.
132114
133- | GPU | Status | Notes |
115+ | GPU | Arch | Status |
134116| ---| ---| ---|
135- | V100 | Supported | sm_70 |
136- | T4 | Supported | sm_75, great for cost-efficient inference |
137- | A100 | Supported | sm_80 |
138- | A10G | Tested in CI | sm_86, Modal/AWS default |
139- | L4 | Supported | sm_89 |
140- | H100 | Supported | sm_90 |
117+ | V100 | sm_70 | Supported |
118+ | T4 | sm_75 | Supported |
119+ | A100 | sm_80 | Benchmarked |
120+ | A10G | sm_86 | Tested in CI |
121+ | L4 / L40S | sm_89 | Supported |
122+ | H100 / H200 | sm_90 | Supported |
123+ | B100 / B200 | sm_100 | Supported |
124+ | GB200 / GB300 | sm_120 | Supported (PTX fallback) |
141125
142126Override: ` TORCH_CUDA_ARCH_LIST="8.6" pip install . `
143127
144128No GPU? TIDE works in pure PyTorch (CPU fallback, no CUDA kernels needed).
145129
146130## Benchmark Results
147131
148- Tested on ** LLaMA 3.1 8B Instruct ** (32 layers, 4096 hidden) on NVIDIA A100-SXM4-40GB .
149- Calibrated with 2000 WikiText samples. CUDA kernels compiled for sm_80 .
132+ All benchmarks on ** NVIDIA A100-SXM4-40GB ** , bf16, 2000 WikiText calibration samples .
133+ 16 prompts (8 reasoning/math + 8 general knowledge) .
150134
151- ### Prefill Exit Rates
135+ ### Prefill: 100% Exit Rate
152136
153- 16 real text prompts (science, code, history), evaluated at different thresholds :
137+ Every token finds an early exit point. On reasoning + general prompts :
154138
155139```
156- Threshold Exit Rate Where Exits Happen
157- ========= ========= ==================
158- 0.95 98.9% L11: 16 tokens, L31: 158 tokens
159- 0.90 100.0% L11: 16 tokens, L31: 160 tokens
160- 0.85 100.0% L11: 16 tokens, L31: 160 tokens
161- 0.70 100.0% L11: 16 tokens, L31: 160 tokens
162- 0.50 100.0% L11: 16 tokens, L31: 160 tokens
140+ Model Layers Exit Rate Early Exits (before last checkpoint)
141+ ========================== ====== ========= =====================================
142+ DeepSeek R1 Distill 8B 32 100% 5% exit at Layer 11 (1/3 depth)
143+ Qwen3 8B 36 100% 10% exit across L11 + L23 (1/3-2/3)
163144```
164145
165- 100% of tokens converge by Layer 31 (the last checkpoint before the final layer).
166- 9% of tokens converge as early as Layer 11 — only 1/3 of the way through the model.
146+ ### Latency: Up to 7% Faster Prefill
167147
168- ### Prefill Latency
169-
170- Single prompt, 20 runs averaged:
148+ Single reasoning prompt, 20 runs averaged on A100:
171149
172150```
173- Configuration Latency vs Baseline
174- ====================== ======= ===========
175- Baseline (no TIDE) 54.04ms --
176- TIDE (threshold=0.95) 50.94ms -5.7%
177- TIDE (threshold=0.85) 50.52ms -6.5%
178- TIDE (threshold=0.50) 50.21ms -7.1%
151+ Model Baseline TIDE Speedup
152+ ===================== ========== =========== =======
153+ DeepSeek R1 Distill 8B 39.08ms 36.26ms -7.2%
154+ Qwen3 8B (36 layers) 46.82ms 44.14ms -5.7%
179155```
180156
181- TIDE is ** faster than baseline** even in frozen-token mode (all layers still run)
182- because the router evaluation + early output selection avoids redundant final-layer
183- normalization for exited tokens.
184-
185- ### Batch Throughput
157+ ### Throughput: Up to 8% More Tokens/sec
186158
187159```
188- Batch Size Baseline (tok/s) TIDE (tok/s) Improvement
189- ========== ================ ============ ===========
190- 1 231 252 +9.1 %
191- 4 834 902 +8.2 %
192- 8 1,618 1,773 +9.6 %
160+ Model Batch Baseline TIDE Gain
161+ ===================== ===== ============ ============ =====
162+ DeepSeek R1 Distill 8B 1 973 tok/s 1,037 tok/s +6.5 %
163+ Qwen3 8B 1 258 tok/s 271 tok/s +5.0 %
164+ Qwen3 8B 8 1,781 tok/s 1,926 tok/s +8.1 %
193165```
194166
195- ### Generation Quality
167+ ### Decode: 99% of Reasoning Tokens Exit Early
196168
197- 100 tokens generated with ` temperature=0 ` on the same prompt :
169+ DeepSeek R1 Distill 8B solving a math problem, 256 tokens, ` temperature=0 ` :
198170
199171```
200- Threshold Exit Rate Output
201- ========= ========= =============================================
202- 1.00 (off) 0% "Backpropagation is a fundamental algorithm
203- in neural networks that enables them to learn
204- from data. Here's a step-by-step guide on
205- how it works: 1. Forward pass: The input..."
206-
207- 0.85 95% "Backpropagation is a fundamental algorithm
208- in neural networks that enables them to learn
209- from data. In this article, we'll break down
210- the process of how neural networks learn..."
211-
212- 0.50 96% (same as 0.85 — stable)
172+ Threshold Decode Exit Rate Unique Tokens Quality
173+ ========= ================ ============= =========================
174+ 1.0 (off) 0% 99 Correct solution
175+ 0.85 98% 95 Correct solution
176+ 0.70 99% 95 Correct solution (stable)
177+ 0.50 99.6% 95 Correct solution (stable)
213178```
214179
215- 95% of decode tokens exit at Layer 31 — the output diverges slightly in phrasing
216- ("Here's a step-by-step guide" vs "In this article, we'll break down") but
217- remains equally coherent and factually correct.
218-
219- ### Convergence Analysis
180+ ** 99% of decode tokens exit early** while the model still solves the math
181+ problem correctly. Output remains coherent with 95+ unique tokens.
220182
221- Layer-by-layer convergence (cosine similarity > 0.98 with final layer):
183+ ### Convergence: 340K Tokens Analyzed
222184
223185```
224- Model Layers Convergence per Checkpoint Layer
225- ================= ====== ===========================================
226- LLaMA 3.1 8B 32 L3:0% L7:0% L11:0% L15:0% L19:0% L23:0%
227- L27:0% L31:100%
228- GPT-2 (124M) 12 L3:0% L7:0% L11:100%
229- TinyLlama (1.1B) 22 L3:0% L7:0% L11:0% L15:0% L19:0%
186+ Model Layers Tokens Finding
187+ ===================== ====== ======== =====================================
188+ DeepSeek R1 Distill 8B 32 339,853 100% converge by L31
189+ Qwen3 8B 36 314,530 100% converge by L35
190+ GPT-2 (124M) 12 78,843 100% converge by L11
230191```
231192
232- The convergence threshold (0.98) is strict — most tokens converge at the last
233- checkpoint. With a lower convergence threshold during calibration, earlier exits
234- become available.
193+ The penultimate checkpoint captures the full model output for every token —
194+ the last few layers contribute negligible change to hidden state representations.
235195
236196## Tuning the Threshold
237197
@@ -399,10 +359,10 @@ layers. With CUDA kernels, router evaluation is fused into a single kernel launc
399359## Citation
400360
401361``` bibtex
402- @software{tide2024 ,
362+ @software{tide2026 ,
403363 title = {TIDE: Token-Informed Depth Execution},
404364 author = {RightNow AI},
405- year = {2024 },
365+ year = {2026 },
406366 url = {https://github.com/RightNow-AI/TIDE}
407367}
408368```
0 commit comments