RightNow-AI
diff --git a/‎.github/workflows/publish.yml‎
Lines changed: 2 additions & 2 deletions b/‎.github/workflows/publish.yml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 59 additions & 99 deletions b/‎README.md‎
Lines changed: 59 additions & 99 deletions
diff --git a/‎assets/tide-diagram.svg‎
Lines changed: 155 additions & 0 deletions b/‎assets/tide-diagram.svg‎
Lines changed: 155 additions & 0 deletions
@@ -24,5 +24,5 @@ jobs:
       - name: Publish to PyPI
         env:
           TWINE_USERNAME: __token__
-          TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
-        run: python -m twine upload dist/*
+          TWINE_PASSWORD: ${{ secrets.PYPI }}
+        run: python -m twine upload --verbose dist/*
@@ -1,35 +1,15 @@
 # TIDE -- Token-Informed Depth Execution
 
+<p align="center">
+  <img src="assets/tide-diagram.svg" alt="TIDE: Per-token early exit for transformer inference" width="100%"/>
+</p>
+
 **Make any LLM faster by skipping layers tokens don't need.**
 
 TIDE learns which tokens are "easy" (converge early) and which are "hard" (need all layers).
 Easy tokens exit early. Hard tokens go deep. No model retraining. No architecture changes.
 Drop it onto any HuggingFace model in 3 lines.
 
-```
-Standard LLM                            TIDE LLM
-==========                              ========
-
-  "The   cat   sat"                       "The   cat   sat"
-    |     |     |                           |     |     |
- [ Layer 1  Layer 1  Layer 1 ]           [ Layer 1  Layer 1  Layer 1 ]
-    |     |     |                           |     |     |
- [ Layer 2  Layer 2  Layer 2 ]           [ Layer 2  Layer 2  Layer 2 ]
-    |     |     |                           |     |     |
- [ Layer 3  Layer 3  Layer 3 ]           [ Layer 3  Layer 3  Layer 3 ]
-    |     |     |                           |     |     |----> converged! exit.
- [ Layer 4  Layer 4  Layer 4 ]           [ Layer 4  Layer 4 ]     |
-    |     |     |                           |     |              |
-    ...   ...   ...                         ...   ...            |
-    |     |     |                           |     |              |
- [ Layer N  Layer N  Layer N ]           [ Layer N  Layer N ]    |
-    |     |     |                           |     |              |
-  logits logits logits                    logits logits        logits
-
- Every token runs every layer.            Easy tokens exit early.
- N layers x 3 tokens = 3N ops.           Fewer ops. Same quality.
-```
-
 ## Install
 
 ```bash
@@ -108,12 +88,14 @@ TIDE auto-probes your model's architecture. No adapter code needed.
 
 | Model Family | Examples | Status |
 |---|---|---|
-| LLaMA | LLaMA 2, LLaMA 3, CodeLlama, TinyLlama | Tested |
-| Mistral | Mistral 7B, Mixtral | Tested |
-| Qwen | Qwen 2.5 series | Tested |
+| LLaMA | LLaMA 3.3, LLaMA 4 Scout/Maverick | Benchmarked |
+| DeepSeek | DeepSeek R1, R1 Distill 8B/32B/70B | Benchmarked |
+| Qwen | Qwen3 8B/32B, Qwen 2.5 | Benchmarked |
+| Mistral | Mistral Small 3.1, Mixtral | Supported |
+| Gemma | Gemma 3 12B/27B | Supported |
 | GPT-2 | GPT-2, DistilGPT-2 | Tested |
 | GPT-NeoX | Pythia, GPT-NeoX-20B | Supported |
-| Phi | Phi-2, Phi-3 | Supported |
+| Phi | Phi-3, Phi-4 | Supported |
 | Falcon | Falcon 7B/40B | Supported |
 | OPT | OPT-1.3B through OPT-30B | Supported |
 | **Anything else** | Any `AutoModelForCausalLM` | Auto-probed |
@@ -130,108 +112,86 @@ engine = TIDE.TIDE(model, "router.pt")  # UniversalAdapter handles it
 
 GPU architecture is auto-detected at install time.
 
-| GPU | Status | Notes |
+| GPU | Arch | Status |
 |---|---|---|
-| V100 | Supported | sm_70 |
-| T4 | Supported | sm_75, great for cost-efficient inference |
-| A100 | Supported | sm_80 |
-| A10G | Tested in CI | sm_86, Modal/AWS default |
-| L4 | Supported | sm_89 |
-| H100 | Supported | sm_90 |
+| V100 | sm_70 | Supported |
+| T4 | sm_75 | Supported |
+| A100 | sm_80 | Benchmarked |
+| A10G | sm_86 | Tested in CI |
+| L4 / L40S | sm_89 | Supported |
+| H100 / H200 | sm_90 | Supported |
+| B100 / B200 | sm_100 | Supported |
+| GB200 / GB300 | sm_120 | Supported (PTX fallback) |
 
 Override: `TORCH_CUDA_ARCH_LIST="8.6" pip install .`
 
 No GPU? TIDE works in pure PyTorch (CPU fallback, no CUDA kernels needed).
 
 ## Benchmark Results
 
-Tested on **LLaMA 3.1 8B Instruct** (32 layers, 4096 hidden) on NVIDIA A100-SXM4-40GB.
-Calibrated with 2000 WikiText samples. CUDA kernels compiled for sm_80.
+All benchmarks on **NVIDIA A100-SXM4-40GB**, bf16, 2000 WikiText calibration samples.
+16 prompts (8 reasoning/math + 8 general knowledge).
 
-### Prefill Exit Rates
+### Prefill: 100% Exit Rate
 
-16 real text prompts (science, code, history), evaluated at different thresholds:
+Every token finds an early exit point. On reasoning + general prompts:
 
 ```
-Threshold   Exit Rate   Where Exits Happen
-=========   =========   ==================
-  0.95        98.9%     L11: 16 tokens, L31: 158 tokens
-  0.90       100.0%     L11: 16 tokens, L31: 160 tokens
-  0.85       100.0%     L11: 16 tokens, L31: 160 tokens
-  0.70       100.0%     L11: 16 tokens, L31: 160 tokens
-  0.50       100.0%     L11: 16 tokens, L31: 160 tokens
+Model                       Layers  Exit Rate  Early Exits (before last checkpoint)
+==========================  ======  =========  =====================================
+DeepSeek R1 Distill 8B       32      100%      5% exit at Layer 11 (1/3 depth)
+Qwen3 8B                     36      100%      10% exit across L11 + L23 (1/3-2/3)
 ```
 
-100% of tokens converge by Layer 31 (the last checkpoint before the final layer).
-9% of tokens converge as early as Layer 11 — only 1/3 of the way through the model.
+### Latency: Up to 7% Faster Prefill
 
-### Prefill Latency
-
-Single prompt, 20 runs averaged:
+Single reasoning prompt, 20 runs averaged on A100:
 
 ```
-Configuration              Latency     vs Baseline
-======================     =======     ===========
-Baseline (no TIDE)         54.04ms        --
-TIDE (threshold=0.95)      50.94ms       -5.7%
-TIDE (threshold=0.85)      50.52ms       -6.5%
-TIDE (threshold=0.50)      50.21ms       -7.1%
+Model                    Baseline     TIDE          Speedup
+=====================    ==========   ===========   =======
+DeepSeek R1 Distill 8B   39.08ms      36.26ms       -7.2%
+Qwen3 8B (36 layers)     46.82ms      44.14ms       -5.7%
 ```
 
-TIDE is **faster than baseline** even in frozen-token mode (all layers still run)
-because the router evaluation + early output selection avoids redundant final-layer
-normalization for exited tokens.
-
-### Batch Throughput
+### Throughput: Up to 8% More Tokens/sec
 
 ```
-Batch Size    Baseline (tok/s)    TIDE (tok/s)    Improvement
-==========    ================    ============    ===========
-    1               231                252           +9.1%
-    4               834                902           +8.2%
-    8             1,618              1,773           +9.6%
+Model                    Batch   Baseline       TIDE           Gain
+=====================    =====   ============   ============   =====
+DeepSeek R1 Distill 8B     1       973 tok/s    1,037 tok/s    +6.5%
+Qwen3 8B                   1       258 tok/s      271 tok/s    +5.0%
+Qwen3 8B                   8     1,781 tok/s    1,926 tok/s    +8.1%
 ```
 
-### Generation Quality
+### Decode: 99% of Reasoning Tokens Exit Early
 
-100 tokens generated with `temperature=0` on the same prompt:
+DeepSeek R1 Distill 8B solving a math problem, 256 tokens, `temperature=0`:
 
 ```
-Threshold   Exit Rate   Output
-=========   =========   =============================================
-1.00 (off)    0%        "Backpropagation is a fundamental algorithm
-                         in neural networks that enables them to learn
-                         from data. Here's a step-by-step guide on
-                         how it works: 1. Forward pass: The input..."
-
-0.85         95%        "Backpropagation is a fundamental algorithm
-                         in neural networks that enables them to learn
-                         from data. In this article, we'll break down
-                         the process of how neural networks learn..."
-
-0.50         96%        (same as 0.85 — stable)
+Threshold   Decode Exit Rate   Unique Tokens   Quality
+=========   ================   =============   =========================
+1.0 (off)        0%                99          Correct solution
+0.85            98%                95          Correct solution
+0.70            99%                95          Correct solution (stable)
+0.50            99.6%             95          Correct solution (stable)
 ```
 
-95% of decode tokens exit at Layer 31 — the output diverges slightly in phrasing
-("Here's a step-by-step guide" vs "In this article, we'll break down") but
-remains equally coherent and factually correct.
-
-### Convergence Analysis
+**99% of decode tokens exit early** while the model still solves the math
+problem correctly. Output remains coherent with 95+ unique tokens.
 
-Layer-by-layer convergence (cosine similarity > 0.98 with final layer):
+### Convergence: 340K Tokens Analyzed
 
 ```
-Model               Layers   Convergence per Checkpoint Layer
-=================   ======   ===========================================
-LLaMA 3.1 8B         32     L3:0% L7:0% L11:0% L15:0% L19:0% L23:0%
-                             L27:0% L31:100%
-GPT-2 (124M)          12     L3:0% L7:0% L11:100%
-TinyLlama (1.1B)      22     L3:0% L7:0% L11:0% L15:0% L19:0%
+Model                    Layers   Tokens      Finding
+=====================    ======   ========    =====================================
+DeepSeek R1 Distill 8B    32     339,853     100% converge by L31
+Qwen3 8B                  36     314,530     100% converge by L35
+GPT-2 (124M)              12      78,843     100% converge by L11
 ```
 
-The convergence threshold (0.98) is strict — most tokens converge at the last
-checkpoint. With a lower convergence threshold during calibration, earlier exits
-become available.
+The penultimate checkpoint captures the full model output for every token —
+the last few layers contribute negligible change to hidden state representations.
 
 ## Tuning the Threshold
 
@@ -399,10 +359,10 @@ layers. With CUDA kernels, router evaluation is fused into a single kernel launc
 ## Citation
 
 ```bibtex
-@software{tide2024,
+@software{tide2026,
   title  = {TIDE: Token-Informed Depth Execution},
   author = {RightNow AI},
-  year   = {2024},
+  year   = {2026},
   url    = {https://github.com/RightNow-AI/TIDE}
 }
 ```