π A curated collection of papers and open-source code repositories dedicated to the application of Vision-Language Models (VLMs) for streaming video.
- π οΈ Project
- π Technical Report
- π¬ Proactive Interaction
- π§ Long-term Memory Management
- β‘ Real-time Inference
- π Streaming with Thinking
- π Benchmarks
- π¦ Training Datasets
- π Survey
- π Resources
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2025.11 | Live VLM WebUI | [docs] | [GitHub] |
Real-Time Vision Language Model Interaction with Webcam Streaming |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.02 | Seed2.0 | [pdf] | [GitHub] |
Doubao Video Calling; OVBench & LiveSports-3K & OVOBench & ODVBench & ViSpeak |
| 2025.12 | Seed1.8 | [pdf] | [GitHub] |
Doubao Video Calling; OVBench & LiveSports-3K & OVOBench & ViSpeak & StreamingBench & OmniMMI |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.03 | Proact-VL: A Proactive VideoLLM for Real-Time AI Companions | [pdf] | [GitHub] |
<|FLAG|>Token Response Head with Transition-Smoothed Classification & Stability Regularization |
| 2026.03 | StreamReady: Learning What to Answer and When in Long Streaming Videos | [pdf] | - | Learnable <RDY> Token with Readiness Head for Evidence-Gated Response Triggering |
| 2025.05 | StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant | [pdf] | [GitHub] |
Activation Model via <ACT> Token with Binary Score Head |
| 2025.03 | ViSpeak: Visual Instruction Feedback in Streaming Videos | [pdf] | [GitHub] |
Informative Head for <seg> Token |
| 2025.03 | StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition | [pdf] | [GitHub] |
Cognition Gate Network (Shallow Layer Transfer from LLM) for Binary </response> / </silence> Classification |
| 2025.01 | Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction | [pdf] | [GitHub] |
Binary Classification Head on <TODO> Token Embedding (BCE Loss); <ANS> for History Marking; <SILENT> for Reaction-Stage Filtering |
| 2024.11 | VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format | [pdf] | [GitHub] |
Informative Head & Relevance Head |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.04 | AURA: Always-On Understanding and Real-Time Assistance via Video Streams | [pdf] | [GitHub] |
Unified <|silent|> Token for Silent Observation with Real-Time QA / Proactive QA / Multi-Response QA; Silent-Speech Balanced Loss |
| 2026.03 | STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding | [pdf] | [GitHub] |
Activation Token Denoising over 0/1/[M]; Sequence Duplication; Selective Re-masking |
| 2025.12 | Streaming Video Instruction Tuning | [pdf] | [GitHub] |
Three-State Response Tokens (</Silence>, </Standby>,</Response>) via Unified Next-Token Prediction; Focal Loss for Response Imbalance |
| 2025.06 | Proactive Assistant Dialogue Generation from Streaming Egocentric Videos | [pdf] | [GitHub] |
Frame-Level [EOS] Silence Prediction with Negative Frame Sub-Sampling |
| 2025.03 | LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant | [pdf] | [GitHub] |
Streaming EOS Prediction via Fast-Slow Dual-Path; Token Aggregation & Dropping Router for Visual Feature Compression |
| 2025.03 | AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis | [pdf] | - | EOS Prediction with Task-Adaptive Threshold for Anomaly Alerting |
| 2024.07 | What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction | [pdf] | [GitHub] |
Action Tokens <next> and <feedback> |
| 2024.06 | VideoLLM-online: Online Video Large Language Model for Streaming Video | [pdf] | [GitHub] |
Streaming EOS Token Prediction |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.03 | Thinking in Streaming Video | [pdf] | [GitHub] |
Watch-Think-Speak with <silent> & <response> Action Tokens; Streaming RLVR (Format + Time + Accuracy Reward) |
| 2026.01 | Learning to Respond: A Large-Scale Benchmark and Progressive Learning Framework for Trigger-Centric Online Video Understanding | [pdf] | - | Trigger-Centric Online Video Understanding |
| 2025.12 | MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning | [pdf] | [GitHub] |
Text-to-Text "NO REPLY" Response Decision with Multi-Objective Reward (PAUC + Replication + In-Span + Prefix) GRPO |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.03 | FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding | [pdf] | [GitHub] |
Scene-Change Ratio Trigger Reusing Temporal Adjacency Selection Statistics |
| 2026.01 | QueryStream: Advancing Streaming Video Understanding with Query-Aware Pruning and Proactive Response | [pdf] | - | Relevance-Triggered Active Response |
| 2026.01 | Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams | [pdf] | - | Motion-Semantic-Prediction Boundary Score; Adaptive Threshold; Event-triggered Decoding |
| 2025.11 | LiveStar: Live Streaming Assistant for Real-World Online Video Understanding | [pdf] | [GitHub] |
Streaming Verification Decoding Perplexity Gate for Response & Silence |
| 2025.08 | StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding | [pdf] | - | Reactive/Proactive/Speculative Planning; Heuristic Trigger; Tool-Guided Information Hunting |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.03 | Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding | [pdf] | [GitHub] |
Query-Time Instruction-Guided Visual Proposal Generation (SFT+GRPO); Lightweight Embedding Cosine Similarity Surge Triggering |
| 2026.03 | StreamingClaw Technical Report | [pdf] | - | Training-Free (Reminder Node); Training-Based (Scenario-Specific Trigger Tokens) |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2025.10 | Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video | [pdf] | [GitHub] |
Weighted Interval Supervision & Uncertainty-Guided High-Resolution Requests |
| 2025.02 | EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild | [pdf] | [GitHub] |
Standalone Audio-Visual Frame-Level Three-Way Classifier (Background / Self / Other) with Anticipatory Prediction |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.03 | CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management | [pdf] | - | Curvature-Aware Scorer (Motion Variation + Geometric Curvature); EMA-Based K-Sigma Dynamic Thresholding; Hierarchical Clear/Blurred/Discard Memory with FIFO Eviction |
| 2026.03 | StreamingClaw Technical Report | [pdf] | - | Hierarchical Memory Evolution |
| 2026.03 | StreamReady: Learning What to Answer and When in Long Streaming Videos | [pdf] | - | 3-Level Visual Memory Tree (FIFO-Centroid-Prototype) & Contextual Memory Bank |
| 2026.03 | WAT: Online Video Understanding Needs Watching Before Thinking | [pdf] | - | Dual-Level Memory: Short-Term FIFO Queue + Long-Term Memory with Redundancy-Aware Eviction Policy |
| 2026.03 | Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism | [pdf] | [GitHub] |
Dual-Pathway Compression for Context Memory and Local Memory; Visual KV-Cache Memory Bank; Cross-attention Memory Recall & MemIndex |
| 2026.03 | FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding | [pdf] | [GitHub] |
Three-Level Hierarchical Memory (Short/Mid/Long-Term); Temporal Adjacency Selection & Spatial Domain Consolidation |
| 2026.02 | FreshMem: Brain-Inspired Frequency-Space Hybrid Memory for Streaming Video Understanding | [pdf] | - | Short-term Sliding Window + Multi-scale Frequency Memory + Space Thumbnail Memory |
| 2026.01 | HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding | [pdf] | [GitHub] |
Hierarchical KV-Cache (Sensory-Working-Long-Term Memory); Exponential Forgetting Curve & Frame-Level Anchor Tokens; Cross-Layer Memory Smoothing & Position Re-Indexing |
| 2025.12 | Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding | [pdf] | - | Hierarchical Raw Data Layer & Semantic Index Layer; Scene Segmentation and Incremental Clustering for Sparse Memory Construction |
| 2025.08 | StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding | [pdf] | - | Streaming KV-Cache; CPU Long-Term & GPU Short-Term Hierarchical Memory; Layer-Adaptive KV Recall |
| 2025.07 | StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling | [pdf] | [GitHub] |
Sliding-Window KV-Cache with Selective Offloading; 3D Voxel-Based Spatial Token Pruning for Cross-Frame Redundancy Removal |
| 2025.01 | Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge | [pdf] | [GitHub] |
Hierarchical Memory: Ebbinghaus Forgetting Curve Short-Term + Tree-Structured Clustering Long-Term + FAISS Dialogue Memory; Forgetting Probability-Based Frame Sampling |
| 2024.09 | VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges | [pdf] | [GitHub] |
Recurrent Memory Bridge with Self-Attention Recursive Update & Retrieval Attention; SceneTiling Semantic Segmentation; Linear Memory Scaling |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.04 | AURA: Always-On Understanding and Real-Time Assistance via Video Streams | [pdf] | [GitHub] |
Dual Sliding-Window Context over Recent Video and QA History; Out-of-Window Video Chunks and <|silent|> Tokens Discarded |
| 2026.04 | A Simple Baseline for Streaming Video Understanding | [pdf] | [GitHub] |
Fixed Recent-frame Sliding Window; Old-frame Eviction without External Memory |
| 2026.03 | Proact-VL: A Proactive VideoLLM for Real-Time AI Companions | [pdf] | [GitHub] |
Dual-Cache (System & Streaming) Sliding Window with Reverse-RoPE Eviction for Infinite Streaming |
| 2026.03 | STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding | [pdf] | [GitHub] |
Bounded Visual Cache Eviction; Sliding-window Frame Retention |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2025.10 | Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs | [pdf] | - | Attention-Score-Based Visual Token Selection with Recurrent FIFO Token Queue; Maximal Marginal Relevance Caption Retrieval |
| 2025.05 | StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant | [pdf] | [GitHub] |
Producer-Consumer Memory Buffer with Conditional Round-Decayed Token Compression (Prioritizing Recent Frames) |
| 2025.04 | TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos | [pdf] | [GitHub] |
Differential Token Drop (Primary); FIFO Slimmed Token Memory Bank (Supplementary) |
| 2025.03 | VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers | [pdf] | [GitHub] |
Single Semantic Carrier Token per Frame (Avg-Pool Embedding + Reused KV Aggregation); Cosine Similarity-Based Dynamic Memory Bank Eviction |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.02 | Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory | [pdf] | [GitHub] |
Adaptive Key Selection for Sparse Sliding-Window Encoding; Training-Free Retrieval MoE via Reciprocal Rank Fusion of Internal & External Signals |
| 2025.12 | V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval | [pdf] | - | Hash-Bit Hamming Clustering & Weighted Cumulative Sum Early-Exit Thresholding for Dynamic KV-Cache Retrieval; Hierarchical Memory Offloading |
| 2025.11 | StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression | [pdf] | [GitHub] |
Cosine Similarity-Based Dynamic Semantic Segmentation; Summary-Vector Representative Key Retrieval; Guidance-Prompt-Driven KV Compression with Layer-Adaptive Budget Allocation |
| 2025.11 | LiveStar: Live Streaming Assistant for Real-World Online Video Understanding | [pdf] | [GitHub] |
Peak-End Memory Compression & Dual-Level Streaming KV Cache |
| 2025.10 | StreamingTOM: Streaming Token Compression for Efficient Video Understanding | [pdf] | [GitHub] |
Causal Temporal Token Reduction (Static-Dynamic DPC & Attention Selection) + Online 4-bit Quantized KV Memory with Representative-Key Retrieval |
| 2025.08 | StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding | [pdf] | - | Chat Template Proxy Query for Query-Agnostic KV-Cache Pruning & Weighted Merging |
| 2025.06 | InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding | [pdf] | [GitHub] |
KV-Cache Compression with Temporal-Axis Redundancy Pruning & Value-Norm Ranking with Layer-Adaptive Pooling |
| 2025.05 | LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval | [pdf] | - | Streaming-Oriented KV Cache with Video-Specific KV Compression; Frame-Wise KV Merging & FIFO KV Chunk Memory |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.03 | PEARL: Personalized Streaming Video Understanding Model | [pdf] | [GitHub] |
Dual-Grained Memory (Streaming Memory + Concept Memory); Concept-Aware Retrieval with Query Rewriting for Personalized Concept Grounding |
| 2026.02 | Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries | [pdf] | - | Scene-aware Segmentation; Temporal-Spatial Scene Token Compression; CPU Offloaded Full Frames with Query-conditioned Top-k Recall |
| 2026.02 | WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs | [pdf] | [GitHub] |
Uncertainty-Gated Hierarchical Retrieval with Entropy-Threshold Triggered Past Context Access |
| 2025.12 | CogStream: Context-guided Streaming Video Question Answering | [pdf] | [GitHub] |
Temporal-Semantic Clustering with Question-Aware Event Compression; Historical Dialogue Retrieval via LLM Selection |
| 2025.11 | CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding | [pdf] | - | Dynamic Token Dropping; GRU-Based Compressive Memory; KV Offloading and Rehydration; Consensus-First Retrieval |
| 2025.06 | Flash-VStream: Efficient Real-Time Understanding for Long Video Streams | [pdf] | [GitHub] |
K-Means Clustered Context Synopsis Memory & Feature-Centric Detail Augmentation Memory with Disk-Offloadable Feature Bank |
| 2025.03 | Streaming Video Question-Answering with In-context Video KV-Cache Retrieval | [pdf] | [GitHub] |
Sliding-Window Encoding with KV-Cache Offloading to RAM/Disk; Internal (Self-Attention Key Averaging) & External (CLIP Cosine Similarity) Frame-Level Retrieval |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.03 | Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA | [pdf] | - | Streaming Event Segmentation; Event-level Historical Caption Memory; Knowledge Extraction Accelerator |
| 2026.03 | Thinking in Streaming Video | [pdf] | [GitHub] |
Reasoning-Compressed Streaming Memory: Visual Sliding Window + Reasoning Tokens as Long-Term Semantic Anchors |
| 2026.03 | Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously | [pdf] | [GitHub] |
Short-Term Visual Sliding Window & Long-Term Textual Streaming-Thought Memory with FIFO Eviction; Recursive Temporal Segmentation |
| 2026.02 | Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge | [pdf] | - | Convert Video into a Lightweight Textual Memory |
| 2025.06 | Proactive Assistant Dialogue Generation from Streaming Egocentric Videos | [pdf] | [GitHub] |
Iterative Progress Summarization as Summary-Based Memory Compression |
| 2025.04 | Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding | [pdf] | [GitHub] |
Multimodal Interleaved Cache: Online Verbalization of Visual-to-Text for Long-Term & Short-Term Visual Tokens |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.02 | EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use | [pdf] | [GitHub] |
Event-Centric Dual-Layer Memory (STM with Online Event Segmentation & Reservoir Sampling + LTM with Structured Event Tuples) |
| 2026.01 | Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams | [pdf] | - | Event-level Memory Bank; Merge-or-Append Event Consolidation |
| 2025.12 | VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs | [pdf] | [GitHub] |
Prediction-Guided Elastic-Scale Event Segmentation & Hierarchical Cross-Attention Event Consolidation |
| 2025.09 | StreamForest: Efficient Online Video Understanding with Persistent Event Memory | [pdf] | [GitHub] |
Event-Level Tree Hierarchy with Adaptive Merging via Similarity & Merge-Count & Temporal Penalty; Short-Term Spatiotemporal Sliding Window |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.01 | OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding | [pdf] | [GitHub] |
Fixed-size Spatial Memory with Time-adaptive Sampling and Concatenation; Explicit Point Cloud and Semantic Memory |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2025.10 | video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory | [pdf] | [GitHub] |
Test-Time Training Fast-Weight MLP as Streaming Memory with Dual (Reconstruction + Long-Span Prediction) Objective; Cosine Similarity Token Discarding; Prompt-Dependent Modality-Aware KV-Cache Chunk Reading |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.03 | Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models | [pdf] | [GitHub] |
Parallel Dual KV-Cache with Merge-Generate-Split Loop for Concurrent Encoding-Decoding; Decoupled Cross-Modal RoPE; Near-Zero Time-to-First-Token |
| 2026.03 | Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in MLLMs | [pdf] | [GitHub] |
Threaded Parallel Watch-Think Pipeline with Async Segment Prefetch & Adaptive Attention Backend |
| 2026.01 | Speak While Watching: Unleashing TRUE Real-Time Video Understanding Capability of Multimodal Large Language Models | [pdf] | [GitHub] |
Decoupled Positional Encoding (Overlapped / Group-Decoupled / Gap-Isolated) for Parallel Perception-Generation Streaming |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.03 | STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding | [pdf] | [GitHub] |
Two-stage Activation-to-Generation Pipeline; Event-gated Downstream Video-LLM Invocation |
| 2026.03 | Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing | [pdf] | [GitHub] |
Windowed Grayscale Affinity Analysis with Quadratic Programming; Credit-Budgeted RGB Activation; Dynamic Token Router with Asymmetric Grayscale and RGB Token Capacity |
| 2026.03 | StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition | [pdf] | [GitHub] |
SSM-Based Single-Token Perception with Event-Gated Sparse LLM Invocation |
| 2026.01 | Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams | [pdf] | - | Boundary-aware Event Pooling; Event-triggered Sparse Decoding; Hysteresis Pacing Control |
| 2025.09 | Open-ended Hierarchical Streaming Video Understanding with Vision Language Models | [pdf] | - | Lightweight RNN Streaming Module with Event-Gated Sparse Frozen VLM Invocation |
| 2025.03 | LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant | [pdf] | [GitHub] |
Routing-Based Response Determination via Fast-Slow Dual-Path; Token Aggregation & Dropping for High-FPS Routing |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.04 | CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference | [pdf] | - | Single-pass Compressed-bitstream Ingestion; Motion Vector-guided Patch Pruning; I-frame Anchor KV-Cache Refresh with RoPE-based Position Correction |
| 2026.04 | A Simple Baseline for Streaming Video Understanding | [pdf] | [GitHub] |
Fixed Recent-frame Window; Bounded-memory Low-latency Inference |
| 2026.03 | Thinking in Streaming Video | [pdf] | [GitHub] |
Eager Prefill + CUDA Graph Decode-and-Prune |
| 2026.03 | FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding | [pdf] | [GitHub] |
Visual Token Compression via TAS & SDC with Otsu-Based Adaptive Thresholding for Latency & Memory Reduction |
| 2026.02 | QueryStream: Advancing Streaming Video Understanding with Query-Aware Pruning and Proactive Response | [pdf] | - | Query-Aware Differential Pruning (QDP) & Relevance-Triggered Active Response (RTAR) Scheduling |
| 2025.12 | StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding | [pdf] | - | Checkerboard-Masked Parallel Spatial Pruning with Adjacency-Constrained Redundancy; Query-Agnostic Continuous Pre-Pruning Pipeline |
| 2025.12 | Accelerating Streaming Video Large Language Models via Hierarchical Token Compression | [pdf] | [GitHub] |
Hierarchical Token Compression: ViT Cache-Aware Selective Computation & Dual-Anchor Novelty Pruning |
| 2025.10 | Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs | [pdf] | - | Attention-Based Visual Token Compression & Caption-Only Question Answering |
| 2025.03 | VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers | [pdf] | [GitHub] |
Single Semantic Carrier per Frame; Prefill-Decode Decoupling; Visual Tokens Discarded after Prefill |
| 2024.08 | VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation | [pdf] | - | Vision Token Computation Skipping with LayerExpert |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.04 | AURA: Always-On Understanding and Real-Time Assistance via Video Streams | [pdf] | [GitHub] |
Floating Video/QA Sliding Windows with Batched N' Chunk Truncation for Prefix KV-Cache Reuse and Lower TTFT |
| 2026.01 | HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding | [pdf] | [GitHub] |
KV-Cache Reuse for Instant Query Response; Hierarchical Token Compression within Fixed Cache Budget |
| 2025.12 | V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval | [pdf] | - | Hardware-Software Co-Design with Dynamic KV Cache Retrieval Engine Accelerator; Pipelined KV Prediction-Retrieval Overlapped with LLM Computation |
| 2025.11 | StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression | [pdf] | [GitHub] |
Sliding-Window Segment Encoding with Immediate Post-Encoding KV Compression |
| 2025.10 | StreamingVLM: Real-Time Understanding for Infinite Video Streams | [pdf] | [GitHub] |
Bounded KV-Cache Eviction for Constant Memory; No Redundant Recomputation across Windows |
| 2025.10 | StreamingTOM: Streaming Token Compression for Efficient Video Understanding | [pdf] | [GitHub] |
Pre-LLM Causal Token Budget Cap for Prefill Acceleration; Post-LLM Quantized Memory with Selective Dequantization |
| 2025.05 | LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval | [pdf] | - | Pre-Generated Video KV Cache; Mean-Pooled Query-Key Chunk Retrieval with FIFO KV Chunk Management |
| 2025.03 | Streaming Video Question-Answering with In-context Video KV-Cache Retrieval | [pdf] | [GitHub] |
Multi-Process Parallel Encoding-Answering; Sliding-Window Attention for Stable-Latency Incremental Processing |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.03 | Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models | [pdf] | [GitHub] |
Causal Streaming Attention Masking; Decoupled Cross-Modal Positional Encoding; Parallel Dual KV-Cache Mechanism |
| 2026.03 | Thinking in Streaming Video | [pdf] | [GitHub] |
Streaming Watch-Think-Speak Paradigm; Reasoning-Compressed Streaming Memory (RCSM); Streaming RLVR (Format + Time + Accuracy Reward) |
| 2026.03 | Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously | [pdf] | [GitHub] |
The Video Streaming Thinking Paradigm; VKG-Based Streaming CoT Data Synthesis; Two-Stage VST-SFT & VST-RL Training |
| 2026.03 | Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in MLLMs | [pdf] | [GitHub] |
Segment-Level Streaming Causal Mask & Positional Encoding; Three-Stage CoT Training; Concurrent Watch-Think Pipeline |
| 2025.10 | StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA | [pdf] | [GitHub] |
Streaming VideoQA and multimodal CoT tasks |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.03 | PEARL: Personalized Streaming Video Understanding Model | [pdf] | [GitHub] |
Frame-Level Personalization & Video-Level Personalization; Concept-Definition QA & Real-Time QA & Past-Time QA |
| 2026.03 | StreamingEval: A Unified Evaluation Protocol towards Realistic Streaming Video Understanding | [pdf] | [GitHub] |
Streaming Evaluation Protocol |
| 2026.03 | HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household | [pdf] | [GitHub] |
Unsafe Action Detection; Early Warning Timing; Severity Assessment |
| 2026.02 | Artic: AI-oriented Real-time Communication for MLLM Video Assistant | [pdf] | [GitHub] |
Degradation-sensitive QA |
| 2026.01 | OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding | [pdf] | [GitHub] |
Online 3D Object Detection |
| 2026.01 | PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios | [pdf] | [GitHub] |
Mobile-Centric Scenarios; Perception & Interaction & Planning |
| 2025.12 | StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios | [pdf] | [GitHub] |
Embodied Scenarios |
| 2025.12 | StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos | [pdf] | [GitHub] |
Gaze-Guided Streaming Data; Past & Present & Proactive |
| 2025.10 | Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance? | [pdf] | [GitHub] |
Qualcomm Interactive Cooking |
| 2025.10 | Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video | [pdf] | [GitHub] |
Explicit Proactives; Implicit Proactive; Contextual Proactive |
| 2025.07 | OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding | [pdf] | [GitHub] |
Agent State; Agent Visible Info; Agent-Object Spatial Relationship |
| 2025.07 | ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models | [pdf] | [GitHub] |
Proactive Web-Video QA; Proactive Ego-Centric Video QA; Proactive TV-Series Video QA; Proactive Video Anomaly Detection |
| 2025.07 | Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI | [pdf] | [GitHub] |
DeViBench |
| 2025.05 | RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video | [pdf] | [GitHub] |
Perception; Understanding; Reasoning |
| 2025.04 | LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale | [pdf] | [GitHub] |
LiveSports-3K-CC/QA |
| 2025.03 | OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts | [pdf] | [GitHub] |
Streaming Video Understanding; Proactive Reasoning |
| 2025.02 | SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding | [pdf] | [GitHub] |
Multi-Turn Dialogues |
| 2025.01 | Online Video Understanding: OVBench and VideoChat-Online | [pdf] | [GitHub] |
Past; Current; Future |
| 2025.01 | OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? | [pdf] | [GitHub] |
Backward Tracing; Real-Time Understanding; Forward Active Responding |
| 2025.01 | Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge | [pdf] | [GitHub] |
Object Search; Long-Term Memory Search; Short-Term Memory Search; Conversational Interaction; Knowledge-Based Question Answering; Simple Factual |
| 2024.11 | StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding | [pdf] | [GitHub] |
Real-Time Visual Understanding; Omni-Source Understanding; Contextual Understanding |
| 2024.07 | What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction | [pdf] | [GitHub] |
Fitness Activity Recognition and Coaching |
| 2024.03 | MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | [pdf] | [GitHub] |
Movies and TV |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2026.03 | WAT: Online Video Understanding Needs Watching Before Thinking | [pdf] | - | Real-Time Perception; Backward Tracing; Forecasting |
| 2026.03 | Thinking in Streaming Video | [pdf] | [GitHub] |
Video Segmentation and Dense Captioning; Diverse Instruction Synthesis; Time-Grounded CoT Generation |
| 2026.01 | Learning to Respond: A Large-Scale Benchmark and Progressive Learning Framework for Trigger-Centric Online Video Understanding | [pdf] | - | TV-Online |
| 2026.01 | ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding | [pdf] | [GitHub] |
Online Proactive; Online Narration; Reactive QA |
| 2025.12 | Streaming Video Instruction Tuning | [pdf] | [GitHub] |
Real-Time Narration; Event Caption; Action Caption; Event Grounding; Time-Sensitive QA |
| 2025.12 | MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning | [pdf] | [GitHub] |
Scene Segmentation and Captioning; QA Generation; Proactive Dialogue Construction |
| 2025.12 | CogStream: Context-guided Streaming Video Question Answering | [pdf] | [GitHub] |
Semi-Automatic QA Pipeline; 1,088 Videos with 59K Hierarchical QA Pairs (Basic, Streaming, Global) |
| 2025.11 | LiveStar: Live Streaming Assistant for Real-World Online Video Understanding | [pdf] | [GitHub] |
Real-Time Narration Generation; Online Temporal Grounding; Frame-Level Dense QA; Contextual Online QA; Multi-Turn Interactive QA |
| 2025.10 | StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA | [pdf] | [GitHub] |
Hierarchical Video Dense Captioning; Dynamic Question-Answer Pairs Construction; Multimodal Chain-of-Thought Generation |
| 2025.10 | StreamingVLM: Real-Time Understanding for Infinite Video Streams | [pdf] | [GitHub] |
Sports; Narration |
| 2025.10 | Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video | [pdf] | [GitHub] |
Explicit Proactive Tasks; Implicit Proactive Tasks; Contextual Proactive Tasks |
| 2025.06 | Proactive Assistant Dialogue Generation from Streaming Egocentric Videos | [pdf] | [GitHub] |
Multi-Round Dialogue |
| 2025.04 | LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale | [pdf] | [GitHub] |
ASR Transcripts |
| 2025.03 | AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis | [pdf] | - | Anomaly Prediction; Anomaly Detection; Anomaly Analysis |
| 2025.02 | EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild | [pdf] | [GitHub] |
Prediction vs Detection; Frame-Level Speech Labeling |
| 2024.11 | VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format | [pdf] | [GitHub] |
Multi-Answer Video Grounded QA; Dense Captioning; Temporal Video Grounding |
| 2024.07 | What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction | [pdf] | [GitHub] |
Fitness Activity Recognition and Coaching |
| 2024.06 | VideoLLM-online: Online Video Large Language Model for Streaming Video | [pdf] | [GitHub] |
Narration Stream |
| Date | Title | Paper | Code | Comment |
|---|---|---|---|---|
| 2024.01 | A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming | [pdf] | - | - |
- JeffShine/Awesome-Streaming-Video-Understanding
- sotayang/Awesome-Streaming-Video-Understanding
- LJungang/Awesome-Video-Reasoning-Landscape
We're hiring multimodal research scientists and interns at JD Explore Academy! If you have top-tier publications and are passionate about video understanding and VLMs, please send your resume to: siqingyi.phoebus@jd.com. We'd love to hear from you!