Skip to content

ydyhello/Awesome-VLM-Streaming-Video

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

36 Commits
Β 
Β 

Repository files navigation

Awesome-VLM-Streaming-Video 🎬

πŸ“’ Introduction

πŸ“š A curated collection of papers and open-source code repositories dedicated to the application of Vision-Language Models (VLMs) for streaming video.

πŸ“– Contents

πŸ› οΈ Project

Date Title Paper Code Comment
2025.11 Live VLM WebUI [docs] [GitHub] Real-Time Vision Language Model Interaction with Webcam Streaming

πŸ“‹ Technical Report

Date Title Paper Code Comment
2026.02 Seed2.0 [pdf] [GitHub] Doubao Video Calling; OVBench & LiveSports-3K & OVOBench & ODVBench & ViSpeak
2025.12 Seed1.8 [pdf] [GitHub] Doubao Video Calling; OVBench & LiveSports-3K & OVOBench & ViSpeak & StreamingBench & OmniMMI

πŸ’¬ Proactive Interaction

Auxiliary Response Head

Date Title Paper Code Comment
2026.03 Proact-VL: A Proactive VideoLLM for Real-Time AI Companions [pdf] [GitHub] <|FLAG|>Token Response Head with Transition-Smoothed Classification & Stability Regularization
2026.03 StreamReady: Learning What to Answer and When in Long Streaming Videos [pdf] - Learnable <RDY> Token with Readiness Head for Evidence-Gated Response Triggering
2025.05 StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant [pdf] [GitHub] Activation Model via <ACT> Token with Binary Score Head
2025.03 ViSpeak: Visual Instruction Feedback in Streaming Videos [pdf] [GitHub] Informative Head for <seg> Token
2025.03 StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition [pdf] [GitHub] Cognition Gate Network (Shallow Layer Transfer from LLM) for Binary </response> / </silence> Classification
2025.01 Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction [pdf] [GitHub] Binary Classification Head on <TODO> Token Embedding (BCE Loss); <ANS> for History Marking; <SILENT> for Reaction-Stage Filtering
2024.11 VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format [pdf] [GitHub] Informative Head & Relevance Head

Generative Token-based Trigger

Date Title Paper Code Comment
2026.04 AURA: Always-On Understanding and Real-Time Assistance via Video Streams [pdf] [GitHub] Unified <|silent|> Token for Silent Observation with Real-Time QA / Proactive QA / Multi-Response QA; Silent-Speech Balanced Loss
2026.03 STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding [pdf] [GitHub] Activation Token Denoising over 0/1/[M]; Sequence Duplication; Selective Re-masking
2025.12 Streaming Video Instruction Tuning [pdf] [GitHub] Three-State Response Tokens (</Silence>, </Standby>,</Response>) via Unified Next-Token Prediction; Focal Loss for Response Imbalance
2025.06 Proactive Assistant Dialogue Generation from Streaming Egocentric Videos [pdf] [GitHub] Frame-Level [EOS] Silence Prediction with Negative Frame Sub-Sampling
2025.03 LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant [pdf] [GitHub] Streaming EOS Prediction via Fast-Slow Dual-Path; Token Aggregation & Dropping Router for Visual Feature Compression
2025.03 AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis [pdf] - EOS Prediction with Task-Adaptive Threshold for Anomaly Alerting
2024.07 What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction [pdf] [GitHub] Action Tokens <next> and <feedback>
2024.06 VideoLLM-online: Online Video Large Language Model for Streaming Video [pdf] [GitHub] Streaming EOS Token Prediction

RL-optimized Proactive Response

Date Title Paper Code Comment
2026.03 Thinking in Streaming Video [pdf] [GitHub] Watch-Think-Speak with <silent> & <response> Action Tokens; Streaming RLVR (Format + Time + Accuracy Reward)
2026.01 Learning to Respond: A Large-Scale Benchmark and Progressive Learning Framework for Trigger-Centric Online Video Understanding [pdf] - Trigger-Centric Online Video Understanding
2025.12 MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning [pdf] [GitHub] Text-to-Text "NO REPLY" Response Decision with Multi-Objective Reward (PAUC + Replication + In-Span + Prefix) GRPO

Training-free / Feature-based Trigger

Date Title Paper Code Comment
2026.03 FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding [pdf] [GitHub] Scene-Change Ratio Trigger Reusing Temporal Adjacency Selection Statistics
2026.01 QueryStream: Advancing Streaming Video Understanding with Query-Aware Pruning and Proactive Response [pdf] - Relevance-Triggered Active Response
2026.01 Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams [pdf] - Motion-Semantic-Prediction Boundary Score; Adaptive Threshold; Event-triggered Decoding
2025.11 LiveStar: Live Streaming Assistant for Real-World Online Video Understanding [pdf] [GitHub] Streaming Verification Decoding Perplexity Gate for Response & Silence
2025.08 StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding [pdf] - Reactive/Proactive/Speculative Planning; Heuristic Trigger; Tool-Guided Information Hunting

Hybrid Trigger Framework

Date Title Paper Code Comment
2026.03 Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding [pdf] [GitHub] Query-Time Instruction-Guided Visual Proposal Generation (SFT+GRPO); Lightweight Embedding Cosine Similarity Surge Triggering
2026.03 StreamingClaw Technical Report [pdf] - Training-Free (Reminder Node); Training-Based (Scenario-Specific Trigger Tokens)

Learned Response Timing

Date Title Paper Code Comment
2025.10 Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video [pdf] [GitHub] Weighted Interval Supervision & Uncertainty-Guided High-Resolution Requests
2025.02 EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild [pdf] [GitHub] Standalone Audio-Visual Frame-Level Three-Way Classifier (Background / Self / Other) with Anticipatory Prediction

🧠 Long-term Memory Management

Hierarchical Multi-level Memory

Date Title Paper Code Comment
2026.03 CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management [pdf] - Curvature-Aware Scorer (Motion Variation + Geometric Curvature); EMA-Based K-Sigma Dynamic Thresholding; Hierarchical Clear/Blurred/Discard Memory with FIFO Eviction
2026.03 StreamingClaw Technical Report [pdf] - Hierarchical Memory Evolution
2026.03 StreamReady: Learning What to Answer and When in Long Streaming Videos [pdf] - 3-Level Visual Memory Tree (FIFO-Centroid-Prototype) & Contextual Memory Bank
2026.03 WAT: Online Video Understanding Needs Watching Before Thinking [pdf] - Dual-Level Memory: Short-Term FIFO Queue + Long-Term Memory with Redundancy-Aware Eviction Policy
2026.03 Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism [pdf] [GitHub] Dual-Pathway Compression for Context Memory and Local Memory; Visual KV-Cache Memory Bank; Cross-attention Memory Recall & MemIndex
2026.03 FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding [pdf] [GitHub] Three-Level Hierarchical Memory (Short/Mid/Long-Term); Temporal Adjacency Selection & Spatial Domain Consolidation
2026.02 FreshMem: Brain-Inspired Frequency-Space Hybrid Memory for Streaming Video Understanding [pdf] - Short-term Sliding Window + Multi-scale Frequency Memory + Space Thumbnail Memory
2026.01 HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding [pdf] [GitHub] Hierarchical KV-Cache (Sensory-Working-Long-Term Memory); Exponential Forgetting Curve & Frame-Level Anchor Tokens; Cross-Layer Memory Smoothing & Position Re-Indexing
2025.12 Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding [pdf] - Hierarchical Raw Data Layer & Semantic Index Layer; Scene Segmentation and Incremental Clustering for Sparse Memory Construction
2025.08 StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding [pdf] - Streaming KV-Cache; CPU Long-Term & GPU Short-Term Hierarchical Memory; Layer-Adaptive KV Recall
2025.07 StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling [pdf] [GitHub] Sliding-Window KV-Cache with Selective Offloading; 3D Voxel-Based Spatial Token Pruning for Cross-Frame Redundancy Removal
2025.01 Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge [pdf] [GitHub] Hierarchical Memory: Ebbinghaus Forgetting Curve Short-Term + Tree-Structured Clustering Long-Term + FAISS Dialogue Memory; Forgetting Probability-Based Frame Sampling
2024.09 VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges [pdf] [GitHub] Recurrent Memory Bridge with Self-Attention Recursive Update & Retrieval Attention; SceneTiling Semantic Segmentation; Linear Memory Scaling

Sliding-window / Eviction

Date Title Paper Code Comment
2026.04 AURA: Always-On Understanding and Real-Time Assistance via Video Streams [pdf] [GitHub] Dual Sliding-Window Context over Recent Video and QA History; Out-of-Window Video Chunks and <|silent|> Tokens Discarded
2026.04 A Simple Baseline for Streaming Video Understanding [pdf] [GitHub] Fixed Recent-frame Sliding Window; Old-frame Eviction without External Memory
2026.03 Proact-VL: A Proactive VideoLLM for Real-Time AI Companions [pdf] [GitHub] Dual-Cache (System & Streaming) Sliding Window with Reverse-RoPE Eviction for Infinite Streaming
2026.03 STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding [pdf] [GitHub] Bounded Visual Cache Eviction; Sliding-window Frame Retention

Token Compression / Pruning

Date Title Paper Code Comment
2025.10 Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs [pdf] - Attention-Score-Based Visual Token Selection with Recurrent FIFO Token Queue; Maximal Marginal Relevance Caption Retrieval
2025.05 StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant [pdf] [GitHub] Producer-Consumer Memory Buffer with Conditional Round-Decayed Token Compression (Prioritizing Recent Frames)
2025.04 TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos [pdf] [GitHub] Differential Token Drop (Primary); FIFO Slimmed Token Memory Bank (Supplementary)
2025.03 VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers [pdf] [GitHub] Single Semantic Carrier Token per Frame (Avg-Pool Embedding + Reused KV Aggregation); Cosine Similarity-Based Dynamic Memory Bank Eviction

KV-Cache Compression / Retrieval / Reuse

Date Title Paper Code Comment
2026.02 Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory [pdf] [GitHub] Adaptive Key Selection for Sparse Sliding-Window Encoding; Training-Free Retrieval MoE via Reciprocal Rank Fusion of Internal & External Signals
2025.12 V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval [pdf] - Hash-Bit Hamming Clustering & Weighted Cumulative Sum Early-Exit Thresholding for Dynamic KV-Cache Retrieval; Hierarchical Memory Offloading
2025.11 StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression [pdf] [GitHub] Cosine Similarity-Based Dynamic Semantic Segmentation; Summary-Vector Representative Key Retrieval; Guidance-Prompt-Driven KV Compression with Layer-Adaptive Budget Allocation
2025.11 LiveStar: Live Streaming Assistant for Real-World Online Video Understanding [pdf] [GitHub] Peak-End Memory Compression & Dual-Level Streaming KV Cache
2025.10 StreamingTOM: Streaming Token Compression for Efficient Video Understanding [pdf] [GitHub] Causal Temporal Token Reduction (Static-Dynamic DPC & Attention Selection) + Online 4-bit Quantized KV Memory with Representative-Key Retrieval
2025.08 StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding [pdf] - Chat Template Proxy Query for Query-Agnostic KV-Cache Pruning & Weighted Merging
2025.06 InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding [pdf] [GitHub] KV-Cache Compression with Temporal-Axis Redundancy Pruning & Value-Norm Ranking with Layer-Adaptive Pooling
2025.05 LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval [pdf] - Streaming-Oriented KV Cache with Video-Specific KV Compression; Frame-Wise KV Merging & FIFO KV Chunk Memory

Retrieval-augmented Memory

Date Title Paper Code Comment
2026.03 PEARL: Personalized Streaming Video Understanding Model [pdf] [GitHub] Dual-Grained Memory (Streaming Memory + Concept Memory); Concept-Aware Retrieval with Query Rewriting for Personalized Concept Grounding
2026.02 Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries [pdf] - Scene-aware Segmentation; Temporal-Spatial Scene Token Compression; CPU Offloaded Full Frames with Query-conditioned Top-k Recall
2026.02 WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs [pdf] [GitHub] Uncertainty-Gated Hierarchical Retrieval with Entropy-Threshold Triggered Past Context Access
2025.12 CogStream: Context-guided Streaming Video Question Answering [pdf] [GitHub] Temporal-Semantic Clustering with Question-Aware Event Compression; Historical Dialogue Retrieval via LLM Selection
2025.11 CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding [pdf] - Dynamic Token Dropping; GRU-Based Compressive Memory; KV Offloading and Rehydration; Consensus-First Retrieval
2025.06 Flash-VStream: Efficient Real-Time Understanding for Long Video Streams [pdf] [GitHub] K-Means Clustered Context Synopsis Memory & Feature-Centric Detail Augmentation Memory with Disk-Offloadable Feature Bank
2025.03 Streaming Video Question-Answering with In-context Video KV-Cache Retrieval [pdf] [GitHub] Sliding-Window Encoding with KV-Cache Offloading to RAM/Disk; Internal (Self-Attention Key Averaging) & External (CLIP Cosine Similarity) Frame-Level Retrieval

Semantic/Textual Abstraction

Date Title Paper Code Comment
2026.03 Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA [pdf] - Streaming Event Segmentation; Event-level Historical Caption Memory; Knowledge Extraction Accelerator
2026.03 Thinking in Streaming Video [pdf] [GitHub] Reasoning-Compressed Streaming Memory: Visual Sliding Window + Reasoning Tokens as Long-Term Semantic Anchors
2026.03 Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously [pdf] [GitHub] Short-Term Visual Sliding Window & Long-Term Textual Streaming-Thought Memory with FIFO Eviction; Recursive Temporal Segmentation
2026.02 Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge [pdf] - Convert Video into a Lightweight Textual Memory
2025.06 Proactive Assistant Dialogue Generation from Streaming Egocentric Videos [pdf] [GitHub] Iterative Progress Summarization as Summary-Based Memory Compression
2025.04 Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding [pdf] [GitHub] Multimodal Interleaved Cache: Online Verbalization of Visual-to-Text for Long-Term & Short-Term Visual Tokens

Event-centric Structured Memory

Date Title Paper Code Comment
2026.02 EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use [pdf] [GitHub] Event-Centric Dual-Layer Memory (STM with Online Event Segmentation & Reservoir Sampling + LTM with Structured Event Tuples)
2026.01 Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams [pdf] - Event-level Memory Bank; Merge-or-Append Event Consolidation
2025.12 VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs [pdf] [GitHub] Prediction-Guided Elastic-Scale Event Segmentation & Hierarchical Cross-Attention Event Consolidation
2025.09 StreamForest: Efficient Online Video Understanding with Persistent Event Memory [pdf] [GitHub] Event-Level Tree Hierarchy with Adaptive Merging via Similarity & Merge-Count & Temporal Penalty; Short-Term Spatiotemporal Sliding Window

Spatial / 3D Memory

Date Title Paper Code Comment
2026.01 OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding [pdf] [GitHub] Fixed-size Spatial Memory with Time-adaptive Sampling and Concatenation; Explicit Point Cloud and Semantic Memory

Parametric / Fast-weight Memory

Date Title Paper Code Comment
2025.10 video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory [pdf] [GitHub] Test-Time Training Fast-Weight MLP as Streaming Memory with Dual (Reconstruction + Long-Span Prediction) Objective; Cosine Similarity Token Discarding; Prompt-Dependent Modality-Aware KV-Cache Chunk Reading

⚑ Real-time Inference

Encoding-Decoding Parallelism

Date Title Paper Code Comment
2026.03 Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models [pdf] [GitHub] Parallel Dual KV-Cache with Merge-Generate-Split Loop for Concurrent Encoding-Decoding; Decoupled Cross-Modal RoPE; Near-Zero Time-to-First-Token
2026.03 Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in MLLMs [pdf] [GitHub] Threaded Parallel Watch-Think Pipeline with Async Segment Prefetch & Adaptive Attention Backend
2026.01 Speak While Watching: Unleashing TRUE Real-Time Video Understanding Capability of Multimodal Large Language Models [pdf] [GitHub] Decoupled Positional Encoding (Overlapped / Group-Decoupled / Gap-Isolated) for Parallel Perception-Generation Streaming

Selective Model Invocation

Date Title Paper Code Comment
2026.03 STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding [pdf] [GitHub] Two-stage Activation-to-Generation Pipeline; Event-gated Downstream Video-LLM Invocation
2026.03 Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing [pdf] [GitHub] Windowed Grayscale Affinity Analysis with Quadratic Programming; Credit-Budgeted RGB Activation; Dynamic Token Router with Asymmetric Grayscale and RGB Token Capacity
2026.03 StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition [pdf] [GitHub] SSM-Based Single-Token Perception with Event-Gated Sparse LLM Invocation
2026.01 Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams [pdf] - Boundary-aware Event Pooling; Event-triggered Sparse Decoding; Hysteresis Pacing Control
2025.09 Open-ended Hierarchical Streaming Video Understanding with Vision Language Models [pdf] - Lightweight RNN Streaming Module with Event-Gated Sparse Frozen VLM Invocation
2025.03 LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant [pdf] [GitHub] Routing-Based Response Determination via Fast-Slow Dual-Path; Token Aggregation & Dropping for High-FPS Routing

Visual Token Reduction

Date Title Paper Code Comment
2026.04 CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference [pdf] - Single-pass Compressed-bitstream Ingestion; Motion Vector-guided Patch Pruning; I-frame Anchor KV-Cache Refresh with RoPE-based Position Correction
2026.04 A Simple Baseline for Streaming Video Understanding [pdf] [GitHub] Fixed Recent-frame Window; Bounded-memory Low-latency Inference
2026.03 Thinking in Streaming Video [pdf] [GitHub] Eager Prefill + CUDA Graph Decode-and-Prune
2026.03 FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding [pdf] [GitHub] Visual Token Compression via TAS & SDC with Otsu-Based Adaptive Thresholding for Latency & Memory Reduction
2026.02 QueryStream: Advancing Streaming Video Understanding with Query-Aware Pruning and Proactive Response [pdf] - Query-Aware Differential Pruning (QDP) & Relevance-Triggered Active Response (RTAR) Scheduling
2025.12 StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding [pdf] - Checkerboard-Masked Parallel Spatial Pruning with Adjacency-Constrained Redundancy; Query-Agnostic Continuous Pre-Pruning Pipeline
2025.12 Accelerating Streaming Video Large Language Models via Hierarchical Token Compression [pdf] [GitHub] Hierarchical Token Compression: ViT Cache-Aware Selective Computation & Dual-Anchor Novelty Pruning
2025.10 Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs [pdf] - Attention-Based Visual Token Compression & Caption-Only Question Answering
2025.03 VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers [pdf] [GitHub] Single Semantic Carrier per Frame; Prefill-Decode Decoupling; Visual Tokens Discarded after Prefill
2024.08 VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [pdf] - Vision Token Computation Skipping with LayerExpert

KV-Cache Optimization

Date Title Paper Code Comment
2026.04 AURA: Always-On Understanding and Real-Time Assistance via Video Streams [pdf] [GitHub] Floating Video/QA Sliding Windows with Batched N' Chunk Truncation for Prefix KV-Cache Reuse and Lower TTFT
2026.01 HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding [pdf] [GitHub] KV-Cache Reuse for Instant Query Response; Hierarchical Token Compression within Fixed Cache Budget
2025.12 V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval [pdf] - Hardware-Software Co-Design with Dynamic KV Cache Retrieval Engine Accelerator; Pipelined KV Prediction-Retrieval Overlapped with LLM Computation
2025.11 StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression [pdf] [GitHub] Sliding-Window Segment Encoding with Immediate Post-Encoding KV Compression
2025.10 StreamingVLM: Real-Time Understanding for Infinite Video Streams [pdf] [GitHub] Bounded KV-Cache Eviction for Constant Memory; No Redundant Recomputation across Windows
2025.10 StreamingTOM: Streaming Token Compression for Efficient Video Understanding [pdf] [GitHub] Pre-LLM Causal Token Budget Cap for Prefill Acceleration; Post-LLM Quantized Memory with Selective Dequantization
2025.05 LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval [pdf] - Pre-Generated Video KV Cache; Mean-Pooled Query-Key Chunk Retrieval with FIFO KV Chunk Management
2025.03 Streaming Video Question-Answering with In-context Video KV-Cache Retrieval [pdf] [GitHub] Multi-Process Parallel Encoding-Answering; Sliding-Window Attention for Stable-Latency Incremental Processing

πŸ’­ Streaming with Thinking

Date Title Paper Code Comment
2026.03 Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models [pdf] [GitHub] Causal Streaming Attention Masking; Decoupled Cross-Modal Positional Encoding; Parallel Dual KV-Cache Mechanism
2026.03 Thinking in Streaming Video [pdf] [GitHub] Streaming Watch-Think-Speak Paradigm; Reasoning-Compressed Streaming Memory (RCSM); Streaming RLVR (Format + Time + Accuracy Reward)
2026.03 Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously [pdf] [GitHub] The Video Streaming Thinking Paradigm; VKG-Based Streaming CoT Data Synthesis; Two-Stage VST-SFT & VST-RL Training
2026.03 Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in MLLMs [pdf] [GitHub] Segment-Level Streaming Causal Mask & Positional Encoding; Three-Stage CoT Training; Concurrent Watch-Think Pipeline
2025.10 StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA [pdf] [GitHub] Streaming VideoQA and multimodal CoT tasks

πŸ“Š Benchmarks

Date Title Paper Code Comment
2026.03 PEARL: Personalized Streaming Video Understanding Model [pdf] [GitHub] Frame-Level Personalization & Video-Level Personalization; Concept-Definition QA & Real-Time QA & Past-Time QA
2026.03 StreamingEval: A Unified Evaluation Protocol towards Realistic Streaming Video Understanding [pdf] [GitHub] Streaming Evaluation Protocol
2026.03 HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household [pdf] [GitHub] Unsafe Action Detection; Early Warning Timing; Severity Assessment
2026.02 Artic: AI-oriented Real-time Communication for MLLM Video Assistant [pdf] [GitHub] Degradation-sensitive QA
2026.01 OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding [pdf] [GitHub] Online 3D Object Detection
2026.01 PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios [pdf] [GitHub] Mobile-Centric Scenarios; Perception & Interaction & Planning
2025.12 StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios [pdf] [GitHub] Embodied Scenarios
2025.12 StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos [pdf] [GitHub] Gaze-Guided Streaming Data; Past & Present & Proactive
2025.10 Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance? [pdf] [GitHub] Qualcomm Interactive Cooking
2025.10 Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video [pdf] [GitHub] Explicit Proactives; Implicit Proactive; Contextual Proactive
2025.07 OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding [pdf] [GitHub] Agent State; Agent Visible Info; Agent-Object Spatial Relationship
2025.07 ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models [pdf] [GitHub] Proactive Web-Video QA; Proactive Ego-Centric Video QA; Proactive TV-Series Video QA; Proactive Video Anomaly Detection
2025.07 Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI [pdf] [GitHub] DeViBench
2025.05 RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video [pdf] [GitHub] Perception; Understanding; Reasoning
2025.04 LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale [pdf] [GitHub] LiveSports-3K-CC/QA
2025.03 OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts [pdf] [GitHub] Streaming Video Understanding; Proactive Reasoning
2025.02 SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding [pdf] [GitHub] Multi-Turn Dialogues
2025.01 Online Video Understanding: OVBench and VideoChat-Online [pdf] [GitHub] Past; Current; Future
2025.01 OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? [pdf] [GitHub] Backward Tracing; Real-Time Understanding; Forward Active Responding
2025.01 Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge [pdf] [GitHub] Object Search; Long-Term Memory Search; Short-Term Memory Search; Conversational Interaction; Knowledge-Based Question Answering; Simple Factual
2024.11 StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding [pdf] [GitHub] Real-Time Visual Understanding; Omni-Source Understanding; Contextual Understanding
2024.07 What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction [pdf] [GitHub] Fitness Activity Recognition and Coaching
2024.03 MovieChat: From Dense Token to Sparse Memory for Long Video Understanding [pdf] [GitHub] Movies and TV

πŸ“¦ Training Datasets

Date Title Paper Code Comment
2026.03 WAT: Online Video Understanding Needs Watching Before Thinking [pdf] - Real-Time Perception; Backward Tracing; Forecasting
2026.03 Thinking in Streaming Video [pdf] [GitHub] Video Segmentation and Dense Captioning; Diverse Instruction Synthesis; Time-Grounded CoT Generation
2026.01 Learning to Respond: A Large-Scale Benchmark and Progressive Learning Framework for Trigger-Centric Online Video Understanding [pdf] - TV-Online
2026.01 ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding [pdf] [GitHub] Online Proactive; Online Narration; Reactive QA
2025.12 Streaming Video Instruction Tuning [pdf] [GitHub] Real-Time Narration; Event Caption; Action Caption; Event Grounding; Time-Sensitive QA
2025.12 MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning [pdf] [GitHub] Scene Segmentation and Captioning; QA Generation; Proactive Dialogue Construction
2025.12 CogStream: Context-guided Streaming Video Question Answering [pdf] [GitHub] Semi-Automatic QA Pipeline; 1,088 Videos with 59K Hierarchical QA Pairs (Basic, Streaming, Global)
2025.11 LiveStar: Live Streaming Assistant for Real-World Online Video Understanding [pdf] [GitHub] Real-Time Narration Generation; Online Temporal Grounding; Frame-Level Dense QA; Contextual Online QA; Multi-Turn Interactive QA
2025.10 StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA [pdf] [GitHub] Hierarchical Video Dense Captioning; Dynamic Question-Answer Pairs Construction; Multimodal Chain-of-Thought Generation
2025.10 StreamingVLM: Real-Time Understanding for Infinite Video Streams [pdf] [GitHub] Sports; Narration
2025.10 Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video [pdf] [GitHub] Explicit Proactive Tasks; Implicit Proactive Tasks; Contextual Proactive Tasks
2025.06 Proactive Assistant Dialogue Generation from Streaming Egocentric Videos [pdf] [GitHub] Multi-Round Dialogue
2025.04 LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale [pdf] [GitHub] ASR Transcripts
2025.03 AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis [pdf] - Anomaly Prediction; Anomaly Detection; Anomaly Analysis
2025.02 EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild [pdf] [GitHub] Prediction vs Detection; Frame-Level Speech Labeling
2024.11 VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format [pdf] [GitHub] Multi-Answer Video Grounded QA; Dense Captioning; Temporal Video Grounding
2024.07 What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction [pdf] [GitHub] Fitness Activity Recognition and Coaching
2024.06 VideoLLM-online: Online Video Large Language Model for Streaming Video [pdf] [GitHub] Narration Stream

πŸ“ Survey

Date Title Paper Code Comment
2024.01 A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming [pdf] - -

πŸ”— Resources

☎️ We're Hiring!

We're hiring multimodal research scientists and interns at JD Explore Academy! If you have top-tier publications and are passionate about video understanding and VLMs, please send your resume to: siqingyi.phoebus@jd.com. We'd love to hear from you!

About

πŸ“š A curated collection of papers and open-source code repositories dedicated to the application of Vision-Language Models (VLMs) for streaming video.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors