Awesome-VLM-Streaming-Video 🎬

📒 Introduction

📚 A curated collection of papers and open-source code repositories dedicated to the application of Vision-Language Models (VLMs) for streaming video.

📖 Contents

🛠️ Project
📋 Technical Report
💬 Proactive Interaction
🧠 Long-term Memory Management
⚡ Real-time Inference
💭 Streaming with Thinking
📊 Benchmarks
📦 Training Datasets
📝 Survey
🔗 Resources

🛠️ Project

Date	Title	Paper	Code	Comment
2025.11	Live VLM WebUI	[docs]	[GitHub]	Real-Time Vision Language Model Interaction with Webcam Streaming

📋 Technical Report

Date	Title	Paper	Code	Comment
2026.02	Seed2.0	[pdf]	[GitHub]	Doubao Video Calling; OVBench & LiveSports-3K & OVOBench & ODVBench & ViSpeak
2025.12	Seed1.8	[pdf]	[GitHub]	Doubao Video Calling; OVBench & LiveSports-3K & OVOBench & ViSpeak & StreamingBench & OmniMMI

💬 Proactive Interaction

Auxiliary Response Head

Date	Title	Paper	Code	Comment
2026.03	Proact-VL: A Proactive VideoLLM for Real-Time AI Companions	[pdf]	[GitHub]	<\|FLAG\|>Token Response Head with Transition-Smoothed Classification & Stability Regularization
2026.03	StreamReady: Learning What to Answer and When in Long Streaming Videos	[pdf]	-	Learnable <RDY> Token with Readiness Head for Evidence-Gated Response Triggering
2025.05	StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant	[pdf]	[GitHub]	Activation Model via <ACT> Token with Binary Score Head
2025.03	ViSpeak: Visual Instruction Feedback in Streaming Videos	[pdf]	[GitHub]	Informative Head for <seg> Token
2025.03	StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition	[pdf]	[GitHub]	Cognition Gate Network (Shallow Layer Transfer from LLM) for Binary </response> / </silence> Classification
2025.01	Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction	[pdf]	[GitHub]	Binary Classification Head on <TODO> Token Embedding (BCE Loss); <ANS> for History Marking; <SILENT> for Reaction-Stage Filtering
2024.11	VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format	[pdf]	[GitHub]	Informative Head & Relevance Head

Generative Token-based Trigger

Date	Title	Paper	Code	Comment
2026.04	AURA: Always-On Understanding and Real-Time Assistance via Video Streams	[pdf]	[GitHub]	Unified <\|silent\|> Token for Silent Observation with Real-Time QA / Proactive QA / Multi-Response QA; Silent-Speech Balanced Loss
2026.03	STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding	[pdf]	[GitHub]	Activation Token Denoising over 0/1/[M]; Sequence Duplication; Selective Re-masking
2025.12	Streaming Video Instruction Tuning	[pdf]	[GitHub]	Three-State Response Tokens (</Silence>, </Standby>,</Response>) via Unified Next-Token Prediction; Focal Loss for Response Imbalance
2025.06	Proactive Assistant Dialogue Generation from Streaming Egocentric Videos	[pdf]	[GitHub]	Frame-Level [EOS] Silence Prediction with Negative Frame Sub-Sampling
2025.03	LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant	[pdf]	[GitHub]	Streaming EOS Prediction via Fast-Slow Dual-Path; Token Aggregation & Dropping Router for Visual Feature Compression
2025.03	AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis	[pdf]	-	EOS Prediction with Task-Adaptive Threshold for Anomaly Alerting
2024.07	What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction	[pdf]	[GitHub]	Action Tokens <next> and <feedback>
2024.06	VideoLLM-online: Online Video Large Language Model for Streaming Video	[pdf]	[GitHub]	Streaming EOS Token Prediction

RL-optimized Proactive Response

Date	Title	Paper	Code	Comment
2026.03	Thinking in Streaming Video	[pdf]	[GitHub]	Watch-Think-Speak with <silent> & <response> Action Tokens; Streaming RLVR (Format + Time + Accuracy Reward)
2026.01	Learning to Respond: A Large-Scale Benchmark and Progressive Learning Framework for Trigger-Centric Online Video Understanding	[pdf]	-	Trigger-Centric Online Video Understanding
2025.12	MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning	[pdf]	[GitHub]	Text-to-Text "NO REPLY" Response Decision with Multi-Objective Reward (PAUC + Replication + In-Span + Prefix) GRPO

Training-free / Feature-based Trigger

Date	Title	Paper	Code	Comment
2026.03	FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding	[pdf]	[GitHub]	Scene-Change Ratio Trigger Reusing Temporal Adjacency Selection Statistics
2026.01	QueryStream: Advancing Streaming Video Understanding with Query-Aware Pruning and Proactive Response	[pdf]	-	Relevance-Triggered Active Response
2026.01	Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams	[pdf]	-	Motion-Semantic-Prediction Boundary Score; Adaptive Threshold; Event-triggered Decoding
2025.11	LiveStar: Live Streaming Assistant for Real-World Online Video Understanding	[pdf]	[GitHub]	Streaming Verification Decoding Perplexity Gate for Response & Silence
2025.08	StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding	[pdf]	-	Reactive/Proactive/Speculative Planning; Heuristic Trigger; Tool-Guided Information Hunting

Hybrid Trigger Framework

Date	Title	Paper	Code	Comment
2026.03	Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding	[pdf]	[GitHub]	Query-Time Instruction-Guided Visual Proposal Generation (SFT+GRPO); Lightweight Embedding Cosine Similarity Surge Triggering
2026.03	StreamingClaw Technical Report	[pdf]	-	Training-Free (Reminder Node); Training-Based (Scenario-Specific Trigger Tokens)

Learned Response Timing

Date	Title	Paper	Code	Comment
2025.10	Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video	[pdf]	[GitHub]	Weighted Interval Supervision & Uncertainty-Guided High-Resolution Requests
2025.02	EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild	[pdf]	[GitHub]	Standalone Audio-Visual Frame-Level Three-Way Classifier (Background / Self / Other) with Anticipatory Prediction

🧠 Long-term Memory Management

Hierarchical Multi-level Memory

Date	Title	Paper	Code	Comment
2026.03	CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management	[pdf]	-	Curvature-Aware Scorer (Motion Variation + Geometric Curvature); EMA-Based K-Sigma Dynamic Thresholding; Hierarchical Clear/Blurred/Discard Memory with FIFO Eviction
2026.03	StreamingClaw Technical Report	[pdf]	-	Hierarchical Memory Evolution
2026.03	StreamReady: Learning What to Answer and When in Long Streaming Videos	[pdf]	-	3-Level Visual Memory Tree (FIFO-Centroid-Prototype) & Contextual Memory Bank
2026.03	WAT: Online Video Understanding Needs Watching Before Thinking	[pdf]	-	Dual-Level Memory: Short-Term FIFO Queue + Long-Term Memory with Redundancy-Aware Eviction Policy
2026.03	Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism	[pdf]	[GitHub]	Dual-Pathway Compression for Context Memory and Local Memory; Visual KV-Cache Memory Bank; Cross-attention Memory Recall & MemIndex
2026.03	FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding	[pdf]	[GitHub]	Three-Level Hierarchical Memory (Short/Mid/Long-Term); Temporal Adjacency Selection & Spatial Domain Consolidation
2026.02	FreshMem: Brain-Inspired Frequency-Space Hybrid Memory for Streaming Video Understanding	[pdf]	-	Short-term Sliding Window + Multi-scale Frequency Memory + Space Thumbnail Memory
2026.01	HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding	[pdf]	[GitHub]	Hierarchical KV-Cache (Sensory-Working-Long-Term Memory); Exponential Forgetting Curve & Frame-Level Anchor Tokens; Cross-Layer Memory Smoothing & Position Re-Indexing
2025.12	Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding	[pdf]	-	Hierarchical Raw Data Layer & Semantic Index Layer; Scene Segmentation and Incremental Clustering for Sparse Memory Construction
2025.08	StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding	[pdf]	-	Streaming KV-Cache; CPU Long-Term & GPU Short-Term Hierarchical Memory; Layer-Adaptive KV Recall
2025.07	StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling	[pdf]	[GitHub]	Sliding-Window KV-Cache with Selective Offloading; 3D Voxel-Based Spatial Token Pruning for Cross-Frame Redundancy Removal
2025.01	Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge	[pdf]	[GitHub]	Hierarchical Memory: Ebbinghaus Forgetting Curve Short-Term + Tree-Structured Clustering Long-Term + FAISS Dialogue Memory; Forgetting Probability-Based Frame Sampling
2024.09	VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges	[pdf]	[GitHub]	Recurrent Memory Bridge with Self-Attention Recursive Update & Retrieval Attention; SceneTiling Semantic Segmentation; Linear Memory Scaling

Sliding-window / Eviction

Date	Title	Paper	Code	Comment
2026.04	AURA: Always-On Understanding and Real-Time Assistance via Video Streams	[pdf]	[GitHub]	Dual Sliding-Window Context over Recent Video and QA History; Out-of-Window Video Chunks and <\|silent\|> Tokens Discarded
2026.04	A Simple Baseline for Streaming Video Understanding	[pdf]	[GitHub]	Fixed Recent-frame Sliding Window; Old-frame Eviction without External Memory
2026.03	Proact-VL: A Proactive VideoLLM for Real-Time AI Companions	[pdf]	[GitHub]	Dual-Cache (System & Streaming) Sliding Window with Reverse-RoPE Eviction for Infinite Streaming
2026.03	STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding	[pdf]	[GitHub]	Bounded Visual Cache Eviction; Sliding-window Frame Retention

Token Compression / Pruning

Date	Title	Paper	Code	Comment
2025.10	Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs	[pdf]	-	Attention-Score-Based Visual Token Selection with Recurrent FIFO Token Queue; Maximal Marginal Relevance Caption Retrieval
2025.05	StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant	[pdf]	[GitHub]	Producer-Consumer Memory Buffer with Conditional Round-Decayed Token Compression (Prioritizing Recent Frames)
2025.04	TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos	[pdf]	[GitHub]	Differential Token Drop (Primary); FIFO Slimmed Token Memory Bank (Supplementary)
2025.03	VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers	[pdf]	[GitHub]	Single Semantic Carrier Token per Frame (Avg-Pool Embedding + Reused KV Aggregation); Cosine Similarity-Based Dynamic Memory Bank Eviction

KV-Cache Compression / Retrieval / Reuse

Date	Title	Paper	Code	Comment
2026.02	Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory	[pdf]	[GitHub]	Adaptive Key Selection for Sparse Sliding-Window Encoding; Training-Free Retrieval MoE via Reciprocal Rank Fusion of Internal & External Signals
2025.12	V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval	[pdf]	-	Hash-Bit Hamming Clustering & Weighted Cumulative Sum Early-Exit Thresholding for Dynamic KV-Cache Retrieval; Hierarchical Memory Offloading
2025.11	StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression	[pdf]	[GitHub]	Cosine Similarity-Based Dynamic Semantic Segmentation; Summary-Vector Representative Key Retrieval; Guidance-Prompt-Driven KV Compression with Layer-Adaptive Budget Allocation
2025.11	LiveStar: Live Streaming Assistant for Real-World Online Video Understanding	[pdf]	[GitHub]	Peak-End Memory Compression & Dual-Level Streaming KV Cache
2025.10	StreamingTOM: Streaming Token Compression for Efficient Video Understanding	[pdf]	[GitHub]	Causal Temporal Token Reduction (Static-Dynamic DPC & Attention Selection) + Online 4-bit Quantized KV Memory with Representative-Key Retrieval
2025.08	StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding	[pdf]	-	Chat Template Proxy Query for Query-Agnostic KV-Cache Pruning & Weighted Merging
2025.06	InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding	[pdf]	[GitHub]	KV-Cache Compression with Temporal-Axis Redundancy Pruning & Value-Norm Ranking with Layer-Adaptive Pooling
2025.05	LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval	[pdf]	-	Streaming-Oriented KV Cache with Video-Specific KV Compression; Frame-Wise KV Merging & FIFO KV Chunk Memory

Retrieval-augmented Memory

Date	Title	Paper	Code	Comment
2026.03	PEARL: Personalized Streaming Video Understanding Model	[pdf]	[GitHub]	Dual-Grained Memory (Streaming Memory + Concept Memory); Concept-Aware Retrieval with Query Rewriting for Personalized Concept Grounding
2026.02	Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries	[pdf]	-	Scene-aware Segmentation; Temporal-Spatial Scene Token Compression; CPU Offloaded Full Frames with Query-conditioned Top-k Recall
2026.02	WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs	[pdf]	[GitHub]	Uncertainty-Gated Hierarchical Retrieval with Entropy-Threshold Triggered Past Context Access
2025.12	CogStream: Context-guided Streaming Video Question Answering	[pdf]	[GitHub]	Temporal-Semantic Clustering with Question-Aware Event Compression; Historical Dialogue Retrieval via LLM Selection
2025.11	CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding	[pdf]	-	Dynamic Token Dropping; GRU-Based Compressive Memory; KV Offloading and Rehydration; Consensus-First Retrieval
2025.06	Flash-VStream: Efficient Real-Time Understanding for Long Video Streams	[pdf]	[GitHub]	K-Means Clustered Context Synopsis Memory & Feature-Centric Detail Augmentation Memory with Disk-Offloadable Feature Bank
2025.03	Streaming Video Question-Answering with In-context Video KV-Cache Retrieval	[pdf]	[GitHub]	Sliding-Window Encoding with KV-Cache Offloading to RAM/Disk; Internal (Self-Attention Key Averaging) & External (CLIP Cosine Similarity) Frame-Level Retrieval

Semantic/Textual Abstraction

Date	Title	Paper	Code	Comment
2026.03	Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA	[pdf]	-	Streaming Event Segmentation; Event-level Historical Caption Memory; Knowledge Extraction Accelerator
2026.03	Thinking in Streaming Video	[pdf]	[GitHub]	Reasoning-Compressed Streaming Memory: Visual Sliding Window + Reasoning Tokens as Long-Term Semantic Anchors
2026.03	Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously	[pdf]	[GitHub]	Short-Term Visual Sliding Window & Long-Term Textual Streaming-Thought Memory with FIFO Eviction; Recursive Temporal Segmentation
2026.02	Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge	[pdf]	-	Convert Video into a Lightweight Textual Memory
2025.06	Proactive Assistant Dialogue Generation from Streaming Egocentric Videos	[pdf]	[GitHub]	Iterative Progress Summarization as Summary-Based Memory Compression
2025.04	Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding	[pdf]	[GitHub]	Multimodal Interleaved Cache: Online Verbalization of Visual-to-Text for Long-Term & Short-Term Visual Tokens

Event-centric Structured Memory

Date	Title	Paper	Code	Comment
2026.02	EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use	[pdf]	[GitHub]	Event-Centric Dual-Layer Memory (STM with Online Event Segmentation & Reservoir Sampling + LTM with Structured Event Tuples)
2026.01	Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams	[pdf]	-	Event-level Memory Bank; Merge-or-Append Event Consolidation
2025.12	VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs	[pdf]	[GitHub]	Prediction-Guided Elastic-Scale Event Segmentation & Hierarchical Cross-Attention Event Consolidation
2025.09	StreamForest: Efficient Online Video Understanding with Persistent Event Memory	[pdf]	[GitHub]	Event-Level Tree Hierarchy with Adaptive Merging via Similarity & Merge-Count & Temporal Penalty; Short-Term Spatiotemporal Sliding Window

Spatial / 3D Memory

Date	Title	Paper	Code	Comment
2026.01	OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding	[pdf]	[GitHub]	Fixed-size Spatial Memory with Time-adaptive Sampling and Concatenation; Explicit Point Cloud and Semantic Memory

Parametric / Fast-weight Memory

Date	Title	Paper	Code	Comment
2025.10	video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory	[pdf]	[GitHub]	Test-Time Training Fast-Weight MLP as Streaming Memory with Dual (Reconstruction + Long-Span Prediction) Objective; Cosine Similarity Token Discarding; Prompt-Dependent Modality-Aware KV-Cache Chunk Reading

⚡ Real-time Inference

Encoding-Decoding Parallelism

Date	Title	Paper	Code	Comment
2026.03	Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models	[pdf]	[GitHub]	Parallel Dual KV-Cache with Merge-Generate-Split Loop for Concurrent Encoding-Decoding; Decoupled Cross-Modal RoPE; Near-Zero Time-to-First-Token
2026.03	Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in MLLMs	[pdf]	[GitHub]	Threaded Parallel Watch-Think Pipeline with Async Segment Prefetch & Adaptive Attention Backend
2026.01	Speak While Watching: Unleashing TRUE Real-Time Video Understanding Capability of Multimodal Large Language Models	[pdf]	[GitHub]	Decoupled Positional Encoding (Overlapped / Group-Decoupled / Gap-Isolated) for Parallel Perception-Generation Streaming

Selective Model Invocation

Date	Title	Paper	Code	Comment
2026.03	STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding	[pdf]	[GitHub]	Two-stage Activation-to-Generation Pipeline; Event-gated Downstream Video-LLM Invocation
2026.03	Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing	[pdf]	[GitHub]	Windowed Grayscale Affinity Analysis with Quadratic Programming; Credit-Budgeted RGB Activation; Dynamic Token Router with Asymmetric Grayscale and RGB Token Capacity
2026.03	StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition	[pdf]	[GitHub]	SSM-Based Single-Token Perception with Event-Gated Sparse LLM Invocation
2026.01	Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams	[pdf]	-	Boundary-aware Event Pooling; Event-triggered Sparse Decoding; Hysteresis Pacing Control
2025.09	Open-ended Hierarchical Streaming Video Understanding with Vision Language Models	[pdf]	-	Lightweight RNN Streaming Module with Event-Gated Sparse Frozen VLM Invocation
2025.03	LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant	[pdf]	[GitHub]	Routing-Based Response Determination via Fast-Slow Dual-Path; Token Aggregation & Dropping for High-FPS Routing

Visual Token Reduction

Date	Title	Paper	Code	Comment
2026.04	CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference	[pdf]	-	Single-pass Compressed-bitstream Ingestion; Motion Vector-guided Patch Pruning; I-frame Anchor KV-Cache Refresh with RoPE-based Position Correction
2026.04	A Simple Baseline for Streaming Video Understanding	[pdf]	[GitHub]	Fixed Recent-frame Window; Bounded-memory Low-latency Inference
2026.03	Thinking in Streaming Video	[pdf]	[GitHub]	Eager Prefill + CUDA Graph Decode-and-Prune
2026.03	FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding	[pdf]	[GitHub]	Visual Token Compression via TAS & SDC with Otsu-Based Adaptive Thresholding for Latency & Memory Reduction
2026.02	QueryStream: Advancing Streaming Video Understanding with Query-Aware Pruning and Proactive Response	[pdf]	-	Query-Aware Differential Pruning (QDP) & Relevance-Triggered Active Response (RTAR) Scheduling
2025.12	StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding	[pdf]	-	Checkerboard-Masked Parallel Spatial Pruning with Adjacency-Constrained Redundancy; Query-Agnostic Continuous Pre-Pruning Pipeline
2025.12	Accelerating Streaming Video Large Language Models via Hierarchical Token Compression	[pdf]	[GitHub]	Hierarchical Token Compression: ViT Cache-Aware Selective Computation & Dual-Anchor Novelty Pruning
2025.10	Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs	[pdf]	-	Attention-Based Visual Token Compression & Caption-Only Question Answering
2025.03	VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers	[pdf]	[GitHub]	Single Semantic Carrier per Frame; Prefill-Decode Decoupling; Visual Tokens Discarded after Prefill
2024.08	VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation	[pdf]	-	Vision Token Computation Skipping with LayerExpert

KV-Cache Optimization

Date	Title	Paper	Code	Comment
2026.04	AURA: Always-On Understanding and Real-Time Assistance via Video Streams	[pdf]	[GitHub]	Floating Video/QA Sliding Windows with Batched N' Chunk Truncation for Prefix KV-Cache Reuse and Lower TTFT
2026.01	HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding	[pdf]	[GitHub]	KV-Cache Reuse for Instant Query Response; Hierarchical Token Compression within Fixed Cache Budget
2025.12	V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval	[pdf]	-	Hardware-Software Co-Design with Dynamic KV Cache Retrieval Engine Accelerator; Pipelined KV Prediction-Retrieval Overlapped with LLM Computation
2025.11	StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression	[pdf]	[GitHub]	Sliding-Window Segment Encoding with Immediate Post-Encoding KV Compression
2025.10	StreamingVLM: Real-Time Understanding for Infinite Video Streams	[pdf]	[GitHub]	Bounded KV-Cache Eviction for Constant Memory; No Redundant Recomputation across Windows
2025.10	StreamingTOM: Streaming Token Compression for Efficient Video Understanding	[pdf]	[GitHub]	Pre-LLM Causal Token Budget Cap for Prefill Acceleration; Post-LLM Quantized Memory with Selective Dequantization
2025.05	LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval	[pdf]	-	Pre-Generated Video KV Cache; Mean-Pooled Query-Key Chunk Retrieval with FIFO KV Chunk Management
2025.03	Streaming Video Question-Answering with In-context Video KV-Cache Retrieval	[pdf]	[GitHub]	Multi-Process Parallel Encoding-Answering; Sliding-Window Attention for Stable-Latency Incremental Processing

💭 Streaming with Thinking

Date	Title	Paper	Code	Comment
2026.03	Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models	[pdf]	[GitHub]	Causal Streaming Attention Masking; Decoupled Cross-Modal Positional Encoding; Parallel Dual KV-Cache Mechanism
2026.03	Thinking in Streaming Video	[pdf]	[GitHub]	Streaming Watch-Think-Speak Paradigm; Reasoning-Compressed Streaming Memory (RCSM); Streaming RLVR (Format + Time + Accuracy Reward)
2026.03	Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously	[pdf]	[GitHub]	The Video Streaming Thinking Paradigm; VKG-Based Streaming CoT Data Synthesis; Two-Stage VST-SFT & VST-RL Training
2026.03	Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in MLLMs	[pdf]	[GitHub]	Segment-Level Streaming Causal Mask & Positional Encoding; Three-Stage CoT Training; Concurrent Watch-Think Pipeline
2025.10	StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA	[pdf]	[GitHub]	Streaming VideoQA and multimodal CoT tasks

📊 Benchmarks

Date	Title	Paper	Code	Comment
2026.03	PEARL: Personalized Streaming Video Understanding Model	[pdf]	[GitHub]	Frame-Level Personalization & Video-Level Personalization; Concept-Definition QA & Real-Time QA & Past-Time QA
2026.03	StreamingEval: A Unified Evaluation Protocol towards Realistic Streaming Video Understanding	[pdf]	[GitHub]	Streaming Evaluation Protocol
2026.03	HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household	[pdf]	[GitHub]	Unsafe Action Detection; Early Warning Timing; Severity Assessment
2026.02	Artic: AI-oriented Real-time Communication for MLLM Video Assistant	[pdf]	[GitHub]	Degradation-sensitive QA
2026.01	OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding	[pdf]	[GitHub]	Online 3D Object Detection
2026.01	PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios	[pdf]	[GitHub]	Mobile-Centric Scenarios; Perception & Interaction & Planning
2025.12	StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios	[pdf]	[GitHub]	Embodied Scenarios
2025.12	StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos	[pdf]	[GitHub]	Gaze-Guided Streaming Data; Past & Present & Proactive
2025.10	Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?	[pdf]	[GitHub]	Qualcomm Interactive Cooking
2025.10	Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video	[pdf]	[GitHub]	Explicit Proactives; Implicit Proactive; Contextual Proactive
2025.07	OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding	[pdf]	[GitHub]	Agent State; Agent Visible Info; Agent-Object Spatial Relationship
2025.07	ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models	[pdf]	[GitHub]	Proactive Web-Video QA; Proactive Ego-Centric Video QA; Proactive TV-Series Video QA; Proactive Video Anomaly Detection
2025.07	Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI	[pdf]	[GitHub]	DeViBench
2025.05	RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video	[pdf]	[GitHub]	Perception; Understanding; Reasoning
2025.04	LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale	[pdf]	[GitHub]	LiveSports-3K-CC/QA
2025.03	OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts	[pdf]	[GitHub]	Streaming Video Understanding; Proactive Reasoning
2025.02	SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding	[pdf]	[GitHub]	Multi-Turn Dialogues
2025.01	Online Video Understanding: OVBench and VideoChat-Online	[pdf]	[GitHub]	Past; Current; Future
2025.01	OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?	[pdf]	[GitHub]	Backward Tracing; Real-Time Understanding; Forward Active Responding
2025.01	Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge	[pdf]	[GitHub]	Object Search; Long-Term Memory Search; Short-Term Memory Search; Conversational Interaction; Knowledge-Based Question Answering; Simple Factual
2024.11	StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding	[pdf]	[GitHub]	Real-Time Visual Understanding; Omni-Source Understanding; Contextual Understanding
2024.07	What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction	[pdf]	[GitHub]	Fitness Activity Recognition and Coaching
2024.03	MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	[pdf]	[GitHub]	Movies and TV

📦 Training Datasets

Date	Title	Paper	Code	Comment
2026.03	WAT: Online Video Understanding Needs Watching Before Thinking	[pdf]	-	Real-Time Perception; Backward Tracing; Forecasting
2026.03	Thinking in Streaming Video	[pdf]	[GitHub]	Video Segmentation and Dense Captioning; Diverse Instruction Synthesis; Time-Grounded CoT Generation
2026.01	Learning to Respond: A Large-Scale Benchmark and Progressive Learning Framework for Trigger-Centric Online Video Understanding	[pdf]	-	TV-Online
2026.01	ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding	[pdf]	[GitHub]	Online Proactive; Online Narration; Reactive QA
2025.12	Streaming Video Instruction Tuning	[pdf]	[GitHub]	Real-Time Narration; Event Caption; Action Caption; Event Grounding; Time-Sensitive QA
2025.12	MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning	[pdf]	[GitHub]	Scene Segmentation and Captioning; QA Generation; Proactive Dialogue Construction
2025.12	CogStream: Context-guided Streaming Video Question Answering	[pdf]	[GitHub]	Semi-Automatic QA Pipeline; 1,088 Videos with 59K Hierarchical QA Pairs (Basic, Streaming, Global)
2025.11	LiveStar: Live Streaming Assistant for Real-World Online Video Understanding	[pdf]	[GitHub]	Real-Time Narration Generation; Online Temporal Grounding; Frame-Level Dense QA; Contextual Online QA; Multi-Turn Interactive QA
2025.10	StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA	[pdf]	[GitHub]	Hierarchical Video Dense Captioning; Dynamic Question-Answer Pairs Construction; Multimodal Chain-of-Thought Generation
2025.10	StreamingVLM: Real-Time Understanding for Infinite Video Streams	[pdf]	[GitHub]	Sports; Narration
2025.10	Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video	[pdf]	[GitHub]	Explicit Proactive Tasks; Implicit Proactive Tasks; Contextual Proactive Tasks
2025.06	Proactive Assistant Dialogue Generation from Streaming Egocentric Videos	[pdf]	[GitHub]	Multi-Round Dialogue
2025.04	LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale	[pdf]	[GitHub]	ASR Transcripts
2025.03	AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis	[pdf]	-	Anomaly Prediction; Anomaly Detection; Anomaly Analysis
2025.02	EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild	[pdf]	[GitHub]	Prediction vs Detection; Frame-Level Speech Labeling
2024.11	VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format	[pdf]	[GitHub]	Multi-Answer Video Grounded QA; Dense Captioning; Temporal Video Grounding
2024.07	What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction	[pdf]	[GitHub]	Fitness Activity Recognition and Coaching
2024.06	VideoLLM-online: Online Video Large Language Model for Streaming Video	[pdf]	[GitHub]	Narration Stream

📝 Survey

Date	Title	Paper	Code	Comment
2024.01	A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming	[pdf]	-	-

🔗 Resources

☎️ We're Hiring!

We're hiring multimodal research scientists and interns at JD Explore Academy! If you have top-tier publications and are passionate about video understanding and VLMs, please send your resume to: siqingyi.phoebus@jd.com. We'd love to hear from you!

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome-VLM-Streaming-Video 🎬

📒 Introduction

📖 Contents

🛠️ Project

📋 Technical Report

💬 Proactive Interaction

Auxiliary Response Head

Generative Token-based Trigger

RL-optimized Proactive Response

Training-free / Feature-based Trigger

Hybrid Trigger Framework

Learned Response Timing

🧠 Long-term Memory Management

Hierarchical Multi-level Memory

Sliding-window / Eviction

Token Compression / Pruning

KV-Cache Compression / Retrieval / Reuse

Retrieval-augmented Memory

Semantic/Textual Abstraction

Event-centric Structured Memory

Spatial / 3D Memory

Parametric / Fast-weight Memory

⚡ Real-time Inference

Encoding-Decoding Parallelism

Selective Model Invocation

Visual Token Reduction

KV-Cache Optimization

💭 Streaming with Thinking

📊 Benchmarks

📦 Training Datasets

📝 Survey

🔗 Resources

☎️ We're Hiring!

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages