Skip to content

OpenGVLab/Future-L1

Repository files navigation

💭 Imagine Before You Predict

Interleaved Latent Visual Reasoning for Video Event Prediction

💬 Chat the PaperHighlightsGetting StartedAcknowledgementsCitation

Future-L1 teaches multimodal LLMs to alternate between language tokens and continuous latent visual spans, enabling compact future-state imagination before answering video event prediction questions.

Motivation of interleaved latent visual reasoning

Figure 1. Text-CoT can be verbose and visually lossy, while pixel-space future simulation is computationally heavy. Future-L1 inserts compact latent visual spans that preserve dynamic future semantics without generating full frames.


✨ Highlights

  • Interleaved latent visual reasoning. Future-L1 alternates between <reason> text and bounded <|latent_start|>…<|latent_end|> spans during autoregressive decoding, keeping dynamic visual structure in a continuous channel instead of verbalizing every intermediate hypothesis.
  • Future-L1-50K. We curate 50K high-utility examples from TwiFF-style trajectories by visual-gain selection: retain samples where intermediate future visual hints measurably improve prediction over a text-only baseline.
  • LA-DAPO RL. A latent-aware extension of DAPO with outcome-contrastive (R_ctr) and temporal-diversity (R_div) rewards that optimize sampled latent trajectories without intermediate-frame annotations at RL time.
  • State-of-the-art VEP performance. Future-L1-RL reaches 85.4% on FutureBench and 3.04 average score on TwiFF-Bench, with especially strong gains on multi-hop and non-consecutive future-event splits.
  • Compact inference. Accuracy improves through latent visual computation rather than long text-only chains or multi-turn search.

Future-L1 pipeline

Figure 2. (Left) Future-L1-50K is built by ranking TwiFF candidates by visual gain pv − pt. (Center) SFT trains interleaved text–latent trajectories, aligning latent spans with future visual states. (Right) LA-DAPO further optimizes sampled trajectories with outcome-contrastive and temporal-diversity rewards.

Latent-span usage by reasoning depth

Figure 4. Latent-span usage by reasoning depth. Donuts show span-count distributions.
RL data scaling on TwiFF-Bench

Figure 5. RL data scaling on TwiFF-Bench. Scores improve as LA-DAPO uses 5K, 10K, and 20K retained visual-gain samples.


🚀 Getting Started

# Install
pip install -r requirements_sft.txt
pip install -r requirements_rl.txt
cd RL_v2 && pip install -e . && cd ..
cd lmms-eval && pip install -e . && cd ..

# Replace chat_template.json before training (once on the base Qwen3-VL checkpoint)
cp chat_template.json /path/to/Qwen3-VL-8B-Instruct/chat_template.json

# SFT — edit MODEL_NAME / DATA_PATH / OUTPUT_DIR in scripts/train_twiff.sh
bash scripts/train_twiff.sh

# RL — set checkpoint, data, and LLM-as-judge API (OpenAI-compatible, e.g. Qwen3.6-27B)
cd RL_v2
MODEL_PATH=/path/to/Future-L1-SFT \
TRAIN_FILES=/path/to/RL.json \
JUDGE_API_URL=http://localhost:8000/v1 \
JUDGE_API_NAME=your-judge-model \
JUDGE_API_KEY=your-api-key \
bash train.sh method

# Evaluation — edit model_path in the eval scripts; TwiFF-Bench also needs lmms-eval/.env
cd lmms-eval
cp .env.example .env   # fill OPENAI_API_KEY, OPENAI_API_BASE, LOCAL_LLM
bash examples/eval_futurebench_future_l1.sh
bash examples/eval_twiffbench_future_l1.sh

🙏 Acknowledgements

We gratefully acknowledge the contributions of the open-source community, particularly:


📖 Citation

@article{tbd,
  title   = {TBD},
  author  = {TBD},
  year    = {TBD}
}

About

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors