💭 Imagine Before You Predict

Interleaved Latent Visual Reasoning for Video Event Prediction

💬 Chat the Paper • Highlights • Getting Started • Acknowledgements • Citation

Future-L1 teaches multimodal LLMs to alternate between language tokens and continuous latent visual spans, enabling compact future-state imagination before answering video event prediction questions.

Figure 1. Text-CoT can be verbose and visually lossy, while pixel-space future simulation is computationally heavy. Future-L1 inserts compact latent visual spans that preserve dynamic future semantics without generating full frames.

✨ Highlights

Interleaved latent visual reasoning. Future-L1 alternates between <reason> text and bounded <|latent_start|>…<|latent_end|> spans during autoregressive decoding, keeping dynamic visual structure in a continuous channel instead of verbalizing every intermediate hypothesis.
Future-L1-50K. We curate 50K high-utility examples from TwiFF-style trajectories by visual-gain selection: retain samples where intermediate future visual hints measurably improve prediction over a text-only baseline.
LA-DAPO RL. A latent-aware extension of DAPO with outcome-contrastive (R_ctr) and temporal-diversity (R_div) rewards that optimize sampled latent trajectories without intermediate-frame annotations at RL time.
State-of-the-art VEP performance. Future-L1-RL reaches 85.4% on FutureBench and 3.04 average score on TwiFF-Bench, with especially strong gains on multi-hop and non-consecutive future-event splits.
Compact inference. Accuracy improves through latent visual computation rather than long text-only chains or multi-turn search.

Figure 2. (Left) Future-L1-50K is built by ranking TwiFF candidates by visual gain p_v − p_t. (Center) SFT trains interleaved text–latent trajectories, aligning latent spans with future visual states. (Right) LA-DAPO further optimizes sampled trajectories with outcome-contrastive and temporal-diversity rewards.

Figure 4. Latent-span usage by reasoning depth. Donuts show span-count distributions.

Figure 5. RL data scaling on TwiFF-Bench. Scores improve as LA-DAPO uses 5K, 10K, and 20K retained visual-gain samples.

🚀 Getting Started

# Install
pip install -r requirements_sft.txt
pip install -r requirements_rl.txt
cd RL_v2 && pip install -e . && cd ..
cd lmms-eval && pip install -e . && cd ..

# Replace chat_template.json before training (once on the base Qwen3-VL checkpoint)
cp chat_template.json /path/to/Qwen3-VL-8B-Instruct/chat_template.json

# SFT — edit MODEL_NAME / DATA_PATH / OUTPUT_DIR in scripts/train_twiff.sh
bash scripts/train_twiff.sh

# RL — set checkpoint, data, and LLM-as-judge API (OpenAI-compatible, e.g. Qwen3.6-27B)
cd RL_v2
MODEL_PATH=/path/to/Future-L1-SFT \
TRAIN_FILES=/path/to/RL.json \
JUDGE_API_URL=http://localhost:8000/v1 \
JUDGE_API_NAME=your-judge-model \
JUDGE_API_KEY=your-api-key \
bash train.sh method

# Evaluation — edit model_path in the eval scripts; TwiFF-Bench also needs lmms-eval/.env
cd lmms-eval
cp .env.example .env   # fill OPENAI_API_KEY, OPENAI_API_BASE, LOCAL_LLM
bash examples/eval_futurebench_future_l1.sh
bash examples/eval_twiffbench_future_l1.sh

🙏 Acknowledgements

We gratefully acknowledge the contributions of the open-source community, particularly:

Qwen-VL-Series-Finetune, Latent Visual Reasoning (LVR), SwimBird, EasyR1
Previous Work: LaViT — Aligning latent visual thoughts for multi-modal reasoning via teacher-extracted visual thought trajectories.

📖 Citation

@article{tbd,
  title   = {TBD},
  author  = {TBD},
  year    = {TBD}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
RL_v2		RL_v2
asset		asset
lmms-eval		lmms-eval
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
chat_template.json		chat_template.json
requirements_rl.txt		requirements_rl.txt
requirements_sft.txt		requirements_sft.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💭 Imagine Before You Predict

Interleaved Latent Visual Reasoning for Video Event Prediction

✨ Highlights

🚀 Getting Started

🙏 Acknowledgements

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

💭 Imagine Before You Predict

Interleaved Latent Visual Reasoning for Video Event Prediction

✨ Highlights

🚀 Getting Started

🙏 Acknowledgements

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages