💬 Chat the Paper • Highlights • Getting Started • Acknowledgements • Citation
Future-L1 teaches multimodal LLMs to alternate between language tokens and continuous latent visual spans, enabling compact future-state imagination before answering video event prediction questions.
Figure 1. Text-CoT can be verbose and visually lossy, while pixel-space future simulation is computationally heavy. Future-L1 inserts compact latent visual spans that preserve dynamic future semantics without generating full frames.
- Interleaved latent visual reasoning. Future-L1 alternates between
<reason>text and bounded<|latent_start|>…<|latent_end|>spans during autoregressive decoding, keeping dynamic visual structure in a continuous channel instead of verbalizing every intermediate hypothesis. - Future-L1-50K. We curate 50K high-utility examples from TwiFF-style trajectories by visual-gain selection: retain samples where intermediate future visual hints measurably improve prediction over a text-only baseline.
- LA-DAPO RL. A latent-aware extension of DAPO with outcome-contrastive (
R_ctr) and temporal-diversity (R_div) rewards that optimize sampled latent trajectories without intermediate-frame annotations at RL time. - State-of-the-art VEP performance. Future-L1-RL reaches 85.4% on FutureBench and 3.04 average score on TwiFF-Bench, with especially strong gains on multi-hop and non-consecutive future-event splits.
- Compact inference. Accuracy improves through latent visual computation rather than long text-only chains or multi-turn search.
Figure 2. (Left) Future-L1-50K is built by ranking TwiFF candidates by visual gain pv − pt. (Center) SFT trains interleaved text–latent trajectories, aligning latent spans with future visual states. (Right) LA-DAPO further optimizes sampled trajectories with outcome-contrastive and temporal-diversity rewards.
# Install
pip install -r requirements_sft.txt
pip install -r requirements_rl.txt
cd RL_v2 && pip install -e . && cd ..
cd lmms-eval && pip install -e . && cd ..
# Replace chat_template.json before training (once on the base Qwen3-VL checkpoint)
cp chat_template.json /path/to/Qwen3-VL-8B-Instruct/chat_template.json
# SFT — edit MODEL_NAME / DATA_PATH / OUTPUT_DIR in scripts/train_twiff.sh
bash scripts/train_twiff.sh
# RL — set checkpoint, data, and LLM-as-judge API (OpenAI-compatible, e.g. Qwen3.6-27B)
cd RL_v2
MODEL_PATH=/path/to/Future-L1-SFT \
TRAIN_FILES=/path/to/RL.json \
JUDGE_API_URL=http://localhost:8000/v1 \
JUDGE_API_NAME=your-judge-model \
JUDGE_API_KEY=your-api-key \
bash train.sh method
# Evaluation — edit model_path in the eval scripts; TwiFF-Bench also needs lmms-eval/.env
cd lmms-eval
cp .env.example .env # fill OPENAI_API_KEY, OPENAI_API_BASE, LOCAL_LLM
bash examples/eval_futurebench_future_l1.sh
bash examples/eval_twiffbench_future_l1.shWe gratefully acknowledge the contributions of the open-source community, particularly:
- Qwen-VL-Series-Finetune, Latent Visual Reasoning (LVR), SwimBird, EasyR1
- Previous Work: LaViT — Aligning latent visual thoughts for multi-modal reasoning via teacher-extracted visual thought trajectories.
@article{tbd,
title = {TBD},
author = {TBD},
year = {TBD}
}


