[src/MaxText/{configs/{rl.yml,types.py},examples/rl_llama3_demo.ipynb,rl/train_rl.py}] Use pydantic natively in RL ; nest GRPO #2815

SamuelMarks · 2025-12-11T06:06:04Z

Description

Contributions by both @SamuelMarks and @A9isha to address the following:

Implementation to type the RL API using the more explicit, typed, pydantic syntax.
Fix broken rl_llama3_demo.ipynb because of vllm.yml not being found from the colab because of MAXTEXT_PKG_DIR
Mitigate the issue that the logs generated from vllm-tpu which floods the output of the notebook such that the evaluation accuracy numbers are not visible in the notebook. We can take a remove these mitigation steps once b/473703277 is resolved.

Tests

Ran rl_llama3_demo.ipynb for both grpo and gspo-token on a v5p-8

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-01-05T08:26:25Z

Codecov Report

❌ Patch coverage is 0% with 37 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/MaxText/rl/train_rl.py	0.00%	21 Missing ⚠️
src/MaxText/max_logging.py	0.00%	9 Missing ⚠️
src/MaxText/rl/evaluate_rl.py	0.00%	4 Missing ⚠️
src/MaxText/rl/utils_rl.py	0.00%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

SamuelMarks · 2026-01-05T23:25:04Z

@A9isha let me know if you need a hand

A9isha · 2026-01-06T08:14:09Z

@A9isha let me know if you need a hand

@SamuelMarks thanks a lot for the offer to help!

do you think it is possible to make the following more pythonic?

config_argv = [
    "",  # argv[0] placeholder
    config_file,
    f"model_name={MODEL_NAME}",
    f"tokenizer_path={HF_REPO_ID}",
    f"run_name={RUN_NAME}",
    f"chat_template_path={CHAT_TEMPLATE_PATH}",
    f"load_parameters_path={MODEL_CHECKPOINT_PATH}",
    f"base_output_directory={OUTPUT_DIRECTORY}",
    f"hf_access_token={HF_TOKEN}",
    f"learning_rate={LEARNING_RATE}",
    f"steps={STEPS}",
    f"grpo.num_generations={NUM_GENERATIONS}",
    f"grpo.grpo_beta={GRPO_BETA}",
    f"grpo.grpo_epsilon={GRPO_EPSILON}",
    f"chips_per_vm={CHIPS_PER_VM}",
    f"grpo.loss_algo={LOSS_ALGO}",
    "use_pathways=False"
]

to become something like

config = mt_types.MaxTextConfig.get_configs(run_name="foo", 
   learning_rate=3e-6, 
   ...., 
   GRPO(grpo_beta=0.2, num_iteration=4),
   ....,
   

)

SamuelMarks · 2026-01-06T23:22:16Z

@A9isha Sure, in #2775 I introduced a split of the initialize function in two. So now you can use initialize_pydantic. To go further, you can change the input also, e.g., to:

from types import (
    MaxTextConfig,
    RunInfo,
    Tokenizer,
    Checkpointing,
    HfDataset,
    Optimizer,
    TrainingLoop,
    GRPO,
    RLHardware
)

# 1. Construct semantic groupings using specific subclasses
run_config = RunInfo(
    base_config=config_file,
    model_name=MODEL_NAME,
    run_name=RUN_NAME,
    base_output_directory=OUTPUT_DIRECTORY
)

tokenizer_config = Tokenizer(
    tokenizer_path=HF_REPO_ID,
    chat_template_path=CHAT_TEMPLATE_PATH
)

checkpoint_config = Checkpointing(
    load_parameters_path=MODEL_CHECKPOINT_PATH
)

dataset_config = HfDataset(
    hf_access_token=HF_TOKEN
)

optimizer_config = Optimizer(
    learning_rate=LEARNING_RATE
)

training_config = TrainingLoop(
    steps=STEPS
)

# Note: The 'grpo.' prefix from argv is removed because these are direct fields of the GRPO class
grpo_config = GRPO(
    num_generations=NUM_GENERATIONS,
    grpo_beta=GRPO_BETA,
    grpo_epsilon=GRPO_EPSILON,
    loss_algo=LOSS_ALGO
)

hardware_config = RLHardware(
    chips_per_vm=CHIPS_PER_VM,
    use_pathways=False
)

# 2. Merge into the final MaxTextConfig
# We assume standard defaults for any mixins not explicitly initialized above.
final_config_dict = {}

# Order doesn't strictly matter for dictionaries, but merging strictly 
# ensures the specific overrides defined above are captured.
sub_configs = [
    run_config,
    tokenizer_config,
    checkpoint_config,
    dataset_config,
    optimizer_config,
    training_config,
    grpo_config,
    hardware_config
]

for cfg in sub_configs:
    # exclude_defaults=True allows the final MaxTextConfig to resolve 
    # interactions between defaults if necessary, though exclude_unset=True 
    # is often safer if you want the explicit values provided above.
    final_config_dict.update(cfg.model_dump(exclude_unset=True))

max_text_config = MaxTextConfig(**final_config_dict)

A9isha · 2026-01-07T10:54:01Z

@SamuelMarks thank you for thinking about this. Do you think you could create a separate PR to change the config_argv in rl_llama3_demo.ipynb following your idea such that config_argv is no longer a list but is now similar to

config = mt_types.MaxTextConfig.get_configs(run_name="foo", 
   learning_rate=3e-6, 
   ...., 
   GRPO(grpo_beta=0.2, num_iteration=4),
   ....,
   

)

SamuelMarks · 2026-01-07T14:45:50Z

@A9isha sure, here you go #2915

PS: I've got some research deadlines so if there are any follow-up reviews required I should have some time on the weekend.

SurbhiJainUSC · 2026-01-07T17:13:11Z

src/MaxText/examples/rl_llama3_demo.ipynb

    "    raise RuntimeError(\"OUTPUT_DIRECTORY is not set\")\n",
    "    \n",
    "os.environ[\"HF_TOKEN\"] = HF_TOKEN\n",
+    "if \"MAXTEXT_PKG_DIR\" not in os.environ:\n",


What is the need to set this env?

great qs. Basically in the path to vllm.yml we are using MAXTEXT_PKG_DIR and if not set it takes up the current filepath. Now, for colab, it is upto /MaxText/examples which causes problem

@NicoGrande for visibility

SurbhiJainUSC · 2026-01-07T17:15:21Z

src/MaxText/configs/types.py

  vllm_hf_config_path: str = Field("", description="Path to HuggingFace model config for MaxText model.")


 class GRPO(BaseModel):


Should we call this RL instead?

src/MaxText/rl/train_rl.py

SurbhiJainUSC · 2026-01-07T17:20:30Z

src/MaxText/rl/train_rl.py

  # Start training

-  max_logging.log("Starting RL training...")
+  max_logging.warning("Starting RL training...")


It should be info right?

My goal is to make this logs from train_rl.py show up but filter most of the logs from tpu inference in the notebook's output cell (until this PR these logs from train_rl.py are not displayed for notebooks - which is bad). Problem is that there is a lot of info/warning/logs that are getting printed from tpu-inference. Inspite of using the NoisyLogFilter not all logs are getting filtered out. Additionally, for some reason, these train_rl.py "max_logging.log" is also getting hidden in the notebook. This complication arises out of various import orders of absl/logging (including the indirection of max_logging.py); also, max_logging uses absl while other dependecies use logging; general volume of logs from tpu-inference. b/473703277 resolution would help revert this later.
So, either we could use print or use max_logging.warning (to bump up priority) and it gets shown for sure. wdyt?

… demo rl

SamuelMarks requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, gagika, gobbleturk, hengtaoguo, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners December 11, 2025 06:06

A9isha force-pushed the rl-api branch from 8d401b9 to a8b5436 Compare January 5, 2026 08:12

A9isha force-pushed the rl-api branch 2 times, most recently from 9be9fb4 to bb2e146 Compare January 6, 2026 08:13

A9isha force-pushed the rl-api branch 7 times, most recently from af26177 to 3a8a55a Compare January 7, 2026 10:48

A9isha force-pushed the rl-api branch from 3a8a55a to ed59bab Compare January 7, 2026 10:51

SamuelMarks mentioned this pull request Jan 7, 2026

[src/MaxText/pyconfig.py] New initialize_pydantic_with_maxtext_config function ; [src/MaxText/examples/rl_llama3_demo.ipynb] Use new function #2915

Open

4 tasks

SurbhiJainUSC reviewed Jan 7, 2026

View reviewed changes

src/MaxText/rl/train_rl.py Outdated Show resolved Hide resolved

SurbhiJainUSC reviewed Jan 7, 2026

View reviewed changes

A9isha force-pushed the rl-api branch from ed59bab to 373fe92 Compare January 8, 2026 06:43

Use pydantic natively in RL ; nest GRPO ; Fix rl_llama3_demo.ipynb to…

6f0597d

… demo rl

A9isha force-pushed the rl-api branch from 373fe92 to 6f0597d Compare January 9, 2026 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[src/MaxText/{configs/{rl.yml,types.py},examples/rl_llama3_demo.ipynb,rl/train_rl.py}] Use pydantic natively in RL ; nest GRPO #2815

[src/MaxText/{configs/{rl.yml,types.py},examples/rl_llama3_demo.ipynb,rl/train_rl.py}] Use pydantic natively in RL ; nest GRPO #2815

SamuelMarks commented Dec 11, 2025 •

edited by A9isha

Loading

Uh oh!

codecov bot commented Jan 5, 2026 •

edited

Loading

Uh oh!

SamuelMarks commented Jan 5, 2026

Uh oh!

A9isha commented Jan 6, 2026

Uh oh!

SamuelMarks commented Jan 6, 2026

Uh oh!

A9isha commented Jan 7, 2026

Uh oh!

SamuelMarks commented Jan 7, 2026

Uh oh!

SurbhiJainUSC Jan 7, 2026

Uh oh!

A9isha Jan 8, 2026

Uh oh!

SurbhiJainUSC Jan 7, 2026

Uh oh!

Uh oh!

SurbhiJainUSC Jan 7, 2026

Uh oh!

A9isha Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		vllm_hf_config_path: str = Field("", description="Path to HuggingFace model config for MaxText model.")


		class GRPO(BaseModel):

[src/MaxText/{configs/{rl.yml,types.py},examples/rl_llama3_demo.ipynb,rl/train_rl.py}] Use pydantic natively in RL ; nest GRPO #2815

Are you sure you want to change the base?

[src/MaxText/{configs/{rl.yml,types.py},examples/rl_llama3_demo.ipynb,rl/train_rl.py}] Use pydantic natively in RL ; nest GRPO #2815

Conversation

SamuelMarks commented Dec 11, 2025 • edited by A9isha Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

SamuelMarks commented Jan 5, 2026

Uh oh!

A9isha commented Jan 6, 2026

Uh oh!

SamuelMarks commented Jan 6, 2026

Uh oh!

A9isha commented Jan 7, 2026

Uh oh!

SamuelMarks commented Jan 7, 2026

Uh oh!

SurbhiJainUSC Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

A9isha Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

SurbhiJainUSC Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SurbhiJainUSC Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

A9isha Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SamuelMarks commented Dec 11, 2025 •

edited by A9isha

Loading

codecov bot commented Jan 5, 2026 •

edited

Loading