dataset blending tool by RaymondLi0 · Pull Request #433 · ServiceNow/Fast-LLM

RaymondLi0 · 2025-12-22T19:38:13Z

✨ Description

Tool to recursively discover datasets in a directory and generate a blended dataset config.
This tool walks through a directory tree, identifies datasets by their fast_llm_config*.yaml files,
and generates a config file that blends all discovered datasets with weights proportional to token counts.

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

tscholak · 2025-12-22T19:44:09Z

great!
though why isn't that built into fast-llm itself? ideally we shouldn't need this.
cc @jlamypoirier

jlamypoirier · 2025-12-22T20:23:30Z

great! though why isn't that built into fast-llm itself? ideally we shouldn't need this. cc @jlamypoirier

@tscholak @RaymondLi0 we already have all the machinery we need for blending datasets in GPTMemmapDatasetPreparator, anb ideally we should extract and reuse it. What is missing for a standalone blending tool is dataset discovery and reading.

The discovery implementation here looks at the yaml configs which cause complications because of all the ways they could be defined. A much simpler option would be to just look at the .fast_llm_dataset files and blend them.

As for reading, it should be done by instantiating the dataset, it will take care of everything and avoid trouble.

Lastly, note that almost every "tool" we've had so far quickly went out of sync and became useless. To prevent this script from suffering the same fate, we'll want to bring it inside fast_llm (would be appropriate as a DatasetPreparator) and add tests for it..

RaymondLi0 · 2026-01-06T20:28:04Z

Made the following updates:

moved the code under fast_llm/data/preparator as a DatasetPreparator subclass.
find .fast_llm_dataset files instead of yaml configs
add tests

fast_llm/data/preparator/dataset_discovery/prepare.py

jlamypoirier · 2026-01-08T02:48:55Z

fast_llm/data/preparator/dataset_discovery/prepare.py

+            return 0
+
+        try:
+            with memmap_path.open("rb") as stream:


Not needed, you should be able to instantiate the dataset and get its num_tokens

Actually the compatibility check here: https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/data/preprocessing/language_model.py#L35
prevents to instantiate the dataset without reading the config from the .fast_llm_dataset file it seems

Refactored it a bit to remove some redundant code

jlamypoirier · 2026-01-08T02:56:39Z

fast_llm/data/preparator/dataset_discovery/prepare.py

+            "weights": all_tokens,
+        }, total_tokens
+
+    def _create_hierarchical_config(


Why making this hierarchical? It's not really needed and it's making this class way more complicated than it needs to be, plus nested blended datasets aren't really a good idea and comes with important downsides like extra overhead and randomness issues.

By creating groups of datasets, the hierarchical part makes it much more convenient to update the mixing weights.

Outputting a flat config would defeat the purpose of this tool, which is to enable easy configuration and manual update of the dataset mix.

jlamypoirier · 2026-01-08T02:59:13Z

fast_llm/data/preparator/dataset_discovery/prepare.py

+            logger.warning(f"  - {dataset_file.name}: skipping (0 tokens or read error)")
+            return None
+        logger.debug(f"  - {dataset_file.name}: {num_tokens:,} tokens")
+        return num_tokens / 1e9


Better leave as integer as in gpt memmap preparator.

This is also to facilitate human review of the configs. Is there some reason to keep it as an integer, other than consistency with gpt-memmap?

Main goal was to avoid rounding errors and make the token count explicit, when compared to normalizing probabilities. The division by 1e9 doesn't prevent this though so is probably fine.

tools/discover_datasets.py

RaymondLi0 added 4 commits December 22, 2025 15:45

add dataset blending tool

e6430b3

only blended. concatenated doesnt work

b2cb710

refactor

05d3bdd

add comment in yaml file

2fd99d8

RaymondLi0 requested review from jlamypoirier, oleksost and tscholak December 22, 2025 19:39

update readme

b4c5034

RaymondLi0 added 6 commits January 6, 2026 18:52

move to data preparator

8b00983

search for .fast_llm_dataset files

b6fafcb

add tests

91923bc

move readme

e5c03cb

update readme

1687c90

update

2987fdb

jlamypoirier reviewed Jan 8, 2026

View reviewed changes

RaymondLi0 added 2 commits January 9, 2026 22:58

refactor read-reader-config and add auto imports

4941c05

rm tool

10f71d7

RaymondLi0 requested a review from jlamypoirier January 13, 2026 21:42

jlamypoirier approved these changes Jan 19, 2026

View reviewed changes

RaymondLi0 merged commit a765807 into main Jan 20, 2026
3 of 4 checks passed

RaymondLi0 deleted the raymond/dataset_tool branch January 20, 2026 16:00

Conversation

RaymondLi0 commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

Uh oh!

tscholak commented Dec 22, 2025

Uh oh!

jlamypoirier commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RaymondLi0 commented Jan 6, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RaymondLi0 commented Dec 22, 2025 •

edited

Loading

jlamypoirier commented Dec 22, 2025 •

edited

Loading