Skip to content

Conversation

@RaymondLi0
Copy link
Contributor

@RaymondLi0 RaymondLi0 commented Dec 22, 2025

✨ Description

Tool to recursively discover datasets in a directory and generate a blended dataset config.
This tool walks through a directory tree, identifies datasets by their fast_llm_config*.yaml files,
and generates a config file that blends all discovered datasets with weights proportional to token counts.

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@tscholak
Copy link
Collaborator

great!
though why isn't that built into fast-llm itself? ideally we shouldn't need this.
cc @jlamypoirier

@jlamypoirier
Copy link
Collaborator

jlamypoirier commented Dec 22, 2025

great! though why isn't that built into fast-llm itself? ideally we shouldn't need this. cc @jlamypoirier

@tscholak @RaymondLi0 we already have all the machinery we need for blending datasets in GPTMemmapDatasetPreparator, anb ideally we should extract and reuse it. What is missing for a standalone blending tool is dataset discovery and reading.

The discovery implementation here looks at the yaml configs which cause complications because of all the ways they could be defined. A much simpler option would be to just look at the .fast_llm_dataset files and blend them.

As for reading, it should be done by instantiating the dataset, it will take care of everything and avoid trouble.

Lastly, note that almost every "tool" we've had so far quickly went out of sync and became useless. To prevent this script from suffering the same fate, we'll want to bring it inside fast_llm (would be appropriate as a DatasetPreparator) and add tests for it..

@RaymondLi0
Copy link
Contributor Author

Made the following updates:

  • moved the code under fast_llm/data/preparator as a DatasetPreparator subclass.
  • find .fast_llm_dataset files instead of yaml configs
  • add tests

def _read_memmap_num_tokens(memmap_path: pathlib.Path) -> int:
"""Read number of tokens from a .fast_llm_dataset memmap file."""
# Import preprocessing and sample configs to register them
import fast_llm.data.preprocessing.image_patch # noqa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import fast_llm.data.auto, though arguably not needed because the cli already does it.

return 0

try:
with memmap_path.open("rb") as stream:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, you should be able to instantiate the dataset and get its num_tokens

"weights": all_tokens,
}, total_tokens

def _create_hierarchical_config(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why making this hierarchical? It's not really needed and it's making this class way more complicated than it needs to be, plus nested blended datasets aren't really a good idea and comes with important downsides like extra overhead and randomness issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By creating groups of datasets, the hierarchical part makes it much more convenient to update the mixing weights.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outputting a flat config would defeat the purpose of this tool, which is to enable easy configuration and manual update of the dataset mix.

logger.warning(f" - {dataset_file.name}: skipping (0 tokens or read error)")
return None
logger.debug(f" - {dataset_file.name}: {num_tokens:,} tokens")
return num_tokens / 1e9
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better leave as integer as in gpt memmap preparator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also to facilitate human review of the configs. Is there some reason to keep it as an integer, other than consistency with gpt-memmap?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main goal was to avoid rounding errors and make the token count explicit, when compared to normalizing probabilities. The division by 1e9 doesn't prevent this though so is probably fine.

@@ -0,0 +1,56 @@
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the new cli? The existing one already covers this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was for compatibility with commands that were run with the previous version. Mostly for testing purposes. We can remove it before merging 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants