Conversation
|
great! |
@tscholak @RaymondLi0 we already have all the machinery we need for blending datasets in The discovery implementation here looks at the yaml configs which cause complications because of all the ways they could be defined. A much simpler option would be to just look at the As for reading, it should be done by instantiating the dataset, it will take care of everything and avoid trouble. Lastly, note that almost every "tool" we've had so far quickly went out of sync and became useless. To prevent this script from suffering the same fate, we'll want to bring it inside |
|
Made the following updates:
|
| return 0 | ||
|
|
||
| try: | ||
| with memmap_path.open("rb") as stream: |
There was a problem hiding this comment.
Not needed, you should be able to instantiate the dataset and get its num_tokens
There was a problem hiding this comment.
Actually the compatibility check here: https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/data/preprocessing/language_model.py#L35
prevents to instantiate the dataset without reading the config from the .fast_llm_dataset file it seems
There was a problem hiding this comment.
Refactored it a bit to remove some redundant code
| "weights": all_tokens, | ||
| }, total_tokens | ||
|
|
||
| def _create_hierarchical_config( |
There was a problem hiding this comment.
Why making this hierarchical? It's not really needed and it's making this class way more complicated than it needs to be, plus nested blended datasets aren't really a good idea and comes with important downsides like extra overhead and randomness issues.
There was a problem hiding this comment.
By creating groups of datasets, the hierarchical part makes it much more convenient to update the mixing weights.
There was a problem hiding this comment.
Outputting a flat config would defeat the purpose of this tool, which is to enable easy configuration and manual update of the dataset mix.
| logger.warning(f" - {dataset_file.name}: skipping (0 tokens or read error)") | ||
| return None | ||
| logger.debug(f" - {dataset_file.name}: {num_tokens:,} tokens") | ||
| return num_tokens / 1e9 |
There was a problem hiding this comment.
Better leave as integer as in gpt memmap preparator.
There was a problem hiding this comment.
This is also to facilitate human review of the configs. Is there some reason to keep it as an integer, other than consistency with gpt-memmap?
There was a problem hiding this comment.
Main goal was to avoid rounding errors and make the token count explicit, when compared to normalizing probabilities. The division by 1e9 doesn't prevent this though so is probably fine.
✨ Description
Tool to recursively discover datasets in a directory and generate a blended dataset config.
This tool walks through a directory tree, identifies datasets by their fast_llm_config*.yaml files,
and generates a config file that blends all discovered datasets with weights proportional to token counts.
🔍 Type of change
Select all that apply: