Skip to content

Conversation

@mpeex
Copy link

@mpeex mpeex commented Dec 8, 2025

…sample

Before submitting
  • [N ] Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • [ Y] Did you read the contributor guideline, Pull Request section?
  • [ N] Did you make sure to update the docs?
  • [ N] Did you write any new necessary tests?

What does this PR do?

  1. enable support for HuggingFace revision/branches as per 1 and 2
  2. transform function of StreamingDataset can return None to skip the sample

Fixes # (issue).

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun? oh ye

Make sure you had fun coding 🙃

item = self.transform(item)

return item
return item if item else self.__next__()
Copy link
Collaborator

@deependujha deependujha Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes more sense to have this in __next__ method, and not in __getitem__.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion

@deependujha
Copy link
Collaborator

deependujha commented Dec 9, 2025

Hi @mpeex, can you share more details on when HF dataset's __getitem__ can return None.

If this is your specific use case, one option can be to override the __next__ method in your implementation.

class MyDS(StreamingDataset):
    def __next__(self, index):
         item = super().__next__(self)
         return item if item else self.__next__()

ds = MyDs(...)

Also, can you write some tests verifying the behaviour for HF dataset revision/branches?

@mpeex
Copy link
Author

mpeex commented Dec 9, 2025

hi @deependujha thanks for reviewing. The usecase is to filter samples while streaming using the transform function, for ex:

t = [partial(t_1, args..),
        ...
        partial(t_N, args..)
        ]
StreamingDatasets(uri, transform=t)

returning None in any of the t_x function can be used to skip the sample, very straightforward.
I understand your suggestion (I didn't think about it).

I will write some tests to verify the behaviour for HF dataset revision/branches.

thanks for your time

@philgzl
Copy link
Contributor

philgzl commented Dec 12, 2025

I think that skipping samples, whether that's via transform or by subclassing StreamingDataset, breaks integration with StreamingDataLoader and ParallelStreamingDataset. See #642.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants