Tldr;
if self._cache_dir is None:
# Only creates UUID with tables if NO cache_dir provided
id_str = json.dumps({
"root": self.root,
"tables": sorted(self.tables), # <-- Only used here
"dataset_name": self.dataset_name,
"dev": self.dev,
})
cache_dir = Path(...) / str(uuid.uuid5(uuid.NAMESPACE_DNS, id_str))
else:
# If cache_dir IS provided explicitly, just use it as-is
cache_dir = Path(self._cache_dir)
cache_dir.mkdir(parents=True, exist_ok=True)
When user specifies the cache_dir, uuid does not include the self.tables nor the self.dev or self.dataset in how the cache is being defined, which can potentially lead to downstream confusions in dataset initialization on why certain tasks fail.