Skip to content

Conversation

@tkocmathla
Copy link

Fixes #4210.

@sjarus sjarus requested a review from zjgarvey December 9, 2025 17:47
Copy link
Collaborator

@zjgarvey zjgarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for finding this. It does seem strange that we would be checking for an explicit args.temp_dir for the external data.

I have a few comments. Most importantly, I think we should find a way to test without needing to generate a 2GB random tensor.

mlir_file = run_path / f"{model_name}-explicit_temp_implicit_data.torch.mlir"
onnx.save(onnx_model, model_file)
temp_dir = run_path / "temp"
temp_dir.mkdir(exist_ok=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be good to use tempfile.TemporaryDirectory in case there is an exception thrown before the temp files are cleaned up by the importer.

Comment on lines +94 to +97
byte_size = numpy.dtype(dtype).itemsize
tensor_size = onnx.checker.MAXIMUM_PROTOBUF // byte_size + 1
large_tensor = numpy.random.rand(tensor_size).astype(dtype)
assert large_tensor.nbytes > onnx.checker.MAXIMUM_PROTOBUF
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem rather unfortunate that we need to make the CI create a 2GB tensor just to test this.

How long does this test take to run? If it takes more than a minute, I think we should refactor the load_onnx_model code so we can test the file-based shape inference directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe pulling out the file based shape inference into a utility is just a good idea in general. https://github.com/llvm/torch-mlir/pull/4375/files#r2608194510

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could try to mirror what is currently being done in IREE:

https://github.com/iree-org/iree/blob/d0dd8893758dcf40558a57916b869ba0babc0a95/compiler/bindings/python/iree/compiler/tools/import_onnx/__main__.py#L86

Ignore the params (this is primarily the reason the importer code has diverged in IREE, but there are some improvements which haven't found their way back to torch-mlir yet).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this isn't a great unit test and it does take too long to reasonably merge it. The point of it was to test the exact bug in this issue though, and not file-based shape inference in general. I'm definitely happy to hoist that code into a separate function but I don't follow how that makes this bug easier to test. For this bug, I need a model that fails in-memory shape inference but passes file-based shape inference.

One other way to do this is to monkey-patch the onnx.checker.MAXIMUM_PROTOBUF constant in the test to a small value with unittest.mock.patch. What do you think of that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. We do want to test the tool e2e to catch these issues.

A monkeypatch would be great if it works. Thanks for working with me on the test size.

# Load the temp file and the external data.
inferred_model = onnx.load(temp_inferred_file, load_external_data=False)
data_dir = Path(input_dir if args.temp_dir is None else args.data_dir)
data_dir = Path(input_dir if args.data_dir is None else args.data_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be moved up above the condition on line 127 (if raw_model_modified), since the external data may end up being organized differently and this gets saved to the temp_dir.

E.g.,

data_dir = Path(args.data_dir or input_dir)
if raw_model_modified:
    ...
    data_dir = Path(temp_dir)
...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! I'll update this, and add a unit test for the raw_model_modified case.

Comment on lines 130 to 156
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's hoist this into a util function so we can test in isolation without needing to generate a 2GB tensor.

E.g.,

def _file_based_shape_infer(
    model_path : Path,
    data_dir : Path,
    temp_dir : Path,
) -> onnx.ModelProto:
    """docstring"""
    ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: the other comment. It may be good to do this still, but I agree that the monkeypatch is still a better way to test the actual issue you are addressing. Feel free to disregard this if you don't think it's worth the time now.

@tkocmathla tkocmathla requested a review from zjgarvey December 15, 2025 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ONNX importer fails when --temp-dir is set and --data-dir is not

2 participants