Optimize bicubic resize with OpenMP + SIMD for large images

## Summary

PR #1056 implements float-domain bicubic resize for Gemma4 HF parity. Performance is competitive for typical VLM images (≤2MP) but 5× slower than torchvision for 12MP+ images.

## Current performance (resize only, no JPEG decode)

| Image | Torchvision CPU | Our C++ | Ratio |
|---|---|---|---|
| 224×224 → 768×768 | 16.0 ms | **3.0 ms** | 0.19× ✅ |
| 640×480 → 912×672 | 16.1 ms | **5.5 ms** | 0.34× ✅ |
| 1300×876 → 960×624 | 16.1 ms | **10.5 ms** | 0.65× ✅ |
| 1920×1080 → 1056×576 | 16.3 ms | 16.8 ms | 1.03× ✅ |
| 3024×4032 → 672×912 | **16.8 ms** | 81.8 ms | 4.87× ❌ |

## Why torchvision is faster for large images

Torchvision uses ATen's `_upsample_bicubic2d_aa` kernel (`aten/src/ATen/native/cpu/UpSampleKernel.cpp`) which has:
- **Multi-threaded** execution via `at::parallel_for` (OpenMP)
- **AVX2/NEON SIMD** vectorization (specialized uint8 paths, 8 floats per instruction)
- **Precomputed weight tensors** for batched multiply-add
- ~16ms fixed overhead from tensor creation and thread pool setup (why it's slower for small images)

## Proposed optimizations

1. **OpenMP parallelism for vertical pass**: The vertical resample loop iterates over output rows independently — adding `#pragma omp parallel for` would parallelize across cores with minimal code change.

2. **AVX2 intrinsics for horizontal kernel**: The horizontal inner loop does 3 multiply-adds per kernel tap — could process 8 taps simultaneously with `_mm256_fmadd_ps`.

3. **Row-batch vertical processing**: Process multiple output columns per iteration to improve instruction-level parallelism in the vertical pass.

## Notes

- No precedent for OpenMP or SIMD in ort-extensions' own source code (checked: `shared/`, `operators/`). All existing image processing is single-threaded scalar C++.
- Would need CMake changes to enable OpenMP and SIMD compiler flags.
- The current implementation uses the same algorithm as Pillow/torchvision (Keys cubic a=-0.5, separable, antialias) and matches torchvision output to within 5×10⁻⁵ max pixel diff.

## Priority

**Low** — current performance is acceptable for typical VLM images (≤2MP). Only impacts 12MP+ unresized phone photos, which are uncommon in production VLM pipelines.

## References

- PR #1056: Float-domain bicubic resize for HF parity
- PyTorch source: `aten/src/ATen/native/cpu/UpSampleKernel.cpp` (lines 1994–2021)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize bicubic resize with OpenMP + SIMD for large images #1057

Summary

Current performance (resize only, no JPEG decode)

Why torchvision is faster for large images

Proposed optimizations

Notes

Priority

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Image	Torchvision CPU	Our C++	Ratio
224×224 → 768×768	16.0 ms	3.0 ms	0.19× ✅
640×480 → 912×672	16.1 ms	5.5 ms	0.34× ✅
1300×876 → 960×624	16.1 ms	10.5 ms	0.65× ✅
1920×1080 → 1056×576	16.3 ms	16.8 ms	1.03× ✅
3024×4032 → 672×912	16.8 ms	81.8 ms	4.87× ❌

Optimize bicubic resize with OpenMP + SIMD for large images #1057

Description

Summary

Current performance (resize only, no JPEG decode)

Why torchvision is faster for large images

Proposed optimizations

Notes

Priority

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions