Summary
PR #1056 implements float-domain bicubic resize for Gemma4 HF parity. Performance is competitive for typical VLM images (≤2MP) but 5× slower than torchvision for 12MP+ images.
Current performance (resize only, no JPEG decode)
| Image |
Torchvision CPU |
Our C++ |
Ratio |
| 224×224 → 768×768 |
16.0 ms |
3.0 ms |
0.19× ✅ |
| 640×480 → 912×672 |
16.1 ms |
5.5 ms |
0.34× ✅ |
| 1300×876 → 960×624 |
16.1 ms |
10.5 ms |
0.65× ✅ |
| 1920×1080 → 1056×576 |
16.3 ms |
16.8 ms |
1.03× ✅ |
| 3024×4032 → 672×912 |
16.8 ms |
81.8 ms |
4.87× ❌ |
Why torchvision is faster for large images
Torchvision uses ATen's _upsample_bicubic2d_aa kernel (aten/src/ATen/native/cpu/UpSampleKernel.cpp) which has:
- Multi-threaded execution via
at::parallel_for (OpenMP)
- AVX2/NEON SIMD vectorization (specialized uint8 paths, 8 floats per instruction)
- Precomputed weight tensors for batched multiply-add
- ~16ms fixed overhead from tensor creation and thread pool setup (why it's slower for small images)
Proposed optimizations
-
OpenMP parallelism for vertical pass: The vertical resample loop iterates over output rows independently — adding #pragma omp parallel for would parallelize across cores with minimal code change.
-
AVX2 intrinsics for horizontal kernel: The horizontal inner loop does 3 multiply-adds per kernel tap — could process 8 taps simultaneously with _mm256_fmadd_ps.
-
Row-batch vertical processing: Process multiple output columns per iteration to improve instruction-level parallelism in the vertical pass.
Notes
- No precedent for OpenMP or SIMD in ort-extensions' own source code (checked:
shared/, operators/). All existing image processing is single-threaded scalar C++.
- Would need CMake changes to enable OpenMP and SIMD compiler flags.
- The current implementation uses the same algorithm as Pillow/torchvision (Keys cubic a=-0.5, separable, antialias) and matches torchvision output to within 5×10⁻⁵ max pixel diff.
Priority
Low — current performance is acceptable for typical VLM images (≤2MP). Only impacts 12MP+ unresized phone photos, which are uncommon in production VLM pipelines.
References
Summary
PR #1056 implements float-domain bicubic resize for Gemma4 HF parity. Performance is competitive for typical VLM images (≤2MP) but 5× slower than torchvision for 12MP+ images.
Current performance (resize only, no JPEG decode)
Why torchvision is faster for large images
Torchvision uses ATen's
_upsample_bicubic2d_aakernel (aten/src/ATen/native/cpu/UpSampleKernel.cpp) which has:at::parallel_for(OpenMP)Proposed optimizations
OpenMP parallelism for vertical pass: The vertical resample loop iterates over output rows independently — adding
#pragma omp parallel forwould parallelize across cores with minimal code change.AVX2 intrinsics for horizontal kernel: The horizontal inner loop does 3 multiply-adds per kernel tap — could process 8 taps simultaneously with
_mm256_fmadd_ps.Row-batch vertical processing: Process multiple output columns per iteration to improve instruction-level parallelism in the vertical pass.
Notes
shared/,operators/). All existing image processing is single-threaded scalar C++.Priority
Low — current performance is acceptable for typical VLM images (≤2MP). Only impacts 12MP+ unresized phone photos, which are uncommon in production VLM pipelines.
References
aten/src/ATen/native/cpu/UpSampleKernel.cpp(lines 1994–2021)