Skip to content

Optimize bicubic resize with OpenMP + SIMD for large images #1057

@justinchuby

Description

@justinchuby

Summary

PR #1056 implements float-domain bicubic resize for Gemma4 HF parity. Performance is competitive for typical VLM images (≤2MP) but 5× slower than torchvision for 12MP+ images.

Current performance (resize only, no JPEG decode)

Image Torchvision CPU Our C++ Ratio
224×224 → 768×768 16.0 ms 3.0 ms 0.19× ✅
640×480 → 912×672 16.1 ms 5.5 ms 0.34× ✅
1300×876 → 960×624 16.1 ms 10.5 ms 0.65× ✅
1920×1080 → 1056×576 16.3 ms 16.8 ms 1.03× ✅
3024×4032 → 672×912 16.8 ms 81.8 ms 4.87× ❌

Why torchvision is faster for large images

Torchvision uses ATen's _upsample_bicubic2d_aa kernel (aten/src/ATen/native/cpu/UpSampleKernel.cpp) which has:

  • Multi-threaded execution via at::parallel_for (OpenMP)
  • AVX2/NEON SIMD vectorization (specialized uint8 paths, 8 floats per instruction)
  • Precomputed weight tensors for batched multiply-add
  • ~16ms fixed overhead from tensor creation and thread pool setup (why it's slower for small images)

Proposed optimizations

  1. OpenMP parallelism for vertical pass: The vertical resample loop iterates over output rows independently — adding #pragma omp parallel for would parallelize across cores with minimal code change.

  2. AVX2 intrinsics for horizontal kernel: The horizontal inner loop does 3 multiply-adds per kernel tap — could process 8 taps simultaneously with _mm256_fmadd_ps.

  3. Row-batch vertical processing: Process multiple output columns per iteration to improve instruction-level parallelism in the vertical pass.

Notes

  • No precedent for OpenMP or SIMD in ort-extensions' own source code (checked: shared/, operators/). All existing image processing is single-threaded scalar C++.
  • Would need CMake changes to enable OpenMP and SIMD compiler flags.
  • The current implementation uses the same algorithm as Pillow/torchvision (Keys cubic a=-0.5, separable, antialias) and matches torchvision output to within 5×10⁻⁵ max pixel diff.

Priority

Low — current performance is acceptable for typical VLM images (≤2MP). Only impacts 12MP+ unresized phone photos, which are uncommon in production VLM pipelines.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions