Releases: amd/blis
AOCL 5.2 GA Release
AOCL-BLAS 5.2 Release Notes
Overview
This release includes significant performance improvements, new features, and critical bug fixes for the AOCL - BLAS linear algebra library, with optimizations specifically targeting AMD Zen4 and Zen5 architectures.
Performance Improvements
GEMM Improvements
- Tuned ZGEMM thresholds for Zen4 and Zen5 architectures
- Optimized AVX512 ZGEMM kernel and edge-case handling
- Improved ZGEMM packing kernel for M-dimension edge cases
- Developed Optimal thread selection logic for ZGEMM on Zen5
GEMV Enhancements
- Added DGEMV no-transpose multithreaded implementations
- Exported AVX512 DGEMV kernels
- DGEMV bug fixes and code cleanup
- Added ability to handle non-unit incx in GEMV transpose kernel
- Improved numerical precision in ZGEMV API
DCOPY Optimization
- Tuned DCOPY aocl_dynamic logic for Zen4/Zen5 architectures
New Features
-
Additional build options to disable optimized code paths for smaller matrices in GEMM and TRSM
- Useful for testing and benchmarking
- Reduces numerical rounding differences when repeating calculations with different core counts
-
Complete set of GEMMTR APIs implemented
Bug Fixes
Critical Fixes
- Fixed probable integer overflow in TPSV
- Fixed ZTRSM accuracy for conjugate transpose
- Fixed DTRSM small threshold for extremely skinny sizes on Zen5
Acknowledgments
This release is the result of contributions from the AOCL team at AMD and the broader BLIS community.
Release Date: January 2026
Version: 5.2 GA
AOCL-BLAS 5.1 GA
Performance Optimizations
- DGEMM, DTRSM, DGEMV, ZGEMM, DTRSV, DCOPYV on Zen4/5
- DSCALV, DDOTV on Zen3
- Benchmark support for ASUMV
- Minor Bug Fixes.
Aocl-gemm Add-on Module updates
- AOCL_ENABLE_INSTRUCTIONS support
- batch_gemm support for all data types
- New Output Datatype for Integer APIs
- BF16 Support on AVX2 Platforms
- WOQ with/without Group Quantization
- Threading Framework Optimizations
- Reference Kernels for all Reorder APIs
- Performance Optimizations for all APIs
- Additional APIs and Post-Ops support in addition to the improved performance for the existing APIs
AOCL-BLAS 5.0
AOCL-BLAS 5.0 Release Highlights
- Added zen5 support
- Turin specific tuning for the APIs: D/ZGEMM, DTRSM and DNRM2
- AVX512 made improvements for the APIs: ZGEMV, D/ZAXPYF, D/ZDOTXF, ZDOTV, C/ZSCALV, DNRM2, S/D/ZCOPY, S/D/C/ZAXPBYV, DTRSV, DGEMMT, D/ZTRSM, and D/ZGEMM
- Improvements to the AOCL_ENABLE_INSTRUCTIONS functionality
- Additional APIs and Post-Ops support in addition to the improved performance for the existing APIs in aocl_gemm add-on
AOCL-BLAS 4.2
AOCL-BLAS 4.2 Release Highlights
- Added uint8 output and zero-point support in int8 API’s in aocl_gemm addon (Low Precision GEMM / LPGEMM)
- Improved performance for all downscaled versions of all API’s in aocl_gemm addon (Low Precision GEMM / LPGEMM)
- Multithread performance improved across API’s in aocl_gemm addon (Low Precision GEMM / LPGEMM)
- Introduced AOCL_ENABLE_INSTRUCTIONS environment variable as an alternative to BLIS_ARCH_TYPE, but with slightly different semantics.
- Improved functionality of XERBLA error handling routine in AOCL-BLAS.
- Performance optimizations for the following APIs:
- DGEMM for tiny sizes
- S/ZGEMM, D/ZTRSM, ZAXPBYV, Z/ZDSCALV, S/D/ZGEMV, and D/DZNRM2 - Following BLAS extension APIs have been added only for AMD “Zen” code paths:
- sgemm_pack_get_size(), sgemm_pack(), and sgemm_compute()
- dgemm_pack_get_size(), dgemm_pack(), and dgemm_compute()
AOCL-BLAS 4.1
AOCL-BLAS 4.1 Release Highlights
- Additional APIs and Post-Ops support in addition to the improved performance for the existing APIs in aocl_gemm add-on
- Dynamic dispatch and amdzen configuration support added to aocl_gemm add-on
- Dynamic dispatch feature enhancements.
- AVX 512-based optimizations for AMD “Zen4” platform:
- SGEMM, DGEMM, and ZGEMM
- DTRSM, D/ZAXPY, ZGEMV, DDOTV, and D/ZSCALV - Improved support for OpenMP nested parallelism.
AOCL-BLIS 4.0
Highlights of AOCL-BLIS 4.0
- The following LPGEMM (Low Precision GEMM) variants are added along with post-ops support:
- aocl_gemm_u8s8s32os32 and aocl_gemm_u8s8s32os8 routines are added and optimized using AVX-512-VNNI
- aocl_gemm_u8s8s16os16 and aocl_gemm_u8s8s16os8 routines are added and optimized using AVX2
- aocl_gemm_bf16bf16f32of32 and aocl_gemm_bf16bf16f32obf16 routines are added and optimized using AVX-512
- SGEMM with packed/reorder buffer support (aocl_gemm_f32f32f32f32)
- AMD “Zen4” support for BLIS
- Dynamic dispatch supports AMD “Zen4” configuration
- Optimizations and performance improvements for DGEMM, SGEMM, ZGEMM, DGEMMT, and DTRSM
- Framework design changes
AOCL-BLIS 3.2
New features:
- Extended BLAS function - DZGEMM
- Progress feature for xGEMM and xTRSM APIs: Time taken to complete the mathematical operations tends to increase exponentially with large input problem sizes; this feature provides users a periodic update on the operation progress.
- Runtime Threading control using OpenMP APIs
- Dynamic Dispatch covers APUs
- Improved detection of standard x86-64 feature support
- Minor bug fixes
Performance improvements in the following single-threaded and multi-threaded functions:
- DGEMM, SGEMM, ZGEMM, and CGEMM
- DTRSM, DGEMMT, ZTRSM, CTRSM, and DTRMM
- SGEMV, DHER2, ZTRSV, and DSYMV
- ?AXPBYV, SSCALV, DSCALV, ?DOTXV, and ZAXPY2V
AMD Optimized BLIS Version 3.1
AMD Optimized BLIS Version 3.1
Highlights of improvements on AMD EPYCTM processor family CPUs
- Supports Dynamic Dispatch and AOCL Dynamic feature
- Improvements in DGEMM, ZGEMM, DTRSM, DSYRK, xGEMV, and DOTV
AMD Optimized BLIS Version 3.0.1
AMD Optimized BLIS Version 3.0.1
Highlights of improvements on AMD EPYCTM processor family CPUs
- Improved performance of DGEMM for skinny matrix shapes.
- Improvements in SGEMM and ZGEMM
- Improved performance of Level-1 and Level2 BLAS routines, GEMV, DOT and AXPY routines
- Improvements in DTRSM for small matrix sizes
AMD Optimized BLIS Version 3.0
AMD Optimized BLIS Version 3.0
Highlights of improvements on AMD EPYCTM processor family CPUs
- Includes support for AMD’s Zen3 architecture. Build can auto detect if it is running on zen3 and enable features and optimizations specific to zen3 architecture.
- Improved performance of ?dotv, ?gemv, ?axpyv for complex and double complex datatypes
- Includes support for copy transposition routines
- New BLAS extension APIs added including cblas_?cabs1, cblas_i?amin, cblas_?axpby, cblas_?gemm_batch, cblas_?gemm3m
- Debug trace and input logging support added for more BLIS APIs.