Skip to content

[SYCL] Implement sycl_ext_oneapi_fp8 for CRI#21568

Open
dklochkov-emb wants to merge 69 commits into
intel:syclfrom
dklochkov-emb:sycl-ext-one-api-fp8-new-arch
Open

[SYCL] Implement sycl_ext_oneapi_fp8 for CRI#21568
dklochkov-emb wants to merge 69 commits into
intel:syclfrom
dklochkov-emb:sycl-ext-one-api-fp8-new-arch

Conversation

@dklochkov-emb
Copy link
Copy Markdown
Contributor

@dklochkov-emb dklochkov-emb commented Mar 19, 2026

This PR adds FP8 types for CRI docs.
It adds the first part which includes:

  1. Initial implementation
  2. Unit tests which check expected behavior and values
  3. Unit tests which check the fact of builtin calls.
  4. E2E tests of all 3 data types: fp8_e4m3, fp8_e5m2 and fp8_e8m0

@dklochkov-emb dklochkov-emb requested a review from a team as a code owner March 19, 2026 13:48
#ifdef __SYCL_DEVICE_ONLY__
// New FP8 builtins
extern __DPCPP_SYCL_EXTERNAL sycl::half
__builtin_spirv_ClampConvertE4M3ToFP16INTEL(uint8_t) noexcept;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just talking with @bashbaug today, and he mentioned that there is a multi-vendor SPIR-V extension for FP8 conversions. It looks like this PR is using the proposed Intel extension. I think this is the multi-vendor extension:

https://github.khronos.org/SPIRV-Registry/extensions/EXT/SPV_EXT_float8.html

@bashbaug do you know if our driver supports the multi-vendor extension? @dklochkov-emb does the multi-vendor extension provide all the functionality we need, or would we still need something from the Intel extension?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if our driver supports the multi-vendor extension?

I believe there is some experimental support, but the extension is not formally supported currently.

It would be good to understand:

  1. Whether any functionality is fundamentally missing from the EXT extension. For example, the EXT extension currently does not include any stochastic rounding functionality. Is this needed, and if so, can it be provided by a layered extension?
  2. Any details about required support with the EXT extension. For example, which rounding modes should be supported, and which other floating-point types should be supported for conversions to/from the fp8 types?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the EXT extension currently does not include any stochastic rounding functionality. Is this needed, and if so, can it be provided by a layered extension?

Yes, we need stochastic rounding for E5M2 when converting from half, bfloat16, and float. I suppose it could be implemented by a layered extension, but it would be up to use to define such an extension.

For example, which rounding modes should be supported

For E45M3, we need only RNE with saturation.
For E5M2, we need only RNE, but we need both saturation and non-saturation.

and which other floating-point types should be supported for conversions to/from the fp8 types

I think the Intel SPIRV extension supports conversions to/from half, bfloat16, and float.

It seems like we should do one of two things:

  1. Do all of the following:

    • Fully implement the EXT extension.
    • Define a layered SPIRV extension with the missing features.
    • Change this PR to use the EXT and that new layered extension.
  2. Clean up and publish the existing Intel SPIRV extension.

Copy link
Copy Markdown
Contributor

@vmaksimo vmaksimo Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we already have everything ready in terms of SPIR-V specs and support -- SPV_INTEL_fp_conversions was published in December 2025 as a layered extension on top of SPV_EXT_float8, and it covers all the functionality you mention.

Specifically:

  • RNE without saturation (E5M2): OpFConvert from SPV_EXT_float8
  • RNE with saturation (E4M3 and E5M2): OpClampConvertFToFINTEL from SPV_INTEL_fp_conversions
  • Stochastic rounding (E5M2, with and without saturation): OpStochasticRoundFToFINTEL/OpClampStochasticRoundFToFINTEL from SPV_INTEL_fp_conversions
  • Source types (half, bfloat16, float): supported in both extensions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXT does not support Stochastic rounding, @gmlueck @bashbaug , can it be used INTEL spirv in this case?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the EXT extension does not support stochastic rounding. It should be OK to use OpStochasticRoundFToFINTEL and OpClampStochasticRoundFToFINTEL for this instead, though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPV_INTEL_fp_conversions was published in December 2025 as a layered extension on top of SPV_EXT_float8, and it covers all the functionality you mention.

This extension is only "approved for public preview" at present. Since we plan to use it in SYCL, it should be promoted to "approved for publication", and it should be published on the Khronos website.

  • RNE with saturation (E4M3 and E5M2): OpClampConvertFToFINTEL from SPV_INTEL_fp_conversions

It seems like this functionality is also provided by the SPV_EXT_float8 extension via the SaturatedToLargestFloat8NormalConversionEXT decoration. If we want SPV_INTEL_fp_conversions to contain only additional functionality, should we remove OpClampConvertFToFINTEL? CC: @bashbaug

It also seems like there is some functionality required by SYCL that is missing from both SPIRV extensions. The SYCL extension requires saturation (clamped) conversion when converting to integer. The INTEL extension provides OpClampConvertFToSINTEL, but that only works when converting to a signed integer. For some reason, there is no clamped equivalent to OpConvertFToU. Was this an oversight in the SPIRV extension?

I was trying to remember why I specified the SYCL extension to require saturation (clamped) conversion for integers. C++ says that FP to integer conversions are undefined when the value cannot be represented after rounding. I think I required saturation because this is how the CUDA API works. If you scroll way down to "Integer notes", you can see that the PTX cvt instruction guarantees saturation when converting to an integer value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use EXT where it is possible

@dklochkov-emb
Copy link
Copy Markdown
Contributor Author

dklochkov-emb commented Mar 31, 2026

@intel/llvm-reviewers-runtime please, review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces initial host-side implementations of OneAPI experimental FP8 types (E4M3, E5M2, E8M0) and adds unit tests validating conversions/encodings plus tests that verify the expected SPIR-V builtin call paths (via mocks) for device-only code paths.

Changes:

  • Added sycl::ext::oneapi::experimental FP8 type definitions and conversion logic in a new public header.
  • Added unit tests for value/encoding behavior for fp8 E4M3, E5M2, and E8M0 on CPU.
  • Added mocked builtin implementations and tests to confirm which builtins are invoked from constructors/conversions.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Adds the FP8 type implementations, conversion helpers, and builtin hooks.
sycl/unittests/Extensions/fp8/fp8_e4m3.cpp Adds host unit tests for E4M3 encoding/conversion behavior.
sycl/unittests/Extensions/fp8/fp8_e5m2.cpp Adds host unit tests for E5M2 encoding/conversion behavior.
sycl/unittests/Extensions/fp8/fp8_e8m0.cpp Adds host unit tests for E8M0 encoding/conversion behavior.
sycl/unittests/Extensions/fp8/builtin_mocks.hpp Provides mock builtin symbols and counters to validate builtin call paths.
sycl/unittests/Extensions/fp8/builtin_call_tests.cpp Adds tests asserting the expected builtin calls occur (via the mocks).
sycl/unittests/Extensions/fp8/CMakeLists.txt Registers the new FP8 unit test target and sources.
sycl/unittests/Extensions/CMakeLists.txt Adds the new fp8 unit test subdirectory to the Extensions suite.

Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/unittests/Extensions/fp8/fp8_e8m0.cpp
Copy link
Copy Markdown
Contributor

@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my comments so far. I haven't looked at E8M0 yet, or at any of the CPU conversions.

Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
Comment thread sycl/include/sycl/ext/oneapi/experimental/float_8bit/types.hpp Outdated
@dklochkov-emb dklochkov-emb requested a review from a team as a code owner May 18, 2026 08:20
@dklochkov-emb dklochkov-emb changed the title FP8 types for CRI PART 1 FP8 types for CRI May 18, 2026
dklochkov-emb and others added 4 commits May 18, 2026 10:25
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
@dm-vodopyanov dm-vodopyanov changed the title FP8 types for CRI [SYCL] Implement sycl_ext_oneapi_fp8 for CRI May 18, 2026
Comment thread llvm/lib/SYCLPostLink/ModuleSplitter.cpp
Co-authored-by: Copilot <copilot@github.com>
@dklochkov-emb dklochkov-emb requested a review from maksimsab May 18, 2026 14:10
Copy link
Copy Markdown
Contributor

@maksimsab maksimsab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYCLPostLink part LGTM.

Co-authored-by: Copilot <copilot@github.com>
@dklochkov-emb
Copy link
Copy Markdown
Contributor Author

E2E tests were added but it is required driver version with bug fixes to pass on CI. Locally all tests pass

@dklochkov-emb dklochkov-emb requested a review from gmlueck May 19, 2026 09:39
@dklochkov-emb
Copy link
Copy Markdown
Contributor Author

@gmlueck Please, review it

Comment on lines +1650 to +1687
explicit fp8_e8m0_x(half const (&in)[N], rounding r = rounding::upward) {
CheckConstraints(r);
for (size_t i = 0; i < N; ++i)
vals[i] = detail::ConvertFloatToE8M0_CPU(in[i], r, saturation::finite);
}

explicit fp8_e8m0_x(bfloat16 const (&in)[N], rounding r = rounding::upward) {
CheckConstraints(r);
for (size_t i = 0; i < N; ++i)
vals[i] = detail::ConvertFloatToE8M0_CPU(in[i], r, saturation::finite);
}

explicit fp8_e8m0_x(float const (&in)[N], rounding r = rounding::upward) {
CheckConstraints(r);
for (size_t i = 0; i < N; ++i)
vals[i] = detail::ConvertFloatToE8M0_CPU(in[i], r, saturation::finite);
}

explicit fp8_e8m0_x(const marray<half, N> &in,
rounding r = rounding::upward) {
CheckConstraints(r);
for (size_t i = 0; i < N; ++i)
vals[i] = detail::ConvertFloatToE8M0_CPU(in[i], r, saturation::finite);
}

explicit fp8_e8m0_x(const marray<bfloat16, N> &in,
rounding r = rounding::upward) {
CheckConstraints(r);
for (size_t i = 0; i < N; ++i)
vals[i] = detail::ConvertFloatToE8M0_CPU(in[i], r, saturation::finite);
}

explicit fp8_e8m0_x(const marray<float, N> &in,
rounding r = rounding::upward) {
CheckConstraints(r);
for (size_t i = 0; i < N; ++i)
vals[i] = detail::ConvertFloatToE8M0_CPU(in[i], r, saturation::finite);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use templates with #define to avoid code duplication in the entire file?

Copy link
Copy Markdown
Contributor

@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't finish reviewing this yet, but I wanted to send the comments I have so far. So far, I have just reviewed the E4M3 and E5M2 device-side code.

#ifdef __SYCL_DEVICE_ONLY__
_Float16 v{0};
if constexpr (std::is_same_v<std::decay_t<T>, sycl::half>)
v = static_cast<_Float16>(static_cast<float>(h));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this line. Isn't half the same as _Float16? Why do we convert from half to float and then back to _Float16?

if constexpr (std::is_same_v<std::decay_t<T>, sycl::half>)
v = static_cast<_Float16>(static_cast<float>(h));
else
v = static_cast<_Float16>(h);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When T is float, doesn't this have the same two-step conversion problem that is described here: https://github.com/intel-tools/intel-xpu-backend-for-triton/issues/847 ?

I'm not sure if there are also similar problems when T is an integer type. Did you consider this and decide it is safe?

if constexpr (((std::is_same_v<std::decay_t<Types>, bfloat16>) && ...)) {
const bfloat16 in[N] = {static_cast<bfloat16>(v)...};
for (size_t i = 0; i < N; ++i)
vals[i] = ConvertBF16ToFP8(in[i]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SPIR-V instructions __builtin_spirv_ClampConvertFP16ToE4M3INTEL and __builtin_spirv_ClampConvertBF16ToE4M3INTEL allow the input to be a vector. In the N == 2 case, I wonder if we are losing some performance by calling the instruction twice (each with a scalar input) rather than calling it once (with a vector of two elements).

explicit operator sycl::marray<sycl::half, N>() const {
sycl::marray<sycl::half, N> ret;
for (size_t i = 0; i < N; ++i)
ret[i] = ConvertFromFP8<sycl::half>(vals[i]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about performance. I wonder if we are losing some performance by calling this in a loop rather than using the vector versions of the conversion instructions.

}
current_seed = *seed.pseed;
}
#endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should either assert or throw an exception on host.

#ifdef __SYCL_DEVICE_ONLY__
uint32_t current_seed = *seed.pseed;
for (size_t i = 0; i < N; ++i) {
const _Float16 v = static_cast<_Float16>(static_cast<float>(in[i]));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as earlier, why convert from half to float and back to _Float16?

// Intentionally public to allow access to the raw values.
uint8_t vals[N];
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the deduction guides are missing for all three of these classes. The spec calls for this deduction guide for the fp8_e4m3_x class, for example:

// Deduction guide available only when the size of the pack is greater than zero.
template<typename... Ts>
fp8_e4m3_x(Ts...) -> fp8_e4m3_x<sizeof...(Ts)>;

Copy link
Copy Markdown
Contributor

@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these comments are about the SPIR-V intrinsics. We should use the SPV_EXT_float8 extension as much as possible and only use SPV_INTEL_fp_conversions for the things it does not contain. It's not clear to me which SPIR-V instructions are being emitted from these intrinsics, but some of the names make it sound like you may be emitting SPV_INTEL_fp_conversions instead of SPV_EXT_float8.

I've listed the SPIR-V instructions I think we should use for each of the intrinsics below.

Tagging @bashbaug here also to make sure I got this right. :-)

// FP8 builtins

extern __DPCPP_SYCL_EXTERNAL uint8_t
__builtin_spirv_ClampConvertFP16ToE4M3INTEL(_Float16) noexcept;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be using OpFConvert with Float8E4M3EXT as the destination FP Encoding.

I see that the name of this builtin ends with "INTEL", so I presume it is not emitting OpFConvert with Float8E4M3EXT now?

__builtin_spirv_ClampConvertFP16ToE4M3INTEL(_Float16) noexcept;

extern __DPCPP_SYCL_EXTERNAL _Float16
__builtin_spirv_ConvertE4M3ToFP16EXT(char) noexcept;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this should be using OpFconvert with Float8E4M3EXT as the source FP Encoding.

I see that the name of this builtin ends with "EXT". Does that mean it is already emitting this SPIR-V instruction?

__builtin_spirv_ConvertE4M3ToFP16EXT(char) noexcept;

extern __DPCPP_SYCL_EXTERNAL uint8_t
__builtin_spirv_ClampConvertBF16ToE4M3INTEL(__bf16) noexcept;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use OpFconvert with Float8E4M3EXT as the destination FP Encoding and decorated with SaturatedToLargestFloat8NormalConversionEXT.

extern __DPCPP_SYCL_EXTERNAL uint8_t
__builtin_spirv_ClampConvertBF16ToE4M3INTEL(__bf16) noexcept;
extern __DPCPP_SYCL_EXTERNAL __bf16
__builtin_spirv_ConvertE4M3ToBF16EXT(uint8_t) noexcept;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use OpFConvert with Float8E4M3EXT as the source FP Encoding and BFloat16KHR as the destination FP Encoding.

extern __DPCPP_SYCL_EXTERNAL __bf16
__builtin_spirv_ConvertE4M3ToBF16EXT(uint8_t) noexcept;
extern __DPCPP_SYCL_EXTERNAL uint8_t
__builtin_spirv_ClampConvertFP16ToE5M2INTEL(_Float16) noexcept;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use OpFConvert with Float8E5M2EXT as the destination FP Encoding and the SaturatedToLargestFloat8NormalConversionEXT decoration.

extern __DPCPP_SYCL_EXTERNAL _Float16
__builtin_spirv_ConvertE5M2ToFP16EXT(uint8_t) noexcept;
extern __DPCPP_SYCL_EXTERNAL uint8_t
__builtin_spirv_ClampConvertBF16ToE5M2INTEL(__bf16) noexcept;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use OpFConvert with BFloat16KHR as the source FP Encoding and Float8E5M2EXT as the destination FP Encoding and decorated with SaturatedToLargestFloat8NormalConversionEXT.

extern __DPCPP_SYCL_EXTERNAL uint8_t
__builtin_spirv_ClampConvertBF16ToE5M2INTEL(__bf16) noexcept;
extern __DPCPP_SYCL_EXTERNAL uint8_t
__builtin_spirv_ConvertBF16ToE5M2EXT(__bf16) noexcept;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use OpFConvert with BFloat16KHR as the source FP Encoding and Float8E5M2EXT as the destination FP Encoding.

extern __DPCPP_SYCL_EXTERNAL uint8_t
__builtin_spirv_ConvertBF16ToE5M2EXT(__bf16) noexcept;
extern __DPCPP_SYCL_EXTERNAL __bf16
__builtin_spirv_ConvertE5M2ToBF16EXT(uint8_t) noexcept;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use OpFConvert with Float8E5M2EXT as the source FP Encoding and BFloat16KHR as the destination FP Encoding.

__builtin_spirv_ConvertE5M2ToBF16EXT(uint8_t) noexcept;
extern __DPCPP_SYCL_EXTERNAL uint8_t
__builtin_spirv_ClampStochasticRoundFP16ToE5M2INTEL(_Float16, uint32_t,
uint32_t *) noexcept;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use OpStochasticRoundFToFINTEL with the SaturatedToLargestFloat8NormalConversionEXT decoration.

uint32_t *) noexcept;
extern __DPCPP_SYCL_EXTERNAL uint8_t
__builtin_spirv_ClampStochasticRoundBF16ToE5M2INTEL(__bf16, uint32_t,
uint32_t *) noexcept;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use OpStochasticRoundFToFINTEL with BFloat16KHR as the source FP Encoding and the SaturatedToLargestFloat8NormalConversionEXT decoration.

#include <stdexcept>
#include <type_traits>

#define SYCL_EXT_ONEAPI_FP8 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we normally define these macros in "sycl/source/feature_test.hpp.in".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants