Audio: STFT Process: Add Xtensa HiFi function versions#10638
Audio: STFT Process: Add Xtensa HiFi function versions#10638singalsu wants to merge 1 commit intothesofproject:mainfrom
Conversation
This patch adds to stft_process-hifi3.c the HiFi3 versions of higher complexity functions stft_process_apply_window() and stft_process_overlap_add_ifft_buffer(). The functions with no clear HiFi optimization benefit are moved from stft_process-generic.c to stft_process_common.c. Those functions move data with practically no processing to samples. This change saves 17 MCPS (from 63 MCPS to 46 MCPS). The test was done with script run: scripts/rebuild-testbench.sh -p mtl scripts/sof-testbench-helper.sh -x -m stft_process_1024_256_ \ -p profile-stft_process.txt The above STFT used FFT length 1024 with hop 256. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
There was a problem hiding this comment.
Pull request overview
Adds HiFi3 SIMD implementations for STFT hot-path helpers and refactors shared, non-SIMD-specific routines into a common compilation unit to reduce MCPS.
Changes:
- Add HiFi3 intrinsic implementations of
stft_process_apply_window()andstft_process_overlap_add_ifft_buffer(). - Move source/sink and buffer-fill helper functions from
stft_process-generic.cintostft_process_common.c. - Introduce Kconfig SIMD level selection and update build sources to include the HiFi3 unit.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/audio/stft_process/stft_process_common.c | Adds shared source/sink and FFT buffer fill helpers (moved from generic). |
| src/audio/stft_process/stft_process-hifi3.c | New HiFi3 intrinsic implementations for windowing + overlap-add. |
| src/audio/stft_process/stft_process-generic.c | Removes moved helpers; wraps generic implementations behind SOF_USE_HIFI(NONE, ...). |
| src/audio/stft_process/Kconfig.simd | Adds Kconfig choice for SIMD optimization level selection. |
| src/audio/stft_process/Kconfig | Includes the new SIMD Kconfig via rsource. |
| src/audio/stft_process/CMakeLists.txt | Adds the HiFi3 compilation unit to the build. |
Comments suppressed due to low confidence (1)
src/audio/stft_process/stft_process-hifi3.c:1
- The function relies on 64-bit alignment and even-sample constraints but does not enforce either at runtime. Misalignment can cause load/store exceptions or significant penalties depending on the core/config, and “even samples” is already required to avoid the
>> 1infinite-loop hazard. Add an explicit alignment/size assertion (or a guarded scalar fallback when(uintptr_t)obuf->w_ptris not 8-byte aligned or when the contiguous region before wrap is odd-length) to make failures deterministic and easier to diagnose.
// SPDX-License-Identifier: BSD-3-Clause
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| while (samples_remain) { | ||
| n = stft_process_buffer_samples_without_wrap(obuf, (int32_t *)w); | ||
| n = MIN(samples_remain, n) >> 1; | ||
| for (i = 0; i < n; i++) { | ||
| /* Load two FFT real part values, combine into fft_data */ | ||
| AE_L32_IP(d0, fft_p, sizeof(ae_int32x2)); | ||
| AE_L32_IP(d1, fft_p, sizeof(ae_int32x2)); | ||
| fft_data = AE_SEL32_HH(d0, d1); | ||
|
|
||
| /* Load buffer data, multiply fft_data with gain and accumulate */ | ||
| buffer_data = AE_L32X2_I(w, 0); | ||
| AE_MULAFP32X2RS(buffer_data, fft_data, gain); | ||
| AE_S32X2_IP(buffer_data, w, sizeof(ae_int32x2)); | ||
| } | ||
| w = (ae_int32x2 *)stft_process_buffer_wrap(obuf, (int32_t *)w); | ||
| samples_remain -= n << 1; | ||
| } |
There was a problem hiding this comment.
n is converted to “pairs” via >> 1. If stft_process_buffer_samples_without_wrap() returns 1 sample at the end of the circular buffer (odd remaining before wrap), n becomes 0 and samples_remain is never decremented, resulting in an infinite loop. Fix by ensuring forward progress when fewer than 2 samples are available (e.g., handle a single-sample tail with a scalar path before wrapping, or compute n_pairs = MIN(samples_remain, n) / 2 and if n_pairs == 0 process 1 sample and wrap).
| ae_int32 *buf; | ||
| const ae_int32x2 *win; | ||
| ae_f32x2 data01, data23; | ||
| ae_f32x2 win01, win23; | ||
| ae_int32x2 d0, d1; |
There was a problem hiding this comment.
The HiFi3 intrinsic usage here appears type-inconsistent in a way that is likely to either not compile cleanly or produce incorrect results: AE_L32_I() is a 32-bit load but its result is assigned to ae_int32x2 (d0, d1) and then fed to AE_SEL32_HH(), which expects valid packed lanes. Similarly, AE_S32_L_IP() is a scalar store but is passed ae_f32x2 values (data01, data23). Consider restructuring to use consistent vector/scalar types (e.g., load complex samples with AE_L32X2_* into ae_int32x2, extract/insert the real lanes explicitly, and store back with AE_S32X2_*, or use scalar ae_int32 loads/stores for the real parts).
There was a problem hiding this comment.
You are wrong. From HiFi3 user's guide:
ae_int32x2 AE_L32_I (const ae_int32 * a, immediate i32);
It duplicates the loaded scalar to int32x2 vector.
| d0 = AE_L32_I(buf, 0 * sizeof(ae_int32x2)); | ||
| d1 = AE_L32_I(buf, 1 * sizeof(ae_int32x2)); | ||
| data01 = AE_SEL32_HH(d0, d1); | ||
| d0 = AE_L32_I(buf, 2 * sizeof(ae_int32x2)); | ||
| d1 = AE_L32_I(buf, 3 * sizeof(ae_int32x2)); | ||
| data23 = AE_SEL32_HH(d0, d1); |
There was a problem hiding this comment.
The HiFi3 intrinsic usage here appears type-inconsistent in a way that is likely to either not compile cleanly or produce incorrect results: AE_L32_I() is a 32-bit load but its result is assigned to ae_int32x2 (d0, d1) and then fed to AE_SEL32_HH(), which expects valid packed lanes. Similarly, AE_S32_L_IP() is a scalar store but is passed ae_f32x2 values (data01, data23). Consider restructuring to use consistent vector/scalar types (e.g., load complex samples with AE_L32X2_* into ae_int32x2, extract/insert the real lanes explicitly, and store back with AE_S32X2_*, or use scalar ae_int32 loads/stores for the real parts).
There was a problem hiding this comment.
Might be possible to change all to ae_f32x2 (that's what the fractional int data really is) and apply type cast to ae_int32 where needed.
| AE_S32_L_IP(AE_SEL32_LH(data01, data01), buf, sizeof(ae_int32x2)); | ||
| AE_S32_L_IP(data01, buf, sizeof(ae_int32x2)); | ||
| AE_S32_L_IP(AE_SEL32_LH(data23, data23), buf, sizeof(ae_int32x2)); | ||
| AE_S32_L_IP(data23, buf, sizeof(ae_int32x2)); |
There was a problem hiding this comment.
The HiFi3 intrinsic usage here appears type-inconsistent in a way that is likely to either not compile cleanly or produce incorrect results: AE_L32_I() is a 32-bit load but its result is assigned to ae_int32x2 (d0, d1) and then fed to AE_SEL32_HH(), which expects valid packed lanes. Similarly, AE_S32_L_IP() is a scalar store but is passed ae_f32x2 values (data01, data23). Consider restructuring to use consistent vector/scalar types (e.g., load complex samples with AE_L32X2_* into ae_int32x2, extract/insert the real lanes explicitly, and store back with AE_S32X2_*, or use scalar ae_int32 loads/stores for the real parts).
| ae_f32x2 gain = AE_MOVDA32(state->gain_comp); | ||
| ae_f32x2 buffer_data; | ||
| ae_f32x2 fft_data; |
There was a problem hiding this comment.
AE_L32X2_I() / AE_S32X2_IP() operate on packed integer registers (ae_int32x2), but buffer_data is declared as ae_f32x2. This mismatch is error-prone and may cause compilation issues or unintended lane interpretation. Use ae_int32x2 buffer_data for the circular buffer samples, and use the appropriate mul/acc intrinsic variants that match the operand types (and saturation requirements) for Q1.31 arithmetic.
| buffer_data = AE_L32X2_I(w, 0); | ||
| AE_MULAFP32X2RS(buffer_data, fft_data, gain); | ||
| AE_S32X2_IP(buffer_data, w, sizeof(ae_int32x2)); |
There was a problem hiding this comment.
AE_L32X2_I() / AE_S32X2_IP() operate on packed integer registers (ae_int32x2), but buffer_data is declared as ae_f32x2. This mismatch is error-prone and may cause compilation issues or unintended lane interpretation. Use ae_int32x2 buffer_data for the circular buffer samples, and use the appropriate mul/acc intrinsic variants that match the operand types (and saturation requirements) for Q1.31 arithmetic.
| comment "STFT Process optimization level select" | ||
|
|
||
| choice "COMP_STFT_PROCESS_SIMD_LEVEL_SELECT" | ||
| prompt "choose which SIMD level used for STFT Process module" |
There was a problem hiding this comment.
The prompt string is grammatically incorrect. Consider updating it for clarity.
| prompt "choose which SIMD level used for STFT Process module" | |
| prompt "Choose which SIMD level is used for the STFT Process module" |
This patch adds to stft_process-hifi3.c the HiFi3 versions of higher complexity functions stft_process_apply_window() and stft_process_overlap_add_ifft_buffer().
The functions with no clear HiFi optimization benefit are moved from stft_process-generic.c to stft_process_common.c. Those functions move data with practically no processing to samples.
This change saves 17 MCPS (from 63 MCPS to 46 MCPS). The test was done with script run:
scripts/rebuild-testbench.sh -p mtl
scripts/sof-testbench-helper.sh -x -m stft_process_1024_256_
-p profile-stft_process.txt
The above STFT used FFT length 1024 with hop 256.