Skip to content

Audio: STFT Process: Add Xtensa HiFi function versions#10638

Open
singalsu wants to merge 1 commit intothesofproject:mainfrom
singalsu:stft_add_hifi_optimize
Open

Audio: STFT Process: Add Xtensa HiFi function versions#10638
singalsu wants to merge 1 commit intothesofproject:mainfrom
singalsu:stft_add_hifi_optimize

Conversation

@singalsu
Copy link
Collaborator

This patch adds to stft_process-hifi3.c the HiFi3 versions of higher complexity functions stft_process_apply_window() and stft_process_overlap_add_ifft_buffer().

The functions with no clear HiFi optimization benefit are moved from stft_process-generic.c to stft_process_common.c. Those functions move data with practically no processing to samples.

This change saves 17 MCPS (from 63 MCPS to 46 MCPS). The test was done with script run:

scripts/rebuild-testbench.sh -p mtl
scripts/sof-testbench-helper.sh -x -m stft_process_1024_256_
-p profile-stft_process.txt

The above STFT used FFT length 1024 with hop 256.

This patch adds to stft_process-hifi3.c the HiFi3 versions of
higher complexity functions stft_process_apply_window() and
stft_process_overlap_add_ifft_buffer().

The functions with no clear HiFi optimization benefit are moved
from stft_process-generic.c to stft_process_common.c. Those
functions move data with practically no processing to samples.

This change saves 17 MCPS (from 63 MCPS to 46 MCPS). The test
was done with script run:

scripts/rebuild-testbench.sh -p mtl
scripts/sof-testbench-helper.sh -x -m stft_process_1024_256_ \
  -p profile-stft_process.txt

The above STFT used FFT length 1024 with hop 256.

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
@singalsu singalsu marked this pull request as ready for review March 20, 2026 16:03
Copilot AI review requested due to automatic review settings March 20, 2026 16:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds HiFi3 SIMD implementations for STFT hot-path helpers and refactors shared, non-SIMD-specific routines into a common compilation unit to reduce MCPS.

Changes:

  • Add HiFi3 intrinsic implementations of stft_process_apply_window() and stft_process_overlap_add_ifft_buffer().
  • Move source/sink and buffer-fill helper functions from stft_process-generic.c into stft_process_common.c.
  • Introduce Kconfig SIMD level selection and update build sources to include the HiFi3 unit.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/audio/stft_process/stft_process_common.c Adds shared source/sink and FFT buffer fill helpers (moved from generic).
src/audio/stft_process/stft_process-hifi3.c New HiFi3 intrinsic implementations for windowing + overlap-add.
src/audio/stft_process/stft_process-generic.c Removes moved helpers; wraps generic implementations behind SOF_USE_HIFI(NONE, ...).
src/audio/stft_process/Kconfig.simd Adds Kconfig choice for SIMD optimization level selection.
src/audio/stft_process/Kconfig Includes the new SIMD Kconfig via rsource.
src/audio/stft_process/CMakeLists.txt Adds the HiFi3 compilation unit to the build.
Comments suppressed due to low confidence (1)

src/audio/stft_process/stft_process-hifi3.c:1

  • The function relies on 64-bit alignment and even-sample constraints but does not enforce either at runtime. Misalignment can cause load/store exceptions or significant penalties depending on the core/config, and “even samples” is already required to avoid the >> 1 infinite-loop hazard. Add an explicit alignment/size assertion (or a guarded scalar fallback when (uintptr_t)obuf->w_ptr is not 8-byte aligned or when the contiguous region before wrap is odd-length) to make failures deterministic and easier to diagnose.
// SPDX-License-Identifier: BSD-3-Clause

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +117 to +133
while (samples_remain) {
n = stft_process_buffer_samples_without_wrap(obuf, (int32_t *)w);
n = MIN(samples_remain, n) >> 1;
for (i = 0; i < n; i++) {
/* Load two FFT real part values, combine into fft_data */
AE_L32_IP(d0, fft_p, sizeof(ae_int32x2));
AE_L32_IP(d1, fft_p, sizeof(ae_int32x2));
fft_data = AE_SEL32_HH(d0, d1);

/* Load buffer data, multiply fft_data with gain and accumulate */
buffer_data = AE_L32X2_I(w, 0);
AE_MULAFP32X2RS(buffer_data, fft_data, gain);
AE_S32X2_IP(buffer_data, w, sizeof(ae_int32x2));
}
w = (ae_int32x2 *)stft_process_buffer_wrap(obuf, (int32_t *)w);
samples_remain -= n << 1;
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n is converted to “pairs” via >> 1. If stft_process_buffer_samples_without_wrap() returns 1 sample at the end of the circular buffer (odd remaining before wrap), n becomes 0 and samples_remain is never decremented, resulting in an infinite loop. Fix by ensuring forward progress when fewer than 2 samples are available (e.g., handle a single-sample tail with a scalar path before wrapping, or compute n_pairs = MIN(samples_remain, n) / 2 and if n_pairs == 0 process 1 sample and wrap).

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +38
ae_int32 *buf;
const ae_int32x2 *win;
ae_f32x2 data01, data23;
ae_f32x2 win01, win23;
ae_int32x2 d0, d1;
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HiFi3 intrinsic usage here appears type-inconsistent in a way that is likely to either not compile cleanly or produce incorrect results: AE_L32_I() is a 32-bit load but its result is assigned to ae_int32x2 (d0, d1) and then fed to AE_SEL32_HH(), which expects valid packed lanes. Similarly, AE_S32_L_IP() is a scalar store but is passed ae_f32x2 values (data01, data23). Consider restructuring to use consistent vector/scalar types (e.g., load complex samples with AE_L32X2_* into ae_int32x2, extract/insert the real lanes explicitly, and store back with AE_S32X2_*, or use scalar ae_int32 loads/stores for the real parts).

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are wrong. From HiFi3 user's guide:

ae_int32x2 AE_L32_I (const ae_int32 * a, immediate i32);

It duplicates the loaded scalar to int32x2 vector.

Comment on lines +66 to +71
d0 = AE_L32_I(buf, 0 * sizeof(ae_int32x2));
d1 = AE_L32_I(buf, 1 * sizeof(ae_int32x2));
data01 = AE_SEL32_HH(d0, d1);
d0 = AE_L32_I(buf, 2 * sizeof(ae_int32x2));
d1 = AE_L32_I(buf, 3 * sizeof(ae_int32x2));
data23 = AE_SEL32_HH(d0, d1);
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HiFi3 intrinsic usage here appears type-inconsistent in a way that is likely to either not compile cleanly or produce incorrect results: AE_L32_I() is a 32-bit load but its result is assigned to ae_int32x2 (d0, d1) and then fed to AE_SEL32_HH(), which expects valid packed lanes. Similarly, AE_S32_L_IP() is a scalar store but is passed ae_f32x2 values (data01, data23). Consider restructuring to use consistent vector/scalar types (e.g., load complex samples with AE_L32X2_* into ae_int32x2, extract/insert the real lanes explicitly, and store back with AE_S32X2_*, or use scalar ae_int32 loads/stores for the real parts).

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be possible to change all to ae_f32x2 (that's what the fractional int data really is) and apply type cast to ae_int32 where needed.

Comment on lines +84 to +87
AE_S32_L_IP(AE_SEL32_LH(data01, data01), buf, sizeof(ae_int32x2));
AE_S32_L_IP(data01, buf, sizeof(ae_int32x2));
AE_S32_L_IP(AE_SEL32_LH(data23, data23), buf, sizeof(ae_int32x2));
AE_S32_L_IP(data23, buf, sizeof(ae_int32x2));
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HiFi3 intrinsic usage here appears type-inconsistent in a way that is likely to either not compile cleanly or produce incorrect results: AE_L32_I() is a 32-bit load but its result is assigned to ae_int32x2 (d0, d1) and then fed to AE_SEL32_HH(), which expects valid packed lanes. Similarly, AE_S32_L_IP() is a scalar store but is passed ae_f32x2 values (data01, data23). Consider restructuring to use consistent vector/scalar types (e.g., load complex samples with AE_L32X2_* into ae_int32x2, extract/insert the real lanes explicitly, and store back with AE_S32X2_*, or use scalar ae_int32 loads/stores for the real parts).

Copilot uses AI. Check for mistakes.
Comment on lines +112 to +114
ae_f32x2 gain = AE_MOVDA32(state->gain_comp);
ae_f32x2 buffer_data;
ae_f32x2 fft_data;
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AE_L32X2_I() / AE_S32X2_IP() operate on packed integer registers (ae_int32x2), but buffer_data is declared as ae_f32x2. This mismatch is error-prone and may cause compilation issues or unintended lane interpretation. Use ae_int32x2 buffer_data for the circular buffer samples, and use the appropriate mul/acc intrinsic variants that match the operand types (and saturation requirements) for Q1.31 arithmetic.

Copilot uses AI. Check for mistakes.
Comment on lines +127 to +129
buffer_data = AE_L32X2_I(w, 0);
AE_MULAFP32X2RS(buffer_data, fft_data, gain);
AE_S32X2_IP(buffer_data, w, sizeof(ae_int32x2));
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AE_L32X2_I() / AE_S32X2_IP() operate on packed integer registers (ae_int32x2), but buffer_data is declared as ae_f32x2. This mismatch is error-prone and may cause compilation issues or unintended lane interpretation. Use ae_int32x2 buffer_data for the circular buffer samples, and use the appropriate mul/acc intrinsic variants that match the operand types (and saturation requirements) for Q1.31 arithmetic.

Copilot uses AI. Check for mistakes.
comment "STFT Process optimization level select"

choice "COMP_STFT_PROCESS_SIMD_LEVEL_SELECT"
prompt "choose which SIMD level used for STFT Process module"
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt string is grammatically incorrect. Consider updating it for clarity.

Suggested change
prompt "choose which SIMD level used for STFT Process module"
prompt "Choose which SIMD level is used for the STFT Process module"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants