[X86] combineStore - attempt to store i256/i512 types as v4i64/v8i64 vectors #172288

RKSimon · 2025-12-15T12:02:26Z

If the larger than legal scalar integer is freely foldable to a vector type, or is likely to be custom lowered to an operation using vectors, attempt to convert to a vector store.

This doesn't appear to be worth it for i128 types which can more easily convert between types with extra insert_subvector steps etc.

…vectors If the larger than legal scalar integer is freely foldable to a vector type, or is likely to be custom lowered to an operation using vectors, attempt to convert to a vector store. This doesn't appear to be worth it for i128 types which can more easily convert between types with extra insert_subvector steps etc.

llvmbot · 2025-12-15T12:04:00Z

@llvm/pr-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Changes

If the larger than legal scalar integer is freely foldable to a vector type, or is likely to be custom lowered to an operation using vectors, attempt to convert to a vector store.

This doesn't appear to be worth it for i128 types which can more easily convert between types with extra insert_subvector steps etc.

Full diff: https://github.com/llvm/llvm-project/pull/172288.diff

3 Files Affected:

(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+23-5)
(modified) llvm/test/CodeGen/X86/atomic-unordered.ll (+3-8)
(modified) llvm/test/CodeGen/X86/elementwise-store-of-scalar-splat.ll (+13-29)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index a85ed96e11158..5d9e627e0484b 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -54078,6 +54078,28 @@ static SDValue combineStore(SDNode *N, SelectionDAG &DAG,
                                    St->getMemOperand());
   }
 
+  // All code below attempts to move stores to FPU, early out if we can't.
+  const Function &F = DAG.getMachineFunction().getFunction();
+  bool NoImplicitFloatOps = F.hasFnAttribute(Attribute::NoImplicitFloat);
+  if (Subtarget.useSoftFloat() || NoImplicitFloatOps)
+    return SDValue();
+
+  // If we are storing a larger than legal scalar integer, see if we can cheaply
+  // handle this as a vector store, either because it already bitcasts to a
+  // vector type or the operation is likely to expand to a vector type
+  // (legalization can scalarize back if it the op failed).
+  if (VT == MVT::i256 || VT == MVT::i512) {
+    MVT VecVT = MVT::getVectorVT(MVT::i64, VT.getSizeInBits() / 64);
+    if (TLI.isTypeLegal(VecVT) && ISD::isNormalStore(St) &&
+        (mayFoldIntoVector(StoredVal, Subtarget) ||
+         (DCI.isBeforeLegalize() &&
+          TLI.getOperationAction(StoredVal.getOpcode(), VT) ==
+              TargetLowering::LegalizeAction::Custom)))
+      return DAG.getStore(St->getChain(), dl, DAG.getBitcast(VecVT, StoredVal),
+                          St->getBasePtr(), St->getPointerInfo(),
+                          St->getBaseAlign(), St->getMemOperand()->getFlags());
+  }
+
   // Turn load->store of MMX types into GPR load/stores.  This avoids clobbering
   // the FP state in cases where an emms may be missing.
   // A preferable solution to the general problem is to figure out the right
@@ -54087,11 +54109,7 @@ static SDValue combineStore(SDNode *N, SelectionDAG &DAG,
   if (VT.getSizeInBits() != 64)
     return SDValue();
 
-  const Function &F = DAG.getMachineFunction().getFunction();
-  bool NoImplicitFloatOps = F.hasFnAttribute(Attribute::NoImplicitFloat);
-  bool F64IsLegal =
-      !Subtarget.useSoftFloat() && !NoImplicitFloatOps && Subtarget.hasSSE2();
-
+  bool F64IsLegal = Subtarget.hasSSE2();
   if (!F64IsLegal || Subtarget.is64Bit())
     return SDValue();
 
diff --git a/llvm/test/CodeGen/X86/atomic-unordered.ll b/llvm/test/CodeGen/X86/atomic-unordered.ll
index e8e0ee0b7ef49..7946204865ab6 100644
--- a/llvm/test/CodeGen/X86/atomic-unordered.ll
+++ b/llvm/test/CodeGen/X86/atomic-unordered.ll
@@ -272,16 +272,11 @@ define i256 @load_i256(ptr %ptr) {
 ; CHECK-O0-NEXT:    callq __atomic_load@PLT
 ; CHECK-O0-NEXT:    movq (%rsp), %rdi # 8-byte Reload
 ; CHECK-O0-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
-; CHECK-O0-NEXT:    movq {{[0-9]+}}(%rsp), %rcx
-; CHECK-O0-NEXT:    movq {{[0-9]+}}(%rsp), %rdx
-; CHECK-O0-NEXT:    movq {{[0-9]+}}(%rsp), %rsi
-; CHECK-O0-NEXT:    movq {{[0-9]+}}(%rsp), %r8
-; CHECK-O0-NEXT:    movq %r8, 24(%rdi)
-; CHECK-O0-NEXT:    movq %rsi, 16(%rdi)
-; CHECK-O0-NEXT:    movq %rdx, 8(%rdi)
-; CHECK-O0-NEXT:    movq %rcx, (%rdi)
+; CHECK-O0-NEXT:    vmovups {{[0-9]+}}(%rsp), %ymm0
+; CHECK-O0-NEXT:    vmovups %ymm0, (%rdi)
 ; CHECK-O0-NEXT:    addq $56, %rsp
 ; CHECK-O0-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-O0-NEXT:    vzeroupper
 ; CHECK-O0-NEXT:    retq
 ;
 ; CHECK-O3-LABEL: load_i256:
diff --git a/llvm/test/CodeGen/X86/elementwise-store-of-scalar-splat.ll b/llvm/test/CodeGen/X86/elementwise-store-of-scalar-splat.ll
index d9158c4af18fa..c3822b3f3021f 100644
--- a/llvm/test/CodeGen/X86/elementwise-store-of-scalar-splat.ll
+++ b/llvm/test/CodeGen/X86/elementwise-store-of-scalar-splat.ll
@@ -1,14 +1,14 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=-sse2 | FileCheck %s --check-prefixes=ALL,SCALAR
-; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2 | FileCheck %s --check-prefixes=ALL,SSE,SSE2,SSE2-ONLY
-; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse3 | FileCheck %s --check-prefixes=ALL,SSE,SSE2,SSE3
-; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+ssse3 | FileCheck %s --check-prefixes=ALL,SSE,SSSE3,SSSE3-ONLY
-; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse4.1 | FileCheck %s --check-prefixes=ALL,SSE,SSSE3,SSE41
-; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse4.2 | FileCheck %s --check-prefixes=ALL,SSE,SSSE3,SSE42
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2 | FileCheck %s --check-prefixes=ALL,SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse3 | FileCheck %s --check-prefixes=ALL,SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+ssse3 | FileCheck %s --check-prefixes=ALL,SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse4.1 | FileCheck %s --check-prefixes=ALL,SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+sse4.2 | FileCheck %s --check-prefixes=ALL,SSE
 ; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+avx | FileCheck %s --check-prefixes=ALL,AVX,AVX1
 ; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+avx2 | FileCheck %s --check-prefixes=ALL,AVX,AVX2
-; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+avx512vl | FileCheck %s --check-prefixes=ALL,AVX512,AVX512F
-; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+avx512vl,+avx512bw | FileCheck %s --check-prefixes=ALL,AVX512,AVX512BW
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+avx512vl | FileCheck %s --check-prefixes=ALL,AVX512
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+avx512vl,+avx512bw | FileCheck %s --check-prefixes=ALL,AVX512
 
 define void @vec16_i8(ptr %in.elt.ptr, ptr %out.vec.ptr) nounwind {
 ; ALL-LABEL: vec16_i8:
@@ -1715,10 +1715,8 @@ define void @vec512_i256(ptr %in.elt.ptr, ptr %out.vec.ptr) nounwind {
 ; AVX1-NEXT:    vxorps %xmm0, %xmm0, %xmm0
 ; AVX1-NEXT:    vcmptrueps %ymm0, %ymm0, %ymm0
 ; AVX1-NEXT:    vxorps (%rdi), %ymm0, %ymm0
-; AVX1-NEXT:    vextractf128 $1, %ymm0, 16(%rsi)
-; AVX1-NEXT:    vmovaps %xmm0, (%rsi)
-; AVX1-NEXT:    vextractf128 $1, %ymm0, 48(%rsi)
-; AVX1-NEXT:    vmovaps %xmm0, 32(%rsi)
+; AVX1-NEXT:    vmovaps %ymm0, (%rsi)
+; AVX1-NEXT:    vmovaps %ymm0, 32(%rsi)
 ; AVX1-NEXT:    vzeroupper
 ; AVX1-NEXT:    retq
 ;
@@ -1726,10 +1724,8 @@ define void @vec512_i256(ptr %in.elt.ptr, ptr %out.vec.ptr) nounwind {
 ; AVX2:       # %bb.0:
 ; AVX2-NEXT:    vpcmpeqd %ymm0, %ymm0, %ymm0
 ; AVX2-NEXT:    vpxor (%rdi), %ymm0, %ymm0
-; AVX2-NEXT:    vextracti128 $1, %ymm0, 16(%rsi)
-; AVX2-NEXT:    vmovdqa %xmm0, (%rsi)
-; AVX2-NEXT:    vextracti128 $1, %ymm0, 48(%rsi)
-; AVX2-NEXT:    vmovdqa %xmm0, 32(%rsi)
+; AVX2-NEXT:    vmovdqa %ymm0, (%rsi)
+; AVX2-NEXT:    vmovdqa %ymm0, 32(%rsi)
 ; AVX2-NEXT:    vzeroupper
 ; AVX2-NEXT:    retq
 ;
@@ -1737,10 +1733,8 @@ define void @vec512_i256(ptr %in.elt.ptr, ptr %out.vec.ptr) nounwind {
 ; AVX512:       # %bb.0:
 ; AVX512-NEXT:    vpcmpeqd %ymm0, %ymm0, %ymm0
 ; AVX512-NEXT:    vpxor (%rdi), %ymm0, %ymm0
-; AVX512-NEXT:    vextracti128 $1, %ymm0, 16(%rsi)
-; AVX512-NEXT:    vmovdqa %xmm0, (%rsi)
-; AVX512-NEXT:    vextracti128 $1, %ymm0, 48(%rsi)
-; AVX512-NEXT:    vmovdqa %xmm0, 32(%rsi)
+; AVX512-NEXT:    vmovdqa %ymm0, (%rsi)
+; AVX512-NEXT:    vmovdqa %ymm0, 32(%rsi)
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
   %in.elt.not = load i256, ptr %in.elt.ptr, align 64
@@ -1751,13 +1745,3 @@ define void @vec512_i256(ptr %in.elt.ptr, ptr %out.vec.ptr) nounwind {
   store i256 %in.elt, ptr %out.elt1.ptr, align 32
   ret void
 }
-;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
-; AVX512BW: {{.*}}
-; AVX512F: {{.*}}
-; SSE2: {{.*}}
-; SSE2-ONLY: {{.*}}
-; SSE3: {{.*}}
-; SSE41: {{.*}}
-; SSE42: {{.*}}
-; SSSE3: {{.*}}
-; SSSE3-ONLY: {{.*}}

phoebewang · 2025-12-15T13:51:22Z

llvm/lib/Target/X86/X86ISelLowering.cpp

-      !Subtarget.useSoftFloat() && !NoImplicitFloatOps && Subtarget.hasSSE2();
-
+  bool F64IsLegal = Subtarget.hasSSE2();
  if (!F64IsLegal || Subtarget.is64Bit())


Suggested change

if (!F64IsLegal || Subtarget.is64Bit())

if (!Subtarget.hasSSE2() || Subtarget.is64Bit())

phoebewang · 2025-12-15T14:01:05Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+  if (VT == MVT::i256 || VT == MVT::i512) {
+    MVT VecVT = MVT::getVectorVT(MVT::i64, VT.getSizeInBits() / 64);
+    if (TLI.isTypeLegal(VecVT) && ISD::isNormalStore(St) &&
+        (mayFoldIntoVector(StoredVal, Subtarget) ||


Do we have test for this?

yes - everything except for the isNormalStore checks as its tricky to persuade dag to create a truncated store from illegal source types.

phoebewang

LGTM.

RKSimon requested review from KanRobert and phoebewang December 15, 2025 12:02

llvmbot added the backend:X86 label Dec 15, 2025

phoebewang reviewed Dec 15, 2025

View reviewed changes

Cleanup redundant F64IsLegal check

7c29935

phoebewang approved these changes Dec 15, 2025

View reviewed changes

Merge branch 'main' into x86-store-bigint-vectors

d57c518

RKSimon enabled auto-merge (squash) December 15, 2025 15:31

RKSimon merged commit b368357 into llvm:main Dec 15, 2025
9 of 10 checks passed

RKSimon deleted the x86-store-bigint-vectors branch December 15, 2025 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[X86] combineStore - attempt to store i256/i512 types as v4i64/v8i64 vectors #172288

[X86] combineStore - attempt to store i256/i512 types as v4i64/v8i64 vectors #172288

RKSimon commented Dec 15, 2025

Uh oh!

llvmbot commented Dec 15, 2025

Uh oh!

phoebewang Dec 15, 2025

Uh oh!

phoebewang Dec 15, 2025

Uh oh!

RKSimon Dec 15, 2025

Uh oh!

phoebewang left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if (!F64IsLegal \|\| Subtarget.is64Bit())
	if (!Subtarget.hasSSE2() \|\| Subtarget.is64Bit())

[X86] combineStore - attempt to store i256/i512 types as v4i64/v8i64 vectors #172288

[X86] combineStore - attempt to store i256/i512 types as v4i64/v8i64 vectors #172288

Conversation

RKSimon commented Dec 15, 2025

Uh oh!

llvmbot commented Dec 15, 2025

Uh oh!

phoebewang Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

phoebewang Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

RKSimon Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

phoebewang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants