Performance regression: rustc failed to optimize specific x86-64 SIMD intrinsics after 1.75.0 #124216

Nugine · 2024-04-21T03:19:07Z

Code

I tried this code:

https://rust.godbolt.org/z/KG4cT6aPK

use std::arch::x86_64::*;

#[target_feature(enable = "avx2")]
pub unsafe fn decode(
    x: __m256i,
    ch: __m256i,
    ct: __m256i,
    dh: __m256i,
    dt: __m256i,
) -> Result<__m256i, ()> {
    let shr3 = _mm256_srli_epi32::<3>(x);

    let h1 = _mm256_avg_epu8(shr3, _mm256_shuffle_epi8(ch, x));
    let h2 = _mm256_avg_epu8(shr3, _mm256_shuffle_epi8(dh, x));

    let o1 = _mm256_shuffle_epi8(ct, h1);
    let o2 = _mm256_shuffle_epi8(dt, h2);

    let c1 = _mm256_adds_epi8(x, o1);
    let c2 = _mm256_add_epi8(x, o2);

    if _mm256_movemask_epi8(c1) != 0 {
        return Err(());
    }

    Ok(c2)
}

I expected to see this happen: This code should emit two vpavgb instructions.

Instead, this happened: One of the vpavgb instructions is missing.

Nugine/simd#43

Version it worked on

It most recently worked on: 1.74.1

Version with regression

1.75.0 ~ nightly

rustc 1.79.0-nightly (dbce3b43b 2024-04-20)
binary: rustc
commit-hash: dbce3b43b6cb34dd3ba12c3ec6f708fe68e9c3df
commit-date: 2024-04-20
host: x86_64-unknown-linux-gnu
release: 1.79.0-nightly
LLVM version: 18.1.4

@rustbot modify labels: +regression-from-stable-to-stable -regression-untriaged

The text was updated successfully, but these errors were encountered:

saethlin · 2024-04-21T04:51:34Z

Blaming rust-lang/stdarch#1477

Did you confirm that this is the responsible change or are you guessing?

workingjubilee · 2024-04-21T05:08:29Z

@Nugine This is definitely more instructions and more bytes on each, so I'm marking it with I-heavy, but it appears this comes with a performance regression. Can you be precise about which of the ~19 benchmarks you appear to run have regressed, and on what architecture?

I would rather we not make the 2nd vpavgb instruction come back only for your algorithm to still be dog-slow because some of the other instructions are different.

Also, can you be more precise on what architectures and with what target features you're testing on? GitHub is allowed to change the CPU you run benchmarks on, and does, because their fleet is not perfectly uniform, so -Ctarget-cpu=native makes it more likely your benchmarks can be run-to-run and job-to-job inconsistent.

Nugine · 2024-04-21T06:00:38Z

Base64-decode in base64-simd has been slower than radix64 since Rust 1.75.0. By comparing the asm generated by 1.74.1 and 1.75.0, I found that one of vpavgb is missing. LLVM doesn't emit vpavgb for one of _mm256_avg_epu8, but a lot of equivalent instructions.

rust-lang/stdarch#1477 made the change. However, the root cause may be elsewhere, possibly LLVM.

To see the asm, you can use the following commands.

git clone https://github.com/Nugine/simd.git
cd simd
rustup override set 1.74.1 # or 1.75.0
RUSTFLAGS="--cfg vsimd_dump_symbols" cargo asm -p base64-simd --lib --simplify --target x86_64-unknown-linux-gnu  --context 1 -- base64_simd::multiversion::decode::avx2 > base64-decode-avx2.asm
cat base64-decode-avx2.asm

Target: x86_64-unknown-linux-gnu
Instruction: AVX2

I have extracted the decode function and reproduced the regression. https://rust.godbolt.org/z/KG4cT6aPK
I'm looking for:

a stable workaround method to generate vpavgb
why the optimization is missing

workingjubilee · 2024-04-21T08:38:49Z

@Nugine re: the workaround: On current Rust, stable, the decode_asm function here recovers exactly equivalent output to what you had before: https://rust.godbolt.org/z/fGEaYME1h

jhorstmann · 2024-04-21T09:16:14Z

Seems the early exit somehow makes llvm loose track of the equivalence to vpavgb instruction. Another workaround thus seems to be to force llvm to calculate both Ok and Err versions:

#[target_feature(enable = "avx2")]
pub unsafe fn decode(
    x: __m256i,
    ch: __m256i,
    ct: __m256i,
    dh: __m256i,
    dt: __m256i,
) -> Result<__m256i, __m256i> {
    let shr3 = _mm256_srli_epi32::<3>(x);

    let h1 = _mm256_avg_epu8(shr3, _mm256_shuffle_epi8(ch, x));
    let h2 = _mm256_avg_epu8(shr3, _mm256_shuffle_epi8(dh, x));

    let o1 = _mm256_shuffle_epi8(ct, h1);
    let o2 = _mm256_shuffle_epi8(dt, h2);

    let c1 = _mm256_adds_epi8(x, o1);
    let c2 = _mm256_add_epi8(x, o2);

    if _mm256_movemask_epi8(c1) != 0 {
        return Err(c2);
    }

    Ok(c2)
}

But I guess this will break down as soon as the function gets inlined if the error value is not otherwise used.

Nugine · 2024-04-21T10:59:23Z

@Nugine re: the workaround: On current Rust, stable, the decode_asm function here recovers exactly equivalent output to what you had before: https://rust.godbolt.org/z/fGEaYME1h

Cool! I'll try asm wrapper.

workingjubilee · 2024-04-21T17:48:24Z

based on jhorstmann's remark, it would be nicest to fix this in LLVM, since LLVM appears to have the information necessary to do this optimization, it just is missing it in the early-return case. I don't think partially reverting a diff is unwarranted, however.

apiraino · 2024-04-24T13:08:17Z

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

Deniskore · 2024-10-26T08:49:34Z

I've encountered the same issue. I reverted the minor change concerning pavgb on the stable branch, rebuilt the compiler, and can confirm that this pull request was the primary cause. I understand that the changes in this PR were made with good intentions, so I'm going to use asm! for now.

diff --git a/crates/core_arch/src/x86/avx2.rs b/crates/core_arch/src/x86/avx2.rs
index 75a393d..b4dba69 100644
--- a/crates/core_arch/src/x86/avx2.rs
+++ b/crates/core_arch/src/x86/avx2.rs
@@ -365,10 +365,7 @@ pub unsafe fn _mm256_avg_epu16(a: __m256i, b: __m256i) -> __m256i {
 #[cfg_attr(test, assert_instr(vpavgb))]
 #[stable(feature = "simd_x86", since = "1.27.0")]
 pub unsafe fn _mm256_avg_epu8(a: __m256i, b: __m256i) -> __m256i {
-    let a = simd_cast::<_, u16x32>(a.as_u8x32());
-    let b = simd_cast::<_, u16x32>(b.as_u8x32());
-    let r = simd_shr(simd_add(simd_add(a, b), u16x32::splat(1)), u16x32::splat(1));
-    transmute(simd_cast::<_, u8x32>(r))
+    transmute(pavgb(a.as_u8x32(), b.as_u8x32()))
 }

 /// Blends packed 32-bit integers from `a` and `b` using control mask `IMM4`.
@@ -3645,6 +3642,8 @@ pub unsafe fn _mm256_extract_epi16<const INDEX: i32>(a: __m256i) -> i32 {

 #[allow(improper_ctypes)]
 extern "C" {
+    #[link_name = "llvm.x86.avx2.pavg.b"]
+    fn pavgb(a: u8x32, b: u8x32) -> u8x32;
     #[link_name = "llvm.x86.avx2.phadd.w"]
     fn phaddw(a: i16x16, b: i16x16) -> i16x16;
     #[link_name = "llvm.x86.avx2.phadd.d"]

I've used the code given by @Nugine since the example is much smaller. You can compare the differences between the stable branch and my changes. The issue remains regardless of whether the function is inlined or not, tested with #[inline(always)] and #[inline(never)]

diff --git a/./issue_stable.s b/./issue_my.s
index fc2df6963c0..a11b08d6c92 100644
--- a/./issue_stable.s
+++ b/./issue_my.s
@@ -165,17 +165,11 @@ _ZN5issue6decode17hb5cdd31d54f96558E:
 .LBB6_1:
        mov     rdx, qword ptr [rsp + 40]
        mov     rax, qword ptr [rsp + 48]
-       vpmovzxbw       zmm1, ymm1
-       vpternlogd      zmm2, zmm2, zmm2, 255
-       vpsubw  zmm1, zmm1, zmm2
        vmovdqa ymm3, ymmword ptr [rdx]
        vmovdqa ymm2, ymmword ptr [rax]
        xor     eax, eax
        vpshufb ymm3, ymm3, ymm0
-       vpmovzxbw       zmm3, ymm3
-       vpaddw  zmm1, zmm1, zmm3
-       vpsrlw  zmm1, zmm1, 1
-       vpmovwb ymm1, zmm1
+       vpavgb  ymm1, ymm1, ymm3
        vpshufb ymm1, ymm2, ymm1
        vpaddb  ymm0, ymm1, ymm0
        vmovdqa ymmword ptr [rcx + 32], ymm0

workingjubilee · 2024-10-27T01:01:57Z

The -Ctarget-cpu=native in this Godbolt is somewhat confounding, because the host CPU can change, enabling various optimizations, or disabling them. It is also unnecessary to get the correct codegen.

However, what is more interesting is that this:

pub unsafe extern "C" fn genericized_mm256_avg_epu8(a: __m256i, b: __m256i) -> __m256i {
    let a = simd_cast::<u8x32, u16x32>(a.into());
    let b = simd_cast::<u8x32, u16x32>(b.into());
    let r = simd_shr(simd_add(simd_add(a, b), u16x32::splat(1)), u16x32::splat(1));
    transmute(simd_cast::<_, u8x32>(r))
}

compiles to this:

example::genericized_mm256_avg_epu8::hcdc2eb986be9a161:
        vpavgb  ymm0, ymm0, ymm1
        ret

https://rust.godbolt.org/z/sxGY1ffWq

Deniskore · 2024-10-27T06:08:41Z

@workingjubilee Because LLVM InstCombineVectorOps or other InstCombiners can recognize and optimize this code into the vpavgb instruction, I've proven it using an IR function.

llc -O3 -mcpu=haswell -mattr=+avx2

define <32 x i8> @genericized_mm256_avg_epu8(<32 x i8> %a, <32 x i8> %b) {
entry:
  ; Cast <32 x i8> to <32 x i16>
  %a_wide = zext <32 x i8> %a to <32 x i16>
  %b_wide = zext <32 x i8> %b to <32 x i16>

  ; Add the two vectors: a + b
  %sum_ab = add <32 x i16> %a_wide, %b_wide

  ; Add 1 to each element in the result
  %sum_ab_plus_one = add <32 x i16> %sum_ab, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, 
                                                    i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, 
                                                    i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, 
                                                    i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
  
  ; Right shift each element by 1 -> dividing by 2
  %average = lshr <32 x i16> %sum_ab_plus_one, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, 
                                             i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, 
                                             i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, 
                                             i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
  
  ; Truncate <32 x i16> back to <32 x i8>
  %result = trunc <32 x i16> %average to <32 x i8>

  ret <32 x i8> %result
}

Output:

genericized_mm256_avg_epu8:             # @genericized_mm256_avg_epu8
        vpavgb  ymm0, ymm0, ymm1
        ret

purplesyringa · 2025-03-20T02:10:53Z

Here's a smaller example.

use std::arch::x86_64::*;

pub fn mul_and_shift(a: __m128i, b: __m128i) -> __m128i {
    unsafe { _mm_srli_epi16(_mm_mulhi_epu16(a, b), 1) }
}

Optimal codegen would be:

mul_and_shift:
        pmulhuw xmm0, xmm1
        psrlw   xmm0, 1
        ret

But rustc currently compiles this to:

mul_and_shift:
        pmulhuw xmm0, xmm1
        punpcklwd       xmm1, xmm0
        punpckhwd       xmm0, xmm0
        psrld   xmm0, 17
        psrld   xmm1, 17
        packssdw        xmm1, xmm0
        movdqa  xmm0, xmm1
        ret

This particular case (16 x 16 -> 32 multiplication, shifted by more than 16) might be actionable on the LLVM side, but I imagine there's many other special cases that are harder to catch (e.g. 16 x 16 -> 32, shifted by 16 to the right, then 16 to the left would probably be compiled to a 32-bit multiplication and a mask instead of 16 x 16 -> 32 and sprinkling in zeroes). It sounds to me like it might be better to rollback some parts of the stdarch PR.

tgross35 · 2025-03-20T02:25:37Z

The stdarch PR nicely splits changes to different functions across individual commits. I think anyone can probably post a revert of the relevant patches for any that wound up with worse codegen.

Cc @eduardosm in case you know any reason why the results in this issue might be different from what you saw before the PR.

moxian · 2025-03-20T09:30:57Z

@rustbot label: -E-needs-bisection
(as per #124216 (comment), and i can also confirm that cargo-bisect-rustc points at that nightly-2023-10-29 that contains the #116609 as well)

tgross35 · 2025-03-20T09:39:54Z

@nikic created an LLVM issue llvm/llvm-project#132166

eduardosm · 2025-03-20T17:48:06Z

It looks like that passing the result of _mm256_avg_epu8 to _mm256_movemask_epi8 makes LLVM miss the optimization, so it might not be related to the early-return.

If rust-lang/stdarch#1477 gets (possibly partially) reverted, please ping me, so I re-add the intrinsic implementations to Miri.

Godbolt link: https://rust.godbolt.org/z/M3rjqfT4j

use std::arch::x86_64::*;

#[target_feature(enable = "avx2")]
pub unsafe fn with_movemask(
    x: __m256i,
    y: __m256i,
) -> i32 {
    _mm256_movemask_epi8(_mm256_avg_epu8(x, y))
}

#[target_feature(enable = "avx2")]
pub unsafe fn without_movemask(
    x: __m256i,
    y: __m256i,
) -> __m256i {
    _mm256_avg_epu8(x, y)
}

example::with_movemask::h41cfcc3d303e3280:
        vpmovzxbw       ymm0, xmmword ptr [rdi]
        vpmovzxbw       ymm1, xmmword ptr [rdi + 16]
        vpmovzxbw       ymm2, xmmword ptr [rsi]
        vpaddw  ymm0, ymm0, ymm2
        vpmovzxbw       ymm2, xmmword ptr [rsi + 16]
        vpaddw  ymm1, ymm1, ymm2
        vpcmpeqd        ymm2, ymm2, ymm2
        vpsubw  ymm0, ymm0, ymm2
        vpsubw  ymm1, ymm1, ymm2
        vpsllw  ymm1, ymm1, 7
        vpsllw  ymm0, ymm0, 7
        vpacksswb       ymm0, ymm0, ymm1
        vpermq  ymm0, ymm0, 216
        vpmovmskb       eax, ymm0
        vzeroupper
        ret

example::without_movemask::h6e39cc68152d3e78:
        mov     rax, rdi
        vmovdqa ymm0, ymmword ptr [rsi]
        vpavgb  ymm0, ymm0, ymmword ptr [rdx]
        vmovdqa ymmword ptr [rdi], ymm0
        vzeroupper
        ret

Nugine added C-bug regression-untriaged labels Apr 21, 2024

rustbot added I-prioritize needs-triage regression-from-stable-to-stable and removed regression-untriaged labels Apr 21, 2024

Nugine mentioned this issue Apr 21, 2024

Performance regression between 1.74.1 and 1.75.0 Nugine/simd#43

Open

saethlin added A-SIMD T-libs labels Apr 21, 2024

workingjubilee added E-needs-bisection I-heavy labels Apr 21, 2024

saethlin removed the needs-triage label Apr 21, 2024

workingjubilee added A-LLVM I-slow labels Apr 21, 2024

rustbot added P-medium and removed I-prioritize labels Apr 24, 2024

tgross35 mentioned this issue Mar 20, 2025

Performance regression of x86-64-specific SIMD intrinsics #138725

Closed

tgross35 marked this as a duplicate of #138725 Mar 20, 2025

tgross35 changed the title ~~Performance regression: rustc failed to optimize _mm256_avg_epu8 after 1.75.0~~ Performance regression: rustc failed to optimize specific x86-64 SIMD intrinsics after 1.75.0 Mar 20, 2025

rustbot removed the E-needs-bisection label Mar 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression: rustc failed to optimize specific x86-64 SIMD intrinsics after 1.75.0 #124216

Performance regression: rustc failed to optimize specific x86-64 SIMD intrinsics after 1.75.0 #124216

Nugine commented Apr 21, 2024

saethlin commented Apr 21, 2024

workingjubilee commented Apr 21, 2024

Nugine commented Apr 21, 2024 •

edited

Loading

workingjubilee commented Apr 21, 2024 •

edited

Loading

jhorstmann commented Apr 21, 2024

Nugine commented Apr 21, 2024

workingjubilee commented Apr 21, 2024 •

edited

Loading

apiraino commented Apr 24, 2024

Deniskore commented Oct 26, 2024

workingjubilee commented Oct 27, 2024

Deniskore commented Oct 27, 2024

purplesyringa commented Mar 20, 2025

tgross35 commented Mar 20, 2025

moxian commented Mar 20, 2025

tgross35 commented Mar 20, 2025

eduardosm commented Mar 20, 2025

Performance regression: rustc failed to optimize specific x86-64 SIMD intrinsics after 1.75.0 #124216

Performance regression: rustc failed to optimize specific x86-64 SIMD intrinsics after 1.75.0 #124216

Comments

Nugine commented Apr 21, 2024

Code

Version it worked on

Version with regression

saethlin commented Apr 21, 2024

workingjubilee commented Apr 21, 2024

Nugine commented Apr 21, 2024 • edited Loading

workingjubilee commented Apr 21, 2024 • edited Loading

jhorstmann commented Apr 21, 2024

Nugine commented Apr 21, 2024

workingjubilee commented Apr 21, 2024 • edited Loading

apiraino commented Apr 24, 2024

Deniskore commented Oct 26, 2024

workingjubilee commented Oct 27, 2024

Deniskore commented Oct 27, 2024

purplesyringa commented Mar 20, 2025

tgross35 commented Mar 20, 2025

moxian commented Mar 20, 2025

tgross35 commented Mar 20, 2025

eduardosm commented Mar 20, 2025

Nugine commented Apr 21, 2024 •

edited

Loading

workingjubilee commented Apr 21, 2024 •

edited

Loading

workingjubilee commented Apr 21, 2024 •

edited

Loading