-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression: rustc failed to optimize specific x86-64 SIMD intrinsics after 1.75.0 #124216
Comments
Did you confirm that this is the responsible change or are you guessing? |
@Nugine This is definitely more instructions and more bytes on each, so I'm marking it with I-heavy, but it appears this comes with a performance regression. Can you be precise about which of the ~19 benchmarks you appear to run have regressed, and on what architecture? I would rather we not make the 2nd vpavgb instruction come back only for your algorithm to still be dog-slow because some of the other instructions are different. Also, can you be more precise on what architectures and with what target features you're testing on? GitHub is allowed to change the CPU you run benchmarks on, and does, because their fleet is not perfectly uniform, so |
Base64-decode in rust-lang/stdarch#1477 made the change. However, the root cause may be elsewhere, possibly LLVM. To see the asm, you can use the following commands. git clone https://github.com/Nugine/simd.git
cd simd
rustup override set 1.74.1 # or 1.75.0
RUSTFLAGS="--cfg vsimd_dump_symbols" cargo asm -p base64-simd --lib --simplify --target x86_64-unknown-linux-gnu --context 1 -- base64_simd::multiversion::decode::avx2 > base64-decode-avx2.asm
cat base64-decode-avx2.asm Target: x86_64-unknown-linux-gnu I have extracted the decode function and reproduced the regression. https://rust.godbolt.org/z/KG4cT6aPK
|
@Nugine re: the workaround: On current Rust, stable, the |
Seems the early exit somehow makes llvm loose track of the equivalence to #[target_feature(enable = "avx2")]
pub unsafe fn decode(
x: __m256i,
ch: __m256i,
ct: __m256i,
dh: __m256i,
dt: __m256i,
) -> Result<__m256i, __m256i> {
let shr3 = _mm256_srli_epi32::<3>(x);
let h1 = _mm256_avg_epu8(shr3, _mm256_shuffle_epi8(ch, x));
let h2 = _mm256_avg_epu8(shr3, _mm256_shuffle_epi8(dh, x));
let o1 = _mm256_shuffle_epi8(ct, h1);
let o2 = _mm256_shuffle_epi8(dt, h2);
let c1 = _mm256_adds_epi8(x, o1);
let c2 = _mm256_add_epi8(x, o2);
if _mm256_movemask_epi8(c1) != 0 {
return Err(c2);
}
Ok(c2)
} But I guess this will break down as soon as the function gets inlined if the error value is not otherwise used. |
Cool! I'll try asm wrapper. |
based on jhorstmann's remark, it would be nicest to fix this in LLVM, since LLVM appears to have the information necessary to do this optimization, it just is missing it in the early-return case. I don't think partially reverting a diff is unwarranted, however. |
WG-prioritization assigning priority (Zulip discussion). @rustbot label -I-prioritize +P-medium |
I've encountered the same issue. I reverted the minor change concerning diff --git a/crates/core_arch/src/x86/avx2.rs b/crates/core_arch/src/x86/avx2.rs
index 75a393d..b4dba69 100644
--- a/crates/core_arch/src/x86/avx2.rs
+++ b/crates/core_arch/src/x86/avx2.rs
@@ -365,10 +365,7 @@ pub unsafe fn _mm256_avg_epu16(a: __m256i, b: __m256i) -> __m256i {
#[cfg_attr(test, assert_instr(vpavgb))]
#[stable(feature = "simd_x86", since = "1.27.0")]
pub unsafe fn _mm256_avg_epu8(a: __m256i, b: __m256i) -> __m256i {
- let a = simd_cast::<_, u16x32>(a.as_u8x32());
- let b = simd_cast::<_, u16x32>(b.as_u8x32());
- let r = simd_shr(simd_add(simd_add(a, b), u16x32::splat(1)), u16x32::splat(1));
- transmute(simd_cast::<_, u8x32>(r))
+ transmute(pavgb(a.as_u8x32(), b.as_u8x32()))
}
/// Blends packed 32-bit integers from `a` and `b` using control mask `IMM4`.
@@ -3645,6 +3642,8 @@ pub unsafe fn _mm256_extract_epi16<const INDEX: i32>(a: __m256i) -> i32 {
#[allow(improper_ctypes)]
extern "C" {
+ #[link_name = "llvm.x86.avx2.pavg.b"]
+ fn pavgb(a: u8x32, b: u8x32) -> u8x32;
#[link_name = "llvm.x86.avx2.phadd.w"]
fn phaddw(a: i16x16, b: i16x16) -> i16x16;
#[link_name = "llvm.x86.avx2.phadd.d"] I've used the code given by @Nugine since the example is much smaller. You can compare the differences between the stable branch and my changes. The issue remains regardless of whether the function is inlined or not, tested with diff --git a/./issue_stable.s b/./issue_my.s
index fc2df6963c0..a11b08d6c92 100644
--- a/./issue_stable.s
+++ b/./issue_my.s
@@ -165,17 +165,11 @@ _ZN5issue6decode17hb5cdd31d54f96558E:
.LBB6_1:
mov rdx, qword ptr [rsp + 40]
mov rax, qword ptr [rsp + 48]
- vpmovzxbw zmm1, ymm1
- vpternlogd zmm2, zmm2, zmm2, 255
- vpsubw zmm1, zmm1, zmm2
vmovdqa ymm3, ymmword ptr [rdx]
vmovdqa ymm2, ymmword ptr [rax]
xor eax, eax
vpshufb ymm3, ymm3, ymm0
- vpmovzxbw zmm3, ymm3
- vpaddw zmm1, zmm1, zmm3
- vpsrlw zmm1, zmm1, 1
- vpmovwb ymm1, zmm1
+ vpavgb ymm1, ymm1, ymm3
vpshufb ymm1, ymm2, ymm1
vpaddb ymm0, ymm1, ymm0
vmovdqa ymmword ptr [rcx + 32], ymm0 |
The However, what is more interesting is that this: pub unsafe extern "C" fn genericized_mm256_avg_epu8(a: __m256i, b: __m256i) -> __m256i {
let a = simd_cast::<u8x32, u16x32>(a.into());
let b = simd_cast::<u8x32, u16x32>(b.into());
let r = simd_shr(simd_add(simd_add(a, b), u16x32::splat(1)), u16x32::splat(1));
transmute(simd_cast::<_, u8x32>(r))
} compiles to this: example::genericized_mm256_avg_epu8::hcdc2eb986be9a161:
vpavgb ymm0, ymm0, ymm1
ret |
@workingjubilee Because LLVM InstCombineVectorOps or other InstCombiners can recognize and optimize this code into the vpavgb instruction, I've proven it using an IR function.
define <32 x i8> @genericized_mm256_avg_epu8(<32 x i8> %a, <32 x i8> %b) {
entry:
; Cast <32 x i8> to <32 x i16>
%a_wide = zext <32 x i8> %a to <32 x i16>
%b_wide = zext <32 x i8> %b to <32 x i16>
; Add the two vectors: a + b
%sum_ab = add <32 x i16> %a_wide, %b_wide
; Add 1 to each element in the result
%sum_ab_plus_one = add <32 x i16> %sum_ab, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1,
i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1,
i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1,
i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
; Right shift each element by 1 -> dividing by 2
%average = lshr <32 x i16> %sum_ab_plus_one, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1,
i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1,
i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1,
i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
; Truncate <32 x i16> back to <32 x i8>
%result = trunc <32 x i16> %average to <32 x i8>
ret <32 x i8> %result
} Output: genericized_mm256_avg_epu8: # @genericized_mm256_avg_epu8
vpavgb ymm0, ymm0, ymm1
ret |
_mm256_avg_epu8
after 1.75.0
Here's a smaller example. use std::arch::x86_64::*;
pub fn mul_and_shift(a: __m128i, b: __m128i) -> __m128i {
unsafe { _mm_srli_epi16(_mm_mulhi_epu16(a, b), 1) }
} Optimal codegen would be: mul_and_shift:
pmulhuw xmm0, xmm1
psrlw xmm0, 1
ret But rustc currently compiles this to: mul_and_shift:
pmulhuw xmm0, xmm1
punpcklwd xmm1, xmm0
punpckhwd xmm0, xmm0
psrld xmm0, 17
psrld xmm1, 17
packssdw xmm1, xmm0
movdqa xmm0, xmm1
ret This particular case (16 x 16 -> 32 multiplication, shifted by more than 16) might be actionable on the LLVM side, but I imagine there's many other special cases that are harder to catch (e.g. 16 x 16 -> 32, shifted by 16 to the right, then 16 to the left would probably be compiled to a 32-bit multiplication and a mask instead of 16 x 16 -> 32 and sprinkling in zeroes). It sounds to me like it might be better to rollback some parts of the stdarch PR. |
The stdarch PR nicely splits changes to different functions across individual commits. I think anyone can probably post a revert of the relevant patches for any that wound up with worse codegen. Cc @eduardosm in case you know any reason why the results in this issue might be different from what you saw before the PR. |
@rustbot label: -E-needs-bisection |
@nikic created an LLVM issue llvm/llvm-project#132166 |
It looks like that passing the result of If rust-lang/stdarch#1477 gets (possibly partially) reverted, please ping me, so I re-add the intrinsic implementations to Miri. Godbolt link: https://rust.godbolt.org/z/M3rjqfT4j use std::arch::x86_64::*;
#[target_feature(enable = "avx2")]
pub unsafe fn with_movemask(
x: __m256i,
y: __m256i,
) -> i32 {
_mm256_movemask_epi8(_mm256_avg_epu8(x, y))
}
#[target_feature(enable = "avx2")]
pub unsafe fn without_movemask(
x: __m256i,
y: __m256i,
) -> __m256i {
_mm256_avg_epu8(x, y)
} example::with_movemask::h41cfcc3d303e3280:
vpmovzxbw ymm0, xmmword ptr [rdi]
vpmovzxbw ymm1, xmmword ptr [rdi + 16]
vpmovzxbw ymm2, xmmword ptr [rsi]
vpaddw ymm0, ymm0, ymm2
vpmovzxbw ymm2, xmmword ptr [rsi + 16]
vpaddw ymm1, ymm1, ymm2
vpcmpeqd ymm2, ymm2, ymm2
vpsubw ymm0, ymm0, ymm2
vpsubw ymm1, ymm1, ymm2
vpsllw ymm1, ymm1, 7
vpsllw ymm0, ymm0, 7
vpacksswb ymm0, ymm0, ymm1
vpermq ymm0, ymm0, 216
vpmovmskb eax, ymm0
vzeroupper
ret
example::without_movemask::h6e39cc68152d3e78:
mov rax, rdi
vmovdqa ymm0, ymmword ptr [rsi]
vpavgb ymm0, ymm0, ymmword ptr [rdx]
vmovdqa ymmword ptr [rdi], ymm0
vzeroupper
ret |
Code
I tried this code:
https://rust.godbolt.org/z/KG4cT6aPK
I expected to see this happen: This code should emit two
vpavgb
instructions.Instead, this happened: One of the
vpavgb
instructions is missing.Nugine/simd#43
Version it worked on
It most recently worked on: 1.74.1
Version with regression
1.75.0 ~ nightly
@rustbot modify labels: +regression-from-stable-to-stable -regression-untriaged
The text was updated successfully, but these errors were encountered: