-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for tune-cpu
code generation option to be promoted to stable
#127961
Comments
I think this is one of those things where not many people have used it or are exposed to it, so nobody really knows what its status is. But also probably not much opposition if somebody wants to take the lead. Are you interested in moving this forward yourself? The process should be pretty easy if so, roughly
|
At least for x86-64 you can pick generic profiles instead, e.g. |
Tangentially related, not sure where to mention it but I thought I'll throw it out there instead of never mentioning it anywhere. Sometimes when I optimize code, I need to manually add inline assembly in order to get the correct optimisations. Setting Given that most people just build projects with default settings, and That way programs would end up better optimized for the most popular hardware people are running, while not locking code behind CPU features in the way that It's something that comes to mind as a library author, since I don't get to control how people build my code, and I want it to be the best it can be. Food for thought. |
Adding a data point on where this flag is useful: I stumbled upon the existence of After some investigation, my issue seems to have been caused by an LLVM optimization specific to x86-64-v2 where it splits unaligned 256-bit AVX loads into two 128-bit load instructions. This optimization is specifically meant for Intel Sandy Bridge CPUs (13-14 years old at this point), where performing two 128-bit loads was much faster than a single unaligned 256-bit load. This particular optimization is actually detrimental on any CPU recent enough to support AVX2, so LLVM does not split AVX loads when optimizing for x86-64-v3 or later. (It also does not seem to apply the optimization at the rustc default of The As a fairly minimal example of this particular AVX issue, take this function that computes a vectorized sum of 32 floats using AVX intrinsics: use std::arch::x86_64::*;
// SAFETY: Only safe to call on CPUs that support AVX
#[target_feature(enable = "avx")]
pub unsafe fn avxsum(nums: &[f32; 32]) -> [f32; 8] {
let mut sum = _mm256_setzero_ps();
for i in (0..32).step_by(8) {
let elems = _mm256_loadu_ps(nums.as_ptr().add(i));
sum = _mm256_add_ps(sum, elems);
}
std::mem::transmute(sum)
} When compiled with mov rax, rdi
vmovups xmm0, xmmword ptr [rsi]
vmovups xmm1, xmmword ptr [rsi + 32]
vmovups xmm2, xmmword ptr [rsi + 64]
vmovups xmm3, xmmword ptr [rsi + 96]
vinsertf128 ymm0, ymm0, xmmword ptr [rsi + 16], 1
vxorps xmm4, xmm4, xmm4
vaddps ymm0, ymm0, ymm4
vinsertf128 ymm1, ymm1, xmmword ptr [rsi + 48], 1
vinsertf128 ymm2, ymm2, xmmword ptr [rsi + 80], 1
vaddps ymm0, ymm0, ymm1
vaddps ymm0, ymm0, ymm2
vinsertf128 ymm1, ymm3, xmmword ptr [rsi + 112], 1
vaddps ymm0, ymm0, ymm1
vextractf128 xmmword ptr [rdi + 16], ymm0, 1
vmovups xmmword ptr [rdi], xmm0
vzeroupper
ret With mov rax, rdi
vxorps xmm0, xmm0, xmm0
vaddps ymm0, ymm0, ymmword ptr [rsi]
vaddps ymm0, ymm0, ymmword ptr [rsi + 32]
vaddps ymm0, ymm0, ymmword ptr [rsi + 64]
vaddps ymm0, ymm0, ymmword ptr [rsi + 96]
vmovups ymmword ptr [rdi], ymm0
vzeroupper
ret For a more extreme example that's closer to what I ran into, take this function which does something like a vectorized dot product using AVX and FMA intrinsics: use std::arch::x86_64::*;
const N: usize = 256;
// SAFETY: Only safe to call on CPUs that support AVX and FMA
#[target_feature(enable = "avx,fma")]
pub unsafe fn avxdot(a: &[f32; N], b: &[f32; N]) -> [f32; 8] {
let mut sum = _mm256_setzero_ps();
for i in (0..N).step_by(8) {
let a_elems = _mm256_loadu_ps(a.as_ptr().add(i));
let b_elems = _mm256_loadu_ps(b.as_ptr().add(i));
sum = _mm256_fmadd_ps(a_elems, b_elems, sum);
}
std::mem::transmute(sum)
} I'm not going to paste the full ASM because it's quite long due to fully unrolling the loop, but using Adding |
I could link to some examples of my own, but I think the person above has done a great job already. I essentially consider intrinsics on x86 unusable without either a sane default for By default we're optimizing for nearly 20 year old chips, making choices which run sub-optimally for even 10 year old chips, and it's unfortunate. In a recent example, I wrote a simple loop of a few operations (AoS -> SoA) which wound up 20% slower than it could have been, because the compiler replaced my instructions with what it thought would be better, for ancient 2010 era CPUs that people no longer use; and that's pretty sad. Using even I worry about how much performance is left on the table, that's basically free, if we were tuning for more recent-ish chips as a sane default. Even 0.5% is a massive saving in global electricity use of Rust programs. |
I'm raising this issue as I couldn't find an existing tracking issue for the
tune-cpu
code generation option being stabilized - and I would like it to be.A common use of
target-cpu
is to specify the oldest type of CPU an application supports. This will unfortunately result in a binary tuned for that specific CPU - since by defaulttune-cpu
picks up the value fromtarget-cpu
. I would like to be able to usetune-cpu=generic
to resolve this without needing to use the nightly compiler.The text was updated successfully, but these errors were encountered: