Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for tune-cpu code generation option to be promoted to stable #127961

Open
pauldoo opened this issue Jul 19, 2024 · 5 comments
Open

Request for tune-cpu code generation option to be promoted to stable #127961

pauldoo opened this issue Jul 19, 2024 · 5 comments
Labels
A-CLI Area: Command-line interface (CLI) to the compiler A-targets Area: Concerning the implications of different compiler targets T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@pauldoo
Copy link

pauldoo commented Jul 19, 2024

I'm raising this issue as I couldn't find an existing tracking issue for the tune-cpu code generation option being stabilized - and I would like it to be.

A common use of target-cpu is to specify the oldest type of CPU an application supports. This will unfortunately result in a binary tuned for that specific CPU - since by default tune-cpu picks up the value from target-cpu. I would like to be able to use tune-cpu=generic to resolve this without needing to use the nightly compiler.

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jul 19, 2024
@tgross35 tgross35 added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-CLI Area: Command-line interface (CLI) to the compiler A-targets Area: Concerning the implications of different compiler targets and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Jul 19, 2024
@tgross35
Copy link
Contributor

tgross35 commented Jul 19, 2024

I think this is one of those things where not many people have used it or are exposed to it, so nobody really knows what its status is. But also probably not much opposition if somebody wants to take the lead.

Are you interested in moving this forward yourself? The process should be pretty easy if so, roughly

  1. Ask on Zulip to figure out if there is any reason this couldn't/shouldn't happen (seems unlikely, outside of maybe lack of testing - so your experiences are useful here) https://rust-lang.zulipchat.com
  2. Write a short stabilization report https://rustc-dev-guide.rust-lang.org/stabilization_guide.html#write-a-stabilization-report
  3. Make a PR to update the docs
  4. Make a PR to stabilize it. That is straightforward, mostly just some tweaks in this file https://github.com/rust-lang/rust/blob/8c3a94a1c79c67924558a4adf7fb6d98f5f0f741/compiler/rustc_session/src/options.rs

@the8472
Copy link
Member

the8472 commented Jul 19, 2024

This will unfortunately result in a binary tuned for that specific CPU - since by default tune-cpu picks up the value from target-cpu.

At least for x86-64 you can pick generic profiles instead, e.g. --target-cpu=x86-64-v3 for AVX2 and other features introduced around the same time.

@Sewer56
Copy link

Sewer56 commented Jan 13, 2025

Tangentially related, not sure where to mention it but I thought I'll throw it out there instead of never mentioning it anywhere.

Sometimes when I optimize code, I need to manually add inline assembly in order to get the correct optimisations. Setting target-cpu to native or to a recent-ish (built in last 6-ish years) CPU of either Intel or AMD usually gets me what I need, but the default does not.

Given that most people just build projects with default settings, and cargo build I was wondering if it would be beneficial to default tune-cpu to some recently released popular CPU every once in a while.

That way programs would end up better optimized for the most popular hardware people are running, while not locking code behind CPU features in the way that target-cpu does.

It's something that comes to mind as a library author, since I don't get to control how people build my code, and I want it to be the best it can be.

Food for thought.

@jsgroth
Copy link

jsgroth commented Jan 27, 2025

Adding a data point on where this flag is useful:

I stumbled upon the existence of tune-cpu while trying to figure out why -C target-cpu=x86-64-v2 combined with #[target_feature(enable = "avx2,fma")] was sometimes producing significantly worse AVX code than -C target-cpu=x86-64-v3, even if I used -C target-feature to explicitly enable every CPU feature included in x86-64-v3.

After some investigation, my issue seems to have been caused by an LLVM optimization specific to x86-64-v2 where it splits unaligned 256-bit AVX loads into two 128-bit load instructions. This optimization is specifically meant for Intel Sandy Bridge CPUs (13-14 years old at this point), where performing two 128-bit loads was much faster than a single unaligned 256-bit load. This particular optimization is actually detrimental on any CPU recent enough to support AVX2, so LLVM does not split AVX loads when optimizing for x86-64-v3 or later. (It also does not seem to apply the optimization at the rustc default of target-cpu=x86-64 (v1), only when targeting x86-64-v2.)

The tune-cpu flag makes it possible to disable this outdated optimization while still building a binary that can run on pre-v3 x86 CPUs, at the cost of somewhat degraded performance on the first x86 CPUs with AVX support (mainly 2nd gen Core i3/i5/i7).

As a fairly minimal example of this particular AVX issue, take this function that computes a vectorized sum of 32 floats using AVX intrinsics:

use std::arch::x86_64::*;

// SAFETY: Only safe to call on CPUs that support AVX
#[target_feature(enable = "avx")]
pub unsafe fn avxsum(nums: &[f32; 32]) -> [f32; 8] {
    let mut sum = _mm256_setzero_ps();
    for i in (0..32).step_by(8) {
        let elems = _mm256_loadu_ps(nums.as_ptr().add(i));
        sum = _mm256_add_ps(sum, elems);
    }

    std::mem::transmute(sum)
}

When compiled with -C opt-level=3 -C target-cpu=x86-64-v2, that generates this assembly, which only performs 128-bit loads and uses a whopping 5 vector registers:

        mov     rax, rdi
        vmovups xmm0, xmmword ptr [rsi]
        vmovups xmm1, xmmword ptr [rsi + 32]
        vmovups xmm2, xmmword ptr [rsi + 64]
        vmovups xmm3, xmmword ptr [rsi + 96]
        vinsertf128     ymm0, ymm0, xmmword ptr [rsi + 16], 1
        vxorps  xmm4, xmm4, xmm4
        vaddps  ymm0, ymm0, ymm4
        vinsertf128     ymm1, ymm1, xmmword ptr [rsi + 48], 1
        vinsertf128     ymm2, ymm2, xmmword ptr [rsi + 80], 1
        vaddps  ymm0, ymm0, ymm1
        vaddps  ymm0, ymm0, ymm2
        vinsertf128     ymm1, ymm3, xmmword ptr [rsi + 112], 1
        vaddps  ymm0, ymm0, ymm1
        vextractf128    xmmword ptr [rdi + 16], ymm0, 1
        vmovups xmmword ptr [rdi], xmm0
        vzeroupper
        ret

With -C opt-level=3 -C target-cpu=x86-64-v2 -Z tune-cpu=x86-64-v3, it generates this more compact assembly instead, which folds the 256-bit loads into the AVX add instructions and only uses 1 vector register:

        mov     rax, rdi
        vxorps  xmm0, xmm0, xmm0
        vaddps  ymm0, ymm0, ymmword ptr [rsi]
        vaddps  ymm0, ymm0, ymmword ptr [rsi + 32]
        vaddps  ymm0, ymm0, ymmword ptr [rsi + 64]
        vaddps  ymm0, ymm0, ymmword ptr [rsi + 96]
        vmovups ymmword ptr [rdi], ymm0
        vzeroupper
        ret

For a more extreme example that's closer to what I ran into, take this function which does something like a vectorized dot product using AVX and FMA intrinsics:

use std::arch::x86_64::*;

const N: usize = 256;

// SAFETY: Only safe to call on CPUs that support AVX and FMA
#[target_feature(enable = "avx,fma")]
pub unsafe fn avxdot(a: &[f32; N], b: &[f32; N]) -> [f32; 8] {
    let mut sum = _mm256_setzero_ps();
    for i in (0..N).step_by(8) {
        let a_elems = _mm256_loadu_ps(a.as_ptr().add(i));
        let b_elems = _mm256_loadu_ps(b.as_ptr().add(i));
        sum = _mm256_fmadd_ps(a_elems, b_elems, sum);
    }

    std::mem::transmute(sum)
}

I'm not going to paste the full ASM because it's quite long due to fully unrolling the loop, but using -C target-cpu=x86-64-v2 without tune-cpu causes Rust/LLVM to generate an incredibly inefficient assembly implementation that unnecessarily spills vector registers onto the stack (Compiler Explorer). This seems to be caused by a bad interaction between loop unrolling and x86-64-v2 AVX load splitting.

Adding -Z tune-cpu=x86-64-v3 leads to a much more reasonable implementation that uses 256-bit loads and doesn't touch the stack at all (Compiler Explorer).

@Sewer56
Copy link

Sewer56 commented Mar 17, 2025

I could link to some examples of my own, but I think the person above has done a great job already.

I essentially consider intrinsics on x86 unusable without either a sane default for tune-cpu towards something more recent or having the explicit toggle.

By default we're optimizing for nearly 20 year old chips, making choices which run sub-optimally for even 10 year old chips, and it's unfortunate. In a recent example, I wrote a simple loop of a few operations (AoS -> SoA) which wound up 20% slower than it could have been, because the compiler replaced my instructions with what it thought would be better, for ancient 2010 era CPUs that people no longer use; and that's pretty sad.

Using even x86-64-v3 (2015 era CPU standard) or newer is sufficient to make it emit good code in most of my cases, and chances are, a majority of the people would be running newer chips than that.


I worry about how much performance is left on the table, that's basically free, if we were tuning for more recent-ish chips as a sane default.

Even 0.5% is a massive saving in global electricity use of Rust programs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-CLI Area: Command-line interface (CLI) to the compiler A-targets Area: Concerning the implications of different compiler targets T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants