Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a .bss-like scheme for encoded const allocs #137152

Merged
merged 1 commit into from
Mar 13, 2025

Conversation

saethlin
Copy link
Member

@saethlin saethlin commented Feb 16, 2025

This check if all bytes are zero feel like it should be too slow, and instead we should have a flag that we track, but that seems hard. Let's see how this perfs first.

Also we can probably stash the "it's all zero actually" flag inside one of the other struct members that's already not using an entire byte. This optimization doesn't fire all that often, so it's possible that by sticking it in the varint length field, this PR actually makes rmeta size worse.

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 16, 2025
@xTachyon
Copy link
Contributor

I enjoy too much how LLVM just checks each constant if it's all zero.

Anyway, stupid idea that's probably not worth it, but what do you think about having something like "repeat this byte n times". So this would cover both 1000 x 0 and 1000 x 5 for instance.

I got the idea while looking at this. I don't know if it's just visual trickery and it's not actually stored like that, but I thought it's worth at least thinking a bit about it.

@saethlin
Copy link
Member Author

Anyway, stupid idea that's probably not worth it, but what do you think about having something like "repeat this byte n times". So this would cover both 1000 x 0 and 1000 x 5 for instance.

I tried that out on allocs more than 8 bytes in size and it only fired once. I also tried expanding the check to 2-byte sequences and that didn't fire at all, even in a stage2 build of cargo (I checked that it works on an artificial example).

@saethlin
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 16, 2025
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 16, 2025
Add a .bss-like scheme for encoded const allocs

r? ghost

This check if all bytes are zero feel like it should be too slow, and instead we should have a flag that we track, but that seems hard. Let's see how this perfs first.

Also we can probably stash the "it's all zero actually" flag inside one of the other struct members that's already not using an entire byte. This optimization doesn't fire all that often, so it's possible that by sticking it in the varint length field, this PR actually makes rmeta size worse.
@bors
Copy link
Contributor

bors commented Feb 16, 2025

⌛ Trying commit b84f6fe with merge 7333484...

@bors
Copy link
Contributor

bors commented Feb 17, 2025

☀️ Try build successful - checks-actions
Build commit: 7333484 (7333484dd63aeaa04b3a0839d7d03667ec02a959)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (7333484): comparison URL.

Overall result: ❌✅ regressions and improvements - please read the text below

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
42.0% [0.4%, 151.4%] 10
Improvements ✅
(primary)
-0.2% [-0.3%, -0.1%] 33
Improvements ✅
(secondary)
-0.2% [-0.4%, -0.1%] 49
All ❌✅ (primary) -0.2% [-0.3%, -0.1%] 33

Max RSS (memory usage)

Results (primary 2.6%, secondary -53.2%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.6% [2.6%, 2.6%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-53.2% [-66.7%, -26.2%] 3
All ❌✅ (primary) 2.6% [2.6%, 2.6%] 1

Cycles

Results (secondary 38.3%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
38.3% [33.8%, 46.7%] 3
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Binary size

Results (primary -0.0%, secondary -31.9%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.0% [-0.0%, -0.0%] 4
Improvements ✅
(secondary)
-31.9% [-31.9%, -31.9%] 6
All ❌✅ (primary) -0.0% [-0.0%, -0.0%] 4

Bootstrap: 790.255s -> 792.128s (0.24%)
Artifact size: 350.01 MiB -> 349.95 MiB (-0.02%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Feb 17, 2025
@saethlin
Copy link
Member Author

icount regression is inflated because currently this is only doing one byte at a time comparisons. For the same benchmark that regresses a lot, we get a wall time improvement. I think that's because the relevant encoding there is the incr cache and that's done on a background thread.

So I think there's a widespread small benefit here, the trick is to detect the all-zero case more efficiently.

@hanna-kruppe
Copy link
Contributor

Doing the all-zeros test as “reduce by binary or, check that the final result is 0” should be very amendable to auto vectorization and run basically at memory bandwidth speed for large allocations. The cost for trying it on many small allocations is probably also significant so I’d consider not even trying to do the check for allocs < 128 bytes or something like that. That should be a super predictable branch but still benefit the cases where the rmeta bloat is significant.

@saethlin
Copy link
Member Author

The cost for trying it on many small allocations is probably also significant so I’d consider not even trying to do the check for allocs < 128 bytes or something like that.

The 33 (small) icount improvements are due to this optimization firing on allocations < 128 bytes. And here is counts -w on all the allocation sizes where this optimization fires for x build --stage 2 src/tools/cargo (which is my go-to invocation for a chunk of real-world-ish code):

138177 counts (weighted integral)
(  1)    73872 (53.5%, 53.5%): 16
(  2)    17152 (12.4%, 65.9%): 8576
(  3)    13968 (10.1%, 76.0%): 8
(  4)     7104 ( 5.1%, 81.1%): 32
(  5)     3360 ( 2.4%, 83.6%): 3360
(  6)     2160 ( 1.6%, 85.1%): 216
(  7)     2048 ( 1.5%, 86.6%): 2048
(  8)     1824 ( 1.3%, 87.9%): 1824
(  9)     1720 ( 1.2%, 89.2%): 1720
( 10)     1324 ( 1.0%, 90.1%): 4
( 11)     1304 ( 0.9%, 91.1%): 1304
( 12)     1232 ( 0.9%, 92.0%): 1232
( 13)     1176 ( 0.9%, 92.8%): 1176
( 14)     1088 ( 0.8%, 93.6%): 1088
( 15)     1080 ( 0.8%, 94.4%): 1080
( 16)     1068 ( 0.8%, 95.2%): 2
( 17)     1008 ( 0.7%, 95.9%): 1008
( 18)      960 ( 0.7%, 96.6%): 40
( 19)      952 ( 0.7%, 97.3%): 952
( 20)      864 ( 0.6%, 97.9%): 48
( 21)      760 ( 0.6%, 98.4%): 760
( 22)      704 ( 0.5%, 99.0%): 64
( 23)      504 ( 0.4%, 99.3%): 24
( 24)      280 ( 0.2%, 99.5%): 56
( 25)      256 ( 0.2%, 99.7%): 128
( 26)      192 ( 0.1%, 99.8%): 96
( 27)       80 ( 0.1%, 99.9%): 80
( 28)       52 ( 0.0%, 99.9%): 52
( 29)       50 ( 0.0%,100.0%): 1
( 30)       20 ( 0.0%,100.0%): 20
( 31)       12 ( 0.0%,100.0%): 12
( 32)        3 ( 0.0%,100.0%): 3
( 33)        0 ( 0.0%,100.0%): 0

So over half the byte savings in real-world code (as opposed to ctfe-stress which is highly contrived) comes from allocations < 128 bytes.

@hanna-kruppe
Copy link
Contributor

Welp, that makes the easy way out of the trade-off a lot less appealing 🙃

@saethlin
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 17, 2025
@bors
Copy link
Contributor

bors commented Feb 17, 2025

⌛ Trying commit 45d702d with merge 726f4aa...

bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 17, 2025
Add a .bss-like scheme for encoded const allocs

This check if all bytes are zero feel like it should be too slow, and instead we should have a flag that we track, but that seems hard. Let's see how this perfs first.

Also we can probably stash the "it's all zero actually" flag inside one of the other struct members that's already not using an entire byte. This optimization doesn't fire all that often, so it's possible that by sticking it in the varint length field, this PR actually makes rmeta size worse.
@saethlin
Copy link
Member Author

saethlin commented Feb 17, 2025

After staring at profiles for a bit I think adding a fast-path for small allocations is probably not going to be measurable. It sure gives the warm fuzzies but the entire encode function that's relevant here is very cold except that under workloads that encode very large const allocations, the loop that checks for all 0s becomes hot.

@bors
Copy link
Contributor

bors commented Feb 17, 2025

☀️ Try build successful - checks-actions
Build commit: 726f4aa (726f4aa01d229fe932ed987db2b95b38a8643760)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (726f4aa): comparison URL.

Overall result: ❌ regressions - please read the text below

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
1.7% [0.3%, 4.7%] 6
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results (primary 4.3%, secondary -52.9%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
4.3% [4.3%, 4.3%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-52.9% [-66.6%, -25.9%] 3
All ❌✅ (primary) 4.3% [4.3%, 4.3%] 1

Cycles

Results (secondary -5.5%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
3.1% [3.0%, 3.3%] 3
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-14.2% [-19.2%, -4.1%] 3
All ❌✅ (primary) - - 0

Binary size

Results (primary -0.0%, secondary -19.3%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
0.0% [0.0%, 0.0%] 1
Improvements ✅
(primary)
-0.0% [-0.1%, -0.0%] 20
Improvements ✅
(secondary)
-21.5% [-31.9%, -0.5%] 9
All ❌✅ (primary) -0.0% [-0.1%, -0.0%] 20

Bootstrap: 774.698s -> 774.652s (-0.01%)
Artifact size: 362.37 MiB -> 360.23 MiB (-0.59%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 17, 2025
@saethlin saethlin added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 18, 2025
@rust-log-analyzer

This comment has been minimized.

@oli-obk
Copy link
Contributor

oli-obk commented Feb 18, 2025

So the PR now is at the state where the perf report is #137152 (comment) ?

@saethlin
Copy link
Member Author

The relevant perf report is the most recent one, but if we want to run another one just to be sure we can do that.

@oli-obk
Copy link
Contributor

oli-obk commented Feb 18, 2025

Ah. Why not just take the stress test hit?

@saethlin
Copy link
Member Author

The naive implementation has a possible upside of a 0.3% icount improvement, and no measured cycles changes. The downside is a 151% icount regression and 47% cycles regression.

The version that uses the autovec-friendly "mask all the bytes together" strategy has no significant icount improvement, 19% cycles improvement, with a downside of 4.1% icount and nothing significant in cycles.

So the choice is between the naive version that yields a widespread but tiny icount improvement and a large cycles regression. Or the slightly clever version that improves only a stress test, and as far as I can tell has no downsides. I prefer the version with no downsides. (I don't really trust icount results with these significance factors of about 2)

Copy link
Member

@wesleywiser wesleywiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm and I agree with your reasoning regarding which perf tradeoff to pick but let's see if @oli-obk disagrees 🙂

Comment on lines 161 to 166
if buf.is_empty() {
return true;
}
// Just fast-rejecting based on the first element significantly reduces the amount that we end
// up walking the whole array.
if buf.get(0) != Some(&0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth fusing this into one check?

match buf.get(0) {
  None => return true,
  Some (n) if n != 0 => return false,
  _ => {}.
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 I wrote it the way I did because I liked the comment placement on both cases. I'm sure it optimizes the same.

Copy link
Member

@RalfJung RalfJung Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comments have natural places in both versions.

I suggested this because the Some pattern in your case requires like 5s of thinking until I realized None is impossible. buf[0] != 0 would make that more clear but I guess you want to avoid the panic case? Surely that optimizes away though?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buf[0] != 0 it is. The panic optimizes out or doesn't matter.

@saethlin
Copy link
Member Author

@oli-obk I think this is waiting for you: #137152 (review)

@oli-obk
Copy link
Contributor

oli-obk commented Mar 13, 2025

Huh sorry. I was sure i approved this after wesley's messsge

@bors r=wesleywiser

@bors
Copy link
Contributor

bors commented Mar 13, 2025

📌 Commit b75b67f has been approved by wesleywiser

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 13, 2025
@bors
Copy link
Contributor

bors commented Mar 13, 2025

⌛ Testing commit b75b67f with merge 52daa7d...

@bors
Copy link
Contributor

bors commented Mar 13, 2025

☀️ Test successful - checks-actions
Approved by: wesleywiser
Pushing 52daa7d to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Mar 13, 2025
@bors bors merged commit 52daa7d into rust-lang:master Mar 13, 2025
7 checks passed
@rustbot rustbot added this to the 1.87.0 milestone Mar 13, 2025
Copy link

Post-merge analysis result

Test differences

No test diffs found

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (52daa7d): comparison URL.

Overall result: ❌ regressions - please read the text below

Our benchmarks found a performance regression caused by this PR.
This might be an actual regression, but it can also be just noise.

Next Steps:

  • If the regression was expected or you think it can be justified,
    please write a comment with sufficient written justification, and add
    @rustbot label: +perf-regression-triaged to it, to mark the regression as triaged.
  • If you think that you know of a way to resolve the regression, try to create
    a new PR with a fix for the regression.
  • If you do not understand the regression or you think that it is just noise,
    you can ask the @rust-lang/wg-compiler-performance working group for help (members of this group
    were already notified of this PR).

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
3.5% [3.2%, 4.1%] 3
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results (primary -0.5%, secondary -39.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.7% [1.7%, 1.7%] 1
Regressions ❌
(secondary)
2.3% [2.3%, 2.3%] 1
Improvements ✅
(primary)
-2.6% [-2.6%, -2.6%] 1
Improvements ✅
(secondary)
-52.9% [-66.1%, -26.4%] 3
All ❌✅ (primary) -0.5% [-2.6%, 1.7%] 2

Cycles

Results (secondary -14.8%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-14.8% [-19.3%, -6.0%] 3
All ❌✅ (primary) - - 0

Binary size

Results (primary -0.0%, secondary -8.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.0% [-0.1%, -0.0%] 48
Improvements ✅
(secondary)
-8.1% [-31.9%, -0.0%] 24
All ❌✅ (primary) -0.0% [-0.1%, -0.0%] 48

Bootstrap: 775.02s -> 774.166s (-0.11%)
Artifact size: 365.09 MiB -> 365.07 MiB (-0.01%)

@saethlin
Copy link
Member Author

@rustbot label: +perf-regression-triaged #137152 (comment)

@saethlin saethlin deleted the bss-const-allocs branch March 13, 2025 21:08
@rustbot
Copy link
Collaborator

rustbot commented Mar 13, 2025

Error: Parsing relabel command in comment failed: ...'aged https' | error: a label delta at >| '://github.'...

Please file an issue on GitHub at triagebot if there's a problem with this bot, or reach out on #t-infra on Zulip.

@saethlin saethlin added the perf-regression-triaged The performance regression has been triaged. label Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. perf-regression-triaged The performance regression has been triaged. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Big static array causes rlib to bloat in size