Add a .bss-like scheme for encoded const allocs #137152

saethlin · 2025-02-16T21:57:31Z

This check if all bytes are zero feel like it should be too slow, and instead we should have a flag that we track, but that seems hard. Let's see how this perfs first.

Also we can probably stash the "it's all zero actually" flag inside one of the other struct members that's already not using an entire byte. This optimization doesn't fire all that often, so it's possible that by sticking it in the varint length field, this PR actually makes rmeta size worse.

xTachyon · 2025-02-16T22:22:23Z

I enjoy too much how LLVM just checks each constant if it's all zero.

Anyway, stupid idea that's probably not worth it, but what do you think about having something like "repeat this byte n times". So this would cover both 1000 x 0 and 1000 x 5 for instance.

I got the idea while looking at this. I don't know if it's just visual trickery and it's not actually stored like that, but I thought it's worth at least thinking a bit about it.

saethlin · 2025-02-16T22:44:33Z

Anyway, stupid idea that's probably not worth it, but what do you think about having something like "repeat this byte n times". So this would cover both 1000 x 0 and 1000 x 5 for instance.

I tried that out on allocs more than 8 bytes in size and it only fired once. I also tried expanding the check to 2-byte sequences and that didn't fire at all, even in a stage2 build of cargo (I checked that it works on an artificial example).

saethlin · 2025-02-16T22:52:17Z

@bors try @rust-timer queue

Add a .bss-like scheme for encoded const allocs r? ghost This check if all bytes are zero feel like it should be too slow, and instead we should have a flag that we track, but that seems hard. Let's see how this perfs first. Also we can probably stash the "it's all zero actually" flag inside one of the other struct members that's already not using an entire byte. This optimization doesn't fire all that often, so it's possible that by sticking it in the varint length field, this PR actually makes rmeta size worse.

bors · 2025-02-16T22:53:27Z

⌛ Trying commit b84f6fe with merge 7333484...

bors · 2025-02-17T00:32:43Z

☀️ Try build successful - checks-actions
Build commit: 7333484 (7333484dd63aeaa04b3a0839d7d03667ec02a959)

rust-timer · 2025-02-17T01:48:46Z

Finished benchmarking commit (7333484): comparison URL.

Overall result: ❌✅ regressions and improvements - please read the text below

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	42.0%	[0.4%, 151.4%]	10
Improvements ✅ (primary)	-0.2%	[-0.3%, -0.1%]	33
Improvements ✅ (secondary)	-0.2%	[-0.4%, -0.1%]	49
All ❌✅ (primary)	-0.2%	[-0.3%, -0.1%]	33

Max RSS (memory usage)

Results (primary 2.6%, secondary -53.2%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.6%	[2.6%, 2.6%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-53.2%	[-66.7%, -26.2%]	3
All ❌✅ (primary)	2.6%	[2.6%, 2.6%]	1

Cycles

Results (secondary 38.3%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	38.3%	[33.8%, 46.7%]	3
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

Binary size

Results (primary -0.0%, secondary -31.9%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.0%	[-0.0%, -0.0%]	4
Improvements ✅ (secondary)	-31.9%	[-31.9%, -31.9%]	6
All ❌✅ (primary)	-0.0%	[-0.0%, -0.0%]	4

Bootstrap: 790.255s -> 792.128s (0.24%)
Artifact size: 350.01 MiB -> 349.95 MiB (-0.02%)

saethlin · 2025-02-17T05:07:59Z

icount regression is inflated because currently this is only doing one byte at a time comparisons. For the same benchmark that regresses a lot, we get a wall time improvement. I think that's because the relevant encoding there is the incr cache and that's done on a background thread.

So I think there's a widespread small benefit here, the trick is to detect the all-zero case more efficiently.

hanna-kruppe · 2025-02-17T08:59:42Z

Doing the all-zeros test as “reduce by binary or, check that the final result is 0” should be very amendable to auto vectorization and run basically at memory bandwidth speed for large allocations. The cost for trying it on many small allocations is probably also significant so I’d consider not even trying to do the check for allocs < 128 bytes or something like that. That should be a super predictable branch but still benefit the cases where the rmeta bloat is significant.

saethlin · 2025-02-17T15:40:22Z

The cost for trying it on many small allocations is probably also significant so I’d consider not even trying to do the check for allocs < 128 bytes or something like that.

The 33 (small) icount improvements are due to this optimization firing on allocations < 128 bytes. And here is counts -w on all the allocation sizes where this optimization fires for x build --stage 2 src/tools/cargo (which is my go-to invocation for a chunk of real-world-ish code):

138177 counts (weighted integral)
(  1)    73872 (53.5%, 53.5%): 16
(  2)    17152 (12.4%, 65.9%): 8576
(  3)    13968 (10.1%, 76.0%): 8
(  4)     7104 ( 5.1%, 81.1%): 32
(  5)     3360 ( 2.4%, 83.6%): 3360
(  6)     2160 ( 1.6%, 85.1%): 216
(  7)     2048 ( 1.5%, 86.6%): 2048
(  8)     1824 ( 1.3%, 87.9%): 1824
(  9)     1720 ( 1.2%, 89.2%): 1720
( 10)     1324 ( 1.0%, 90.1%): 4
( 11)     1304 ( 0.9%, 91.1%): 1304
( 12)     1232 ( 0.9%, 92.0%): 1232
( 13)     1176 ( 0.9%, 92.8%): 1176
( 14)     1088 ( 0.8%, 93.6%): 1088
( 15)     1080 ( 0.8%, 94.4%): 1080
( 16)     1068 ( 0.8%, 95.2%): 2
( 17)     1008 ( 0.7%, 95.9%): 1008
( 18)      960 ( 0.7%, 96.6%): 40
( 19)      952 ( 0.7%, 97.3%): 952
( 20)      864 ( 0.6%, 97.9%): 48
( 21)      760 ( 0.6%, 98.4%): 760
( 22)      704 ( 0.5%, 99.0%): 64
( 23)      504 ( 0.4%, 99.3%): 24
( 24)      280 ( 0.2%, 99.5%): 56
( 25)      256 ( 0.2%, 99.7%): 128
( 26)      192 ( 0.1%, 99.8%): 96
( 27)       80 ( 0.1%, 99.9%): 80
( 28)       52 ( 0.0%, 99.9%): 52
( 29)       50 ( 0.0%,100.0%): 1
( 30)       20 ( 0.0%,100.0%): 20
( 31)       12 ( 0.0%,100.0%): 12
( 32)        3 ( 0.0%,100.0%): 3
( 33)        0 ( 0.0%,100.0%): 0

So over half the byte savings in real-world code (as opposed to ctfe-stress which is highly contrived) comes from allocations < 128 bytes.

hanna-kruppe · 2025-02-17T15:50:17Z

Welp, that makes the easy way out of the trade-off a lot less appealing 🙃

saethlin · 2025-02-17T19:32:10Z

@bors try @rust-timer queue

bors · 2025-02-17T19:33:22Z

⌛ Trying commit 45d702d with merge 726f4aa...

Add a .bss-like scheme for encoded const allocs This check if all bytes are zero feel like it should be too slow, and instead we should have a flag that we track, but that seems hard. Let's see how this perfs first. Also we can probably stash the "it's all zero actually" flag inside one of the other struct members that's already not using an entire byte. This optimization doesn't fire all that often, so it's possible that by sticking it in the varint length field, this PR actually makes rmeta size worse.

saethlin · 2025-02-17T19:41:49Z

After staring at profiles for a bit I think adding a fast-path for small allocations is probably not going to be measurable. It sure gives the warm fuzzies but the entire encode function that's relevant here is very cold except that under workloads that encode very large const allocations, the loop that checks for all 0s becomes hot.

bors · 2025-02-17T21:27:43Z

☀️ Try build successful - checks-actions
Build commit: 726f4aa (726f4aa01d229fe932ed987db2b95b38a8643760)

rust-timer · 2025-02-17T22:43:32Z

Finished benchmarking commit (726f4aa): comparison URL.

Overall result: ❌ regressions - please read the text below

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	1.7%	[0.3%, 4.7%]	6
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

Results (primary 4.3%, secondary -52.9%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	4.3%	[4.3%, 4.3%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-52.9%	[-66.6%, -25.9%]	3
All ❌✅ (primary)	4.3%	[4.3%, 4.3%]	1

Cycles

Results (secondary -5.5%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	3.1%	[3.0%, 3.3%]	3
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-14.2%	[-19.2%, -4.1%]	3
All ❌✅ (primary)	-	-	0

Binary size

Results (primary -0.0%, secondary -19.3%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	0.0%	[0.0%, 0.0%]	1
Improvements ✅ (primary)	-0.0%	[-0.1%, -0.0%]	20
Improvements ✅ (secondary)	-21.5%	[-31.9%, -0.5%]	9
All ❌✅ (primary)	-0.0%	[-0.1%, -0.0%]	20

Bootstrap: 774.698s -> 774.652s (-0.01%)
Artifact size: 362.37 MiB -> 360.23 MiB (-0.59%)

compiler/rustc_middle/src/mir/interpret/allocation.rs

oli-obk · 2025-02-18T19:16:25Z

So the PR now is at the state where the perf report is #137152 (comment) ?

saethlin · 2025-02-18T19:19:45Z

The relevant perf report is the most recent one, but if we want to run another one just to be sure we can do that.

oli-obk · 2025-02-18T19:27:50Z

Ah. Why not just take the stress test hit?

saethlin · 2025-02-18T20:52:54Z

The naive implementation has a possible upside of a 0.3% icount improvement, and no measured cycles changes. The downside is a 151% icount regression and 47% cycles regression.

The version that uses the autovec-friendly "mask all the bytes together" strategy has no significant icount improvement, 19% cycles improvement, with a downside of 4.1% icount and nothing significant in cycles.

So the choice is between the naive version that yields a widespread but tiny icount improvement and a large cycles regression. Or the slightly clever version that improves only a stress test, and as far as I can tell has no downsides. I prefer the version with no downsides. (I don't really trust icount results with these significance factors of about 2)

wesleywiser

Lgtm and I agree with your reasoning regarding which perf tradeoff to pick but let's see if @oli-obk disagrees 🙂

RalfJung · 2025-02-19T15:04:08Z

compiler/rustc_middle/src/mir/interpret/allocation.rs

+    if buf.is_empty() {
+        return true;
+    }
+    // Just fast-rejecting based on the first element significantly reduces the amount that we end
+    // up walking the whole array.
+    if buf.get(0) != Some(&0) {


Is it worth fusing this into one check?

match buf.get(0) { None => return true, Some (n) if n != 0 => return false, _ => {}. }

🤷 I wrote it the way I did because I liked the comment placement on both cases. I'm sure it optimizes the same.

I think the comments have natural places in both versions.

I suggested this because the Some pattern in your case requires like 5s of thinking until I realized None is impossible. buf[0] != 0 would make that more clear but I guess you want to avoid the panic case? Surely that optimizes away though?

buf[0] != 0 it is. The panic optimizes out or doesn't matter.

compiler/rustc_middle/src/mir/interpret/allocation.rs

saethlin · 2025-03-13T04:14:26Z

@oli-obk I think this is waiting for you: #137152 (review)

oli-obk · 2025-03-13T06:26:57Z

Huh sorry. I was sure i approved this after wesley's messsge

@bors r=wesleywiser

bors · 2025-03-13T06:27:01Z

📌 Commit b75b67f has been approved by wesleywiser

It is now in the queue for this repository.

bors · 2025-03-13T16:41:26Z

⌛ Testing commit b75b67f with merge 52daa7d...

bors · 2025-03-13T19:45:14Z

☀️ Test successful - checks-actions
Approved by: wesleywiser
Pushing 52daa7d to master...

github-actions · 2025-03-13T19:46:59Z

Post-merge analysis result

Test differences

No test diffs found

rust-timer · 2025-03-13T21:02:42Z

Finished benchmarking commit (52daa7d): comparison URL.

Overall result: ❌ regressions - please read the text below

Our benchmarks found a performance regression caused by this PR.
This might be an actual regression, but it can also be just noise.

Next Steps:

If the regression was expected or you think it can be justified,
please write a comment with sufficient written justification, and add
@rustbot label: +perf-regression-triaged to it, to mark the regression as triaged.
If you think that you know of a way to resolve the regression, try to create
a new PR with a fix for the regression.
If you do not understand the regression or you think that it is just noise,
you can ask the @rust-lang/wg-compiler-performance working group for help (members of this group
were already notified of this PR).

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	3.5%	[3.2%, 4.1%]	3
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

Results (primary -0.5%, secondary -39.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.7%	[1.7%, 1.7%]	1
Regressions ❌ (secondary)	2.3%	[2.3%, 2.3%]	1
Improvements ✅ (primary)	-2.6%	[-2.6%, -2.6%]	1
Improvements ✅ (secondary)	-52.9%	[-66.1%, -26.4%]	3
All ❌✅ (primary)	-0.5%	[-2.6%, 1.7%]	2

Cycles

Results (secondary -14.8%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-14.8%	[-19.3%, -6.0%]	3
All ❌✅ (primary)	-	-	0

Binary size

Results (primary -0.0%, secondary -8.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.0%	[-0.1%, -0.0%]	48
Improvements ✅ (secondary)	-8.1%	[-31.9%, -0.0%]	24
All ❌✅ (primary)	-0.0%	[-0.1%, -0.0%]	48

Bootstrap: 775.02s -> 774.166s (-0.11%)
Artifact size: 365.09 MiB -> 365.07 MiB (-0.01%)

saethlin · 2025-03-13T21:07:59Z

@rustbot label: +perf-regression-triaged #137152 (comment)

rustbot · 2025-03-13T21:08:02Z

Error: Parsing relabel command in comment failed: ...'aged https' | error: a label delta at >| '://github.'...

Please file an issue on GitHub at triagebot if there's a problem with this bot, or reach out on #t-infra on Zulip.

rustbot added S-waiting-on-review T-compiler labels Feb 16, 2025

saethlin mentioned this pull request Feb 16, 2025

Big static array causes rlib to bloat in size #136519

Closed

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf label Feb 16, 2025

This comment has been minimized.

Sign in to view

rustbot added perf-regression and removed S-waiting-on-perf labels Feb 17, 2025

saethlin force-pushed the bss-const-allocs branch from b84f6fe to 45d702d Compare February 17, 2025 19:31

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf label Feb 17, 2025

This comment has been minimized.

Sign in to view

rustbot removed the S-waiting-on-perf label Feb 17, 2025

saethlin added the S-waiting-on-review label Feb 18, 2025

This comment has been minimized.

Sign in to view

klensy reviewed Feb 18, 2025

View reviewed changes

compiler/rustc_middle/src/mir/interpret/allocation.rs Show resolved Hide resolved

saethlin force-pushed the bss-const-allocs branch from e678bfc to c9839fc Compare February 18, 2025 19:04

wesleywiser approved these changes Feb 19, 2025

View reviewed changes

RalfJung reviewed Feb 19, 2025

View reviewed changes

compiler/rustc_middle/src/mir/interpret/allocation.rs Show resolved Hide resolved

saethlin force-pushed the bss-const-allocs branch from c9839fc to 082a4b8 Compare February 19, 2025 15:48

Add a .bss-like scheme for encoded const allocs

b75b67f

saethlin force-pushed the bss-const-allocs branch from 082a4b8 to b75b67f Compare February 19, 2025 17:02

bors added S-waiting-on-bors and removed S-waiting-on-review labels Mar 13, 2025

bors added the merged-by-bors label Mar 13, 2025

bors merged commit 52daa7d into rust-lang:master Mar 13, 2025
7 checks passed

rustbot added this to the 1.87.0 milestone Mar 13, 2025

saethlin deleted the bss-const-allocs branch March 13, 2025 21:08

saethlin added the perf-regression-triaged label Mar 13, 2025

Add a .bss-like scheme for encoded const allocs #137152

Add a .bss-like scheme for encoded const allocs #137152

Conversation

saethlin commented Feb 16, 2025 • edited Loading

xTachyon commented Feb 16, 2025

saethlin commented Feb 16, 2025

saethlin commented Feb 16, 2025

This comment has been minimized.

bors commented Feb 16, 2025

bors commented Feb 17, 2025

This comment has been minimized.

rust-timer commented Feb 17, 2025

Overall result: ❌✅ regressions and improvements - please read the text below

saethlin commented Feb 17, 2025

hanna-kruppe commented Feb 17, 2025

saethlin commented Feb 17, 2025

hanna-kruppe commented Feb 17, 2025

saethlin commented Feb 17, 2025

This comment has been minimized.

bors commented Feb 17, 2025

saethlin commented Feb 17, 2025 • edited Loading

bors commented Feb 17, 2025

This comment has been minimized.

rust-timer commented Feb 17, 2025

Overall result: ❌ regressions - please read the text below

This comment has been minimized.

oli-obk commented Feb 18, 2025

saethlin commented Feb 18, 2025

oli-obk commented Feb 18, 2025

saethlin commented Feb 18, 2025

wesleywiser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RalfJung Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saethlin commented Mar 13, 2025

oli-obk commented Mar 13, 2025

bors commented Mar 13, 2025

bors commented Mar 13, 2025

bors commented Mar 13, 2025

github-actions bot commented Mar 13, 2025

Test differences

rust-timer commented Mar 13, 2025

Overall result: ❌ regressions - please read the text below

saethlin commented Mar 13, 2025

rustbot commented Mar 13, 2025

saethlin commented Feb 16, 2025 •

edited

Loading

saethlin commented Feb 17, 2025 •

edited

Loading

RalfJung Feb 19, 2025 •

edited

Loading