-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a .bss-like scheme for encoded const allocs #137152
Conversation
I enjoy too much how LLVM just checks each constant if it's all zero. Anyway, stupid idea that's probably not worth it, but what do you think about having something like "repeat this byte n times". So this would cover both 1000 x 0 and 1000 x 5 for instance. I got the idea while looking at this. I don't know if it's just visual trickery and it's not actually stored like that, but I thought it's worth at least thinking a bit about it. |
I tried that out on allocs more than 8 bytes in size and it only fired once. I also tried expanding the check to 2-byte sequences and that didn't fire at all, even in a stage2 build of cargo (I checked that it works on an artificial example). |
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
Add a .bss-like scheme for encoded const allocs r? ghost This check if all bytes are zero feel like it should be too slow, and instead we should have a flag that we track, but that seems hard. Let's see how this perfs first. Also we can probably stash the "it's all zero actually" flag inside one of the other struct members that's already not using an entire byte. This optimization doesn't fire all that often, so it's possible that by sticking it in the varint length field, this PR actually makes rmeta size worse.
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (7333484): comparison URL. Overall result: ❌✅ regressions and improvements - please read the text belowBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)Results (primary 2.6%, secondary -53.2%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (secondary 38.3%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary -0.0%, secondary -31.9%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 790.255s -> 792.128s (0.24%) |
icount regression is inflated because currently this is only doing one byte at a time comparisons. For the same benchmark that regresses a lot, we get a wall time improvement. I think that's because the relevant encoding there is the incr cache and that's done on a background thread. So I think there's a widespread small benefit here, the trick is to detect the all-zero case more efficiently. |
Doing the all-zeros test as “reduce by binary or, check that the final result is 0” should be very amendable to auto vectorization and run basically at memory bandwidth speed for large allocations. The cost for trying it on many small allocations is probably also significant so I’d consider not even trying to do the check for allocs < 128 bytes or something like that. That should be a super predictable branch but still benefit the cases where the rmeta bloat is significant. |
The 33 (small) icount improvements are due to this optimization firing on allocations < 128 bytes. And here is
So over half the byte savings in real-world code (as opposed to ctfe-stress which is highly contrived) comes from allocations < 128 bytes. |
Welp, that makes the easy way out of the trade-off a lot less appealing 🙃 |
b84f6fe
to
45d702d
Compare
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
Add a .bss-like scheme for encoded const allocs This check if all bytes are zero feel like it should be too slow, and instead we should have a flag that we track, but that seems hard. Let's see how this perfs first. Also we can probably stash the "it's all zero actually" flag inside one of the other struct members that's already not using an entire byte. This optimization doesn't fire all that often, so it's possible that by sticking it in the varint length field, this PR actually makes rmeta size worse.
After staring at profiles for a bit I think adding a fast-path for small allocations is probably not going to be measurable. It sure gives the warm fuzzies but the entire encode function that's relevant here is very cold except that under workloads that encode very large const allocations, the loop that checks for all 0s becomes hot. |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (726f4aa): comparison URL. Overall result: ❌ regressions - please read the text belowBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)Results (primary 4.3%, secondary -52.9%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (secondary -5.5%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary -0.0%, secondary -19.3%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 774.698s -> 774.652s (-0.01%) |
This comment has been minimized.
This comment has been minimized.
e678bfc
to
c9839fc
Compare
So the PR now is at the state where the perf report is #137152 (comment) ? |
The relevant perf report is the most recent one, but if we want to run another one just to be sure we can do that. |
Ah. Why not just take the stress test hit? |
The naive implementation has a possible upside of a 0.3% icount improvement, and no measured cycles changes. The downside is a 151% icount regression and 47% cycles regression. The version that uses the autovec-friendly "mask all the bytes together" strategy has no significant icount improvement, 19% cycles improvement, with a downside of 4.1% icount and nothing significant in cycles. So the choice is between the naive version that yields a widespread but tiny icount improvement and a large cycles regression. Or the slightly clever version that improves only a stress test, and as far as I can tell has no downsides. I prefer the version with no downsides. (I don't really trust icount results with these significance factors of about 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm and I agree with your reasoning regarding which perf tradeoff to pick but let's see if @oli-obk disagrees 🙂
if buf.is_empty() { | ||
return true; | ||
} | ||
// Just fast-rejecting based on the first element significantly reduces the amount that we end | ||
// up walking the whole array. | ||
if buf.get(0) != Some(&0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth fusing this into one check?
match buf.get(0) {
None => return true,
Some (n) if n != 0 => return false,
_ => {}.
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 I wrote it the way I did because I liked the comment placement on both cases. I'm sure it optimizes the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comments have natural places in both versions.
I suggested this because the Some
pattern in your case requires like 5s of thinking until I realized None
is impossible. buf[0] != 0
would make that more clear but I guess you want to avoid the panic case? Surely that optimizes away though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buf[0] != 0
it is. The panic optimizes out or doesn't matter.
c9839fc
to
082a4b8
Compare
082a4b8
to
b75b67f
Compare
@oli-obk I think this is waiting for you: #137152 (review) |
Huh sorry. I was sure i approved this after wesley's messsge @bors r=wesleywiser |
☀️ Test successful - checks-actions |
Post-merge analysis result Test differencesNo test diffs found |
Finished benchmarking commit (52daa7d): comparison URL. Overall result: ❌ regressions - please read the text belowOur benchmarks found a performance regression caused by this PR. Next Steps:
@rustbot label: +perf-regression Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)Results (primary -0.5%, secondary -39.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (secondary -14.8%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary -0.0%, secondary -8.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 775.02s -> 774.166s (-0.11%) |
@rustbot label: +perf-regression-triaged #137152 (comment) |
This check if all bytes are zero feel like it should be too slow, and instead we should have a flag that we track, but that seems hard. Let's see how this perfs first.
Also we can probably stash the "it's all zero actually" flag inside one of the other struct members that's already not using an entire byte. This optimization doesn't fire all that often, so it's possible that by sticking it in the varint length field, this PR actually makes rmeta size worse.