-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize core::str::Lines::count #123606
base: master
Are you sure you want to change the base?
Optimize core::str::Lines::count #123606
Conversation
All reactions
-
❤️ 7 reactions
Clearing assignee cuz it's a draft still. |
All reactions
Sorry, something went wrong.
b502143
to
1258d48
Compare
This comment has been minimized.
This comment has been minimized.
1258d48
to
1466713
Compare
I've updated the PR description to contain benchmarks. This PR needs more tests tho, almost missed that I got the logic wrong for going from newline count to line count. |
All reactions
Sorry, something went wrong.
i don't have time/energy to poke at this right now (maybe over the weekend)... but it seems it is used in the compiler so curiosity is getting the better of me. @bors try @rust-timer queue |
All reactions
Sorry, something went wrong.
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
Optimize core::str::Lines::count `s.lines().count()+1` is somewhat common as a way to find the line number given a byte position, so it'd be nice if it were faster. This just generalizes the SWAR-optimized char counting code so that it can be used for SWAR-optimized line counting, so it's actually not very complex of a PR. ## TODO - [x] benchmarks - [x] adjust comments - [ ] more tests ## Benchmarks `case00_libcore` is the new version, and `case01_fold_increment` is the previous implementation (the default impl of `Iterator::count()` is a fold that increments ``` str::line_count::all_newlines_32kib::case00_libcore 4.35µs/iter +/- 11.00ns str::line_count::all_newlines_32kib::case01_fold_increment 779.99µs/iter +/- 8.43µs str::line_count::all_newlines_4kib::case00_libcore 562.00ns/iter +/- 5.00ns str::line_count::all_newlines_4kib::case01_fold_increment 97.81µs/iter +/- 1.48µs str::line_count::all_newlines_64b::case00_libcore 21.00ns/iter +/- 0.00ns str::line_count::all_newlines_64b::case01_fold_increment 1.49µs/iter +/- 32.00ns str::line_count::en_huge::case00_libcore 45.58µs/iter +/- 122.00ns str::line_count::en_huge::case01_fold_increment 167.62µs/iter +/- 609.00ns str::line_count::en_large::case00_libcore 734.00ns/iter +/- 6.00ns str::line_count::en_large::case01_fold_increment 2.62µs/iter +/- 9.00ns str::line_count::en_medium::case00_libcore 100.00ns/iter +/- 0.00ns str::line_count::en_medium::case01_fold_increment 347.00ns/iter +/- 0.00ns str::line_count::en_small::case00_libcore 18.00ns/iter +/- 1.00ns str::line_count::en_small::case01_fold_increment 60.00ns/iter +/- 2.00ns str::line_count::en_tiny::case00_libcore 6.00ns/iter +/- 0.00ns str::line_count::en_tiny::case01_fold_increment 60.00ns/iter +/- 0.00ns str::line_count::zh_huge::case00_libcore 40.63µs/iter +/- 85.00ns str::line_count::zh_huge::case01_fold_increment 205.10µs/iter +/- 1.62µs str::line_count::zh_large::case00_libcore 655.00ns/iter +/- 1.00ns str::line_count::zh_large::case01_fold_increment 3.21µs/iter +/- 21.00ns str::line_count::zh_medium::case00_libcore 92.00ns/iter +/- 0.00ns str::line_count::zh_medium::case01_fold_increment 420.00ns/iter +/- 2.00ns str::line_count::zh_small::case00_libcore 20.00ns/iter +/- 1.00ns str::line_count::zh_small::case01_fold_increment 63.00ns/iter +/- 1.00ns str::line_count::zh_tiny::case00_libcore 6.00ns/iter +/- 0.00ns str::line_count::zh_tiny::case01_fold_increment 21.00ns/iter +/- 0.00ns ``` This is a speedup of around 2x-4x most of the time, but for some highly unrealistic scenarios (32KiB of newlines) it's up to almost 200x faster (because the time taken by the version in this PR is not dependent on the number of newlines in the input, but the old version is slower the more newlines are present). It's also much faster for small inputs, especially if they have newlines (10x faster for en_tiny). Real world cases will vary, don't read too much into these, I would expect 2x-4x speedup in general, since that's what it gets on the most realistic examples. Obviously a SIMD impl will beat this, but users who are really bottlenecked on this operation should probably just reach for crates.io (even if we provided a SIMD version, libcore can't use runtime CPU feature detection so they'd still be better off with something from crates.io).
☀️ Try build successful - checks-actions |
All reactions
Sorry, something went wrong.
1 similar comment
☀️ Try build successful - checks-actions |
All reactions
Sorry, something went wrong.
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (eb24f76): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 676.389s -> 679.092s (0.40%) |
All reactions
Sorry, something went wrong.
Successfully merging this pull request may close these issues.
None yet
s.lines().count()+1
is somewhat common as a way to find the line number given a byte position, so it'd be nice if it were faster.This just generalizes the SWAR-optimized char counting code so that it can be used for SWAR-optimized line counting, so it's actually not very complex of a PR.
TODO
Benchmarks
case00_libcore
is the new version, andcase01_fold_increment
is the previous implementation (the default impl ofIterator::count()
is a fold that incrementsThis is a speedup of around 2x-4x most of the time, but for some highly unrealistic scenarios (32KiB of newlines) it's up to almost 200x faster (because the time taken by the version in this PR is not dependent on the number of newlines in the input, but the old version is slower the more newlines are present). It's also much faster for small inputs, especially if they have newlines (10x faster for en_tiny).
Real world cases will vary, don't read too much into these, I would expect 2x-4x speedup in general, since that's what it gets on the most realistic examples.
Obviously a SIMD impl will beat this, but users who are really bottlenecked on this operation should probably just reach for crates.io (even if we provided a SIMD version, libcore can't use runtime CPU feature detection so they'd still be better off with something from crates.io).