-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate the unescape functions but avoid duplicating code #138163
base: master
Are you sure you want to change the base?
Conversation
rust-analyzer is developed in its own repository. If possible, consider making this change to rust-lang/rust-analyzer instead. cc @rust-lang/rust-analyzer |
/// Used for ASCII chars (written directly or via `\x01`..`\x7f` escapes) | ||
/// and Unicode chars (written directly or via `\u` escapes). | ||
/// | ||
/// For example, if '¥' appears in a string it is represented here as | ||
/// `MixedUnit::Char('¥')`, and it will be appended to the relevant byte | ||
/// string as the two-byte UTF-8 sequence `[0xc2, 0xa5]` | ||
Char(char), | ||
Char(NonZero<char>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTF-8 includes the value 0x00, which may be written in a Rust string like so: "\0"
. The description of the MixedUnit type seems misleading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
headscratch Stared at things a bit longer. If it is only used in the case of c""
, then the documentation of this type should be changed to reflect the fact that it is an implementation detail of Specifically That, as this is misleading as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding however was that there was a desire to eventually have "mixed strings which are not necessarily CStr", which is probably why the type is described in a more generic way and doesn't reference CStr currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's currently only used for non-raw C strings AFAIK as raw C strings are perfectly happy with char.
@nnethercote's a1c0721 does mention that "it will soon be used in more than just C string literals.", but I don't know what that refers to. Seems it's for https://rust-lang.github.io/rfcs/3349-mixed-utf8-literals.html which would mean byte strings would also use this construction and those are fine with null bytes. So either we don't use NonZero for C strings or we use a different unit-type for C strings and for byte strings.
You seem to really want to separate those functions. What's the motivation here? What is the problem with the mode parameter? I don't think the new code is better. In particular, adding four new macros makes this code harder to read and understand. |
Perhaps it is completely naive, but I am still hoping this will unlock a little bit of performance. :) On top of that, I dislike the (unescape_unicode and unescape_mixed) functions that match on the Mode and have unreachable for some of the variants. It does not seem to provide any benefit over calling the right unescape function directly. And there are downsides: the reason that there are two unescape functions at the top level is because the signature of the callbacks are too different to unify them, unless you make all use MixedUnit. But this is only necessary because we try to push everything through these top-level functions in the first place. If we didn't then each function could have its natural type, eliminating the need for unreachable. Also the current common functions have many booleans that influence their behavior and a generic parameter which also makes the code quite hard to understand. Separating the code into these separate functions greatly helped me to make sense of everything. The macros also suffer a bit from being complicated, but it is easier to see the different instantations, which makes it less bad than the previous situation (in my author-biased opinion). And since you objected to the code duplication of the previous version, then eliminating that duplication via macros seems to be the logical answer. I propose that we do a perf run to see if there is any perf to be had. If I haven't made a mistake then it should at worst be neutral this time and we can discuss the comparative code qualities of common vs (macro or non-macro) separate. If there is a small perf win, then perhaps we can discuss the right mix of macro vs non-macro separate functions? |
An alternative to the macros may be using a trait:
I'm not sure it's really much better... |
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
… r=<try> Separate the unescape functions but avoid duplicating code Separate the unescape functions for string, byte string and C string, but avoid duplicating code via macro_rules. Also plays with NonZero, since C strings cannot contain null bytes, which can be captured in the type system. r? `@nnethercote`
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (2d1434d): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)Results (primary 2.0%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (primary -2.6%, secondary -3.3%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 777.224s -> 776.837s (-0.05%) |
2e53992
to
848dd75
Compare
Some changes occurred in src/tools/clippy cc @rust-lang/clippy |
I've added a second commit which completes the removal of unescape_unicode and with it the final use of unreachable in this file. No more lying to the type system. |
b7ee702
to
61845a9
Compare
… but avoid duplicating code via macro_rules. Also plays with NonZero, since C strings cannot contain null bytes, which can be captured in the type system.
61845a9
to
c6eaeb4
Compare
This comment has been minimized.
This comment has been minimized.
c6eaeb4
to
30822ec
Compare
Separate the unescape functions for string, byte string and C string, but avoid duplicating code via macro_rules.
Also plays with NonZero, since C strings cannot contain null bytes, which can be captured in the type system.
r? @nnethercote