Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate the unescape functions but avoid duplicating code #138163

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

hkBst
Copy link
Member

@hkBst hkBst commented Mar 7, 2025

Separate the unescape functions for string, byte string and C string, but avoid duplicating code via macro_rules.

Also plays with NonZero, since C strings cannot contain null bytes, which can be captured in the type system.

r? @nnethercote

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Mar 7, 2025
@rustbot
Copy link
Collaborator

rustbot commented Mar 7, 2025

rust-analyzer is developed in its own repository. If possible, consider making this change to rust-lang/rust-analyzer instead.

cc @rust-lang/rust-analyzer

Comment on lines +88 to +94
/// Used for ASCII chars (written directly or via `\x01`..`\x7f` escapes)
/// and Unicode chars (written directly or via `\u` escapes).
///
/// For example, if '¥' appears in a string it is represented here as
/// `MixedUnit::Char('¥')`, and it will be appended to the relevant byte
/// string as the two-byte UTF-8 sequence `[0xc2, 0xa5]`
Char(char),
Char(NonZero<char>),
Copy link
Member

@workingjubilee workingjubilee Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-8 includes the value 0x00, which may be written in a Rust string like so: "\0". The description of the MixedUnit type seems misleading?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

headscratch Stared at things a bit longer. If it is only used in the case of c"", then the documentation of this type should be changed to reflect the fact that it is an implementation detail of Specifically That, as this is misleading as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding however was that there was a desire to eventually have "mixed strings which are not necessarily CStr", which is probably why the type is described in a more generic way and doesn't reference CStr currently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently only used for non-raw C strings AFAIK as raw C strings are perfectly happy with char.
@nnethercote's a1c0721 does mention that "it will soon be used in more than just C string literals.", but I don't know what that refers to. Seems it's for https://rust-lang.github.io/rfcs/3349-mixed-utf8-literals.html which would mean byte strings would also use this construction and those are fine with null bytes. So either we don't use NonZero for C strings or we use a different unit-type for C strings and for byte strings.

@nnethercote
Copy link
Contributor

Separate the unescape functions for string, byte string and C string, but avoid duplicating code via macro_rules.

You seem to really want to separate those functions. What's the motivation here? What is the problem with the mode parameter? I don't think the new code is better. In particular, adding four new macros makes this code harder to read and understand.

@hkBst
Copy link
Member Author

hkBst commented Mar 10, 2025

Separate the unescape functions for string, byte string and C string, but avoid duplicating code via macro_rules.

You seem to really want to separate those functions. What's the motivation here? What is the problem with the mode parameter? I don't think the new code is better. In particular, adding four new macros makes this code harder to read and understand.

Perhaps it is completely naive, but I am still hoping this will unlock a little bit of performance. :)

On top of that, I dislike the (unescape_unicode and unescape_mixed) functions that match on the Mode and have unreachable for some of the variants. It does not seem to provide any benefit over calling the right unescape function directly. And there are downsides: the reason that there are two unescape functions at the top level is because the signature of the callbacks are too different to unify them, unless you make all use MixedUnit. But this is only necessary because we try to push everything through these top-level functions in the first place. If we didn't then each function could have its natural type, eliminating the need for unreachable.

Also the current common functions have many booleans that influence their behavior and a generic parameter which also makes the code quite hard to understand. Separating the code into these separate functions greatly helped me to make sense of everything. The macros also suffer a bit from being complicated, but it is easier to see the different instantations, which makes it less bad than the previous situation (in my author-biased opinion). And since you objected to the code duplication of the previous version, then eliminating that duplication via macros seems to be the logical answer.

I propose that we do a perf run to see if there is any perf to be had. If I haven't made a mistake then it should at worst be neutral this time and we can discuss the comparative code qualities of common vs (macro or non-macro) separate. If there is a small perf win, then perhaps we can discuss the right mix of macro vs non-macro separate functions?

@hkBst
Copy link
Member Author

hkBst commented Mar 12, 2025

An alternative to the macros may be using a trait:

-macro_rules! check {
-    ($string_ty:literal
-     ($check:ident: $char2unit:expr => $unit:ty)) => {
-        #[doc = concat!("Take the contents of a raw ", stringify!($string_ty),
-                        " literal (without quotes) and produce a sequence of results of ",
-                        stringify!($unit_ty), " or error (returned via `callback`).",
-                        "\nNB: Raw strings don't do any unescaping, but do produce errors on bare CR.")]
-        pub fn $check(src: &str, callback: &mut impl FnMut(Range<usize>, Result<$unit, EscapeError>))
-        {
-            src.char_indices().for_each(|(pos, c)| {
-                callback(
-                    pos..pos + c.len_utf8(),
-                    if c == '\r' { Err(EscapeError::BareCarriageReturnInRawString) } else { $char2unit(c) },
-                );
-            });
-        }
-    };
+mod private {
+    #[allow(unreachable_pub)]
+    pub trait Sealed {}
+
+    impl Sealed for str {}
+    impl Sealed for [u8] {}
+    impl Sealed for std::ffi::CStr {}
+}
+
+trait Check: private::Sealed {
+    type Unit;
+
+    fn char2unit(c: char) -> Result<Self::Unit, EscapeError>;
+
+    /// Take the contents of a raw literal (without quotes) and produce a sequence of
+    /// `Result<Self::Unit, EscapeError>` (returned via `callback`).
+    ///
+    /// NB: Raw strings don't do any unescaping, but do produce errors on bare CR.
+    fn check_raw(
+        src: &str,
+        callback: &mut impl FnMut(Range<usize>, Result<Self::Unit, EscapeError>),
+    ) {
+        src.char_indices().for_each(|(pos, c)| {
+            callback(
+                pos..pos + c.len_utf8(),
+                if c == '\r' {
+                    Err(EscapeError::BareCarriageReturnInRawString)
+                } else {
+                    Self::char2unit(c)
+                },
+            );
+        });
+    }
 }
 
-check!("string" (check_raw_str: Ok => char));
-check!("byte string" (check_raw_byte_str: ascii_char_to_byte => u8));
-check!("C string" (check_raw_cstr: |c| NonZero::<char>::new(c).ok_or(EscapeError::NulInCStr) => NonZero<char>));
+impl Check for str {
+    type Unit = char;
+
+    fn char2unit(c: char) -> Result<Self::Unit, EscapeError> {
+        Ok(c)
+    }
+}
+
+impl Check for [u8] {
+    type Unit = u8;
+
+    fn char2unit(c: char) -> Result<Self::Unit, EscapeError> {
+        ascii_char_to_byte(c)
+    }
+}
+
+impl Check for CStr {
+    type Unit = NonZero<char>;
+
+    fn char2unit(c: char) -> Result<Self::Unit, EscapeError> {
+        NonZero::<char>::new(c).ok_or(EscapeError::NulInCStr)
+    }
+}

I'm not sure it's really much better...

@nnethercote
Copy link
Contributor

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 13, 2025
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 13, 2025
… r=<try>

Separate the unescape functions but avoid duplicating code

Separate the unescape functions for string, byte string and C string, but avoid duplicating code via macro_rules.

Also plays with NonZero, since C strings cannot contain null bytes, which can be captured in the type system.

r? `@nnethercote`
@bors
Copy link
Contributor

bors commented Mar 13, 2025

⌛ Trying commit 2e53992 with merge 2d1434d...

@bors
Copy link
Contributor

bors commented Mar 13, 2025

☀️ Try build successful - checks-actions
Build commit: 2d1434d (2d1434d32070e541dee233bbd2807e08ab4512d2)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (2d1434d): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.7% [-1.1%, -0.4%] 16
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results (primary 2.0%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.0% [2.0%, 2.0%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.0% [2.0%, 2.0%] 1

Cycles

Results (primary -2.6%, secondary -3.3%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-2.6% [-2.6%, -2.6%] 1
Improvements ✅
(secondary)
-3.3% [-3.7%, -2.7%] 3
All ❌✅ (primary) -2.6% [-2.6%, -2.6%] 1

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 777.224s -> 776.837s (-0.05%)
Artifact size: 365.30 MiB -> 365.29 MiB (-0.00%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 13, 2025
@rust-cloud-vms rust-cloud-vms bot force-pushed the cleanup_lexer_unescape_macros branch from 2e53992 to 848dd75 Compare March 14, 2025 09:44
@rustbot
Copy link
Collaborator

rustbot commented Mar 14, 2025

Some changes occurred in src/tools/clippy

cc @rust-lang/clippy

@hkBst
Copy link
Member Author

hkBst commented Mar 14, 2025

I've added a second commit which completes the removal of unescape_unicode and with it the final use of unreachable in this file. No more lying to the type system.

@rust-cloud-vms rust-cloud-vms bot force-pushed the cleanup_lexer_unescape_macros branch 2 times, most recently from b7ee702 to 61845a9 Compare March 19, 2025 21:01
… but avoid duplicating code via macro_rules.

Also plays with NonZero, since C strings cannot contain null bytes, which can be captured in the type system.
@rust-cloud-vms rust-cloud-vms bot force-pushed the cleanup_lexer_unescape_macros branch from 61845a9 to c6eaeb4 Compare March 21, 2025 13:15
@hkBst
Copy link
Member Author

hkBst commented Mar 21, 2025

Some churn to deal with #136355 getting merged and then reverted #138661. It may be a good idea to improve the interface of unescape.rs before it gets turned into a crate.

@rust-log-analyzer

This comment has been minimized.

@rust-cloud-vms rust-cloud-vms bot force-pushed the cleanup_lexer_unescape_macros branch from c6eaeb4 to 30822ec Compare March 21, 2025 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants