Separate the unescape functions but avoid duplicating code #138163

hkBst · 2025-03-07T14:24:08Z

Separate the unescape functions for string, byte string and C string, but avoid duplicating code via macro_rules.

Also plays with NonZero, since C strings cannot contain null bytes, which can be captured in the type system.

rustbot · 2025-03-07T14:24:13Z

rust-analyzer is developed in its own repository. If possible, consider making this change to rust-lang/rust-analyzer instead.

cc @rust-lang/rust-analyzer

workingjubilee · 2025-03-07T21:09:36Z

compiler/rustc_lexer/src/unescape.rs

+    /// Used for ASCII chars (written directly or via `\x01`..`\x7f` escapes)
    /// and Unicode chars (written directly or via `\u` escapes).
    ///
    /// For example, if '¥' appears in a string it is represented here as
    /// `MixedUnit::Char('¥')`, and it will be appended to the relevant byte
    /// string as the two-byte UTF-8 sequence `[0xc2, 0xa5]`
-    Char(char),
+    Char(NonZero<char>),


UTF-8 includes the value 0x00, which may be written in a Rust string like so: "\0". The description of the MixedUnit type seems misleading?

headscratch Stared at things a bit longer. If it is only used in the case of c"", then the documentation of this type should be changed to reflect the fact that it is an implementation detail of Specifically That, as this is misleading as-is.

My understanding however was that there was a desire to eventually have "mixed strings which are not necessarily CStr", which is probably why the type is described in a more generic way and doesn't reference CStr currently.

It's currently only used for non-raw C strings AFAIK as raw C strings are perfectly happy with char.
@nnethercote's a1c0721 does mention that "it will soon be used in more than just C string literals.", but I don't know what that refers to. Seems it's for https://rust-lang.github.io/rfcs/3349-mixed-utf8-literals.html which would mean byte strings would also use this construction and those are fine with null bytes. So either we don't use NonZero for C strings or we use a different unit-type for C strings and for byte strings.

nnethercote · 2025-03-10T06:21:41Z

Separate the unescape functions for string, byte string and C string, but avoid duplicating code via macro_rules.

You seem to really want to separate those functions. What's the motivation here? What is the problem with the mode parameter? I don't think the new code is better. In particular, adding four new macros makes this code harder to read and understand.

hkBst · 2025-03-10T09:10:14Z

Separate the unescape functions for string, byte string and C string, but avoid duplicating code via macro_rules.

You seem to really want to separate those functions. What's the motivation here? What is the problem with the mode parameter? I don't think the new code is better. In particular, adding four new macros makes this code harder to read and understand.

Perhaps it is completely naive, but I am still hoping this will unlock a little bit of performance. :)

On top of that, I dislike the (unescape_unicode and unescape_mixed) functions that match on the Mode and have unreachable for some of the variants. It does not seem to provide any benefit over calling the right unescape function directly. And there are downsides: the reason that there are two unescape functions at the top level is because the signature of the callbacks are too different to unify them, unless you make all use MixedUnit. But this is only necessary because we try to push everything through these top-level functions in the first place. If we didn't then each function could have its natural type, eliminating the need for unreachable.

Also the current common functions have many booleans that influence their behavior and a generic parameter which also makes the code quite hard to understand. Separating the code into these separate functions greatly helped me to make sense of everything. The macros also suffer a bit from being complicated, but it is easier to see the different instantations, which makes it less bad than the previous situation (in my author-biased opinion). And since you objected to the code duplication of the previous version, then eliminating that duplication via macros seems to be the logical answer.

I propose that we do a perf run to see if there is any perf to be had. If I haven't made a mistake then it should at worst be neutral this time and we can discuss the comparative code qualities of common vs (macro or non-macro) separate. If there is a small perf win, then perhaps we can discuss the right mix of macro vs non-macro separate functions?

hkBst · 2025-03-12T12:35:02Z

An alternative to the macros may be using a trait:

-macro_rules! check {
-    ($string_ty:literal
-     ($check:ident: $char2unit:expr => $unit:ty)) => {
-        #[doc = concat!("Take the contents of a raw ", stringify!($string_ty),
-                        " literal (without quotes) and produce a sequence of results of ",
-                        stringify!($unit_ty), " or error (returned via `callback`).",
-                        "\nNB: Raw strings don't do any unescaping, but do produce errors on bare CR.")]
-        pub fn $check(src: &str, callback: &mut impl FnMut(Range<usize>, Result<$unit, EscapeError>))
-        {
-            src.char_indices().for_each(|(pos, c)| {
-                callback(
-                    pos..pos + c.len_utf8(),
-                    if c == '\r' { Err(EscapeError::BareCarriageReturnInRawString) } else { $char2unit(c) },
-                );
-            });
-        }
-    };
+mod private {
+    #[allow(unreachable_pub)]
+    pub trait Sealed {}
+
+    impl Sealed for str {}
+    impl Sealed for [u8] {}
+    impl Sealed for std::ffi::CStr {}
+}
+
+trait Check: private::Sealed {
+    type Unit;
+
+    fn char2unit(c: char) -> Result<Self::Unit, EscapeError>;
+
+    /// Take the contents of a raw literal (without quotes) and produce a sequence of
+    /// `Result<Self::Unit, EscapeError>` (returned via `callback`).
+    ///
+    /// NB: Raw strings don't do any unescaping, but do produce errors on bare CR.
+    fn check_raw(
+        src: &str,
+        callback: &mut impl FnMut(Range<usize>, Result<Self::Unit, EscapeError>),
+    ) {
+        src.char_indices().for_each(|(pos, c)| {
+            callback(
+                pos..pos + c.len_utf8(),
+                if c == '\r' {
+                    Err(EscapeError::BareCarriageReturnInRawString)
+                } else {
+                    Self::char2unit(c)
+                },
+            );
+        });
+    }
 }
 
-check!("string" (check_raw_str: Ok => char));
-check!("byte string" (check_raw_byte_str: ascii_char_to_byte => u8));
-check!("C string" (check_raw_cstr: |c| NonZero::<char>::new(c).ok_or(EscapeError::NulInCStr) => NonZero<char>));
+impl Check for str {
+    type Unit = char;
+
+    fn char2unit(c: char) -> Result<Self::Unit, EscapeError> {
+        Ok(c)
+    }
+}
+
+impl Check for [u8] {
+    type Unit = u8;
+
+    fn char2unit(c: char) -> Result<Self::Unit, EscapeError> {
+        ascii_char_to_byte(c)
+    }
+}
+
+impl Check for CStr {
+    type Unit = NonZero<char>;
+
+    fn char2unit(c: char) -> Result<Self::Unit, EscapeError> {
+        NonZero::<char>::new(c).ok_or(EscapeError::NulInCStr)
+    }
+}

I'm not sure it's really much better...

nnethercote · 2025-03-13T00:15:12Z

@bors try @rust-timer queue

… r=<try> Separate the unescape functions but avoid duplicating code Separate the unescape functions for string, byte string and C string, but avoid duplicating code via macro_rules. Also plays with NonZero, since C strings cannot contain null bytes, which can be captured in the type system. r? `@nnethercote`

bors · 2025-03-13T00:16:27Z

⌛ Trying commit 2e53992 with merge 2d1434d...

bors · 2025-03-13T02:19:33Z

☀️ Try build successful - checks-actions
Build commit: 2d1434d (2d1434d32070e541dee233bbd2807e08ab4512d2)

rust-timer · 2025-03-13T05:22:54Z

Finished benchmarking commit (2d1434d): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-0.7%	[-1.1%, -0.4%]	16
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

Results (primary 2.0%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.0%	[2.0%, 2.0%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	2.0%	[2.0%, 2.0%]	1

Cycles

Results (primary -2.6%, secondary -3.3%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-2.6%	[-2.6%, -2.6%]	1
Improvements ✅ (secondary)	-3.3%	[-3.7%, -2.7%]	3
All ❌✅ (primary)	-2.6%	[-2.6%, -2.6%]	1

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 777.224s -> 776.837s (-0.05%)
Artifact size: 365.30 MiB -> 365.29 MiB (-0.00%)

rustbot · 2025-03-14T09:44:50Z

Some changes occurred in src/tools/clippy

cc @rust-lang/clippy

hkBst · 2025-03-14T10:19:03Z

I've added a second commit which completes the removal of unescape_unicode and with it the final use of unreachable in this file. No more lying to the type system.

… but avoid duplicating code via macro_rules. Also plays with NonZero, since C strings cannot contain null bytes, which can be captured in the type system.

hkBst · 2025-03-21T13:19:59Z

Some churn to deal with #136355 getting merged and then reverted #138661. It may be a good idea to improve the interface of unescape.rs before it gets turned into a crate.

rustbot assigned nnethercote Mar 7, 2025

rustbot added S-waiting-on-review T-compiler T-libs labels Mar 7, 2025

workingjubilee reviewed Mar 7, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf label Mar 13, 2025

This comment has been minimized.

Sign in to view

rustbot removed the S-waiting-on-perf label Mar 13, 2025

rust-cloud-vms bot force-pushed the cleanup_lexer_unescape_macros branch from 2e53992 to 848dd75 Compare March 14, 2025 09:44

rust-cloud-vms bot force-pushed the cleanup_lexer_unescape_macros branch 2 times, most recently from b7ee702 to 61845a9 Compare March 19, 2025 21:01

Separate the unescape functions for string, byte string and C string,…

1cbecc3

… but avoid duplicating code via macro_rules. Also plays with NonZero, since C strings cannot contain null bytes, which can be captured in the type system.

rust-cloud-vms bot force-pushed the cleanup_lexer_unescape_macros branch from 61845a9 to c6eaeb4 Compare March 21, 2025 13:15

This comment has been minimized.

Sign in to view

Replace all uses of unescape_unicode: no more unreachable!

30822ec

rust-cloud-vms bot force-pushed the cleanup_lexer_unescape_macros branch from c6eaeb4 to 30822ec Compare March 21, 2025 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate the unescape functions but avoid duplicating code #138163

Separate the unescape functions but avoid duplicating code #138163

hkBst commented Mar 7, 2025

rustbot commented Mar 7, 2025

workingjubilee Mar 7, 2025 •

edited

Loading

workingjubilee Mar 7, 2025

workingjubilee Mar 7, 2025

hkBst Mar 8, 2025

nnethercote commented Mar 10, 2025

hkBst commented Mar 10, 2025

hkBst commented Mar 12, 2025 •

edited

Loading

nnethercote commented Mar 13, 2025

This comment has been minimized.

bors commented Mar 13, 2025

bors commented Mar 13, 2025

This comment has been minimized.

rust-timer commented Mar 13, 2025

rustbot commented Mar 14, 2025

hkBst commented Mar 14, 2025 •

edited

Loading

hkBst commented Mar 21, 2025 •

edited

Loading

This comment has been minimized.

Separate the unescape functions but avoid duplicating code #138163

Are you sure you want to change the base?

Separate the unescape functions but avoid duplicating code #138163

Conversation

hkBst commented Mar 7, 2025

rustbot commented Mar 7, 2025

workingjubilee Mar 7, 2025 • edited Loading

Choose a reason for hiding this comment

workingjubilee Mar 7, 2025

Choose a reason for hiding this comment

workingjubilee Mar 7, 2025

Choose a reason for hiding this comment

hkBst Mar 8, 2025

Choose a reason for hiding this comment

nnethercote commented Mar 10, 2025

hkBst commented Mar 10, 2025

hkBst commented Mar 12, 2025 • edited Loading

nnethercote commented Mar 13, 2025

This comment has been minimized.

bors commented Mar 13, 2025

bors commented Mar 13, 2025

This comment has been minimized.

rust-timer commented Mar 13, 2025

Overall result: ✅ improvements - no action needed

Instruction count

Max RSS (memory usage)

Cycles

Binary size

rustbot commented Mar 14, 2025

hkBst commented Mar 14, 2025 • edited Loading

hkBst commented Mar 21, 2025 • edited Loading

This comment has been minimized.

workingjubilee Mar 7, 2025 •

edited

Loading

hkBst commented Mar 12, 2025 •

edited

Loading

hkBst commented Mar 14, 2025 •

edited

Loading

hkBst commented Mar 21, 2025 •

edited

Loading