Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit 07b5f8c

Browse files
committedJan 11, 2025
move the unicode discussion to the design notes
There was a lobste.rs thread about this that was pretty disappointing, given that the purpose of this project is the big ideas and this is a pretty irrelevant detail. Moving this here means it's not the first thing you see clicking around on the project, and also puts the strings discussion in context of the larger question of encoding handling.
1 parent d75dca6 commit 07b5f8c

File tree

2 files changed

+36
-41
lines changed

2 files changed

+36
-41
lines changed
 

‎doc/design_notes.md

+36-20
Original file line numberDiff line numberDiff line change
@@ -118,35 +118,51 @@ Ninja
118118
which is to say it treated input build files as any byte stream that is ASCII
119119
compatible. In other words, any string of bytes found in a `build.ninja` is
120120
passed verbatim through printing to stdout and to the OS for path operations,
121-
which meant Ninja was compatible with both UTF-8 and other encodings. The intent
122-
is that those other encodings occur on Linuxes, especially in East Asia, and
123-
also it means Ninja doesn't need any specific handling of even UTF-8.
121+
which meant Ninja was compatible with both UTF-8 and other encodings where
122+
slashes were the ASCII slash byte. The intent is that those other encodings
123+
occur on Linuxes, especially in East Asia, and also it means Ninja doesn't need
124+
any specific handling of even UTF-8.
124125

125126
It looks like since my time,
126-
[Ninja on Windows changed its input files to require UTF-8](https://github.com/ninja-build/ninja/pull/1915).
127-
As you can see from the comments on that bug, this area is unfortunately pretty
128-
subtle.
127+
[Ninja on Windows changed to require UTF-8 in its input files](https://github.com/ninja-build/ninja/pull/1915).
128+
As you can see from the comments on that bug, this was a breaking change in an
129+
area that is unfortunately pretty subtle.
129130

130131
Windows is particularly fiddly not only because its native path representation
131132
is UTF-16 -- which is incompatible with the original byte stream assumption made
132133
by Ninja and which requires conversions -- but further because Ninja needs to
133-
parse the `/showIncludes` output from the MSVC compiler, which is localized. See
134-
the `msvc_deps_prefix` variable in
135-
[the Ninja docs on deps handling](https://ninja-build.org/manual.html#_deps);
136-
there have been lots of bug reports over the years from people with Chinese
134+
parse the `/showIncludes` output from the MSVC compiler, which includes
135+
localized strings. See the `msvc_deps_prefix` variable in
136+
[the Ninja docs on deps handling](https://ninja-build.org/manual.html#_deps).
137+
There have been lots of bug reports over the years from people with Chinese
137138
output that is failing to parse right due to Windows code page mess.
138139

140+
For these reasons it's not clear to me that it's a better design to require
141+
input files to always be UTF-8. Perhaps in a UTF-8 world if you needed something
142+
other than UTF-8 in your `msvc_deps_prefix` we could provide some mechanism to
143+
encode escaped byte strings in the file, but it feels pretty gross.
144+
145+
### Path handling and Unicode safety
146+
139147
In any case, n2 doesn't support any of this for now, and instead just follows
140-
Ninja in treating paths as bytes.
141-
142-
It's possibly a better design to require input files to always be UTF-8, though
143-
I think I'd want to better understand the `/showIncludes` situation. (The above
144-
Ninja bug observed this was a breaking change in Ninja, too.) Possibly we could
145-
mandate "input files are UTF-8, and if you need something other than UTF-8 in
146-
your `msvc_deps_prefix` it's on you to escape the exact byte sequence you
147-
desire". However, I'm not clear on whether all possible Windows paths are
148-
representable as UTF-8 or if we'd need to go down the WTF-8 route for full
149-
compatibility.
148+
old Ninja in treating paths as bytes. This will work up until people start
149+
attempting to use n2 with Chinese versions of Visual Studio, I guess.
150+
151+
Concretely, we currently use Rust `String` for all paths and file contents, but
152+
internally interpret them as as bytes (not UTF-8) including using `unsafe`
153+
sometimes to convert. Based on my superficial understanding of how safety
154+
relates to UTF-8 in Rust strings, it's probably harmless given that we never
155+
treat strings as Unicode, but it's also possible some code outside of our
156+
control relies on this. But it does mean there's some extra `unsafe`s in the
157+
code, and some of them are possibly actually doing something bad.
158+
159+
This has made some people upset, but in reality this is just some hobbyist code
160+
and non-ASCII build files are not at the top of my worries. Changing the type of
161+
strings to bags of bytes is a relatively straightforward but extremely tedious
162+
change, with impact on not only paths but also console output, error messages,
163+
debugging, etc. And it's ultimately not clear that using bags of bytes or even
164+
UTF-8 is the desired end state, so it's probably not worth doing until I figure
165+
that out.
150166

151167
## Tracking build state
152168

‎doc/development.md

-21
Original file line numberDiff line numberDiff line change
@@ -6,27 +6,6 @@ On a new checkout, run this to install the formatting check hook:
66
$ ln -s ../../git-pre-commit .git/hooks/pre-commit
77
```
88

9-
## Path handling and Unicode safety
10-
11-
See the longer discussion of Unicode in general in the
12-
[design notes](design_notes.md).
13-
14-
Concretely, we currently use Rust `String` for all paths and file contents, but
15-
internally interpret them as as bytes (not UTF-8) including using `unsafe`
16-
sometimes to convert.
17-
18-
Based on my superficial understanding of how safety relates to UTF-8 in Rust
19-
strings, it's probably harmless given that we never treat strings as Unicode,
20-
but it's also possible some code outside of our control relies on this. But it
21-
does mean there's a bunch of kind of needless `unsafe`s in the code, and some of
22-
them are possibly actually doing something bad.
23-
24-
We could fix this by switching to using a bag of bytes type, like
25-
https://crates.io/crates/bstr. But it is pretty invasive. We would need to use
26-
that not only for paths but also console output, error messages, etc. And it's
27-
not clear (again, see above design notes discussion) that using bags of bytes is
28-
the desired end state, so it's probably not worth doing.
29-
309
## Profiling
3110

3211
### gperftools

0 commit comments

Comments
 (0)
Failed to load comments.