@@ -118,35 +118,51 @@ Ninja
118
118
which is to say it treated input build files as any byte stream that is ASCII
119
119
compatible. In other words, any string of bytes found in a ` build.ninja ` is
120
120
passed verbatim through printing to stdout and to the OS for path operations,
121
- which meant Ninja was compatible with both UTF-8 and other encodings. The intent
122
- is that those other encodings occur on Linuxes, especially in East Asia, and
123
- also it means Ninja doesn't need any specific handling of even UTF-8.
121
+ which meant Ninja was compatible with both UTF-8 and other encodings where
122
+ slashes were the ASCII slash byte. The intent is that those other encodings
123
+ occur on Linuxes, especially in East Asia, and also it means Ninja doesn't need
124
+ any specific handling of even UTF-8.
124
125
125
126
It looks like since my time,
126
- [ Ninja on Windows changed its input files to require UTF-8] ( https://github.com/ninja-build/ninja/pull/1915 ) .
127
- As you can see from the comments on that bug, this area is unfortunately pretty
128
- subtle.
127
+ [ Ninja on Windows changed to require UTF-8 in its input files ] ( https://github.com/ninja-build/ninja/pull/1915 ) .
128
+ As you can see from the comments on that bug, this was a breaking change in an
129
+ area that is unfortunately pretty subtle.
129
130
130
131
Windows is particularly fiddly not only because its native path representation
131
132
is UTF-16 -- which is incompatible with the original byte stream assumption made
132
133
by Ninja and which requires conversions -- but further because Ninja needs to
133
- parse the ` /showIncludes ` output from the MSVC compiler, which is localized. See
134
- the ` msvc_deps_prefix ` variable in
135
- [ the Ninja docs on deps handling] ( https://ninja-build.org/manual.html#_deps ) ;
136
- there have been lots of bug reports over the years from people with Chinese
134
+ parse the ` /showIncludes ` output from the MSVC compiler, which includes
135
+ localized strings. See the ` msvc_deps_prefix ` variable in
136
+ [ the Ninja docs on deps handling] ( https://ninja-build.org/manual.html#_deps ) .
137
+ There have been lots of bug reports over the years from people with Chinese
137
138
output that is failing to parse right due to Windows code page mess.
138
139
140
+ For these reasons it's not clear to me that it's a better design to require
141
+ input files to always be UTF-8. Perhaps in a UTF-8 world if you needed something
142
+ other than UTF-8 in your ` msvc_deps_prefix ` we could provide some mechanism to
143
+ encode escaped byte strings in the file, but it feels pretty gross.
144
+
145
+ ### Path handling and Unicode safety
146
+
139
147
In any case, n2 doesn't support any of this for now, and instead just follows
140
- Ninja in treating paths as bytes.
141
-
142
- It's possibly a better design to require input files to always be UTF-8, though
143
- I think I'd want to better understand the ` /showIncludes ` situation. (The above
144
- Ninja bug observed this was a breaking change in Ninja, too.) Possibly we could
145
- mandate "input files are UTF-8, and if you need something other than UTF-8 in
146
- your ` msvc_deps_prefix ` it's on you to escape the exact byte sequence you
147
- desire". However, I'm not clear on whether all possible Windows paths are
148
- representable as UTF-8 or if we'd need to go down the WTF-8 route for full
149
- compatibility.
148
+ old Ninja in treating paths as bytes. This will work up until people start
149
+ attempting to use n2 with Chinese versions of Visual Studio, I guess.
150
+
151
+ Concretely, we currently use Rust ` String ` for all paths and file contents, but
152
+ internally interpret them as as bytes (not UTF-8) including using ` unsafe `
153
+ sometimes to convert. Based on my superficial understanding of how safety
154
+ relates to UTF-8 in Rust strings, it's probably harmless given that we never
155
+ treat strings as Unicode, but it's also possible some code outside of our
156
+ control relies on this. But it does mean there's some extra ` unsafe ` s in the
157
+ code, and some of them are possibly actually doing something bad.
158
+
159
+ This has made some people upset, but in reality this is just some hobbyist code
160
+ and non-ASCII build files are not at the top of my worries. Changing the type of
161
+ strings to bags of bytes is a relatively straightforward but extremely tedious
162
+ change, with impact on not only paths but also console output, error messages,
163
+ debugging, etc. And it's ultimately not clear that using bags of bytes or even
164
+ UTF-8 is the desired end state, so it's probably not worth doing until I figure
165
+ that out.
150
166
151
167
## Tracking build state
152
168
0 commit comments