Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substitute a replacement character for invalid UTF-16 text in a TextSpan #84887

Merged
merged 1 commit into from
Jun 24, 2021

Conversation

jason-simmons
Copy link
Member

Fixes #84693

String text = 'Hello\uD83DWorld';
painter.text = TextSpan(text: text);
painter.layout();
expect(painter.width, greaterThan(0.0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be exactly 10.0, right? assuming default font size is 10 and font is Ahem

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@Hixie Hixie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but travis is angry.

...I really should update this image.

@Hixie
Copy link
Contributor

Hixie commented Jun 18, 2021

Bonus points if you mention this somewhere in the API docs, maybe for Text, TextSpan, and TextPainter in particular.

@sm2017
Copy link

sm2017 commented Jun 19, 2021

@jason-simmons in this line, As I understand you replace whole text with REPLACEMENT CHARACTER, Why whole text? We should replace only invalid characters

You convert Hello\uD83DWorld' to , But I think it can be Hello�World , it's the desired output, see Specials_(Unicode_block)

U+FFFD � REPLACEMENT CHARACTER used to replace an unknown, unrecognized, or unrepresentable character

The replacement character � (often displayed as a black rhombus with a white question mark) is a symbol found in the Unicode standard at code point U+FFFD in the Specials table. It is used to indicate problems when a system is unable to render a stream of data to a correct symbol.[4] It is usually seen when the data is invalid and does not match any character:

Consider a text file containing the German word für (meaning 'for') in the ISO-8859-1 encoding (0x66 0xFC 0x72). This file is now opened with a text editor that assumes the input is UTF-8. The first and last byte are valid UTF-8 encodings of ASCII, but the middle byte (0xFC) is not a valid byte in UTF-8. Therefore, a text editor could replace this byte with the replacement character symbol to produce a valid string of Unicode code points. The whole string now displays like this: "f�r".

A poorly implemented text editor might save the replacement in UTF-8 form; the text file data will then look like this: 0x66 0xEF 0xBF 0xBD 0x72, which will be displayed in ISO-8859-1 as "f�r" (this is called mojibake). Since the replacement is the same for all errors this makes it impossible to recover the original character. A better (but harder to implement) design is to preserve the original bytes, including the error, and only convert to the replacement when displaying the text. This will allow the text editor to save the original byte sequence, while still showing the error indicator to the user.

At one time the replacement character was often used when there was no glyph available in a font for that character. However most modern text rendering systems instead use a font's .notdef character, which in most cases is an empty box (or "?" or "X" in a box[5]), sometimes called a "tofu" (this browser displays �). There is no Unicode code point for this symbol.

Thus the replacement character is now only seen for encoding errors, such as invalid UTF-8. Some software attempts to hide this by translating the bytes of invalid UTF-8 to matching characters in Windows-1252 (since that is the most likely source of these errors), so that the replacement character is never seen.

@Hixie
Copy link
Contributor

Hixie commented Jun 19, 2021

Replacing individual characters is a lot of work, I really don't think we should do that given that the whole point is to fail. If you want to replace invalid characters, that's something to do in your app.

@sm2017
Copy link

sm2017 commented Jun 19, 2021

@Hixie I understand, Assume I want to do it, There is many many string and Text in the application, Some Text are in the dependencies of dependencies
It's very cumbersome to do it in application layer, At least there must be a Global option to override text sanitization

@Hixie
Copy link
Contributor

Hixie commented Jun 19, 2021

You would have to provide a central place in your application where strings were sanitized, yes.

But fundamentally, strings should be sanitized long before they reach Text. If they're not, that indicates a more fundamental problem. I don't think it's Flutter's job to provide APIs to make it easier to work around such fundamental problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a: typography Text rendering, possibly libtxt framework flutter/packages/flutter repository. See also f: labels.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REPLACEMENT CHARACTER (�) for not well-formed UTF-16 character
5 participants