Digit Separators

ISO/IEC JTC1 SC22 WG21 N3499 - 2012-12-19

Lawrence Crowl, [email protected], [email protected]

Problem
Solution
Constraints
    Program Ambiguity
    Lexical Language Compatibility
    Extension Language Compatibility
Existing Grammar
    2.3 Character sets [lex.charset]
    2.5 Preprocessing tokens [lex.pptoken]
    2.10 Preprocessing numbers [lex.ppnumber]
    2.11 Identifiers [lex.name]
    2.14.2 Integer literals [lex.icon]
    2.14.3 Floating literals [lex.fcon]
    2.14.8 User-defined literals [lex.ext]
    16 Preprocessing directives [cpp]
Approaches
    Remove User-Defined Literals
    Typographic
    Grave Accent
    Single Quote
    Underscore
        Double Underscore
        Scope Operator
        Non-Digit Literal Suffix
        Spacing
        Double Radix Point
        Backslash
Proposal
    2.10 Preprocessing numbers [lex.ppnumber]
    2.14.2 Integer literals [lex.icon]
    2.14.4 Floating literals [lex.fcon]
    2.14.8 User-defined literals [lex.ext]
References

Problem

Numeric literals of more than a few digits are hard to read. Consider the following tasks.

Solution

The problem has a long history of solutions in writing and typography, digit separators. In the English-speaking world, commas are usually used to separate digits.

We wish to introduce digit separators into C++. The exact syntax is still open. The remainder of this paper discusses various approaches to the solution.

Constraints

Constraints on digit separators arise from three distinct sources.

Program Ambiguity

Adding digit separators introduces the potential for ambiguous C++ programs. We would prefer to avoid ambiguity, and failing that would prefer to have usable rules for disambiguating the source. In particular, the interaction with user-defined literals [N2747] [N2765] should be carefully considered.

Lexical Language Compatibility

The lexical structure of C++ is shared with C, Objective C/C++, and other tools through the preprocessor. Any introduction of digit separators should carefully consider compatibility with the existing lexical structure of these languages.

Richard Smith questions the value of compatibility here.

This problem only arises if:

  1. Someone is attempting to write a file which is to be shared between C++14 and other languages, and
  2. They include text in that header which simply does not work in those other languages.

I find it hard to believe that this will be a real problem, and it seems like a clear case of user error. (If you're writing a header which works in C and C++, the burden is on you to make sure it works in C).

This is not a new issue. The same problem already exists with C++11's raw string literals, and to a lesser extent with user-defined-literals and with C's hex floats (which allow 'p+' within pp-numbers).

Extension Language Compatibility

C++ is often used as the basis for extended languages, notably Objective C/C++, but also many languages that are smaller and less widely used. Invalidating those extension languages has costs that are hard to predict.

Existing Grammar

The existing grammar provides both constraints and opportunities.

2.3 Character sets [lex.charset]

Paragraph 1 is as follows.

The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters: [Footnote: The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files. —end footnote]

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & |  ! = , \ " '

Of particular note, the only printable ASCII characters not used in the C++ basic character set are $ (dollar), @ (commercial at sign), and ` (grave accent, back tick). All of these characters have been used for extension characters. Dollar has also been used as an identifier character, e.g. in VAX/VMS system functions names.

2.5 Preprocessing tokens [lex.pptoken]

The grammar is as follows.

preprocessing-token:
header-name
identifier
pp-number
character-literal
user-defined-character-literal
string-literal
user-defined-string-literal
preprocessing-op-or-punc
each non-white-space character that cannot be one of the above

Paragraph two is of special note.

A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing token are: header names, identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories. If a ' or a " character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments (2.8), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in Clause 16, in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation. White space can appear within a preprocessing token only as part of a header name or between the quotation characters in a character literal or string literal.

The implication here is that no valid C++ program should have an isolated single or double quote character. Unfortunately, that information is less useful that it might appear because an isolated single quote could be in use to signal an extension language interpretation.

2.10 Preprocessing numbers [lex.ppnumber]

The grammar is as follows.

pp-number:
digit
. digit
pp-number digit
pp-number nondigit
pp-number e sign
pp-number E sign
pp-number .

We would like numeric literals to fit within this syntax, as it would require the least change to existing tools, e.g editor syntax highlighting and mouse word grabbing.

2.11 Identifiers [lex.name]

The grammar is as follows.

nondigit: one of
a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z _
digit: one of
0 1 2 3 4 5 6 7 8 9

The implication in this grammar is that ignored code must still be made up of valid tokens.

2.14.2 Integer literals [lex.icon]

The grammar is as follows.

integer-literal:
decimal-literal integer-suffixopt
octal-literal integer-suffixopt
hexadecimal-literal integer-suffixopt
decimal-literal:
nonzero-digit
decimal-literal digit
octal-literal:
0
octal-literal octal-digit
hexadecimal-literal:
0x hexadecimal-digit
0X hexadecimal-digit
hexadecimal-literal hexadecimal-digit
nonzero-digit: one of
1 2 3 4 5 6 7 8 9
octal-digit: one of
0 1 2 3 4 5 6 7
hexadecimal-digit: one of
0 1 2 3 4 5 6 7 8 9
a b c d e f
A B C D E F

This syntax is entirely contained with the pp-number syntax.

2.14.3 Floating literals [lex.fcon]

The grammar is as follows.

floating-literal:
fractional-constant exponent-partopt floating-suffixopt
digit-sequence exponent-part floating-suffixopt
fractional-constant:
digit-sequenceopt . digit-sequence
digit-sequence .
exponent-part:
e signopt digit-sequence
E signopt digit-sequence
sign: one of
+ -
digit-sequence:
digit
digit-sequence digit

This syntax is entirely contained with the pp-number syntax.

2.14.8 User-defined literals [lex.ext]

The grammar is as follows.

user-defined-literal:
user-defined-integer-literal
user-defined-floating-literal
user-defined-string-literal
user-defined-character-literal
user-defined-integer-literal:
decimal-literal ud-suffix
octal-literal ud-suffix
hexadecimal-literal ud-suffix
user-defined-floating-literal:
fractional-constant exponent-partopt ud-suffix
digit-sequence exponent-part ud-suffix
user-defined-string-literal:
string-literal ud-suffix
user-defined-character-literal:
character-literal ud-suffix
ud-suffix:
identifier

16 Preprocessing directives [cpp]

The grammar is as follows.

text-line:
pp-tokensopt new-line
pp-tokens:
preprocessing-token
pp-tokens preprocessing-token

The implication here is that #if-ignored program source must still be made up of valid preprocessor tokens, not arbitrary text. Many preprocessors will skip arbitrary text, though.

Approaches

There are several approaches to the solution. We evaluate them in turn.

Remove User-Defined Literals

At least Daveed Vandevoorde and N.M. Maclaren have suggested removing user-defined literals. However, removing a feature that we just introduced could be difficult.

Typographic

There are three primary typographic conventions for digit separators: a comma, base-line dot, and a (thin) space.

C++ already uses the comma for an operator, and using it for a digit separator would introduce ambiguities in expressions such as ++a-3,4-b++, or even more simply, f(12,345).

C++ already uses the base-line dot as a radix point, and so it is essentially not usable as a digit separator.

Bjarne Stroustrup has suggested using a space as a separator.

While this approach is consistent with one common typeographic style, it suffers from some compatibility problems.

Grave Accent

Ville Voutilainen, among others, suggests using a grave accent (`) (back tick) as a digit separator.

This character is not part of the C++ basic source character set. The proposal has the advantage that introducing for this purpose cannot yield any ambiguity with existing C++ code. There are two disadvantages. First, using this character in the language invalidates any meta-languages using this character to distinguish between the C++ base layer and any meta information. Second, existing preprocessors would not recognize the grave accent as part of a preprocessor number, and may thus yield incorrect results.

Single Quote

Daveed Vandevoorde suggests using a single quote [N2747]. The single quote can be thought of as an "upper comma".

There are two problems with this approach. First, an odd number of single quotes would result in a line of text that does not meet the preprocessor syntax for a token. While most preprocessors do not tokenize lines that are ignored in #if/#else, some preprocessors are known to emit errors for such cases. Second, existing preprocessors would not recognize the single quote as part of a preprocessor number, and may thus yield incorrect results.

Daveed Vandevoorde explains the incompatibility in more detail.

For example:

#if defined(__cplusplus)
double pie = 3.141'593;
#endif

In C, the preprocessor-tokens that are #if'ed out are (not including the double quotes) "double", "pie", "=", "3.141", "'", "593", and ";".

However, single and double quotes that aren't part of a larger preprocessor-token are deemed undefined behavior (C99, 6.4/3).

Typical C compilers (GCC, clang, EDG, and MSVC for example) have no problem with it (presumably they don't try to tokenize #if'ed-out lines), but James Dennett mentioned at least one older C compiler didn't like it.

Pete Becker points out that many tools, such as syntax highlighting in editors, rely on quotes being paired. The adaptability of the tools to new expressions is an open issue.

N.M. Maclaren suggests that single quote will lead to very bad error messages with some macro-based libraries.

Underscore

The Ada programming language uses an underscore (technically, a low line) for the digit separator [AdaLRMnumlit] [AdaRDnumlit]. This approach seems to be used in VHDL and Verilog, also possibly in Algol68. (VHDL also appears to have literal suffixes.) This approach has been proposed more than once for C++, going at least as far back as 1993 [N0259].

In all known cases, the primary proposal has been to permit only a single underscore between digits [N0259] [N2281] [N3342]. However, [N0259] presents an option to permit underscores between the digit sequence and any prefix or suffix.

Underscores work well as a digit separator for C++03 [N0259] [N2281]. But with C++11, there exists a potential ambiguity with user-defined literals [N2747]. While the likely resolution will be some form of "max munch" rule, some mechanism must be present to disambiguate when max munch is too much. We use the term suffix separator to indicate this mechanism.

Double Underscore

[N2747] suggests a double underscore as a suffix separator.

Mike Miller provides more detail.

... one possibility that occurs to me would be to allow a trailing underscore in an integer literal. The ambiguity with user-defined literals would be resolved in favor of the plain integer literal; a user could disambiguate a user-defined literal by ending the integer part with a trailing underscore. (Double underscores would not be permitted in an integer literal.) Thus:

1_ => 1
1_2 => 12
1__2 => value 1 passed to operator "" _2
0xdead_bee_f => 0xdeadbeef
0xdead_bee__f => value 0xdeadbee passed to operator "" _f

The ambiguity with this approach arises when the suffix begins with one or more underscores.

John Spicer suggests something slightly different.

At some point I had suggested using underscore and having a special lookup rule so that something like 0xabc_de would look for the "de" user-defined literal operator, and if not found, would treat the "de" as part of the hex literal. If you wanted to force the use of the operator, you could write 0xabc__de. If you wanted to force the use of a _de operator, you would have to write 0xabc___de.

Another alternative would be to look for the "de" form and then the "_de" form if the first was not found. That way would only require the use of three underscores in cases where you had both a "de" and "_de" operator and wanted to force use of the second.

Scope Operator

[N2747] suggests the scope operator (::) as a potential suffix separator. The scope operator would be a pure syntactic extension, as it could not otherwise follow a literal. However, it would make substrings of a literal separately subject to preprocessor symbol substitution.

Non-Digit Literal Suffix

[N3342] suggests disallowing a leading underscore followed by a digit as a user-defined literal suffix. The intent was to make a suffix separator unnecessary. However, [N3448] points out that [N3342] fails to disambiguate hexadecimal digits, particularly in hte example 0xdead_beef_db, where db could be either decibel or the hexadecimal digits d and b.

One could simply not allow user-defined literals with hexadecimal literals. However, this restriction is not desirable.

Spacing

Discussions in the October 2012 standards meeting settled on using whitespace as the suffix separator. Unfortunately, that approach causes parsing problems for Objective C/C++.

Richard Smith explains.

An Objective-C message send works like this:

message-expression:
[ expression message-selector ]
message-selector:
identifier
keyword-arguments
keyword-arguments:
identifieropt : expression keyword-argumentsopt

In particular, this is a valid Objective-C message send:

[self setValue: 0xff units: "cm"]

Hence any proposal which folds a pp-number followed by an identifier into a single literal will break a significant quantity of Objective-C code.

Doug Gregor elaborates.

There are two issues with allowing spaces between a literal and its suffix for Objective-C. One is a true ambiguity and one is a problem for error recovery.

The true ambiguity occurs because one can omit a parameter name from the method declaration, in which case there is no identifier before the ':' in the call. For example, one could have a message send that looks like this:

[a method:10 :11]

which calls the method "method::". Now, consider

[a method:10 _suffix:11]

Currently, this parses (unambiguously) as a message send to "method:_suffix:", i.e., it's parsed as

[a method:(10) _suffix:11]
// _suffix is the name of the second argument; calls method:_suffix:

However, if we allow a space between a literal and its suffix, there is a second potential parse:

[a method:(10_suffix) :11]
// _suffix is a suffix to the literal 10; calls method::

which is completely ambiguous.

The error-recovery issue is that Objective-C(++) parsers tend to rely heavily on the fact that an expression in C/C++ cannot be immediately followed by an identifier. If we see an expression followed by an identifier in an expression context, it's fairly likely that this is a message send for which the '[' has been dropped. For example, Clang detects these cases and automatically inserts the '[' for the user; this was one of the top error-recovery requests, and a regression here would be considered a major problem for our users.

Double Radix Point

Jeremiah Willcock suggests using ".." as the suffix separator. This notation is already permitted by the pp-number syntax. It is also not presently permitted by any numeric literal. Its primary disadvantage seems to be that it is unfamilar.

Backslash

Clark Nelson suggests using "\" as the suffix separator. This notation is not permitted by the pp-number syntax. It is also not presently permitted by any numeric literal.

Proposal

In this section we present likely wording edits, parameterized by the possible choices.

2.10 Preprocessing numbers [lex.ppnumber]

Edit the grammar as follows. Note that the additional rule for pp-number may not be necessary, depending on the specific chosen format.

digit-separator:
to be determined
pp-number:
digit
. digit
pp-number digit
pp-number nondigit
pp-number e sign
pp-number E sign
pp-number .
pp-number digit-separator

2.14.2 Integer literals [lex.icon]

Edit the grammar as follows.

integer-literal:
decimal-literal integer-suffixopt
octal-literal integer-suffixopt
hexadecimal-literal integer-suffixopt
decimal-literal:
nonzero-digit
decimal-literal digit-separatoropt digit
octal-literal:
0
octal-literal digit-separatoropt octal-digit
hexadecimal-literal:
0x hexadecimal-digit
0X hexadecimal-digit
hexadecimal-literal digit-separatoropt hexadecimal-digit
nonzero-digit: one of
1 2 3 4 5 6 7 8 9
octal-digit: one of
0 1 2 3 4 5 6 7
hexadecimal-digit: one of
0 1 2 3 4 5 6 7 8 9
a b c d e f
A B C D E F

Edit paragraph 1 as follows. Note that each ? will be replaced by the actual chosen digit separator character(s).

An integer literal is a sequence of digits that has no period or exponent part, with optional digit separators. These separators are ignored when determining its value. .... [Example: the The number twelve can be written 12, 014, or 0XC. The literals 1048576, 1?048?576, 0X100000, 0x10?0000, and 0?004?000?000 all have the same value.end example]

2.14.4 Floating literals [lex.fcon]

Edit the grammar as follows.

floating-literal:
fractional-constant exponent-partopt floating-suffixopt
digit-sequence exponent-part floating-suffixopt
fractional-constant:
digit-sequenceopt . digit-sequence
digit-sequence .
exponent-part:
e signopt digit-sequence
E signopt digit-sequence
sign: one of
+ -
digit-sequence:
digit
digit-sequence digit-separatoropt digit

Edit within paragraph 1 as follows. Note that each ? will be replaced by the actual chosen digit separator character(s).

.... The integer and fraction parts both consist of a sequence of decimal (base ten) digits, with optional digit separators. These separators are ignored when determining its value. [Example: The literals 1.602?176?565e-19 and 1.602176565e-19 have the same value. —end example] ....

2.14.8 User-defined literals [lex.ext]

Edit the grammar as follows.

user-defined-literal:
user-defined-integer-literal
user-defined-floating-literal
user-defined-string-literal
user-defined-character-literal
user-defined-integer-literal:
decimal-literal ud-suffix separated-suffix
octal-literal ud-suffix separated-suffix
hexadecimal-literal ud-suffix separated-suffix
user-defined-floating-literal:
fractional-constant exponent-partopt ud-suffix separated-suffix
digit-sequence exponent-part ud-suffix separated-suffix
user-defined-string-literal:
string-literal ud-suffix
user-defined-character-literal:
character-literal ud-suffix
separated-suffix:
literal-separatoropt ud-suffix
literal-separator:
to be determined
ud-suffix:
identifier

Edit paragraph 1 as follows. Note that each ? will be replaced by the actual chosen digit separator character(s) and each ?? will be replaced by the actual chosen literal separator character(s).

If a token matches both user-defined-literal and another literal kind, it is treated as the latter. [Example: 123_km and 123??km is a user-defined-literal are user-defined-literals, but 123?456 and 12LL is an integer-literal are integer-literalsend example] ....

References

[N0259]
A proposal to allow Binary Literals, and some other small changes to Chapter 2: Lexical Conventions, John Max Skaller, ISO/IEC JTC1 SC22 WG21 N0259, 1993-03-26
[N2281]
Digit Separators, Lawrence Crowl, ISO/IEC JTC1 SC22 WG21 N2281, 2007-05-02
[N2747]
Ambiguity and Insecurity with User-Defined Literals, Lawrence Crowl, ISO/IEC JTC1 SC22 WG21 N2747, 2008-08-24
[N2765]
User-defined Literals (aka. Extensible Literals (revision 5)), Ian McIntosh, Michael Wong, Raymond Mak, Robert Klarer, Jens Maurer, Alisdair Meredith, Bjarne Stroustrup, David Vandevoorde, ISO/IEC JTC1 SC22 WG21 N2765, 2008-09-18
[N3250]
US-18: Removing User-Defined Literals, Douglas Gregor, ISO/IEC JTC1 SC22 WG21 N3250, 2011-02-28
[N3402]
User-defined Literals for Standard Library Types, Peter Sommerlad, ISO/IEC JTC1 SC22 WG21 N3402, 2012-09-07
[N3342]
Digit Separators coming back, Jens Maurer, ISO/IEC JTC1 SC22 WG21 N3342, 2012-01-09
[N3448]
Painless Digit Separation, Daveed Vandevoorde, ISO/IEC JTC1 SC22 WG21 N3448, 2012-09-21
[N3472]
Binary Literals in the C++ Core Language, James Dennett, ISO/IEC JTC1 SC22 WG21 N3472, 2012-10-19
[AdaLRMnumlit]
Ada '83 Language Reference Manual, Section 2.4 Numeric Literals, http://archive.adaic.com/standards/83lrm/html/lrm-02-04.html#2.4
[AdaRDnumlit]
Rationale for the Design of the Ada Programming Language, Section 2.1 Lexical Structure http://archive.adaic.com/standards/83rat/html/ratl-02-01.html#2.1