compiler (syntax)
Lex binary and octal literals as if they were decimal.
Lexing all digits (even ones not valid in the given base) allows for improved error messages & future proofing (this is more conservative than the current approach) and less confusion, with little downside.
Currently, the lexer stops lexing binary and octal literals (0b10
and
0o12345670
) as soon as it sees an invalid digit (2-9 or 8-9
respectively), and immediately starts lexing a new token,
e.g. 0b0123
is two tokens, 0b01
and 23
. Writing such a thing in
normal code gives a strange error message:
<anon>:2:9: 2:11 error: expected one of `.`, `;`, `}`, or an operator, found `23`
<anon>:2 0b0123
^~
However, it is valid to write such a thing in a macro (e.g. using the
tt
non-terminal), and thus lexing the adjacent digits as two tokens
can lead to unexpected behaviour.
macro_rules! expr { ($e: expr) => { $e } }
macro_rules! add {
($($token: tt)*) => {
0 $(+ expr!($token))*
}
}
fn main() {
println!("{}", add!(0b0123));
}
prints 24
(add
expands to 0 + 0b01 + 23
).
It would be nicer for both cases to print an error like:
error: found invalid digit `2` in binary literal
0b0123
^
(The non-macro case could be handled by detecting this pattern in the lexer and special casing the message, but this doesn't not handle the macro case.)
Code that wants two tokens can opt in to it by 0b01 23
, for
example. This is easy to write, and expresses the intent more clearly
anyway.
The grammar that the lexer uses becomes
(0b[0-9]+ | 0o[0-9]+ | [0-9]+ | 0x[0-9a-fA-F]+) suffix
instead of just [01]
and [0-7]
for the first two, respectively.
However, it is always an error (in the lexer) to have invalid digits
in a numeric literal beginning with 0b
or 0o
. In particular, even
a macro invocation like
macro_rules! ignore { ($($_t: tt)*) => { {} } }
ignore!(0b0123)
is an error even though it doesn't use the tokens.
This adds a slightly peculiar special case, that is somewhat unique to
Rust. On the other hand, most languages do not expose the lexical
grammar so directly, and so have more freedom in this respect. That
is, in many languages it is indistinguishable if 0b1234
is one or
two tokens: it is always an error either way.
Don't do it, obviously.
Consider 0b123
to just be 0b1
with a suffix of 23
, and this is
an error or not depending if a suffix of 23
is valid. Handling this
uniformly would require "foo"123
and 'a'123
also being lexed as a
single token. (Which may be a good idea anyway.)
None.