compiler (linkage)
This RFC proposes a new mangling scheme that describes what the symbol names generated by the Rust compiler look like. This new scheme has a number of advantages over the existing one which has grown over time without a clear direction. The new scheme is consistent, depends less on compiler internals, and the information it stores in symbol names can be decoded again which provides an improved experience for users of external tools that work with Rust symbol names.
Note that, at this point, the new mangling scheme would not be part of the language specification or the specification of a stable Rust ABI. In the future it could be part of both and it is designed to be stable and extensible; but for the time being it would still be an implementation detail of the Rust compiler.
Due to its ad-hoc nature, the compiler's current name mangling scheme has a number of drawbacks:
Information about generic parameters and other things is lost in the mangling process. One cannot extract the type arguments of a monomorphized function from its symbol name.
The current scheme is inconsistent: most paths use Itanium ABI style encoding, but some don't.
The symbol names it generates can contain .
characters which is
not generally supported on all platforms. [1]
[2] [3]
It depends on compiler internals and its results cannot be replicated by another compiler implementation or external tool.
The proposed scheme solves these problems:
A-Z
, a-z
,
0-9
, and _
.These properties should make it easier for third party tools to work with Rust binaries.
The following section will lay out the requirements for a name mangling scheme and then introduce the actual scheme through a series of ever more complex examples.
A symbol mangling scheme has a few goals, one of them essential, the rest of them desirable. The essential one is:
"Unambiguous" means that no two distinct compiler-generated entities (that is, mostly object code for functions) must be mapped to the same symbol name. This disambiguation is the main purpose of the hash-suffix in the current, legacy mangling scheme. The scheme proposed here, on the other hand, achieves it in a way that allows to also satisfy a number of additional desirable properties of a mangling scheme:
A mangled symbol should be decodable to some degree. That is, it is desirable to be able to tell which exact concrete instance of e.g. a polymorphic function a given symbol identifies. This is true for external tools, backtraces, or just people only having the binary representation of some piece of code available to them. With the current scheme, this kind of information gets lost in the magical hash-suffix.
A mangling scheme should be platform-independent. This is mainly
achieved by restricting the character set to A-Z
, a-z
, 0-9
,
_
. All other characters might have special meaning in some
context (e.g. .
for MSVC DEF
files) or are simply not
supported (e.g. Unicode).
The scheme should be efficient, meaning that the symbols it produces are not unnecessarily long (because that takes up space in object files and means more work for the compiler and the linker). In addition, generating or demangling a symbol name should not be too computationally expensive.
When used as part of a stable ABI, it should be possible to predict
the symbol name for a given source-level construct. For example,
given the definition fn foo<T>() { ... }
, the scheme should allow
to construct, by hand, the symbol names for e.g. foo<u32>
or
foo<extern fn(i32, &mut SomeStruct<(char, &str)>, ...) -> !>()
.
Since the current scheme generates its hash from the values of
various compiler internal data structures, an alternative compiler
implementation could not predict the symbol name, even for
simple cases. Note that the scheme proposed here does not fulfill
this requirement either (yet) as some things are still left to
the compiler implementation.
The RFC also has a couple of non-goals:
The mangling scheme does not try to be compatible with an existing (e.g. C++) mangling scheme. While it might sound tempting to encode Rust symbols with an existing scheme, it is the author's opinion that the actual benefits are small (C++ tools would not demangle to Rust syntax, demanglings would be hard to read) and at the same time supporting a Rust-specific scheme in existing tools seems quite feasible (many tools like GDB, LLDB, binutils, and valgrind already have specialized code paths for Rust symbols).
The RFC does not try to define a standardized demangled form for symbol names. It defines the mangled form and makes sure it can be demangled in an efficient manner but different demanglers still have some degree of freedom regarding how symbol names are presented to the user.
This section will develop an overview of the mangling scheme by walking through a number of examples. We'll start with the simplest case -- and will see how that already involves things that might be surprising.
A free-standing function is fully identified via its absolute path. For example, the following function
mod foo {
fn bar() {}
}
has the path foo::bar
and NN3foo3bar
is a possible mangling of that path
that complies to the character set we are restricted to. Why this format with
numbers embedded in it? It is a run-length encoding, similar to what the
Itanium C++ ABI name mangling scheme uses for
identifiers. The scheme proposed here will also use this
format because it does not need termination tokens for identifiers
(which are hard to come by with our limited character set).
Note that each component in the path (i.e. foo
and bar
) also has an
accompanying start-tag (here N
) at the beginning. This start-tag is
needed in order for the syntax to be able to represent complex, nested
structures as we will see later.
The symbol name above, unfortunately, does not unambiguously identify the
function in every context. It is perfectly valid for another crate
to also define mod foo { fn bar() {} }
somewhere. So in order to
avoid conflicts in such cases, the absolute path must always include
the crate-id, as in NNC7mycrate3foo3bar
. The crate-id has a C
start-tag.
There is another possible ambiguity that we have to take care of.
Rust has two distinct namespaces: the type and the value namespace.
This leads to a path of the form crate_id::foo::bar
not uniquely
identifying the item bar
because the following snippet is legal
Rust code:
fn foo() {
fn bar() {}
}
mod foo {
fn bar() {}
}
The function foo
lives in the value namespace while the module foo
lives in the type namespace. They don't interfere. In order to make the
symbol names for the two distinct bar
functions unique, we thus add a
namespace identifier to the start-tag of components where necessary, as in
NvNvC7mycrate3foo3bar
for the first case and NvNtC7mycrate3foo3bar
second case (notice the difference: NvNv...
vs NvNt...
).
There is one final case of name ambiguity that we have to take care of.
Because of macro hygiene, multiple items with the same name can appear in
the same context. The compiler internally disambiguates such names by
augmenting them with a numeric index. For example, the first occurrence
of the name foo
within its parent is actually treated as foo'0
, the
second occurrence would be foo'1
, the next foo'2
, and so one. The
mangling scheme will adopt this setup by prepending a disambiguation
prefix to each identifier with a non-zero index. So if macro expansion
would result in the following code:
mod foo {
fn bar'0() {}
// The second `bar` function was introduced by macro expansion.
fn bar'1() {}
}
then we would encode the two functions symbols as NvNtC7mycrate3foo3bar
and NvNtC7mycrate3foos_3bar
respectively (note the s_
prefix in the
second case). A very similar disambiguation is needed for avoiding
conflicts between crates of the same name but different versions. The
same syntactic prefix is thus used for crate-id where we encode the
crate disambiguator as in NtNvCs1234_7mycrate3foo3bar
. Details on
the shape of this prefix are provided in the reference-level description.
As opposed to C++ and other languages that support function overloading, we don't need to include function parameter types in the symbol name. Rust does not allow two functions of the same name but different arguments.
The final symbol name for the function would also include the prefix
_R
that is common to all symbol names generated by this scheme:
_RNvNtCs1234_7mycrate3foo3bar
<>^^^^^<----><------><--><-->
|||||| | | | |
|||||| | | | +--- "bar" identifier
|||||| | | +------- "foo" identifier
|||||| | +------------- "mycrate" identifier
|||||| +-------------------- disambiguator for "mycrate"
|||||+------------------------ start-tag for "mycrate"
||||+------------------------- namespace tag for "foo"
|||+-------------------------- start-tag for "foo"
||+--------------------------- namespace tag for "bar"
|+---------------------------- start-tag for "bar"
+----------------------------- common Rust symbol prefix
Each monomorphization of a generic function has its own symbol name.
The monomorphizations are disambiguated by the list of concrete generic
arguments. These arguments are added to the symbol name by a pair of I
start-tag at the beginning and a list of the actual arguments at the end.
So the instance
std::mem::align_of::<f64>
would be mangled to
_RINvNtC3std3mem8align_ofdE
^ ^^
| ||
| |+--- end of argument list
| +--- f64
+--- start-tag
where I
precedes the thing the arguments belong to, d
designates f64
and E
ends the argument list. As we can see, we need to be able to
represent all kinds of types that can be part of such an argument list.
(In the future, when const generics get added to the language, we will
also need to represent values) These kinds of types are:
char
, ()
, str
, !
, i8
, i16
, ...)mut
and const
[u8]
, [u8; 17]
)fn(&i32) -> u16
dyn
traitsBasic types are all encoded via a single lower-case letter, like in the Itanium scheme. Named types are encoded as their absolute path (including arguments) like is done for function symbols. Composites like references, tuples, and function types all follow a simple grammar given in the reference-level explanation below. Here are some example manglings to get a general feel of what they look like:
std::mem::align_of::<usize>
: _RINvNtC3std3mem8align_ofjE
std::mem::align_of::<&char>
: _RINvNtC3std3mem8align_ofRcE
std::mem::align_of::<std::mem::Discriminant>
:
_RINvNtC3std3mem8align_ofNtNtC3std3mem12DiscriminantE
std::mem::align_of::<&mut (&str,())>
: _RINvNtC3std3mem8align_ofQTReuEE
There's one more thing we have to take into account for generic functions:
The compiler may produce "crate-local" copies of a monomorphization.
That is, if there is a function foo<T>
which gets used as foo<u32>
in two different crates, the compiler (depending on the optimization level)
might generate two distinct functions at the LLVM IR level, each with it's
own symbol. In order to support this without running into conflicts, symbol
names for monomorphizations must include the id of the crate they are
instantiated for. This scheme does this by appending an <crate-id>
suffix
to the symbol. So for example the mangling for std::mem::align_of::<usize>
would actually look like this:
_RINvNtC3std3mem8align_ofjEC3foo (for crate "foo")
_RINvNtC3std3mem8align_ofjEC3bar (for crate "bar")
The scheme needs to be able to generate symbol names for the function containing the code of a closure and it needs to be able to refer to the type of a closure if it occurs as a type argument. As closures don't have a name, we need to generate one. The scheme proposes to use the namespace and disambiguation mechanisms already introduced above for this purpose. Closures get their own "namespace" (i.e. they are neither in the type nor the value namespace), and each closure has an empty name with a disambiguation index (like for macro hygiene) identifying them within their parent. The full name of a closure is then constructed like for any other named item:
mod foo {
fn bar(x: u32) {
let a = |x| { x + 1 }; // local name: NC<...>0
let b = |x| { x + 2 }; // local name: NC<...>s_0
a(b(x))
}
}
In the above example we have two closures, the one assigned to a
and the one assigned to b
. The first one would get the local name
NC<...>0
and the second one the name NC<...>s_0
. The 0
signifies
the length of their (empty) name. The <...>
part is the path of the
parent. The C
is the namespace tag, analogous to the v
tag for
the value namespace. The s_
for the second closure is the
disambiguation index (index 0
is, again, encoded by not prepending
a prefix). Their full names would then be NCNvNtC7mycrate3foo3bar0
and NCNvNtC7mycrate3foo3bars_0
respectively.
Methods are nested within impl
or trait
items. As such it would be
possible to construct their symbol names as paths like
my_crate::foo::{{impl}}::some_method
where {{impl}}
somehow identifies
the the impl
in question. Since impl
s don't have names, we'd have to
use an indexing scheme like the one used for closures (and indeed, this is
what the compiler does internally). Adding in generic arguments to, this
would lead to symbol names looking like
my_crate::foo::impl'17::<u32, char>::some_method
.
However, in the opinion of the author these symbols are very hard to map
back to the method they represent. Consider a module containing dozens of
types, each with multiple impl
blocks generated via #[derive(...)]
.
In order to find out which method a symbol maps to, one would have to count
the number of handwritten and macro generated impl
blocks in the module,
and hope that one correctly guessed the number of impl
blocks introduced
by the given derive-macro (each macro invocation can introduce 0..n
such
blocks). The name of the method might give a hint, but there are still
likely to be dozens of methods named clone
, hash
, eq
, et cetera.
The RFC therefore proposes to keep symbol names close to how methods are represented in error messages, that is:
<Foo<u32, char>>::some_method
for inherent methods, and<Foo<u32, char> as SomeTrait<Quux>>::some_method
for trait methods.This can be achieved by extending the definition of paths that we have
used so far. Instead of the path root always being a crate-id, we now
also allow a path to start with an impl
production that contains the
self-type and (for trait methods) the name of the trait being implemented.
Thus, this extended form of paths would have the following syntax:
<path> = C <identifier> // crate-id root
| M <type> // inherent impl root
| X <type> <path> // trait impl root
| N <namespace> <path> <identifier> // nested path
| I <path> {<generic-arg>} E // generic arguments
Here are some examples for complete symbol names:
<mycrate::Foo<u32>>::foo => _RNvMINtC7mycrate3FoomE3foo
<u32 as mycrate::Foo>::foo => _RNvXmNtC7mycrate3Foo3foo
<mycrate::Foo<u32> as mycrate::Bar<u64>>::foo => _RNvXINtC7mycrate3FoomEINtC7mycrate3BaryE3foo
In Rust one can define items within generic items, e.g. functions or impls, like in the following example:
struct Foo<T>(T);
impl<T> From<T> for Foo<T> {
fn from(x: T) -> Self {
static MSG: &str = "...";
panic!("{}", MSG)
}
}
The MSG
here (or any other such nested definition) does not inherit
the generic context from the impl
. MSG
is non-generic, and a
function defined in its place would be too. The fully qualified name
of MSG
, according to our examples so far, is thus
<mycrate::Foo<_> as std::convert::From<_>>::from::MSG
and its symbol name:
_RNvNvXINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG
However, with trait specialization, this symbol can be ambiguous. Consider the following piece of code:
struct Foo<T>(T);
impl<T> From<T> for Foo<T> {
default fn from(x: T) -> Self {
static MSG: &str = "...";
panic!("{}", MSG)
}
}
impl<T: Default> From<T> for Foo<T> {
fn from(x: T) -> Self {
static MSG: &str = "123";
panic!("{}", MSG)
}
}
Notice that both MSG
statics have the path <Foo<_> as From<_>>::foo::MSG
.
We somehow have to disambiguate the impls
. We do so by adding the path of
the impl
to the symbol name.
<path> = C <identifier> // crate-id root
| M <impl-path> <type> // inherent impl root
| X <impl-path> <type> <path> // trait impl root
| N <namespace> <path> <identifier> // nested path
| I <path> {<generic-arg>} E // generic arguments
<impl-path> = [<disambiguator>] <path>
The two symbol names would then look something like:
_RNvNvXs2_C7mycrateINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG
_RNvNvXs3_C7mycrateINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG
<----------><----------------><----------------------->
impl-path self-type trait-name
Like other disambiguation information, this path would usually not actually be shown by demanglers.
Rust allows Unicode identifiers but our character set is restricted
to ASCII alphanumerics, and _
. In order to transcode the former to
the latter, we use the same approach as Swift, which is: encode all
non-ASCII identifiers via Punycode, a standardized and
efficient encoding that keeps encoded strings in a rather
human-readable format. So for example, the string
"Gödel, Escher, Bach"
is encoded as
"Gdel, Escher, Bach-d3b"
which, as opposed to something like Base64, still gives a pretty good idea of what the original string looked like.
Each component of a name, i.e. anything that starts with the number
of bytes to read in the examples above, is encoded individually.
Components encoded this way are augmented with a u
prefix so that
demanglers know that the identifier needs further decoding. As an
example, the function:
mod gödel {
mod escher {
fn bach() {}
}
}
would be mangled as:
_RNvNtNtC7mycrateu8gdel_5qa6escher4bach
<-------->
Unicode component
The length of symbol names has an influence on how much work the compiler,
linker, and loader have to perform. The shorter the names, the better.
At the same time, Rust's generics can lead to rather long names (which
are often not visible in the code because of type inference and
impl Trait
). For example, the return type of the following function:
fn quux(s: Vec<u32>) -> impl Iterator<Item = (u32, usize)> {
s.into_iter()
.map(|x| x + 1)
.filter(|&x| x > 10)
.zip(0..)
.chain(iter::once((0, 0)))
}
is
std::iter::Chain<
std::iter::Zip<
std::iter::Filter<
std::iter::Map<
std::vec::IntoIter<u32>,
[closure@src/main.rs:16:11: 16:18]>,
[closure@src/main.rs:17:14: 17:25]>,
std::ops::RangeFrom<usize>>,
std::iter::Once<(u32, usize)>>
It would make for a long symbol name if this type is used (maybe repeatedly) as a generic argument somewhere. C++ has the same problem with its templates; which is why the Itanium mangling introduces the concept of compression. If a component of a definition occurs more than once, it will not be repeated and instead be emitted as a substitution marker that allows to reconstruct which component it refers to. The scheme proposed here will use the same approach (but with a simpler definition).
The exact scheme will be described in detail in the reference level explanation below but it roughly works as follows: As a mangled symbol name is being built, we remember the position of every substitutable item in the output string, that is, we keep track of things a subsequent occurrence of which could be replaced by a back reference.
The things that are eligible for substitution are (1) all prefixes of paths (including the entire path itself), (2) all types except for basic types, and (3) type-level constants (array lengths and values passed to const generic params).
Here's an example in order to illustrate the concept. The name
std::iter::Chain<std::iter::Zip<std::vec::IntoIter<u32>, std::vec::IntoIter<u32>>>
is mangled to the following uncompressed string. The lines below show parts of the mangled string that already occurred before and can thus be replaced by a back reference. The number of at the beginning of each span given the 0-based byte position of where it occurred the first time.
0 10 20 30 40 50 60 70 80 90
_RINtNtC3std4iter5ChainINtNtC3std4iter3ZipINtNtC3std3vec8IntoItermEINtNtC3std3vec8IntoItermEEE
5---- 5---- 5----
3----------- 43---------
41--------------------
40-----------------------
The compiler is always supposed to use the longest replacement possible in order to achieve the best compression. The compressed symbol looks as follows:
_RINtNtC3std4iter5ChainINtB2_3ZipINtNtB4_3vec8IntoItermEBt_EE
^^^ ^^^ ^^^ back references
Back references have the form B<base-62-number>_
.
The reference-level explanation consists of three parts:
For implementing a demangler, only the first two sections are of interest, that is, a demangler only needs to understand syntax and compression of names, but it does not have to care about how the compiler generates mangled names.
The syntax of mangled names is given in extended Backus-Naur form:
<path>
)"_R"
),[<disambiguator>]
),{<type>}
)//
.Mangled names conform to the following grammar:
// The <decimal-number> specifies the encoding version.
<symbol-name> = "_R" [<decimal-number>] <path> [<instantiating-crate>]
<path> = "C" <identifier> // crate root
| "M" <impl-path> <type> // <T> (inherent impl)
| "X" <impl-path> <type> <path> // <T as Trait> (trait impl)
| "Y" <type> <path> // <T as Trait> (trait definition)
| "N" <namespace> <path> <identifier> // ...::ident (nested path)
| "I" <path> {<generic-arg>} "E" // ...<T, U> (generic args)
| <backref>
// Path to an impl (without the Self type or the trait).
// The <path> is the parent, while the <disambiguator> distinguishes
// between impls in that same parent (e.g. multiple impls in a mod).
// This exists as a simple way of ensure uniqueness, and demanglers
// don't need to show it (unless the location of the impl is desired).
<impl-path> = [<disambiguator>] <path>
// The <decimal-number> is the length of the identifier in bytes.
// <bytes> is the identifier itself, and it's optionally preceded by "_",
// to separate it from its length - this "_" is mandatory if the <bytes>
// starts with a decimal digit, or "_", in order to keep it unambiguous.
// If the "u" is present then <bytes> is Punycode-encoded.
<identifier> = [<disambiguator>] <undisambiguated-identifier>
<disambiguator> = "s" <base-62-number>
<undisambiguated-identifier> = ["u"] <decimal-number> ["_"] <bytes>
// Namespace of the identifier in a (nested) path.
// It's an a-zA-Z character, with a-z reserved for implementation-internal
// disambiguation categories (and demanglers should never show them), while
// A-Z are used for special namespaces (e.g. closures), which the demangler
// can show in a special way (e.g. `NC...` as `...::{closure}`), or just
// default to showing the uppercase character.
<namespace> = "C" // closure
| "S" // shim
| <A-Z> // other special namespaces
| <a-z> // internal namespaces
<generic-arg> = <lifetime>
| <type>
| "K" <const> // forward-compat for const generics
// An anonymous (numbered) lifetime, either erased or higher-ranked.
// Index 0 is always erased (can show as '_, if at all), while indices
// starting from 1 refer (as de Bruijn indices) to a higher-ranked
// lifetime bound by one of the enclosing <binder>s.
<lifetime> = "L" <base-62-number>
// Specify the number of higher-ranked (for<...>) lifetimes to bound.
// <lifetime> can then later refer to them, with lowest indices for
// innermost lifetimes, e.g. in `for<'a, 'b> fn(for<'c> fn(...))`,
// any <lifetime>s in ... (but not inside more binders) will observe
// the indices 1, 2, and 3 refer to 'c, 'b, and 'a, respectively.
// The number of bound lifetimes is value of <base-62-number> + 1.
<binder> = "G" <base-62-number>
<type> = <basic-type>
| <path> // named type
| "A" <type> <const> // [T; N]
| "S" <type> // [T]
| "T" {<type>} "E" // (T1, T2, T3, ...)
| "R" [<lifetime>] <type> // &T
| "Q" [<lifetime>] <type> // &mut T
| "P" <type> // *const T
| "O" <type> // *mut T
| "F" <fn-sig> // fn(...) -> ...
| "D" <dyn-bounds> <lifetime> // dyn Trait<Assoc = X> + Send + 'a
| <backref>
<basic-type> = "a" // i8
| "b" // bool
| "c" // char
| "d" // f64
| "e" // str
| "f" // f32
| "h" // u8
| "i" // isize
| "j" // usize
| "l" // i32
| "m" // u32
| "n" // i128
| "o" // u128
| "s" // i16
| "t" // u16
| "u" // ()
| "v" // ...
| "x" // i64
| "y" // u64
| "z" // !
| "p" // placeholder (e.g. for generic params), shown as _
// If the "U" is present then the function is `unsafe`.
// The return type is always present, but demanglers can
// choose to omit the ` -> ()` by special-casing "u".
<fn-sig> = [<binder>] ["U"] ["K" <abi>] {<type>} "E" <type>
<abi> = "C"
| <undisambiguated-identifier>
<dyn-bounds> = [<binder>] {<dyn-trait>} "E"
<dyn-trait> = <path> {<dyn-trait-assoc-binding>}
<dyn-trait-assoc-binding> = "p" <undisambiguated-identifier> <type>
<const> = <type> <const-data>
| "p" // placeholder, shown as _
| <backref>
// The encoding of a constant depends on its type. Integers use their value,
// in base 16 (0-9a-f), not their memory representation. Negative integer
// values are preceded with "n". The bool value false is encoded as `0_`, true
// value as `1_`. The char constants are encoded using their Unicode scalar
// value.
<const-data> = ["n"] {<hex-digit>} "_"
// <base-62-number> uses 0-9-a-z-A-Z as digits, i.e. 'a' is decimal 10 and
// 'Z' is decimal 61.
// "_" with no digits indicates the value 0, while any other value is offset
// by 1, e.g. "0_" is 1, "Z_" is 62, "10_" is 63, etc.
<base-62-number> = {<0-9a-zA-Z>} "_"
<backref> = "B" <base-62-number>
// We use <path> here, so that we don't have to add a special rule for
// compression. In practice, only a crate root is expected.
<instantiating-crate> = <path>
Namespaces are identified by an implementation defined single character tag
(the <namespace>
production). Only closures (C
) and shims (S
) have a
specific character assigned to them so that demanglers can reliable
adjust their output accordingly. Other namespace tags have to be omitted
or shown verbatim during demangling.
This is a concession to the compiler's current implementation. While the language only knows two namespaces (the type and the value namespace), the compiler uses many more in some important data structures and disambiguation indices are assigned according to these internal data structures. So, in order not to force the compiler to waste processing time on re-constructing different disambiguation indices, the internal unspecified "namespaces" are used. This may change in the future.
As described above, the grammar encodes constant values via the
<const-data> = {<hex-digit>} "_"
production, where {<hex-digit>}
is
the numeric value of the constant, not its representation as bytes. Using
the numeric value is platform independent but does not easily scale to
non-integer data types.
In the future it is likely that Rust will support complex type-level constants (i.e. not just integers). This RFC suggests to develop a proper mangling for these as part of the future const-generics work, and, for now, only define a mangling for integer values.
Punycode generates strings of the form ([[:ascii:]]+-)?[[:alnum:]]+
.
This is problematic because of the -
character, which is not in the
supported character set; Punycode uses it to separate the ASCII part
(if it exists), from the base-36 encoding of the non-ASCII characters.
For this reasons, we deviate from vanilla Punycode, by replacing
the -
character with a _
character.
Here are some examples:
Original | Punycode | Punycode + Encoding |
---|---|---|
føø | f-5gaa | f_5gaa |
α_ω | _-ylb7e | __ylb7e |
铁锈 | n84amf | n84amf |
🤦 | fq9h | fq9h |
ρυστ | 2xaedc | 2xaedc |
With this post-processing in place the Punycode strings can be treated like regular identifiers and need no further special handling.
Symbol name compression works by substituting parts of the mangled name that have already been seen for a back reference. Compression is directly built into the mangling algorithm, as shown by the following piece of pseudocode:
fn mangle(node, output_string, substitution_dictionary) {
if let Some(backref) = substitution_dictionary.get(node) {
// Emit the backref instead of the node's contents
mangle(backref, output_string)
} else {
// Remember where the current node starts in the output
let start_position = output_string.len()
// Do the actual mangling, including recursive mangling of child nodes
// Add the current node to the substitution dictionary
if node.is_substitutable() {
substitution_dictionary.insert(node, start_position)
}
}
}
This algorithm automatically chooses the best compression because parent nodes (which are always larger) are visited before child nodes.
Note that this kind of compression relies on the fact that all substitutable AST nodes have a self-terminating mangled form, that is, given the start position of the encoded node, the grammar guarantees that it is always unambiguous where the node ends. This is ensured by not allowing optional or repeating elements at the end of substitutable productions.
Decompression too is built directly into demangling/parsing. When a back reference is encountered, we decode the referenced position and use a temporary demangler/parser to do the decoding of the node's actual content:
fn demangle_at(&mut pos, mangled, output_string) {
if is_backref(*pos, mangled) {
// Read the byte offset of the referenced node and
// advance `pos` past the backref.
let mut referenced_pos = decode(pos, mangled);
demangle_at(&mut referenced_pos, mangled, output_string)
} else {
// do regular demangling
}
}
Using byte offsets as backref keys (as this RFC does) instead of post-order traversal indices (as Itanium mangling does) has the advantage that the demangler does not need to duplicate the mangler's substitution indexing logic, something that can become quite complex (as demonstrated by the compression scheme proposed in the initial version of this RFC).
The mangling syntax is constructed in a way that allows for implementing efficient demanglers:
Mangled names contain information in the same order as unmangled names are expected to contain it. Therefore, a demangler can directly generate its output while parsing the mangled form. There is no need to explicitly instantiate the AST in memory.
The same is true for decompression. Decompression can be done without
allocating memory outside of the stack. Alternatively the demangler
can keep a simple array that maps back-ref indices to ranges in the
already generated output. When it encounters a <backref>
in need
of expansion, it can just look up corresponding range and do a
simple memcpy
.
Parsing, decompression, and demangling can thus be done in a single pass
over the mangled name without the need for complex data structures, which
is useful when having to implement #[no_std]
or C demanglers. (Note that
Punycode can complicate decoding slightly because it needs dynamic memory
allocation in the general case but it can be implemented with an on-stack
buffer for a reasonable maximum supported length).
This RFC suggests the following mapping of Rust entities to mangled names:
Named functions, methods, and statics shall be represented by a
<path>
production.
Paths should be rooted at the inner-most entity that can act as a path root. Roots can be crate-ids, inherent impls, trait impls, and (for items within default methods) trait definitions.
The compiler is free to choose disambiguation indices and namespace tags from the reserved ranges as long as it ascertains identifier unambiguity.
Generic arguments that are equal to the default should not be encoded in order to save space.
Why should we not do this?
The alternatives considered are:
Keeping the current scheme. It does meet the minimum requirements after all. However, the general consensus seems to be that this leads to situations where people are unpleasantly surprised when they come across (demangled) symbol names in backtraces or profilers.
Keeping the current scheme but cleaning it up by making the non-hash part more consistent and more expressive. Keep the hash part as a safeguard against symbol conflicts and the rest as something just for demangling. The downside of this is that the hash would still not be predictable, and symbols would get rather long if they should contain more human-readable information about generic arguments.
Define a standardized pretty-printing format for things that end up as symbols, and then encode that via Punycode in order to meet the character set restrictions. This would be rather simple. Symbol names would remain somewhat human-readable (but not very, because all separators would be stripped out). But without some kind of additional compression, symbol names would become rather long.
Use the scheme from the previous bullet point but apply the compression scheme described above. We could do this but it wouldn't really be less complex than the scheme proposed by the RFC.
Define a standardized pretty-printing format for things that end up as
symbols, compress with zstd
(specially trained for Rust symbols) and
encode the result as base63
. This is rather simple but loses all
human-readability. It's unclear how well this would compress. It would
pull the zstd
specification into the mangling scheme specification,
as well as the pre-trained dictionary.
One of the major modern mangling schemes with a public specification is the Itanium C++ ABI scheme for C++ which is used by the GCC toolchain. An initial version of this RFC sticked closely to Itanium mangling, however, the latest version only retains the run-length encoding for identifiers and some literals for tagging things like basic types. The Itanium scheme has been criticized for being overly complex, due to its extensive grammar and two separate compression schemes.
The idea of using Punycode for handling of unicode identifiers is taken from the Swift programming language's mangling scheme.
During the pre-RFC phase, it has been suggested that Unicode identifiers should be encoded as UTF-8 instead of Punycode on platforms that allow it. GCC, Clang, and MSVC seem to do this. The author of the RFC has a hard time making up their mind about this issue. Here are some interesting points that might influence the final decision:
Using UTF-8 instead of Punycode would make mangled strings containing non-ASCII identifiers a bit more human-readable. For demangled strings, there would be no difference.
Punycode support is non-optional since some platforms only allow a very limited character set for symbol names. Thus, we would be using UTF-8 on some platforms and Punycode on others, making it harder to predict what a symbol name for a given item looks like.
Punycode encoding and decoding is more runtime effort for the mangler and demangler.
Once a demangler supports Punycode, it is not much effort to support
both encodings. The u
identifier prefix tells the demangler whether
it is Punycode. Otherwise it can just assume UTF-8 which already
subsumes ASCII.
UPDATE: This RFC recommends that Punycode encoded identifiers must be supported by demanglers but that it is up to the compiler implementation (for now) to decide whether to use it for a given platform. This question will have to be revisited if Rust ever wants to define a stable ABI.
It has been suggested that parameter types for functions and methods should be encoded in mangled form too. This is not necessary for symbol name uniqueness but it would provide an additional safeguard against silent ABI-related errors where definition and callers of some function make different assumptions about what parameters a function takes. The RFC does not propose to do this because:
However, a final decision on the topic has not been made yet.
UPDATE: This RFC suggests that parameter types are not encoded into function and method symbols. Symbol names will already get significantly longer due to encoding additional information and the additional safeguard provided against ABI mismatches is less relevant for Rust than it is for other languages that don't have a concept of library/crate metadata.
This RFC suggests that names are demangled to a form that matches Rust
syntax as it is used in source code, compiler error messages and rustdoc
:
Path components should be separated by ::
.
If the path root is a <crate-id>
it should be printed as the crate name.
If the context requires it for correctness, the crate disambiguator can be
printed too, as in, for example, std[a0b1c2d3]::collections::HashMap
.
In this case a0b1c2d3
would be the disambiguator. Usually, the
disambiguator can be omitted for better readability.
If the path root is an impl, it should be printed as <SelfType>
(for
inherent impls) or <SelfType as Trait>
(for trait impls), like the
compiler does in error messages. The <impl-path>
also contained in the
AST node should usually be omitted.
The list of generic arguments should be demangled as <T1, T2, T3>
.
Identifiers can have a numeric disambiguator
(the <disambiguator>
production). The syntactic version of the numeric
disambiguator maps to a numeric index. If the disambiguator is not
present, this index is 0. If it is of the form s_
then the index is 1.
If it is of the form s<base-62-digit>_
then the index is
<base-62-digit> + 2
. The suggested demangling of a disambiguator is
[<index>]
. However, for better readability, these disambiguators
should usually be omitted in the demangling altogether. Disambiguators
with index zero can always be omitted.
The exception here are closures. Since these do not have a name, the
disambiguator is the only thing identifying them. The suggested
demangling for closures is thus {closure}[<index>]
.
We assume that all examples are defined in a crate named mycrate[1234]
.
mod foo {
mod bar {
fn baz() {}
}
}
mycrate::foo::bar::baz
_RNvNtNtCs1234_7mycrate3foo3bar3baz
struct Foo<T>(T);
impl<T> Foo<T> {
pub fn bar<U>(_: U) {
static QUUX: u32 = 0;
// ...
}
}
<mycrate::Foo<_>>::bar::QUUX
_RNvNvMCs1234_7mycrateINtCs1234_7mycrate3FoopE3bar4QUUX
struct Foo<T>(T);
impl<T> Clone for Foo<T> {
fn clone(&self) -> Self {
static QUUX: u32 = 0;
// ...
}
}
<mycrate::Foo<_> as std::clone::Clone>::clone::QUUX
_RNvNvXCs1234_7mycrateINtCs1234_7mycrate3FoopENtNtC3std5clone5Clone5clone4QUUX
pub static QUUX: u32 = {
static FOO: u32 = 1;
FOO + FOO
};
mycrate::QUUX::FOO
_RNvNvCs1234_7mycrate4QUUX3FOO
mycrate::foo<mycrate::bar,mycrate::bar::baz>
_RINvCs1234_7mycrate3fooNvB4_3barNvBn_3bazE
std::foo<(std::Bar,std::Bar),(std::Bar,std::Bar)>
_RINxC3std3fooTNyB4_3BarBe_EBd_E
impl
disambiguation strategy changed to using the impl path instead of param bounds.<disambiguator>
for crate disambiguator".<binder>
optional in <fn-sig>
and <dyn-bounds>
productions.<const-data>
to include bool
values, char
values, and negative integer values.