RFC 2603: Name mangling reform

compiler (linkage)

Summary

This RFC proposes a new mangling scheme that describes what the symbol names generated by the Rust compiler look like. This new scheme has a number of advantages over the existing one which has grown over time without a clear direction. The new scheme is consistent, depends less on compiler internals, and the information it stores in symbol names can be decoded again which provides an improved experience for users of external tools that work with Rust symbol names.

Note that, at this point, the new mangling scheme would not be part of the language specification or the specification of a stable Rust ABI. In the future it could be part of both and it is designed to be stable and extensible; but for the time being it would still be an implementation detail of the Rust compiler.

Motivation

Due to its ad-hoc nature, the compiler's current name mangling scheme has a number of drawbacks:

The proposed scheme solves these problems:

These properties should make it easier for third party tools to work with Rust binaries.

Guide-level explanation

The following section will lay out the requirements for a name mangling scheme and then introduce the actual scheme through a series of ever more complex examples.

Requirements for a Symbol Mangling Scheme

A symbol mangling scheme has a few goals, one of them essential, the rest of them desirable. The essential one is:

"Unambiguous" means that no two distinct compiler-generated entities (that is, mostly object code for functions) must be mapped to the same symbol name. This disambiguation is the main purpose of the hash-suffix in the current, legacy mangling scheme. The scheme proposed here, on the other hand, achieves it in a way that allows to also satisfy a number of additional desirable properties of a mangling scheme:

The RFC also has a couple of non-goals:

The Mangling Scheme by Example

This section will develop an overview of the mangling scheme by walking through a number of examples. We'll start with the simplest case -- and will see how that already involves things that might be surprising.

Free-standing Functions and Statics

A free-standing function is fully identified via its absolute path. For example, the following function

mod foo {
  fn bar() {}
}

has the path foo::bar and NN3foo3bar is a possible mangling of that path that complies to the character set we are restricted to. Why this format with numbers embedded in it? It is a run-length encoding, similar to what the Itanium C++ ABI name mangling scheme uses for identifiers. The scheme proposed here will also use this format because it does not need termination tokens for identifiers (which are hard to come by with our limited character set).

Note that each component in the path (i.e. foo and bar) also has an accompanying start-tag (here N) at the beginning. This start-tag is needed in order for the syntax to be able to represent complex, nested structures as we will see later.

The symbol name above, unfortunately, does not unambiguously identify the function in every context. It is perfectly valid for another crate to also define mod foo { fn bar() {} } somewhere. So in order to avoid conflicts in such cases, the absolute path must always include the crate-id, as in NNC7mycrate3foo3bar. The crate-id has a C start-tag.

There is another possible ambiguity that we have to take care of. Rust has two distinct namespaces: the type and the value namespace. This leads to a path of the form crate_id::foo::bar not uniquely identifying the item bar because the following snippet is legal Rust code:

fn foo() {
    fn bar() {}
}

mod foo {
    fn bar() {}
}

The function foo lives in the value namespace while the module foo lives in the type namespace. They don't interfere. In order to make the symbol names for the two distinct bar functions unique, we thus add a namespace identifier to the start-tag of components where necessary, as in NvNvC7mycrate3foo3bar for the first case and NvNtC7mycrate3foo3bar second case (notice the difference: NvNv... vs NvNt...).

There is one final case of name ambiguity that we have to take care of. Because of macro hygiene, multiple items with the same name can appear in the same context. The compiler internally disambiguates such names by augmenting them with a numeric index. For example, the first occurrence of the name foo within its parent is actually treated as foo'0, the second occurrence would be foo'1, the next foo'2, and so one. The mangling scheme will adopt this setup by prepending a disambiguation prefix to each identifier with a non-zero index. So if macro expansion would result in the following code:

mod foo {
    fn bar'0() {}
    // The second `bar` function was introduced by macro expansion.
    fn bar'1() {}
}

then we would encode the two functions symbols as NvNtC7mycrate3foo3bar and NvNtC7mycrate3foos_3bar respectively (note the s_ prefix in the second case). A very similar disambiguation is needed for avoiding conflicts between crates of the same name but different versions. The same syntactic prefix is thus used for crate-id where we encode the crate disambiguator as in NtNvCs1234_7mycrate3foo3bar. Details on the shape of this prefix are provided in the reference-level description.

As opposed to C++ and other languages that support function overloading, we don't need to include function parameter types in the symbol name. Rust does not allow two functions of the same name but different arguments.

The final symbol name for the function would also include the prefix _R that is common to all symbol names generated by this scheme:

  _RNvNtCs1234_7mycrate3foo3bar
  <>^^^^^<----><------><--><-->
   ||||||   |      |     |   |
   ||||||   |      |     |   +--- "bar" identifier
   ||||||   |      |     +------- "foo" identifier
   ||||||   |      +------------- "mycrate" identifier
   ||||||   +-------------------- disambiguator for "mycrate"
   |||||+------------------------ start-tag for "mycrate"
   ||||+------------------------- namespace tag for "foo"
   |||+-------------------------- start-tag for "foo"
   ||+--------------------------- namespace tag for "bar"
   |+---------------------------- start-tag for "bar"
   +----------------------------- common Rust symbol prefix

Generic Functions

Each monomorphization of a generic function has its own symbol name. The monomorphizations are disambiguated by the list of concrete generic arguments. These arguments are added to the symbol name by a pair of I start-tag at the beginning and a list of the actual arguments at the end. So the instance

std::mem::align_of::<f64>

would be mangled to

_RINvNtC3std3mem8align_ofdE
  ^                      ^^
  |                      ||
  |                      |+--- end of argument list
  |                      +--- f64
  +--- start-tag

where I precedes the thing the arguments belong to, d designates f64 and E ends the argument list. As we can see, we need to be able to represent all kinds of types that can be part of such an argument list. (In the future, when const generics get added to the language, we will also need to represent values) These kinds of types are:

Basic types are all encoded via a single lower-case letter, like in the Itanium scheme. Named types are encoded as their absolute path (including arguments) like is done for function symbols. Composites like references, tuples, and function types all follow a simple grammar given in the reference-level explanation below. Here are some example manglings to get a general feel of what they look like:

There's one more thing we have to take into account for generic functions: The compiler may produce "crate-local" copies of a monomorphization. That is, if there is a function foo<T> which gets used as foo<u32> in two different crates, the compiler (depending on the optimization level) might generate two distinct functions at the LLVM IR level, each with it's own symbol. In order to support this without running into conflicts, symbol names for monomorphizations must include the id of the crate they are instantiated for. This scheme does this by appending an <crate-id> suffix to the symbol. So for example the mangling for std::mem::align_of::<usize> would actually look like this:

_RINvNtC3std3mem8align_ofjEC3foo (for crate "foo")
_RINvNtC3std3mem8align_ofjEC3bar (for crate "bar")

Closures and Closure Environments

The scheme needs to be able to generate symbol names for the function containing the code of a closure and it needs to be able to refer to the type of a closure if it occurs as a type argument. As closures don't have a name, we need to generate one. The scheme proposes to use the namespace and disambiguation mechanisms already introduced above for this purpose. Closures get their own "namespace" (i.e. they are neither in the type nor the value namespace), and each closure has an empty name with a disambiguation index (like for macro hygiene) identifying them within their parent. The full name of a closure is then constructed like for any other named item:

mod foo {
  fn bar(x: u32) {
    let a = |x| { x + 1 }; // local name: NC<...>0
    let b = |x| { x + 2 }; // local name: NC<...>s_0

    a(b(x))
  }
}

In the above example we have two closures, the one assigned to a and the one assigned to b. The first one would get the local name NC<...>0 and the second one the name NC<...>s_0. The 0 signifies the length of their (empty) name. The <...> part is the path of the parent. The C is the namespace tag, analogous to the v tag for the value namespace. The s_ for the second closure is the disambiguation index (index 0 is, again, encoded by not prepending a prefix). Their full names would then be NCNvNtC7mycrate3foo3bar0 and NCNvNtC7mycrate3foo3bars_0 respectively.

Methods

Methods are nested within impl or trait items. As such it would be possible to construct their symbol names as paths like my_crate::foo::{{impl}}::some_method where {{impl}} somehow identifies the the impl in question. Since impls don't have names, we'd have to use an indexing scheme like the one used for closures (and indeed, this is what the compiler does internally). Adding in generic arguments to, this would lead to symbol names looking like my_crate::foo::impl'17::<u32, char>::some_method.

However, in the opinion of the author these symbols are very hard to map back to the method they represent. Consider a module containing dozens of types, each with multiple impl blocks generated via #[derive(...)]. In order to find out which method a symbol maps to, one would have to count the number of handwritten and macro generated impl blocks in the module, and hope that one correctly guessed the number of impl blocks introduced by the given derive-macro (each macro invocation can introduce 0..n such blocks). The name of the method might give a hint, but there are still likely to be dozens of methods named clone, hash, eq, et cetera.

The RFC therefore proposes to keep symbol names close to how methods are represented in error messages, that is:

This can be achieved by extending the definition of paths that we have used so far. Instead of the path root always being a crate-id, we now also allow a path to start with an impl production that contains the self-type and (for trait methods) the name of the trait being implemented.

Thus, this extended form of paths would have the following syntax:

<path> = C <identifier>                      // crate-id root
       | M <type>                            // inherent impl root
       | X <type> <path>                     // trait impl root
       | N <namespace> <path> <identifier>   // nested path
       | I <path> {<generic-arg>} E          // generic arguments

Here are some examples for complete symbol names:

<mycrate::Foo<u32>>::foo => _RNvMINtC7mycrate3FoomE3foo
<u32 as mycrate::Foo>::foo => _RNvXmNtC7mycrate3Foo3foo
<mycrate::Foo<u32> as mycrate::Bar<u64>>::foo => _RNvXINtC7mycrate3FoomEINtC7mycrate3BaryE3foo

Items Within Generic Impls

In Rust one can define items within generic items, e.g. functions or impls, like in the following example:

struct Foo<T>(T);

impl<T> From<T> for Foo<T> {
  fn from(x: T) -> Self {
    static MSG: &str = "...";
    panic!("{}", MSG)
  }
}

The MSG here (or any other such nested definition) does not inherit the generic context from the impl. MSG is non-generic, and a function defined in its place would be too. The fully qualified name of MSG, according to our examples so far, is thus <mycrate::Foo<_> as std::convert::From<_>>::from::MSG and its symbol name:

_RNvNvXINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG

However, with trait specialization, this symbol can be ambiguous. Consider the following piece of code:

struct Foo<T>(T);

impl<T> From<T> for Foo<T> {
  default fn from(x: T) -> Self {
    static MSG: &str = "...";
    panic!("{}", MSG)
  }
}

impl<T: Default> From<T> for Foo<T> {
  fn from(x: T) -> Self {
    static MSG: &str = "123";
    panic!("{}", MSG)
  }
}

Notice that both MSG statics have the path <Foo<_> as From<_>>::foo::MSG. We somehow have to disambiguate the impls. We do so by adding the path of the impl to the symbol name.

<path> = C <identifier>                      // crate-id root
       | M <impl-path> <type>                // inherent impl root
       | X <impl-path> <type> <path>         // trait impl root
       | N <namespace> <path> <identifier>   // nested path
       | I <path> {<generic-arg>} E          // generic arguments

<impl-path> = [<disambiguator>] <path>

The two symbol names would then look something like:

_RNvNvXs2_C7mycrateINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG
_RNvNvXs3_C7mycrateINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG
       <----------><----------------><----------------------->
        impl-path      self-type            trait-name

Like other disambiguation information, this path would usually not actually be shown by demanglers.

Unicode Identifiers

Rust allows Unicode identifiers but our character set is restricted to ASCII alphanumerics, and _. In order to transcode the former to the latter, we use the same approach as Swift, which is: encode all non-ASCII identifiers via Punycode, a standardized and efficient encoding that keeps encoded strings in a rather human-readable format. So for example, the string

"Gödel, Escher, Bach"

is encoded as

"Gdel, Escher, Bach-d3b"

which, as opposed to something like Base64, still gives a pretty good idea of what the original string looked like.

Each component of a name, i.e. anything that starts with the number of bytes to read in the examples above, is encoded individually. Components encoded this way are augmented with a u prefix so that demanglers know that the identifier needs further decoding. As an example, the function:

mod gödel {
  mod escher {
    fn bach() {}
  }
}

would be mangled as:

_RNvNtNtC7mycrateu8gdel_5qa6escher4bach
                 <-------->
              Unicode component

Compression/Substitution

The length of symbol names has an influence on how much work the compiler, linker, and loader have to perform. The shorter the names, the better. At the same time, Rust's generics can lead to rather long names (which are often not visible in the code because of type inference and impl Trait). For example, the return type of the following function:

fn quux(s: Vec<u32>) -> impl Iterator<Item = (u32, usize)> {
    s.into_iter()
     .map(|x| x + 1)
     .filter(|&x| x > 10)
     .zip(0..)
     .chain(iter::once((0, 0)))
}

is

std::iter::Chain<
  std::iter::Zip<
    std::iter::Filter<
      std::iter::Map<
        std::vec::IntoIter<u32>,
        [closure@src/main.rs:16:11: 16:18]>,
      [closure@src/main.rs:17:14: 17:25]>,
    std::ops::RangeFrom<usize>>,
  std::iter::Once<(u32, usize)>>

It would make for a long symbol name if this type is used (maybe repeatedly) as a generic argument somewhere. C++ has the same problem with its templates; which is why the Itanium mangling introduces the concept of compression. If a component of a definition occurs more than once, it will not be repeated and instead be emitted as a substitution marker that allows to reconstruct which component it refers to. The scheme proposed here will use the same approach (but with a simpler definition).

The exact scheme will be described in detail in the reference level explanation below but it roughly works as follows: As a mangled symbol name is being built, we remember the position of every substitutable item in the output string, that is, we keep track of things a subsequent occurrence of which could be replaced by a back reference.

The things that are eligible for substitution are (1) all prefixes of paths (including the entire path itself), (2) all types except for basic types, and (3) type-level constants (array lengths and values passed to const generic params).

Here's an example in order to illustrate the concept. The name

std::iter::Chain<std::iter::Zip<std::vec::IntoIter<u32>, std::vec::IntoIter<u32>>>

is mangled to the following uncompressed string. The lines below show parts of the mangled string that already occurred before and can thus be replaced by a back reference. The number of at the beginning of each span given the 0-based byte position of where it occurred the first time.

  0         10        20        30        40        50        60        70        80        90
_RINtNtC3std4iter5ChainINtNtC3std4iter3ZipINtNtC3std3vec8IntoItermEINtNtC3std3vec8IntoItermEEE
                            5----              5----                    5----
                          3-----------                                43---------
                                                                    41--------------------
                                                                   40-----------------------

The compiler is always supposed to use the longest replacement possible in order to achieve the best compression. The compressed symbol looks as follows:

_RINtNtC3std4iter5ChainINtB2_3ZipINtNtB4_3vec8IntoItermEBt_EE
                          ^^^         ^^^               ^^^     back references

Back references have the form B<base-62-number>_.

Reference-level explanation

The reference-level explanation consists of three parts:

  1. A specification of the syntax mangled names conform to.
  2. A specification of the compression scheme.
  3. A mapping of Rust entities to the mangling syntax.

For implementing a demangler, only the first two sections are of interest, that is, a demangler only needs to understand syntax and compression of names, but it does not have to care about how the compiler generates mangled names.

Syntax Of Mangled Names

The syntax of mangled names is given in extended Backus-Naur form:

Mangled names conform to the following grammar:

// The <decimal-number> specifies the encoding version.
<symbol-name> = "_R" [<decimal-number>] <path> [<instantiating-crate>]

<path> = "C" <identifier>                    // crate root
       | "M" <impl-path> <type>              // <T> (inherent impl)
       | "X" <impl-path> <type> <path>       // <T as Trait> (trait impl)
       | "Y" <type> <path>                   // <T as Trait> (trait definition)
       | "N" <namespace> <path> <identifier> // ...::ident (nested path)
       | "I" <path> {<generic-arg>} "E"      // ...<T, U> (generic args)
       | <backref>

// Path to an impl (without the Self type or the trait).
// The <path> is the parent, while the <disambiguator> distinguishes
// between impls in that same parent (e.g. multiple impls in a mod).
// This exists as a simple way of ensure uniqueness, and demanglers
// don't need to show it (unless the location of the impl is desired).
<impl-path> = [<disambiguator>] <path>

// The <decimal-number> is the length of the identifier in bytes.
// <bytes> is the identifier itself, and it's optionally preceded by "_",
// to separate it from its length - this "_" is mandatory if the <bytes>
// starts with a decimal digit, or "_", in order to keep it unambiguous.
// If the "u" is present then <bytes> is Punycode-encoded.
<identifier> = [<disambiguator>] <undisambiguated-identifier>
<disambiguator> = "s" <base-62-number>
<undisambiguated-identifier> = ["u"] <decimal-number> ["_"] <bytes>

// Namespace of the identifier in a (nested) path.
// It's an a-zA-Z character, with a-z reserved for implementation-internal
// disambiguation categories (and demanglers should never show them), while
// A-Z are used for special namespaces (e.g. closures), which the demangler
// can show in a special way (e.g. `NC...` as `...::{closure}`), or just
// default to showing the uppercase character.
<namespace> = "C"      // closure
            | "S"      // shim
            | <A-Z>    // other special namespaces
            | <a-z>    // internal namespaces

<generic-arg> = <lifetime>
              | <type>
              | "K" <const> // forward-compat for const generics

// An anonymous (numbered) lifetime, either erased or higher-ranked.
// Index 0 is always erased (can show as '_, if at all), while indices
// starting from 1 refer (as de Bruijn indices) to a higher-ranked
// lifetime bound by one of the enclosing <binder>s.
<lifetime> = "L" <base-62-number>

// Specify the number of higher-ranked (for<...>) lifetimes to bound.
// <lifetime> can then later refer to them, with lowest indices for
// innermost lifetimes, e.g. in `for<'a, 'b> fn(for<'c> fn(...))`,
// any <lifetime>s in ... (but not inside more binders) will observe
// the indices 1, 2, and 3 refer to 'c, 'b, and 'a, respectively.
// The number of bound lifetimes is value of <base-62-number> + 1.
<binder> = "G" <base-62-number>

<type> = <basic-type>
       | <path>                      // named type
       | "A" <type> <const>          // [T; N]
       | "S" <type>                  // [T]
       | "T" {<type>} "E"            // (T1, T2, T3, ...)
       | "R" [<lifetime>] <type>     // &T
       | "Q" [<lifetime>] <type>     // &mut T
       | "P" <type>                  // *const T
       | "O" <type>                  // *mut T
       | "F" <fn-sig>                // fn(...) -> ...
       | "D" <dyn-bounds> <lifetime> // dyn Trait<Assoc = X> + Send + 'a
       | <backref>

<basic-type> = "a"      // i8
             | "b"      // bool
             | "c"      // char
             | "d"      // f64
             | "e"      // str
             | "f"      // f32
             | "h"      // u8
             | "i"      // isize
             | "j"      // usize
             | "l"      // i32
             | "m"      // u32
             | "n"      // i128
             | "o"      // u128
             | "s"      // i16
             | "t"      // u16
             | "u"      // ()
             | "v"      // ...
             | "x"      // i64
             | "y"      // u64
             | "z"      // !
             | "p"      // placeholder (e.g. for generic params), shown as _

// If the "U" is present then the function is `unsafe`.
// The return type is always present, but demanglers can
// choose to omit the ` -> ()` by special-casing "u".
<fn-sig> = [<binder>] ["U"] ["K" <abi>] {<type>} "E" <type>

<abi> = "C"
      | <undisambiguated-identifier>

<dyn-bounds> = [<binder>] {<dyn-trait>} "E"
<dyn-trait> = <path> {<dyn-trait-assoc-binding>}
<dyn-trait-assoc-binding> = "p" <undisambiguated-identifier> <type>
<const> = <type> <const-data>
        | "p" // placeholder, shown as _
        | <backref>

// The encoding of a constant depends on its type. Integers use their value,
// in base 16 (0-9a-f), not their memory representation. Negative integer
// values are preceded with "n". The bool value false is encoded as `0_`, true
// value as `1_`. The char constants are encoded using their Unicode scalar
// value.
<const-data> = ["n"] {<hex-digit>} "_"

// <base-62-number> uses 0-9-a-z-A-Z as digits, i.e. 'a' is decimal 10 and
// 'Z' is decimal 61.
// "_" with no digits indicates the value 0, while any other value is offset
// by 1, e.g. "0_" is 1, "Z_" is 62, "10_" is 63, etc.
<base-62-number> = {<0-9a-zA-Z>} "_"

<backref> = "B" <base-62-number>

// We use <path> here, so that we don't have to add a special rule for
// compression. In practice, only a crate root is expected.
<instantiating-crate> = <path>

Namespace Tags

Namespaces are identified by an implementation defined single character tag (the <namespace> production). Only closures (C) and shims (S) have a specific character assigned to them so that demanglers can reliable adjust their output accordingly. Other namespace tags have to be omitted or shown verbatim during demangling.

This is a concession to the compiler's current implementation. While the language only knows two namespaces (the type and the value namespace), the compiler uses many more in some important data structures and disambiguation indices are assigned according to these internal data structures. So, in order not to force the compiler to waste processing time on re-constructing different disambiguation indices, the internal unspecified "namespaces" are used. This may change in the future.

Type-Level Constants

As described above, the grammar encodes constant values via the <const-data> = {<hex-digit>} "_" production, where {<hex-digit>} is the numeric value of the constant, not its representation as bytes. Using the numeric value is platform independent but does not easily scale to non-integer data types.

In the future it is likely that Rust will support complex type-level constants (i.e. not just integers). This RFC suggests to develop a proper mangling for these as part of the future const-generics work, and, for now, only define a mangling for integer values.

Punycode Identifiers

Punycode generates strings of the form ([[:ascii:]]+-)?[[:alnum:]]+. This is problematic because of the - character, which is not in the supported character set; Punycode uses it to separate the ASCII part (if it exists), from the base-36 encoding of the non-ASCII characters.

For this reasons, we deviate from vanilla Punycode, by replacing the - character with a _ character.

Here are some examples:

OriginalPunycodePunycode + Encoding
føøf-5gaaf_5gaa
α_ω_-ylb7e__ylb7e
铁锈n84amfn84amf
🤦fq9hfq9h
ρυστ2xaedc2xaedc

With this post-processing in place the Punycode strings can be treated like regular identifiers and need no further special handling.

Compression

Symbol name compression works by substituting parts of the mangled name that have already been seen for a back reference. Compression is directly built into the mangling algorithm, as shown by the following piece of pseudocode:

fn mangle(node, output_string, substitution_dictionary) {
    if let Some(backref) = substitution_dictionary.get(node) {
        // Emit the backref instead of the node's contents
        mangle(backref, output_string)
    } else {
        // Remember where the current node starts in the output
        let start_position = output_string.len()

        // Do the actual mangling, including recursive mangling of child nodes

        // Add the current node to the substitution dictionary
        if node.is_substitutable() {
            substitution_dictionary.insert(node, start_position)
        }
    }
}

This algorithm automatically chooses the best compression because parent nodes (which are always larger) are visited before child nodes.

Note that this kind of compression relies on the fact that all substitutable AST nodes have a self-terminating mangled form, that is, given the start position of the encoded node, the grammar guarantees that it is always unambiguous where the node ends. This is ensured by not allowing optional or repeating elements at the end of substitutable productions.

Decompression

Decompression too is built directly into demangling/parsing. When a back reference is encountered, we decode the referenced position and use a temporary demangler/parser to do the decoding of the node's actual content:

fn demangle_at(&mut pos, mangled, output_string) {
  if is_backref(*pos, mangled) {
    // Read the byte offset of the referenced node and
    // advance `pos` past the backref.
    let mut referenced_pos = decode(pos, mangled);
    demangle_at(&mut referenced_pos, mangled, output_string)
  } else {
    // do regular demangling
  }
}

Using byte offsets as backref keys (as this RFC does) instead of post-order traversal indices (as Itanium mangling does) has the advantage that the demangler does not need to duplicate the mangler's substitution indexing logic, something that can become quite complex (as demonstrated by the compression scheme proposed in the initial version of this RFC).

A Note On Implementing Efficient Demanglers

The mangling syntax is constructed in a way that allows for implementing efficient demanglers:

Parsing, decompression, and demangling can thus be done in a single pass over the mangled name without the need for complex data structures, which is useful when having to implement #[no_std] or C demanglers. (Note that Punycode can complicate decoding slightly because it needs dynamic memory allocation in the general case but it can be implemented with an on-stack buffer for a reasonable maximum supported length).

Mapping Rust Language Entities to Symbol Names

This RFC suggests the following mapping of Rust entities to mangled names:

Drawbacks

Why should we not do this?

Rationale and alternatives

The alternatives considered are:

  1. Keeping the current scheme. It does meet the minimum requirements after all. However, the general consensus seems to be that this leads to situations where people are unpleasantly surprised when they come across (demangled) symbol names in backtraces or profilers.

  2. Keeping the current scheme but cleaning it up by making the non-hash part more consistent and more expressive. Keep the hash part as a safeguard against symbol conflicts and the rest as something just for demangling. The downside of this is that the hash would still not be predictable, and symbols would get rather long if they should contain more human-readable information about generic arguments.

  3. Define a standardized pretty-printing format for things that end up as symbols, and then encode that via Punycode in order to meet the character set restrictions. This would be rather simple. Symbol names would remain somewhat human-readable (but not very, because all separators would be stripped out). But without some kind of additional compression, symbol names would become rather long.

  4. Use the scheme from the previous bullet point but apply the compression scheme described above. We could do this but it wouldn't really be less complex than the scheme proposed by the RFC.

  5. Define a standardized pretty-printing format for things that end up as symbols, compress with zstd (specially trained for Rust symbols) and encode the result as base63. This is rather simple but loses all human-readability. It's unclear how well this would compress. It would pull the zstd specification into the mangling scheme specification, as well as the pre-trained dictionary.

Prior art

One of the major modern mangling schemes with a public specification is the Itanium C++ ABI scheme for C++ which is used by the GCC toolchain. An initial version of this RFC sticked closely to Itanium mangling, however, the latest version only retains the run-length encoding for identifiers and some literals for tagging things like basic types. The Itanium scheme has been criticized for being overly complex, due to its extensive grammar and two separate compression schemes.

The idea of using Punycode for handling of unicode identifiers is taken from the Swift programming language's mangling scheme.

Unresolved questions

Punycode vs UTF-8

During the pre-RFC phase, it has been suggested that Unicode identifiers should be encoded as UTF-8 instead of Punycode on platforms that allow it. GCC, Clang, and MSVC seem to do this. The author of the RFC has a hard time making up their mind about this issue. Here are some interesting points that might influence the final decision:

UPDATE: This RFC recommends that Punycode encoded identifiers must be supported by demanglers but that it is up to the compiler implementation (for now) to decide whether to use it for a given platform. This question will have to be revisited if Rust ever wants to define a stable ABI.

Encoding parameter types for function symbols

It has been suggested that parameter types for functions and methods should be encoded in mangled form too. This is not necessary for symbol name uniqueness but it would provide an additional safeguard against silent ABI-related errors where definition and callers of some function make different assumptions about what parameters a function takes. The RFC does not propose to do this because:

However, a final decision on the topic has not been made yet.

UPDATE: This RFC suggests that parameter types are not encoded into function and method symbols. Symbol names will already get significantly longer due to encoding additional information and the additional safeguard provided against ABI mismatches is less relevant for Rust than it is for other languages that don't have a concept of library/crate metadata.

Appendix A - Suggested Demangling

This RFC suggests that names are demangled to a form that matches Rust syntax as it is used in source code, compiler error messages and rustdoc:

Appendix B - Examples

We assume that all examples are defined in a crate named mycrate[1234].

Free-standing Item

mod foo {
  mod bar {
    fn baz() {}
  }
}

Item Defined In Inherent Method

struct Foo<T>(T);

impl<T> Foo<T> {
  pub fn bar<U>(_: U) {
    static QUUX: u32 = 0;
    // ...
  }
}

Item Defined In Trait Method

struct Foo<T>(T);

impl<T> Clone for Foo<T> {
  fn clone(&self) -> Self {
    static QUUX: u32 = 0;
    // ...
  }
}

Item Defined In Initializer Of A Static

pub static QUUX: u32 = {
  static FOO: u32 = 1;
  FOO + FOO
};

Compressed Prefix Constructed From Prefix That Contains A Substitution Itself - TODO

Progressive type compression

Appendix C - Change LOG