RFC 2195: really-tagged-unions

lang (data-types | repr | ffi | machine)

Summary

Formally define the enum #[repr(u32, i8, etc..)] and #[repr(C)] attributes to force a non-C-like enum to have a defined layouts. This serves two purposes: allowing low-level Rust code to independently initialize the tag and payload, and allowing C(++) to safely manipulate these types.

Motivation

Enums that contain data are very good and useful. Unfortunately, their layout is currently purposefully unspecified, which makes these kinds of enums unusable for FFI and for low-level code. To demonstrate this, this RFC will look at two examples from firefox development where this has been a problem.

C(++) FFI

Consider a native Rust API for drawing a line, that uses a C-like LineStyle enum:

// In native Rust crate

pub fn draw_line(&mut self, bounds: &Rect, color: &Color, style: LineStyle) {
    ...
}

#[repr(u8)]
pub enum LineStyle {
    Solid,
    Dotted,
    Dashed,
}

#[repr(C)]
pub struct Rect { x: f32, y: f32, width: f32, height: f32 }

#[repr(C)]
pub struct Color { r: f32, g: f32, b: f32, a: f32 }

This API is fairly easy for us to write a machine-checked shim for C++ code to invoke:

// In Rust shim crate

#[no_mangle]
pub extern "C" fn wr_draw_line(
    state: &mut State, 
    bounds: &Rect, 
    color: &Color,
    style: LineStyle,
) {
    state.draw_line(bounds, color, style);
} 
// In C++ shim header


// Autogenerated by cbindgen
extern "C" {
namespace wr {
struct State; // opaque

struct Rect { float x; float y; float width; float height; }
struct Color { float r; float g; float b; float a; }

enum class LineStyle: uint8_t {
    Solid,
    Dotted,
    Dashed,
}

void wr_draw_line(WrState *state,
                  const Rect *bounds,
                  const ColorF *aColor,
                  LineStyle aStyle);
} // namespace wr
} // extern



// Hand-written
void WrDrawLine(
    wr::State* aState, 
    const wr::Rect* aRect, 
    const wr::Color* aColor, 
    wr::LineStyle aStyle
) {
    wr_draw_line(aState, aRect, aColor, aStyle);
}

This works well, and we're happy.

Now consider adding a WavyLine style, which requires an extra thickness value:

// Native Rust crate

#[repr(u8)]         // Doesn't actually do anything we can rely on now
enum LineStyle {
    Solid,
    Dotted,
    Dashed,
    Wavy { thickness: f32 },
}

We cannot safely pass this to/from C(++), nor can we manipulate it there. As such, we're forced to take the thickness as an extra argument that is just ignored most of the time:

// Native Rust crate

pub fn draw_line(
    &mut self, 
    bounds: &Rect, 
    color: &Color, 
    style: LineStyle, 
    wavy_line_thickness: f32
) { ... }

#[repr(u8)]
enum LineStyle {
    Solid,
    Dotted,
    Dashed,
    Wavy,
}

This produces a worse API for everyone, while also throwing away the type-safety benefits of enums. This trick also doesn't scale: if you have many nested enums, the combinatorics eventually become completely intractable.

In-Place Construction

Popular deserialization APIs in Rust generally have a signature like deserialize<T>() -> Result<T, Error>. This works well for small values, but optimizes very poorly for large values, as Rust ends up copying the T many times. Further, in many cases we just want to overwrite an old value that we no longer care about.

In those cases, we could potentially use an API like deserialize_from<T>(&mut T) -> Result<(), Error>. However Rust currently requires enums to be constructed "atomically", so we can't actually take advantage of this API if our large value is an enum.

That is, we must do something like:

fn deserialize_from(dest: &mut MyBigEnum) -> Result<(), Error> {
    let tag = deserialize_tag()?;
    match tag {
        A => {
            let payload = deserialize_a()?
            *dest = A(payload);
        }
        ..
    }
    Ok(())
}

We must construct the entire payload out-of-place, and then move it into place at the end, even though our API is specifically designed to let us construct in-place.

Now, this is pretty important for memory-safety in the general case, but there are many cases where this can be done safely. For instance, this is safe to do if the entire payload is plain-old-data, like [u8; 200], or if the code catches panics and fixes up the value.

Note that one cannot do something like:

*dest = A(mem::uninitialized())
if let A(ref mut payload_dest) = *dest {
    deserialize_a(payload_dest);
} else { unreachable!() }

because enum optimizations make it unsound to put mem::uninitialized in an enum. That is, checking if dest = A can require inspecting the payload.

To accomplish this task, we need dedicated support from the language.

Guide-level explanation

An enum can currently be adorned with #[repr(Int)] where Int is one of Rust's integer types (u8, isize, etc). For C-like enums -- enums which have no variants with associated data -- this specifies that the enum should have the ABI of that integer type (size, alignment, and calling convention). #[repr(C)] currently just tells Rust to try to pick whatever integer type that a C compiler for the target platform would use for an enum.

With this RFC, two new guaranteed, C(++)-compatible enum layouts will be added.

#[repr(Int)] on a non-C-like enum will now mean: the enum must be represented as a C-union of C-structs that each start with a C-like enum with #[repr(Int)]. The other fields of the structs are the payloads of the variants. This is a mouthful, so let's look at an example. This definition:

#[repr(Int)]
enum MyEnum {
    A(u32),
    B(f32, u64),
    C { x: u32, y: u8 },
    D,
}

Has the same layout as the following:

#[repr(C)]
union MyEnumRepr {
    A: MyEnumVariantA,
    B: MyEnumVariantB,
    C: MyEnumVariantC,
    D: MyEnumVariantD,
}

#[repr(Int)]
enum MyEnumTag { A, B, C, D }

#[repr(C)]
struct MyEnumVariantA(MyEnumTag, u32);

#[repr(C)]
struct MyEnumVariantB(MyEnumTag, f32, u64);

#[repr(C)]
struct MyEnumVariantC { tag: MyEnumTag, x: u32, y: u8 }

#[repr(C)]
struct MyEnumVariantD(MyEnumTag);

Note that the structs must be repr(C), because otherwise the MyEnumTag value wouldn't be guaranteed to have the same position in each variant.

C++ can also correctly manipulate this enum with the following definition:

#include <stdint.h>

enum class MyEnumTag: CppEquivalentOfInt { A, B, C, D };
struct MyEnumPayloadA { MyEnumTag tag; uint32_t payload; };
struct MyEnumPayloadB { MyEnumTag tag; float _0; uint64_t _1;  };
struct MyEnumPayloadC { MyEnumTag tag; uint32_t x; uint8_t y; };
struct MyEnumPayloadD { MyEnumTag tag; };

union MyEnum {
    MyEnumVariantA A;
    MyEnumVariantB B;
    MyEnumVariantC C;
    MyEnumVariantD D;
};

The correct C definition is essentially the same, but with the enum class replaced with a plain integer of the appropriate type.

This layout might be a bit surprising to those used to using tagged unions in C(++), which are commonly represented as a (tag, union) pair. There are two reasons to prefer this more complex layout. First, it's what Rust has incidentally used this layout for a long time, so code that wants to begin relying on this layout will be compatible with old versions of Rust. Second, it can make slightly better use of space. For instance:

#[repr(u8)]
enum TwoCases {
    A(u8, u16),
    B(u16),
}

Becomes

union TwoCasesRepr {
    A: TwoCasesVariantA,
    B: TwoCasesVariantB,
}

#[repr(u8)]
enum TwoCasesTag { A, B }

#[repr(C)]
struct TwoCasesVariantA(TwoCasesTag, u8, u16);

#[repr(C)]
struct TwoCasesVariantB(TwoCasesTag, u16);

Which ends up being 4 bytes large, because the TwoCasesVariantA struct can be laid out like:

[ u8  | u8  |   u16    ]
  --    --    --    --

While a (tag, union) pair would have to make it 6 bytes large:

[ u8  | pad | u8  | pad |   u16   ]
  --    --    --    --    --    --
        ^           ^- u16 needs 16-bit align
        |
    (u8, u16) struct needs 16-bit align 

However, for better compatibility with common C(++) idioms, and better ergonomics for low-level Rust programs, this RFC defines #[repr(C, Int)] on a tagged enum to specify the (tag, union) representation. Specifically the layout will be equivalent to a C-struct containing a C-like #[repr(Int)] enum followed by a C-union containing each payload.

So for example this enum:

#[repr(C, Int)]
enum MyEnum {
    A(u32),
    B(f32, u64),
    C { x: u32, y: u8 },
    D,
}

Has the same layout as the following:

#[repr(C)]
struct MyEnumRepr {
    tag: MyEnumTag,
    payload: MyEnumPayload,
}

#[repr(Int)]
enum MyEnumTag { A, B, C, D }

#[repr(C)]
union MyEnumPayload {
   A: u32,
   B: MyEnumPayloadB,
   C: MyEnumPayloadC,
   D: (),
}

#[repr(C)]
struct MyEnumPayloadB(f32, u64);

#[repr(C)]
struct MyEnumPayloadC { x: u32, y: u8 }

C++ can also correctly manipulate this enum with the following definition:

#include <stdint.h>

enum class MyEnumTag: CppEquivalentOfInt { A, B, C, D };
struct MyEnumPayloadB { float _0; uint64_t _1;  };
struct MyEnumPayloadC { uint32_t x; uint8_t y; };

union MyEnumPayload {
   uint32_t A;
   MyEnumPayloadB B;   
   MyEnumPayloadC C;
};

struct MyEnum {
    MyEnumTag tag;
    MyEnumPayload payload;
};

If a non-C-like enum is only #[repr(C)], then the layout will be the same as #[repr(C, Int)], but the C-like tag enum will instead just be #[repr(C)] (so it will have whatever size C enums default to).

For both layouts, it is defined for Rust programs to cast/reinterpret/transmute such an enum into the equivalent Repr definition. Seperately manipulating the tag and payload is also defined. The tag and payload need only be in a consistent/initialized state when the value is matched on (which includes Dropping it).

For instance, this code is valid (using the same definitions above):

/// Tries to parse a `#[repr(C, u8)] MyEnum` from a custom binary format, overwriting `dest`.
/// On Err, `dest` may be partially overwritten (but will be in a memory-safe state)
fn parse_my_enum_from<'a>(dest: &'a mut MyEnum, input: &mut &[u8]) -> Result<(), &'static str> {
    unsafe {
        // Convert to raw repr
        let dest: &'a mut MyEnumRepr = mem::transmute(dest);

        // If MyEnum was non-trivial, we might match on the tag and 
        // drop_in_place the payload here to start.

        // Read the tag
        let tag = input.get(0).ok_or("Couldn't Read Tag")?;
        dest.tag = match tag {
            0 => MyEnumTag::A,
            1 => MyEnumTag::B,
            2 => MyEnumTag::C,
            3 => MyEnumTag::D,
            _ => { return Err("Invalid Tag Value"); }
        };
        *input = &input[1..];

        // Note: it would be very bad if we panicked past this point, or if
        // the following methods didn't initialize the payload on Err!

        // Read the payload
        match dest.tag {
            MyEnumTag::A => parse_my_enum_a_from(&mut dest.payload.A, input),
            MyEnumTag::B => parse_my_enum_b_from(&mut dest.payload.B, input),
            MyEnumTag::C => parse_my_enum_c_from(&mut dest.payload.C, input),
            MyEnumTag::D => { Ok(()) /* do nothing */ }
        }
    }
}

It should be noted that Rust enums should still idiomatically not have any repr annotation, as this allows for maximum optimization opportunities and the precise layout is unlikely to matter. If a deterministic layout is required, repr(Int) should be preferred by default over repr(C, Int) as it has a strictly superior space-usage, and incidentally works in older versions of Rust. However repr(C, Int) is a reasonable choice for a more idiomatic-feeling tagged union, or to interoperate with an existing C(++) codebase.

There are a few enum repr combinations that are left unspecified under this proposal, and thus produce compiler warnings:

Reference-level explanation

Since the whole point of this proposal is to enable low-level control, the guide-level explanation should cover all the relevant corner-cases and details in sufficient detail. All that remains is to discuss implementation details.

It was informally decided earilier this year that repr(Int)should have the behaviour this RFC proposes, as it was being partially relied on (in that it supressed dangerous optimizations) and it made sense to the developers. There is even a test in the rust-lang repo that was added to ensure that this behaviour doesn't regress. So this part of the proposal is already implemented and somewhat tested on stable Rust. This RFC just seeks to codify that this won't break in the future.

However repr(C, Int) currently doesn't do anything different from repr(Int). Changing this is a relatively minor tweak to the code that lowers Rust code to a particular ABI. Anyone relying on repr(C, Int) being the same as repr(Int) is relying on unspecified behaviour, but a cargo bomb run should still be done just to check.

A PR has been submitted to implement this, along with several tests.

Drawbacks

Half of this proposal is already implemented, and the other half has an implementation submitted (~20 line patch). The existence of this proposal can also be completely ignored by anyone who doesn't care about it, as they can keep using the default Rust repr. This is simply making things that exist sort-of-by-accident do something useful, which is basically a pure win considering the implementation/maintenance burden is minimal.

One minor issue with this proposal is that there's no way to request the repr(Int) layout with the repr(C) tag size. To be blunt, this doesn't seem very important. It's unclear if developers should even use bare repr(C) on tagged unions, as the default C enum size is actually quite large for a tag. This is also consistent with the Rust philosophy of trying to minimize unecessary platform-specific details. Also, a desperate Rust programmer could acquire the desired behaviour with platform-specific cfgs (Rust has to basically guess at the type of a repr(C) enum anyway).

The remaining drawbacks amount to "what if this is the wrong interpretation", which shall be adressed in the alternatives.

Rationale and alternatives

There are a few alternative interpretations of repr(Int) on a non-C-like enum.

It should do nothing

In which case it should probably become an error/warning. This isn't particularly desirable, as was discussed when we decided to maintain this behaviour.

The tag should come after the union, and/or order should be manually specified

With the repr(C) layout, there isn't a particularly compelling reason to move the tag around because of how padding and alignment are handled: you can't actually save space by putting the tag after, as long as your tag is a reasonable size.

It's possible positioning the tag afterwards could be desirable to interoperate with a definition that is provided by a third party (hardware spec or some existing C library). However there are tons of other tag packing strategies that we also can't handle, so we'd probably want a more robust solution for those kinds of cases anyway.

With the repr(Int) layout, this could potentially save space (for instance, with a variant like A(u16, u8)). However the benefits are relatively minimal compared to the increased complexity. If that complexity is desirable, it can be adressed with a future extension.

Compound variants shouldn't automatically be marked as repr(C)

With the repr(Int) layout this isn't really possible, because the tag needs a deterministic position, and we can't "partially" repr(C) a struct.

With either layout, one can make the payload be a single repr(Rust) struct, and that will have its layout agressively optimized, because repr(C) isn't infectious. So this is just a matter of "what is a good default". The FFI case clearly wants fully defined layouts, while the pure-Rust case seems like a toss up. It seems like repr(C) is therefore the better default.

Opaque Tags

This code isn't valid under the main proposal:

let x: Option<MyEnum> = Some(mem::uninitialized());
if let Some(ref mut inner) = x {
    initialize(inner);
} else { unreachable!() }

It relies on the fact that the Some-ness of an Option (or the tag of any repr(Rust) enum) can't rely on the tag of a repr(C/Int) enum. Or in other words, repr(C/Int) enums have opaque tags. The cost of making this work is that Option<MyEnum> would have to be larger than MyEnum.

It would be nice for this to work, but if you really need it, you can just define #[repr(u8)] COption<T> { ... } and use that.

Unresolved questions

Currently None. 🎉

Future Extensions

Here's some quick sketches of future extensions which could be done to this design.