RFC 2325: stable-simd

libs (simd | cfg | attributes)

Feature Name: vendor_intrinsics
Start Date: 2018-02-04
RFC PR: rust-lang/rfcs#2325
Rust Issue: rust-lang/rust#48556

Summary

The purpose of this RFC is to provide a framework for SIMD to be used on stable Rust. It proposes stabilizing x86-specific vendor intrinsics, but includes the scaffolding for other platforms as well as a future portable SIMD design (to be fleshed out in another RFC).

Motivation

Stable Rust today does not typically expose all of the capabilities of the platform that you're running on. A notable gap in Rust's support includes SIMD (single instruction multiple data) support. For example on x86 you don't currently have explicit access to the 128, 256, and 512 bit registers on the CPU. LLVM is in general an excellent optimizing compiler and often attempts to make use of these registers (auto vectorizing code), but it unfortunately is still somewhat limited and doesn't express the full power of the various SIMD intrinsics.

The goal of this RFC is to enable using SIMD intrinsics on stable Rust, and in general provide a means to access the architecture-specific functionality of each vendor. For example the AES intrinsics on x86 would also be made available through this RFC, not only the SIMD-related AVX intrinsics.

Note that this is certainly not the first discussion to broach the topic of SIMD in Rust, but rather this has been an ongoing discussion for quite some time now! For example the simd crate started long ago, we've had rfcs, we've had a lot of discussions on internals, and the stdsimd crate has been implemented.

This RFC draws from much of the historical feedback and design that we've done around SIMD in Rust and is targeted at providing path forward for using SIMD on stable Rust while allowing the compiler to change in the future and retain a stable interface.

Guide-level explanation

Let's say you've just heard about this fancy feature called "auto vectorization" in LLVM and you want to take advantage of it. For example you've got a function like this you'd like to make faster:

pub fn foo(a: &[u8], b: &[u8], c: &mut [u8]) {
    for ((a, b), c) in a.iter().zip(b).zip(c) {
        *c = *a + *b;
    }
}

When inspecting the assembly you notice that rustc is making use of the %xmmN registers which you've read is related to SSE on your CPU. You know, however, that your CPU supports up to AVX2 which has bigger registers, so you'd like to get access to them!

Your first solution to this problem is to compile with -C target-feature=+avx2, and after that you see the %ymmN registers being used, yay! Unfortunately though you're publishing this binary on CPUs which may not actually have AVX2 as a feature, so you don't want to enable AVX2 for the entire program. Instead what you can do is enable it for just this function:

#[target_feature(enable = "avx2")]
pub unsafe fn foo(a: &[u8], b: &[u8], c: &mut [u8]) {
    for ((a, b), c) in a.iter().zip(b).zip(c) {
        *c = *a + *b;
    }
}

And sure enough you see the %ymmN registers getting used in this function! Note, however, that because you've explicitly enabled a feature you're required to declare the function as unsafe, as specified in RFC 2045 (although this requirement is likely to be relaxed in RFC 2212). This worked as a proof of concept but what you still need to do is dispatch at runtime whether the local CPU that you're running on supports AVX2 or not. Thankfully, though, libstd has a handy macro for this!

pub fn foo(a: &[u8], b: &[u8], c: &mut [u8]) {
    // Note that this `unsafe` block is safe because we're testing
    // that the `avx2` feature is indeed available on our CPU.
    if is_target_feature_detected!("avx2") {
        unsafe { foo_avx2(a, b, c) }
    } else {
        foo_fallback(a, b, c)
    }
}

#[target_feature(enable = "avx2")]
unsafe fn foo_avx2(a: &[u8], b: &[u8], c: &mut [u8]) {
    foo_fallback(a, b, c) // the function below is inlined here
}

fn foo_fallback(a: &[u8], b: &[u8], c: &mut [u8]) {
    for ((a, b), c) in a.iter().zip(b).zip(c) {
        *c = *a + *b;
    }
}

And sure enough once again we see that foo is dispatching at runtime to the appropriate function, and only foo_avx2 is using our %ymmN registers!

Ok great! At this point we've seen how to enable CPU features for functions-at-a-time as well as how they could be used in a larger context to do runtime dispatch to the most appropriate implementation. As we saw in the motivation, however, we're just relying on LLVM to auto-vectorize here which often isn't good enough or otherwise doesn't expose the functionality we want.

For explicit and guaranteed simd on stable Rust you'll be using a new module in the standard library, std::arch. The std::arch module is defined by vendors/architectures, not us actually! For example Intel publishes a list of intrinsics as does ARM. These exact functions and their signatures will be available in std::arch with types translated to Rust (e.g. int32_t becomes i32). Vendor specific types like __m128i on Intel will also live in std::arch.

For example let's say that we're writing a function that encodes a &[u8] in ascii hex and we want to convert &[1, 2] to "0102". The stdsimd crate currently has this as an example, and let's take a look at a few snippets from that.

First up you'll see the dispatch routine like we wrote above:

fn hex_encode<'a>(src: &[u8], dst: &'a mut [u8]) -> Result<&'a str, usize> {
    let len = src.len().checked_mul(2).unwrap();
    if dst.len() < len {
        return Err(len);
    }

    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
    {
        if is_target_feature_detected!("avx2") {
            return unsafe { hex_encode_avx2(src, dst) };
        }
        if is_target_feature_detected!("sse4.1") {
            return unsafe { hex_encode_sse41(src, dst) };
        }
    }

    hex_encode_fallback(src, dst)
}

Here we have some routine business about hex encoding in general, and then for x86/x86_64 platforms we have optimized versions specifically for avx2 and sse41. Using the is_target_feature_detected! macro in libstd we saw above we'll dispatch to the correct one at runtime.

Taking a closer look at hex_encode_sse41 we see that it starts out with a bunch of weird looking function calls:

let ascii_zero = _mm_set1_epi8(b'0' as i8);
let nines = _mm_set1_epi8(9);
let ascii_a = _mm_set1_epi8((b'a' - 9 - 1) as i8);
let and4bits = _mm_set1_epi8(0xf);

As it turns out though, these are all Intel SIMD intrinsics! For example _mm_set1_epi8 is defined as creating an instance of __m128i, a 128-bit integer register. The intrinsic specificall sets all bytes to the first argument.

These functions are all imported through std::arch::* at the top of the example (in this case stdsimd::vendor::*). We go on to use a bunch of these intrinsics throughout the hex_encode_sse41 function to actually do the hex encoding.

The example listed currently has some tests/benchmarks as well, and if we run the benchmarks we'll see:

test benches::large_default    ... bench:      73,432 ns/iter (+/- 12,526) = 14279 MB/s
test benches::large_fallback   ... bench:   1,711,030 ns/iter (+/- 286,642) = 612 MB/s
test benches::small_default    ... bench:          30 ns/iter (+/- 18) = 3900 MB/s
test benches::small_fallback   ... bench:         204 ns/iter (+/- 74) = 573 MB/s
test benches::x86::large_avx2  ... bench:      69,742 ns/iter (+/- 9,157) = 15035 MB/s
test benches::x86::large_sse41 ... bench:     108,463 ns/iter (+/- 70,250) = 9667 MB/s
test benches::x86::small_avx2  ... bench:          25 ns/iter (+/- 8) = 4680 MB/s
test benches::x86::small_sse41 ... bench:          25 ns/iter (+/- 14) = 4680 MB/s

Or in other words, our runtime dispatch implementation ("default") is 20 times faster than the fallback implementation (no explicit SIMD). Furthermore the AVX2 implementation is nearly 2x faster than the SSE4.1 implementation for large inputs, and the SSE4.1 implementation is over 10x faster than the default fallback as well.

With std::arch and is_target_feature_detected! we've now written a program that's 20x faster on supported hardware, yet it also continues to run on older hardware as well! Not bad for a few dozen lines on each function!

Note that this RFC is explicitly not attempting to stabilize/design a set of "portable simd operations". The contents of std::arch are platform specific and provide no guarantees about portability. Efforts in the past, however, such as with simd.js and the simd crate show that it's desirable and useful to have a set of types which are usable across platforms.

Furthermore LLVM does quite a good job with a portable u32x4 type, for example, in terms of platform support and speed on platforms that support it. This RFC is not going to go too much into the details about these types, but rather these guidelines will still hold:

The intrinsics will not take portable types as arguments. For example u32x4 and __m128i will be different types on x86. The two types, however, will be convertible between one another (either via transmutes or via explicit functions). This conversion will have zero run-time cost.
The portable simd types will likely live in a module like std::simd rather than std::arch.

The design around these portable types are ongoing, however, and stay tuned for an RFC for the std::simd module!

Reference-level explanation

Stable SIMD in Rust ends up requiring a surprising number of both language and library features to be productive. Thankfully, though, there's has been quite a bit of experimentation over time with SIMD in Rust and we've gotten a lot of good experience along the way! In this section, though, we'll be going into the various features in detail.

The `#[target_feature]` Attribute

The #[target_feature] attribute was specified in RFC 2045 and remains unchanged from that specification. As a quick recap it allows you to add this attribute to functions:

#[target_feature(enable = "avx2")]

The only currently allowed key is enable (one day we may allow disable). The string values accepted by enable will be separately stabilized but are likely to be guided by vendor definitions. For example in Intel's intrinsic guide it lists functions under "AVX2", so we're likely to stabilize the name avx2 for Rust.

There's a good number of these features supported by the compiler today. It's expected that when stabilizing other pieces of this RFC the names of the following existing features for x86 will be stabilized:

aes
avx2
avx
bmi2
bmi - to be renamed to bmi1, the name Intel gives it
fma
fxsr
lzcnt
popcnt
rdrnd
rdseed
sse2
sse3
sse4.1
sse4.2
sse
ssse3
xsave
xsavec
xsaveopt
xsaves

Note that AVX-512 names are missing from this list, but that's because we haven't implemented any AVX-512 intrinsics yet. Those'll get stabilized on their own once implemented. Additionally note that mmx is missing from this list. For reasons discussed later, it's proposed that MMX types are omitted from the first pass of stabilization. AMD also has some specific features supported (sse4a, tbm), and so do ARM, MIPS, and PowerPC, but none of these feature names a proposed for becoming stable in the first pass.

The `target_feature` value in `#[cfg]`

In addition to enabling target features for a function the compiler will also allow statically testing whether a particular target feature is enabled. This corresponds to the cfg_target_feature feature today in rustc, and can be seen via:

#[cfg(target_feature = "avx")]
fn foo() {
    // implementation that can use `avx`
}

#[cfg(not(target_feature = "avx"))]
fn foo() {
    // a fallback implementation
}

Additionally this is also made available to cfg!:

if cfg!(target_feature = "avx") {
    println!("this program was compiled with AVX support");
}

The #[cfg] attribute and cfg! macro statically resolve and do not do runtime dispatch. Tweaking these functions is currently done via the -C target-feature flag to the compiler. This flag to the compiler accepts a similar set of strings to the ones specified above and is already "stable".

The `is_target_feature_detected!` Macro

One mode of operation with intrinsics is to compile part of a program with certain CPU features enabled but not the entire program. This way a portable program can be compiled which runs across a broad range of hardware which can still benefit from optimized implementations for particular hardware at runtime.

The crux of this support in libstd is this macro provided by libstd, is_target_feature_detected!. The macro will accept one argument, a string literal. The string can be any feature passed to #[target_feature(enable = ...)] for the platform you're compiling for. Finally, the macro will resolve to a bool result.

For example on x86 you could write:

if is_target_feature_detected!("sse4.1") {
    println!("this cpu has sse4.1 features enabled!");
}

It would, however, be an error to write this on x86 cpus:

is_target_feature_detected!("neon"); //~ COMPILE ERROR: neon is an ARM feature, not x86
is_target_feature_detected!("foo"); //~ COMPILE ERROR: unknown target feature for x86

The macro is intended to be implemented in the std crate (not core) and made available via the normal macro preludes. The implementation of this macro is expected to be what stdsimd does today, notably:

The first time the macro is invoked all the local CPU features will be detected.
The detected features will then be cached globally (when possible and currently in a bitset) for the rest of the execution of the program.
Further invocations of is_target_feature_detected! are expected to be cheap runtime dispatches. (aka load a value and check whether a bit is set)
Exception: in some cases the result of the macro is statically known: for example, is_target_feature_detected!("sse2") when the binary is being compiled with "sse42" globally. In these cases, none of the steps above are performed and the macro just expands to true.

The exact method of CPU feature detection various by platform, OS, and architecture. For example on x86 we make heavy use of the cpuid instruction whereas on ARM the implementation currently uses getauxval//proc mounted information on Linux. It's expected that the detection will vary for each particular target, as necessary.

Note that the implementation details of the macro today prevent it from being located in libcore. If getauxval or /proc is used that requires libc to be available or File in one form or another. These concerns are currently std-only (not available in libcore). This is also a conservative route for x86 where it is possible to do CPU feature detection in libcore (as it's just the cpuid instruction), but for consistency across platforms the macro will only be available in libstd for now. This placement can of course be relaxed in the future if necessary.

The `std::arch` Module

This is where the real meat is. A new module will be added to the standard library, std::arch. This module will also be available in core::arch (and std will simply reexport it). The contents of this module provide no portability guarantees (like std::os and unlike the rest of std). APIs present on one platform may not be present on another.

The contents of the arch modules are defined by, well, architectures! For example Intel has an intrinsics guide which will serve as a guideline for all contents in the arch module itself. The standard library will not deviate in naming or type signature of any intrinsic defined by an architecture.

For example most Intel intrinsics start with _mm_ or _mm256_ for 128 and 256-bit registers. While perhaps unergonomic, we'll be sticking to what Intel says. Note that all intrinsics will also be unsafe, according to RFC 2045.

Function signatures defined by architectures are typically defined in terms of C types. In Rust, however, those aren't always available! Instead the intrinsics will be defined in terms of Rust-specific types. Some types are easily translated such as int32_t, but otherwise a different mapping may be applied per-architecture.

The current proposed mapping for x86 intrinsics is:

What Intel says	Rust Type
`void*`	`*mut u8`
`char`	`i8`
`short`	`i16`
`int`	`i32`
`long long`	`i64`
`const int`	`i32` [0]

[0] required to be compile-time constants.

Other than these exceptions the x86 intrinsics will be defined exactly as Intel defines them. This will necessitate new types in the std::arch modules for SIMD registers! For example these new types will all be present in std::arch on x86 platforms:

__m128
__m128d
__m128i
__m256
__m256d
__m256i

(note that AVX-512 types will come in the future!)

Infrastructure-wise the contents of std::arch are expected to continue to be defined in the stdsimd crate/repository. Intrinsics defined here go through a rigorous test suite involving automatic verification against the upstream architecture defintion, verification that the correct instruction is generated by LLVM, and at least one runtime test for each intrinsic to ensure it not only compiles but also produces correct results. It's expected that stabilized intrinsics will meet these critera to the best of their ability.

Currently today on x86 and ARM platforms the stdsimd crate performs all these checks, but these checks are not yet implemented for PowerPC/MIPS/etc, but that's always just some more work to do!

It's not expected that the contents of std::arch will remain static for all time. Rather intrinsics will continue to be implemented in stdsimd and make their way into the main Rust repository. For example there are not currently any implemented AVX-512 intrinsics, but that doesn't mean we won't implement them! Rather once implemented they'll be stabilized and included in libstd following the Rust release model.

The types in `std::arch`

It's worth paying close attention to the types in std::arch. Types like __m128i are intended to represent a 128-bit packed SIMD register on x86, but there's nothing stopping you from using types like Option<__m128i> in your program! Most generic containers and such probably aren't written with packed SIMD types in mind, and it'd be a bummer if everything stopped working once you used a packed SIMD type in one of them.

Instead it will be required that the types defined in std::arch do indeed work when used in "nonstandard" contexts. For example Option<__m128i> should never produce a compiler error or a codegen error when used (it may just be slower than you expect). This requires special care to be taken both in representation of these arguments as well as their ABI.

Implementation-wise these packed SIMD types are implemented in terms of a "portable" vector in LLVM. LLVM as a results gets most of this logic correct for us in terms of having these compile without errors in many contexts. The ABI, however, had to be special cased as it was a location where LLVM didn't always help us.

The Rust ABI will currently be implemented such that all related packed-SIMD types are passed via memory instead of by-value. This means that regardless of the target features enabled for a function everything should agree on how packed SIMD arguments are passed across boundaries and whatnot.

Again though, note that this section is largely an implementation detail of SIMD in Rust today, though it's enabling usage without a lot of codegen errors popping up all over the place.

Intrinsics in `std::arch` and constant arguments

There are a number of intrinsics on x86 (and other) platforms that require their arguments to be constants rather than decided at runtime. For example _mm_insert_pi16 requires its third argument to be a constant value where only the lowest two bits are used. The Rust type system, however, does not currently have a stable way of expressing this information.

Eventually we will likely have some form of const arguments or const machinery to guarantee that these functions are called and monomorphized with constant arguments, but for now this RFC proposes taking a more conservative route forward. Instead we'll, for the time being, forbid the functions from being invoked with non-constant arguments. Prototyped in #48018 the stdsimd crate will have an unstable attribute where the compiler can help provide this guarantee. As an extra precaution as well #48078 also implements disallowing taking a function pointer to these intrinsics, requiring a direct invocation.

It's hoped that this restriction will allow stdsimd to be forward compatible with a future const-powered world of Rust but in the meantime not otherwise block stabilization of these intrinsics.

Portable packed SIMD

So-called "portable" packed SIMD types are currently implemented in both the stdsimd and simd crates. These types look like u8x16 and explicitly specify how many lanes they have (16 in this case) and what type each line is (u8 in this case). These types are intended to unconditionally available (like the rest of libstd) and simply optimized much more aggressively on platforms that have native support for the various operations.

For example u8x16::add may be implemented differently on i586 vs i686, and also entirely differently implemented on ARM. The idea with portable packed SIMD types is that they represent a broad intersection of fast behavior across a broad range of platforms.

It's intended that this RFC neither includes nor rules out the addition of portable packed-SIMD types in Rust. It's expected that an upcoming RFC will propose the addition of these types in a std::simd module. These types will be orthogonal to scalable-vector types which are expected to be proposed in another, also different, RFC. What this RFC does do, however, is explicitly specify that:

The portable SIMD types (both packed and scalable) will not be used in intrinsics.
The per-architecture SIMD types will be distinct types from the portable SIMD types.

Or, in other words, it's intended that portable SIMD types are entirely decoupled from intrinsics. If they both end up being implemented then there will be jkro-cost interoperation between them, but neither will necessarily depend on the other.

Not stabilizing MMX in this RFC

This RFC proposed notably omitting the MMX intrinsics, or those related to __m64 in other words. The MMX type __m64 and the intrinsics have been somewhat problematic in a number of ways. Known cases include:

Due to these issues having an unclear conclusion as well as a seeming lack of desire to stabilize MMX intrinsics, the __m64 and all related intrinsics will not be stabilized via this RFC.

Drawbacks

This RFC represents a significant addition to the standard library, maybe one of the largest we've ever done! As a result alternate implementations of Rust will likely have a difficult time catching up to rustc/LLVM with all the SIMD intrinsics. Additionaly the semantics of "packed SIMD types should work everywhere" may be overly difficult to implement in alternate implementations. It is worth noting that both Cranelift and GCC support packed SIMD types.

Due to the enormity of what's being added to the standard library it's also infeasible to carefully review each addition in isolation. While there are a number of automatic verifications in place we're likely to inevitably make a mistake when stabilizing something. Fixing a stabilization can often be quite difficult and costly.

Rationale and alternatives

Over the years quite a few iterations have happened for SIMD in Rust. This RFC draws from as many of those as it can and attempts to strike a balance between exposing functionality while still allowing us to implement everything in a stable fashion for years to come (and without blocking us from updating LLVM, for example). Despite this there's a few alternatives we could do as well.

Portable types in architecture interfaces

It was initially attempted in the stdsimd crate that we would use the portable types on all of the intrinsics. For example instead of:

pub unsafe fn _mm_set1_epi8(val: i8) -> __m128i;

we would instead define

pub unsafe fn _mm_set1_epi8(val: i8) -> i8x16;

The latter definition here is much easier for a beginner to SIMD to read (or at least I gawked when I first saw __m128i).

The downside of this approach, however, is that Intel isn't telling us what to do. While that may sound simple, this RFC is proposing an addition of thousands of functions to the standard library in a stable manner. It's infeasible for any one person (or even the entire libs team) to scrutinize all functions and assess whether the correct signature is applied (aka was it i8x16 or i16x8?)

Furthermore not all intrinsics from Intel actually have an interpretation with one of the portable types. For example some intrinsics take an integer constant which when 0 interprets the input as u8x16 and when 1 interprets it as u16x8 (as an example). This effectively means that there isn't a correct choice in all situations for what portable type should be used.

Consequently it's proposed that instead of portable types the exact architecture types are used in all intrinsics. This provides us a much easier route to stabilization ("make sure it's what Intel says") along with no need to interpret what Intel does and attempt to find the most appropriate type.

There is interest by both current stdsimd maintainers and users to expose a "better-typed" SIMD API in crates.io that builds on top of the intrinsics proposed for stabilization here.

Stabilizing SIMD implementation details

Another alternative to the bulk of this RFC is allowing more raw access to the internals of LLVM. For example stabilizing #[repr(simd)] or the ability to write extern "platform-intrinsics" { ... } or #[link_llvm_intrinsic...]. This is certainly a much smaller surface area to stabilize (aka not thousands of intrinsics).

This avenue was decided against, however, for a few reasons:

Such raw interfaces may change over time as they simply represent LLVM as a current point in time rather than what LLVM wants to do in the future.
Alternate implementations of rustc or alternate rustc backends like Cranelift may not expose the same sort of functionality that LLVM provides, or implementing the interfaces may be much more difficult in alternate backends than in LLVM's.

As a result, it's intended that instead of exposing raw building blocks (and allowing stdsimd to live on crates.io) we'll instead pull in stdsimd to the standard library and expose it as the stable interface to SIMD in Rust.

Unresolved questions

There's a number of unresolved questions around stabilizing SIMD today which don't pose serious blockers and may also wish to be considered open bugs rather than blocking stabilization:

Relying on unexported LLVM APIs

The static detection performed by cfg! and #[cfg] currently relies on a Rust-specific patch to LLVM. LLVM internal knows all about hierarchies of features and such. For example if you compile with -C target-feature=+avx2 then cfg!(target_feature = "sse2") also needs to resolve to true. Rustc, however, does not know about these features and relies on learning this information through LLVM.

Unfortunately though LLVM does not actually export this information for us to consume (as far as we know). As a result we have a local patch which exposed this information for us to read. The consequence of this implementation detail is that when compiled against the system LLVM the cfg! macro may not work correctly when used in conjunction with -C target-feature or -C target-cpu flags.

It appears that clang vendors and/or duplicates LLVM's functionality in this regard. It's an option for rustc to do the same but it may also be an option to expose the information in upstream LLVM. So far there appears to have been no attempts to upstream this patch into LLVM itself.

Packed SIMD types in `extern` functions are not sound

The packed SIMD types have particular care paid to them with respect to their ABI in Rust and how they're passed between functions, notably to ensure that they work properly throughout Rust programs. The "fix" to pass them in memory over function calls, however, was only applied to the "Rust" ABI and not any other function ABIs.

A consequence of this change is that if you instead label all your functions extern then the same bug will arise. It may be possible to implement a "lint" or a compiler error of sorts to forbid this situation in the short term. We could also possibly accept this as a known bug for the time being.

What if we're wrong?

Despite the CI infrastructure of the stdsimd crate it seems inevitable that we'll get an intrinsic wrong at some point. What do we do in a situation like that? This situation is somewhat analagous to the libc crate but there you can fix the problem downstream (just have a corrected type/definition) for vendor intrinsics it's not so easy.

Currently it seems that our only recourse would be to add a 2 suffix to the function name or otherwise indicate there's a corrected version, but that's not always the best...