Niche optimizations in Rust
For my fellow Rust enthusiasts, it is well-known that the compiler is able to perform some optimizations, called “niche”, that can make, for example, an Option<NonZeroU32>
fits on 32-bits. In this article, I will try to understand how it extends to some other enum types, and how the compiler achieves this.
Note: at the time writing the article, I am running the following toolchain: stable-x86_64-unknown-linux-gnu updated - rustc 1.79.0 (129f3b996 2024-06-10)
First of all, a quick disclaimer. I do not know much about language design, nor do I know about compiler design. The most advanced thing I have done is writing is a small compiler in an afternoon class using JFlex for the lexer and another thing that I cannot even remember for the parser. This article will show my very brief understanding by only looking at the code, and maybe diving into some documentations, hoping that I can learn something, and you too along the way.
Basic understanding of the enum representation
First, let’s check that I am not completely wrong about my current understanding of niche optimization. From what I understand, because NonZeroU32
represents an u32
that cannot be zero, the compiler understands that it can use the 0...0
pattern to represent the None
variant of the Option
enum.
To check this assumption, we can write a very simple main function, that would transmute our None
variant to an u32
.
fn main() {
let variant: Option<NonZeroU32> = None;
let transmuted: u32 = unsafe {
std::mem::transmute(variant)
};
println!("{transmuted}");
}
// Output: 0
So indeed, None
is represented as a zero. If I replaced NonZeroU32
with a simple u32
, I would expect the representation of the None
variant to be different, because u32
already uses zero to represent… well… zero.
fn main() {
let variant: Option<u32> = None;
let transmuted: u64 = unsafe {
std::mem::transmute(variant)
};
println!("{transmuted}");
println!("{transmuted:b}");
}
// Output:
// 139835545223168
// 11111110010111000000000000000000000000000000000
So here, the None
variant is not represented by 0
anymore, because it is already used, but by 139835545223168
. This number does not ring a bell at all and appear absolutely random to me, but we will hopefully understand more about it later. One interesting thing to note however, is that it does not use any of the 32 first bits. (the first 33 bits to be exact). If I run (without even compiling again !), I got different representations for this None
variant ! Here are a sample of 3 outputs:
$ ./target/debug/niche
140127602999296
11111110111001000000000000000000000000000000000
$ ./target/debug/niche
140685948747776
11111111111010000000000000000000000000000000000
$ ./target/debug/niche
140166257704960
11111110111101100000000000000000000000000000000
What about the representation of Some
then ?
fn main() {
let variant: Option<u32> = Some(0);
let transmuted: u64 = unsafe {
std::mem::transmute(variant)
};
println!("{transmuted}");
println!("{transmuted:#066b}");
}
// Output: 0b0000000000000000000000000000000000000000000000000000000000000001
// Some(1): 0b0000000000000000000000000000000100000000000000000000000000000001
// Some(2): 0b0000000000000000000000000000001000000000000000000000000000000001
So, it looks like the last 32 bits that seemed always set to 0 with the None
variant are now set to 1
. This is called the discriminant. Then, the first 32 bits are used to represent the u32
placed within the Option
. So matching the variant with Some
or None
would actually be checking wether the last 32 bits are set to 0 (for None
) or not (for Some
). Then, the 32 first bits would not matter for None
, which is why we can allow them to be random, but would represent our value in the Some
case.
If you have read my previous blog posts, you know that I am trying to get a bit familiar with LLVM, so let’s look at the LLVM IR to check what is happening when we try to match over our Option<u32>
to check that our understanding is correct.
fn main() {
let variant: Option<u32> = None;
let transmuted: u64 = unsafe {
std::mem::transmute(variant)
};
let help_debug = match variant {
None => 0,
Some(1) => 1,
Some(_) => 2
};
}
start:
%help_debug = alloca [4 x i8], align 4
%transmuted = alloca [8 x i8], align 8
%variant = alloca [8 x i8], align 4
call void @llvm.dbg.declare(metadata ptr %variant, metadata !186, metadata !DIExpression()), !dbg !208
call void @llvm.dbg.declare(metadata ptr %transmuted, metadata !203, metadata !DIExpression()), !dbg !209
call void @llvm.dbg.declare(metadata ptr %help_debug, metadata !206, metadata !DIExpression()), !dbg !210
store i32 0, ptr %variant, align 4, !dbg !211
%0 = load i32, ptr %variant, align 4, !dbg !212
%1 = getelementptr inbounds i8, ptr %variant, i64 4, !dbg !212
%2 = load i32, ptr %1, align 4, !dbg !212
store i32 %0, ptr %transmuted, align 8, !dbg !212
%3 = getelementptr inbounds i8, ptr %transmuted, i64 4, !dbg !212
store i32 %2, ptr %3, align 4, !dbg !212
%4 = load i32, ptr %variant, align 4, !dbg !213
%_4 = zext i32 %4 to i64, !dbg !213
%5 = icmp eq i64 %_4, 0, !dbg !214
br i1 %5, label %bb4, label %bb2, !dbg !214
bb4: ; preds = %start
store i32 0, ptr %help_debug, align 4, !dbg !215
br label %bb6, !dbg !215
bb2: ; preds = %start
%6 = getelementptr inbounds i8, ptr %variant, i64 4, !dbg !214
%7 = load i32, ptr %6, align 4, !dbg !214
%8 = icmp eq i32 %7, 1, !dbg !214
br i1 %8, label %bb5, label %bb3, !dbg !214
bb5: ; preds = %bb2
store i32 1, ptr %help_debug, align 4, !dbg !217
br label %bb6, !dbg !217
bb3: ; preds = %bb2
store i32 2, ptr %help_debug, align 4, !dbg !218
br label %bb6, !dbg !218
So if we skip the first 6 lines of the start
block, we see that we start by setting the 32 first bits of %variant
to 0, which is our way to represent None
. Here, we need to note that my computer is using little-endian, so the 32 first bits here corresponds to the last 32 bits of the binary representation that we saw earlier.
Note that if I were to set variant
to Some(2)
instead of None
, this line would be changed by:
%0 = getelementptr inbounds i8, ptr %variant, i64 4, !dbg !211
store i32 2, ptr %0, align 4, !dbg !211
store i32 1, ptr %variant, align 4, !dbg !211
which actually sets the first 32 bits to 1, and the following 32 bits to 2. The 1 marks the Some
variant, and then the following bits are the value contained by the option.
Then, we load the 32 next bits of variant
(random) in %2
. I think we need to do 2 load of 32 bits integers instead of one load because of the alignment of 4 bytes of the Option<u32>
. We then proceed to make 2 store
to write the value of variant
into %transmuted
. Then, the match
happens. We load the 32 first bits of variant
into %4
, they should represent the enum variant. We cast those to a 64 bits integer, and compare it with 0. If it is zero, we jump to bb4
, which store 0 in help_debug
. This is our None
variant ! Otherwise, we jump to bb2
, where we read the next 32 bits of variant
(the u32
value), and compare them to 1. We then do another conditional jump, representing the two remaining arms of our match
.
Alright, we skipped a bit far from the niche optimization that we were initially looking for, but we stepped up a bit in our enum representation game.
We will not analyze the release
build to see if the representation is different because… well… it simply does not compile.
$ cargo build --release
thread 'rustc' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/compiler/rustc_codegen_ssa/src/mir/operand.rs:137:9:
assertion failed: alloc_align >= layout.align.abi
stack backtrace:
<...>
error: the compiler unexpectedly panicked. this is a bug.
note: we would appreciate a bug report: https://github.com/rust-lang/rust/issues/new?labels=C-bug%2C+I-ICE%2C+T-compiler&template=ice.md
<...>
query stack during panic:
end of query stack
To keep the article readable, I redacted most of the stack backtrace, but because the compiler kindly asks us to create an issue, you can find the full backtrace here. The good side of this is that it gives us an entrypoint for later, when we will want to look into the code ! Something seems to be happening in the assertion alloc_align >= layout.align.abi
. Something of interest is that it seems to crash on a because of an unexpected alignment. And if we look at the first two lines of our LLVM Intermediate Representation, we can see that indeed, the alignment of u64
and Option<u32>
are different. Maybe I will finally have a reason to dive into alignment and understand why things are aligned the way they are.
Now, on question remains. How does the compiler choose the size of the type Option<u32>
, and how does it choose what bits will represent the variant, and what bits will store the value ? Answering this question will certainly guide us to the code for the niche optimization.
Looking at the code
I started by searching for any occurrence of the word niche in the Rust main repo on the stable
branch, currently at commit 129f3b9964af4d4a709d1383930ade12dfe7c081
. Two things popped up. First, the workspace rustc_abi
has some nice mentions of a struct called Niche
, and contains a struct LayoutS
, which has a field largest_niche: Option<Niche>
. So we definitely want to look in here at some point. Also, the file RELEASES.md
contains 3 occurrences of the word niche
, with their related pull requests.
- [Improve niche placement by trying two strategies and picking the better result.](https://github.com/rust-lang/rust/pull/108106/)
- [Use niche-filling optimization even when multiple variants have data](https://github.com/rust-lang/rust/pull/94075/)
- [On Unix platforms, the `std::fs::File` type now has a "niche" of `-1`.][74699]
This value cannot be a valid file descriptor, and now means `Option<File>` takes
up the same amount of space as `File`.
With some side-research (typing rust niche
on my favorite search engine), I also found this merge request.
Hopefully, we will, at the end of this article, be able to understand properly what is happening in these four merge requests.
To help myself understand the code a bit better, I checked out the rust repo before all these merge requests, in the hope that the code at this time will be simpler to walk through. Once we understand more clearly the niche from this commit, we will go sequentially through the other merge requests and figure out what is changing, and why.
So I am actually checking out the 9bb77da74da
commit for now. Looking for the word niche, I ended up in the file rustc_middle/src/ty/layout.rs
, and was specifically interested in the function layout_of_uncached
. I will quickly skim over what I understand from this function. I will not copy paste the code, because it is more than 300 lines longs, and this will not be very digest in a blog post. If you however wants to tag along, here is a link.
The goal of the function is to compute the layout for a given type. It starts by using an huge match
over the different kind of types. The arm that interests us is the one for Adt
, which stands in this case for Algebraic Data Type
. The first thing we do is to recursively compute the layout of the fields of each variants (if this is not an enum, then I guess there will be only one variant). This will recursively call the same match
that we have seen, until it finds a base case (a scalar, a boolean, …). Then, there is a special treatment for union
types, but I will not dive into it because I have actually never used union
s in Rust.
Then we basically check wether the type that we are looking at is a struct (or an univariant enum equivalent to a struct, such as
enum A {
B {x: u8, y: u8}
}
), or not.
In this case, we call self.univariant_uninterned
to compute the layout of a struct. A lot of things happens during this computation, but one important thing is that the largest_niche
of the layout is set to the largest largest_niche
of its fields. I think this means that if you actually have two niches inside a struct, only one will be used later. So the second assertion in the following snippet will fail:
pub struct Test {
a: NonZeroU32,
b: NonZeroU32
}
assert_eq!(size_of::<Option<Test>>(), 2*size_of::<u32>());
assert_eq!(size_of::<Option<Option<Test>>>(), 2*size_of::<u32>());
//assertion `left == right` failed
// left: 12
// right: 8
Then, if we go back to our layout computation, we are left with the case that interested us in the first place: computing the layout of an enum. In this case, we start by trying to compute a first layout called niche_filling_layout
.
For this, we iterate through the variants, and try to find a dataful variant (a variant with data in it). At the same time, we also count the number of variants, by changing the niche_variants
range.
let mut dataful_variant = None;
let mut niche_variants = VariantIdx::MAX..=VariantIdx::new(0);
// Find one non-ZST variant.
'variants: for (v, fields) in variants.iter_enumerated() {
if absent(fields) {
continue 'variants;
}
for f in fields {
if !f.is_zst() {
if dataful_variant.is_none() {
dataful_variant = Some(v);
continue 'variants;
} else {
dataful_variant = None;
break 'variants;
}
}
}
niche_variants = *niche_variants.start().min(&v)..=v;
}
From what I understand here, if we find more than one dataful variant, then dataful_variant
will be set to None
, so we will not be able to perform niche optimization.
I think this has actually been improved later on, in one of the merge requests mentioned earlier: Use niche-filling optimization even when multiple variants have data, that we will check later on.
In the case where there is only one dataful variant, then we iterate through its fields and find the one with the niche with the most available space. We then try to reserve
some space in the niche, to store our variants (we computed the space we needed earlier with the niche_variants
variable.). If we succeed, then we proceed to compute alignment, offset and size, and return the layout, stating that the variants are encoded with a niche, and that this niche is the one we found earlier. Also, we state that our type has a niche available, and that is whatever is left from the niche we used earlier.
Once this layout using the niche is computed (or not, if we did not find a niche, or it was too small for example), we still compute another layout, that will not be using niche optimization. However, because it will take some new space to store the discriminant, this new space may contain a niche, that could be used for other niche optimization, if our type is used in another variants type. For example,
enum E {
Something,
SomethingElse(Option<u32>)
}
assert_eq!(size_of::<E>(), size_of::<Option<u32>>());
because the type Option<u32>
needed more space to store the discriminant, a niche became available, and has been used to store the discriminant for E
, thus making the Something
variant actually take no space.
Once we have our two layouts (if the niche layout exists), we then pick the best one, meaning the smaller one, or the one with the largest niche if they have the same size. If the niche is the same size, take the one without niche optimization (the second one we computed), because its codegen will be simpler.
Nice ! We now understand how the compiler choses or not to use a niche. We have also seen that sometimes, niche are available because the discriminant actually contains niche. But that does not explain all the niches that we know of.
How is the range of possible values defined ?
A good place to start is certainly the example I mentioned at the beginning of the blog post. One of the most well-known example of niche usage is certainly the following.
assert_eq!(size_of::<Option<NonZeroU32>>(), size_of::<u32>());
From what we have seen earlier, because the compiler knows NonZeroU32
has a niche (the value 0), and that Option
only has two variants, it can use 0
to store the discriminant of the enum. But we must understand how the compiler knows. To discover this, we can go look at the code for the NonZero
types. These types can be found in core/src/num/nonzero.rs
.
nonzero_integer! {
Self = NonZeroU32,
Primitive = unsigned u32,
}
where
macro_rules! nonzero_integer {
(
#[$stability:meta]
Self = $Ty:ident,
Primitive = $signedness:ident $Int:ident,
$(UnsignedNonZero = $UnsignedNonZero:ident,)?
UnsignedPrimitive = $UnsignedPrimitive:ty,
// Used in doc comments.
leading_zeros_test = $leading_zeros_test:expr,
) => {
/// An integer that is known not to equal zero.
/// [A bunch of doc]
pub type $Ty = NonZero<$Int>;
/// [A bunch of impl]
}
The NonZero
type uses a newtype pattern over a NonZeroInner
type as follows:
#[repr(transparent)]
#[rustc_nonnull_optimization_guaranteed]
#[rustc_diagnostic_item = "NonZero"]
pub struct NonZero<T: ZeroablePrimitive>(T::NonZeroInner);
and the zeroable primitive are implemented as follows
impl_zeroable_primitive!(
NonZeroU8Inner(u8),
NonZeroU16Inner(u16),
NonZeroU32Inner(u32),
NonZeroU64Inner(u64),
NonZeroU128Inner(u128),
NonZeroUsizeInner(usize),
NonZeroI8Inner(i8),
NonZeroI16Inner(i16),
NonZeroI32Inner(i32),
NonZeroI64Inner(i64),
NonZeroI128Inner(i128),
NonZeroIsizeInner(isize),
);
with the following macro:
macro_rules! impl_zeroable_primitive {
($($NonZeroInner:ident ( $primitive:ty )),+ $(,)?) => {
mod private {
#[unstable(
feature = "nonzero_internals",
reason = "implementation detail which may disappear or be replaced at any time",
issue = "none"
)]
#[const_trait]
pub trait Sealed {}
$(
#[derive(Debug, Clone, Copy, PartialEq)]
#[repr(transparent)]
#[rustc_layout_scalar_valid_range_start(1)]
#[rustc_nonnull_optimization_guaranteed]
#[unstable(
feature = "nonzero_internals",
reason = "implementation detail which may disappear or be replaced at any time",
issue = "none"
)]
pub struct $NonZeroInner($primitive);
)+
}
$(
#[unstable(
feature = "nonzero_internals",
reason = "implementation detail which may disappear or be replaced at any time",
issue = "none"
)]
impl const private::Sealed for $primitive {}
#[unstable(
feature = "nonzero_internals",
reason = "implementation detail which may disappear or be replaced at any time",
issue = "none"
)]
unsafe impl const ZeroablePrimitive for $primitive {
type NonZeroInner = private::$NonZeroInner;
}
)+
};
}
This is where we get our response. The NonZeroInner
type is defined here, with the following macro above it:
#[rustc_layout_scalar_valid_range_start(1)]
This states that the range of valid values starts at 1, and thus, that it cannot be zero.
The value of this macro will then be read during compile time in rustc_middle/src/ty/context.rs
in the layout_scalar_valid_range
. This is then used when computing the layout of a struct, by finding niches in scalar
fields, in the function layout_of_uncached
that we mentioned earlier. Everything is now a bit clearer, from how the compiler knows there is a niche for a given type, to how it uses it to optimize the storage of the enum discriminant.
Trying to create our own niches.
We have already seen a few examples were we created our own niches. For example, we have seen that when we add a discriminant to an enum, it can potentially create a niche that will be used to store further discriminants, if this enum is used in a field of a variant of another enum. We have also seen that the standard library uses rustc_layout_scalar_valid_range_start
to tell the compiler that some values are invalid and that it can use them as niche. In this section, we will see how to do so with our own types.
For example, we could define a new type NonMaxU32
, that would represent a u8
that cannot take the value u8::MAX
(255).
#![feature(rustc_attrs)]
#![allow(internal_features)]
use std::{mem::size_of};
#[rustc_layout_scalar_valid_range_end(254)]
#[derive(Debug)]
pub struct NonMaxU8(u8);
fn main() {
assert_eq!(size_of::<Option<NonMaxU8>>(), size_of::<u8>());
}
// OK
Note that this only compiles in nightly
, because the #[rustc_layout_scalar_valid_range_end]
attribute is just used to enable niche optimizations in libcore and libstd and will never be stable.
Because we told the compiler that the range of possible values for the inner part of NonMaxU8
ends at 254, it can use the most-significant bit as a niche to store the discriminant for the Option
. However, we did not create a nice interface like the standard library does for NonZero
integers, so we could miss use our interface as follow:
fn main() {
let x = Some(unsafe {NonMaxU8(255)});
assert!(x.is_none());
}
// OK
We can not that if we further reduce the range of possible values, then the niche gets larger, and we can store even larger discriminants.
#[rustc_layout_scalar_valid_range_end(250)]
#[derive(Debug)]
pub struct NonMaxU8(u8);
fn main() {
assert_eq!(size_of::<Option<NonMaxU8>>(), size_of::<u8>());
assert_eq!(size_of::<Option<Option<NonMaxU8>>>(), size_of::<u8>());
assert_eq!(size_of::<Option<Option<Option<NonMaxU8>>>>(), size_of::<u8>());
}
It is worth to note however that while we gained some space in memory, this could be detrimental performances-wise. For example, the CPU would need to do a &
to get the discriminant, and then match over it. If the discriminant was simply stored in 32 new bits, the CPU would simply check these bits without needing to do a bit-wise &
.
A pointer does not have that many niche
I always (naively) thought that pointers were a great source for niche. Indeed, in ARM, some of the higher bits of the 64-bits pointers are used to store the PAC, because you do not need 64-bits to store an address except if you have 18446744 terabytes of RAM. However, when looking at the code, it occurred to me that 64-bits pointers (my machine is using x64), only had one niche !
assert_eq!(size_of::<Option<&u8>>(), size_of::<&u8>()); // OK
assert_eq!(size_of::<Option<Option<&u8>>>(), size_of::<&u8>()); // FAILS
It comes from the following code from layout_of_uncached
.
ty::Ref(_, pointee, _) | ty::RawPtr(ty::TypeAndMut { ty: pointee, .. }) => {
let mut data_ptr = scalar_unit(Pointer);
if !ty.is_unsafe_ptr() {
data_ptr.valid_range = data_ptr.valid_range.with_start(1);
}
let pointee = tcx.normalize_erasing_regions(param_env, pointee);
if pointee.is_sized(tcx.at(DUMMY_SP), param_env) {
return Ok(tcx.intern_layout(Layout::scalar(self, data_ptr)));
}
}
As you can see, in our case, our pointee has a sized type (u8
), so the only niche that we have comes from the line data_ptr.valid_range = data_ptr.valid_range.with_start(1);
because our pointer is not a null-pointer. There is no upper bound on the pointer’s value. If the type of our pointee was unsized (for example with a dyn
) , then we may have a larger niche in the metadata (e.g. storing a pointer to the VTABLE for dyn
). but that’s it.
Understanding the pull requests
Let’s look at the four merge requests that we mentioned previously, and see how our newly acquired knowledge helps us understanding them.
On Unix platforms, the std::fs::File
type now has a “niche” of -1
.
#[rustc_layout_scalar_valid_range_start(0)]
// libstd/os/raw/mod.rs assures me that every libstd-supported platform has a
// 32-bit c_int. Below is -2, in two's complement, but that only works out
// because c_int is 32 bits.
#[rustc_layout_scalar_valid_range_end(0xFF_FF_FF_FE)]
pub struct FileDesc {
fd: c_int,
}
impl FileDesc {
pub fn new(fd: c_int) -> FileDesc {
assert_ne!(fd, -1i32);
// SAFETY: we just asserted that the value is in the valid range and isn't `-1` (the only value bigger than `0xFF_FF_FF_FE` unsigned)
unsafe { FileDesc { fd } }
}
// <...>
}
This pull request is the first one (on a time-basis), and also the easier to understand. It added the five first lines, specifying that -1 could not be used as a value for the file descriptor. It then simply add an assertion to avoid the issue that we saw earlier in our NonMaxU8
and uses unsafe
to instantiate the file descriptor. This creates a potential niche for the compiler on the -1
value.
Enum should prefer discriminant zero for niche
The purpose of this pull request is to make the compiler use 0 as a niche when possible. This apparently allows some LLVM optimization afterwards. The following example is given:
pub enum Size {
One = 1,
Two = 2,
Three = 3,
}
pub fn handle(x: Option<Size>) -> u8 {
match x {
None => {0}
Some(size) => {size as u8}
}
}
that compiled as
mov eax, edi
cmp al, 4
jne .LBB0_2
xor eax, eax
.LBB0_2:
ret
before, and you compile to
mov eax, edi
ret
afterwards, thus sparing 3 instructions ! We can see on this cmp al, 4
that the discriminant was 4
before the pull request, which is the value directly after Three = 3,
. This pull request had a small bug in it, that I will mention when we review it.
Most of the code in this PR happens inside the reserve
function for impl Niche
. This function is called when we found one dataful variant, and that we have found the field with the largest available niche. It returns the start of the range that you can use to store the discriminant within the niche, and a scalar representing the niche that you will use. Before the pull request, the start of the range was simply the end of valid values for the niche, and the end of the range was this start + the number of values that we need to store our discriminant. So, in our example, the end of valid values being 3, our niche would take the range 4..=4
.
The idea of the PR is the following: Instead of simply using the values after the valid range of the niche, we try to find a better place. A better place is defined as follows:
- If zero is already taken by a value of the variant, proceed as before and store the discriminant after the valid range
- If the end of the valid range is far from zero, or the start of the valid range is close to zero
- If the discriminant fits at the beginning of the range, use it. This will use zero, or make the start closer to zero for the next niche selection
- If it does not fit, then use the end of the valid range
- If the start if far from zero, or the end close to the max value
- If we would use zero when adding at the end (because of wrapping), then do not do it, leave it available
- Otherwise, do it to make the end closer to zero for a future niche selection.
So in practice, because most of the time the valid range will start at one, this means that only Option
will fit at the beginning of the range and take the 0
space, meaning that LLVM would be able to perform the optimizations. (I say Option
, but any enum with only two variants, one being dataless, would fit.)
The bug I mentioned before was contained if the closure responsible for computing the new start of the valid range. Instead of reserving the space for storing the discriminant for all the variants, it reserved only 1 value. Which is fine for Option
, because you only need to store one value for the discriminant, but is not fine for enums with more variants.
let move_start = |v: WrappingRange| {
let start = v.start.wrapping_sub(1) & max_value;
Some((start, Scalar { value, valid_range: v.with_start(start) }))
};
needs to be changed to
let move_start = |v: WrappingRange| {
let start = v.start.wrapping_sub(count) & max_value;
Some((start, Scalar { value, valid_range: v.with_start(start) }))
};
Use niche-filling optimization even when multiple variants have data.
This pull request is a bigger chunk than the two previous one. It took more than 6 month to be merged, and has more than a hundred comments of GitHub, especially because of performance regression. It appears that this enabled more niche-optimization, but that, while niche optimization is good for memory, it could be harder for LLVM to optimize the subconsequent match
es, and lead to sub-optimal code in some cases. Hopefully, we arrive once everything is done, so we can just read the fruits of months of reflection.
This pull requests aims at improving something that we have seen earlier. The part of the algorithm that was responsible for selecting the dataful variant
would return None
if there was more than one dataful variant, thus disabling niche optimization. Here, we would like to be able to consider all dataful variants (which will be renamed in untagged variants), and make the best choice.
Most of the changes happens in the now-well-known layout_of_uncached
function, and especially in the computation of the niche-filling layout. The first difference that we notice is that now, the layouts of the variants will depend on the variant that we select for the niche, and on the final layout that we choose, so we cannot cache the layout on-the-fly as we did before (even if we did not really mention it because we were not worried about that).
What changes in the niche-filling layout computation is that instead of selecting the one and only untagged variant from the enum, we compute the layout for all variants. Then find the variant with the largest size (not the largest niche, just the largest size). We then iterate through the fields of this variant and find the one with the largest available niche, and we reserve space for all of our variants that needs a discriminant. Then, we need to check that all of our variants will fit gracefully if we used the niche that we just selected. If another variant is supposed to store data at the place where we selected the niche, this will be a problem.
For example, if all variants are smaller than the offset of the niche, this means that they will all fit before the niche, so the niche would indeed be available in all variants.
If the variant does not fit before the niche, then we check if it will fit after the niche. If it does, then cool, we just need to adapt the offsets of its layout to make them point to after the niche.
If a variant does not fit before or after the niche, then we cannot use this niche, and we simply give up niche optimization.
If they all do fit before or after the niche, then we make the necessary adjustment and we proceed with the rest of the algorithm as before (computing the default layout and choosing the better one).
Improve niche placement by trying two strategies and picking the better result
Remember when I said that the previous pull request was a bigger chunk than the other one ? Well I was not ready for this one.
Most of the changes are in compiler/rustc_abi/src/layout.rs
.
The main idea is actually linked to the previous PR. We saw previously that for a field to be used as a niche, we needed all the other variants to either fit after the niche, or before the niche. This means that it is better if the niche is selected at the beginning or at the end of the variant, so we can fit the other variants before or after it. Previously, we were simply picking the largest niche in the largest variant, and crossing our fingers that other variants fits before or after the niche. This PR aims to provide other strategies for picking the niche, and defining a way to choose the better one.
This can be summarized quite simply. We first compute a layout for the struct while trying to place the largest niche at the start of the struct. This can be done because Rust actually tries to reorder fields to reduce memory lost to alignment. So we can try to reorder field to have an early niche. We compute the same thing, but while having a bias and trying to put the niche at the end. If the layout with the niche at the end provides more space before the niche, and also more space before the niche than the first layout did after the niche, then we keep the layout with the niche at the end. This is how the better one is defined.
The only thing left to see is how the bias is included in the layout computation. This is done when we computing the reordering of the fields. Unfortunately, ny knowledge on field ordering is a bit too scarce at the moment to understand all the ins and outs of this reordering.
Key takeaways
- The variant of an enum is encoded in the discriminant, and this discriminant can be placed in a niche if there is one.
- Niche can come from:
- Specifying range of valid values (only available in the std or in nightly)
- Types that naturally have niche (
bool
) - Discriminant created inside fields of a variant
- Placing niches in the right places can enable performance optimization
- Using nightly, you can create your own niches
- Pointers do not have that much niche
- Niche are considered when reordering fields in layout computation, to place them at the beginning or the end of the struct.
- I can read some compiler/std code, and that’s nice !