Rust performances in debug mode
In this article, we will consider a simple program computing the sum of a vector numbers, see the differences of performances when running it with cargo run
and cargo run --release
. We will then try to understand where this difference comes from and have some surprises.
I originally intended to write a blog post about how to compute the sum of all the elements of a vector quickly (using SIMD instructions, then parallelism, using scope
s to allow sharing the same object in different threads without lifetime issues), but I encountered an unexpected difference between the debug and release binary, and decided to talk about that instead.
Let’s start by creating a new binary project with cargo
, by running:
cargo new summing_numbers --bin
The program
The program will be very simple, and will consist in a naive loop over the elements of the Vec
, and summing them into a mutable variable.
const N: usize = 10_000_000_000;
fn main() {
let data = vec![0u8; N];
let mut s = 0u8;
for x in data {
s += x;
}
println!("{s}");
}
Because we are summing 10 billions of numbers (around 10 GiB if everything was in RAM), this takes quite some time, even if the operations are simple. This will allow us to feel noticeable differences during our various iterations with different compilation parameters.
If we run this program with cargo run
, our computer hangs for a bit (approx. 30 seconds for me), and then outputs 0
. Sweet, thats the value we expected. But how long exactly did it take ? A proper way to measure this would be to create a benchmark, and run the program multiple times, and look at the average and the standard deviation of our measures. Indeed, the time our program takes to execute may vary depending on how busy my computer is at the moment, and could even vary depending on the room temperature (modern may adapt their voltage depending on how hot they are, and this would have an impact on their frequency, and thus the number of cycles per second).
But because we will be working with orders of magnitude, we will not bother with all of this, and will simply measure the time between the start of the execution and the end. To do so, we will use the Instant
struct from the standard library.
use std::time::Instant;
const N: usize = 10_000_000_000;
fn main() {
let start = Instant::now();
let data = vec![0u8; N];
println!("Time for allocation: {:.2?}", start.elapsed());
let allocated_instant = Instant::now();
let mut s = 0u8;
for x in data {
s += x;
}
println!("Time for sum: {:.2?}", allocated_instant.elapsed());
println!("{s}");
}
With a simple cargo run
, we can now see how long it took to execute our program.
Time for allocation: 3.87µs
Time for sum: 37.38s
0
Compiler optimizations
cargo run
builds an unoptimized binary, so the performances are not as good as what you would have in production. Let’s check how it would behave if we ran the same program in release mode. We can do this by running cargo run --release
.
Time for allocation: 3.17µs
Time for sum: 979.44ms
Okay, that’s at least 37 times faster ! The compiler optimizations are doing good work. To know if this is all there is, we can try disabling compiler optimizations for the release mode and run it again.
This is doable by adding the following option to our Cargo.toml
.
[profile.release]
opt-level = 0
0
is the lowest optimization level that you can ask for, according to the cargo book.
However, if we run cargo run
again, our release build is still faster than the debug mode.
Time for allocation: 3.78µs
Time for sum: 22.00s
What could explain the differences ? The code we wrote is the same, and the compiler should make the same optimizations.
Integer overflows
This is a well known difference between the debug
profile and the release
profile in Rust. When compiling with the debug
profile, cargo
sets the overflow-checks
flag to true. This means that during runtime, at every addition, the program will perform a check to see if the addition resulted in an overflow or not. If it did, it will panic. In release mode, the overflow will happen and the result will be wrapped. For example in release, 255 + 1 = 0
for u8
.
Our program did not panic earlier because we are summing zeroes, but if we initialized our vector to ones, we could see the following error.
Time for allocation: 2.15s
thread 'main' panicked at src/main.rs:13:9:
attempt to add with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
On a side note, we can see that the allocation time took more time (like one million times more time), but this is not what we are interested in for now. I can just simply guess that this is because to initialize the vector to ones, we have to call malloc
then set the memory to 1 everywhere. When initializing to zero, we can just call calloc
, which I guess can be faster due to some internal shenanigans. But that may be a topic for another time.
If we look at the disassembly of both our debug and release binary, we can indeed see that the sum of two numbers is done like this in release mode:
9c9b: 8a 84 24 f7 01 00 00 mov 0x1f7(%rsp),%al
9ca2: 02 84 24 8f 01 00 00 add 0x18f(%rsp),%al
9ca9: 88 84 24 8f 01 00 00 mov %al,0x18f(%rsp)
9cb0: e9 70 ff ff ff jmp 9c25 <_ZN15summing_numbers4main17hd87eec02e36de12cE+0x2a5>
and like this is debug mode:
8e65: 8a 84 24 ef 01 00 00 mov 0x1ef(%rsp),%al
8e6c: 88 84 24 37 03 00 00 mov %al,0x337(%rsp)
8e73: 02 84 24 8f 01 00 00 add 0x18f(%rsp),%al
8e7a: 88 44 24 39 mov %al,0x39(%rsp)
8e7e: 0f 92 c0 setb %al
8e81: a8 01 test $0x1,%al
8e83: 0f 85 ab 02 00 00 jne 9134 <_ZN15summing_numbers4main17had0d3439eef8be72E+0x644>
8e89: e9 96 02 00 00 jmp 9124 <_ZN15summing_numbers4main17had0d3439eef8be72E+0x634>
In release mode it is very simple to understand what is happening. We move a value from memory into the al
register, then add another value from memory to it, and store the result back at the same location. Then we jump back to somewhere (the beginning of the loop certainly).
In debug mode, we do the same things (or almost, we’ll see later), but we have an additional setb
instruction, that will set a bit to 1 if the carry flag is set (meaning that the previous add
overflowed), and then a test
instruction, followed by a conditional jump. One of the two jumps is to a place where we will panic, and the other one is to a place where we will be sent to the beginning of the loop again.
As we can see, having this integer overflow check add multiple instructions inside of the loop (that is executed 10 billions time), so the impact must be quite huge.
We can disable these checks by setting overflow-checks = false
in the profile.dev
section of Cargo.toml
.
If we run again in debug mode again, we have the new execution time:
Time for allocation: 3.97µs
Time for sum: 35.80s
That is actually a bit faster, but not as much as I was expecting, and it is still very far from the release
performances (22 seconds with optimizations disabled). Maybe the improvement was scarce because all the additional instructions related to the overflow checks were running fast. Indeed, as far as I can see, all the operands were registers, so there was no memory access, and because the sum never overflowed, the CPU must have been able to correctly predict the right branches to take, and continue executing speculatively.
However, that does not explain why our performances are still very far from the ones of the release build.
Just the fact that we are debugging
I was aware of the previous differences, but I was actually surprised that the performances were so different even after setting the same optimizations and disabling the overflow checks, which is why I decided to write this article, in case you can also learn from this.
According to the documentation, the remaining differences between our profiles are:
debug = false
in release anddebug = true
in dev profile- debug-assertions
- incremental build
- number of codegen units
The documentation states that: More code generation units allows more of a crate to be processed in parallel possibly reducing compile time, but may produce slower code.. After testing, in our case, in does not change much our performances. This is most certainly because our program is really small and simple.
You will have to trust me when I say that incremental build and debug-assertions also don’t change much in our performances.
This leaves is with only one difference: the debug
setting. The documentation states: The debug setting controls the -C debuginfo
flag which controls the amount of debug information included in the compiled binary.. Reading this, I was not expecting to find such a big difference in performances. I only thought that the binary was going to be larger because of additional debug information.
I was now expecting the debug build to be the same as the release build, just with debug information (like DWARF entries), but no different code. However, if we look at the disassembly for our debug
build with overflow checks turned off, the addition looks a lot more like the addition that we had in the release
build.
8c72: 8a 84 24 ef 01 00 00 mov 0x1ef(%rsp),%al
8c79: 88 84 24 37 03 00 00 mov %al,0x337(%rsp)
8c80: 02 84 24 8f 01 00 00 add 0x18f(%rsp),%al
8c87: 88 84 24 8f 01 00 00 mov %al,0x18f(%rsp)
8c8e: e9 69 ff ff ff jmp 8bfc <_ZN15summing_numbers4main17h88468790e6b730b7E+0x2fc>
However, there is an additional mov
instruction that writes the content of al
in memory, that is not present in release
. After disassembling our debug
build in IDA in order to have a clearer view than with objdump
, it seemed like the memory location where al
is written to, 0x337(%rsp)
is never used elsewhere. So we are simply doing an useless write to memory ? To actually test whether it changes the behavior of our program, I used IDA to patch our binary and replace this instruction with nop
. Because the mov
instruction is encoded on 7 bytes, we need to replace it with 7 nop
s, otherwise, we would also need to patch all the offsets for jmp
in the code.
Our disassembly now looks like
8c72: 8a 84 24 ef 01 00 00 mov 0x1ef(%rsp),%al
8c79: 90 nop
8c7a: 90 nop
8c7b: 90 nop
8c7c: 90 nop
8c7d: 90 nop
8c7e: 90 nop
8c7f: 90 nop
8c80: 02 84 24 8f 01 00 00 add 0x18f(%rsp),%al
8c87: 88 84 24 8f 01 00 00 mov %al,0x18f(%rsp)
8c8e: e9 69 ff ff ff jmp 8bfc <_ZN15summing_numbers4main17h88468790e6b730b7E+0x2fc>
This removes the memory access, but nop
s still costs a few cycles. If we then run our patched binary, we see more our less the same performances than with the memory access.
Time for allocation: 3.30µs
Time for sum: 35.90s
Looking more closely at the disassembly, I think this is because there are also a lot of additional mov
to memory in the implementation of into_iter
(certainly for the same reason). So removing one mov
is not going to change a lot.
For example, here is the beginning of into_iter
for the release version.
sub rsp, 38h
mov [rsp+38h+var_30], rdi
mov rax, [rdi+10h]
cmp rax, [rdi+18h]
jz short loc_891B
And for comparison, here is the version in debug mode:
sub rsp, 0D8h
mov [rsp+0D8h+var_D8], rdi
mov [rsp+0D8h+var_D0], 1
mov [rsp+0D8h+var_C8], 1
mov [rsp+0D8h+var_C0], 1
mov [rsp+0D8h+var_B8], 0
mov [rsp+0D8h+var_B0], 1
mov [rsp+0D8h+var_A8], 0FFFFFFFFFFFFFFFFh
mov [rsp+0D8h+var_99], 0
mov [rsp+0D8h+var_98], 1
mov [rsp+0D8h+var_89], 0
mov [rsp+0D8h+var_88], 1
mov [rsp+0D8h+var_80], 1
mov [rsp+0D8h+var_58], rdi
mov rax, [rdi+10h]
cmp rax, [rdi+18h]
jz short loc_A330
This function is called at every iteration of the loop, so replacing one small mov
with nop
s is not gonna change the overall performances a lot.
After all, the memory is certainly stored in the L1 cache of the processor, so writing to it does not cost much. However, we confirmed that our program still behaves correctly if we remove this write, so it is most certainly used for debugging only.
We can actually measure the number of additional instructions executed in debug
mode using perf stat
.
With debug = true
we have:
$ perf stat ./target/debug/summing_numbers
Time for sum: 43.26s, result: 0
Performance counter stats for './target/debug/summing_numbers':
43,254.93 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
2,441,484 page-faults:u # 56.444 K/sec
209,055,629,529 cycles:u # 4.833 GHz
700,002,821,990 instructions:u # 3.35 insn per cycle
And with debug = false
Performance counter stats for './target/debug/summing_numbers':
25,026.56 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
2,441,481 page-faults:u # 97.556 K/sec
121,457,238,882 cycles:u # 4.853 GHz
470,002,800,391 instructions:u # 3.87 insn per cycle
As we can see, we are doing almost 50% more instructions with debug info ! So of course, it takes more time (even if we can notice here that the instruction per cycle is smaller in debug mode).
Where do the additional instructions comes from ?
Assembly kind of erases the debug information, so maybe we can look at the LLVM Intermediate Representation (IR) instead. This can be done by setting the RUSTFLAGS
environment variable when calling cargo
, such as RUSTFLAGS='--emit=llvm-ir' cargo run
.
The compiler will then output a .ll
file in the build
folder, which we can open with our favorite text editor.
For the release build we have something similar to the assembly. Two loads, then the add, and a store.
%47 = getelementptr inbounds { i8, i8 }, ptr %_26, i32 0, i32 1
%x = load i8, ptr %47, align 1, !noundef !4
%48 = load i8, ptr %s, align 1, !noundef !4
%49 = add i8 %48, %x
store i8 %49, ptr %s, align 1
br label %bb11
For the debug build, there are more operations, as we could expect (the additional assembly instruction did not come from nowhere).
%49 = getelementptr inbounds { i8, i8 }, ptr %_26, i32 0, i32 1, !dbg !3188
%x = load i8, ptr %49, align 1, !dbg !3188, !noundef !23
store i8 %x, ptr %x.dbg.spill, align 1, !dbg !3188
call void @llvm.dbg.declare(metadata ptr %x.dbg.spill, metadata !3106, metadata !DIExpression()), !dbg !3189
%50 = load i8, ptr %s, align 1, !dbg !3190, !noundef !23
%51 = add i8 %50, %x, !dbg !3190
store i8 %51, ptr %s, align 1, !dbg !3190
br label %bb11, !dbg !3186
We have two instructions that did not appear in the release
build. A store
into x.dbg.spill
, and a call
to llvm.dbg.declare
. I do not know much about LLVM IR yet, but after reading the documentation, it appears that llvm.dbg.declare
is only used to provide debug information about x
. To do so, it stores the value of x
in x.dbg.spill
, which is defined earlier in the .ll
file with %x.dbg.spill = alloca i8, align 1
, and use llvm.dbg.declare
to store metadata about x.dbg.spill
. So it is clear where the additional mov
instruction comes from now.
We cannot just write:
%49 = getelementptr inbounds { i8, i8 }, ptr %_26, i32 0, i32 1, !dbg !3188
%x = load i8, ptr %49, align 1, !dbg !3188, !noundef !23
call void @llvm.dbg.declare(metadata ptr %x, metadata !3106, metadata !DIExpression()), !dbg !3189
%50 = load i8, ptr %s, align 1, !dbg !3190, !noundef !23
%51 = add i8 %50, %x, !dbg !3190
store i8 %51, ptr %s, align 1, !dbg !3190
br label %bb11, !dbg !3186
because %x
is not a pointer, it has no corresponding alloca
at the beginning of the function. To be debugged, a variable needs to be stored in memory somewhere that we can inspect. I am not sure why though. The DWARF documentation (chapter 2.6.1.1.3) shows that we can store variable location into registers. So I am not sure why it needs to be stored in memory here. I think that is why llvm.dbg.value
is made for, but I do not understand why this is not used.
Even if it was the case, maybe we could just write something like:
%49 = getelementptr inbounds { i8, i8 }, ptr %_26, i32 0, i32 1, !dbg !3188
%x = load i8, ptr %49, align 1, !dbg !3188, !noundef !23
call void @llvm.dbg.declare(metadata ptr %49, metadata !3106, metadata !DIExpression()), !dbg !3189
%50 = load i8, ptr %s, align 1, !dbg !3190, !noundef !23
%51 = add i8 %50, %x, !dbg !3190
store i8 %51, ptr %s, align 1, !dbg !3190
br label %bb11, !dbg !3186
This is because %49
is actually a pointer to x
and x
is its value. This makes me wonder, would we have an LLVM output looking like the above if we did the dereferencing manually. We could try changing our loop to look like this:
for x in data.iter() {
s += *x;
}
In which case, because x
is a pointer, maybe there will be no need to store its value in a debug variable, because we already have a pointer. But the LLVM looks like this:
bb15: ; preds = %bb14
%x = load ptr, ptr %_28, align 8, !dbg !2485, !nonnull !23, !align !2375, !noundef !23
store ptr %x, ptr %x.dbg.spill, align 8, !dbg !2485
call void @llvm.dbg.declare(metadata ptr %x.dbg.spill, metadata !2405, metadata !DIExpression()), !dbg !2486
%_32 = load i8, ptr %x, align 1, !dbg !2487, !noundef !23
%49 = load i8, ptr %s, align 1, !dbg !2488, !noundef !23
%50 = add i8 %49, %_32, !dbg !2488
store i8 %50, ptr %s, align 1, !dbg !2488
br label %bb13, !dbg !2483
Even if %x
is already a pointer, its value is sent to another variable %x.dbg.spill
with store ptr %x, ptr %x.dbg.spill, align 8, !dbg !2485
. If you can explain why, feel free to contact me on Twitter and I will gladly modify this ticket to include the explanation !
For now, we can see that if we indeed remove the debug from our debug
build (which is not really useful anymore, I agree), by adding debug = 0
in the Cargo.toml
.
This indeed gives us approximately the same performances than the release
build.
Time for allocation: 3.62µs
Time for sum: 22.62s
PS: After writing this article, I stumbled over this opened issue on Github. It appears that it would indeed be better to use llvm.dbg.value
, but LLVM would then possibly throw away the debug value if it feels like computing the value can be avoided.
Impact on recursion
Another fun fact than we can notice is that, because the values are spilled on the stack for debugging, we may encounter some stack overflows with debug=true
that we would not see with debug=false
For example, the following (very poor) implementation of Fibonacci overflows in debug mode but not in release mode.
fn main() {
let v = fib(150000);
}
fn fib(n: usize) -> usize {
if n == 0 || n == 1 {
1
} else {
let x = fib(n-1);
let y = fib(n-2);
x + y
}
}
Optimization removes this issue ?
We can note that if we put back the debug in the debug
build, but set the opt-level
to 3 for both profile (release and debug), the performances are somewhat similar.
Debug:
Time for allocation: 2.92µs
Time for sum: 993.37ms
Release:
Time for allocation: 3.06µs
Time for sum: 987.25ms
This is because with optimization, the compiler will use SIMD instructions that are available on my CPU and vectorize most of the computations. If you are unfamiliar with SIMD instructions, I recommend learning about them ! But for now, it is enough to know that my CPU (and a lot of recent CPUs) has registers that are larger than 64 bits. For example, if we look at the disassembly, we can see that we are now using the xmm
registers, which are 128 bits long. This allow to load 16 u8
in two registers and perform the 16 add
in only one instruction. However, there is no information about x
anymore, because it will not exists in the code that the compiler actually executes. This may be why the compiler drops debug information about x
and that performances are similar.
Key takeaway
- Differences in performance in our simple case comes from mostly
debug
andopt-level
. debug
push every value onto the stack to prevent LLVM from optimizing it awaydebug
does more than simply making the binary bigger