Rust performances in debug mode

In this article, we will consider a simple program computing the sum of a vector numbers, see the differences of performances when running it with cargo run and cargo run --release. We will then try to understand where this difference comes from and have some surprises.

I originally intended to write a blog post about how to compute the sum of all the elements of a vector quickly (using SIMD instructions, then parallelism, using scopes to allow sharing the same object in different threads without lifetime issues), but I encountered an unexpected difference between the debug and release binary, and decided to talk about that instead.

Let’s start by creating a new binary project with cargo, by running:

cargo new summing_numbers --bin

The program

The program will be very simple, and will consist in a naive loop over the elements of the Vec, and summing them into a mutable variable.

const N: usize = 10_000_000_000;

fn main() {
    let data = vec![0u8; N];
    let mut s = 0u8;

    for x in data {
        s += x;
    }

    println!("{s}");
}

Because we are summing 10 billions of numbers (around 10 GiB if everything was in RAM), this takes quite some time, even if the operations are simple. This will allow us to feel noticeable differences during our various iterations with different compilation parameters.

If we run this program with cargo run, our computer hangs for a bit (approx. 30 seconds for me), and then outputs 0. Sweet, thats the value we expected. But how long exactly did it take ? A proper way to measure this would be to create a benchmark, and run the program multiple times, and look at the average and the standard deviation of our measures. Indeed, the time our program takes to execute may vary depending on how busy my computer is at the moment, and could even vary depending on the room temperature (modern may adapt their voltage depending on how hot they are, and this would have an impact on their frequency, and thus the number of cycles per second).

But because we will be working with orders of magnitude, we will not bother with all of this, and will simply measure the time between the start of the execution and the end. To do so, we will use the Instant struct from the standard library.

use std::time::Instant;

const N: usize = 10_000_000_000;

fn main() {
    let start = Instant::now();
    let data = vec![0u8; N];
    println!("Time for allocation: {:.2?}", start.elapsed());
    let allocated_instant = Instant::now();

    let mut s = 0u8;
    for x in data {
        s += x;
    }
    
    println!("Time for sum: {:.2?}", allocated_instant.elapsed());
    println!("{s}");
}

With a simple cargo run, we can now see how long it took to execute our program.

Time for allocation: 3.87µs
Time for sum: 37.38s
0

Compiler optimizations

cargo run builds an unoptimized binary, so the performances are not as good as what you would have in production. Let’s check how it would behave if we ran the same program in release mode. We can do this by running cargo run --release.

Time for allocation: 3.17µs
Time for sum: 979.44ms

Okay, that’s at least 37 times faster ! The compiler optimizations are doing good work. To know if this is all there is, we can try disabling compiler optimizations for the release mode and run it again.

This is doable by adding the following option to our Cargo.toml.

[profile.release]
opt-level = 0

0 is the lowest optimization level that you can ask for, according to the cargo book.

However, if we run cargo run again, our release build is still faster than the debug mode.

Time for allocation: 3.78µs
Time for sum: 22.00s

What could explain the differences ? The code we wrote is the same, and the compiler should make the same optimizations.

Integer overflows

This is a well known difference between the debug profile and the release profile in Rust. When compiling with the debug profile, cargo sets the overflow-checks flag to true. This means that during runtime, at every addition, the program will perform a check to see if the addition resulted in an overflow or not. If it did, it will panic. In release mode, the overflow will happen and the result will be wrapped. For example in release, 255 + 1 = 0 for u8.

Our program did not panic earlier because we are summing zeroes, but if we initialized our vector to ones, we could see the following error.

Time for allocation: 2.15s
thread 'main' panicked at src/main.rs:13:9:
attempt to add with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

On a side note, we can see that the allocation time took more time (like one million times more time), but this is not what we are interested in for now. I can just simply guess that this is because to initialize the vector to ones, we have to call malloc then set the memory to 1 everywhere. When initializing to zero, we can just call calloc, which I guess can be faster due to some internal shenanigans. But that may be a topic for another time.

If we look at the disassembly of both our debug and release binary, we can indeed see that the sum of two numbers is done like this in release mode:

    9c9b:	8a 84 24 f7 01 00 00 	mov    0x1f7(%rsp),%al
    9ca2:	02 84 24 8f 01 00 00 	add    0x18f(%rsp),%al
    9ca9:	88 84 24 8f 01 00 00 	mov    %al,0x18f(%rsp)
    9cb0:	e9 70 ff ff ff       	jmp    9c25 <_ZN15summing_numbers4main17hd87eec02e36de12cE+0x2a5>

and like this is debug mode:

    8e65:	8a 84 24 ef 01 00 00 	mov    0x1ef(%rsp),%al
    8e6c:	88 84 24 37 03 00 00 	mov    %al,0x337(%rsp)
    8e73:	02 84 24 8f 01 00 00 	add    0x18f(%rsp),%al
    8e7a:	88 44 24 39          	mov    %al,0x39(%rsp)
    8e7e:	0f 92 c0             	setb   %al
    8e81:	a8 01                	test   $0x1,%al
    8e83:	0f 85 ab 02 00 00    	jne    9134 <_ZN15summing_numbers4main17had0d3439eef8be72E+0x644>
    8e89:	e9 96 02 00 00       	jmp    9124 <_ZN15summing_numbers4main17had0d3439eef8be72E+0x634>

In release mode it is very simple to understand what is happening. We move a value from memory into the al register, then add another value from memory to it, and store the result back at the same location. Then we jump back to somewhere (the beginning of the loop certainly).

In debug mode, we do the same things (or almost, we’ll see later), but we have an additional setb instruction, that will set a bit to 1 if the carry flag is set (meaning that the previous add overflowed), and then a test instruction, followed by a conditional jump. One of the two jumps is to a place where we will panic, and the other one is to a place where we will be sent to the beginning of the loop again.

As we can see, having this integer overflow check add multiple instructions inside of the loop (that is executed 10 billions time), so the impact must be quite huge.

We can disable these checks by setting overflow-checks = false in the profile.dev section of Cargo.toml.

If we run again in debug mode again, we have the new execution time:

Time for allocation: 3.97µs
Time for sum: 35.80s

That is actually a bit faster, but not as much as I was expecting, and it is still very far from the release performances (22 seconds with optimizations disabled). Maybe the improvement was scarce because all the additional instructions related to the overflow checks were running fast. Indeed, as far as I can see, all the operands were registers, so there was no memory access, and because the sum never overflowed, the CPU must have been able to correctly predict the right branches to take, and continue executing speculatively. However, that does not explain why our performances are still very far from the ones of the release build.

Just the fact that we are debugging

I was aware of the previous differences, but I was actually surprised that the performances were so different even after setting the same optimizations and disabling the overflow checks, which is why I decided to write this article, in case you can also learn from this.

According to the documentation, the remaining differences between our profiles are:

debug = false in release and debug = true in dev profile
debug-assertions
incremental build
number of codegen units

The documentation states that: More code generation units allows more of a crate to be processed in parallel possibly reducing compile time, but may produce slower code.. After testing, in our case, in does not change much our performances. This is most certainly because our program is really small and simple. You will have to trust me when I say that incremental build and debug-assertions also don’t change much in our performances. This leaves is with only one difference: the debug setting. The documentation states: The debug setting controls the -C debuginfo flag which controls the amount of debug information included in the compiled binary.. Reading this, I was not expecting to find such a big difference in performances. I only thought that the binary was going to be larger because of additional debug information.

I was now expecting the debug build to be the same as the release build, just with debug information (like DWARF entries), but no different code. However, if we look at the disassembly for our debug build with overflow checks turned off, the addition looks a lot more like the addition that we had in the release build.

    8c72:	8a 84 24 ef 01 00 00 	mov    0x1ef(%rsp),%al
    8c79:	88 84 24 37 03 00 00 	mov    %al,0x337(%rsp)
    8c80:	02 84 24 8f 01 00 00 	add    0x18f(%rsp),%al
    8c87:	88 84 24 8f 01 00 00 	mov    %al,0x18f(%rsp)
    8c8e:	e9 69 ff ff ff       	jmp    8bfc <_ZN15summing_numbers4main17h88468790e6b730b7E+0x2fc>

However, there is an additional mov instruction that writes the content of al in memory, that is not present in release. After disassembling our debug build in IDA in order to have a clearer view than with objdump, it seemed like the memory location where al is written to, 0x337(%rsp) is never used elsewhere. So we are simply doing an useless write to memory ? To actually test whether it changes the behavior of our program, I used IDA to patch our binary and replace this instruction with nop. Because the mov instruction is encoded on 7 bytes, we need to replace it with 7 nops, otherwise, we would also need to patch all the offsets for jmp in the code.

Our disassembly now looks like

    8c72:	8a 84 24 ef 01 00 00 	mov    0x1ef(%rsp),%al
    8c79:	90                   	nop
    8c7a:	90                   	nop
    8c7b:	90                   	nop
    8c7c:	90                   	nop
    8c7d:	90                   	nop
    8c7e:	90                   	nop
    8c7f:	90                   	nop
    8c80:	02 84 24 8f 01 00 00 	add    0x18f(%rsp),%al
    8c87:	88 84 24 8f 01 00 00 	mov    %al,0x18f(%rsp)
    8c8e:	e9 69 ff ff ff       	jmp    8bfc <_ZN15summing_numbers4main17h88468790e6b730b7E+0x2fc>

This removes the memory access, but nops still costs a few cycles. If we then run our patched binary, we see more our less the same performances than with the memory access.

Time for allocation: 3.30µs
Time for sum: 35.90s

Looking more closely at the disassembly, I think this is because there are also a lot of additional mov to memory in the implementation of into_iter (certainly for the same reason). So removing one mov is not going to change a lot.

For example, here is the beginning of into_iter for the release version.

sub     rsp, 38h
mov     [rsp+38h+var_30], rdi
mov     rax, [rdi+10h]
cmp     rax, [rdi+18h]
jz      short loc_891B

And for comparison, here is the version in debug mode:

sub     rsp, 0D8h
mov     [rsp+0D8h+var_D8], rdi
mov     [rsp+0D8h+var_D0], 1
mov     [rsp+0D8h+var_C8], 1
mov     [rsp+0D8h+var_C0], 1
mov     [rsp+0D8h+var_B8], 0
mov     [rsp+0D8h+var_B0], 1
mov     [rsp+0D8h+var_A8], 0FFFFFFFFFFFFFFFFh
mov     [rsp+0D8h+var_99], 0
mov     [rsp+0D8h+var_98], 1
mov     [rsp+0D8h+var_89], 0
mov     [rsp+0D8h+var_88], 1
mov     [rsp+0D8h+var_80], 1
mov     [rsp+0D8h+var_58], rdi
mov     rax, [rdi+10h]
cmp     rax, [rdi+18h]
jz      short loc_A330

This function is called at every iteration of the loop, so replacing one small mov with nops is not gonna change the overall performances a lot. After all, the memory is certainly stored in the L1 cache of the processor, so writing to it does not cost much. However, we confirmed that our program still behaves correctly if we remove this write, so it is most certainly used for debugging only. We can actually measure the number of additional instructions executed in debug mode using perf stat.

With debug = true we have:

$ perf stat ./target/debug/summing_numbers                                        
Time for sum: 43.26s, result: 0

 Performance counter stats for './target/debug/summing_numbers':

         43,254.93 msec task-clock:u                     #    1.000 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
         2,441,484      page-faults:u                    #   56.444 K/sec                     
   209,055,629,529      cycles:u                         #    4.833 GHz                       
   700,002,821,990      instructions:u                   #    3.35  insn per cycle

And with debug = false

 Performance counter stats for './target/debug/summing_numbers':

         25,026.56 msec task-clock:u                     #    1.000 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
         2,441,481      page-faults:u                    #   97.556 K/sec                     
   121,457,238,882      cycles:u                         #    4.853 GHz                       
   470,002,800,391      instructions:u                   #    3.87  insn per cycle

As we can see, we are doing almost 50% more instructions with debug info ! So of course, it takes more time (even if we can notice here that the instruction per cycle is smaller in debug mode).

Where do the additional instructions comes from ?

Assembly kind of erases the debug information, so maybe we can look at the LLVM Intermediate Representation (IR) instead. This can be done by setting the RUSTFLAGS environment variable when calling cargo, such as RUSTFLAGS='--emit=llvm-ir' cargo run. The compiler will then output a .ll file in the build folder, which we can open with our favorite text editor.

For the release build we have something similar to the assembly. Two loads, then the add, and a store.

  %47 = getelementptr inbounds { i8, i8 }, ptr %_26, i32 0, i32 1
  %x = load i8, ptr %47, align 1, !noundef !4
  %48 = load i8, ptr %s, align 1, !noundef !4
  %49 = add i8 %48, %x
  store i8 %49, ptr %s, align 1
  br label %bb11

For the debug build, there are more operations, as we could expect (the additional assembly instruction did not come from nowhere).

  %49 = getelementptr inbounds { i8, i8 }, ptr %_26, i32 0, i32 1, !dbg !3188
  %x = load i8, ptr %49, align 1, !dbg !3188, !noundef !23
  store i8 %x, ptr %x.dbg.spill, align 1, !dbg !3188
  call void @llvm.dbg.declare(metadata ptr %x.dbg.spill, metadata !3106, metadata !DIExpression()), !dbg !3189
  %50 = load i8, ptr %s, align 1, !dbg !3190, !noundef !23
  %51 = add i8 %50, %x, !dbg !3190
  store i8 %51, ptr %s, align 1, !dbg !3190
  br label %bb11, !dbg !3186

We have two instructions that did not appear in the release build. A store into x.dbg.spill, and a call to llvm.dbg.declare. I do not know much about LLVM IR yet, but after reading the documentation, it appears that llvm.dbg.declare is only used to provide debug information about x. To do so, it stores the value of x in x.dbg.spill, which is defined earlier in the .ll file with %x.dbg.spill = alloca i8, align 1, and use llvm.dbg.declare to store metadata about x.dbg.spill. So it is clear where the additional mov instruction comes from now.

We cannot just write:

  %49 = getelementptr inbounds { i8, i8 }, ptr %_26, i32 0, i32 1, !dbg !3188
  %x = load i8, ptr %49, align 1, !dbg !3188, !noundef !23
  call void @llvm.dbg.declare(metadata ptr %x, metadata !3106, metadata !DIExpression()), !dbg !3189
  %50 = load i8, ptr %s, align 1, !dbg !3190, !noundef !23
  %51 = add i8 %50, %x, !dbg !3190
  store i8 %51, ptr %s, align 1, !dbg !3190
  br label %bb11, !dbg !3186

because %x is not a pointer, it has no corresponding alloca at the beginning of the function. To be debugged, a variable needs to be stored in memory somewhere that we can inspect. I am not sure why though. The DWARF documentation (chapter 2.6.1.1.3) shows that we can store variable location into registers. So I am not sure why it needs to be stored in memory here. I think that is why llvm.dbg.value is made for, but I do not understand why this is not used.

Even if it was the case, maybe we could just write something like:

  %49 = getelementptr inbounds { i8, i8 }, ptr %_26, i32 0, i32 1, !dbg !3188
  %x = load i8, ptr %49, align 1, !dbg !3188, !noundef !23
  call void @llvm.dbg.declare(metadata ptr %49, metadata !3106, metadata !DIExpression()), !dbg !3189
  %50 = load i8, ptr %s, align 1, !dbg !3190, !noundef !23
  %51 = add i8 %50, %x, !dbg !3190
  store i8 %51, ptr %s, align 1, !dbg !3190
  br label %bb11, !dbg !3186

This is because %49 is actually a pointer to x and x is its value. This makes me wonder, would we have an LLVM output looking like the above if we did the dereferencing manually. We could try changing our loop to look like this:

    for x in data.iter() {
        s += *x;
    }

In which case, because x is a pointer, maybe there will be no need to store its value in a debug variable, because we already have a pointer. But the LLVM looks like this:

bb15:                                             ; preds = %bb14
  %x = load ptr, ptr %_28, align 8, !dbg !2485, !nonnull !23, !align !2375, !noundef !23
  store ptr %x, ptr %x.dbg.spill, align 8, !dbg !2485
  call void @llvm.dbg.declare(metadata ptr %x.dbg.spill, metadata !2405, metadata !DIExpression()), !dbg !2486
  %_32 = load i8, ptr %x, align 1, !dbg !2487, !noundef !23
  %49 = load i8, ptr %s, align 1, !dbg !2488, !noundef !23
  %50 = add i8 %49, %_32, !dbg !2488
  store i8 %50, ptr %s, align 1, !dbg !2488
  br label %bb13, !dbg !2483

Even if %x is already a pointer, its value is sent to another variable %x.dbg.spill with store ptr %x, ptr %x.dbg.spill, align 8, !dbg !2485. If you can explain why, feel free to contact me on Twitter and I will gladly modify this ticket to include the explanation !

For now, we can see that if we indeed remove the debug from our debug build (which is not really useful anymore, I agree), by adding debug = 0 in the Cargo.toml. This indeed gives us approximately the same performances than the release build.

Time for allocation: 3.62µs
Time for sum: 22.62s

PS: After writing this article, I stumbled over this opened issue on Github. It appears that it would indeed be better to use llvm.dbg.value, but LLVM would then possibly throw away the debug value if it feels like computing the value can be avoided.

Impact on recursion

Another fun fact than we can notice is that, because the values are spilled on the stack for debugging, we may encounter some stack overflows with debug=true that we would not see with debug=false

For example, the following (very poor) implementation of Fibonacci overflows in debug mode but not in release mode.

fn main() {
    let v = fib(150000);
}



fn fib(n: usize) -> usize {
    if n == 0 || n == 1 {
        1
    } else {
        let x = fib(n-1);
        let y = fib(n-2);
        x + y
    }
}

Optimization removes this issue ?

We can note that if we put back the debug in the debug build, but set the opt-level to 3 for both profile (release and debug), the performances are somewhat similar.

Debug:

Time for allocation: 2.92µs
Time for sum: 993.37ms

Release:

Time for allocation: 3.06µs
Time for sum: 987.25ms

This is because with optimization, the compiler will use SIMD instructions that are available on my CPU and vectorize most of the computations. If you are unfamiliar with SIMD instructions, I recommend learning about them ! But for now, it is enough to know that my CPU (and a lot of recent CPUs) has registers that are larger than 64 bits. For example, if we look at the disassembly, we can see that we are now using the xmm registers, which are 128 bits long. This allow to load 16 u8 in two registers and perform the 16 add in only one instruction. However, there is no information about x anymore, because it will not exists in the code that the compiler actually executes. This may be why the compiler drops debug information about x and that performances are similar.

Key takeaway

Differences in performance in our simple case comes from mostly debug and opt-level.
debug push every value onto the stack to prevent LLVM from optimizing it away
debug does more than simply making the binary bigger

2024-02-18

https://www.0xatticus.com/posts/debug_performances/ 0xAtticus