Time Profiling

This chapter introduces how to profile Web pages using Rust and WebAssembly where the goal is improving throughput or latency.

⚡ Always make sure you are using a --release build when profiling! With our project template, that means using npm run build-release instead of npm run build-debug.

Available Tools

The performance.now() Timer

The performance.now() function returns a monotonic timestamp measured in milliseconds since the Web page was loaded. We can use it to time various operations, and we can access it from Rust with the following wasm_bindgen import declaration:


# #![allow(unused_variables)]
#fn main() {
#[wasm_bindgen]
extern {
    #[wasm_bindgen(js_namespace = performance)]
    fn now() -> f64;
}
#}

Calling performance.now has very little overhead, so we can create simple measurements from it without distorting the performance of the rest of the system.

For example, we can create a simple FPS counter that we update on each iteration of our renderLoop:

const fps = new class {
  constructor() {
    this.fps = document.getElementById("fps");
    this.frames = [];
    this.lastFrameTimeStamp = performance.now();
  }

  render() {
    // Convert the delta time since the last frame render into a measure
    // of frames per second.
    const now = performance.now();
    const delta = now - this.lastFrameTimeStamp;
    this.lastFrameTimeStamp = now;
    const fps = 1 / delta * 1000;

    // Save only the latest 100 timings.
    this.frames.push(fps);
    if (this.frames.length > 100) {
      this.frames.shift();
    }

    // Find the max, min, and mean of our 100 latest timings.
    let min = Infinity;
    let max = -Infinity;
    let sum = 0;
    for (let i = 0; i < this.frames.length; i++) {
      sum += this.frames[i];
      min = Math.min(this.frames[i], min);
      max = Math.max(this.frames[i], max);
    }
    let mean = sum / this.frames.length;

    // Render the statistics.
    this.fps.textContent = `
Frames per Second:
         latest = ${Math.round(fps)}
avg of last 100 = ${Math.round(mean)}
min of last 100 = ${Math.round(min)}
max of last 100 = ${Math.round(max)}
`.trim();
  }
};

Developer Tools Profilers

All Web browsers' builtin developer tools include a profiler. These profilers display which functions are taking the most time with the usual kinds of visualizations like call trees and flame graphs. If you build with debug symbols, then these profilers should display the Rust function names instead of something like wasm-function[123]. Note that these profilers won't show inlined functions, and since Rust and LLVM rely on inlining so heavily, the results might end up a bit perplexing.

Screenshot of profiler with Rust symbols

Resources

The console.time and console.timeEnd Functions

The console.time and console.timeEnd functions allow you to log the timing of named operations to the browser's developer tools console.

You can import them into Rust with this wasm-bindgen declaration:


# #![allow(unused_variables)]
#fn main() {
#[wasm_bindgen]
extern {
    #[wasm_bindgen(js_namespace = console)]
    fn time(name: &str);

    #[wasm_bindgen(js_namespace = console)]
    fn timeEnd(name: &str);
}
#}

Because there should be a corresponding console.timeEnd invocation for every console.time call, it is convenient to wrap them both up in an RAII type:


# #![allow(unused_variables)]
#fn main() {
pub struct Timer<'a> {
    name: &'a str,
}

impl<'a> Timer<'a> {
    pub fn new(name: &'a str) -> Timer<'a> {
        time(name);
        Timer { name }
    }
}

impl<'a> Drop for Timer<'a> {
    fn drop(&mut self) {
        timeEnd(self.name);
    }
}
#}

Then, we can time how long each Universe::tick takes by adding this snippet to the top of the method:


# #![allow(unused_variables)]
#fn main() {
let _timer = Timer::new("Universe::tick");
#}

The time of how long each call to Universe::tick took are now logged in the console:

Screenshot of console.time logs

Additionally, console.time and console.timeEnd pairs will show up in your browser's profiler's "timeline" or "waterfall" view:

Screenshot of console.time logs

Using #[bench] with Native Code

The same way we can often leverage our operating system's native code debugging tools by writing #[test]s rather than debugging on the Web, we can leverage our operating system's native code profiling tools by writing #[bench] functions.

However! Make sure that you know the bottleneck is in the WebAssembly before investing much energy in native code profiling! Use your browser's profiler to confirm this, or else you risk wasting your time optimizing code that isn't hot.

Growing our Game of Life Universe

What happens if we make our Game of Life universe larger? Replacing the 64 by 64 universe with a 128 by 128 universe results in FPS dropping from a smooth 60 to a choppy 40-ish on my machine.

If we record a profile and look at the waterfall view, we see that each animation frame is taking over 20 milliseconds. Recall that 60 frames per second leaves sixteen milliseconds for the whole process of rendering a frame. That's not just our JavaScript and WebAssembly, but also everything else the browser is doing, such as painting.

Screenshot of a waterfall view of rendering a frame

If we look at what happens within a single animation frame, we see that the CanvasRenderingContext2D.fillStyle setter is very expensive!

Screenshot of a flamegraph view of rendering a frame

And we can confirm that this isn't an abnormality by looking at the call tree's aggregation of many frames:

Screenshot of a flamegraph view of rendering a frame

Nearly 40% of our time is spent in this setter!

⚡ We might have expected something in the tick method to be the performance bottleneck, but it wasn't. Always let profiling guide your focus, since time may be spent in places you don't expect it to be.

In the drawCells function, fillStyle is set once for every cell in the universe, on every animation frame:

for (let row = 0; row < height; row++) {
  for (let col = 0; col < width; col++) {
    const idx = getIndex(row, col);

    ctx.fillStyle = cells[idx] === DEAD
      ? DEAD_COLOR
      : ALIVE_COLOR;

    ctx.fillRect(
      col * (CELL_SIZE + 1) + 1,
      row * (CELL_SIZE + 1) + 1,
      CELL_SIZE,
      CELL_SIZE
    );
  }
}

Now that we have discovered that setting fillStyle is so expensive, what can we do to avoid setting it so often? We need to change fillStyle depending on whether a cell is alive or dead. If we set fillStyle = ALIVE_COLOR and then draw every alive cell in one pass, and then set fillStyle = DEAD_COLOR and draw every dead cell in another pass, then we only end setting fillStyle twice, rather than once for every cell.

// Alive cells.
ctx.fillStyle = ALIVE_COLOR;
for (let row = 0; row < height; row++) {
  for (let col = 0; col < width; col++) {
    const idx = getIndex(row, col);
    if (cells[idx] !== ALIVE) {
      continue;
    }

    ctx.fillRect(
      col * (CELL_SIZE + 1) + 1,
      row * (CELL_SIZE + 1) + 1,
      CELL_SIZE,
      CELL_SIZE
    );
  }
}

// Dead cells.
ctx.fillStyle = DEAD_COLOR;
for (let row = 0; row < height; row++) {
  for (let col = 0; col < width; col++) {
    const idx = getIndex(row, col);
    if (cells[idx] !== DEAD) {
      continue;
    }

    ctx.fillRect(
      col * (CELL_SIZE + 1) + 1,
      row * (CELL_SIZE + 1) + 1,
      CELL_SIZE,
      CELL_SIZE
    );
  }
}

After saving these changes and refreshing http://localhost:8080/, rendering is back to a smooth 60 frames per second.

If we take another profile, we can see that only about ten milliseconds are spent in each animation frame now.

Screenshot of a waterfall view of rendering a frame after the drawCells changes

Breaking down a single frame, we see that the fillStyle cost is gone, and most of our frame's time is spent within fillRect, drawing each cell's rectangle.

Screenshot of a flamegraph view of rendering a frame after the drawCells changes

Making Time Run Faster

Some folks don't like waiting around, and would prefer if instead of one tick of the universe occurred per animation frame, nine ticks did. We can modify the renderLoop function in index.js to do this quite easily:

for (let i = 0; i < 9; i++) {
  universe.tick();
}

On my machine, this brings us back down to only 35 frames per second. No good. We want that buttery 60!

Now we know that time is being spent in Universe::tick, so let's add some Timers to wrap various bits of it in console.time and console.timeEnd calls, and see where that leads us. My hypothesis is that allocating a new vector of cells and freeing the old vector on every tick is costly, and taking up a significant portion of our time budget.


# #![allow(unused_variables)]
#fn main() {
pub fn tick(&mut self) {
    let _timer = Timer::new("Universe::tick");

    let mut next = {
        let _timer = Timer::new("allocate next cells");
        self.cells.clone()
    };

    {
        let _timer = Timer::new("new generation");
        for row in 0..self.height {
            for col in 0..self.width {
                let idx = self.get_index(row, col);
                let cell = self.cells[idx];
                let live_neighbors = self.live_neighbor_count(row, col);

                let next_cell = match (cell, live_neighbors) {
                    // Rule 1: Any live cell with fewer than two live neighbours
                    // dies, as if caused by underpopulation.
                    (Cell::Alive, x) if x < 2 => Cell::Dead,
                    // Rule 2: Any live cell with two or three live neighbours
                    // lives on to the next generation.
                    (Cell::Alive, 2) | (Cell::Alive, 3) => Cell::Alive,
                    // Rule 3: Any live cell with more than three live
                    // neighbours dies, as if by overpopulation.
                    (Cell::Alive, x) if x > 3 => Cell::Dead,
                    // Rule 4: Any dead cell with exactly three live neighbours
                    // becomes a live cell, as if by reproduction.
                    (Cell::Dead, 3) => Cell::Alive,
                    // All other cells remain in the same state.
                    (otherwise, _) => otherwise,
                };

                next[idx] = next_cell;
            }
        }
    }

    let _timer = Timer::new("free old cells");
    self.cells = next;
}
#}

Looking at the timings, it is clear that my hypothesis is incorrect: the vast majority of time is spent actually calculating the next generation of cells. Allocating and freeing a vector on every tick appears to have negligible cost, surprisingly. Another reminder to always guide our efforts with profiling!

Screenshot of a Universe::tick timer results

Let's write a native code #[bench] doing the same thing that our WebAssembly is doing, but where we can use more mature profiling tools. Here is the new benches/bench.rs:


# #![allow(unused_variables)]
#![feature(test)]

#fn main() {
extern crate test;
extern crate wasm_game_of_life;

#[bench]
fn universe_ticks(b: &mut test::Bencher) {
    let mut universe = wasm_game_of_life::Universe::new();

    b.iter(|| {
        universe.tick();
    });
}
#}

We also have to comment out all the #[wasm_bindgen] annotations, and the "cdylib" bits from Cargo.toml or else building native code will fail and have link errors.

With all that in place, we can run cargo bench to compile and run our benchmark!

$ cargo bench
    Finished release [optimized + debuginfo] target(s) in 0.0 secs
     Running target/release/deps/wasm_game_of_life-91574dfbe2b5a124

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running target/release/deps/bench-8474091a05cfa2d9

running 1 test
test universe_ticks ... bench:     664,421 ns/iter (+/- 51,926)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out

This also tells us where the binary lives, and we can run the benchmarks again, but this time under our operating system's profiler. In my case, I'm running Linux, so perf is the profiler I'll use:

$ perf record -g target/release/deps/bench-8474091a05cfa2d9 --bench
running 1 test
test universe_ticks ... bench:     635,061 ns/iter (+/- 38,764)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.178 MB perf.data (2349 samples) ]

Loading up the profile with perf report shows that all of our time is spent in Universe::tick, as expected:

Screenshot of perf report

perf will annotate which instructions in a function time is being spent at if you press a:

Screenshot of perf's instruction annotation

This tells us that 26.67% of time is being spent summing neighboring cells' values, 23.41% of time is spent getting the neighbor's column index, and another 15.42% of time is spent getting the neighbor's row index. Of these top three most expensive instructions, the second and third are both costly div instructions. These divs implement the modulo indexing logic in Universe::live_neighbor_count.

Recall the live_neighbor_count definition:


# #![allow(unused_variables)]
#fn main() {
fn live_neighbor_count(&self, row: u32, column: u32) -> u8 {
    let mut count = 0;
    for delta_row in [self.height - 1, 0, 1].iter().cloned() {
        for delta_col in [self.width - 1, 0, 1].iter().cloned() {
            if delta_row == 0 && delta_col == 0 {
                continue;
            }

            let neighbor_row = (row + delta_row) % self.height;
            let neighbor_col = (column + delta_col) % self.width;
            let idx = self.get_index(neighbor_row, neighbor_col);
            count += self.cells[idx] as u8;
        }
    }
    count
}
#}

The reason we used modulo was to avoid cluttering up the code with if branches for the first or last row or column edge cases. But we are paying the cost of a div instruction even for the most common case, when neither row nor column is on the edge of the universe and they don't need the modulo wrapping treatment. Instead, if we use ifs for the edge cases and unroll this loop, the branches should be very well-predicted by the CPU's branch predictor.

Let's rewrite live_neighbor_count like this:


# #![allow(unused_variables)]
#fn main() {
fn live_neighbor_count(&self, row: u32, column: u32) -> u8 {
    let mut count = 0;

    let north = if row == 0 {
        self.height - 1
    } else {
        row - 1
    };

    let south = if row == self.height - 1 {
        0
    } else {
        row + 1
    };

    let west = if column == 0 {
        self.width - 1
    } else {
        column - 1
    };

    let east = if column == self.width - 1 {
        0
    } else {
        column + 1
    };

    let nw = self.get_index(north, west);
    count += self.cells[nw] as u8;

    let n = self.get_index(north, column);
    count += self.cells[n] as u8;

    let ne = self.get_index(north, east);
    count += self.cells[ne] as u8;

    let w = self.get_index(row, west);
    count += self.cells[w] as u8;

    let e = self.get_index(row, east);
    count += self.cells[e] as u8;

    let sw = self.get_index(south, west);
    count += self.cells[sw] as u8;

    let s = self.get_index(south, column);
    count += self.cells[s] as u8;

    let se = self.get_index(south, east);
    count += self.cells[se] as u8;

    count
}
#}

Now let's run the benchmarks again!

$ cargo bench
   Compiling wasm_game_of_life v0.1.0 (file:///home/fitzgen/wasm_game_of_life)
    Finished release [optimized + debuginfo] target(s) in 0.82 secs
     Running target/release/deps/wasm_game_of_life-91574dfbe2b5a124

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running target/release/deps/bench-8474091a05cfa2d9

running 1 test
test universe_ticks ... bench:      87,258 ns/iter (+/- 14,632)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out

That looks a whole lot better! We can see just how much better it is with the cargo benchcmp tool:

$ cargo benchcmp before.txt after.txt
 name            before.txt ns/iter  after.txt ns/iter  diff ns/iter   diff %  speedup
 universe_ticks  664,421             87,258                 -577,163  -86.87%   x 7.61

Wow! 7.61x speed up!

WebAssembly intentionally maps closely to common hardware architectures, but we do need to make sure that this native code speed up translates into a WebAssembly speed up as well.

Let's reinstate all the #[wasm_bindgen] annotations, rebuild the .wasm with npm run build-release, and refresh http://localhost:8080/. On my machine, the page is running at 60 frames per second again, and recording another profile with the browser's profiler reveals that each animation frame is taking about ten milliseconds.

Success!

Screenshot of a waterfall view of rendering a frame after replacing modulos with branches

Exercises

  • At this point, the next lowest hanging fruit for speeding up Universe::tick is removing the allocation and free. Implement double buffering of cells, where the Universe maintains two vectors, never frees either of them, and never allocates new buffers in tick.

  • Implement the alternative, delta-based design from the "Implementing Life" chapter, where the Rust code returns a list of cells that changed states to JavaScript. Does this make rendering to <canvas> faster? Can you implement this design without allocating a new list of deltas on every tick?

  • As our profiling has shown us, 2D <canvas> rendering is not particularly fast. Replace the 2D canvas renderer with a WebGL renderer. How much faster is the WebGL version? How large can you make the universe before WebGL rendering is a bottleneck?