Use fork of prost with changes to decode_varint_slice#4
Use fork of prost with changes to decode_varint_slice#4landaire wants to merge 1 commit intov0y4g3r:step6/optimize-slicefrom
Conversation
|
I failed to recognize that the slow path exists for scenarios where |
|
That's really cool! I reproduced your result on the same machine as previous benchmarks.
Is these changes you made to prost still correct? |
No, unfortunately not -- at least not for the general case. For the purposes of your benchmark, or if you are 100% certain you will never have a fragmented I'm trying to figure out how the this I found that rewriting it in a few different ways, including adding an extra condition, seems to speed up the if len >= 10 || len == rem {
let (value, advance) = decode_varint_slice(bytes)?;
buf.advance(advance);
Ok(value)
} else {
decode_varint_slow(buf)
}
}This yields the following results in varint/small/decode time: [32.028 ns 32.147 ns 32.279 ns]
change: [-84.245% -84.141% -84.048%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
varint/medium/decode time: [154.64 ns 155.26 ns 155.99 ns]
change: [-29.224% -28.274% -27.316%] (p = 0.00 < 0.05)
Performance has improved.
varint/large/decode time: [307.91 ns 309.42 ns 311.17 ns]
change: [-9.2315% -8.4735% -7.5222%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
varint/mixed/decode time: [224.22 ns 228.21 ns 233.62 ns]
change: [-17.977% -16.333% -13.952%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) high mild
6 (6.00%) high severeBut again, no change in Greptime benchmarks. I think this certainly highlights though that there are additional tradeoffs that were made, like accounting for discontinuous memory, that are not present in the Go version of This seems like the perfect case for specialization if it's ever stabilized... |
|
My latest change (tokio-rs/prost@7c6da11) on the same branch handles non-contiguous memory and for me saw a 19.318% performance improvement: Curiously though prost's benchmarks still show otherwise overall: |
This PR should not be merged.
I was reading the article yesterday and made some changes that allowed me to see a lot of time being spent in
decode_varint:This is with using the following patch (prost is the latest
mastercommit):This removes any noise from reallocating vectors, but is kind of cheating for this scenario and should only serve to reduce noise.
The two biggest areas for improvement now are in the
RepeatedField<T>::push_default()andprost::encoding::decode_varint()functions. I decided to look at the latter and noted a couple things (reference: https://github.com/tokio-rs/prost/blob/e3deaa200b3a5500bf0403325d02716973b7296a/src/encoding.rs#L57)> 10or the last byte of the buffer is< 0x80.Point 1 is fine
Point 3 I think is fine
Point 2 I think has some problems:
decode_varint()we check the same conditions thatdecode_varint_slice()asserts on, and even thoughdecode_varint_slice()is marked as#[inline]I don't think the optimizer will remove these duplicate checks since the behavior of the conditions is different. Indecode_varint()failing the checks will result indecode_varint_slow()being called and failing indecode_varint_slice()results in an assertion failure. I did not verify this by looking at the assembly.decode_varint()has two conditions in its main body: the byte value is< 0x80or< 2(checking for presence of another byte and overflow, respectively). I'm not sure if this makes a practical impact, but if we can eliminate both of these conditions or front-load them we may be able to speed things up.I took a stab at rewriting this logic using safe Rust and iterators here: tokio-rs/prost@master...landaire:prost:varint_slice_perf
Assembly comparison: original, modified.
My solution looked somewhat like the slow path and handles all cases of both slow/fast path, so I also removed the slow path code. All prost tests pass.
For this benchmark it resulted in a 37% perf increase:
For the prost benchmarks the numbers are a little iffy:
And viewing the performance of the benchmark under a flamegraph shows that we no longer even see
decode_varintin the sampling!I would ask if you could try out my branch to see if you can reproduce these results. The PR here is just for the sake of showing using my branch of
prostas a patched dependency.