Verify the block filter hash when reading the filter from disk. (utxo db and indexes)

Jun 29, 2022

https://github.com/bitcoin/bitcoin/pull/24832

Host: stickies-v - PR author: kcalvinalvin

The PR branch HEAD was e734228 at the time of this review club meeting.

Notes

Compact block filters (not to be confused with compact blocks) were introduced in BIP 158 as a more privacy-friendly and incentive-compatible alternative to Bloom filters.
The main purpose of PR #24832 is to improve performance when loading filters from disk by bypassing unnecessarily expensive checks. To verify this, the authors (this builds on the work done in #19280) also introduced new benchmark tests. Benchmarks are located in /src/bench, and you can also read the usage instructions on how to compile and run the benchmarks.
Once you’ve compiled bench_bitcoin, you can run the entire benchmark test suite, or just limit your tests to the tests related to this PR by using the filter option:

./bench_bitcoin --filter="GCS.*"

Questions

Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?
When are compact block filters actually loaded from disk? Is this a frequent process?
This PR introduces a new skip_decode_check bool parameter in the GCSFilter constructor that controls whether we check the size of the filter. In which scenarios will this parameter actually be true? Could we remove it?
We usually don’t pass (u)int type variables by reference. Why is that different for the const uint256& hash parameter in BlockFilterIndex::ReadFilterFromDisk?
In GCSFilterDecodeSkipCheck(), is it reasonable to construct GCSFilter filter with {0, 0} as the first two of the Params arguments? In your own words, what do the 4 Params parameters represent?
Were you able to compile the bench_bitcoin benchmark tool?
Were you able to run the benchmarks? In terms of ns/op (nanoseconds per operation), what are the results are you getting for GCSFilterDecode and for GCSFilterDecodeSkipCheck? Is that in line with expectations, and would you say there is sufficient benefit to warrant merging this PR?
What are the risks and downsides of this PR? Are there any new attack vectors?

Meeting Log

17:00

<stickies-v> #startmeeting

17:00

<BlueMoon> Hello!!

17:00

<lightlike> hi

17:01

<stickies-v> welcome everyone! On the menu this week is a PR by kcalvinalvin (building on earlier work by pstratem) that improves the performance of loading compact block filters from disk. In addition to those changes, we'll also look at the benchmarking tests and suite.

17:01

<stickies-v> the notes and questions are available on https://bitcoincore.reviews/24832

17:01

<yashraj> hi

17:01

<Amirreza> Hello

17:02

<Bitcoin_Hodler> hello

17:02

<stickies-v> anyone joining us for the first time today? even if you're just lurking, feel free to say hi!

17:02

<brunoerg> hi

17:02

<nasser_saazi> hi

17:02

<extheo[m]> Hi

17:02

<ls55> Hi

17:03

<schmidty_> hi

17:03

<stickies-v> who got the chance to review the PR or read the notes? (y/n)

17:03

<Amirreza> y

17:03

<willcl_ark> Hi

17:03

<Bitcoin_Hodler> y

17:03

<lightlike> y

17:04

<svav> Hi

17:04

<TobiAdeyemi[m]> hi

17:04

<glozow> hi

17:04

<BlueMoon> y

17:04

<brunoerg> y

17:05

<stickies-v> for those of you who were able to review, would you give it a Concept ACK, Approach ACK, Tested ACK, or NACK?

17:06

<Amirreza> Actually I'm still very new to the project :) but I think Concept ACK

17:07

<stickies-v> luckily there's a bit less pressure on ACK'ing things just here in the review club :-D thanks for your input!

17:07

<willcl_ark> It doesnt seem to make sense to perform expensive checks multiple times when we can just compare the hash, so concept ACK

17:08

<Amirreza> For really understanding how the source code of this part works, do we need to know the GCS algorithm?

17:10

<stickies-v> when reviewing a PR, you need to be comfortable about understanding the *changes* it introduces. In this PR, we don't really change how we use GCS, so I'd say it's not really super important. Of course, sometimes side effects can be difficult to understand, and having enough people with deep expertise of that part of the codebase is important too.

17:10

<Bitcoin_Hodler> Concept ACK

17:11

<stickies-v> we'll come back to benefits, risks and downsides about this PR later on in the discussion too! with that said, let's get to the questions

17:11

<stickies-v> let's start with a quick refresher: how would you summarize what is a compact block filter, and what it's used for?

17:12

<svav> Compact block filters - are a condensed representation of the contents of a block that allow wallets to determine whether the block contains any transactions involving the user’s keys.

17:13

<Amirreza> It's for wallets to check if their addresses are involved in a given block or not.

17:13

<Amirreza> Because checking all blocks is too expensive.

17:13

<ls55> Compact block filters are a condensed representation of the contents of a block that allow wallets to determine whether the block contains any transactions involving the user’s keys.

17:13

<sipa> well, they're still checking all blocks, but not by going through all transactions in those blocks individually

17:14

<Amirreza> sipa: yeah you're right. ls55 explanation was better.

17:14

<stickies-v> lots of good answers already - I'd say the main thing to add is that they're a compact representation of all the scriptPubKeys in a block - both the inputs and outputs

17:15

<willcl_ark> They are used when clients don't want to have to download full-size blocks, but still want to check for transactions related to themselves (e.g. "light clients")

17:16

<stickies-v> we already had bloom filters to achieve a similar goal, but they were leaking quite a bit of privacy and weren't incentive compatible (put a lot of load on full nodes), and that's fixed with compact block filters (at cost of higher bandwidth and CPU requirements for light nodes)

17:16

<willcl_ark> When you find a match against the filter, you then request the full block to get full transaction information, without leaking too much privacy to the "server"

17:16

<sipa> There is a more fundamental difference between the BIP37 bloom filters and the BIP157 golomb-coded filters than just the encoding.

17:17

<sipa> BIP37 is server-side filtering: the client gives the server what they're interested in, and the server responds with matching transactions.

17:17

<ls55> However, the Compact block filters approach consumes significantly more bandwidth.

17:17

<sipa> With BIP157, the server gives the client a filter of what's in the block, and they can do the matching by themselves.

17:17

<evanlinjin> @ls55 it's still way less bandwidth than running a full node though

17:18

<ls55> evanlinjin: true

17:18

<willcl_ark> With BIP157 theres a single filter per block, shared to all clients. With BIP37 each client can ask for matches against custom filters (and there's no DOS protection to requesting matching against many filters).

17:18

<sipa> GCS is also smaller than Bloom (approximately 1.4x for the same false positive rate), but that's just an incremental change. The fundamental difference is that the client does not just tell the server what they care about anymore (a huge privacy leak).

17:19

<Amirreza> What about probability of false positive? Bloom filter vs GCS.

17:19

<stickies-v> willcl_ark: yes exactly, and that's quite relevant to this PR too, but we'll get back to that later

17:19

<sipa> Both Bloom filters and GCS have tunable false positive rate - the lower your want the fprate to be, the bigger the filter becomes.

17:20

<sipa> However, GCS is 1.4x smaller for the same fprate and data size, compared to Bloom.

17:20

<Amirreza> sipa: got it. Thanks

17:20

<sipa> The downside is that GCS cannot be updated efficiently once constructed; they're intended to be constructed once, and read multiple times.

17:21

<stickies-v> the tunable false positive rate is also something we'll cover later on in the questions. Awesome discussion, I'll move on to the next question. As always, this discussion is async so feel free to continue discussing earlier questions

17:21

<ls55> Does the wallet need to download the entire block when they learn that a block has relevant transactions or just the transactions?

17:21

<stickies-v> when are compact block filters actually loaded from disk? Is this a frequent process?

17:21

<Amirreza> ls55: no they download the merkle block AFAIK

17:22

<Amirreza> which contains only block header and related TXs

17:22

<sipa> ls55: Well, if they want the block, yes. There is no way to only download just the tx they care about (because that would reveal which tx they are interested in).

17:22

<sipa> Amirreza: That's for BIP37, which is deprecated.

17:22

<sipa> In the BIP157 way of working, yes, the client will download the full block if it matches.

17:22

<Amirreza> sipa: oh, thanks for mentioning

17:23

<ls55> sipa: Amirreza: Got it. Thanks.

17:24

<stickies-v> ls55: this is also not consensus or P2P protocol or anything, I think any dev could implement this for their application however they want to, nothing's holding you back from querying individual TXs

17:25

<stickies-v> (but you would be leaking privacy by doing that)

17:25

<ls55> stickies-v: when are compact block filters actually loaded from disk? When `getblockfilter` RPC is called, I guess.

17:25

<stickies-v> yes, that's one way! there are 2 more

17:25

<lightlike> filters are loaded from disk when we get a request from a peer that wants them - that could be frequent

17:26

<sipa> stickies-v: BIP158 defines the GCS filter. BIP157 exposes it over the P2P network.

17:26

<ls55> stickies-v: Got it. Thanks. Downloading the entire block is more private.

17:27

<ls55> How does a peer request a block filter ? Is there a specific message ?

17:27

<Amirreza> sipa: Aren't bip numbers based on the order of issuing them? BIP157 used the concept that was introduced in BIP158?

17:27

<stickies-v> sipa: hmm yeah my comment wasn't really relevant to BIP157/158, you're right

17:27

<stickies-v> no, BIP numbers are not chronological

17:27

<willcl_ark> Usually you request filter headers first, then filters which you are missing according to the filter header chain

17:27

<sipa> ls55: Per BIP157, yes `getcfilters`

17:28

<lightlike> I think filters can also be queried via REST

17:28

<sipa> stickies-v: Outside of BIP37 there is no mechanism in the P2P protocol for a client to request a subset of the transactions of a block.

17:29

<sipa> Oh, I guess BIP152 compact block relay also has a mechanism for transferring just a subset of transactions, but that's in a very different context.

17:29

<stickies-v> lightlike: yes that's the third one!

17:30

<ls55> spa: Thanks. Why would a peer request a block filter (`getcfilter` message) ? For the same reason as a lightweight wallet ?

17:30

<stickies-v> so the compact block filters are only loaded on request, through RPC, through REST or through P2P networking with the GETCFILTERS message (if `NODE_COMPACT_FILTERS` service bit is set)

17:30

<stickies-v> links to the relevant code:

17:30

<stickies-v> RPC: https://github.com/bitcoin/bitcoin/blob/749b80b29e875cc6afa1c2674cccdfd7115cc16a/src/rpc/blockchain.cpp#L2226

17:30

<stickies-v> REST: https://github.com/bitcoin/bitcoin/blob/749b80b29e875cc6afa1c2674cccdfd7115cc16a/src/rest.cpp#L519

17:30

<sipa> @ls55 They're the same thing. That's how a lightweight wallet, being a peer of a full node, asks for a block filter.

17:31

<stickies-v> GETCFILTERS: https://github.com/bitcoin/bitcoin/blob/11106a4722558765a44ae45c7892724a73ce514c/src/net_processing.cpp#L3514-L3516

17:31

<ls55> sipa: Got it.

17:32

<sipa> (That's assuming the lightweight wallet communicates over the P2P protocol; they don't have to, e.g. Electrum uses its own protocol, talking to Electrum servers, not Bitcoin P2P nodes - in that case BIP157 is obviously irrelevant)

17:33

<stickies-v> next question: this PR introduces a new `skip_decode_check` bool parameter in the `GCSFilter` constructor that controls whether we check the size of the filter. In which scenarios will this parameter actually be `true`? Could we remove it?

17:33

<lightlike> bitcoin core doesn't have or need any logic to request blockfilters via p2p itself, it just creates the filters to serve others

17:35

<stickies-v> lightlike: good point. since Core doesn't have a "light client mode", there is this asymmetry where it only sends but never consumes compact block filters

17:36

<willcl_ark> Wondering if that would make an interesting Tor/I2P mode…

17:37

<stickies-v> woops forgot to provide the link for the previous question: link: https://github.com/kcalvinalvin/bitcoin/blob/e734228d8585c0870c71ce8ba8c037f8cf8b249a/src/blockfilter.h#L62

17:37

<evanlinjin> Would it make sense to have core have a "light client mode"? Maybe, use blockfilters while the full node syncs in the background?

100

17:38

<evanlinjin> *compact bloom filters

101

17:38

<willcl_ark> We already sync headers first

102

17:38

<sipa> @evanlinjin That question is meaningless without developer prioritization to make it happen.

103

17:38

<sipa> Of course it's meaningful, but it'd be an enormously invasive change.

104

17:39

<stickies-v> (*compact block filters, not bloom)

105

17:39

<evanlinjin> willcl_ark: But you won't see txs

106

17:39

<sipa> Bitcoin Core isn't designed around such a mode of operation.

107

17:39

<sipa> assumeutxo goes in a similar direction, and it's not exactly moving alon quickly

108

17:39

<evanlinjin> stickies-v: Thanks, I keep getting confused haha

109

17:40

<evanlinjin> sipa: thank you for the insight

110

17:42

<lightlike> stickies-v: I think skip_decode_check can be removed (and I think I suggested that in the PR some time ago). skip_decode_check=false is only used when deserializing blockfilters we get from others, and since core does not request filters from others, it doesn't need to do that (outside of tests)

111

17:43

<stickies-v> lightlike: yes I was waiting for your input haha, see https://github.com/bitcoin/bitcoin/pull/24832#issuecomment-1098206739 for more discussion around this

112

17:43

<stickies-v> BlockFilter::Unserialize() is the only place where `skip_decode_check` is true, and that's only called from the test suite

113

17:44

<stickies-v> so kcalvinalvin is actually working on refactoring that

114

17:44

<lightlike> though I'm not sure if we should remove Unserialize completely, as was also suggested. I'd be fine with just removing the extra check code.

115

17:45

<stickies-v> alright next Q: we usually don't pass `(u)int` type variables by reference. Why is that different for the `const uint256& hash` parameter in `BlockFilterIndex::ReadFilterFromDisk`?

116

17:45

<ls55> But isn't `skip_decode_check=true` the main benefit of this PR?

117

17:45

<stickies-v> link: https://github.com/kcalvinalvin/bitcoin/blob/e734228d8585c0870c71ce8ba8c037f8cf8b249a/src/index/blockfilterindex.h#L34

118

17:45

<stickies-v> ls55:

119

17:45

<stickies-v> ls55: no the main benefit is that we can replace the previous decode check with a much cheaper hash check

120

17:47

<stickies-v> since we already store the hash of each compact block filter in the db, we just compare that hash with the filter that we just loaded from disk

121

17:48

<ls55> stickies-v: Got it. Thanks. `CHash256().Write(encoded_filter).Finalize(result); if (result != hash) ...`

122

17:50

<stickies-v> hint for the current Q: what's the size of a reference/pointer?

123

17:51

<sipa> nit: references don't have a size (from the perspective of the source code - since the reference is a perfect stand-in for what it is referencing). typically references are compiled to pointers, but the compiler isn't required to do so

124

17:51

<svav> Reference is a lot smaller

125

17:53

<stickies-v> sipa: interesting, so if not compiled to pointers, what are they compiled to?

126

17:53

<ls55> `uint256` is often passed by reference in various parts of the codebase.

127

17:53

<sipa> stickies-v: Sometimes nothing. The compiler may be able to inline a function call or so.

128

17:54

<stickies-v> svav: exactly! pointers are usually 32/64 bit depending on OS, so much smaller than 256 bit. Also, uint256 is not a fundamental type but a class that we implemented in https://github.com/bitcoin/bitcoin/blob/bfc6070342b9f43bcf125526e6a3c8ed34e29a71/src/uint256.h#L112

129

17:54

<sipa> My point is just that "reference" is a concept that only exists in the source code, and its size is meaningless. If you'd ask sizeof() on a reference, you'd just get the size of what it is referencing.

130

17:55

<sipa> What you care about is what the overhead of passing an argument by reference or by pointer is, and while the compiler is free to do anything that works, typically both pointer and reference arguments result in 1 extra register being passed to the callee.

131

17:56

<sipa> (sorry, language nittery, it's not actually relevant for this discussion)

132

17:57

<stickies-v> no that's definitely interesting, it was kind of the point of the question!

133

17:57

<stickies-v> alright quickly wanna cover the benchmarking too before we wrap up

134

17:57

<stickies-v> who was able to compile the `bench_bitcoin` benchmarking tool?

135

17:58

<stickies-v> and if so, were you able to run the benchmarks? ( ./bench_bitcoin --filter="GCS.*" )

136

17:58

<willcl_ark> alas, I did not find the time :'(

137

18:00

<stickies-v> willcl_ark: it's very quick both to compile and run actually! see https://github.com/bitcoin/bitcoin/blob/master/doc/benchmarking.md for making instructions

138

18:00

<stickies-v> #endmeeting