Verify the block filter hash when reading the filter from disk. (utxo db and indexes)

https://github.com/bitcoin/bitcoin/pull/24832

Host: stickies-v  -  PR author: kcalvinalvin

The PR branch HEAD was e734228 at the time of this review club meeting.

Notes

  • Compact block filters (not to be confused with compact blocks) were introduced in BIP 158 as a more privacy-friendly and incentive-compatible alternative to Bloom filters.

  • The main purpose of PR #24832 is to improve performance when loading filters from disk by bypassing unnecessarily expensive checks. To verify this, the authors (this builds on the work done in #19280) also introduced new benchmark tests. Benchmarks are located in /src/bench, and you can also read the usage instructions on how to compile and run the benchmarks.

  • Once you’ve compiled bench_bitcoin, you can run the entire benchmark test suite, or just limit your tests to the tests related to this PR by using the filter option:

./bench_bitcoin --filter="GCS.*"

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?

  2. When are compact block filters actually loaded from disk? Is this a frequent process?

  3. This PR introduces a new skip_decode_check bool parameter in the GCSFilter constructor that controls whether we check the size of the filter. In which scenarios will this parameter actually be true? Could we remove it?

  4. We usually don’t pass (u)int type variables by reference. Why is that different for the const uint256& hash parameter in BlockFilterIndex::ReadFilterFromDisk?

  5. In GCSFilterDecodeSkipCheck(), is it reasonable to construct GCSFilter filter with {0, 0} as the first two of the Params arguments? In your own words, what do the 4 Params parameters represent?

  6. Were you able to compile the bench_bitcoin benchmark tool?

  7. Were you able to run the benchmarks? In terms of ns/op (nanoseconds per operation), what are the results are you getting for GCSFilterDecode and for GCSFilterDecodeSkipCheck? Is that in line with expectations, and would you say there is sufficient benefit to warrant merging this PR?

  8. What are the risks and downsides of this PR? Are there any new attack vectors?

Meeting Log

  117:00 <stickies-v> #startmeeting
  217:00 <BlueMoon> Hello!!
  317:00 <lightlike> hi
  417:01 <stickies-v> welcome everyone! On the menu this week is a PR by kcalvinalvin (building on earlier work by pstratem) that improves the performance of loading compact block filters from disk. In addition to those changes, we'll also look at the benchmarking tests and suite.
  517:01 <stickies-v> the notes and questions are available on https://bitcoincore.reviews/24832
  617:01 <yashraj> hi
  717:01 <Amirreza> Hello
  817:02 <Bitcoin_Hodler> hello
  917:02 <stickies-v> anyone joining us for the first time today? even if you're just lurking, feel free to say hi!
 1017:02 <brunoerg> hi
 1117:02 <nasser_saazi> hi
 1217:02 <extheo[m]> Hi
 1317:02 <ls55> Hi
 1417:03 <schmidty_> hi
 1517:03 <stickies-v> who got the chance to review the PR or read the notes? (y/n)
 1617:03 <Amirreza> y
 1717:03 <willcl_ark> Hi
 1817:03 <Bitcoin_Hodler> y
 1917:03 <lightlike> y
 2017:04 <svav> Hi
 2117:04 <TobiAdeyemi[m]> hi
 2217:04 <glozow> hi
 2317:04 <BlueMoon> y
 2417:04 <brunoerg> y
 2517:05 <stickies-v> for those of you who were able to review, would you give it a Concept ACK, Approach ACK, Tested ACK, or NACK?
 2617:06 <Amirreza> Actually I'm still very new to the project :) but I think Concept ACK
 2717:07 <stickies-v> luckily there's a bit less pressure on ACK'ing things just here in the review club :-D thanks for your input!
 2817:07 <willcl_ark> It doesnt seem to make sense to perform expensive checks multiple times when we can just compare the hash, so concept ACK
 2917:08 <Amirreza> For really understanding how the source code of this part works, do we need to know the GCS algorithm?
 3017:10 <stickies-v> when reviewing a PR, you need to be comfortable about understanding the *changes* it introduces. In this PR, we don't really change how we use GCS, so I'd say it's not really super important. Of course, sometimes side effects can be difficult to understand, and having enough people with deep expertise of that part of the codebase is important too.
 3117:10 <Bitcoin_Hodler> Concept ACK
 3217:11 <stickies-v> we'll come back to benefits, risks and downsides about this PR later on in the discussion too! with that said, let's get to the questions
 3317:11 <stickies-v> let's start with a quick refresher: how would you summarize what is a compact block filter, and what it's used for?
 3417:12 <svav> Compact block filters - are a condensed representation of the contents of a block that allow wallets to determine whether the block contains any transactions involving the user’s keys.
 3517:13 <Amirreza> It's for wallets to check if their addresses are involved in a given block or not.
 3617:13 <Amirreza> Because checking all blocks is too expensive.
 3717:13 <ls55> Compact block filters are a condensed representation of the contents of a block that allow wallets to determine whether the block contains any transactions involving the user’s keys.
 3817:13 <sipa> well, they're still checking all blocks, but not by going through all transactions in those blocks individually
 3917:14 <Amirreza> sipa: yeah you're right. ls55 explanation was better.
 4017:14 <stickies-v> lots of good answers already - I'd say the main thing to add is that they're a compact representation of all the scriptPubKeys in a block - both the inputs and outputs
 4117:15 <willcl_ark> They are used when clients don't want to have to download full-size blocks, but still want to check for transactions related to themselves (e.g. "light clients")
 4217:16 <stickies-v> we already had bloom filters to achieve a similar goal, but they were leaking quite a bit of privacy and weren't incentive compatible (put a lot of load on full nodes), and that's fixed with compact block filters (at cost of higher bandwidth and CPU requirements for light nodes)
 4317:16 <willcl_ark> When you find a match against the filter, you then request the full block to get full transaction information, without leaking too much privacy to the "server"
 4417:16 <sipa> There is a more fundamental difference between the BIP37 bloom filters and the BIP157 golomb-coded filters than just the encoding.
 4517:17 <sipa> BIP37 is server-side filtering: the client gives the server what they're interested in, and the server responds with matching transactions.
 4617:17 <ls55> However, the Compact block filters approach consumes significantly more bandwidth.
 4717:17 <sipa> With BIP157, the server gives the client a filter of what's in the block, and they can do the matching by themselves.
 4817:17 <evanlinjin> @ls55 it's still way less bandwidth than running a full node though
 4917:18 <ls55> evanlinjin: true
 5017:18 <willcl_ark> With BIP157 theres a single filter per block, shared to all clients. With BIP37 each client can ask for matches against custom filters (and there's no DOS protection to requesting matching against many filters).
 5117:18 <sipa> GCS is also smaller than Bloom (approximately 1.4x for the same false positive rate), but that's just an incremental change. The fundamental difference is that the client does not just tell the server what they care about anymore (a huge privacy leak).
 5217:19 <Amirreza> What about probability of false positive? Bloom filter vs GCS.
 5317:19 <stickies-v> willcl_ark: yes exactly, and that's quite relevant to this PR too, but we'll get back to that later
 5417:19 <sipa> Both Bloom filters and GCS have tunable false positive rate - the lower your want the fprate to be, the bigger the filter becomes.
 5517:20 <sipa> However, GCS is 1.4x smaller for the same fprate and data size, compared to Bloom.
 5617:20 <Amirreza> sipa: got it. Thanks
 5717:20 <sipa> The downside is that GCS cannot be updated efficiently once constructed; they're intended to be constructed once, and read multiple times.
 5817:21 <stickies-v> the tunable false positive rate is also something we'll cover later on in the questions. Awesome discussion, I'll move on to the next question. As always, this discussion is async so feel free to continue discussing earlier questions
 5917:21 <ls55> Does the wallet need to download the entire block when they learn that a block has relevant transactions or just the transactions?
 6017:21 <stickies-v> when are compact block filters actually loaded from disk? Is this a frequent process?
 6117:21 <Amirreza> ls55: no they download the merkle block AFAIK
 6217:22 <Amirreza> which contains only block header and related TXs
 6317:22 <sipa> ls55: Well, if they want the block, yes. There is no way to only download just the tx they care about (because that would reveal which tx they are interested in).
 6417:22 <sipa> Amirreza: That's for BIP37, which is deprecated.
 6517:22 <sipa> In the BIP157 way of working, yes, the client will download the full block if it matches.
 6617:22 <Amirreza> sipa: oh, thanks for mentioning
 6717:23 <ls55> sipa: Amirreza: Got it. Thanks.
 6817:24 <stickies-v> ls55: this is also not consensus or P2P protocol or anything, I think any dev could implement this for their application however they want to, nothing's holding you back from querying individual TXs
 6917:25 <stickies-v> (but you would be leaking privacy by doing that)
 7017:25 <ls55> stickies-v: when are compact block filters actually loaded from disk? When `getblockfilter` RPC is called, I guess.
 7117:25 <stickies-v> yes, that's one way! there are 2 more
 7217:25 <lightlike> filters are loaded from disk when we get a request from a peer that wants them - that could be frequent
 7317:26 <sipa> stickies-v: BIP158 defines the GCS filter. BIP157 exposes it over the P2P network.
 7417:26 <ls55> stickies-v: Got it. Thanks. Downloading the entire block is more private.
 7517:27 <ls55> How does a peer request a block filter ? Is there a specific message ?
 7617:27 <Amirreza> sipa: Aren't bip numbers based on the order of issuing them? BIP157 used the concept that was introduced in BIP158?
 7717:27 <stickies-v> sipa: hmm yeah my comment wasn't really relevant to BIP157/158, you're right
 7817:27 <stickies-v> no, BIP numbers are not chronological
 7917:27 <willcl_ark> Usually you request filter headers first, then filters which you are missing according to the filter header chain
 8017:27 <sipa> ls55: Per BIP157, yes `getcfilters`
 8117:28 <lightlike> I think filters can also be queried via REST
 8217:28 <sipa> stickies-v: Outside of BIP37 there is no mechanism in the P2P protocol for a client to request a subset of the transactions of a block.
 8317:29 <sipa> Oh, I guess BIP152 compact block relay also has a mechanism for transferring just a subset of transactions, but that's in a very different context.
 8417:29 <stickies-v> lightlike: yes that's the third one!
 8517:30 <ls55> spa: Thanks. Why would a peer request a block filter (`getcfilter` message) ? For the same reason as a lightweight wallet ?
 8617:30 <stickies-v> so the compact block filters are only loaded on request, through RPC, through REST or through P2P networking with the GETCFILTERS message (if `NODE_COMPACT_FILTERS` service bit is set)
 8717:30 <stickies-v> links to the relevant code:
 8817:30 <stickies-v> RPC: https://github.com/bitcoin/bitcoin/blob/749b80b29e875cc6afa1c2674cccdfd7115cc16a/src/rpc/blockchain.cpp#L2226
 8917:30 <stickies-v> REST: https://github.com/bitcoin/bitcoin/blob/749b80b29e875cc6afa1c2674cccdfd7115cc16a/src/rest.cpp#L519
 9017:30 <sipa> @ls55 They're the same thing. That's how a lightweight wallet, being a peer of a full node, asks for a block filter.
 9117:31 <stickies-v> GETCFILTERS: https://github.com/bitcoin/bitcoin/blob/11106a4722558765a44ae45c7892724a73ce514c/src/net_processing.cpp#L3514-L3516
 9217:31 <ls55> sipa: Got it.
 9317:32 <sipa> (That's assuming the lightweight wallet communicates over the P2P protocol; they don't have to, e.g. Electrum uses its own protocol, talking to Electrum servers, not Bitcoin P2P nodes - in that case BIP157 is obviously irrelevant)
 9417:33 <stickies-v> next question: this PR introduces a new `skip_decode_check` bool parameter in the `GCSFilter` constructor that controls whether we check the size of the filter. In which scenarios will this parameter actually be `true`? Could we remove it?
 9517:33 <lightlike> bitcoin core doesn't have or need any logic to request blockfilters via p2p itself, it just creates the filters to serve others
 9617:35 <stickies-v> lightlike: good point. since Core doesn't have a "light client mode", there is this asymmetry where it only sends but never consumes compact block filters
 9717:36 <willcl_ark> Wondering if that would make an interesting Tor/I2P mode…
 9817:37 <stickies-v> woops forgot to provide the link for the previous question: link: https://github.com/kcalvinalvin/bitcoin/blob/e734228d8585c0870c71ce8ba8c037f8cf8b249a/src/blockfilter.h#L62
 9917:37 <evanlinjin> Would it make sense to have core have a "light client mode"? Maybe, use blockfilters while the full node syncs in the background?
10017:38 <evanlinjin> *compact bloom filters
10117:38 <willcl_ark> We already sync headers first
10217:38 <sipa> @evanlinjin That question is meaningless without developer prioritization to make it happen.
10317:38 <sipa> Of course it's meaningful, but it'd be an enormously invasive change.
10417:39 <stickies-v> (*compact block filters, not bloom)
10517:39 <evanlinjin> willcl_ark: But you won't see txs
10617:39 <sipa> Bitcoin Core isn't designed around such a mode of operation.
10717:39 <sipa> assumeutxo goes in a similar direction, and it's not exactly moving alon quickly
10817:39 <evanlinjin> stickies-v: Thanks, I keep getting confused haha
10917:40 <evanlinjin> sipa: thank you for the insight
11017:42 <lightlike> stickies-v: I think skip_decode_check can be removed (and I think I suggested that in the PR some time ago). skip_decode_check=false is only used when deserializing blockfilters we get from others, and since core does not request filters from others, it doesn't need to do that (outside of tests)
11117:43 <stickies-v> lightlike: yes I was waiting for your input haha, see https://github.com/bitcoin/bitcoin/pull/24832#issuecomment-1098206739 for more discussion around this
11217:43 <stickies-v> BlockFilter::Unserialize() is the only place where `skip_decode_check` is true, and that's only called from the test suite
11317:44 <stickies-v> so kcalvinalvin is actually working on refactoring that
11417:44 <lightlike> though I'm not sure if we should remove Unserialize completely, as was also suggested. I'd be fine with just removing the extra check code.
11517:45 <stickies-v> alright next Q: we usually don't pass `(u)int` type variables by reference. Why is that different for the `const uint256& hash` parameter in `BlockFilterIndex::ReadFilterFromDisk`?
11617:45 <ls55> But isn't `skip_decode_check=true` the main benefit of this PR?
11717:45 <stickies-v> link: https://github.com/kcalvinalvin/bitcoin/blob/e734228d8585c0870c71ce8ba8c037f8cf8b249a/src/index/blockfilterindex.h#L34
11817:45 <stickies-v> ls55:
11917:45 <stickies-v> ls55: no the main benefit is that we can replace the previous decode check with a much cheaper hash check
12017:47 <stickies-v> since we already store the hash of each compact block filter in the db, we just compare that hash with the filter that we just loaded from disk
12117:48 <ls55> stickies-v: Got it. Thanks. `CHash256().Write(encoded_filter).Finalize(result); if (result != hash) ...`
12217:50 <stickies-v> hint for the current Q: what's the size of a reference/pointer?
12317:51 <sipa> nit: references don't have a size (from the perspective of the source code - since the reference is a perfect stand-in for what it is referencing). typically references are compiled to pointers, but the compiler isn't required to do so
12417:51 <svav> Reference is a lot smaller
12517:53 <stickies-v> sipa: interesting, so if not compiled to pointers, what are they compiled to?
12617:53 <ls55> `uint256` is often passed by reference in various parts of the codebase.
12717:53 <sipa> stickies-v: Sometimes nothing. The compiler may be able to inline a function call or so.
12817:54 <stickies-v> svav: exactly! pointers are usually 32/64 bit depending on OS, so much smaller than 256 bit. Also, uint256 is not a fundamental type but a class that we implemented in https://github.com/bitcoin/bitcoin/blob/bfc6070342b9f43bcf125526e6a3c8ed34e29a71/src/uint256.h#L112
12917:54 <sipa> My point is just that "reference" is a concept that only exists in the source code, and its size is meaningless. If you'd ask sizeof() on a reference, you'd just get the size of what it is referencing.
13017:55 <sipa> What you care about is what the overhead of passing an argument by reference or by pointer is, and while the compiler is free to do anything that works, typically both pointer and reference arguments result in 1 extra register being passed to the callee.
13117:56 <sipa> (sorry, language nittery, it's not actually relevant for this discussion)
13217:57 <stickies-v> no that's definitely interesting, it was kind of the point of the question!
13317:57 <stickies-v> alright quickly wanna cover the benchmarking too before we wrap up
13417:57 <stickies-v> who was able to compile the `bench_bitcoin` benchmarking tool?
13517:58 <stickies-v> and if so, were you able to run the benchmarks? ( ./bench_bitcoin --filter="GCS.*" )
13617:58 <willcl_ark> alas, I did not find the time :'(
13718:00 <stickies-v> willcl_ark: it's very quick both to compile and run actually! see https://github.com/bitcoin/bitcoin/blob/master/doc/benchmarking.md for making instructions
13818:00 <stickies-v> #endmeeting