wallet: fast rescan with BIP157 block filters for descriptor wallets (wallet)

https://github.com/bitcoin/bitcoin/pull/25957

Host: larryruane  -  PR author: theStack

Notes

  • This PR is a re-attempt of PR 15845 from 2019, which was closed without being merged. PR 15845 was the subject of an earlier review club. Its notes apply here as well.

  • This PR is a performance improvement (no functional difference).

  • BIP 157 (see also review club) adds the P2P support (light client protocol) for block filters, while BIP 158 specifies the filters themselves. This PR takes advantage of BIP 158.

  • One difference between this PR and 15845 is that this PR works only with descriptor wallets, which is a more recent type of wallet added in v0.17 (2019). (See doc/descriptors.md and Andrew Chow’s video)

  • To review this PR, you will need to create a descriptor wallet. This requires building your node with sqlite; see the build instructions for your environment (search for “sqlite”).

  • bitcoind does not automatically create a descriptor wallet (or any wallet). To create a wallet, run the createwallet RPC. You don’t need to specify any arguments except wallet name, such as my_wallet (the default is to create a descriptor wallet).

  • It’s probably best to also use -signet=1, since you can run a non-pruned node. You can get some coins to play with at the Signet Faucet.

  • When your node is finished syncing, run and time the rescanblockchain RPC.

  • You can restart with block filters enabled using -blockfilterindex=1, and run -rescanblockchain again to use the optimization.

  • The getindexinfo RPC will show you if block filter index is enabled.

  • The listreceivedbyaddress RPC will show you received transactions; this list should be the same with and without -blockfilterindex=1 (and with and without running this PR’s branch).

  • The PR description links to a benchmark script.

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?

  2. Why would a node operator enable BIP 158 filters (-blockfilterindex=1)? Does the motivation make sense?

  3. What downsides, if any, are there to enabling BIP 158 filters?

  4. Were you able to set up and run the PR on signet as described in the notes? Did you see a difference in performance with and without -blockfilterindex?

  5. Were you able to run the benchmark script?

  6. What is the advantage of descriptor wallets compared to legacy wallets, especially in the creation of the filter set? (Hint: what exact type of data do we need to put into the filter set?)

  7. On a new descriptor wallet with default settings (i.e. ‘keypoolsize=1000’), how many elements would we need to put the filter set? (Hint: the listdescriptors RPC can be used to count the number of descriptors created)

  8. What is the difference between active and non-active descriptors, and why does this distinction matter for this PR? (Hint: see GetActiveScriptPubKeyMans() and GetAllScriptPubKeyMans() respectively.)

  9. What problem did the earlier version of this PR (15845 not address? (Hint) How this PR solve this problem?

  10. Why can’t we directly request the block filter index in the rescanning period? Why do we have to use the chain interface instead?

Meeting Log

  117:00 <larryruane_> #startmeeting
  217:00 <larryruane_> Hi!
  317:01 <willcl_ark> hi
  417:01 <lightlike> hi
  517:01 <glozow> hi
  617:01 <Kaizen_Kintsugi_> hi
  717:01 <larryruane_> Feel free to say hi, even if you're just lurking! Any first-time review club participants with us today?
  817:01 <theStack> hi
  917:02 <larryruane_> This week's PR is 25957: "wallet: fast rescan with BIP157 block filters for descriptor wallets". Notes and questions at https://bitcoincore.reviews/25957.html
 1017:03 <juancama> hi
 1117:03 <Kaizen_Kintsugi_> This one was super tough for me, very over my head
 1217:03 <ccdle12> hi
 1317:03 <Kaizen_Kintsugi_> reading over BIP158
 1417:03 <larryruane_> welcome to all! So yes, this one requires some background, but luckily we have the PR author with us, @theStack
 1517:04 <larryruane_> he will explain EVERYTHING :)
 1617:04 <larryruane_> Feel free to jump in with questions at any time, doesn't have to be related to the main discussion thread
 1717:04 <theStack> *blush* *cough*
 1817:04 <furszy> hi, lurking here.
 1917:04 <hernanmarino> Hi
 2017:04 <larryruane_> So what are some general impressions of the PR, other than it being tough?
 2117:05 <Kaizen_Kintsugi_> its a performance boost
 2217:05 <willcl_ark> It's nice to make slow operations faster!
 2317:05 <larryruane_> Did anyone have a chance to review the PR? Concept, approach, tested ACK, or NACK?
 2417:06 <willcl_ark> The wallet has historically suffered from lack of various indexes, so nice to use this one
 2517:06 <Kaizen_Kintsugi_> I reviewed and read the code, didnt get to testing
 2617:06 <Kaizen_Kintsugi_> I think I will today just to be thurough and practice testing prs
 2717:07 <larryruane_> willcl_ark: interesting point, are there some other indices the wallet should have? use of existing ones, or new ones? (I know slightly off-topic)
 2817:07 <larryruane_> (off-topic is our speciality here :) )
 2917:08 <Kaizen_Kintsugi_> question: the goal of this is to speed up the collection of txoutputs that a wallet address owns correct?
 3017:08 <Kaizen_Kintsugi_> *a descriptor wallet
 3117:09 <theStack> Kaizen_Kintsugi_: ad "reading over BIP158": i think understanding how exactly the filters are constructed in detail (BIP158) is not mandatory for reviewing this PR; knowing the basic idea should be sufficient
 3217:09 <Kaizen_Kintsugi_> is the basic idea faster filtering?
 3317:09 <larryruane_> Kaizen_Kintsugi_: yes, and also to identify transactions that pay TO the wallet (or more precisely, that the wallet has watch-only addresses of, or spending keys to)
 3417:10 <Kaizen_Kintsugi_> stack and larry: ty
 3517:11 <willcl_ark> Also it seems wasteful to be able to offer fast scans to SPV clients, but then to not use the filters for ourselves :)
 3617:11 <larryruane_> is that correct, @theStack? identifying transactions that either pay to this wallet (outputs), or are paid by this wallet (inputs)?
 3717:12 <larryruane_> willcl_ark: yes, that's how I think about it, if we have this index to benefit light client peers of ours, why not use it to benefit ourselves? no extra cost (other than a little more code)
 3817:13 <larryruane_> that leads into question 2, Why would a node operator enable BIP 158 filters (-blockfilterindex=1)? Does the motivation make sense?
 3917:13 <theStack> larryruane_: yes! the method checking if a tx is relevant for the wallet has the nice name `CWallet::AddToWalletIfInvolvingMe` (https://github.com/bitcoin/bitcoin/blob/fc44d1796e4df5824423d7d13de3082fe204db7d/src/wallet/wallet.cpp#L1093)
 4017:14 <theStack> and a bit below there is the condition `(fExisted || IsMine(tx) || IsFromMe(tx))`
 4117:14 <Kaizen_Kintsugi_> node operator would enable this to speed things up on rescanning
 4217:15 <larryruane_> Kaizen_Kintsugi_: yes, once this PR is merged.. What about before?
 4317:15 <willcl_ark> A few reasons: to offer better privacy to light clients connected to you, lower resource usage for yourself (as the server) and no ability for clients to DoS the server by requesting you monitor many unique filters (like BIP37 can do), and now faster rescans for yourself too!
 4417:15 <Kaizen_Kintsugi_> before? I'm not sure
 4517:16 <Kaizen_Kintsugi_> oh thats right, +1 will, blockfilters are better for privacy
 4617:16 <Kaizen_Kintsugi_> I didn't know about the reduced DoS. That is cool
 4717:17 <larryruane_> willcl_ark: good answer, before this PR, I would say it's providing a community service, not sure if there's any reason other than altruism (before this PR)
 4817:18 <larryruane_> so actually, this PR may lead to more nodes providing this service, since the incremental cost is smaller to do so!
 4917:19 <furszy> larryruane_, theStack: small add: not only txes that are sent from or received on the wallet are important. The wallet can watch scripts as well.
 5017:19 <larryruane_> side question, is it possible to enable the building and maintaining this index (`-blockfilterindex=1`) but not provide the BIP 157 peer-to-peer service?
 5117:19 <Kaizen_Kintsugi_> I'm going to say yes, because it requires a network flag thing?
 5217:19 <willcl_ark> I think yes, as you have to enable `peerblockfilters` too to serve them?
 5317:19 <theStack> furszy: good point
 5417:20 <theStack> willcl_ark: +1
 5517:20 <Kaizen_Kintsugi_> people can abuse these things that signal to eachother, I forget what they are called though
 5617:20 <larryruane_> willcl_ark: yes exactly
 5717:21 <larryruane_> are there any downsides to enabling BIP 158 block filters? (question 3)
 5817:22 <theStack> side-note: for people wanting to learn more details about block filters and BIP 157/158, there has been a row of interesting PR review clubs about that in 2020 (i think https://bitcoincore.reviews/18877 was the first one)
 5917:23 <willcl_ark> Not quite answering the question directly, but they do require more (client) bandwidth than BIP37 filters, IIRC
 6017:23 <lightlike> they require some disk space
 6117:23 <larryruane_> I think conceptually BIP 158 filter is similar to a bloom filter, but better for this use case (more efficient), but I don't know the details
 6217:23 <larryruane_> lightlike: +1
 6317:23 <Kaizen_Kintsugi_> +1 lightlike
 6417:24 <willcl_ark> but for a node operator with adequate CPU, RAM and disk space overhead, I'd say not many downsides
 6517:24 <Kaizen_Kintsugi_> is true that any index option requires rescan and additional disk space?
 6617:24 <Kaizen_Kintsugi_> I remember enabling txindex and having to wait.
 6717:26 <larryruane_> and more side node, the BIP 37 bloom filter had the light client provide the bloom filter to its server (the full node), and that was different for each light client (so the server had to remember a bunch of them), whereas with BIP 157/158, the server generates just one for each block, and can send it (the same filter) to ALL of its light clients
 6817:26 <larryruane_> so it's much less of a burden on (what i'm calling) the server (full node)
 6917:27 <willcl_ark> yes enabling the index requires a rescan to build the filters for each block
 7017:28 <larryruane_> Kaizen_Kintsugi_: I think the term rescan is specific to the wallet (?) ... but yes, enabling txindex, or the block filters, requires reading all the blocks again
 7117:28 <Kaizen_Kintsugi_> will: ty
 7217:29 <larryruane_> should we move on? (feel free to continue previous discussions) ... question 4, Why would a node operator enable BIP 158 filters (-blockfilterindex=1)? Does the motivation make sense?
 7317:30 <Kaizen_Kintsugi_> yea, net performance boost
 7417:30 <larryruane_> oh I'm sorry, that copy-paste was wrong, question 4 is: Were you able to set up and run the PR on signet as described in the notes? Did you see a difference in performance with and without -blockfilterindex?
 7517:32 <willcl_ark> I did not test it yet myself, but noticed that someone called LarryRuane on GH had some interesting signet results :)
 7617:33 <larryruane_> I myself did this, it's easier than enabling blockfilterindex on mainnet, you can build the blockfilter index in signet in a few minutes
 7717:34 <Kaizen_Kintsugi_> I think I read someone tested it on mainnet. I plan to do so.
 7817:34 <larryruane_> But for me, signet was slower with the PR than without the PR! Any ideas why that might be?
 7917:34 <Kaizen_Kintsugi_> Oh I read this, because there are empty blocks, which increases false positives?
 8017:35 <Kaizen_Kintsugi_> signet has a lot of empty blocks I think
 8117:35 <larryruane_> Yes I think so, and that's my guess.. but it doesn't increase false positives
 8217:36 <Kaizen_Kintsugi_> oh derp
 8317:36 <larryruane_> it ends up using the block filter to check each block (rather than checking each block directly), but using the filter seems to take longer than checking an empty (or near-empty) block!
 8417:36 <Kaizen_Kintsugi_> larry: thanks for cleaning that up
 8517:37 <Kaizen_Kintsugi_> why is that it goes through every block?
 8617:37 <larryruane_> PR author @theStack wrote a benchmark script https://github.com/theStack/bitcoin/blob/fast_rescan_functional_test_benchmark/test/functional/pr25957_benchmark.py
 8717:38 <Kaizen_Kintsugi_> and so it seems like there is a threshold of how many transactions are in a block to gain a performance boost
 8817:38 <larryruane_> did anyone have a chance to run that? (this is question 5)
 8917:38 <larryruane_> Kaizen_Kintsugi_: yes! does that suggest an optimization (to the overall optimization that this PR is)?
 9017:40 <Kaizen_Kintsugi_> I think it does
 9117:40 <willcl_ark> Perhaps GCSFilter::MatchInternal() is just always going to be slower than reading (nearly) empty blocks?
 9217:40 <Kaizen_Kintsugi_> but I think that transaction count would have to be discovered?
 9317:41 <larryruane_> Kaizen_Kintsugi_: +1 willcl_ark: yes I think so (I didn't analyze in detail)
 9417:41 <theStack> Kaizen_Kintsugi_: even if we know the transaction count, it's a bad metric to determine how long a block takes to rescan. has anyone an idea why?
 9517:42 <larryruane_> Kaizen_Kintsugi_: that's a really good point.. and technically the transaction count isn't enough, it depends on the number of tx inputs and outputs
 9617:42 <willcl_ark> ^
 9717:42 <Kaizen_Kintsugi_> I was going to guess the difference in wallet types
 9817:42 <Kaizen_Kintsugi_> err address types
 9917:42 <Kaizen_Kintsugi_> ah so every transaction would have to be decoded correct to discover if it is worth it?
10017:43 <larryruane_> or at least you'd need to know how many inputs and outputs there are to examine in a block ... which you don't really have easy access to
10117:43 <theStack> Kaizen_Kintsugi_: yes. as an extreme example, i've seen blocks every now and then that only consist of 10 txs but are still full (each one takes 100kvbytes, which is a policy limit IIRC)
10217:44 <Kaizen_Kintsugi_> yea damn that isn't implicit in a block header is it
10317:44 <larryruane_> personally I'd say it's not worth optimizing ... this inverted performance behavior wouldn't occur on mainnet, which is all we really care about
10417:44 <willcl_ark> agree
10517:44 <Kaizen_Kintsugi_> yea it is starting to sound like a headache
10617:45 <larryruane_> Kaizen_Kintsugi_: +1 ... the block header does include the transaction count but that's always zero (this is why block headers are 81 bytes serialized, not 80)
10717:45 <Kaizen_Kintsugi_> ah
10817:45 <sipa> RE BIP158's GCS filter: it is indeed similar to a Bloom filter (no false negatives, a controllable rate of false positives), but more compact (iirc around 1.3x-1.4x). The downsides are the GCSs are write-once (you can't update them once created), and querying is much slower. Bloom filters are effectively O(n) for finding n elements in them. GCS are O(m+n) for finding n elements in a filter of size m.
10917:46 <sipa> So Bloom filters are way faster if you're only going to do one or a few queries. But as you're querying for larger and larger number of elements, the relative downside of a GCS's performance goes down.
11017:47 <larryruane_> sipa: +1 that's very helpful, TIL
11117:47 <Kaizen_Kintsugi_> aye ty sipa
11217:47 <larryruane_> question 6: What is the advantage of descriptor wallets compared to legacy wallets, especially in the creation of the filter set? (Hint: what exact type of data do we need to put into the filter set?)
11317:47 <willcl_ark> Thanks! Hmmm, I wonder why I had in my head that they used more bandwidth on the client that BIP37 filters...
11417:49 <Kaizen_Kintsugi_> 6: I think you need pubkeys?
11517:49 <theStack> hint for answering the hint question: just look up how the block filter is created
11617:49 <furszy> willcl_ark: because clients request entire blocks instead merely upload the bloom filter and receive the txes that matches it directly.
11717:50 <theStack> Kaizen_Kintsugi_: right direction already, but it's "a bit more" than just pubkeys
11817:50 <Kaizen_Kintsugi_> save active ScriptPubKeyMans's end ranges
11917:50 <Kaizen_Kintsugi_> ?
12017:51 <sipa> Yeah BIP37 offered a way to just downloading matcing transactions in blocks. BIP157 does not, as the server judt doesn't know what it'd need to give. This is an advantage on its own, as it avoids gratuitously revealing which transactions are interesting to the client (BIP37 has terrible privacy for this reason)
12117:52 <larryruane_> sipa: so the only privacy leak is that the server knows that a particular light client is interested in *something* within this block (but not which tx(s))
12217:52 <larryruane_> (is that right?)
12317:53 <Kaizen_Kintsugi_> I'm looking at FastWalletRescanFilter, is that right?
12417:53 <willcl_ark> Do we create an SPKM for each pubkey in legacy wallets, for each address type, resulting in hundreds (thousands?) whereas for descriptor wallets we have 8 SPKMans, 2 for each of 4 address types, receive and change?
12517:53 <larryruane_> theStack: "just look up how the block filter is created" https://github.com/bitcoin/bitcoin/blob/fc44d1796e4df5824423d7d13de3082fe204db7d/src/blockfilter.cpp#L187
12617:54 <Kaizen_Kintsugi_> ah thanks
12717:54 <willcl_ark> hmmm, actually IIRC legacy wallets have 1 SPKM aliased to all 8 default SPKM slots now, so I don't think thats it
12817:54 <sipa> @larryruane Yes. Though obviously clients can leak information in different ways too (tx relay, for example).
12917:55 <Kaizen_Kintsugi_> is it some sort of delta?
13017:55 <Kaizen_Kintsugi_> what is a CBlockUndo?
13117:56 <sipa> FWIW, I have a writeup on the analysis for the size of GCS filters (which was used to set the BIP158 parameters): https://github.com/sipa/writeups/tree/main/minimizing-golomb-filters
13217:56 <willcl_ark> For descriptor wallets I know it's much easier to enumerate the set of SPKs that it involves
13317:58 <larryruane_> willcl_ark: yes I think that's correct, the SPKs are already determined and broken out
13417:58 <furszy> willcl_ark: pre or post migration?
13517:59 <theStack> willcl_ark: yes, and especially the scriptPubKeys are saved already, exactly the thing we need to put into the filter set
13617:59 <larryruane_> we're almost out of time, there are a few questions remaining (7-10) sorry we didn't get to them, any comments on those questions? or anything else?
13718:00 <theStack> 7 should be quick and easy to answer... just shout out a number :)
13818:00 <larryruane_> a really good question is 9, what problem does this PR fix that the earlier PR didn't?
13918:01 <Kaizen_Kintsugi_> allowed non SPV nodes to run this?
14018:01 <Kaizen_Kintsugi_> or give full nodes the ability of a faster rescan?
14118:02 <sipa> Bitcoin Core does not implement any SPV mode.
14218:02 <larryruane_> Kaizen_Kintsugi_: no... it has to do with the top-up of the keypool, we didn't have a chance to get into that
14318:02 <sipa> This PR only affects Bitcoin Core's wallet
14418:02 <larryruane_> We better stop here, thank you everyone!
14518:02 <larryruane_> #endmeeting