wallet: fast rescan with BIP157 block filters for descriptor wallets (wallet)

Sep 7, 2022

https://github.com/bitcoin/bitcoin/pull/25957

Host: larryruane - PR author: theStack

Notes

This PR is a re-attempt of PR 15845 from 2019, which was closed without being merged. PR 15845 was the subject of an earlier review club. Its notes apply here as well.
This PR is a performance improvement (no functional difference).
BIP 157 (see also review club) adds the P2P support (light client protocol) for block filters, while BIP 158 specifies the filters themselves. This PR takes advantage of BIP 158.
One difference between this PR and 15845 is that this PR works only with descriptor wallets, which is a more recent type of wallet added in v0.17 (2019). (See doc/descriptors.md and Andrew Chow’s video)
To review this PR, you will need to create a descriptor wallet. This requires building your node with sqlite; see the build instructions for your environment (search for “sqlite”).
bitcoind does not automatically create a descriptor wallet (or any wallet). To create a wallet, run the createwallet RPC. You don’t need to specify any arguments except wallet name, such as my_wallet (the default is to create a descriptor wallet).
It’s probably best to also use -signet=1, since you can run a non-pruned node. You can get some coins to play with at the Signet Faucet.
When your node is finished syncing, run and time the rescanblockchain RPC.
You can restart with block filters enabled using -blockfilterindex=1, and run -rescanblockchain again to use the optimization.
The getindexinfo RPC will show you if block filter index is enabled.
The listreceivedbyaddress RPC will show you received transactions; this list should be the same with and without -blockfilterindex=1 (and with and without running this PR’s branch).
The PR description links to a benchmark script.

Questions

Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?
Why would a node operator enable BIP 158 filters (-blockfilterindex=1)? Does the motivation make sense?
What downsides, if any, are there to enabling BIP 158 filters?
Were you able to set up and run the PR on signet as described in the notes? Did you see a difference in performance with and without -blockfilterindex?
Were you able to run the benchmark script?
What is the advantage of descriptor wallets compared to legacy wallets, especially in the creation of the filter set? (Hint: what exact type of data do we need to put into the filter set?)
On a new descriptor wallet with default settings (i.e. ‘keypoolsize=1000’), how many elements would we need to put the filter set? (Hint: the listdescriptors RPC can be used to count the number of descriptors created)
What is the difference between active and non-active descriptors, and why does this distinction matter for this PR? (Hint: see GetActiveScriptPubKeyMans() and GetAllScriptPubKeyMans() respectively.)
What problem did the earlier version of this PR (15845 not address? (Hint) How this PR solve this problem?
Why can’t we directly request the block filter index in the rescanning period? Why do we have to use the chain interface instead?

Meeting Log

17:00

<larryruane_> #startmeeting

17:00

<larryruane_> Hi!

17:01

<willcl_ark> hi

17:01

<lightlike> hi

17:01

<glozow> hi

17:01

<Kaizen_Kintsugi_> hi

17:01

<larryruane_> Feel free to say hi, even if you're just lurking! Any first-time review club participants with us today?

17:01

<theStack> hi

17:02

<larryruane_> This week's PR is 25957: "wallet: fast rescan with BIP157 block filters for descriptor wallets". Notes and questions at https://bitcoincore.reviews/25957.html

17:03

<juancama> hi

17:03

<Kaizen_Kintsugi_> This one was super tough for me, very over my head

17:03

<ccdle12> hi

17:03

<Kaizen_Kintsugi_> reading over BIP158

17:03

<larryruane_> welcome to all! So yes, this one requires some background, but luckily we have the PR author with us, @theStack

17:04

<larryruane_> he will explain EVERYTHING :)

17:04

<larryruane_> Feel free to jump in with questions at any time, doesn't have to be related to the main discussion thread

17:04

<theStack> *blush* *cough*

17:04

<furszy> hi, lurking here.

17:04

<hernanmarino> Hi

17:04

<larryruane_> So what are some general impressions of the PR, other than it being tough?

17:05

<Kaizen_Kintsugi_> its a performance boost

17:05

<willcl_ark> It's nice to make slow operations faster!

17:05

<larryruane_> Did anyone have a chance to review the PR? Concept, approach, tested ACK, or NACK?

17:06

<willcl_ark> The wallet has historically suffered from lack of various indexes, so nice to use this one

17:06

<Kaizen_Kintsugi_> I reviewed and read the code, didnt get to testing

17:06

<Kaizen_Kintsugi_> I think I will today just to be thurough and practice testing prs

17:07

<larryruane_> willcl_ark: interesting point, are there some other indices the wallet should have? use of existing ones, or new ones? (I know slightly off-topic)

17:07

<larryruane_> (off-topic is our speciality here :) )

17:08

<Kaizen_Kintsugi_> question: the goal of this is to speed up the collection of txoutputs that a wallet address owns correct?

17:08

<Kaizen_Kintsugi_> *a descriptor wallet

17:09

<theStack> Kaizen_Kintsugi_: ad "reading over BIP158": i think understanding how exactly the filters are constructed in detail (BIP158) is not mandatory for reviewing this PR; knowing the basic idea should be sufficient

17:09

<Kaizen_Kintsugi_> is the basic idea faster filtering?

17:09

<larryruane_> Kaizen_Kintsugi_: yes, and also to identify transactions that pay TO the wallet (or more precisely, that the wallet has watch-only addresses of, or spending keys to)

17:10

<Kaizen_Kintsugi_> stack and larry: ty

17:11

<willcl_ark> Also it seems wasteful to be able to offer fast scans to SPV clients, but then to not use the filters for ourselves :)

17:11

<larryruane_> is that correct, @theStack? identifying transactions that either pay to this wallet (outputs), or are paid by this wallet (inputs)?

17:12

<larryruane_> willcl_ark: yes, that's how I think about it, if we have this index to benefit light client peers of ours, why not use it to benefit ourselves? no extra cost (other than a little more code)

17:13

<larryruane_> that leads into question 2, Why would a node operator enable BIP 158 filters (-blockfilterindex=1)? Does the motivation make sense?

17:13

<theStack> larryruane_: yes! the method checking if a tx is relevant for the wallet has the nice name `CWallet::AddToWalletIfInvolvingMe` (https://github.com/bitcoin/bitcoin/blob/fc44d1796e4df5824423d7d13de3082fe204db7d/src/wallet/wallet.cpp#L1093)

17:14

<theStack> and a bit below there is the condition `(fExisted || IsMine(tx) || IsFromMe(tx))`

17:14

<Kaizen_Kintsugi_> node operator would enable this to speed things up on rescanning

17:15

<larryruane_> Kaizen_Kintsugi_: yes, once this PR is merged.. What about before?

17:15

<willcl_ark> A few reasons: to offer better privacy to light clients connected to you, lower resource usage for yourself (as the server) and no ability for clients to DoS the server by requesting you monitor many unique filters (like BIP37 can do), and now faster rescans for yourself too!

17:15

<Kaizen_Kintsugi_> before? I'm not sure

17:16

<Kaizen_Kintsugi_> oh thats right, +1 will, blockfilters are better for privacy

17:16

<Kaizen_Kintsugi_> I didn't know about the reduced DoS. That is cool

17:17

<larryruane_> willcl_ark: good answer, before this PR, I would say it's providing a community service, not sure if there's any reason other than altruism (before this PR)

17:18

<larryruane_> so actually, this PR may lead to more nodes providing this service, since the incremental cost is smaller to do so!

17:19

<furszy> larryruane_, theStack: small add: not only txes that are sent from or received on the wallet are important. The wallet can watch scripts as well.

17:19

<larryruane_> side question, is it possible to enable the building and maintaining this index (`-blockfilterindex=1`) but not provide the BIP 157 peer-to-peer service?

17:19

<Kaizen_Kintsugi_> I'm going to say yes, because it requires a network flag thing?

17:19

<willcl_ark> I think yes, as you have to enable `peerblockfilters` too to serve them?

17:19

<theStack> furszy: good point

17:20

<theStack> willcl_ark: +1

17:20

<Kaizen_Kintsugi_> people can abuse these things that signal to eachother, I forget what they are called though

17:20

<larryruane_> willcl_ark: yes exactly

17:21

<larryruane_> are there any downsides to enabling BIP 158 block filters? (question 3)

17:22

<theStack> side-note: for people wanting to learn more details about block filters and BIP 157/158, there has been a row of interesting PR review clubs about that in 2020 (i think https://bitcoincore.reviews/18877 was the first one)

17:23

<willcl_ark> Not quite answering the question directly, but they do require more (client) bandwidth than BIP37 filters, IIRC

17:23

<lightlike> they require some disk space

17:23

<larryruane_> I think conceptually BIP 158 filter is similar to a bloom filter, but better for this use case (more efficient), but I don't know the details

17:23

<larryruane_> lightlike: +1

17:23

<Kaizen_Kintsugi_> +1 lightlike

17:24

<willcl_ark> but for a node operator with adequate CPU, RAM and disk space overhead, I'd say not many downsides

17:24

<Kaizen_Kintsugi_> is true that any index option requires rescan and additional disk space?

17:24

<Kaizen_Kintsugi_> I remember enabling txindex and having to wait.

17:26

<larryruane_> and more side node, the BIP 37 bloom filter had the light client provide the bloom filter to its server (the full node), and that was different for each light client (so the server had to remember a bunch of them), whereas with BIP 157/158, the server generates just one for each block, and can send it (the same filter) to ALL of its light clients

17:26

<larryruane_> so it's much less of a burden on (what i'm calling) the server (full node)

17:27

<willcl_ark> yes enabling the index requires a rescan to build the filters for each block

17:28

<larryruane_> Kaizen_Kintsugi_: I think the term rescan is specific to the wallet (?) ... but yes, enabling txindex, or the block filters, requires reading all the blocks again

17:28

<Kaizen_Kintsugi_> will: ty

17:29

<larryruane_> should we move on? (feel free to continue previous discussions) ... question 4, Why would a node operator enable BIP 158 filters (-blockfilterindex=1)? Does the motivation make sense?

17:30

<Kaizen_Kintsugi_> yea, net performance boost

17:30

<larryruane_> oh I'm sorry, that copy-paste was wrong, question 4 is: Were you able to set up and run the PR on signet as described in the notes? Did you see a difference in performance with and without -blockfilterindex?

17:32

<willcl_ark> I did not test it yet myself, but noticed that someone called LarryRuane on GH had some interesting signet results :)

17:33

<larryruane_> I myself did this, it's easier than enabling blockfilterindex on mainnet, you can build the blockfilter index in signet in a few minutes

17:34

<Kaizen_Kintsugi_> I think I read someone tested it on mainnet. I plan to do so.

17:34

<larryruane_> But for me, signet was slower with the PR than without the PR! Any ideas why that might be?

17:34

<Kaizen_Kintsugi_> Oh I read this, because there are empty blocks, which increases false positives?

17:35

<Kaizen_Kintsugi_> signet has a lot of empty blocks I think

17:35

<larryruane_> Yes I think so, and that's my guess.. but it doesn't increase false positives

17:36

<Kaizen_Kintsugi_> oh derp

17:36

<larryruane_> it ends up using the block filter to check each block (rather than checking each block directly), but using the filter seems to take longer than checking an empty (or near-empty) block!

17:36

<Kaizen_Kintsugi_> larry: thanks for cleaning that up

17:37

<Kaizen_Kintsugi_> why is that it goes through every block?

17:37

<larryruane_> PR author @theStack wrote a benchmark script https://github.com/theStack/bitcoin/blob/fast_rescan_functional_test_benchmark/test/functional/pr25957_benchmark.py

17:38

<Kaizen_Kintsugi_> and so it seems like there is a threshold of how many transactions are in a block to gain a performance boost

17:38

<larryruane_> did anyone have a chance to run that? (this is question 5)

17:38

<larryruane_> Kaizen_Kintsugi_: yes! does that suggest an optimization (to the overall optimization that this PR is)?

17:40

<Kaizen_Kintsugi_> I think it does

17:40

<willcl_ark> Perhaps GCSFilter::MatchInternal() is just always going to be slower than reading (nearly) empty blocks?

17:40

<Kaizen_Kintsugi_> but I think that transaction count would have to be discovered?

17:41

<larryruane_> Kaizen_Kintsugi_: +1 willcl_ark: yes I think so (I didn't analyze in detail)

17:41

<theStack> Kaizen_Kintsugi_: even if we know the transaction count, it's a bad metric to determine how long a block takes to rescan. has anyone an idea why?

17:42

<larryruane_> Kaizen_Kintsugi_: that's a really good point.. and technically the transaction count isn't enough, it depends on the number of tx inputs and outputs

17:42

<willcl_ark> ^

17:42

<Kaizen_Kintsugi_> I was going to guess the difference in wallet types

17:42

<Kaizen_Kintsugi_> err address types

17:42

<Kaizen_Kintsugi_> ah so every transaction would have to be decoded correct to discover if it is worth it?

100

17:43

<larryruane_> or at least you'd need to know how many inputs and outputs there are to examine in a block ... which you don't really have easy access to

101

17:43

<theStack> Kaizen_Kintsugi_: yes. as an extreme example, i've seen blocks every now and then that only consist of 10 txs but are still full (each one takes 100kvbytes, which is a policy limit IIRC)

102

17:44

<Kaizen_Kintsugi_> yea damn that isn't implicit in a block header is it

103

17:44

<larryruane_> personally I'd say it's not worth optimizing ... this inverted performance behavior wouldn't occur on mainnet, which is all we really care about

104

17:44

<willcl_ark> agree

105

17:44

<Kaizen_Kintsugi_> yea it is starting to sound like a headache

106

17:45

<larryruane_> Kaizen_Kintsugi_: +1 ... the block header does include the transaction count but that's always zero (this is why block headers are 81 bytes serialized, not 80)

107

17:45

<Kaizen_Kintsugi_> ah

108

17:45

<sipa> RE BIP158's GCS filter: it is indeed similar to a Bloom filter (no false negatives, a controllable rate of false positives), but more compact (iirc around 1.3x-1.4x). The downsides are the GCSs are write-once (you can't update them once created), and querying is much slower. Bloom filters are effectively O(n) for finding n elements in them. GCS are O(m+n) for finding n elements in a filter of size m.

109

17:46

<sipa> So Bloom filters are way faster if you're only going to do one or a few queries. But as you're querying for larger and larger number of elements, the relative downside of a GCS's performance goes down.

110

17:47

<larryruane_> sipa: +1 that's very helpful, TIL

111

17:47

<Kaizen_Kintsugi_> aye ty sipa

112

17:47

<larryruane_> question 6: What is the advantage of descriptor wallets compared to legacy wallets, especially in the creation of the filter set? (Hint: what exact type of data do we need to put into the filter set?)

113

17:47

<willcl_ark> Thanks! Hmmm, I wonder why I had in my head that they used more bandwidth on the client that BIP37 filters...

114

17:49

<Kaizen_Kintsugi_> 6: I think you need pubkeys?

115

17:49

<theStack> hint for answering the hint question: just look up how the block filter is created

116

17:49

<furszy> willcl_ark: because clients request entire blocks instead merely upload the bloom filter and receive the txes that matches it directly.

117

17:50

<theStack> Kaizen_Kintsugi_: right direction already, but it's "a bit more" than just pubkeys

118

17:50

<Kaizen_Kintsugi_> save active ScriptPubKeyMans's end ranges

119

17:50

<Kaizen_Kintsugi_> ?

120

17:51

<sipa> Yeah BIP37 offered a way to just downloading matcing transactions in blocks. BIP157 does not, as the server judt doesn't know what it'd need to give. This is an advantage on its own, as it avoids gratuitously revealing which transactions are interesting to the client (BIP37 has terrible privacy for this reason)

121

17:52

<larryruane_> sipa: so the only privacy leak is that the server knows that a particular light client is interested in *something* within this block (but not which tx(s))

122

17:52

<larryruane_> (is that right?)

123

17:53

<Kaizen_Kintsugi_> I'm looking at FastWalletRescanFilter, is that right?

124

17:53

<willcl_ark> Do we create an SPKM for each pubkey in legacy wallets, for each address type, resulting in hundreds (thousands?) whereas for descriptor wallets we have 8 SPKMans, 2 for each of 4 address types, receive and change?

125

17:53

<larryruane_> theStack: "just look up how the block filter is created" https://github.com/bitcoin/bitcoin/blob/fc44d1796e4df5824423d7d13de3082fe204db7d/src/blockfilter.cpp#L187

126

17:54

<Kaizen_Kintsugi_> ah thanks

127

17:54

<willcl_ark> hmmm, actually IIRC legacy wallets have 1 SPKM aliased to all 8 default SPKM slots now, so I don't think thats it

128

17:54

<sipa> @larryruane Yes. Though obviously clients can leak information in different ways too (tx relay, for example).

129

17:55

<Kaizen_Kintsugi_> is it some sort of delta?

130

17:55

<Kaizen_Kintsugi_> what is a CBlockUndo?

131

17:56

<sipa> FWIW, I have a writeup on the analysis for the size of GCS filters (which was used to set the BIP158 parameters): https://github.com/sipa/writeups/tree/main/minimizing-golomb-filters

132

17:56

<willcl_ark> For descriptor wallets I know it's much easier to enumerate the set of SPKs that it involves

133

17:58

<larryruane_> willcl_ark: yes I think that's correct, the SPKs are already determined and broken out

134

17:58

<furszy> willcl_ark: pre or post migration?

135

17:59

<theStack> willcl_ark: yes, and especially the scriptPubKeys are saved already, exactly the thing we need to put into the filter set

136

17:59

<larryruane_> we're almost out of time, there are a few questions remaining (7-10) sorry we didn't get to them, any comments on those questions? or anything else?

137

18:00

<theStack> 7 should be quick and easy to answer... just shout out a number :)

138

18:00

<larryruane_> a really good question is 9, what problem does this PR fix that the earlier PR didn't?

139

18:01

<Kaizen_Kintsugi_> allowed non SPV nodes to run this?

140

18:01

<Kaizen_Kintsugi_> or give full nodes the ability of a faster rescan?