Improve error handling when VerifyDB fails due to insufficient dbcache (validation)

Dec 7, 2022

https://github.com/bitcoin/bitcoin/pull/25574

Host: mzumsande - PR author: mzumsande

Notes

VerifyDB (added in PR 2145) performs various checks for possible corruption of recent blocks stored on disk and of the UTXO database. The check is invoked during startup, but can also be triggered by the verifychain RPC.
VerifyDB is dependent on two parameters that are available as startup options or RPC arguments: -checkblocks defines to how many blocks (from the tip) the check is applied, -checklevel defines how thorough these checks are.
It is possible that that due to an insufficient dbcache size, VerifyDB can fail to complete the level 3 checks, in which it would gracefully skip these, but still attempt the checks of other levels. However, the level 4 checks are dependent on the level 3 checks being completed, so that bitcoind can currently hit an assert and crash if -checklevel=4 is specified, but the level 3 checks weren’t able to finish (Issue 25563).
Since the default values are DEFAULT_CHECKBLOCKS == 6 and DEFAULT_CHECKLEVEL == 3, users can’t run into this issue unless they actively specify a higher checklevel than the default.
This PR prevents the assert from being hit during the checks and also changes the way errors are reported and logged in both places VerifyDB is used (Init and RPC).

Questions

Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?
What is the reason the VerifyDB checks are important? Considering that they can delay the startup by several seconds, what would be possible consequences of not having these checks in place?
Can you describe the checks that each of the four levels of VerifyDB perform? For which of the levels is the dbcache size relevant?
In the failure case, why exactly is the Assertion hashPrevBlock == view.GetBestBlock() in Chainstate::ConnectBlock() hit?
Were you able to reproduce the VerifyDB crash on master and verify that this branch prevents it? (Hint: use -dbcache=1 as a startup arg to bitcoind)
In addition to fixing the crash, this PR also changes the way both init and the verifychain RPC handle errors. Can you describe these?
Do you agree with the new error handling? Do you think bitcoind should abort if the default checks couldn’t be completed (but there was no corruption)?

Meeting Log

17:00

<lightlike> #startmeeting

17:00

<pablomartin4btc> hello

17:00

<lightlike> hi!

17:00

<rozehnal_paul> hi

17:01

<LarryRuane> hi

17:01

<b_101> hi!

17:01

<lightlike> welcome to review club - today's meeting is about the VerifyDB checks

17:01

<d33r_gee> hello

17:02

<glozow> hi

17:02

<lightlike> notes are at https://bitcoincore.reviews/25574

17:02

<hernanmarino> hi !

17:02

<lightlike> anyone here for the first time?

17:03

<lightlike> ok, let's start. Who had a chance to review the PR? (y/n)

17:03

<pablomartin4btc> y

17:04

<hernanmarino> yes

17:04

<d33r_gee> yes

17:04

<rozehnal_paul> yes

17:04

<b_101> y

17:04

<LarryRuane> 0.4

17:05

<lightlike> that's a lot! what's your impression (Concept ACK, approach ACK, tested ACK, or NACK?)

17:05

<hernanmarino> tested

17:05

<pablomartin4btc> tested ACK

17:05

<hernanmarino> ACK

17:05

<hernanmarino> :)

17:05

<d33r_gee> tested ACK

17:05

<LarryRuane> concept ACK, almost finished with tested ACK

17:05

<rozehnal_paul> concept ack

17:06

<b10c> hi

17:06

<effexzi> Hey every1

17:06

<lightlike> great - let's move to the first question!

17:06

<lightlike> What is the reason the VerifyDB checks are important? Considering that they can delay the startup by several seconds, what would be possible consequences of not having these checks in place?

17:06

<b_101> concept Ack

17:07

<rozehnal_paul> relaying corrupted blocks?

17:07

<LarryRuane> basic question, the DB being verified is levelDB? is that correct?

17:07

<d33r_gee> they are are important for ensuring the integrity and reliability of the local bitcoin database.

17:07

<hernanmarino> it s important to verify the integrity of the db

17:08

<hernanmarino> As to the consecuences, I'm not really sure, but I guess that greater problems might arise later

17:08

<lightlike> LarryRuane: it's a mix. there are different levels of check - some verify the blocks stored on disk, others also the chainstate.

17:08

<LarryRuane> I think when bitcoind experiences a non-clean shutdown (such as system OOM or power), there's a fear that the ondisk DB could get corrupted (even though it shouldn't in theory)

17:08

<pablomartin4btc> Without these checks, there is a risk that the database could become corrupted, either due to bugs in the software or due to external factors such as hardware failures or malicious attacks.

17:09

<lightlike> Good answers!

17:10

<pablomartin4btc> If the database were to become corrupted, it could cause the Bitcoin software to crash or behave unpredictably, which could potentially lead to the loss of funds or other problems.

17:10

<andrewtoth_> LarryRuane chainstate is leveldb, blocks are custom file format and also checked

17:10

<LarryRuane> andrewtoth_: lightlike: thanks

17:11

<lightlike> I think another possiblity is that we could fall out of consensus, e.g. rejecting a valid block because of some inconsistency with our chainstate. That would be really bad, maybe worse than crashing.

17:11

<andrewtoth_> it could cause the node to fork off the network if it doesn't just crash from other checks while activating the best chain

17:11

<LarryRuane> and it is possible to check the blocks (`blk*.dat` and `rev*.dat`) on pruned nodes, but only back to the prune height (IIUC)

17:12

<lightlike> LarryRuane: yes, the VerifyDB code accounts for that - if we are pruning and the files aren't there, we don't check any further.

17:12

<rozehnal_paul> an affected node could also lose reputation by relaying corrupted blocks?

17:13

<andrewtoth_> rozehnal_paul yes, relaying corrupted blocks would cause peers to immediately disconnect

17:14

<andrewtoth_> but I don't think this checks all historical blocks for corruption

17:14

<LarryRuane> and just to make sure ... VerifyDB does not check the wallet's DB in any way, does it? (i don't think so)

17:14

<lightlike> andrewtoth: yes, that is important. Does anyone know how many blocks we check by default?

17:15

<lightlike> LarryRuane: no, no connection with the wallet

17:15

<rozehnal_paul> 6 blocks by default

17:15

<b_101> lightlike: only 6

17:15

<LarryRuane> 6 blocks .. wonder if that's purposely the same as what's considered enough to "confirm" a tx?

17:15

<lightlike> rozehnal_paul b_101 correct!

17:17

<lightlike> It's a tradeoff between the time it takes and thouroughness. Probably many of you have seen this slightly annoying 15%...30%... progress in the debug log during startup.

17:18

<LarryRuane> when i first started running bitcoind, i thought it could / should be more than 6 blocks, because that takes so few seconds on my laptop ... but after I got a RPi I noticed it takes quite there! So for that platform at least, it would be bad to be much more than 6

17:18

<lightlike> if we'd scan the entire chain, it would take us hours (and we'd probably run out of memory, but we'll come to that later)

17:18

<andrewtoth_> It does verify the entire chainstate though right? It loads the entire block *index* into memory, just not blocks?

17:19

<LarryRuane> whoops i just started that (entire chain), but on signet, maybe it will be ok (and it has your PR fix)

17:19

<glozow> larryruane: how long did it take on your rpi? like, on the order of seconds or minutes?

17:19

<andrewtoth_> A -reindex basically does a verification on the entire blocks db though I believe

17:19

<LarryRuane> to do 6 blocks, i would say on the order of a minute.. but i'll check and get back to you!

17:20

<lightlike> andrewtoth_: what do you mean with "it" - the VerifyDB checks, or bitcoind in general?

17:20

<glozow> oh wow that's slow! cool to know

17:20

<andrewtoth_> the VerifyDB checks

17:20

<LarryRuane> andrewtoth_: a -reindex actually *rebuilds* the block index and chainstate

17:21

<lightlike> andrewtoth_: ok, in that case I don't think so. But this leads to the next question, what the checks actually do:

17:21

<lightlike> Can you describe the checks that each of the four levels of VerifyDB perform? For which of the levels is the dbcache size relevant?

17:21

<lightlike> actually, it's five levels, I forgot it starts with 0...

17:22

<rozehnal_paul> Sipa commented in january 2013 the following :

17:22

<rozehnal_paul> -checklevel gets a new meaning:

17:22

<rozehnal_paul> 0: verify blocks can be read from disk (like before)

17:22

<rozehnal_paul> 1: verify (contextless) block validity (like before)

17:22

<rozehnal_paul> 2: verify undo data can be read and matches checksums

17:22

<rozehnal_paul> 3: verify coin database is consistent with the last few blocks (close to level 6 before)

17:22

<rozehnal_paul> 4: verify all validity rules of the last few blocks (including signature checks)

17:22

<LarryRuane> glozow: I was off by a little, it takes 31 seconds (for the default 6 blocks, level 3) on raspberry pi

17:22

<glozow> larryruane: ah nice, thanks!

17:23

<d33r_gee> Check 0: reads blocks from diskfrom disk and returns and error if ReadBlockFromdisk is false

17:23

<d33r_gee> Check 1: runs CheckBlock returns error if false

17:23

<d33r_gee> Check 2: runs CBlockUndoin pindex returns false if bad data is found

17:23

<d33r_gee> Check 3: confirms that the best block matches pindex block hash

17:23

<d33r_gee> Check 4: last check try to reconnect if failure if previously interrupted (?)

17:23

<LarryRuane> you can see what each level does with `bitcoin-cli help verifychain`, it's pretty nice

17:23

<yashraj> what does 2 mean?

17:24

<lightlike> rozehnal_paul d33r_gee: yes exactly!

17:24

<rozehnal_paul> LarryRuane thx for the cmd tip

17:24

<b_101> lightlike: level 3 test can fail due to insuficient dbcache size

17:25

<b_101> LarryRuane: +1, thx

17:25

<lightlike> b_101: yes!

17:27

<lightlike> so it's important to note that Level 0-2 are note independent of context ("the chain"), including the CheckBlock at level 1. They just test the blocks in isolation.

17:29

<pablomartin4btc> yeah

17:29

<lightlike> fun fact, until very recently VerifyDB wasn't able to verify the entire main chain. Does anyone have an idea why that might have been the case?

17:32

<hernanmarino> lightlike: just guessing ... memory constraints combined with the implementation ?

17:32

<lightlike> that's an issue, but with enough memory it should be solvable.

100

17:32

<LarryRuane> is it this maybe? (i'm sort of guessing) https://github.com/bitcoin/bitcoin/pull/21523

101

17:33

<lightlike> It was because it couldn't deal with the duplicate coinbase transactions (BIP30) which are in the history of the chain. https://github.com/bitcoin/bitcoin/pull/24851 fixed that, I think.

102

17:33

<LarryRuane> oh that's amazing!

103

17:33

<hernanmarino> ohh, okey I understand

104

17:33

<rozehnal_paul> I thought I remembered reading that!

105

17:34

<lightlike> yes, I should have included that in the questions, thought about it too late...

106

17:34

<lightlike> moving on to the next q:

107

17:34

<LarryRuane> that was from before the soft-fork requiring the block height to be encoded into the coinbase tx

108

17:34

<lightlike> In the failure case, why exactly is the Assertion hashPrevBlock == view.GetBestBlock() in Chainstate::ConnectBlock() hit?

109

17:35

<LarryRuane> i'm sort of guessing here, but is it because `view` isn't populated (enough) due to the low value for dbcache?

110

17:35

<lightlike> LarryRuane: yes, there are two pairs of identical txids around block ~100000 or so.

111

17:38

<d33r_gee> There a mismatch between the current block and what the previous block should be, hinting at perhaps tampering or corruption of the database

112

17:38

<lightlike> What do the level 3 checks do if the memory is insufficient?

113

17:39

<b_101> lightlike: skip tests and continue with level 4?

114

17:39

<LarryRuane> (this is probably wrong but...) seems like level 3 checks are skipped completely in that case

115

17:40

<lightlike> yes, correct. We skip the level 3, not disconnecting the block in our in-memory copy, but we continue to the next block (because level 0-3 are done in a loop, together for each block)

116

17:41

<lightlike> and why does level 4 have a problem with that?

117

17:45

<lightlike> Ok, I'll answer myself:

118

17:45

<lightlike> Because Level 4 tries to re-connect all the blocks that were meant to be disconnected before - including the blocks that we skipped. So it tries to reconnect a block of a height that was previously never disconnected.

119

17:45

<lightlike> does that explanation make sense?

120

17:46

<LarryRuane> yes, definitely makes sense!

121

17:46

<hernanmarino> yes, thanks !

122

17:46

<d33r_gee> yep it makes sense...

123

17:46

<lightlike> next question: Were you able to reproduce the VerifyDB crash on master and verify that this branch prevents it? (Hint: use -dbcache=1 as a startup arg to bitcoind)

124

17:47

<pablomartin4btc> yes

125

17:47

<d33r_gee> nope

126

17:47

<LarryRuane> yes i was able to reproduce, but I had to specify a larger depth than 1000 ... 5000 worked to reproduce the problem for me

127

17:47

<rozehnal_paul> no, not spinning a full node right now.

128

17:47

<hernanmarino> yes, tested both on master and on the PR branch

129

17:47

<LarryRuane> (that was on signet, in case it matters)

130

17:48

<rozehnal_paul> LarryRuane what values do 1000 and 5000 refer to? block depth?

131

17:48

<rozehnal_paul> separate question, what are the units for dbcache = x ?

132

17:49

<lightlike> LarryRuane: I think it could depend on how large the recent blocks were. If they were empty, disconnecting them doesn't need a lot of memory.

133

17:50

<LarryRuane> rozehnal_paul: yes that number (last argument to verifydb RPC) indicates how far back in history from the current tip to check

134

17:50

<lightlike> rozehnal_paul: It's MiB

135

17:50

<LarryRuane> lightlike: i see, that makes sense.. i would imagine signet blocks are pretty light

136

17:51

<LarryRuane> lightlike: I think he was referring to the depth or nblocks argument not -dbcache

137

17:51

<lightlike> LarryRuane: I guess it depends on whether someone is currently using signet to test something. When I opened the PR, it did fail for me with these parameters, but maybe the recent blocks of signet are less full than the ones back then.

138

17:52

<rozehnal_paul> LarryRuane did you use the bitcoin cli sandbox on bitcoindev.network or a node in signet? for next weeek my goal is to test-ack from with th sandbox version

139

17:52

<LarryRuane> also if you specify 0 for this argument, just FYI, it checks all blocks back to the beginning

140

17:53

<LarryRuane> rozehnal_paul: interesting, i'm not aware of that sandbox ... I'm running a local node

141

17:53

<lightlike> Next q: In addition to fixing the crash, this PR also changes the way both init and the verifychain RPC handle errors. Can you describe these?

142

17:53

<rozehnal_paul> I imagine the error thrown is more detailed

143

17:54

<lightlike> rozehnal_paul: I also never use a sandbox. I have a local datadir for signet / testnet that I use when I need to test something.

144

17:54

<hernanmarino> they are more verbose / descriptive when logging / reporting

145

17:55

<lightlike> they also change the behavior in case of an incomplete check due to insufficient cache.

146

17:55

<hernanmarino> rozehnal_paul: if you want to run a local node, you can sync the whole signet chain in just under an hour

147

17:56

<lightlike> the question is basically: do we want to fail init in this case? There is no error, but we also didn't check everything we wanted.

148

17:56

<hernanmarino> lightlike: are you referring to the return value, for example ?

149

17:56

<lightlike> hernanmarino: yes, the return value of the -verifychain RPC

150

17:56

<rozehnal_paul> hernanmarino looks like with these commands i can get it going https://en.bitcoin.it/wiki/Signet#Why_run_Signet.3F

151

17:57

<rozehnal_paul> lightlike i think a warning error suffices, if it tells the node operator to increase dbcache in order to complete the tests.

152

17:58

<LarryRuane> lightlike: "do we want to fail init in this case?" -- not to spoil the party too much, but you answered this in the PR description (third bullet point)

153

17:58

<lightlike> so my thoughts were: If we actually fire up an RPC verifychain, the result should be false if we didn't complete the checks, so the user can do something about this (like increasing the cache).

154

17:58

<pablomartin4btc> lightlike: I agree with it

155

17:59

<LarryRuane> yes this is a nice PR! I hope it gets merged soon

156

18:00

<LarryRuane> i'll review in detail this week

157

18:00

<lightlike> In case of Init, my thinking was if we actually deviate from the defaults (e.g. to check more and deeper) then aborting is ok. However, if we just want to get our node starting (not touching the defaults) we don't want to abort if the checks are incomplete.

158

18:00

<LarryRuane> lightlike: +1

159

18:00

<rozehnal_paul> +1

160

18:00

<lightlike> That's it - thanks for attending!

161

18:00

<lightlike> #endmeeting