Improve error handling when VerifyDB fails due to insufficient dbcache (validation)

https://github.com/bitcoin/bitcoin/pull/25574

Host: mzumsande  -  PR author: mzumsande

Notes

  • VerifyDB (added in PR 2145) performs various checks for possible corruption of recent blocks stored on disk and of the UTXO database. The check is invoked during startup, but can also be triggered by the verifychain RPC.

  • VerifyDB is dependent on two parameters that are available as startup options or RPC arguments: -checkblocks defines to how many blocks (from the tip) the check is applied, -checklevel defines how thorough these checks are.

  • It is possible that that due to an insufficient dbcache size, VerifyDB can fail to complete the level 3 checks, in which it would gracefully skip these, but still attempt the checks of other levels. However, the level 4 checks are dependent on the level 3 checks being completed, so that bitcoind can currently hit an assert and crash if -checklevel=4 is specified, but the level 3 checks weren’t able to finish (Issue 25563).

  • Since the default values are DEFAULT_CHECKBLOCKS == 6 and DEFAULT_CHECKLEVEL == 3, users can’t run into this issue unless they actively specify a higher checklevel than the default.

  • This PR prevents the assert from being hit during the checks and also changes the way errors are reported and logged in both places VerifyDB is used (Init and RPC).

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?

  2. What is the reason the VerifyDB checks are important? Considering that they can delay the startup by several seconds, what would be possible consequences of not having these checks in place?

  3. Can you describe the checks that each of the four levels of VerifyDB perform? For which of the levels is the dbcache size relevant?

  4. In the failure case, why exactly is the Assertion hashPrevBlock == view.GetBestBlock() in Chainstate::ConnectBlock() hit?

  5. Were you able to reproduce the VerifyDB crash on master and verify that this branch prevents it? (Hint: use -dbcache=1 as a startup arg to bitcoind)

  6. In addition to fixing the crash, this PR also changes the way both init and the verifychain RPC handle errors. Can you describe these?

  7. Do you agree with the new error handling? Do you think bitcoind should abort if the default checks couldn’t be completed (but there was no corruption)?

Meeting Log

  117:00 <lightlike> #startmeeting
  217:00 <pablomartin4btc> hello
  317:00 <lightlike> hi!
  417:00 <rozehnal_paul> hi
  517:01 <LarryRuane> hi
  617:01 <b_101> hi!
  717:01 <lightlike> welcome to review club - today's meeting is about the VerifyDB checks
  817:01 <d33r_gee> hello
  917:02 <glozow> hi
 1017:02 <lightlike> notes are at https://bitcoincore.reviews/25574
 1117:02 <hernanmarino> hi !
 1217:02 <lightlike> anyone here for the first time?
 1317:03 <lightlike> ok, let's start. Who had a chance to review the PR? (y/n)
 1417:03 <pablomartin4btc> y
 1517:04 <hernanmarino> yes
 1617:04 <d33r_gee> yes
 1717:04 <rozehnal_paul> yes
 1817:04 <b_101> y
 1917:04 <LarryRuane> 0.4
 2017:05 <lightlike> that's a lot! what's your impression (Concept ACK, approach ACK, tested ACK, or NACK?)
 2117:05 <hernanmarino> tested
 2217:05 <pablomartin4btc> tested ACK
 2317:05 <hernanmarino> ACK
 2417:05 <hernanmarino> :)
 2517:05 <d33r_gee> tested ACK
 2617:05 <LarryRuane> concept ACK, almost finished with tested ACK
 2717:05 <rozehnal_paul> concept ack
 2817:06 <b10c> hi
 2917:06 <effexzi> Hey every1
 3017:06 <lightlike> great - let's move to the first question!
 3117:06 <lightlike> What is the reason the VerifyDB checks are important? Considering that they can delay the startup by several seconds, what would be possible consequences of not having these checks in place?
 3217:06 <b_101> concept Ack
 3317:07 <rozehnal_paul> relaying corrupted blocks?
 3417:07 <LarryRuane> basic question, the DB being verified is levelDB? is that correct?
 3517:07 <d33r_gee> they are are important for ensuring the integrity and reliability of the local bitcoin database.
 3617:07 <hernanmarino> it s important to verify the integrity of the db
 3717:08 <hernanmarino> As to the consecuences, I'm not really sure, but I guess that greater problems might arise later
 3817:08 <lightlike> LarryRuane: it's a mix. there are different levels of check - some verify the blocks stored on disk, others also the chainstate.
 3917:08 <LarryRuane> I think when bitcoind experiences a non-clean shutdown (such as system OOM or power), there's a fear that the ondisk DB could get corrupted (even though it shouldn't in theory)
 4017:08 <pablomartin4btc> Without these checks, there is a risk that the database could become corrupted, either due to bugs in the software or due to external factors such as hardware failures or malicious attacks.
 4117:09 <lightlike> Good answers!
 4217:10 <pablomartin4btc> If the database were to become corrupted, it could cause the Bitcoin software to crash or behave unpredictably, which could potentially lead to the loss of funds or other problems.
 4317:10 <andrewtoth_> LarryRuane chainstate is leveldb, blocks are custom file format and also checked
 4417:10 <LarryRuane> andrewtoth_: lightlike: thanks
 4517:11 <lightlike> I think another possiblity is that we could fall out of consensus, e.g. rejecting a valid block because of some inconsistency with our chainstate. That would be really bad, maybe worse than crashing.
 4617:11 <andrewtoth_> it could cause the node to fork off the network if it doesn't just crash from other checks while activating the best chain
 4717:11 <LarryRuane> and it is possible to check the blocks (`blk*.dat` and `rev*.dat`) on pruned nodes, but only back to the prune height (IIUC)
 4817:12 <lightlike> LarryRuane: yes, the VerifyDB code accounts for that - if we are pruning and the files aren't there, we don't check any further.
 4917:12 <rozehnal_paul> an affected node could also lose reputation by relaying corrupted blocks?
 5017:13 <andrewtoth_> rozehnal_paul yes, relaying corrupted blocks would cause peers to immediately disconnect
 5117:14 <andrewtoth_> but I don't think this checks all historical blocks for corruption
 5217:14 <LarryRuane> and just to make sure ... VerifyDB does not check the wallet's DB in any way, does it? (i don't think so)
 5317:14 <lightlike> andrewtoth: yes, that is important. Does anyone know how many blocks we check by default?
 5417:15 <lightlike> LarryRuane: no, no connection with the wallet
 5517:15 <rozehnal_paul> 6 blocks by default
 5617:15 <b_101> lightlike: only 6
 5717:15 <LarryRuane> 6 blocks .. wonder if that's purposely the same as what's considered enough to "confirm" a tx?
 5817:15 <lightlike> rozehnal_paul b_101 correct!
 5917:17 <lightlike> It's a tradeoff between the time it takes and thouroughness. Probably many of you have seen this slightly annoying 15%...30%... progress in the debug log during startup.
 6017:18 <LarryRuane> when i first started running bitcoind, i thought it could / should be more than 6 blocks, because that takes so few seconds on my laptop ... but after I got a RPi I noticed it takes quite there! So for that platform at least, it would be bad to be much more than 6
 6117:18 <lightlike> if we'd scan the entire chain, it would take us hours (and we'd probably run out of memory, but we'll come to that later)
 6217:18 <andrewtoth_> It does verify the entire chainstate though right? It loads the entire block *index* into memory, just not blocks?
 6317:19 <LarryRuane> whoops i just started that (entire chain), but on signet, maybe it will be ok (and it has your PR fix)
 6417:19 <glozow> larryruane: how long did it take on your rpi? like, on the order of seconds or minutes?
 6517:19 <andrewtoth_> A -reindex basically does a verification on the entire blocks db though I believe
 6617:19 <LarryRuane> to do 6 blocks, i would say on the order of a minute.. but i'll check and get back to you!
 6717:20 <lightlike> andrewtoth_: what do you mean with "it" - the VerifyDB checks, or bitcoind in general?
 6817:20 <glozow> oh wow that's slow! cool to know
 6917:20 <andrewtoth_> the VerifyDB checks
 7017:20 <LarryRuane> andrewtoth_: a -reindex actually *rebuilds* the block index and chainstate
 7117:21 <lightlike> andrewtoth_: ok, in that case I don't think so. But this leads to the next question, what the checks actually do:
 7217:21 <lightlike> Can you describe the checks that each of the four levels of VerifyDB perform? For which of the levels is the dbcache size relevant?
 7317:21 <lightlike> actually, it's five levels, I forgot it starts with 0...
 7417:22 <rozehnal_paul> Sipa commented in january 2013 the following :
 7517:22 <rozehnal_paul> -checklevel gets a new meaning:
 7617:22 <rozehnal_paul> 0: verify blocks can be read from disk (like before)
 7717:22 <rozehnal_paul> 1: verify (contextless) block validity (like before)
 7817:22 <rozehnal_paul> 2: verify undo data can be read and matches checksums
 7917:22 <rozehnal_paul> 3: verify coin database is consistent with the last few blocks (close to level 6 before)
 8017:22 <rozehnal_paul> 4: verify all validity rules of the last few blocks (including signature checks)
 8117:22 <LarryRuane> glozow: I was off by a little, it takes 31 seconds (for the default 6 blocks, level 3) on raspberry pi
 8217:22 <glozow> larryruane: ah nice, thanks!
 8317:23 <d33r_gee> Check 0: reads blocks from diskfrom disk and returns and error if ReadBlockFromdisk is false
 8417:23 <d33r_gee> Check 1: runs CheckBlock returns error if false
 8517:23 <d33r_gee> Check 2: runs CBlockUndoin pindex returns false if bad data is found
 8617:23 <d33r_gee> Check 3: confirms that the best block matches pindex block hash
 8717:23 <d33r_gee> Check 4: last check try to reconnect if failure if previously interrupted (?)
 8817:23 <LarryRuane> you can see what each level does with `bitcoin-cli help verifychain`, it's pretty nice
 8917:23 <yashraj> what does 2 mean?
 9017:24 <lightlike> rozehnal_paul d33r_gee: yes exactly!
 9117:24 <rozehnal_paul> LarryRuane thx for the cmd tip
 9217:24 <b_101> lightlike: level 3 test can fail due to insuficient dbcache size
 9317:25 <b_101> LarryRuane: +1, thx
 9417:25 <lightlike> b_101: yes!
 9517:27 <lightlike> so it's important to note that Level 0-2 are note independent of context ("the chain"), including the CheckBlock at level 1. They just test the blocks in isolation.
 9617:29 <pablomartin4btc> yeah
 9717:29 <lightlike> fun fact, until very recently VerifyDB wasn't able to verify the entire main chain. Does anyone have an idea why that might have been the case?
 9817:32 <hernanmarino> lightlike: just guessing ... memory constraints combined with the implementation ?
 9917:32 <lightlike> that's an issue, but with enough memory it should be solvable.
10017:32 <LarryRuane> is it this maybe? (i'm sort of guessing) https://github.com/bitcoin/bitcoin/pull/21523
10117:33 <lightlike> It was because it couldn't deal with the duplicate coinbase transactions (BIP30) which are in the history of the chain. https://github.com/bitcoin/bitcoin/pull/24851 fixed that, I think.
10217:33 <LarryRuane> oh that's amazing!
10317:33 <hernanmarino> ohh, okey I understand
10417:33 <rozehnal_paul> I thought I remembered reading that!
10517:34 <lightlike> yes, I should have included that in the questions, thought about it too late...
10617:34 <lightlike> moving on to the next q:
10717:34 <LarryRuane> that was from before the soft-fork requiring the block height to be encoded into the coinbase tx
10817:34 <lightlike> In the failure case, why exactly is the Assertion hashPrevBlock == view.GetBestBlock() in Chainstate::ConnectBlock() hit?
10917:35 <LarryRuane> i'm sort of guessing here, but is it because `view` isn't populated (enough) due to the low value for dbcache?
11017:35 <lightlike> LarryRuane: yes, there are two pairs of identical txids around block ~100000 or so.
11117:38 <d33r_gee> There a mismatch between the current block and what the previous block should be, hinting at perhaps tampering or corruption of the database
11217:38 <lightlike> What do the level 3 checks do if the memory is insufficient?
11317:39 <b_101> lightlike: skip tests and continue with level 4?
11417:39 <LarryRuane> (this is probably wrong but...) seems like level 3 checks are skipped completely in that case
11517:40 <lightlike> yes, correct. We skip the level 3, not disconnecting the block in our in-memory copy, but we continue to the next block (because level 0-3 are done in a loop, together for each block)
11617:41 <lightlike> and why does level 4 have a problem with that?
11717:45 <lightlike> Ok, I'll answer myself:
11817:45 <lightlike> Because Level 4 tries to re-connect all the blocks that were meant to be disconnected before - including the blocks that we skipped. So it tries to reconnect a block of a height that was previously never disconnected.
11917:45 <lightlike> does that explanation make sense?
12017:46 <LarryRuane> yes, definitely makes sense!
12117:46 <hernanmarino> yes, thanks !
12217:46 <d33r_gee> yep it makes sense...
12317:46 <lightlike> next question: Were you able to reproduce the VerifyDB crash on master and verify that this branch prevents it? (Hint: use -dbcache=1 as a startup arg to bitcoind)
12417:47 <pablomartin4btc> yes
12517:47 <d33r_gee> nope
12617:47 <LarryRuane> yes i was able to reproduce, but I had to specify a larger depth than 1000 ... 5000 worked to reproduce the problem for me
12717:47 <rozehnal_paul> no, not spinning a full node right now.
12817:47 <hernanmarino> yes, tested both on master and on the PR branch
12917:47 <LarryRuane> (that was on signet, in case it matters)
13017:48 <rozehnal_paul> LarryRuane what values do 1000 and 5000 refer to? block depth?
13117:48 <rozehnal_paul> separate question, what are the units for dbcache = x ?
13217:49 <lightlike> LarryRuane: I think it could depend on how large the recent blocks were. If they were empty, disconnecting them doesn't need a lot of memory.
13317:50 <LarryRuane> rozehnal_paul: yes that number (last argument to verifydb RPC) indicates how far back in history from the current tip to check
13417:50 <lightlike> rozehnal_paul: It's MiB
13517:50 <LarryRuane> lightlike: i see, that makes sense.. i would imagine signet blocks are pretty light
13617:51 <LarryRuane> lightlike: I think he was referring to the depth or nblocks argument not -dbcache
13717:51 <lightlike> LarryRuane: I guess it depends on whether someone is currently using signet to test something. When I opened the PR, it did fail for me with these parameters, but maybe the recent blocks of signet are less full than the ones back then.
13817:52 <rozehnal_paul> LarryRuane did you use the bitcoin cli sandbox on bitcoindev.network or a node in signet? for next weeek my goal is to test-ack from with th sandbox version
13917:52 <LarryRuane> also if you specify 0 for this argument, just FYI, it checks all blocks back to the beginning
14017:53 <LarryRuane> rozehnal_paul: interesting, i'm not aware of that sandbox ... I'm running a local node
14117:53 <lightlike> Next q: In addition to fixing the crash, this PR also changes the way both init and the verifychain RPC handle errors. Can you describe these?
14217:53 <rozehnal_paul> I imagine the error thrown is more detailed
14317:54 <lightlike> rozehnal_paul: I also never use a sandbox. I have a local datadir for signet / testnet that I use when I need to test something.
14417:54 <hernanmarino> they are more verbose / descriptive when logging / reporting
14517:55 <lightlike> they also change the behavior in case of an incomplete check due to insufficient cache.
14617:55 <hernanmarino> rozehnal_paul: if you want to run a local node, you can sync the whole signet chain in just under an hour
14717:56 <lightlike> the question is basically: do we want to fail init in this case? There is no error, but we also didn't check everything we wanted.
14817:56 <hernanmarino> lightlike: are you referring to the return value, for example ?
14917:56 <lightlike> hernanmarino: yes, the return value of the -verifychain RPC
15017:56 <rozehnal_paul> hernanmarino looks like with these commands i can get it going https://en.bitcoin.it/wiki/Signet#Why_run_Signet.3F
15117:57 <rozehnal_paul> lightlike i think a warning error suffices, if it tells the node operator to increase dbcache in order to complete the tests.
15217:58 <LarryRuane> lightlike: "do we want to fail init in this case?" -- not to spoil the party too much, but you answered this in the PR description (third bullet point)
15317:58 <lightlike> so my thoughts were: If we actually fire up an RPC verifychain, the result should be false if we didn't complete the checks, so the user can do something about this (like increasing the cache).
15417:58 <pablomartin4btc> lightlike: I agree with it
15517:59 <LarryRuane> yes this is a nice PR! I hope it gets merged soon
15618:00 <LarryRuane> i'll review in detail this week
15718:00 <lightlike> In case of Init, my thinking was if we actually deviate from the defaults (e.g. to check more and deeper) then aborting is ok. However, if we just want to get our node starting (not touching the defaults) we don't want to abort if the checks are incomplete.
15818:00 <LarryRuane> lightlike: +1
15918:00 <rozehnal_paul> +1
16018:00 <lightlike> That's it - thanks for attending!
16118:00 <lightlike> #endmeeting