VerifyDB (added in PR 2145)
performs various checks for possible corruption of recent blocks stored on disk
and of the UTXO database. The check is invoked during startup, but can also be
triggered by the verifychain RPC.
VerifyDB is dependent on two parameters that are available as startup options
or RPC arguments: -checkblocks defines to how many blocks (from the tip) the
check is applied, -checklevel defines how thorough these checks are.
It is possible that that due to an insufficient dbcache size, VerifyDB can fail
to complete the level 3 checks, in which it would gracefully skip these, but still
attempt the checks of other levels.
However, the level 4 checks are dependent on the level 3 checks being completed,
so that bitcoind can currently hit an assert and crash if -checklevel=4 is
specified, but the level 3 checks weren’t able to finish
(Issue 25563).
Since the default values are DEFAULT_CHECKBLOCKS == 6 and DEFAULT_CHECKLEVEL == 3,
users can’t run into this issue unless they actively specify a higher checklevel
than the default.
This PR prevents the assert from being hit during the checks and also changes
the way errors are reported and logged in both places VerifyDB is used (Init and RPC).
What is the reason the VerifyDB checks are important? Considering that they
can delay the startup by several seconds, what would be possible consequences of not
having these checks in place?
Can you describe the checks that each of the four levels of VerifyDB perform?
For which of the levels is the dbcache size relevant?
In the failure case, why exactly is the Assertion hashPrevBlock == view.GetBestBlock()
in Chainstate::ConnectBlock() hit?
Were you able to reproduce the VerifyDB crash on master and verify that
this branch prevents it? (Hint: use -dbcache=1 as a startup arg to bitcoind)
In addition to fixing the crash, this PR also changes the way both
init and the verifychain RPC handle errors. Can you describe these?
Do you agree with the new error handling? Do you think bitcoind should abort if
the default checks couldn’t be completed (but there was no corruption)?
<lightlike> What is the reason the VerifyDB checks are important? Considering that they can delay the startup by several seconds, what would be possible consequences of not having these checks in place?
<LarryRuane> I think when bitcoind experiences a non-clean shutdown (such as system OOM or power), there's a fear that the ondisk DB could get corrupted (even though it shouldn't in theory)
<pablomartin4btc> Without these checks, there is a risk that the database could become corrupted, either due to bugs in the software or due to external factors such as hardware failures or malicious attacks.
<pablomartin4btc> If the database were to become corrupted, it could cause the Bitcoin software to crash or behave unpredictably, which could potentially lead to the loss of funds or other problems.
<lightlike> I think another possiblity is that we could fall out of consensus, e.g. rejecting a valid block because of some inconsistency with our chainstate. That would be really bad, maybe worse than crashing.
<lightlike> It's a tradeoff between the time it takes and thouroughness. Probably many of you have seen this slightly annoying 15%...30%... progress in the debug log during startup.
<LarryRuane> when i first started running bitcoind, i thought it could / should be more than 6 blocks, because that takes so few seconds on my laptop ... but after I got a RPi I noticed it takes quite there! So for that platform at least, it would be bad to be much more than 6
<lightlike> so it's important to note that Level 0-2 are note independent of context ("the chain"), including the CheckBlock at level 1. They just test the blocks in isolation.
<lightlike> fun fact, until very recently VerifyDB wasn't able to verify the entire main chain. Does anyone have an idea why that might have been the case?
<lightlike> It was because it couldn't deal with the duplicate coinbase transactions (BIP30) which are in the history of the chain. https://github.com/bitcoin/bitcoin/pull/24851 fixed that, I think.
<d33r_gee> There a mismatch between the current block and what the previous block should be, hinting at perhaps tampering or corruption of the database
<lightlike> yes, correct. We skip the level 3, not disconnecting the block in our in-memory copy, but we continue to the next block (because level 0-3 are done in a loop, together for each block)
<lightlike> Because Level 4 tries to re-connect all the blocks that were meant to be disconnected before - including the blocks that we skipped. So it tries to reconnect a block of a height that was previously never disconnected.
<lightlike> next question: Were you able to reproduce the VerifyDB crash on master and verify that this branch prevents it? (Hint: use -dbcache=1 as a startup arg to bitcoind)
<lightlike> LarryRuane: I think it could depend on how large the recent blocks were. If they were empty, disconnecting them doesn't need a lot of memory.
<lightlike> LarryRuane: I guess it depends on whether someone is currently using signet to test something. When I opened the PR, it did fail for me with these parameters, but maybe the recent blocks of signet are less full than the ones back then.
<rozehnal_paul> LarryRuane did you use the bitcoin cli sandbox on bitcoindev.network or a node in signet? for next weeek my goal is to test-ack from with th sandbox version
<lightlike> Next q: In addition to fixing the crash, this PR also changes the way both init and the verifychain RPC handle errors. Can you describe these?
<LarryRuane> lightlike: "do we want to fail init in this case?" -- not to spoil the party too much, but you answered this in the PR description (third bullet point)
<lightlike> so my thoughts were: If we actually fire up an RPC verifychain, the result should be false if we didn't complete the checks, so the user can do something about this (like increasing the cache).
<lightlike> In case of Init, my thinking was if we actually deviate from the defaults (e.g. to check more and deeper) then aborting is ok. However, if we just want to get our node starting (not touching the defaults) we don't want to abort if the checks are incomplete.