Coalesce Chainstate loading sequence between {,non-}unittest codepaths (refactoring)

https://github.com/bitcoin/bitcoin/pull/23280

Host: jnewbery  -  PR author: dongcarl

The PR branch HEAD was b8ea88926c at the time of this review club meeting.

Notes and questions to follow soon!

Notes

  • init.cpp is responsible for the startup sequence for bitcoind and bitcoin-qt. It parses and sanitizes configuration options, initializes the various components, starts threads, and so on.

  • For historic reasons, the initialization code is complex and difficult to work with - the file is very long (1861 lines prior to this PR), the individual functions within the file are long (for example, AppInitMain is 750 lines) and there’s global state being read and written.

  • The code is also brittle - since various components and global state depend on each other, even innocuous-looking changes that re-order events in minor ways can cause subtle changes in behaviour. This has been the source of several bugs in the past.

  • To make matters worse, the code in init.cpp is difficult to unit test. Since it’s tightly coupled with lots of other parts of the codebase, it’s difficult to isolate and test thoroughly. The unit tests have their own code for initializing their test setup, so the code is not shared between bitcoind initialization code and unit test initialization.

  • PR 23280 is an attempt to unify some of the initialization logic for the validation state. If that code is shared between bitcoind and the unit tests, then it’ll be better tested and less likely to be broken by future changes.

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?

  2. This PR distinguishes between “soft failures” and “hard failures” when loading the chainstate. What is the difference between those two?

  3. The first commit (init: Extract chainstate loading sequence) touches a large number of lines and extracts logic from AppInitMain() into a dedicated LoadChainstateSequence() function. How did you satisfy yourself that this commit didn’t introduce any behaviour changes?

  4. Commit init/chainstate: Decouple from stringy errors changes the return type of LoadChainstateSequence() from a boolean to an (optional) enum. How did you satisfy yourself that the commit didn’t introduce any behaviour changes?

  5. In the same commit, why are the failed_chainstate_init and failed_verification local variables removed? What were they used for previously? Why are they no longer required?

  6. In the same commit, fLoaded is no longer passed into LoadChainstateSequence() (as a reference), and is now set in the outer AppInitMain() function. Why is this possible?

  7. In commit init/chainstate: Remove do/while loop, the do/while loop is removed from LoadChainstateSequence(). Why is this possible? What was it being used for prior to this PR?

  8. Commit init/chainstate: Decouple from concept of uiInterface adds this code to the call site for LoadChainstateSequence():

    []() {
        uiInterface.ThreadSafeMessageBox(
      	 _("Error reading from database, shutting down."),
      	 "", CClientUIInterface::MSG_ERROR);
    },
    []() {
        uiInterface.InitMessage(_("Verifying blocks…").translated);
    });
    

    What is that syntax? What is being passed in to the function?

  9. Commit validation: Call NotifyBlockTip in CCS::LoadChainTip removes a call to RPCNotifyBlockChange() and adds a call to uiInterface.NotifyBlockTip(). How are those two functions related?

Meeting Log

  117:00 <jnewbery> #startmeeting
  217:00 <glozow> hi
  317:00 <michaelfolkson> hi
  417:00 <glozow> #initmeeting
  517:00 <jnewbery> hi folks. Welcome to Bitcoin Core PR Review Club! Feel free to say hi to let everyone know you're here
  617:00 <sm0l> hi
  717:00 <jnewbery> glozow: hoho
  817:01 <dongcarl> hi
  917:01 <lightlike> hi
 1017:01 <stickies-v> hi everyone!
 1117:01 <seaona> hi!
 1217:01 <jnewbery> Anyone here for the first time?
 1317:01 <jnewbery> dongcarl: thanks for joining us!
 1417:01 <michaelfolkson> config option: enable glozow jokes
 1517:01 <dongcarl> Wouldn't miss it!
 1617:02 <jnewbery> Notes and questions are in the normal place: https://bitcoincore.reviews/23280
 1717:02 <jnewbery> Who had a chance to read the notes & questions / review the PR this week? (y/n)
 1817:02 <MaxDoronin> I'm new here, jnewbery
 1917:03 <michaelfolkson> 0.5y
 2017:03 <lightlike> y
 2117:03 <sm0l> 1/2 y, read throught the code and tried to answer the questions
 2217:03 <sm0l> through*
 2317:03 <seaona> 0.5y
 2417:03 <jnewbery> MaxDoronin: welcome! We love new participants. Feel free to ask questions at any time
 2517:04 <stickies-v> 0.5y
 2617:04 <jnewbery> Any concept/approach ACKs or NACKs? What do you all think of the PR from a high level?
 2717:05 <stickies-v> Concept ACK - carve outs like LoadChainstateSequence make the code so much easier to understand and test
 2817:06 <sm0l> Concept ACK from me, seems like a good idea to clean up this code to pave the way for better testing and future maintaining
 2917:06 <michaelfolkson> Certainly can't think of any reason to NACK it
 3017:06 <lightlike> concept ACK, also makes a lot of sense to me to reuse init code in the unit test setup
 3117:07 <jnewbery> stickies-v: I agree. AppInitMain() is a beast, so pulling out logical units of the code seem like an improvement, especially if it means improving the test quality.
 3217:07 <jnewbery> ok, let's get into the code. This PR distinguishes between “soft failures” and “hard failures” when loading the chainstate. What is the difference between those two?
 3317:08 <michaelfolkson> Whether a failure can be recovered from with a reindex
 3417:09 <glozow> copy-paste:
 3517:09 <glozow> Soft failure - a failure that might be recovered from with a reindex
 3617:09 <glozow> Hard failure - a failure that definitively cannot be recovered from with a reindex
 3717:09 <sm0l> A soft failure is a failure we might be able to recover from while a hard failure is a failure we definitely cannot recover from
 3817:09 <michaelfolkson> And a failure being a corrupted datadir?
 3917:09 <michaelfolkson> Or data missing...
 4017:10 <jnewbery> michaelfolkson glozow: right, a failure is 'soft' if we can recover from it with a reindex
 4117:10 <jnewbery> what do we mean by reindex?
 4217:10 <stickies-v> I think to add, from my understanding the diff between soft/hard used to be a boolean true/false response, whereas nowadays this difference is not as binary anymore since we return a ChainstateLoadingError?
 4317:10 <michaelfolkson> Or an already locked datadir would be a failure to I think
 4417:11 <michaelfolkson> *too
 4517:11 <dongcarl> stickies-v: It actually used to be that a hard failure used to return from AppInitMain altogether
 4617:11 <jnewbery> stickies-v: this PR introduces the terminology of hard/soft failure (in the commit messages)
 4717:12 <dongcarl> whereas a soft failure would break (basically a goto) and proceed to some failure handling logic
 4817:12 <lightlike> the distinction is a bit unclear though imo. "ERROR_BLOCK_FROM_FUTURE" is treated as a soft failure, but won't be healed by reindexing, instead of fixing your time
 4917:13 <michaelfolkson> A reindexing of the blockchain. So we are talking using block data we already have available to us to create a UTXO set....? I'm not actually sure
 5017:14 <jnewbery> lightlike: good point. Perhaps a better definition is that if it's a hard failure then we can never rebuild our chain state with the data in our datadir. If it's a soft failure, we may be able to recover with the data in our datadir. Is that more accurate?
 5117:15 <lightlike> jnewbery: yes, that makes sense to understand it that way
 5217:16 <jnewbery> reindex will rebuild both the chain state (UTXO set) and block index from the blk files on disk
 5317:16 <dongcarl> lightlike: Ah, I think that case is why I added the "might" in my message...
 5417:16 <jnewbery> can anyone give an example of a hard failure?
 5517:16 <Kaizen_Kintsugi> hi shit i'm late
 5617:16 <seaona> error bad genesis block ?
 5717:17 <michaelfolkson> Corrupted datadir?
 5817:17 <jnewbery> seanoa: bingo!
 5917:18 <jnewbery> That's kind of a pathological failure case where somehow the datadir has been mixed up with the datadir from a testnet node or something.
 6017:18 <jnewbery> I think that's the only hard failure. We should be able to recover from everything else
 6117:19 <jnewbery> seaona: oops sorry mistyped your name
 6217:19 <jnewbery> next question: The first commit (init: Extract chainstate loading sequence) touches a large number of lines and extracts logic from AppInitMain() into a dedicated LoadChainstateSequence() function. How did you satisfy yourself that this commit didn’t introduce any behaviour changes?
 6317:20 <seaona> I was thinking all these were hard failures:
 6417:20 <seaona> ERROR_LOADING_BLOCK_DB
 6517:20 <seaona> ERROR_BAD_GENESIS_BLOCK
 6617:20 <seaona> ERROR_PRUNED_NEEDS_REINDEX
 6717:20 <seaona> ERROR_LOAD_GENESIS_BLOCK_FAILED
 6817:20 <seaona> np =D
 6917:21 <Kaizen_Kintsugi> jnewbery: would this be solved by passed tests
 7017:22 <Kaizen_Kintsugi> in testing setup
 7117:22 <lightlike> using the git flags mentioned in the commit message helps
 7217:22 <Kaizen_Kintsugi> load chain state sequence is called
 7317:23 <Kaizen_Kintsugi> *LoadChainstateSequence
 7417:23 <michaelfolkson> Looking over the code or playing with James' new functional test, there are no unit tests
 7517:23 <jnewbery> seaona: take a look at the first commit (init: Extract chainstate loading sequence), and the commit log (Currently, LoadChainstateSequence returns a bool which: - if false - Definitely a "Hard failure"). There's only one place in the function that returns false, and it's the "If the loaded chain has a wrong genesis, bail out immediately" part
 7617:23 <jnewbery> Kaizen_Kintsugi: alas, this area of the code is not well covered by unit tests
 7717:23 <Kaizen_Kintsugi> ah
 7817:24 <jnewbery> lightlike: I agree. `--color-moved=dimmed_zebra --color-moved-ws=allow-indentation-change` are very helpful flags for reviewing these mostly move-only commits. Kudos to dongcarl for including that review tip in the commit log.
 7917:25 <Kaizen_Kintsugi> git flags?
 8017:25 <seaona> jnewberry: thank you!
 8117:25 <dongcarl> 😊
 8217:26 <jnewbery> I'd say that in the absence of test coverage, we need to be extra careful in our code review. Here, the commit is moving code into a separate function. We can satisfy ourselves that it's basically move only by using that `git diff` command, and the arguments are almost all (mutable) references, so there shouldn't be any difference between changing those variables inside the function from
 8317:26 <jnewbery> the way it was done before.
 8417:27 <lightlike> Kaizen_Kintsugi: you do "git diff <old commit hash> <new commit hash> --color-moved=dimmed_zebra --color-moved-ws=allow-indentation-change" and everything that is just moved is marked in gray, so you see the differences more easily.
 8517:27 <jnewbery> was this pure move-only or were there any differences before and after the move?
 8617:28 <Kaizen_Kintsugi> thx lightlike
 8717:28 <Kaizen_Kintsugi> jnewbery function inputs changed i think
 8817:28 <Kaizen_Kintsugi> ass things decoupled
 8917:28 <Kaizen_Kintsugi> *as
 9017:29 <jnewbery> I'm specifically talking about this commit: https://github.com/bitcoin-core-review-club/bitcoin/commit/f24e12f039edec70dfe30885b7a6082ccd3cf8e8
 9117:29 <stickies-v> lightlike Kaizen_Kintsugi you can also use a slightly shorter command with git show: "git show <commit hash> --color-moved=dimmed_zebra --color-moved-ws=allow-indentation-change"
 9217:30 <jnewbery> try it right now if you want
 9317:30 <jnewbery> git diff f24e12f039edec70dfe30885b7a6082ccd3cf8e8~ f24e12f039edec70dfe30885b7a6082ccd3cf8e8 --color-moved=dimmed_zebra --color-moved-ws=allow-indentation-change
 9417:30 <Kaizen_Kintsugi> the do while was removed, which confused me why that it was there in the first place
 9517:30 <jnewbery> Kaizen_Kintsugi: no, the do-while loop is still there after this commit
 9617:31 <jnewbery> it's important to review the PR commit-by-commit to be able to see what's changing
 9717:31 <Kaizen_Kintsugi> derp
 9817:31 <jnewbery> anyone see any non-move changes in that commit?
 9917:32 <jnewbery> how about this:
10017:32 <jnewbery> - chainstate->CoinsErrorCatcher().AddReadErrCallback([]() {
10117:32 <lightlike> there is a small diff in the first one, the line "chainstate->CoinsErrorCatcher().AddReadErrCallback([&]() {" gets an additional "&"
10217:32 <jnewbery> + chainstate->CoinsErrorCatcher().AddReadErrCallback([&]() {
10317:32 <jnewbery> lightlike: excellent observation!
10417:33 <jnewbery> if you're reviewing refactors, you need to be on the lookout for even tiny changes like that if you want to satisfy yourself that there aren't behaviour changes
10517:33 <jnewbery> So what does that extra & do?
10617:34 <glozow> capture local vars by reference
10717:35 <glozow> s/local/used
10817:36 <jnewbery> glozow: very nice! Yes, this is the captures for the lambda function. Previously it wasn't capturing anything, and now it's capturing local variables by reference
10917:36 <jnewbery> why is that change required here?
11017:36 <Kaizen_Kintsugi> needs fReindexChainState and fLoaded?
11117:37 <Kaizen_Kintsugi> and the chainstate i guess
11217:37 <pg156> jnewbery: "be on the lookout for even tiny changes" means looking at and reasoning about the code side by side? anything else? (as you also said "this area of the code is not well covered by unit tests")
11317:38 <jnewbery> Kaizen_Kintsugi: that's not it. Let's have a look at what's happening inside that lambda
11417:38 <jnewbery> {
11517:38 <jnewbery> uiInterface.ThreadSafeMessageBox(
11617:38 <jnewbery> _("Error reading from database, shutting down."),
11717:38 <jnewbery> "", CClientUIInterface::MSG_ERROR);
11817:38 <jnewbery> }
11917:38 <jnewbery> which variable needed to be captured?
12017:38 <glozow> mm easy to check, just need to remove the `&` and see what the compiler says
12117:39 <Kaizen_Kintsugi> the CClientUIInterface?
12217:39 <Kaizen_Kintsugi> uiInterface sry
12317:39 <Kaizen_Kintsugi> which we want to decouple
12417:39 <jnewbery> pg156: yes, use a diff tool to see the differences in the code and figure out what each of those differences imply
12517:40 <jnewbery> Kaizen_Kintsugi: yes, it's uiInterface! So why didn't we need to capture it before, and we do need to capture it inside the new function?
12617:41 <jnewbery> hint: https://github.com/bitcoin/bitcoin/pull/23280#discussion_r743923908
12717:42 <Kaizen_Kintsugi> is it that we are replacing the string errors with these enum values like ChainstateLoadingError::ERROR_LOADING_BLOCK_DB
12817:42 <lightlike> because it wasn't local in the context of init.cpp but global, but is local after the move?
12917:42 <dongcarl> lightlike: bingo!
13017:42 <jnewbery> lightlike: sir, you dropped this 👑
13117:43 <dongcarl> 😆
13217:43 <Kaizen_Kintsugi> cool
13317:43 <lightlike> is there a good way to check where the global var is defined? git grep is problematic
13417:44 <lightlike> too many lines
13517:44 <jnewbery> yes, these things called uiInterface are actually different things! In the code before the commit, uiInterface is a global variable. After the commit, uiInterface is a local variable in the new LoadChainstateSequence() function
13617:44 <dongcarl> lightlike: I use SourceTrail but there are other tools for this too
13717:45 <jnewbery> ligthlike: here you go: https://github.com/bitcoin/bitcoin/blob/38b2a0a3f933fef167274851acaad0fd9104302a/src/node/ui_interface.cpp#L12
13817:45 <jnewbery> I use exuberant ctags. Works most of the time, but didn't actually work here
13917:46 <jnewbery> I just happen to know that uiInterface is declared in ui_interface.cpp
14017:46 <dongcarl> The two vars do refer to the exact same thing though. I kept it named the same so that git’s move detection algorithm is happy.
14117:47 <michaelfolkson> https://www.sourcetrail.com/blog/discontinue_sourcetrail/ :(
14217:47 <jnewbery> right, so this function has a local variable called uiInterface, which is a reference to the global variable uiInterface. Having multiple variables named the same way is called "shadowing" and it's what Cory is talking about here: https://github.com/bitcoin/bitcoin/pull/23280#discussion_r743923908.
14317:48 <jnewbery> It's maybe not such a big deal because this is only an intermediate commit and it gets cleaned up later
14417:48 <jnewbery> but it is slightly confusing, as Cory points out
14517:49 <Kaizen_Kintsugi> shit, I see it now
14617:49 <dongcarl> True, perhaps I should add to the commit message
14717:50 <jnewbery> an alternative approach would be to have a commit before this one that creates a reference to uiInterface in AppInitMain() called ui_interface or whatever, and uses that in the code that's about to be moved, then moves it with that name
14817:50 <jnewbery> the first commit would have to update the lambda capture to capture that local reference
14917:50 <jnewbery> but the second commit would then be a pure move of the code
15017:51 <dongcarl> ah, good point!
15117:51 <jnewbery> it's a bit circuitous, but it's maybe less mental strain for reviewers
15217:51 <jnewbery> we have 9 minutes left. Let's get on to the next question
15317:51 <jnewbery> Commit init/chainstate: Decouple from stringy errors changes the return type of LoadChainstateSequence() from a boolean to an (optional) enum. How did you satisfy yourself that the commit didn’t introduce any behaviour changes?
15417:52 <pg156> The commit log addresses the question. I haven't completed this. But as a reviewer, I feel I need to enumerate all possible outcomes where `LoadChainstateSequence` is called, to compare before and after.
15517:52 <dongcarl> Yeah this is probably the second trickiest commit in this PR, after the RPCNotify one
15617:53 <dongcarl> Happy to improve the commit message if anything was confusing to folks
15717:53 <jnewbery> pg156: indeed, you need to carefully look at all of the places where we exit due to an error, and verify that we're doing the same thing before and after
15817:53 <dongcarl> Also, it might be easier to review after I implement https://github.com/bitcoin/bitcoin/pull/23280#discussion_r743703915
15917:55 <lightlike> actually an InitError seemed to have gotten lost in that commit, but there was already a review comment on that (as for everything else I noticed during review)
16017:55 <jnewbery> I think the key observation is that the break statements become return statements that return error codes, and then the caller handles those error codes
16117:55 <dongcarl> Yup, totally my b 😬
16217:55 <Kaizen_Kintsugi> oh cool that commit message explains that do/while(false) loop
16317:56 <Kaizen_Kintsugi> You don't need it anymore when returning a chainstateloading error
16417:57 <dongcarl> Right! Previously, the do/while(false) + break combination was basically used to emulate goto's
16517:57 <jnewbery> Kaizen_Kintsugi: exactly! The do-while is just there so it's possible to break out of that block of code. It's a very ugly way of replicating what can be done with a function and return statements
16617:57 <jnewbery> dongcarl: right, or goto
16717:57 <jnewbery> functions are much nicer :)
16817:57 <dongcarl> hear hear!
16917:58 <jnewbery> ok, maybe time for one very quick question to end
17017:58 <jnewbery> In the same commit, why are the failed_chainstate_init and failed_verification local variables removed? What were they used for previously? Why are they no longer required?
17117:58 <lightlike> i looked the pattern up on someone on stackoverflow called it "an idiotically disguised goto" 😉
17217:58 <pg156> - They were used previously to break out of the chainstate activation do-while loop to return true (more specifically indicating a soft failure, as it is neither "Success" nor "Shutdown requested").
17317:58 <pg156> - They are no longer needed and therefore are removed, because
17417:58 <pg156> - before the change: the effect of calling `LoadChainstateSequence` is captured by the function return value together with mutable function parameters (e.g. `fLoaded`, `strLoadError`). E.g. when db fails to upgrade, `failed_chainstate_init` is necessary to break out of the do-while loop, and set return value to true outside the loop.
17517:58 <pg156> - after the change, e.g. when db fails to upgrade, the function can directly return an error status value `ERROR_CHAINSTATE_UPGRADE_FAILED` capturing the effect.
17618:00 <Kaizen_Kintsugi> we now return before those local variables
17718:00 <jnewbery> pg156: yes, I think that's exactly it
17818:00 <jnewbery> now that we're returning an error code to indicate failure, we can do it directly at the point of the failure
17918:00 <jnewbery> ok, that's time
18018:00 <Kaizen_Kintsugi> dude this was awesome
18118:00 <Kaizen_Kintsugi> thanks
18218:00 <Kaizen_Kintsugi> learned a lot again today
18318:00 <jnewbery> thanks everyone for sticking with this. It was a really hard PR this week
18418:01 <pg156> Thank you jnewbery!
18518:01 <jnewbery> we'll do something a little easier next time :)
18618:01 <Kaizen_Kintsugi> everything that is hard is good for us
18718:01 <jnewbery> #endmeeting