Reverse cs_main, cs_wallet lock order and reduce cs_main locking (wallet)

https://github.com/bitcoin/bitcoin/pull/16426

Host: ariard  -  PR author: ariard

Notes

  • The wallet is an integral part of the bitcoind full node for historic reasons. See the first git commit where the wallet lock cs_wallet is taken after cs_main.

  • The wallet was split from the node in PR 10973.

  • The new interface interfaces::Chain lets the wallet access chain and mempool state for its own operations. Interface methods allow the wallet to do things like access the block height to know a transaction’s confirmation depth or estimate the feerate.

  • Some heavily-used interfaces::Chain methods impose synchronisation between wallet and node because Chain::Lock needs to be taken by the wallet before its own lock.

  • Switching the lock order from cs_main -> cs_wallet to cs_wallet -> cs_main would avoid needlessly stalling the node when there is no need to access chain data.

  • PR 15931 removed GetDepthInMainChain’s requirement to lock the chain and simplified the reasoning on locks and events ordering.

  • A long-term goal is to separate major components of the Bitcoin Core node (the wallet, the P2P layer, the GUI) to let them evolve independently and only communicate asynchronously through well-defined interfaces: See ryanofsky’s github issue or eklitzke’s blog post for reasoning.

  • To learn more about lock management in Bitcoin Core:

Questions

  1. Did you review the PR? Concept ACK, approach ACK, ACK, or NACK? (Don’t forget to put your PR review on GitHub.)

  2. Did you take any steps beyond reading the code? Did you run the tests on your local machine with --enable-debug?

  3. While reading the code, which commit(s) attracted your attention? What should be carefully examined? What could go wrong?

  4. What could happen if a deadlock was introduced in the code?

  5. Does this PR change the wallet behavior?

  6. Why does the lock order need to be inverted all at once?

  7. Do you expect this PR to have any impact on performance?

  8. Can you think of any other way to remove the cs_main lock from the wallet code?

  9. What further changes could be build on top of this PR?

Meeting Log

  113:00 <ariard> #startmeeting
  213:00 <emilengler> hello
  313:00 <jnewbery> hi
  413:00 <_andrewtoth_> hi
  513:00 <fjahr> hi
  613:00 <sanoj> hi
  713:00 <jonatack> hi
  813:01 <ariard> today's PR is https://bitcoincore.reviews/16426.html : Reverse cs_main, cs_wallet lock order and reduce cs_main locking
  913:01 <mattcruzz> hi
 1013:01 <ariard> who had a chance to review the PR ?
 1113:01 <jonatack> yup
 1213:01 <emilengler> negative
 1313:01 <ariard> jonatack: cool what was your review process?
 1413:02 <fjahr> started but not finished
 1513:02 <ariard> others did you read PR top message ? what the main problem this PR is trying to tackle ?
 1613:03 <jonatack> ariard: pulled the PR branch, opened diff locally with gitk, built with --enable-debug, ran tests, ran bitcoind, looked for warnings, read the commits one by on gitk, git grepped varois sites
 1713:03 <pinheadmz> hi!
 1813:03 <_andrewtoth_> it's removing a requirement of locking cs_main before taking cs_wallet lock
 1913:04 <ariard> _andrewtoth: correct, does locking cs_main still occurs afterwards?
 2013:04 <_andrewtoth_> there's a lot of backstory to this PR, as several of your refactor PRs are linked
 2113:04 <ariard> jonatack: "git grepped varois sites"?
 2213:04 <jonatack> ariard: after jkczyz commented on the build warning, i then wanted to reproduce, so switched from gcc to clang, which showed the warning for me (reassuring)
 2313:04 <_andrewtoth_> cs_main still needs to be taken yes
 2413:04 <jonatack> ariard: note to self: build with clang more often
 2513:05 <ariard> jonatack: yeah having a Mac env is great to enable all clang options
 2613:05 <ariard> jonatack: that's a bit cumbersome when you're working with locks on core, we have different lock safety mechanisms following the platforms
 2713:05 <jonatack> ariard: then after you rebased, i verified the warning was gone by re-pulling your latest rebase and re-building with clang
 2813:06 <ariard> _andrewtoth_: right where is cs_main taken now ?
 2913:06 <ariard> jonatack: you run built with --enable-debug what the flag is doing ?
 3013:06 <jonatack> ariard: it seems to me that passing `--enable-debug` with clang on debian works too... it wanted to check that!
 3113:06 <jonatack> s/it/i/
 3213:07 <_andrewtoth_> ariard: inside the Chain interface, as opposed to wallet code, right?
 3313:07 <jonatack> ariard: git grepped various call sites to see that you covered them all
 3413:08 <ariard> _andrewtoth_: correct did you dig a bit on Chain interface ?
 3513:08 <ariard> and why it's a performance improvement to not hold the locks in the wallet code ?
 3613:08 <_andrewtoth_> hmm no that part is not clear to me yet
 3713:08 <_andrewtoth_> if lock is taken inside chain anyways, why is it a benefit?
 3813:08 <jkczyz> hi
 3913:08 <ariard> jonatack: --enable-debug is a different check than clang, it's going to enable DEBUG_LOCKORDER
 4013:08 <jonatack> ariard: running with --enable-debug adds the -DDEBUG -DDEBUG_LOCKORDER flags afaik
 4113:09 <ariard> jonatack: exactly
 4213:09 <Talkless> hi
 4313:09 <Talkless> jonatack: do you ever use --enable-werror ?
 4413:09 <Talkless> or you just save build output & grep warnings..?
 4513:09 <jonatack> Talkless: no, i need to try that. the latter one, correct :)
 4613:10 <Talkless> ok
 4713:10 <ariard> jonatack: did you dig into src/sync.cpp to undersand effet of DEBUG_LOCKORDER?
 4813:10 <jonatack> i'd like to try the various options practicalswift lists here: https://github.com/bitcoin/bitcoin/issues/17344
 4913:11 <ariard> _andrewtoth_: it's a benefit because right now we hold the cs_main at beginning of every RPC call
 5013:11 <jonatack> since i reviewed the PR with the uninitialized bug that caused the patch release 0.19.0.1 and would like to catch those in the future
 5113:11 <ariard> _andrewtoth_ : and it's hold until the RPC finished
 5213:12 <ariard> with this patchset: cs_main is just hold to fetch chain data and no more
 5313:12 <_andrewtoth_> i see thanks
 5413:12 <ariard> we lock cs_main just-in-time if you prefer
 5513:13 <ariard> jonatack: oh that's an interesting list thanks
 5613:14 <ariard> jonatack: yes I think generally we have a good deal of improvements of using sanitizers/automated tests/fuzzing
 5713:14 <jonatack> ariard: sync.cpp::L31-215 took a cursory look at the effects of DEBUG_LOCKORDER, will spend more time in that file
 5813:14 <jonatack> ariard: yes!
 5913:14 <ariard> While where reading the code which commit(s) attracted your attention ?
 6013:15 <_andrewtoth_> there are comments mentioning this could potentially improve IBD. How would it? Wouldn't it only affect performance if wallet rpcs are called during IBD?
 6113:15 <ariard> What should be carefully examined ? What could go wrong ?
 6213:16 <ariard> _andrewtoth_ : have a look on validationinterface, right now wallet lock cs_main in BlockConnected/BlockDisconnected
 6313:16 <jonatack> ariard: the last commit that removes the locks seems potentially the most dangerous
 6413:16 <ariard> I mean wallet is an indirect consumer of validationinterface through the Chain interface
 6513:17 <ariard> jonatack: yes DEBUG_LOCKORDER enable a process struct to pair lock order and check at any new tacking if it doesn't invert
 6613:18 <ariard> and process struct have pointers back to a per-thread struct which trace lock position in file
 6713:18 <ariard> jonatack: yes last commit is the most painful one to review, forgetting a lock somewhere could cause deadlock
 6813:21 <jonatack> ariard: did you experience any deadlocks while working on this PR?
 6913:21 <ariard> okay next question : Did this PR changes the wallet behavior?
 7013:21 <jonatack> (if yes, how did they show themselves?)
 7113:22 <ariard> jonatack: ahah ofc but it was in the wallet test framework because we don't take locks through the Chain interface there
 7213:22 <ariard> but the wild way of LOCK(cs_main)
 7313:23 <ariard> jonatack: they showed up thanks to --enable-debug
 7413:23 <_andrewtoth_> ariard: I think it only changes it if cs_main is locked by something else, then it has to wait. Otherwise, no it does not
 7513:23 <_andrewtoth_> *doesn't have to wait
 7613:24 <_andrewtoth_> up until it needs to get something from the chain
 7713:25 <jonatack> ariard: What gave you the initial idea to begin working on this series of PRs? Something you read, that someone said, or the idea came to you on its own?
 7813:25 <ariard> _andrewtoth_: yes that should change thread concurrency which should be better, I was more thinking about wallet behavior if the sense of different return values or breaking some internal features
 7913:26 <_andrewtoth_> then the answer is no
 8013:27 <jonatack> Given that so few tests needed to be changed, I surmised no wallet behavior change other than perhaps performance. Did you bench it?
 8113:27 <ariard> jonatack: I wish a more modular core project with clean, well-functional interfaces, and this serie of PR was a good beginning
 8213:28 <jonatack> ariard: "a good beginning"... hehe, yes, i'd say so! +1
 8313:28 <ariard> _andrewtoth_: well actually it may have change if wallet do something like A=getheight(), B=getheight(), if A == B do something, if A != B do something else
 8413:29 <_andrewtoth_> ahh yes
 8513:29 <ariard> _andrewtoth_: I've verified all the callsites which are making assumptions on a particular tip and don't find anything wrong
 8613:29 <ariard> most of them are related to the rescan code, which currently don't lock the cs_main and verify every previous value returned from the Chain interface
 8713:30 <jonatack> ariard: did you use gdb to verify, or?
 8813:30 <ariard> you should grep all the "getBlockHeight" "getBlockTime"
 8913:30 <jonatack> or just grepping
 9013:31 <ariard> and try to think about what have changed in the sense of wallet view of the blockchain has changed or not
 9113:31 <ariard> jonatack: yeah just grep "get*" in wallet code
 9213:31 <ariard> jonatack: I don't see whagt gdb would discover there
 9313:32 <ariard> next question : Why does the lock order need to be inverted all at once?
 9413:32 <ariard> (easy one)
 9513:33 <jkczyz> otherwise there could be a deadlock
 9613:33 <jonatack> ariard: to check values, as you mentioned verifying values returned. It's interesting to see how others debug bitcoin core. For example, i've learned a lot from watching achow101's live coding sessions on twitch tv
 9713:34 <ariard> jkczyz: correct
 9813:34 <jonatack> as soon as he gets stuck, he often jumps right into gdb.
 9913:34 <ariard> jonatack: what were your biggest learnings ?
10013:35 <ariard> jonatack: honestly for debugging it's more adding more printers and take pen and papers and think what I got wrong
10113:35 <ariard> like drawing control flow, see where there is write/read on the modified data structures
10213:36 <ariard> depend what kind of bugs you're tracking but IMO gdb is overkill for a project like bitcoin core
10313:36 <jonatack> ariard: from watching achow101 (so far): compiler-driven debugging, gdb use, inspecting wallet files using Berkeley DB special commands (located in db4/bin), using flame graphs.
10413:36 <ariard> jonatack: yes we should all keep an eye on flame graph when we modify part of the code which touch performance
10513:37 <jonatack> ariard: yes, i tend to begin with prints and asserts but have not found myself deep in the weeds yet in my PRs since they have been simple ones
10613:37 <jonatack> so more for reviewing others' PRs
10713:37 <ariard> but I think how you're debugging is also super related to what part of codebase you're working on, P2P != wallet
10813:38 <ariard> jonatack: when reviewing others' PRs I like to go through modified changes and tweak them to see if there is a better way of doing it
10913:38 <_andrewtoth_> jonatack: what is compiler-driven debugging?
11013:38 <ariard> that force you to understand the code
11113:38 <jonatack> ariard: yes, i noticed that you have always been assidious and good wrt drawing control flow
11213:39 <jonatack> assiduous*
11313:39 <jonatack> tweaking others' code is a great way to go +1
11413:40 <jonatack> compiler-driven debugging: relying on compiler errors to see what's not working and fix it
11513:40 <ariard> next question: Do you expect this PR to have any impact on performace? But we already talk a bit about this, so let's readjust the question how could we make wallet code better to service concurrently multiple RPC calls ?
11613:41 <jkczyz> moving the code to a more modular state will also ease unit testing
11713:42 <jkczyz> ariard: perhaps using a per-wallet lock would help
11813:42 <ariard> jkczyz: ofc I think we were considering with Marco to disable block writing just to fuzz p2p/validation stack without burning your disks
11913:42 <jonatack> yes. could enable shifting more testing from functional to unit (faster), or better coverage
12013:43 <ariard> jkczyz: it's already the case
12113:43 <jkczyz> ah, didn't realize that
12213:44 <jonatack> and modularity could help improve fuzzing and PBT testing as well
12313:44 <ariard> jkczyz: I think the way of thinking what are the main data structures of the wallet, like coins, keys, address, what can be accessed concurrently and for what we need a synchronized view
12413:44 <ariard> like you don't want both RPC building a transaction spending the same coin
12513:45 <ariard> and you don't want to return same address twice
12613:45 <ariard> but all read rpc like displaying coins received could be concurrent
12713:45 <ariard> jonatack: PBT testing ?
12813:46 <jonatack> ariard: property-based testing
12913:46 <ariard> a bit off-topic but having more fine-grained lock in wallet is an interesting topic to dig in
13013:46 <ariard> if people are looking for cool PRs to work on (I don't plan to do this)
13113:47 <ariard> next question: Can you think of any other way to remove the cs_main lock from the wallet code?
13213:47 <jkczyz> I see what you are saying. I was thinking of different wallets. For the same wallet, should concurrent RPCs that "write" be disallowed then? Or at least while holding a higer-level lock?
13313:47 <jonatack> (chris stewart has a large PR to add a bunch of PBT that needs review and we ought to add PBT coverage for instance to the BIP157-158 changes, there may be low-hanging fruit there)
13413:48 <jkczyz> (or rather should require holding a higher-level lock)
13513:49 <ariard> jkczyz: "should concurrent RPcs that "write" be disallowed then" I think you should have a cs_coins lock to hold to avoid concurrent write
13613:49 <ariard> so yes should require holding a higher-level lock but not necessary the actual cs_wallet one
13713:49 <ariard> jonatack: what the PR already, would try to review it
13813:51 <jonatack> ariard: https://github.com/bitcoin/bitcoin/pull/14430
13913:51 <jonatack> "Add more property based tests for basic bitcoin data structures"
14013:51 <jonatack> it's on my review list too
14113:52 <jonatack> ariard: "having more fine-grained lock in wallet"... interesting
14213:53 <ariard> jonatack: yes you need #16426, because without that would be useless
14313:53 <jonatack> ariard: Do you envision other building other changes on this PR in addition to those you describe in your PR body?
14413:53 <ariard> next question: What further changes could be build on top of this PR?
14513:54 <jonatack> snap! :D
14613:55 <ariard> jonatack: okay I have some changes building on top of it on the rescan logic
14713:55 <ariard> well it's not exactly on top of it but it's going to be easier to review without having to care about cs_main
14813:56 <ariard> and more generally drying-up the Chain interface until it makes sense to have other types of client using it
14913:56 <ariard> like lightning daemons ;)
15013:57 <jonatack> afaict #16426 should get in soon if no bugs are found during review, better to merge it not too late in this release cycle to have time to catch anything unexpected
15113:57 <ariard> jonatack: it's really biased but yes that's a blocker to more broader improvements both on the wallet-side on interface-sides
15213:57 <ariard> and many of them can be worked in parallel after it
15313:59 <ariard> and the Chain interface itself maybe should be splitted in different ones, like at least one for fee estimation
15414:00 <ariard> #endmeeting
15514:00 <jonatack> ariard: interesting
15614:01 <_andrewtoth_> ariard: thanks!
15714:01 <ariard> okay thanks all for participation, will hang here a bit more if you have questions, and once #16426 merged I will open issues to keep track what can be do as follow-ups
15814:01 <jonatack> ariard: about lightning, fwiw i'll ping you soon to talk about the anchor outputs RFC PR 688
15914:01 <jonatack> thanks ariard and everyone!
16014:02 <mattcruzz> ariard: Thanks for hosting, there's some interesting topics I need dig into now
16114:02 <ariard> yes to everyone don't censor yourself for reviewing, everyone is doing mistake and you're not going to break anything by doing a review
16214:02 <ariard> quite the contrary :)
16314:04 <jonatack> ariard: perhaps add a roadmap about your ideas in https://github.com/bitcoin/bitcoin/projects/ (or as part of https://github.com/bitcoin/bitcoin/projects/10)
16414:04 <jonatack> "Process separation"
16514:05 <jonatack> and very good point about not censoring yourself and reviewing!
16614:06 <fanquake> ariard if you want open a follow ups issue in advance of that being merged, feel free. Better to capture all the issues etc now, rather than digging through comments post-merge.
16714:06 <ariard> fanquake: one issue per-idea or big issues and listing everything ?
16814:08 <fanquake> 1 issue per big idea/refactor is probably fine. If there's nits/small follows up, probably best to combine them all into a single issue.
16914:09 <fanquake> Main point is just not to lose PR commentary/things that should be done as follow ups post merge.