Reverse cs_main, cs_wallet lock order and reduce cs_main locking (wallet)

https://github.com/bitcoin/bitcoin/16426

Host: ariard

Notes

  • The wallet is an integral part of the bitcoind full node for historic reasons. See the first git commit where the wallet lock cs_wallet is taken after cs_main.

  • The wallet was split from the node in PR 10973.

  • The new interface interfaces::Chain lets the wallet access chain and mempool state for its own operations. Interface methods allow the wallet to do things like access the block height to know a transaction’s confirmation depth or estimate the feerate.

  • Some heavily-used interfaces::Chain methods impose synchronisation between wallet and node because Chain::Lock needs to be taken by the wallet before its own lock.

  • Switching the lock order from cs_main -> cs_wallet to cs_wallet -> cs_main would avoid needlessly stalling the node when there is no need to access chain data.

  • PR 15931 removed GetDepthInMainChain’s requirement to lock the chain and simplified the reasoning on locks and events ordering.

  • A long-term goal is to separate major components of the Bitcoin Core node (the wallet, the P2P layer, the GUI) to let them evolve independently and only communicate asynchronously through well-defined interfaces: See ryanofsky’s github issue or eklitzke’s blog post for reasoning.

  • To learn more about lock management in Bitcoin Core:

Questions

  1. Did you review the PR? Concept ACK, approach ACK, ACK, or NACK? (Don’t forget to put your PR review on GitHub.)

  2. Did you take any steps beyond reading the code? Did you run the tests on your local machine with --enable-debug?

  3. While reading the code, which commit(s) attracted your attention? What should be carefully examined? What could go wrong?

  4. What could happen if a deadlock was introduced in the code?

  5. Does this PR change the wallet behavior?

  6. Why does the lock order need to be inverted all at once?

  7. Do you expect this PR to have any impact on performance?

  8. Can you think of any other way to remove the cs_main lock from the wallet code?

  9. What further changes could be build on top of this PR?

Meeting Log

13:00 < ariard> #startmeeting
13:00 < emilengler> hello
13:00 < jnewbery> hi
13:00 < _andrewtoth_> hi
13:00 < fjahr> hi
13:00 < sanoj> hi
13:00 < jonatack> hi
13:01 < ariard> today's PR is https://bitcoincore.reviews/16426.html : Reverse cs_main, cs_wallet lock order and reduce cs_main locking
13:01 < mattcruzz> hi
13:01 < ariard> who had a chance to review the PR ?
13:01 < jonatack> yup
13:01 < emilengler> negative
13:01 < ariard> jonatack: cool what was your review process?
13:02 < fjahr> started but not finished
13:02 < ariard> others did you read PR top message ? what the main problem this PR is trying to tackle ?
13:03 < jonatack> ariard: pulled the PR branch, opened diff locally with gitk, built with --enable-debug, ran tests, ran bitcoind, looked for warnings, read the commits one by on gitk, git grepped varois sites
13:03 < pinheadmz> hi!
13:03 < _andrewtoth_> it's removing a requirement of locking cs_main before taking cs_wallet lock
13:04 < ariard> _andrewtoth: correct, does locking cs_main still occurs afterwards?
13:04 < _andrewtoth_> there's a lot of backstory to this PR, as several of your refactor PRs are linked
13:04 < ariard> jonatack: "git grepped varois sites"?
13:04 < jonatack> ariard: after jkczyz commented on the build warning, i then wanted to reproduce, so switched from gcc to clang, which showed the warning for me (reassuring)
13:04 < _andrewtoth_> cs_main still needs to be taken yes
13:04 < jonatack> ariard: note to self: build with clang more often
13:05 < ariard> jonatack: yeah having a Mac env is great to enable all clang options
13:05 < ariard> jonatack: that's a bit cumbersome when you're working with locks on core, we have different lock safety mechanisms following the platforms
13:05 < jonatack> ariard: then after you rebased, i verified the warning was gone by re-pulling your latest rebase and re-building with clang
13:06 < ariard> _andrewtoth_: right where is cs_main taken now ?
13:06 < ariard> jonatack: you run built with --enable-debug what the flag is doing ?
13:06 < jonatack> ariard: it seems to me that passing `--enable-debug` with clang on debian works too... it wanted to check that!
13:06 < jonatack> s/it/i/
13:07 < _andrewtoth_> ariard: inside the Chain interface, as opposed to wallet code, right?
13:07 < jonatack> ariard: git grepped various call sites to see that you covered them all
13:08 < ariard> _andrewtoth_: correct did you dig a bit on Chain interface ?
13:08 < ariard> and why it's a performance improvement to not hold the locks in the wallet code ?
13:08 < _andrewtoth_> hmm no that part is not clear to me yet
13:08 < _andrewtoth_> if lock is taken inside chain anyways, why is it a benefit?
13:08 < jkczyz> hi
13:08 < ariard> jonatack: --enable-debug is a different check than clang, it's going to enable DEBUG_LOCKORDER
13:08 < jonatack> ariard: running with --enable-debug adds the -DDEBUG -DDEBUG_LOCKORDER flags afaik
13:09 < ariard> jonatack: exactly
13:09 < Talkless> hi
13:09 < Talkless> jonatack: do you ever use --enable-werror ?
13:09 < Talkless> or you just save build output & grep warnings..?
13:09 < jonatack> Talkless: no, i need to try that. the latter one, correct :)
13:10 < Talkless> ok
13:10 < ariard> jonatack: did you dig into src/sync.cpp to undersand effet of DEBUG_LOCKORDER?
13:10 < jonatack> i'd like to try the various options practicalswift lists here: https://github.com/bitcoin/bitcoin/issues/17344
13:11 < ariard> _andrewtoth_: it's a benefit because right now we hold the cs_main at beginning of every RPC call
13:11 < jonatack> since i reviewed the PR with the uninitialized bug that caused the patch release 0.19.0.1 and would like to catch those in the future
13:11 < ariard> _andrewtoth_ : and it's hold until the RPC finished
13:12 < ariard> with this patchset: cs_main is just hold to fetch chain data and no more
13:12 < _andrewtoth_> i see thanks
13:12 < ariard> we lock cs_main just-in-time if you prefer
13:13 < ariard> jonatack: oh that's an interesting list thanks
13:14 < ariard> jonatack: yes I think generally we have a good deal of improvements of using sanitizers/automated tests/fuzzing
13:14 < jonatack> ariard: sync.cpp::L31-215 took a cursory look at the effects of DEBUG_LOCKORDER, will spend more time in that file
13:14 < jonatack> ariard: yes!
13:14 < ariard> While where reading the code which commit(s) attracted your attention ?
13:15 < _andrewtoth_> there are comments mentioning this could potentially improve IBD. How would it? Wouldn't it only affect performance if wallet rpcs are called during IBD?
13:15 < ariard> What should be carefully examined ? What could go wrong ?
13:16 < ariard> _andrewtoth_ : have a look on validationinterface, right now wallet lock cs_main in BlockConnected/BlockDisconnected
13:16 < jonatack> ariard: the last commit that removes the locks seems potentially the most dangerous
13:16 < ariard> I mean wallet is an indirect consumer of validationinterface through the Chain interface
13:17 < ariard> jonatack: yes DEBUG_LOCKORDER enable a process struct to pair lock order and check at any new tacking if it doesn't invert
13:18 < ariard> and process struct have pointers back to a per-thread struct which trace lock position in file
13:18 < ariard> jonatack: yes last commit is the most painful one to review, forgetting a lock somewhere could cause deadlock
13:21 < jonatack> ariard: did you experience any deadlocks while working on this PR?
13:21 < ariard> okay next question : Did this PR changes the wallet behavior?
13:21 < jonatack> (if yes, how did they show themselves?)
13:22 < ariard> jonatack: ahah ofc but it was in the wallet test framework because we don't take locks through the Chain interface there
13:22 < ariard> but the wild way of LOCK(cs_main)
13:23 < ariard> jonatack: they showed up thanks to --enable-debug
13:23 < _andrewtoth_> ariard: I think it only changes it if cs_main is locked by something else, then it has to wait. Otherwise, no it does not
13:23 < _andrewtoth_> *doesn't have to wait
13:24 < _andrewtoth_> up until it needs to get something from the chain
13:25 < jonatack> ariard: What gave you the initial idea to begin working on this series of PRs? Something you read, that someone said, or the idea came to you on its own?
13:25 < ariard> _andrewtoth_: yes that should change thread concurrency which should be better, I was more thinking about wallet behavior if the sense of different return values or breaking some internal features
13:26 < _andrewtoth_> then the answer is no
13:27 < jonatack> Given that so few tests needed to be changed, I surmised no wallet behavior change other than perhaps performance. Did you bench it?
13:27 < ariard> jonatack: I wish a more modular core project with clean, well-functional interfaces, and this serie of PR was a good beginning
13:28 < jonatack> ariard: "a good beginning"... hehe, yes, i'd say so! +1
13:28 < ariard> _andrewtoth_: well actually it may have change if wallet do something like A=getheight(), B=getheight(), if A == B do something, if A != B do something else
13:29 < _andrewtoth_> ahh yes
13:29 < ariard> _andrewtoth_: I've verified all the callsites which are making assumptions on a particular tip and don't find anything wrong
13:29 < ariard> most of them are related to the rescan code, which currently don't lock the cs_main and verify every previous value returned from the Chain interface
13:30 < jonatack> ariard: did you use gdb to verify, or?
13:30 < ariard> you should grep all the "getBlockHeight" "getBlockTime"
13:30 < jonatack> or just grepping
13:31 < ariard> and try to think about what have changed in the sense of wallet view of the blockchain has changed or not
13:31 < ariard> jonatack: yeah just grep "get*" in wallet code
13:31 < ariard> jonatack: I don't see whagt gdb would discover there
13:32 < ariard> next question : Why does the lock order need to be inverted all at once?
13:32 < ariard> (easy one)
13:33 < jkczyz> otherwise there could be a deadlock
13:33 < jonatack> ariard: to check values, as you mentioned verifying values returned. It's interesting to see how others debug bitcoin core. For example, i've learned a lot from watching achow101's live coding sessions on twitch tv
13:34 < ariard> jkczyz: correct
13:34 < jonatack> as soon as he gets stuck, he often jumps right into gdb.
13:34 < ariard> jonatack: what were your biggest learnings ?
13:35 < ariard> jonatack: honestly for debugging it's more adding more printers and take pen and papers and think what I got wrong
13:35 < ariard> like drawing control flow, see where there is write/read on the modified data structures
13:36 < ariard> depend what kind of bugs you're tracking but IMO gdb is overkill for a project like bitcoin core
13:36 < jonatack> ariard: from watching achow101 (so far): compiler-driven debugging, gdb use, inspecting wallet files using Berkeley DB special commands (located in db4/bin), using flame graphs.
13:36 < ariard> jonatack: yes we should all keep an eye on flame graph when we modify part of the code which touch performance
13:37 < jonatack> ariard: yes, i tend to begin with prints and asserts but have not found myself deep in the weeds yet in my PRs since they have been simple ones
13:37 < jonatack> so more for reviewing others' PRs
13:37 < ariard> but I think how you're debugging is also super related to what part of codebase you're working on, P2P != wallet
13:38 < ariard> jonatack: when reviewing others' PRs I like to go through modified changes and tweak them to see if there is a better way of doing it
13:38 < _andrewtoth_> jonatack: what is compiler-driven debugging?
13:38 < ariard> that force you to understand the code
13:38 < jonatack> ariard: yes, i noticed that you have always been assidious and good wrt drawing control flow
13:39 < jonatack> assiduous*
13:39 < jonatack> tweaking others' code is a great way to go +1
13:40 < jonatack> compiler-driven debugging: relying on compiler errors to see what's not working and fix it
13:40 < ariard> next question: Do you expect this PR to have any impact on performace? But we already talk a bit about this, so let's readjust the question how could we make wallet code better to service concurrently multiple RPC calls ?
13:41 < jkczyz> moving the code to a more modular state will also ease unit testing
13:42 < jkczyz> ariard: perhaps using a per-wallet lock would help
13:42 < ariard> jkczyz: ofc I think we were considering with Marco to disable block writing just to fuzz p2p/validation stack without burning your disks
13:42 < jonatack> yes. could enable shifting more testing from functional to unit (faster), or better coverage
13:43 < ariard> jkczyz: it's already the case
13:43 < jkczyz> ah, didn't realize that
13:44 < jonatack> and modularity could help improve fuzzing and PBT testing as well
13:44 < ariard> jkczyz: I think the way of thinking what are the main data structures of the wallet, like coins, keys, address, what can be accessed concurrently and for what we need a synchronized view
13:44 < ariard> like you don't want both RPC building a transaction spending the same coin
13:45 < ariard> and you don't want to return same address twice
13:45 < ariard> but all read rpc like displaying coins received could be concurrent
13:45 < ariard> jonatack: PBT testing ?
13:46 < jonatack> ariard: property-based testing
13:46 < ariard> a bit off-topic but having more fine-grained lock in wallet is an interesting topic to dig in
13:46 < ariard> if people are looking for cool PRs to work on (I don't plan to do this)
13:47 < ariard> next question: Can you think of any other way to remove the cs_main lock from the wallet code?
13:47 < jkczyz> I see what you are saying. I was thinking of different wallets. For the same wallet, should concurrent RPCs that "write" be disallowed then? Or at least while holding a higer-level lock?
13:47 < jonatack> (chris stewart has a large PR to add a bunch of PBT that needs review and we ought to add PBT coverage for instance to the BIP157-158 changes, there may be low-hanging fruit there)
13:48 < jkczyz> (or rather should require holding a higher-level lock)
13:49 < ariard> jkczyz: "should concurrent RPcs that "write" be disallowed then" I think you should have a cs_coins lock to hold to avoid concurrent write
13:49 < ariard> so yes should require holding a higher-level lock but not necessary the actual cs_wallet one
13:49 < ariard> jonatack: what the PR already, would try to review it
13:51 < jonatack> ariard: https://github.com/bitcoin/bitcoin/pull/14430
13:51 < jonatack> "Add more property based tests for basic bitcoin data structures"
13:51 < jonatack> it's on my review list too
13:52 < jonatack> ariard: "having more fine-grained lock in wallet"... interesting
13:53 < ariard> jonatack: yes you need #16426, because without that would be useless
13:53 < jonatack> ariard: Do you envision other building other changes on this PR in addition to those you describe in your PR body?
13:53 < ariard> next question: What further changes could be build on top of this PR?
13:54 < jonatack> snap! :D
13:55 < ariard> jonatack: okay I have some changes building on top of it on the rescan logic
13:55 < ariard> well it's not exactly on top of it but it's going to be easier to review without having to care about cs_main
13:56 < ariard> and more generally drying-up the Chain interface until it makes sense to have other types of client using it
13:56 < ariard> like lightning daemons ;)
13:57 < jonatack> afaict #16426 should get in soon if no bugs are found during review, better to merge it not too late in this release cycle to have time to catch anything unexpected
13:57 < ariard> jonatack: it's really biased but yes that's a blocker to more broader improvements both on the wallet-side on interface-sides
13:57 < ariard> and many of them can be worked in parallel after it
13:59 < ariard> and the Chain interface itself maybe should be splitted in different ones, like at least one for fee estimation
14:00 < ariard> #endmeeting
14:00 < jonatack> ariard: interesting
14:01 < _andrewtoth_> ariard: thanks!
14:01 < ariard> okay thanks all for participation, will hang here a bit more if you have questions, and once #16426 merged I will open issues to keep track what can be do as follow-ups
14:01 < jonatack> ariard: about lightning, fwiw i'll ping you soon to talk about the anchor outputs RFC PR 688
14:01 < jonatack> thanks ariard and everyone!
14:02 < mattcruzz> ariard: Thanks for hosting, there's some interesting topics I need dig into now
14:02 < ariard> yes to everyone don't censor yourself for reviewing, everyone is doing mistake and you're not going to break anything by doing a review
14:02 < ariard> quite the contrary :)
14:04 < jonatack> ariard: perhaps add a roadmap about your ideas in https://github.com/bitcoin/bitcoin/projects/ (or as part of https://github.com/bitcoin/bitcoin/projects/10)
14:04 < jonatack> "Process separation"
14:05 < jonatack> and very good point about not censoring yourself and reviewing!
14:06 < fanquake> ariard if you want open a follow ups issue in advance of that being merged, feel free. Better to capture all the issues etc now, rather than digging through comments post-merge.
14:06 < ariard> fanquake: one issue per-idea or big issues and listing everything ?
14:08 < fanquake> 1 issue per big idea/refactor is probably fine. If there's nits/small follows up, probably best to combine them all into a single issue.
14:09 < fanquake> Main point is just not to lose PR commentary/things that should be done as follow ups post merge.