Reverse cs_main, cs_wallet lock order and reduce cs_main locking (wallet)

Dec 4, 2019

https://github.com/bitcoin/bitcoin/pull/16426

Host: ariard - PR author: ariard

Notes

The wallet is an integral part of the bitcoind full node for historic reasons. See the first git commit where the wallet lock cs_wallet is taken after cs_main.
The wallet was split from the node in PR 10973.
The new interface interfaces::Chain lets the wallet access chain and mempool state for its own operations. Interface methods allow the wallet to do things like access the block height to know a transaction’s confirmation depth or estimate the feerate.
Some heavily-used interfaces::Chain methods impose synchronisation between wallet and node because Chain::Lock needs to be taken by the wallet before its own lock.
Switching the lock order from cs_main -> cs_wallet to cs_wallet -> cs_main would avoid needlessly stalling the node when there is no need to access chain data.
PR 15931 removed GetDepthInMainChain’s requirement to lock the chain and simplified the reasoning on locks and events ordering.
A long-term goal is to separate major components of the Bitcoin Core node (the wallet, the P2P layer, the GUI) to let them evolve independently and only communicate asynchronously through well-defined interfaces: See ryanofsky’s github issue or eklitzke’s blog post for reasoning.
To learn more about lock management in Bitcoin Core:
- The Bitcoin Core debug_lockorder notes explain how to compile and run bitcoind with deadlock detection.
- The Clang Thread Safety Analysis documentation gives information about static analysis of locks in the Clang compiler.
- Reading Bitcoin Core’s locking code in src/sync.h is also recommended.

Questions

Did you review the PR? Concept ACK, approach ACK, ACK, or NACK? (Don’t forget to put your PR review on GitHub.)
Did you take any steps beyond reading the code? Did you run the tests on your local machine with --enable-debug?
While reading the code, which commit(s) attracted your attention? What should be carefully examined? What could go wrong?
What could happen if a deadlock was introduced in the code?
Does this PR change the wallet behavior?
Why does the lock order need to be inverted all at once?
Do you expect this PR to have any impact on performance?
Can you think of any other way to remove the cs_main lock from the wallet code?
What further changes could be build on top of this PR?

Meeting Log

13:00

<ariard> #startmeeting

13:00

<emilengler> hello

13:00

<jnewbery> hi

13:00

<_andrewtoth_> hi

13:00

<fjahr> hi

13:00

<sanoj> hi

13:00

<jonatack> hi

13:01

<ariard> today's PR is https://bitcoincore.reviews/16426.html : Reverse cs_main, cs_wallet lock order and reduce cs_main locking

13:01

<mattcruzz> hi

13:01

<ariard> who had a chance to review the PR ?

13:01

<jonatack> yup

13:01

<emilengler> negative

13:01

<ariard> jonatack: cool what was your review process?

13:02

<fjahr> started but not finished

13:02

<ariard> others did you read PR top message ? what the main problem this PR is trying to tackle ?

13:03

<jonatack> ariard: pulled the PR branch, opened diff locally with gitk, built with --enable-debug, ran tests, ran bitcoind, looked for warnings, read the commits one by on gitk, git grepped varois sites

13:03

<pinheadmz> hi!

13:03

<_andrewtoth_> it's removing a requirement of locking cs_main before taking cs_wallet lock

13:04

<ariard> _andrewtoth: correct, does locking cs_main still occurs afterwards?

13:04

<_andrewtoth_> there's a lot of backstory to this PR, as several of your refactor PRs are linked

13:04

<ariard> jonatack: "git grepped varois sites"?

13:04

<jonatack> ariard: after jkczyz commented on the build warning, i then wanted to reproduce, so switched from gcc to clang, which showed the warning for me (reassuring)

13:04

<_andrewtoth_> cs_main still needs to be taken yes

13:04

<jonatack> ariard: note to self: build with clang more often

13:05

<ariard> jonatack: yeah having a Mac env is great to enable all clang options

13:05

<ariard> jonatack: that's a bit cumbersome when you're working with locks on core, we have different lock safety mechanisms following the platforms

13:05

<jonatack> ariard: then after you rebased, i verified the warning was gone by re-pulling your latest rebase and re-building with clang

13:06

<ariard> _andrewtoth_: right where is cs_main taken now ?

13:06

<ariard> jonatack: you run built with --enable-debug what the flag is doing ?

13:06

<jonatack> ariard: it seems to me that passing `--enable-debug` with clang on debian works too... it wanted to check that!

13:06

<jonatack> s/it/i/

13:07

<_andrewtoth_> ariard: inside the Chain interface, as opposed to wallet code, right?

13:07

<jonatack> ariard: git grepped various call sites to see that you covered them all

13:08

<ariard> _andrewtoth_: correct did you dig a bit on Chain interface ?

13:08

<ariard> and why it's a performance improvement to not hold the locks in the wallet code ?

13:08

<_andrewtoth_> hmm no that part is not clear to me yet

13:08

<_andrewtoth_> if lock is taken inside chain anyways, why is it a benefit?

13:08

<jkczyz> hi

13:08

<ariard> jonatack: --enable-debug is a different check than clang, it's going to enable DEBUG_LOCKORDER

13:08

<jonatack> ariard: running with --enable-debug adds the -DDEBUG -DDEBUG_LOCKORDER flags afaik

13:09

<ariard> jonatack: exactly

13:09

<Talkless> hi

13:09

<Talkless> jonatack: do you ever use --enable-werror ?

13:09

<Talkless> or you just save build output & grep warnings..?

13:09

<jonatack> Talkless: no, i need to try that. the latter one, correct :)

13:10

<Talkless> ok

13:10

<ariard> jonatack: did you dig into src/sync.cpp to undersand effet of DEBUG_LOCKORDER?

13:10

<jonatack> i'd like to try the various options practicalswift lists here: https://github.com/bitcoin/bitcoin/issues/17344

13:11

<ariard> _andrewtoth_: it's a benefit because right now we hold the cs_main at beginning of every RPC call

13:11

<jonatack> since i reviewed the PR with the uninitialized bug that caused the patch release 0.19.0.1 and would like to catch those in the future

13:11

<ariard> _andrewtoth_ : and it's hold until the RPC finished

13:12

<ariard> with this patchset: cs_main is just hold to fetch chain data and no more

13:12

<_andrewtoth_> i see thanks

13:12

<ariard> we lock cs_main just-in-time if you prefer

13:13

<ariard> jonatack: oh that's an interesting list thanks

13:14

<ariard> jonatack: yes I think generally we have a good deal of improvements of using sanitizers/automated tests/fuzzing

13:14

<jonatack> ariard: sync.cpp::L31-215 took a cursory look at the effects of DEBUG_LOCKORDER, will spend more time in that file

13:14

<jonatack> ariard: yes!

13:14

<ariard> While where reading the code which commit(s) attracted your attention ?

13:15

<_andrewtoth_> there are comments mentioning this could potentially improve IBD. How would it? Wouldn't it only affect performance if wallet rpcs are called during IBD?

13:15

<ariard> What should be carefully examined ? What could go wrong ?

13:16

<ariard> _andrewtoth_ : have a look on validationinterface, right now wallet lock cs_main in BlockConnected/BlockDisconnected

13:16

<jonatack> ariard: the last commit that removes the locks seems potentially the most dangerous

13:16

<ariard> I mean wallet is an indirect consumer of validationinterface through the Chain interface

13:17

<ariard> jonatack: yes DEBUG_LOCKORDER enable a process struct to pair lock order and check at any new tacking if it doesn't invert

13:18

<ariard> and process struct have pointers back to a per-thread struct which trace lock position in file

13:18

<ariard> jonatack: yes last commit is the most painful one to review, forgetting a lock somewhere could cause deadlock

13:21

<jonatack> ariard: did you experience any deadlocks while working on this PR?

13:21

<ariard> okay next question : Did this PR changes the wallet behavior?

13:21

<jonatack> (if yes, how did they show themselves?)

13:22

<ariard> jonatack: ahah ofc but it was in the wallet test framework because we don't take locks through the Chain interface there

13:22

<ariard> but the wild way of LOCK(cs_main)

13:23

<ariard> jonatack: they showed up thanks to --enable-debug

13:23

<_andrewtoth_> ariard: I think it only changes it if cs_main is locked by something else, then it has to wait. Otherwise, no it does not

13:23

<_andrewtoth_> *doesn't have to wait

13:24

<_andrewtoth_> up until it needs to get something from the chain

13:25

<jonatack> ariard: What gave you the initial idea to begin working on this series of PRs? Something you read, that someone said, or the idea came to you on its own?

13:25

<ariard> _andrewtoth_: yes that should change thread concurrency which should be better, I was more thinking about wallet behavior if the sense of different return values or breaking some internal features

13:26

<_andrewtoth_> then the answer is no

13:27

<jonatack> Given that so few tests needed to be changed, I surmised no wallet behavior change other than perhaps performance. Did you bench it?

13:27

<ariard> jonatack: I wish a more modular core project with clean, well-functional interfaces, and this serie of PR was a good beginning

13:28

<jonatack> ariard: "a good beginning"... hehe, yes, i'd say so! +1

13:28

<ariard> _andrewtoth_: well actually it may have change if wallet do something like A=getheight(), B=getheight(), if A == B do something, if A != B do something else

13:29

<_andrewtoth_> ahh yes

13:29

<ariard> _andrewtoth_: I've verified all the callsites which are making assumptions on a particular tip and don't find anything wrong

13:29

<ariard> most of them are related to the rescan code, which currently don't lock the cs_main and verify every previous value returned from the Chain interface

13:30

<jonatack> ariard: did you use gdb to verify, or?

13:30

<ariard> you should grep all the "getBlockHeight" "getBlockTime"

13:30

<jonatack> or just grepping

13:31

<ariard> and try to think about what have changed in the sense of wallet view of the blockchain has changed or not

13:31

<ariard> jonatack: yeah just grep "get*" in wallet code

13:31

<ariard> jonatack: I don't see whagt gdb would discover there

13:32

<ariard> next question : Why does the lock order need to be inverted all at once?

13:32

<ariard> (easy one)

13:33

<jkczyz> otherwise there could be a deadlock

13:33

<jonatack> ariard: to check values, as you mentioned verifying values returned. It's interesting to see how others debug bitcoin core. For example, i've learned a lot from watching achow101's live coding sessions on twitch tv

13:34

<ariard> jkczyz: correct

13:34

<jonatack> as soon as he gets stuck, he often jumps right into gdb.

13:34

<ariard> jonatack: what were your biggest learnings ?

100

13:35

<ariard> jonatack: honestly for debugging it's more adding more printers and take pen and papers and think what I got wrong

101

13:35

<ariard> like drawing control flow, see where there is write/read on the modified data structures

102

13:36

<ariard> depend what kind of bugs you're tracking but IMO gdb is overkill for a project like bitcoin core

103

13:36

<jonatack> ariard: from watching achow101 (so far): compiler-driven debugging, gdb use, inspecting wallet files using Berkeley DB special commands (located in db4/bin), using flame graphs.

104

13:36

<ariard> jonatack: yes we should all keep an eye on flame graph when we modify part of the code which touch performance

105

13:37

<jonatack> ariard: yes, i tend to begin with prints and asserts but have not found myself deep in the weeds yet in my PRs since they have been simple ones

106

13:37

<jonatack> so more for reviewing others' PRs

107

13:37

<ariard> but I think how you're debugging is also super related to what part of codebase you're working on, P2P != wallet

108

13:38

<ariard> jonatack: when reviewing others' PRs I like to go through modified changes and tweak them to see if there is a better way of doing it

109

13:38

<_andrewtoth_> jonatack: what is compiler-driven debugging?

110

13:38

<ariard> that force you to understand the code

111

13:38

<jonatack> ariard: yes, i noticed that you have always been assidious and good wrt drawing control flow

112

13:39

<jonatack> assiduous*

113

13:39

<jonatack> tweaking others' code is a great way to go +1

114

13:40

<jonatack> compiler-driven debugging: relying on compiler errors to see what's not working and fix it

115

13:40

<ariard> next question: Do you expect this PR to have any impact on performace? But we already talk a bit about this, so let's readjust the question how could we make wallet code better to service concurrently multiple RPC calls ?

116

13:41

<jkczyz> moving the code to a more modular state will also ease unit testing

117

13:42

<jkczyz> ariard: perhaps using a per-wallet lock would help

118

13:42

<ariard> jkczyz: ofc I think we were considering with Marco to disable block writing just to fuzz p2p/validation stack without burning your disks

119

13:42

<jonatack> yes. could enable shifting more testing from functional to unit (faster), or better coverage

120

13:43

<ariard> jkczyz: it's already the case

121

13:43

<jkczyz> ah, didn't realize that

122

13:44

<jonatack> and modularity could help improve fuzzing and PBT testing as well

123

13:44

<ariard> jkczyz: I think the way of thinking what are the main data structures of the wallet, like coins, keys, address, what can be accessed concurrently and for what we need a synchronized view

124

13:44

<ariard> like you don't want both RPC building a transaction spending the same coin

125

13:45

<ariard> and you don't want to return same address twice

126

13:45

<ariard> but all read rpc like displaying coins received could be concurrent

127

13:45

<ariard> jonatack: PBT testing ?

128

13:46

<jonatack> ariard: property-based testing

129

13:46

<ariard> a bit off-topic but having more fine-grained lock in wallet is an interesting topic to dig in

130

13:46

<ariard> if people are looking for cool PRs to work on (I don't plan to do this)

131

13:47

<ariard> next question: Can you think of any other way to remove the cs_main lock from the wallet code?

132

13:47

<jkczyz> I see what you are saying. I was thinking of different wallets. For the same wallet, should concurrent RPCs that "write" be disallowed then? Or at least while holding a higer-level lock?

133

13:47

<jonatack> (chris stewart has a large PR to add a bunch of PBT that needs review and we ought to add PBT coverage for instance to the BIP157-158 changes, there may be low-hanging fruit there)

134

13:48

<jkczyz> (or rather should require holding a higher-level lock)

135

13:49

<ariard> jkczyz: "should concurrent RPcs that "write" be disallowed then" I think you should have a cs_coins lock to hold to avoid concurrent write

136

13:49

<ariard> so yes should require holding a higher-level lock but not necessary the actual cs_wallet one

137

13:49

<ariard> jonatack: what the PR already, would try to review it

138

13:51

<jonatack> ariard: https://github.com/bitcoin/bitcoin/pull/14430

139

13:51

<jonatack> "Add more property based tests for basic bitcoin data structures"

140

13:51

<jonatack> it's on my review list too

141

13:52

<jonatack> ariard: "having more fine-grained lock in wallet"... interesting

142

13:53

<ariard> jonatack: yes you need #16426, because without that would be useless

143

13:53

<jonatack> ariard: Do you envision other building other changes on this PR in addition to those you describe in your PR body?

144

13:53

<ariard> next question: What further changes could be build on top of this PR?

145

13:54

<jonatack> snap! :D

146

13:55

<ariard> jonatack: okay I have some changes building on top of it on the rescan logic

147

13:55

<ariard> well it's not exactly on top of it but it's going to be easier to review without having to care about cs_main

148

13:56

<ariard> and more generally drying-up the Chain interface until it makes sense to have other types of client using it

149

13:56

<ariard> like lightning daemons ;)

150

13:57

<jonatack> afaict #16426 should get in soon if no bugs are found during review, better to merge it not too late in this release cycle to have time to catch anything unexpected

151

13:57

<ariard> jonatack: it's really biased but yes that's a blocker to more broader improvements both on the wallet-side on interface-sides

152

13:57

<ariard> and many of them can be worked in parallel after it

153

13:59

<ariard> and the Chain interface itself maybe should be splitted in different ones, like at least one for fee estimation

154

14:00

<ariard> #endmeeting

155

14:00

<jonatack> ariard: interesting

156

14:01

<_andrewtoth_> ariard: thanks!

157

14:01

<ariard> okay thanks all for participation, will hang here a bit more if you have questions, and once #16426 merged I will open issues to keep track what can be do as follow-ups

158

14:01

<jonatack> ariard: about lightning, fwiw i'll ping you soon to talk about the anchor outputs RFC PR 688

159

14:01

<jonatack> thanks ariard and everyone!

160

14:02

<mattcruzz> ariard: Thanks for hosting, there's some interesting topics I need dig into now

161

14:02

<ariard> yes to everyone don't censor yourself for reviewing, everyone is doing mistake and you're not going to break anything by doing a review

162

14:02

<ariard> quite the contrary :)

163

14:04

<jonatack> ariard: perhaps add a roadmap about your ideas in https://github.com/bitcoin/bitcoin/projects/ (or as part of https://github.com/bitcoin/bitcoin/projects/10)

164

14:04

<jonatack> "Process separation"

165

14:05

<jonatack> and very good point about not censoring yourself and reviewing!

166

14:06

<fanquake> ariard if you want open a follow ups issue in advance of that being merged, feel free. Better to capture all the issues etc now, rather than digging through comments post-merge.

167

14:06

<ariard> fanquake: one issue per-idea or big issues and listing everything ?

168

14:08

<fanquake> 1 issue per big idea/refactor is probably fine. If there's nits/small follows up, probably best to combine them all into a single issue.

169

14:09

<fanquake> Main point is just not to lose PR commentary/things that should be done as follow ups post merge.