Net_processing: lock clean up (p2p, refactoring)

Jun 2, 2021

https://github.com/bitcoin/bitcoin/pull/21527

Host: jnewbery - PR author: ajtowns

The PR branch HEAD was d6dfa5977a at the time of this review club meeting.

Notes

PR 21527 continues the work of better encapsulating net_processing that ajtowns has been tracking in WIP PR 20758.
The previous PR in the series (PR 21148) moved most of the orphan processing logic out of net_processing into its own subcomponent, txorphanage. Splitting self-contained units of logic into their own classes/translation units makes it easier to test that logic in isolation. Enforcing a well-defined interface to the component also makes it easier to reason about how the code will behave.
This PR does two things:
- introduces an m_mutex_message_handling mutex in PeerManager, which guards data that is used in PeerManager’s message handling. Doing that allows us to remove the cs_sendProcessing mutex from net.
- removes the g_cs_orphans mutex from net_processing and adds an internal mutex to TxOrphanage, making it responsible for its own locking and thread safety.

Questions

Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? What was your review approach?
Does this PR change observable behaviour of the node in any way?
What was the cs_sendProcessing mutex responsible for before this PR? When was cs_sendProcessing introduced? (Hint use git blame and git log -S "string" to search back through previous commits).
This PR moves one mutex from net to net_processing (CNode.cs_sendProcessing is replaced by PeerManager.m_mutex_message_handler) and one mutex from net_processing to txorphanage (g_cs_orphans is replaced by TxOrphanage.m_mutex). What are the benefits of moving these global/externally visible mutexes to being defined internally in their respective classes?
What are vExtraTxnForCompact and vExtraTxnForCompactIt? Why is it ok to stop guarding them with g_cs_orphans and guard them with m_mutex_message_handling instead?
This PR removes the Peer.m_orphan_work_set member and replaces it with a m_peer_work_set map in TxOrphanage. What is the peer orphan work set used for?
(Bonus question) This PR originally included some behavioural changes in the way that orphans are handled. Those have now been moved from this PR to a separate branch and may be proposed as a follow-up PR. What are those behaviour changes?

Meeting Log

17:00

<jnewbery> #startmeeting

17:00

<glozow> hi

17:00

<jnewbery> Hi folks! Welcome to review club. Feel free to say hi to let everyon know you're here.

17:00

<jnewbery> Also feel free to not say hi and just hang out :)

17:00

<michaelfolkson> hi

17:00

<glozow> crap. can i still hang out?

17:00

<hernanmarino> hi !

17:00

<marqusat> hi

17:00

<emzy> hi

17:01

<jnewbery> you can {say|not say} hi and {hang out|not hang out}

17:01

<jnewbery> Is anyone here for the first time?

17:01

<lightlike> hi

17:01

<michaelfolkson> Is anyone in Miami?

17:01

<jnewbery> Today we're going to be looking at PR 21527: "Net_processing: lock clean up". Notes and questions are in the normal place: https://bitcoincore.reviews/21527

17:02

<svav> Hi

17:02

<jnewbery> I'll be asking questions from https://bitcoincore.reviews/21527 to guide the conversation, but feel free to jump in at any point and ask questions if anything's unclear

17:03

<jnewbery> ok, who had a chance to review the PR this week (y/n)?

17:03

<hernanmarino> I didn't have much time to read de PR notes today, but i would like to hang out here and read the discussion anyway :)

17:03

<emzy> n

17:03

<michaelfolkson> y

17:03

<svav> y

17:03

<jnewbery> hernanmarino: you're very welcome to!

17:03

<lightlike> n

17:04

<marqusat> .5y

17:04

<jnewbery> ok, let's get started. For those who reviewed the PR, can you give a short summary of the changes. How did you go about reviewing the PR?

17:06

<svav> continues the work of better encapsulating net_processing

17:06

<jnewbery> svav: yes indeed!

17:07

<svav> It is like a code cleanup operation

17:07

<svav> Which will help with clarity and performance

17:07

<svav> It is moving a few thinks relating to mutex locking

17:07

<svav> *things*

17:08

<jnewbery> svav: I agree with the clarity part. This PR shouldn't change performance, or maybe you're thinking of work that could be built on top of this PR?

17:08

<glozow> what could be built on top? 👀

17:08

<svav> It was a slight newbie guess because no-one else was answering

17:08

<michaelfolkson> Reducing globals, attempting to improve modularity or at least loosening ties

17:09

<jnewbery> svav: It was a good guess!

17:09

<michaelfolkson> There is kinda the clear separation of net and net_processing and also cleaning up within net_processing

17:09

<svav> but i see things like #include <txorphanage.h> being moved out of init.cpp where they are not needed

17:10

<jnewbery> Currently, the critical paths in Bitcoin Core are mostly single-threaded, and most of the important state is guarded by a single mutex called cs_main

17:10

<glozow> oo, so compilation performance

17:10

<jnewbery> if we can break that mutex up, it opens up the possibility of having a bit more concurrency, for example, having a validation thread independent from net_processing

17:11

<hernanmarino> that would be great

17:11

<jnewbery> michaelfolkson: yes! Better modularity and decoupling is a goal

17:11

<glozow> o so right now, if we receive a TX and validate it, the message processing thread sits there waiting for ATMP to finish?

17:12

<svav> For the newbies, can you define the difference between net and net_processing?

17:12

<michaelfolkson> And so the (interesting) discussion with vasild was a disagreement on stepping stones towards that goal? There are trade-offs depending on which stepping stones you pick?

17:12

<jnewbery> glozow: correct! If there was a separate thread for validation, the net_processing thread could start serving other peers while validation is validating the tx

17:12

<michaelfolkson> svav: https://bitcoin.stackexchange.com/questions/106751/what-does-net-processing-cover-conceptually-in-the-bitcoin-core-codebase

17:13

<ben3223> net is lower level that net_processing, which handles the protocol messages

17:13

<glozow> jnewbery has an issue for net vs net_processing: https://github.com/bitcoin/bitcoin/issues/19398

17:13

<ben3223> net -> network level / net_processing -> message protocol handling

17:13

<michaelfolkson> svav: Net handles P2P messages. Net processing works whether it needs to go to validation or not

17:13

<michaelfolkson> *works out

17:14

<jnewbery> svav: great question. The "net" layer is responsible for opening and maintaining connections to peers, and reading/writing messages to the wire. The main class in that layer is CConnman. The "net processing" layer sits above that, deserializes the p2p messages and then maybe passes them up to validation to be validated

17:14

<jnewbery> ben3223: michaelfolkson: correct!

17:15

<jnewbery> michaelfolkson: yes, the discussion with vasild was about approach rather than concept I think

17:15

<jnewbery> ok, next question: Does this PR change observable behaviour of the node in any way?

17:16

<svav> I'm guessing it shouldn't

17:16

<michaelfolkson> I'm also guessing that

17:16

<hernanmarino> i didn't read the code, but my guess is it doesn't

17:16

<marqusat> Does not seem so

17:17

<ben3223> I think it doesn't

17:17

<emzy> I hope not.

17:17

<michaelfolkson> I did look at the tx orphanage and that appears to have the same orphan processing logic as before (from a cursory glance)

17:17

<svav> It is just moving mutexes to better places, so should not affect behaviour

17:18

<jnewbery> that's right. This PR should be a refactor only. If you think it changes observable behaviour, you should speak up!

17:18

<jnewbery> The first version of the PR had some behavioural changes. We can talk about those changes at the end if we have time

17:18

<jnewbery> ok, next question. What was the cs_sendProcessing mutex responsible for before this PR? When was cs_sendProcessing introduced? (Hint use git blame and git log -S "string" to search back through previous commits).

17:19

<jnewbery> This is a good test of your git skills

17:20

<michaelfolkson> Ha there were a few movements before you got to the creator

17:20

<michaelfolkson> glozow's many layers of git blame

17:20

<jnewbery> Hint: run `git log -S cs_sendProcessing` in your git repo and scroll down to the bottom to see where the mutex was added

17:21

<michaelfolkson> I got a Matt PR in Jan 2017 I think

17:21

* michaelfolkson checks notes

17:21

<glozow> don't think i've touched `cs_sendProcessing` 🤔

17:21

<jnewbery> anyone get anything different?

17:23

<jnewbery> ok, if I run it, I get commit d7c58ad514ee00db00589216166808258bc16b60 from Dec 2016

17:23

<jnewbery> "Split CNode::cs_vSend: message processing and message sending"

17:23

<sipa> i don't think i've ever used `git log -S`... i should

17:24

<michaelfolkson> What's the PR number?

17:24

<hernanmarino> jnewbery: i got that too

17:24

<michaelfolkson> I had #9535

17:24

<glozow> #9535 is what i got as well

17:24

<jnewbery> being able to move back and forward through the git history is really helpful for understanding the historic context for changes.

17:25

<jnewbery> Yes, PR #9535 is where it was introduced

17:26

<jnewbery> I have a few tools that help me navigate the history of the repo. `git log -S` is one of them. The fugitive plugin for vim is another great tool

17:26

<jnewbery> it lets you walk back through the commits and see how things change over time

17:26

<ben3223> Did it replace cs_vSend ?

17:26

<michaelfolkson> Hmm nice. Always interested in a Vim plugin recommendation

17:27

<glozow> ben3223: just split it

17:27

<jnewbery> and then using a custom search engine in my browser https://github.com/bitcoin/bitcoin/search?q=%s to lookup which PR a commit was merged in

17:28

<jnewbery> ben3223: exactly - the PR split cs_vSend into two mutexes (mutices?)

17:28

<jnewbery> ok, next question

17:28

<jnewbery> This PR moves one mutex from net to net_processing (CNode.cs_sendProcessing is replaced by PeerManager.m_mutex_message_handler) and one mutex from net_processing to txorphanage (g_cs_orphans is replaced by TxOrphanage.m_mutex). What are the benefits of moving these global/externally visible mutexes to being defined internally in their respective classes?

17:29

<michaelfolkson> Makes it easier to test, safer for future changes?

17:30

<marqusat> Information hiding, harder to make a mistake and acquire a mutex when it shouldn’t be acquired and cause a deadlock.

17:30

<michaelfolkson> "Easier to reason about" is always the right answer :)

17:31

<jnewbery> michaelfolkson marqusat: Yes!

17:31

<svav> You will enable more concurrency, although what exactly I'm not sure!!

100

17:31

<jnewbery> by making the locking internal, you hide those requirements from the caller

101

17:31

<glozow> it'd be nice for modules to make themselves thread-safe instead of requiring others to grab locks all the time

102

17:32

<jnewbery> glozow: +1

103

17:32

<jnewbery> Next question. What are vExtraTxnForCompact and vExtraTxnForCompactIt? Why is it ok to stop guarding them with g_cs_orphans and guard them with m_mutex_message_handling instead?

104

17:32

<svav> The code will be easier to maintain, because you will have thread opening and closing in the same place, so less chance of messing up locking.

105

17:35

<jnewbery> svav: I think almost all of our locking uses RAII, so it's quite hard to mess up locking. The lock lasts as long as the std::unique_lock stays in scope (and that's wrapped by the LOCK() macro)

106

17:35

<jnewbery> We also have lock annotations now, which give us a bit more confidence, since the compiler enforces that the right locks are held when entering functions/accessing data

107

17:35

<michaelfolkson> Not quite sure on vExtraTxnForCompact. An extra transaction for compact block reconstruction whatever that means

108

17:36

<glozow> `vExtraTxnForCompact` = transactions you conflicted out, but might see in a block

109

17:36

<jnewbery> but yes, I agree that if the mutex is not exposed at all, there's no way for a caller to get locking wrong (since it's not responsible for locking)

110

17:36

<marqusat> Orphan/conflicted/etc transactions that are kept for compact block reconstruction, private in PeerManagerImpl which uses its internal mutex

111

17:36

<glozow> if you have compact blocks enabled, your peers won't send you those txns in the block?

112

17:36

<glozow> so you'll want to have them somewhere in case you need them?

113

17:37

<sipa> glozow: that's not how compact blocks work; the sender sends you a compact block with compact tx hashes; you then request the transactions which you miss

114

17:37

<jnewbery> glozow: very good! Yes, they're extra transactions that can be used for compact block reconstruction.

115

17:37

<sipa> glozow: by keeping conflicted/orphan transactions around for a bit, they are available for reconstruction, so you won't have to ask for them if they show up in a compact block

116

17:38

<jnewbery> sipa: is it different between high bandwidth/low bandwidth mode?

117

17:38

<sipa> jnewbery: no, high bandwidth just drops a roundtrip (you don't need to ask for the compact block after announcement; they announce through a compact block directly)

118

17:39

<jnewbery> sipa: thanks!

119

17:39

<jnewbery> https://github.com/bitcoin/bips/blob/master/bip-0152.mediawiki#Intended_Protocol_Flow for reference

120

17:39

<glozow> sipa: ah okay, got it

121

17:39

<glozow> for some reason i thought they might go by inventory

122

17:40

<jnewbery> ok, so part two of that question: Why is it ok to stop guarding them with g_cs_orphans and guard them with m_mutex_message_handling instead?

123

17:40

<sipa> (i have not reviewed the PR, i am guessing) there is no strict requirement that they are in sync with the rest of the orphan data structure

124

17:40

<jnewbery> sipa: yes!

125

17:41

<svav> Because they now have their own mutex lock and unlock in TXOrphanage which can be dealt with by m_mutex_message_handling

126

17:41

<jnewbery> svav: what are you referring to as they?

127

17:42

<sipa> right, locking achieves two things: multithreaded r/w access (you're instantly UB if you don't protect variabled read/written by multiple threads with some form of locking/synchronization)

128

17:42

<sipa> but it also achieves consistency; by putting multiple variables under the same lock, they will always be in sync from the perspective of other threads

129

17:42

<svav> jnewbury: vExtraTxnForCompact and vExtraTxnForCompactIt

130

17:42

<sipa> if consistency across certain variables isn't required, they can by protected by distinct locks

131

17:43

<svav> jnewbery: vExtraTxnForCompact and vExtraTxnForCompactIt

132

17:44

<jnewbery> sipa: exactly right. The class has certain invariants that must always be true for callers. We can lock a mutex, and then temporarily violate those invariants during processing, before restoring the invariants and releasing the lock.

133

17:44

<jnewbery> if there are no logical invariants between separate data, they don't need to be guarded by the same lock

134

17:44

<michaelfolkson> So there is always this trade-off re locks between few locks (guarantees re being in sync) versus many locks (freedom of concerns)

135

17:45

<sipa> right, and the consistency requirements for vExtraTxnForCompact etc are basically "whatever"

136

17:45

<jnewbery> sipa: yes

137

17:46

<jnewbery> svav: ah, I was a little confused because you said they have mutex lock and unlock in TXOrphange. vExtraTxnForCompact and vExtraTxnForCompactIt are not in TXOrphange

138

17:47

<jnewbery> you're right that they're not protected under a different lock (m_mutex_message_handling), which prevents multiple threads accessing them concurrently

139

17:47

<jnewbery> ok, next question. This PR removes the Peer.m_orphan_work_set member and replaces it with a m_peer_work_set map in TxOrphanage. What is the peer orphan work set used for?

140

17:49

<michaelfolkson> Didn't get this far, don't know

141

17:50

<glozow> peer -> orphans to be considered, for which the parents were provided by this peer?

142

17:50

<glozow> or is it that they provided the orphan?

143

17:50

<jnewbery> glozow: right first time. The parents were provided by this peer

144

17:50

<glozow> so you could have multiple entries if the parents were provided by multiple peers?

145

17:51

<glozow> er, multiple parents by different peers

146

17:51

<jnewbery> so if we receive a transaction where we don't have its inputs in our UTXO set or mempool from peer A, and then peer B provides the missing parent and it gets accepted to our mempool, whose orphan work set will the orphan go into?

147

17:51

<glozow> B?

148

17:52

<jnewbery> glozow: right. We'll validate the orphan transaction next time we go through the ProcessMessages loop for peer B.

149

17:53

<jnewbery> and if the orphan transaction was consensus-invalid, which peer would we punish?

150

17:53

<glozow> and punish B if it's a bad orphan?

151

17:53

<glozow> ha oops. B?

152

17:54

<jnewbery> glozow: that seems unfair. Peer B might not even know about the orphan

153

17:54

<glozow> yeah peer B's just parenting hey

154

17:54

<michaelfolkson> Punishment seems challenging/impossible with orphans

155

17:55

<jnewbery> no, we won't punish the peer that provided the parent. That'd be a good way to allow a malicious actor to get peers to disconnect other peers.

156

17:56

<jnewbery> we remember who provided the orphan (https://github.com/bitcoin/bitcoin/blob/1186910b6b7ba7c7e5193c76f33f25825e6cc0b7/src/txorphanage.h#L53), so if it's bad we'll punish the peer that sent us the orphan

157

17:56

<jnewbery> ok, final question. (Bonus question) This PR originally included some behavioural changes in the way that orphans are handled. Those have now been moved from this PR to a separate branch and may be proposed as a follow-up PR. What are those behaviour changes?

158

17:58

<jnewbery> alright, maybe one for people to go and investigate for themselves.

159

17:58

<michaelfolkson> A lot of commits on that branch

160

17:58

<jnewbery> the branch is here: https://github.com/ajtowns/bitcoin/commits/202104-whohandlesorphans

161

17:58

<michaelfolkson> Something about who handles orphans :)