Net_processing: lock clean up (p2p, refactoring)

https://github.com/bitcoin/bitcoin/pull/21527

Host: jnewbery  -  PR author: ajtowns

The PR branch HEAD was d6dfa5977a at the time of this review club meeting.

Notes

  • PR 21527 continues the work of better encapsulating net_processing that ajtowns has been tracking in WIP PR 20758.

  • The previous PR in the series (PR 21148) moved most of the orphan processing logic out of net_processing into its own subcomponent, txorphanage. Splitting self-contained units of logic into their own classes/translation units makes it easier to test that logic in isolation. Enforcing a well-defined interface to the component also makes it easier to reason about how the code will behave.

  • This PR does two things:

    • introduces an m_mutex_message_handling mutex in PeerManager, which guards data that is used in PeerManager’s message handling. Doing that allows us to remove the cs_sendProcessing mutex from net.

    • removes the g_cs_orphans mutex from net_processing and adds an internal mutex to TxOrphanage, making it responsible for its own locking and thread safety.

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? What was your review approach?

  2. Does this PR change observable behaviour of the node in any way?

  3. What was the cs_sendProcessing mutex responsible for before this PR? When was cs_sendProcessing introduced? (Hint use git blame and git log -S "string" to search back through previous commits).

  4. This PR moves one mutex from net to net_processing (CNode.cs_sendProcessing is replaced by PeerManager.m_mutex_message_handler) and one mutex from net_processing to txorphanage (g_cs_orphans is replaced by TxOrphanage.m_mutex). What are the benefits of moving these global/externally visible mutexes to being defined internally in their respective classes?

  5. What are vExtraTxnForCompact and vExtraTxnForCompactIt? Why is it ok to stop guarding them with g_cs_orphans and guard them with m_mutex_message_handling instead?

  6. This PR removes the Peer.m_orphan_work_set member and replaces it with a m_peer_work_set map in TxOrphanage. What is the peer orphan work set used for?

  7. (Bonus question) This PR originally included some behavioural changes in the way that orphans are handled. Those have now been moved from this PR to a separate branch and may be proposed as a follow-up PR. What are those behaviour changes?

Meeting Log

  117:00 <jnewbery> #startmeeting
  217:00 <glozow> hi
  317:00 <jnewbery> Hi folks! Welcome to review club. Feel free to say hi to let everyon know you're here.
  417:00 <jnewbery> Also feel free to not say hi and just hang out :)
  517:00 <michaelfolkson> hi
  617:00 <glozow> crap. can i still hang out?
  717:00 <hernanmarino> hi !
  817:00 <marqusat> hi
  917:00 <emzy> hi
 1017:01 <jnewbery> you can {say|not say} hi and {hang out|not hang out}
 1117:01 <jnewbery> Is anyone here for the first time?
 1217:01 <lightlike> hi
 1317:01 <michaelfolkson> Is anyone in Miami?
 1417:01 <jnewbery> Today we're going to be looking at PR 21527: "Net_processing: lock clean up". Notes and questions are in the normal place: https://bitcoincore.reviews/21527
 1517:02 <svav> Hi
 1617:02 <jnewbery> I'll be asking questions from https://bitcoincore.reviews/21527 to guide the conversation, but feel free to jump in at any point and ask questions if anything's unclear
 1717:03 <jnewbery> ok, who had a chance to review the PR this week (y/n)?
 1817:03 <hernanmarino> I didn't have much time to read de PR notes today, but i would like to hang out here and read the discussion anyway :)
 1917:03 <emzy> n
 2017:03 <michaelfolkson> y
 2117:03 <svav> y
 2217:03 <jnewbery> hernanmarino: you're very welcome to!
 2317:03 <lightlike> n
 2417:04 <marqusat> .5y
 2517:04 <jnewbery> ok, let's get started. For those who reviewed the PR, can you give a short summary of the changes. How did you go about reviewing the PR?
 2617:06 <svav> continues the work of better encapsulating net_processing
 2717:06 <jnewbery> svav: yes indeed!
 2817:07 <svav> It is like a code cleanup operation
 2917:07 <svav> Which will help with clarity and performance
 3017:07 <svav> It is moving a few thinks relating to mutex locking
 3117:07 <svav> *things*
 3217:08 <jnewbery> svav: I agree with the clarity part. This PR shouldn't change performance, or maybe you're thinking of work that could be built on top of this PR?
 3317:08 <glozow> what could be built on top? 👀
 3417:08 <svav> It was a slight newbie guess because no-one else was answering
 3517:08 <michaelfolkson> Reducing globals, attempting to improve modularity or at least loosening ties
 3617:09 <jnewbery> svav: It was a good guess!
 3717:09 <michaelfolkson> There is kinda the clear separation of net and net_processing and also cleaning up within net_processing
 3817:09 <svav> but i see things like #include <txorphanage.h> being moved out of init.cpp where they are not needed
 3917:10 <jnewbery> Currently, the critical paths in Bitcoin Core are mostly single-threaded, and most of the important state is guarded by a single mutex called cs_main
 4017:10 <glozow> oo, so compilation performance
 4117:10 <jnewbery> if we can break that mutex up, it opens up the possibility of having a bit more concurrency, for example, having a validation thread independent from net_processing
 4217:11 <hernanmarino> that would be great
 4317:11 <jnewbery> michaelfolkson: yes! Better modularity and decoupling is a goal
 4417:11 <glozow> o so right now, if we receive a TX and validate it, the message processing thread sits there waiting for ATMP to finish?
 4517:12 <svav> For the newbies, can you define the difference between net and net_processing?
 4617:12 <michaelfolkson> And so the (interesting) discussion with vasild was a disagreement on stepping stones towards that goal? There are trade-offs depending on which stepping stones you pick?
 4717:12 <jnewbery> glozow: correct! If there was a separate thread for validation, the net_processing thread could start serving other peers while validation is validating the tx
 4817:12 <michaelfolkson> svav: https://bitcoin.stackexchange.com/questions/106751/what-does-net-processing-cover-conceptually-in-the-bitcoin-core-codebase
 4917:13 <ben3223> net is lower level that net_processing, which handles the protocol messages
 5017:13 <glozow> jnewbery has an issue for net vs net_processing: https://github.com/bitcoin/bitcoin/issues/19398
 5117:13 <ben3223> net -> network level / net_processing -> message protocol handling
 5217:13 <michaelfolkson> svav: Net handles P2P messages. Net processing works whether it needs to go to validation or not
 5317:13 <michaelfolkson> *works out
 5417:14 <jnewbery> svav: great question. The "net" layer is responsible for opening and maintaining connections to peers, and reading/writing messages to the wire. The main class in that layer is CConnman. The "net processing" layer sits above that, deserializes the p2p messages and then maybe passes them up to validation to be validated
 5517:14 <jnewbery> ben3223: michaelfolkson: correct!
 5617:15 <jnewbery> michaelfolkson: yes, the discussion with vasild was about approach rather than concept I think
 5717:15 <jnewbery> ok, next question: Does this PR change observable behaviour of the node in any way?
 5817:16 <svav> I'm guessing it shouldn't
 5917:16 <michaelfolkson> I'm also guessing that
 6017:16 <hernanmarino> i didn't read the code, but my guess is it doesn't
 6117:16 <marqusat> Does not seem so
 6217:17 <ben3223> I think it doesn't
 6317:17 <emzy> I hope not.
 6417:17 <michaelfolkson> I did look at the tx orphanage and that appears to have the same orphan processing logic as before (from a cursory glance)
 6517:17 <svav> It is just moving mutexes to better places, so should not affect behaviour
 6617:18 <jnewbery> that's right. This PR should be a refactor only. If you think it changes observable behaviour, you should speak up!
 6717:18 <jnewbery> The first version of the PR had some behavioural changes. We can talk about those changes at the end if we have time
 6817:18 <jnewbery> ok, next question. What was the cs_sendProcessing mutex responsible for before this PR? When was cs_sendProcessing introduced? (Hint use git blame and git log -S "string" to search back through previous commits).
 6917:19 <jnewbery> This is a good test of your git skills
 7017:20 <michaelfolkson> Ha there were a few movements before you got to the creator
 7117:20 <michaelfolkson> glozow's many layers of git blame
 7217:20 <jnewbery> Hint: run `git log -S cs_sendProcessing` in your git repo and scroll down to the bottom to see where the mutex was added
 7317:21 <michaelfolkson> I got a Matt PR in Jan 2017 I think
 7417:21 * michaelfolkson checks notes
 7517:21 <glozow> don't think i've touched `cs_sendProcessing` 🤔
 7617:21 <jnewbery> anyone get anything different?
 7717:23 <jnewbery> ok, if I run it, I get commit d7c58ad514ee00db00589216166808258bc16b60 from Dec 2016
 7817:23 <jnewbery> "Split CNode::cs_vSend: message processing and message sending"
 7917:23 <sipa> i don't think i've ever used `git log -S`... i should
 8017:24 <michaelfolkson> What's the PR number?
 8117:24 <hernanmarino> jnewbery: i got that too
 8217:24 <michaelfolkson> I had #9535
 8317:24 <glozow> #9535 is what i got as well
 8417:24 <jnewbery> being able to move back and forward through the git history is really helpful for understanding the historic context for changes.
 8517:25 <jnewbery> Yes, PR #9535 is where it was introduced
 8617:26 <jnewbery> I have a few tools that help me navigate the history of the repo. `git log -S` is one of them. The fugitive plugin for vim is another great tool
 8717:26 <jnewbery> it lets you walk back through the commits and see how things change over time
 8817:26 <ben3223> Did it replace cs_vSend ?
 8917:26 <michaelfolkson> Hmm nice. Always interested in a Vim plugin recommendation
 9017:27 <glozow> ben3223: just split it
 9117:27 <jnewbery> and then using a custom search engine in my browser https://github.com/bitcoin/bitcoin/search?q=%s to lookup which PR a commit was merged in
 9217:28 <jnewbery> ben3223: exactly - the PR split cs_vSend into two mutexes (mutices?)
 9317:28 <jnewbery> ok, next question
 9417:28 <jnewbery> This PR moves one mutex from net to net_processing (CNode.cs_sendProcessing is replaced by PeerManager.m_mutex_message_handler) and one mutex from net_processing to txorphanage (g_cs_orphans is replaced by TxOrphanage.m_mutex). What are the benefits of moving these global/externally visible mutexes to being defined internally in their respective classes?
 9517:29 <michaelfolkson> Makes it easier to test, safer for future changes?
 9617:30 <marqusat> Information hiding, harder to make a mistake and acquire a mutex when it shouldn’t be acquired and cause a deadlock.
 9717:30 <michaelfolkson> "Easier to reason about" is always the right answer :)
 9817:31 <jnewbery> michaelfolkson marqusat: Yes!
 9917:31 <svav> You will enable more concurrency, although what exactly I'm not sure!!
10017:31 <jnewbery> by making the locking internal, you hide those requirements from the caller
10117:31 <glozow> it'd be nice for modules to make themselves thread-safe instead of requiring others to grab locks all the time
10217:32 <jnewbery> glozow: +1
10317:32 <jnewbery> Next question. What are vExtraTxnForCompact and vExtraTxnForCompactIt? Why is it ok to stop guarding them with g_cs_orphans and guard them with m_mutex_message_handling instead?
10417:32 <svav> The code will be easier to maintain, because you will have thread opening and closing in the same place, so less chance of messing up locking.
10517:35 <jnewbery> svav: I think almost all of our locking uses RAII, so it's quite hard to mess up locking. The lock lasts as long as the std::unique_lock stays in scope (and that's wrapped by the LOCK() macro)
10617:35 <jnewbery> We also have lock annotations now, which give us a bit more confidence, since the compiler enforces that the right locks are held when entering functions/accessing data
10717:35 <michaelfolkson> Not quite sure on vExtraTxnForCompact. An extra transaction for compact block reconstruction whatever that means
10817:36 <glozow> `vExtraTxnForCompact` = transactions you conflicted out, but might see in a block
10917:36 <jnewbery> but yes, I agree that if the mutex is not exposed at all, there's no way for a caller to get locking wrong (since it's not responsible for locking)
11017:36 <marqusat> Orphan/conflicted/etc transactions that are kept for compact block reconstruction, private in PeerManagerImpl which uses its internal mutex
11117:36 <glozow> if you have compact blocks enabled, your peers won't send you those txns in the block?
11217:36 <glozow> so you'll want to have them somewhere in case you need them?
11317:37 <sipa> glozow: that's not how compact blocks work; the sender sends you a compact block with compact tx hashes; you then request the transactions which you miss
11417:37 <jnewbery> glozow: very good! Yes, they're extra transactions that can be used for compact block reconstruction.
11517:37 <sipa> glozow: by keeping conflicted/orphan transactions around for a bit, they are available for reconstruction, so you won't have to ask for them if they show up in a compact block
11617:38 <jnewbery> sipa: is it different between high bandwidth/low bandwidth mode?
11717:38 <sipa> jnewbery: no, high bandwidth just drops a roundtrip (you don't need to ask for the compact block after announcement; they announce through a compact block directly)
11817:39 <jnewbery> sipa: thanks!
11917:39 <jnewbery> https://github.com/bitcoin/bips/blob/master/bip-0152.mediawiki#Intended_Protocol_Flow for reference
12017:39 <glozow> sipa: ah okay, got it
12117:39 <glozow> for some reason i thought they might go by inventory
12217:40 <jnewbery> ok, so part two of that question: Why is it ok to stop guarding them with g_cs_orphans and guard them with m_mutex_message_handling instead?
12317:40 <sipa> (i have not reviewed the PR, i am guessing) there is no strict requirement that they are in sync with the rest of the orphan data structure
12417:40 <jnewbery> sipa: yes!
12517:41 <svav> Because they now have their own mutex lock and unlock in TXOrphanage which can be dealt with by m_mutex_message_handling
12617:41 <jnewbery> svav: what are you referring to as they?
12717:42 <sipa> right, locking achieves two things: multithreaded r/w access (you're instantly UB if you don't protect variabled read/written by multiple threads with some form of locking/synchronization)
12817:42 <sipa> but it also achieves consistency; by putting multiple variables under the same lock, they will always be in sync from the perspective of other threads
12917:42 <svav> jnewbury: vExtraTxnForCompact and vExtraTxnForCompactIt
13017:42 <sipa> if consistency across certain variables isn't required, they can by protected by distinct locks
13117:43 <svav> jnewbery: vExtraTxnForCompact and vExtraTxnForCompactIt
13217:44 <jnewbery> sipa: exactly right. The class has certain invariants that must always be true for callers. We can lock a mutex, and then temporarily violate those invariants during processing, before restoring the invariants and releasing the lock.
13317:44 <jnewbery> if there are no logical invariants between separate data, they don't need to be guarded by the same lock
13417:44 <michaelfolkson> So there is always this trade-off re locks between few locks (guarantees re being in sync) versus many locks (freedom of concerns)
13517:45 <sipa> right, and the consistency requirements for vExtraTxnForCompact etc are basically "whatever"
13617:45 <jnewbery> sipa: yes
13717:46 <jnewbery> svav: ah, I was a little confused because you said they have mutex lock and unlock in TXOrphange. vExtraTxnForCompact and vExtraTxnForCompactIt are not in TXOrphange
13817:47 <jnewbery> you're right that they're not protected under a different lock (m_mutex_message_handling), which prevents multiple threads accessing them concurrently
13917:47 <jnewbery> ok, next question. This PR removes the Peer.m_orphan_work_set member and replaces it with a m_peer_work_set map in TxOrphanage. What is the peer orphan work set used for?
14017:49 <michaelfolkson> Didn't get this far, don't know
14117:50 <glozow> peer -> orphans to be considered, for which the parents were provided by this peer?
14217:50 <glozow> or is it that they provided the orphan?
14317:50 <jnewbery> glozow: right first time. The parents were provided by this peer
14417:50 <glozow> so you could have multiple entries if the parents were provided by multiple peers?
14517:51 <glozow> er, multiple parents by different peers
14617:51 <jnewbery> so if we receive a transaction where we don't have its inputs in our UTXO set or mempool from peer A, and then peer B provides the missing parent and it gets accepted to our mempool, whose orphan work set will the orphan go into?
14717:51 <glozow> B?
14817:52 <jnewbery> glozow: right. We'll validate the orphan transaction next time we go through the ProcessMessages loop for peer B.
14917:53 <jnewbery> and if the orphan transaction was consensus-invalid, which peer would we punish?
15017:53 <glozow> and punish B if it's a bad orphan?
15117:53 <glozow> ha oops. B?
15217:54 <jnewbery> glozow: that seems unfair. Peer B might not even know about the orphan
15317:54 <glozow> yeah peer B's just parenting hey
15417:54 <michaelfolkson> Punishment seems challenging/impossible with orphans
15517:55 <jnewbery> no, we won't punish the peer that provided the parent. That'd be a good way to allow a malicious actor to get peers to disconnect other peers.
15617:56 <jnewbery> we remember who provided the orphan (https://github.com/bitcoin/bitcoin/blob/1186910b6b7ba7c7e5193c76f33f25825e6cc0b7/src/txorphanage.h#L53), so if it's bad we'll punish the peer that sent us the orphan
15717:56 <jnewbery> ok, final question. (Bonus question) This PR originally included some behavioural changes in the way that orphans are handled. Those have now been moved from this PR to a separate branch and may be proposed as a follow-up PR. What are those behaviour changes?
15817:58 <jnewbery> alright, maybe one for people to go and investigate for themselves.
15917:58 <michaelfolkson> A lot of commits on that branch
16017:58 <jnewbery> the branch is here: https://github.com/ajtowns/bitcoin/commits/202104-whohandlesorphans
16117:58 <michaelfolkson> Something about who handles orphans :)
16217:58 <jnewbery> any final questions before we wrap up?
16317:58 <svav> Dont process tx after processing orphans?
16417:59 <hernanmarino> :)
16517:59 <jnewbery> ok, let's call it. Thanks everyone
16618:00 <jnewbery> quiet week - I guess everyone else is hanging out on Saylor's yacht
16718:00 <jnewbery> #endmeeting