Move CNodeState members guarded by g_msgproc_mutex to Peer (p2p, refactoring)

Host: dergoegge  -  PR author: dergoegge


  • #26036 introduced a new mutex (NetEventsInterface::g_msgproc_mutex) to document the fact that our message processing code is single threaded (msghand thread). Any PeerManagerImpl or Peer members that are only ever accessed from that single thread should be annotated as GUARDED_BY(g_msgproc_mutex), to avoid bugs where those members are accessed by other threads as well (in which case they need to be guarded by a different mutex).

  • CNodeState is documented to only have validation specific members and is therefore entirely guarded by cs_main. However, not all members are validation specific, and the ones that aren’t should be moved to Peer.

  • #26140 is a simple refactor PR that moves some of the CNodeState members that are not validation relevant to Peer and annotates them as guarded by g_msgproc_mutex.


  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?

  2. What is cs_main and what is it used for?

  3. Which threads access state relevant for message processing (PeerManagerImpl, Peer, etc.)? (Hint: have a look at the developer notes for a list of all threads)

  4. What is the difference between CNodeState and Peer? How would you decide where to store new per-peer state? (Bonus points if you also mention CNode in your answer)

  5. The PR moves nUnconnectingHeaders, m_headers_sync_timeout, fPreferHeaders and m_recently_announced_invs from CNodeState to Peer. Multiple other members of CNodeState are also not validation specific and should also move to Peer. Which members are that and why is it not as trivial to move those in comparison to the ones that this PR moves?

  6. Why does the PR rename nUnconnectingHeaders and MAX_UNCONNECTING_HEADERS?

Meeting Log

  117:00 <dergoegge> #startmeeting
  217:00 <pablomartin> hello!
  317:00 <hernanmarino> Hello
  417:00 <dergoegge> Hi everyone! Welcome to this week's PR review club!
  517:00 <andrewtoth> hi
  617:00 <stickies-v> Hi!
  717:00 <dergoegge> Feel free to say hi to let people know you are here
  817:01 <amovfx> hu
  917:01 <amovfx> hi
 1017:01 <svav> Hi All
 1117:01 <dergoegge> This week we are looking at #26140 "Move CNodeState members guarded by g_msgproc_mutex to Peer", notes are in the usual place:
 1217:01 <effexzi> Hi every1
 1317:01 <dergoegge> Anyone here for the first time? :)
 1417:02 <Lov3r_Of_Bitcoin> Hello
 1517:02 <sprainhill> dergoegge: I am!
 1617:02 <amovfx> welcome sprain
 1717:02 <dergoegge> sprainhill: Welcome!
 1817:02 <sprainhill> ty!
 1917:02 <svav> sprainhill where did you hear of the meeting please, and Hi!
 2017:03 <LarryRuane> hi
 2117:03 <sprainhill> Svav: I think Twitter originally
 2217:03 <schmidty_> hi
 2317:04 <dergoegge> Lets get started! Who had a chance to look at the notes for this week? (y/n)
 2417:04 <amovfx> y
 2517:04 <hernanmarino> y
 2617:04 <stickies-v> y, mostly code review though!
 2717:04 <pablomartin> y
 2817:04 <svav> y
 2917:04 <sprainhill> n
 3017:04 <LarryRuane> y (but wasn't able to figure out all the answers to the questions)
 3117:05 <Lov3r_Of_Bitcoin> y
 3217:05 <dergoegge> Lots of ys, cool! Have you also reviewed the PR? Concept ACK, approach ACK, tested ACK, or NACK?
 3317:06 <pablomartin> approach CK
 3417:06 <stickies-v> Approach ACK
 3517:06 <hernanmarino> approach ACK, I will test it after this meeting
 3617:07 <Lov3r_Of_Bitcoin> approach ACK
 3717:08 <dergoegge> Great! The first question is a bit of a background question: What is `cs_main` and what is it used for?
 3817:08 <pablomartin> cs_main syncs access (recursive mutex) to the chain state in an atomic way
 3917:08 <hernanmarino> "critical section" mutex from the old main function
 4017:09 <svav> cs_main is a recursive mutex which is used to ensure that validation is carried out in an atomic way. It guards access to validation specific variables (such as CChainState and CNode) or mempool variables (in net_processing). The lock of cs_main is in validation.cpp.
 4117:10 <glozow> hi
 4217:10 <svav>
 4317:11 <dergoegge> pablomartin: yes but is the chainstate all that it guards?
 4417:11 <dergoegge> And can someone explain what the difference between a mutex and a recursive mutex is?
 4517:12 <pablomartin> nop, also de ones svav mentioned and on this pr
 4617:13 <LarryRuane> For any of you history fans ... I was thinking `cs_main` was the only mutex in the original code, but that's not true, there were several (but not many):
 4717:13 <amovfx> A recursive mutex can be locked many times
 4817:13 <stickies-v> A recursive mutex doesn't lock when locked multiple times in the same stack, I think?
 4917:13 <amovfx> and needs to be unlocked that many times
 5017:13 <dergoegge> svav: I don't think that CNode is guarded by cs_main (and it totally shouldn't be). That SO answer is over a year old, so it might be out of date. Although i also can't remember CNode being guarded by cs_main.
 5117:14 <amovfx> oh shit, maybe I got it backwards
 5217:14 <pablomartin> recursive mutex: could be locked multiple times by the same process/thread, without causing a deadlock.
 5317:14 <LarryRuane> if a thread acquires (locks) a mutex and then that same thread tries to lock it again, it will block... with a recursive mutex, the same thread can lock the mutex multiple times (simultaneously), and it's only unlocked on the last unlock
 5417:15 <svav> dergoegge ok I just Googled it
 5517:15 <dergoegge> stickies-v, amovfx: yes a recursive mutex can be locked multiple times but only by the same thread. A regular mutex will create a deadlock when locked twice from the same thread.
 5617:15 <amovfx> imo, Larry +1, he said what I was trying to say
 5717:15 <andrewtoth> pablomartin: I think that's true for thread, but a process can have multiple threads
 5817:15 <LarryRuane> the recursive mutex has a lock "counter" ... only if it's zero, is the lock unlocked
 5917:15 <dergoegge> pablomartin, LarryRuane: exactly +1
 6017:16 <pablomartin> andrewtoth: I agree
 6117:16 <amovfx> ah I missed the detail of the same thread has to lock multiple times, I thought it can be locked multiple times from wherever, good to know
 6217:16 <LarryRuane> but I'm pretty sure that we're trying to eliminate recursive mutexes, but it is a slow process
 6317:16 <dergoegge> andrewtoth: yes that is right, only relevant for threads
 6417:16 <LarryRuane> amovfx: well, if it could be locked multiple times from wherever, then it wouldn't have any effect!
 6517:17 <andrewtoth> a shared mutex can be locked multiple times from wherever, and only locked exclusively by a unique lock taken on it
 6617:17 <yashraj> just found out what a mutex is, with a fun, relevant analogy too!
 6717:17 <dergoegge> LarryRuane: yea recursive mutexes generally lead to worse code, i think we have a couple of open PRs that try to remove some of them.
 6817:17 <amovfx> Larry: good point, ty
 6917:18 <amovfx> imo, I don't like them at all
 7017:18 <andrewtoth> there is an open issue tracking removal of recursive mutexes
 7117:18 <stickies-v> dergoegge: the docs state that CNodeState's members are protected by cs_main, but I can't see that enforced in code anywhere. Do the docs mean that they _should_ only be accessed with a lock on cs_main?
 7217:18 <dergoegge> My prepared answer for this question: cs_main is our main validation mutex lock, used to guard any validation state from parallel access by different threads. It is also used to guard non-validation state which we should change (like this week's PR does)
 7317:19 <LarryRuane> my impression is that recursive mutexes, even though they seem useful. lead to lazy coding, and increase the possibility of synchronization bugs
 7417:19 <LarryRuane> dergoegge: very helpful, can you say a little on what validatation state means? what's an example of such state, and example of NOT such state?
 7517:19 <pablomartin> dergoegge: I agree, it's the main mutex lock
 7617:20 <dergoegge> stickies-v: yes thats what the docs mean, we could add annotations (i.e. GUARDED_BY(cs_main)) to get the compiler to check (i think i did that at some point privately just to check that cs_main is always held when they're accessed)
 7717:20 <amovfx> is validation state, things like chaintip, mempool?
 7817:21 <LarryRuane> maybe i can answer partially myself... our list of peers is part of our current state, but it's not *validation* state ... so changing that list should probably not require `cs_main`
 7917:22 <dergoegge> LarryRuane: validation state as in chainstate manager, chainstate but also non-validation state like CNodeState::nUnconnectingHeaders (which is strictly per peer p2p state)
 8017:23 <LarryRuane> dergoegge: +1 thanks
 8117:23 <stickies-v> dergoegge: okay cool thanks, that's what I thought. The docs just make it sound like the locking is already taken care of, instead of highlighting that it absolutely needs to be done - which I think is a bit confusing
 8217:23 <amovfx> ah, validation data can come in through peers, so we lock any state that is associated with those?
 8317:24 <dergoegge> amovfx: we use cs_main to keep the mempool in sync with the current chainstate iirc (e.g. utxo set) (eventually we could have an async mempool but thats a story for another time)
 8417:24 <amovfx> chainstate ==utxo state
 8517:24 <amovfx> ?
 8617:24 <dergoegge> OK we should move on: Which threads access state relevant for message processing (PeerManagerImpl, Peer, etc.)? (Hint: have a look at the developer notes for a list of all threads)
 8717:25 <amovfx>
 8817:25 <LarryRuane> amovfx: yes, I thought that naming is confusing, but yet, my impression is that chainstate is another term for UTXO state (utxo set)
 8917:25 <andrewtoth> amovfx: blocks added and indexed as well
 9017:25 <amovfx> ty
 9117:25 <LarryRuane> dergoegge: These "net threads"
 9217:26 <pablomartin> all net threads?
 9317:26 <dergoegge> amovfx: the naming can be confusing, you can have a look at our "Chainstate" to see what that includes (which is what i meant)
 9417:26 <amovfx> ty
 9517:27 <amovfx> dergoegge: ThreadMessageHandler?
 9617:27 <LarryRuane> (those threads are owned by `CConnman` of which there is one instance)
 9717:28 <dergoegge> Ideally the answer includes a list of thread names and an explanation on how they access the net processing (message processing) state
 9817:28 <pablomartin> LarryRuane: I see
 9917:29 <dergoegge> amovfx: yes the msghand thread is one of them, how does it access net processing state? (what PeerManager functions does it call?)
10017:30 <amovfx> .m_recently_announced_invs?
10117:30 <LarryRuane> amovfx: I think what i said is inaccurate, read this comment above the `Chainstate` class definition
10217:31 <amovfx> ty
10317:31 <LarryRuane> one of the members of that class is `m_coins_views` which is the UTXO set, but it's only one part of that class
10417:32 <dergoegge> amovfx: ThreadMessageHandler directly calls methods on the PeerManager (i.e. ProcessMessages, SendMessages) which do all the message handling
10517:32 <amovfx> ty
10617:32 <dergoegge> How about the scheduler thread, does that call into net processing?
10717:34 <andrewtoth> does it for scheduling feeler connections?
10817:35 <pablomartin> dergoegge: but it does asynchronous calls, no? not sure if it ended up callin the net processing...
10917:35 <pablomartin> *calling
11017:35 <dergoegge> andrewtoth: feeler connections are opened by the openconn thread (see ThreadOpenConnections in net.cpp)
11117:35 <amovfx> I would be shocked if the scheduler thread makes calls into netprocessing
11217:37 <dergoegge> PeerManager is registered as a CValidationInterface so it receives validation callbacks which are scheduled on the scheduler thread (e.g. BlockConnected, BlockDisconnected, NewPowValidBlock)
11317:37 <amovfx> Looks like there CConnman calls the scheduler
11417:38 <stickies-v> a bit to my surprise, the scheduler does not start new threads for the tasks it starts (, but rather they are indeed executed from the scheduler thread
11517:38 <LarryRuane> scheduler calls `CheckForStaleTipAndEvictPeers` and `ReattemptInitialBroadcast`
11617:38 <LarryRuane> (which are part of PeerManager)
11717:39 <dergoegge> LarryRuane: also correct!
11817:42 <dergoegge> Oh i think my last message didn't make it through...
11917:42 <dergoegge> Ok so we covered the msghand and scheduler threads. How about the openconn thread?
12017:44 <stickies-v> (note: the thread is called "opencon", if you're grepping)
12117:47 <stickies-v> dergoegge: I can't see any instances where `CConnman::ThreadOpenConnections` calls into net_processing, so... I think the answer is no?
12217:49 <dergoegge> ThreadOpenConnections calls OpenNetworkConnection which calls... IntializeNode on PeerManager!
12317:49 <dergoegge> I added this question to make it obvious that some functions on PeerManager are called by multiple threads and any state accessed/mutated in them needs protection against parallel access.
12417:49 <dergoegge> Some state in net processing however is only ever accessed by the msghand thread and therefore doesn't need a mutex.
12517:50 <LarryRuane> in case this is helpful, link to ThreadOpenConnections:
12617:50 <dergoegge> Next question: What is the difference between CNodeState and Peer? How would you decide where to store new per-peer state? (Bonus points if you also mention CNode in your answer)
12717:50 <stickies-v> oh. good point, I only checked the direct calls, not the entire stacks. that seems like a very difficult thing to verify if you're not reasonably familiar with all the functions called, or is there a trick?
12817:51 <stickies-v> (difficult as in very time consuming to do it manually)
12917:52 <LarryRuane> dergoegge: I think currently, some of CNodeState is not relevant to validation, and so better belongs in Peer.. Peer has non-validation stuff, CNodeState should have validation stuff -- did I get that right? :)
13017:54 <LarryRuane> stickies-v: +1 .. I sometimes find it helpful to run the node in the debugger to see those kind of dynamics
13117:54 <pablomartin> stickies-v: should be a map/ graph somewhere... not sure if I ever saw it... or dreamt it
13217:54 <dergoegge> stickies-v: yea you need to do some digging, i don't have a special trick for this (i think the call graphs in the doxygen docs are sometimes helpful:
13317:54 <pablomartin> dergoegge: yeah there
13417:55 <LarryRuane> dergoegge: that's cool, I wonder how good the call graphing is when using interface classes, that would make it tricky
13517:55 <dergoegge> LarryRuane: yes CNodeState is meant for validation specific per-peer state guarded by cs_main and Peer is meant for all other per-peer state (that is not guarded by cs_main)
13617:55 <stickies-v> intuitively, I'd say CNode mostly deals with communication between nodes, CNodeState with validating what we get from a node, and Peer to deal with policy and checking if peers are behaving correctly. Is that a fair approximation?
13717:56 <stickies-v> pablomartin dergoegge yeah but call graphs are function/class based, so you'd have to look at a lot of charts to figure out what calls into net_processing?
13817:56 <dergoegge> stickies-v: yea that sounds about right. Do you mean mempool policy?
13917:57 <stickies-v> cool. No I mean p2p policy, e.g. not exceeding the number of unconnected headers etc. I don't think this touches mempool policy at all, right?
14017:57 <dergoegge> stickies-v: yea the doxygen graphs are not perfect but in case of the addcon thread it would help
14117:58 <dergoegge> stickies-v: OK good! It has no mempool policy state (i hope)
14217:58 <amovfx> Is node state only for the operators node?
14317:58 <dergoegge> Lets skip Q. 5: Why does the PR rename nUnconnectingHeaders and MAX_UNCONNECTING_HEADERS?
14417:58 <LarryRuane> amovfx: no it's for each of our peers
14517:59 <LarryRuane> dergoegge: because Marco suggested it :)
14617:59 <amovfx> ty
14717:59 <pablomartin> dergoegge: about... where to store new peer-peer state? CNode (defined in net.h, used by m_nodes(CConnman) and covered by m_nodes_mutex) is concerned with the connection state of the peer.
14818:00 <LarryRuane> but seriously, is the rename because it's more accurate to include "msgs" in the name, since it's a count of headers *messages*? (i'm not sure)
14918:00 <dergoegge> LarryRuane: yes but the renaming is useful because the variable tracks the number of "headers" messages that did not connect, not the number of headers that didn't connect
15018:00 <dergoegge> LarryRuane: yes
15118:00 <LarryRuane> dergoegge: +1 that's what i was thinking
15218:00 <dergoegge> #endmeeting