Move CNodeState members guarded by g_msgproc_mutex to Peer (p2p, refactoring)

Nov 2, 2022

https://github.com/bitcoin/bitcoin/pull/26140

Host: dergoegge - PR author: dergoegge

Notes

#26036 introduced a new mutex (NetEventsInterface::g_msgproc_mutex) to document the fact that our message processing code is single threaded (msghand thread). Any PeerManagerImpl or Peer members that are only ever accessed from that single thread should be annotated as GUARDED_BY(g_msgproc_mutex), to avoid bugs where those members are accessed by other threads as well (in which case they need to be guarded by a different mutex).
CNodeState is documented to only have validation specific members and is therefore entirely guarded by cs_main. However, not all members are validation specific, and the ones that aren’t should be moved to Peer.
#26140 is a simple refactor PR that moves some of the CNodeState members that are not validation relevant to Peer and annotates them as guarded by g_msgproc_mutex.

Questions

Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?
What is cs_main and what is it used for?
Which threads access state relevant for message processing (PeerManagerImpl, Peer, etc.)? (Hint: have a look at the developer notes for a list of all threads)
What is the difference between CNodeState and Peer? How would you decide where to store new per-peer state? (Bonus points if you also mention CNode in your answer)
The PR moves nUnconnectingHeaders, m_headers_sync_timeout, fPreferHeaders and m_recently_announced_invs from CNodeState to Peer. Multiple other members of CNodeState are also not validation specific and should also move to Peer. Which members are that and why is it not as trivial to move those in comparison to the ones that this PR moves?
Why does the PR rename nUnconnectingHeaders and MAX_UNCONNECTING_HEADERS?

Meeting Log

17:00

<dergoegge> #startmeeting

17:00

<pablomartin> hello!

17:00

<hernanmarino> Hello

17:00

<dergoegge> Hi everyone! Welcome to this week's PR review club!

17:00

<andrewtoth> hi

17:00

<stickies-v> Hi!

17:00

<dergoegge> Feel free to say hi to let people know you are here

17:01

<amovfx> hu

17:01

<amovfx> hi

17:01

<svav> Hi All

17:01

<dergoegge> This week we are looking at #26140 "Move CNodeState members guarded by g_msgproc_mutex to Peer", notes are in the usual place: https://bitcoincore.reviews/26140

17:01

<effexzi> Hi every1

17:01

<dergoegge> Anyone here for the first time? :)

17:02

<Lov3r_Of_Bitcoin> Hello

17:02

<sprainhill> dergoegge: I am!

17:02

<amovfx> welcome sprain

17:02

<dergoegge> sprainhill: Welcome!

17:02

<sprainhill> ty!

17:02

<svav> sprainhill where did you hear of the meeting please, and Hi!

17:03

<LarryRuane> hi

17:03

<sprainhill> Svav: I think Twitter originally

17:03

<schmidty_> hi

17:04

<dergoegge> Lets get started! Who had a chance to look at the notes for this week? (y/n)

17:04

<amovfx> y

17:04

<hernanmarino> y

17:04

<stickies-v> y, mostly code review though!

17:04

<pablomartin> y

17:04

<svav> y

17:04

<sprainhill> n

17:04

<LarryRuane> y (but wasn't able to figure out all the answers to the questions)

17:05

<Lov3r_Of_Bitcoin> y

17:05

<dergoegge> Lots of ys, cool! Have you also reviewed the PR? Concept ACK, approach ACK, tested ACK, or NACK?

17:06

<pablomartin> approach CK

17:06

<stickies-v> Approach ACK

17:06

<hernanmarino> approach ACK, I will test it after this meeting

17:07

<Lov3r_Of_Bitcoin> approach ACK

17:08

<dergoegge> Great! The first question is a bit of a background question: What is `cs_main` and what is it used for?

17:08

<pablomartin> cs_main syncs access (recursive mutex) to the chain state in an atomic way

17:08

<hernanmarino> "critical section" mutex from the old main function

17:09

<svav> cs_main is a recursive mutex which is used to ensure that validation is carried out in an atomic way. It guards access to validation specific variables (such as CChainState and CNode) or mempool variables (in net_processing). The lock of cs_main is in validation.cpp.

17:10

<glozow> hi

17:10

<svav> https://bitcoin.stackexchange.com/questions/106314/what-is-cs-main-why-is-it-called-cs-main

17:11

<dergoegge> pablomartin: yes but is the chainstate all that it guards?

17:11

<dergoegge> And can someone explain what the difference between a mutex and a recursive mutex is?

17:12

<pablomartin> nop, also de ones svav mentioned and on this pr

17:13

<LarryRuane> For any of you history fans ... I was thinking `cs_main` was the only mutex in the original code, but that's not true, there were several (but not many): https://github.com/JeremyRubin/satoshis-version/search?q=CCriticalSection

17:13

<amovfx> A recursive mutex can be locked many times

17:13

<stickies-v> A recursive mutex doesn't lock when locked multiple times in the same stack, I think?

17:13

<amovfx> and needs to be unlocked that many times

17:13

<dergoegge> svav: I don't think that CNode is guarded by cs_main (and it totally shouldn't be). That SO answer is over a year old, so it might be out of date. Although i also can't remember CNode being guarded by cs_main.

17:14

<amovfx> oh shit, maybe I got it backwards

17:14

<pablomartin> recursive mutex: could be locked multiple times by the same process/thread, without causing a deadlock.

17:14

<LarryRuane> if a thread acquires (locks) a mutex and then that same thread tries to lock it again, it will block... with a recursive mutex, the same thread can lock the mutex multiple times (simultaneously), and it's only unlocked on the last unlock

17:15

<svav> dergoegge ok I just Googled it

17:15

<dergoegge> stickies-v, amovfx: yes a recursive mutex can be locked multiple times but only by the same thread. A regular mutex will create a deadlock when locked twice from the same thread.

17:15

<amovfx> imo, Larry +1, he said what I was trying to say

17:15

<andrewtoth> pablomartin: I think that's true for thread, but a process can have multiple threads

17:15

<LarryRuane> the recursive mutex has a lock "counter" ... only if it's zero, is the lock unlocked

17:15

<dergoegge> pablomartin, LarryRuane: exactly +1

17:16

<pablomartin> andrewtoth: I agree

17:16

<amovfx> ah I missed the detail of the same thread has to lock multiple times, I thought it can be locked multiple times from wherever, good to know

17:16

<LarryRuane> but I'm pretty sure that we're trying to eliminate recursive mutexes, but it is a slow process

17:16

<dergoegge> andrewtoth: yes that is right, only relevant for threads

17:16

<LarryRuane> amovfx: well, if it could be locked multiple times from wherever, then it wouldn't have any effect!

17:17

<andrewtoth> a shared mutex can be locked multiple times from wherever, and only locked exclusively by a unique lock taken on it

17:17

<yashraj> just found out what a mutex is, with a fun, relevant analogy too! https://stackoverflow.com/questions/34524/what-is-a-mutex

17:17

<dergoegge> LarryRuane: yea recursive mutexes generally lead to worse code, i think we have a couple of open PRs that try to remove some of them.

17:17

<amovfx> Larry: good point, ty

17:18

<amovfx> imo, I don't like them at all

17:18

<andrewtoth> there is an open issue tracking removal of recursive mutexes https://github.com/bitcoin/bitcoin/issues/19303

17:18

<stickies-v> dergoegge: the docs state that CNodeState's members are protected by cs_main, but I can't see that enforced in code anywhere. Do the docs mean that they _should_ only be accessed with a lock on cs_main?

17:18

<dergoegge> My prepared answer for this question: cs_main is our main validation mutex lock, used to guard any validation state from parallel access by different threads. It is also used to guard non-validation state which we should change (like this week's PR does)

17:19

<LarryRuane> my impression is that recursive mutexes, even though they seem useful. lead to lazy coding, and increase the possibility of synchronization bugs

17:19

<LarryRuane> dergoegge: very helpful, can you say a little on what validatation state means? what's an example of such state, and example of NOT such state?

17:19

<pablomartin> dergoegge: I agree, it's the main mutex lock

17:20

<dergoegge> stickies-v: yes thats what the docs mean, we could add annotations (i.e. GUARDED_BY(cs_main)) to get the compiler to check (i think i did that at some point privately just to check that cs_main is always held when they're accessed)

17:20

<amovfx> is validation state, things like chaintip, mempool?

17:21

<LarryRuane> maybe i can answer partially myself... our list of peers is part of our current state, but it's not *validation* state ... so changing that list should probably not require `cs_main`

17:22

<dergoegge> LarryRuane: validation state as in chainstate manager, chainstate but also non-validation state like CNodeState::nUnconnectingHeaders (which is strictly per peer p2p state)

17:23

<LarryRuane> dergoegge: +1 thanks

17:23

<stickies-v> dergoegge: okay cool thanks, that's what I thought. The docs just make it sound like the locking is already taken care of, instead of highlighting that it absolutely needs to be done - which I think is a bit confusing

17:23

<amovfx> ah, validation data can come in through peers, so we lock any state that is associated with those?

17:24

<dergoegge> amovfx: we use cs_main to keep the mempool in sync with the current chainstate iirc (e.g. utxo set) (eventually we could have an async mempool but thats a story for another time)

17:24

<amovfx> chainstate ==utxo state

17:24

<amovfx> ?

17:24

<dergoegge> OK we should move on: Which threads access state relevant for message processing (PeerManagerImpl, Peer, etc.)? (Hint: have a look at the developer notes for a list of all threads)

17:25

<amovfx> https://github.com/bitcoin/bitcoin/blob/master/doc/developer-notes.md#threads

17:25

<LarryRuane> amovfx: yes, I thought that naming is confusing, but yet, my impression is that chainstate is another term for UTXO state (utxo set)

17:25

<andrewtoth> amovfx: blocks added and indexed as well

17:25

<amovfx> ty

17:25

<LarryRuane> dergoegge: These "net threads" https://github.com/bitcoin/bitcoin/blob/master/src/net.h#L1105

17:26

<pablomartin> all net threads?

17:26

<dergoegge> amovfx: the naming can be confusing, you can have a look at our "Chainstate" to see what that includes (which is what i meant)

17:26

<amovfx> ty

17:27

<amovfx> dergoegge: ThreadMessageHandler?

17:27

<LarryRuane> (those threads are owned by `CConnman` of which there is one instance)

17:28

<dergoegge> Ideally the answer includes a list of thread names and an explanation on how they access the net processing (message processing) state

17:28

<pablomartin> LarryRuane: I see

17:29

<dergoegge> amovfx: yes the msghand thread is one of them, how does it access net processing state? (what PeerManager functions does it call?)

100

17:30

<amovfx> .m_recently_announced_invs?

101

17:30

<LarryRuane> amovfx: I think what i said is inaccurate, read this comment above the `Chainstate` class definition https://github.com/bitcoin/bitcoin/blob/master/src/validation.h#L404

102

17:31

<amovfx> ty

103

17:31

<LarryRuane> one of the members of that class is `m_coins_views` which is the UTXO set, but it's only one part of that class

104

17:32

<dergoegge> amovfx: ThreadMessageHandler directly calls methods on the PeerManager (i.e. ProcessMessages, SendMessages) which do all the message handling

105

17:32

<amovfx> ty

106

17:32

<dergoegge> How about the scheduler thread, does that call into net processing?

107

17:34

<andrewtoth> does it for scheduling feeler connections?

108

17:35

<pablomartin> dergoegge: but it does asynchronous calls, no? not sure if it ended up callin the net processing...

109

17:35

<pablomartin> *calling

110

17:35

<dergoegge> andrewtoth: feeler connections are opened by the openconn thread (see ThreadOpenConnections in net.cpp)

111

17:35

<amovfx> I would be shocked if the scheduler thread makes calls into netprocessing

112

17:37

<dergoegge> PeerManager is registered as a CValidationInterface so it receives validation callbacks which are scheduled on the scheduler thread (e.g. BlockConnected, BlockDisconnected, NewPowValidBlock)

113

17:37

<amovfx> Looks like there CConnman calls the scheduler

114

17:38

<stickies-v> a bit to my surprise, the scheduler does not start new threads for the tasks it starts (https://github.com/bitcoin/bitcoin/blob/5274f324375fd31cf8507531fbc612765d03092f/src/scheduler.cpp#L62), but rather they are indeed executed from the scheduler thread

115

17:38

<LarryRuane> scheduler calls `CheckForStaleTipAndEvictPeers` and `ReattemptInitialBroadcast`

116

17:38

<LarryRuane> (which are part of PeerManager)

117

17:39

<dergoegge> LarryRuane: also correct!

118

17:42

<dergoegge> Oh i think my last message didn't make it through...

119

17:42

<dergoegge> Ok so we covered the msghand and scheduler threads. How about the openconn thread?

120

17:44

<stickies-v> (note: the thread is called "opencon", if you're grepping)

121

17:47

<stickies-v> dergoegge: I can't see any instances where `CConnman::ThreadOpenConnections` calls into net_processing, so... I think the answer is no?

122

17:49

<dergoegge> ThreadOpenConnections calls OpenNetworkConnection which calls... IntializeNode on PeerManager!

123

17:49

<dergoegge> I added this question to make it obvious that some functions on PeerManager are called by multiple threads and any state accessed/mutated in them needs protection against parallel access.

124

17:49

<dergoegge> Some state in net processing however is only ever accessed by the msghand thread and therefore doesn't need a mutex.

125

17:50

<LarryRuane> in case this is helpful, link to ThreadOpenConnections: https://github.com/bitcoin/bitcoin/blob/master/src/net.cpp#L1577

126

17:50

<dergoegge> Next question: What is the difference between CNodeState and Peer? How would you decide where to store new per-peer state? (Bonus points if you also mention CNode in your answer)

127

17:50

<stickies-v> oh. good point, I only checked the direct calls, not the entire stacks. that seems like a very difficult thing to verify if you're not reasonably familiar with all the functions called, or is there a trick?

128

17:51

<stickies-v> (difficult as in very time consuming to do it manually)

129

17:52

<LarryRuane> dergoegge: I think currently, some of CNodeState is not relevant to validation, and so better belongs in Peer.. Peer has non-validation stuff, CNodeState should have validation stuff -- did I get that right? :)

130

17:54

<LarryRuane> stickies-v: +1 .. I sometimes find it helpful to run the node in the debugger to see those kind of dynamics

131

17:54

<pablomartin> stickies-v: should be a map/ graph somewhere... not sure if I ever saw it... or dreamt it

132

17:54

<dergoegge> stickies-v: yea you need to do some digging, i don't have a special trick for this (i think the call graphs in the doxygen docs are sometimes helpful: https://doxygen.bitcoincore.org/class_c_connman.html#a0b787caf95e52a346a2b31a580d60a62)

133

17:54

<pablomartin> dergoegge: yeah there

134

17:55

<LarryRuane> dergoegge: that's cool, I wonder how good the call graphing is when using interface classes, that would make it tricky

135

17:55

<dergoegge> LarryRuane: yes CNodeState is meant for validation specific per-peer state guarded by cs_main and Peer is meant for all other per-peer state (that is not guarded by cs_main)

136

17:55

<stickies-v> intuitively, I'd say CNode mostly deals with communication between nodes, CNodeState with validating what we get from a node, and Peer to deal with policy and checking if peers are behaving correctly. Is that a fair approximation?

137

17:56

<stickies-v> pablomartin dergoegge yeah but call graphs are function/class based, so you'd have to look at a lot of charts to figure out what calls into net_processing?

138

17:56

<dergoegge> stickies-v: yea that sounds about right. Do you mean mempool policy?

139

17:57

<stickies-v> cool. No I mean p2p policy, e.g. not exceeding the number of unconnected headers etc. I don't think this touches mempool policy at all, right?

140

17:57

<dergoegge> stickies-v: yea the doxygen graphs are not perfect but in case of the addcon thread it would help

141

17:58

<dergoegge> stickies-v: OK good! It has no mempool policy state (i hope)

142

17:58

<amovfx> Is node state only for the operators node?

143

17:58

<dergoegge> Lets skip Q. 5: Why does the PR rename nUnconnectingHeaders and MAX_UNCONNECTING_HEADERS?

144

17:58

<LarryRuane> amovfx: no it's for each of our peers

145

17:59

<LarryRuane> dergoegge: because Marco suggested it :) https://github.com/bitcoin/bitcoin/pull/26140#discussion_r1003303992

146

17:59

<amovfx> ty

147

17:59

<pablomartin> dergoegge: about... where to store new peer-peer state? CNode (defined in net.h, used by m_nodes(CConnman) and covered by m_nodes_mutex) is concerned with the connection state of the peer.

148

18:00

<LarryRuane> but seriously, is the rename because it's more accurate to include "msgs" in the name, since it's a count of headers *messages*? (i'm not sure)

149

18:00

<dergoegge> LarryRuane: yes but the renaming is useful because the variable tracks the number of "headers" messages that did not connect, not the number of headers that didn't connect

150

18:00

<dergoegge> LarryRuane: yes

151

18:00

<LarryRuane> dergoegge: +1 that's what i was thinking

152

18:00

<dergoegge> #endmeeting