The PR branch HEAD was 2cfa278 at the time of this review club meeting.
We can think of the Bitcoin Core node as a stack with three levels:
Net - keeps track of potential nodes to peer with, handles active
connections to nodes, and reads and writes messages to the wire. The main
class in the net layer is CConnman, which is defined in
hands messages up the stack to …
Net Processing - reads and processes application-level messages, and
handles application-level view of peers. The main class in net
processing is PeerManager, which is declared in
Net processing uses function calls like ProcessNewBlock() and
AcceptToMemoryPool() to call into …
Validation - is responsible for our view of the block chain and the UTXO
set. It may (or may not) include a mempool for unconfirmed transactions. There
are various classes and functions exposed to the rest of the node. PR
20158 and linked PRs are an
attempt to clarify and rationalize the interface to validation.
In early versions of Bitcoin, this distinction wasn’t as clear. The first
commit in the
Bitcoin Core repository has a separate net.cpp file, but net processing and
validation are mixed in the main.cpp file. It was only in PR
9260 in 2016 that net processing
and validation were split into separate files.
Even today, the division between net, net processing and validation is not as
clear as we’d like. Clarifying and rationalizing these interfaces is an
ongoing project. See, for example, PR
15141, which moves the denial
of service logic from validation into net processing, PR
20158, which rationalizes the
interface to validation, and issue
19398, which moves
application-layer data and logic from net into net processing.
The high level goal of issue 19398 is to split responsibilities cleanly between
CConnman and PeerManager:
CConnman should be responsible for connections - accepting incoming
and opening outbound connections, opening and closing sockets, reading and
writing data on those sockets, and first-line-of-defence validation of
PeerManager should be responsible for peers - parsing and acting on
p2p messages from peers, storing state relating to those peers, and
deciding when to send messages to those peers.
Today’s PR, PR 20721, is
part of the larger project tracked by issue 19398. It moves the ping data
from net into net processing.
The ping/pong message exchange acts as a keepalive between peers on the
peer-to-peer network, and also measures round-trip latency. Lowest ping time
and most recent ping time are presented to users in the GUI and RPC interface
and lowest ping time is also used by the net layer in its peer eviction logic.
Some of the data members, nPingNonceSent, m_ping_start and fPingQueued are
only ever used by net processing, so they are obvious candidates for moving into the
nMinPingUsecTime is updated by net processing, but is used by net in its
inbound node eviction logic.
What’s the difference between system time and mockable time? When is
it appropriate to use one or the other?
why do we have a ping timeout? Why not just rely on the send buffer timeout and
receive buffer timeout?
Why do certain tests need to be updated in the second commit Use
-pingtimeout in tests that might have pings time out? Why could pings start
timing out when the ping timeout logic is moved into net processing?
The first commit, Add -pingtimeout to configure ping timeout, adds a new
configuration option. Is this desirable? When should we add new
configuration options, and when should we avoid adding new options?
If the goal of the PR is to move ping data into net processing, why is there
still a PingReceived function in CNode (in net) after this PR?
<dhruvm> The PR is trying to further separate connection and application layer logic. Bitcoin Core uses ping/pong p2p messages to maintain data about connection health. p2p logic belongs in the application layer, so this PR moves the associated variables and logic there.
<michaelfolkson> Yeah Concept ACK. I'm always interested in what makes it into net, net processing, validation etc because it makes sense logically or because it is convenient from codebase perspective
<jnewbery> great! It sounds like you've all got the high-level concept. I agree that it's a bit strange to think about whether ping belongs in net or net processing. We'll dive into the details in a bit.
<dhruvm> System time asks the OS for the time. Mockable time uses a mock time value if available, if not, uses system time. Mockable time can make tests more deterministic. However, mockable time does not tick forward so if the functionality being tested needs that, we have to `setmocktime` repeatedly.
<pinheadmz> i was thinking nack at first, since keeping connections alive seems like a low level (net) function but since `ping` is defined as a bitcoin p2p protocol message I changed my mind, it does belong up in the applciation level
<amiti> so the idea is if the low level network stuff could be mocked, when we set mocktime to try to test higher level stuff, the low level stuff wouldn't work as expected? ... so we shouldn't use mocktime too far down?
<jnewbery> felixweis: if I move a data member, I'll add a scripted diff at the end to make the naming comply with current code style. All of the lines that the variable touches are being updated anyway, so it's no additional churn
<glozow> sorree, I was trying to answer the question "why do ping timeouts at all" and my guess was that if we didn't do that, we could accidentally be conflating "nothing to talk about" with "inactivity"
<jnewbery> so the ping timeout does seem at first glance to be slightly redundant. If the receive buffer timeout is hit (20 minutes of inactivity) then, by definition, we can't have received a pong in 20 minutes.
<jnewbery> 3. Why do certain tests need to be updated in the second commit Use -pingtimeout in tests that might have pings time out? Why could pings start timing out when the ping timeout logic is moved into net processing?
<AnthonyRonning90> Instead of only checking ping timeout when `InactivityCheck` runs & the peer has been connected longer than the threshold, pings timeouts are now checked during the `SendMessage` flow. Tests before would not have hit the 20 minute default, but now hit the ping check every `SendMessage`.
<dhruvm> nehan: The redundancy is useful although not needed iiuc. For example, if we wanted to remove pings in a future protocol version, we could because the transport layer is still checking for connection health?
<jnewbery> if a test runs for less than 60 seconds, none of the inactivity logic is exercised at all (before this PR). After this PR, the ping timeout logic is exercised as soon as the connection is fully up (on receipt of the verack)
<jnewbery> nehan: your instinct is right. If the application layer is exchanging pings with a peer, then the lower layer connection *must* be healthy. However, I think this is just good defense in depth and separation of responsibilities: connman is responsible for maintaining good connections, and peerman is responsible for maintaining good peers
<nehan> part of the reason i bring this up is because of the challenge of which times to mock. that seems pretty hard and not at all "settled"! it would be very tricky to make sure you're mocking/updating all the clocks appropriately
<glozow> for the "why could pings start timing out when the ping timeout logic is moved" question, I thought it boiled down to the fact that it's no longer preceded by the `m_peer_connect_timeout` check? so then even the `-peertimeout` config option wouldn't suffice
<jnewbery> 4. The first commit, Add -pingtimeout to configure ping timeout, adds a new configuration option. Is this desirable? When should we add new configuration options, and when should we avoid adding new options?
<glozow> Adding it made sense to me since it was debug-only and existing `-peertimeout` doesn't work. idk what the good practice would be, interested in hearing about that. However, since the only option ever used is pretty magic-number-y, `-pingtimeout=9999` how come it isn't just a flag option to extend/ignore the ping timeout?
<jnewbery> lightlike: If it makes testing specific functionality easier and there's no other way to do that, then maybe. That needs to be weighed up against the confusion of having multiple timestamps. It's already a bit of a mess
<nehan> lightlike: i worry it might be tricky to figure out how to set them all appropriately. like maybe you want to fast forward a few days in blocktime, but don't want to trigger inactivity checks. and maybe something else besides inactivity checks will come to rely on net time being accurate
<jnewbery> AnthonyRonning90: It could. But logically I don't think that makes sense. We want pings to be sent/tested as soon as we're fully connected. I don't like adding special case logic to the production code to work around difficulties in our testing
<jnewbery> pinheadmz: your earlier comment about setting a high pingtimeout for low bandwidth connections is interesting, but I think if you're not receiving ping responses is 20 minutes, you're probably not going to have a good time :)
<jnewbery> pinheadmz: before this PR, there were two default timeouts: DEFAULT_PEER_CONNECT_TIMEOUT (60 seconds) - we don't do any inactivity timeout checks until the peer has been connected for this long. TIMEOUT_INTERVAL (20 minutes) - we'll disconnect a peer if there's no activity on their send/recv buffers or they haven't sent a pong for this long
<AnthonyRonning> the `CNode::m_min_ping_time` is used during `CConnman` eviction logic, so the ping processing data still needs to make it through to `Cnode` (as accomplished via `CNode::PingReceived`).