While doing initial block download (IBD) over a fluctuating and slow internet
connection in El Salvador, I observed very frequent peer disconnections in the
debug log, on the order of 100+ per hour. These disconnections were often of
manually added “addnode” peers, and logged as Peer is stalling block download,
disconnecting <peer>. Ping requests to these peers often took 20-100 seconds.
Even after IBD was completed, addnode peer disconnections still happened:
When an addnode peer is disconnected by the IBD headers/blocks download timeout
or stalling logic, ThreadOpenAddedConnections attempts to immediately
reconnect it – unless “onetry” was passed to the addnode RPC – up to the limit
of 8 addnode connections. This limit is separate from the regular peer
connection limits.
ThreadOpenAddedConnections will continue to attempt reconnection of the
disconnected addnode peer until it succeeds.
When these disconnection/reconnection cycles happen frequently with
addnode peers, it is likely network, resource and time intensive.
This is particularly true for I2P peers, as these involve destroying and
rebuilding 2 tunnels for each peer connection. It seems worth avoiding this if
it is straightforward to do so.
Automatic (non-addnode) peers are also disconnected by the same logic, but they
are a different category and case (non-protected peers, no immediate
connection/reconnection) that would require monitoring over time to adjust the
timeouts accordingly. Martin Zumsande was looking into optimizing this (see
https://bitcoin-irc.chaincode.com/bitcoin-core-dev/2025-01-22#1083993): “The
challenge is to distinguish this situation from making things worse for
fast/reliable connections that just have some slow peers which should be
disconnected.”
The goal of this pull request is thus to avoid unnecessary frequent
disconnections and immediate reconnections of addnode peers, both during IBD and
afterwards.
Approach
The first commit, “p2p: protect addnode peers during IBD”, provides addnode
peers the max BLOCK_STALLING_TIMEOUT_MAX value of 64 seconds for the IBD
stalling logic (“Peer is stalling block download”) in src/net_processing.cpp.
The second commit, “p2p: don’t disconnect addnode peers for block download
timeout”, proposes to protect addnode peers from disconnection. Review
feedback
suggested that we also clear their block requests, so that these blocks can
be requested from other peers.
The third commit, “p2p: don’t disconnect addnode peers for slow
initial-headers-sync”, proposes the same protection for addnode peers that we
currently already provide to peers with NetPermissionFlags::NoBan
permission.
The fourth commit, “rpc, doc: update addnode documentation”, updates the RPC
addnode help documentation.
<jonatack> This discussion is ad hoc and asynchronous, so feel free to continue conversation on previous questions when we move on, or raise any other questions or thoughts you have.
<jonatack> still, I noticed the disconnections were heavily affecting some of my addnode peers that were being targeted for no fault of their own, some of which were low latency (like cjdns peers) or medium/higher latency (like tor and i2p peers)
<stringintech> it can help prevent the node from being isolated from the rest of the network (not having the knowledge of the best chain anymore) by malicious peers
<jonatack> "An eclipse attack targets particular nodes in a network by surrounding them and obscuring their view of the entire network. For example, if a Bitcoin node has eight connections to other nodes, and an attacker controls all eight of those nodes, the attacker can refuse to relay any new blocks that miners produce. Although the rest of the network continues to process new blocks, the
<stringintech> then if we are choosing more than one addnode peer, they should be also geographically diverse so that for example if one region is compromised, our node can still maintain connections to rest of the network...
<monlovesmango> if it isn't actually protect them from disconnection then I would say I concept ACK and approach ACK. but if it is protecting from disconnection then I think there needs to be more thought into what desired IBD behavior would be
<jonatack> I need to consider whether to potentially drop the second (and maybe third) commit to keep it focused on IBD only, when the high number of disconnections I was seeing took place
<enochazariah> does this not raise up a silent stall? i mean, if the system does not have the inhererent mechanism to re-request that block from another peer, then the IBD could effectively stall, preventing a disconnection, but introducing a silent stall
<jonatack> i'm not sure how useful the cpu load is, as I have been testing it, and the load seems to often be higher when the peer is first connected, and then go down to normal levels
<stringintech> Regarding the timeout logics this PR touches, I could only find p2p_ibd_stalling.py, which covers the IBD block stalling timeout (and should possibly be adapted to reflect the addnode changes). I didn't find any integration tests for the initial header sync timeout and regular block download timeout. Am I right??
<stringintech> Hmm... I'd be happy to work on the missing ones (regardless of the PR changes) and open a PR (in case it is out of the scope for this PR of course). If it merges first, the addnode PR could adapt them accordingly too.