Protect addnode peers during IBD (p2p)

https://github.com/bitcoin/bitcoin/pull/32051

Host: jonatack  -  PR author: jonatack

The PR branch HEAD was 93b07997e9a38523f5ab850aa32ca57983fd2552 at the time of this review club meeting.

Notes

Motivation

While doing initial block download (IBD) over a fluctuating and slow internet connection in El Salvador, I observed very frequent peer disconnections in the debug log, on the order of 100+ per hour. These disconnections were often of manually added “addnode” peers, and logged as Peer is stalling block download, disconnecting <peer>. Ping requests to these peers often took 20-100 seconds.

Even after IBD was completed, addnode peer disconnections still happened:

Timeout downloading block <hex>, disconnecting <peer>

Discussion

When an addnode peer is disconnected by the IBD headers/blocks download timeout or stalling logic, ThreadOpenAddedConnections attempts to immediately reconnect it – unless “onetry” was passed to the addnode RPC – up to the limit of 8 addnode connections. This limit is separate from the regular peer connection limits.

ThreadOpenAddedConnections will continue to attempt reconnection of the disconnected addnode peer until it succeeds.

When these disconnection/reconnection cycles happen frequently with addnode peers, it is likely network, resource and time intensive. This is particularly true for I2P peers, as these involve destroying and rebuilding 2 tunnels for each peer connection. It seems worth avoiding this if it is straightforward to do so.

Automatic (non-addnode) peers are also disconnected by the same logic, but they are a different category and case (non-protected peers, no immediate connection/reconnection) that would require monitoring over time to adjust the timeouts accordingly. Martin Zumsande was looking into optimizing this (see https://bitcoin-irc.chaincode.com/bitcoin-core-dev/2025-01-22#1083993): “The challenge is to distinguish this situation from making things worse for fast/reliable connections that just have some slow peers which should be disconnected.”

The goal of this pull request is thus to avoid unnecessary frequent disconnections and immediate reconnections of addnode peers, both during IBD and afterwards.

Approach

  1. The first commit, “p2p: protect addnode peers during IBD”, provides addnode peers the max BLOCK_STALLING_TIMEOUT_MAX value of 64 seconds for the IBD stalling logic (“Peer is stalling block download”) in src/net_processing.cpp.

  2. The second commit, “p2p: don’t disconnect addnode peers for block download timeout”, proposes to protect addnode peers from disconnection. Review feedback suggested that we also clear their block requests, so that these blocks can be requested from other peers.

  3. The third commit, “p2p: don’t disconnect addnode peers for slow initial-headers-sync”, proposes the same protection for addnode peers that we currently already provide to peers with NetPermissionFlags::NoBan permission.

  4. The fourth commit, “rpc, doc: update addnode documentation”, updates the RPC addnode help documentation.

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? What was your review approach?

  2. What is an addnode peer? How can you specify them to your node?

  3. Do you provide addnode peers to the node(s) that you run? Why or why not? What kind of peers do you choose?

  4. What is a fundamental protection that addnode peers can provide to your node?

  5. What do you think of the review suggestion to rate peers and use that to scale the number of blocks we request from them?

  6. One reviewer suggested clearing block requests instead of disconnecting peers. How would you implement this?

  7. How is the test coverage for the code affected by this change? Can you think of any tests that would be worthwhile adding?

Erratum

Attendee stringintech correctly identified that the addnode code was introduced by Satoshi, namely in commit e66ec79b.

Meeting Log

  117:00 <jonatack> #startmeeting
  217:00 <corebot> jonatack: Meeting started at 2025-05-28T17:00+0000
  317:00 <corebot> jonatack: Current chairs: jonatack
  417:00 <corebot> jonatack: Useful commands: #action #info #idea #link #topic #motion #vote #close #endmeeting
  517:00 <corebot> jonatack: See also: https://hcoop-meetbot.readthedocs.io/en/stable/
  617:00 <corebot> jonatack: Participants should now identify themselves with '#here' or with an alias like '#here FirstLast'
  717:00 <jonatack> Hi and welcome to this week's review club!
  817:00 <stringintech> Hi!
  917:00 <monlovesmango> hello :)
 1017:00 <yuvic> hi
 1117:00 <jonatack> Today we'll be discussing "p2p: protect addnode peers during IBD"
 1217:00 <enochazariah> hello
 1317:00 <jonatack> Review club url:
 1417:01 <jonatack> https://bitcoincore.reviews/32051
 1517:01 <jonatack> Bitcoin Core PR url:
 1617:01 <jonatack> https://github.com/bitcoin/bitcoin/pull/32051
 1717:01 <jonatack> Anyone here for the first time? Feel free to say hi, even if you're only observing.
 1817:01 <jonatack> This discussion is ad hoc and asynchronous, so feel free to continue conversation on previous questions when we move on, or raise any other questions or thoughts you have.
 1917:02 <jonatack> To get the convo warmed up: Anyone know who originally introduced the addnode code into Bitcoin Core?
 2017:03 <jonatack> stringintech: monlovesmango: yuvic: enochazariah: welcome!
 2117:04 <monlovesmango> no idea..
 2217:04 <yuvic> maybe laanwj?
 2317:04 <enochazariah> i do not know that
 2417:04 <stringintech> satoshi? seeing him in addnode commits in git log :))
 2517:05 <jonatack> :D if my memory serves, it was gregory maxwell ("gmax" in the poll at https://x.com/jonatack/status/1927768398630387973)
 2617:05 <jonatack> So
 2717:05 <jonatack> an interesting observation, I have found, from out in the field in Central America,
 2817:05 <jonatack> is that bitcoind in general is quite robust in dealing with a hostile environment of poor or intermittent internet connection
 2917:05 <jonatack> e.g. where browsing the internet might be painfully slow or no longer really viable, but your bitcoind node survives/thrives quite well
 3017:06 <jonatack> in contrast
 3117:06 <jonatack> the stalling/timeout logic seems less tolerant and adapted, to a slow hostile environment
 3217:06 <jonatack> as it lacks the ability to monitor and adapt accordingly
 3317:06 <jonatack> which, as noted in the notes, mzumsande was looking at improving
 3417:07 <jonatack> still, I noticed the disconnections were heavily affecting some of my addnode peers that were being targeted for no fault of their own, some of which were low latency (like cjdns peers) or medium/higher latency (like tor and i2p peers)
 3517:07 <jonatack> which motivated this pull request.
 3617:07 <jonatack> Did anyone get the chance to review the notes and/or PR (y/n)?
 3717:07 <stringintech> y
 3817:07 <monlovesmango> yes
 3917:07 <yuvic> y
 4017:07 <enochazariah> yes
 4117:07 <jonatack> excellent
 4217:08 <jonatack> I'll add some bonus questions about the code, but let's begin with the more general ones
 4317:08 <jonatack> 1. What is an addnode peer? How can you specify them to your node?
 4417:09 <jonatack> (this is a practical question for anyone who runs a node)
 4517:09 <yuvic> peer which we manually add or connect, using addnode rpc or -addnode/-connect config.
 4617:10 <jonatack> yes! and why would you want to do that, as a node runner
 4717:10 <enochazariah> peer that a node can connect to, it can be speified to the node by using the command line
 4817:11 <jonatack> right, either via CLI/RPC addnode (onetry for a one-shot attempt, or "add" to add it to the addnode list)
 4917:11 <monlovesmango> so you can connect to a peer that you know to be honest
 5017:11 <yuvic> as I trust that peer or also to sync up faster during IBD
 5117:11 <stringintech> to maintain connections to nodes I trust in case for example the core peer selection fails for some reason
 5217:11 <enochazariah> I think a reason why someone would want to add node to add more trust in the network, verify don't trust
 5317:11 <jonatack> or in your bitcoin.conf file with addnode=<peer> ... one per line
 5417:12 <jonatack> So: to ensure a connection to a *trusted* peer
 5517:12 <jonatack> What is a fundamental protection that a trusted peer can provide to your node?
 5617:12 <jonatack> assuming they are an honest peer becaue you know/trust them
 5717:13 <stringintech> it can help prevent the node from being isolated from the rest of the network (not having the knowledge of the best chain anymore) by malicious peers
 5817:13 <yuvic> yes to sync with the best chain
 5917:13 <monlovesmango> would probably provide protection against a sybil attack
 6017:13 <enochazariah> protection from isolation
 6117:14 <jonatack> right!
 6217:14 <jonatack> https://river.com/learn/terms/e/eclipse-attack/
 6317:14 <jonatack> "An eclipse attack targets particular nodes in a network by surrounding them and obscuring their view of the entire network. For example, if a Bitcoin node has eight connections to other nodes, and an attacker controls all eight of those nodes, the attacker can refuse to relay any new blocks that miners produce. Although the rest of the network continues to process new blocks, the
 6417:14 <jonatack> victim node will be unaware that blocks are coming in."
 6517:14 <jonatack> It only takes one single honest peer connection to break out of an eclipe attack
 6617:14 <monlovesmango> ah so there is a term for it hah
 6717:15 <jonatack> yes
 6817:15 <yuvic> yes eclipse attack
 6917:15 <jonatack> So, by adding a trusted peer, your node cannot be successfully eclipsed unless your addnode peer connections are also eclipsed
 7017:17 <stringintech> then if we are choosing more than one addnode peer, they should be also geographically diverse so that for example if one region is compromised, our node can still maintain connections to rest of the network...
 7117:17 <jonatack> Therefore, it's a good idea to add some trusted peers using the addnode config option or rpc/cli
 7217:17 <jonatack> stringintech: sgtm
 7317:17 <jonatack> You can have up to 8 addnode peer connections simultaneously
 7417:18 <jonatack> In addition to the limit of 8 autamatic full outbound conns and 2 block-relay-only ones
 7517:18 <jonatack> (along with transient connections, like feelers or extra block-relay-only conns
 7617:19 <jonatack> or addr_fetch ones)
 7717:19 <jonatack> see src/node/connection_types.h for details
 7817:19 <stringintech> Thanks for the details
 7917:19 <jonatack> Now, onto the code
 8017:20 <jonatack> What function contains the Bitcoin Core stalling and timeout logic?
 8117:20 <stringintech> PeerManagerImpl::SendMessages()
 8217:20 <jonatack> this wasn't a question to prepare in advance, but you can search the codebase right now if you like
 8317:20 <enochazariah> SendMessages
 8417:20 <enochazariah> in PeerManagerImpl
 8517:20 <jonatack> excellent
 8617:20 <yuvic> SendMessages
 8717:21 <jonatack> From where is SendMessages() called?
 8817:21 <jonatack> (who it its caller)
 8917:21 <stringintech> connection manager
 9017:22 <jonatack> stringintech: can you elaborate?
 9117:22 <jonatack> it is called from a thread, currently inside src/net.cpp
 9217:23 <stringintech> I should go back for source for detail
 9317:23 <stringintech> but I guess we would loop over peers
 9417:23 <stringintech> can call this after ProcessMessages
 9517:23 <enochazariah> Not sure, but i think THe ThreadMessageHandler is the method that calls the SendMessages
 9617:23 <jonatack> correct for both
 9717:23 <enochazariah> *SendMessage
 9817:23 <jonatack> void CConnman::ThreadMessageHandler()
 9917:24 <jonatack> that calls ProcessMessages() and then SendMessages() for each peer, if not flagged for disconnection
10017:25 <jonatack> Now, on to the PR
10117:25 <jonatack> I was pleasantly surprised by pinheadz's review here https://github.com/bitcoin/bitcoin/pull/32051#pullrequestreview-2864676242
10217:26 <jonatack> "I believe this also closes an 11-year-old issue: https://github.com/bitcoin/bitcoin/issues/5097&quot;
10317:27 <yuvic> yes that was interesting
10417:27 <jonatack> and also:
10517:27 <jonatack> https://github.com/bitcoin/bitcoin/issues/9213
10617:27 <jonatack> that I need to read and look into TBH
10717:27 <stringintech> Nice!
10817:28 <jonatack> (as well as the review comments in https://github.com/bitcoin/bitcoin/pull/25880)
10917:29 <jonatack> Back to this PR
11017:29 <jonatack> The PR as it is currently, is actually not well-named
11117:29 <jonatack> because only the first commit affects the IBD issues that I observed
11217:29 <jonatack> that first commit gives addnode peers more time. apart from that, it doesn't protect them from disconnection.
11317:30 <stringintech> I had a question: Is the PR intentionally focusing on “-addnode” peers and not “-connect” peers?
11417:30 <jonatack> I'd be curious to hear if you have any thoughts or suggestions on the changes
11517:31 <jonatack> stringintech: good question. Yes, I was focusing on addnode peers.
11617:31 <monlovesmango> have you been able to test disconnection frequency after increasing timeout allowed?
11717:31 <jonatack> Making a note to verify the effect on a -connect peer.
11817:32 <jonatack> monlovesmango: yes, I saw much fewer disconnections with the ping times of my peers with my internet speed
11917:33 <jonatack> from 100 or more per minute to a few an hour
12017:33 <monlovesmango> nice
12117:33 <jonatack> it still took more than a month for that node to sync...
12217:34 <yuvic> I had a similar question as mzumsande's comment on the pr -> https://github.com/bitcoin/bitcoin/pull/32051/commits/3463a7f4813c3eece5ba9a260670a76e3f8d38ab#r1999313868
12317:34 <jonatack> like 5 weeks, but all those addnode disconnections were not helping, as those peers were being re-connected right away again afterward
12417:34 <monlovesmango> if it isn't actually protect them from disconnection then I would say I concept ACK and approach ACK. but if it is protecting from disconnection then I think there needs to be more thought into what desired IBD behavior would be
12517:34 <monlovesmango> jonatack: omg hahah
12617:36 <jonatack> yuvic: yes, the high frequency of disconnections I was seeing during IBD were not of that logic
12717:37 <jonatack> those are disconnections that occur infrequently (for me) after IBD is completed
12817:38 <yuvic> got it!
12917:38 <jonatack> I need to consider whether to potentially drop the second (and maybe third) commit to keep it focused on IBD only, when the high number of disconnections I was seeing took place
13017:39 <jonatack> or, alternatively, try to implement his review suggestion to clear the block requests
13117:39 <enochazariah> I've got a bit of a question
13217:39 <enochazariah> does this not raise up a silent stall? i mean, if the system does not have the inhererent mechanism to re-request that block from another peer, then the IBD could effectively stall, preventing a disconnection, but introducing a silent stall
13317:41 <jonatack> enochazariah: I need to look at that (test coverage could be useful, as well, but maybe non-trivial to do, and only if it is reliable)
13417:42 <enochazariah> okay
13517:42 <jonatack> The review comments turned up valuable history on this that I need to review.
13617:43 <jonatack> I didn't necessarily expect the PR to be merged quickly as-is, but hoped to gain insight as to what would be best.
13717:44 <jonatack> A long-term goal since years that comes up now and then in developer discussions
13817:44 <jonatack> is how to score peers based on the resources they consume
13917:45 <jonatack> see, for instance: https://github.com/bitcoin/bitcoin/pull/31672
14017:45 <yuvic> there was an issue for this
14117:45 <yuvic> yes by vasild
14217:46 <jonatack> yuvic: yes. how to measure this an open question.
14317:47 <jonatack> i'm not sure how useful the cpu load is, as I have been testing it, and the load seems to often be higher when the peer is first connected, and then go down to normal levels
14417:47 <jonatack> and not necessarily indicate a bad peer
14517:47 <jonatack> (to be continued)
14617:48 <jonatack> i haven't seen it yet as useful -- on its own -- to qualify a peer to be disconnected
14717:49 <jonatack> this perhaps connects to aj towns' review suggestion about scoring peers and using that to scale the number of blocks we request from them
14817:50 <jonatack> If anyone comes up with test coverage for the stalling or timeout logic here, I'd be very happy to look at it and bring it into the PR
14917:51 <enochazariah> scoring them and using as an order, so a much higher score would mean higher chances of being requested
15017:51 <jonatack> right now, the changes in this PR do not break any tests...
15117:52 <stringintech> Regarding the timeout logics this PR touches, I could only find p2p_ibd_stalling.py, which covers the IBD block stalling timeout (and should possibly be adapted to reflect the addnode changes). I didn't find any integration tests for the initial header sync timeout and regular block download timeout. Am I right??
15217:52 <yuvic> yes, test coverage would be interesting
15317:53 <jonatack> stringintech: neat, maybe addnode connections could be added to that functional test file
15417:54 <jonatack> stringintech: as for the header sync and block download, I have not yet looked
15517:55 <stringintech> Hmm... I'd be happy to work on the missing ones (regardless of the PR changes) and open a PR (in case it is out of the scope for this PR of course). If it merges first, the addnode PR could adapt them accordingly too.
15617:55 <stringintech> Have to double check to see if they are actually missing
15717:55 <jonatack> stringintech: that would be great!
15817:56 <jonatack> please ping me (on the github PR, or via IRC or DM on twitter/x) if you come up with coverage
15917:56 <enochazariah> stringintech that would be nice
16017:57 <jonatack> any final thoughts or questions?
16117:57 <jonatack> 2 minutes left
16217:58 <yuvic> thanks, nothing from my side!
16317:58 <enochazariah> Nothing from my end
16417:58 <jonatack> Appreciate you all participating!
16517:59 <jonatack> Don't hesitate to leave a review comment or feedback on that PR or propose test coverage or improvements to it there
16617:59 <enochazariah> Thank you jonatack
16717:59 <stringintech> Thank you jonatack!
16817:59 <jonatack> Thank you!
16917:59 <jonatack> #endmeeting