Allow inbound whitebind connections to more aggressively evict peers when slots are full (p2p)

https://github.com/bitcoin/bitcoin/pull/27600

Host: stickies-v  -  PR author: pinheadmz

The PR branch HEAD was e71d495ff at the time of this review club meeting.

Notes

  • Recommended reading from earlier review clubs:
  • Currently when connections are full, if we receive in inbound peer request, we look for a current connection to evict so the new peer can have a slot. To find an evict-able peer we go through all our peers and “protect” multiple categories of peers, then we evict the “worst” peer that is left unprotected. If there are no peers left to evict, the inbound connection is denied.

  • With this PR, if the inbound connection is on our whitelist we still loop through all our current connection, removing protected peers from the evict-able list. However, now we keep track of the last peer we protected (aka the last peer we removed from the evict-able list). If we end up with no peers left to evict, the last protected peer is evicted instead. This should preserve the peer eviction logic but just in the case of a whitebind inbound, we evict the “worst best” connection.

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? What was your review approach?

  2. Why does this PR only apply to inbound peer requests?

  3. What is the impact of the force parameter of SelectNodeToEvict() on the return value?

  4. In your own words, what is the mechanism by which ProtectEvictionCandidatesByRatio() protects a node from being evicted?

  5. How is the function signature of EraseLastKElements changed in this PR? Why is this necessary? Do you see an alternative approach?

  6. In SelectNodeToEvict(), is the order in which we make the various EraseLastKElements() calls important? Why (not)?

  7. EraseLastKElements used to be a templated function, but this is changed in the first commit. Why is this the case? Do you see any downsides to this change?

  8. Suppose we pass a vector of 40 eviction candidates to SelectNodeToEvict(). Before and after this PR, what’s the theoretical maximum of Tor nodes that can be protected from eviction?

Meeting Log

  117:00 <stickies-v> #startmeeting
  217:00 <LarryRuane> hi
  317:00 <sebastianvanstaa> hi
  417:00 <lightlike> hi
  517:00 <alex_wiederin> hi
  617:00 <pinheadmz> hi B-)
  717:00 <abubakarsadiq> hello
  817:00 <brunoerg> hi
  917:01 <stickies-v> welcome everyone! Today we're looking at #27600, authored by pinheadmz, who I'm delighted is here as well! The notes and questions are available on https://bitcoincore.reviews/27600
 1017:01 <stickies-v> anyone joining us for the first time today? even if you're just lurking, feel free to say hi!
 1117:02 <LarryRuane> thanks for hosting again, @stickies-v!
 1217:03 <stickies-v> quietly hoping no one is getting sick of me 🤞 if so, next week we've got abubakarsadiq taking the mic, so hang in there!
 1317:03 <stickies-v> who got the chance to review the PR or read the notes? (y/n)
 1417:03 <sebastianvanstaa> y
 1517:03 <pinheadmz> y
 1617:03 <alex_wiederin> y
 1717:03 <lightlike> y
 1817:03 <abubakarsadiq> y
 1917:03 <LarryRuane> y
 2017:04 <abubakarsadiq> read the notes
 2117:04 <stickies-v> ohhhhh, such diligence this week, nice!
 2217:04 <stickies-v> for those of you who were able to review, would you give it a Concept ACK, Approach ACK, Tested ACK, or NACK? what was your review approach?
 2317:05 <LarryRuane> "such diligence" -- it helps that this PR isn't super difficult to understand!
 2417:05 <LarryRuane> I built the branch, ran the updated functional test, looked at the individual commits
 2517:06 <sebastianvanstaa> same here
 2617:06 <abubakarsadiq> concept Ack, was lost reading linked pr review club session and concept behind the pr. run the test, but did not review the code
 2717:06 <lightlike> Concept ACK, not completely sure about the approach yet (I just started reviewing 1 hour ago)
 2817:07 <LarryRuane> I find it easier to look at the diffs from within VScode (i.e. `code` on ubuntu), as opposed to GitHub, because then I can see more context, and search symbols and such
 2917:07 <alex_wiederin> Concept Ack. Read the code and tried to understand implications for SelectNodeToEvict() and EraseLastKElements()
 3017:08 <LarryRuane> concept ACK for me as well, but I'm not familiar with peer eviction, so I'd defer to people like @lightlike for this kind of PR
 3117:08 <stickies-v> LarryRuane: yeah, agreed. Also protects us from github potentially hiding nefarious changes on the webui, when wearing the tinfoil hat
 3217:08 <LarryRuane> oh that's interesting, hadn't thought of that!
 3317:09 <stickies-v> Let's start with building some more understanding/context. Why does this PR only apply to inbound peer requests?
 3417:09 <LarryRuane> (sorry for the sidetrack) one thing I don't like about using vscode is that when looking at a diff, it doesn't allow you to "tag" (jump to the definition of the symbol your cursor is on)
 3517:10 <LarryRuane> so i have to go to the non-diff file, find the right location, and tag from there... maybe someone has a better idea?
 3617:10 <sebastianvanstaa> stickies-v for outbound connections, the contacted node decides if it wants the connection?
 3717:10 <pinheadmz> LarryRuane i use sublime text with a plugin I forked for myself: https://github.com/CJTozer/SublimeDiffView/pull/73
 3817:10 <brunoerg> Concept ACK
 3917:10 <stickies-v> LarryRuane: yeah, that is by far my biggest annoyance in reviewing code on vscode. opening the file works, but in multi-commit PRs not ideal, have to actually check out the commit too
 4017:11 <LarryRuane> stickies-v: I think because it's changing the behavior of accepting an inbound connection, so our outbounds shouldn't be affected
 4117:11 <abubakarsadiq> outbound peers are protected from eviction
 4217:11 <LarryRuane> pinheadmz: thanks, i'll check that out!
 4317:12 <stickies-v> sebastianvanstaa: indeed, from the outbound node's perspective, we're the inbound, and we could get evicted
 4417:12 <lightlike> abubakarsadiq: outobund peers can be evicted too! but there is a completely seperate algorithm for this
 4517:12 <LarryRuane> abubakarsadiq: I was wondering about that ... there are times where we evict an outbound, aren't there? I'm not sure, but if it's not performing well?
 4617:12 <sebastianvanstaa> yes, when it behaves badly
 4717:13 <LarryRuane> lightlike: thanks (you beat me to it)
 4817:13 <sebastianvanstaa> there is a scoring system for that
 4917:14 <abubakarsadiq> thanks lightlike, LarryRuane.
 5017:14 <LarryRuane> i know that during IBD, if there's a peer who's holding us up (we're like 1000 blocks ahead of that peer), then we can kick it (but I forget the details)
 5117:14 <lightlike> abubakarsadiq: outbound peers get evicted if the don't keep up with the best chain or we have too many of them, see ConsiderEviction() https://github.com/bitcoin/bitcoin/blob/594f05db19fa2eaf6705f13bb0e147bce6ac21e5/src/net_processing.cpp#LL4955C6-L4955C27
 5217:14 <stickies-v> outbound peers are irrelevant here because we choose our outbounds, whereas for inbounds we are connected to, so we don't really control who fills up our slots (although we do filter, but it's more of a passive thing). but in some cases (e.g. with whitebind), we want to make sure we keep some space for certain nodes
 5317:15 <stickies-v> lightlike: there are a lot of ways an outbound can get evicted right? quite a few protocol violations also lead to this, I think?
 5417:15 <pinheadmz> theres also this idea of "churn" -- like just the idea that we evict any current peer when an inbound request comes in si interesting
 5517:15 <pinheadmz> it prevents aggresive eclipse attacks
 5617:16 <abubakarsadiq> thanks lightlike
 5717:16 <LarryRuane> stickies-v: lightlike: isn't there some kind of "ban" score? so if a peer misbehaves a little bit, we decrement its score (from 100), and only if it reaches zero do we kick it?
 5817:16 <stickies-v> pinheadmz: oh yeah, great point! we don't blindly always want to evict a peer whenever a new node tries to connect to us, because that would make it easy for an attacker to fill up all our (inbound) slots
 5917:17 <pinheadmz> as also dont want our slots to be filled up with the same peers forever, so even if we are full we allow some churn
 6017:17 <pinheadmz> s/as/we
 6117:17 <lightlike> stickies-v: yes - same for inbound peers. But I think these kind of misbehavior-base disconnections/bans aren't usually called "eviction"
 6217:17 <sebastianvanstaa> LarryRuane yes, but the banscore increases with time, until threshold reached
 6317:18 <stickies-v> ohh, I see. eviction more relates to when the peer isn't really misbehaving, but we still want to kick them out anyway?
 6417:18 <LarryRuane> lightlike: ah interesting, so the term eviction sort of implies well-behaved peers?
 6517:18 <LarryRuane> sebastianvanstaa: oh I see, thanks, that makes sense
 6617:18 <lightlike> stickies-v: yes, at least that's how I use these terms usually.
 6717:19 <sebastianvanstaa> I don't think outbounds get evicted in any case unless misbehaving
 6817:19 <stickies-v> I'm going to post the next question already, but as always - we're async, so feel free to keep discussing the previous question!
 6917:19 <sebastianvanstaa> this is also to harden it against eclipse attack
 7017:19 <stickies-v> What is the impact of the `force` parameter of `SelectNodeToEvict()` on the return value?
 7117:19 <stickies-v> (link: https://github.com/bitcoin-core-review-club/bitcoin/blob/e71d495ffbda3bc072bbaecd7580809d5087f9e6/src/node/eviction.h#L42)
 7217:20 <lightlike> sebastianvanstaa: they do if we have too many outbound peers (because we sometimes create extra ones), see EvictExtraOutboundPeers in net_processing
 7317:20 <sebastianvanstaa> lightlike cool, will check that out
 7417:21 <alex_wiederin> stickies-v: 'force' ensures that at least one is selected for eviction. Previously it would not pick one if it had already excluded all peers to protect
 7517:21 <LarryRuane> stickies-v: if our eviction candidate list is empty (everyone on it has been protected), then we normally return `nullopt`, but if force is set, then we return the least-good protected peer
 7617:21 <pinheadmz> * which could still be null!
 7717:22 <pinheadmz> if for example all our inbounds are already whitelisted
 7817:22 <abubakarsadiq> +1 larryRuane
 7917:22 <alex_wiederin> pinheadmz: gotcha!
 8017:22 <LarryRuane> pinheadmz: oh interesting! hadn't noticed that! might be worth adding a comment there
 8117:22 <pinheadmz> eviction.cpp:195
 8217:22 <pinheadmz> ProtectNoBanConnections(vEvictionCandidates);
 8317:22 <alex_wiederin> Yea, I think one of tests covers that case
 8417:22 <pinheadmz> this line does not return a "last out" for example
 8517:23 <pinheadmz> alex_wiederin yes the test tries to add additional whitebind inbounds that do NOT get accepted
 8617:24 <LarryRuane> pinheadmz: excellent work on the tests, sometimes good tests are harder to write than the code!
 8717:24 <pinheadmz> thanks i like writing tests lol and yes it can take all day to figure out how to reproduce a bug, then the fix is simple
 8817:25 <lightlike> another way to get into this situation is wehn running with a much smaller value of -maxconnections than the default (<40 or so?)
 8917:25 <LarryRuane> lightlike: good to know.. just curious, why would someone configure their node to run with a smaller maxconnections?
 9017:25 <pinheadmz> yeah and if you are expecting a lot of whitebind inbounds, you need to be conscious of that and set it higher
 9117:25 <stickies-v> pinheadmz: this makes me wonder what happens if we pass a vector containing only whitelisted candidates to `SelectNodeToEvict()`, I think that means we'll be returning a default-initialized `last_out.id`?
 9217:26 <lightlike> LarryRuane: for example to reduce traffic
 9317:26 <pinheadmz> LarryRuane just cpu/memory resources
 9417:26 <pinheadmz> stickies-v i *think* it should return nullopt
 9517:26 <LarryRuane> lightlike: oh right! like if you're on a limited bandwidth connection, hadn't thought of that, thanks
 9617:26 <pinheadmz> but i mighrve missed something
 9717:27 <stickies-v> pinheadmz: sorry, I mean when `force` is also set to `true`
 9817:27 <LarryRuane> pinheadmz: yes, also like you said, cpu / memory resources
 9917:27 <abubakarsadiq> what if you have small maxconnection and still wants the whitelisted inbound to be your peer
10017:28 <LarryRuane> abubakarsadiq: i'm guessing that's okay, as long as you have only ONE whitelisted inbound
10117:28 <LarryRuane> (or, at least, a smaller number than maxconnections)
10217:28 <stickies-v> abubakarsadiq: you just can't have more whitelisted inbound peers than the number specified in your maxconnections
10317:29 <stickies-v> increasing maxconnections seems like the straightforward solution
10417:29 <LarryRuane> basic question, is maxconnections applicable to inbound only?
10517:29 <pinheadmz> its less than that too, as i learned writing this, because inbound slots = maxconnections - 10 outbound - 1 feeler
10617:29 <stickies-v> LarryRuane: no, inbound is maxconnections minus outbound and feelers etc
10717:29 <pinheadmz> see p2p_eviction.py:49
10817:30 <stickies-v> oh beat me to it
10917:30 <LarryRuane> stickies-v: pinheadmz: thanks
11017:30 <stickies-v> In `SelectNodeToEvict()`, is the order in which we make the various `EraseLastKElements()` calls important? Why (not)?
11117:30 <LarryRuane> (and 2 of the 10 are blocks-only, IIRC)
11217:30 <lightlike> stickies-v: before or after this PR?
11317:30 <LarryRuane> *10 outbound that is
11417:31 <alex_wiederin> I believe order does make a difference. Before and after. No?
11517:31 <stickies-v> lightlike: both, sorry should have added that to the question
11617:32 <stickies-v> alex_wiederin: what is the impact of ordering them differently?
11717:33 <alex_wiederin> I believe impact depends on the number of peers that are present
11817:34 <stickies-v> I'm not sure I understand what you're saying, do you mean a different number of candidates will get protected?
11917:34 <lightlike> seems that after this PR the order is more important because the force-evicted peer would be the one from the last EraseLastKElements call
12017:35 <lightlike> before, I guess the order could have some small effects if some peers would be protected through multiple criteria, but I can't really see a systematic bias there.
12117:35 <stickies-v> lightlike: that's my understanding as well. On master, I don't think the order has any impact. After this PR, it affects which `std::optional<NodeId>` is returned, where the implicit assumption is that the "least important" protection rule needs to get executed last
12217:36 <pinheadmz> i definitely made that assumption!
12317:36 <stickies-v> oh
12417:37 <alex_wiederin> stickies-v: if say the number of current peers is 4 (which the first EraseCall covers) do the other EraseCalls take effect? Not sure I understand the EraseLast function - new to C++
12517:37 <stickies-v> wow, sorry. brain fart. I was only considering the edge case where all candidates get evicted, in which case the order does not matter on `master` (but it still does on this PR)
12617:39 <stickies-v> How is the function signature of `EraseLastKElements` changed in this PR? Why is this necessary? Do you see an alternative approach?
12717:39 <stickies-v> (link: https://github.com/bitcoin-core-review-club/bitcoin/blob/e71d495ffbda3bc072bbaecd7580809d5087f9e6/src/node/eviction.cpp#L78-L80)
12817:40 <yashraj> still confused about the answer to the previous one...
12917:40 <lightlike> I'm not sure that all the EraseLastKElements calls are ordered by importance. e.g. the "recently sent us novel blocks" seems more important/harder to game to me than having low minimum ping times.
13017:41 <pinheadmz> i guess it might make sense to specifically pick one quality (instaed of all) and for exmple "force' will just evict the worst ping time peer
13117:42 <LarryRuane> the signature changed because of the need to return the best protected peer, which we figure out as a side-effect of erasing elements
13217:44 <stickies-v> yashraj: with this PR, the order in which we execute the `EraseLastKElements` calls is important we keep overwriting `last_out` with the most recently erased candidate. does that make sense?
13317:44 <LarryRuane> stickies-v: "Do you see an alternative approach?" yes, `last` could be passed to `EraseLastKElements()` as an in-out argument (and make that function return void again),
13417:44 <lightlike> a very simple alternative approach would be to just pick a random peer before the first EraseLastKElements, and evict that one if at the end of the function there is no better peer left because every peer is protected.
13517:44 <pinheadmz> lightlike wow a random peer? i thought our eviction choise was super-sensitive
13617:44 <LarryRuane> and then the multiple instances of `if (last.has_value()) last_out = last.value();` would be eliminated
13717:45 <pinheadmz> LarryRuane that is a good option, thanks. yeah i think i do not need to modify that function as much as i did
13817:45 <pinheadmz> i actually did not know if I could "template" a return value and thats why I changed it so much
13917:45 <stickies-v> pinheadmz: but this only applies in case of `force==true`, i.e. we trust the inbound peer, so I think we can be less careful about eviction here?
14017:46 <yashraj> yes, thanks stickies-v:
14117:46 <pinheadmz> bc i also saw the function was not used in any other context
14217:46 <lightlike> pinheadmz: this would only apply when you have a whitelisted incoming peer connecting to you and all peers are protected. seems a like a very special-case scenario to me.
14317:46 <pinheadmz> stickies-v thats a good point but id still have to defer to lightlike and the p2p gurus
14417:47 <pinheadmz> right but even in that scenario random choice doesnt seem very bitcoin-core-y
14517:47 <LarryRuane> i actually like that change to eliminate the templating... I think it's confusing when code is over-generalized... if it's only instantiated one way, then why not "hard code" that way?
14617:47 <pinheadmz> oh ok
14717:47 <stickies-v> LarryRuane: that's a good alternative! generally i'm not a huge fan of out-parameters but in this case it would quite significantly clean up the code
14817:48 <pinheadmz> im sure the original author of "remove elements" designed the function to just remove elements in a generic way
14917:48 <LarryRuane> but on the other hand (arguing against myself), those `if` statements make it very clear to the reader what's going on, without having to dig into the function being called
15017:48 <stickies-v> I agree with LarryRuane about templating. Much easier to reason about what a function is doing when it's not templated, so I like to avoid it when possible
15117:49 <pinheadmz> ah but is it an unecessary code change for this PR
15217:49 <LarryRuane> pinheadmz: that's a good question, reviewers may say so ... it might help to make that change in a separate commit?
15317:50 <stickies-v> I think another approach would be to have `EraseLastKElements` return all (instead of just the last) removed elements, which imo is a bit more intuitive/general, but perhaps not worth it. And this can also be coupled with LarryRuane's idea of using an out-parameter to simplify the current code
15417:51 <stickies-v> we're nearing the end, so time for the last question for today
15517:51 <lightlike> pinheadmz: why does ProtectEvictionCandidatesByRatio need to be changed? It protects only 50% at best as far as I understand, so it should leave an unprotected peer usually so last wouldn't be needed if we get that far? Or is this just for the edge case of having 1 peer left?
15617:51 <stickies-v> Suppose we pass a vector of 40 eviction candidates to `SelectNodeToEvict()`. Before and after this PR, what's the theoretical maximum of Tor nodes that can be protected from eviction?
15717:52 <pinheadmz> lightlike yeah really wanted to make sure i caught the absolute last protected peer after we seal off the outbound an no-ban peers
15817:52 <pinheadmz> but mightve been overkill
15917:53 <LarryRuane> stickies-v: does it change from 10 to 9?
16017:54 <stickies-v> lightlike: I really like your suggestion to just select a random (non-whitelisted, inbound) peer, it seems sensible at first sight and would simplify this PR so much
16117:55 <stickies-v> LarryRuane: I've found different numbers but I want to give a bit more time for people to share their numbers (if any)
16217:55 <lightlike> stickies-v: I'll suggest it in the PR, would be good to have some input from other p2p contributors on that.
16317:55 <pinheadmz> yeah and actually we can just select a random peer from whatevers left at any point in SelectNodeToEvict()
16417:56 <pinheadmz> so like, protect outbound, noBan AND peers that sent us blocks, and then choose random
16517:57 <pinheadmz> whitebind is also not entirely attack-proof for example i have a ful node with whitebind on a non-standard port that i use for my SPV wallet... but anyone could discover that port by sweeping
16617:58 <pinheadmz> i dont know if we deal with that in any other way, i suppose bip324 maybe we can actually authenticate inboudns?
16717:58 <stickies-v> pinheadmz: that almost sounds like an invite
16817:58 <pinheadmz> go for it boss
16917:59 <pinheadmz> but yeah with that in mind (that a whitebind inbound could still be an attacker) i wonder if random choice is still safe
17018:00 <stickies-v> LarryRuane: I think both with and without this PR 34/40 Tor nodes would be the maximum number (assuming they're not NoBan and inbound, I underspecified the question a bit) of nodes that can be protected by `SelectNodeToEvict`
17118:01 <LarryRuane> stickies-v: thanks, i'll have to study that more!
17218:01 <stickies-v> feel free to ping me here later on if you disagree on the numbers, but since we're at time I'll be closing off the meeting now
17318:01 <stickies-v> thanks everyone for participating, and pinheadmz for authoring the PR and joining us today!
17418:02 <pinheadmz> thanks stickies-v !! and everyone for reviewing and thikning about this issue
17518:02 <stickies-v> #endmeeting