Connection eviction logic tests (p2p, tests)

https://github.com/bitcoin/bitcoin/pull/16756

Host: mzumsande  -  PR author: mzumsande

The PR branch HEAD was 32c6ce0be3 at the time of this review club meeting.

Notes

  • A node allows a limited number of connections, the default is 125. When this limit is reached, the node reacts to the connection of a new inbound peer by terminating the connection with an existing inbound peer (ā€œevictionā€).

  • Certain peers are protected from eviction by a heuristic algorithm, based on various metrics such as netgroup, connection time, ping time, them having sent us transactions or blocks. This is implemented in CConnman::AttemptToEvictConnection() and was added in PR #6374 and extended in PR #8084.

  • The eviction logic is currently covered by neither functional nor unit tests (Test Coverage), which was noticed in Issue #16660 .

  • This PR suggests a functional test for the eviction mechanism, covering protection of fast responding peers and peers that have sent us transactions and blocks. This is achieved by adding several mininode peers with different properties and testing that after reaching the maximum number of connections, a peer is being evicted, which is not among the peers assumed to be protected.

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? (Donā€™t forget to put your PR review on GitHub.)

  2. Why is there an eviction mechanism for inbound peers, instead of simply not accepting new connections when full?

  3. Why are some peers protected from eviction?

  4. Can you summarize the algorithm by which the candidate for eviction is determined in AttemptToEvictConnection()?

  5. Why does the test choose maxconnections=32?

  6. What are the challenges in testing the eviction logic? Can you think of ways to extend the coverage of the functional test?

  7. What do you think of the algorithm for eviction/protection? Do you have ideas for improvement?

Meeting Log

  113:00 <jnewbery> #startmeeting
  213:00 <troygiorshev> hi
  313:00 <lightlike> hi
  413:00 <amiti> hi
  513:00 <emzy> hi
  613:00 <raj_149> hi
  713:00 <michaelfolkson> hi
  813:00 <jnewbery> Hi folks! Welcome to review club. Today, lightlike is going to be hosting the meeting on his PR #16756: Connection eviction logic tests.
  913:00 <nehan> hi
 1013:01 <jnewbery> Notes and questions in the normal place: https://bitcoincore.reviews/16756.html
 1113:01 <jnewbery> Before we start though, I want to draw everyone's attention to the code of conduct that we added to the website last week: https://bitcoincore.reviews/code-of-conduct. Please read it!
 1213:01 <jnewbery> We want to make sure that PR Review Club is a safe, respectful, and productive environment for everyone. The Code of Conduct documents the behaviour we expect to ensure that's the case.
 1313:01 <jnewbery> I don't think there's anything controversial in there. If you have any questions/concerns/comments, feel free to message me.
 1413:01 <pinheadmz> +1 for code of conduct
 1513:02 <decipher> hi
 1613:02 <jnewbery> With that, I'll hand over to lightlike.
 1713:03 <lightlike> Ok, so today will be on my PR on testing the eviction/protection logic for inbound peers.
 1813:03 <lightlike> I think that also the eviction algorithm in itself is quite interesting in itself.
 1913:03 <andrewtoth> hi
 2013:03 <lightlike> But let's start with the usual question - Who had a chance to review the PR? (y/n)
 2113:03 <raj_149> y
 2213:04 <troygiorshev> n
 2313:04 <nehan> n
 2413:04 <andrewtoth> y
 2513:04 <jnewbery> y
 2613:04 <gdejz> N
 2713:04 <amiti> mostly y
 2813:04 <emzy> n
 2913:05 <lightlike> Great - and there have been many helpful review comments so far.
 3013:05 <lightlike> So, moving to the first question:
 3113:05 <lightlike> Why is there an eviction mechanism for inbound peers, instead of simply not accepting new connections when full?
 3213:06 <michaelfolkson> To help bootstrap new nodes?
 3313:06 <raj_149> We might get better new peers than existing one? Having static peers might open risk of attacks?
 3413:06 <andrewtoth> To prevent an attacker from filling up all inbound slots
 3513:07 <decipher> preventing peering monopolization, potentially from an adversary
 3613:07 <troygiorshev> We ensure we're not stuck with a small (potentially malivious) subset of the network?
 3713:07 <amiti> it seems like we try to evict from the netgroup with the max number of connections of unprotected peers, so maybe as a method of ensuring peers across multiple netgroups?
 3813:08 <raj_149> quick question: what exactly is a netgroup?
 3913:08 <lightlike> I think all of these are valid reasons.
 4013:09 <lightlike> but probably, preventing an attacker to fill the inbound slots was the main reason why eviction was introduced?!
 4113:10 <sipa> just in general: to try to improve the quality of inbound connections (a combination of peers that relay blocks quickly, don't misbehave, and come from a variety of networks across the world)
 4213:10 <sipa> when we're at our inbound connection limit, we have a choice
 4313:11 <sipa> or rather: we need to make a choice, and not evicting means making the implicit "first connection is best" assumption
 4413:11 <decipher> "prioritize peer desirability"
 4513:12 <michaelfolkson> raj_149: I had to look it up. Connections with certain characteristics form one netgroup https://github.com/bitcoin/bitcoin/blob/657b82cef0e8e8695fc189d013a4353bdbebb041/src/net.cpp#L895
 4613:12 <jnewbery> raj_149: traditionally, a netgroup was a /16 subnet. From 0.20, it can optionally be based on ASNs. See https://github.com/bitcoin/bitcoin/issues/16599 for more details
 4713:13 <sipa> raj_149: it's a fairly arbitrary grouping of IP addresses we make; for IPv4 it's /16 subnets, though since 0.20 it's possible to use asmap
 4813:13 <lightlike> ok, next question
 4913:13 <lightlike> Why are some peers protected from eviction?
 5013:13 <sipa> raj_149: the idea is that it's harder for an attacker to control many networks rather than many IP addresses in one network
 5113:14 <raj_149> thanks, makes sense. Will look up on it more.
 5213:14 <andrewtoth> The same way an attacker can fill up all inbound slots, an attacker could evict all other connections by continually connecting to a node
 5313:15 <jonatack> raj_149: see https://bitcoincore.reviews/16702
 5413:15 <michaelfolkson> And if they have proved themselves useful by providing blocks or responding to ping quickly you don't want to ditch them
 5513:15 <amiti> +1 andrewtoth. protecting peers from eviction based on different desired attributes increases the cost of attack to fill up all the slots.
 5613:16 <michaelfolkson> So you kind of want to get best of both worlds or a happy medium. Go too far in either direction and you don't get benefits of both extremes
 5713:16 <gzhao408> this might be a big claim, but I actually think any eviction policy that doesn't protect certain characteristics (i.e. ones that are hard to fake like minping and longest connection time) would almost guarantee that an attacker could fill up your slots
 5813:17 <ecurrencyholder> It's preferable to keep a connection that has proven to be reputable. By evicting it, you are opening yourself up to potentially connect to a malicious one.
 5913:18 <lightlike> yes, i think it's a balance. we don't want to have our peers too static, but we also need to protect against attackers actively eviction other peers.
 6013:19 <decipher> having defined peering preferences gives an attacker the success criteria they need to meet to perform an attack -- regardless of what criteria are chosen. Our objective should be - choose the criteria which suite are desires best but also make it as costly as possible for an attacker
 6113:19 <decipher> suit our*
 6213:19 <nehan> decipher: +1
 6313:19 <gzhao408> costly, and also that they're doing useful work anyway?
 6413:20 <gzhao408> (for some of the characteristics)
 6513:20 <nehan> gzhao408: i don't think usefulness is helpful. an attacker who is useful up to eclipsing you has still eclipsed you.
 6613:21 <jnewbery> I think amiti and decipher's point about having different criteria is very important. We obviously want to receive blocks quickly so a naive approach would be to just protect the peers that send us blocks. If we did that would it be easier or harder for an attacker to eclipse you?
 6713:21 <raj_149> jnewbery: seems easier, if that was the only criteria?
 6813:22 <decipher> easier or harder relative to which initial case
 6913:22 <sipa> the network serves a variety of purposes; for blocks it's partition resistance and fast relay; for transactions it's original privacy; for addresses there are eclipse attack concerns, ...
 7013:22 <sipa> so that means we need to select peers that are good across multiple different dimensions
 7113:22 <sipa> *origin privacy
 7213:22 <gzhao408> nehan good point haha
 7313:23 <lightlike> next q: Can you summarize the algorithm by which the candidate for eviction is determined in AttemptToEvictConnection?
 7413:23 <jnewbery> decipher: compared to not just protecting block-serving peers. I think the answer is easier because a determined adversary can be a better block-serving peer than an average peer.
 7513:23 <michaelfolkson> nehan: They can. But you would still want to value node who has proved useful over a node who has not.
 7613:24 <nehan> michaelfolkson: I think the issue is that if you measure nodes by how "good" they are and it's something an attacker can optimize, they can become better than average nodes and take up all the good nodes slots. I think :)
 7713:24 <michaelfolkson> nehan: Yeah good point
 7813:24 <sipa> nehan: there is a risk for a repution attack"
 7913:25 <raj_149> It seems focusing on any sinle character will by definition open up gamability. the only way is to have as many character as possible as criteria?
 8013:25 <nehan> sipa: i don't know what that means
 8113:25 <raj_149> *single
 8213:25 <sipa> nehan: an attacker can act "good" before trying to perform an attack
 8313:25 <nehan> sipa: yes
 8413:25 <sipa> which is inherently a problem without persistent identities
 8513:26 <sipa> on the other hand, some things can't be faked: a peer with a low ping time is close to you - it doesn't mean they're not an attacker of course
 8613:26 <nehan> sipa: yep or some other hard to fake mechanism + honesty assumptions
 8713:26 <gzhao408> lightlike: make a list of eviction candidates, sort them by characteristics and remove the top ones of each category, whittling the list down based on peers that you definitely want to protect. if you still have candidates left over, make an eviction decision?
 8813:26 <sipa> or a peer who gives you a good block helped you and the network, regardless of their future actions
 8913:28 <amiti> lightlike: start by selecting eviction candidates that are inbound peers that are not on the NO BAN list or already marked for disconnect. remove the top candidates from each of the desirable categories (netgroup, ping time, recently sent txns, recently sent blocks, and the longest connected). If there are any candidates left over, choose from the netgroup with the most connections
 9013:28 <raj_149> is it implicite that there will always be a candidate for eviction (assuming peercount limit reached)? or it might be the case that none of them gets evicted and new connection is rejected?
 9113:28 <sipa> raj_149: the new connection may be the one that is evicted
 9213:28 <sipa> (i think, i haven't checked this code recently)
 9313:28 <nehan> so the idea is to optimize for metrics that are "proof of good and hard to fake behavior". like low ping times, bw usage for relaying blocks, etc. others? sort of like proof of work, ha
 9413:30 <nehan> random probably wrong thought: maybe more expensive the behavior is, the more it should be weighted
 9513:30 <lightlike> gzhao, amiti: right! we also preferably evict peers that are undesirable (HasAllDesirableServiceFlags)
 9613:31 <decipher> should we discuss the protected_peers weightings?
 9713:31 <gzhao408> I think it's important that it's not just favoring some but /protecting/ them - that's why we remove them from the list so even if a node is both a long-lasting connection and has short ping times we consider the same number regardless of overlap. and to nehan's point, the more expensive behavior comes at the end
 9813:32 <nehan> gzhao408: yeah i think i spoke too soon before, you are right that useful work is helpful especially if costly
 9913:32 <lightlike> raj_149: I'm not sure how we would not find a candidate - if we have maxconnections large enough that that there are more than the fixed 20 peers (4 block +4 tx + 4netgroup +8 ping).
10013:33 <raj_149> IMO weighting might be problematic as an attacker then gets higher probabailty to get into protected sets by playing good just before the attack?
10113:33 <troygiorshev> i'm not sure i follow precisely what the attack is, can anyone give a quick summary?
10213:33 <raj_149> lightlike: i was also thinking the same. Seems like it will always find a peer to evict.
10313:33 <sipa> i think it's important that this is not just a score that maps all criteria onto a single value
10413:34 <sipa> you want to keep the best peers according to many different criteria, so that an attacker has to beat your other peers in all those criteria
10513:34 <michaelfolkson> These eviction policies in the test are a simplified version of the eviction policies in the actual codebase right?
10613:34 <decipher> should have phrased it as protected_peers allocation not creating a score
10713:35 <raj_149> michaelfolkson: only the netgrouping part. rest of the criteria are being tested as it seems.
10813:37 <lightlike> ok, moving on to the test: Why does the test choose maxconnections=32?
10913:38 <gzhao408> I donā€™t think maxinbound can be specified, rather itā€™s implied from maxconnections, maxoutbound, and maxfeeler.Basic math: Maxconnections = maxoutboundfullrelay + maxoutboundblocksonly + maxfeelers + maxinboundWe wanted 21 inbound so 8 + 2 + 1 + 21 = 32
11013:38 <raj_149> so that it gives us total 20 peers to be protected, adding the 21st peer will trigger eviction of one?
11113:39 <lightlike> raj_149: I think 21 are connected, so adding the 22nd will evict (because 20 are protected, and one is not protected)
11213:39 <andrewtoth> it is calculated to give us exactly one more peer than protected, so we can see it get evicted
11313:39 <raj_149> lightlike: ah right..
11413:40 <lightlike> gzhao408: right. that's why maxconnections needed to be raised by two when the PR introducing two more outbound connections (blocks-only) was merged
11513:41 <lightlike> next q: What are the challenges in testing the eviction logic? Can you think of ways to extend the coverage of the functional test?
11613:41 <raj_149> lightlike: its hard to simulate real world networking situations in regtest.
11713:43 <michaelfolkson> I still haven't wrapped my head exactly what the test is doing at a high level. Maybe I am just slow :)
11813:43 <troygiorshev> in particular, we can't test the netgroup behaviour
11913:44 <michaelfolkson> Is it testing a Core node by spinning up a bunch of regtest nodes that connect to that Core node? Or a separate regtest network and testing how that independent network works?
12013:44 <andrewtoth> There's no clear interface for the eviction logic, so the test has to build an extensive workaround simulated network
12113:44 <raj_149> michaelfolkson: its making some peers satisfy the above criterias so to add them in protected group, then checking eviction is happening only from the non protected group, and protected peers are kept intact as intended.
12213:45 <troygiorshev> andrewtoth: for a functional test this is maybe an advantage?
12313:45 <jnewbery> michaelfolkson: it's a single node being tested, and a bunch of connections to it from the python test framework
12413:45 <andrewtoth> troygiorshev an advantage in what way?
12513:45 <michaelfolkson> Cool
12613:45 <troygiorshev> michaelfolkson: the former
12713:46 <raj_149> quickquestion: whats the diff between p2pdatastore and p2pinterface?
12813:47 <amiti> lightlike: did you consider testing the `m_prefer_evict` preferred-for-eviction logic? I haven't looked too closely at feasibility yet
12913:47 <lightlike> I originally had another commit with a unit test (testing the sorting in more detail), but that was controversial because in order to make the methods from Boost accessible, quite some code needed to be moved from net.cpp to net.h
13013:47 <troygiorshev> andrewtoth: a functional test should try and simulate the real world as closely as possible. forcing us to actually spin up multiple nodes and connect them could help us realize if we've made an incorrect assupmtion somewhere along the line
13113:47 <andrewtoth> i see
13213:48 <andrewtoth> but even this test is fairly contrived, having only 32 connections. real world nodes will usually use the default 125
13313:49 <raj_149> troygiorshev: but wont doing that in regtest will always be insufficient than real world?
13413:49 <michaelfolkson> What are you concerned about going from 32 to 125 andrewtoth? DoS?
13513:50 <lightlike> amiti: I think this should be possible. One problem is that due to the already mentioned impossibility to mock netgroup, there is some randomness involved (the first step kicks out 4 "random" peers which makes writing the tests harder)
13613:50 <sipa> lightlike: with asmap you can mock netgroups
13713:50 <andrewtoth> on one hand, it does simulate a lot of different things which would test for incorrect assumptions. on the other, it should be testing exactly the behaviour we want and not more.
13813:51 <lightlike> sipa: oh, that is cool, I'll have a look at that!
13913:51 <sipa> lightlike: ah, but all connections will still come from localhost, so probably not
14013:51 <andrewtoth> michaelfolkson nothing particularly but to simulate real world we should use defaults. Could be some incorrect assumptions about how eviction works that we are not considering with default peers
14113:52 <lightlike> last question: What do you think of the algorithm for eviction/protection? Do you have ideas for improvement?
14213:52 <troygiorshev> andrewtoth: good point. I think, if you have multiple peers left over after protecting the 20, the logic to choose which one to evict is very simple, so there isn't much benefit to testing that it picks the right one?
14313:52 <troygiorshev> now that I type that out I'm not sure that I agree with it ...
14413:53 <jnewbery> it'd be great if we could somehow mock remote IP ranges in the test framework. There are quite a few differences between how we treat local peers and remote peers
14513:53 <raj_149> jnewbery: mind pointing to few such diffs?
14613:55 <MarcoFalke> raj_149: Pretty much anything that deals with addresses (addrman, ...)
14713:55 <gzhao408> lightlike I wondered for a while why minping (best ping time ever) is used instead of pingtime (most recent ping time), but for the purposes of protection it seems like mingping would be much harder for an attacker to manipulate and thus...safer?
14813:55 <jnewbery> raj_149: I'm not sure if it's that interesting to get into low-level details here. Search for IsLocal() and see where it's used
14913:56 <emzy> lightlike: I think for ipv4 you can use as source every host in 127.0.0.0/8.
15013:57 <michaelfolkson> Would testing remote peers be in the remit of maintainers/long term contributors with resources set up? Like industrial testing?
15113:57 <emzy> Should be all localhost.
15213:57 <sipa> gzhao408: ping times vary due to network conditions, node overloading, ...; but an attacker still can't fake minping (if minping is N ms, then you know the node is at most N*150 km away)
15313:58 <lightlike> emzy: Probably those would all map to the same netgroup though?
15413:59 <emzy> lightlike: no default netgroups is /16 so you could use 127.3.0.1 127.4.0.1 ....
15513:59 <amiti> lightlike: hmmm, maybe a test could setup 5 nodes as preferred for eviction, and ensure that one of them is the one evicted? šŸ˜‚
15613:59 <amiti> not great, the netgroup definitely makes it tricky
15713:59 <raj_149> sipa: woh, cool formula.
15813:59 <MarcoFalke> michaelfolkson: I think having the tests call out to remote peers by default could be problematic, though if it is opt-in and can't be achieved in another way, why not
15914:00 <jnewbery> ok, I've got to run to my next call. Thanks for hosting lightlike!
16014:00 <jnewbery> Don't forget that tomorrow is the last of our BIP 157 specials, and probably the most interesting from a protocol-design perspective. Notes are here: https://bitcoincore.reviews/19070.html
16114:00 <jnewbery> And next week we'll get into some nice crypto implementation. Fabian will host a meeting on sipa's Python implementation of MuHash. Notes and questions will be up soon.
16214:00 <lightlike> ok, thanks all!
16314:00 <gzhao408> sipa: ah, that makes sense that it can serve as a measurement of physical distance. so this would fall under "hard for attacker to fake" category more than "favorable behavior" category? I was confused because I thought we used it because we cared about how fast our peers were responding
16414:00 <nehan> thanks lightlike!
16514:00 <raj_149> thanks lightlike for hosting. has been a great one.
16614:01 <troygiorshev> thanks lightlike!
16714:01 <gzhao408> thanks lightlike!
16814:01 <emzy> tnx everyone!
16914:01 <michaelfolkson> Nice hosting lightlike
17014:01 <amiti> thanks lightlike! great test addition :)
17114:01 * decipher sets up quantum tunneling peer to h4x minping
17214:02 <MarcoFalke> jup, thx for hosting lightlike. Good to see more test prs here :)
17314:02 <decipher> thanks lightlike
17414:02 <sipa> thanks lightlike!
17514:04 <andrewtoth> thanks lightlike!