The PR branch HEAD was cadd93f at the time of this review club meeting.
Notes
When all inbound connection slots are occupied and a new inbound connection
is received, the node attempts to evict an existing inbound connection in
order to open up a slot.
However, any eviction logic is susceptible to eclipse
attacks (see
previous review club meetings on PR 16702 and PR
17428 for more background on eclipse attacks). An attacker may try to
make lots of new inbound connections to us, and if we always evict another
honest node, we may eventually be eclipsed.
AttemptToEvictConnection in src/net.cpp contains logic to increase the
cost of such attacks. Among other things, in order to take over all of the
inbound slots for a node, the attacker would need to:
Obtain IP addresses in various net groups
Be amongst the lowest latency peers (geographic cost)
Provide us with transactions before other peers
Provide us with blocks before other peers
The scarcity of tests for the eviction logic was first noticed in
#16660 and addressed with
functional tests in #16756
(review club).
This PR, #20477 refactors
some of AttemptToEvictConnection() into SelectNodeToEvict() to increase
testability and adds unit tests for SelectNodeToEvict().
Questions
Why is it important to allow for evictions? What are the risks in rejecting
new inbounds if slots are full?
What are the various ways AttemptToEvictConnection() tries to increase the
cost of an eclipse attack?
<stacie> Allowing evictions opens up the opportunity to get better peers (i.e. ones w/ less latency, IP diversity for net groups, ones that provide transactions and blocks faster, etc.)
<amiti> an attacker can execute a connection starvation attack - just take up all the slots on the network, and then there's no way for a new nodes to connect to the honest network
<dhruvm> If honest peers in the network don't have available inbound slots, a new node may not be able to create any outbound connections into the honest network. This might make it easy for a dishonest node to attack the new node - all they have to do is be available.
<michaelfolkson> For the network (and new peers) it is obviously a problem. Less of a problem for you as a node I think (unless you are literally not finding out blocks, transactions anymore)
<Murch> dhruvm: we keep nodes that gave us blocks recently, protect nodes from different internet regions, keep the ones that we have been connected to the longest, keep peers with low latency, there was more I'm forgetting
<dhruvm> AttemptToEvictConnection protects some connections from eviction if they are in certain net groups, have had low latency ping times, or have provided us with valid transactions or blocks we did not know about. So to protect their own prior connection from eviction (when they make a new inbound), an attacker has to change real world parameters, and do useful work.
<dhruvm> Q: Why does AttemptToEvictConnection remove peers being disconnected from eviction candidates instead of waiting for them to disconnect (as that would open up a slot)?
<sipa> dhruvm: inbound variety helps, but given that not even all nodes permit incoming connections, it's hard to say variety in it is essential for eclipse protection
<dhruvm> I actually don't know the know the answer to this one. I suspect it is because there would be no way to know how many new inbounds were waiting on a disconnection in progress (given current code). But it can probably be changed?
<Murch> stacie: An attacker could add more connections of a specific net group with a vpn in that net group, but then that would also increase the chance of the attacker's nodes to get removed?
<mango> There is guaranteed eviction when the set of candidates is high enough (29?), so disconnect keeps the set reasonable and avoids unncessary evictions later.
<amiti> yeah, I don't know either, but it seems more reasonable to have too many available slots vs having too many connections competing for the limited slots
<dhruvm> amiti: yeah that makes sense. SInce ThreadSocketHandler and the logic in AttemptToEvictConnection are not synchronized, better to have more available slots than too few
<Murch> Right, and since we never tell about all of our connections, the attacker wouldn't know which of our peers' netgroups has the highest representation
<michaelfolkson> As an attacker you don't know much about the node's peers (other than what you can deduce from addr gossiping) unless you hack their node obviously
<Murch> stacie: perhaps the attacker could learn what netgroup we have a lot of nodes from when they get disconnected and focus on connecting from other netgroups thereafter
<dhruvm> ASNs are dynamic groups of IP address range assignments representing real world entities like ISPs, companies, etc. Grouping by ASNs is superior to /16 subnets because large entities would span multiple /16 subnets, lowering the cost for an attacker to obtain addresses in different groups.
<michaelfolkson> Murch: In this way a honest node is going to behave the same way as a malicious node. Both are trying to stop being disconnected. I guess a malicious node will make more effort and expend more resources to maintain a connection
<stacie> Murch I was thinking something along those lines, trying different net groups to see which ones are successful. But I was more trying to find out how including a disconnecting node in the node eviction candidates could possibly alter any of the logic in SelectNodeToEvict(). Wasn't really going anywhere more specific than that :) Thanks for digging into it deeper! A lot of interesting questions that I don't quite know
<dhruvm> An attacker cannot predict which net groups would shelter them from eviction because they'd need to know the net groups of all our peers. We never reveal more than 23% of our addrman to any single peer, and we definitely don't reveal a list of our active peers to anyone (that would open up other attack vectors).
<michaelfolkson> dhruvm: If the attacker knows which net groups are less populated on the network due to gossiping. Or are they evenly distributed somehow?
<jonatack> murtyjones: thinking about your question, ISTM it would be the opposite: CompareNodeBlockRelayOnlyTime() is used for protecting potential block-relay-ony peers from eviction
<prayank> I have a noob question. Everything that we are discussing related to this PR is same for nodes with and without tor or anything different in the eviction process? Or something that may work differently for Tor nodes?
<troygiorshev> (I've answered my own question. The third last line of net.cpp assigns netgroups, and it does so randomly. An attacker can not know which netgroups are prioritized, at least not easily)
<sipa> prayank: yes and no; tor connections don't have a meaningful "from IP" or equivalent to clasify them by, so it's much easier to e.g. exhaust incoming tor slots
<michaelfolkson> Murch: The whole point of this is to make attacks harder. You make it easier for the attacker if they know to join an underrepresented group
<jnewbery> perhaps the questions should be rephrased: "if our maximum number of inbound slots is more than 29, and they're all full, why is it guaranteed that we'll evict an inbound connection if AttemptToEvictConnection() is called?"
<jnewbery> schmidty: right. The deterministic randomizer mixes in a (private) salt, so an attacker shouldn't be able to know how the netgroups are ordered
<Murch> so, when we're full and have 29 or more nodes, there will at least be one unprotected and thus eligible for eviction. We do prefer the new connection, so an eviction is guaranteed
<stacie> 29 lines up with the protections, but I agree on the rephrasing of the question/referencing where it came from in the test - 4 by net group, 8 by lowest ping time, 4 by most recently sent novel tx, 8 for non-relay-tx peers that most recently sent novel blocks, 4 for nodes that sent novel blocks
<dhruvm> The code in net.cpp (https://github.com/bitcoin/bitcoin/blob/fabecce/src/net.cpp#L954-L968) protects at most 28 peers from eviction (4 by net group, 8 by lowest ping time, 4 by last time of novel tx, up to 8 non-tx-relay peers by last novel block time, and 4 more peers by last novel block time). So any additional peers are guaranteed to be candidates for eviction.
<Murch> Four nodes from highest keyed netgroup values, eight nodes with minimum bping, four nodes from novel transactions, eight non-tx-relay peers that sent us blocks last, up to 4 from localhost
<dhruvm> Q: Why is non-eviction guaranteed if we have no more than 20 eviction candidates? Is 20 the highest number of nodes that guarantees non-eviction?
<dhruvm> The protection at [net.cpp::961](https://github.com/bitcoin/bitcoin/blob/fabecce/src/net.cpp#L961) for up to 8 non-tx-relay peers may, or may not apply to the randomly generated eviction candidates since `!n.fRelayTxes && n.fRelevantServices` will evaluate to `randbool`. So we cannot count them in the _guaranteed_ non-eviction?
<dhruvm> guaranteed non-eviction is only on 4(CompareNetGroupKeyed) + 8(ReverseCompareNodeMinPingTime) + 4(CompareNodeTXTime) + 4(CompareNodeBlockTime) = 20.
<schmidty> Test case makes sure there is always an eviction at (29) GUARANTEED_EVICTION_AT_N_CANDIDATES_OR_ABOVE or above. Likewise no evictions at (20) GUARANTEED_NON_EVICTION_AT_N_CANDIDATES_OR_BELOW or below.
<dhruvm> The tests generate randomized eviction candidates and since non-eviction is guaranteed at <= 20 peers and eviction at >= 29 peers, the characteristics of the randomized peers are irrelevant, except the ones specifically overriden by each test case.
<Murch> The test counts up 10 times from zero to 199 nodes and checks that for each number of nodes under 21 there is no eviction and above 28 there is.
<dhruvm> std::move casts the parameters into an rvalue reference and invokes a move constructor. This (1) transfers the ownership of the vector to `SelectNodeToEvict`, (2) avoids a copy of the vector and (3) makes it clear to future contributors that they should not re-use the vector as the behavior will be unspecified.
<michaelfolkson> Formally should this be considered a functional test rather than a unit test? I know it doesn't matter but unsure on where the line is between unit and functional other than using Boost/functional test framework
<jnewbery> std::move doesn't move anything, it simply casts to rvalue. I'd recommend Effective Modern C++ by Scott Meyers for a really good explanation of move sematics
<dhruvm> On the lambdas question: After setting up random eviction candidates, `candidate_setup_fn` is used to give some candidates specific attributes, like net groups, etc.
<troygiorshev> michaelfolkson: if you write your test in python, then you'll have to spin up an entire bitcoin node to interact with through the public api. On the other hand, in C++ here we can call non-exposed methods directly
<dhruvm> Thank you all for coming. I had a ton of fun reviewing the PR and hosting today. Will definitely be doing it again. I know jnewbery is looking for more hosts. If you have the time and inclination, I highly recommend hosting, it's a great way to learn!