Connection eviction logic tests (p2p, tests)

Jun 3, 2020

https://github.com/bitcoin/bitcoin/pull/16756

Host: mzumsande - PR author: mzumsande

The PR branch HEAD was 32c6ce0be3 at the time of this review club meeting.

Notes

A node allows a limited number of connections, the default is 125. When this limit is reached, the node reacts to the connection of a new inbound peer by terminating the connection with an existing inbound peer (“eviction”).
Certain peers are protected from eviction by a heuristic algorithm, based on various metrics such as netgroup, connection time, ping time, them having sent us transactions or blocks. This is implemented in CConnman::AttemptToEvictConnection() and was added in PR #6374 and extended in PR #8084.
The eviction logic is currently covered by neither functional nor unit tests (Test Coverage), which was noticed in Issue #16660 .
This PR suggests a functional test for the eviction mechanism, covering protection of fast responding peers and peers that have sent us transactions and blocks. This is achieved by adding several mininode peers with different properties and testing that after reaching the maximum number of connections, a peer is being evicted, which is not among the peers assumed to be protected.

Questions

Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? (Don’t forget to put your PR review on GitHub.)
Why is there an eviction mechanism for inbound peers, instead of simply not accepting new connections when full?
Why are some peers protected from eviction?
Can you summarize the algorithm by which the candidate for eviction is determined in AttemptToEvictConnection()?
Why does the test choose maxconnections=32?
What are the challenges in testing the eviction logic? Can you think of ways to extend the coverage of the functional test?
What do you think of the algorithm for eviction/protection? Do you have ideas for improvement?

Meeting Log

13:00

<jnewbery> #startmeeting

13:00

<troygiorshev> hi

13:00

<lightlike> hi

13:00

<amiti> hi

13:00

<emzy> hi

13:00

<raj_149> hi

13:00

<michaelfolkson> hi

13:00

<jnewbery> Hi folks! Welcome to review club. Today, lightlike is going to be hosting the meeting on his PR #16756: Connection eviction logic tests.

13:00

<nehan> hi

13:01

<jnewbery> Notes and questions in the normal place: https://bitcoincore.reviews/16756.html

13:01

<jnewbery> Before we start though, I want to draw everyone's attention to the code of conduct that we added to the website last week: https://bitcoincore.reviews/code-of-conduct. Please read it!

13:01

<jnewbery> We want to make sure that PR Review Club is a safe, respectful, and productive environment for everyone. The Code of Conduct documents the behaviour we expect to ensure that's the case.

13:01

<jnewbery> I don't think there's anything controversial in there. If you have any questions/concerns/comments, feel free to message me.

13:01

<pinheadmz> +1 for code of conduct

13:02

<decipher> hi

13:02

<jnewbery> With that, I'll hand over to lightlike.

13:03

<lightlike> Ok, so today will be on my PR on testing the eviction/protection logic for inbound peers.

13:03

<lightlike> I think that also the eviction algorithm in itself is quite interesting in itself.

13:03

<andrewtoth> hi

13:03

<lightlike> But let's start with the usual question - Who had a chance to review the PR? (y/n)

13:03

<raj_149> y

13:04

<troygiorshev> n

13:04

<nehan> n

13:04

<andrewtoth> y

13:04

<jnewbery> y

13:04

<gdejz> N

13:04

<amiti> mostly y

13:04

<emzy> n

13:05

<lightlike> Great - and there have been many helpful review comments so far.

13:05

<lightlike> So, moving to the first question:

13:05

<lightlike> Why is there an eviction mechanism for inbound peers, instead of simply not accepting new connections when full?

13:06

<michaelfolkson> To help bootstrap new nodes?

13:06

<raj_149> We might get better new peers than existing one? Having static peers might open risk of attacks?

13:06

<andrewtoth> To prevent an attacker from filling up all inbound slots

13:07

<decipher> preventing peering monopolization, potentially from an adversary

13:07

<troygiorshev> We ensure we're not stuck with a small (potentially malivious) subset of the network?

13:07

<amiti> it seems like we try to evict from the netgroup with the max number of connections of unprotected peers, so maybe as a method of ensuring peers across multiple netgroups?

13:08

<raj_149> quick question: what exactly is a netgroup?

13:08

<lightlike> I think all of these are valid reasons.

13:09

<lightlike> but probably, preventing an attacker to fill the inbound slots was the main reason why eviction was introduced?!

13:10

<sipa> just in general: to try to improve the quality of inbound connections (a combination of peers that relay blocks quickly, don't misbehave, and come from a variety of networks across the world)

13:10

<sipa> when we're at our inbound connection limit, we have a choice

13:11

<sipa> or rather: we need to make a choice, and not evicting means making the implicit "first connection is best" assumption

13:11

<decipher> "prioritize peer desirability"

13:12

<michaelfolkson> raj_149: I had to look it up. Connections with certain characteristics form one netgroup https://github.com/bitcoin/bitcoin/blob/657b82cef0e8e8695fc189d013a4353bdbebb041/src/net.cpp#L895

13:12

<jnewbery> raj_149: traditionally, a netgroup was a /16 subnet. From 0.20, it can optionally be based on ASNs. See https://github.com/bitcoin/bitcoin/issues/16599 for more details

13:13

<sipa> raj_149: it's a fairly arbitrary grouping of IP addresses we make; for IPv4 it's /16 subnets, though since 0.20 it's possible to use asmap

13:13

<lightlike> ok, next question

13:13

<lightlike> Why are some peers protected from eviction?

13:13

<sipa> raj_149: the idea is that it's harder for an attacker to control many networks rather than many IP addresses in one network

13:14

<raj_149> thanks, makes sense. Will look up on it more.

13:14

<andrewtoth> The same way an attacker can fill up all inbound slots, an attacker could evict all other connections by continually connecting to a node

13:15

<jonatack> raj_149: see https://bitcoincore.reviews/16702

13:15

<michaelfolkson> And if they have proved themselves useful by providing blocks or responding to ping quickly you don't want to ditch them

13:15

<amiti> +1 andrewtoth. protecting peers from eviction based on different desired attributes increases the cost of attack to fill up all the slots.

13:16

<michaelfolkson> So you kind of want to get best of both worlds or a happy medium. Go too far in either direction and you don't get benefits of both extremes

13:16

<gzhao408> this might be a big claim, but I actually think any eviction policy that doesn't protect certain characteristics (i.e. ones that are hard to fake like minping and longest connection time) would almost guarantee that an attacker could fill up your slots

13:17

<ecurrencyholder> It's preferable to keep a connection that has proven to be reputable. By evicting it, you are opening yourself up to potentially connect to a malicious one.

13:18

<lightlike> yes, i think it's a balance. we don't want to have our peers too static, but we also need to protect against attackers actively eviction other peers.

13:19

<decipher> having defined peering preferences gives an attacker the success criteria they need to meet to perform an attack -- regardless of what criteria are chosen. Our objective should be - choose the criteria which suite are desires best but also make it as costly as possible for an attacker

13:19

<decipher> suit our*

13:19

<nehan> decipher: +1

13:19

<gzhao408> costly, and also that they're doing useful work anyway?

13:20

<gzhao408> (for some of the characteristics)

13:20

<nehan> gzhao408: i don't think usefulness is helpful. an attacker who is useful up to eclipsing you has still eclipsed you.

13:21

<jnewbery> I think amiti and decipher's point about having different criteria is very important. We obviously want to receive blocks quickly so a naive approach would be to just protect the peers that send us blocks. If we did that would it be easier or harder for an attacker to eclipse you?

13:21

<raj_149> jnewbery: seems easier, if that was the only criteria?

13:22

<decipher> easier or harder relative to which initial case

13:22

<sipa> the network serves a variety of purposes; for blocks it's partition resistance and fast relay; for transactions it's original privacy; for addresses there are eclipse attack concerns, ...

13:22

<sipa> so that means we need to select peers that are good across multiple different dimensions

13:22

<sipa> *origin privacy

13:22

<gzhao408> nehan good point haha

13:23

<lightlike> next q: Can you summarize the algorithm by which the candidate for eviction is determined in AttemptToEvictConnection?

13:23

<jnewbery> decipher: compared to not just protecting block-serving peers. I think the answer is easier because a determined adversary can be a better block-serving peer than an average peer.

13:23

<michaelfolkson> nehan: They can. But you would still want to value node who has proved useful over a node who has not.

13:24

<nehan> michaelfolkson: I think the issue is that if you measure nodes by how "good" they are and it's something an attacker can optimize, they can become better than average nodes and take up all the good nodes slots. I think :)

13:24

<michaelfolkson> nehan: Yeah good point

13:24

<sipa> nehan: there is a risk for a repution attack"

13:25

<raj_149> It seems focusing on any sinle character will by definition open up gamability. the only way is to have as many character as possible as criteria?

13:25

<nehan> sipa: i don't know what that means

13:25

<raj_149> *single

13:25

<sipa> nehan: an attacker can act "good" before trying to perform an attack

13:25

<nehan> sipa: yes

13:25

<sipa> which is inherently a problem without persistent identities

13:26

<sipa> on the other hand, some things can't be faked: a peer with a low ping time is close to you - it doesn't mean they're not an attacker of course

13:26

<nehan> sipa: yep or some other hard to fake mechanism + honesty assumptions

13:26

<gzhao408> lightlike: make a list of eviction candidates, sort them by characteristics and remove the top ones of each category, whittling the list down based on peers that you definitely want to protect. if you still have candidates left over, make an eviction decision?

13:26

<sipa> or a peer who gives you a good block helped you and the network, regardless of their future actions

13:28

<amiti> lightlike: start by selecting eviction candidates that are inbound peers that are not on the NO BAN list or already marked for disconnect. remove the top candidates from each of the desirable categories (netgroup, ping time, recently sent txns, recently sent blocks, and the longest connected). If there are any candidates left over, choose from the netgroup with the most connections

13:28

<raj_149> is it implicite that there will always be a candidate for eviction (assuming peercount limit reached)? or it might be the case that none of them gets evicted and new connection is rejected?

13:28

<sipa> raj_149: the new connection may be the one that is evicted

13:28

<sipa> (i think, i haven't checked this code recently)

13:28

<nehan> so the idea is to optimize for metrics that are "proof of good and hard to fake behavior". like low ping times, bw usage for relaying blocks, etc. others? sort of like proof of work, ha

13:30

<nehan> random probably wrong thought: maybe more expensive the behavior is, the more it should be weighted

13:30

<lightlike> gzhao, amiti: right! we also preferably evict peers that are undesirable (HasAllDesirableServiceFlags)

13:31

<decipher> should we discuss the protected_peers weightings?

13:31

<gzhao408> I think it's important that it's not just favoring some but /protecting/ them - that's why we remove them from the list so even if a node is both a long-lasting connection and has short ping times we consider the same number regardless of overlap. and to nehan's point, the more expensive behavior comes at the end

13:32

<nehan> gzhao408: yeah i think i spoke too soon before, you are right that useful work is helpful especially if costly

13:32

<lightlike> raj_149: I'm not sure how we would not find a candidate - if we have maxconnections large enough that that there are more than the fixed 20 peers (4 block +4 tx + 4netgroup +8 ping).

100

13:33

<raj_149> IMO weighting might be problematic as an attacker then gets higher probabailty to get into protected sets by playing good just before the attack?

101

13:33

<troygiorshev> i'm not sure i follow precisely what the attack is, can anyone give a quick summary?

102

13:33

<raj_149> lightlike: i was also thinking the same. Seems like it will always find a peer to evict.

103

13:33

<sipa> i think it's important that this is not just a score that maps all criteria onto a single value

104

13:34

<sipa> you want to keep the best peers according to many different criteria, so that an attacker has to beat your other peers in all those criteria

105

13:34

<michaelfolkson> These eviction policies in the test are a simplified version of the eviction policies in the actual codebase right?

106

13:34

<decipher> should have phrased it as protected_peers allocation not creating a score

107

13:35

<raj_149> michaelfolkson: only the netgrouping part. rest of the criteria are being tested as it seems.

108

13:37

<lightlike> ok, moving on to the test: Why does the test choose maxconnections=32?

109

13:38

<gzhao408> I don’t think maxinbound can be specified, rather it’s implied from maxconnections, maxoutbound, and maxfeeler.Basic math: Maxconnections = maxoutboundfullrelay + maxoutboundblocksonly + maxfeelers + maxinboundWe wanted 21 inbound so 8 + 2 + 1 + 21 = 32

110

13:38

<raj_149> so that it gives us total 20 peers to be protected, adding the 21st peer will trigger eviction of one?

111

13:39

<lightlike> raj_149: I think 21 are connected, so adding the 22nd will evict (because 20 are protected, and one is not protected)

112

13:39

<andrewtoth> it is calculated to give us exactly one more peer than protected, so we can see it get evicted

113

13:39

<raj_149> lightlike: ah right..

114

13:40

<lightlike> gzhao408: right. that's why maxconnections needed to be raised by two when the PR introducing two more outbound connections (blocks-only) was merged

115

13:41

<lightlike> next q: What are the challenges in testing the eviction logic? Can you think of ways to extend the coverage of the functional test?

116

13:41

<raj_149> lightlike: its hard to simulate real world networking situations in regtest.

117

13:43

<michaelfolkson> I still haven't wrapped my head exactly what the test is doing at a high level. Maybe I am just slow :)

118

13:43

<troygiorshev> in particular, we can't test the netgroup behaviour

119

13:44

<michaelfolkson> Is it testing a Core node by spinning up a bunch of regtest nodes that connect to that Core node? Or a separate regtest network and testing how that independent network works?

120

13:44

<andrewtoth> There's no clear interface for the eviction logic, so the test has to build an extensive workaround simulated network

121

13:44

<raj_149> michaelfolkson: its making some peers satisfy the above criterias so to add them in protected group, then checking eviction is happening only from the non protected group, and protected peers are kept intact as intended.

122

13:45

<troygiorshev> andrewtoth: for a functional test this is maybe an advantage?

123

13:45

<jnewbery> michaelfolkson: it's a single node being tested, and a bunch of connections to it from the python test framework

124

13:45

<andrewtoth> troygiorshev an advantage in what way?

125

13:45

<michaelfolkson> Cool

126

13:45

<troygiorshev> michaelfolkson: the former

127

13:46

<raj_149> quickquestion: whats the diff between p2pdatastore and p2pinterface?

128

13:47

<amiti> lightlike: did you consider testing the `m_prefer_evict` preferred-for-eviction logic? I haven't looked too closely at feasibility yet

129

13:47

<lightlike> I originally had another commit with a unit test (testing the sorting in more detail), but that was controversial because in order to make the methods from Boost accessible, quite some code needed to be moved from net.cpp to net.h

130

13:47

<troygiorshev> andrewtoth: a functional test should try and simulate the real world as closely as possible. forcing us to actually spin up multiple nodes and connect them could help us realize if we've made an incorrect assupmtion somewhere along the line

131

13:47

<andrewtoth> i see

132

13:48

<andrewtoth> but even this test is fairly contrived, having only 32 connections. real world nodes will usually use the default 125

133

13:49

<raj_149> troygiorshev: but wont doing that in regtest will always be insufficient than real world?

134

13:49

<michaelfolkson> What are you concerned about going from 32 to 125 andrewtoth? DoS?

135

13:50

<lightlike> amiti: I think this should be possible. One problem is that due to the already mentioned impossibility to mock netgroup, there is some randomness involved (the first step kicks out 4 "random" peers which makes writing the tests harder)

136

13:50

<sipa> lightlike: with asmap you can mock netgroups

137

13:50

<andrewtoth> on one hand, it does simulate a lot of different things which would test for incorrect assumptions. on the other, it should be testing exactly the behaviour we want and not more.

138

13:51

<lightlike> sipa: oh, that is cool, I'll have a look at that!

139

13:51

<sipa> lightlike: ah, but all connections will still come from localhost, so probably not

140

13:51

<andrewtoth> michaelfolkson nothing particularly but to simulate real world we should use defaults. Could be some incorrect assumptions about how eviction works that we are not considering with default peers

141

13:52

<lightlike> last question: What do you think of the algorithm for eviction/protection? Do you have ideas for improvement?

142

13:52

<troygiorshev> andrewtoth: good point. I think, if you have multiple peers left over after protecting the 20, the logic to choose which one to evict is very simple, so there isn't much benefit to testing that it picks the right one?

143

13:52

<troygiorshev> now that I type that out I'm not sure that I agree with it ...

144

13:53

<jnewbery> it'd be great if we could somehow mock remote IP ranges in the test framework. There are quite a few differences between how we treat local peers and remote peers

145

13:53

<raj_149> jnewbery: mind pointing to few such diffs?

146

13:55

<MarcoFalke> raj_149: Pretty much anything that deals with addresses (addrman, ...)

147

13:55

<gzhao408> lightlike I wondered for a while why minping (best ping time ever) is used instead of pingtime (most recent ping time), but for the purposes of protection it seems like mingping would be much harder for an attacker to manipulate and thus...safer?

148

13:55

<jnewbery> raj_149: I'm not sure if it's that interesting to get into low-level details here. Search for IsLocal() and see where it's used

149

13:56

<emzy> lightlike: I think for ipv4 you can use as source every host in 127.0.0.0/8.

150

13:57

<michaelfolkson> Would testing remote peers be in the remit of maintainers/long term contributors with resources set up? Like industrial testing?

151

13:57

<emzy> Should be all localhost.

152

13:57

<sipa> gzhao408: ping times vary due to network conditions, node overloading, ...; but an attacker still can't fake minping (if minping is N ms, then you know the node is at most N*150 km away)

153

13:58

<lightlike> emzy: Probably those would all map to the same netgroup though?

154

13:59

<emzy> lightlike: no default netgroups is /16 so you could use 127.3.0.1 127.4.0.1 ....

155

13:59

<amiti> lightlike: hmmm, maybe a test could setup 5 nodes as preferred for eviction, and ensure that one of them is the one evicted? 😂

156

13:59

<amiti> not great, the netgroup definitely makes it tricky

157

13:59

<raj_149> sipa: woh, cool formula.

158

13:59

<MarcoFalke> michaelfolkson: I think having the tests call out to remote peers by default could be problematic, though if it is opt-in and can't be achieved in another way, why not

159

14:00

<jnewbery> ok, I've got to run to my next call. Thanks for hosting lightlike!

160

14:00

<jnewbery> Don't forget that tomorrow is the last of our BIP 157 specials, and probably the most interesting from a protocol-design perspective. Notes are here: https://bitcoincore.reviews/19070.html

161

14:00

<jnewbery> And next week we'll get into some nice crypto implementation. Fabian will host a meeting on sipa's Python implementation of MuHash. Notes and questions will be up soon.

162

14:00

<lightlike> ok, thanks all!

163

14:00

<gzhao408> sipa: ah, that makes sense that it can serve as a measurement of physical distance. so this would fall under "hard for attacker to fake" category more than "favorable behavior" category? I was confused because I thought we used it because we cared about how fast our peers were responding

164

14:00

<nehan> thanks lightlike!

165

14:00

<raj_149> thanks lightlike for hosting. has been a great one.

166

14:01

<troygiorshev> thanks lightlike!

167

14:01

<gzhao408> thanks lightlike!