Request NOTFOUND transactions immediately from other outbound peers, when possible (p2p)

Jul 31, 2019

https://github.com/bitcoin/bitcoin/pull/15505

Host: mzumsande - PR author: sdaftuar

Notes

The basic message flow for transaction relay is INV -> GETDATA -> TX. A node that cannot deliver a requested transaction answers the GETDATA request with NOTFOUND instead of TX.
NOTFOUND messages were introduced in Bitcoin Core PR 2192 and are documented in the Bitcoin developer reference.
Historically, Bitcoin Core would ignore NOTFOUND messages it received. This changed slightly with PR 15834, (Review Club Session), which introduced internal bookkeeping measures in response to a NOTFOUND (clearing the internal map of in-flight transactions).
This PR suggests to actually utilize the info of a NOTFOUND by immediately requesting the transaction from outbound peers having INV‘ed it to us before.
This PR touches the central P2P loop CConnman::ThreadMessageHandler() by adding a sequence counter to SendMessages(). This counter is used to deterministically select one outbound peer in each loop cycle from which we request transactions for which we have received a NOTFOUND from another peer.

Questions

What is the most typical situation when running a node in which NOTFOUND messages are encountered?
Why was the solution with the sequence counter chosen, instead of simpler alternatives like a random draw?
Why are only outbound peers considered for requesting transactions after receiving NOTFOUND?
What kind of options exist in the functional testing framework for P2P functionality? Why is it so hard to test PRs like this one with the existing framework?

Meeting Log

13:00

<jnewbery_> hi

13:00

<lightlike> hello

13:00

<jkczyz> hi

13:00

<michaelfolkson> Hey

13:01

<davterra> hi

13:01

<jnewbery_> lightlike is going to host today. He also prepared the notes and questions at https://bitcoin-core-review-club.github.io/15505.html . Thanks lightlike!

13:01

<fjahr> hi

13:01

<hugohn> hello

13:01

<i2> hallo

13:01

<jonatack> hi++

13:01

<jnewbery_> before I hand over, my periodic reminder that the point of PR review club is to help you review PRs! Please do review the PRs on github (an if you have any questions about how to do that, feel free to ask them here)

13:02

<jnewbery_> ok, let's get started. Over to you, lightlike

13:02

<michaelfolkson> Thanks for the reminder John :)

13:02

<lightlike> Thanks! I found this PR interesting because it deals with some of the very basic P2P functionality (ThreadMessageHandler)

13:02

<lightlike> to optimize how we deal with NOTFOND message.

13:03

<ariard> hi

13:03

<lightlike> Even though it got closed by the author recently

13:03

<lightlike> I think it is good for learning about P2P aspects of bitcoin - definitely helped my understand some new things.

13:04

<lightlike> First question (some overlap with the old session on #15834):

13:04

<lightlike> What is the most typical situation when running a node in which NOTFOUND messages are encountered?

13:04

<jkczyz> When the peer no longer has the transaction in the relay set

13:04

<michaelfolkson> Unpruned node requesting historic transaction?

13:05

<michaelfolkson> *Pruned

13:06

<lightlike> In my experience of running a node with logging, it is dealing with orphan transactions:

13:06

<jnewbery_> michaelfolkson: pruning is only relevant for old blocks in the chain. This PR deals with tx relay

13:06

<jnewbery_> (unconfirmed txs)

13:07

<lightlike> when we receive an tx from a peer for which we dont have the parent, we will ask him for the parent too

13:07

<lightlike> sending GETDATA without him having sent an INV before.

13:07

<lightlike> *received

13:08

<michaelfolkson> But a pruned node might request a tx it no longer has?

13:08

<lightlike> in this case, the other node often doesn't have the parent tx (or won't send it to us) and answers with NOTFOUND.

13:09

<jnewbery_> michaelfolkson: pruning is only for the blockchain. tx relay is for unconfirmed txs not in the blockchain yet

13:09

<michaelfolkson> Ok thanks

13:09

<hugohn> how likely is it to receive a txn for which we don't know the parent? is that related to CPFP or no?

13:10

<i2> answer to typical situation for NOTFOUND: maybe: a.) blocks-only Lite client? b.) full-node that's in process of syncing to tip, c.) adversarial peer? d.) full node drops tx if tx is included in block but a reorg happens to your peer?

13:10

<lightlike> when firing up a node, it happens a lot (we weren't there when the parent was relayed)

13:13

<lightlike> i2: i dont think an adversarial peer would send a NOTFOUND currently. He could just INV a tx and never follow up with the TX if we request it.

13:13

<jnewbery_> i2: a) blocks-only clients will advertise that they're not relaying txs with the relay flag in the VERSION message: https://btcinformation.org/en/developer-reference#version

13:14

<jnewbery_> i2: d) in the case of a reorg, the disconnected txs are added back into the mempool

13:14

<hugohn> lightlike: I see, so when a node comes back online after a while, the INVs typically arrive out-of-order? i.e., receiving INVs for children first before receiving INVs for parent

13:14

<lightlike> im not entirely sure if this is typical, but in a short test-run on mainnet all of the NOTFOUNDs i got were for parents of orphan transactions.

13:15

<jnewbery_> lightlike: I think that's fairly typical, but I haven't actually tested it myself

13:15

<i2> jnewbery_ re d) so the steps are 1.) new block 2.) shed txns in block from mempool 3) reorg 4) put txn from reorg'ed blocks back into mempool?

13:16

<lightlike> ok, maybe move on to the next q:

13:16

<lightlike> Why was the solution with the sequence counter chosen, instead of simpler alternatives like a random draw?

13:16

<jonatack> lightlike: my testing of the NOTFOUND PR showed the same as you

13:16

<i2> lightlike: maybe a.) Decrease p2p churn? b.) Equalize load requested from peers (no peer wins the INV-request lottery)

13:17

<jkczyz> Is it to ensure a uniform distribution of peer selection? I left a comment on the review regarding the sequence number but wasn't sure if this was the reason.

13:17

<hugohn> I guess for robustness: we want to retry fetching the txn from all eligible outbound peers, not just one

13:17

<jnewbery_> i2: not quite. It's 1) get new block and see that it's got more work than tip 2) run ActivateBestChain, which tries to move to the best chain, which will 2a) rewind blocks, adding txs to mempool 2b) connect new blocks, removing txs from mempool that in the new block

13:19

<lightlike> yes, i think that's correct, it's to not ask more transansaction from peers earlier in the loop.

13:20

<michaelfolkson> How many peers is a node on average connected to? I'm assuming most would run the default config settings

13:20

<jnewbery_> hugohn: no, we only want to request from one peer at a time. That tripped me up too. The logic to make sure we're not requesting from more than one peer is in TryRequestTx(): https://github.com/bitcoin/bitcoin/pull/15505/files#diff-eff7adeaec73a769788bb78858815c91R773

13:21

<lightlike> i think it is important here that there is an "early" and "late" in the loop, even though ThreadMessageHandler loops continuously through all our peer.

13:21

<jnewbery_> michaelfolkson: up to 8 outbound and (for listening nodes) up to 125 total (so 117 inbound)

13:21

<hugohn> jnewbery_: yes I saw that, the retrying is actually one peer at a time. but the queuing up for retries happen for all outbound peers, right? to ensure that if one outbound peer fails to give us the txn, the next one still could

13:22

<lightlike> because the order of our peers is not random (usually there are first our outbound peers to which we connect at startup, and after that the inbound peers)

13:22

<jnewbery_> hugohn: right, the txid will be added to m_tx_not_found for all outbound peers that have announced the tx

13:22

<jonatack> perhaps a random draw would not ensure the taking of turns e.g. the same peer could be picked twice in a row

13:23

<michaelfolkson> Yeah especially if only 8 outbound

13:24

<lightlike> jonatack: i think with the random draw, the node in our current position in the loop would get more action: it has now an 1/N chance, and in case there is an entire cycle without anohter node getting the INV, it gets a second chance first

13:24

<i2> jonatack "peer could be picked twice in a row" = INV 'lottery' winner ;]

13:24

<hugohn> jonatack: yeah I think so, so if we randomly choose only one outbound peer to retry, it's less robust if that peer _also_ fails to SENDDATA . but the advantage of random draw is no extra sequence_number, which is somewhat awkward to introduce in the main processing loop...

13:24

<jonatack> lightlike: right, in that case random would have more bias

13:24

<jnewbery_> one thing we could do is shuffle the order we visit peers in the ThreadMessageHandler here: https://github.com/bitcoin/bitcoin/blob/7821821a23b68cc9ec49d69829ad4c939cb762e8/src/net.cpp#L1939

13:25

<lightlike> jnewbery_: yes, I thought about that too, that could be randomized

13:26

<jkczyz> jnewbery_: how important is maintaining determinism for sake of testing?

13:26

<lightlike> i actually like the proposal of jkczyz https://github.com/bitcoin/bitcoin/pull/15505#issuecomment-516636613 - on first thought I think that might work

13:26

<jnewbery_> jkczyz: I don't think it's important. My guess is that no tests will break.

13:26

<hugohn> but one consequence of queuing up the txns in all outbound peers is there could potentially be duplicate entries in each m_tx_not_found queue? I don't see any logic checking against with duplicates

13:27

<hugohn> do you guys think it's necessary? to check against duplicates in m_tx_not_found?

13:28

<jnewbery_> hugohn: I think you may be right about duplicates in m_tx_not_found

13:30

<lightlike> ok, next question: Why are only outbound peers considered for requesting transactions after receiving NOTFOUND?

13:31

<lightlike> hugohn: I think so too, not sure thought how bad duplicates would be.

13:32

<i2> clarify my ignorance: outbound peers = non-listen-only peers. yes?

13:32

<hugohn> jnewbery_ lightlike: yeah so looks like we indiscriminately queue up the txn any time we get a NOTFOUND. there's no guarantee we can't get multiple NOTFOUNDs for the same txns.

13:32

<hugohn> or back-to-back

13:32

<hugohn> retry failure

13:32

<lightlike> i2: no, outbound peers are when we initiate the connection, vs inbound peers that connect to us

13:32

<michaelfolkson> https://bitcoin.stackexchange.com/questions/42286/what-is-the-difference-between-inbound-and-outbound-connections

13:33

<michaelfolkson> I made same mistake

13:34

<michaelfolkson> Because you would rather cycle through nodes you have personally selected rather than random nodes that you haven't selected?

13:35

<lightlike> michaelfolkson: yes, that's true, but why not just treat everyone the same?

13:35

<hugohn> we generally trust the outbound peers more than inbound ones, to guard against Sybil attacks?

13:36

<michaelfolkson> Because nodes aren't the same. Some will go offline and won't make timely responses?

13:36

<lightlike> hugohn: yes. I think if we also requested from inbounds, attackers could try to withhold a tx from us by establishing many connections to us

13:37

<lightlike> the normal logic if a GETDATA times out also treats outbounds nodes preferentially

13:37

<jonatack> yes, idea is attackers could use inbound conns to prevent us from seeing a txn

13:38

<jonatack> thus the delay placed on inbounds giving preferance to outbounds

13:38

<lightlike> so an attacker could circumvent that by sending us NOTFOUND right before the GETDATA times out, and then we would be likely to ask the tx from another inbound node, that is also controlled by the attacker.

13:38

<jonatack> nNow + 2 sec

13:39

<lightlike> and so on

13:40

<i2> so rational for txn requests to outbound peers before inbound is to prevent an eclipse attack on a node for txn awareness (say lighnting force close)?

13:41

<lightlike> i2: yes, I think so. it's much harder for an attacker to control our outbound peers that we chose carefully.

13:42

<hugohn> the time delay attack was the motivation for the fix https://github.com/bitcoin/bitcoin/pull/14897 wasn't it? which introduced a few bugs & motivated more robustness handling of NOTFOUND

13:43

<jonatack> hugohn: yes, iirc

13:43

<hugohn> retry logic is hard :-)

13:44

<hugohn> so many edge cases

13:44

<lightlike> ok, final question:

100

13:44

<lightlike> What kind of options exist in the functional testing framework for P2P functionality? Why is it so hard to test PRs like this one with the existing framework?

101

13:45

<jonatack> Good read on this at https://github.com/bitcoin/bitcoin/issues/14210

102

13:46

<jonatack> (current limitations and goal of an instantiable CConnman, then a MockNetworkRouter)

103

13:48

<lightlike> jonatack: yes, seems like it is not possible at the moment to fully mock inbound/outbound behavior in functional tests.

104

13:48

<ariard> We currently dissociate the node added from cli and from addrman

105

13:49

<jonatack> and the testing can do manual conns but not spontanous ones

106

13:49

<jnewbery_> IIRC there's also special logic for peers on the same subnet or IP address, which all our test nodes are for functional tests

107

13:50

<jnewbery_> yeah, I think it's the IsLocal() test

108

13:50

<hugohn> is the bucketing mechanism (putting nodes in new vs. tried tables) one of the main challenges? do you have to stub out the bucketing logic somehow, or create fake IP addresses?

109

13:51

<jnewbery_> there are a few things that would need to change. I think they're all documented in 14210

110

13:51

<ariard> I think we should create fake IP addresses for our nodes thanks to namespaces

111

13:52

<jnewbery_> ariard: I don't know anything about that, but it sounds like a good idea!

112

13:52

<jonatack> Ethan Deilman proposed that here https://github.com/bitcoin/bitcoin/issues/14210#issuecomment-469318300

113

13:53

<jonatack> * Heilman

114

13:53

<ariard> jnewbery: how IIRC that's why we conclude with dongcarl in 14210

115

13:54

<jonatack> ariard: right, namespaces as you proposed https://github.com/bitcoin/bitcoin/issues/14210#issuecomment-473361633

116

13:55

<ariard> not sure about all this stuff, it need more thought but first cconman encapsulation!

117

13:55

<ariard> *needs

118

13:55

<lightlike> do you know if the work towards an instantiable CConnman is still in an early stage, or close to being finished?

119

13:55

<jonatack> before we run out of time... is anyone working on this atm?

120

13:55

<hugohn> all roads point to more modular components :-)

121

13:55

<ariard> I do what I can https://github.com/bitcoin/bitcoin/pull/16503 ;)

122

13:56

<ariard> jonatack: dongcarl, theuni, ryanofksy

123

13:56

<hugohn> ariard: yes moar pls ! :D

124

13:56

<ariard> but it split between different parts of the codebase

125

13:57

<jonatack> ariard: currently?

126

13:57

<ariard> we'll try to remove more cconnman usages in RPC interface, don't remember what need to be done in net.cpp

127

13:57

<ariard> *will

128

13:57

<hugohn> also a basic q before we run out: does anyone know why g_outbound_peers_with_protect_from_disconnect is defined within net_processing.cpp’s namespace, but g_outbound_peers is defined outside of it?

129

13:58

<ariard> jonatack: dunno rn, ask on core dev

130

13:58

<lightlike> there is #14856, where there was not so much action recently

131

13:58

<lightlike> https://github.com/bitcoin/bitcoin/pull/14856

132

13:59

<ariard> lightlike: ping carl on it, he may need help