#15505 Request NOTFOUND transactions immediately from other outbound peers, when possible (p2p)
- The basic message flow for transaction relay is
INV -> GETDATA -> TX. A node that cannot deliver a requested transaction answers the
NOTFOUNDmessages were introduced in Bitcoin Core PR 2192 and are documented in the Bitcoin developer reference.
- Historically, Bitcoin Core would ignore
NOTFOUNDmessages it received. This changed slightly with PR 15834, (Review Club Session), which introduced internal bookkeeping measures in response to a NOTFOUND (clearing the internal map of in-flight transactions).
- This PR suggests to actually utilize the info of a
NOTFOUNDby immediately requesting the transaction from outbound peers having
INV‘ed it to us before.
- This PR touches the central P2P loop
CConnman::ThreadMessageHandler()by adding a sequence counter to
SendMessages(). This counter is used to deterministically select one outbound peer in each loop cycle from which we request transactions for which we have received a
NOTFOUNDfrom another peer.
- What is the most typical situation when running a node in which
NOTFOUNDmessages are encountered?
- Why was the solution with the sequence counter chosen, instead of simpler alternatives like a random draw?
- Why are only outbound peers considered for requesting transactions after receiving
- What kind of options exist in the functional testing framework for P2P functionality? Why is it so hard to test PRs like this one with the existing framework?
13:00 < jnewbery_> hi 13:00 < lightlike> hello 13:00 < jkczyz> hi 13:00 < michaelfolkson> Hey 13:01 < davterra> hi 13:01 < jnewbery_> lightlike is going to host today. He also prepared the notes and questions at https://bitcoin-core-review-club.github.io/15505.html . Thanks lightlike! 13:01 < fjahr> hi 13:01 < hugohn> hello 13:01 < i2> hallo 13:01 < jonatack> hi++ 13:01 < jnewbery_> before I hand over, my periodic reminder that the point of PR review club is to help you review PRs! Please do review the PRs on github (an if you have any questions about how to do that, feel free to ask them here) 13:02 < jnewbery_> ok, let's get started. Over to you, lightlike 13:02 < michaelfolkson> Thanks for the reminder John :) 13:02 < lightlike> Thanks! I found this PR interesting because it deals with some of the very basic P2P functionality (ThreadMessageHandler) 13:02 < lightlike> to optimize how we deal with NOTFOND message. 13:03 < ariard> hi 13:03 < lightlike> Even though it got closed by the author recently 13:03 < lightlike> I think it is good for learning about P2P aspects of bitcoin - definitely helped my understand some new things. 13:04 < lightlike> First question (some overlap with the old session on #15834): 13:04 < lightlike> What is the most typical situation when running a node in which NOTFOUND messages are encountered? 13:04 < jkczyz> When the peer no longer has the transaction in the relay set 13:04 < michaelfolkson> Unpruned node requesting historic transaction? 13:05 < michaelfolkson> *Pruned 13:06 < lightlike> In my experience of running a node with logging, it is dealing with orphan transactions: 13:06 < jnewbery_> michaelfolkson: pruning is only relevant for old blocks in the chain. This PR deals with tx relay 13:06 < jnewbery_> (unconfirmed txs) 13:07 < lightlike> when we receive an tx from a peer for which we dont have the parent, we will ask him for the parent too 13:07 < lightlike> sending GETDATA without him having sent an INV before. 13:07 < lightlike> *received 13:08 < michaelfolkson> But a pruned node might request a tx it no longer has? 13:08 < lightlike> in this case, the other node often doesn't have the parent tx (or won't send it to us) and answers with NOTFOUND. 13:09 < jnewbery_> michaelfolkson: pruning is only for the blockchain. tx relay is for unconfirmed txs not in the blockchain yet 13:09 < michaelfolkson> Ok thanks 13:09 < hugohn> how likely is it to receive a txn for which we don't know the parent? is that related to CPFP or no? 13:10 < i2> answer to typical situation for NOTFOUND: maybe: a.) blocks-only Lite client? b.) full-node that's in process of syncing to tip, c.) adversarial peer? d.) full node drops tx if tx is included in block but a reorg happens to your peer? 13:10 < lightlike> when firing up a node, it happens a lot (we weren't there when the parent was relayed) 13:13 < lightlike> i2: i dont think an adversarial peer would send a NOTFOUND currently. He could just INV a tx and never follow up with the TX if we request it. 13:13 < jnewbery_> i2: a) blocks-only clients will advertise that they're not relaying txs with the relay flag in the VERSION message: https://btcinformation.org/en/developer-reference#version 13:14 < jnewbery_> i2: d) in the case of a reorg, the disconnected txs are added back into the mempool 13:14 < hugohn> lightlike: I see, so when a node comes back online after a while, the INVs typically arrive out-of-order? i.e., receiving INVs for children first before receiving INVs for parent 13:14 < lightlike> im not entirely sure if this is typical, but in a short test-run on mainnet all of the NOTFOUNDs i got were for parents of orphan transactions. 13:15 < jnewbery_> lightlike: I think that's fairly typical, but I haven't actually tested it myself 13:15 < i2> jnewbery_ re d) so the steps are 1.) new block 2.) shed txns in block from mempool 3) reorg 4) put txn from reorg'ed blocks back into mempool? 13:16 < lightlike> ok, maybe move on to the next q: 13:16 < lightlike> Why was the solution with the sequence counter chosen, instead of simpler alternatives like a random draw? 13:16 < jonatack> lightlike: my testing of the NOTFOUND PR showed the same as you 13:16 < i2> lightlike: maybe a.) Decrease p2p churn? b.) Equalize load requested from peers (no peer wins the INV-request lottery) 13:17 < jkczyz> Is it to ensure a uniform distribution of peer selection? I left a comment on the review regarding the sequence number but wasn't sure if this was the reason. 13:17 < hugohn> I guess for robustness: we want to retry fetching the txn from all eligible outbound peers, not just one 13:17 < jnewbery_> i2: not quite. It's 1) get new block and see that it's got more work than tip 2) run ActivateBestChain, which tries to move to the best chain, which will 2a) rewind blocks, adding txs to mempool 2b) connect new blocks, removing txs from mempool that in the new block 13:19 < lightlike> yes, i think that's correct, it's to not ask more transansaction from peers earlier in the loop. 13:20 < michaelfolkson> How many peers is a node on average connected to? I'm assuming most would run the default config settings 13:20 < jnewbery_> hugohn: no, we only want to request from one peer at a time. That tripped me up too. The logic to make sure we're not requesting from more than one peer is in TryRequestTx(): https://github.com/bitcoin/bitcoin/pull/15505/files#diff-eff7adeaec73a769788bb78858815c91R773 13:21 < lightlike> i think it is important here that there is an "early" and "late" in the loop, even though ThreadMessageHandler loops continuously through all our peer. 13:21 < jnewbery_> michaelfolkson: up to 8 outbound and (for listening nodes) up to 125 total (so 117 inbound) 13:21 < hugohn> jnewbery_: yes I saw that, the retrying is actually one peer at a time. but the queuing up for retries happen for all outbound peers, right? to ensure that if one outbound peer fails to give us the txn, the next one still could 13:22 < lightlike> because the order of our peers is not random (usually there are first our outbound peers to which we connect at startup, and after that the inbound peers) 13:22 < jnewbery_> hugohn: right, the txid will be added to m_tx_not_found for all outbound peers that have announced the tx 13:22 < jonatack> perhaps a random draw would not ensure the taking of turns e.g. the same peer could be picked twice in a row 13:23 < michaelfolkson> Yeah especially if only 8 outbound 13:24 < lightlike> jonatack: i think with the random draw, the node in our current position in the loop would get more action: it has now an 1/N chance, and in case there is an entire cycle without anohter node getting the INV, it gets a second chance first 13:24 < i2> jonatack "peer could be picked twice in a row" = INV 'lottery' winner ;] 13:24 < hugohn> jonatack: yeah I think so, so if we randomly choose only one outbound peer to retry, it's less robust if that peer _also_ fails to SENDDATA . but the advantage of random draw is no extra sequence_number, which is somewhat awkward to introduce in the main processing loop... 13:24 < jonatack> lightlike: right, in that case random would have more bias 13:24 < jnewbery_> one thing we could do is shuffle the order we visit peers in the ThreadMessageHandler here: https://github.com/bitcoin/bitcoin/blob/7821821a23b68cc9ec49d69829ad4c939cb762e8/src/net.cpp#L1939 13:25 < lightlike> jnewbery_: yes, I thought about that too, that could be randomized 13:26 < jkczyz> jnewbery_: how important is maintaining determinism for sake of testing? 13:26 < lightlike> i actually like the proposal of jkczyz https://github.com/bitcoin/bitcoin/pull/15505#issuecomment-516636613 - on first thought I think that might work 13:26 < jnewbery_> jkczyz: I don't think it's important. My guess is that no tests will break. 13:26 < hugohn> but one consequence of queuing up the txns in all outbound peers is there could potentially be duplicate entries in each m_tx_not_found queue? I don't see any logic checking against with duplicates 13:27 < hugohn> do you guys think it's necessary? to check against duplicates in m_tx_not_found? 13:28 < jnewbery_> hugohn: I think you may be right about duplicates in m_tx_not_found 13:30 < lightlike> ok, next question: Why are only outbound peers considered for requesting transactions after receiving NOTFOUND? 13:31 < lightlike> hugohn: I think so too, not sure thought how bad duplicates would be. 13:32 < i2> clarify my ignorance: outbound peers = non-listen-only peers. yes? 13:32 < hugohn> jnewbery_ lightlike: yeah so looks like we indiscriminately queue up the txn any time we get a NOTFOUND. there's no guarantee we can't get multiple NOTFOUNDs for the same txns. 13:32 < hugohn> or back-to-back 13:32 < hugohn> retry failure 13:32 < lightlike> i2: no, outbound peers are when we initiate the connection, vs inbound peers that connect to us 13:32 < michaelfolkson> https://bitcoin.stackexchange.com/questions/42286/what-is-the-difference-between-inbound-and-outbound-connections 13:33 < michaelfolkson> I made same mistake 13:34 < michaelfolkson> Because you would rather cycle through nodes you have personally selected rather than random nodes that you haven't selected? 13:35 < lightlike> michaelfolkson: yes, that's true, but why not just treat everyone the same? 13:35 < hugohn> we generally trust the outbound peers more than inbound ones, to guard against Sybil attacks? 13:36 < michaelfolkson> Because nodes aren't the same. Some will go offline and won't make timely responses? 13:36 < lightlike> hugohn: yes. I think if we also requested from inbounds, attackers could try to withhold a tx from us by establishing many connections to us 13:37 < lightlike> the normal logic if a GETDATA times out also treats outbounds nodes preferentially 13:37 < jonatack> yes, idea is attackers could use inbound conns to prevent us from seeing a txn 13:38 < jonatack> thus the delay placed on inbounds giving preferance to outbounds 13:38 < lightlike> so an attacker could circumvent that by sending us NOTFOUND right before the GETDATA times out, and then we would be likely to ask the tx from another inbound node, that is also controlled by the attacker. 13:38 < jonatack> nNow + 2 sec 13:39 < lightlike> and so on 13:40 < i2> so rational for txn requests to outbound peers before inbound is to prevent an eclipse attack on a node for txn awareness (say lighnting force close)? 13:41 < lightlike> i2: yes, I think so. it's much harder for an attacker to control our outbound peers that we chose carefully. 13:42 < hugohn> the time delay attack was the motivation for the fix https://github.com/bitcoin/bitcoin/pull/14897 wasn't it? which introduced a few bugs & motivated more robustness handling of NOTFOUND 13:43 < jonatack> hugohn: yes, iirc 13:43 < hugohn> retry logic is hard :-) 13:44 < hugohn> so many edge cases 13:44 < lightlike> ok, final question: 13:44 < lightlike> What kind of options exist in the functional testing framework for P2P functionality? Why is it so hard to test PRs like this one with the existing framework? 13:45 < jonatack> Good read on this at https://github.com/bitcoin/bitcoin/issues/14210 13:46 < jonatack> (current limitations and goal of an instantiable CConnman, then a MockNetworkRouter) 13:48 < lightlike> jonatack: yes, seems like it is not possible at the moment to fully mock inbound/outbound behavior in functional tests. 13:48 < ariard> We currently dissociate the node added from cli and from addrman 13:49 < jonatack> and the testing can do manual conns but not spontanous ones 13:49 < jnewbery_> IIRC there's also special logic for peers on the same subnet or IP address, which all our test nodes are for functional tests 13:50 < jnewbery_> yeah, I think it's the IsLocal() test 13:50 < hugohn> is the bucketing mechanism (putting nodes in new vs. tried tables) one of the main challenges? do you have to stub out the bucketing logic somehow, or create fake IP addresses? 13:51 < jnewbery_> there are a few things that would need to change. I think they're all documented in 14210 13:51 < ariard> I think we should create fake IP addresses for our nodes thanks to namespaces 13:52 < jnewbery_> ariard: I don't know anything about that, but it sounds like a good idea! 13:52 < jonatack> Ethan Deilman proposed that here https://github.com/bitcoin/bitcoin/issues/14210#issuecomment-469318300 13:53 < jonatack> * Heilman 13:53 < ariard> jnewbery: how IIRC that's why we conclude with dongcarl in 14210 13:54 < jonatack> ariard: right, namespaces as you proposed https://github.com/bitcoin/bitcoin/issues/14210#issuecomment-473361633 13:55 < ariard> not sure about all this stuff, it need more thought but first cconman encapsulation! 13:55 < ariard> *needs 13:55 < lightlike> do you know if the work towards an instantiable CConnman is still in an early stage, or close to being finished? 13:55 < jonatack> before we run out of time... is anyone working on this atm? 13:55 < hugohn> all roads point to more modular components :-) 13:55 < ariard> I do what I can https://github.com/bitcoin/bitcoin/pull/16503 ;) 13:56 < ariard> jonatack: dongcarl, theuni, ryanofksy 13:56 < hugohn> ariard: yes moar pls ! :D 13:56 < ariard> but it split between different parts of the codebase 13:57 < jonatack> ariard: currently? 13:57 < ariard> we'll try to remove more cconnman usages in RPC interface, don't remember what need to be done in net.cpp 13:57 < ariard> *will 13:57 < hugohn> also a basic q before we run out: does anyone know why g_outbound_peers_with_protect_from_disconnect is defined within net_processing.cpp’s namespace, but g_outbound_peers is defined outside of it? 13:58 < ariard> jonatack: dunno rn, ask on core dev 13:58 < lightlike> there is #14856, where there was not so much action recently 13:58 < lightlike> https://github.com/bitcoin/bitcoin/pull/14856 13:59 < ariard> lightlike: ping carl on it, he may need help 13:59 < jnewbery_> hugohn: no idea! 13:59 < hugohn> ha 13:59 < jonatack> thanks, interesting. we need to be reviewing these PRs to encourage people 13:59 < jnewbery_> Shall we wrap up there? 14:00 < lightlike> yes, thanks everybody! 14:00 < i2> thanks lightlike! 14:00 < jnewbery_> that was great. Thanks so much for hosting and preparing notes lightlike! 14:00 < jonatack> great jon, lightlike :100: would attend again 14:00 < hugohn> thanks lightlike! learned a lot! 14:00 < ariard> thanks lightlike! 14:00 < jkczyz> yeah, thank you! 14:01 < jonatack> * job :-) 14:01 < jnewbery_> Before we go: next week PR review is looking at *two* competing PRs! https://bitcoin-core-review-club.github.io/16345.html 14:01 < jnewbery_> notes are already up 14:02 < jonatack> nice