The basic message flow for transaction relay is INV -> GETDATA -> TX. A node that cannot deliver a requested transaction answers the GETDATA request with NOTFOUND instead of TX.
Historically, Bitcoin Core would ignore NOTFOUND messages it received. This changed slightly with PR 15834, (Review Club Session), which introduced internal bookkeeping measures in response to a NOTFOUND (clearing the internal map of in-flight transactions).
This PR suggests to actually utilize the info of a NOTFOUND by immediately requesting the transaction from outbound peers having INV‘ed it to us before.
This PR touches the central P2P loop CConnman::ThreadMessageHandler() by adding a sequence counter to SendMessages(). This counter is used to deterministically select one outbound peer in each loop cycle from which we request transactions for which we have received a NOTFOUND from another peer.
Questions
What is the most typical situation when running a node in which NOTFOUND messages are encountered?
Why was the solution with the sequence counter chosen, instead of simpler alternatives like a random draw?
Why are only outbound peers considered for requesting transactions after receiving NOTFOUND?
What kind of options exist in the functional testing framework for P2P functionality? Why is it so hard to test PRs like this one with the existing framework?
<jnewbery_> before I hand over, my periodic reminder that the point of PR review club is to help you review PRs! Please do review the PRs on github (an if you have any questions about how to do that, feel free to ask them here)
<i2> answer to typical situation for NOTFOUND: maybe: a.) blocks-only Lite client? b.) full-node that's in process of syncing to tip, c.) adversarial peer? d.) full node drops tx if tx is included in block but a reorg happens to your peer?
<lightlike> i2: i dont think an adversarial peer would send a NOTFOUND currently. He could just INV a tx and never follow up with the TX if we request it.
<hugohn> lightlike: I see, so when a node comes back online after a while, the INVs typically arrive out-of-order? i.e., receiving INVs for children first before receiving INVs for parent
<lightlike> im not entirely sure if this is typical, but in a short test-run on mainnet all of the NOTFOUNDs i got were for parents of orphan transactions.
<jkczyz> Is it to ensure a uniform distribution of peer selection? I left a comment on the review regarding the sequence number but wasn't sure if this was the reason.
<jnewbery_> i2: not quite. It's 1) get new block and see that it's got more work than tip 2) run ActivateBestChain, which tries to move to the best chain, which will 2a) rewind blocks, adding txs to mempool 2b) connect new blocks, removing txs from mempool that in the new block
<lightlike> i think it is important here that there is an "early" and "late" in the loop, even though ThreadMessageHandler loops continuously through all our peer.
<hugohn> jnewbery_: yes I saw that, the retrying is actually one peer at a time. but the queuing up for retries happen for all outbound peers, right? to ensure that if one outbound peer fails to give us the txn, the next one still could
<lightlike> because the order of our peers is not random (usually there are first our outbound peers to which we connect at startup, and after that the inbound peers)
<lightlike> jonatack: i think with the random draw, the node in our current position in the loop would get more action: it has now an 1/N chance, and in case there is an entire cycle without anohter node getting the INV, it gets a second chance first
<hugohn> jonatack: yeah I think so, so if we randomly choose only one outbound peer to retry, it's less robust if that peer _also_ fails to SENDDATA . but the advantage of random draw is no extra sequence_number, which is somewhat awkward to introduce in the main processing loop...
<hugohn> but one consequence of queuing up the txns in all outbound peers is there could potentially be duplicate entries in each m_tx_not_found queue? I don't see any logic checking against with duplicates
<hugohn> jnewbery_ lightlike: yeah so looks like we indiscriminately queue up the txn any time we get a NOTFOUND. there's no guarantee we can't get multiple NOTFOUNDs for the same txns.
<lightlike> hugohn: yes. I think if we also requested from inbounds, attackers could try to withhold a tx from us by establishing many connections to us
<lightlike> so an attacker could circumvent that by sending us NOTFOUND right before the GETDATA times out, and then we would be likely to ask the tx from another inbound node, that is also controlled by the attacker.
<i2> so rational for txn requests to outbound peers before inbound is to prevent an eclipse attack on a node for txn awareness (say lighnting force close)?
<hugohn> the time delay attack was the motivation for the fix https://github.com/bitcoin/bitcoin/pull/14897 wasn't it? which introduced a few bugs & motivated more robustness handling of NOTFOUND
<lightlike> What kind of options exist in the functional testing framework for P2P functionality? Why is it so hard to test PRs like this one with the existing framework?
<hugohn> is the bucketing mechanism (putting nodes in new vs. tried tables) one of the main challenges? do you have to stub out the bucketing logic somehow, or create fake IP addresses?
<hugohn> also a basic q before we run out: does anyone know why g_outbound_peers_with_protect_from_disconnect is defined within net_processing.cpp’s namespace, but g_outbound_peers is defined outside of it?