Request NOTFOUND transactions immediately from other outbound peers, when possible (p2p)

Host: mzumsande


  • The basic message flow for transaction relay is INV -> GETDATA -> TX. A node that cannot deliver a requested transaction answers the GETDATA request with NOTFOUND instead of TX.
  • NOTFOUND messages were introduced in Bitcoin Core PR 2192 and are documented in the Bitcoin developer reference.
  • Historically, Bitcoin Core would ignore NOTFOUND messages it received. This changed slightly with PR 15834, (Review Club Session), which introduced internal bookkeeping measures in response to a NOTFOUND (clearing the internal map of in-flight transactions).
  • This PR suggests to actually utilize the info of a NOTFOUND by immediately requesting the transaction from outbound peers having INV‘ed it to us before.
  • This PR touches the central P2P loop CConnman::ThreadMessageHandler() by adding a sequence counter to SendMessages(). This counter is used to deterministically select one outbound peer in each loop cycle from which we request transactions for which we have received a NOTFOUND from another peer.


  • What is the most typical situation when running a node in which NOTFOUND messages are encountered?
  • Why was the solution with the sequence counter chosen, instead of simpler alternatives like a random draw?
  • Why are only outbound peers considered for requesting transactions after receiving NOTFOUND?
  • What kind of options exist in the functional testing framework for P2P functionality? Why is it so hard to test PRs like this one with the existing framework?

Meeting log

13:00 < jnewbery_> hi
13:00 < lightlike> hello
13:00 < jkczyz> hi
13:00 < michaelfolkson> Hey
13:01 < davterra> hi
13:01 < jnewbery_> lightlike is going to host today. He also prepared the notes and questions at . Thanks lightlike!
13:01 < fjahr> hi
13:01 < hugohn> hello
13:01 < i2> hallo
13:01 < jonatack> hi++
13:01 < jnewbery_> before I hand over, my periodic reminder that the point of PR review club is to help you review PRs! Please do review the PRs on github (an if you have any questions about how to do that, feel free to ask them here)
13:02 < jnewbery_> ok, let's get started. Over to you, lightlike
13:02 < michaelfolkson> Thanks for the reminder John :)
13:02 < lightlike> Thanks! I found this PR interesting because it deals with some of the very basic P2P functionality (ThreadMessageHandler)
13:02 < lightlike> to optimize how we deal with NOTFOND message.
13:03 < ariard> hi
13:03 < lightlike> Even though it got closed by the author recently
13:03 < lightlike> I think it is good for learning about  P2P aspects of bitcoin - definitely helped my understand some new things.
13:04 < lightlike> First question (some overlap with the old session on #15834):
13:04 < lightlike> What is the most typical situation when running a node in which NOTFOUND messages are encountered?
13:04 < jkczyz> When the peer no longer has the transaction in the relay set
13:04 < michaelfolkson> Unpruned node requesting historic transaction?
13:05 < michaelfolkson> *Pruned
13:06 < lightlike> In my experience of running a node with logging, it is dealing with orphan transactions:
13:06 < jnewbery_> michaelfolkson: pruning is only relevant for old blocks in the chain. This PR deals with tx relay
13:06 < jnewbery_> (unconfirmed txs)
13:07 < lightlike> when we receive an tx from a peer for which we dont have the parent, we will ask him for the parent too
13:07 < lightlike> sending GETDATA without him having sent an INV before.
13:07 < lightlike> *received
13:08 < michaelfolkson> But a pruned node might request a tx it no longer has?
13:08 < lightlike> in this case, the other node often doesn't have the parent tx (or won't send it to us) and answers with NOTFOUND.
13:09 < jnewbery_> michaelfolkson: pruning is only for the blockchain. tx relay is for unconfirmed txs not in the blockchain yet
13:09 < michaelfolkson> Ok thanks
13:09 < hugohn> how likely is it to receive a txn for which we don't know the parent? is that related to CPFP or no?
13:10 < i2> answer to typical situation for NOTFOUND: maybe: a.) blocks-only Lite client? b.) full-node that's in process of syncing to tip, c.) adversarial peer? d.) full node drops tx if tx is included in block but a reorg happens to your peer?
13:10 < lightlike> when firing up a node, it happens a lot (we weren't there when the parent was relayed)
13:13 < lightlike> i2: i dont think an adversarial peer would send a NOTFOUND currently. He could just INV a tx and never follow up with the TX if we request it.
13:13 < jnewbery_> i2: a) blocks-only clients will advertise that they're not relaying txs with the relay flag in the VERSION message:
13:14 < jnewbery_> i2: d) in the case of a reorg, the disconnected txs are added back into the mempool
13:14 < hugohn> lightlike: I see, so when a node comes back online after a while, the INVs typically arrive out-of-order? i.e., receiving INVs for children first before receiving INVs for parent
13:14 < lightlike> im not entirely sure if this is typical, but in a short test-run on mainnet all of the NOTFOUNDs i got were for parents of orphan transactions.
13:15 < jnewbery_> lightlike: I think that's fairly typical, but I haven't actually tested it myself
13:15 < i2> jnewbery_ re d) so the steps are 1.) new block 2.) shed txns in block from mempool 3) reorg 4) put txn from reorg'ed blocks back into mempool?
13:16 < lightlike> ok, maybe move on to the next q:
13:16 < lightlike> Why was the solution with the sequence counter chosen, instead of simpler alternatives like a random draw?
13:16 < jonatack> lightlike: my testing of the NOTFOUND PR showed the same as you
13:16 < i2> lightlike: maybe a.) Decrease p2p churn?  b.) Equalize load requested from peers (no peer wins the INV-request lottery)
13:17 < jkczyz> Is it to ensure a uniform distribution of peer selection? I left a comment on the review regarding the sequence number but wasn't sure if this was the reason.
13:17 < hugohn> I guess for robustness: we want to retry fetching the txn from all eligible outbound peers, not just one
13:17 < jnewbery_> i2: not quite. It's 1) get new block and see that it's got more work than tip 2) run ActivateBestChain, which tries to move to the best chain, which will 2a) rewind blocks, adding txs to mempool 2b) connect new blocks, removing txs from mempool that in the new block
13:19 < lightlike> yes, i think that's correct, it's to not ask more transansaction from peers earlier in the loop.
13:20 < michaelfolkson> How many peers is a node on average connected to? I'm assuming most would run the default config settings
13:20 < jnewbery_> hugohn: no, we only want to request from one peer at a time. That tripped me up too. The logic to make sure we're not requesting from more than one peer is in TryRequestTx():
13:21 < lightlike> i think it is important here that there is an "early" and "late" in the loop, even though ThreadMessageHandler loops continuously through all our peer.
13:21 < jnewbery_> michaelfolkson: up to 8 outbound and (for listening nodes) up to 125 total (so 117 inbound)
13:21 < hugohn> jnewbery_: yes I saw that, the retrying is actually one peer at a time. but the queuing up for retries happen for all outbound peers, right? to ensure that if one outbound peer fails to give us the txn, the next one still could
13:22 < lightlike> because the order of our peers is not random (usually there are first our outbound peers to which we connect at startup, and after that the inbound peers)
13:22 < jnewbery_> hugohn: right, the txid will be added to m_tx_not_found for all outbound peers that have announced the tx
13:22 < jonatack> perhaps a random draw would not ensure the taking of turns e.g. the same peer could be picked twice in a row
13:23 < michaelfolkson> Yeah especially if only 8 outbound
13:24 < lightlike> jonatack: i think with the random draw, the node in our current position in the loop would get more action: it has now an 1/N chance, and in case there is an entire cycle without anohter node getting the INV, it gets a second chance first
13:24 < i2> jonatack "peer could be picked twice in a row" = INV 'lottery' winner ;]
13:24 < hugohn> jonatack: yeah I think so, so if we randomly choose only one outbound peer to retry, it's less robust if that peer _also_ fails to SENDDATA . but the advantage of random draw is no extra sequence_number, which is somewhat awkward to introduce in the main processing loop...
13:24 < jonatack> lightlike: right, in that case random would have more bias
13:24 < jnewbery_> one thing we could do is shuffle the order we visit peers in the ThreadMessageHandler here:
13:25 < lightlike> jnewbery_: yes, I thought about that too, that could be randomized
13:26 < jkczyz> jnewbery_: how important is maintaining determinism for sake of testing?
13:26 < lightlike> i actually like the proposal of jkczyz - on first thought I think that might work
13:26 < jnewbery_> jkczyz: I don't think it's important. My guess is that no tests will break.
13:26 < hugohn> but one consequence of queuing up the txns in all outbound peers is there could potentially be duplicate entries in each m_tx_not_found queue? I don't see any logic checking against with duplicates
13:27 < hugohn> do you guys think it's necessary? to check against duplicates in m_tx_not_found?
13:28 < jnewbery_> hugohn: I think you may be right about duplicates in m_tx_not_found
13:30 < lightlike> ok, next question: Why are only outbound peers considered for requesting transactions after receiving NOTFOUND?
13:31 < lightlike> hugohn: I think so too, not sure thought how bad duplicates would be.
13:32 < i2> clarify my ignorance: outbound peers = non-listen-only peers. yes?
13:32 < hugohn> jnewbery_ lightlike: yeah so looks like we indiscriminately queue up the txn any time we get a NOTFOUND. there's no guarantee we can't get multiple NOTFOUNDs for the same txns.
13:32 < hugohn> or back-to-back
13:32 < hugohn> retry failure
13:32 < lightlike> i2: no, outbound peers are when we initiate the connection, vs inbound peers that connect to us
13:32 < michaelfolkson>
13:33 < michaelfolkson> I made same mistake
13:34 < michaelfolkson> Because you would rather cycle through nodes you have personally selected rather than random nodes that you haven't selected?
13:35 < lightlike> michaelfolkson: yes, that's true, but why not just treat everyone the same?
13:35 < hugohn> we generally trust the outbound peers more than inbound ones, to guard against Sybil attacks?
13:36 < michaelfolkson> Because nodes aren't the same. Some will go offline and won't make timely responses?
13:36 < lightlike> hugohn: yes. I think if we also requested from inbounds, attackers could try to withhold a tx from us by establishing many connections to us
13:37 < lightlike> the normal logic if a GETDATA times out also treats outbounds nodes preferentially
13:37 < jonatack> yes, idea is attackers could use inbound conns to prevent us from seeing a txn
13:38 < jonatack> thus the delay placed on inbounds giving preferance to outbounds
13:38 < lightlike> so an attacker could circumvent that by sending us NOTFOUND right before the GETDATA times out, and then we would be likely to ask the tx from another inbound node, that is also controlled by the attacker.
13:38 < jonatack> nNow + 2 sec
13:39 < lightlike> and so on
13:40 < i2> so rational for txn requests to outbound peers before inbound is to prevent an eclipse attack on a node for txn awareness (say lighnting force close)?
13:41 < lightlike> i2: yes, I think so. it's much harder for an attacker to control our outbound peers that we chose carefully.
13:42 < hugohn> the time delay attack was the motivation for the fix wasn't it? which introduced a few bugs & motivated more robustness handling of NOTFOUND
13:43 < jonatack> hugohn: yes, iirc
13:43 < hugohn> retry logic is hard :-)
13:44 < hugohn> so many edge cases
13:44 < lightlike> ok, final question:
13:44 < lightlike> What kind of options exist in the functional testing framework for P2P functionality? Why is it so hard to test PRs like this one with the existing framework?
13:45 < jonatack> Good read on this at
13:46 < jonatack> (current limitations and goal of an instantiable CConnman, then a MockNetworkRouter)
13:48 < lightlike> jonatack: yes, seems like it is not possible at the moment to fully mock inbound/outbound behavior in functional tests.
13:48 < ariard> We currently dissociate the node added from cli and from addrman
13:49 < jonatack> and the testing can do manual conns but not spontanous ones
13:49 < jnewbery_> IIRC there's also special logic for peers on the same subnet or IP address, which all our test nodes are for functional tests
13:50 < jnewbery_> yeah, I think it's the IsLocal() test
13:50 < hugohn> is the bucketing mechanism (putting nodes in new vs. tried tables) one of the main challenges? do you have to stub out the bucketing logic somehow, or create fake IP addresses?
13:51 < jnewbery_> there are a few things that would need to change. I think they're all documented in 14210
13:51 < ariard> I think we should create fake IP addresses for our nodes thanks to namespaces
13:52 < jnewbery_> ariard: I don't know anything about that, but it sounds like a good idea!
13:52 < jonatack> Ethan Deilman proposed that here
13:53 < jonatack> * Heilman
13:53 < ariard> jnewbery: how IIRC that's why we conclude with dongcarl in 14210
13:54 < jonatack> ariard: right, namespaces as you proposed
13:55 < ariard> not sure about all this stuff, it need more thought but first cconman encapsulation!
13:55 < ariard> *needs
13:55 < lightlike> do you know if the work towards an instantiable CConnman is still in an early stage, or close to being finished?
13:55 < jonatack> before we run out of time... is anyone working on this atm?
13:55 < hugohn> all roads point to more modular components :-)
13:55 < ariard> I do what I can ;)
13:56 < ariard> jonatack: dongcarl, theuni, ryanofksy
13:56 < hugohn> ariard: yes moar pls ! :D
13:56 < ariard> but it split between different parts of the codebase
13:57 < jonatack> ariard: currently?
13:57 < ariard> we'll try to remove more cconnman usages in RPC interface, don't remember what need to be done in net.cpp
13:57 < ariard> *will
13:57 < hugohn> also a basic q before we run out: does anyone know why g_outbound_peers_with_protect_from_disconnect is defined within net_processing.cpp’s namespace, but g_outbound_peers is defined outside of it?
13:58 < ariard> jonatack: dunno rn, ask on core dev
13:58 < lightlike> there is #14856, where there was not so much action recently
13:58 < lightlike>
13:59 < ariard> lightlike: ping carl on it, he may need help
13:59 < jnewbery_> hugohn: no idea!
13:59 < hugohn> ha
13:59 < jonatack> thanks, interesting. we need to be reviewing these PRs to encourage people
13:59 < jnewbery_> Shall we wrap up there?
14:00 < lightlike> yes, thanks everybody!
14:00 < i2> thanks lightlike!
14:00 < jnewbery_> that was great. Thanks so much for hosting and preparing notes lightlike!
14:00 < jonatack> great jon, lightlike :100: would attend again
14:00 < hugohn> thanks lightlike! learned a lot!
14:00 < ariard> thanks lightlike!
14:00 < jkczyz> yeah, thank you!
14:01 < jonatack> * job :-)
14:01 < jnewbery_> Before we go: next week PR review is looking at *two* competing PRs!
14:01 < jnewbery_> notes are already up
14:02 < jonatack> nice