Request NOTFOUND transactions immediately from other outbound peers, when possible (
p2p) Jul 31, 2019
The basic message flow for transaction relay is
INV -> GETDATA -> TX. A node that cannot deliver a requested transaction answers the
GETDATA request with
NOTFOUND instead of
NOTFOUND messages were introduced in
Bitcoin Core PR 2192 and are documented in the Bitcoin developer reference. Historically, Bitcoin Core would ignore
NOTFOUND messages it received. This changed slightly with
PR 15834, ( Review Club Session), which introduced internal bookkeeping measures in response to a NOTFOUND (clearing the internal map of in-flight transactions). This PR suggests to actually utilize the info of a
NOTFOUND by immediately requesting the transaction from outbound peers having
INV‘ed it to us before.
This PR touches the central P2P loop
CConnman::ThreadMessageHandler() by adding a sequence counter to
SendMessages(). This counter is used to deterministically select one outbound peer in each loop cycle from which we request transactions for which we have received a
NOTFOUND from another peer.
What is the most typical situation when running a node in which
NOTFOUND messages are encountered?
Why was the solution with the sequence counter chosen, instead of simpler alternatives like a random draw?
Why are only outbound peers considered for requesting transactions after receiving
What kind of options exist in the functional testing framework for P2P functionality? Why is it so hard to test PRs like this one with the existing framework?
4 13:00 <michaelfolkson> Hey
11 13:01 <jnewbery_> before I hand over, my periodic reminder that the point of PR review club is to help you review PRs! Please do review the PRs on github (an if you have any questions about how to do that, feel free to ask them here)
12 13:02 <jnewbery_> ok, let's get started. Over to you, lightlike
13 13:02 <michaelfolkson> Thanks for the reminder John :)
14 13:02 <lightlike> Thanks! I found this PR interesting because it deals with some of the very basic P2P functionality (ThreadMessageHandler)
15 13:02 <lightlike> to optimize how we deal with NOTFOND message.
17 13:03 <lightlike> Even though it got closed by the author recently
18 13:03 <lightlike> I think it is good for learning about P2P aspects of bitcoin - definitely helped my understand some new things.
19 13:04 <lightlike> First question (some overlap with the old session on #15834):
20 13:04 <lightlike> What is the most typical situation when running a node in which NOTFOUND messages are encountered?
21 13:04 <jkczyz> When the peer no longer has the transaction in the relay set
22 13:04 <michaelfolkson> Unpruned node requesting historic transaction?
23 13:05 <michaelfolkson> *Pruned
24 13:06 <lightlike> In my experience of running a node with logging, it is dealing with orphan transactions:
25 13:06 <jnewbery_> michaelfolkson: pruning is only relevant for old blocks in the chain. This PR deals with tx relay
26 13:06 <jnewbery_> (unconfirmed txs)
27 13:07 <lightlike> when we receive an tx from a peer for which we dont have the parent, we will ask him for the parent too
28 13:07 <lightlike> sending GETDATA without him having sent an INV before.
29 13:07 <lightlike> *received
30 13:08 <michaelfolkson> But a pruned node might request a tx it no longer has?
31 13:08 <lightlike> in this case, the other node often doesn't have the parent tx (or won't send it to us) and answers with NOTFOUND.
32 13:09 <jnewbery_> michaelfolkson: pruning is only for the blockchain. tx relay is for unconfirmed txs not in the blockchain yet
33 13:09 <michaelfolkson> Ok thanks
34 13:09 <hugohn> how likely is it to receive a txn for which we don't know the parent? is that related to CPFP or no?
35 13:10 <i2> answer to typical situation for NOTFOUND: maybe: a.) blocks-only Lite client? b.) full-node that's in process of syncing to tip, c.) adversarial peer? d.) full node drops tx if tx is included in block but a reorg happens to your peer?
36 13:10 <lightlike> when firing up a node, it happens a lot (we weren't there when the parent was relayed)
37 13:13 <lightlike> i2: i dont think an adversarial peer would send a NOTFOUND currently. He could just INV a tx and never follow up with the TX if we request it.
39 13:14 <jnewbery_> i2: d) in the case of a reorg, the disconnected txs are added back into the mempool
40 13:14 <hugohn> lightlike: I see, so when a node comes back online after a while, the INVs typically arrive out-of-order? i.e., receiving INVs for children first before receiving INVs for parent
41 13:14 <lightlike> im not entirely sure if this is typical, but in a short test-run on mainnet all of the NOTFOUNDs i got were for parents of orphan transactions.
42 13:15 <jnewbery_> lightlike: I think that's fairly typical, but I haven't actually tested it myself
43 13:15 <i2> jnewbery_ re d) so the steps are 1.) new block 2.) shed txns in block from mempool 3) reorg 4) put txn from reorg'ed blocks back into mempool?
44 13:16 <lightlike> ok, maybe move on to the next q:
45 13:16 <lightlike> Why was the solution with the sequence counter chosen, instead of simpler alternatives like a random draw?
46 13:16 <jonatack> lightlike: my testing of the NOTFOUND PR showed the same as you
47 13:16 <i2> lightlike: maybe a.) Decrease p2p churn? b.) Equalize load requested from peers (no peer wins the INV-request lottery)
48 13:17 <jkczyz> Is it to ensure a uniform distribution of peer selection? I left a comment on the review regarding the sequence number but wasn't sure if this was the reason.
49 13:17 <hugohn> I guess for robustness: we want to retry fetching the txn from all eligible outbound peers, not just one
50 13:17 <jnewbery_> i2: not quite. It's 1) get new block and see that it's got more work than tip 2) run ActivateBestChain, which tries to move to the best chain, which will 2a) rewind blocks, adding txs to mempool 2b) connect new blocks, removing txs from mempool that in the new block
51 13:19 <lightlike> yes, i think that's correct, it's to not ask more transansaction from peers earlier in the loop.
52 13:20 <michaelfolkson> How many peers is a node on average connected to? I'm assuming most would run the default config settings
54 13:21 <lightlike> i think it is important here that there is an "early" and "late" in the loop, even though ThreadMessageHandler loops continuously through all our peer.
55 13:21 <jnewbery_> michaelfolkson: up to 8 outbound and (for listening nodes) up to 125 total (so 117 inbound)
56 13:21 <hugohn> jnewbery_: yes I saw that, the retrying is actually one peer at a time. but the queuing up for retries happen for all outbound peers, right? to ensure that if one outbound peer fails to give us the txn, the next one still could
57 13:22 <lightlike> because the order of our peers is not random (usually there are first our outbound peers to which we connect at startup, and after that the inbound peers)
58 13:22 <jnewbery_> hugohn: right, the txid will be added to m_tx_not_found for all outbound peers that have announced the tx
59 13:22 <jonatack> perhaps a random draw would not ensure the taking of turns e.g. the same peer could be picked twice in a row
60 13:23 <michaelfolkson> Yeah especially if only 8 outbound
61 13:24 <lightlike> jonatack: i think with the random draw, the node in our current position in the loop would get more action: it has now an 1/N chance, and in case there is an entire cycle without anohter node getting the INV, it gets a second chance first
62 13:24 <i2> jonatack "peer could be picked twice in a row" = INV 'lottery' winner ;]
63 13:24 <hugohn> jonatack: yeah I think so, so if we randomly choose only one outbound peer to retry, it's less robust if that peer _also_ fails to SENDDATA . but the advantage of random draw is no extra sequence_number, which is somewhat awkward to introduce in the main processing loop...
64 13:24 <jonatack> lightlike: right, in that case random would have more bias
66 13:25 <lightlike> jnewbery_: yes, I thought about that too, that could be randomized
67 13:26 <jkczyz> jnewbery_: how important is maintaining determinism for sake of testing?
69 13:26 <jnewbery_> jkczyz: I don't think it's important. My guess is that no tests will break.
70 13:26 <hugohn> but one consequence of queuing up the txns in all outbound peers is there could potentially be duplicate entries in each m_tx_not_found queue? I don't see any logic checking against with duplicates
71 13:27 <hugohn> do you guys think it's necessary? to check against duplicates in m_tx_not_found?
72 13:28 <jnewbery_> hugohn: I think you may be right about duplicates in m_tx_not_found
73 13:30 <lightlike> ok, next question: Why are only outbound peers considered for requesting transactions after receiving NOTFOUND?
74 13:31 <lightlike> hugohn: I think so too, not sure thought how bad duplicates would be.
75 13:32 <i2> clarify my ignorance: outbound peers = non-listen-only peers. yes?
76 13:32 <hugohn> jnewbery_ lightlike: yeah so looks like we indiscriminately queue up the txn any time we get a NOTFOUND. there's no guarantee we can't get multiple NOTFOUNDs for the same txns.
77 13:32 <hugohn> or back-to-back
78 13:32 <hugohn> retry failure
79 13:32 <lightlike> i2: no, outbound peers are when we initiate the connection, vs inbound peers that connect to us
81 13:33 <michaelfolkson> I made same mistake
82 13:34 <michaelfolkson> Because you would rather cycle through nodes you have personally selected rather than random nodes that you haven't selected?
83 13:35 <lightlike> michaelfolkson: yes, that's true, but why not just treat everyone the same?
84 13:35 <hugohn> we generally trust the outbound peers more than inbound ones, to guard against Sybil attacks?
85 13:36 <michaelfolkson> Because nodes aren't the same. Some will go offline and won't make timely responses?
86 13:36 <lightlike> hugohn: yes. I think if we also requested from inbounds, attackers could try to withhold a tx from us by establishing many connections to us
87 13:37 <lightlike> the normal logic if a GETDATA times out also treats outbounds nodes preferentially
88 13:37 <jonatack> yes, idea is attackers could use inbound conns to prevent us from seeing a txn
89 13:38 <jonatack> thus the delay placed on inbounds giving preferance to outbounds
90 13:38 <lightlike> so an attacker could circumvent that by sending us NOTFOUND right before the GETDATA times out, and then we would be likely to ask the tx from another inbound node, that is also controlled by the attacker.
91 13:38 <jonatack> nNow + 2 sec
92 13:39 <lightlike> and so on
93 13:40 <i2> so rational for txn requests to outbound peers before inbound is to prevent an eclipse attack on a node for txn awareness (say lighnting force close)?
94 13:41 <lightlike> i2: yes, I think so. it's much harder for an attacker to control our outbound peers that we chose carefully.
96 13:43 <jonatack> hugohn: yes, iirc
97 13:43 <hugohn> retry logic is hard :-)
98 13:44 <hugohn> so many edge cases
99 13:44 <lightlike> ok, final question:
100 13:44 <lightlike> What kind of options exist in the functional testing framework for P2P functionality? Why is it so hard to test PRs like this one with the existing framework?
102 13:46 <jonatack> (current limitations and goal of an instantiable CConnman, then a MockNetworkRouter)
103 13:48 <lightlike> jonatack: yes, seems like it is not possible at the moment to fully mock inbound/outbound behavior in functional tests.
104 13:48 <ariard> We currently dissociate the node added from cli and from addrman
105 13:49 <jonatack> and the testing can do manual conns but not spontanous ones
106 13:49 <jnewbery_> IIRC there's also special logic for peers on the same subnet or IP address, which all our test nodes are for functional tests
107 13:50 <jnewbery_> yeah, I think it's the IsLocal() test
108 13:50 <hugohn> is the bucketing mechanism (putting nodes in new vs. tried tables) one of the main challenges? do you have to stub out the bucketing logic somehow, or create fake IP addresses?
109 13:51 <jnewbery_> there are a few things that would need to change. I think they're all documented in 14210
110 13:51 <ariard> I think we should create fake IP addresses for our nodes thanks to namespaces
111 13:52 <jnewbery_> ariard: I don't know anything about that, but it sounds like a good idea!
113 13:53 <jonatack> * Heilman
114 13:53 <ariard> jnewbery: how IIRC that's why we conclude with dongcarl in 14210
116 13:55 <ariard> not sure about all this stuff, it need more thought but first cconman encapsulation!
118 13:55 <lightlike> do you know if the work towards an instantiable CConnman is still in an early stage, or close to being finished?
119 13:55 <jonatack> before we run out of time... is anyone working on this atm?
120 13:55 <hugohn> all roads point to more modular components :-)
122 13:56 <ariard> jonatack: dongcarl, theuni, ryanofksy
123 13:56 <hugohn> ariard: yes moar pls ! :D
124 13:56 <ariard> but it split between different parts of the codebase
125 13:57 <jonatack> ariard: currently?
126 13:57 <ariard> we'll try to remove more cconnman usages in RPC interface, don't remember what need to be done in net.cpp
128 13:57 <hugohn> also a basic q before we run out: does anyone know why g_outbound_peers_with_protect_from_disconnect is defined within net_processing.cpp’s namespace, but g_outbound_peers is defined outside of it?
129 13:58 <ariard> jonatack: dunno rn, ask on core dev
130 13:58 <lightlike> there is #14856, where there was not so much action recently
132 13:59 <ariard> lightlike: ping carl on it, he may need help
133 13:59 <jnewbery_> hugohn: no idea!
135 13:59 <jonatack> thanks, interesting. we need to be reviewing these PRs to encourage people
136 13:59 <jnewbery_> Shall we wrap up there?
137 14:00 <lightlike> yes, thanks everybody!
138 14:00 <i2> thanks lightlike!
139 14:00 <jnewbery_> that was great. Thanks so much for hosting and preparing notes lightlike!
140 14:00 <jonatack> great jon, lightlike :100: would attend again
141 14:00 <hugohn> thanks lightlike! learned a lot!
142 14:00 <ariard> thanks lightlike!
143 14:00 <jkczyz> yeah, thank you!
144 14:01 <jonatack> * job :-)
146 14:01 <jnewbery_> notes are already up