<jnewbery> This PR is for changes to the "Net Processing" layer. That layer sits between "Net" (which manages connections to peers and is responsible for reading/writing bytes off the wire) and "Validation" (which is responsible for our node's view of the state of the blockchain and mempool).
<jnewbery> "Net Processing" receives application level messages from peers, which can either be data messages (eg tx and block inventory) or control messages (eg gossiping addresses of peers or establishing new connections). The developer reference here: https://btcinformation.org/en/developer-reference#p2p-network is probably the best place for documentation on p2p messages.
<jnewbery> This change is in the handling of the INV/GETDATA/TX message exchange, and is an attempt to fix a bug introduced in #14897 where our node could potentially stop asking peers to send us new transactions.
<aj> that then showed that i was getting lots of INV messages from peers, telling me that new tx's were around, but i was never actually sending the GETDATA back to get the details of those transactions
<aj> talking about it on irc, people were wondering if i was connected to hostile peers that were deliberately attacking me; so i checked against gmaxwell's banlist, but didn't get any matches, and noted down the peers i was connected to
<aj> once it restarted and was working fine, i manually reconnected to the peers that didn't work before, to check they did work now, and they did so i took that to mean it wasn't a case of hostile peers, but a bug on my end
<Lightlike> In 15776, there is a rather special situation, with the parent of the tx already expired from the relay set, leading to NOTFOUND msg being sent. But wouldn't you need to have 100 different children tx, each with expired parents to reach MAX_PEER_IN_FLIGHT, multiplied by the number of peers (8) to be cut off completely. Is that not extremely unlikely?
<aj> Lightlike: and it's not that the parent expires from the relay set, it's that you receive a child you don't have a parent for (because you've just restarted your node after being down for a while, eg), and then you ask the peer for the parent even though they never told you they had the parent, so it's not in the collection of tx's they'll relay to you even though it's in the mempool
<merehap> Beyond more testing, is there something like using better concurrency primitives or better code structuring to reduce the likelihood of concurrency problems like these? I feel like I've seen several different concurrency bugs file against core. Since concurrency bugs are so difficult to discover and debug, it seems scary if there isn't a more systematic approach to preventing them.
<aj> Lightlike: (also, if the parent isn't in the mempool, but is in your utxo set, but the first two outputs from the transaction have already been spent, then you'll redundantly ask for the tx and get a notfound back)
<jnewbery> Good context to know is that if a peer asks you for a tx with GETDATA, you'll only send that tx if it's in a data structure called MapRelay. You won't fish the tx out of the mempool if it's not in MapRelay
<aj> INV(tx) GETDATA(tx) TX(txdata) GETDATA(tx.parent) NOTFOUND is common though, because our side thinks "if they have tx, they must have the parent" and their side, running the same software, thinks "i didn't INV tx.parent, so i won't send it"
<aj> so one thing i wonder is about sdaftuar's question, "switch the uses of GetTimeMicros() in the logic to GetTime() so that we can mock it"; it doesn't seem to make a lot of sense to measure timeouts in the seconds-minutes range with microseconds; but using non-mocktime for "peer has been idle" in general (see #9606) certainly makes sense i think
<aj> Lightlike: there's a bug in the handling of missing parents. when actually testing the transaction the logic is done right (in AcceptToMempool), but when that fails, it doesn't say what tx's exactly are missing
<aj> Lightlike: so when the net_processing code works that out to try to ask for the parents (which generally won't work anyway) it tests to see if the parent is in the mempool, or txid:0 or txid:1 are in the UTXO set
<aj> i think the tx's in maprelay are kept around so even if the tx expires from the mempool (by being mined, or because the mempool is full) it's still able to be relayed for the 15m's since being INV announced
<aj> so previously, it would've just been "we sent a GETDATA and got a NOTFOUND? not big deal, moving on". but now it means a data structure doesn't get cleared, and eventually a limit gets hit, and behaviour changes
<jnewbery> luckily, I think there are more parallels between p2p code and other software domains (eg networking, telecoms), so for people with experience in those areas then p2p might be somewhere you want to look at more.
<aj> jnewbery: i think if you used mocktime for everything you might get a bug where "setup test, connect peers. mocktime += 1 hour. oh all the peers have been non-responsive for an hour, mass disconnect. tests fail"
<Lightlike> A question concerning the fix in #15834: Now we clear m_tx_inflight_it after our peer first sends an INV(tx), but is totally unresponsive after GETDATA(tx) (not even NOTFOUND). At the same time we accept other transactions from this peer. is there a legit reason for our peer to behave like this besides censoring certain transactions (so that we don't disconnect instead)?
<aj> Lightlike: legit reasons: could be buggy, or maybe there's bugs in our logic where we're taking much longer to send the GETDATA than we think and it's actually totally reasonable for it to have expired, could be others