<Lightlike> just a small thing, could you update the old gist page (with the old agenda) to link to the new page, for the people who have bookmarked that.
<jnewbery> The reason I chose IRC is that it makes it easier for people who don't have somewhere to take a video/audio call to join, and for people who want to just dip in and out
<jnewbery> OK, onto #15834 - Fix NOTFOUND bug and expire getdata requests for transactions. As usual, I'll give a quick summary before we open up for questions.
<jb55> my only suggestion is that it might be neat to have a site where people can propose reviews and have people signup to them, with a calendar and everything
<jnewbery> This PR is for changes to the "Net Processing" layer. That layer sits between "Net" (which manages connections to peers and is responsible for reading/writing bytes off the wire) and "Validation" (which is responsible for our node's view of the state of the blockchain and mempool).
<jnewbery> "Net Processing" receives application level messages from peers, which can either be data messages (eg tx and block inventory) or control messages (eg gossiping addresses of peers or establishing new connections). The developer reference here: https://btcinformation.org/en/developer-reference#p2p-network is probably the best place for documentation on p2p messages.
<jnewbery> This change is in the handling of the INV/GETDATA/TX message exchange, and is an attempt to fix a bug introduced in #14897 where our node could potentially stop asking peers to send us new transactions.
<jnewbery> as a result of that, we reverted #14897 in the 0.18 release, but kept it in master with the aim of fixing the bug while still keeping the good things about 14897
<jnewbery> It's a bit scary that this was only discovered by chance. I think there's more testing we could be doing to increase our chances of catching things like this earlier on.
<aj> that then showed that i was getting lots of INV messages from peers, telling me that new tx's were around, but i was never actually sending the GETDATA back to get the details of those transactions
<aj> i then tried staring at code to figure out what was wrong, and made guesses. also tried attaching gdb to the running process, but that caused it to crash, and when i restarted everything was fine
<aj> talking about it on irc, people were wondering if i was connected to hostile peers that were deliberately attacking me; so i checked against gmaxwell's banlist, but didn't get any matches, and noted down the peers i was connected to
<aj> once it restarted and was working fine, i manually reconnected to the peers that didn't work before, to check they did work now, and they did so i took that to mean it wasn't a case of hostile peers, but a bug on my end
<Lightlike> In 15776, there is a rather special situation, with the parent of the tx already expired from the relay set, leading to NOTFOUND msg being sent. But wouldn't you need to have 100 different children tx, each with expired parents to reach MAX_PEER_IN_FLIGHT, multiplied by the number of peers (8) to be cut off completely. Is that not extremely unlikely?
<aj> Lightlike: and it's not that the parent expires from the relay set, it's that you receive a child you don't have a parent for (because you've just restarted your node after being down for a while, eg), and then you ask the peer for the parent even though they never told you they had the parent, so it's not in the collection of tx's they'll relay to you even though it's in the mempool
<merehap> Beyond more testing, is there something like using better concurrency primitives or better code structuring to reduce the likelihood of concurrency problems like these? I feel like I've seen several different concurrency bugs file against core. Since concurrency bugs are so difficult to discover and debug, it seems scary if there isn't a more systematic approach to preventing them.
<aj> Lightlike: (also, if the parent isn't in the mempool, but is in your utxo set, but the first two outputs from the transaction have already been spent, then you'll redundantly ask for the tx and get a notfound back)
<jnewbery> Good context to know is that if a peer asks you for a tx with GETDATA, you'll only send that tx if it's in a data structure called MapRelay. You won't fish the tx out of the mempool if it's not in MapRelay
<jnewbery> merehap: I wouldn't really categorise this as a concurrency bug. This is more like a resource leak (where the resource is txs in flight from a peer)
<aj> INV(tx) GETDATA(tx) TX(txdata) GETDATA(tx.parent) NOTFOUND is common though, because our side thinks "if they have tx, they must have the parent" and their side, running the same software, thinks "i didn't INV tx.parent, so i won't send it"
<aj> so one thing i wonder is about sdaftuar's question, "switch the uses of GetTimeMicros() in the logic to GetTime() so that we can mock it"; it doesn't seem to make a lot of sense to measure timeouts in the seconds-minutes range with microseconds; but using non-mocktime for "peer has been idle" in general (see #9606) certainly makes sense i think
<aj> Lightlike: there's a bug in the handling of missing parents. when actually testing the transaction the logic is done right (in AcceptToMempool), but when that fails, it doesn't say what tx's exactly are missing
<aj> Lightlike: so when the net_processing code works that out to try to ask for the parents (which generally won't work anyway) it tests to see if the parent is in the mempool, or txid:0 or txid:1 are in the UTXO set
<aj> i think the tx's in maprelay are kept around so even if the tx expires from the mempool (by being mined, or because the mempool is full) it's still able to be relayed for the 15m's since being INV announced
<aj> so previously, it would've just been "we sent a GETDATA and got a NOTFOUND? not big deal, moving on". but now it means a data structure doesn't get cleared, and eventually a limit gets hit, and behaviour changes
<jnewbery> luckily, I think there are more parallels between p2p code and other software domains (eg networking, telecoms), so for people with experience in those areas then p2p might be somewhere you want to look at more.
<jnewbery> jonatack: there are various calls for getting the current time used by bitcoind. GetTime() can be overridden by setmocktime but GetTimeMicros() always uses your system time
<aj> jnewbery: i think if you used mocktime for everything you might get a bug where "setup test, connect peers. mocktime += 1 hour. oh all the peers have been non-responsive for an hour, mass disconnect. tests fail"
<Lightlike> A question concerning the fix in #15834: Now we clear m_tx_inflight_it after our peer first sends an INV(tx), but is totally unresponsive after GETDATA(tx) (not even NOTFOUND). At the same time we accept other transactions from this peer. is there a legit reason for our peer to behave like this besides censoring certain transactions (so that we don't disconnect instead)?
<aj> Lightlike: legit reasons: could be buggy, or maybe there's bugs in our logic where we're taking much longer to send the GETDATA than we think and it's actually totally reasonable for it to have expired, could be others
<jnewbery> In general, I think we want to be careful about introducing behaviour that could turn a relatively small bug into something that causes a bunch of peers to disconnect
<jnewbery> jonatack: I hope it can be tested with mocktime, or in some way that doesn't require very long-running tests. The problem with long-running tests is that no-one ever runs them
<jnewbery> Another thing that I'd like to see is more stress testing. Resource leaks like this are really difficult to catch in deterministic, short-running tests
<jnewbery> Net Processing is a tricky area. Probably second only to Validation in terms of subtlety and complexity. It'd be great if we had more eyes on it.