Fix NOTFOUND bug and expire getdata requests for transactions (net processing)
12:59 < afig> sup yall 13:00 < jb55> yo 13:00 < jnewbery> hi 13:00 < pinheadmz> hi 13:00 < Lightlike> hello 13:00 < merehap> hi 13:00 < amiti> hi 13:00 < b10c> hi 13:01 < jonatack> Hi 13:01 < jnewbery> Hi folks! Thanks for joining PR review club today. 13:01 < jnewbery> Before we start talking about #15834, I wanted to poll for feedback on these meetings. Is there any way we could make them more useful? 13:02 < jnewbery> I'm also taking suggestions for PRs to cover in future weeks. We don't have anything on the agenda yet. 13:02 < JulioBarros> I think the chat interface is interesting but wonder if an audio or video format would be helpful. 13:03 < peevsie> hi 13:03 < Lightlike> just a small thing, could you update the old gist page (with the old agenda) to link to the new page, for the people who have bookmarked that. 13:03 < aj> heyo 13:04 < jnewbery> The reason I chose IRC is that it makes it easier for people who don't have somewhere to take a video/audio call to join, and for people who want to just dip in and out 13:04 < merehap> I definitely prefer IRC over audio/video. Easier to reference the code for a few minutes without falling too far behind in the conversation. 13:04 < jnewbery> also means we have a text log for people who can't join in real time 13:05 < jnewbery> ok, I don't want to take too long on this. If you think of anything, please raise it here or just message me directly 13:05 < jnewbery> OK, onto #15834 - Fix NOTFOUND bug and expire getdata requests for transactions. As usual, I'll give a quick summary before we open up for questions. 13:05 < jb55> my only suggestion is that it might be neat to have a site where people can propose reviews and have people signup to them, with a calendar and everything 13:05 < jb55> perhaps when this gets more popular... 13:06 < jnewbery> jb55: until then, feel free to propose reviews in this channel 13:06 < jnewbery> This PR is for changes to the "Net Processing" layer. That layer sits between "Net" (which manages connections to peers and is responsible for reading/writing bytes off the wire) and "Validation" (which is responsible for our node's view of the state of the blockchain and mempool). 13:06 < jnewbery> "Net Processing" receives application level messages from peers, which can either be data messages (eg tx and block inventory) or control messages (eg gossiping addresses of peers or establishing new connections). The developer reference here: https://btcinformation.org/en/developer-reference#p2p-network is probably the best place for documentation on p2p messages. 13:07 < jnewbery> Data messages are parsed and then passed up to validation for consideration for us to update mempool/chainstate. 13:07 < jnewbery> This change is in the handling of the INV/GETDATA/TX message exchange, and is an attempt to fix a bug introduced in #14897 where our node could potentially stop asking peers to send us new transactions. 13:08 < jnewbery> As it says in the PR description, this is a smaller alternative to https://github.com/bitcoin/bitcoin/pull/15776 , who's author is in this chat 13:08 < jnewbery> aj was actually the first person to report the bug a couple of months ago 13:09 < jnewbery> as a result of that, we reverted #14897 in the 0.18 release, but kept it in master with the aim of fixing the bug while still keeping the good things about 14897 13:09 < jnewbery> I'll shut up now. Any questions? 13:09 < iiiiiiiiiii> >a bug introduced in #14897 https://github.com/bitcoin/bitcoin/pull/14897 13:09 < jb55> how would you find a bug like this? is there tracing diagnostics you can turn on to see these messages? 13:10 < jnewbery> aj: do you want to comment on how you found the bug? 13:10 < aj> so what happened is i ran "getmempoolinfo" and noticed that it was empty -- figured something weird was going on and mentioned it on irc 13:10 < aj> at which point people said "no, nothing weird is going on, your node is just having problems" or something similar 13:11 < aj> so i tried working out what was going on at that point, because my mempool had just like 12 tx's and wasn't getting any more 13:11 < aj> eventually figured out how to turn net message debugging on to the live system, bitcoin-cli logging '["net"]' or similar 13:12 < jnewbery> It's a bit scary that this was only discovered by chance. I think there's more testing we could be doing to increase our chances of catching things like this earlier on. 13:12 < aj> that then showed that i was getting lots of INV messages from peers, telling me that new tx's were around, but i was never actually sending the GETDATA back to get the details of those transactions 13:12 < jnewbery> that's a good tip. We have different categories of logging that can be enabled and disabled at run time. Use the `logging` RPC. 13:12 < aj> i then tried staring at code to figure out what was wrong, and made guesses. also tried attaching gdb to the running process, but that caused it to crash, and when i restarted everything was fine 13:13 < aj> talking about it on irc, people were wondering if i was connected to hostile peers that were deliberately attacking me; so i checked against gmaxwell's banlist, but didn't get any matches, and noted down the peers i was connected to 13:13 < aj> (i run my node behind NAT and not on tor, so i've only got 8 outgoing peers, no incoming peers) 13:14 < aj> once it restarted and was working fine, i manually reconnected to the peers that didn't work before, to check they did work now, and they did so i took that to mean it wasn't a case of hostile peers, but a bug on my end 13:14 < jonatack> aj, did the custom logging you mentioned in your PR shine light too? https://github.com/bitcoin/bitcoin/pull/15776#issuecomment-483916240 13:14 < Lightlike> In 15776, there is a rather special situation, with the parent of the tx already expired from the relay set, leading to NOTFOUND msg being sent. But wouldn't you need to have 100 different children tx, each with expired parents to reach MAX_PEER_IN_FLIGHT, multiplied by the number of peers (8) to be cut off completely. Is that not extremely unlikely? 13:14 < aj> in the log of #15776 i think you can see some of my guesses and corrections to that 13:15 < aj> Lightlike: it's cumulative, so you use up 1 slot with one tx, then it happens against and you use up another slot 13:16 < aj> Lightlike: and it's not that the parent expires from the relay set, it's that you receive a child you don't have a parent for (because you've just restarted your node after being down for a while, eg), and then you ask the peer for the parent even though they never told you they had the parent, so it's not in the collection of tx's they'll relay to you even though it's in the mempool 13:16 < merehap> Beyond more testing, is there something like using better concurrency primitives or better code structuring to reduce the likelihood of concurrency problems like these? I feel like I've seen several different concurrency bugs file against core. Since concurrency bugs are so difficult to discover and debug, it seems scary if there isn't a more systematic approach to preventing them. 13:17 < merehap> filed* against 13:17 < aj> Lightlike: (also, if the parent isn't in the mempool, but is in your utxo set, but the first two outputs from the transaction have already been spent, then you'll redundantly ask for the tx and get a notfound back) 13:18 < jnewbery> Good context to know is that if a peer asks you for a tx with GETDATA, you'll only send that tx if it's in a data structure called MapRelay. You won't fish the tx out of the mempool if it's not in MapRelay 13:18 < aj> jnewbery: (except you will if you're working with an SPV peer doing bloom filters) 13:19 < jnewbery> ah, yes. Thanks! 13:19 < aj> jnewbery: (there's always an exception. except when there isn't...) 13:19 < jnewbery> merehap: I wouldn't really categorise this as a concurrency bug. This is more like a resource leak (where the resource is txs in flight from a peer) 13:20 < MarcoFalke> Yeah, it is a logic bug. The code was written under the wrong assumptions (missing some "edge" cases) 13:21 < pinheadmz> so the story is: INV(tx)--> <--GETDATA(tx) (tx is dropped from relayset) NOTFOUND(tx)--> ? 13:21 < jnewbery> pinheadmz: yes 13:21 < pinheadmz> all three, the same tx - not GETDATA(parent of tx) 13:21 < jnewbery> (although of course we can't rely on the peer sending a NOTFOUND) 13:22 < aj> INV(tx) GETDATA(tx) TX(txdata) is normal 13:22 < pinheadmz> so is the tx not really in the maprealy from thebeginning? or is it actually dropped in the time between INV and GETDATA ? 13:23 < aj> INV(tx) GETDATA(tx) TX(txdata) GETDATA(tx.parent) NOTFOUND is common though, because our side thinks "if they have tx, they must have the parent" and their side, running the same software, thinks "i didn't INV tx.parent, so i won't send it" 13:23 < jnewbery> INV(tx_child) ---> 13:23 < jnewbery> <--- GETDATA(tx_child) 13:23 < jnewbery> TX(tx_child) ---> 13:23 < jnewbery> I don't have this child's parents! 13:23 < jnewbery> INV(tx_parent) ---> 13:23 < jnewbery> <--- GETDATA(tx_parent) 13:23 < jnewbery> TX(tx_parent) ---> 13:23 < aj> jnewbery: the special logging mostly proved the guesses i was making was wrong, fwiw 13:24 < jb55> I found reading the struct TxDownloadState comment helpful for anyone new to this part of the codebase 13:24 < jnewbery> oh sorry, that second INV shouldn't be there 13:24 < jnewbery> correction: 13:24 < jnewbery> INV(tx_child) ---> 13:24 < jnewbery> <--- GETDATA(tx_child) 13:24 < jnewbery> TX(tx_child) ---> 13:24 < jnewbery> I don't have this child's parents! 13:24 < jnewbery> <--- GETDATA(tx_parent) 13:24 < jnewbery> TX(tx_parent) ---> 13:25 < jnewbery> that last TX should actually be a NOTFOUND 13:25 < pinheadmz> thanks! visual aids are so helpful :-) 13:25 < pinheadmz> and why wouldnt the first node have the parent? bc it jsut expired? time-wise? 13:25 < jnewbery> expired from the MapRelay but still in the mempool, or cofirmed in a block 13:26 < Lightlike> but if it was confirmed in a block, wouldn't we know about this and not request it? 13:27 < aj> so one thing i wonder is about sdaftuar's question, "switch the uses of GetTimeMicros() in the logic to GetTime() so that we can mock it"; it doesn't seem to make a lot of sense to measure timeouts in the seconds-minutes range with microseconds; but using non-mocktime for "peer has been idle" in general (see #9606) certainly makes sense i think 13:27 < jnewbery> Lightlike: The peer might have it confirmed in a block, but we don't have the block yet? Not sure 13:28 < aj> Lightlike: there's a bug in the handling of missing parents. when actually testing the transaction the logic is done right (in AcceptToMempool), but when that fails, it doesn't say what tx's exactly are missing 13:29 < b10c> jnewbery: and the "expired from the MapRelay but still in the mempool" was introduced in #14897? 13:29 < aj> Lightlike: so when the net_processing code works that out to try to ask for the parents (which generally won't work anyway) it tests to see if the parent is in the mempool, or txid:0 or txid:1 are in the UTXO set 13:29 < aj> Lightlike: but if the output we're looking for is txid:3 and txid:0 and txid:1 are spent, we'll get a false positive 13:30 < jnewbery> b10c: no, I believe that's been normal behaviour since earlier than that 13:30 < aj> i think the tx's in maprelay are kept around so even if the tx expires from the mempool (by being mined, or because the mempool is full) it's still able to be relayed for the 15m's since being INV announced 13:31 < jnewbery> the problem introduced by 14897 is adding txs to the node's m_tx_in_flight and then there being edge-cases where they're never removed 13:32 < b10c> jnewbery: ah I see 13:33 < aj> so previously, it would've just been "we sent a GETDATA and got a NOTFOUND? not big deal, moving on". but now it means a data structure doesn't get cleared, and eventually a limit gets hit, and behaviour changes 13:34 < jb55> p2p code is scary 13:34 < jnewbery> jb55: yes! 13:34 < jb55> so much state 13:34 < jonatack> can you expand on the missing testing and mocktime? 13:35 < jnewbery> we talk a lot about consensus code being subtle and difficult, but problems in p2p code can be just as dangerous 13:35 * jb55 considering spending more time looking at p2p PRs 13:36 < jnewbery> luckily, I think there are more parallels between p2p code and other software domains (eg networking, telecoms), so for people with experience in those areas then p2p might be somewhere you want to look at more. 13:37 < jonatack> * jb55: there are todos in https://github.com/bitcoin/bitcoin/pull/9606/files 13:37 < jnewbery> jonatack: there are various calls for getting the current time used by bitcoind. GetTime() can be overridden by setmocktime but GetTimeMicros() always uses your system time 13:38 < jnewbery> I *think* that's just historic. I can't think of any good reason for it 13:38 < jnewbery> but it does mean that code that uses GetTimeMicros() can't be tested using mocktime 13:39 < aj> jnewbery: i think if you used mocktime for everything you might get a bug where "setup test, connect peers. mocktime += 1 hour. oh all the peers have been non-responsive for an hour, mass disconnect. tests fail" 13:39 < aj> jnewbery: GetTimeMicros() is used in bunches of places for "here's how long this piece of code took to run" logs, which is pretty reasonable too 13:39 < jonatack> thanks! wdyt about sdaftuar's last comment/question 13:40 < jonatack> could this be tested in compacted mocktime 13:41 < Lightlike> A question concerning the fix in #15834: Now we clear m_tx_inflight_it after our peer first sends an INV(tx), but is totally unresponsive after GETDATA(tx) (not even NOTFOUND). At the same time we accept other transactions from this peer. is there a legit reason for our peer to behave like this besides censoring certain transactions (so that we don't disconnect instead)? 13:42 < aj> i think it'd make sense to switch it to mockable GetTime,interested in what other people think, or if anyone sees any problems with that 13:42 < jnewbery> on the face of it, it seems reasonable. I can't immediately think of any reason that it'd be a problem 13:44 < aj> Lightlike: legit reasons: could be buggy, or maybe there's bugs in our logic where we're taking much longer to send the GETDATA than we think and it's actually totally reasonable for it to have expired, could be others 13:44 < jnewbery> Lightlike: unreliable connection? 13:45 < aj> jnewbery: unreliable with the bounds TCP/tor provides? 13:45 < jnewbery> In general, I think we want to be careful about introducing behaviour that could turn a relatively small bug into something that causes a bunch of peers to disconnect 13:45 < aj> yeah, that^^^ 13:47 < Lightlike> yes, that makes sense :-) 13:49 < jnewbery> aj: yeah TCP/tor should give us reliable messages 13:49 < jnewbery> jonatack: I hope it can be tested with mocktime, or in some way that doesn't require very long-running tests. The problem with long-running tests is that no-one ever runs them 13:50 < jnewbery> (and we explicitly disable them from our Travis CI because they'd cause the run to time out) 13:50 < jonatack> jnewbery: this 13:51 < jnewbery> A few other random thoughts I had about this PR: I liked aj's update to the getpeerinfo RPC here: https://github.com/bitcoin/bitcoin/pull/15776/files#diff-618d3a4f4fc0dc851ec29218c445c148R109 13:52 < jnewbery> Because I like monitoring in general 13:53 < jnewbery> I'd also love to see more work on stuff like https://forkmonitor.info/nodes/btc to provide live monitoring of the network 13:53 < jnewbery> I know that james o'beirne is planning to do some work on that, so if it's something that interests anyone here, please reach out to him 13:54 < jnewbery> Another thing that I'd like to see is more stress testing. Resource leaks like this are really difficult to catch in deterministic, short-running tests 13:55 < jnewbery> Any final questions? 13:55 < aj> https://github.com/ajtowns/bitcoin/commits/201904-reduce-notfounds -- is a branch i was playing with to reduce the underlying cause for getting notfounds btw. i haven't opened a PR for it yet, but it might be interesting to look at 13:55 < jnewbery> aj: cool, thanks! 13:56 < jnewbery> ok, seems like there aren't any more questions 13:56 < jnewbery> This was quite a step up from the previous two weeks in terms of complexity - I'd be interested in what people thought about reviewing. 13:57 < jnewbery> Net Processing is a tricky area. Probably second only to Validation in terms of subtlety and complexity. It'd be great if we had more eyes on it. 13:58 < jnewbery> ok, thanks everyone, and thanks especially to aj! 13:58 < jonatack> Good choice of PR -- I'm learning with these forays into new (for me) places in the codebase 13:58 < Lightlike> I found this very interesting (the p2p code was new to me), but also definitely harder to understand than the one last week. Thanks! 13:58 < merehap> jnewbery, aj: Thanks! 13:58 < jonatack> Thanks aj and jnewbery! 13:59 < aj> the p2p code was new to me too when trying to debug this :) 13:59 < jnewbery> let's wrap it up there. As always, feel free to ping me if you have any thoughts about these. 13:59 < pinheadmz> jnewbery: thank you! this is great, informative, fun 13:59 < pinheadmz> to the moon!