Fix NOTFOUND bug and expire getdata requests for transactions (net processing)

Host: jnewbery

Meeting Log

12:59 < afig> sup yall
13:00 < jb55> yo
13:00 < jnewbery> hi
13:00 < pinheadmz> hi
13:00 < Lightlike> hello
13:00 < merehap> hi
13:00 < amiti> hi
13:00 < b10c> hi
13:01 < jonatack> Hi
13:01 < jnewbery> Hi folks! Thanks for joining PR review club today.
13:01 < jnewbery> Before we start talking about #15834, I wanted to poll for feedback on these meetings. Is there any way we could make them more useful?
13:02 < jnewbery> I'm also taking suggestions for PRs to cover in future weeks. We don't have anything on the agenda yet.
13:02 < JulioBarros> I think the chat interface is interesting but wonder if an audio or video format would be helpful.
13:03 < peevsie> hi
13:03 < Lightlike> just a small thing, could you update the old gist page (with the old agenda) to link to the new page, for the people who have bookmarked that.
13:03 < aj> heyo
13:04 < jnewbery> The reason I chose IRC is that it makes it easier for people who don't have somewhere to take a video/audio call to join, and for people who want to just dip in and out
13:04 < merehap> I definitely prefer IRC over audio/video. Easier to reference the code for a few minutes without falling too far behind in the conversation.
13:04 < jnewbery> also means we have a text log for people who can't join in real time
13:05 < jnewbery> ok, I don't want to take too long on this. If you think of anything, please raise it here or just message me directly
13:05 < jnewbery> OK, onto #15834 - Fix NOTFOUND bug and expire getdata requests for transactions. As usual, I'll give a quick summary before we open up for questions.
13:05 < jb55> my only suggestion is that it might be neat to have a site where people can propose reviews and have people signup to them, with a calendar and everything
13:05 < jb55> perhaps when this gets more popular...
13:06 < jnewbery> jb55: until then, feel free to propose reviews in this channel
13:06 < jnewbery> This PR is for changes to the "Net Processing" layer. That layer sits between "Net" (which manages connections to peers and is responsible for reading/writing bytes off the wire) and "Validation" (which is responsible for our node's view of the state of the blockchain and mempool).
13:06 < jnewbery> "Net Processing" receives application level messages from peers, which can either be data messages (eg tx and block inventory) or control messages (eg gossiping addresses of peers or establishing new connections). The developer reference here: is probably the best place for documentation on p2p messages.
13:07 < jnewbery> Data messages are parsed and then passed up to validation for consideration for us to update mempool/chainstate.
13:07 < jnewbery> This change is in the handling of the INV/GETDATA/TX message exchange, and is an attempt to fix a bug introduced in #14897 where our node could potentially stop asking peers to send us new transactions.
13:08 < jnewbery> As it says in the PR description, this is a smaller alternative to , who's author is in this chat
13:08 < jnewbery> aj was actually the first person to report the bug a couple of months ago
13:09 < jnewbery> as a result of that, we reverted #14897 in the 0.18 release, but kept it in master with the aim of fixing the bug while still keeping the good things about 14897
13:09 < jnewbery> I'll shut up now. Any questions?
13:09 < iiiiiiiiiii> >a bug introduced in #14897
13:09 < jb55> how would you find a bug like this? is there tracing diagnostics you can turn on to see these messages?
13:10 < jnewbery> aj: do you want to comment on how you found the bug?
13:10 < aj> so what happened is i ran "getmempoolinfo" and noticed that it was empty -- figured something weird was going on and mentioned it on irc
13:10 < aj> at which point people said "no, nothing weird is going on, your node is just having problems" or something similar
13:11 < aj> so i tried working out what was going on at that point, because my mempool had just like 12 tx's and wasn't getting any more
13:11 < aj> eventually figured out how to turn net message debugging on to the live system, bitcoin-cli logging '["net"]' or similar
13:12 < jnewbery> It's a bit scary that this was only discovered by chance. I think there's more testing we could be doing to increase our chances of catching things like this earlier on.
13:12 < aj> that then showed that i was getting lots of INV messages from peers, telling me that new tx's were around, but i was never actually sending the GETDATA back to get the details of those transactions
13:12 < jnewbery> that's a good tip. We have different categories of logging that can be enabled and disabled at run time. Use the `logging` RPC.
13:12 < aj> i then tried staring at code to figure out what was wrong, and made guesses. also tried attaching gdb to the running process, but that caused it to crash, and when i restarted everything was fine
13:13 < aj> talking about it on irc, people were wondering if i was connected to hostile peers that were deliberately attacking me; so i checked against gmaxwell's banlist, but didn't get any matches, and noted down the peers i was connected to
13:13 < aj> (i run my node behind NAT and not on tor, so i've only got 8 outgoing peers, no incoming peers)
13:14 < aj> once it restarted and was working fine, i manually reconnected to the peers that didn't work before, to check they did work now, and they did so i took that to mean it wasn't a case of hostile peers, but a bug on my end
13:14 < jonatack> aj, did the custom logging you mentioned in your PR shine light too?
13:14 < Lightlike> In 15776, there is a rather special situation, with the parent of the tx already expired from the relay set, leading to NOTFOUND msg being sent. But wouldn't you need to have 100 different children tx, each with expired parents to reach MAX_PEER_IN_FLIGHT, multiplied by the number of peers (8) to be cut off completely. Is that not extremely unlikely?
13:14 < aj> in the log of #15776 i think you can see some of my guesses and corrections to that
13:15 < aj> Lightlike: it's cumulative, so you use up 1 slot with one tx, then it happens against and you use up another slot
13:16 < aj> Lightlike: and it's not that the parent expires from the relay set, it's that you receive a child you don't have a parent for (because you've just restarted your node after being down for a while, eg), and then you ask the peer for the parent even though they never told you they had the parent, so it's not in the collection of tx's they'll relay to you even though it's in the mempool
13:16 < merehap> Beyond more testing, is there something like using better concurrency primitives or better code structuring to reduce the likelihood of concurrency problems like these? I feel like I've seen several different concurrency bugs file against core. Since concurrency bugs are so difficult to discover and debug, it seems scary if there isn't a more systematic approach to preventing them.
13:17 < merehap> filed* against
13:17 < aj> Lightlike: (also, if the parent isn't in the mempool, but is in your utxo set, but the first two outputs from the transaction have already been spent, then you'll redundantly ask for the tx and get a notfound back)
13:18 < jnewbery> Good context to know is that if a peer asks you for a tx with GETDATA, you'll only send that tx if it's in a data structure called MapRelay. You won't fish the tx out of the mempool if it's not in MapRelay
13:18 < aj> jnewbery: (except you will if you're working with an SPV peer doing bloom filters)
13:19 < jnewbery> ah, yes. Thanks!
13:19 < aj> jnewbery: (there's always an exception. except when there isn't...)
13:19 < jnewbery> merehap: I wouldn't really categorise this as a concurrency bug. This is more like a resource leak (where the resource is txs in flight from a peer)
13:20 < MarcoFalke> Yeah, it is a logic bug. The code was written under the wrong assumptions (missing some "edge" cases)
13:21 < pinheadmz> so the story is: INV(tx)-->    <--GETDATA(tx)   (tx is dropped from relayset)   NOTFOUND(tx)-->      ?
13:21 < jnewbery> pinheadmz: yes
13:21 < pinheadmz> all three, the same tx - not GETDATA(parent of tx)
13:21 < jnewbery> (although of course we can't rely on the peer sending a NOTFOUND)
13:22 < aj> INV(tx) GETDATA(tx) TX(txdata) is normal
13:22 < pinheadmz> so is the tx not really in the maprealy from thebeginning? or is it actually dropped in the time between INV and GETDATA ?
13:23 < aj> INV(tx) GETDATA(tx) TX(txdata) GETDATA(tx.parent) NOTFOUND is common though, because our side thinks "if they have tx, they must have the parent" and their side, running the same software, thinks "i didn't INV tx.parent, so i won't send it"
13:23 < jnewbery> INV(tx_child) --->
13:23 < jnewbery> <--- GETDATA(tx_child)
13:23 < jnewbery> TX(tx_child) --->
13:23 < jnewbery>                   I don't have this child's parents!
13:23 < jnewbery> INV(tx_parent) --->
13:23 < jnewbery> <--- GETDATA(tx_parent)
13:23 < jnewbery> TX(tx_parent) --->
13:23 < aj> jnewbery: the special logging mostly proved the guesses i was making was wrong, fwiw
13:24 < jb55> I found reading the struct TxDownloadState comment helpful for anyone new to this part of the codebase
13:24 < jnewbery> oh sorry, that second INV shouldn't be there
13:24 < jnewbery> correction:
13:24 < jnewbery> INV(tx_child) --->
13:24 < jnewbery> <--- GETDATA(tx_child)
13:24 < jnewbery> TX(tx_child) --->
13:24 < jnewbery>                   I don't have this child's parents!
13:24 < jnewbery> <--- GETDATA(tx_parent)
13:24 < jnewbery> TX(tx_parent) --->
13:25 < jnewbery> that last TX should actually be a NOTFOUND
13:25 < pinheadmz> thanks! visual aids are so helpful :-)
13:25 < pinheadmz> and why wouldnt the first node have the parent? bc it jsut expired? time-wise?
13:25 < jnewbery> expired from the MapRelay but still in the mempool, or cofirmed in a block
13:26 < Lightlike> but if it was confirmed in a block, wouldn't we know about this and not request it?
13:27 < aj> so one thing i wonder is about sdaftuar's question, "switch the uses of GetTimeMicros() in the logic to GetTime() so that we can mock it"; it doesn't seem to make a lot of sense to measure timeouts in the seconds-minutes range with microseconds; but using non-mocktime for "peer has been idle" in general (see #9606) certainly makes sense i think
13:27 < jnewbery> Lightlike: The peer might have it confirmed in a block, but we don't have the block yet? Not sure
13:28 < aj> Lightlike: there's a bug in the handling of missing parents. when actually testing the transaction the logic is done right (in AcceptToMempool), but when that fails, it doesn't say what tx's exactly are missing
13:29 < b10c> jnewbery: and the "expired from the MapRelay but still in the mempool" was introduced in #14897?
13:29 < aj> Lightlike: so when the net_processing code works that out to try to ask for the parents (which generally won't work anyway) it tests to see if the parent is in the mempool, or txid:0 or txid:1 are in the UTXO set
13:29 < aj> Lightlike: but if the output we're looking for is txid:3 and txid:0 and txid:1 are spent, we'll get a false positive
13:30 < jnewbery> b10c: no, I believe that's been normal behaviour since earlier than that
13:30 < aj> i think the tx's in maprelay are kept around so even if the tx expires from the mempool (by being mined, or because the mempool is full) it's still able to be relayed for the 15m's since being INV announced
13:31 < jnewbery> the problem introduced by 14897 is adding txs to the node's m_tx_in_flight and then there being edge-cases where they're never removed
13:32 < b10c> jnewbery: ah I see
13:33 < aj> so previously, it would've just been "we sent a GETDATA and got a NOTFOUND? not big deal, moving on". but now it means a data structure doesn't get cleared, and eventually a limit gets hit, and behaviour changes
13:34 < jb55> p2p code is scary
13:34 < jnewbery> jb55: yes!
13:34 < jb55> so much state
13:34 < jonatack> can you expand on the missing testing and mocktime?
13:35 < jnewbery> we talk a lot about consensus code being subtle and difficult, but problems in p2p code can be just as dangerous
13:35  * jb55 considering spending more time looking at p2p PRs
13:36 < jnewbery> luckily, I think there are more parallels between p2p code and other software domains (eg networking, telecoms), so for people with experience in those areas then p2p might be somewhere you want to look at more.
13:37 < jonatack> * jb55: there are todos in
13:37 < jnewbery> jonatack: there are various calls for getting the current time used by bitcoind. GetTime() can be overridden by setmocktime but GetTimeMicros() always uses your system time
13:38 < jnewbery> I *think* that's just historic. I can't think of any good reason for it
13:38 < jnewbery> but it does mean that code that uses GetTimeMicros() can't be tested using mocktime
13:39 < aj> jnewbery: i think if you used mocktime for everything you might get a bug where "setup test, connect peers. mocktime += 1 hour. oh all the peers have been non-responsive for an hour, mass disconnect. tests fail"
13:39 < aj> jnewbery: GetTimeMicros() is used in bunches of places for "here's how long this piece of code took to run" logs, which is pretty reasonable too
13:39 < jonatack> thanks! wdyt about sdaftuar's last comment/question
13:40 < jonatack> could this be tested in compacted mocktime
13:41 < Lightlike> A question concerning the fix in #15834: Now we clear m_tx_inflight_it after our peer first sends an INV(tx), but is totally unresponsive after GETDATA(tx) (not even NOTFOUND). At the same time we accept other transactions from this peer. is there a legit reason for our peer to behave like this besides censoring certain transactions (so that we don't disconnect instead)?
13:42 < aj> i think it'd make sense to switch it to mockable GetTime,interested in what other people think, or if anyone sees any problems with that
13:42 < jnewbery> on the face of it, it seems reasonable. I can't immediately think of any reason that it'd be a problem
13:44 < aj> Lightlike: legit reasons: could be buggy, or maybe there's bugs in our logic where we're taking much longer to send the GETDATA than we think and it's actually totally reasonable for it to have expired, could be others
13:44 < jnewbery> Lightlike: unreliable connection?
13:45 < aj> jnewbery: unreliable with the bounds TCP/tor provides?
13:45 < jnewbery> In general, I think we want to be careful about introducing behaviour that could turn a relatively small bug into something that causes a bunch of peers to disconnect
13:45 < aj> yeah, that^^^
13:47 < Lightlike> yes, that makes sense :-)
13:49 < jnewbery> aj: yeah TCP/tor should give us reliable messages
13:49 < jnewbery> jonatack: I hope it can be tested with mocktime, or in some way that doesn't require very long-running tests. The problem with long-running tests is that no-one ever runs them
13:50 < jnewbery> (and we explicitly disable them from our Travis CI because they'd cause the run to time out)
13:50 < jonatack> jnewbery: this
13:51 < jnewbery> A few other random thoughts I had about this PR: I liked aj's update to the getpeerinfo RPC here:
13:52 < jnewbery> Because I like monitoring in general
13:53 < jnewbery> I'd also love to see more work on stuff like to provide live monitoring of the network
13:53 < jnewbery> I know that james o'beirne is planning to do some work on that, so if it's something that interests anyone here, please reach out to him
13:54 < jnewbery> Another thing that I'd like to see is more stress testing. Resource leaks like this are really difficult to catch in deterministic, short-running tests
13:55 < jnewbery> Any final questions?
13:55 < aj>  -- is a branch i was playing with to reduce the underlying cause for getting notfounds btw. i haven't opened a PR for it yet, but it might be interesting to look at
13:55 < jnewbery> aj: cool, thanks!
13:56 < jnewbery> ok, seems like there aren't any more questions
13:56 < jnewbery> This was quite a step up from the previous two weeks in terms of complexity - I'd be interested in what people thought about reviewing.
13:57 < jnewbery> Net Processing is a tricky area. Probably second only to Validation in terms of subtlety and complexity. It'd be great if we had more eyes on it.
13:58 < jnewbery> ok, thanks everyone, and thanks especially to aj!
13:58 < jonatack> Good choice of PR -- I'm learning with these forays into new (for me) places in the codebase
13:58 < Lightlike> I found this very interesting (the p2p code was new to me), but also definitely harder to understand than the one last week. Thanks!
13:58 < merehap> jnewbery, aj: Thanks!
13:58 < jonatack> Thanks aj and jnewbery!
13:59 < aj> the p2p code was new to me too when trying to debug this :)
13:59 < jnewbery> let's wrap it up there. As always, feel free to ping me if you have any thoughts about these.
13:59 < pinheadmz> jnewbery: thank you! this is great, informative, fun
13:59 < pinheadmz> to the moon!