Fix NOTFOUND bug and expire getdata requests for transactions (p2p)

Host: jnewbery  -  PR author: sdaftuar

Meeting Log

  112:59 <afig> sup yall
  213:00 <jb55> yo
  313:00 <jnewbery> hi
  413:00 <pinheadmz> hi
  513:00 <Lightlike> hello
  613:00 <merehap> hi
  713:00 <amiti> hi
  813:00 <b10c> hi
  913:01 <jonatack> Hi
 1013:01 <jnewbery> Hi folks! Thanks for joining PR review club today.
 1113:01 <jnewbery> Before we start talking about #15834, I wanted to poll for feedback on these meetings. Is there any way we could make them more useful?
 1213:02 <jnewbery> I'm also taking suggestions for PRs to cover in future weeks. We don't have anything on the agenda yet.
 1313:02 <JulioBarros> I think the chat interface is interesting but wonder if an audio or video format would be helpful.
 1413:03 <peevsie> hi
 1513:03 <Lightlike> just a small thing, could you update the old gist page (with the old agenda) to link to the new page, for the people who have bookmarked that.
 1613:03 <aj> heyo
 1713:04 <jnewbery> The reason I chose IRC is that it makes it easier for people who don't have somewhere to take a video/audio call to join, and for people who want to just dip in and out
 1813:04 <merehap> I definitely prefer IRC over audio/video. Easier to reference the code for a few minutes without falling too far behind in the conversation.
 1913:04 <jnewbery> also means we have a text log for people who can't join in real time
 2013:05 <jnewbery> ok, I don't want to take too long on this. If you think of anything, please raise it here or just message me directly
 2113:05 <jnewbery> OK, onto #15834 - Fix NOTFOUND bug and expire getdata requests for transactions. As usual, I'll give a quick summary before we open up for questions.
 2213:05 <jb55> my only suggestion is that it might be neat to have a site where people can propose reviews and have people signup to them, with a calendar and everything
 2313:05 <jb55> perhaps when this gets more popular...
 2413:06 <jnewbery> jb55: until then, feel free to propose reviews in this channel
 2513:06 <jnewbery> This PR is for changes to the "Net Processing" layer. That layer sits between "Net" (which manages connections to peers and is responsible for reading/writing bytes off the wire) and "Validation" (which is responsible for our node's view of the state of the blockchain and mempool).
 2613:06 <jnewbery> "Net Processing" receives application level messages from peers, which can either be data messages (eg tx and block inventory) or control messages (eg gossiping addresses of peers or establishing new connections). The developer reference here: is probably the best place for documentation on p2p messages.
 2713:07 <jnewbery> Data messages are parsed and then passed up to validation for consideration for us to update mempool/chainstate.
 2813:07 <jnewbery> This change is in the handling of the INV/GETDATA/TX message exchange, and is an attempt to fix a bug introduced in #14897 where our node could potentially stop asking peers to send us new transactions.
 2913:08 <jnewbery> As it says in the PR description, this is a smaller alternative to , who's author is in this chat
 3013:08 <jnewbery> aj was actually the first person to report the bug a couple of months ago
 3113:09 <jnewbery> as a result of that, we reverted #14897 in the 0.18 release, but kept it in master with the aim of fixing the bug while still keeping the good things about 14897
 3213:09 <jnewbery> I'll shut up now. Any questions?
 3313:09 <iiiiiiiiiii> >a bug introduced in #14897
 3413:09 <jb55> how would you find a bug like this? is there tracing diagnostics you can turn on to see these messages?
 3513:10 <jnewbery> aj: do you want to comment on how you found the bug?
 3613:10 <aj> so what happened is i ran "getmempoolinfo" and noticed that it was empty -- figured something weird was going on and mentioned it on irc
 3713:10 <aj> at which point people said "no, nothing weird is going on, your node is just having problems" or something similar
 3813:11 <aj> so i tried working out what was going on at that point, because my mempool had just like 12 tx's and wasn't getting any more
 3913:11 <aj> eventually figured out how to turn net message debugging on to the live system, bitcoin-cli logging '["net"]' or similar
 4013:12 <jnewbery> It's a bit scary that this was only discovered by chance. I think there's more testing we could be doing to increase our chances of catching things like this earlier on.
 4113:12 <aj> that then showed that i was getting lots of INV messages from peers, telling me that new tx's were around, but i was never actually sending the GETDATA back to get the details of those transactions
 4213:12 <jnewbery> that's a good tip. We have different categories of logging that can be enabled and disabled at run time. Use the `logging` RPC.
 4313:12 <aj> i then tried staring at code to figure out what was wrong, and made guesses. also tried attaching gdb to the running process, but that caused it to crash, and when i restarted everything was fine
 4413:13 <aj> talking about it on irc, people were wondering if i was connected to hostile peers that were deliberately attacking me; so i checked against gmaxwell's banlist, but didn't get any matches, and noted down the peers i was connected to
 4513:13 <aj> (i run my node behind NAT and not on tor, so i've only got 8 outgoing peers, no incoming peers)
 4613:14 <aj> once it restarted and was working fine, i manually reconnected to the peers that didn't work before, to check they did work now, and they did so i took that to mean it wasn't a case of hostile peers, but a bug on my end
 4713:14 <jonatack> aj, did the custom logging you mentioned in your PR shine light too?
 4813:14 <Lightlike> In 15776, there is a rather special situation, with the parent of the tx already expired from the relay set, leading to NOTFOUND msg being sent. But wouldn't you need to have 100 different children tx, each with expired parents to reach MAX_PEER_IN_FLIGHT, multiplied by the number of peers (8) to be cut off completely. Is that not extremely unlikely?
 4913:14 <aj> in the log of #15776 i think you can see some of my guesses and corrections to that
 5013:15 <aj> Lightlike: it's cumulative, so you use up 1 slot with one tx, then it happens against and you use up another slot
 5113:16 <aj> Lightlike: and it's not that the parent expires from the relay set, it's that you receive a child you don't have a parent for (because you've just restarted your node after being down for a while, eg), and then you ask the peer for the parent even though they never told you they had the parent, so it's not in the collection of tx's they'll relay to you even though it's in the mempool
 5213:16 <merehap> Beyond more testing, is there something like using better concurrency primitives or better code structuring to reduce the likelihood of concurrency problems like these? I feel like I've seen several different concurrency bugs file against core. Since concurrency bugs are so difficult to discover and debug, it seems scary if there isn't a more systematic approach to preventing them.
 5313:17 <merehap> filed* against
 5413:17 <aj> Lightlike: (also, if the parent isn't in the mempool, but is in your utxo set, but the first two outputs from the transaction have already been spent, then you'll redundantly ask for the tx and get a notfound back)
 5513:18 <jnewbery> Good context to know is that if a peer asks you for a tx with GETDATA, you'll only send that tx if it's in a data structure called MapRelay. You won't fish the tx out of the mempool if it's not in MapRelay
 5613:18 <aj> jnewbery: (except you will if you're working with an SPV peer doing bloom filters)
 5713:19 <jnewbery> ah, yes. Thanks!
 5813:19 <aj> jnewbery: (there's always an exception. except when there isn't...)
 5913:19 <jnewbery> merehap: I wouldn't really categorise this as a concurrency bug. This is more like a resource leak (where the resource is txs in flight from a peer)
 6013:20 <MarcoFalke> Yeah, it is a logic bug. The code was written under the wrong assumptions (missing some "edge" cases)
 6113:21 <pinheadmz> so the story is: INV(tx)--> <--GETDATA(tx) (tx is dropped from relayset) NOTFOUND(tx)--> ?
 6213:21 <jnewbery> pinheadmz: yes
 6313:21 <pinheadmz> all three, the same tx - not GETDATA(parent of tx)
 6413:21 <jnewbery> (although of course we can't rely on the peer sending a NOTFOUND)
 6513:22 <aj> INV(tx) GETDATA(tx) TX(txdata) is normal
 6613:22 <pinheadmz> so is the tx not really in the maprealy from thebeginning? or is it actually dropped in the time between INV and GETDATA ?
 6713:23 <aj> INV(tx) GETDATA(tx) TX(txdata) GETDATA(tx.parent) NOTFOUND is common though, because our side thinks "if they have tx, they must have the parent" and their side, running the same software, thinks "i didn't INV tx.parent, so i won't send it"
 6813:23 <jnewbery> INV(tx_child) --->
 6913:23 <jnewbery> <--- GETDATA(tx_child)
 7013:23 <jnewbery> TX(tx_child) --->
 7113:23 <jnewbery> I don't have this child's parents!
 7213:23 <jnewbery> INV(tx_parent) --->
 7313:23 <jnewbery> <--- GETDATA(tx_parent)
 7413:23 <jnewbery> TX(tx_parent) --->
 7513:23 <aj> jnewbery: the special logging mostly proved the guesses i was making was wrong, fwiw
 7613:24 <jb55> I found reading the struct TxDownloadState comment helpful for anyone new to this part of the codebase
 7713:24 <jnewbery> oh sorry, that second INV shouldn't be there
 7813:24 <jnewbery> correction:
 7913:24 <jnewbery> INV(tx_child) --->
 8013:24 <jnewbery> <--- GETDATA(tx_child)
 8113:24 <jnewbery> TX(tx_child) --->
 8213:24 <jnewbery> I don't have this child's parents!
 8313:24 <jnewbery> <--- GETDATA(tx_parent)
 8413:24 <jnewbery> TX(tx_parent) --->
 8513:25 <jnewbery> that last TX should actually be a NOTFOUND
 8613:25 <pinheadmz> thanks! visual aids are so helpful :-)
 8713:25 <pinheadmz> and why wouldnt the first node have the parent? bc it jsut expired? time-wise?
 8813:25 <jnewbery> expired from the MapRelay but still in the mempool, or cofirmed in a block
 8913:26 <Lightlike> but if it was confirmed in a block, wouldn't we know about this and not request it?
 9013:27 <aj> so one thing i wonder is about sdaftuar's question, "switch the uses of GetTimeMicros() in the logic to GetTime() so that we can mock it"; it doesn't seem to make a lot of sense to measure timeouts in the seconds-minutes range with microseconds; but using non-mocktime for "peer has been idle" in general (see #9606) certainly makes sense i think
 9113:27 <jnewbery> Lightlike: The peer might have it confirmed in a block, but we don't have the block yet? Not sure
 9213:28 <aj> Lightlike: there's a bug in the handling of missing parents. when actually testing the transaction the logic is done right (in AcceptToMempool), but when that fails, it doesn't say what tx's exactly are missing
 9313:29 <b10c> jnewbery: and the "expired from the MapRelay but still in the mempool" was introduced in #14897?
 9413:29 <aj> Lightlike: so when the net_processing code works that out to try to ask for the parents (which generally won't work anyway) it tests to see if the parent is in the mempool, or txid:0 or txid:1 are in the UTXO set
 9513:29 <aj> Lightlike: but if the output we're looking for is txid:3 and txid:0 and txid:1 are spent, we'll get a false positive
 9613:30 <jnewbery> b10c: no, I believe that's been normal behaviour since earlier than that
 9713:30 <aj> i think the tx's in maprelay are kept around so even if the tx expires from the mempool (by being mined, or because the mempool is full) it's still able to be relayed for the 15m's since being INV announced
 9813:31 <jnewbery> the problem introduced by 14897 is adding txs to the node's m_tx_in_flight and then there being edge-cases where they're never removed
 9913:32 <b10c> jnewbery: ah I see
10013:33 <aj> so previously, it would've just been "we sent a GETDATA and got a NOTFOUND? not big deal, moving on". but now it means a data structure doesn't get cleared, and eventually a limit gets hit, and behaviour changes
10113:34 <jb55> p2p code is scary
10213:34 <jnewbery> jb55: yes!
10313:34 <jb55> so much state
10413:34 <jonatack> can you expand on the missing testing and mocktime?
10513:35 <jnewbery> we talk a lot about consensus code being subtle and difficult, but problems in p2p code can be just as dangerous
10613:35 * jb55 considering spending more time looking at p2p PRs
10713:36 <jnewbery> luckily, I think there are more parallels between p2p code and other software domains (eg networking, telecoms), so for people with experience in those areas then p2p might be somewhere you want to look at more.
10813:37 <jonatack> * jb55: there are todos in
10913:37 <jnewbery> jonatack: there are various calls for getting the current time used by bitcoind. GetTime() can be overridden by setmocktime but GetTimeMicros() always uses your system time
11013:38 <jnewbery> I *think* that's just historic. I can't think of any good reason for it
11113:38 <jnewbery> but it does mean that code that uses GetTimeMicros() can't be tested using mocktime
11213:39 <aj> jnewbery: i think if you used mocktime for everything you might get a bug where "setup test, connect peers. mocktime += 1 hour. oh all the peers have been non-responsive for an hour, mass disconnect. tests fail"
11313:39 <aj> jnewbery: GetTimeMicros() is used in bunches of places for "here's how long this piece of code took to run" logs, which is pretty reasonable too
11413:39 <jonatack> thanks! wdyt about sdaftuar's last comment/question
11513:40 <jonatack> could this be tested in compacted mocktime
11613:41 <Lightlike> A question concerning the fix in #15834: Now we clear m_tx_inflight_it after our peer first sends an INV(tx), but is totally unresponsive after GETDATA(tx) (not even NOTFOUND). At the same time we accept other transactions from this peer. is there a legit reason for our peer to behave like this besides censoring certain transactions (so that we don't disconnect instead)?
11713:42 <aj> i think it'd make sense to switch it to mockable GetTime,interested in what other people think, or if anyone sees any problems with that
11813:42 <jnewbery> on the face of it, it seems reasonable. I can't immediately think of any reason that it'd be a problem
11913:44 <aj> Lightlike: legit reasons: could be buggy, or maybe there's bugs in our logic where we're taking much longer to send the GETDATA than we think and it's actually totally reasonable for it to have expired, could be others
12013:44 <jnewbery> Lightlike: unreliable connection?
12113:45 <aj> jnewbery: unreliable with the bounds TCP/tor provides?
12213:45 <jnewbery> In general, I think we want to be careful about introducing behaviour that could turn a relatively small bug into something that causes a bunch of peers to disconnect
12313:45 <aj> yeah, that^^^
12413:47 <Lightlike> yes, that makes sense :-)
12513:49 <jnewbery> aj: yeah TCP/tor should give us reliable messages
12613:49 <jnewbery> jonatack: I hope it can be tested with mocktime, or in some way that doesn't require very long-running tests. The problem with long-running tests is that no-one ever runs them
12713:50 <jnewbery> (and we explicitly disable them from our Travis CI because they'd cause the run to time out)
12813:50 <jonatack> jnewbery: this
12913:51 <jnewbery> A few other random thoughts I had about this PR: I liked aj's update to the getpeerinfo RPC here:
13013:52 <jnewbery> Because I like monitoring in general
13113:53 <jnewbery> I'd also love to see more work on stuff like to provide live monitoring of the network
13213:53 <jnewbery> I know that james o'beirne is planning to do some work on that, so if it's something that interests anyone here, please reach out to him
13313:54 <jnewbery> Another thing that I'd like to see is more stress testing. Resource leaks like this are really difficult to catch in deterministic, short-running tests
13413:55 <jnewbery> Any final questions?
13513:55 <aj> -- is a branch i was playing with to reduce the underlying cause for getting notfounds btw. i haven't opened a PR for it yet, but it might be interesting to look at
13613:55 <jnewbery> aj: cool, thanks!
13713:56 <jnewbery> ok, seems like there aren't any more questions
13813:56 <jnewbery> This was quite a step up from the previous two weeks in terms of complexity - I'd be interested in what people thought about reviewing.
13913:57 <jnewbery> Net Processing is a tricky area. Probably second only to Validation in terms of subtlety and complexity. It'd be great if we had more eyes on it.
14013:58 <jnewbery> ok, thanks everyone, and thanks especially to aj!
14113:58 <jonatack> Good choice of PR -- I'm learning with these forays into new (for me) places in the codebase
14213:58 <Lightlike> I found this very interesting (the p2p code was new to me), but also definitely harder to understand than the one last week. Thanks!
14313:58 <merehap> jnewbery, aj: Thanks!
14413:58 <jonatack> Thanks aj and jnewbery!
14513:59 <aj> the p2p code was new to me too when trying to debug this :)
14613:59 <jnewbery> let's wrap it up there. As always, feel free to ping me if you have any thoughts about these.
14713:59 <pinheadmz> jnewbery: thank you! this is great, informative, fun
14813:59 <pinheadmz> to the moon!