Fix NOTFOUND bug and expire getdata requests for transactions (p2p)

May 15, 2019

https://github.com/bitcoin/bitcoin/pull/15834

Host: jnewbery - PR author: sdaftuar

Meeting Log

12:59

<afig> sup yall

13:00

<jb55> yo

13:00

<jnewbery> hi

13:00

<pinheadmz> hi

13:00

<Lightlike> hello

13:00

<merehap> hi

13:00

<amiti> hi

13:00

<b10c> hi

13:01

<jonatack> Hi

13:01

<jnewbery> Hi folks! Thanks for joining PR review club today.

13:01

<jnewbery> Before we start talking about #15834, I wanted to poll for feedback on these meetings. Is there any way we could make them more useful?

13:02

<jnewbery> I'm also taking suggestions for PRs to cover in future weeks. We don't have anything on the agenda yet.

13:02

<JulioBarros> I think the chat interface is interesting but wonder if an audio or video format would be helpful.

13:03

<peevsie> hi

13:03

<Lightlike> just a small thing, could you update the old gist page (with the old agenda) to link to the new page, for the people who have bookmarked that.

13:03

<aj> heyo

13:04

<jnewbery> The reason I chose IRC is that it makes it easier for people who don't have somewhere to take a video/audio call to join, and for people who want to just dip in and out

13:04

<merehap> I definitely prefer IRC over audio/video. Easier to reference the code for a few minutes without falling too far behind in the conversation.

13:04

<jnewbery> also means we have a text log for people who can't join in real time

13:05

<jnewbery> ok, I don't want to take too long on this. If you think of anything, please raise it here or just message me directly

13:05

<jnewbery> OK, onto #15834 - Fix NOTFOUND bug and expire getdata requests for transactions. As usual, I'll give a quick summary before we open up for questions.

13:05

<jb55> my only suggestion is that it might be neat to have a site where people can propose reviews and have people signup to them, with a calendar and everything

13:05

<jb55> perhaps when this gets more popular...

13:06

<jnewbery> jb55: until then, feel free to propose reviews in this channel

13:06

<jnewbery> This PR is for changes to the "Net Processing" layer. That layer sits between "Net" (which manages connections to peers and is responsible for reading/writing bytes off the wire) and "Validation" (which is responsible for our node's view of the state of the blockchain and mempool).

13:06

<jnewbery> "Net Processing" receives application level messages from peers, which can either be data messages (eg tx and block inventory) or control messages (eg gossiping addresses of peers or establishing new connections). The developer reference here: https://btcinformation.org/en/developer-reference#p2p-network is probably the best place for documentation on p2p messages.

13:07

<jnewbery> Data messages are parsed and then passed up to validation for consideration for us to update mempool/chainstate.

13:07

<jnewbery> This change is in the handling of the INV/GETDATA/TX message exchange, and is an attempt to fix a bug introduced in #14897 where our node could potentially stop asking peers to send us new transactions.

13:08

<jnewbery> As it says in the PR description, this is a smaller alternative to https://github.com/bitcoin/bitcoin/pull/15776 , who's author is in this chat

13:08

<jnewbery> aj was actually the first person to report the bug a couple of months ago

13:09

<jnewbery> as a result of that, we reverted #14897 in the 0.18 release, but kept it in master with the aim of fixing the bug while still keeping the good things about 14897

13:09

<jnewbery> I'll shut up now. Any questions?

13:09

<iiiiiiiiiii> >a bug introduced in #14897 https://github.com/bitcoin/bitcoin/pull/14897

13:09

<jb55> how would you find a bug like this? is there tracing diagnostics you can turn on to see these messages?

13:10

<jnewbery> aj: do you want to comment on how you found the bug?

13:10

<aj> so what happened is i ran "getmempoolinfo" and noticed that it was empty -- figured something weird was going on and mentioned it on irc

13:10

<aj> at which point people said "no, nothing weird is going on, your node is just having problems" or something similar

13:11

<aj> so i tried working out what was going on at that point, because my mempool had just like 12 tx's and wasn't getting any more

13:11

<aj> eventually figured out how to turn net message debugging on to the live system, bitcoin-cli logging '["net"]' or similar

13:12

<jnewbery> It's a bit scary that this was only discovered by chance. I think there's more testing we could be doing to increase our chances of catching things like this earlier on.

13:12

<aj> that then showed that i was getting lots of INV messages from peers, telling me that new tx's were around, but i was never actually sending the GETDATA back to get the details of those transactions

13:12

<jnewbery> that's a good tip. We have different categories of logging that can be enabled and disabled at run time. Use the `logging` RPC.

13:12

<aj> i then tried staring at code to figure out what was wrong, and made guesses. also tried attaching gdb to the running process, but that caused it to crash, and when i restarted everything was fine

13:13

<aj> talking about it on irc, people were wondering if i was connected to hostile peers that were deliberately attacking me; so i checked against gmaxwell's banlist, but didn't get any matches, and noted down the peers i was connected to

13:13

<aj> (i run my node behind NAT and not on tor, so i've only got 8 outgoing peers, no incoming peers)

13:14

<aj> once it restarted and was working fine, i manually reconnected to the peers that didn't work before, to check they did work now, and they did so i took that to mean it wasn't a case of hostile peers, but a bug on my end

13:14

<jonatack> aj, did the custom logging you mentioned in your PR shine light too? https://github.com/bitcoin/bitcoin/pull/15776#issuecomment-483916240

13:14

<Lightlike> In 15776, there is a rather special situation, with the parent of the tx already expired from the relay set, leading to NOTFOUND msg being sent. But wouldn't you need to have 100 different children tx, each with expired parents to reach MAX_PEER_IN_FLIGHT, multiplied by the number of peers (8) to be cut off completely. Is that not extremely unlikely?

13:14

<aj> in the log of #15776 i think you can see some of my guesses and corrections to that

13:15

<aj> Lightlike: it's cumulative, so you use up 1 slot with one tx, then it happens against and you use up another slot

13:16

<aj> Lightlike: and it's not that the parent expires from the relay set, it's that you receive a child you don't have a parent for (because you've just restarted your node after being down for a while, eg), and then you ask the peer for the parent even though they never told you they had the parent, so it's not in the collection of tx's they'll relay to you even though it's in the mempool

13:16

<merehap> Beyond more testing, is there something like using better concurrency primitives or better code structuring to reduce the likelihood of concurrency problems like these? I feel like I've seen several different concurrency bugs file against core. Since concurrency bugs are so difficult to discover and debug, it seems scary if there isn't a more systematic approach to preventing them.

13:17

<merehap> filed* against

13:17

<aj> Lightlike: (also, if the parent isn't in the mempool, but is in your utxo set, but the first two outputs from the transaction have already been spent, then you'll redundantly ask for the tx and get a notfound back)

13:18

<jnewbery> Good context to know is that if a peer asks you for a tx with GETDATA, you'll only send that tx if it's in a data structure called MapRelay. You won't fish the tx out of the mempool if it's not in MapRelay

13:18

<aj> jnewbery: (except you will if you're working with an SPV peer doing bloom filters)

13:19

<jnewbery> ah, yes. Thanks!

13:19

<aj> jnewbery: (there's always an exception. except when there isn't...)

13:19

<jnewbery> merehap: I wouldn't really categorise this as a concurrency bug. This is more like a resource leak (where the resource is txs in flight from a peer)

13:20

<MarcoFalke> Yeah, it is a logic bug. The code was written under the wrong assumptions (missing some "edge" cases)

13:21

<pinheadmz> so the story is: INV(tx)--> <--GETDATA(tx) (tx is dropped from relayset) NOTFOUND(tx)--> ?

13:21

<jnewbery> pinheadmz: yes

13:21

<pinheadmz> all three, the same tx - not GETDATA(parent of tx)

13:21

<jnewbery> (although of course we can't rely on the peer sending a NOTFOUND)

13:22

<aj> INV(tx) GETDATA(tx) TX(txdata) is normal

13:22

<pinheadmz> so is the tx not really in the maprealy from thebeginning? or is it actually dropped in the time between INV and GETDATA ?

13:23

<aj> INV(tx) GETDATA(tx) TX(txdata) GETDATA(tx.parent) NOTFOUND is common though, because our side thinks "if they have tx, they must have the parent" and their side, running the same software, thinks "i didn't INV tx.parent, so i won't send it"

13:23

<jnewbery> INV(tx_child) --->

13:23

<jnewbery> <--- GETDATA(tx_child)

13:23

<jnewbery> TX(tx_child) --->

13:23

<jnewbery> I don't have this child's parents!

13:23

<jnewbery> INV(tx_parent) --->

13:23

<jnewbery> <--- GETDATA(tx_parent)

13:23

<jnewbery> TX(tx_parent) --->

13:23

<aj> jnewbery: the special logging mostly proved the guesses i was making was wrong, fwiw

13:24

<jb55> I found reading the struct TxDownloadState comment helpful for anyone new to this part of the codebase

13:24

<jnewbery> oh sorry, that second INV shouldn't be there

13:24

<jnewbery> correction:

13:24

<jnewbery> INV(tx_child) --->

13:24

<jnewbery> <--- GETDATA(tx_child)

13:24

<jnewbery> TX(tx_child) --->

13:24

<jnewbery> I don't have this child's parents!

13:24

<jnewbery> <--- GETDATA(tx_parent)

13:24

<jnewbery> TX(tx_parent) --->

13:25

<jnewbery> that last TX should actually be a NOTFOUND

13:25

<pinheadmz> thanks! visual aids are so helpful :-)

13:25

<pinheadmz> and why wouldnt the first node have the parent? bc it jsut expired? time-wise?

13:25

<jnewbery> expired from the MapRelay but still in the mempool, or cofirmed in a block

13:26

<Lightlike> but if it was confirmed in a block, wouldn't we know about this and not request it?

13:27

<aj> so one thing i wonder is about sdaftuar's question, "switch the uses of GetTimeMicros() in the logic to GetTime() so that we can mock it"; it doesn't seem to make a lot of sense to measure timeouts in the seconds-minutes range with microseconds; but using non-mocktime for "peer has been idle" in general (see #9606) certainly makes sense i think

13:27

<jnewbery> Lightlike: The peer might have it confirmed in a block, but we don't have the block yet? Not sure

13:28

<aj> Lightlike: there's a bug in the handling of missing parents. when actually testing the transaction the logic is done right (in AcceptToMempool), but when that fails, it doesn't say what tx's exactly are missing

13:29

<b10c> jnewbery: and the "expired from the MapRelay but still in the mempool" was introduced in #14897?

13:29

<aj> Lightlike: so when the net_processing code works that out to try to ask for the parents (which generally won't work anyway) it tests to see if the parent is in the mempool, or txid:0 or txid:1 are in the UTXO set

13:29

<aj> Lightlike: but if the output we're looking for is txid:3 and txid:0 and txid:1 are spent, we'll get a false positive

13:30

<jnewbery> b10c: no, I believe that's been normal behaviour since earlier than that

13:30

<aj> i think the tx's in maprelay are kept around so even if the tx expires from the mempool (by being mined, or because the mempool is full) it's still able to be relayed for the 15m's since being INV announced

13:31

<jnewbery> the problem introduced by 14897 is adding txs to the node's m_tx_in_flight and then there being edge-cases where they're never removed

13:32

<b10c> jnewbery: ah I see

100

13:33

<aj> so previously, it would've just been "we sent a GETDATA and got a NOTFOUND? not big deal, moving on". but now it means a data structure doesn't get cleared, and eventually a limit gets hit, and behaviour changes

101

13:34

<jb55> p2p code is scary

102

13:34

<jnewbery> jb55: yes!

103

13:34

<jb55> so much state

104

13:34

<jonatack> can you expand on the missing testing and mocktime?

105

13:35

<jnewbery> we talk a lot about consensus code being subtle and difficult, but problems in p2p code can be just as dangerous

106

13:35

* jb55 considering spending more time looking at p2p PRs

107

13:36

<jnewbery> luckily, I think there are more parallels between p2p code and other software domains (eg networking, telecoms), so for people with experience in those areas then p2p might be somewhere you want to look at more.

108

13:37

<jonatack> * jb55: there are todos in https://github.com/bitcoin/bitcoin/pull/9606/files

109

13:37

<jnewbery> jonatack: there are various calls for getting the current time used by bitcoind. GetTime() can be overridden by setmocktime but GetTimeMicros() always uses your system time

110

13:38

<jnewbery> I *think* that's just historic. I can't think of any good reason for it

111

13:38

<jnewbery> but it does mean that code that uses GetTimeMicros() can't be tested using mocktime

112

13:39

<aj> jnewbery: i think if you used mocktime for everything you might get a bug where "setup test, connect peers. mocktime += 1 hour. oh all the peers have been non-responsive for an hour, mass disconnect. tests fail"

113

13:39

<aj> jnewbery: GetTimeMicros() is used in bunches of places for "here's how long this piece of code took to run" logs, which is pretty reasonable too

114

13:39

<jonatack> thanks! wdyt about sdaftuar's last comment/question

115

13:40

<jonatack> could this be tested in compacted mocktime

116

13:41

<Lightlike> A question concerning the fix in #15834: Now we clear m_tx_inflight_it after our peer first sends an INV(tx), but is totally unresponsive after GETDATA(tx) (not even NOTFOUND). At the same time we accept other transactions from this peer. is there a legit reason for our peer to behave like this besides censoring certain transactions (so that we don't disconnect instead)?

117

13:42

<aj> i think it'd make sense to switch it to mockable GetTime,interested in what other people think, or if anyone sees any problems with that

118

13:42

<jnewbery> on the face of it, it seems reasonable. I can't immediately think of any reason that it'd be a problem

119

13:44

<aj> Lightlike: legit reasons: could be buggy, or maybe there's bugs in our logic where we're taking much longer to send the GETDATA than we think and it's actually totally reasonable for it to have expired, could be others

120

13:44

<jnewbery> Lightlike: unreliable connection?

121

13:45

<aj> jnewbery: unreliable with the bounds TCP/tor provides?

122

13:45

<jnewbery> In general, I think we want to be careful about introducing behaviour that could turn a relatively small bug into something that causes a bunch of peers to disconnect

123

13:45

<aj> yeah, that^^^

124

13:47

<Lightlike> yes, that makes sense :-)

125

13:49

<jnewbery> aj: yeah TCP/tor should give us reliable messages

126

13:49

<jnewbery> jonatack: I hope it can be tested with mocktime, or in some way that doesn't require very long-running tests. The problem with long-running tests is that no-one ever runs them

127

13:50

<jnewbery> (and we explicitly disable them from our Travis CI because they'd cause the run to time out)

128

13:50

<jonatack> jnewbery: this

129

13:51

<jnewbery> A few other random thoughts I had about this PR: I liked aj's update to the getpeerinfo RPC here: https://github.com/bitcoin/bitcoin/pull/15776/files#diff-618d3a4f4fc0dc851ec29218c445c148R109

130

13:52

<jnewbery> Because I like monitoring in general

131

13:53

<jnewbery> I'd also love to see more work on stuff like https://forkmonitor.info/nodes/btc to provide live monitoring of the network

132

13:53

<jnewbery> I know that james o'beirne is planning to do some work on that, so if it's something that interests anyone here, please reach out to him

133

13:54

<jnewbery> Another thing that I'd like to see is more stress testing. Resource leaks like this are really difficult to catch in deterministic, short-running tests

134

13:55

<jnewbery> Any final questions?

135

13:55

<aj> https://github.com/ajtowns/bitcoin/commits/201904-reduce-notfounds -- is a branch i was playing with to reduce the underlying cause for getting notfounds btw. i haven't opened a PR for it yet, but it might be interesting to look at

136

13:55

<jnewbery> aj: cool, thanks!

137

13:56

<jnewbery> ok, seems like there aren't any more questions

138

13:56

<jnewbery> This was quite a step up from the previous two weeks in terms of complexity - I'd be interested in what people thought about reviewing.

139

13:57

<jnewbery> Net Processing is a tricky area. Probably second only to Validation in terms of subtlety and complexity. It'd be great if we had more eyes on it.

140

13:58

<jnewbery> ok, thanks everyone, and thanks especially to aj!

141

13:58

<jonatack> Good choice of PR -- I'm learning with these forays into new (for me) places in the codebase

142

13:58

<Lightlike> I found this very interesting (the p2p code was new to me), but also definitely harder to understand than the one last week. Thanks!

143

13:58

<merehap> jnewbery, aj: Thanks!