Immediately disconnect on invalid net message checksum (p2p)

https://github.com/bitcoin/bitcoin/pulls/15206

Host: jnewbery  -  PR author: jonasschnelli

The PR branch HEAD was 1d9bc6a8c at the time of this review club meeting.

Note: an earlier version of this PR moved hash finalization from the Message Handler thread to the Socket Handler thread. That actually happened in PR 16202, so you can ignore discussion about moving the hash calculation.

Notes

  • Peer-to-peer message headers include a checksum for message integrity. If any of the message gets corrupted, then the calculated checksum from the payload won’t match the checksum in the header. See the developer notes for full details of the message header format.

  • The checksum is the first four bytes of the double-SHA256 of the message payload.

  • SHA256() is a Merkle–Damgård hash function. In this hash function construction, the pre-image is split into blocks, which are fed one-by-one into a compression function.

  • Because of the serial way that the hash is constructed, we can do some of the hashing work in the Socket Handler thread as we receive the bytes of the wire. See the readData() function, which calls into CSHA256::Write().

  • The final round of hashing is done in GetMessageHash(), which calls through to CSHA256::Finalize().

  • Prior to PR 16202, GetMessageHash() was called by ProcessMessages(), which is executed by the Message Handler thread. After that PR, the hash finalization is done in the Socket Hander thread, and the Message Handler thread just checks the saved hash data member.

  • If the calculated checksum does not match the checksum in the message header, then the message is dropped.

  • This PR proposes removing any checksum logic from net_processing, and moving it to net. It also proposes immediately disconnecting peers that send us a message with a bad checksum.

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? (Don’t forget to put your PR review on GitHub.)

  2. What are the benefits of handling bad checksums in the net layer instead of in net processing?

  3. Are there any other checks that should be moved from net processing into net?

  4. What do you think about the behaviour change in this PR? Is it better to simply drop the message or should we disconnect from the peer?

Meeting Log

  113:00 <jnewbery> #startmeeing
  213:00 <MarcoFalke> hi
  313:00 <troygiorshev> hi
  413:00 <ariard> hi
  513:00 <nehan> hi
  613:00 <theStack> hi
  713:00 <kanzure> hi
  813:00 <thomasb06> hi
  913:00 <jnewbery> greetings earthlings (and other universe dwellers)
 1013:00 <raj_149> Hi..
 1113:00 <jnewbery> Notes and questions: https://bitcoincore.reviews/15206.html
 1213:00 <michaelfolkson> hi
 1313:00 <emzy> Hi
 1413:00 <felixweis> hi
 1513:00 <jkczyz> hi
 1613:00 <sipa> hi
 1713:00 <lightlike> hi
 1813:01 <pinheadmz> hi
 1913:01 <jnewbery> Normal reminders: all questions are welcome! If you're confused about something, other people probably are too, so ask! No need to ask to ask, just ask!
 2013:01 <andrewtoth> hi
 2113:01 <jnewbery> who had a chance to review this week? (y/n)
 2213:01 <pinheadmz> y
 2313:01 <ariard> y
 2413:01 <raj_149> y
 2513:01 <felixweis> n
 2613:01 <MarcoFalke> y
 2713:01 <theStack> n
 2813:01 <lightlike> y
 2913:01 <troygiorshev> y
 3013:02 <vasild> Hi, n
 3113:02 <gimballock> Hi
 3213:02 <nehan> y
 3313:02 <andrewtoth> y
 3413:02 <jkczyz> y
 3513:02 <jnewbery> Great, anyone want to summarize the PR? Are you concept ACKs/NACKs?
 3613:02 <emzy> n
 3713:03 <pinheadmz> it moves the p2p message checksum check up earlier in the message processing prrocess
 3813:04 <troygiorshev> Concept ACK. Two parts: 1) Move checksum check to net layer, 2) change behavior upon check failure to disconnecting the peer, from ignoring the message
 3913:04 <pinheadmz> and the reviewers are trying to decide if it should ban a peer that sends a malformed message
 4013:04 <jonasschnelli> Hi (only partially here)
 4113:04 <jnewbery> troygiorshev pinheadmz: exactly. This isn't a pure refactor, it does change behaviour
 4213:04 <raj_149> Move hashing op into network layer. Move it from message processing to socket Handler. .
 4313:04 <jonasschnelli> The current behavior of the PR is not a ban
 4413:05 <jnewbery> we'll go into that in the next questions
 4513:05 <jnewbery> hi jonasschnelli! Thanks for joining us
 4613:05 <nehan> also changes which thread does the Finalize, I think?
 4713:05 <sipa> only the finalization moves
 4813:06 <sipa> hashing for bulk of messages was already in net
 4913:06 <jonasschnelli> Only the finalize operation changes...
 5013:06 <jonasschnelli> Though we have a lot of small messages
 5113:06 <jnewbery> I don't think this PR actually moves which thread does the finalize
 5213:06 <pinheadmz> yeah looked to me like originally the checksum was checked and stored in a bool but not /acted/ upon until way later
 5313:06 <jonasschnelli> I think it does... but can’t say for sure. It been a while
 5413:07 <jnewbery> When the PR was opened it did, but since then, PR16202 was merged, which moves the finalize operation into the socket handler thread
 5513:07 <sipa> oh, ok!
 5613:07 <pinheadmz> oh interesting. yeah i had a pre-game quesiton about the timing of this one
 5713:07 <pinheadmz> has there been a recent flood of malformed messages or something?
 5813:07 <pinheadmz> what rises this PR from its year of sleep?
 5913:07 <jonasschnelli> Oh. Thanks for that info jnewbery
 6013:08 <michaelfolkson> I think some of the review comments touched upon this but what are the likely reasons for sending an incorrect checksum (other than maliciousness)?
 6113:08 <jonasschnelli> Some BIp324 discussion
 6213:08 <nehan> jnewbery: oh right, that's the first sentence of the review notes
 6313:08 <jnewbery> pinheadmz: I just saw it was rebased recently and started receiving some review attention and thought net/net_processing separation would be interesting to talk about at review club
 6413:09 <jnewbery> ok, so first question. What are the benefits of handling bad checksums in the net layer instead of in net processing?
 6513:09 <vasild> michaelfolkson: implementation bug, memory corruption, also memory corruption at the receiving end.
 6613:09 <sipa> just broken network infrastructure
 6713:10 <vasild> we can rely on TCP to not corrupt the data?
 6813:10 <nehan> jnewbery: get to it sooner, use fewer resources processing bad messages
 6913:10 <jnewbery> (I'll be a bit loose with language and use "net layer" == "socket handler thread" and "net processing layer" == "message handler thread")
 7013:10 <michaelfolkson> vasild: Which could all be honest failures right? We're disconnecting for our own benefit as a node rather than because we think the incorrect checksum was sent due to maliciousness
 7113:11 <emzy> For network problems there is l a TCP checksum. Ok could be manipulatet bei firewall stuff.
 7213:11 <jnewbery> vasild: great question. TCP should take care of errors for us. So how else could header checksum be wrong?
 7313:11 <theStack> jnewbery: i'd say it makes generally sense to detect failures at the lowest layer possible
 7413:11 <vasild> I mean we can rely that TCP will detect and re-transmit corrupted data. So there is no way to get corrupted data due to a malfuctioning router in between?
 7513:11 <vasild> michaelfolkson: yes
 7613:12 <ariard> vasild: TCP does integrity check ?
 7713:12 <emzy> vasild: yes, normaly.
 7813:12 <jonasschnelli> Would firewall randomly manipulate a single message? Or constant?
 7913:12 <sipa> jonasschnelli: intentionally it wouldn't change anything at all
 8013:12 <jonasschnelli> (Sry phone typing)
 8113:12 <lightlike> While reviewing I wondered if the changed disconnect behavior is a consequence of a change done mostly for architectural reasons (not so easy to imitate the old behavior after moving from net_processing to net) or something desirable in itself.
 8213:12 <vasild> ariard: yes
 8313:13 <jonasschnelli> Is the firewall a hypothetical issue or did someone report / monitor that?
 8413:13 <emzy> jonasschnelli: you never know. there is deep packet inspection.
 8513:13 <sipa> but routers can reassemble TCP packets
 8613:13 <jnewbery> nehan theStack: in the original version of this PR, I'd agree with you. I expect there was a performance benefit from moving the finalize operation to the socket handler thread, since the message handler thread is our bottleneck. Since 16202 was merged, there isn't a performance benefit ...
 8713:13 <felixweis> tcp has a 16 bit checksum
 8813:13 <sipa> so the TCP checksum is only an point-to-point checksum, not end-to-end
 8913:13 <jonasschnelli> Which is relatively weak
 9013:13 <jnewbery> ... but I think from a architecture/layering perspective, this makes sense
 9113:13 <MarcoFalke> lightlike: I posted a patch that keeps the check in net_processing, so no
 9213:13 <ariard> right but your NAT firewall may recompute the checksum after modification IIRC
 9313:13 <sipa> ariard: yes
 9413:13 <sipa> it will
 9513:14 <troygiorshev> jnewbery: agreement on the architecture/layering reason
 9613:14 <nehan> jnewbery: what is the architecture/layering reason?
 9713:15 <jnewbery> anyone want to answer nehan's question about layering?
 9813:15 <sipa> it's ugly that the checksum checking is in net_processing, because it's a network protocol level thing
 9913:15 * MarcoFalke move the checksum check to the wallet. *hides
10013:16 <troygiorshev> nehan: The header wraps around some data. (Just like how, say, an ethernet header wraps around some data). Ideally you do the checksums, check the header, and then hand the data to the layer above you. Then that layer doesn't have to worry about anythign
10113:16 <sipa> in particular when facing BIP324, where the checksum because dependent on the transport used... it's strange that the network processing layer would even know which transport is used
10213:16 <nehan> maybe you could share a bit about how you see hte split between net/net_processing
10313:16 <nehan> net is network protocol; net_processing is more application (node) semantics?
10413:17 <felixweis> stream vs message handling?
10513:17 <nehan> and you consider the "header" part of the network protocol? that's application-specific, right? it's not the packet header
10613:17 <troygiorshev> nehan: yes exactly
10713:17 <jnewbery> nehan: exactly
10813:17 <sipa> net is interaction with socket layer, and P2P transport layer
10913:17 <lightlike> the PR description gives another reason: It would be desirable to implement alternative transport layer protocols such as BIP155 in net, without having to change anything in net_processing
11013:17 <sipa> the P2P transport layer transports messages, which are handled by net_processing
11113:18 <jnewbery> The distinction on our software is not perfect. Take a look a CNode in net.h and CNodeState in net_processing.cpp. Ideally, CNode would just be about the connection to another node and CNodeState would just be about application-level stuff (ie inventory of transactions and blocks, address gossip, etc)
11213:19 <jnewbery> Greg's point here: https://github.com/bitcoin/bitcoin/pull/15206#issuecomment-460082894 is about firewalls trying to be application aware, and messing around with IP addresses inside application messages. That sometimes happens when firewalls are trying to deal with things like NAT traversal.
11313:19 <sipa> nehan: i guess you'd have to see the bitcoin p2p protocol as consisting of two layers itself
11413:19 <jnewbery> In a previous life I worked in telecoms, and we used to have lots of similar issues, where firewalls would mess around with IP addresses inside SIP and SDP messages and things would stop working.
11513:19 <gzhao408> jnewbery When you say that "message handler thread is our bottleneck" it seems like we want to protect that thread from DoS/wasted resources. Is this a concern here/in general?
11613:20 <jnewbery> gzhao408: that's a concern everywhere!
11713:20 <sipa> it depends whether you're talking about honest overload or DoS attack
11813:20 <gzhao408> jnewbery er sorry, I mean is it something we care /especially/ about e.g. because it's a common dos vector?
11913:20 <jnewbery> but yes, here is a good example. In general, like nehan and theStack said earlier, it's best to deal with failures and bad messages as early and as low in the stack as possible.
12013:21 <ariard> sipa: do we have a way to dissociate both right now ?
12113:21 <lightlike> if these corruption issues can happen spuriously to otherwise good peers in some cases, I would prefer not to disconnect.
12213:21 <ariard> like when downloading too much blocks become a DoS ?
12313:21 <theStack> would a proper analogy for net vs. net_processing be be ethernet (layer 2) vs ip (layer 3)? the ethernet frame has a checksum, which is also checked on layer 2, and layer3 never gets to see it
12413:21 <sipa> ariard: sure it can, but we have no protection against that
12513:22 <jnewbery> gzhao408: probably not. This isn't a big dos concern. We always have to do the checksum calculations, so a node trying to waste our resources gains no advantage by sending a message with a bad checksum.
12613:22 <ariard> I think there is another point by moving this from net_processing to net, it's happen in 2 different threads right now
12713:22 <emzy> What about Erlay BIP 330? Will it also bettter to move the checksum out to net for that?
12813:22 <felixweis> could ASMAP be used to limit upload rates per network?
12913:22 <ariard> and lets say you have a single-threaded system, you may have some buffer growing quickly before switch
13013:22 <troygiorshev> theStack: I think so. An interesting parellel is that just as the ethernet header contains a section saying "IP", the bitcoin header contains a sections specifying the command name
13113:22 <jnewbery> theStack: I'm not sure that analogy is useful. You can point out some similarities, but I don't think it's particularly illuminating
13213:22 <sipa> emzy: completely orthogonal; erlay is entirely net_processing
13313:23 <ariard> felixweis: I would fear someone in your ASN wasting resources for you, like some kind of impersonation
13413:24 <sipa> felixweis: the general solution to resource wasting attack is just keep track of how much resources every peer is using; if things get tight, slow down processing of the worst one
13513:24 <sipa> felixweis: that's the first, and hard, step
13613:24 <jnewbery> lightlike: To understand whether this is actually a concern, it'd be useful to look at real-world data.
13713:24 <sipa> doing it per asmap seems like a nice optimization on top of that
13813:24 <jnewbery> Has anyone grepped their debug log for "CHECKSUM ERROR"?
13913:24 <sipa> felixweis: sorry, how many resources you're consuming on behalf of every peer
14013:24 <sipa> jnewbery: yes
14113:25 <MarcoFalke> net_processing is similar to the rpc server I guess. They both can send data structures to validation, and they both read raw bytes from a socket or so
14213:25 <troygiorshev> sipa: What did you generally find?
14313:25 <felixweis> ariad, sipa: interesting points!
14413:25 <MarcoFalke> I found one error
14513:25 <sipa> troygiorshev: only one completely broken peer, which sends bogus messages with checksum all 0
14613:25 <MarcoFalke> 2020-03-03T23:14:54Z ProcessMessages(inv, 397 bytes): CHECKSUM ERROR peer=1849951
14713:25 <troygiorshev> sipa: bogus message with SMBr as the type?
14813:25 <sipa> indeed
14913:25 <MarcoFalke> 2020-03-03T22:49:52Z receive version message: /Satoshi:0.18.0/: version 70015, blocks=620054, us=35.247.102.9:8333, peer=1849951
15013:25 <sipa> or 0-byte or -1 messages
15113:26 <troygiorshev> Looks like that's spam from some group advertizing their coin
15213:26 <jnewbery> I found none (although I only had a couple of weeks' worth of debug logs)
15313:26 <MarcoFalke> So according to the user agent, it is a Bitcoin Core node
15413:26 <jonasschnelli> MarcoFalke: what does that peer does during its session... can you grep?
15513:26 <MarcoFalke> Normal relay, it was an inv message
15613:26 <jnewbery> sipa: how long does your debug log go back?
15713:27 <sipa> only 15 days
15813:27 <sipa> i have an older one though
15913:28 <sipa> i don't think this means much though - people on broken network infrastructure will see checksum errors and others won't
16013:28 <jnewbery> So it seems that checksum errors are very rare.
16113:28 <jnewbery> Did people see Nicolas Dorier's comment here: https://github.com/bitcoin/bitcoin/pull/15206#issuecomment-460024940. Any thoughts about that?
16213:29 <sipa> jnewbery: i don't think we can conclude that (i agree it's likely the case, but it's hard to know)
16313:29 <emzy> can't find a "CHECKSUM ERROR" on like 5 nodes/
16413:30 <jnewbery> sipa: right, there's at least one cow in scotland, and one side of it appears to be brown
16513:30 <sipa> ...
16613:30 <nehan> jnewbery: gmaxwell's response seemed reasonable
16713:30 <sipa> jnewbery: my point is that you wouldn't see problems if you're on a non-broken network
16813:30 <jnewbery> (https://mathworld.wolfram.com/AtLeastOne.html)
16913:30 <sipa> that doesn't mean it's not a problem for others
17013:31 <sipa> it may be fairly common even, for users we care about, but we wouldn't know as long as we are using good networks
17113:31 <sipa> (especially old home routers...)
17213:31 <lightlike> emzy: all nodes with debug=net enabled?
17313:31 <ariard> and there is no way to measure which nodes are on good-vs-bad network
17413:31 <troygiorshev> should we try and ensure that bitcoin can be used over an unreliable transport layer?
17513:32 <raj_149> I didn't find any in my node..
17613:32 <emzy> lightlike: no.
17713:32 <sipa> troygiorshev: that is a better question, i think
17813:32 <vasild> If I am sitting on a broken network infrastructure, checksum errors will be common for me.
17913:32 <sipa> to what extent can be guarantee reliable operation in such cases
18013:32 <lightlike> emzy: then it wouldn't appear in the logs even if it happened
18113:32 <MarcoFalke> Wouldn't a bad router cause disconnects anyway because the network magic is corrupted? I mean it is less likely because the magic is short compared to the message, but still
18213:32 <sipa> BIP324 would break these nodes anyway
18313:32 <troygiorshev> it certainly seems, uh, romantic at least...
18413:32 <ariard> troygiorshev: shilling this https://github.com/bitcoin/bitcoin/pull/18988 ;)
18513:33 <emzy> lightlike: I see. I will enable it on one of my nodes.
18613:33 <jnewbery> If the network infrastructure is breaking the version message, then those nodes won't be able to connect to peers
18713:33 <sipa> jnewbery: agree
18813:33 <troygiorshev> sipa: good point bringing up BIP324. Maybe it overrides this discussion
18913:33 <sipa> further, the protocol is stateful
19013:34 <sipa> if something happens in the middle of say a tx response, the peer may keep waiting, and time out
19113:34 <jnewbery> so there seems to be no downside to disconnecting peers that have a bad checksum in a version message (since the alternative is that they'll timeout their connection)
19213:34 <troygiorshev> ariard: i love it :D
19313:34 <sipa> jnewbery: why would they time out?
19413:34 <jnewbery> And if the version message is the one most likely to be corrupted by the firewall (which seems likely since it contains IP addresses), then if it's uncorrupted, then there are unlikely to be problems with subsequent messages
19513:35 <jonasschnelli> Good point
19613:35 <jnewbery> Do we not drop a connection if we don't receive a version within a time limit?
19713:35 <sipa> jnewbery: sure
19813:35 <sipa> but there are cases where a checksum failure would not result in any problems at all
19913:35 <jonasschnelli> MarcoFalke: is that checksum error you witnessed in a version message?
20013:35 <ariard> jnewbery: can you still have a bad checksum for version message, i.e switch IP address but this IP address being valid
20113:35 <ariard> for receiver
20213:35 <MarcoFalke> no
20313:35 <sipa> the message would just be ignored and skipped, and that'd be it
20413:35 <jonasschnelli> (Or did you say in a inv)
20513:35 <MarcoFalke> It was in an inv message
20613:36 <jnewbery> so the peer would send us a version with a bad checksum, we'd drop it, and then we'd eventually time out the connection because we never saw a version
20713:36 <troygiorshev> sipa: (Other than a single bit flip in the checksum field I assume?)
20813:36 <ariard> if it's a NATed address it's routable
20913:36 <sipa> jnewbery: if it's in the version message, sure
21013:36 <MarcoFalke> I was connected to the node for days
21113:36 <troygiorshev> jonasschnelli: that's a good point
21213:37 <MarcoFalke> No, only 15 minutes :sweat-smile:
21313:37 <jnewbery> sipa: right that was my point. If the version message is corrupted, then there's no downside to disconnecting
21413:37 <sipa> jnewbery: agree!
21513:37 <jnewbery> right, next question. Are there any other checks that should be moved from net processing into net?
21613:38 <jnewbery> (feel free to continue discussing/asking questions about disconnect. I just want to make sure we have a chance to discuss all the questions before the end of the hour)
21713:38 <troygiorshev> jnewbery: MarcoFalke brought up a great point in the PR that the checksum should be treated the same as the header check and netmagic. Maybe all of them should be moved to net?
21813:39 <jnewbery> troygiorshev: I agree!
21913:40 <MarcoFalke> I think both should live in the same place (whether that is net_processing due to historic reasons or net because we decided that is the better place)
22013:41 <MarcoFalke> But splitting them up felt a bit weird
22113:41 <jnewbery> MarcoFalke: I don't think it's all-or-nothing. There are plenty of examples where things live in the wrong layer because they haven't moved yet
22213:42 <ariard> its far better in net IMO, its really confusing while reviewing bip324 where these values are changed by deserializer and you have to go in net_processing to understand semantic
22313:42 <jonasschnelli> jnewbery: we might want to disconnect nodes on invalid netmagic (v1 only).
22413:42 <pinheadmz> jnewbery how about m_valid_netmagic ?
22513:42 <pinheadmz> ha
22613:42 <jonasschnelli> (in the socket thread)
22713:42 <pinheadmz> i was looking for something like message size etc
22813:42 <troygiorshev> pinheadmz: look at m_valid_header too
22913:42 <jonasschnelli> Isn't #15197 doing that? (netmagic)
23013:43 <jonasschnelli> https://github.com/bitcoin/bitcoin/pull/15197
23113:43 <pinheadmz> troygiorshev ah good call for some reason i saw that and thought block header, but ofc that was wrong contet
23213:44 <jnewbery> jonasschnelli: ah, I hadn't seen 15197. Thanks
23313:45 <MarcoFalke> Is there software that uses the net magic to seek to the next message in the stream?
23413:45 <jonasschnelli> MarcoFalke: that would be super weak
23513:46 <MarcoFalke> (Asking because btcinformation claimed so)
23613:46 <jonasschnelli> what if those 4 bytes match a hash of something sent over the wire?
23713:46 <MarcoFalke> jonasschnelli: Agree
23813:46 <jonasschnelli> we can't help someone doing that
23913:47 <MarcoFalke> I am asking because the meeting notes linked to btcinformation, which suggest this is good practice
24013:47 <MarcoFalke> Someone should probably edit that
24113:47 <troygiorshev> Is it maybe historical?
24213:48 <vasild> The decision whether to disconnect from a peer from whom we receive a bad checksum maybe should be based on how many other checksum errors am I seeing from other peers. For example if I get bad checksums from 1 peer but have plenty of other healthy peers from which I never get a bad checksum, then I would disconnect from the "bad" peer. But if I am getting checksum errors every few minutes
24313:48 <vasild> from e.g 80% of my peers then maybe I wouldn't be so eager to disconnect (from everybody).
24413:48 <jnewbery> MarcoFalke: you've missed out a part of the sentence "used to seek to next message _when stream state is unknown._"
24513:49 <jnewbery> I'm still not sure whether that's true, but it makes a bit more sense
24613:49 <jonasschnelli> vasild: that would be an option. But is it worth the effort?
24713:49 <vasild> dunno :)
24813:49 <MarcoFalke> But why would the stream state be unknown? (We disconnect when the header can't be deserialized)
24913:50 <vasild> jonasschnelli: I think definitely out of the scope of that PR, maybe not worth the effort at all.
25013:50 <michaelfolkson> Agree vasild. The danger here is that majority of your peers are sending you bad checksums and you rotate through iterations of new peers?
25113:50 <felixweis> is it known how much CPU time is spent in SHA256 in a typical bitcoin node during normal operation (not IBD)?
25213:51 <nehan> vasild: hmm this makes me wonder if this changes threatmodels at all. now an attacker could corrupt all incoming headers and force me to disconnect from all nodes
25313:51 <troygiorshev> nehan: this is discussed in the PR
25413:51 <vasild> michaelfolkson: yes, needlessly disconnecting from good peers due to bricked network router, but you also use the same router to connect to new peers...
25513:51 <nehan> but could an attacker with network control essentially do that anyway?
25613:51 <sipa> nehan: that kind of attacker can already just disconnect you
25713:51 <troygiorshev> nehan: yup that's it :)
25813:51 <nehan> troygiorshe: ah, must have missed it. thanks!
25913:51 <MarcoFalke> felixweis: Depends whether you use hardware hashing or do it in software, I guess
26013:51 <jnewbery> felixweis: I'm also curious about this. Does anyone know?
26113:52 <troygiorshev> felixweis: Do you mean in this checksum or SHA256 in general?
26213:52 <sipa> none of the changes in this PR change anything about attack models, as far as i can tell
26313:52 <emzy> There is also an attack vector. If you can insert one TCP packet into the stram with wrong checksum. Then you can disconnect all peers. Seeing from a network level attacker.
26413:52 <sipa> the only question is whether disconnecting honest but broken peers is preferable
26513:52 <jonasschnelli> well... there is a CVE.
26613:52 <felixweis> well all i know is there's a lot of sh256 hashing going, only 2 of them are used to verify the next block header.
26713:53 <michaelfolkson> The scenario where honest peers are sending you bad checksums. A dishonest node identifies this, sends you good checksums and controls all your connections
26813:53 <michaelfolkson> Unlikely but possible?
26913:53 <jnewbery> is the concern about bad firewalls about my local firewall or the remote peer's firewall?
27013:53 <ariard> emzy: network level attacker with such capabilities can just feed invalid blocks and get ban all your peers
27113:54 <ariard> jnewbery: good question, I think both
27213:54 <ariard> like can you know which one is faultive ?
27313:54 <sipa> felixweis: txid and merkle root computation are probably the majority
27413:54 <jnewbery> if it's about the remote's firewall, then I'd expect to see CHECKSUM ERROR messages in all node's debug logs
27513:54 <sipa> and p2p checksums...
27613:54 <pinheadmz> jonasschnelli CVE for what ?
27713:55 <jnewbery> because we'd probably connect to at least one peer with a bad firewall
27813:55 <sipa> jnewbery: it could be rare
27913:55 <emzy> ariard: You only have to hit the right TCP SEQ. number. That's not sp hard anymore.
28013:55 <jonasschnelli> There is a publicly revealed CVE that the PR fixes... but I don't think that CVE has reasonable weight
28113:55 <jnewbery> 5 minutes left!
28213:55 <MarcoFalke> jonasschnelli: Pretty sure anyone can come up with a DOS vector that is more severe and does not depend on the checksum disconnect logic
28313:55 <jonasschnelli> The PR focuses on layering issues and it "also fixes that CVE"
28413:55 <jnewbery> If you've been shy so far, now's your chance to ask your question
28513:56 <ariard> emzy: what do you mean by sp hard ? IIRC there is ban on mutated consensus data, you don't need hashrate for this
28613:57 <pinheadmz> jonasschnelli if possible id love to see the cve
28713:57 <lightlike> emzy, ariard: maybe there are types of attackers who could manage to flip a bit every now and then, but cannot carefully replace whole msgs with constructed ones? Or does that make no sense?
28813:57 <nehan> well, now i want to know about the CVE
28913:57 <troygiorshev> +!
29013:57 <troygiorshev> +1
29113:57 <jonasschnelli> MarcoFalke: probably... I guess the difference on attacking the checksum is, that the attacker doesn't have to calculate a valid SHA256 hash where the victim needs to do for the validation
29213:57 <ariard> jnewbery: it's a private node it may not have that much connections and not being seen that much
29313:57 <emzy> ariard: sorry typo. It's not so hard to hit the right TCP sequence number.
29413:57 <jonasschnelli> pinheadmz: the CVE has been publicly announced by the Bitcoin SV people...
29513:57 <pinheadmz> ha!
29613:57 <jonasschnelli> I'm only revealing it because its already public available on the web
29713:58 <nehan> ah. it's a pretty obvious one
29813:58 <pinheadmz> https://bitcoinsv.io/2019/03/01/denial-of-service-vulnerabilities-repaired-in-bitcoin-sv-version-0-1-1/
29913:59 <jonasschnelli> It is public since a year and it seems that no-one could explot it
30013:59 <sipa> heh, they can just send valid double spending transactions too
30113:59 <nehan> jonasschnelli: why not?
30213:59 <emzy> lightlike: yes, It would make an attack more easy, if you only have to get in the TCP stream and send someting random. Insted of getting the hash right.
30314:00 <jnewbery> That's time. Thanks everyone! Special thanks to jonasschnelli for dropping in.
30414:00 <jonasschnelli> nehan: don't know... because its probably an inefficient attack?
30514:00 <ariard> lightlike: I can't think about real-world infrastructure with this kind of capabilites, see Erebus paper for discussion on infrastructure attacker model
30614:00 <jnewbery> I have a call now so I have to run.
30714:00 <nehan> thanks!
30814:00 <jnewbery> Reminder that we have a BIP 157 special meeting at the same time tomorrow. I'll push the notes and questions for https://github.com/bitcoin/bitcoin/pull/19010 this afternoon.
30914:00 <troygiorshev> thanks jnewbery!
31014:00 <jnewbery> #endmeeting
31114:00 <sipa> jonasschnelli: there are dozens of ways to achieve the same outcome
31214:00 <jonasschnelli> thanks, Thanks for calling me in jnewbery!
31314:00 <pinheadmz> ty all!
31414:00 <jonasschnelli> sipa: Yes. I think so.
31514:00 <emzy> tnx everyone!
31614:00 <lightlike> thanks!
31714:00 <theStack> thanks!
31814:00 <vasild> Thanks everyone!
31914:00 <sipa> jonasschnelli: sending N bytes to cause the victim to waste N bytes of network + N bytes of hashing...
32014:01 <felixweis> thanks everyone!
32114:01 <pinheadmz> sipa jonasschnelli you mean just sending any type of invalid mesasge
32214:01 <jonasschnelli> indeed... yeah
32314:01 <thomasb06> thanks!
32414:01 <thomasb06>
32514:01 <pinheadmz> is as severe as bad checksum
32614:01 <sipa> pinheadmz: right
32714:01 <sipa> that too
32814:01 <pinheadmz> i.e. a ddos where entire 4MB block is valid.... except the last tx
32914:01 <nehan> pinheadz: that requires a bunch of pow
33014:01 <pinheadmz> ah good point
33114:01 <jonasschnelli> but for an invalid message (to pass the checksum test), the attacker needs a valid sha256 hash?
33214:01 <pinheadmz> well then a TX max-size where the last sig is invalid
33314:02 <MarcoFalke> jonasschnelli: Only needs to calculate once
33414:02 <jonasschnelli> But I guess you can send invalid blocks to make the victim hash-check that
33514:02 <nehan> don't nodes drop peers that send a lot of invalid messages/
33614:02 <jonasschnelli> MarcoFalke: yeah. We don't ignore dups
33714:02 <jonasschnelli> I guess that CVE is total bullshit
33814:02 <MarcoFalke> right
33914:02 <pinheadmz> even if its not severe, I do appreciate that altcoins look at the code from a different perspective and have an opportunity to contribute back to bitcoin
34014:03 <jonasschnelli> However, the optimization in layering (checksum belong to message transport and not to message processing) is real
34114:03 <jonasschnelli> I guess the SV people where happy they could create some CVE at all. :)
34214:04 <raj_149> Curious on estimate of performance optimization by this layering.
34314:05 <jonasschnelli> raj_149: near to zero probably
34414:06 <emzy> Again. I think the checksum prevents that an attacker could inject just one packet into the TCP stream to ban a node. I'ts far to easy to spoof the IP address of a peer and flood with TCP packets (random SEQ. number) the target node.
34514:07 <emzy> Would be better to just dissconnet the node not ban it.
34614:08 <sipa> who says anything about banning?
34714:08 <raj_149> Are we also banning the node along with disconnection?
34814:08 <sipa> no
34914:08 <raj_149> Ya. Right.
35014:08 <emzy> I think I got it wrong. It is only disconnect.
35114:08 <emzy> Sorry, that's fine then.
35214:10 <raj_149> Also is it possible to apply some kind of heuristics to assertain resource draining intention from a node? May be that way we can safeguard against disconecting honest nodes just because of bad transport?
35314:11 <sipa> raj_149: that's hard, as it's probably a cat and mouse game
35414:11 <sipa> there are certainly some behaviours that are easily detectable, but if someone really wants to exploit things, there are dozens of ways
35514:12 <sipa> i believe the real solution is just keeping track of resources used on behalf of every peer, and slow down/disconnect the worst offenders if you run low
35614:12 <desolate> raj_149 does the heuristic determine how much resources the attack is draining on the heuristic function? :P
35714:12 <sipa> with perhaps exceptions like giving you a valid block
35814:20 <desolate> I second sipa's KISS approach when receiving incompatible data: "turtle" strategy rather than attempting to build a highly optimized, hopefully omnipotent strategy