The PR branch HEAD was 1d9bc6a8c at the time of this review club meeting.
Note: an earlier version of this PR moved hash finalization from the Message
Handler thread to the Socket Handler thread. That actually happened in PR
16202, so you can ignore
discussion about moving the hash calculation.
Notes
Peer-to-peer message headers include a checksum for message integrity. If
any of the message gets corrupted, then the calculated checksum from the
payload won’t match the checksum in the header. See the developer
notes for
full details of the message header format.
The checksum is the first four bytes of the double-SHA256 of the message
payload.
SHA256() is a Merkle–Damgård hash
function.
In this hash function construction, the pre-image is split into blocks, which
are fed one-by-one into a compression function.
Because of the serial way that the hash is constructed, we can do some of the
hashing work in the Socket Handler thread as we receive the bytes of the
wire. See the readData()
function, which calls into CSHA256::Write().
Prior to PR 16202,
GetMessageHash() was called by ProcessMessages(), which is executed by
the Message Handler thread. After that PR, the hash finalization is done
in the Socket Hander thread, and the Message Handler thread just checks
the saved hash data member.
If the calculated checksum does not match the checksum in the message header,
then the message is dropped.
This PR proposes removing any checksum logic from net_processing, and moving
it to net. It also proposes immediately disconnecting peers that send us
a message with a bad checksum.
<jnewbery> Normal reminders: all questions are welcome! If you're confused about something, other people probably are too, so ask! No need to ask to ask, just ask!
<troygiorshev> Concept ACK. Two parts: 1) Move checksum check to net layer, 2) change behavior upon check failure to disconnecting the peer, from ignoring the message
<michaelfolkson> I think some of the review comments touched upon this but what are the likely reasons for sending an incorrect checksum (other than maliciousness)?
<jnewbery> pinheadmz: I just saw it was rebased recently and started receiving some review attention and thought net/net_processing separation would be interesting to talk about at review club
<michaelfolkson> vasild: Which could all be honest failures right? We're disconnecting for our own benefit as a node rather than because we think the incorrect checksum was sent due to maliciousness
<vasild> I mean we can rely that TCP will detect and re-transmit corrupted data. So there is no way to get corrupted data due to a malfuctioning router in between?
<lightlike> While reviewing I wondered if the changed disconnect behavior is a consequence of a change done mostly for architectural reasons (not so easy to imitate the old behavior after moving from net_processing to net) or something desirable in itself.
<jnewbery> nehan theStack: in the original version of this PR, I'd agree with you. I expect there was a performance benefit from moving the finalize operation to the socket handler thread, since the message handler thread is our bottleneck. Since 16202 was merged, there isn't a performance benefit ...
<troygiorshev> nehan: The header wraps around some data. (Just like how, say, an ethernet header wraps around some data). Ideally you do the checksums, check the header, and then hand the data to the layer above you. Then that layer doesn't have to worry about anythign
<sipa> in particular when facing BIP324, where the checksum because dependent on the transport used... it's strange that the network processing layer would even know which transport is used
<lightlike> the PR description gives another reason: It would be desirable to implement alternative transport layer protocols such as BIP155 in net, without having to change anything in net_processing
<jnewbery> The distinction on our software is not perfect. Take a look a CNode in net.h and CNodeState in net_processing.cpp. Ideally, CNode would just be about the connection to another node and CNodeState would just be about application-level stuff (ie inventory of transactions and blocks, address gossip, etc)
<jnewbery> Greg's point here: https://github.com/bitcoin/bitcoin/pull/15206#issuecomment-460082894 is about firewalls trying to be application aware, and messing around with IP addresses inside application messages. That sometimes happens when firewalls are trying to deal with things like NAT traversal.
<jnewbery> In a previous life I worked in telecoms, and we used to have lots of similar issues, where firewalls would mess around with IP addresses inside SIP and SDP messages and things would stop working.
<gzhao408> jnewbery When you say that "message handler thread is our bottleneck" it seems like we want to protect that thread from DoS/wasted resources. Is this a concern here/in general?
<jnewbery> but yes, here is a good example. In general, like nehan and theStack said earlier, it's best to deal with failures and bad messages as early and as low in the stack as possible.
<theStack> would a proper analogy for net vs. net_processing be be ethernet (layer 2) vs ip (layer 3)? the ethernet frame has a checksum, which is also checked on layer 2, and layer3 never gets to see it
<jnewbery> gzhao408: probably not. This isn't a big dos concern. We always have to do the checksum calculations, so a node trying to waste our resources gains no advantage by sending a message with a bad checksum.
<troygiorshev> theStack: I think so. An interesting parellel is that just as the ethernet header contains a section saying "IP", the bitcoin header contains a sections specifying the command name
<sipa> felixweis: the general solution to resource wasting attack is just keep track of how much resources every peer is using; if things get tight, slow down processing of the worst one
<MarcoFalke> net_processing is similar to the rpc server I guess. They both can send data structures to validation, and they both read raw bytes from a socket or so
<MarcoFalke> Wouldn't a bad router cause disconnects anyway because the network magic is corrupted? I mean it is less likely because the magic is short compared to the message, but still
<jnewbery> so there seems to be no downside to disconnecting peers that have a bad checksum in a version message (since the alternative is that they'll timeout their connection)
<jnewbery> And if the version message is the one most likely to be corrupted by the firewall (which seems likely since it contains IP addresses), then if it's uncorrupted, then there are unlikely to be problems with subsequent messages
<jnewbery> so the peer would send us a version with a bad checksum, we'd drop it, and then we'd eventually time out the connection because we never saw a version
<jnewbery> (feel free to continue discussing/asking questions about disconnect. I just want to make sure we have a chance to discuss all the questions before the end of the hour)
<troygiorshev> jnewbery: MarcoFalke brought up a great point in the PR that the checksum should be treated the same as the header check and netmagic. Maybe all of them should be moved to net?
<MarcoFalke> I think both should live in the same place (whether that is net_processing due to historic reasons or net because we decided that is the better place)
<jnewbery> MarcoFalke: I don't think it's all-or-nothing. There are plenty of examples where things live in the wrong layer because they haven't moved yet
<ariard> its far better in net IMO, its really confusing while reviewing bip324 where these values are changed by deserializer and you have to go in net_processing to understand semantic
<vasild> The decision whether to disconnect from a peer from whom we receive a bad checksum maybe should be based on how many other checksum errors am I seeing from other peers. For example if I get bad checksums from 1 peer but have plenty of other healthy peers from which I never get a bad checksum, then I would disconnect from the "bad" peer. But if I am getting checksum errors every few minutes
<michaelfolkson> Agree vasild. The danger here is that majority of your peers are sending you bad checksums and you rotate through iterations of new peers?
<nehan> vasild: hmm this makes me wonder if this changes threatmodels at all. now an attacker could corrupt all incoming headers and force me to disconnect from all nodes
<vasild> michaelfolkson: yes, needlessly disconnecting from good peers due to bricked network router, but you also use the same router to connect to new peers...
<emzy> There is also an attack vector. If you can insert one TCP packet into the stram with wrong checksum. Then you can disconnect all peers. Seeing from a network level attacker.
<michaelfolkson> The scenario where honest peers are sending you bad checksums. A dishonest node identifies this, sends you good checksums and controls all your connections
<lightlike> emzy, ariard: maybe there are types of attackers who could manage to flip a bit every now and then, but cannot carefully replace whole msgs with constructed ones? Or does that make no sense?
<jonasschnelli> MarcoFalke: probably... I guess the difference on attacking the checksum is, that the attacker doesn't have to calculate a valid SHA256 hash where the victim needs to do for the validation
<emzy> lightlike: yes, It would make an attack more easy, if you only have to get in the TCP stream and send someting random. Insted of getting the hash right.
<ariard> lightlike: I can't think about real-world infrastructure with this kind of capabilites, see Erebus paper for discussion on infrastructure attacker model
<jnewbery> Reminder that we have a BIP 157 special meeting at the same time tomorrow. I'll push the notes and questions for https://github.com/bitcoin/bitcoin/pull/19010 this afternoon.
<pinheadmz> even if its not severe, I do appreciate that altcoins look at the code from a different perspective and have an opportunity to contribute back to bitcoin
<emzy> Again. I think the checksum prevents that an attacker could inject just one packet into the TCP stream to ban a node. I'ts far to easy to spoof the IP address of a peer and flood with TCP packets (random SEQ. number) the target node.
<raj_149> Also is it possible to apply some kind of heuristics to assertain resource draining intention from a node? May be that way we can safeguard against disconecting honest nodes just because of bad transport?
<sipa> i believe the real solution is just keeping track of resources used on behalf of every peer, and slow down/disconnect the worst offenders if you run low
<desolate> I second sipa's KISS approach when receiving incompatible data: "turtle" strategy rather than attempting to build a highly optimized, hopefully omnipotent strategy