Erlay support signaling (p2p)

Dec 8, 2021

https://github.com/bitcoin/bitcoin/pull/23443

Host: naumenkogs - PR author: naumenkogs

The PR branch HEAD was ea3b87af90 at the time of this review club meeting.

Notes

Erlay is a proposed transaction relay protocol change to reduce the bandwidth used to announce transactions.
- We looked at Erlay in a previous review club.
- For this PR, it might make sense to overview the full PR again (note that this is a different PR number from what’s above because that one was closed), and the updated BIP. Don’t spend too much time on it though, the overall understanding of the protocol is sufficient.
PR #23443 contains the first batch of commits enabling Erlay:
- A node becomes able to negotiate the support for Erlay with another node by sending/receiving a new p2p message.
- Once the handshake is done, the node also initializes the “reconciliation state”, a variable to keep track of ongoing reconciliations with a particular peer.
- If the peer is disconnected, the corresponding reconciliation state should be cleared.
We moved forward with this PR once minisketch was merged, although minisketch is not required for this particular Erlay sub-PR. Another dependency was PR #18044: “Use wtxid for transaction relay”. We have discussed minisketch in previous review clubs, python part 1, python part 2, and C++ implementation.

Questions

Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?
What are the benefits of splitting PRs into smaller parts? Are there any drawbacks to this approach? How might you review sub-PRs differently?
When are the nodes supposed to announce Erlay support? What should they consider before doing so and in which cases would it be clearly meaningless?
What is the overall handshake and “registration for reconciliation” protocol flow?
Why is the version field of sendrecon useful? Under which conditions it should be bumped?
What is the reason for generating reconciliation salt (m_local_salts)? How is it generated and why?
This PR adds 4 new conditions under which nodes will disconnect a peer. What are they? When does it make sense to disconnect a peer rather than just ignore the message?
What does TxReconciliationTracker::Impl::m_mutex guard and when does the lock need to be held?
Why might we use a std::unordered_map instead of a std::map to track salts and states?

Meeting Log

17:00

<gleb7> #startmeeting

17:00

<gleb7> hi?

17:00

<glozow> hi!

17:00

<jnewbery> hi!

17:00

<arythmetic> hi

17:00

<stickies-v> oi

17:00

<ccdle12> hi

17:00

<dergoegge> hi

17:00

<svav> Hi

17:00

<effexzi> Hi

17:00

<firsttimeprrevie> hi

17:00

<lightlike> hi

17:00

<larryruane> Hi

17:01

<gleb7> i don't know half of the attendees, cool! I also saw some new names already reviewing the PR

17:01

<gleb7> Today we gonna talk about erlay, a long-lasting project :)

17:01

<gleb7> More specifically, about a first chunk of it (apart from merging minisketch library to the repo). THis is a first p2p-level part

17:02

<gleb7> First usual question

17:02

<gleb7> Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?

17:02

<stickies-v> Concept ACK, hooray for increased connectivity!

17:03

<glozow> y, though i realized i should also zoom out and look at the overall PR + try running an erlay node and stuff

17:03

<arythmetic> No, just lurking today :)

17:03

<jnewbery> glozow: I agree. It's hard to concept ACK/aproach ACK without looking at the full changeset

17:03

<gleb7> yeah, looking at the overall PR is a good idea for someone interested in this stuff. Original PR also links to 2 issues i made about understanding/measuring performance of this, which should help

17:04

<glozow> yeah, the stats look impressive and the measurement process thorough, but i'm trusting not verifying 😱

17:05

<gleb7> I tried to make reproducing easy, and at least couple people re-verified it, sooo :)

17:05

<gleb7> Looking forward for your verification too

17:05

<gleb7> For me making/reviewing prs splitted in parts is something new, and I don't think we have much experience with that in general (i recall a handfull of projects going through this)

17:06

<gleb7> Which leads us to the next set question, which is probably my favorite

17:06

<gleb7> 2a. What are the benefits of splitting PRs into smaller parts?

17:07

<gleb7> For the context, Erlay PR is 40-something commits, which is definitely more than average, and line-wise as well

17:07

<stickies-v> It makes reviewing much easier. Good and atomic commits already help a lot with this, but it's nice if you can fit the entire PR in your mind model, so ideally it's not too large

17:08

<glozow> i think people review more thoroughly when the PR is smaller (probably not a good thing but seems so)

17:08

<dergoegge> i think github is pretty bad for code review on large change sets with multiple people reviewing different parts, splitting it up makes it easier to have focused review

17:09

<gleb7> Yeah, so all these mean that we're probably gonna get less bugs, better code quality, and probably faster than all-at-once approach

17:09

<glozow> dergoegge: +1, if you have more than 150 comments it just becomes impossible

17:09

<stickies-v> It also reduces code going stale since the non-controversial code can be merged more quickly while the more controversial code is discussed in another PR

17:10

<gleb7> My first version was 7 commits I think and oh god it was hard to navigate even by myself, we already went through a lot of structural changes since then. Now we're here

17:10

<jnewbery> I think one downside is that if you're not reviewing the full change set, then you're slightly trusting the author that they're taking you in the right direction

17:10

<gleb7> Yeah, this is the next question :) 2b Are there any drawbacks to this approach?

17:11

<gleb7> jnewbery: this is indeed an issue. Not exactly for this PR, where it's a pretty general thing with not so many design choices.

17:11

<glozow> it might be weird if the sub-PRs don't do much by themselves. lightlike had a good comment about the fact that, if a release was branched off from this PR, nodes would be sending each other meaningless SENDRECONs

17:11

<gleb7> But I believe the next couple PRs would require more context from future chunks indeed.

17:12

<gleb7> glozow: I spent some time thinking about this too. I think this is first time we're merging a p2p message which is useless on its own.

17:13

<jnewbery> eg Carl split https://github.com/bitcoin/bitcoin/pull/20158 into many smaller PRs. Either you do a concept review of the full PR, or you're trusting that Carl is taking us in the right direction and just review the mechanical code changes in the sub-PRs

17:14

<lightlike> I think it's important to not only look at the current PR chunk, but also at the whole thing (simulations/mainnet testing) to get to a concept ACK.

17:16

<gleb7> Yeah, and making the whole thing accessible for reviewers is the whole other challenge. I hope, participants of this PR will have time to look at the big thing too :)

17:16

<glozow> not to shill my own bags but i've been trying to split the package mempool accept PRs into chunks where each one might be useful, e.g. package testmempoolaccept, rbf improvements, etc. i'm not sure if people like that or if it just seems disconnected but 🤷

17:17

<gleb7> glozow: i'll be honest, when i review your prs, I usually trust you and other reviewers w.r.t the big direction

17:18

<gleb7> Making sure it doesn't break anything terribly of course. They are usually just good things on them own, like, decoupling and stuff

17:18

<gleb7> Okay, let's move on to the actual code changes

17:18

<gleb7> When are the nodes supposed to announce Erlay support? What should they consider before doing so and in which cases would it be clearly meaningless

17:18

<gleb7> This is something we've been discussing in the PR already.

17:19

<stickies-v> We should consider if the peer is blocks-only, and if it supports wtxid relay. If both aren not fulfilled, SENDRECON should not be sent.

17:19

<gleb7> stickies-v: right. do you know why the wtxid relay thing matter?

17:20

<gleb7> anyone is welcome to answer too :)

17:20

<stickies-v> because sketches are based on the wtxids of transactions

17:21

<gleb7> exactly, this is something covered in a BIP. This will be used later on in Erlay commits, but for now it was sufficient to compare the code to BIP.

17:21

<gleb7> There is a dependency on wtxid relay protocol.

17:22

<larryruane> quick side-note, in case anyone wants to run the unit test, you'll need to add `test/txreconciliation_tests.cpp` to `src/Makefile.test.include`, and also change the unit test to `#include <node/txreconciliation.h>` .... was this intentional, that the unit test isn't being built yet?

17:22

<gleb7> larryruane: not at all, good catch.

17:22

<larryruane> (i'll make a comment on the PR)

17:23

<gleb7> Thank you! I was probably moving or renaming something and forgot to do this.

17:23

<gleb7> Next question. What is the overall handshake and “registration for reconciliation” protocol flow?

17:25

<svav> First, send a new p2p message sendrecon

17:25

<larryruane> after (existing) version handshake, then the peers have a defined order for trading recon salts (?), depending on who is inbound and who is outbound

17:25

<stickies-v> In the VERSION message, each peer can include some fields to indicate recon support. If such a VERSION message from a peer is received, we PreRegister the peer and send out a SENDRECON message. If we receive a SENDRECON message back from the peer and everything is valid, we initialize the reconciliation state for the peer.

17:25

<larryruane> whoever is outbound goes first

17:26

<gleb7> larryruane: hmmm, why do you think so?

17:27

<larryruane> did I get it wrong, I was looking at https://github.com/bitcoin/bitcoin/pull/23443/commits/b55cbf63e15766bdabcac1c08b4cbfea0badeb04#diff-62e6a7c4c23e68b88bd585db25bb0a10e6ccda2d7fff2f05769bc0a1ad81dcdcR38

17:27

<gleb7> Okay, good to know this is somewhat confusing.

17:28

<larryruane> :) note i'm easily confused

17:28

<gleb7> The 'we_initiate_recon' variable refers to actual transaction reconciliations in the future.

17:28

<gleb7> Not the handshake SENDRECON stuff we're dealing with here.

17:29

<gleb7> No, this is perfect I think. If you're confused, someone else will also be. And this code should be understandable even without my presence so yeah, I probably should check how good are those comments.

17:29

<lightlike> stickies-v: I don't think anything is changed in the VERSION message to explicitely indicate recon support.

17:30

<stickies-v> lightlike oh you're right!

17:31

<stickies-v> that also explains the confusion I had about one of your comments regarding peers getting disconnected about sending a SENDRECON message unexpectedly

17:31

<gleb7> The PR *used to* bump the general Bitcoin p2p protocol version (maybe a month ago), but I dropped this. Does anyone know why?

17:32

<gleb7> I'm referring to `P2P_VERSION = 70016`

17:33

<stickies-v> I'm not quite familiar with this, but could it be that an increased p2p protocol version would mean Erlay nodes won't communicate with pre-Erlay nodes anymore, and there's no real need to do that?

17:33

<lightlike> maybe because it is not really necessary for things to work?

17:34

<glozow> when do we change protocol version? when there'd be a compatibility issue?

17:35

<gleb77> Last time we changed the protocol version for wtxids

17:35

<lightlike> I think they would still be ale communicate, just via the old flooding mechanism (because the SENDRECON messages would be tied to the new protocol version).

17:35

<lightlike> *able to communicate

17:36

<jnewbery> previous protocol versions: https://github.com/bitcoin/bitcoin/blob/926fc2a0d4ff64cf2ff8e1dfa64eca2ebd24e090/src/version.h#L1-L39

17:37

<glozow> jnewbery: ah thanks

17:37

<gleb77> I actually don't have a good answer for this. At some point i just realized we don't need to bump it, so i dropped it.

17:38

<gleb77> There is clearly no issues of Erlay peers talking to non-Erlay peers.

17:38

<jnewbery> Generally, I don't think we need a new protocol version if we're just adding new optional message types. Nodes should ignore messages that they don't know the meaning of (although this hasn't always been the case for alternative implementations)

17:39

<gleb77> Yeah, this now goes out of scope of this PR. Probably deserves a blogpost, about protocol version and also service bits and messages like SENDRECON/SENDCMCPT

17:39

<gleb77> Let's move on

17:39

<gleb77> What is the reason for generating reconciliation salt (m_local_salts)? How is it generated and why?

17:40

<sipa> hi!

17:41

<stickies-v> Since we use short IDs instead of the full ID, multiple tx's could have the same short ID. An attacker could calculate such colliding transactions, but by having each peer calculate their own salt (and used in calculating the short IDs), these transactions would only collide on that specific link and not with the entire network

17:41

<glozow> also you could possibly tell which nodes are the same node if they use the same salt

17:41

<glozow> same node on different networks*

17:42

<svav> Before sending SENDRECON, a node is supposed to "pre-register" the peer by generating and storing an associated reconciliation salt component.

17:42

<sipa> salts are per connection (and side on that connection), not per node, right?

100

17:42

<glozow> yea

101

17:43

<stickies-v> It's generated via GetRand(UINT64_MAX)

102

17:43

<gleb77> All correct answers :)

103

17:44

<gleb77> We have 2 reasons for making salts per-connection, not per-node

104

17:44

<stickies-v> gleb77 as to your 'why?' part, I seem to remember we have multiple Rand functions, some faster than others. Is GetRand the fastest one we can use? I'd think it doesn't need to be very secure?

105

17:44

<lightlike> both sides contribute 1/2 of the salt and they order and combine them, so they end up with the same salt

106

17:45

<gleb77> stickies-v: honestly, I didn't think much about the exact random function... i think we use GetRand in a bunch of other places, and this is the number of bits i needed so i used it

107

17:45

<gleb77> My question was more about combining the salts, etc.

108

17:45

<lightlike> speed probably doesn't matter, it's only done once per connection.

109

17:46

<gleb77> Can anyone guess what would be beneficial of having per-node salts (as opposed to per-connection we do)

110

17:46

<jnewbery> Very good code comment about our different random functions: https://github.com/bitcoin/bitcoin/blob/926fc2a0d4ff64cf2ff8e1dfa64eca2ebd24e090/src/random.h#L18-L58

111

17:46

<gleb77> jnewbery: thank you, i wasn't aware

112

17:46

<glozow> reusing salts means u can cache stuff i guess

113

17:47

<stickies-v> Hmm but `m_local_salts` from your question doesn't combine anything yet right? That only happens in RegisterPeer when we're updating m_states?

114

17:47

<gleb77> glozow: maybe, but there's a really big reason

115

17:47

<gleb77> stickies-v: yeah, hence the 'local' part of the name :)

116

17:47

<jnewbery> stickies-v: right, the local salt is generated when sending the `sendrecon`, and then combined with the remoate salt when receiving the `sendrecon`

117

17:48

<gleb77> Okay, so having per-connection salt is a little problematic in terms of the efficiency gains.

118

17:48

<gleb77> Because nodes receiving announcements can't de-duplicate same transactions (they are hashed with different salts)

119

17:49

<gleb77> So we ended up sending 1 extra full WTXID message. If we had per-node salt, they would be able to de-duplicate and send/request TX right away after short id

120

17:49

<gleb77> You will think about this challenge we had to solve after reviewing later parts of Erlay I guess :)

121

17:50

<gleb77> Let's jump right to question 9

122

17:50

<gleb77> Why might we use a std::unordered_map instead of a std::map to track salts and states?

123

17:52

<larryruane> more efficient

124

17:52

<larryruane> (in time, and probably in space)

125

17:52

<glozow> std::map is really an ordered_map, and we don't need the map to be ordered. afaik also faster lookup times

126

17:53

<gleb77> larryruane: I guess this answers "why we want?", second part of the question :)

127

17:53

<larryruane> glozow: right, map lookup is O(log(n)), unordered map is O(1)

128

17:54

<gleb77> Yeah, the benefits of reads and writes are massive.

129

17:54

<gleb77> Probably not that a big deal for couple hundred of records we will have here, but still, there's literally no disadvantage I think in this case, right?

130

17:56

<stickies-v> I did some quick googling and from that I gather that unordered_map uses more memory than map, but I honestly woulnd't know haha

131

17:56

<stickies-v> https://thispointer.com/map-vs-unordered_map-when-to-choose-one-over-another/

132

17:56

<larryruane> my understanding is, many of our existing maps could be unordered_maps, but unordered maps weren't added (to std at least) until after a bunch of the core code had been written

133

17:56

<gleb77> stickies-v: cool, something i indeed have to refresh in my memory

134

17:57

<gleb77> okay, i think this is a good chill note to wrap up the meeting

135

17:58

<gleb77> I hope most of us got a good impression of the PR, and got inspired about different aspects of it to go ahead and review :)

136

17:59

<jnewbery> larryruane: right, I think std::unordered_map was only added in c++11

137

17:59

<larryruane> thank you gleb77! this was great!

138

17:59

<stickies-v> gleb77 if you still have a bit of time, could you just give some intuition about what happens when we have 2 tx's with colliding short IDs? to make it simple, assuming this happens accidentally and not through an attack

139

17:59

<glozow> thanks gleb77 :)

140

17:59

<lightlike> thank you gleb77 !

141

17:59

<gleb77> thank you guys, see you on github :)

142

17:59

<dergoegge> thank you gleb77!

143

18:00

<svav> Thanks gleb77

144

18:00

<lightlike> gleb77: btw, are your mainnet erlay nodes up? I tried to connect earlier but wasn't able to.

145

18:00

<gleb77> stickies-v: In the worst case (when they arrive at both nodes right before reconciliation or something like that), they will be "cancelled out" in reconciliation. Those particular colliding transactions won't be exchanged. No other txs or state affected.

146

18:01

<gleb77> lightlike: I probably have to restart them, I will do that tomorrow. Also, feel free to comment in the corresponding testing issue with your IPs, if you feel so

147

18:02

<gleb77> lightlike: https://github.com/naumenkogs/txrelaysim/issues/8

148

18:02

<glozow> gleb77: seems not too bad. hopefully, both nodes will just get those transactions from other peers with different salts

149

18:02

<glozow> #endmeeting