Erlay support signaling (
p2p) Dec 8, 2021
The PR branch HEAD was ea3b87af90 at the time of this review club meeting.
Erlay is a proposed transaction relay protocol change to reduce the bandwidth used to
We looked at Erlay in
a previous review club.
For this PR, it might make sense to overview
the full PR again
(note that this is a different PR number from what’s above because that one was closed), and
the updated BIP.
Don’t spend too much time on it though, the overall understanding of the protocol is sufficient.
PR #23443 contains the first batch of commits
A node becomes able to negotiate the support for Erlay with another node by sending/receiving
a new p2p message.
Once the handshake is done, the node also initializes the “reconciliation state”, a variable to
keep track of ongoing reconciliations with a particular peer.
If the peer is disconnected, the corresponding reconciliation state should be cleared.
We moved forward with this PR once
merged, although minisketch is not required for this particular Erlay sub-PR. Another dependency
was PR #18044: “Use wtxid for transaction
relay”. We have discussed minisketch in
previous review clubs, python part 1, python part
2, and C++ implementation. Questions
Did you review the PR?
Concept ACK, approach ACK, tested ACK, or NACK?
What are the benefits of splitting PRs into smaller parts? Are there any drawbacks to this
approach? How might you review sub-PRs differently?
When are the nodes supposed to announce Erlay support? What should they consider before doing
so and in which cases would it be clearly meaningless?
What is the overall handshake and “registration for reconciliation” protocol flow?
Why is the
version field of
useful? Under which conditions it should be bumped?
What is the reason for generating reconciliation salt (
m_local_salts)? How is it generated
This PR adds 4 new conditions under which nodes will disconnect a peer. What are they?
When does it make sense to disconnect a peer rather than just ignore the message?
TxReconciliationTracker::Impl::m_mutex guard and when does the lock need to be held?
Why might we use a
std::unordered_map instead of a
std::map to track salts and states?
1 17:00 <gleb7> #startmeeting
11 17:00 <firsttimeprrevie> hi
14 17:01 <gleb7> i don't know half of the attendees, cool! I also saw some new names already reviewing the PR
15 17:01 <gleb7> Today we gonna talk about erlay, a long-lasting project :)
16 17:01 <gleb7> More specifically, about a first chunk of it (apart from merging minisketch library to the repo). THis is a first p2p-level part
17 17:02 <gleb7> First usual question
18 17:02 <gleb7> Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?
19 17:02 <stickies-v> Concept ACK, hooray for increased connectivity!
20 17:03 <glozow> y, though i realized i should also zoom out and look at the overall PR + try running an erlay node and stuff
21 17:03 <arythmetic> No, just lurking today :)
22 17:03 <jnewbery> glozow: I agree. It's hard to concept ACK/aproach ACK without looking at the full changeset
23 17:03 <gleb7> yeah, looking at the overall PR is a good idea for someone interested in this stuff. Original PR also links to 2 issues i made about understanding/measuring performance of this, which should help
24 17:04 <glozow> yeah, the stats look impressive and the measurement process thorough, but i'm trusting not verifying 😱
25 17:05 <gleb7> I tried to make reproducing easy, and at least couple people re-verified it, sooo :)
26 17:05 <gleb7> Looking forward for your verification too
27 17:05 <gleb7> For me making/reviewing prs splitted in parts is something new, and I don't think we have much experience with that in general (i recall a handfull of projects going through this)
28 17:06 <gleb7> Which leads us to the next set question, which is probably my favorite
29 17:06 <gleb7> 2a. What are the benefits of splitting PRs into smaller parts?
30 17:07 <gleb7> For the context, Erlay PR is 40-something commits, which is definitely more than average, and line-wise as well
31 17:07 <stickies-v> It makes reviewing much easier. Good and atomic commits already help a lot with this, but it's nice if you can fit the entire PR in your mind model, so ideally it's not too large
32 17:08 <glozow> i think people review more thoroughly when the PR is smaller (probably not a good thing but seems so)
33 17:08 <dergoegge> i think github is pretty bad for code review on large change sets with multiple people reviewing different parts, splitting it up makes it easier to have focused review
34 17:09 <gleb7> Yeah, so all these mean that we're probably gonna get less bugs, better code quality, and probably faster than all-at-once approach
35 17:09 <glozow> dergoegge: +1, if you have more than 150 comments it just becomes impossible
36 17:09 <stickies-v> It also reduces code going stale since the non-controversial code can be merged more quickly while the more controversial code is discussed in another PR
37 17:10 <gleb7> My first version was 7 commits I think and oh god it was hard to navigate even by myself, we already went through a lot of structural changes since then. Now we're here
38 17:10 <jnewbery> I think one downside is that if you're not reviewing the full change set, then you're slightly trusting the author that they're taking you in the right direction
39 17:10 <gleb7> Yeah, this is the next question :) 2b Are there any drawbacks to this approach?
40 17:11 <gleb7> jnewbery: this is indeed an issue. Not exactly for this PR, where it's a pretty general thing with not so many design choices.
41 17:11 <glozow> it might be weird if the sub-PRs don't do much by themselves. lightlike had a good comment about the fact that, if a release was branched off from this PR, nodes would be sending each other meaningless SENDRECONs
42 17:11 <gleb7> But I believe the next couple PRs would require more context from future chunks indeed.
43 17:12 <gleb7> glozow: I spent some time thinking about this too. I think this is first time we're merging a p2p message which is useless on its own.
44 17:13 <jnewbery> eg Carl split https://github.com/bitcoin/bitcoin/pull/20158 into many smaller PRs. Either you do a concept review of the full PR, or you're trusting that Carl is taking us in the right direction and just review the mechanical code changes in the sub-PRs
45 17:14 <lightlike> I think it's important to not only look at the current PR chunk, but also at the whole thing (simulations/mainnet testing) to get to a concept ACK.
46 17:16 <gleb7> Yeah, and making the whole thing accessible for reviewers is the whole other challenge. I hope, participants of this PR will have time to look at the big thing too :)
47 17:16 <glozow> not to shill my own bags but i've been trying to split the package mempool accept PRs into chunks where each one might be useful, e.g. package testmempoolaccept, rbf improvements, etc. i'm not sure if people like that or if it just seems disconnected but 🤷
48 17:17 <gleb7> glozow: i'll be honest, when i review your prs, I usually trust you and other reviewers w.r.t the big direction
49 17:18 <gleb7> Making sure it doesn't break anything terribly of course. They are usually just good things on them own, like, decoupling and stuff
50 17:18 <gleb7> Okay, let's move on to the actual code changes
51 17:18 <gleb7> When are the nodes supposed to announce Erlay support? What should they consider before doing so and in which cases would it be clearly meaningless
52 17:18 <gleb7> This is something we've been discussing in the PR already.
53 17:19 <stickies-v> We should consider if the peer is blocks-only, and if it supports wtxid relay. If both aren not fulfilled, SENDRECON should not be sent.
54 17:19 <gleb7> stickies-v: right. do you know why the wtxid relay thing matter?
55 17:20 <gleb7> anyone is welcome to answer too :)
56 17:20 <stickies-v> because sketches are based on the wtxids of transactions
57 17:21 <gleb7> exactly, this is something covered in a BIP. This will be used later on in Erlay commits, but for now it was sufficient to compare the code to BIP.
58 17:21 <gleb7> There is a dependency on wtxid relay protocol.
59 17:22 <larryruane> quick side-note, in case anyone wants to run the unit test, you'll need to add `test/txreconciliation_tests.cpp` to `src/Makefile.test.include`, and also change the unit test to `#include <node/txreconciliation.h>` .... was this intentional, that the unit test isn't being built yet?
60 17:22 <gleb7> larryruane: not at all, good catch.
61 17:22 <larryruane> (i'll make a comment on the PR)
62 17:23 <gleb7> Thank you! I was probably moving or renaming something and forgot to do this.
63 17:23 <gleb7> Next question. What is the overall handshake and “registration for reconciliation” protocol flow?
64 17:25 <svav> First, send a new p2p message sendrecon
65 17:25 <larryruane> after (existing) version handshake, then the peers have a defined order for trading recon salts (?), depending on who is inbound and who is outbound
66 17:25 <stickies-v> In the VERSION message, each peer can include some fields to indicate recon support. If such a VERSION message from a peer is received, we PreRegister the peer and send out a SENDRECON message. If we receive a SENDRECON message back from the peer and everything is valid, we initialize the reconciliation state for the peer.
67 17:25 <larryruane> whoever is outbound goes first
68 17:26 <gleb7> larryruane: hmmm, why do you think so?
70 17:27 <gleb7> Okay, good to know this is somewhat confusing.
71 17:28 <larryruane> :) note i'm easily confused
72 17:28 <gleb7> The 'we_initiate_recon' variable refers to actual transaction reconciliations in the future.
73 17:28 <gleb7> Not the handshake SENDRECON stuff we're dealing with here.
74 17:29 <gleb7> No, this is perfect I think. If you're confused, someone else will also be. And this code should be understandable even without my presence so yeah, I probably should check how good are those comments.
75 17:29 <lightlike> stickies-v: I don't think anything is changed in the VERSION message to explicitely indicate recon support.
76 17:30 <stickies-v> lightlike oh you're right!
77 17:31 <stickies-v> that also explains the confusion I had about one of your comments regarding peers getting disconnected about sending a SENDRECON message unexpectedly
78 17:31 <gleb7> The PR *used to* bump the general Bitcoin p2p protocol version (maybe a month ago), but I dropped this. Does anyone know why?
79 17:32 <gleb7> I'm referring to `P2P_VERSION = 70016`
80 17:33 <stickies-v> I'm not quite familiar with this, but could it be that an increased p2p protocol version would mean Erlay nodes won't communicate with pre-Erlay nodes anymore, and there's no real need to do that?
81 17:33 <lightlike> maybe because it is not really necessary for things to work?
82 17:34 <glozow> when do we change protocol version? when there'd be a compatibility issue?
83 17:35 <gleb77> Last time we changed the protocol version for wtxids
84 17:35 <lightlike> I think they would still be ale communicate, just via the old flooding mechanism (because the SENDRECON messages would be tied to the new protocol version).
85 17:35 <lightlike> *able to communicate
87 17:37 <glozow> jnewbery: ah thanks
88 17:37 <gleb77> I actually don't have a good answer for this. At some point i just realized we don't need to bump it, so i dropped it.
89 17:38 <gleb77> There is clearly no issues of Erlay peers talking to non-Erlay peers.
90 17:38 <jnewbery> Generally, I don't think we need a new protocol version if we're just adding new optional message types. Nodes should ignore messages that they don't know the meaning of (although this hasn't always been the case for alternative implementations)
91 17:39 <gleb77> Yeah, this now goes out of scope of this PR. Probably deserves a blogpost, about protocol version and also service bits and messages like SENDRECON/SENDCMCPT
92 17:39 <gleb77> Let's move on
93 17:39 <gleb77> What is the reason for generating reconciliation salt (m_local_salts)? How is it generated and why?
95 17:41 <stickies-v> Since we use short IDs instead of the full ID, multiple tx's could have the same short ID. An attacker could calculate such colliding transactions, but by having each peer calculate their own salt (and used in calculating the short IDs), these transactions would only collide on that specific link and not with the entire network
96 17:41 <glozow> also you could possibly tell which nodes are the same node if they use the same salt
97 17:41 <glozow> same node on different networks*
98 17:42 <svav> Before sending SENDRECON, a node is supposed to "pre-register" the peer by generating and storing an associated reconciliation salt component.
99 17:42 <sipa> salts are per connection (and side on that connection), not per node, right?
101 17:43 <stickies-v> It's generated via GetRand(UINT64_MAX)
102 17:43 <gleb77> All correct answers :)
103 17:44 <gleb77> We have 2 reasons for making salts per-connection, not per-node
104 17:44 <stickies-v> gleb77 as to your 'why?' part, I seem to remember we have multiple Rand functions, some faster than others. Is GetRand the fastest one we can use? I'd think it doesn't need to be very secure?
105 17:44 <lightlike> both sides contribute 1/2 of the salt and they order and combine them, so they end up with the same salt
106 17:45 <gleb77> stickies-v: honestly, I didn't think much about the exact random function... i think we use GetRand in a bunch of other places, and this is the number of bits i needed so i used it
107 17:45 <gleb77> My question was more about combining the salts, etc.
108 17:45 <lightlike> speed probably doesn't matter, it's only done once per connection.
109 17:46 <gleb77> Can anyone guess what would be beneficial of having per-node salts (as opposed to per-connection we do)
111 17:46 <gleb77> jnewbery: thank you, i wasn't aware
112 17:46 <glozow> reusing salts means u can cache stuff i guess
113 17:47 <stickies-v> Hmm but `m_local_salts` from your question doesn't combine anything yet right? That only happens in RegisterPeer when we're updating m_states?
114 17:47 <gleb77> glozow: maybe, but there's a really big reason
115 17:47 <gleb77> stickies-v: yeah, hence the 'local' part of the name :)
116 17:47 <jnewbery> stickies-v: right, the local salt is generated when sending the `sendrecon`, and then combined with the remoate salt when receiving the `sendrecon`
117 17:48 <gleb77> Okay, so having per-connection salt is a little problematic in terms of the efficiency gains.
118 17:48 <gleb77> Because nodes receiving announcements can't de-duplicate same transactions (they are hashed with different salts)
119 17:49 <gleb77> So we ended up sending 1 extra full WTXID message. If we had per-node salt, they would be able to de-duplicate and send/request TX right away after short id
120 17:49 <gleb77> You will think about this challenge we had to solve after reviewing later parts of Erlay I guess :)
121 17:50 <gleb77> Let's jump right to question 9
122 17:50 <gleb77> Why might we use a std::unordered_map instead of a std::map to track salts and states?
123 17:52 <larryruane> more efficient
124 17:52 <larryruane> (in time, and probably in space)
125 17:52 <glozow> std::map is really an ordered_map, and we don't need the map to be ordered. afaik also faster lookup times
126 17:53 <gleb77> larryruane: I guess this answers "why we want?", second part of the question :)
127 17:53 <larryruane> glozow: right, map lookup is O(log(n)), unordered map is O(1)
128 17:54 <gleb77> Yeah, the benefits of reads and writes are massive.
129 17:54 <gleb77> Probably not that a big deal for couple hundred of records we will have here, but still, there's literally no disadvantage I think in this case, right?
130 17:56 <stickies-v> I did some quick googling and from that I gather that unordered_map uses more memory than map, but I honestly woulnd't know haha
132 17:56 <larryruane> my understanding is, many of our existing maps could be unordered_maps, but unordered maps weren't added (to std at least) until after a bunch of the core code had been written
133 17:56 <gleb77> stickies-v: cool, something i indeed have to refresh in my memory
134 17:57 <gleb77> okay, i think this is a good chill note to wrap up the meeting
135 17:58 <gleb77> I hope most of us got a good impression of the PR, and got inspired about different aspects of it to go ahead and review :)
136 17:59 <jnewbery> larryruane: right, I think std::unordered_map was only added in c++11
137 17:59 <larryruane> thank you gleb77! this was great!
138 17:59 <stickies-v> gleb77 if you still have a bit of time, could you just give some intuition about what happens when we have 2 tx's with colliding short IDs? to make it simple, assuming this happens accidentally and not through an attack
139 17:59 <glozow> thanks gleb77 :)
140 17:59 <lightlike> thank you gleb77 !
141 17:59 <gleb77> thank you guys, see you on github :)
142 17:59 <dergoegge> thank you gleb77!
143 18:00 <svav> Thanks gleb77
144 18:00 <lightlike> gleb77: btw, are your mainnet erlay nodes up? I tried to connect earlier but wasn't able to.
145 18:00 <gleb77> stickies-v: In the worst case (when they arrive at both nodes right before reconciliation or something like that), they will be "cancelled out" in reconciliation. Those particular colliding transactions won't be exchanged. No other txs or state affected.
146 18:01 <gleb77> lightlike: I probably have to restart them, I will do that tomorrow. Also, feel free to comment in the corresponding testing issue with your IPs, if you feel so
148 18:02 <glozow> gleb77: seems not too bad. hopefully, both nodes will just get those transactions from other peers with different salts
149 18:02 <glozow> #endmeeting