Erlay support signaling (p2p)

https://github.com/bitcoin/bitcoin/pull/23443

Host: naumenkogs  -  PR author: naumenkogs

The PR branch HEAD was ea3b87af90 at the time of this review club meeting.

Notes

  • Erlay is a proposed transaction relay protocol change to reduce the bandwidth used to announce transactions.

    • We looked at Erlay in a previous review club.

    • For this PR, it might make sense to overview the full PR again (note that this is a different PR number from whatā€™s above because that one was closed), and the updated BIP. Donā€™t spend too much time on it though, the overall understanding of the protocol is sufficient.

  • PR #23443 contains the first batch of commits enabling Erlay:

    • A node becomes able to negotiate the support for Erlay with another node by sending/receiving a new p2p message.

    • Once the handshake is done, the node also initializes the ā€œreconciliation stateā€, a variable to keep track of ongoing reconciliations with a particular peer.

    • If the peer is disconnected, the corresponding reconciliation state should be cleared.

  • We moved forward with this PR once minisketch was merged, although minisketch is not required for this particular Erlay sub-PR. Another dependency was PR #18044: ā€œUse wtxid for transaction relayā€. We have discussed minisketch in previous review clubs, python part 1, python part 2, and C++ implementation.

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?

  2. What are the benefits of splitting PRs into smaller parts? Are there any drawbacks to this approach? How might you review sub-PRs differently?

  3. When are the nodes supposed to announce Erlay support? What should they consider before doing so and in which cases would it be clearly meaningless?

  4. What is the overall handshake and ā€œregistration for reconciliationā€ protocol flow?

  5. Why is the version field of sendrecon useful? Under which conditions it should be bumped?

  6. What is the reason for generating reconciliation salt (m_local_salts)? How is it generated and why?

  7. This PR adds 4 new conditions under which nodes will disconnect a peer. What are they? When does it make sense to disconnect a peer rather than just ignore the message?

  8. What does TxReconciliationTracker::Impl::m_mutex guard and when does the lock need to be held?

  9. Why might we use a std::unordered_map instead of a std::map to track salts and states?

Meeting Log

  117:00 <gleb7> #startmeeting
  217:00 <gleb7> hi?
  317:00 <glozow> hi!
  417:00 <jnewbery> hi!
  517:00 <arythmetic> hi
  617:00 <stickies-v> oi
  717:00 <ccdle12> hi
  817:00 <dergoegge> hi
  917:00 <svav> Hi
 1017:00 <effexzi> Hi
 1117:00 <firsttimeprrevie> hi
 1217:00 <lightlike> hi
 1317:00 <larryruane> Hi
 1417:01 <gleb7> i don't know half of the attendees, cool! I also saw some new names already reviewing the PR
 1517:01 <gleb7> Today we gonna talk about erlay, a long-lasting project :)
 1617:01 <gleb7> More specifically, about a first chunk of it (apart from merging minisketch library to the repo). THis is a first p2p-level part
 1717:02 <gleb7> First usual question
 1817:02 <gleb7> Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?
 1917:02 <stickies-v> Concept ACK, hooray for increased connectivity!
 2017:03 <glozow> y, though i realized i should also zoom out and look at the overall PR + try running an erlay node and stuff
 2117:03 <arythmetic> No, just lurking today :)
 2217:03 <jnewbery> glozow: I agree. It's hard to concept ACK/aproach ACK without looking at the full changeset
 2317:03 <gleb7> yeah, looking at the overall PR is a good idea for someone interested in this stuff. Original PR also links to 2 issues i made about understanding/measuring performance of this, which should help
 2417:04 <glozow> yeah, the stats look impressive and the measurement process thorough, but i'm trusting not verifying šŸ˜±
 2517:05 <gleb7> I tried to make reproducing easy, and at least couple people re-verified it, sooo :)
 2617:05 <gleb7> Looking forward for your verification too
 2717:05 <gleb7> For me making/reviewing prs splitted in parts is something new, and I don't think we have much experience with that in general (i recall a handfull of projects going through this)
 2817:06 <gleb7> Which leads us to the next set question, which is probably my favorite
 2917:06 <gleb7> 2a. What are the benefits of splitting PRs into smaller parts?
 3017:07 <gleb7> For the context, Erlay PR is 40-something commits, which is definitely more than average, and line-wise as well
 3117:07 <stickies-v> It makes reviewing much easier. Good and atomic commits already help a lot with this, but it's nice if you can fit the entire PR in your mind model, so ideally it's not too large
 3217:08 <glozow> i think people review more thoroughly when the PR is smaller (probably not a good thing but seems so)
 3317:08 <dergoegge> i think github is pretty bad for code review on large change sets with multiple people reviewing different parts, splitting it up makes it easier to have focused review
 3417:09 <gleb7> Yeah, so all these mean that we're probably gonna get less bugs, better code quality, and probably faster than all-at-once approach
 3517:09 <glozow> dergoegge: +1, if you have more than 150 comments it just becomes impossible
 3617:09 <stickies-v> It also reduces code going stale since the non-controversial code can be merged more quickly while the more controversial code is discussed in another PR
 3717:10 <gleb7> My first version was 7 commits I think and oh god it was hard to navigate even by myself, we already went through a lot of structural changes since then. Now we're here
 3817:10 <jnewbery> I think one downside is that if you're not reviewing the full change set, then you're slightly trusting the author that they're taking you in the right direction
 3917:10 <gleb7> Yeah, this is the next question :) 2b Are there any drawbacks to this approach?
 4017:11 <gleb7> jnewbery: this is indeed an issue. Not exactly for this PR, where it's a pretty general thing with not so many design choices.
 4117:11 <glozow> it might be weird if the sub-PRs don't do much by themselves. lightlike had a good comment about the fact that, if a release was branched off from this PR, nodes would be sending each other meaningless SENDRECONs
 4217:11 <gleb7> But I believe the next couple PRs would require more context from future chunks indeed.
 4317:12 <gleb7> glozow: I spent some time thinking about this too. I think this is first time we're merging a p2p message which is useless on its own.
 4417:13 <jnewbery> eg Carl split https://github.com/bitcoin/bitcoin/pull/20158 into many smaller PRs. Either you do a concept review of the full PR, or you're trusting that Carl is taking us in the right direction and just review the mechanical code changes in the sub-PRs
 4517:14 <lightlike> I think it's important to not only look at the current PR chunk, but also at the whole thing (simulations/mainnet testing) to get to a concept ACK.
 4617:16 <gleb7> Yeah, and making the whole thing accessible for reviewers is the whole other challenge. I hope, participants of this PR will have time to look at the big thing too :)
 4717:16 <glozow> not to shill my own bags but i've been trying to split the package mempool accept PRs into chunks where each one might be useful, e.g. package testmempoolaccept, rbf improvements, etc. i'm not sure if people like that or if it just seems disconnected but šŸ¤·
 4817:17 <gleb7> glozow: i'll be honest, when i review your prs, I usually trust you and other reviewers w.r.t the big direction
 4917:18 <gleb7> Making sure it doesn't break anything terribly of course. They are usually just good things on them own, like, decoupling and stuff
 5017:18 <gleb7> Okay, let's move on to the actual code changes
 5117:18 <gleb7> When are the nodes supposed to announce Erlay support? What should they consider before doing so and in which cases would it be clearly meaningless
 5217:18 <gleb7> This is something we've been discussing in the PR already.
 5317:19 <stickies-v> We should consider if the peer is blocks-only, and if it supports wtxid relay. If both aren not fulfilled, SENDRECON should not be sent.
 5417:19 <gleb7> stickies-v: right. do you know why the wtxid relay thing matter?
 5517:20 <gleb7> anyone is welcome to answer too :)
 5617:20 <stickies-v> because sketches are based on the wtxids of transactions
 5717:21 <gleb7> exactly, this is something covered in a BIP. This will be used later on in Erlay commits, but for now it was sufficient to compare the code to BIP.
 5817:21 <gleb7> There is a dependency on wtxid relay protocol.
 5917:22 <larryruane> quick side-note, in case anyone wants to run the unit test, you'll need to add `test/txreconciliation_tests.cpp` to `src/Makefile.test.include`, and also change the unit test to `#include <node/txreconciliation.h>` .... was this intentional, that the unit test isn't being built yet?
 6017:22 <gleb7> larryruane: not at all, good catch.
 6117:22 <larryruane> (i'll make a comment on the PR)
 6217:23 <gleb7> Thank you! I was probably moving or renaming something and forgot to do this.
 6317:23 <gleb7> Next question. What is the overall handshake and ā€œregistration for reconciliationā€ protocol flow?
 6417:25 <svav> First, send a new p2p message sendrecon
 6517:25 <larryruane> after (existing) version handshake, then the peers have a defined order for trading recon salts (?), depending on who is inbound and who is outbound
 6617:25 <stickies-v> In the VERSION message, each peer can include some fields to indicate recon support. If such a VERSION message from a peer is received, we PreRegister the peer and send out a SENDRECON message. If we receive a SENDRECON message back from the peer and everything is valid, we initialize the reconciliation state for the peer.
 6717:25 <larryruane> whoever is outbound goes first
 6817:26 <gleb7> larryruane: hmmm, why do you think so?
 6917:27 <larryruane> did I get it wrong, I was looking at https://github.com/bitcoin/bitcoin/pull/23443/commits/b55cbf63e15766bdabcac1c08b4cbfea0badeb04#diff-62e6a7c4c23e68b88bd585db25bb0a10e6ccda2d7fff2f05769bc0a1ad81dcdcR38
 7017:27 <gleb7> Okay, good to know this is somewhat confusing.
 7117:28 <larryruane> :) note i'm easily confused
 7217:28 <gleb7> The 'we_initiate_recon' variable refers to actual transaction reconciliations in the future.
 7317:28 <gleb7> Not the handshake SENDRECON stuff we're dealing with here.
 7417:29 <gleb7> No, this is perfect I think. If you're confused, someone else will also be. And this code should be understandable even without my presence so yeah, I probably should check how good are those comments.
 7517:29 <lightlike> stickies-v: I don't think anything is changed in the VERSION message to explicitely indicate recon support.
 7617:30 <stickies-v> lightlike oh you're right!
 7717:31 <stickies-v> that also explains the confusion I had about one of your comments regarding peers getting disconnected about sending a SENDRECON message unexpectedly
 7817:31 <gleb7> The PR *used to* bump the general Bitcoin p2p protocol version (maybe a month ago), but I dropped this. Does anyone know why?
 7917:32 <gleb7> I'm referring to `P2P_VERSION = 70016`
 8017:33 <stickies-v> I'm not quite familiar with this, but could it be that an increased p2p protocol version would mean Erlay nodes won't communicate with pre-Erlay nodes anymore, and there's no real need to do that?
 8117:33 <lightlike> maybe because it is not really necessary for things to work?
 8217:34 <glozow> when do we change protocol version? when there'd be a compatibility issue?
 8317:35 <gleb77> Last time we changed the protocol version for wtxids
 8417:35 <lightlike> I think they would still be ale communicate, just via the old flooding mechanism (because the SENDRECON messages would be tied to the new protocol version).
 8517:35 <lightlike> *able to communicate
 8617:36 <jnewbery> previous protocol versions: https://github.com/bitcoin/bitcoin/blob/926fc2a0d4ff64cf2ff8e1dfa64eca2ebd24e090/src/version.h#L1-L39
 8717:37 <glozow> jnewbery: ah thanks
 8817:37 <gleb77> I actually don't have a good answer for this. At some point i just realized we don't need to bump it, so i dropped it.
 8917:38 <gleb77> There is clearly no issues of Erlay peers talking to non-Erlay peers.
 9017:38 <jnewbery> Generally, I don't think we need a new protocol version if we're just adding new optional message types. Nodes should ignore messages that they don't know the meaning of (although this hasn't always been the case for alternative implementations)
 9117:39 <gleb77> Yeah, this now goes out of scope of this PR. Probably deserves a blogpost, about protocol version and also service bits and messages like SENDRECON/SENDCMCPT
 9217:39 <gleb77> Let's move on
 9317:39 <gleb77> What is the reason for generating reconciliation salt (m_local_salts)? How is it generated and why?
 9417:40 <sipa> hi!
 9517:41 <stickies-v> Since we use short IDs instead of the full ID, multiple tx's could have the same short ID. An attacker could calculate such colliding transactions, but by having each peer calculate their own salt (and used in calculating the short IDs), these transactions would only collide on that specific link and not with the entire network
 9617:41 <glozow> also you could possibly tell which nodes are the same node if they use the same salt
 9717:41 <glozow> same node on different networks*
 9817:42 <svav> Before sending SENDRECON, a node is supposed to "pre-register" the peer by generating and storing an associated reconciliation salt component.
 9917:42 <sipa> salts are per connection (and side on that connection), not per node, right?
10017:42 <glozow> yea
10117:43 <stickies-v> It's generated via GetRand(UINT64_MAX)
10217:43 <gleb77> All correct answers :)
10317:44 <gleb77> We have 2 reasons for making salts per-connection, not per-node
10417:44 <stickies-v> gleb77 as to your 'why?' part, I seem to remember we have multiple Rand functions, some faster than others. Is GetRand the fastest one we can use? I'd think it doesn't need to be very secure?
10517:44 <lightlike> both sides contribute 1/2 of the salt and they order and combine them, so they end up with the same salt
10617:45 <gleb77> stickies-v: honestly, I didn't think much about the exact random function... i think we use GetRand in a bunch of other places, and this is the number of bits i needed so i used it
10717:45 <gleb77> My question was more about combining the salts, etc.
10817:45 <lightlike> speed probably doesn't matter, it's only done once per connection.
10917:46 <gleb77> Can anyone guess what would be beneficial of having per-node salts (as opposed to per-connection we do)
11017:46 <jnewbery> Very good code comment about our different random functions: https://github.com/bitcoin/bitcoin/blob/926fc2a0d4ff64cf2ff8e1dfa64eca2ebd24e090/src/random.h#L18-L58
11117:46 <gleb77> jnewbery: thank you, i wasn't aware
11217:46 <glozow> reusing salts means u can cache stuff i guess
11317:47 <stickies-v> Hmm but `m_local_salts` from your question doesn't combine anything yet right? That only happens in RegisterPeer when we're updating m_states?
11417:47 <gleb77> glozow: maybe, but there's a really big reason
11517:47 <gleb77> stickies-v: yeah, hence the 'local' part of the name :)
11617:47 <jnewbery> stickies-v: right, the local salt is generated when sending the `sendrecon`, and then combined with the remoate salt when receiving the `sendrecon`
11717:48 <gleb77> Okay, so having per-connection salt is a little problematic in terms of the efficiency gains.
11817:48 <gleb77> Because nodes receiving announcements can't de-duplicate same transactions (they are hashed with different salts)
11917:49 <gleb77> So we ended up sending 1 extra full WTXID message. If we had per-node salt, they would be able to de-duplicate and send/request TX right away after short id
12017:49 <gleb77> You will think about this challenge we had to solve after reviewing later parts of Erlay I guess :)
12117:50 <gleb77> Let's jump right to question 9
12217:50 <gleb77> Why might we use a std::unordered_map instead of a std::map to track salts and states?
12317:52 <larryruane> more efficient
12417:52 <larryruane> (in time, and probably in space)
12517:52 <glozow> std::map is really an ordered_map, and we don't need the map to be ordered. afaik also faster lookup times
12617:53 <gleb77> larryruane: I guess this answers "why we want?", second part of the question :)
12717:53 <larryruane> glozow: right, map lookup is O(log(n)), unordered map is O(1)
12817:54 <gleb77> Yeah, the benefits of reads and writes are massive.
12917:54 <gleb77> Probably not that a big deal for couple hundred of records we will have here, but still, there's literally no disadvantage I think in this case, right?
13017:56 <stickies-v> I did some quick googling and from that I gather that unordered_map uses more memory than map, but I honestly woulnd't know haha
13117:56 <stickies-v> https://thispointer.com/map-vs-unordered_map-when-to-choose-one-over-another/
13217:56 <larryruane> my understanding is, many of our existing maps could be unordered_maps, but unordered maps weren't added (to std at least) until after a bunch of the core code had been written
13317:56 <gleb77> stickies-v: cool, something i indeed have to refresh in my memory
13417:57 <gleb77> okay, i think this is a good chill note to wrap up the meeting
13517:58 <gleb77> I hope most of us got a good impression of the PR, and got inspired about different aspects of it to go ahead and review :)
13617:59 <jnewbery> larryruane: right, I think std::unordered_map was only added in c++11
13717:59 <larryruane> thank you gleb77! this was great!
13817:59 <stickies-v> gleb77 if you still have a bit of time, could you just give some intuition about what happens when we have 2 tx's with colliding short IDs? to make it simple, assuming this happens accidentally and not through an attack
13917:59 <glozow> thanks gleb77 :)
14017:59 <lightlike> thank you gleb77 !
14117:59 <gleb77> thank you guys, see you on github :)
14217:59 <dergoegge> thank you gleb77!
14318:00 <svav> Thanks gleb77
14418:00 <lightlike> gleb77: btw, are your mainnet erlay nodes up? I tried to connect earlier but wasn't able to.
14518:00 <gleb77> stickies-v: In the worst case (when they arrive at both nodes right before reconciliation or something like that), they will be "cancelled out" in reconciliation. Those particular colliding transactions won't be exchanged. No other txs or state affected.
14618:01 <gleb77> lightlike: I probably have to restart them, I will do that tomorrow. Also, feel free to comment in the corresponding testing issue with your IPs, if you feel so
14718:02 <gleb77> lightlike: https://github.com/naumenkogs/txrelaysim/issues/8
14818:02 <glozow> gleb77: seems not too bad. hopefully, both nodes will just get those transactions from other peers with different salts
14918:02 <glozow> #endmeeting