#16939 Delay querying DNS seeds if addrman is populated (p2p)

https://github.com/bitcoin/bitcoin/16939

Host: jnewbery

We will discuss both PR 15558 (merged) and PR 16939 (open) in this week’s PR review club. Both are small code changes, so should be quick to review.

Notes

  • When bitcoind is started, it tries to connect to peers from its addrman database. Those new connections are opened in CConnman::ThreadOpenConnections(), which is called in CConnman::Start().
  • If the addrman database is empty (for example, the first time that bitcoind is run), or if bitcoind is unable to successfully connect to two peers, then it will query DNS seeds to get addresses of candidate peers to connect to.
  • Querying DNS seeds is done in a separate thread, threadDNSAddressSeed, which is also started in CConman::Start().
  • For completeness, if the DNS seed queries are unsuccesful, then bitcoind will fall back to connecting to a hard-coded list of seed nodes. This fall back functionality is only run as a last resort if all DNS seeds are down (eg if there is an attack on the DNS seed infrastructure).
  • In February, Issue 15434 was opened, which reported that DNS seeds were being queried too soon and too frequently. From that issue: “Despite being running most of the time, and having a database of tens of thousands of peers, my node seems to query the DNS seeds each time I restart it, which doesn’t seem ideal from a privacy perspective”.
  • PR 15558 (which was merged in time for V0.19) changed the ThreadOpenConnections() behaviour to only query three DNS seeds at a time (rather than all at once). According to the author: “This reduces the amount of information DNS seeds can observe about the requesters by spreading the load over all of them.”.
  • After that PR was merged, AJ Towns left a comment: “I think there’s a bug here: if this loop is hit with addrman.size()==0 and seeds_right_now==0 it will then query the first seed, and set seeds_right_now==-1, at which point this condition will keep failing on future loops, preventing a delay between batches of 3 seeds, and preventing early exit.”.
  • PR 16939 changes the DNS seed behaviour once again:
    • if there are 0 entries in addrman or -forcedns is set, it will query the DNS seeds immediately;
    • if there are fewer than 1000 entries in addrman, it will query DNS seeds after 11 seconds;
    • if there are more than 1000 entries in addrman, it will query DNS seeds after 5 minutes;
    • if there are still no entries in addrman 6 minutes after start up, then bitcoind will try to connect to the hard-coded seed nodes.

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? (Don’t forget to put your PR review on GitHub.)

  2. What steps did you take, beyond reading the code?

  3. How would you test this manually? Could any automated tests be added?

  4. Do you agree with AJ’s comment that this is a bug?

  5. What implications do these PRs have for privacy? Robustness/redundancy? Usability?

  6. Why was three DNS seeds chosen in PR 15558?

Further reading

  • AJ Towns also proposes adding more threads to speed up connecting to peers at start-up in PR 15502.
  • Matt Corallo commented that having all DNS seeds run the same software is a risk. Issue 16938 was opened to track diversifying the DNS seed software.

Meeting Log

13:00 < jnewbery> hi
13:00 < sebastianvstaa> hi
13:00 < amiti> hi
13:00 < pinheadmz> hi
13:00 < aj> zzz
13:00 < lightlike> hi
13:00 < ajonas> hi
13:01 < b10c> hi
13:01 < jnewbery> Thanks to jonatack for suggesting this week's PRs, and for hosting the last two meetings!
13:01 < fanquake> Hi
13:01 < jonatack> hi
13:01 < fjahr> hi
13:01 < ariard_> hi
13:02 < jnewbery> If anyone else wants to host or has a suggestion for PRs to discuss, please let me know. Either message me here or leave a comment on https://github.com/bitcoin-core-review-club/bitcoin-core-review-club.github.io/issues/14
13:02 < jonatack> thanks, jnewbery
13:02 < jnewbery> ok, who had a chance to build/test this week's PRs?
13:03 < zenogais> hi all
13:03 < jnewbery> The first question (as always): Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?
13:03 < fjahr> We are talking about the open PR now?
13:03 < jnewbery> both
13:04 < zenogais> Still in the process of reviewing, but 16939 will probably be a tested ACK from me
13:04 < lightlike> built it and did some testing on testnet/mainnet.
13:04 < jnewbery> lightlike: great! What did your testing involve?
13:05 < fjahr> built and ran tests on 16939, need some more time to look at code and think about tests
13:06 < jnewbery> that question is open to anyone: how did you test/how could you test these changes?
13:06 < lightlike> jnewbery: i started a node with the PR with and without peers.dat and also added some more precise logging to see how many IPs the different DNS seeds sent us.
13:06 < zenogais> Simplest test I thought of was syncing node from scratch without peers and check the behavior.
13:06 < jnewbery> lightlike: nice. Did you find anything interesting?
13:09 < lightlike> jnewbery: in one run, it took me like 5 minutes to get 2 peers on mainnet, which I found a bit too slow considering I had ~6k entries in addrman. Might just be an outlier though.
13:09 < jnewbery> one of the challenges for this PR is that we don't have any way to do integration testing of these changes - we can't manipulate the peers.dat file or set up DNS servers in our functional test framework
13:10 < jnewbery> did anyone take a look at how we could do automated testing? Adding some functionality to the test framework or adding unit tests?
13:10 < zenogais> One other option: Step through the DNS thread in gdb
13:10 < aj> lightlike: did you see how many connections it attempted before finding two peers?
13:11 < fanquake> Now that there's been more some PR review coming from the review club, this sort of "how did you actually review PRs" discussion is definitely the most interesting to me. Especially from anyone that has been ACKing PRs.
13:11 < jnewbery> lightlike: that doesn't seem great. How long had your node been running the previous time?
13:13 < aj> lightlike: (if you had net debug enabled, "trying connection .* lastseen=" in your logs in between when you started the test and when you got two peers should tell you)
13:13 < jnewbery> I thought aj's comment here: https://github.com/bitcoin/bitcoin/issues/15434#issuecomment-468166991 was interesting. "I'm getting maybe 20% of outbound attempts succeeding..."
13:14 < jnewbery> For those who are interested and want to read further, aj's PR here: https://github.com/bitcoin/bitcoin/pull/15502 is also interesting
13:16 < lightlike> aj: the 9th peer was the second one that worked. I think the issue is that the first peer I connected to sent me some new blocks (my node was still in final stages of IBD), and while they were being validates there were some ~30s breaks in which no new connections were attempted
13:17 < jnewbery> ok, let's move on to the next question. you agree with AJ’s comment (https://github.com/bitcoin/bitcoin/pull/15558#discussion_r327421987) that this is a bug?
13:18 < lightlike> jnewbery: my node was maybe one week behind, when I run for like 1-2 hours.
13:18 < fjahr> jnewbery: I did not get to think about about testing yet but why can we not manipulate peers.dat file?
13:19 < jnewbery> fjahr: it's possible, but there's nothing in the test framework that does it
13:19 < fjahr> got it
13:19 < aj> lightlike: oh, huh... that'd mean it'd be just as slow with dns peers, i think. probably worth commenting that somewhere
13:20 < jnewbery> I have a tool that parses the peers.dat file: https://github.com/jnewbery/bitcointools. That could be enhanced to edit the files for testing
13:20 < zenogais> Hmm, on testnet for me IBD seems to be stalled right now. It found 10 peers, but no blocks yet after ~15 mins or so.
13:20 < jnewbery> aj: did you look at adding unit tests?
13:21 < aj> jnewbery: nope
13:22 < zenogais> Hmm, deleting peers.dat and reseeding seemed to fix
13:22 < zenogais> I now see it syncing blockheaders
13:23 < jnewbery> Does anyone want to try answering q4: Do you agree with AJ’s comment that this is a bug?
13:25 < zenogais> It sounds like it might have been intended behavior, but wasn't documented. So unclear whether it was a bug.
13:25 < lightlike> re q4: hard to say if bug or intentional, but I'm not sure if it is important if a peer with an empty addrman asks all DNS seeds or just 3 of them.
13:26 < jnewbery> zenogais: lightlike: yeah, at the very least the code isn't as clear as it could be
13:27 < aj> if we had lots of DNS seeds (or lots of full nodes coming online) it would seem a bit spammy to always query all of them to me. but we don't have either of those things...
13:28 < jnewbery> Were there any other questions on the PRs? It seems like this week's PRs were perhaps difficult for people to review/test?
13:29 < zenogais> I wasn't able to review these as much as I would have liked. Seems like they need deeper review / testing since they're in a part of the codebase that isn't easily testable atm.
13:29 < aj> any good ideas on ways to make them easier to test?
13:29 < zenogais> I think jnewbery's peers.py is a good base for automating some testing around this.
13:30 < zenogais> At least getting the ability to manipulate peers.py so you can test present/missing behavior here.
13:30 < zenogais> I also still want to step through this in a debugger just to ensure the intended behaviors are really triggered and work as expected.
13:31 < jnewbery> I know that carldong was hoping that better separation of the net layer would make focused testing easier. I'm not sure how far he's got with that
13:31 < zenogais> ^ This would be a big help too
13:32 < jnewbery> zenogais: one potential difficulty with using a debugger with this functionality is that there are a bunch of threads with timers interacting with each other. Stepping through one thread doesn't capture all the behaviour
13:33 < lightlike> One thought I had is that that while querying the DNS seeds in almost every startup is obviously bad for decentralisation, it is also a defense against eclipse attacks.
13:34 < lightlike> So if someone manages to manipulate our addrman (like in the Erebus or Heilman paper), the DNS seeds won't counteract that after this PR is merged.
13:34 < fjahr> concerning peers.dat: we could make it easier to manipulate files in the node's datadir with some helper methods, i think this could be helpful for other tests
13:36 < jnewbery> lightlike: I think that's a good point. If an adversary can take over your addrman, then I think the only way out of that is to query the DNS seeds or the hard-coded seed nodes.
13:36 < jnewbery> fjahr: I agree. It shouldn't be too difficult to add those methods to TestNode
13:38 < jnewbery> next question: What implications do these PRs have for privacy? Robustness/redundancy? Usability?
13:39 < jnewbery> lightlike has already mentioned that not querying DNS seeds could potentially be bad for robustness/redundancy (because reducing the use of DNS seeds could *potentially* make us more vulnerable to having our addrman taken over)
13:40 < fjahr> what i took away from the discussion: higher privacy because less other nodes are pinged at startup but potentially lower usability because startup might take longer
13:40 < zenogais> It sounds like the originally motivator was to avoid leaking information to DNS seeds.
13:40 < zenogais> *original
13:41 < jnewbery> zenogais: yes, I think that's the motivation for both 15558 and 16939
13:41 < zenogais> It sounds like one other alternative here would be to query all DNS seeds in random order?
13:42 < jnewbery> zenogais: how does that help?
13:42 < zenogais> Not sure if it does. But it sounds like the order was originally fixed, and 15558 switched to it to random order in addition to reducing the number of seeds queried.
13:44 < jnewbery> the problem that they're trying to solve is not leaking information to _all_ of the DNS seeds about when you're staring a node. Changing the ordering doesn't improve that
13:44 < jnewbery> Final question: Why was three DNS seeds chosen in PR 15558?
13:44 < zenogais> +1
13:44 < sprawl-unsolved> re: Robustness I agree with BlueMatt's concern about a nodes isolation if homogenous DNS sources are queried (https://github.com/bitcoin/bitcoin/pull/15558#issuecomment-529128985)
13:45 < zenogais> It sounded like 1 seed makes eclipse attacks fairly easy, so 3 was selected somewhat arbitrarily to mitigate that.
13:45 < sprawl-unsolved> ... which is being discussed / mitigated here https://github.com/bitcoin/bitcoin/issues/16938
13:46 < sprawl-unsolved> What's not clear to me is if PR 15558 was modified to attempt to get a non homogenous set of DNS seeds to attempt to prevent eclipse attacks
13:46 < lightlike> there are only 8 DNS seeds currently, and choosing only 1 would again be bad for eclipse attacks, so 3 seems like a reasonable compromise.
13:47 < zenogais> sprawl-unsolved: This is interesting discussion, thanks for sharing
13:48 < jnewbery> sprawl-unsolved: I think 3 was simply chosen to mitigate against a single rogue DNS seed being able to eclipse new nodes. Issue 16938 is a very interesting discussion, but it's not the only reason you'd want redundancy in the DNS seeds you query
13:50 < jnewbery> The bitcoin-seeder repo is here: https://github.com/sipa/bitcoin-seeder for those interested
13:50 < jnewbery> ok, any other questions in the last 10 minutes?
13:51 < zenogais> Are there folks actively working on improving testing around multi-threaded code like this?
13:52 < jnewbery> zenogais: not that I'm aware of. What specifically do you mean when you say 'testing around multi-threaded code'? Do you have something specific in mind?
13:52 < zenogais> Seems difficult to integration test, but could perhaps benefit from plucking the logic out into isolated testable objects/methods.
13:53 < zenogais> jnewbery: The question is mostly motivated since you spoke about the difficult of testing/observing these threads in isolation because of their interactions
13:53 < zenogais> *difficulty
13:54 < zenogais> I'd need to dig more into the code for specific questions here, but was just an initial thought.
13:54 < jnewbery> I don't think it's that difficult to write integration tests, or at least it's probably easier than writing integration tests for some of the other p2p behaviour
13:55 < sprawl-unsolved> I think I misunderstood an attack vulnerability, I thought there were mutliple DNS seeds operated by the same entities but if https://github.com/bitcoin/bitcoin/blob/4b51ed89cfce9870a20d75001fae3b68ac1dfd86/src/chainparams.cpp#L116 is where all the DNS seeds are listed then it appears it wouldn't be possible for to query 3 DNS seeds and get a response from < 3 entities
13:55 < jnewbery> there would be a bit of framework code to write (mocking DNS server responses, manipulating .dat files)
13:56 < jnewbery> sprawl-unsolved: you're assuming that sipa, bluematt, luke, christian, ... are all different people :)
13:57 < fanquake> I don't think I've ever seen them all in the same room
13:57 < jonatack> zenogais: re folks working on this, I proposed these PRs out of personal interest and am beginning to look for funding to work on these areas.
13:57 < sprawl-unsolved> jnewbery lol I had written "apparently" in that line but maybe thought it was a bit too pedantic :{
13:58 < jnewbery> ok, let's wrap it up there. I'll try to get next week's notes and questions up sooner than last time so you all have a bit more time to review and test
13:58 < zenogais> Thanks for hosting jnewbery
13:58  * sprawl-unsolved claps
13:58 < lightlike> thanks!
13:59 < jnewbery> thanks all!
13:59 < jonatack> thanks!