BIP 157: Add compact block filter headers cache (utxo db and indexes)

https://github.com/bitcoin/bitcoin/pull/18960

Host: jnewbery  -  PR authors: jnewbery , jimpo

This is the second special Bitcoin Core PR Review Club looking at the BIP 157 implementation. Today, we’ll look at the second PR in the series: Add compact block filter headers cache.

Notes

  • BIP 157 defines checkpoints, which are compact block filter headers at 1000 block intervals. Newly syncing clients can fetch the checkpoints first (potentially from a trusted source) and then fill in the intermediate filter headers, and finally the filters themselves. Because the filter headers form a hash chain, the client can verify that they’ve been provided with the correct filters and headers by checking against the checkpoints.

  • Since the checkpoint filter headers are the potentially the most frequently served data, it makes sense to cache them in memory rather than read them from disk every time.

  • I’ve taken a different approach from jimpo’s original implementation, so in this review club, we’ll take a look at three different approaches:

    1. The original implementation, which added an active_chain_cf_headers vector in net_processing to cache the block filter headers, and updated that cache in the UpdatedBlockTip() callback.

    2. A second implementation, which uses the same active_chain_cf_headers, but updates it in the ProcessGetCFCheckPt() function.

    3. A third implementation, which moves the cache into the BlockFilterIndex class, and doesn’t change net processing logic.

Questions

  • Why is a lock added in the first and second implementations? Is it also needed in the third implementation?

  • The first implementation could cause a data race on shutdown (full log). Can you find where the data race is?

  • What are the main differences between the implementations? Which approach do you prefer?

Meeting Log

  113:00 <jnewbery> #startmeeting
  213:00 <jnewbery> hi
  313:00 <michaelfolkson> hi
  413:00 <theStack> hi
  513:00 <troygiorshev> hi
  613:00 <fjahr> hi
  713:01 <jkczyz> hi
  813:01 <jonatack> hi
  913:01 <Caralie> hi
 1013:01 <lightlike> hi
 1113:01 <MarcoFalke> hi
 1213:01 <andrewtoth> hi
 1313:01 <jnewbery> thanks for joining pt 2 of the BIP 157 special edition review club series!
 1413:01 <jnewbery> notes/questions in the normal place: https://bitcoincore.reviews/18960.html
 1513:02 <jnewbery> who had a chance to review the PR? y/n
 1613:02 <thomasb06> hi
 1713:02 <theStack> y
 1813:02 <troygiorshev> n
 1913:02 <andrewtoth> n
 2013:02 <jkczyz> y
 2113:02 <thomasb06> n
 2213:03 <fjahr> y earlier but missed that there were notes
 2313:03 <jnewbery> and for those that did review it, did you look at the alternative cache implementations in https://github.com/bitcoin-core-review-club/bitcoin/releases/tag/pr18960.1 and https://github.com/bitcoin-core-review-club/bitcoin/releases/tag/pr18960.2 ?
 2413:03 <jkczyz> yup
 2513:04 <michaelfolkson> Somewhat. Not to a point where I understood exactly what was going on in all 3
 2613:04 <jnewbery> great. So PR 18960 is fairly straightforward and small, so I think the interesting discussion this week will be around how the implementations differ
 2713:05 <jnewbery> Is anyone able to give a high-level summary of the different approaches?
 2813:06 <jkczyz> Roughly: (1) populates the cache as blocks are connected and needs to manage reorgs (2) populates when filters are requested in the message handler (3) moves the cache to the BlockFilterIndex
 2913:06 <theStack> 1) update cache on block tip update, cache by height 2) update cache on getcfcheckpt request, cache by height 3) update cache on getfcheckpt request, cache by block hash
 3013:07 <jnewbery> jkczyz theStack: great summaries. Thanks!
 3113:07 <jkczyz> theStack: ah yes, height vs hash too
 3213:07 <jnewbery> So q1: why are locks added? Are they needed?
 3313:08 <jkczyz> I suppose it would be needed if populating and querying are done on separate threads, which is not in (3) but I assume so in (1) and (2)
 3413:09 <theStack> they are needed because LookupFilterHeader could be also called from a different thread -- currently an RPC call i think
 3513:09 <jnewbery> jkczyz: yes. We don't want to read/write the same data in different threads without a lock
 3613:10 <jnewbery> theStack: good spot. I think we can call LookupFilterHeader() from both the message handler thread (net processing) and an RPC thread
 3713:10 <jnewbery> so without a lock there could be a race
 3813:11 <jnewbery> I guess I should say http helper thread, since after https://github.com/bitcoin/bitcoin/pull/17631, the filter headers will also be fetchable over REST
 3913:12 <jkczyz> so needed in (3) actually
 4013:12 <jnewbery> I think implementation (2) strictly doesn't need a lock since all the reading/writing of the cache happens in the message handler thread
 4113:12 <jnewbery> jkczyz: right
 4213:13 <jnewbery> ok, onto q2. The first implementation could cause a data race on shutdown (full log). Can you find where the data race is?
 4313:14 <jnewbery> (and as always, feel free to ask any questions any time. I only put the set questions out to prompt discussion)
 4413:15 <theStack> i saw that the original implementation used std::mutex, while in the codebase mostly our custom (Recursive)Mutex seems to be used -- is there any guide on when to use what?
 4513:16 <jnewbery> MarcoFalke is probably best placed to answer this, if he's still around
 4613:17 <jnewbery> RecursiveMutex allows us to take locks multiple times, which happens in some places. I think a long term goal might be to git rid of that usage pattern?
 4713:18 <aj> our custom Mutex (and RecursiveMutex) are just the standard ones, with extra code to debug lock order violations (which might result in deadlocks) and to support clangs thread safety compile time checks
 4813:20 <jnewbery> The data race on shutdown was in a UpdatedBlockTip() callback. See L5727 of https://travis-ci.org/github/bitcoin/bitcoin/jobs/665594933. I don't know exactly why that happens, but I guess it's what caused jimpo to move to implementation (2), which doesn't use that callback
 4913:20 <theStack> aj: ok, so i deduce from that that it's general a good idea to use them
 5013:21 <jnewbery> perhaps it'd be a good idea to talk aout performance. theStack, I see you've done some tests. Can you summarise what you did?
 5113:21 <sipa> for a mutex which is a private field in a class, which is not exposed to any external modules, shouldn't need our lock assertion wrappers
 5213:23 <aj> sipa: and none of the class's member functions take any other locks, or call functions that take any other locks? otherwise you could still get lock order violations?
 5313:23 <fjahr> here is the change he pushed after sipa pointed out the data race: https://github.com/bitcoin/bitcoin/compare/f68aa2c1a40afb49969b9120ec43e2a5040724f1..7608b32fc27254c84c16090af9a32d6bbdf60ec7#diff-eff7adeaec73a769788bb78858815c91
 5413:23 <theStack> jnewbery: sure. basically the idea was to send repeatedly send getcfcheckpt requests to a full node (as it has enough blocks to take a substantial amout of time, compared to e.g. a regtest node) and measure how long it would take -- with and without the cache
 5513:24 <theStack> i summarized the results here, but they are not very exciting, there doesn't seem to be much difference: https://github.com/bitcoin/bitcoin/pull/18960#issuecomment-628761578
 5613:24 <theStack> maybe it's my testing approach though :)
 5713:26 <jnewbery> theStack: thanks for sharing. I wonder how much of the time is parsing/serialization/message sending overhead
 5813:27 <jnewbery> I think a better test might be to add LogPrint statements before and after the call to ProcessGetCFCheckPt() and then check debug.log
 5913:28 <theStack> jnewbery: yes the mentioned overhead could definitely distort the result; also the ping part of send_and_ping(...) is included in the time, maybe it would be better to directly wait for the cfcheckpt answer
 6013:28 <theStack> jnewbery: ah, nice idea with the LogPrint statements!
 6113:28 <fjahr> testing this in isolation like this guarantees that the leveldb cache will be hit in the version where there is no in memory cache, right? I am not sure if that would always be the case for a node running in the network.
 6213:29 <fjahr> but probably more likely the more adoption we see, so I guess it's ok?
 6313:29 <jnewbery> fjahr: right. I think that's where testing it gets tricky. I don't know where other caching happens
 6413:31 <jnewbery> Perhaps running this on a main net node and periodically sending getcfcheckpt messages is a more realistic test
 6513:31 <jkczyz> theStack: did you notice much difference between the first and subsequent runs? One difference between the implementations is (3) needs to warm up the cache on the initial fetch
 6613:32 <jkczyz> I guess (2) does as well
 6713:33 <jnewbery> jkczyz: right. (2) and (3) need an initial getcfcheckpt request to warm the cache. (1) needs a new block to warm the cache
 6813:33 <theStack> jkczyz: not really, on the posted result the request for up to block 630000 (i.e. 630 checkpoint headers) takes initially 50.92ms and the average of 99 more subsequent requests is 50.8ms
 6913:34 <jnewbery> It's possible to modify (3) to automatically fill the cache on startup, but I didn't think it was worth adding the code just to make the first getcfcheckpt response faster
 7013:35 <jnewbery> theStack: I expect much of that 50ms is overhead. If you think about how the message gets sent from the python test framework to the node, and then added to the peers receive buffer, and then the buffer processed on a messagehanderthread loop, that's got to be a lot of it.
 7113:36 <theStack> jnewbery: yes, that's also my assumption here
 7213:36 <jnewbery> There are 100ms sleeps in the messagehandler thread between looping through all peers: https://github.com/jnewbery/bitcoin/blob/7ad83ed252914d6d5b8ed81f103aecf052c68fb7/src/net.cpp#L2061
 7313:37 <thomasb06> where the corresponding tests are?
 7413:37 <jnewbery> on a non-busy node, 50ms is the average time you have to wait for the message handler thread to wake up
 7513:37 <theStack> it's a pity that there is no easy way to just create a node with a huge block count
 7613:37 <theStack> jnewbery: oh, that explains a lot then
 7713:38 <jnewbery> thomasb06: https://github.com/jnewbery/bitcoin/blob/7ad83ed252914d6d5b8ed81f103aecf052c68fb7/test/functional/p2p_blockfilters.py#L71
 7813:38 <thomasb06> thanks
 7913:38 <jnewbery> theStack: yeah, I only just realised that myself
 8013:38 <theStack> then i guess your suggested approach with logging directly in the C++ code itself is really the way to go; given that the time resolution in the log files is small enough
 8113:39 <theStack> (if not, one can sure find another way to directly measure the elapsed time)
 8213:39 <jnewbery> yes, time resolution in the logging files can be microseconds
 8313:40 <jnewbery> take a look at https://github.com/jnewbery/bitcoin/blob/7ad83ed252914d6d5b8ed81f103aecf052c68fb7/test/functional/test_framework/test_node.py#L355. Perhaps that can be modified to take timestamps from the debug log.
 8413:40 <jnewbery> it's a way for the test framework to read the node's debug log during a test
 8513:43 <jkczyz> Is the cache recommended in BIP 157? I couldn't find it just now but thought I had read about it awhile back
 8613:43 <jnewbery> ok, final question from me: Which approach do you prefer? Why? 1/2 are acceptable answers :)
 8713:43 <theStack> jnewbery: good idea. i was thinking about manually analyzing the log in worst case, but of course if that can be automatized as well it'd be awesome
 8813:43 <jnewbery> jkczyz: I thought I remembered reading it in a bip or somewhere else, but I couldn't see it last time I checked
 8913:44 <theStack> jkczyz: i don't think BIPs usually describe implementation details about performance improvements in such a detailled manner
 9013:44 <jkczyz> I like (3) for its simplicity (no need to worry about re-orgs) and because it separates caching from the network layer
 9113:45 <jnewbery> It doesn't seem to be in the BIP history, so maybe I imagined it
 9213:45 <jnewbery> ah. Here it is: https://github.com/bitcoin/bips/blob/master/bip-0157.mediawiki#node-operation: "They also SHOULD keep every checkpoint header in memory, so that getcfcheckpt requests do not result in many random-access disk reads."
 9313:46 <thomasb06> talking about node operation, what are filters for?
 9413:47 <theStack> jnewbery: also prefer (3) for its absolute simplicity! from an algorithmic perspective, the cache access in (1)/(2) happens in O(1) though, while in (3) it's O(log N), but considering the small amount of elements in the cache, this doesn't really matter
 9513:47 <fjahr> I also prefer (3) for it's simplicity
 9613:47 <michaelfolkson> Yeah 3 does seem better. Did you write it from scratch jnewbery or did you look at the first two and think there are better ways to do this?
 9713:48 <michaelfolkson> thomasb06: Filters are for light clients. So they can work out which blocks they need to request from full nodes
 9813:48 <thomasb06> michaelfolkson: ok
 9913:49 <nehan> nice property of (3) (i think) is that the cache is only populated if this feature is actually used (requests are made)
10013:49 <jnewbery> theStack: BIPs do often go into implementation details that are not strictly necessary to the protocol. I'd argue that the schnoor BIP's section on signing: https://github.com/bitcoin/bips/blob/master/bip-0340.mediawiki#default-signing, which documents lots of footguns is similar in some ways.
10113:51 <jnewbery> michaelfolkson: moving the cache to the indexer was one of my requests in the original BIP: https://github.com/bitcoin/bitcoin/pull/16442#issuecomment-555256244 , so I wanted to implement it to see what it would look like
10213:51 <nehan> i had a question about (3): the lock is taken out twice in LookupFilterHeader(). what might have changed between those two locks? https://github.com/bitcoin-core-review-club/bitcoin/commit/7ad83ed252914d6d5b8ed81f103aecf052c68fb7#diff-5ad30723a078a5a4b5f92c9aa0369ac2R412
10313:52 <thomasb06> what is a DoS attack, denial of service?
10413:52 <nehan> could the header have been inserted somewhere else?
10513:52 <michaelfolkson> thomasb06: Yes
10613:53 <jnewbery> nehan: you're right that the cache is only populated if it's used in implementation (3). I think that's also true in implementation (2), but not (1), which will always fill the cache when the chain advances
10713:53 <jkczyz> nehan: I think it may be to avoid holding the lock when it is not necessary to do so
10813:53 <jnewbery> nehan: that's a great question! Does anyone have any thoughts?
10913:54 <thomasb06> michaelfolkson: so what's wrong if nodes generate filters dynamically?
11013:54 <jnewbery> when you say 'could the header have been inserted somewhere else?' do you mean in a different thread?
11113:54 <nehan> jnewbery: yes
11213:54 <theStack> i'd say it's generally a good practice to not hold locks longer than necessary
11313:55 <nehan> releasing and taking out a lock again in the same function means the world might have changed out from under you.
11413:55 <michaelfolkson> For the benefit of future code changes theStack?
11513:56 <sipa> michaelfolkson: to reduce latency of other code waiting on the lock
11613:56 <michaelfolkson> Ah ok
11713:56 <jnewbery> right. So we'd need to check where LookupFilterHeader() could be called from, which is net_processing (when a getcfcheckpt request is received) or rpc/blockchain (when the getblockfilter RPC is called)
11813:56 <jnewbery> so there is the possibility that there are two threads executing this function simultaneously. Is that a problem?
11913:57 <nehan> i guess they'd just both put the same thing in the map
12013:57 <jnewbery> does 'the world changing out from under us' cause a problem in this case?
12113:57 <fjahr> nehan: the operations don't depend on each other so I think it does not matter
12213:57 <nehan> fjahr: the operations definitely depend on each other. i'm not sure i understand what you're saying
12313:58 <michaelfolkson> thomasb06: I don't know what you mean in this example by "dynamically". What doc are you referring to?
12413:58 <nehan> fjahr: it's a get - no return - put paradigm. the thing might have been put in there by another thread
12513:59 <thomasb06> michaelfolkson: the Node Operation: "Nodes SHOULD NOT generate filters dynamically on request, as malicious peers may be able to perform DoS attacks by requesting small filters derived from large blocks."
12613:59 <jnewbery> nehan: right. So we could have two threads not able to get from the cache and both going through this logic: https://github.com/jnewbery/bitcoin/blob/7ad83ed252914d6d5b8ed81f103aecf052c68fb7/src/index/blockfilterindex.cpp#L405-L410
12713:59 <jnewbery> and then both trying to put in the cache
12813:59 <jnewbery> so the second thread tries to emplace the same object into the map that's already there
12914:00 <jnewbery> which isn't a problem. emplace() is a no-op on the map if there's already something with that key: https://en.cppreference.com/w/cpp/container/map/emplace
13014:00 <jnewbery> That was a great question. I think it probably deserves a code comment to explain that.
13114:01 <nehan> jnewbery: yes that is a scary code pattern to me
13214:02 <michaelfolkson> thomasb06: I think that is explained in the preceding paragraph. Nodes should generate filters and persist them rather than generate them on request
13314:02 <jnewbery> There's no harm in locking for the entire function if you think that's better. There's going to be ~0 lock contention
13414:02 <fjahr> so the second pattern does not fail depending on the first pattern, that's what I meant
13514:02 <jnewbery> since it's only when a P2P message and RPC request arrive at the same time
13614:02 <jnewbery> ok, that's time. Thanks for participating, everyone!
13714:02 <jnewbery> See you all next week
13814:02 <nehan> jnewbery: less cognitive overhead for the next developer who reads the code if you just lock the whole function
13914:02 <jkczyz> drawback is you need to hold the lock even when not accessing the cache, unless refactoring and duplicating a little code
14014:02 <nehan> thanks!
14114:03 <jnewbery> and don't forget to leave your comments/review on the PR
14214:03 <jkczyz> thanks, jnewbery!
14314:03 <fjahr> Thanks jnewbery!
14414:03 <theStack> thanks for hosting jnewbery!
14514:03 <Caralie> thank you!
14614:03 <troygiorshev> thanks jnewbery
14714:03 <jnewbery> #endmeeting