During IBD, prune as much as possible until we get close to where we will eventually keep blocks (validation)

https://github.com/bitcoin/bitcoin/pull/20827

Host: 0xB10C  -  PR author: luke-jr

The PR branch HEAD was 24f3936 at the time of this review club meeting.

Notes

  • When pruning is enabled in Bitcoin Core, data about old blocks is deleted to limit disk space usage. Users can configure a pruning target with the -prune=<target in MB> argument defining how much disk space to use for block and undo data. The minimum target is 550MB.

  • Bitcoin Core keeps a write buffer of UTXOs (aka dbcache). If the buffer didn’t exist, creating a UTXO and deleting a UTXO would both cause a write operation to disk. As UTXOs are often short lived, modifying the buffer is a lot faster than writes to disk. Reading from the buffer is also cheaper than looking UTXOs up on disk. The buffer is flushed to disk, for example, when it grows too large. Depending on the buffer size, flushes can take a while.

  • Node operators can control the buffer size with the -dbcache=<size in MB> argument. A larger buffer takes up more system memory but takes longer to fill and thus requires fewer flushes. This speeds up the initial block download (IBD).

  • Pruning is a reason for us to flush the dbcache regardless of its memory usage. The maximum configured dbcache size is often not reached.

  • This PR changes the pruning behavior. Previously, we’d prune just enough files for us to be able to continue the IBD. We now aggressively prune all prunable files enabling us to continue with IBD without having to prune again too soon. Fewer prunes also mean fewer dbcache flushes, potentially speeding IBD for pruned nodes up. Higher dbcache sizes can be reached before the dbcache is flushed.

  • PR #12404 attempted aggressive pruning too, but was closed in favor of PR #11658. PR #11658 added 10% of the prune target to the nBuffer. This is being overwritten by PR #20827.

Questions

  1. What does this PR do? What is the goal of this PR?

  2. Where in the code do we check if we need to prune old block data? (hint: look for usages of the FindFilesToPrune function)

  3. What is removed during pruning and under which conditions? What is not pruned?

  4. What is the variable nBuffer in BlockManager:FindFilesToPrune() being used for? How large is the buffer (in MB)?

  5. The PR assumes 1MB for average_block_size. How accurate does this assumption have to be?

  6. The PR description mentions IBD speed improvements for pruned nodes. What can we measure to benchmark the improvement? With which prune targets and dbcache sizes should we test?

  7. Edge case: Is agressively pruning during IBD a problem if there are longer forks in the chain?

Meeting Log

  118:00 <b10c> #startmeeting
  218:00 <b10c> Welcome to the last Bitcoin Core review club meeting of 2021!
  318:00 <b10c> Feel free to say hi!
  418:00 <shapleigh1842> hi!
  518:00 <scavr> hi
  618:00 <b10c> Today we are talking about PR 20827 "During IBD, prune as much as possible until we get close to where we will eventually keep blocks" https://github.com/bitcoin/bitcoin/pull/20827
  718:01 <b10c> Notes are on https://bitcoincore.reviews/20827
  818:01 <michaelfolkson> hi
  918:01 <b10c> anyone got a chance to have a look at this over the holidays?
 1018:02 <scavr> yes read the notes and had a look at the changes
 1118:02 <svav> Hi
 1218:03 <michaelfolkson> Yup read the notes too
 1318:04 <b10c> cool! the diff is only a few lines, this one is more about understanding how pruning in Bitcoin Core works. Let's dive right in with the questions, but feel free to ask questions any time!
 1418:04 <b10c> What does this PR do? What is the goal of this PR?
 1518:05 <svav> I read the notes too ...
 1618:05 <scavr> The goal is to optimize the pruning strategy during initial block download
 1718:06 <b10c> svav: was everything clear? any questions?
 1818:06 <michaelfolkson> More aggressive pruning to speed up IBD for a pruned node
 1918:06 <svav> Why was this PR felt necessary?
 2018:07 <b10c> scavr michaelfolkson: correct! What do we prune now that we previously didn't prune?
 2118:07 <michaelfolkson> IBD speedups are always good. I was more unsure why there wasn't already aggressive pruning in the original code
 2218:07 <svav> I would say if the IBD acronym is used, it should be defined once as Initial Block Download for clarity for newbies
 2318:08 <shapleigh1842> ^^yes I just figured this acronym out
 2418:08 <michaelfolkson> ^
 2518:09 <b10c> svav: performance improvements are always welcome. This helps people running Bitcoin Core on lower end hardware. e.g. Raspberry Pi's
 2618:09 <michaelfolkson> b10c: Just more blocks right? blk.dat and rev.dat files
 2718:09 <svav> b1c: and do we know how significant a performance increase this gives?
 2818:09 <shapleigh1842> context Q: during an IBD on a "pruned" node, does the node still download the entire blockchain, albeit verifying and pruning as it goes?
 2918:10 <sipa> IBD is no different from normal synchronization really. It just changes some heuristics/policies.
 3018:10 <sipa> All blocks are still downloaded and verified the same, whether IBD or not.
 3118:11 <michaelfolkson> Random q: Are rev.dat files what undo.dat files used to be called?
 3218:11 <shapleigh1842> sipa: thank you.
 3318:12 <b10c> michaelfolkson: yes! we just free up more space once we decide to prune
 3418:12 <sipa> @michaelfolkson I can't remember Bitcoin Core ever having had an undo.dat file.
 3518:13 <b10c> I think it's called undo data in the code and the files are called rev*.dat (?)
 3618:13 <sipa> Yeah, rev*.dat files contain undo data.
 3718:13 <michaelfolkson> sipa: This comment refers to undo.dat files https://github.com/bitcoin/bitcoin/blob/8c0bd871fcf6c5ff5851ccb18a7bc7554a0484b0/src/validation.h#L405
 3818:14 <b10c> hm this should probably be rev*.dat files, not sure if this got changed at some point
 3918:15 <b10c> Next question: Where in the code do we check if we need to prune old block data?
 4018:15 <sipa> That's a typo I think.
 4118:15 <sipa> In what branch do you see that?
 4218:15 <michaelfolkson> Master
 4318:16 <svav> validation.cpp
 4418:16 <scavr> we check it in CChainState::FlushStateToDisk before we check if we need to flush the dbcache
 4518:17 <shapleigh1842> so just browsing this diff [and I'm sure I could look this up in a readme] it looks like the bitcoin codebase standard is to only provide parameter comments for [out] parameters? (i.e. no comments required for normal params or return?)
 4618:18 <sipa> michaelfolkson: It was introduced in commit f9ec3f0fadb11ee9889af977e16915f5d6e01944 in 2015, which introduced pruning in the first place. Even then the files were called rev*.dat.
 4718:19 <b10c> svav scavr: correct! I somehow assumed it would it's done when connecting a new block, but I guess we call FlushStateToDisk often enough (but don't actually flush the cache)
 4818:20 <b10c> sipa: maybe they were called undo*.dat in a first interation of the pruning feature, but got renamed during development
 4918:20 <michaelfolkson> sipa: So a typo then. I'll open a PR to correct (or someone new can)
 5018:21 <sipa> No, because the rev*.dat files predate pruning. I introduced the concept of rev*.dat files ;)
 5118:21 <b10c> sipa: oh, I see :D
 5218:21 <b10c> next question: Under which conditions do we prune and what is removed? What is not pruned?
 5318:22 <scavr> we prune once disk_usage + buffer >= prune target
 5418:23 <svav> Pruning will never delete a block within a defined distance (currently 288) from the active chain's tip.
 5518:23 <scavr> and we stop once that's no longer the case
 5618:23 <b10c> scavr svav: correct!
 5718:24 <b10c> svav: to be clear, we don't delete a block file with a block 288 blocks from tip
 5818:25 <b10c> rev*.dat files are also pruned
 5918:25 <michaelfolkson> There is a block index and a UTXO database in dbcache... https://bitcoin.stackexchange.com/questions/99098/what-is-in-the-bitcoin-core-leveldb-dbcache-is-it-full-records-or-metadata
 6018:25 <michaelfolkson> The block index isn't deleted?
 6118:26 <sipa> No, only the blocks.
 6218:26 <sipa> We don't want to forget about old blocks, just their contents is forgotten.
 6318:27 <b10c> There are flags in the index that indicate if we HAVE_BLOCK_DATA and HAVE_UNDO_DATA
 6418:27 <b10c> I guess they are set to false when we prune?
 6518:28 <sipa> Yeah, I believe so.
 6618:28 <b10c> What is the variable nBuffer in BlockManager:FindFilesToPrune() being used for? How large is the buffer (in MB)?
 6718:29 <sipa> In the very first commit that added undo files they were called "<HEIGHT>.und", actually: 8adf48dc9b45816793c7b98e2f4fa625c2e09f2c.
 6818:29 <michaelfolkson> I think of blocks as transactions (rather than UTXO diffs) and deleting transactions but this is deleting from a UTXO database right? Effectively spent txo
 6918:29 <sipa> No, pruning is unrelated to the UTXO set.
 7018:30 <scavr> the nBuffer and the current disk usage are summed up when checking if the prune target has been reached
 7118:30 <sipa> The UTXO set is already UTXO: it only contains spent outputs already.
 7218:30 <sipa> unspent, sorry
 7318:31 <b10c> another acronym worth mentioning: UTXO = unspend transaction output
 7418:31 <sipa> Pruning is literally deleting the block files (blk*.dat) and undo files (rev*.dat) from disk, nothing more. It does not touch the UTXO set, and doesn't delete anything from any database.
 7518:31 <michaelfolkson> Ok thanks
 7618:31 <sipa> (apart from marking the pruned blocks as pruned in the database).
 7718:31 <b10c> scavr: do you know how big the buffer is?
 7818:32 <scavr> it starts as 17 MB (16MB block chunk size + 1 MB undo chunk size)
 7918:34 <b10c> correct! we pre-allocate the files in chunks so we want to keep this as a buffer
 8018:35 <scavr> when in IBD we add 10% of the prune target to the buffer
 8118:35 <scavr> so 17MB + 55MB = 72MB?
 8218:35 <scavr> with a prune target of 550MB
 8318:36 <michaelfolkson> 550 being the minimum
 8418:36 <scavr> this is similar to PR 20827 also an optimization, right?
 8518:36 <b10c> yes! that's my understanding too. why?
 8618:37 <b10c> yep, we leave a bit bigger buffer to have to flush too soon again (causing another dbcache flush)
 8718:38 <b10c> With 20827 the buffer will be 17 MB + `number of blocks left to sync` * 1 MB
 8818:39 <michaelfolkson> The downside of having a big dbcache is that when it fills it takes longer to flush so time trade-offs I'm guessing. Saves time overall as infrequent flushing
 8918:39 <sipa> That point is that any pruning involves flushing, whether it's deleting a small or large amount of blocks.
 9018:40 <b10c> michaelfolkson: yes, with a large dbcache you could do IBD without ever flushing
 9118:40 <sipa> And frequent flushing is what kills IBD performance.
 9218:40 <sipa> So by deleting more when we prune, the pruning (and thus flushing) operation needs to be done less frequently.
 9318:40 <michaelfolkson> Frequent flushing of small dbcaches is a lot worse than infrequent flushing of big dbcaches, right
 9418:41 <sipa> The size isn't relevant. If you flush frequently, the dbcache just won't grow big.
 9518:41 <b10c> not really about dbcache size in this PR, more about number of flushes we do
 9618:41 <sipa> Speed is gained by having lots of things cached. When you flush, the cahce is empty.
 9718:42 <sipa> These two are related: the longer you go without flushing, the bigger the database cache grows.
 9818:42 <sipa> The dbcache parameter sets when you have to forcefully flush because otherwise the cache size would exceed the dbcache parameter.
 9918:42 <b10c> The PR assumes 1MB for average_block_size. How accurate does this assumption have to be?
10018:44 <michaelfolkson> b10c: Presumably not very accurate :)
10118:44 <scavr> i don't know but guess not too accurate as may blocks are >1MB
10218:44 <michaelfolkson> The average for the first few years must have been much lower than 1MB
10318:44 <michaelfolkson> The start of IBD
10418:45 <michaelfolkson> Today's blocks are consistently much bigger than 1MB, 1.6MB average?
10518:46 <Kaizen_Kintsugi_> I'm surprised at this as-well, intuitively with my limited knowledge, I come to the conclusion that the average block size could be computed.
10618:46 <svav> b10c: Were people having performance problems that made this PR necessary? When is it hoped that this PR will be implemented?
10718:47 <b10c> agree with both of you, yes. It doen't matter for the early blocks and it's an OK assumption for the later blocks. Might leave us with one more or one fewer set of blk/rev dat files when IBD is done
10818:47 <sipa> I mean, what counts as "performance problem"? Faster is always nicer, no?
10918:47 <michaelfolkson> I think if the estimate was totally wrong e.g. 0.1 MB or 6 MB it would be a problem
11018:47 <sipa> And certainly on some hardware IBD is painfully slow... hours to days to weeks sometimes
11118:47 <b10c> It is totally wrong for regtest, signet and testnet
11218:48 <michaelfolkson> Hmm but on the high side right
11318:48 <michaelfolkson> So just a massive underestimate would be a problem?
11418:49 <b10c> Kaizen: computing would be possible for sure, but this would probably be to complex here
11518:49 <michaelfolkson> So 0.1 MB would be a problem but 6MB wouldn't be a problem?
11618:50 <michaelfolkson> Just less efficient
11718:50 <b10c> e.g. on testnet you could finish IBD with only one pair of blk/rev files left when we prune just before we catch up with the tip
11818:51 <b10c> as we assume the next blocks will all be 1MB each and make space for them
11918:52 <michaelfolkson> So the pruning is too aggressive for testnet
12018:52 <b10c> could maybe even be a problem if there is a big reorg on testnet as we can't reverse to a previous UTXO set state?
12118:52 <b10c> michaelfolkson: yes, maybe. I need to think about this a bit more
12218:53 <sipa> During IBD there shouldn't be reorgs in the first place.
12318:53 <sipa> Which is presumably the justification why more aggressive pruning is acceptable there.
12418:53 <b10c> I mean after
12518:53 <sipa> Oh, oops.
12618:54 <b10c> assume we just finished IBD with only a few blocks+undo data left on disk
12718:54 <b10c> (you anwered question 7 :) )
12818:55 <b10c> we do a headers first sync, so we don't download and blocks from stale chains
12918:56 <b10c> (that's the answer to questions 7, let's do question 6 now):
13018:56 <b10c> The PR description mentions IBD speed improvements for pruned nodes. What can we measure to benchmark the improvement? With which prune targets and dbcache sizes should we test?
13118:56 <michaelfolkson> A non-zero very low probability of having to deal with re-orgs during IBD. Even lower probability with headers first sync
13218:57 <michaelfolkson> (if someone provides wrong headers)
13318:57 <scavr> ohh I didnt know about headers first sync. you mean once we have the longest chain we ignore forks in IBD?
13418:58 <b10c> yes!
13518:58 <sipa> even with headers-first sync we download blocks simultaneously with headers
13618:58 <sipa> but we only fetch blocks along the path of what we currently know to be the best headers chain
13718:59 <sipa> and i'm not sure "probability" is the issue to discuss here; you can't *randomly* end up with an invalid headers chain if your peers are honest
13818:59 <sipa> if you peers are dishonest however, it is possible, but that's a question of attack cost, not probability
13918:59 <svav> To benchmark the improvement, can we measure IBD download time?
14019:00 <michaelfolkson> The header has the PoW in it.. so massive attack cost :)
14119:00 <b10c> svav: yes!
14219:00 <shapleigh1842> We should measure / benchmark IBD sync time with low cost hardware and default dbcache and other settings
14319:01 <b10c> yup, preferably with different prune targets (more important) and dbcache sizes
14419:01 <shapleigh1842> yeah, well def with the minimum 550
14519:02 <b10c> I'd assume this has a bigger effect for people pruning with larger prune targets
14619:03 <b10c> since you still flush quite often with the 550 prune target, but if you can download 10GB and only need to flush (for pruning) once, that's a lot better than before
14719:03 <sipa> It should mostly matter for people with large prune target and large dbcache, I think.
14819:03 <b10c> Ok, time to wrap up! Thanks for joining, I wish everyone a happy new year.
14919:04 <Kaizen_Kintsugi_> Thanks for hosting!
15019:04 <b10c> #endmeeting