The PR branch HEAD was 24f3936 at the time of this review club meeting.
Notes
When pruning is enabled in Bitcoin Core, data about old blocks is deleted to limit disk space usage.
Users can configure a pruning target with the -prune=<target in MB> argument defining how much disk space to use for block and undo data.
The minimum target is 550MB.
Bitcoin Core keeps a write buffer of UTXOs (aka dbcache).
If the buffer didn’t exist, creating a UTXO and deleting a UTXO would both cause a write operation to disk.
As UTXOs are often short lived, modifying the buffer is a lot faster than writes to disk.
Reading from the buffer is also cheaper than looking UTXOs up on disk.
The buffer is flushed to disk, for example, when it grows too large.
Depending on the buffer size, flushes can take a while.
Node operators can control the buffer size with the -dbcache=<size in MB> argument.
A larger buffer takes up more system memory but takes longer to fill and thus requires fewer flushes.
This speeds up the initial block download (IBD).
Pruning is a reason for us to flush the dbcache regardless of its memory usage.
The maximum configured dbcache size is often not reached.
This PR changes the pruning behavior.
Previously, we’d prune just enough files for us to be able to continue the IBD.
We now aggressively prune all prunable files enabling us to continue with IBD without having to prune again too soon.
Fewer prunes also mean fewer dbcache flushes, potentially speeding IBD for pruned nodes up.
Higher dbcache sizes can be reached before the dbcache is flushed.
PR #12404 attempted aggressive pruning too, but was closed in favor of PR #11658.
PR #11658 added 10% of the prune target to the nBuffer.
This is being overwritten by PR #20827.
Questions
What does this PR do? What is the goal of this PR?
Where in the code do we check if we need to prune old block data? (hint: look for usages of the FindFilesToPrune function)
What is removed during pruning and under which conditions? What is not pruned?
The PR assumes 1MB for average_block_size. How accurate does this assumption have to be?
The PR description mentions IBD speed improvements for pruned nodes. What can we measure to benchmark the improvement? With which prune targets and dbcache sizes should we test?
Edge case: Is agressively pruning during IBD a problem if there are longer forks in the chain?
<b10c> Today we are talking about PR 20827 "During IBD, prune as much as possible until we get close to where we will eventually keep blocks" https://github.com/bitcoin/bitcoin/pull/20827
<b10c> cool! the diff is only a few lines, this one is more about understanding how pruning in Bitcoin Core works. Let's dive right in with the questions, but feel free to ask questions any time!
<shapleigh1842> context Q: during an IBD on a "pruned" node, does the node still download the entire blockchain, albeit verifying and pruning as it goes?
<shapleigh1842> so just browsing this diff [and I'm sure I could look this up in a readme] it looks like the bitcoin codebase standard is to only provide parameter comments for [out] parameters? (i.e. no comments required for normal params or return?)
<sipa> michaelfolkson: It was introduced in commit f9ec3f0fadb11ee9889af977e16915f5d6e01944 in 2015, which introduced pruning in the first place. Even then the files were called rev*.dat.
<b10c> svav scavr: correct! I somehow assumed it would it's done when connecting a new block, but I guess we call FlushStateToDisk often enough (but don't actually flush the cache)
<michaelfolkson> I think of blocks as transactions (rather than UTXO diffs) and deleting transactions but this is deleting from a UTXO database right? Effectively spent txo
<sipa> Pruning is literally deleting the block files (blk*.dat) and undo files (rev*.dat) from disk, nothing more. It does not touch the UTXO set, and doesn't delete anything from any database.
<michaelfolkson> The downside of having a big dbcache is that when it fills it takes longer to flush so time trade-offs I'm guessing. Saves time overall as infrequent flushing
<Kaizen_Kintsugi_> I'm surprised at this as-well, intuitively with my limited knowledge, I come to the conclusion that the average block size could be computed.
<b10c> agree with both of you, yes. It doen't matter for the early blocks and it's an OK assumption for the later blocks. Might leave us with one more or one fewer set of blk/rev dat files when IBD is done
<b10c> The PR description mentions IBD speed improvements for pruned nodes. What can we measure to benchmark the improvement? With which prune targets and dbcache sizes should we test?
<b10c> since you still flush quite often with the 550 prune target, but if you can download 10GB and only need to flush (for pruning) once, that's a lot better than before