During IBD, prune as much as possible until we get close to where we will eventually keep blocks (validation)

Dec 29, 2021

https://github.com/bitcoin/bitcoin/pull/20827

Host: 0xB10C - PR author: luke-jr

The PR branch HEAD was 24f3936 at the time of this review club meeting.

Notes

When pruning is enabled in Bitcoin Core, data about old blocks is deleted to limit disk space usage. Users can configure a pruning target with the -prune=<target in MB> argument defining how much disk space to use for block and undo data. The minimum target is 550MB.
Bitcoin Core keeps a write buffer of UTXOs (aka dbcache). If the buffer didn’t exist, creating a UTXO and deleting a UTXO would both cause a write operation to disk. As UTXOs are often short lived, modifying the buffer is a lot faster than writes to disk. Reading from the buffer is also cheaper than looking UTXOs up on disk. The buffer is flushed to disk, for example, when it grows too large. Depending on the buffer size, flushes can take a while.
Node operators can control the buffer size with the -dbcache=<size in MB> argument. A larger buffer takes up more system memory but takes longer to fill and thus requires fewer flushes. This speeds up the initial block download (IBD).
Pruning is a reason for us to flush the dbcache regardless of its memory usage. The maximum configured dbcache size is often not reached.
This PR changes the pruning behavior. Previously, we’d prune just enough files for us to be able to continue the IBD. We now aggressively prune all prunable files enabling us to continue with IBD without having to prune again too soon. Fewer prunes also mean fewer dbcache flushes, potentially speeding IBD for pruned nodes up. Higher dbcache sizes can be reached before the dbcache is flushed.

PR #12404 attempted aggressive pruning too, but was closed in favor of PR #11658. PR #11658 added 10% of the prune target to the nBuffer. This is being overwritten by PR #20827.

Questions

What does this PR do? What is the goal of this PR?
Where in the code do we check if we need to prune old block data? (hint: look for usages of the FindFilesToPrune function)
What is removed during pruning and under which conditions? What is not pruned?
What is the variable nBuffer in BlockManager:FindFilesToPrune() being used for? How large is the buffer (in MB)?
The PR assumes 1MB for average_block_size. How accurate does this assumption have to be?
The PR description mentions IBD speed improvements for pruned nodes. What can we measure to benchmark the improvement? With which prune targets and dbcache sizes should we test?
Edge case: Is agressively pruning during IBD a problem if there are longer forks in the chain?

Meeting Log

18:00

<b10c> #startmeeting

18:00

<b10c> Welcome to the last Bitcoin Core review club meeting of 2021!

18:00

<b10c> Feel free to say hi!

18:00

<shapleigh1842> hi!

18:00

<scavr> hi

18:00

<b10c> Today we are talking about PR 20827 "During IBD, prune as much as possible until we get close to where we will eventually keep blocks" https://github.com/bitcoin/bitcoin/pull/20827

18:01

<b10c> Notes are on https://bitcoincore.reviews/20827

18:01

<michaelfolkson> hi

18:01

<b10c> anyone got a chance to have a look at this over the holidays?

18:02

<scavr> yes read the notes and had a look at the changes

18:02

<svav> Hi

18:03

<michaelfolkson> Yup read the notes too

18:04

<b10c> cool! the diff is only a few lines, this one is more about understanding how pruning in Bitcoin Core works. Let's dive right in with the questions, but feel free to ask questions any time!

18:04

<b10c> What does this PR do? What is the goal of this PR?

18:05

<svav> I read the notes too ...

18:05

<scavr> The goal is to optimize the pruning strategy during initial block download

18:06

<b10c> svav: was everything clear? any questions?

18:06

<michaelfolkson> More aggressive pruning to speed up IBD for a pruned node

18:06

<svav> Why was this PR felt necessary?

18:07

<b10c> scavr michaelfolkson: correct! What do we prune now that we previously didn't prune?

18:07

<michaelfolkson> IBD speedups are always good. I was more unsure why there wasn't already aggressive pruning in the original code

18:07

<svav> I would say if the IBD acronym is used, it should be defined once as Initial Block Download for clarity for newbies

18:08

<shapleigh1842> ^^yes I just figured this acronym out

18:08

<michaelfolkson> ^

18:09

<b10c> svav: performance improvements are always welcome. This helps people running Bitcoin Core on lower end hardware. e.g. Raspberry Pi's

18:09

<michaelfolkson> b10c: Just more blocks right? blk.dat and rev.dat files

18:09

<svav> b1c: and do we know how significant a performance increase this gives?

18:09

<shapleigh1842> context Q: during an IBD on a "pruned" node, does the node still download the entire blockchain, albeit verifying and pruning as it goes?

18:10

<sipa> IBD is no different from normal synchronization really. It just changes some heuristics/policies.

18:10

<sipa> All blocks are still downloaded and verified the same, whether IBD or not.

18:11

<michaelfolkson> Random q: Are rev.dat files what undo.dat files used to be called?

18:11

<shapleigh1842> sipa: thank you.

18:12

<b10c> michaelfolkson: yes! we just free up more space once we decide to prune

18:12

<sipa> @michaelfolkson I can't remember Bitcoin Core ever having had an undo.dat file.

18:13

<b10c> I think it's called undo data in the code and the files are called rev*.dat (?)

18:13

<sipa> Yeah, rev*.dat files contain undo data.

18:13

<michaelfolkson> sipa: This comment refers to undo.dat files https://github.com/bitcoin/bitcoin/blob/8c0bd871fcf6c5ff5851ccb18a7bc7554a0484b0/src/validation.h#L405

18:14

<b10c> hm this should probably be rev*.dat files, not sure if this got changed at some point

18:15

<b10c> Next question: Where in the code do we check if we need to prune old block data?

18:15

<sipa> That's a typo I think.

18:15

<sipa> In what branch do you see that?

18:15

<michaelfolkson> Master

18:16

<svav> validation.cpp

18:16

<scavr> we check it in CChainState::FlushStateToDisk before we check if we need to flush the dbcache

18:17

<shapleigh1842> so just browsing this diff [and I'm sure I could look this up in a readme] it looks like the bitcoin codebase standard is to only provide parameter comments for [out] parameters? (i.e. no comments required for normal params or return?)

18:18

<sipa> michaelfolkson: It was introduced in commit f9ec3f0fadb11ee9889af977e16915f5d6e01944 in 2015, which introduced pruning in the first place. Even then the files were called rev*.dat.

18:19

<b10c> svav scavr: correct! I somehow assumed it would it's done when connecting a new block, but I guess we call FlushStateToDisk often enough (but don't actually flush the cache)

18:20

<b10c> sipa: maybe they were called undo*.dat in a first interation of the pruning feature, but got renamed during development

18:20

<michaelfolkson> sipa: So a typo then. I'll open a PR to correct (or someone new can)

18:21

<sipa> No, because the rev*.dat files predate pruning. I introduced the concept of rev*.dat files ;)

18:21

<b10c> sipa: oh, I see :D

18:21

<b10c> next question: Under which conditions do we prune and what is removed? What is not pruned?

18:22

<scavr> we prune once disk_usage + buffer >= prune target

18:23

<svav> Pruning will never delete a block within a defined distance (currently 288) from the active chain's tip.

18:23

<scavr> and we stop once that's no longer the case

18:23

<b10c> scavr svav: correct!

18:24

<b10c> svav: to be clear, we don't delete a block file with a block 288 blocks from tip

18:25

<b10c> rev*.dat files are also pruned

18:25

<michaelfolkson> There is a block index and a UTXO database in dbcache... https://bitcoin.stackexchange.com/questions/99098/what-is-in-the-bitcoin-core-leveldb-dbcache-is-it-full-records-or-metadata

18:25

<michaelfolkson> The block index isn't deleted?

18:26

<sipa> No, only the blocks.

18:26

<sipa> We don't want to forget about old blocks, just their contents is forgotten.

18:27

<b10c> There are flags in the index that indicate if we HAVE_BLOCK_DATA and HAVE_UNDO_DATA

18:27

<b10c> I guess they are set to false when we prune?

18:28

<sipa> Yeah, I believe so.

18:28

<b10c> What is the variable nBuffer in BlockManager:FindFilesToPrune() being used for? How large is the buffer (in MB)?

18:29

<sipa> In the very first commit that added undo files they were called "<HEIGHT>.und", actually: 8adf48dc9b45816793c7b98e2f4fa625c2e09f2c.

18:29

<michaelfolkson> I think of blocks as transactions (rather than UTXO diffs) and deleting transactions but this is deleting from a UTXO database right? Effectively spent txo

18:29

<sipa> No, pruning is unrelated to the UTXO set.

18:30

<scavr> the nBuffer and the current disk usage are summed up when checking if the prune target has been reached

18:30

<sipa> The UTXO set is already UTXO: it only contains spent outputs already.

18:30

<sipa> unspent, sorry

18:31

<b10c> another acronym worth mentioning: UTXO = unspend transaction output

18:31

<sipa> Pruning is literally deleting the block files (blk*.dat) and undo files (rev*.dat) from disk, nothing more. It does not touch the UTXO set, and doesn't delete anything from any database.

18:31

<michaelfolkson> Ok thanks

18:31

<sipa> (apart from marking the pruned blocks as pruned in the database).

18:31

<b10c> scavr: do you know how big the buffer is?

18:32

<scavr> it starts as 17 MB (16MB block chunk size + 1 MB undo chunk size)

18:34

<b10c> correct! we pre-allocate the files in chunks so we want to keep this as a buffer

18:35

<scavr> when in IBD we add 10% of the prune target to the buffer

18:35

<scavr> so 17MB + 55MB = 72MB?

18:35

<scavr> with a prune target of 550MB

18:36

<michaelfolkson> 550 being the minimum

18:36

<scavr> this is similar to PR 20827 also an optimization, right?

18:36

<b10c> yes! that's my understanding too. why?

18:37

<b10c> yep, we leave a bit bigger buffer to have to flush too soon again (causing another dbcache flush)

18:38

<b10c> With 20827 the buffer will be 17 MB + `number of blocks left to sync` * 1 MB

18:39

<michaelfolkson> The downside of having a big dbcache is that when it fills it takes longer to flush so time trade-offs I'm guessing. Saves time overall as infrequent flushing

18:39

<sipa> That point is that any pruning involves flushing, whether it's deleting a small or large amount of blocks.

18:40

<b10c> michaelfolkson: yes, with a large dbcache you could do IBD without ever flushing

18:40

<sipa> And frequent flushing is what kills IBD performance.

18:40

<sipa> So by deleting more when we prune, the pruning (and thus flushing) operation needs to be done less frequently.

18:40

<michaelfolkson> Frequent flushing of small dbcaches is a lot worse than infrequent flushing of big dbcaches, right

18:41

<sipa> The size isn't relevant. If you flush frequently, the dbcache just won't grow big.

18:41

<b10c> not really about dbcache size in this PR, more about number of flushes we do

18:41

<sipa> Speed is gained by having lots of things cached. When you flush, the cahce is empty.

18:42

<sipa> These two are related: the longer you go without flushing, the bigger the database cache grows.

18:42

<sipa> The dbcache parameter sets when you have to forcefully flush because otherwise the cache size would exceed the dbcache parameter.

18:42

<b10c> The PR assumes 1MB for average_block_size. How accurate does this assumption have to be?

100

18:44

<michaelfolkson> b10c: Presumably not very accurate :)

101

18:44

<scavr> i don't know but guess not too accurate as may blocks are >1MB

102

18:44

<michaelfolkson> The average for the first few years must have been much lower than 1MB

103

18:44

<michaelfolkson> The start of IBD

104

18:45

<michaelfolkson> Today's blocks are consistently much bigger than 1MB, 1.6MB average?

105

18:46

<Kaizen_Kintsugi_> I'm surprised at this as-well, intuitively with my limited knowledge, I come to the conclusion that the average block size could be computed.

106

18:46

<svav> b10c: Were people having performance problems that made this PR necessary? When is it hoped that this PR will be implemented?

107

18:47

<b10c> agree with both of you, yes. It doen't matter for the early blocks and it's an OK assumption for the later blocks. Might leave us with one more or one fewer set of blk/rev dat files when IBD is done

108

18:47

<sipa> I mean, what counts as "performance problem"? Faster is always nicer, no?

109

18:47

<michaelfolkson> I think if the estimate was totally wrong e.g. 0.1 MB or 6 MB it would be a problem

110

18:47

<sipa> And certainly on some hardware IBD is painfully slow... hours to days to weeks sometimes

111

18:47

<b10c> It is totally wrong for regtest, signet and testnet

112

18:48

<michaelfolkson> Hmm but on the high side right

113

18:48

<michaelfolkson> So just a massive underestimate would be a problem?

114

18:49

<b10c> Kaizen: computing would be possible for sure, but this would probably be to complex here

115

18:49

<michaelfolkson> So 0.1 MB would be a problem but 6MB wouldn't be a problem?

116

18:50

<michaelfolkson> Just less efficient

117

18:50

<b10c> e.g. on testnet you could finish IBD with only one pair of blk/rev files left when we prune just before we catch up with the tip

118

18:51

<b10c> as we assume the next blocks will all be 1MB each and make space for them

119

18:52

<michaelfolkson> So the pruning is too aggressive for testnet

120

18:52

<b10c> could maybe even be a problem if there is a big reorg on testnet as we can't reverse to a previous UTXO set state?

121

18:52

<b10c> michaelfolkson: yes, maybe. I need to think about this a bit more

122

18:53

<sipa> During IBD there shouldn't be reorgs in the first place.

123

18:53

<sipa> Which is presumably the justification why more aggressive pruning is acceptable there.

124

18:53

<b10c> I mean after

125

18:53

<sipa> Oh, oops.

126

18:54

<b10c> assume we just finished IBD with only a few blocks+undo data left on disk

127

18:54

<b10c> (you anwered question 7 :) )

128

18:55

<b10c> we do a headers first sync, so we don't download and blocks from stale chains

129

18:56

<b10c> (that's the answer to questions 7, let's do question 6 now):

130

18:56

<b10c> The PR description mentions IBD speed improvements for pruned nodes. What can we measure to benchmark the improvement? With which prune targets and dbcache sizes should we test?

131

18:56

<michaelfolkson> A non-zero very low probability of having to deal with re-orgs during IBD. Even lower probability with headers first sync

132

18:57

<michaelfolkson> (if someone provides wrong headers)

133

18:57

<scavr> ohh I didnt know about headers first sync. you mean once we have the longest chain we ignore forks in IBD?

134

18:58

<b10c> yes!

135

18:58

<sipa> even with headers-first sync we download blocks simultaneously with headers

136

18:58

<sipa> but we only fetch blocks along the path of what we currently know to be the best headers chain

137

18:59

<sipa> and i'm not sure "probability" is the issue to discuss here; you can't *randomly* end up with an invalid headers chain if your peers are honest

138

18:59

<sipa> if you peers are dishonest however, it is possible, but that's a question of attack cost, not probability

139

18:59

<svav> To benchmark the improvement, can we measure IBD download time?

140

19:00

<michaelfolkson> The header has the PoW in it.. so massive attack cost :)

141

19:00

<b10c> svav: yes!

142

19:00

<shapleigh1842> We should measure / benchmark IBD sync time with low cost hardware and default dbcache and other settings

143

19:01

<b10c> yup, preferably with different prune targets (more important) and dbcache sizes

144

19:01

<shapleigh1842> yeah, well def with the minimum 550

145

19:02

<b10c> I'd assume this has a bigger effect for people pruning with larger prune targets

146

19:03

<b10c> since you still flush quite often with the 550 prune target, but if you can download 10GB and only need to flush (for pruning) once, that's a lot better than before

147

19:03

<sipa> It should mostly matter for people with large prune target and large dbcache, I think.

148

19:03

<b10c> Ok, time to wrap up! Thanks for joining, I wish everyone a happy new year.

149

19:04

<Kaizen_Kintsugi_> Thanks for hosting!

150

19:04

<b10c> #endmeeting