incorrect blk file size calculation during reindex results in recoverable blk file corruption (block storage)

Jul 27, 2022

https://github.com/bitcoin/bitcoin/pull/24858

Host: LarryRuane - PR author: mruddy

The PR branch HEAD was bcb0cac at the time of this review club meeting.

Notes

During Initial block download (IBD), a new node automatically initializes and populates its data directory by fetching blocks from peers, validating them, and storing them in the blocks directory within the data directory (default $HOME/.bitcoin).
The blocks are stored in files named blknnnnn.dat (for example, blk01234.dat). These files are limited to 128 MiB, so each can hold about 60 blocks or more depending on their sizes.
The blknnnnn.dat files are in a custom format: a sequence of blocks, each preceded by a 4-byte “marker” or “magic number” and a 4-byte integer indicating the block’s size or length in bytes. The blocks need not be in height-order, either within a block file, or across block files. For example, block 2000 could be stored in blk00010.dat while block 1500 could be stored in blk00011.dat.
In order to save disk space, the node operator can enable pruning using -prune, so the node retains only the most recent few hundred blocks on disk, although IBD still downloads and verifies all blocks since the beginning.
During IBD, besides storing the raw blocks, several kinds of state are derived from the blocks and also stored the data directory as entries in LevelDB. The two most prominent are the block index and the chainstate (UTXO set). The block index, contains an entry for every block (including pruned ones) and indexes the locations (file and offset within the file) of unpruned raw blocks.
If corruption is suspected in the derived indices (the block index or chainstate), the user has the option of starting over with an empty data directory and performing IBD again, but that’s slow and uses a lot of network bandwidth. If the node isn’t pruned, an alternative is to start the node with the -reindex option. This will use the existing blocks files to rebuild all the derived state. This is simliar to IBD but obtains blocks from local files instead of network peers.
PR #24858 fixes a long-standing bug that can cause a form of mild corruption in the way blocks are stored within the blocks files following a reindex.

Questions

Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?
Which parts of the bitcoind data directory are not derived from other parts of the data directory? What are some examples of parts that are?
Why are blocks in the block files disordered?
What is CBlockFileInfo used for?
What is reindexing, and how does it differ from IBD (initial block download)?
How does pruning interact with reindexing?
What is the format of the blknnnnn.dat files? Are these files portable across CPU architectures (big-endian, little-endian)?
What happens if a blocks file becomes corrupted? Partial hint: See this call to FindByte
Each network type (mainnet, testnet, signet, regtest) has its own “magic” bytes, for example, mainnet). Where else are these bytes used? Why do they differ across network types?
What’s the purpose of the loop in FindBlockPos()?
What bug does this PR purport to fix? Do you think it fixes it?
Did you reproduce the problem that this PR fixes?
A review comment suggested a slightly different way to fix the bug. Explain the alternate approach. How does it compare?
(Bonus question) The definition of BLOCK_SERIALIZATION_HEADER_SIZE must be the same across platforms (so that the blocks files are portable). Why is it okay to assume an int is 4 bytes? Couldn’t it be different on some platforms?

Meeting Log

17:00

<larryruane> #startmeeting

17:00

<josie[m]> hi

17:00

<michaelfolkson> hi

17:00

<larryruane> Hi everyone! today we'll review PR 24858, notes are here: https://bitcoincore.reviews/24858

17:01

<Amirreza> Hi

17:01

<BlueMoon> Hello!

17:02

<hernanmarino_> Hello everyone

17:02

<pablo_martin> Hello!

17:02

<Lov3r_Of_Bitcoin> hello

17:02

<rewe> Hi I'm new here

17:03

<brunoerg> hi!!

17:03

<brunoerg> rewe: welcome!

17:03

<larryruane> Is anyone here for the first time? Please feel free to say hi! Welcome, rewe !

17:03

<pablo_martin> yes, first time here... hi all!

17:04

<hernanmarino_> Welcome Pablo !

17:04

<glozow> hi

17:04

<rebroad> first time here also

17:04

<adam2k> 👋 hello

17:04

<larryruane> Welcome, Pablo! And rebroad!

17:05

<rebroad> there's a PR I wanted to ask about - to try to find out why it's not being given any attention

17:05

<larryruane> I'm having network trouble, Gloria, could you be host for a minute or 2?

17:06

<glozow> sure

17:06

<glozow> Did you all get a chance to review the PR or look at the notes? how about a y/n

17:06

<BlueMoon> I read a bit, it's interesting.

17:06

<brunoerg> y for the notes

17:07

<pablo_martin> yes glozow

17:07

<schmidty_> hola

17:07

<larryruane> glozow: (i'm back, thank you!)

17:07

<josie[m]> y - went through the PR and also read the notes

17:07

<glozow> Great! could any of you summarize what this PR is doing?

17:07

<hernanmarino_> notes n , tested the fix y

17:08

<adam2k> y

17:09

<michaelfolkson> Did you reproduce the issue hernanmarino_?

17:09

<hernanmarino_> michaelfolkson: yes

17:09

<michaelfolkson> Cool

17:09

<josie[m]> this PR is fixes a bug where extra bytes are added to a serialized during reindexing. the extra bytes cause an error to be printed to the log when it is later de-serialized

17:09

<BlueMoon> It deals with the files where the block information is stored, how the size calculation is incorrect.

17:10

<adam2k> There is a deserialization error in the log file that appears when doing a reindexing of the blknnnn.dat files. It's not fatal, but it's confusing and possibly alarming for people that see the issue in their logs.

17:10

<josie[m]> s/serialized/serialized block/

17:10

<larryruane> josie[m]: adam2k: BlueMoon: yes!

17:11

<BlueMoon> :)

17:11

<larryruane> Before we get into the questions, does anyone have any questions about the Notes? Anything unclear (or wrong)?

17:12

<michaelfolkson> Maybe this should be asked later but you can explain "A reindex, if it occurs, always happens immediately after the node starts up, before any blocks have been added from peers."?

17:12

<josie[m]> one thing that was a little unclear to me: the first few questions in the notes didn't seem directly related to the PR (tho helpful background knowledge to have)

17:12

<michaelfolkson> Why?

17:14

<larryruane> michaelfolkson: Reindex is requested by the user (node operator) as a configuration option (command line or in the config file, tho you probably would never put it in the file, or else it would reindex on every startup!),

17:15

<larryruane> and if specified (`-reindex` or `-reindex=1`), it will happen when the node first starts up ... after that process completes (which takes hours usually), then the node syncs with its peers, and you'll add more blocks as usual

17:16

<larryruane> josie[m]: yes, that's my fault, I thought it would be helpful to make sure people had the background needed before jumping into this particular PR

17:16

<Amirreza> A question. Notes say that blk???.dat files are 128 MiG but store 60 or more blocks. Why 60? I imagined the number is much higher while block size are at most 1 MiG.

17:17

<larryruane> Amirreza: well actually, since the segwit upgrade, blocks are almost always larger than 1 MiB .. they're often close to 2

17:17

<hernanmarino_> amirezza : that 's not the case really

17:18

<BlueMoon> I find this information on IBD very interesting. https://btctranscripts.com/andreas-antonopoulos/2018-10-23-andreas-antonopoulos-initial-blockchain-download/

17:18

<michaelfolkson> Yeah the SegWit block size -> block weight https://jimmysong.medium.com/understanding-segwit-block-size-fd901b87c9d4

17:18

<Amirreza> larryruane: Thanks, I should read more about segwit.

17:20

<larryruane> Thanks, BlueMoon: and michaelfolkson: - very useful links! Okay, here's question 2 (but feel free to bring up what we've already covered): Which parts of the bitcoind data directory are not derived from other parts of the data directory? What are some examples of parts that are?

17:20

<larryruane> (I should have said, "... that are part of the data directory")

17:22

<michaelfolkson> There are things not related to this PR. Like banlist.dat and peers.dat

17:23

<michaelfolkson> peers.dat must be updated from banlist.dat (is that what you mean by "derived")?

17:24

<hernanmarino_> Regarding this PR , the index is derived from the blocks

17:25

<larryruane> michaelfolkson: that's really good, I actually hadn't thought of P2P! I was thinking more that the `blocks` directory is not derived from other stuff in the datadir, but like ... yes, as hernanmarino_: said, exactly

17:25

<larryruane> the block index is derived from the blocks files ... also the chainstate, which is the UTXO set

17:28

<larryruane> Things in the data directory that are derived are, in a way, to make performance reasonable ... if the node is looking for information about a block (and has its hash), it would be impractical to linearly search the blocks files (blknnnn.dat)!

17:28

<michaelfolkson> For the new people the data directory: https://en.bitcoin.it/wiki/Data_directory

17:28

<BlueMoon> Thanks!!

17:29

<larryruane> michaelfolkson: +1 thanks! Okay how about question 3: "Why are blocks in the block files disordered?"

17:30

<adam2k> The blocks are added as they come in through the network, so there is no guarantee on order.

17:30

<BlueMoon> It's not necessary from what I've read, and each of these block files has a marker indicating the size of the block file, I imagine they are accessed via an index.

17:32

<josie[m]> the blk*.dat files have orphan blocks as well, which means we wouldn't expect a strict ordering

17:32

<michaelfolkson> "It is perfectly possible that *.blk files contains gaps of zeroes, or even partially written blocks" https://bitcoin.stackexchange.com/questions/49615/how-can-you-tell-if-youre-at-the-end-of-an-incomplete-blk-dat-file

17:32

<larryruane> adam2k: yes exactly, a long time ago, the blocks were ordered, but initial block download (IBD) got a major performance improvement by a feature called headers-first download

17:33

<blocknum256> The order of the blocks in the block files is set by the order you get them from peers during the IBD

17:34

<larryruane> So the node first downloads only headers (which are only 80 bytes each), figures out the best chain (assuming the blocks turn out to be valid), knows order of the blocks, so can request many blocks simultaneously from different peers ... and their reply times are kind of random .. so the blocks end up out of order

17:34

<larryruane> blocknum256: yes

17:35

<larryruane> (again, feel free to keep discussing, even if I go on...) Question 4: "What is CBlockFileInfo used for?"

17:36

<nassersaazi> larryruane: just for clarity....does that mean the headers are ordered?

17:37

<BlueMoon> Used to get information about the last block of files; blocks, size, heights, time....

17:37

<adam2k> CBlockFileInfo is a class that manages information about what is contained within the block files. Things like the number of blocks.

17:37

<josie[m]> CBlockFileInfo contains metadata about a block file, min and max block height in the file, timestamps, size, etc

17:38

<michaelfolkson> That's what it contains but what is it used for? :) (I'm still looking)

17:39

<larryruane> nassersaazi: Yes the headers come in ordered by height, and there are many in a single message (getheaders P2P message) .. so the node basically says "I know about block hash X, give me up to N headers that build on block X"

17:40

<larryruane> josie[m]: yes, there's one of those for each `blknnnn.dat` file ... are they persisted to disk?

17:41

<adam2k> it looks like that class is used in several places, like within the blockstorage.cpp there are methods like LoadBlockIndexDB and WriteBlockIndexDB.

17:41

<BlueMoon> CBlockFileInfo is used to obtain the block file information entry for a block file.

17:42

<larryruane> adam2k: yes exactly, they're written out to LevelDB, part of the "block index" ... BlueMoon: yes!

17:42

<BlueMoon> :)

17:42

<josie[m]> larryruane: hm, not sure about on disk.. tho it does have a serialize method that looks like it outputs json

17:44

<josie[m]> does levelDB write to disk? i thought it was an in memory db

17:44

<larryruane> josie[m]: I don't think it's json .. but yes, anytime you see those serialization methods, you know this data structure is getting saved to (and read from) disk or sent over (and received from) the network

17:44

<larryruane> josie[m]: it's both, it caches in memory but also persists to disk

17:44

<josie[m]> larryruane: cool, TIL!

17:45

<larryruane> so if you look in datadir/blocks/index you'll see those *.ldb files, those are leveldb

17:46

<larryruane> ok let's see.. question 5 "What is reindexing, and how does it differ from IBD (initial block download)?"

17:48

<adam2k> According to the help cmd "Rebuild chain state and block index from the blk*.dat files on disk. This will also rebuild active optional indexes."

17:49

<michaelfolkson> It is almost like restarting IBD but with blocks you already have stored

17:49

<adam2k> Does that mean it does not re-download the blockchain from the network and use your local file storage instead?

17:49

<larryruane> yes, good, reindexing sources blocks from local files, IBD sources blocks from peers

17:49

<blocknum256> It rebuilds the index of blocks based on the blk*. dat files saved in . bitcoin/blocks

17:50

<larryruane> adam2k: yes, so depending on your network speed, reindex could be quite a bit faster than IBD

17:50

<hernanmarino_> adam2k: yes

100

17:50

<larryruane> question 6: How does pruning interact with reindexing?

101

17:51

<josie[m]> +1 to other answers, but in how it's different from IBD, reindexing does not verify the blocks, right? its assumed the blk files on disk have already been verified?

102

17:51

<larryruane> (sorry I'm rushing a little, trying to have time to discuss the PR)

103

17:51

<michaelfolkson> If the reindex fails (multiple times) you'd have to instruct it to start IBD afresh. It wouldn't do it automatically?

104

17:51

<adam2k> I was looking at this earlier. I have a pruned client and it appears that I don't have blknnnn.dat files that go beyond the memory limit that I set.

105

17:51

<larryruane> josie[m]: I'm pretty sure it does verify the blocks

106

17:52

<josie[m]> larryruane: ah, that makes sense, because in the case of a pruned client it is requesting those blocks from peers?

107

17:52

<adam2k> So I'm assuming when I reindex with a pruned node I would need to do a partial IBD up until the point where I have the blknnnn.dat files?

108

17:52

<larryruane> michaelfolkson: I think reindexing does as much as it can (using the blk files), and then the node goes into its usual "sync with my peers" .. which may amount to IBD

109

17:52

<blocknum256> Q6: Since pruning raws block data for blocks older than a givven height, I think indexes are broke at that point

110

17:53

<josie[m]> adam2k: i think that's correct, but im guessing you wouldnt store blk files on disk in the case of a pruned node, you would just use the blocks to build the index?

111

17:53

<larryruane> adam2k: The reindex request (`-reindex` on the command line) would actually fail if the node is pruned

112

17:54

<michaelfolkson> larryruane: I'd guess the optimal behavior would be to after multiple corruptions or whatever just drop all blocks and start afresh without input from user

113

17:54

<michaelfolkson> Don't think it does that though

114

17:54

<michaelfolkson> I think the user would have to decide that

115

17:55

<adam2k> hmm...ok, maybe I don't have something configured correctly because I can see the blknnnn.dat files on disc and it seems like I can reindex without any failures.

116

17:55

<larryruane> michaelfolkson: Yes it does do that, or very close.. like say reindex is able to process the first 100,000 blocks and then ran into some kind of corruption ... it will automatically IBD starting at 100,001

117

17:56

<larryruane> adam2k: is your node pruned?

118

17:57

<adam2k> I know this is off topic, but in my bitcoin.conf I set `prune=10204` isn't that all that I need to do in order to prune?

119

17:57

<michaelfolkson> larryruane: So it detects when the corruption happened and drops blocks from the corruption onwards and requests them from peers? That's neat if so

120

17:57

<larryruane> yes, and you're able to `-reindex`? I'll have to try that!

121

17:58

<josie[m]> larryruane, adam2k: im pretty sure you can run reindex on a pruned node. i just tried by adding prune=1000 in bitcoin.conf and then restarted bitcoind with -reindex

122

17:58

<adam2k> yeah, I'm trying now and it appears to work. I have to run a pruned node because I only have 512GB of disc space total.

123

17:58

<larryruane> ok that's cool, I didn't know that! I wonder how that works?

124

17:59

<larryruane> Okay we're about out of time, any comments on the bug that the PR fixes, or how the PR fixes it?

125

17:59

<josie[m]> i think it behaves the same as IBD (again, TIL), but it deletes older blocks as normal once you hit the pruned size. but yeah, i also need to play with this a bit more

126

17:59

<larryruane> (feel free to stay on to continue to discuss, I'll stay for a while)

127

17:59

<BlueMoon> No comment, thank you very much for so much valuable information. :)

128

18:00

<michaelfolkson> Erm bug should definitely be fixed :)

129

18:00

<blocknum256> larryruane: Thank you for your time!