The PR branch HEAD was bcb0cac at the time of this review club meeting.
During Initial block download (IBD),
a new node automatically initializes and populates its data directory by fetching blocks from peers,
validating them, and storing them in the blocks directory within the data directory (default
The blocks are stored in files named blknnnnn.dat (for example, blk01234.dat).
These files are limited to 128 MiB, so each can hold about 60 blocks or
more depending on their sizes.
The blknnnnn.dat files are in a custom format: a sequence of blocks, each preceded by a 4-byte
“marker” or “magic number” and a 4-byte integer indicating the block’s size or length in bytes.
The blocks need not be in height-order, either within a block file, or across block files. For
example, block 2000 could be stored in blk00010.dat while block 1500 could be stored in
In order to save disk space, the node operator can enable
using -prune, so the node retains only the most recent few hundred blocks on disk, although IBD
still downloads and verifies all blocks since the beginning.
During IBD, besides storing the raw blocks, several kinds of state are derived from
the blocks and also stored the data directory as entries in LevelDB. The two most prominent are
the block index and the chainstate (UTXO set). The
contains an entry for every block (including pruned ones) and indexes the
locations (file and offset within the file) of unpruned raw blocks.
If corruption is suspected in the derived indices (the block index or
chainstate), the user has the option of starting over with an empty data directory
and performing IBD again, but that’s slow and uses a lot of network bandwidth.
If the node isn’t pruned, an alternative is to start the node with the -reindex option.
This will use the existing blocks files to rebuild all the derived state.
This is simliar to IBD but obtains blocks from local files instead of network peers.
PR #24858 fixes a long-standing bug that can
cause a form of mild corruption in the way blocks are stored within the blocks files following a
A review comment
suggested a slightly different way to fix the bug. Explain the
alternate approach. How does it compare?
(Bonus question) The definition of BLOCK_SERIALIZATION_HEADER_SIZE
must be the same across platforms (so that the blocks files are portable).
Why is it okay to assume an int is 4 bytes? Couldn’t it be different on some platforms?
<adam2k> There is a deserialization error in the log file that appears when doing a reindexing of the blknnnn.dat files. It's not fatal, but it's confusing and possibly alarming for people that see the issue in their logs.
<larryruane> michaelfolkson: Reindex is requested by the user (node operator) as a configuration option (command line or in the config file, tho you probably would never put it in the file, or else it would reindex on every startup!),
<larryruane> and if specified (`-reindex` or `-reindex=1`), it will happen when the node first starts up ... after that process completes (which takes hours usually), then the node syncs with its peers, and you'll add more blocks as usual
<larryruane> Thanks, BlueMoon: and michaelfolkson: - very useful links! Okay, here's question 2 (but feel free to bring up what we've already covered): Which parts of the bitcoind data directory are not derived from other parts of the data directory? What are some examples of parts that are?
<larryruane> michaelfolkson: that's really good, I actually hadn't thought of P2P! I was thinking more that the `blocks` directory is not derived from other stuff in the datadir, but like ... yes, as hernanmarino_: said, exactly
<larryruane> Things in the data directory that are derived are, in a way, to make performance reasonable ... if the node is looking for information about a block (and has its hash), it would be impractical to linearly search the blocks files (blknnnn.dat)!
<larryruane> So the node first downloads only headers (which are only 80 bytes each), figures out the best chain (assuming the blocks turn out to be valid), knows order of the blocks, so can request many blocks simultaneously from different peers ... and their reply times are kind of random .. so the blocks end up out of order
<larryruane> nassersaazi: Yes the headers come in ordered by height, and there are many in a single message (getheaders P2P message) .. so the node basically says "I know about block hash X, give me up to N headers that build on block X"
<larryruane> josie[m]: I don't think it's json .. but yes, anytime you see those serialization methods, you know this data structure is getting saved to (and read from) disk or sent over (and received from) the network
<larryruane> michaelfolkson: Yes it does do that, or very close.. like say reindex is able to process the first 100,000 blocks and then ran into some kind of corruption ... it will automatically IBD starting at 100,001