The PR branch HEAD was ac94141af at the time of this review club meeting.
In this Review Club we will dive into how blocks are downloaded and stored on
disk, how disk access is arranged and a bug it has caused.
Notes
Concepts
When a node first connects to the network, it does an Initial Block
Download (IBD) to
download all the blocks to the tip, validate them and connect them in its block
chain.
Block undo
data
is all the information that is necessary to revert a block if the block needs
to be disconnected during a reorg.
The blocks database is represented by two instances of
FlatFileSeq -
one for all blocks/blk*.dat files and another one for all blocks/rev*.dat
files. FlatFilePos
is used to locate objects within those files.
The meta information about a single block file and its corresponding undo
file is represented by
CBlockFileInfo.
The fflush library call moves all
buffered data from the FILE* buffers to the OS (i.e. in kernel buffers). It
may not necessarily hit the disk yet.
The fsync system call moves all
modified data from the OS buffers to the disk.
Block and undo files
Blocks are stored in a custom format in the blocks directory. This
directory consists of a series of blk*.dat files (currently 128 MiB each)
that contain the raw blocks.
Each block is written to disk as it arrives from the network. Because blocks
are downloaded in parallel from more than one peer during initial block
download this means that a block with greater height can be received (and
written to disk) before a block with lower height. We call these out-of-order
blocks.
When initial block download finishes things calm down and new blocks arrive
every 10 minutes on average (assuming the node is online to receive them).
That means we’re much less likely to write blocks out of order.
In addition to the blk*.dat files, we also generate and store “undo”
information for each block in corresponding rev*.dat files. This can be used to
revert all the transactions from the block if we need to disconnect the block
from the chain during a reorg. Unlike blocks, this information is always stored
in block height order.
[Edit 2021-02-03: Note: Within a single rev*.dat file, the undo data
is ordered – a block’s undo data will never appear after the undo data for a
descendant block. However, this is not true if the rev*.dat files are
concatenated – a block’s undo data may appear in a later rev*.dat file
than the rev*.dat file containing its descendant’s undo data.]
We put block and undo data in corresponding blk*.dat and rev*.dat files,
but internally they may be in different order. For example, the undos for all
blocks in blk1234.dat will be in rev1234.dat, but maybe the block at height
100 is somewhere near the beginning of blk1234.dat whereas its undo is
somewhere near the end of rev1234.dat.
Questions
Overview
Can we
create
an undo file for a given block without having all prior blocks?
Do we ever modify existing block data in the blocks directory, or do we just append
new blocks and their undos?
Dive
How is space allocated in the files when new data is appended? Why?
What does it mean to “finalize” a file?
What is the bug that the PR is fixing?
How would you reproduce the bug or show it exists?
<robot-visions> I think we cannot, because without prior blocks, we would not be able to restore the UTXOs spent by the block (we'd have the `COutPoint`s, but not the `CTxOut`s)
<vasild> right, actually, now when I think about it - the question is misleading. We need the utxo for which we need the prior blocks. So the answer would be "no", but however if we somehow have the utxo - then we are good.
<jnewbery> vasild: the undo data is written in ConnectBlock(), which we only call for the tip block, so we need to have processed all the blocks up to that point
<sipa> well, you need *all* utxos spent by a block, which may include utxos created the block before, so it can't be created before at least having validated the previous block
<felixweis> to restore the the utxo we need key=(txid,index) value=(script,amount,blockheight,is_coincase). some info (like original txid) is inferred from the last block (in blk*.dat), the rest is stored in rev*.dat, inputs are in the same canonical order as the inputs in the block spending. (iirc)
<oyster> vasild how does "not modifying existing blk* and rev* data square with: "In addition to the blk*.dat files, we also generate and store “undo” information for each block in corresponding rev*.dat files. This can be used to revert all the transactions from the block if we need to disconnect the block from the chain during a reorg. Unlike blocks, this information is always stored in block height order."
<oyster> Does it mean that in order to ensure "information is always stored in block height order." that the rev data isn't written until all the blocks are processed up until that point?
<felixweis> im still stuggling to understand how blocks in blk* are out of order, but rev is in order. the pruning mechanism deletes blk and rev of the same number? what would happen if the tip is artificialy stored in blk000000.dat right after the genesis block?
<oyster> so if it's the case that rev data isn't written until block is activated/connected, how is the case that blk*.dat and rev*.dat files always contain the same block info, What if node starts and only hears about 128mb worth of blocks that are beyond it's tip? wouldn't those be written to blk* but couldn't be written to the corresponding rev number?
<vasild> so, the point is - we dump blocks in blk*.dat as they come from the network, which is usually out of order, but we cannot do the same with undo - which we only can generate once all previous blocks are activated
<vasild> oyster: we come back later, when we have the undo and append it to the proper rev*.dat file - the one which corresponds to the block's blk*.dat file. Notice that this is also always the last rev*.dat file.
<vasild> felixweis: yes :) I don't know if this guarding from disk full is very useful. We can still get other IO errors and we have to be prepared for that anyway, even if disk full is not one of them.
<vasild> willcl_ark: notice that if we tried to reorder them later we may have to move blocks across blk files, if e.g. block 100 is in blk5 and block 101 is in blk4
<vasild> nehan_: that is what the Flush() method does (fflush() + fsync()). Finalize is that + return the claimed space to the FS because we know that we will not need it because we will not append to this file again.
<theStack> if one would delete all blk*/rev* files from a (stopped) node and replace them by the copy from another node, would that cause any problems?
<vasild> to be sure, I would `rm -fr blocks/index/` so at least if it bricks it bricks with some "file not found" error instead of some obscure trying to lookup a block in position X at file Y and finding something else there
<fjahr> theStack: I was pretty confident until I saw jnewbery and vasild answers ;) But -reindex builds the blockindex so I don't know why it wouldn't work
<theStack> i was always assuming that everynode has the exact same block files, i.e. their hashes would. today i learned that this is obviously not the case :)
<pinheadmz> sipa: does it introsepct the bitcoin data? for example i can imagine pruning default values like nlocktime or nsequence, maybe squeeze some bytes out of DER format
<robot-visions> vasild: If we're receiving new black faster than the chain tip is moving, we could run into situations where we (1) finalize a rev* file, (2) write some more undo data to the file later, (3) don't finalize it again
<vasild> robot-visions: right, except it happens during IBD when we receive blocks out of order and finalize revN whenever we finalize blkN, but later come back and append stuff to revN and forget to finalize it
<robot-visions> (I think the benefit would be making the code simpler, and the downside would be sometimes an extra finalization—I'm not sure how much that would affect fragmentation)
<vasild> robot-visions: this is what the fix does - but not unconditionally "finalize revN whenever finalizing blkN". It is possible to detect if we will be going back to append
<robot-visions> Just to make sure I understand correctly: Could you flush more often than needed in the rare edge case where the same file has multiple blocks with `nHeight == vinfoBlockFile[_pos.nFile].nHeightLast`?
<vasild> What about even removing rev*.dat files altogether? We only disconnect the block from the tip and at this point we have the utxo, so we can generate the undo on the fly as needed, no?
<sipa> with txindex and no pruning they could in theory be recovered from the block data itself, but it would be slow (as they're scattered all over), and not be generically applocable
<vasild> long time ago I enabled lz4 on everything because the decompression is so fast that "read compressed from disk + uncompress" is faster than "read uncompressed from disk". Of course it depends on the data itself, disk and CPU speed.
<sipa> 2) shouldn't be, as the block data is always flushed before writing the block index, and the block index is always flushed before writing the chainstaye