the raw append-only blknnnnn.dat files containing the serialized block data
the raw append-only revnnnnn.dat files containing the undo data,
the blocks/index/ block index data which is stored in a LevelDB database. The index helps to quickly find blocks without having to re-scan all the data files.
Flushing is the process of bringing the data that we keep in-memory in sync with the data we store on disk. We want to flush regularly enough to ensure that when bitcoind is unexpectedly killed, we may have to redo some work (e.g. re-download a couple of blocks) but we don’t end up in an unrecoverable state (that would e.g. require a completely new IBD). On the other hand, the (generally quite slow) disk I/O can have negative performance implications when we flush too often.
A reindex, amongst other things, recreates the block index from the block files on disk. The block files themselves don’t need to be updated: blocks are already assumed to not be stored sequentially in or across block files anyway.
This PR skips flushing to disk blocks whose positions are already known. Skipping this means that users can write-protect the blk files on disk which may be useful for security or even safely sharing that data between multiple bitcoind instances.
Some of the functions that are relevant to the reindexing process behave differently than can perhaps be expected from their name:
BlockManager::SaveBlockToDisk() does not actually write anything to disk when (as is the case for a reindex) the known block position is passed (as dbp), but we still need to call it to update the blockfile info stats
BlockManager::FindBlockPos() is mostly used to find the position in a blockfile where a new block is going to be appended. However, it also updates the blockfile info and may also trigger the flushing of block and undo files, which is why we still call it even if we already know the block’s position.
this behaviour arguably could arguably be improved upon, as for example done in this WIP by mzumsande
After this PR, can the whole blocks/ directory be made read-only? Why (not)?
In FindBlockPos(), what is the relevance of fKnown to this PR, conceptually? How does the behaviour of FindBlockPos() change when fKnown==true instead of false?
What happens if bitcoind is killed after when block(s) have been processed, but before FlushBlockFile() is called? Does this PR affect that behaviour?
In the blockmanager_flush_block_file unit test, why are we pausing before starting a new blockfile? Can you think of issues with this approach, or alternatives?
In the blockmanager_flush_block_file unit test, why do we still only have two blocks in the file after we have called SaveBlockToDisk(block3)?
In BlockManager::FlushBlockFile(), why do we always flush the undo file except for when we’re finalizing the block file (and we’re not finalizing the undo file)?
<stickies-v> welcome everyone! Today we're looking at #27039, authored by pinheadmz. The notes and questions are available on https://bitcoincore.reviews/27039
<stickies-v> the PR only modifies a few lines of business logic, almost all of the PR is tests, but I think touches on a lot of interesting concepts re how we store and index blocks
<turkycat> this PR makes it possible to share all but the latest block file amongst multiple clients, also might prevent potential data corruption by preventing many unnecessary writes
<LarryRuane> It improves performance by refraining from writing blocks to disk that are already present on disk (so are no-op writes).. but also +1 to @turkycat 's comment about corruption, hadn't thought of that
<stickies-v> turkycat: ah, I was understanding "sharing" as copying the files to another client, instead of multiple clients using the same file (which is the more obvious interpretation). my bad
<stickies-v> (I think the error message could've been clearer if `FlatFileSeq::Flush` logged "failed to open file in write mode" instead of just "failed to open file")
<stickies-v> LarryRuane: I think that's where the `util::Result` class comes in handy, so we can more easily propagate detailed error messages all the way to where we decide if we want to log (which depends on the use case)
<LarryRuane> it probably would have been better if the actual blocks files were in a subdir one level lower .. and maybe the rev (undo) files also one level lower (in own subdir)
<lightlike> another thing is that if we -reindex and there is some corruption at some point, we'd stop at that point and rebuild the rest of the blocks - for that, the blocks would need to be writeable again.
<LarryRuane> one thing I've done often to get the size of the blockchain is `mkdir tmp ; ln ~/.bitcoin/blocks/blk*dat tmp ; du -sh tmp ; rm -r tmp` ... all that wouldn't be needed if blk*.dat files were in a separate subdir
<stickies-v> In `FindBlockPos()`, what is the relevance of `fKnown` to this PR, conceptually? How does the behaviour of `FindBlockPos()` change when `fKnown==true` instead of `false`?
<lightlike> stickies-v: I think if -reindex fails to continue at some point, we'd fall back to IBD, download the missing blocks from peers, and overwrite the existing higher block files
<stickies-v> In `FindBlockPos()`, what is the relevance of `fKnown` to this PR, conceptually? How does the behaviour of `FindBlockPos()` change when `fKnown==true` instead of `false`?
<alex-wiederin> stickies-v `fKnown` param determines whether `FlushBlockFile` is called in `FindBlockPos()`. The call to flush block file has been moved to the condition of `!fKnown` (i.e. if the position is unknown), I believe this is where get the reduction in calls to write.
<stickies-v> LarryRuane: alex-wiederin: yeah you're both correct! I'd say the relevance is that `fKnown` is going to always be true when reindexing, which is what this PR is targeting
<LarryRuane> it's created with a somewhat larger than needed size (i think it's 16m unless test mode), and grows by larger amounts (probably also 16m), because it's slightly more efficient than growing it on demand
<stickies-v> What happens if bitcoind is killed after when block(s) have been processed, but before `FlushBlockFile()` is called? Does this PR affect that behaviour?
<LarryRuane> "if bitcoind is killed" ... I'm not sure about this, but i think the block index isn't flushed until after the block file, so we re-do the block processing when we come back up
<LarryRuane> in general, when something on disk (A) refers to something else on disk (B), you always want to write out B first (and sync it) and THEN write A to disk
<LarryRuane> lightlike: is that right, interesting, i thought it would continue (IF you don't specify `-reindex` again .. if you do then of course it does start over)
<sipa> The reindexing process consists of two phases really, the first is rebuilding the block index, the second is wiping the utxo set and rebuilding is. If you interrupt during the second one, it'll definitely just continue where it left off, because that is just the normal "try to activate the best known block" logic, unrelated to reindexing.
<LarryRuane> that second phase takes much longer in practice, BTW ... actually @sipa aren't there now 3 phases? first block headers (takes only a couple of minutes on most systems), then the two you mentioned?
<LarryRuane> I think the node does get bothered a little during reindex by P2P messages, which you can disable with `--maxconnections=0` if you're doing reindex performance comparisons
<pinheadmz> this was a really hard test to write, because "flushing" is something unpredictable the OS does and theres no great way to know if it happened or not
<stickies-v> pinheadmz: yeah, and i'm not sure if it's reliable now? i tried running the updated `feature_reindex.py` test with the changes to `blockstorage.cpp` reverted and the test keeps succeeding
<LarryRuane> stickies-v: I think it's because sequential operations can still happen within the same second (or whatever the file timestamp resolution is)
<turkycat> so, I believe the answer is that the OS writing the data and updating the timestamp on the file is async and there is some delay, for which `pause` is set at what should be the max delay
<LarryRuane> i think i suggested on the PR to compute the file checksum instead of depending on timestamps, but that's more work, probably not worth it
<pinheadmz> stickies-v yeah the test is, hard. I added it by jonatack request and it doesnt really cover the change in the PR as much as it just sets a regression test that the PR DOESNT BREAK
<turkycat> I found these pauses strange though, in their order. I made a comment on it and tbh still don't fully understand why we aren't 1) read timestamp 2) perform write 3) pause 4) read timestamp again 5) compare
<turkycat> so, perhaps I misunderstand the filesystem update delay- but I guess my thought is that since we're writing on line 150 we should pause again before reading `last_write_time` again, for comparison?
<turkycat> yea ok I think that was my fundamental misunderstanding. I assumed the fflush and fsync were async and you needed to check the timestamp after some period to be sure
<stickies-v> so on FAT filesystem, if you perform 10 operations within an (ideally timed) span of 2 seconds, you'd update the last modified timestamp 10 times, but they'd all have the same value
<stickies-v> with that said, we're out of time for this meeting so the remaining questions are left as an exercise to the reader - feel free to continue discussing here!