Do not flush block to disk if it is already there (block storage)

https://github.com/bitcoin/bitcoin/pull/27039

Host: stickies-v  -  PR author: pinheadmz

The PR branch HEAD was 93c70287a at the time of this review club meeting.

Notes

  • Recommended reading from earlier review clubs:
  • The blocks/ directory contains three datasets:
    • the raw append-only blknnnnn.dat files containing the serialized block data
    • the raw append-only revnnnnn.dat files containing the undo data,
    • the blocks/index/ block index data which is stored in a LevelDB database. The index helps to quickly find blocks without having to re-scan all the data files.
  • Flushing is the process of bringing the data that we keep in-memory in sync with the data we store on disk. We want to flush regularly enough to ensure that when bitcoind is unexpectedly killed, we may have to redo some work (e.g. re-download a couple of blocks) but we don’t end up in an unrecoverable state (that would e.g. require a completely new IBD). On the other hand, the (generally quite slow) disk I/O can have negative performance implications when we flush too often.

  • A reindex, amongst other things, recreates the block index from the block files on disk. The block files themselves don’t need to be updated: blocks are already assumed to not be stored sequentially in or across block files anyway.

  • This PR skips flushing to disk blocks whose positions are already known. Skipping this means that users can write-protect the blk files on disk which may be useful for security or even safely sharing that data between multiple bitcoind instances.

  • Some of the functions that are relevant to the reindexing process behave differently than can perhaps be expected from their name:
    • BlockManager::SaveBlockToDisk() does not actually write anything to disk when (as is the case for a reindex) the known block position is passed (as dbp), but we still need to call it to update the blockfile info stats
    • BlockManager::FindBlockPos() is mostly used to find the position in a blockfile where a new block is going to be appended. However, it also updates the blockfile info and may also trigger the flushing of block and undo files, which is why we still call it even if we already know the block’s position.
    • this behaviour arguably could arguably be improved upon, as for example done in this WIP by mzumsande

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? What was your review approach?

  2. After this PR, can the whole blocks/ directory be made read-only? Why (not)?

  3. In FindBlockPos(), what is the relevance of fKnown to this PR, conceptually? How does the behaviour of FindBlockPos() change when fKnown==true instead of false?

  4. What happens if bitcoind is killed after when block(s) have been processed, but before FlushBlockFile() is called? Does this PR affect that behaviour?

  5. In the blockmanager_flush_block_file unit test, why are we pausing before starting a new blockfile? Can you think of issues with this approach, or alternatives?

  6. In the blockmanager_flush_block_file unit test, why do we still only have two blocks in the file after we have called SaveBlockToDisk(block3)?

  7. In BlockManager::FlushBlockFile(), why do we always flush the undo file except for when we’re finalizing the block file (and we’re not finalizing the undo file)?

Meeting Log

  117:00 <stickies-v> #startmeeting
  217:00 <LarryRuane> hi
  317:00 <pinheadmz> hi !
  417:00 <alex-wiederin> hi
  517:00 <DaveBeer> hi
  617:00 <lightlike> hi
  717:00 <pablomartin> hi all!
  817:00 <turkycat> hi everyone
  917:01 <stickies-v> welcome everyone! Today we're looking at #27039, authored by pinheadmz. The notes and questions are available on https://bitcoincore.reviews/27039
 1017:01 <stickies-v> anyone joining us for the first time today? even if you're just lurking, feel free to say hi!
 1117:03 <stickies-v> an old-timers only day it seems
 1217:03 <abubakar> hi
 1317:03 <stickies-v> who got the chance to review the PR or read the notes? (y/n)
 1417:03 <turkycat> y
 1517:03 <pablomartin> y
 1617:03 <abubakar> read the notes
 1717:03 <DaveBeer> y, read the notes
 1817:04 <LarryRuane> y
 1917:04 <alex-wiederin> read the notes
 2017:05 <stickies-v> the PR only modifies a few lines of business logic, almost all of the PR is tests, but I think touches on a lot of interesting concepts re how we store and index blocks
 2117:05 <stickies-v> how would you summarize this PR in your own words?
 2217:06 <DaveBeer> minimize unnecessary disk writes
 2317:06 <turkycat> this PR makes it possible to share all but the latest block file amongst multiple clients, also might prevent potential data corruption by preventing many unnecessary writes
 2417:07 <stickies-v> DaveBeer - it doesn't actually change any writing behaviour, as far as I understand
 2517:07 <pinheadmz> turkycat youre right about being able to share the files, but its not really about disk writes !
 2617:07 <turkycat> ack
 2717:08 <pinheadmz> stickies-v is right. did anyone check out the related issue?
 2817:08 <LarryRuane> It improves performance by refraining from writing blocks to disk that are already present on disk (so are no-op writes).. but also +1 to @turkycat 's comment about corruption, hadn't thought of that
 2917:08 <stickies-v> pinheadmz: what's the issue with sharing block files prior to this PR? I'd think that's possible either way?
 3017:08 <pinheadmz> https://github.com/bitcoin/bitcoin/issues/2039
 3117:08 <pinheadmz> you can try this with your own node - set the blocks dir to readonly and then start with -reindex
 3217:09 <pinheadmz> that *should* be possible but bitcoin throws an error - not because its trying to WRITE a file but .... ??? anyone ?
 3317:09 <stickies-v> LarryRuane: prior to this PR, we weren't writing any blocks to disk again either - I don't think this behaviour is changed?
 3417:09 <pinheadmz> correct!
 3517:09 <turkycat> it should be possible to share, but the latest files (blk and rev) might change while being read by another client
 3617:10 <pinheadmz> the error message is in this comment: https://github.com/bitcoin/bitcoin/issues/2039#issuecomment-1101330894
 3717:11 <turkycat> or maybe the OS would fail to open the file for reading if it was opened for writing by bitcoin core
 3817:11 <stickies-v> turkycat: ah, I was understanding "sharing" as copying the files to another client, instead of multiple clients using the same file (which is the more obvious interpretation). my bad
 3917:11 <LarryRuane> oh i see, we're trying to open the file with read-write permission, but if the file is RO, that fails
 4017:11 <pinheadmz> turkycat yes!
 4117:11 <pinheadmz> the bug is we cant OPEN the file, because we are trying to open it with W flag
 4217:11 <pablomartin> stickies-v I was thinking the same... no, same location, locking issue...
 4317:11 <pinheadmz> LarryRuane yep
 4417:12 <stickies-v> LarryRuane: indeed, see https://github.com/bitcoin/bitcoin/blob/d908877c4774c2456eed09167a5f382758e4a8a6/src/flatfile.cpp#L83 where we're calling `Open()` without the `read_only` parameter which defaults to `false`)
 4517:13 <LarryRuane> oh i see, so the PR avoids calling the function (Flush) that is calling Open()
 4617:13 <stickies-v> yessir
 4717:13 <LarryRuane> (unless necessary of course)
 4817:14 <LarryRuane> I even reviewed this PR but had forgotten... they say memory is the first thing to go... I don't recall what's the second thing
 4917:14 <stickies-v> (I think the error message could've been clearer if `FlatFileSeq::Flush` logged "failed to open file in write mode" instead of just "failed to open file")
 5017:15 <LarryRuane> should this PR improve that error message, maybe?
 5117:15 <pinheadmz> It should also log "Unable to open file" from flatfile.cpp Open() I think
 5217:15 <turkycat> lol LarryRuane
 5317:16 <stickies-v> gonna start moving on to the next questions, but as always - feel free to keep the discussion on previous questions going!
 5417:16 <LarryRuane> I've noticed it's really hard to decide which functions should log failures and which should not!
 5517:17 <stickies-v> LarryRuane: I think that's where the `util::Result` class comes in handy, so we can more easily propagate detailed error messages all the way to where we decide if we want to log (which depends on the use case)
 5617:17 <LarryRuane> stickies-v: +1 ... there's a review club for that in case anyone here isn't aware
 5717:17 <stickies-v> After this PR, can the whole `blocks/` directory be made read-only? Why (not)?
 5817:18 <LarryRuane> NO! because we still need to be able to write to the latest blocks file
 5917:18 <turkycat> ^
 6017:18 <DaveBeer> no it can't, client still needs to be able to create block files
 6117:18 <pablomartin> LarryRuane +1
 6217:18 <stickies-v> LarryRuane: correct! but... there's another reason
 6317:18 <LarryRuane> oh sorry, you were asking about the directory, @DaveBeer +1
 6417:18 <turkycat> we wouldn't be able to -reindex either
 6517:18 <stickies-v> turkycat: why?
 6617:19 <turkycat> since the index/ directory is a sub of blocks/
 6717:19 <pablomartin> blocks/index/
 6817:19 <LarryRuane> oh but that can have its own (write) permission enabled
 6917:19 <stickies-v> exactly! we store multiple kinds of data in the `blocks` directory, including the block index (which is stored in a leveldb db)
 7017:19 <stickies-v> LarryRuane: well yes but it's in the `blocks` directory. perhaps the question should have been clearer about including subdirectories too
 7117:20 <stickies-v> anyway, being able to write to the latest block file is reason enough
 7217:20 <LarryRuane> it probably would have been better if the actual blocks files were in a subdir one level lower .. and maybe the rev (undo) files also one level lower (in own subdir)
 7317:20 <LarryRuane> but too late now
 7417:21 <stickies-v> LarryRuane: what would that have improved?
 7517:21 <lightlike> another thing is that if we -reindex and there is some corruption at some point, we'd stop at that point and rebuild the rest of the blocks - for that, the blocks would need to be writeable again.
 7617:21 <turkycat> LarryRuane yea the chainstate/ folder is aso built while -reindex but it's a sibling of blocks/
 7717:21 <turkycat> lightlike good point
 7817:22 <LarryRuane> one thing I've done often to get the size of the blockchain is `mkdir tmp ; ln ~/.bitcoin/blocks/blk*dat tmp ; du -sh tmp ; rm -r tmp` ... all that wouldn't be needed if blk*.dat files were in a separate subdir
 7917:22 <stickies-v> lightlike: because we attempt to remove the corrupted data from the block files?
 8017:23 <stickies-v> In `FindBlockPos()`, what is the relevance of `fKnown` to this PR, conceptually? How does the behaviour of `FindBlockPos()` change when `fKnown==true` instead of `false`?
 8117:23 <LarryRuane> i think lightlike meant the directory needs to be writeable
 8217:23 <lightlike> stickies-v: I think if -reindex fails to continue at some point, we'd fall back to IBD, download the missing blocks from peers, and overwrite the existing higher block files
 8317:23 <turkycat> I guess I assume, to lightlike's comment, that we would delete all blk and revs from the corruption point onward when rebuilding
 8417:24 <lightlike> but not 100% sure about that...
 8517:24 <pinheadmz> lightlike im not sure thats an automatic thing either but makes sense
 8617:24 <turkycat> yea, interesting- this might be worth a test case
 8717:24 <LarryRuane> turkycat: I don't think anything ever automatically deletes blk and rev files
 8817:24 <LarryRuane> pinheadmz: i think that is automatic
 8917:25 <turkycat> so we would just insert over the corrupted data?
 9017:25 <pinheadmz> well yeah actually, bc reindex would return
 9117:25 <pinheadmz> then bitcoin would just... do bitcoin
 9217:25 <LarryRuane> no i think we just create new blk and rev files
 9317:25 <turkycat> ahh fair
 9417:25 <pinheadmz> yeah probably the old corrupted data stays put
 9517:25 <LarryRuane> one thing to be aware of is, it's okay for the blk files to have redundant blocks (multiple copies of the same block)
 9617:25 <LarryRuane> (it only wastes disk space)
 9717:26 <turkycat> yea I guess if our index points to the 'correct' one (or either if both are correct) it doesn't really matter
 9817:26 <LarryRuane> it's the first one we encounter
 9917:26 <LarryRuane> (when doing reindex)
10017:26 <pinheadmz> huh, what happens if reindex encounters a duplocate block ?
10117:27 <turkycat> right but if corrupted, I guess it would fail validation and find it later
10217:27 <stickies-v> on https://github.com/bitcoin/bitcoin/blob/d26a71a94ac4ae1b1a091f4412d390afba69b2f8/src/node/blockstorage.cpp#L877-L896 I can't immediately see any logic that deletes blk and rev files when reindex fails, but may need deeper digging
10317:27 <LarryRuane> pinheadmz: it just ignores it
10417:27 <pinheadmz> it wouldnt over-write the first idex ?
10517:27 <pinheadmz> ah
10617:27 <LarryRuane> stickies-v: right, i don't think anything auto-deletes those files
10717:27 <pinheadmz> only in pruning mode !
10817:28 <turkycat> yea I made a bold assumption and was wrong
10917:29 <stickies-v> just reposting the current question since we've had a lot of (good!) discussion after:
11017:29 <stickies-v> In `FindBlockPos()`, what is the relevance of `fKnown` to this PR, conceptually? How does the behaviour of `FindBlockPos()` change when `fKnown==true` instead of `false`?
11117:30 <LarryRuane> I think `fKnown` means the block already exists on disk .. maybe `fExists` would have been a better name?
11217:30 <alex-wiederin> stickies-v `fKnown` param determines whether `FlushBlockFile` is called in `FindBlockPos()`. The call to flush block file has been moved to the condition of `!fKnown` (i.e. if the position is unknown), I believe this is where get the reduction in calls to write.
11317:31 <turkycat> LarryRuane no harder problem in computer science than naming things
11417:31 <LarryRuane> turkycat: +1
11517:32 <LarryRuane> the block has a "known" location, meaning that it already exists on disk (so it's not a terrible name)
11617:32 <turkycat> we also only allocate if the position is not known, after finding a good position for it and deciding if we should finalize the file after
11717:32 <LarryRuane> `FindBlockPos()` flushes the block if the position isn't known
11817:32 <stickies-v> LarryRuane: alex-wiederin: yeah you're both correct! I'd say the relevance is that `fKnown` is going to always be true when reindexing, which is what this PR is targeting
11917:32 <pinheadmz> yeah it does a lot more than just find something!
12017:33 <pinheadmz> lightlike has a WIP to clean up that part of the code, but for now its a sort of typical bitcoin core codebase thing
12117:33 — pinheadmz grimace
12217:33 <turkycat> or, as a clarification, the finalize is for the current file (if there is not sufficient space for the new block) before the allocate logic
12317:33 <turkycat> none of that logic is necessary if we know where the file position is
12417:34 <lightlike> yeah, i plan on opening a cleanup PR for that, but only after this PR is merged.
12517:34 <LarryRuane> pinheadmz: agree needs cleanup, i think in general when there are boolean flags, something may not be designed in the cleanest way
12617:34 <pinheadmz> heh yeah but someone sometime made a nice little PR that was easy to review at the time :-)
12717:34 <stickies-v> a little bonus side question: what does finalizing a block or undo file mean? like - what happens, and why do we need this?
12817:34 <LarryRuane> (i say "may not" because it could be okay... but often isn't)
12917:35 <LarryRuane> i think finalizing means truncating the file to the minimum length (?)
13017:35 <stickies-v> LarryRuane: yes, and why does that need to happen?
13117:36 <turkycat> because we allocate in chunks that are likely larger than the actual space used
13217:36 <LarryRuane> it's created with a somewhat larger than needed size (i think it's 16m unless test mode), and grows by larger amounts (probably also 16m), because it's slightly more efficient than growing it on demand
13317:36 <turkycat> coming in with the assist for LarryRuane
13417:37 <stickies-v> exactly! we pre-allocate space, then fill it up best we can, and once we're about to exceed size we trim off whatever we didn't use
13517:37 <stickies-v> alright next question
13617:37 <stickies-v> What happens if bitcoind is killed after when block(s) have been processed, but before `FlushBlockFile()` is called? Does this PR affect that behaviour?
13717:37 <LarryRuane> makes sense because we know it's very likely that we'll be writing to the file again soon
13817:38 <LarryRuane> "if bitcoind is killed" ... I'm not sure about this, but i think the block index isn't flushed until after the block file, so we re-do the block processing when we come back up
13917:39 <turkycat> the buffered content won't be written to disk and we'll have to re-download the block
14017:39 <alex-wiederin> Agree. Don't think PR changes in that sense
14117:39 <LarryRuane> in general, when something on disk (A) refers to something else on disk (B), you always want to write out B first (and sync it) and THEN write A to disk
14217:40 <stickies-v> (we preallocate in chunks of 16MiB for block files and 1MiB for undo files: https://github.com/bitcoin/bitcoin/blob/d26a71a94ac4ae1b1a091f4412d390afba69b2f8/src/node/blockstorage.cpp#L587-L592)
14317:40 <LarryRuane> (filesystem code has many of those kinds of sequenced writes to try to be corruption-proof)
14417:40 <LarryRuane> in a way, the block index and the block data comprise a kind of primitive filesystem
14517:40 <lightlike> I think if bitcoind is killed during -reindex, it will always start the reindex from scratch (and not try to continue at the latest point)
14617:41 <LarryRuane> lightlike: is that right, interesting, i thought it would continue (IF you don't specify `-reindex` again .. if you do then of course it does start over)
14717:41 <pinheadmz> lightlike https://bitcoin.stackexchange.com/questions/32835/safely-interrupt-reindex
14817:41 <pinheadmz> sipa says it continutes
14917:42 <LarryRuane> i think many people specify `-reindex` on the restart, thinking it's needed ... but it's not
15017:42 <lightlike> oh, interesting. will try that out
15117:42 <LarryRuane> pinheadmz: good SE find
15217:42 <pinheadmz> i knew this one, i must have looked it up before lol
15317:43 <turkycat> yea for sure if I wasn't 100% sure it would resume I'd start it over on a relaunch just to be sure. this is good to know
15417:44 <sipa> The reindexing process consists of two phases really, the first is rebuilding the block index, the second is wiping the utxo set and rebuilding is. If you interrupt during the second one, it'll definitely just continue where it left off, because that is just the normal "try to activate the best known block" logic, unrelated to reindexing.
15517:44 <sipa> I'm not sure if progress is saved during the first block index rebuilding phase, but if past-me says so, it's probably true.
15617:44 <pinheadmz> past-you is so smart
15717:45 <pinheadmz> and is reindex-chainstate then just skipping right to that ssecond step ?
15817:45 <sipa> Exactly. `-reindex-chainstate` is exactly equivalent to `rm -rf ~/.bitcoin/chainstate`
15917:45 <LarryRuane> that second phase takes much longer in practice, BTW ... actually @sipa aren't there now 3 phases? first block headers (takes only a couple of minutes on most systems), then the two you mentioned?
16017:45 <lightlike> sure - but what if it's aborted midway through the first part
16117:45 <LarryRuane> and actually, block headers are now downloeaded twice
16217:46 <sipa> @LarryRuane You'll know this better now than I do.
16317:46 <stickies-v> I think this might be where we check if we need to continue reindexing: https://github.com/bitcoin/bitcoin/blob/d26a71a94ac4ae1b1a091f4412d390afba69b2f8/src/node/blockstorage.cpp#L364-L367
16417:46 <sipa> @stickies-v Nice catch, indeed.
16517:47 <stickies-v> In the `blockmanager_flush_block_file` unit test, why are we pausing before starting a new blockfile?
16617:47 <stickies-v> (link: https://github.com/bitcoin-core-review-club/bitcoin/commit/470ef396b5498d8689802c359a216d5a3c4749a5#diff-d6d633592a40f5f3d8b03863e41547de8751b874c1d20f129a616b9dd719b999R170)
16717:47 <LarryRuane> stickies-v: yes .. and IIUC, if for some reason that flag in the leveldb index weren't `true`, we'd IBD from that point?
16817:47 <lightlike> LarryRuane: redownloading the block headers via p2p isn't part of -reindex as far as I know. -reindex works without any peers.
16917:48 <LarryRuane> oh that's right
17017:49 <LarryRuane> I think the node does get bothered a little during reindex by P2P messages, which you can disable with `--maxconnections=0` if you're doing reindex performance comparisons
17117:49 <LarryRuane> (i'm not sure how much of a difference that makes tho)
17217:50 <LarryRuane> "why are we pausing" to detect if the file was written (timestamp will be different if so)
17317:51 <stickies-v> LarryRuane: but why do we need to pause to check for a different timestamp when these operations are sequential?
17417:51 <pinheadmz> this was a really hard test to write, because "flushing" is something unpredictable the OS does and theres no great way to know if it happened or not
17517:53 <stickies-v> pinheadmz: yeah, and i'm not sure if it's reliable now? i tried running the updated `feature_reindex.py` test with the changes to `blockstorage.cpp` reverted and the test keeps succeeding
17617:53 <LarryRuane> stickies-v: I think it's because sequential operations can still happen within the same second (or whatever the file timestamp resolution is)
17717:53 <pinheadmz> Thats correct LarryRuane on windows its 2 seconds!
17817:53 <turkycat> so, I believe the answer is that the OS writing the data and updating the timestamp on the file is async and there is some delay, for which `pause` is set at what should be the max delay
17917:53 <LarryRuane> i think i suggested on the PR to compute the file checksum instead of depending on timestamps, but that's more work, probably not worth it
18017:53 <stickies-v> LarryRuane: yeah the latter part is what kinda surprised me - that there's such a huge variance in resolution of last modified timestamps
18117:53 <stickies-v> see e.g. https://stackoverflow.com/a/31521204/734612 for an overview
18217:54 <pinheadmz> stickies-v yeah the test is, hard. I added it by jonatack request and it doesnt really cover the change in the PR as much as it just sets a regression test that the PR DOESNT BREAK
18317:54 <LarryRuane> stickies-v: madness :)
18417:54 <stickies-v> turkycat: this is more of a filesystem than an OS thing, afaik (which is also one concern I have with the current approach)
18517:54 <pablomartin> thankfuklly the intermittent failure of the test caught it
18617:55 <turkycat> I found these pauses strange though, in their order. I made a comment on it and tbh still don't fully understand why we aren't 1) read timestamp 2) perform write 3) pause 4) read timestamp again 5) compare
18717:55 <pinheadmz> write = update timestamp
18817:55 <pinheadmz> so reading timestamp before writing isnt useful
18917:55 <pinheadmz> then we want to know if we WROTE AGAIN
19017:55 <pinheadmz> and the only way to know is, the timestamp has changed
19117:56 <turkycat> right but reading it first gives you the comparison point, you want to make sure that the value didn't change
19217:56 <pinheadmz> but if the second write happened so fast that time itself hasnt advanced, the test would false-positibe
19317:56 <pinheadmz> actually we are doing setps (1,2,3,4,5) as yo umention
19417:57 <turkycat> ok, fair, I'll consider that
19517:57 <pinheadmz> but we do a write before step 1 as well
19617:57 <pinheadmz> er sorry no sorry
19717:57 <pinheadmz> yeah we need to pause between writes so that if a write happened, the time will definitely be updated
19817:58 <turkycat> so, perhaps I misunderstand the filesystem update delay- but I guess my thought is that since we're writing on line 150 we should pause again before reading `last_write_time` again, for comparison?
19917:58 <stickies-v> turkycat: it's not a lag thing, it's a resolution thing
20017:59 <stickies-v> the last modified timestamp just doesn't store any more specific timestamps than that resolution
20117:59 <pinheadmz> yeah imagine if timestamp resolution was one day
20217:59 <turkycat> yea ok I think that was my fundamental misunderstanding. I assumed the fflush and fsync were async and you needed to check the timestamp after some period to be sure
20317:59 <stickies-v> so on FAT filesystem, if you perform 10 operations within an (ideally timed) span of 2 seconds, you'd update the last modified timestamp 10 times, but they'd all have the same value
20417:59 <pinheadmz> you could update your file all day long and keep checking the time, itd never change
20518:00 <turkycat> got it, cheers guys
20618:00 <pinheadmz> yeah so if you happen to run this test on windows, its extra slow
20718:00 <pinheadmz> theres an ifdef that changes the pause length for linux
20818:00 <stickies-v> with that said, we're out of time for this meeting so the remaining questions are left as an exercise to the reader - feel free to continue discussing here!
20918:00 <LarryRuane> another reason to avoid windows haha
21018:00 <pinheadmz> thanks everyone for taking a look!
21118:00 <LarryRuane> thanks, @stickies-v and everyone else!
21218:01 <stickies-v> thank you everyone for participating, a bit less code heavy but i hope the concepts were interesting
21318:01 <LarryRuane> and @pinheadmz !!
21418:01 <stickies-v> and thank you pinheadmz for authoring the PR and helping out!
21518:01 <alex-wiederin> thanks stickies-v!
21618:01 <stickies-v> #endmeeting