Do not flush block to disk if it is already there (
block storage) Apr 19, 2023
The PR branch HEAD was 93c70287a at the time of this review club meeting.
Recommended reading from earlier review clubs:
blocks/ directory contains three datasets:
the raw append-only
blknnnnn.dat files containing the serialized block data
the raw append-only
revnnnnn.dat files containing the
undo data, the
blocks/index/ block index data which is stored in a LevelDB database. The index helps to quickly find blocks without having to re-scan all the data files.
Flushing is the process of bringing the data that we keep in-memory in sync with the data we store on disk. We want to flush regularly enough to ensure that when bitcoind is unexpectedly killed, we may have to redo some work (e.g. re-download a couple of blocks) but we don’t end up in an unrecoverable state (that would e.g. require a completely new IBD). On the other hand, the (generally quite slow) disk I/O can have negative performance implications when we flush too often.
A reindex, amongst other things, recreates the block index from the block files on disk. The block files themselves don’t need to be updated: blocks are already assumed to not be stored sequentially in or across block files anyway.
This PR skips flushing to disk blocks whose positions are already known. Skipping this means that users can write-protect the blk files on disk which may be useful for security or even safely sharing that data between multiple bitcoind instances.
Some of the functions that are relevant to the reindexing process behave differently than can perhaps be expected from their name:
does not actually write anything to disk when (as is the case for a reindex) the known block position is passed (as
dbp), but we still need to call it to update the blockfile info stats
is mostly used to find the position in a blockfile where a new block is going to be appended. However, it also updates the blockfile info and may also trigger the flushing of block and undo files, which is why we still call it even if we already know the block’s position.
this behaviour arguably could arguably be improved upon, as for example done in
this WIP by mzumsande Questions
Did you review the PR?
Concept ACK, approach ACK, tested ACK, or NACK? What was your review approach?
After this PR, can the whole
blocks/ directory be made read-only? Why (not)?
, what is the relevance of
fKnown to this PR, conceptually? How does the behaviour of
FindBlockPos() change when
fKnown==true instead of
What happens if bitcoind is killed after when block(s) have been processed, but before
FlushBlockFile() is called? Does this PR affect that behaviour?
blockmanager_flush_block_file unit test, why are we
pausing before starting a new blockfile? Can you think of issues with this approach, or alternatives?
blockmanager_flush_block_file unit test, why do we still
only have two blocks in the file after we have called
, why do we always flush the undo file
except for when we’re finalizing the block file (and we’re not finalizing the undo file)? Meeting Log
1 17:00 <stickies-v> #startmeeting
4 17:00 <alex-wiederin> hi
7 17:00 <pablomartin> hi all!
8 17:00 <turkycat> hi everyone
10 17:01 <stickies-v> anyone joining us for the first time today? even if you're just lurking, feel free to say hi!
11 17:03 <stickies-v> an old-timers only day it seems
13 17:03 <stickies-v> who got the chance to review the PR or read the notes? (y/n)
16 17:03 <abubakar> read the notes
17 17:03 <DaveBeer> y, read the notes
19 17:04 <alex-wiederin> read the notes
20 17:05 <stickies-v> the PR only modifies a few lines of business logic, almost all of the PR is tests, but I think touches on a lot of interesting concepts re how we store and index blocks
21 17:05 <stickies-v> how would you summarize this PR in your own words?
22 17:06 <DaveBeer> minimize unnecessary disk writes
23 17:06 <turkycat> this PR makes it possible to share all but the latest block file amongst multiple clients, also might prevent potential data corruption by preventing many unnecessary writes
24 17:07 <stickies-v> DaveBeer - it doesn't actually change any writing behaviour, as far as I understand
25 17:07 <pinheadmz> turkycat youre right about being able to share the files, but its not really about disk writes !
27 17:08 <pinheadmz> stickies-v is right. did anyone check out the related issue?
28 17:08 <LarryRuane> It improves performance by refraining from writing blocks to disk that are already present on disk (so are no-op writes).. but also +1 to @turkycat 's comment about corruption, hadn't thought of that
29 17:08 <stickies-v> pinheadmz: what's the issue with sharing block files prior to this PR? I'd think that's possible either way?
31 17:08 <pinheadmz> you can try this with your own node - set the blocks dir to readonly and then start with -reindex
32 17:09 <pinheadmz> that *should* be possible but bitcoin throws an error - not because its trying to WRITE a file but .... ??? anyone ?
33 17:09 <stickies-v> LarryRuane: prior to this PR, we weren't writing any blocks to disk again either - I don't think this behaviour is changed?
34 17:09 <pinheadmz> correct!
35 17:09 <turkycat> it should be possible to share, but the latest files (blk and rev) might change while being read by another client
37 17:11 <turkycat> or maybe the OS would fail to open the file for reading if it was opened for writing by bitcoin core
38 17:11 <stickies-v> turkycat: ah, I was understanding "sharing" as copying the files to another client, instead of multiple clients using the same file (which is the more obvious interpretation). my bad
39 17:11 <LarryRuane> oh i see, we're trying to open the file with read-write permission, but if the file is RO, that fails
40 17:11 <pinheadmz> turkycat yes!
41 17:11 <pinheadmz> the bug is we cant OPEN the file, because we are trying to open it with W flag
42 17:11 <pablomartin> stickies-v I was thinking the same... no, same location, locking issue...
43 17:11 <pinheadmz> LarryRuane yep
45 17:13 <LarryRuane> oh i see, so the PR avoids calling the function (Flush) that is calling Open()
46 17:13 <stickies-v> yessir
47 17:13 <LarryRuane> (unless necessary of course)
48 17:14 <LarryRuane> I even reviewed this PR but had forgotten... they say memory is the first thing to go... I don't recall what's the second thing
49 17:14 <stickies-v> (I think the error message could've been clearer if `FlatFileSeq::Flush` logged "failed to open file in write mode" instead of just "failed to open file")
50 17:15 <LarryRuane> should this PR improve that error message, maybe?
51 17:15 <pinheadmz> It should also log "Unable to open file" from flatfile.cpp Open() I think
52 17:15 <turkycat> lol LarryRuane
53 17:16 <stickies-v> gonna start moving on to the next questions, but as always - feel free to keep the discussion on previous questions going!
54 17:16 <LarryRuane> I've noticed it's really hard to decide which functions should log failures and which should not!
55 17:17 <stickies-v> LarryRuane: I think that's where the `util::Result` class comes in handy, so we can more easily propagate detailed error messages all the way to where we decide if we want to log (which depends on the use case)
56 17:17 <LarryRuane> stickies-v: +1 ... there's a review club for that in case anyone here isn't aware
57 17:17 <stickies-v> After this PR, can the whole `blocks/` directory be made read-only? Why (not)?
58 17:18 <LarryRuane> NO! because we still need to be able to write to the latest blocks file
60 17:18 <DaveBeer> no it can't, client still needs to be able to create block files
61 17:18 <pablomartin> LarryRuane +1
62 17:18 <stickies-v> LarryRuane: correct! but... there's another reason
63 17:18 <LarryRuane> oh sorry, you were asking about the directory, @DaveBeer +1
64 17:18 <turkycat> we wouldn't be able to -reindex either
65 17:18 <stickies-v> turkycat: why?
66 17:19 <turkycat> since the index/ directory is a sub of blocks/
67 17:19 <pablomartin> blocks/index/
68 17:19 <LarryRuane> oh but that can have its own (write) permission enabled
69 17:19 <stickies-v> exactly! we store multiple kinds of data in the `blocks` directory, including the block index (which is stored in a leveldb db)
70 17:19 <stickies-v> LarryRuane: well yes but it's in the `blocks` directory. perhaps the question should have been clearer about including subdirectories too
71 17:20 <stickies-v> anyway, being able to write to the latest block file is reason enough
72 17:20 <LarryRuane> it probably would have been better if the actual blocks files were in a subdir one level lower .. and maybe the rev (undo) files also one level lower (in own subdir)
73 17:20 <LarryRuane> but too late now
74 17:21 <stickies-v> LarryRuane: what would that have improved?
75 17:21 <lightlike> another thing is that if we -reindex and there is some corruption at some point, we'd stop at that point and rebuild the rest of the blocks - for that, the blocks would need to be writeable again.
76 17:21 <turkycat> LarryRuane yea the chainstate/ folder is aso built while -reindex but it's a sibling of blocks/
77 17:21 <turkycat> lightlike good point
78 17:22 <LarryRuane> one thing I've done often to get the size of the blockchain is `mkdir tmp ; ln ~/.bitcoin/blocks/blk*dat tmp ; du -sh tmp ; rm -r tmp` ... all that wouldn't be needed if blk*.dat files were in a separate subdir
79 17:22 <stickies-v> lightlike: because we attempt to remove the corrupted data from the block files?
80 17:23 <stickies-v> In `FindBlockPos()`, what is the relevance of `fKnown` to this PR, conceptually? How does the behaviour of `FindBlockPos()` change when `fKnown==true` instead of `false`?
81 17:23 <LarryRuane> i think lightlike meant the directory needs to be writeable
82 17:23 <lightlike> stickies-v: I think if -reindex fails to continue at some point, we'd fall back to IBD, download the missing blocks from peers, and overwrite the existing higher block files
83 17:23 <turkycat> I guess I assume, to lightlike's comment, that we would delete all blk and revs from the corruption point onward when rebuilding
84 17:24 <lightlike> but not 100% sure about that...
85 17:24 <pinheadmz> lightlike im not sure thats an automatic thing either but makes sense
86 17:24 <turkycat> yea, interesting- this might be worth a test case
87 17:24 <LarryRuane> turkycat: I don't think anything ever automatically deletes blk and rev files
88 17:24 <LarryRuane> pinheadmz: i think that is automatic
89 17:25 <turkycat> so we would just insert over the corrupted data?
90 17:25 <pinheadmz> well yeah actually, bc reindex would return
91 17:25 <pinheadmz> then bitcoin would just... do bitcoin
92 17:25 <LarryRuane> no i think we just create new blk and rev files
93 17:25 <turkycat> ahh fair
94 17:25 <pinheadmz> yeah probably the old corrupted data stays put
95 17:25 <LarryRuane> one thing to be aware of is, it's okay for the blk files to have redundant blocks (multiple copies of the same block)
96 17:25 <LarryRuane> (it only wastes disk space)
97 17:26 <turkycat> yea I guess if our index points to the 'correct' one (or either if both are correct) it doesn't really matter
98 17:26 <LarryRuane> it's the first one we encounter
99 17:26 <LarryRuane> (when doing reindex)
100 17:26 <pinheadmz> huh, what happens if reindex encounters a duplocate block ?
101 17:27 <turkycat> right but if corrupted, I guess it would fail validation and find it later
103 17:27 <LarryRuane> pinheadmz: it just ignores it
104 17:27 <pinheadmz> it wouldnt over-write the first idex ?
106 17:27 <LarryRuane> stickies-v: right, i don't think anything auto-deletes those files
107 17:27 <pinheadmz> only in pruning mode !
108 17:28 <turkycat> yea I made a bold assumption and was wrong
109 17:29 <stickies-v> just reposting the current question since we've had a lot of (good!) discussion after:
110 17:29 <stickies-v> In `FindBlockPos()`, what is the relevance of `fKnown` to this PR, conceptually? How does the behaviour of `FindBlockPos()` change when `fKnown==true` instead of `false`?
111 17:30 <LarryRuane> I think `fKnown` means the block already exists on disk .. maybe `fExists` would have been a better name?
112 17:30 <alex-wiederin> stickies-v `fKnown` param determines whether `FlushBlockFile` is called in `FindBlockPos()`. The call to flush block file has been moved to the condition of `!fKnown` (i.e. if the position is unknown), I believe this is where get the reduction in calls to write.
113 17:31 <turkycat> LarryRuane no harder problem in computer science than naming things
114 17:31 <LarryRuane> turkycat: +1
115 17:32 <LarryRuane> the block has a "known" location, meaning that it already exists on disk (so it's not a terrible name)
116 17:32 <turkycat> we also only allocate if the position is not known, after finding a good position for it and deciding if we should finalize the file after
117 17:32 <LarryRuane> `FindBlockPos()` flushes the block if the position isn't known
118 17:32 <stickies-v> LarryRuane: alex-wiederin: yeah you're both correct! I'd say the relevance is that `fKnown` is going to always be true when reindexing, which is what this PR is targeting
119 17:32 <pinheadmz> yeah it does a lot more than just find something!
120 17:33 <pinheadmz> lightlike has a WIP to clean up that part of the code, but for now its a sort of typical bitcoin core codebase thing
121 17:33 — pinheadmz grimace
122 17:33 <turkycat> or, as a clarification, the finalize is for the current file (if there is not sufficient space for the new block) before the allocate logic
123 17:33 <turkycat> none of that logic is necessary if we know where the file position is
124 17:34 <lightlike> yeah, i plan on opening a cleanup PR for that, but only after this PR is merged.
125 17:34 <LarryRuane> pinheadmz: agree needs cleanup, i think in general when there are boolean flags, something may not be designed in the cleanest way
126 17:34 <pinheadmz> heh yeah but someone sometime made a nice little PR that was easy to review at the time :-)
127 17:34 <stickies-v> a little bonus side question: what does finalizing a block or undo file mean? like - what happens, and why do we need this?
128 17:34 <LarryRuane> (i say "may not" because it could be okay... but often isn't)
129 17:35 <LarryRuane> i think finalizing means truncating the file to the minimum length (?)
130 17:35 <stickies-v> LarryRuane: yes, and why does that need to happen?
131 17:36 <turkycat> because we allocate in chunks that are likely larger than the actual space used
132 17:36 <LarryRuane> it's created with a somewhat larger than needed size (i think it's 16m unless test mode), and grows by larger amounts (probably also 16m), because it's slightly more efficient than growing it on demand
133 17:36 <turkycat> coming in with the assist for LarryRuane
134 17:37 <stickies-v> exactly! we pre-allocate space, then fill it up best we can, and once we're about to exceed size we trim off whatever we didn't use
135 17:37 <stickies-v> alright next question
136 17:37 <stickies-v> What happens if bitcoind is killed after when block(s) have been processed, but before `FlushBlockFile()` is called? Does this PR affect that behaviour?
137 17:37 <LarryRuane> makes sense because we know it's very likely that we'll be writing to the file again soon
138 17:38 <LarryRuane> "if bitcoind is killed" ... I'm not sure about this, but i think the block index isn't flushed until after the block file, so we re-do the block processing when we come back up
139 17:39 <turkycat> the buffered content won't be written to disk and we'll have to re-download the block
140 17:39 <alex-wiederin> Agree. Don't think PR changes in that sense
141 17:39 <LarryRuane> in general, when something on disk (A) refers to something else on disk (B), you always want to write out B first (and sync it) and THEN write A to disk
143 17:40 <LarryRuane> (filesystem code has many of those kinds of sequenced writes to try to be corruption-proof)
144 17:40 <LarryRuane> in a way, the block index and the block data comprise a kind of primitive filesystem
145 17:40 <lightlike> I think if bitcoind is killed during -reindex, it will always start the reindex from scratch (and not try to continue at the latest point)
146 17:41 <LarryRuane> lightlike: is that right, interesting, i thought it would continue (IF you don't specify `-reindex` again .. if you do then of course it does start over)
148 17:41 <pinheadmz> sipa says it continutes
149 17:42 <LarryRuane> i think many people specify `-reindex` on the restart, thinking it's needed ... but it's not
150 17:42 <lightlike> oh, interesting. will try that out
151 17:42 <LarryRuane> pinheadmz: good SE find
152 17:42 <pinheadmz> i knew this one, i must have looked it up before lol
153 17:43 <turkycat> yea for sure if I wasn't 100% sure it would resume I'd start it over on a relaunch just to be sure. this is good to know
154 17:44 <sipa> The reindexing process consists of two phases really, the first is rebuilding the block index, the second is wiping the utxo set and rebuilding is. If you interrupt during the second one, it'll definitely just continue where it left off, because that is just the normal "try to activate the best known block" logic, unrelated to reindexing.
155 17:44 <sipa> I'm not sure if progress is saved during the first block index rebuilding phase, but if past-me says so, it's probably true.
156 17:44 <pinheadmz> past-you is so smart
157 17:45 <pinheadmz> and is reindex-chainstate then just skipping right to that ssecond step ?
158 17:45 <sipa> Exactly. `-reindex-chainstate` is exactly equivalent to `rm -rf ~/.bitcoin/chainstate`
159 17:45 <LarryRuane> that second phase takes much longer in practice, BTW ... actually @sipa aren't there now 3 phases? first block headers (takes only a couple of minutes on most systems), then the two you mentioned?
160 17:45 <lightlike> sure - but what if it's aborted midway through the first part
161 17:45 <LarryRuane> and actually, block headers are now downloeaded twice
162 17:46 <sipa> @LarryRuane You'll know this better now than I do.
164 17:46 <sipa> @stickies-v Nice catch, indeed.
165 17:47 <stickies-v> In the `blockmanager_flush_block_file` unit test, why are we pausing before starting a new blockfile?
167 17:47 <LarryRuane> stickies-v: yes .. and IIUC, if for some reason that flag in the leveldb index weren't `true`, we'd IBD from that point?
168 17:47 <lightlike> LarryRuane: redownloading the block headers via p2p isn't part of -reindex as far as I know. -reindex works without any peers.
169 17:48 <LarryRuane> oh that's right
170 17:49 <LarryRuane> I think the node does get bothered a little during reindex by P2P messages, which you can disable with `--maxconnections=0` if you're doing reindex performance comparisons
171 17:49 <LarryRuane> (i'm not sure how much of a difference that makes tho)
172 17:50 <LarryRuane> "why are we pausing" to detect if the file was written (timestamp will be different if so)
173 17:51 <stickies-v> LarryRuane: but why do we need to pause to check for a different timestamp when these operations are sequential?
174 17:51 <pinheadmz> this was a really hard test to write, because "flushing" is something unpredictable the OS does and theres no great way to know if it happened or not
175 17:53 <stickies-v> pinheadmz: yeah, and i'm not sure if it's reliable now? i tried running the updated `feature_reindex.py` test with the changes to `blockstorage.cpp` reverted and the test keeps succeeding
176 17:53 <LarryRuane> stickies-v: I think it's because sequential operations can still happen within the same second (or whatever the file timestamp resolution is)
177 17:53 <pinheadmz> Thats correct LarryRuane on windows its 2 seconds!
178 17:53 <turkycat> so, I believe the answer is that the OS writing the data and updating the timestamp on the file is async and there is some delay, for which `pause` is set at what should be the max delay
179 17:53 <LarryRuane> i think i suggested on the PR to compute the file checksum instead of depending on timestamps, but that's more work, probably not worth it
180 17:53 <stickies-v> LarryRuane: yeah the latter part is what kinda surprised me - that there's such a huge variance in resolution of last modified timestamps
182 17:54 <pinheadmz> stickies-v yeah the test is, hard. I added it by jonatack request and it doesnt really cover the change in the PR as much as it just sets a regression test that the PR DOESNT BREAK
183 17:54 <LarryRuane> stickies-v: madness :)
184 17:54 <stickies-v> turkycat: this is more of a filesystem than an OS thing, afaik (which is also one concern I have with the current approach)
185 17:54 <pablomartin> thankfuklly the intermittent failure of the test caught it
186 17:55 <turkycat> I found these pauses strange though, in their order. I made a comment on it and tbh still don't fully understand why we aren't 1) read timestamp 2) perform write 3) pause 4) read timestamp again 5) compare
187 17:55 <pinheadmz> write = update timestamp
188 17:55 <pinheadmz> so reading timestamp before writing isnt useful
189 17:55 <pinheadmz> then we want to know if we WROTE AGAIN
190 17:55 <pinheadmz> and the only way to know is, the timestamp has changed
191 17:56 <turkycat> right but reading it first gives you the comparison point, you want to make sure that the value didn't change
192 17:56 <pinheadmz> but if the second write happened so fast that time itself hasnt advanced, the test would false-positibe
193 17:56 <pinheadmz> actually we are doing setps (1,2,3,4,5) as yo umention
194 17:57 <turkycat> ok, fair, I'll consider that
195 17:57 <pinheadmz> but we do a write before step 1 as well
196 17:57 <pinheadmz> er sorry no sorry
197 17:57 <pinheadmz> yeah we need to pause between writes so that if a write happened, the time will definitely be updated
198 17:58 <turkycat> so, perhaps I misunderstand the filesystem update delay- but I guess my thought is that since we're writing on line 150 we should pause again before reading `last_write_time` again, for comparison?
199 17:58 <stickies-v> turkycat: it's not a lag thing, it's a resolution thing
200 17:59 <stickies-v> the last modified timestamp just doesn't store any more specific timestamps than that resolution
201 17:59 <pinheadmz> yeah imagine if timestamp resolution was one day
202 17:59 <turkycat> yea ok I think that was my fundamental misunderstanding. I assumed the fflush and fsync were async and you needed to check the timestamp after some period to be sure
203 17:59 <stickies-v> so on FAT filesystem, if you perform 10 operations within an (ideally timed) span of 2 seconds, you'd update the last modified timestamp 10 times, but they'd all have the same value
204 17:59 <pinheadmz> you could update your file all day long and keep checking the time, itd never change
205 18:00 <turkycat> got it, cheers guys
206 18:00 <pinheadmz> yeah so if you happen to run this test on windows, its extra slow
207 18:00 <pinheadmz> theres an ifdef that changes the pause length for linux
208 18:00 <stickies-v> with that said, we're out of time for this meeting so the remaining questions are left as an exercise to the reader - feel free to continue discussing here!
209 18:00 <LarryRuane> another reason to avoid windows haha
210 18:00 <pinheadmz> thanks everyone for taking a look!
211 18:00 <LarryRuane> thanks, @stickies-v and everyone else!
212 18:01 <stickies-v> thank you everyone for participating, a bit less code heavy but i hope the concepts were interesting
213 18:01 <LarryRuane> and @pinheadmz !!
214 18:01 <stickies-v> and thank you pinheadmz for authoring the PR and helping out!
215 18:01 <alex-wiederin> thanks stickies-v!
216 18:01 <stickies-v> #endmeeting