Do not flush block to disk if it is already there (block storage)

Apr 19, 2023

https://github.com/bitcoin/bitcoin/pull/27039

Host: stickies-v - PR author: pinheadmz

The PR branch HEAD was 93c70287a at the time of this review club meeting.

Notes

Recommended reading from earlier review clubs:
- 17994: Flush undo files after last block write to learn more about how blocks are downloaded and stored
- 16981: Improve runtime performance of –reindex and 24858: incorrect blk file size calculation during reindex results in recoverable blk file corruption to learn more about reindexing
The blocks/ directory contains three datasets:
- the raw append-only blknnnnn.dat files containing the serialized block data
- the raw append-only revnnnnn.dat files containing the undo data,
- the blocks/index/ block index data which is stored in a LevelDB database. The index helps to quickly find blocks without having to re-scan all the data files.
  - At runtime, the block index is loaded into BlockManager::m_block_index. However, we also maintain statistics about the block files in BlockManager::m_blockfile_info.
Flushing is the process of bringing the data that we keep in-memory in sync with the data we store on disk. We want to flush regularly enough to ensure that when bitcoind is unexpectedly killed, we may have to redo some work (e.g. re-download a couple of blocks) but we don’t end up in an unrecoverable state (that would e.g. require a completely new IBD). On the other hand, the (generally quite slow) disk I/O can have negative performance implications when we flush too often.
A reindex, amongst other things, recreates the block index from the block files on disk. The block files themselves don’t need to be updated: blocks are already assumed to not be stored sequentially in or across block files anyway.
This PR skips flushing to disk blocks whose positions are already known. Skipping this means that users can write-protect the blk files on disk which may be useful for security or even safely sharing that data between multiple bitcoind instances.
Some of the functions that are relevant to the reindexing process behave differently than can perhaps be expected from their name:
- BlockManager::SaveBlockToDisk() does not actually write anything to disk when (as is the case for a reindex) the known block position is passed (as dbp), but we still need to call it to update the blockfile info stats
- BlockManager::FindBlockPos() is mostly used to find the position in a blockfile where a new block is going to be appended. However, it also updates the blockfile info and may also trigger the flushing of block and undo files, which is why we still call it even if we already know the block’s position.
- this behaviour arguably could arguably be improved upon, as for example done in this WIP by mzumsande

Questions

Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? What was your review approach?
After this PR, can the whole blocks/ directory be made read-only? Why (not)?
In FindBlockPos(), what is the relevance of fKnown to this PR, conceptually? How does the behaviour of FindBlockPos() change when fKnown==true instead of false?
What happens if bitcoind is killed after when block(s) have been processed, but before FlushBlockFile() is called? Does this PR affect that behaviour?
In the blockmanager_flush_block_file unit test, why are we pausing before starting a new blockfile? Can you think of issues with this approach, or alternatives?
In the blockmanager_flush_block_file unit test, why do we still only have two blocks in the file after we have called SaveBlockToDisk(block3)?
In BlockManager::FlushBlockFile(), why do we always flush the undo file except for when we’re finalizing the block file (and we’re not finalizing the undo file)?

Meeting Log

17:00

<stickies-v> #startmeeting

17:00

<LarryRuane> hi

17:00

<pinheadmz> hi !

17:00

<alex-wiederin> hi

17:00

<DaveBeer> hi

17:00

<lightlike> hi

17:00

<pablomartin> hi all!

17:00

<turkycat> hi everyone

17:01

<stickies-v> welcome everyone! Today we're looking at #27039, authored by pinheadmz. The notes and questions are available on https://bitcoincore.reviews/27039

17:01

<stickies-v> anyone joining us for the first time today? even if you're just lurking, feel free to say hi!

17:03

<stickies-v> an old-timers only day it seems

17:03

<abubakar> hi

17:03

<stickies-v> who got the chance to review the PR or read the notes? (y/n)

17:03

<turkycat> y

17:03

<pablomartin> y

17:03

<abubakar> read the notes

17:03

<DaveBeer> y, read the notes

17:04

<LarryRuane> y

17:04

<alex-wiederin> read the notes

17:05

<stickies-v> the PR only modifies a few lines of business logic, almost all of the PR is tests, but I think touches on a lot of interesting concepts re how we store and index blocks

17:05

<stickies-v> how would you summarize this PR in your own words?

17:06

<DaveBeer> minimize unnecessary disk writes

17:06

<turkycat> this PR makes it possible to share all but the latest block file amongst multiple clients, also might prevent potential data corruption by preventing many unnecessary writes

17:07

<stickies-v> DaveBeer - it doesn't actually change any writing behaviour, as far as I understand

17:07

<pinheadmz> turkycat youre right about being able to share the files, but its not really about disk writes !

17:07

<turkycat> ack

17:08

<pinheadmz> stickies-v is right. did anyone check out the related issue?

17:08

<LarryRuane> It improves performance by refraining from writing blocks to disk that are already present on disk (so are no-op writes).. but also +1 to @turkycat 's comment about corruption, hadn't thought of that

17:08

<stickies-v> pinheadmz: what's the issue with sharing block files prior to this PR? I'd think that's possible either way?

17:08

<pinheadmz> https://github.com/bitcoin/bitcoin/issues/2039

17:08

<pinheadmz> you can try this with your own node - set the blocks dir to readonly and then start with -reindex

17:09

<pinheadmz> that *should* be possible but bitcoin throws an error - not because its trying to WRITE a file but .... ??? anyone ?

17:09

<stickies-v> LarryRuane: prior to this PR, we weren't writing any blocks to disk again either - I don't think this behaviour is changed?

17:09

<pinheadmz> correct!

17:09

<turkycat> it should be possible to share, but the latest files (blk and rev) might change while being read by another client

17:10

<pinheadmz> the error message is in this comment: https://github.com/bitcoin/bitcoin/issues/2039#issuecomment-1101330894

17:11

<turkycat> or maybe the OS would fail to open the file for reading if it was opened for writing by bitcoin core

17:11

<stickies-v> turkycat: ah, I was understanding "sharing" as copying the files to another client, instead of multiple clients using the same file (which is the more obvious interpretation). my bad

17:11

<LarryRuane> oh i see, we're trying to open the file with read-write permission, but if the file is RO, that fails

17:11

<pinheadmz> turkycat yes!

17:11

<pinheadmz> the bug is we cant OPEN the file, because we are trying to open it with W flag

17:11

<pablomartin> stickies-v I was thinking the same... no, same location, locking issue...

17:11

<pinheadmz> LarryRuane yep

17:12

<stickies-v> LarryRuane: indeed, see https://github.com/bitcoin/bitcoin/blob/d908877c4774c2456eed09167a5f382758e4a8a6/src/flatfile.cpp#L83 where we're calling `Open()` without the `read_only` parameter which defaults to `false`)

17:13

<LarryRuane> oh i see, so the PR avoids calling the function (Flush) that is calling Open()

17:13

<stickies-v> yessir

17:13

<LarryRuane> (unless necessary of course)

17:14

<LarryRuane> I even reviewed this PR but had forgotten... they say memory is the first thing to go... I don't recall what's the second thing

17:14

<stickies-v> (I think the error message could've been clearer if `FlatFileSeq::Flush` logged "failed to open file in write mode" instead of just "failed to open file")

17:15

<LarryRuane> should this PR improve that error message, maybe?

17:15

<pinheadmz> It should also log "Unable to open file" from flatfile.cpp Open() I think

17:15

<turkycat> lol LarryRuane

17:16

<stickies-v> gonna start moving on to the next questions, but as always - feel free to keep the discussion on previous questions going!

17:16

<LarryRuane> I've noticed it's really hard to decide which functions should log failures and which should not!

17:17

<stickies-v> LarryRuane: I think that's where the `util::Result` class comes in handy, so we can more easily propagate detailed error messages all the way to where we decide if we want to log (which depends on the use case)

17:17

<LarryRuane> stickies-v: +1 ... there's a review club for that in case anyone here isn't aware

17:17

<stickies-v> After this PR, can the whole `blocks/` directory be made read-only? Why (not)?

17:18

<LarryRuane> NO! because we still need to be able to write to the latest blocks file

17:18

<turkycat> ^

17:18

<DaveBeer> no it can't, client still needs to be able to create block files

17:18

<pablomartin> LarryRuane +1

17:18

<stickies-v> LarryRuane: correct! but... there's another reason

17:18

<LarryRuane> oh sorry, you were asking about the directory, @DaveBeer +1

17:18

<turkycat> we wouldn't be able to -reindex either

17:18

<stickies-v> turkycat: why?

17:19

<turkycat> since the index/ directory is a sub of blocks/

17:19

<pablomartin> blocks/index/

17:19

<LarryRuane> oh but that can have its own (write) permission enabled

17:19

<stickies-v> exactly! we store multiple kinds of data in the `blocks` directory, including the block index (which is stored in a leveldb db)

17:19

<stickies-v> LarryRuane: well yes but it's in the `blocks` directory. perhaps the question should have been clearer about including subdirectories too

17:20

<stickies-v> anyway, being able to write to the latest block file is reason enough

17:20

<LarryRuane> it probably would have been better if the actual blocks files were in a subdir one level lower .. and maybe the rev (undo) files also one level lower (in own subdir)

17:20

<LarryRuane> but too late now

17:21

<stickies-v> LarryRuane: what would that have improved?

17:21

<lightlike> another thing is that if we -reindex and there is some corruption at some point, we'd stop at that point and rebuild the rest of the blocks - for that, the blocks would need to be writeable again.

17:21

<turkycat> LarryRuane yea the chainstate/ folder is aso built while -reindex but it's a sibling of blocks/

17:21

<turkycat> lightlike good point

17:22

<LarryRuane> one thing I've done often to get the size of the blockchain is `mkdir tmp ; ln ~/.bitcoin/blocks/blk*dat tmp ; du -sh tmp ; rm -r tmp` ... all that wouldn't be needed if blk*.dat files were in a separate subdir

17:22

<stickies-v> lightlike: because we attempt to remove the corrupted data from the block files?

17:23

<stickies-v> In `FindBlockPos()`, what is the relevance of `fKnown` to this PR, conceptually? How does the behaviour of `FindBlockPos()` change when `fKnown==true` instead of `false`?

17:23

<LarryRuane> i think lightlike meant the directory needs to be writeable

17:23

<lightlike> stickies-v: I think if -reindex fails to continue at some point, we'd fall back to IBD, download the missing blocks from peers, and overwrite the existing higher block files

17:23

<turkycat> I guess I assume, to lightlike's comment, that we would delete all blk and revs from the corruption point onward when rebuilding

17:24

<lightlike> but not 100% sure about that...

17:24

<pinheadmz> lightlike im not sure thats an automatic thing either but makes sense

17:24

<turkycat> yea, interesting- this might be worth a test case

17:24

<LarryRuane> turkycat: I don't think anything ever automatically deletes blk and rev files

17:24

<LarryRuane> pinheadmz: i think that is automatic

17:25

<turkycat> so we would just insert over the corrupted data?

17:25

<pinheadmz> well yeah actually, bc reindex would return

17:25

<pinheadmz> then bitcoin would just... do bitcoin

17:25

<LarryRuane> no i think we just create new blk and rev files

17:25

<turkycat> ahh fair

17:25

<pinheadmz> yeah probably the old corrupted data stays put

17:25

<LarryRuane> one thing to be aware of is, it's okay for the blk files to have redundant blocks (multiple copies of the same block)

17:25

<LarryRuane> (it only wastes disk space)

17:26

<turkycat> yea I guess if our index points to the 'correct' one (or either if both are correct) it doesn't really matter

17:26

<LarryRuane> it's the first one we encounter

17:26

<LarryRuane> (when doing reindex)

100

17:26

<pinheadmz> huh, what happens if reindex encounters a duplocate block ?

101

17:27

<turkycat> right but if corrupted, I guess it would fail validation and find it later

102

17:27

<stickies-v> on https://github.com/bitcoin/bitcoin/blob/d26a71a94ac4ae1b1a091f4412d390afba69b2f8/src/node/blockstorage.cpp#L877-L896 I can't immediately see any logic that deletes blk and rev files when reindex fails, but may need deeper digging

103

17:27

<LarryRuane> pinheadmz: it just ignores it

104

17:27

<pinheadmz> it wouldnt over-write the first idex ?

105

17:27

<pinheadmz> ah

106

17:27

<LarryRuane> stickies-v: right, i don't think anything auto-deletes those files

107

17:27

<pinheadmz> only in pruning mode !

108

17:28

<turkycat> yea I made a bold assumption and was wrong

109

17:29

<stickies-v> just reposting the current question since we've had a lot of (good!) discussion after:

110

17:29

<stickies-v> In `FindBlockPos()`, what is the relevance of `fKnown` to this PR, conceptually? How does the behaviour of `FindBlockPos()` change when `fKnown==true` instead of `false`?

111

17:30

<LarryRuane> I think `fKnown` means the block already exists on disk .. maybe `fExists` would have been a better name?

112

17:30

<alex-wiederin> stickies-v `fKnown` param determines whether `FlushBlockFile` is called in `FindBlockPos()`. The call to flush block file has been moved to the condition of `!fKnown` (i.e. if the position is unknown), I believe this is where get the reduction in calls to write.

113

17:31

<turkycat> LarryRuane no harder problem in computer science than naming things

114

17:31

<LarryRuane> turkycat: +1

115

17:32

<LarryRuane> the block has a "known" location, meaning that it already exists on disk (so it's not a terrible name)

116

17:32

<turkycat> we also only allocate if the position is not known, after finding a good position for it and deciding if we should finalize the file after

117

17:32

<LarryRuane> `FindBlockPos()` flushes the block if the position isn't known

118

17:32

<stickies-v> LarryRuane: alex-wiederin: yeah you're both correct! I'd say the relevance is that `fKnown` is going to always be true when reindexing, which is what this PR is targeting

119

17:32

<pinheadmz> yeah it does a lot more than just find something!

120

17:33

<pinheadmz> lightlike has a WIP to clean up that part of the code, but for now its a sort of typical bitcoin core codebase thing

121

17:33

— pinheadmz grimace

122

17:33

<turkycat> or, as a clarification, the finalize is for the current file (if there is not sufficient space for the new block) before the allocate logic

123

17:33

<turkycat> none of that logic is necessary if we know where the file position is

124

17:34

<lightlike> yeah, i plan on opening a cleanup PR for that, but only after this PR is merged.

125

17:34

<LarryRuane> pinheadmz: agree needs cleanup, i think in general when there are boolean flags, something may not be designed in the cleanest way

126

17:34

<pinheadmz> heh yeah but someone sometime made a nice little PR that was easy to review at the time :-)

127

17:34

<stickies-v> a little bonus side question: what does finalizing a block or undo file mean? like - what happens, and why do we need this?

128

17:34

<LarryRuane> (i say "may not" because it could be okay... but often isn't)

129

17:35

<LarryRuane> i think finalizing means truncating the file to the minimum length (?)

130

17:35

<stickies-v> LarryRuane: yes, and why does that need to happen?

131

17:36

<turkycat> because we allocate in chunks that are likely larger than the actual space used

132

17:36

<LarryRuane> it's created with a somewhat larger than needed size (i think it's 16m unless test mode), and grows by larger amounts (probably also 16m), because it's slightly more efficient than growing it on demand

133

17:36

<turkycat> coming in with the assist for LarryRuane

134

17:37

<stickies-v> exactly! we pre-allocate space, then fill it up best we can, and once we're about to exceed size we trim off whatever we didn't use

135

17:37

<stickies-v> alright next question

136

17:37

<stickies-v> What happens if bitcoind is killed after when block(s) have been processed, but before `FlushBlockFile()` is called? Does this PR affect that behaviour?

137

17:37

<LarryRuane> makes sense because we know it's very likely that we'll be writing to the file again soon

138

17:38

<LarryRuane> "if bitcoind is killed" ... I'm not sure about this, but i think the block index isn't flushed until after the block file, so we re-do the block processing when we come back up

139

17:39

<turkycat> the buffered content won't be written to disk and we'll have to re-download the block

140

17:39

<alex-wiederin> Agree. Don't think PR changes in that sense

141

17:39

<LarryRuane> in general, when something on disk (A) refers to something else on disk (B), you always want to write out B first (and sync it) and THEN write A to disk

142

17:40

<stickies-v> (we preallocate in chunks of 16MiB for block files and 1MiB for undo files: https://github.com/bitcoin/bitcoin/blob/d26a71a94ac4ae1b1a091f4412d390afba69b2f8/src/node/blockstorage.cpp#L587-L592)

143

17:40

<LarryRuane> (filesystem code has many of those kinds of sequenced writes to try to be corruption-proof)

144

17:40

<LarryRuane> in a way, the block index and the block data comprise a kind of primitive filesystem

145

17:40

<lightlike> I think if bitcoind is killed during -reindex, it will always start the reindex from scratch (and not try to continue at the latest point)

146

17:41

<LarryRuane> lightlike: is that right, interesting, i thought it would continue (IF you don't specify `-reindex` again .. if you do then of course it does start over)

147

17:41

<pinheadmz> lightlike https://bitcoin.stackexchange.com/questions/32835/safely-interrupt-reindex

148

17:41

<pinheadmz> sipa says it continutes

149

17:42

<LarryRuane> i think many people specify `-reindex` on the restart, thinking it's needed ... but it's not

150

17:42

<lightlike> oh, interesting. will try that out

151

17:42

<LarryRuane> pinheadmz: good SE find

152

17:42

<pinheadmz> i knew this one, i must have looked it up before lol

153

17:43

<turkycat> yea for sure if I wasn't 100% sure it would resume I'd start it over on a relaunch just to be sure. this is good to know

154

17:44

<sipa> The reindexing process consists of two phases really, the first is rebuilding the block index, the second is wiping the utxo set and rebuilding is. If you interrupt during the second one, it'll definitely just continue where it left off, because that is just the normal "try to activate the best known block" logic, unrelated to reindexing.

155

17:44

<sipa> I'm not sure if progress is saved during the first block index rebuilding phase, but if past-me says so, it's probably true.

156

17:44

<pinheadmz> past-you is so smart

157

17:45

<pinheadmz> and is reindex-chainstate then just skipping right to that ssecond step ?

158

17:45

<sipa> Exactly. `-reindex-chainstate` is exactly equivalent to `rm -rf ~/.bitcoin/chainstate`

159

17:45

<LarryRuane> that second phase takes much longer in practice, BTW ... actually @sipa aren't there now 3 phases? first block headers (takes only a couple of minutes on most systems), then the two you mentioned?

160

17:45

<lightlike> sure - but what if it's aborted midway through the first part

161

17:45

<LarryRuane> and actually, block headers are now downloeaded twice

162

17:46

<sipa> @LarryRuane You'll know this better now than I do.

163

17:46

<stickies-v> I think this might be where we check if we need to continue reindexing: https://github.com/bitcoin/bitcoin/blob/d26a71a94ac4ae1b1a091f4412d390afba69b2f8/src/node/blockstorage.cpp#L364-L367

164

17:46

<sipa> @stickies-v Nice catch, indeed.

165

17:47

<stickies-v> In the `blockmanager_flush_block_file` unit test, why are we pausing before starting a new blockfile?

166

17:47

<stickies-v> (link: https://github.com/bitcoin-core-review-club/bitcoin/commit/470ef396b5498d8689802c359a216d5a3c4749a5#diff-d6d633592a40f5f3d8b03863e41547de8751b874c1d20f129a616b9dd719b999R170)

167

17:47

<LarryRuane> stickies-v: yes .. and IIUC, if for some reason that flag in the leveldb index weren't `true`, we'd IBD from that point?

168

17:47

<lightlike> LarryRuane: redownloading the block headers via p2p isn't part of -reindex as far as I know. -reindex works without any peers.

169

17:48

<LarryRuane> oh that's right

170

17:49

<LarryRuane> I think the node does get bothered a little during reindex by P2P messages, which you can disable with `--maxconnections=0` if you're doing reindex performance comparisons

171

17:49

<LarryRuane> (i'm not sure how much of a difference that makes tho)

172

17:50

<LarryRuane> "why are we pausing" to detect if the file was written (timestamp will be different if so)

173

17:51

<stickies-v> LarryRuane: but why do we need to pause to check for a different timestamp when these operations are sequential?

174

17:51

<pinheadmz> this was a really hard test to write, because "flushing" is something unpredictable the OS does and theres no great way to know if it happened or not

175

17:53

<stickies-v> pinheadmz: yeah, and i'm not sure if it's reliable now? i tried running the updated `feature_reindex.py` test with the changes to `blockstorage.cpp` reverted and the test keeps succeeding

176

17:53

<LarryRuane> stickies-v: I think it's because sequential operations can still happen within the same second (or whatever the file timestamp resolution is)

177

17:53

<pinheadmz> Thats correct LarryRuane on windows its 2 seconds!

178

17:53

<turkycat> so, I believe the answer is that the OS writing the data and updating the timestamp on the file is async and there is some delay, for which `pause` is set at what should be the max delay

179

17:53

<LarryRuane> i think i suggested on the PR to compute the file checksum instead of depending on timestamps, but that's more work, probably not worth it

180

17:53

<stickies-v> LarryRuane: yeah the latter part is what kinda surprised me - that there's such a huge variance in resolution of last modified timestamps

181

17:53

<stickies-v> see e.g. https://stackoverflow.com/a/31521204/734612 for an overview

182

17:54

<pinheadmz> stickies-v yeah the test is, hard. I added it by jonatack request and it doesnt really cover the change in the PR as much as it just sets a regression test that the PR DOESNT BREAK

183

17:54

<LarryRuane> stickies-v: madness :)

184

17:54

<stickies-v> turkycat: this is more of a filesystem than an OS thing, afaik (which is also one concern I have with the current approach)

185

17:54

<pablomartin> thankfuklly the intermittent failure of the test caught it

186

17:55

<turkycat> I found these pauses strange though, in their order. I made a comment on it and tbh still don't fully understand why we aren't 1) read timestamp 2) perform write 3) pause 4) read timestamp again 5) compare

187

17:55

<pinheadmz> write = update timestamp

188

17:55

<pinheadmz> so reading timestamp before writing isnt useful

189

17:55

<pinheadmz> then we want to know if we WROTE AGAIN

190

17:55

<pinheadmz> and the only way to know is, the timestamp has changed

191

17:56

<turkycat> right but reading it first gives you the comparison point, you want to make sure that the value didn't change

192

17:56

<pinheadmz> but if the second write happened so fast that time itself hasnt advanced, the test would false-positibe

193

17:56

<pinheadmz> actually we are doing setps (1,2,3,4,5) as yo umention

194

17:57

<turkycat> ok, fair, I'll consider that

195

17:57

<pinheadmz> but we do a write before step 1 as well

196

17:57

<pinheadmz> er sorry no sorry

197

17:57

<pinheadmz> yeah we need to pause between writes so that if a write happened, the time will definitely be updated

198

17:58

<turkycat> so, perhaps I misunderstand the filesystem update delay- but I guess my thought is that since we're writing on line 150 we should pause again before reading `last_write_time` again, for comparison?

199

17:58

<stickies-v> turkycat: it's not a lag thing, it's a resolution thing

200

17:59

<stickies-v> the last modified timestamp just doesn't store any more specific timestamps than that resolution

201

17:59

<pinheadmz> yeah imagine if timestamp resolution was one day

202

17:59

<turkycat> yea ok I think that was my fundamental misunderstanding. I assumed the fflush and fsync were async and you needed to check the timestamp after some period to be sure

203

17:59

<stickies-v> so on FAT filesystem, if you perform 10 operations within an (ideally timed) span of 2 seconds, you'd update the last modified timestamp 10 times, but they'd all have the same value

204

17:59

<pinheadmz> you could update your file all day long and keep checking the time, itd never change

205

18:00

<turkycat> got it, cheers guys

206

18:00

<pinheadmz> yeah so if you happen to run this test on windows, its extra slow

207

18:00

<pinheadmz> theres an ifdef that changes the pause length for linux

208

18:00

<stickies-v> with that said, we're out of time for this meeting so the remaining questions are left as an exercise to the reader - feel free to continue discussing here!

209

18:00

<LarryRuane> another reason to avoid windows haha

210

18:00

<pinheadmz> thanks everyone for taking a look!

211

18:00

<LarryRuane> thanks, @stickies-v and everyone else!

212

18:01

<stickies-v> thank you everyone for participating, a bit less code heavy but i hope the concepts were interesting