Don't allow a coin to be spent and FRESH. Improve commenting. (utxo db and indexes)

https://github.com/bitcoin/bitcoin/pull/18113

Host: jnewbery  -  PR author: jnewbery

The PR branch HEAD was 9288d747 at the time of this review club meeting.

Notes

  • The coins database is the structure in Bitcoin Core which keeps track of the UTXO set. We looked at the coins database in a previous PR Review Club (#17487). The notes for that Review Club meeting include an excellent description of the caching for the coins database.

  • The coins database is a key-value store. It was introduced in v0.8 (PR 1677 - Ultraprune) where it was keyed by transaction. Pieter Wuille describes this change in a recent Chaincode Podcast episode.

  • In v0.15 the coins database was changed to be keyed by transaction output (PR 10195). For more information on that change see this blog post.

  • The ‘prune’ in ultraprune came from the fact that once all of the outputs from a transaction had been spent, that entry could be removed or ‘pruned’ from the coins database (note that this is a different concept from a pruned node, which deletes old block files that it no longer needs). The ‘pruned’ terminology for the coins database no longer makes sense - either a coin is unspent (and exists in the database) or is spent (and removed from the database).

  • When the coins database was changed from per-transaction to per-txout, some of the ‘pruned’ code comments and variable names were left in. This PR was motivated by improving the clarity of those comments. Subsequently some additional small changes were added.

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? (Don’t forget to put your PR review on GitHub.)

  2. This PR touches consensus-critical code. Does it meet the high bar for justifying a change to such critical code? Is improving code clarity justification enough?

  3. What does it mean for a coin to be FRESH? What about DIRTY?

  4. Coins in a coins cache can be:
    • spent or unspent,
    • FRESH or not FRESH, and
    • DIRTY or not DIRTY How many total states are there? How many of those are valid? Does this PR change that?
  5. Could this change have a performance impact during block validation or Initial Block Download?

Appendix

(notes added after meeting)

A summary of the validity of possible states, with explanation:

Image of coins table

  • For a description of FRESH & DIRTY, see comments before & after this PR.
  • For more details on possible cases, see here.

Meeting Log

  113:00 <jnewbery> #startmeeting
  213:00 <jkczyz> hi
  313:00 <willcl_ark> hi
  413:00 <fjahr> hi
  513:00 <emzy> hi
  613:00 <pinheadmz> hi
  713:00 <michaelfolkson> hi
  813:00 <amiti> hi
  913:01 <nehan_> hi
 1013:01 <jnewbery> Hi folks. Welcome to PR Review Club! Feel free to say hi to let everyone know you're here.
 1113:01 <jnewbery> As always, feel free to jump in at any point to ask questions. No need to ask to ask, just ask!
 1213:01 <Prayank> Hello Everyone 😀
 1313:02 <jnewbery> wow. Emojis. That's novel :)
 1413:02 <jnewbery> This week's PR is rather short, but I think it's a good opportunity to dig into the coins structures a bit more. Notes are in the normal place: https://bitcoincore.reviews/18113.html
 1513:02 <andrewtoth> hi
 1613:02 <jonatack> hi
 1713:02 <lightlike> hi
 1813:02 <earpiece> hi
 1913:02 <jnewbery> who had a chance to review the changes? y/n
 2013:02 <fjahr> y
 2113:02 <pinheadmz> y
 2213:02 <andrewtoth> y
 2313:02 <amiti> y
 2413:02 <jkczyz> y
 2513:02 <emzy> y
 2613:03 <nehan_> n
 2713:03 <willcl_ark> y
 2813:03 <jnewbery> great. Solitary confinement seems to be doing wonders for people's time for code review
 2913:03 <jnewbery> question 1: Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?
 3013:04 <jnewbery> and we might as well do question 2 at the same time: This PR touches consensus-critical code. Does it meet the high bar for justifying a change to such critical code? Is improving code clarity justification enough?
 3113:04 <fjahr> Concept ACK, the code looks good but I don't feel comfortable enough with everything especially considering reorgs
 3213:05 <fjahr> If it’s a significant improvement I think it is justified because it reduces the risk of bugs being introduced in the future and makes “real” improvements and new feature development easier
 3313:05 <willcl_ark> tACK from me. But I have to confess I was a bit less clear on the code-side this week. It was a new area for me to be looking at (i didn't attent the previous related meeting in Jan)
 3413:05 <emzy> Conxept ACK, seens the code is more clear for future changes.
 3513:05 <jnewbery> (it's my code but NACK is an acceptable answer)
 3613:05 <jkczyz> Concept ACK. IMHO, consensus-critical code *should* be clear
 3713:06 <andrewtoth> Concept ACK, but it seems like it's doing too much for how critical the code is. Could the renaming of PRUNED to SPENT and comment updates be a separate change to the consensus code change?
 3813:06 <jonatack> scripted diffs and improving commenting, even in consensus code, seem to still have a worthwhile benefit/risk ratio
 3913:06 <jnewbery> willcl_ark: the notes and meeting log from https://bitcoincore.reviews/17487 are really good, so I'd recommend reading through those if this has piqued your interest
 4013:07 <michaelfolkson> Yeah we can't leave consensus code untouched in fear of breaking something if clarity can be improved. Needs lots of review though
 4113:07 <jnewbery> andrewtoth: yes, this could definitely be split in two. One that changes the commends and variable naming, and then a second that makes the logical changes
 4213:08 <andrewtoth> It just seems like review will not be as focused with such a large diff and essentially two different things being changed
 4313:08 <jnewbery> that's fair feedback
 4413:08 <pinheadmz> jnewbery: the organization of the different cache layers is whats missing from my overall view of this PR
 4513:08 <nehan_> i think it would be better to split this PR into two: one for comments/no consensus changes and one with the changes
 4613:09 <jnewbery> pinheadmz: definitely check out https://bitcoincore.reviews/17487 and james's diagram here: https://jameso.be/dev++2018/#54
 4713:09 <willcl_ark> One thing I couldn't tell, is there any guarantee that this could not create a chain split? At a high level, if the change is modifying _both_ the code and the associated test, it seems like that guarantee is harder to prove...
 4813:10 <jnewbery> willcl_ark: if the logical change is bad, then yes, this could cause a consensus failure. Does anyone want to expand on how?
 4913:11 <jkczyz> regarding splitting into PRs, if the changes don't bleed across commits then I think one PR is fine. Unless the logical is contentious. But in that case the comments would need to be updated if the comment PR goes in first
 5013:11 <fjahr> consensus failure: a block would be rejected because different nodes have a different utxo set, one of them would miss a coin that is spent in the block
 5113:11 <Prayank> Interesting
 5213:12 <jnewbery> fjahr: yeah, that's what we should be scared about with any change like this
 5313:13 <jnewbery> I'm going to move onto the next question, but feel free to ask questions/give input on previous questions as we go
 5413:13 <jnewbery> What does it mean for a coin to be FRESH? What about DIRTY?
 5513:14 <amiti> DIRTY means it exists in the parent cache, but has been changed in the child cache
 5613:14 <amiti> FRESH means it doesn't exist in the parent cache
 5713:14 <jnewbery> amiti: exactly right
 5813:15 <fjahr> in the comment it says *potentially* different from the parent cache
 5913:15 <jnewbery> a Coin object can also be spent or unspent. How many different potential states are there?
 6013:15 <jkczyz> 8
 6113:15 <jnewbery> fjahr: ah yes, thanks for clarifying
 6213:15 <jnewbery> jkczyz: yes, and which of those are valid?
 6313:16 <jkczyz> good question. Haven't gone through all them yet :)
 6413:16 <amiti> if a coin is FRESH, is it *always* DIRTY, or is it *never* DIRTY?
 6513:16 <amiti> I think it has to be one of those two, right?
 6613:16 <michaelfolkson> fjahr: Why only potentially? It has been changed?
 6713:17 <lightlike> why can't a coin be "both FRESH and spent" then as one of the commit messages says? Couldn't it be created in the child cache, spent there and still be FRESH because it never existed in the parent?
 6813:17 <amiti> lightlike: my understanding is because in that case you could delete it from the child cache
 6913:18 <jonatack> jkczyz: i agree, istm the main reason to split would be if over time there was a lack of review that could be alleviated by splitting, or to get in worthwhile standalone changes if others were contentious.
 7013:18 <jnewbery> lightlike: important to distinguish between before this PR and after this PR
 7113:18 <fjahr> michaelfolkson: I am not 100% clear on this in which cases this occurs, I am referring to the comment in coins.h: "DIRTY: This cache entry is potentially different from the version in the parent cache"
 7213:18 <jnewbery> before this PR, a Coin could be spent and FRESH
 7313:19 <willcl_ark> Semantically that reads badly, but according to the two definitions above... it appears to be an acceptable state.
 7413:19 <lightlike> oh ok, got it.
 7513:20 <jnewbery> michaelfolkson: I'd need to look back at the code in validation.cpp, but I think there could be a situation where a coin gets deleted in a child cache and then re-created. It would then be the same as the parent cache but would be marked as DIRTY
 7613:20 <michaelfolkson> Ah ok
 7713:21 <michaelfolkson> Should it be marked as DIRTY in this case?
 7813:21 <jnewbery> I think that might happen in a re-org - the block that spent the coin is disconnected, which adds the coin to the cache, and then the competing block also spends the coin and it gets removed
 7913:22 <jnewbery> or something like that. The point is that if we change a coin in a cache, it MUST be marked as DIRTY
 8013:22 <michaelfolkson> Ok
 8113:22 <jnewbery> if not, changes might not get saved when the cache is flushed
 8213:23 <fjahr> michaelfolkson: To be sure you can not mark it as DIRTY you would probably have to check the parent cache all the time and that would be too much of a performance penalty I guess
 8313:23 <Prayank> I have a noob question. Sorry have not read much details but it looked interesting and this is my first meeting in core or review club. Does this FRESH and DIRTY thing in any way affect privacy when a full node interacts with other full nodes.
 8413:24 <fjahr> I don't really know, just speculating
 8513:24 <jnewbery> Prayank: noob questions are welcome here :)
 8613:24 <andrewtoth> FRESH must always be DIRTY. You cannot have a state with FRESH and not DIRTY.
 8713:24 <an4s> I am a noob here as well. Joined late so kind of outta sync
 8813:24 <amiti> andrewtoth: gotcha. thanks
 8913:25 <andrewtoth> because to be FRESH it must necessarily be different than what the parent has, since the parent does not have it
 9013:25 <jnewbery> andrewtoth: Yes, I think that's right. FRESH implies DIRTY
 9113:26 <michaelfolkson> Prayank: I would say not with these definitions of fresh and dirty. There is another context where coins from a coinbase transaction are referred to as fresh and that's where there's a potential privacy benefit
 9213:26 <fjahr> Let me try to answer Prayank: We are discussing internal management of data, this not information that is shared between nodes so it can't impact privacy
 9313:26 <Prayank> Thanks
 9413:26 <jnewbery> michaelfolkson: fjahr: exactly right
 9513:26 <amiti> so what's the point of additionally marking FRESH? Is there a performance benefit when flushing the cache?
 9613:27 <amiti> (before this PR)
 9713:27 <jnewbery> there's also another meaning of 'dirty' in the wallet for addresses that have already been used, but that is unrelated to what we're talking about here
 9813:28 <jnewbery> can anyone answer amiti? Why do we have FRESH at all?
 9913:28 <willcl_ark> So will nodes from before/after this patch be marking coins in their caches differently to each other, and therefore flushing them differently (risking chainsplit)? That's what I can't quite tell..
10013:29 <andrewtoth> If it's marked FRESH and is spent, it can be removed instead of flushed to parent
10113:29 <pinheadmz> jnewbery: i think its bc if that coin is spent, the parent caceh never needs to know about it
10213:29 <andrewtoth> If it's not spent, it needs to be flushed to parent because it's also DIRTY
10313:30 <jnewbery> willcl_ark: the logical change in this PR is to do with the node's internal caching strategy. The underlying coins database shouldn't be affected
10413:30 <amiti> oh right. ah got confused and thought thats what this PR is introducing, but actually the change here is that we don't need to fetch spent coins
10513:30 <amiti> right?
10613:30 <jnewbery> (you could potentially remove the cache entirely and just use the disk's coins database and arrive at the same view of consensus, but performance would be horrendous)
10713:30 <earpiece> fjahr: the pr is not related to P2P layer but so not information intentionally shared but the privacy impact to be answered is, are the changes creating a side channel etc
10813:31 <willcl_ark> jnewbery: Ok thanks.
10913:31 <jnewbery> (but only kinda - an extra layer of cache is used during block connection)
11013:31 <jkczyz> Are the assumptions then (1) FRESH implies DIRTY and (2) spent coins should not have entries?
11113:32 <earpiece> could a probing peering observe additional information by using the fresh/dirty/spent/unspent to exfiltrate info about wallet etc, I don't think it could but not certain
11213:32 <jnewbery> amiti: right. That' the change
11313:32 <andrewtoth> Would spent and ~Dirty be an invalid state as well?
11413:32 <jnewbery> earpiece: nice question. What do people think?
11513:32 <fjahr> earpiece: good question. I don't think there is a chance because this is general chain state and it is not connected to the wallet
11613:33 <jnewbery> fjahr: I agree. This is logic that all nodes do on all transactions/blocks, so leaks no information about the wallet
11713:34 <jnewbery> earpiece: if you're interested in remote wallet side-channel attacks, check out the talk here: https://bitcoinops.org/en/newsletters/2020/02/26/#remote-side-channel-attacks-on-anonymous-transactions
11813:34 <jnewbery> that paper deals with monero/zcash, but I think it's generally interesting
11913:35 <earpiece> ah yeah, Tramer's work. it's a good recent reference :)
12013:35 <amiti> jkczyz: re (2) spent coins should not have entries -> if a coin is retrieved from the parent cache then spent in the child cache, it should have a DIRTY entry there until its flushed, right?
12113:36 <jkczyz> amiti: Yeah, that was my thinking
12213:36 <jkczyz> In which case I could only see three possible states
12313:37 <jnewbery> amiti: jkczyz: yes, spent and DIRTY is a valid state. It means the child cache has spent the coin and that spentness needs to be propogated to the parent when the cache is flushed
12413:37 <nehan_> jkczyz amiti: re(2) I think only coins that are created and spent in the child view should not have entries (with this change. instead of returning an entry with FRESH)
12513:39 <andrewtoth> spent and not DIRTY is not a valid state though right?
12613:39 <jnewbery> andrewtoth: it was before this PR. After this PR, it's not
12713:40 <andrewtoth> ahh gotcha
12813:40 <jnewbery> before this PR we could fetch a spent coin from the parent and keep it in the child cache. That would be spent and not-DIRTY
12913:40 <nehan_> jnewbery: does that mean you can remove the line return !coin.IsSpent() in GetCoin?
13013:40 <nehan_> remove the check, that is.
13113:41 <andrewtoth> I was looking at all the commenting changes and the actual logic change escaped me. Thanks!
13213:41 <jkczyz> jnewbery: In that case I'll change my answer to 5 valid states :)
13313:41 <amiti> jkczyz: before, or after the PR?
13413:42 <jkczyz> after
13513:42 <jnewbery> andrewtoth: that's another good reason to split this PR up!
13613:42 <michaelfolkson> Not DIRTY is FRESH right? There are only two states. DIRTY and FRESH
13713:43 <michaelfolkson> Oh no sorry, ignore me I'm with you
13813:43 <nehan_> michaelfolson: no this tripped me up last time. DIRTY is not the opposite of FRESH.
13913:43 <jnewbery> what do people think of nehan_'s question? Can we remove the check for spentness in https://github.com/bitcoin/bitcoin/blob/3a8d25064e700ff2e69600cc1ede597751283a85/src/coins.cpp#L63 after this PR?
14013:44 <nehan_> oh, no, because it might be spent in the child view
14113:44 <jnewbery> right
14213:45 <jnewbery> yes, FRESH and DIRTY aren't the most intuitive names. Perhaps NEWCOIN and MUSTFLUSH would be better
14313:45 <amiti> I'm seeing 4 valid states with the PR. 1) spent, not fresh, dirty 2) unspent, fresh, dirty 3) unspent, not fresh, dirty 4) unspent, not fresh, not dirty.
14413:45 <amiti> jkczyz: hows that match up to your 5?
14513:46 <amiti> jnewbery: NEWCOIN and MUSTFLUSH would be _so_ much better
14613:46 <nehan_> amiti: +1
14713:46 <michaelfolkson> Start with 8 (2^3) and then subtract from that
14813:46 <jonatack> +1 on better naming
14913:47 <jnewbery> well they say there are only two hard things in computer science: naming things and cache invalidation, and we've covered both of them today
15013:47 <andrewtoth> lol
15113:47 <michaelfolkson> Do "they" say that?! haha
15213:47 <jkczyz> amiti: also had (5) spent, fresh, dirty since FRESH implies DIRTY. Thinking if that one makes sense now...
15313:48 <andrewtoth> It does make sense
15413:48 <jnewbery> jkczyz: after this PR, a spent coin can't be FRESH
15513:48 <amiti> jkczyz: thats the one that this PR changes. if it's fresh and spent, drop it.
15613:48 <andrewtoth> It is a candidate to remove
15713:48 <amiti> wait wait
15813:49 <nehan_> i got four states as well
15913:49 <amiti> ok nevermind, don't wait
16013:49 <andrewtoth> it can't be spent and NOT dirty
16113:49 <nehan_> yeah
16213:49 <jnewbery> amiti: this PR doesn't change that. The behavior is already 'if it's spent and FRESH, drop it'. The change in this PR is 'don't fetch a spent coin from a parent cache'
16313:50 <amiti> yeah thats what I mixed up before 🤦‍♀️
16413:50 <jnewbery> which introduces a new invariant: a spent coin can now *never* be fresh
16513:50 <amiti> maybe this time I'll get it :)
16613:51 <jnewbery> ok 10 minutes left. I have one last question, but now is also your chance to say anything you've been holding back on
16713:51 <jnewbery> Could this change have a performance impact during block validation or Initial Block Download?
16813:51 <jkczyz> jnewbery: so 4 then?
16913:51 <jnewbery> oh yeah, I agree that the answer is 4
17013:51 <nehan_> ugh but which 4? i think i had the wrong ones too: 1) not dirty & not fresh & not spent (or does this mean it wouldn't be in the child cash?) 2) dirty & not fresh & spent 3) dirty & fresh & spent 4) dirty & not fresh & spent
17113:51 <nehan_> s/cash/cache
17213:52 <jnewbery> bitcoin cache?
17313:52 <amiti> nehan_: #1 would be when a child retrieves a not spent coin from the parent, right?
17413:52 <nehan_> because i think you're saying after this change #3 will no longer happen
17513:53 <nehan_> amiti: right. just not sure when that happens?
17613:53 <jnewbery> My 4 are: (spent & not-FRESH & not-DIRTY), (unspent & not-FRESH & not-DIRTY), (unspent & not-FRESH & DIRTY), (unspent & FRESH & DIRTY)
17713:54 <jnewbery> nehan_: retrieving unspent coins from a parent happens all the time. It's how we retrieve coins from the disk coins DB
17813:55 <amiti> jnewbery: why not (spent & not-FRESH & DIRTY)?
17913:55 <jnewbery> any thoughts about performance impact?
18013:56 <jonatack> This PR LGTM, but naming-wise I think "FRESH" and "DIRTY" generally imply mutually exclusive states to people.
18113:56 <michaelfolkson> Agreed
18213:56 <jnewbery> amiti: sorry you're right. Replace my first one with (spent and not-FRESH and DIRTY)
18313:56 <nehan_> why wouldn't the first one happen?
18413:57 <nehan_> that seems like the case where you read a spent UTXO from the parent
18513:57 <jnewbery> nehan_: after this PR, we wouldn't keep it: https://github.com/bitcoin-core-review-club/bitcoin/commit/7053da3#diff-cd7b305fd4b4280f22ae88960e60398eR48
18613:57 <willcl_ark> jonatack: agree, FRESH is too close to (the opposite of DIRTY,) CLEAN
18713:57 <nehan_> jnewbery: ah right, thanks
18813:58 <willcl_ark> ...although CCoinsCacheEntry::Flags does explain pretty clearly what the two flags mean
18913:58 <jnewbery> seems like we have almost unanimous agreement that the FRESH/DIRTY terminology is confusing!
19013:58 <Prayank> Yes
19113:59 <jnewbery> I think that naming should always be as intuitive as we can make it. Any confusion stemming from other associations with the name can eventually exhibit themselves as bugs
19213:59 <jnewbery> ok, that's about time. I just realised that I didn't shill my own podcast episode in the notes, which is uncharacteristic of me
19314:00 <jnewbery> if you like coins, you'll love https://podcast.chaincode.com/2020/02/26/utxos-5.html
19414:00 <Prayank> lol
19514:00 <jnewbery> #endmeeting
19614:00 <jnewbery> thanks all!
19714:00 <willcl_ark> so can we have an unspent && DIRTY coin?
19814:00 <Prayank> Thanks 👍
19914:00 <willcl_ark> thanks jnewbery
20014:00 <michaelfolkson> I have a backlog to catch up on. The sipa one was immense
20114:00 <michaelfolkson> Thanks
20214:00 <amiti> thanks!
20314:00 <nehan_> thanks
20414:00 <emzy> Thanks jnewbery and all!
20514:00 <jkczyz> thanks!
20614:01 <pinheadmz> ty all confusing and interesting today
20714:01 <jonatack> Thanks jnewbery and everyone! Good to see new people too :)
20814:01 <jnewbery> willcl_ark: yes, a Coin can be uspent and DIRTY
20914:01 <amiti> pinheadmz: well put 😂
21014:01 <willcl_ark> jnewbery: is that the re-org case?
21114:01 <fjahr> jnewbery: Since the spent coins from the parent cache are skipped there could be a small performance impact?
21214:01 <andrewtoth> Thanks jnewbery!
21314:01 <fjahr> Thanks!
21414:02 <jnewbery> fjahr: too late!
21514:02 <fjahr> jnewbery: i mean small performance improvement
21614:02 <fjahr> :(
21714:02 <jnewbery> I don't know the answer for sure. I was just throwing it out there as a conversation starter
21814:03 <fjahr> ok
21914:03 <jonatack> fjahr: I'm thinking it's possible but only benchmarking can tell. Have to measure.
22014:03 <jnewbery> jonatack: agree with benchmarking