The PR branch HEAD was 2faa5b5179 at the time of this review club meeting.
Notes
Indexes are optional modules that scan each block of the chain and store index-specific
additional data in a separate LevelDB database. There are currently three indexes in bitcoin core with very
different use cases that are all using a common framework (index/base).
All indexes have a best block which is the block height up to which they are synced.
In principle, indexes can also work with a node that is running in pruned mode - as long as the
data of each block was indexed at a point in time when the block was still available.
After a block has been processed by the index, this block could be pruned
(the indexed data itself is never pruned).
However, when pruning with an index enabled, extra care must be taken:
We must ensure no blocks are pruned that still need to be indexed.
We must take into account the possibility of a reorg. In particular,
coinstatsindex has a running hash for the UTXO set, so when a block is
disconnected from the tip, this block’s data needs to be available
in order to be able to adjust the running hash.
Being optional, indexes can be switched off (and switched back on again) at any point.
If it is impossible to fully sync an index because of missing block data, the user
must be notified.
PR #15946 enabled pruning for
blockfilterindex - however, a negative side-effect of this was that it introduced a circular
dependency between validation and the index code.
This PR introduces a new method for pruning, prune locks. It removes the circular
dependency and enables pruning for coinstatsindex.
<danielabrozzoni> coin stats index - statistics on the utxo set; block filter index - to retrieve BIP157 block filters, hashes and headers; tx index - to retrieve transactions by hash
<lightlike> ok, so all of these indexes use a common framework (index/base) and on top of that have their own specific code relating to what data they index and how to handle this
<lightlike> larryruane: yes! although blockfilterindex has the special property that the filters themselves are not saved in the levelDB, but in a flatfile - the levelDB has the file positions where one finds the filter.
<ls55> If A uses B and conversely then there is a circular dependency. However, the circular dependency maybe subtler. For instance, it may be A that uses B that uses C that uses A.
<larryruane> (sorry I'm late with this:) I always wonder how stuff is laid out on disk, it looks like the `$HOME/.bitcoin/index` directory has subdirectories for each kind of index
<lightlike> larryruane: yes, that's where the data is saved. one could also just delete the respective folder and reindex again. I find myself doing that a lot doing testing.
<lightlike> larryruane: no, it won't. You can enable it back on later on, and the index will catch up with the chain, starting from the point it was synced before (and not from genesis)
<sipa> there are also "conceptual circular dependencies" or so... as in: "module X cannot really be used without module Y"; this is much broader, and not necessarily a problem in the code - just in your design
<sipa> ideally, modules in your code, if well-organized, are layers - they build on top of each other. If two modules both cannot be used without the other, that's a sign that they should be either just one module, or the interface between them is bad
<larryruane> I think an easy code circular dependency to get your mind around, that I've seen often in other projects I've worked on, involves logging. If logging takes a mutex, and if the mutex code can possible write log messages... !
<lightlike> for example, I think that the existing circular dependency between validation and index caused the indexes to be part of the initial draft of Carl's libbitcoinkernel library for consensus code - as an optional module they really shouldn't be there, but it takes refactoring to make that possible.
<sipa> Right. Ideally it's always possible for any two modules to use at least one of them without the other one. Circular dependencies break that ability.
<ls55> It is a map representing the index name ("txindex", "coinstatsindex" or "basic blockfilterindex") and the height of the earliest block that should be kept and not pruned for each index.
<larryruane> prune blockers are a beautiful thing, it's a list of heights, one per kind of index, which set a lower bound on what can be pruned away (removed)
<larryruane> if we maintained just a single lowest-height (across all index types), then when that lowest index can advance to a higher height, we wouldn't know what would be the new "lowest" ... so they must be kept separately per-index
<lightlike> yes. and ls55 is right too, they are read in CChainState::FlushStateToDisk , which is the point where the node decides whether it should prune away blocks or not.
<ls55> The old code handled only the "blockfilterindex" and this new code iterates over all available indexes to find last height the node can prune. This data is got from `CBlockIndex BaseIndex::m_best_block_index`.
<lightlike> Now, the indexes tell validation their best height by themselves. So validation doesn't need to know about the indexes anymore, and there is no longer a circular dependency.
<lightlike> I also think that there is no meaningful cost related to it. I think one difference is that the prune locks might be updated more often (even when our node doesn't want to prune), but that should be completely negligible.
<otech> I think more than a few blocks reorg would be rare, but I can see it being prudent to set the buffer a bit higher in case of an targeted eclipse attack especially since the use case is for pruned nodes... but not sure if that intuition is wrong...
<sipa> that's just a guideline of course, but my point is more: "being unable to decide" or "feeling bad about hardcoding" should not be reasons to make something configurable. It's just you as designer not doing your job and trying to shove it off to the user.
<larryruane> I'm curious to know what happens if we DO have a very large reorg (such that some indices "break" because height decreases too much) ... does the node shut down? No, couldn't be, that would be bad!
<theStack> didn't look deeper into the code, but is there the theoretical possibility of an integer underflow? "const int lock_height{prune_lock.second.height_first - PRUNE_LOCK_BUFFER - 1};" (if height_first is <= 10)
<larryruane> sipa: very interesting ... would you ever be in favor of a temporary hidden config option so that if there's some unexpected problem, nodes can be "patched" without needing a new binary?