The PR branch HEAD was ec5fbd0b at the time of this review club meeting.
Notes
An address index is an index from addresses or scriptPubKeys to the
transactions in the block chain that create outputs to that address or spend
outputs from that address.
Adding an address index is a very common feature request for Bitcoin Core.
Being able to lookup transactions from an address is useful for block
explorers or other services that need to scan the block chain. Example
requests:
1,
2
There have been several patches and additional software to produce
transaction indexes over the years:
Bitcore,
electrs and
ElectrumX are three examples.
There are also centralized block explorer services that provide an
address-to-transactions lookup. These offer convenience at the cost
of privacy.
There have been several attempts to add an address index to Bitcoin Core:
Address indexes are inherently unscalable since they grow linearly with the
size of the block chain. Some contributors worry that offering an address
index in Bitcoin Core may encourage businesses to build Bitcoin services on
infrastructure that can’t scale as the size of the block chain increases.
Pieter Wuille expresses this concern in the post from his PR: “I hate making it
easy to build infrastructure that relies on a fully-indexed blockchain (which
shouldn’t be necessary), as it may remove the incentive to build more scalable
solutions. On the other hand, in cases where the alternative is relying on a
trusted third party, this approach is certainly preferable, and would allow an
RPC-based blockexplorer, for example.”
This latest attempt to add an address index builds on the new auxiliary
indexes base class introduced in PR
13243. The auxiliary index
infrastructure runs asynchronously and uses the CValidationInterface to
receive new block data. That means adding new indexes should have minimal
impact on performance and low risk of introducing a bug that impacts mainline
functionality.
The address index code and tests (index/addrindex.cpp, index/addrindex.h,
test/addrindex_tests.cpp) added in this PR are similar to the transaction
index code and tests (index/txindex.cpp, index/txindex.h,
test/txindex_tests.cpp). Comparing them by eye or using a diff tool may be
helpful to understand what the new code is doing.
Questions
Do you think Bitcoin Core should include an address index? What are the
arguments for and against?
<jnewbery> 1. You don't have to ask to ask a question (e.g. "I have a question about x but I don't know if it's on-topic?"). Just go ahead and ask. If it's off-topic we'll tell you.
<jnewbery> First question was about whether Bitcoin Core should include an address index. I don't think we want to get into a long debate, but did people think a bit about the arguments for and against?
<michaelfolkson> Yeah if it keeps coming up, there's almost an argument to just get it over with. Assuming the downsides aren't that severe which they don't seem to be (at least for the Core project itself)
<jnewbery> raj_: this wouldn't impact IBD time (well perhaps for people that decided to enable an address index, but this would be disabled by default)
<jnewbery> yeah, I think that's a good summary of the arguments. What did people think about this particular implementation. Does it address any of those concerns?
<fjahr> pro of this particular PR against the historical ones: it uses the base index which should make it easier to maintain, which means this is probably also much easier to maintain as an external patch if it does not get in
<_andrewtoth_> emzy: with https://github.com/bitcoin/bitcoin/pull/10785 it would make it much faster to rescan. I think that's ultimately the better solution, block filter indexing and fast rescanning for addresses you want to know about
<nehan_> it's useful for me to understand what's missing from something like this before it might be considered for merging. one is definitely stats/benchmarks.
<jnewbery> _andrewtoth_: how large is the blockchain up to height 473300? (I'm interested in how large the address index is in comparison to the blockchain)
<marcinja> nehan_: I definitely agree it should be documented better, that should at least make it easier to review. Stats and benchmarks would put people at ease maybe and also won't be hard to get
<jkczyz> Did jamesob report back on the use cases that companies had for this? Seems any solution that would scale requires multiple nodes and mechanism to resolve inconsistencies across nodes.
<michaelfolkson> I'm finding it hard to care about the Approach ACK when there isn't agreement on the Concept ACK. And at least the Concept nACK seems to be the primary reason why the other attempts were rejected
<jonatack> marcinja: Since you are here (which is great!) could you discuss your motivation for reviving this PR? I didn't you mention the *why* in the PR description, which seemed quite succinct for a change like this.
<jnewbery> michaelfolkson: maybe. The final comment on the last attempt was from sipa: "I vote to close this; this should be done in an external index. Perhaps at some point when we can a more modular design it can be supported as an optionally buildable module, but let's not complicate the current database tracking code."
<fjahr> raj_: that was the feedback in most of the discussions. As a standalone or patch. But of course people would like a really robust solution maintained by core because this is not so easy to maintain.
<jnewbery> ok, let's move on to the next question: What steps did you take to review and test this PR? Did you try to create an address index over the full mainnet block chain?
<jnewbery> and we can also talk about the next question at the same time: Did you review the unit and functional tests? Are there any additional test cases that you’d like to see added?
<marcinja> jonatack: Sure. One of the main justifications is that it seems like a lot of people don't want to run extra software in addition to Bitcoin Core to get block explorer functionality. If people can get that functionality easily without sacrificing privacy its a plus for me.
<_andrewtoth_> as i mentioned on github, we should have ability to skip transactions, so we can page through them. I'd like to see that implemented and tested
<ariard> jnewbery: architectural level, you should split all indexes in its own process (like bitcoin-server) so the server would be able to serve multiple clients without encumbering operation of the consensus/p2p module
<ariard> so easier to maintain, smaller binary for ones not interested, easier to have its improvements pace, like if we split the wallet from the node
<jonatack> marcinja: Thanks. It's true there is accumulated context behind the PR, but my intuition (which can be wrong ofc) is that providing context (and links to more) in your PR description can help "sell" the PR.
<jnewbery> ariard: binary size is negligable (a few hundred lines of code), I don't think there's any difference in maintenance - that's more of a code separation issue than process separation
<jonatack> marcinja: yes, and give it a bit of your personal conviction too perhaps, without being too subjective. Just a side thought, maybe not a helpful one, but I was looking for that.
<ariard> jnewbery: code separation let you do process separation so both of them are tied, having different processes would let them running on different machines
<jnewbery> there are lots of things that *could* be improved about Bitcoin Core architecture, but we shouldn't pull all of those into discussion of individual PRs
<ariard> jnewbery: I'm worried it would make it harder to split it latter, that's really IMO but if you take the wallet, more there is features, more the chain interface is stuffed, harder it make it to refactor
<ariard> nehan_: hmm that's a good question, harder point would be to memory isolate indexes from the rest of the node, cut dependencies between them, and get it its own p2p stack
<belcher> earlier people were saying this addrindex could be used as an electrum server, electrum's protocol uses hash(scriptPubKey) as the key for its queries, so if an electrum server implementation is desired that would probably need to be the key in core
<nehan_> belcher: hash(scriptPubKey) is the key prefix, so that works here too. How does electrum handle multiple outpoints with the same hash(scriptPubKey)?
<belcher> nehan_ the electrum client sends a list of its bitcoin addresses encoded as hash(spk) and the server replies with transactions on those addresses, so if an address contains multiple outputs then the server replies with all those txes
<jnewbery> belcher: sorry, the question might be a bit confusing. I was asking about the internal levelDB representation. The RPC query uses the script or address to lookup in the index
<nehan_> jnewbery: i believe it is always the outpoint where the scriptPubKey was created. I believe this is used to get unique keys when the same scriptPubKey is used multiple times.
<jnewbery> so I *think* that for spends, the txid given is actually the txid of the transaction creating the UTXO not the txid of the transaction spending the UTXO
<jnewbery> ok, time's running out so let's move on to the next question: Where is MurmurHash3 used in this PR? How are hash collisions resolved? Where else is MurmurHash3 used in the Bitcoin Core codebase?
<marcinja> jnewbery: I think the key is serialized in the order I gave, but the constructor shows them in a different order which is definitely confusing
<michaelfolkson> In this case why use MurmurHash3? Would another hash function offer better collision resistance? Understand it is a trade-off with simplicity, speed etc