Add address-based index (attempt 4?) (utxo db and indexes)

https://github.com/bitcoin/bitcoin/14053

Host: jnewbery

The PR branch HEAD was ec5fbd0b at the time of this review club meeting.

Notes

  • An address index is an index from addresses or scriptPubKeys to the transactions in the block chain that create outputs to that address or spend outputs from that address.

  • Adding an address index is a very common feature request for Bitcoin Core. Being able to lookup transactions from an address is useful for block explorers or other services that need to scan the block chain. Example requests: 1, 2

  • There have been several patches and additional software to produce transaction indexes over the years: Bitcore, electrs and ElectrumX are three examples.

  • There are also centralized block explorer services that provide an address-to-transactions lookup. These offer convenience at the cost of privacy.

  • There have been several attempts to add an address index to Bitcoin Core:
  • Address indexes are inherently unscalable since they grow linearly with the size of the block chain. Some contributors worry that offering an address index in Bitcoin Core may encourage businesses to build Bitcoin services on infrastructure that can’t scale as the size of the block chain increases. Pieter Wuille expresses this concern in the post from his PR: “I hate making it easy to build infrastructure that relies on a fully-indexed blockchain (which shouldn’t be necessary), as it may remove the incentive to build more scalable solutions. On the other hand, in cases where the alternative is relying on a trusted third party, this approach is certainly preferable, and would allow an RPC-based blockexplorer, for example.”

  • For more discussion on the merits of adding an address index to Bitcoin Core, see the discussion in the 13 Dec 2018 Bitcoin Core IRC meeting.

  • This latest attempt to add an address index builds on the new auxiliary indexes base class introduced in PR 13243. The auxiliary index infrastructure runs asynchronously and uses the CValidationInterface to receive new block data. That means adding new indexes should have minimal impact on performance and low risk of introducing a bug that impacts mainline functionality.

  • The address index code and tests (index/addrindex.cpp, index/addrindex.h, test/addrindex_tests.cpp) added in this PR are similar to the transaction index code and tests (index/txindex.cpp, index/txindex.h, test/txindex_tests.cpp). Comparing them by eye or using a diff tool may be helpful to understand what the new code is doing.

Questions

  1. Do you think Bitcoin Core should include an address index? What are the arguments for and against?

  2. Did you review the PR? Concept ACK, approach ACK, ACK <commit>, or NACK?  Don’t forget to put your PR review on GitHub or ask questions.

  3. What steps did you take to review and test this PR? Did you try to create an address index over the full mainnet block chain?

  4. Did you review the unit and functional tests? Are there any additional test cases that you’d like to see added?

  5. The address index is stored in LevelDB, which is a key-value store. What is used as the key, and what is used as the value?

  6. Where is MurmurHash3 used in this PR? How are hash collisions resolved? Where else is MurmurHash3 used in the Bitcoin Core codebase?

  7. How does the address index handle reorgs?

Meeting Log

13:00 < jnewbery> #startmeeting
13:00 < jkczyz> hi
13:00 < jonatack> hi
13:00 < jnewbery> hi
13:00 < _andrewtoth_> hi
13:00 < fjahr> hi
13:00 < jnewbery> Hi folks. Welcome to the first Review Club of 2020 :)
13:00 < michaelfolkson> hi
13:00 < emzy> Hi
13:00 < amiti> hi
13:00 < jnewbery> Notes are in the usual place: https://bitcoincore.reviews/14053.html
13:00 < jnewbery> Feel free to say "hi" to let everyone know that you're at keyboard.
13:00 < jnewbery> Before we begin, just a reminder of some of our meeting conventions:
13:00 < jnewbery> 1. You don't have to ask to ask a question (e.g. "I have a question about x but I don't know if it's on-topic?"). Just go ahead and ask. If it's off-topic we'll tell you.
13:00 < jnewbery> 2. You don't need to wait for the host to ask a specific question — just jump in at any point.
13:01 < van> hi
13:01 < jnewbery> Ok, let's get started. Did everyone get a chance to review the PR? How about a quick y/n from everyone
13:01 < _andrewtoth_> y
13:01 < raj_> y
13:01 < jkczyz> n (read the notes)
13:01 < emzy> y
13:01 < kanzure> hi
13:02 < michaelfolkson> y
13:02 < amiti> n - read notes and brief overview of PR
13:02 < fjahr> working on it
13:02 < jonatack> y
13:02 < jnewbery> First question was about whether Bitcoin Core should include an address index. I don't think we want to get into a long debate, but did people think a bit about the arguments for and against?
13:03 < marcinja> cccccclkninbdfgdifvjhklfgutvrdrtetdfbirvveuj
13:03 < kanzure> was there a recent address index proposal..?
13:03 < fjahr> pro: clearly there is a need for it based on user requests
13:03 < marcinja> oops sorry, new yubikey is being annoying
13:03 < fjahr> con: storage requirements
13:03 < amiti> con: unscalable in the long term
13:03 < jnewbery> kanzure: not recent. We're reviewing 14053 which was opened last year
13:03 < jonatack> I'll admit I'm having trouble concept acking this being in bitcoin core despite the long-term request for this
13:04 < nehan_> con: more code to maintain / added complexity
13:04 < raj_> pro: seems like an useful tool. con:  worried about storage and IBD time.
13:04 < amiti> pro: people need it so often depend on 3rd party services. having it in core would be preferable
13:04 < michaelfolkson> Yeah if it keeps coming up, there's almost an argument to just get it over with. Assuming the downsides aren't that severe which they don't seem to be (at least for the Core project itself)
13:04 < nehan_> pro: seems to be real desire for this and good use cases
13:04 < kanzure> jnewbery: thanks for the clarification.
13:05 < _andrewtoth_> pro: increased privacy and not forcing users to use third-party services/software
13:05 < jnewbery> raj_: this wouldn't impact IBD time (well perhaps for people that decided to enable an address index, but this would be disabled by default)
13:05 < fjahr> I did not see any stats and projections on storage usage, should not be to hard to I think, or did I miss something?
13:05 < _andrewtoth_> con: creating reliance on it being available when it won't scale
13:05 < jnewbery> yeah, I think that's a good summary of the arguments. What did people think about this particular implementation. Does it address any of those concerns?
13:05 < jonatack> Opportunity cost, maintainence, and disincentivises the more sustainable dedicated external solution, pushing it down the road
13:06 < jonatack> at first look in any case
13:06 < emzy> pro: you could habe a more simple elecrum wallet connector. Or use electrum direct with bitcoin core.
13:06 < _andrewtoth_> emzy: you can already do that with https://github.com/chris-belcher/electrum-personal-server
13:06 < _andrewtoth_> I guess it would be more simple though, yeah
13:07 < emzy> _andrewtoth_: I know. But you have te reindex und put ypu xpub in EPS.
13:07 < nehan_> jnewbery: does not address long-term scalability issue. doesn't seem too bad in terms of added code though.
13:07 < raj_> i feel like its a good addition, given optionality.
13:07 < emzy> xpub.
13:07 < jnewbery> I think using the base index infrastructure addresses some of the concerns about code complexity and performance
13:08 < fjahr> pro of this particular PR against the historical ones: it uses the base index which should make it easier to maintain, which means this is probably also much easier to maintain as an external patch if it does not get in
13:08 < michaelfolkson> Any reason why EPS wasn't included in the summary jonatack?
13:08 < jnewbery> nehan_: right, I agree
13:08 < jnewbery> fjahr: I agree that it'd be nice to see stats and storage projections
13:09 < jonatack> michaelfolkson: i did not understand your comment
13:09 < _andrewtoth_> emzy: with https://github.com/bitcoin/bitcoin/pull/10785 it would make it much faster to rescan. I think that's ultimately the better solution, block filter indexing and fast rescanning for addresses you want to know about
13:09 < jnewbery> marcinja: did you run this on mainnet? Do you know what the storage requirements are for indexing the full mainnet?
13:09 < raj_> yes stats would be very informative
13:09 < jnewbery> (or did anyone else run this on mainnet?)
13:09 < raj_> I am clueless about the potential impact otherwise.
13:10 < michaelfolkson> jonatack: "Bitcore, electrs and ElectrumX are three examples." I just wondered if EPS was missing something or not intentional.
13:10 < _andrewtoth_> i started running this on mainnet last evening. It's only at block 473300 now
13:10 < marcinja> jnewbery: I did a full sync on mainnet last year but I don't remember the size off the top of my head
13:10 < _andrewtoth_> storage is at 98 GB already
13:10 < _andrewtoth_> so it's several times slower than reindex-chainstate
13:10 < nehan_> it's useful for me to understand what's missing from something like this before it might be considered for merging.  one is definitely stats/benchmarks.
13:11 < nehan_> another is testing and the documentation jnewbery pointed out in his comment
13:11 < jnewbery> nehan_: that's a good question. What do other people think?
13:11 < nehan_> or perhaps others think it is mergeable without those things?
13:11 < michaelfolkson> About stats/benchmarks?
13:11 < raj_> agreed with nehan_
13:11 < nehan_> michaelfolkson: what else needs to be done before considering this PR for merging?
13:12 < _andrewtoth_> nehan_: agree, we need those numbers before considering
13:12 < jnewbery> _andrewtoth_: how large is the blockchain up to height 473300? (I'm interested in how large the address index is in comparison to the blockchain)
13:12 < raj_> i do have one point regarding testing. But i think we will hit that later anyway.
13:12 < nehan_> marcinja: what do you think?
13:12 < _andrewtoth_> jnewbery: du -h | grep addr_index -> 98G	./addr_index
13:13 < _andrewtoth_> oh sorry you want to know blockchain size, not sure
13:13 < marcinja> nehan_: I definitely agree it should be documented better, that should at least make it easier to review. Stats and benchmarks would put people at ease maybe and also won't be hard to get
13:13 < jonatack> michaelfolkson: Ah! I do not know. I did not write this summary :)
13:14 < jnewbery> _andrewtoth_: du -h blocks should tell you
13:14 < jkczyz> Did jamesob report back on the use cases that companies had for this? Seems any solution that would scale requires multiple nodes and mechanism to resolve inconsistencies across nodes.
13:14 < _andrewtoth_> oh, i'm already synced
13:14 < _andrewtoth_> it's just indexing in the addrindex thread
13:15 < jnewbery> _andrewtoth_: ah, ok. I forgot that the indexer could fall behind!
13:15 < michaelfolkson> jkczyz: So on the use case front I saw local block explorers. What else?
13:15 < _andrewtoth_> jnewbery: well i already had the node synced, i just built on the branch and ran with -addrindex
13:16 < jnewbery> jkczyz: I'd *hope* that there wouldn't be inconsistencies
13:17 < jnewbery> I linked to a couple of comments from engineers building services, and their use cases for an address index: https://github.com/bitcoin/bitcoin/pull/2802#issuecomment-26712372, https://github.com/bitcoin/bitcoin/pull/14053#issuecomment-558344637
13:17 < jkczyz> By inconsistencies I mean competing versions of the tip
13:17 < fjahr> michaelfolkson: people who use block explorers but don't want to use public ones
13:17 < jkczyz> chain tip
13:17 < fjahr> ah sorry, that's what you meant
13:18 < raj_> cant then this be done in standalone way for those who want to use their own explorer?
13:18 < jnewbery> jkczyz: Is that concern specific to address indexes or general to any service built on Bitcoin?
13:18 < michaelfolkson> I'm finding it hard to care about the Approach ACK when there isn't agreement on the Concept ACK. And at least the Concept nACK seems to be the primary reason why the other attempts were rejected
13:18 < jkczyz> any service really
13:19 < jonatack> marcinja: Since you are here (which is great!) could you discuss your motivation for reviving this PR? I didn't you mention the *why* in the PR description, which seemed quite succinct for a change like this.
13:19 < jonatack> I didn't see* you mention the *why*
13:19 < jnewbery> michaelfolkson: I don't see any concept NACKs on the PR
13:20 < michaelfolkson> jnewbery: But you agree that's the primary reason why other attempts failed?
13:20 < raj_> any service really. Seems like something you can write your own script for. This PR can be used as a refference.
13:21 < michaelfolkson> Don't mean to derail discussion :/ The PR seems good
13:21 < raj_> But yes, through discussion it does seem like a high demand functionality.
13:22 < jonatack> marcinja: In other words, you provide the how but without your why. I would have liked to hear your why :)
13:22 < jnewbery> michaelfolkson: maybe. The final comment on the last attempt was from sipa: "I vote to close this; this should be done in an external index. Perhaps at some point when we can a more modular design it can be supported as an optionally buildable module, but let's not complicate the current database tracking code."
13:22 < fjahr> raj_: that was the feedback in most of the discussions. As a standalone or patch. But of course people would like a really robust solution maintained by core because this is not so easy to maintain.
13:22 < jnewbery> we now have that more modular design!
13:23 < raj_> fjahr: any particular reason why its hard to maintain?
13:23 < michaelfolkson> Ok fair enough. Thanks :)
13:23 < jnewbery> ok, let's move on to the next question: What steps did you take to review and test this PR? Did you try to create an address index over the full mainnet block chain?
13:24 < raj_> standard unit and functional testing. Didint tried on mainnet.
13:24 < jnewbery> and we can also talk about the next question at the same time: Did you review the unit and functional tests? Are there any additional test cases that you’d like to see added?
13:24 < nehan_> i just read the code
13:24 < ariard> I agree with sipa we should do this in its own module post-multiprocess with the rest of the server code
13:24 < _andrewtoth_> built and ran tests, but as I mentioned, mainnet syncing is taking a long time
13:24 < fjahr> raj_: I don't have seen any particular reference to which problems these projects have.
13:24 < _andrewtoth_> however, testing already indexed addresses with searchrawtransactions works
13:25 < marcinja> jonatack: Sure. One of the main justifications is that it seems like a lot of people don't want to run extra software in addition to Bitcoin Core to get block explorer functionality. If people can get that functionality easily without sacrificing privacy its a plus for me.
13:25 < emzy> I just read the PR and compiled the PR. But had no time to run it. Mainet data is still rsyning
13:25 < jnewbery> ariard: what do you think having multiprocess provides as an improvement for an address index?
13:25 < _andrewtoth_> as i mentioned on github, we should have ability to skip transactions, so we can page through them. I'd like to see that implemented and tested
13:25 < raj_> functional testing does not include any check for spends_results. It only checks consistency of creation_results.
13:26 < jnewbery> raj_: yes, I'd definitly like to see that added to the tests. Did anyone test it manually?
13:26 < _andrewtoth_> yes i tested manually and got spend results
13:26 < ariard> jnewbery: architectural level, you should split all indexes in its own process (like bitcoin-server) so the server would be able to serve multiple clients without encumbering operation of the consensus/p2p module
13:27 < ariard> so easier to maintain, smaller binary for ones not interested, easier to have its improvements pace, like if we split the wallet from the node
13:29 < jonatack> marcinja: Thanks. It's true there is accumulated context behind the PR, but my intuition (which can be wrong ofc) is that providing context (and links to more) in your PR description can help "sell" the PR.
13:29 < jnewbery> ariard: binary size is negligable (a few hundred lines of code), I don't think there's any difference in maintenance - that's more of a code separation issue than process separation
13:29 < nehan_> ariard: what needs to be done before one could build this PR the way you suggest? How far out is that?
13:30 < jnewbery> Serving RPC requests to the index doesn't take cs_main, so I don't think it encumbers operation of consensus/p2p
13:30 < jnewbery> and improvements pace is another code separation thing, not process separation
13:30 < michaelfolkson> I agree jonatack. I think jnewbery point on the modularity of Core being improved since the previous attempts is a key "sell" too
13:30 < jonatack> marcinja: because if the *why* isn't justified, then the *how* doesn't matter.
13:30 < jnewbery> (I'm not arguing against multiprocess, but I don't think it directly impacts this PR)
13:30 < marcinja> jonatack: That's a good point. I'll try to summarize the background better for a new PR description
13:32 < jonatack> marcinja: yes, and give it a bit of your personal conviction too perhaps, without being too subjective. Just a side thought, maybe not a helpful one, but I was looking for that.
13:32 < ariard> jnewbery: code separation let you do process separation so both of them are tied, having different processes would let them running on different machines
13:32 < ariard> and if you server is doing heavy rescans, not taking cpu time for the validation
13:33 < jnewbery> ariard: I agree those things are true, but I don't think it's a requirement for this PR
13:33 < jnewbery> there are lots of things that *could* be improved about Bitcoin Core architecture, but we shouldn't pull all of those into discussion of individual PRs
13:34 < ariard> jnewbery: I'm worried it would make it harder to split it latter, that's really IMO but if you take the wallet, more there is features, more the chain interface is stuffed, harder it make it to refactor
13:34 < nehan_> ariard: you can run 2 bitcoin core nodes, one with this on and one with this off, if you wanted.
13:35 < ariard> nehan_: hmm that's a good question, harder point would be to memory isolate indexes from the rest of the node, cut dependencies between them, and get it its own p2p stack
13:36 < jnewbery> I believe the only validation interface method this uses is BlockConnected, which is in the base indexer class
13:36 < jnewbery> It's interface with the rest of the node is minimal
13:36 < jnewbery> *Its
13:36 < ariard> nehan_: okay but know you do 2 ibds, even if the second one is pruned, seems more a hack than doing it cleanly
13:37 < jnewbery> Next question!
13:37 < jnewbery> The address index is stored in LevelDB, which is a key-value store. What is used as the key, and what is used as the value?
13:37 < nehan_> ariard: if one implemented a blockexplorer from scratch they'd have to do IBD anyway.
13:38 < ariard> jnewbery: dunno if this is that hard to split, want to try at some point
13:38 < nehan_> this was confusing. marcinja please add documentation :) but i think the key is scripthash, outpoint
13:38 < ariard> nehan_: a pruned wallet does not need ibd, you already speaking about one specific application requirement I think
13:39 < belcher> earlier people were saying this addrindex could be used as an electrum server, electrum's protocol uses hash(scriptPubKey) as the key for its queries, so if an electrum server implementation is desired that would probably need to be the key in core
13:39 < raj_> The key seems to be [type, outpoint] and value: [CDiskTxPos, Script]
13:39 < nehan_> belcher: hash(scriptPubKey) is the key prefix, so that works here too. How does electrum handle multiple outpoints with the same hash(scriptPubKey)?
13:39 < raj_> I might be wrong, i am very new to c++
13:40 < jnewbery> nehan_: I think it's (type | script hash | outpoint) . Is that right, marcinja?
13:40 < nehan_> oh yes. type, sorry, forgot.
13:40 < belcher> nehan_ the electrum client sends a list of its bitcoin addresses encoded as hash(spk) and the server replies with transactions on those addresses, so if an address contains multiple outputs then the server replies with all those txes
13:41 < jnewbery> I was confused about the outpoint being used: https://github.com/bitcoin/bitcoin/pull/14053#discussion_r363971325
13:41 < michaelfolkson> Type being SEED, SPENT, CREATED
13:41 < belcher> electrum doesnt use outpoints in its protocol, it only uses hash(spk) and transactions encoded as hex
13:42 < nehan_> belcher: the rpc interface doesn't use outpoints either. so I think this is compatible.
13:42 < marcinja> jnewbery: it's (script hash | type | outpoint)
13:42 < jnewbery> belcher: sorry, the question might be a bit confusing. I was asking about the internal levelDB representation. The RPC query uses the script or address to lookup in the index
13:43 < michaelfolkson> +1 on nehan_ docs suggestion
13:44 < jnewbery> marcinja: you sure? I think it's (type | scripthash | outpoint) : https://github.com/bitcoin/bitcoin/pull/14053/files#diff-11cde577be5c21f1d0cb1527c50543b3R188
13:44 < nehan_> jnewbery: i believe it is always the outpoint where the scriptPubKey was created. I believe this is used to get unique keys when the same scriptPubKey is used multiple times.
13:45 < marcinja> nehan_: that's right. leveldb doesn't support multiple values for a single key so adding the outpoint is one way to have unique keys.
13:46 < jnewbery> It looked to me like that part of the key (the outpoint) is included in the RPC response as txid: https://github.com/bitcoin/bitcoin/pull/14053/files#diff-01aa7d1d32f1b9e5a836c9c411978918R314
13:46 < jnewbery> so I *think* that for spends, the txid given is actually the txid of the transaction creating the UTXO not the txid of the transaction spending the UTXO
13:48 < jnewbery> ok, time's running out so let's move on to the next question: Where is MurmurHash3 used in this PR? How are hash collisions resolved? Where else is MurmurHash3 used in the Bitcoin Core codebase?
13:49 < marcinja> jnewbery: yes that's right, the transaction from the value should be used for that txid. clearly that needs a functional test :)
13:50 < marcinja> jnewbery: I think the key is serialized in the order I gave, but the constructor shows them in a different order which is definitely confusing
13:50 < michaelfolkson> The address index stores a copy of the CScript in case of hash collisions
13:51 < nehan_> jnewbery: i think you caught a bug. I'm leaving a comment.
13:51 < jnewbery> michaelfolkson: yes, that's right
13:52 < jnewbery> ok, final question. How does the address index handle reorgs?
13:54 < jnewbery> ok, perhaps something to look into when you all do your reviews :)
13:54 < jnewbery> 5 minutes left. Any final questions?
13:55 < michaelfolkson> What is the answer to the reorg question? :)
13:55 < instagibbs> anyone run benchmarks up to different heights? both benchmarking IBD and query response times?
13:55 < instagibbs> michaelfolkson, "hopefully" ;)
13:56 < jnewbery> michaelfolkson: left as an exercise for the reader
13:56 < emzy> I will run a benchmark with a blockchain that has already a txindex.
13:57 < nehan_> jnewbery: https://github.com/bitcoin/bitcoin/pull/14053/files#r364389525
13:57 < jnewbery> emzy: the presence/absence of a txindex shouldn't have any impact
13:58 < _andrewtoth_> instagibbs: just generating the addrindex with a synced node has taken 16 hours to get to 478000 blocks :/
13:58 < jnewbery> nehan_: thanks!
13:58 < _andrewtoth_> IBD with assumevalid on has taken the same machine < 6 hours
13:59 < jnewbery> I'm slightly unconcerned about performance of building the address index for the first time, since it only has to be done once
13:59 < jnewbery> I wonder how this compares to external tools like electrs
13:59 < _andrewtoth_> also, querying can be optimized by taking count and returning early
13:59 < _andrewtoth_> i commented on github
13:59 < jnewbery> belcher: do you know how long building an index in electrs takes, or is it doing something different?
14:00 < emzy> electrs will do a full sync in about 16h
14:00 < michaelfolkson> In this case why use MurmurHash3? Would another hash function offer better collision resistance? Understand it is a trade-off with simplicity, speed etc
14:00 < belcher> i dont know
14:00 < michaelfolkson> "good enough" collision resistance I suppose
14:00 < jnewbery> That's time. Thanks everyone. Hope you all enjoyed!
14:00 < _andrewtoth_> thanks!
14:00 < jnewbery> And thanks to marcinja for dropping in
14:01 < marcinja> thanks jnewbery for hosting, and thanks to everyone who reviewed and gave feedback. This was very helpful :)
14:01 < amiti> thank you!
14:01 < nehan_> thanks!
14:01 < michaelfolkson> You mean in EPS jnewbery?
14:01 < michaelfolkson> Thanks!
14:01 < raj_> Thanks.
14:01 < fjahr> thanks, very good choice!
14:01 < emzy> thanks jnewbery, marcinja and everyone!
14:01 < _andrewtoth_> EPS uses the wallet and imports the addresses, then does a rescan
14:01 < jnewbery> Next week is fuzzing, hosted by MarcoFalke. Notes are already available: https://bitcoincore.reviews/17860.html
14:01 < jnewbery> #endmeeting