Add address-based index (attempt 4?) (utxo db and indexes)

https://github.com/bitcoin/bitcoin/pull/14053

Host: jnewbery  -  PR author: marcinja

The PR branch HEAD was ec5fbd0b at the time of this review club meeting.

Notes

  • An address index is an index from addresses or scriptPubKeys to the transactions in the block chain that create outputs to that address or spend outputs from that address.

  • Adding an address index is a very common feature request for Bitcoin Core. Being able to lookup transactions from an address is useful for block explorers or other services that need to scan the block chain. Example requests: 1, 2

  • There have been several patches and additional software to produce transaction indexes over the years: Bitcore, electrs and ElectrumX are three examples.

  • There are also centralized block explorer services that provide an address-to-transactions lookup. These offer convenience at the cost of privacy.

  • There have been several attempts to add an address index to Bitcoin Core:
  • Address indexes are inherently unscalable since they grow linearly with the size of the block chain. Some contributors worry that offering an address index in Bitcoin Core may encourage businesses to build Bitcoin services on infrastructure that can’t scale as the size of the block chain increases. Pieter Wuille expresses this concern in the post from his PR: “I hate making it easy to build infrastructure that relies on a fully-indexed blockchain (which shouldn’t be necessary), as it may remove the incentive to build more scalable solutions. On the other hand, in cases where the alternative is relying on a trusted third party, this approach is certainly preferable, and would allow an RPC-based blockexplorer, for example.”

  • For more discussion on the merits of adding an address index to Bitcoin Core, see the discussion in the 13 Dec 2018 Bitcoin Core IRC meeting.

  • This latest attempt to add an address index builds on the new auxiliary indexes base class introduced in PR 13243. The auxiliary index infrastructure runs asynchronously and uses the CValidationInterface to receive new block data. That means adding new indexes should have minimal impact on performance and low risk of introducing a bug that impacts mainline functionality.

  • The address index code and tests (index/addrindex.cpp, index/addrindex.h, test/addrindex_tests.cpp) added in this PR are similar to the transaction index code and tests (index/txindex.cpp, index/txindex.h, test/txindex_tests.cpp). Comparing them by eye or using a diff tool may be helpful to understand what the new code is doing.

Questions

  1. Do you think Bitcoin Core should include an address index? What are the arguments for and against?

  2. Did you review the PR? Concept ACK, approach ACK, ACK <commit>, or NACK?  Don’t forget to put your PR review on GitHub or ask questions.

  3. What steps did you take to review and test this PR? Did you try to create an address index over the full mainnet block chain?

  4. Did you review the unit and functional tests? Are there any additional test cases that you’d like to see added?

  5. The address index is stored in LevelDB, which is a key-value store. What is used as the key, and what is used as the value?

  6. Where is MurmurHash3 used in this PR? How are hash collisions resolved? Where else is MurmurHash3 used in the Bitcoin Core codebase?

  7. How does the address index handle reorgs?

Meeting Log

  113:00 <jnewbery> #startmeeting
  213:00 <jkczyz> hi
  313:00 <jonatack> hi
  413:00 <jnewbery> hi
  513:00 <_andrewtoth_> hi
  613:00 <fjahr> hi
  713:00 <jnewbery> Hi folks. Welcome to the first Review Club of 2020 :)
  813:00 <michaelfolkson> hi
  913:00 <emzy> Hi
 1013:00 <amiti> hi
 1113:00 <jnewbery> Notes are in the usual place: https://bitcoincore.reviews/14053.html
 1213:00 <jnewbery> Feel free to say "hi" to let everyone know that you're at keyboard.
 1313:00 <jnewbery> Before we begin, just a reminder of some of our meeting conventions:
 1413:00 <jnewbery> 1. You don't have to ask to ask a question (e.g. "I have a question about x but I don't know if it's on-topic?"). Just go ahead and ask. If it's off-topic we'll tell you.
 1513:00 <jnewbery> 2. You don't need to wait for the host to ask a specific question — just jump in at any point.
 1613:01 <van> hi
 1713:01 <jnewbery> Ok, let's get started. Did everyone get a chance to review the PR? How about a quick y/n from everyone
 1813:01 <_andrewtoth_> y
 1913:01 <raj_> y
 2013:01 <jkczyz> n (read the notes)
 2113:01 <emzy> y
 2213:01 <kanzure> hi
 2313:02 <michaelfolkson> y
 2413:02 <amiti> n - read notes and brief overview of PR
 2513:02 <fjahr> working on it
 2613:02 <jonatack> y
 2713:02 <jnewbery> First question was about whether Bitcoin Core should include an address index. I don't think we want to get into a long debate, but did people think a bit about the arguments for and against?
 2813:03 <marcinja> cccccclkninbdfgdifvjhklfgutvrdrtetdfbirvveuj
 2913:03 <kanzure> was there a recent address index proposal..?
 3013:03 <fjahr> pro: clearly there is a need for it based on user requests
 3113:03 <marcinja> oops sorry, new yubikey is being annoying
 3213:03 <fjahr> con: storage requirements
 3313:03 <amiti> con: unscalable in the long term
 3413:03 <jnewbery> kanzure: not recent. We're reviewing 14053 which was opened last year
 3513:03 <jonatack> I'll admit I'm having trouble concept acking this being in bitcoin core despite the long-term request for this
 3613:04 <nehan_> con: more code to maintain / added complexity
 3713:04 <raj_> pro: seems like an useful tool. con: worried about storage and IBD time.
 3813:04 <amiti> pro: people need it so often depend on 3rd party services. having it in core would be preferable
 3913:04 <michaelfolkson> Yeah if it keeps coming up, there's almost an argument to just get it over with. Assuming the downsides aren't that severe which they don't seem to be (at least for the Core project itself)
 4013:04 <nehan_> pro: seems to be real desire for this and good use cases
 4113:04 <kanzure> jnewbery: thanks for the clarification.
 4213:05 <_andrewtoth_> pro: increased privacy and not forcing users to use third-party services/software
 4313:05 <jnewbery> raj_: this wouldn't impact IBD time (well perhaps for people that decided to enable an address index, but this would be disabled by default)
 4413:05 <fjahr> I did not see any stats and projections on storage usage, should not be to hard to I think, or did I miss something?
 4513:05 <_andrewtoth_> con: creating reliance on it being available when it won't scale
 4613:05 <jnewbery> yeah, I think that's a good summary of the arguments. What did people think about this particular implementation. Does it address any of those concerns?
 4713:05 <jonatack> Opportunity cost, maintainence, and disincentivises the more sustainable dedicated external solution, pushing it down the road
 4813:06 <jonatack> at first look in any case
 4913:06 <emzy> pro: you could habe a more simple elecrum wallet connector. Or use electrum direct with bitcoin core.
 5013:06 <_andrewtoth_> emzy: you can already do that with https://github.com/chris-belcher/electrum-personal-server
 5113:06 <_andrewtoth_> I guess it would be more simple though, yeah
 5213:07 <emzy> _andrewtoth_: I know. But you have te reindex und put ypu xpub in EPS.
 5313:07 <nehan_> jnewbery: does not address long-term scalability issue. doesn't seem too bad in terms of added code though.
 5413:07 <raj_> i feel like its a good addition, given optionality.
 5513:07 <emzy> xpub.
 5613:07 <jnewbery> I think using the base index infrastructure addresses some of the concerns about code complexity and performance
 5713:08 <fjahr> pro of this particular PR against the historical ones: it uses the base index which should make it easier to maintain, which means this is probably also much easier to maintain as an external patch if it does not get in
 5813:08 <michaelfolkson> Any reason why EPS wasn't included in the summary jonatack?
 5913:08 <jnewbery> nehan_: right, I agree
 6013:08 <jnewbery> fjahr: I agree that it'd be nice to see stats and storage projections
 6113:09 <jonatack> michaelfolkson: i did not understand your comment
 6213:09 <_andrewtoth_> emzy: with https://github.com/bitcoin/bitcoin/pull/10785 it would make it much faster to rescan. I think that's ultimately the better solution, block filter indexing and fast rescanning for addresses you want to know about
 6313:09 <jnewbery> marcinja: did you run this on mainnet? Do you know what the storage requirements are for indexing the full mainnet?
 6413:09 <raj_> yes stats would be very informative
 6513:09 <jnewbery> (or did anyone else run this on mainnet?)
 6613:09 <raj_> I am clueless about the potential impact otherwise.
 6713:10 <michaelfolkson> jonatack: "Bitcore, electrs and ElectrumX are three examples." I just wondered if EPS was missing something or not intentional.
 6813:10 <_andrewtoth_> i started running this on mainnet last evening. It's only at block 473300 now
 6913:10 <marcinja> jnewbery: I did a full sync on mainnet last year but I don't remember the size off the top of my head
 7013:10 <_andrewtoth_> storage is at 98 GB already
 7113:10 <_andrewtoth_> so it's several times slower than reindex-chainstate
 7213:10 <nehan_> it's useful for me to understand what's missing from something like this before it might be considered for merging. one is definitely stats/benchmarks.
 7313:11 <nehan_> another is testing and the documentation jnewbery pointed out in his comment
 7413:11 <jnewbery> nehan_: that's a good question. What do other people think?
 7513:11 <nehan_> or perhaps others think it is mergeable without those things?
 7613:11 <michaelfolkson> About stats/benchmarks?
 7713:11 <raj_> agreed with nehan_
 7813:11 <nehan_> michaelfolkson: what else needs to be done before considering this PR for merging?
 7913:12 <_andrewtoth_> nehan_: agree, we need those numbers before considering
 8013:12 <jnewbery> _andrewtoth_: how large is the blockchain up to height 473300? (I'm interested in how large the address index is in comparison to the blockchain)
 8113:12 <raj_> i do have one point regarding testing. But i think we will hit that later anyway.
 8213:12 <nehan_> marcinja: what do you think?
 8313:12 <_andrewtoth_> jnewbery: du -h | grep addr_index -> 98G ./addr_index
 8413:13 <_andrewtoth_> oh sorry you want to know blockchain size, not sure
 8513:13 <marcinja> nehan_: I definitely agree it should be documented better, that should at least make it easier to review. Stats and benchmarks would put people at ease maybe and also won't be hard to get
 8613:13 <jonatack> michaelfolkson: Ah! I do not know. I did not write this summary :)
 8713:14 <jnewbery> _andrewtoth_: du -h blocks should tell you
 8813:14 <jkczyz> Did jamesob report back on the use cases that companies had for this? Seems any solution that would scale requires multiple nodes and mechanism to resolve inconsistencies across nodes.
 8913:14 <_andrewtoth_> oh, i'm already synced
 9013:14 <_andrewtoth_> it's just indexing in the addrindex thread
 9113:15 <jnewbery> _andrewtoth_: ah, ok. I forgot that the indexer could fall behind!
 9213:15 <michaelfolkson> jkczyz: So on the use case front I saw local block explorers. What else?
 9313:15 <_andrewtoth_> jnewbery: well i already had the node synced, i just built on the branch and ran with -addrindex
 9413:16 <jnewbery> jkczyz: I'd *hope* that there wouldn't be inconsistencies
 9513:17 <jnewbery> I linked to a couple of comments from engineers building services, and their use cases for an address index: https://github.com/bitcoin/bitcoin/pull/2802#issuecomment-26712372, https://github.com/bitcoin/bitcoin/pull/14053#issuecomment-558344637
 9613:17 <jkczyz> By inconsistencies I mean competing versions of the tip
 9713:17 <fjahr> michaelfolkson: people who use block explorers but don't want to use public ones
 9813:17 <jkczyz> chain tip
 9913:17 <fjahr> ah sorry, that's what you meant
10013:18 <raj_> cant then this be done in standalone way for those who want to use their own explorer?
10113:18 <jnewbery> jkczyz: Is that concern specific to address indexes or general to any service built on Bitcoin?
10213:18 <michaelfolkson> I'm finding it hard to care about the Approach ACK when there isn't agreement on the Concept ACK. And at least the Concept nACK seems to be the primary reason why the other attempts were rejected
10313:18 <jkczyz> any service really
10413:19 <jonatack> marcinja: Since you are here (which is great!) could you discuss your motivation for reviving this PR? I didn't you mention the *why* in the PR description, which seemed quite succinct for a change like this.
10513:19 <jonatack> I didn't see* you mention the *why*
10613:19 <jnewbery> michaelfolkson: I don't see any concept NACKs on the PR
10713:20 <michaelfolkson> jnewbery: But you agree that's the primary reason why other attempts failed?
10813:20 <raj_> any service really. Seems like something you can write your own script for. This PR can be used as a refference.
10913:21 <michaelfolkson> Don't mean to derail discussion :/ The PR seems good
11013:21 <raj_> But yes, through discussion it does seem like a high demand functionality.
11113:22 <jonatack> marcinja: In other words, you provide the how but without your why. I would have liked to hear your why :)
11213:22 <jnewbery> michaelfolkson: maybe. The final comment on the last attempt was from sipa: "I vote to close this; this should be done in an external index. Perhaps at some point when we can a more modular design it can be supported as an optionally buildable module, but let's not complicate the current database tracking code."
11313:22 <fjahr> raj_: that was the feedback in most of the discussions. As a standalone or patch. But of course people would like a really robust solution maintained by core because this is not so easy to maintain.
11413:22 <jnewbery> we now have that more modular design!
11513:23 <raj_> fjahr: any particular reason why its hard to maintain?
11613:23 <michaelfolkson> Ok fair enough. Thanks :)
11713:23 <jnewbery> ok, let's move on to the next question: What steps did you take to review and test this PR? Did you try to create an address index over the full mainnet block chain?
11813:24 <raj_> standard unit and functional testing. Didint tried on mainnet.
11913:24 <jnewbery> and we can also talk about the next question at the same time: Did you review the unit and functional tests? Are there any additional test cases that you’d like to see added?
12013:24 <nehan_> i just read the code
12113:24 <ariard> I agree with sipa we should do this in its own module post-multiprocess with the rest of the server code
12213:24 <_andrewtoth_> built and ran tests, but as I mentioned, mainnet syncing is taking a long time
12313:24 <fjahr> raj_: I don't have seen any particular reference to which problems these projects have.
12413:24 <_andrewtoth_> however, testing already indexed addresses with searchrawtransactions works
12513:25 <marcinja> jonatack: Sure. One of the main justifications is that it seems like a lot of people don't want to run extra software in addition to Bitcoin Core to get block explorer functionality. If people can get that functionality easily without sacrificing privacy its a plus for me.
12613:25 <emzy> I just read the PR and compiled the PR. But had no time to run it. Mainet data is still rsyning
12713:25 <jnewbery> ariard: what do you think having multiprocess provides as an improvement for an address index?
12813:25 <_andrewtoth_> as i mentioned on github, we should have ability to skip transactions, so we can page through them. I'd like to see that implemented and tested
12913:25 <raj_> functional testing does not include any check for spends_results. It only checks consistency of creation_results.
13013:26 <jnewbery> raj_: yes, I'd definitly like to see that added to the tests. Did anyone test it manually?
13113:26 <_andrewtoth_> yes i tested manually and got spend results
13213:26 <ariard> jnewbery: architectural level, you should split all indexes in its own process (like bitcoin-server) so the server would be able to serve multiple clients without encumbering operation of the consensus/p2p module
13313:27 <ariard> so easier to maintain, smaller binary for ones not interested, easier to have its improvements pace, like if we split the wallet from the node
13413:29 <jonatack> marcinja: Thanks. It's true there is accumulated context behind the PR, but my intuition (which can be wrong ofc) is that providing context (and links to more) in your PR description can help "sell" the PR.
13513:29 <jnewbery> ariard: binary size is negligable (a few hundred lines of code), I don't think there's any difference in maintenance - that's more of a code separation issue than process separation
13613:29 <nehan_> ariard: what needs to be done before one could build this PR the way you suggest? How far out is that?
13713:30 <jnewbery> Serving RPC requests to the index doesn't take cs_main, so I don't think it encumbers operation of consensus/p2p
13813:30 <jnewbery> and improvements pace is another code separation thing, not process separation
13913:30 <michaelfolkson> I agree jonatack. I think jnewbery point on the modularity of Core being improved since the previous attempts is a key "sell" too
14013:30 <jonatack> marcinja: because if the *why* isn't justified, then the *how* doesn't matter.
14113:30 <jnewbery> (I'm not arguing against multiprocess, but I don't think it directly impacts this PR)
14213:30 <marcinja> jonatack: That's a good point. I'll try to summarize the background better for a new PR description
14313:32 <jonatack> marcinja: yes, and give it a bit of your personal conviction too perhaps, without being too subjective. Just a side thought, maybe not a helpful one, but I was looking for that.
14413:32 <ariard> jnewbery: code separation let you do process separation so both of them are tied, having different processes would let them running on different machines
14513:32 <ariard> and if you server is doing heavy rescans, not taking cpu time for the validation
14613:33 <jnewbery> ariard: I agree those things are true, but I don't think it's a requirement for this PR
14713:33 <jnewbery> there are lots of things that *could* be improved about Bitcoin Core architecture, but we shouldn't pull all of those into discussion of individual PRs
14813:34 <ariard> jnewbery: I'm worried it would make it harder to split it latter, that's really IMO but if you take the wallet, more there is features, more the chain interface is stuffed, harder it make it to refactor
14913:34 <nehan_> ariard: you can run 2 bitcoin core nodes, one with this on and one with this off, if you wanted.
15013:35 <ariard> nehan_: hmm that's a good question, harder point would be to memory isolate indexes from the rest of the node, cut dependencies between them, and get it its own p2p stack
15113:36 <jnewbery> I believe the only validation interface method this uses is BlockConnected, which is in the base indexer class
15213:36 <jnewbery> It's interface with the rest of the node is minimal
15313:36 <jnewbery> *Its
15413:36 <ariard> nehan_: okay but know you do 2 ibds, even if the second one is pruned, seems more a hack than doing it cleanly
15513:37 <jnewbery> Next question!
15613:37 <jnewbery> The address index is stored in LevelDB, which is a key-value store. What is used as the key, and what is used as the value?
15713:37 <nehan_> ariard: if one implemented a blockexplorer from scratch they'd have to do IBD anyway.
15813:38 <ariard> jnewbery: dunno if this is that hard to split, want to try at some point
15913:38 <nehan_> this was confusing. marcinja please add documentation :) but i think the key is scripthash, outpoint
16013:38 <ariard> nehan_: a pruned wallet does not need ibd, you already speaking about one specific application requirement I think
16113:39 <belcher> earlier people were saying this addrindex could be used as an electrum server, electrum's protocol uses hash(scriptPubKey) as the key for its queries, so if an electrum server implementation is desired that would probably need to be the key in core
16213:39 <raj_> The key seems to be [type, outpoint] and value: [CDiskTxPos, Script]
16313:39 <nehan_> belcher: hash(scriptPubKey) is the key prefix, so that works here too. How does electrum handle multiple outpoints with the same hash(scriptPubKey)?
16413:39 <raj_> I might be wrong, i am very new to c++
16513:40 <jnewbery> nehan_: I think it's (type | script hash | outpoint) . Is that right, marcinja?
16613:40 <nehan_> oh yes. type, sorry, forgot.
16713:40 <belcher> nehan_ the electrum client sends a list of its bitcoin addresses encoded as hash(spk) and the server replies with transactions on those addresses, so if an address contains multiple outputs then the server replies with all those txes
16813:41 <jnewbery> I was confused about the outpoint being used: https://github.com/bitcoin/bitcoin/pull/14053#discussion_r363971325
16913:41 <michaelfolkson> Type being SEED, SPENT, CREATED
17013:41 <belcher> electrum doesnt use outpoints in its protocol, it only uses hash(spk) and transactions encoded as hex
17113:42 <nehan_> belcher: the rpc interface doesn't use outpoints either. so I think this is compatible.
17213:42 <marcinja> jnewbery: it's (script hash | type | outpoint)
17313:42 <jnewbery> belcher: sorry, the question might be a bit confusing. I was asking about the internal levelDB representation. The RPC query uses the script or address to lookup in the index
17413:43 <michaelfolkson> +1 on nehan_ docs suggestion
17513:44 <jnewbery> marcinja: you sure? I think it's (type | scripthash | outpoint) : https://github.com/bitcoin/bitcoin/pull/14053/files#diff-11cde577be5c21f1d0cb1527c50543b3R188
17613:44 <nehan_> jnewbery: i believe it is always the outpoint where the scriptPubKey was created. I believe this is used to get unique keys when the same scriptPubKey is used multiple times.
17713:45 <marcinja> nehan_: that's right. leveldb doesn't support multiple values for a single key so adding the outpoint is one way to have unique keys.
17813:46 <jnewbery> It looked to me like that part of the key (the outpoint) is included in the RPC response as txid: https://github.com/bitcoin/bitcoin/pull/14053/files#diff-01aa7d1d32f1b9e5a836c9c411978918R314
17913:46 <jnewbery> so I *think* that for spends, the txid given is actually the txid of the transaction creating the UTXO not the txid of the transaction spending the UTXO
18013:48 <jnewbery> ok, time's running out so let's move on to the next question: Where is MurmurHash3 used in this PR? How are hash collisions resolved? Where else is MurmurHash3 used in the Bitcoin Core codebase?
18113:49 <marcinja> jnewbery: yes that's right, the transaction from the value should be used for that txid. clearly that needs a functional test :)
18213:50 <marcinja> jnewbery: I think the key is serialized in the order I gave, but the constructor shows them in a different order which is definitely confusing
18313:50 <michaelfolkson> The address index stores a copy of the CScript in case of hash collisions
18413:51 <nehan_> jnewbery: i think you caught a bug. I'm leaving a comment.
18513:51 <jnewbery> michaelfolkson: yes, that's right
18613:52 <jnewbery> ok, final question. How does the address index handle reorgs?
18713:54 <jnewbery> ok, perhaps something to look into when you all do your reviews :)
18813:54 <jnewbery> 5 minutes left. Any final questions?
18913:55 <michaelfolkson> What is the answer to the reorg question? :)
19013:55 <instagibbs> anyone run benchmarks up to different heights? both benchmarking IBD and query response times?
19113:55 <instagibbs> michaelfolkson, "hopefully" ;)
19213:56 <jnewbery> michaelfolkson: left as an exercise for the reader
19313:56 <emzy> I will run a benchmark with a blockchain that has already a txindex.
19413:57 <nehan_> jnewbery: https://github.com/bitcoin/bitcoin/pull/14053/files#r364389525
19513:57 <jnewbery> emzy: the presence/absence of a txindex shouldn't have any impact
19613:58 <_andrewtoth_> instagibbs: just generating the addrindex with a synced node has taken 16 hours to get to 478000 blocks :/
19713:58 <jnewbery> nehan_: thanks!
19813:58 <_andrewtoth_> IBD with assumevalid on has taken the same machine < 6 hours
19913:59 <jnewbery> I'm slightly unconcerned about performance of building the address index for the first time, since it only has to be done once
20013:59 <jnewbery> I wonder how this compares to external tools like electrs
20113:59 <_andrewtoth_> also, querying can be optimized by taking count and returning early
20213:59 <_andrewtoth_> i commented on github
20313:59 <jnewbery> belcher: do you know how long building an index in electrs takes, or is it doing something different?
20414:00 <emzy> electrs will do a full sync in about 16h
20514:00 <michaelfolkson> In this case why use MurmurHash3? Would another hash function offer better collision resistance? Understand it is a trade-off with simplicity, speed etc
20614:00 <belcher> i dont know
20714:00 <michaelfolkson> "good enough" collision resistance I suppose
20814:00 <jnewbery> That's time. Thanks everyone. Hope you all enjoyed!
20914:00 <_andrewtoth_> thanks!
21014:00 <jnewbery> And thanks to marcinja for dropping in
21114:01 <marcinja> thanks jnewbery for hosting, and thanks to everyone who reviewed and gave feedback. This was very helpful :)
21214:01 <amiti> thank you!
21314:01 <nehan_> thanks!
21414:01 <michaelfolkson> You mean in EPS jnewbery?
21514:01 <michaelfolkson> Thanks!
21614:01 <raj_> Thanks.
21714:01 <fjahr> thanks, very good choice!
21814:01 <emzy> thanks jnewbery, marcinja and everyone!
21914:01 <_andrewtoth_> EPS uses the wallet and imports the addresses, then does a rescan
22014:01 <jnewbery> Next week is fuzzing, hosted by MarcoFalke. Notes are already available: https://bitcoincore.reviews/17860.html
22114:01 <jnewbery> #endmeeting