Add address-based index (attempt 4?) (
utxo db and indexes) Jan 8, 2020
The PR branch HEAD was ec5fbd0b at the time of this review club meeting.
address index is an index from addresses or scriptPubKeys to the
transactions in the block chain that create outputs to that address or spend
outputs from that address.
Adding an address index is a very common feature request for Bitcoin Core.
Being able to lookup transactions from an address is useful for block
explorers or other services that need to scan the block chain. Example
There have been several patches and additional software to produce
transaction indexes over the years:
ElectrumX are three examples.
There are also centralized block explorer services that provide an
address-to-transactions lookup. These offer convenience at the cost
There have been several attempts to add an address index to Bitcoin Core:
Address indexes are inherently unscalable since they grow linearly with the
size of the block chain. Some contributors worry that offering an address
index in Bitcoin Core may encourage businesses to build Bitcoin services on
infrastructure that can’t scale as the size of the block chain increases.
Pieter Wuille expresses this concern in the post from his PR: “I hate making it
easy to build infrastructure that relies on a fully-indexed blockchain (which
shouldn’t be necessary), as it may remove the incentive to build more scalable
solutions. On the other hand, in cases where the alternative is relying on a
trusted third party, this approach is certainly preferable, and would allow an
RPC-based blockexplorer, for example.”
For more discussion on the merits of adding an address index to Bitcoin Core,
see the discussion in the
13 Dec 2018 Bitcoin Core IRC
This latest attempt to add an address index builds on the new auxiliary
indexes base class introduced in
13243. The auxiliary index
infrastructure runs asynchronously and uses the CValidationInterface to
receive new block data. That means adding new indexes should have minimal
impact on performance and low risk of introducing a bug that impacts mainline
functionality. The address index code and tests (
test/addrindex_tests.cpp) added in this PR are similar to the transaction
index code and tests (
test/txindex_tests.cpp). Comparing them by eye or using a diff tool may be
helpful to understand what the new code is doing.
Do you think Bitcoin Core should include an address index? What are the
arguments for and against?
Did you review the PR?
Concept ACK, approach ACK, ACK <commit>, or
Don’t forget to put your PR review on GitHub or ask
What steps did you take to review and test this PR? Did you try to create an
address index over the full mainnet block chain?
Did you review the unit and functional tests? Are there any additional test
cases that you’d like to see added?
The address index is stored in
LevelDB, which is a key-value
store. What is used as the key, and what is used as the value?
Where is MurmurHash3 used in this PR? How are hash collisions resolved?
Where else is MurmurHash3 used in the Bitcoin Core codebase?
How does the address index handle reorgs?
1 13:00 <jnewbery> #startmeeting
7 13:00 <jnewbery> Hi folks. Welcome to the first Review Club of 2020 :)
8 13:00 <michaelfolkson> hi
12 13:00 <jnewbery> Feel free to say "hi" to let everyone know that you're at keyboard.
13 13:00 <jnewbery> Before we begin, just a reminder of some of our meeting conventions:
14 13:00 <jnewbery> 1. You don't have to ask to ask a question (e.g. "I have a question about x but I don't know if it's on-topic?"). Just go ahead and ask. If it's off-topic we'll tell you.
15 13:00 <jnewbery> 2. You don't need to wait for the host to ask a specific question — just jump in at any point.
17 13:01 <jnewbery> Ok, let's get started. Did everyone get a chance to review the PR? How about a quick y/n from everyone
20 13:01 <jkczyz> n (read the notes)
23 13:02 <michaelfolkson> y
24 13:02 <amiti> n - read notes and brief overview of PR
25 13:02 <fjahr> working on it
27 13:02 <jnewbery> First question was about whether Bitcoin Core should include an address index. I don't think we want to get into a long debate, but did people think a bit about the arguments for and against?
28 13:03 <marcinja> cccccclkninbdfgdifvjhklfgutvrdrtetdfbirvveuj
29 13:03 <kanzure> was there a recent address index proposal..?
30 13:03 <fjahr> pro: clearly there is a need for it based on user requests
31 13:03 <marcinja> oops sorry, new yubikey is being annoying
32 13:03 <fjahr> con: storage requirements
33 13:03 <amiti> con: unscalable in the long term
34 13:03 <jnewbery> kanzure: not recent. We're reviewing 14053 which was opened last year
35 13:03 <jonatack> I'll admit I'm having trouble concept acking this being in bitcoin core despite the long-term request for this
36 13:04 <nehan_> con: more code to maintain / added complexity
37 13:04 <raj_> pro: seems like an useful tool. con: worried about storage and IBD time.
38 13:04 <amiti> pro: people need it so often depend on 3rd party services. having it in core would be preferable
39 13:04 <michaelfolkson> Yeah if it keeps coming up, there's almost an argument to just get it over with. Assuming the downsides aren't that severe which they don't seem to be (at least for the Core project itself)
40 13:04 <nehan_> pro: seems to be real desire for this and good use cases
41 13:04 <kanzure> jnewbery: thanks for the clarification.
42 13:05 <_andrewtoth_> pro: increased privacy and not forcing users to use third-party services/software
43 13:05 <jnewbery> raj_: this wouldn't impact IBD time (well perhaps for people that decided to enable an address index, but this would be disabled by default)
44 13:05 <fjahr> I did not see any stats and projections on storage usage, should not be to hard to I think, or did I miss something?
45 13:05 <_andrewtoth_> con: creating reliance on it being available when it won't scale
46 13:05 <jnewbery> yeah, I think that's a good summary of the arguments. What did people think about this particular implementation. Does it address any of those concerns?
47 13:05 <jonatack> Opportunity cost, maintainence, and disincentivises the more sustainable dedicated external solution, pushing it down the road
48 13:06 <jonatack> at first look in any case
49 13:06 <emzy> pro: you could habe a more simple elecrum wallet connector. Or use electrum direct with bitcoin core.
51 13:06 <_andrewtoth_> I guess it would be more simple though, yeah
52 13:07 <emzy> _andrewtoth_: I know. But you have te reindex und put ypu xpub in EPS.
53 13:07 <nehan_> jnewbery: does not address long-term scalability issue. doesn't seem too bad in terms of added code though.
54 13:07 <raj_> i feel like its a good addition, given optionality.
56 13:07 <jnewbery> I think using the base index infrastructure addresses some of the concerns about code complexity and performance
57 13:08 <fjahr> pro of this particular PR against the historical ones: it uses the base index which should make it easier to maintain, which means this is probably also much easier to maintain as an external patch if it does not get in
58 13:08 <michaelfolkson> Any reason why EPS wasn't included in the summary jonatack?
59 13:08 <jnewbery> nehan_: right, I agree
60 13:08 <jnewbery> fjahr: I agree that it'd be nice to see stats and storage projections
61 13:09 <jonatack> michaelfolkson: i did not understand your comment
62 13:09 <_andrewtoth_> emzy: with https://github.com/bitcoin/bitcoin/pull/10785 it would make it much faster to rescan. I think that's ultimately the better solution, block filter indexing and fast rescanning for addresses you want to know about
63 13:09 <jnewbery> marcinja: did you run this on mainnet? Do you know what the storage requirements are for indexing the full mainnet?
64 13:09 <raj_> yes stats would be very informative
65 13:09 <jnewbery> (or did anyone else run this on mainnet?)
66 13:09 <raj_> I am clueless about the potential impact otherwise.
67 13:10 <michaelfolkson> jonatack: "Bitcore, electrs and ElectrumX are three examples." I just wondered if EPS was missing something or not intentional.
68 13:10 <_andrewtoth_> i started running this on mainnet last evening. It's only at block 473300 now
69 13:10 <marcinja> jnewbery: I did a full sync on mainnet last year but I don't remember the size off the top of my head
70 13:10 <_andrewtoth_> storage is at 98 GB already
71 13:10 <_andrewtoth_> so it's several times slower than reindex-chainstate
72 13:10 <nehan_> it's useful for me to understand what's missing from something like this before it might be considered for merging. one is definitely stats/benchmarks.
73 13:11 <nehan_> another is testing and the documentation jnewbery pointed out in his comment
74 13:11 <jnewbery> nehan_: that's a good question. What do other people think?
75 13:11 <nehan_> or perhaps others think it is mergeable without those things?
76 13:11 <michaelfolkson> About stats/benchmarks?
77 13:11 <raj_> agreed with nehan_
78 13:11 <nehan_> michaelfolkson: what else needs to be done before considering this PR for merging?
79 13:12 <_andrewtoth_> nehan_: agree, we need those numbers before considering
80 13:12 <jnewbery> _andrewtoth_: how large is the blockchain up to height 473300? (I'm interested in how large the address index is in comparison to the blockchain)
81 13:12 <raj_> i do have one point regarding testing. But i think we will hit that later anyway.
82 13:12 <nehan_> marcinja: what do you think?
83 13:12 <_andrewtoth_> jnewbery: du -h | grep addr_index -> 98G ./addr_index
84 13:13 <_andrewtoth_> oh sorry you want to know blockchain size, not sure
85 13:13 <marcinja> nehan_: I definitely agree it should be documented better, that should at least make it easier to review. Stats and benchmarks would put people at ease maybe and also won't be hard to get
86 13:13 <jonatack> michaelfolkson: Ah! I do not know. I did not write this summary :)
87 13:14 <jnewbery> _andrewtoth_: du -h blocks should tell you
88 13:14 <jkczyz> Did jamesob report back on the use cases that companies had for this? Seems any solution that would scale requires multiple nodes and mechanism to resolve inconsistencies across nodes.
89 13:14 <_andrewtoth_> oh, i'm already synced
90 13:14 <_andrewtoth_> it's just indexing in the addrindex thread
91 13:15 <jnewbery> _andrewtoth_: ah, ok. I forgot that the indexer could fall behind!
92 13:15 <michaelfolkson> jkczyz: So on the use case front I saw local block explorers. What else?
93 13:15 <_andrewtoth_> jnewbery: well i already had the node synced, i just built on the branch and ran with -addrindex
94 13:16 <jnewbery> jkczyz: I'd *hope* that there wouldn't be inconsistencies
96 13:17 <jkczyz> By inconsistencies I mean competing versions of the tip
97 13:17 <fjahr> michaelfolkson: people who use block explorers but don't want to use public ones
98 13:17 <jkczyz> chain tip
99 13:17 <fjahr> ah sorry, that's what you meant
100 13:18 <raj_> cant then this be done in standalone way for those who want to use their own explorer?
101 13:18 <jnewbery> jkczyz: Is that concern specific to address indexes or general to any service built on Bitcoin?
102 13:18 <michaelfolkson> I'm finding it hard to care about the Approach ACK when there isn't agreement on the Concept ACK. And at least the Concept nACK seems to be the primary reason why the other attempts were rejected
103 13:18 <jkczyz> any service really
104 13:19 <jonatack> marcinja: Since you are here (which is great!) could you discuss your motivation for reviving this PR? I didn't you mention the *why* in the PR description, which seemed quite succinct for a change like this.
105 13:19 <jonatack> I didn't see* you mention the *why*
106 13:19 <jnewbery> michaelfolkson: I don't see any concept NACKs on the PR
107 13:20 <michaelfolkson> jnewbery: But you agree that's the primary reason why other attempts failed?
108 13:20 <raj_> any service really. Seems like something you can write your own script for. This PR can be used as a refference.
109 13:21 <michaelfolkson> Don't mean to derail discussion :/ The PR seems good
110 13:21 <raj_> But yes, through discussion it does seem like a high demand functionality.
111 13:22 <jonatack> marcinja: In other words, you provide the how but without your why. I would have liked to hear your why :)
112 13:22 <jnewbery> michaelfolkson: maybe. The final comment on the last attempt was from sipa: "I vote to close this; this should be done in an external index. Perhaps at some point when we can a more modular design it can be supported as an optionally buildable module, but let's not complicate the current database tracking code."
113 13:22 <fjahr> raj_: that was the feedback in most of the discussions. As a standalone or patch. But of course people would like a really robust solution maintained by core because this is not so easy to maintain.
114 13:22 <jnewbery> we now have that more modular design!
115 13:23 <raj_> fjahr: any particular reason why its hard to maintain?
116 13:23 <michaelfolkson> Ok fair enough. Thanks :)
117 13:23 <jnewbery> ok, let's move on to the next question: What steps did you take to review and test this PR? Did you try to create an address index over the full mainnet block chain?
118 13:24 <raj_> standard unit and functional testing. Didint tried on mainnet.
119 13:24 <jnewbery> and we can also talk about the next question at the same time: Did you review the unit and functional tests? Are there any additional test cases that you’d like to see added?
120 13:24 <nehan_> i just read the code
121 13:24 <ariard> I agree with sipa we should do this in its own module post-multiprocess with the rest of the server code
122 13:24 <_andrewtoth_> built and ran tests, but as I mentioned, mainnet syncing is taking a long time
123 13:24 <fjahr> raj_: I don't have seen any particular reference to which problems these projects have.
124 13:24 <_andrewtoth_> however, testing already indexed addresses with searchrawtransactions works
125 13:25 <marcinja> jonatack: Sure. One of the main justifications is that it seems like a lot of people don't want to run extra software in addition to Bitcoin Core to get block explorer functionality. If people can get that functionality easily without sacrificing privacy its a plus for me.
126 13:25 <emzy> I just read the PR and compiled the PR. But had no time to run it. Mainet data is still rsyning
127 13:25 <jnewbery> ariard: what do you think having multiprocess provides as an improvement for an address index?
128 13:25 <_andrewtoth_> as i mentioned on github, we should have ability to skip transactions, so we can page through them. I'd like to see that implemented and tested
129 13:25 <raj_> functional testing does not include any check for spends_results. It only checks consistency of creation_results.
130 13:26 <jnewbery> raj_: yes, I'd definitly like to see that added to the tests. Did anyone test it manually?
131 13:26 <_andrewtoth_> yes i tested manually and got spend results
132 13:26 <ariard> jnewbery: architectural level, you should split all indexes in its own process (like bitcoin-server) so the server would be able to serve multiple clients without encumbering operation of the consensus/p2p module
133 13:27 <ariard> so easier to maintain, smaller binary for ones not interested, easier to have its improvements pace, like if we split the wallet from the node
134 13:29 <jonatack> marcinja: Thanks. It's true there is accumulated context behind the PR, but my intuition (which can be wrong ofc) is that providing context (and links to more) in your PR description can help "sell" the PR.
135 13:29 <jnewbery> ariard: binary size is negligable (a few hundred lines of code), I don't think there's any difference in maintenance - that's more of a code separation issue than process separation
136 13:29 <nehan_> ariard: what needs to be done before one could build this PR the way you suggest? How far out is that?
137 13:30 <jnewbery> Serving RPC requests to the index doesn't take cs_main, so I don't think it encumbers operation of consensus/p2p
138 13:30 <jnewbery> and improvements pace is another code separation thing, not process separation
139 13:30 <michaelfolkson> I agree jonatack. I think jnewbery point on the modularity of Core being improved since the previous attempts is a key "sell" too
140 13:30 <jonatack> marcinja: because if the *why* isn't justified, then the *how* doesn't matter.
141 13:30 <jnewbery> (I'm not arguing against multiprocess, but I don't think it directly impacts this PR)
142 13:30 <marcinja> jonatack: That's a good point. I'll try to summarize the background better for a new PR description
143 13:32 <jonatack> marcinja: yes, and give it a bit of your personal conviction too perhaps, without being too subjective. Just a side thought, maybe not a helpful one, but I was looking for that.
144 13:32 <ariard> jnewbery: code separation let you do process separation so both of them are tied, having different processes would let them running on different machines
145 13:32 <ariard> and if you server is doing heavy rescans, not taking cpu time for the validation
146 13:33 <jnewbery> ariard: I agree those things are true, but I don't think it's a requirement for this PR
147 13:33 <jnewbery> there are lots of things that *could* be improved about Bitcoin Core architecture, but we shouldn't pull all of those into discussion of individual PRs
148 13:34 <ariard> jnewbery: I'm worried it would make it harder to split it latter, that's really IMO but if you take the wallet, more there is features, more the chain interface is stuffed, harder it make it to refactor
149 13:34 <nehan_> ariard: you can run 2 bitcoin core nodes, one with this on and one with this off, if you wanted.
150 13:35 <ariard> nehan_: hmm that's a good question, harder point would be to memory isolate indexes from the rest of the node, cut dependencies between them, and get it its own p2p stack
151 13:36 <jnewbery> I believe the only validation interface method this uses is BlockConnected, which is in the base indexer class
152 13:36 <jnewbery> It's interface with the rest of the node is minimal
154 13:36 <ariard> nehan_: okay but know you do 2 ibds, even if the second one is pruned, seems more a hack than doing it cleanly
155 13:37 <jnewbery> Next question!
156 13:37 <jnewbery> The address index is stored in LevelDB, which is a key-value store. What is used as the key, and what is used as the value?
157 13:37 <nehan_> ariard: if one implemented a blockexplorer from scratch they'd have to do IBD anyway.
158 13:38 <ariard> jnewbery: dunno if this is that hard to split, want to try at some point
159 13:38 <nehan_> this was confusing. marcinja please add documentation :) but i think the key is scripthash, outpoint
160 13:38 <ariard> nehan_: a pruned wallet does not need ibd, you already speaking about one specific application requirement I think
161 13:39 <belcher> earlier people were saying this addrindex could be used as an electrum server, electrum's protocol uses hash(scriptPubKey) as the key for its queries, so if an electrum server implementation is desired that would probably need to be the key in core
162 13:39 <raj_> The key seems to be [type, outpoint] and value: [CDiskTxPos, Script]
163 13:39 <nehan_> belcher: hash(scriptPubKey) is the key prefix, so that works here too. How does electrum handle multiple outpoints with the same hash(scriptPubKey)?
164 13:39 <raj_> I might be wrong, i am very new to c++
165 13:40 <jnewbery> nehan_: I think it's (type | script hash | outpoint) . Is that right, marcinja?
166 13:40 <nehan_> oh yes. type, sorry, forgot.
167 13:40 <belcher> nehan_ the electrum client sends a list of its bitcoin addresses encoded as hash(spk) and the server replies with transactions on those addresses, so if an address contains multiple outputs then the server replies with all those txes
169 13:41 <michaelfolkson> Type being SEED, SPENT, CREATED
170 13:41 <belcher> electrum doesnt use outpoints in its protocol, it only uses hash(spk) and transactions encoded as hex
171 13:42 <nehan_> belcher: the rpc interface doesn't use outpoints either. so I think this is compatible.
172 13:42 <marcinja> jnewbery: it's (script hash | type | outpoint)
173 13:42 <jnewbery> belcher: sorry, the question might be a bit confusing. I was asking about the internal levelDB representation. The RPC query uses the script or address to lookup in the index
174 13:43 <michaelfolkson> +1 on nehan_ docs suggestion
176 13:44 <nehan_> jnewbery: i believe it is always the outpoint where the scriptPubKey was created. I believe this is used to get unique keys when the same scriptPubKey is used multiple times.
177 13:45 <marcinja> nehan_: that's right. leveldb doesn't support multiple values for a single key so adding the outpoint is one way to have unique keys.
179 13:46 <jnewbery> so I *think* that for spends, the txid given is actually the txid of the transaction creating the UTXO not the txid of the transaction spending the UTXO
180 13:48 <jnewbery> ok, time's running out so let's move on to the next question: Where is MurmurHash3 used in this PR? How are hash collisions resolved? Where else is MurmurHash3 used in the Bitcoin Core codebase?
181 13:49 <marcinja> jnewbery: yes that's right, the transaction from the value should be used for that txid. clearly that needs a functional test :)
182 13:50 <marcinja> jnewbery: I think the key is serialized in the order I gave, but the constructor shows them in a different order which is definitely confusing
183 13:50 <michaelfolkson> The address index stores a copy of the CScript in case of hash collisions
184 13:51 <nehan_> jnewbery: i think you caught a bug. I'm leaving a comment.
185 13:51 <jnewbery> michaelfolkson: yes, that's right
186 13:52 <jnewbery> ok, final question. How does the address index handle reorgs?
187 13:54 <jnewbery> ok, perhaps something to look into when you all do your reviews :)
188 13:54 <jnewbery> 5 minutes left. Any final questions?
189 13:55 <michaelfolkson> What is the answer to the reorg question? :)
190 13:55 <instagibbs> anyone run benchmarks up to different heights? both benchmarking IBD and query response times?
191 13:55 <instagibbs> michaelfolkson, "hopefully" ;)
192 13:56 <jnewbery> michaelfolkson: left as an exercise for the reader
193 13:56 <emzy> I will run a benchmark with a blockchain that has already a txindex.
195 13:57 <jnewbery> emzy: the presence/absence of a txindex shouldn't have any impact
196 13:58 <_andrewtoth_> instagibbs: just generating the addrindex with a synced node has taken 16 hours to get to 478000 blocks :/
197 13:58 <jnewbery> nehan_: thanks!
198 13:58 <_andrewtoth_> IBD with assumevalid on has taken the same machine < 6 hours
199 13:59 <jnewbery> I'm slightly unconcerned about performance of building the address index for the first time, since it only has to be done once
200 13:59 <jnewbery> I wonder how this compares to external tools like electrs
201 13:59 <_andrewtoth_> also, querying can be optimized by taking count and returning early
202 13:59 <_andrewtoth_> i commented on github
203 13:59 <jnewbery> belcher: do you know how long building an index in electrs takes, or is it doing something different?
204 14:00 <emzy> electrs will do a full sync in about 16h
205 14:00 <michaelfolkson> In this case why use MurmurHash3? Would another hash function offer better collision resistance? Understand it is a trade-off with simplicity, speed etc
206 14:00 <belcher> i dont know
207 14:00 <michaelfolkson> "good enough" collision resistance I suppose
208 14:00 <jnewbery> That's time. Thanks everyone. Hope you all enjoyed!
209 14:00 <_andrewtoth_> thanks!
210 14:00 <jnewbery> And thanks to marcinja for dropping in
211 14:01 <marcinja> thanks jnewbery for hosting, and thanks to everyone who reviewed and gave feedback. This was very helpful :)
212 14:01 <amiti> thank you!
213 14:01 <nehan_> thanks!
214 14:01 <michaelfolkson> You mean in EPS jnewbery?
215 14:01 <michaelfolkson> Thanks!
217 14:01 <fjahr> thanks, very good choice!
218 14:01 <emzy> thanks jnewbery, marcinja and everyone!
219 14:01 <_andrewtoth_> EPS uses the wallet and imports the addresses, then does a rescan
221 14:01 <jnewbery> #endmeeting