Add address-based index (attempt 4?) (utxo db and indexes)

Jan 8, 2020

https://github.com/bitcoin/bitcoin/pull/14053

Host: jnewbery - PR author: marcinja

The PR branch HEAD was ec5fbd0b at the time of this review club meeting.

Notes

An address index is an index from addresses or scriptPubKeys to the transactions in the block chain that create outputs to that address or spend outputs from that address.
Adding an address index is a very common feature request for Bitcoin Core. Being able to lookup transactions from an address is useful for block explorers or other services that need to scan the block chain. Example requests: 1, 2
There have been several patches and additional software to produce transaction indexes over the years: Bitcore, electrs and ElectrumX are three examples.
There are also centralized block explorer services that provide an address-to-transactions lookup. These offer convenience at the cost of privacy.
There have been several attempts to add an address index to Bitcoin Core:
Address indexes are inherently unscalable since they grow linearly with the size of the block chain. Some contributors worry that offering an address index in Bitcoin Core may encourage businesses to build Bitcoin services on infrastructure that can’t scale as the size of the block chain increases. Pieter Wuille expresses this concern in the post from his PR: “I hate making it easy to build infrastructure that relies on a fully-indexed blockchain (which shouldn’t be necessary), as it may remove the incentive to build more scalable solutions. On the other hand, in cases where the alternative is relying on a trusted third party, this approach is certainly preferable, and would allow an RPC-based blockexplorer, for example.”
For more discussion on the merits of adding an address index to Bitcoin Core, see the discussion in the 13 Dec 2018 Bitcoin Core IRC meeting.
This latest attempt to add an address index builds on the new auxiliary indexes base class introduced in PR 13243. The auxiliary index infrastructure runs asynchronously and uses the CValidationInterface to receive new block data. That means adding new indexes should have minimal impact on performance and low risk of introducing a bug that impacts mainline functionality.
The address index code and tests (index/addrindex.cpp, index/addrindex.h, test/addrindex_tests.cpp) added in this PR are similar to the transaction index code and tests (index/txindex.cpp, index/txindex.h, test/txindex_tests.cpp). Comparing them by eye or using a diff tool may be helpful to understand what the new code is doing.

Questions

Do you think Bitcoin Core should include an address index? What are the arguments for and against?
Did you review the PR? Concept ACK, approach ACK, ACK <commit>, or NACK? Don’t forget to put your PR review on GitHub or ask questions.
What steps did you take to review and test this PR? Did you try to create an address index over the full mainnet block chain?
Did you review the unit and functional tests? Are there any additional test cases that you’d like to see added?
The address index is stored in LevelDB, which is a key-value store. What is used as the key, and what is used as the value?
Where is MurmurHash3 used in this PR? How are hash collisions resolved? Where else is MurmurHash3 used in the Bitcoin Core codebase?
How does the address index handle reorgs?

Meeting Log

13:00

<jnewbery> #startmeeting

13:00

<jkczyz> hi

13:00

<jonatack> hi

13:00

<jnewbery> hi

13:00

<_andrewtoth_> hi

13:00

<fjahr> hi

13:00

<jnewbery> Hi folks. Welcome to the first Review Club of 2020 :)

13:00

<michaelfolkson> hi

13:00

<emzy> Hi

13:00

<amiti> hi

13:00

<jnewbery> Notes are in the usual place: https://bitcoincore.reviews/14053.html

13:00

<jnewbery> Feel free to say "hi" to let everyone know that you're at keyboard.

13:00

<jnewbery> Before we begin, just a reminder of some of our meeting conventions:

13:00

<jnewbery> 1. You don't have to ask to ask a question (e.g. "I have a question about x but I don't know if it's on-topic?"). Just go ahead and ask. If it's off-topic we'll tell you.

13:00

<jnewbery> 2. You don't need to wait for the host to ask a specific question — just jump in at any point.

13:01

<van> hi

13:01

<jnewbery> Ok, let's get started. Did everyone get a chance to review the PR? How about a quick y/n from everyone

13:01

<_andrewtoth_> y

13:01

<raj_> y

13:01

<jkczyz> n (read the notes)

13:01

<emzy> y

13:01

<kanzure> hi

13:02

<michaelfolkson> y

13:02

<amiti> n - read notes and brief overview of PR

13:02

<fjahr> working on it

13:02

<jonatack> y

13:02

<jnewbery> First question was about whether Bitcoin Core should include an address index. I don't think we want to get into a long debate, but did people think a bit about the arguments for and against?

13:03

<marcinja> cccccclkninbdfgdifvjhklfgutvrdrtetdfbirvveuj

13:03

<kanzure> was there a recent address index proposal..?

13:03

<fjahr> pro: clearly there is a need for it based on user requests

13:03

<marcinja> oops sorry, new yubikey is being annoying

13:03

<fjahr> con: storage requirements

13:03

<amiti> con: unscalable in the long term

13:03

<jnewbery> kanzure: not recent. We're reviewing 14053 which was opened last year

13:03

<jonatack> I'll admit I'm having trouble concept acking this being in bitcoin core despite the long-term request for this

13:04

<nehan_> con: more code to maintain / added complexity

13:04

<raj_> pro: seems like an useful tool. con: worried about storage and IBD time.

13:04

<amiti> pro: people need it so often depend on 3rd party services. having it in core would be preferable

13:04

<michaelfolkson> Yeah if it keeps coming up, there's almost an argument to just get it over with. Assuming the downsides aren't that severe which they don't seem to be (at least for the Core project itself)

13:04

<nehan_> pro: seems to be real desire for this and good use cases

13:04

<kanzure> jnewbery: thanks for the clarification.

13:05

<_andrewtoth_> pro: increased privacy and not forcing users to use third-party services/software

13:05

<jnewbery> raj_: this wouldn't impact IBD time (well perhaps for people that decided to enable an address index, but this would be disabled by default)

13:05

<fjahr> I did not see any stats and projections on storage usage, should not be to hard to I think, or did I miss something?

13:05

<_andrewtoth_> con: creating reliance on it being available when it won't scale

13:05

<jnewbery> yeah, I think that's a good summary of the arguments. What did people think about this particular implementation. Does it address any of those concerns?

13:05

<jonatack> Opportunity cost, maintainence, and disincentivises the more sustainable dedicated external solution, pushing it down the road

13:06

<jonatack> at first look in any case

13:06

<emzy> pro: you could habe a more simple elecrum wallet connector. Or use electrum direct with bitcoin core.

13:06

<_andrewtoth_> emzy: you can already do that with https://github.com/chris-belcher/electrum-personal-server

13:06

<_andrewtoth_> I guess it would be more simple though, yeah

13:07

<emzy> _andrewtoth_: I know. But you have te reindex und put ypu xpub in EPS.

13:07

<nehan_> jnewbery: does not address long-term scalability issue. doesn't seem too bad in terms of added code though.

13:07

<raj_> i feel like its a good addition, given optionality.

13:07

<emzy> xpub.

13:07

<jnewbery> I think using the base index infrastructure addresses some of the concerns about code complexity and performance

13:08

<fjahr> pro of this particular PR against the historical ones: it uses the base index which should make it easier to maintain, which means this is probably also much easier to maintain as an external patch if it does not get in

13:08

<michaelfolkson> Any reason why EPS wasn't included in the summary jonatack?

13:08

<jnewbery> nehan_: right, I agree

13:08

<jnewbery> fjahr: I agree that it'd be nice to see stats and storage projections

13:09

<jonatack> michaelfolkson: i did not understand your comment

13:09

<_andrewtoth_> emzy: with https://github.com/bitcoin/bitcoin/pull/10785 it would make it much faster to rescan. I think that's ultimately the better solution, block filter indexing and fast rescanning for addresses you want to know about

13:09

<jnewbery> marcinja: did you run this on mainnet? Do you know what the storage requirements are for indexing the full mainnet?

13:09

<raj_> yes stats would be very informative

13:09

<jnewbery> (or did anyone else run this on mainnet?)

13:09

<raj_> I am clueless about the potential impact otherwise.

13:10

<michaelfolkson> jonatack: "Bitcore, electrs and ElectrumX are three examples." I just wondered if EPS was missing something or not intentional.

13:10

<_andrewtoth_> i started running this on mainnet last evening. It's only at block 473300 now

13:10

<marcinja> jnewbery: I did a full sync on mainnet last year but I don't remember the size off the top of my head

13:10

<_andrewtoth_> storage is at 98 GB already

13:10

<_andrewtoth_> so it's several times slower than reindex-chainstate

13:10

<nehan_> it's useful for me to understand what's missing from something like this before it might be considered for merging. one is definitely stats/benchmarks.

13:11

<nehan_> another is testing and the documentation jnewbery pointed out in his comment

13:11

<jnewbery> nehan_: that's a good question. What do other people think?

13:11

<nehan_> or perhaps others think it is mergeable without those things?

13:11

<michaelfolkson> About stats/benchmarks?

13:11

<raj_> agreed with nehan_

13:11

<nehan_> michaelfolkson: what else needs to be done before considering this PR for merging?

13:12

<_andrewtoth_> nehan_: agree, we need those numbers before considering

13:12

<jnewbery> _andrewtoth_: how large is the blockchain up to height 473300? (I'm interested in how large the address index is in comparison to the blockchain)

13:12

<raj_> i do have one point regarding testing. But i think we will hit that later anyway.

13:12

<nehan_> marcinja: what do you think?

13:12

<_andrewtoth_> jnewbery: du -h | grep addr_index -> 98G ./addr_index

13:13

<_andrewtoth_> oh sorry you want to know blockchain size, not sure

13:13

<marcinja> nehan_: I definitely agree it should be documented better, that should at least make it easier to review. Stats and benchmarks would put people at ease maybe and also won't be hard to get

13:13

<jonatack> michaelfolkson: Ah! I do not know. I did not write this summary :)

13:14

<jnewbery> _andrewtoth_: du -h blocks should tell you

13:14

<jkczyz> Did jamesob report back on the use cases that companies had for this? Seems any solution that would scale requires multiple nodes and mechanism to resolve inconsistencies across nodes.

13:14

<_andrewtoth_> oh, i'm already synced

13:14

<_andrewtoth_> it's just indexing in the addrindex thread

13:15

<jnewbery> _andrewtoth_: ah, ok. I forgot that the indexer could fall behind!

13:15

<michaelfolkson> jkczyz: So on the use case front I saw local block explorers. What else?

13:15

<_andrewtoth_> jnewbery: well i already had the node synced, i just built on the branch and ran with -addrindex

13:16

<jnewbery> jkczyz: I'd *hope* that there wouldn't be inconsistencies

13:17

<jnewbery> I linked to a couple of comments from engineers building services, and their use cases for an address index: https://github.com/bitcoin/bitcoin/pull/2802#issuecomment-26712372, https://github.com/bitcoin/bitcoin/pull/14053#issuecomment-558344637

13:17

<jkczyz> By inconsistencies I mean competing versions of the tip

13:17

<fjahr> michaelfolkson: people who use block explorers but don't want to use public ones

13:17

<jkczyz> chain tip

13:17

<fjahr> ah sorry, that's what you meant

100

13:18

<raj_> cant then this be done in standalone way for those who want to use their own explorer?

101

13:18

<jnewbery> jkczyz: Is that concern specific to address indexes or general to any service built on Bitcoin?

102

13:18

<michaelfolkson> I'm finding it hard to care about the Approach ACK when there isn't agreement on the Concept ACK. And at least the Concept nACK seems to be the primary reason why the other attempts were rejected

103

13:18

<jkczyz> any service really

104

13:19

<jonatack> marcinja: Since you are here (which is great!) could you discuss your motivation for reviving this PR? I didn't you mention the *why* in the PR description, which seemed quite succinct for a change like this.

105

13:19

<jonatack> I didn't see* you mention the *why*

106

13:19

<jnewbery> michaelfolkson: I don't see any concept NACKs on the PR

107

13:20

<michaelfolkson> jnewbery: But you agree that's the primary reason why other attempts failed?

108

13:20

<raj_> any service really. Seems like something you can write your own script for. This PR can be used as a refference.

109

13:21

<michaelfolkson> Don't mean to derail discussion :/ The PR seems good

110

13:21

<raj_> But yes, through discussion it does seem like a high demand functionality.

111

13:22

<jonatack> marcinja: In other words, you provide the how but without your why. I would have liked to hear your why :)

112

13:22

<jnewbery> michaelfolkson: maybe. The final comment on the last attempt was from sipa: "I vote to close this; this should be done in an external index. Perhaps at some point when we can a more modular design it can be supported as an optionally buildable module, but let's not complicate the current database tracking code."

113

13:22

<fjahr> raj_: that was the feedback in most of the discussions. As a standalone or patch. But of course people would like a really robust solution maintained by core because this is not so easy to maintain.

114

13:22

<jnewbery> we now have that more modular design!

115

13:23

<raj_> fjahr: any particular reason why its hard to maintain?

116

13:23

<michaelfolkson> Ok fair enough. Thanks :)

117

13:23

<jnewbery> ok, let's move on to the next question: What steps did you take to review and test this PR? Did you try to create an address index over the full mainnet block chain?

118

13:24

<raj_> standard unit and functional testing. Didint tried on mainnet.

119

13:24

<jnewbery> and we can also talk about the next question at the same time: Did you review the unit and functional tests? Are there any additional test cases that you’d like to see added?

120

13:24

<nehan_> i just read the code

121

13:24

<ariard> I agree with sipa we should do this in its own module post-multiprocess with the rest of the server code

122

13:24

<_andrewtoth_> built and ran tests, but as I mentioned, mainnet syncing is taking a long time

123

13:24

<fjahr> raj_: I don't have seen any particular reference to which problems these projects have.

124

13:24

<_andrewtoth_> however, testing already indexed addresses with searchrawtransactions works

125

13:25

<marcinja> jonatack: Sure. One of the main justifications is that it seems like a lot of people don't want to run extra software in addition to Bitcoin Core to get block explorer functionality. If people can get that functionality easily without sacrificing privacy its a plus for me.

126

13:25

<emzy> I just read the PR and compiled the PR. But had no time to run it. Mainet data is still rsyning

127

13:25

<jnewbery> ariard: what do you think having multiprocess provides as an improvement for an address index?

128

13:25

<_andrewtoth_> as i mentioned on github, we should have ability to skip transactions, so we can page through them. I'd like to see that implemented and tested

129

13:25

<raj_> functional testing does not include any check for spends_results. It only checks consistency of creation_results.

130

13:26

<jnewbery> raj_: yes, I'd definitly like to see that added to the tests. Did anyone test it manually?

131

13:26

<_andrewtoth_> yes i tested manually and got spend results

132

13:26

<ariard> jnewbery: architectural level, you should split all indexes in its own process (like bitcoin-server) so the server would be able to serve multiple clients without encumbering operation of the consensus/p2p module

133

13:27

<ariard> so easier to maintain, smaller binary for ones not interested, easier to have its improvements pace, like if we split the wallet from the node

134

13:29

<jonatack> marcinja: Thanks. It's true there is accumulated context behind the PR, but my intuition (which can be wrong ofc) is that providing context (and links to more) in your PR description can help "sell" the PR.

135

13:29

<jnewbery> ariard: binary size is negligable (a few hundred lines of code), I don't think there's any difference in maintenance - that's more of a code separation issue than process separation

136

13:29

<nehan_> ariard: what needs to be done before one could build this PR the way you suggest? How far out is that?

137

13:30

<jnewbery> Serving RPC requests to the index doesn't take cs_main, so I don't think it encumbers operation of consensus/p2p

138

13:30

<jnewbery> and improvements pace is another code separation thing, not process separation

139

13:30

<michaelfolkson> I agree jonatack. I think jnewbery point on the modularity of Core being improved since the previous attempts is a key "sell" too

140

13:30

<jonatack> marcinja: because if the *why* isn't justified, then the *how* doesn't matter.

141

13:30

<jnewbery> (I'm not arguing against multiprocess, but I don't think it directly impacts this PR)

142

13:30

<marcinja> jonatack: That's a good point. I'll try to summarize the background better for a new PR description

143

13:32

<jonatack> marcinja: yes, and give it a bit of your personal conviction too perhaps, without being too subjective. Just a side thought, maybe not a helpful one, but I was looking for that.

144

13:32

<ariard> jnewbery: code separation let you do process separation so both of them are tied, having different processes would let them running on different machines

145

13:32

<ariard> and if you server is doing heavy rescans, not taking cpu time for the validation

146

13:33

<jnewbery> ariard: I agree those things are true, but I don't think it's a requirement for this PR

147

13:33

<jnewbery> there are lots of things that *could* be improved about Bitcoin Core architecture, but we shouldn't pull all of those into discussion of individual PRs

148

13:34

<ariard> jnewbery: I'm worried it would make it harder to split it latter, that's really IMO but if you take the wallet, more there is features, more the chain interface is stuffed, harder it make it to refactor

149

13:34

<nehan_> ariard: you can run 2 bitcoin core nodes, one with this on and one with this off, if you wanted.

150

13:35

<ariard> nehan_: hmm that's a good question, harder point would be to memory isolate indexes from the rest of the node, cut dependencies between them, and get it its own p2p stack

151

13:36

<jnewbery> I believe the only validation interface method this uses is BlockConnected, which is in the base indexer class

152

13:36

<jnewbery> It's interface with the rest of the node is minimal

153

13:36

<jnewbery> *Its

154

13:36

<ariard> nehan_: okay but know you do 2 ibds, even if the second one is pruned, seems more a hack than doing it cleanly

155

13:37

<jnewbery> Next question!

156

13:37

<jnewbery> The address index is stored in LevelDB, which is a key-value store. What is used as the key, and what is used as the value?

157

13:37

<nehan_> ariard: if one implemented a blockexplorer from scratch they'd have to do IBD anyway.

158

13:38

<ariard> jnewbery: dunno if this is that hard to split, want to try at some point

159

13:38

<nehan_> this was confusing. marcinja please add documentation :) but i think the key is scripthash, outpoint

160

13:38

<ariard> nehan_: a pruned wallet does not need ibd, you already speaking about one specific application requirement I think

161

13:39

<belcher> earlier people were saying this addrindex could be used as an electrum server, electrum's protocol uses hash(scriptPubKey) as the key for its queries, so if an electrum server implementation is desired that would probably need to be the key in core

162

13:39

<raj_> The key seems to be [type, outpoint] and value: [CDiskTxPos, Script]

163

13:39

<nehan_> belcher: hash(scriptPubKey) is the key prefix, so that works here too. How does electrum handle multiple outpoints with the same hash(scriptPubKey)?

164

13:39

<raj_> I might be wrong, i am very new to c++

165

13:40

<jnewbery> nehan_: I think it's (type | script hash | outpoint) . Is that right, marcinja?

166

13:40

<nehan_> oh yes. type, sorry, forgot.

167

13:40

<belcher> nehan_ the electrum client sends a list of its bitcoin addresses encoded as hash(spk) and the server replies with transactions on those addresses, so if an address contains multiple outputs then the server replies with all those txes

168

13:41

<jnewbery> I was confused about the outpoint being used: https://github.com/bitcoin/bitcoin/pull/14053#discussion_r363971325

169

13:41

<michaelfolkson> Type being SEED, SPENT, CREATED

170

13:41

<belcher> electrum doesnt use outpoints in its protocol, it only uses hash(spk) and transactions encoded as hex

171

13:42

<nehan_> belcher: the rpc interface doesn't use outpoints either. so I think this is compatible.

172

13:42

<marcinja> jnewbery: it's (script hash | type | outpoint)

173

13:42

<jnewbery> belcher: sorry, the question might be a bit confusing. I was asking about the internal levelDB representation. The RPC query uses the script or address to lookup in the index

174

13:43

<michaelfolkson> +1 on nehan_ docs suggestion

175

13:44

<jnewbery> marcinja: you sure? I think it's (type | scripthash | outpoint) : https://github.com/bitcoin/bitcoin/pull/14053/files#diff-11cde577be5c21f1d0cb1527c50543b3R188

176

13:44

<nehan_> jnewbery: i believe it is always the outpoint where the scriptPubKey was created. I believe this is used to get unique keys when the same scriptPubKey is used multiple times.

177

13:45

<marcinja> nehan_: that's right. leveldb doesn't support multiple values for a single key so adding the outpoint is one way to have unique keys.

178

13:46

<jnewbery> It looked to me like that part of the key (the outpoint) is included in the RPC response as txid: https://github.com/bitcoin/bitcoin/pull/14053/files#diff-01aa7d1d32f1b9e5a836c9c411978918R314

179

13:46

<jnewbery> so I *think* that for spends, the txid given is actually the txid of the transaction creating the UTXO not the txid of the transaction spending the UTXO

180

13:48

<jnewbery> ok, time's running out so let's move on to the next question: Where is MurmurHash3 used in this PR? How are hash collisions resolved? Where else is MurmurHash3 used in the Bitcoin Core codebase?

181

13:49

<marcinja> jnewbery: yes that's right, the transaction from the value should be used for that txid. clearly that needs a functional test :)

182

13:50

<marcinja> jnewbery: I think the key is serialized in the order I gave, but the constructor shows them in a different order which is definitely confusing

183

13:50

<michaelfolkson> The address index stores a copy of the CScript in case of hash collisions

184

13:51

<nehan_> jnewbery: i think you caught a bug. I'm leaving a comment.

185

13:51

<jnewbery> michaelfolkson: yes, that's right

186

13:52

<jnewbery> ok, final question. How does the address index handle reorgs?

187

13:54

<jnewbery> ok, perhaps something to look into when you all do your reviews :)

188

13:54

<jnewbery> 5 minutes left. Any final questions?

189

13:55

<michaelfolkson> What is the answer to the reorg question? :)

190

13:55

<instagibbs> anyone run benchmarks up to different heights? both benchmarking IBD and query response times?

191

13:55

<instagibbs> michaelfolkson, "hopefully" ;)

192

13:56

<jnewbery> michaelfolkson: left as an exercise for the reader

193

13:56

<emzy> I will run a benchmark with a blockchain that has already a txindex.

194

13:57

<nehan_> jnewbery: https://github.com/bitcoin/bitcoin/pull/14053/files#r364389525

195

13:57

<jnewbery> emzy: the presence/absence of a txindex shouldn't have any impact

196

13:58

<_andrewtoth_> instagibbs: just generating the addrindex with a synced node has taken 16 hours to get to 478000 blocks :/

197

13:58

<jnewbery> nehan_: thanks!

198

13:58

<_andrewtoth_> IBD with assumevalid on has taken the same machine < 6 hours

199

13:59

<jnewbery> I'm slightly unconcerned about performance of building the address index for the first time, since it only has to be done once

200

13:59

<jnewbery> I wonder how this compares to external tools like electrs

201

13:59

<_andrewtoth_> also, querying can be optimized by taking count and returning early

202

13:59

<_andrewtoth_> i commented on github

203

13:59

<jnewbery> belcher: do you know how long building an index in electrs takes, or is it doing something different?

204

14:00

<emzy> electrs will do a full sync in about 16h

205

14:00

<michaelfolkson> In this case why use MurmurHash3? Would another hash function offer better collision resistance? Understand it is a trade-off with simplicity, speed etc

206

14:00

<belcher> i dont know

207

14:00

<michaelfolkson> "good enough" collision resistance I suppose

208

14:00

<jnewbery> That's time. Thanks everyone. Hope you all enjoyed!

209

14:00

<_andrewtoth_> thanks!