Prefer to use txindex if available for GetTransaction (rpc/rest/zmq)

https://github.com/bitcoin/bitcoin/pull/22383

Host: jnewbery  -  PR author: jlopp

The PR branch HEAD was 78f4c8b98e at the time of this review club meeting.

Notes

  • Historic blocks that make up the block chain are stored on disk in blk?????.dat files. Those files each contain multiple serialized blocks, which consist of a header followed by the transactions contained in that block.

  • The transaction index is an optional index from transaction id (txid) to where that transaction is stored in the block files (file and offset), for all transactions that have been confirmed in a block. For normal operation of the node, a txindex is not required, and by default the txindex is disabled.

  • The getrawtransaction RPC is used to fetch a transaction from the node. When the verbose argument is set to false, it returns the hex-encoded serialized transaction. When verbose is set to true, it returns a JSON object containing information about the transaction.

  • getrawtransaction can retrieve the transaction in one of the following ways:

    • mempool: unconfirmed transactions can be looked up by txid in the mempool.

    • by block hash: if the user provides the hash of the block that the transaction is contained in, the block (including all transactions) is deserialized from disk, and the txid of each of the deserialized transactions is compared with the requested txid. If a matching transaction is found, it is returned to the user.

    • using the txindex: if the node has txindex enabled, then the txid is looked up in the transaction index to find the location of the transaction on disk. The transaction is then deserialized from that location and returned to the user.

  • If the user provides the wrong block hash, then the call to getrawtransaction will fail.

  • If txindex is enabled and the correct block hash is provided, then performance will be much slower than if no block hash has been provided. This PR attempts to fix that so that there is no performance penalty for providing the block hash.

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?

  2. Without looking at the code, why do you think that performance is worse when providing a block hash (assuming txindex is enabled)?

  3. If we’re looking up the transaction by block hash, then what are the steps required? How do we know where to find that block on disk? How much data is deserialized?

  4. If we’re looking up the transaction using the txindex, how much data is deserialized?

  5. The first version of this PR included a behaviour change: when an incorrect block_index is provided to GetTransaction, but g_txindex->FindTx(hash, hashBlock, tx) finds and returns the tx. After this PR we would return the tx although it isn’t in the block the user asked for.

    This behaviour change was removed in a subsequent push to the PR. Do you think the behaviour change is an improvement? Should it be included in this PR?

  6. How can this PR be tested? Are any new test cases required?

Meeting Log

  117:00 <jnewbery> #startmeeting
  217:00 <willcl_ark> hi
  317:00 <stickies-v> hi!
  417:00 <larryruane> hi
  517:00 <theStack> hi
  617:00 <jnewbery> Hi folks! Welcome to Bitcoin Core PR Review Club
  717:00 <raj_> hi
  817:00 <emzy> hi
  917:00 <dariusp> hi!
 1017:00 <glozow> hi!
 1117:01 <jnewbery> The review club is a place for people to come and learn about the process of contributing to Bitcoin Core. Everyone is welcome to come and ask questions
 1217:01 <jnewbery> Feel free to say hi to let people know you're here
 1317:01 <jnewbery> Anyone here for the first time?
 1417:01 <lopp> First time participant, reporting for duty!
 1517:01 <Sachin> hi
 1617:01 <jnewbery> lopp: welcome!!
 1717:02 <Azorcode> Hello everyone
 1817:02 <jnewbery> Notes and questions are here: https://bitcoincore.reviews/22383. I'll use the questions as prompts to guide the conversation, but feel free to jump in at any point if you have any questions or points you want to add
 1917:02 <sanket1729_> Hi
 2017:03 <jnewbery> Who had a chance to review the PR this week? (y/n)
 2117:03 <raj_> y
 2217:03 <unplanted> hi
 2317:03 <larryruane> y
 2417:03 <stickies-v> y
 2517:03 <neha> hi
 2617:03 <glozow> 0.5y
 2717:03 <theStack> y
 2817:03 <willcl_ark> y
 2917:03 <pglazman> y
 3017:03 <unplanted> n
 3117:03 <dariusp> n
 3217:03 <emzy> 0,5y
 3317:04 <jnewbery> Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? And for those that reviewed the PR, can you briefly summarise what it's doing?
 3417:04 <Murch[m]> N
 3517:04 <Murch[m]> Well sort of
 3617:05 <raj_> Using txindex if enabled irrespective of weather blockhash is provided or not. But return failure provided blockhash dont match with found blockhash.
 3717:05 <larryruane> tested ACK -- summary: if the transaction index is enabled (`-txindex`), then use that to loop up a transaction, rather than reading a block from disk
 3817:06 <theStack> code-review ACK
 3917:06 <jnewbery> raj_ larryruane: correct!
 4017:06 <stickies-v> Approach ACK - it resolves unexpected performance loss where (when having a txindex) GetTransaction becomes slower when providing the optional block_index parameter
 4117:06 <jnewbery> stickies-v: correct
 4217:06 <jnewbery> 2. Without looking at the code, why do you think that performance is worse when providing a block hash (assuming txindex is enabled)?
 4317:07 <stickies-v> because the entire block instead of just the transaction gets deserialized
 4417:07 <raj_> De-serialization of the entire block takes more time than finding via txindex?
 4517:07 <jnewbery> (and if you've reviewed the PR, you can ignore the part about not looking at the code)
 4617:07 <theStack> the txindex is not used in this case, and the slow path of deserializing the whole block from disk is taken
 4717:08 <jnewbery> stickies-v raj_: I think you're right. Reading the block from disk and deserializing is probably what's causing the delay.
 4817:08 <jnewbery> theStack: right
 4917:08 <larryruane> also, is it likely that the parts of the txindex we need are already in memory (so no disk read needed)? (But the deserialization is the most important thing)
 5017:08 <unplanted> jnewbery would it be appropriate/relevant to visualize the slowdown with a fireplot?
 5117:08 <jnewbery> Did anyone do any profiling to see how long deserialization of a block takes?
 5217:09 <stickies-v> didn't run any code, not on my dev machine...
 5317:10 <sipa> jnewbery: that's also what i assume
 5417:10 <jnewbery> larryruane: the txindex points to the location of the transaction on the disk, so a disk read is always required for that part. You may be right about the txindex itself being in memory. I'm not familiar with the txindex code and how often it gets flushed or if there's any caching
 5517:11 <murch> I mean, don't you also need to linearly search through the whole transaction set of the block if you're looking it up from the block?
 5617:11 <jnewbery> next q: If we’re looking up the transaction by block hash, then what are the steps required? How do we know where to find that block on disk? How much data is deserialized?
 5717:12 <larryruane> jnewbery: the block index (also leveldb, like txindex) stores the blk file number and byte offset of the start of the block on disk
 5817:12 <larryruane> it's indexed by block hash
 5917:13 <jnewbery> larryruane: I would assume that it's indexed by txid if it's a transaction index
 6017:13 <larryruane> murch: great point, may have to scan over a thousand transactions (on average) to find it
 6117:13 <sipa> the txindex stores the file number, begin offset of that block, and begin offset of that tx
 6217:13 <sipa> iirc
 6317:13 <larryruane> jnewbery: no the block index
 6417:13 <unplanted> when I said fireplot I meant flame graphs ':]
 6517:14 <jnewbery> larryruane: the txindex is an index from txid -> (file, block pos, tx offset) , no?
 6617:15 <jnewbery> so you can look things up by txid
 6717:15 <theStack> murch: i guess once the block is deserialied into memory searching a tx doesn't take that long (in comparison), considering the limited number of txs in a block
 6817:15 <larryruane> jnewbery: yes, I thought you were asking about having the block hash but not the txindex
 6917:16 <jnewbery> ooooh sorry I misread your earlier message!
 7017:17 <jnewbery> yes, you're right about the block index being an index on the block hash
 7117:17 <larryruane> sipa: so the txindex allows you to seek directly to the tx on disk, do a small read (relative to an entire block), deserialize only that one tx .. so that's why it's much faster
 7217:17 <sipa> right
 7317:17 <jnewbery> back to the question: what are the steps currently when looking up a transaction with a provided block hash?
 7417:18 <jnewbery> it's kind of been answered already
 7517:18 <jnewbery> we deserialize the entire block, and then scan through each transaction looking for a match: https://github.com/bitcoin/bitcoin/blob/a3791da0e80ab35e862989373f033e5be4dff26b/src/validation.cpp#L1163-L1170
 7617:19 <murch> Well, you find the block, deserialize it and then look through the transactions until you find the one in question
 7717:19 <jnewbery> murch: exactly right
 7817:19 <glozow> get block from disk, deserialize it, deserialize all the transactions and compare their txids with the one requested until we find it
 7917:19 <larryruane> glozow: I think deserializing the block implies deserializing all the transaction it contains (right?)
 8017:19 <raj_> Find the block, find the transaction, see if txid matches, get back the blockhash and tx.
 8117:19 <glozow> larryruane: ya sorree
 8217:19 <jnewbery> theoretically you could stop deserializing part way through if you found a tx with the right txid, but we don't have a way of doing that
 8317:20 <unplanted> jnewbery we don't have a method to do that, or we haven't implemented one?
 8417:20 <jnewbery> but yes, it involves deserializing 1-2MB on average from disk into CBlock and CTransaction objects
 8517:21 <jnewbery> unplanted: there's no method to do that. I'm not suggesting that it'd be a good idea, but it's at least theoretically possible
 8617:22 <jnewbery> but if searching for transactions in serialized blocks was a performance critical operation, then stopping as soon as you deserialized the right transaction would be a reasonable operation
 8717:22 <jnewbery> *reasonable optimization
 8817:22 <jnewbery> ok, next question: If we’re looking up the transaction using the txindex, how much data is deserialized?
 8917:23 <murch> Since we know the offset of the tx data in the block, only the transaction itself
 9017:23 <raj_> The block header and the transaction only?
 9117:23 <jnewbery> murch: right
 9217:23 <larryruane> murch: i agree
 9317:23 <jnewbery> raj_: I don't think we even need to deserialize the block header, do we?
 9417:24 <murch> I think larryruane said earlier that the txindex stored the offset of the block in the block file and the offset of the tx respective to that, so all credit to him :D
 9517:24 <larryruane> jnewbery: I think that's correct -- with txindex, you don't even know which block the tx belongs to
 9617:24 <unplanted> jnewbery txindex is providing the offsets, so no I think
 9717:24 <larryruane> murch: no i was referring to the block index
 9817:24 <murch> Oh, well then I just guessed well
 9917:24 <larryruane> murch: oh it does?? i was accidentally right! haha
10017:25 <murch> So would the txindex have the offset per blockfile?
10117:25 <sipa> yes, it has the offset within the blk*.dat file where the tx starts
10217:25 <raj_> jnewbery, yes but it seems its still reading the header and then seeking to the tx offset (or atleast thats how I am reading the code, so might be wrong)
10317:25 <sipa> (and also the offset within that file where the block it contains starts)
10417:26 <murch> So, it knows where the corresponding block starts, but does not explicitly store which block that is?
10517:26 <jnewbery> raj_: I take it back. You're right!
10617:26 <jnewbery> https://github.com/bitcoin/bitcoin/blob/a3791da0e80ab35e862989373f033e5be4dff26b/src/index/txindex.cpp#L242-L251
10717:27 <Sachin> ```if (g_txindex->FindTx(hash, block_hash, tx)) {
10817:27 <Sachin> if (!block_index || block_index->GetBlockHash() == block_hash) {
10917:27 <Sachin> hashBlock = block_hash;
11017:27 <Sachin> return tx;
11117:27 <Sachin> }
11217:27 <Sachin> }
11317:27 <Sachin> does this code not return the blockhash of the tx?
11417:27 <Sachin> or bind it to a pointer that is available?
11517:27 <theStack> ah, TxIndex::FindTx() uses the header to find out and set the block hash
11617:27 <jnewbery> Sachin: please don't paste code into the chat. You can paste a link to the code on github
11717:28 <Sachin> my bad, thank you
11817:28 <jnewbery> no problem! It just gets a bit noisy if we paste code in here
11917:28 <raj_> Sachin, I think the hash is returned in that `hashBlock` variable.
12017:29 <larryruane> TIL after it deserializes the header (`file >> header`), the file offset is immediately _after_ the header, so that's a relative seek to the transaction offset .. cool
12117:29 <jnewbery> Right, FindTx() returns the block hash, which it gets from deserializing the header and hashing it
12217:29 <raj_> jnewbery, is the reason that it still reads the block header is that tx offset starts counting after the header?
12317:29 <Sachin> yeah, that's what I thought but I wanted to confirm. So the user can still determine the blockhash after this call
12417:30 <glozow> right: https://github.com/bitcoin/bitcoin/blob/54e31742d208eb98ce706aaa6bbd4b023f42c3a5/src/index/txindex.cpp#L255
12517:30 <sipa> murch: correct, i think (re: "does not explicitly store which block that is")
12617:30 <murch> sipa: thanks
12717:31 <jnewbery> raj_: the reason is that it needs to calculate the block hash from the block header. If it didn't need to do that, the index could just store the offset of the tx from the beginning of the file
12817:31 <jnewbery> alright, next question: The first version of this PR included a behaviour change: when an incorrect block_index is provided to GetTransaction, but g_txindex->FindTx(hash, hashBlock, tx) finds and returns the tx. After this PR we would return the tx although it isn’t in the block the user asked for.
12917:31 <jnewbery> This behaviour change was removed in a subsequent push to the PR. Do you think the behaviour change is an improvement? Should it be included in this PR?
13017:33 <raj_> It doesn't seem to be an improvement to me. If the user provided wrong blockhash its better to notify that.
13117:33 <murch> No, I think it shouldn't. When you explicitly filter something for a specific block it would be odd to get data back that doesn't fit the search parameters
13217:33 <theStack> generally i'd argue it's more clean to separate performance optimizations and behavioural changes into different PRs, to not mix things up and lower review burden
13317:33 <murch> If the user doesn't know which block to expect it in, they shouldn't/wouldn't restrict the search to that block.
13417:34 <larryruane> jnewbery: I know you've got more questions, but if I may inject a question here: what is txindex used for; why do people sometimes enable it? One answer I can think of is: block explorers .. but maybe there are other reasons?
13517:34 <murch> It could perhaps give a different error message to be more helpful to the user though.
13617:34 <murch> "Tx found in a different castle"
13717:35 <jnewbery> larryruane: great question! Anyone have any thoughts about that? Why do people enable txindex?
13817:35 <larryruane> murch: ".. it would be odd to .." I agree completely
13917:35 <willcl_ark> It's easier to look up transactions which aren't related to your wallet, on your local node?
14017:35 <larryruane> obviously not for validity checking, or txindex wouldn't be optional!
14117:35 <raj_> To quickly find lots of transaction via txid?
14217:36 <jnewbery> theStack: Yes, regardless of whether it's an improvement or not, separating performance improvements from behavioural changes into different PRs is just good hygiene
14317:36 <murch> jnewbery: To run additional services on top of bitcoin
14417:36 <glozow> based on what the rpc docs suggest, `getrawtransaction(txid, blockhash)` means "look for this tx in this block" so if it's not in the block and a tx is returned, seems misleading
14517:36 <pglazman> In the case of duplicate TxIDs that BIP30 addresses, would providing a block_hash help filter the expected transaction?
14617:36 <larryruane> theStack: what's your opinion about separating those into different commits but still same PR? I've seen that done (i think)
14717:36 <stickies-v> raj_: I don't even think it's about performance, without txindex I don't think you can search for transactions (not in your wallet) at all?
14817:36 <jnewbery> murch raj_: I agree. Failing the request but returning a more specific error seems optimal
14917:37 <jnewbery> pglazman: great question! Has anyone tried using getrawtransaction with either of the duplicate txids?
15017:37 <raj_> stickies-v, ya makes sense, unless the user also happens to know the blockhash.
15117:37 <jnewbery> either with or without the txindex enabled
15217:37 <theStack> glozow: +1
15317:38 <sipa> pglazman: i believe it will only find the latter transaction copy
15417:38 <sipa> because the later one overwrites the former one
15517:38 <sipa> (when using txindex)
15617:38 <raj_> jnewbery, without txindex we would only get mempool txs right?
15717:38 <larryruane> sipa: yes because leveldb isn't a multimap (can it even be?)
15817:39 <sipa> no
15917:39 <sipa> it's a key-value store, and the key is the txid
16017:39 <jnewbery> does this PR prevent us from accessing the first one if txindex is enabled, even when providing the correct block hash?
16117:39 <larryruane> sipa: got it, thanks
16217:39 <lopp> jnewbery: it seems to me that anyone who has txindex enabled wouldn't want to bother searching by blockhash, though it could come in handy if you are looking for transactions in orphaned blocks.
16317:39 <sipa> jnewbery: i suspect so
16417:40 <sipa> i'm not sure we care for those two historical oddities...
16517:40 <theStack> larryruane: i think behvaioural and optimizing changes should always at least separated by commits, but generally i'd prefer different PRs; in a single PR some people (e.g. people deep into optimization, but not caring so much about behaviour) can only "half-ACK" the PR :p
16617:40 <jnewbery> I'm not sure what different information would be returned from the first and second ones. What block contextual information does getrawtransaction return?
16717:41 <larryruane> theStack: +1
16817:41 <murch> Just the blockheight maybe?
16917:41 <jnewbery> sipa: agree that in practice, we probably don't care. It's a good test case though
17017:41 <murch> and the hash of the block it was included in
17117:41 <murch> The tx itself would obviously be immutable, since if it had been malleated, it would have a different txid
17217:42 <glozow> jnewbery: also # confirmations and block time if verbose=true, it seems
17317:42 <larryruane> for us newbies, are there 2 tx on the mainnet blockchain with the same txids? (but i assume different wtxids) And only 2?
17417:42 <jnewbery> If anyone is scratching their heads about "multiple transactions with the same txid", it's all explained in BIP 30: https://github.com/bitcoin/bips/blob/master/bip-0030.mediawiki
17517:43 <sipa> larryruane: yes, exactly two; no more, no less
17617:43 <jnewbery> larryruane: 2 pairs, so 4 transactions
17717:43 <sipa> oh right, 2 times 2
17817:43 <jnewbery> or 2 transactions, depending on how you define transaction :)
17917:44 <jnewbery> There's one more question: How can this PR be tested? Are any new test cases required?
18017:44 <Sachin> larryruane https://github.com/bitcoin/bips/blob/master/bip-0030.mediawiki
18117:44 <raj_> sipa, Where can i read more about these 2 and how/why the occurred?
18217:44 <Sachin> raj_ see my link
18317:44 <theStack> raj_: see jnewbery's link to BIP30 above
18417:45 <murch> Yeahhttps://bitcoin.stackexchange.com/a/75301/5406
18517:45 <raj_> Sachin, theStack Thanks..
18617:46 <jnewbery> any thoughts about testing?
18717:46 <theStack> as far as i remember the first version of the PR with behavioural changes still passed the CI; so there seem to be test coverage missing related to GetTransaction
18817:46 <theStack> jonatack opened a PR with that
18917:46 <raj_> Jonatack has wrote some functional functional tests, yet to go through them..
19017:46 <lopp> this PR seemed to reveal a lack of testing around expected behavior of this function
19117:46 <larryruane> jnewbery: There's another PR to improve the functional testing of this RPC, https://github.com/bitcoin/bitcoin/pull/22437
19217:47 <theStack> lopp: +1
19317:47 <murch> lopp: Did you ever find out what caused the functional test hiccough?
19417:47 <jnewbery> lopp: I agree! It's always a shame when you change behaviour and nothing breaks
19517:47 <larryruane> jnewbery: Seems like RPCs like this should be unit-testable, but `src/test/rpc_tests.cpp` seems to mostly test just argument processing
19617:48 <sipa> damn tests, they're terrible. their sole purpose in life is to *fail* at the right time, and they can't even do that correctly
19717:48 <lopp> murch: oddly enough, the first few commits kept failing and then suddenly started working despite my changes being cosmetic
19817:48 <murch> Right, even though I could even reproduce the test failure locally
19917:49 <jnewbery> our rpc_tests unit tests are very limited
20017:49 * murch scratches head
20117:49 <larryruane> jnewbery: but could they be much better?
20217:49 <lopp> I assumed it was a CI issue because my tests passed locally
20317:49 * unplanted makes a test of a test
20417:49 <larryruane> (i'm not sure if they provide enough framework to be able to do more)
20517:50 <jnewbery> larryruane: I generally think we use the functional tests too much when a unit test would be more appropriate. In this case though, I think the functional tests are fine. This functionality involves a lot of components (rpc, validation, txindex), and we just don't have the unit test framework to mock all those pieces and isolate the behaviour we want to test
20617:50 <larryruane> I think some of the refactoring going on (such as eliminating globals) will make RPCs more unit-testable ... (?)
20717:50 <murch> sipa: Yeah, we should really get rid of them on grounds of underperforming
20817:51 <larryruane> :laugh:
20917:51 <murch> Although I'd say we should look into hiring replacements
21017:51 <jnewbery> larryruane: yes, there's ongoing work to elimate globals which would make all of our components more testable. It's really important work
21117:51 <murch> lopp: Mh, they did fail for me locally as well.
21217:52 <murch> I didn't try again after they passed in the build system, tho
21317:52 <jnewbery> Was anyone surprised that GetTransaction() is in validation.cpp? It seems to me that node/transaction.cpp would be a more appropriate place for it.
21417:53 <raj_> jnewbery, +1
21517:53 <stickies-v> agreed!
21617:54 <glozow> jnewbery ya
21717:54 <jnewbery> seems weird that validation would call into txindex. I wonder if we remove this function, then validation would no longer need to #include txindex
21817:54 <sipa> GetTransaction predates node/transaction.cpp, and even the generic index framework itself :)
21917:55 <sipa> (before 0.8, validation itself used the txindex)
22017:55 <jnewbery> (and GetTransaction() seems like a natural sibling to BroadcastTransaction(), which is already in node/transaction.cpp)
22117:55 <jnewbery> sipa: right, this is not meant as a criticism of course. Just wondering if we can organize things a bit more rationally now that we have better separation between things.
22217:55 <sipa> jnewbery: sure, just providing background
22317:56 <sipa> seems very reasonable to move it elsewhere now
22417:56 <jnewbery> ok, any other questions/thoughts before we wrap this up?
22517:56 <larryruane> one small thing,
22617:57 <larryruane> does rolling back the blockchain (reorg) need to update txindex?
22717:57 <jnewbery> larryruane: great question!
22817:57 <larryruane> i guess it must need to.. maybe using the rev files?
22917:57 <murch> larryruane: I would certainly expect that it does!
23017:57 <sipa> it doesn't need to
23117:57 <sipa> it can report transactions in non-active chains, as long as the transaction wasn't re-confirmed in the main chain
23217:58 <larryruane> oh that's cool!
23317:58 <murch> sipa: Right, so wouldn't it weird to get a tx with a blockhash that might not be confirmed in the best chain?
23417:58 <lopp> I'd expect txindex would remain the same until said transaction got confirmed in a new block, at which point the txindex pointer would get updated
23517:58 <sipa> murch: eh, maybe :)
23617:58 <glozow> murch: getrawtransaction tells you if it's in the active chain or not
23717:58 <murch> mh
23817:58 <murch> TIL
23917:59 <sipa> i assume it's pretty much unused functionality, but it has always worked that way
24017:59 <jnewbery> oh, I have one more question. Now that we're all experts in GDB thanks to larry's tutorial last week, did anyone step through the code using a debugger?
24117:59 <larryruane> i did! haha
24217:59 <jnewbery> larryruane: 🥇
24318:00 <raj_> I plan to with jonatack's test tomorrow. :)
24418:00 <jnewbery> Oh, there are also some gdb notes from last week that larry sent me, which I'll add to last week's meeting page
24518:00 <theStack> i'm still being a noob in GDB. is there a record of larryruane's video tutorial available?
24618:00 <theStack> jnewbery: ah, that sounds useful
24718:01 <stickies-v> theStack it's in the meeting notes: https://bitcoincore.reviews/22350
24818:01 <jnewbery> theStack: video is linked to from https://bitcoincore.reviews/22350
24918:01 <jnewbery> stickies-v: thanks!
25018:01 <larryruane> I forgot to record last week, but I re-did it on my own (it's a lot better)
25118:01 <jnewbery> ok, that's time. Thank you everyone. Next week, glozow is hosting
25218:01 <jnewbery> #endmeeting