The PR branch HEAD was 78f4c8b98e at the time of this review club meeting.
Notes
Historic blocks that make up the block chain are stored on disk in
blk?????.dat files. Those files each contain multiple serialized blocks,
which consist of a header followed by the transactions contained in that
block.
The transaction
index
is an optional index from transaction id (txid) to where that
transaction is stored in the block files (file and offset), for all
transactions that have been confirmed in a block. For normal operation of
the node, a txindex is not required, and by default the txindex is
disabled.
The getrawtransaction RPC is used to fetch a transaction from the node.
When the verbose argument is set to false, it returns the hex-encoded
serialized transaction. When verbose is set to true, it returns a JSON
object containing information about the transaction.
getrawtransaction can retrieve the transaction in one of the following
ways:
mempool: unconfirmed transactions can be looked up by txid in the
mempool.
by block hash: if the user provides the hash of the block that the
transaction is contained in, the block (including all transactions) is
deserialized from disk, and the txid of each of the deserialized
transactions is compared with the requested txid. If a matching
transaction is found, it is returned to the user.
using the txindex: if the node has txindex enabled, then the txid is
looked up in the transaction index to find the location of the
transaction on disk. The transaction is then deserialized from that location
and returned to the user.
If the user provides the wrong block hash, then the call to
getrawtransaction will fail.
Without looking at the code, why do you think that performance is worse
when providing a block hash (assuming txindex is enabled)?
If we’re looking up the transaction by block hash, then what are the
steps required? How do we know where to find that block on disk? How much
data is deserialized?
If we’re looking up the transaction using the txindex, how much data is
deserialized?
The first version of this PR included a behaviour
change:
when an incorrect block_index is provided to GetTransaction, but
g_txindex->FindTx(hash, hashBlock, tx) finds and returns the tx.
After this PR we would return the tx although it isn’t in the block
the user asked for.
This behaviour change was removed in a subsequent push to the PR. Do you
think the behaviour change is an improvement? Should it be included in
this PR?
How can this PR be tested? Are any new test cases required?
<jnewbery> The review club is a place for people to come and learn about the process of contributing to Bitcoin Core. Everyone is welcome to come and ask questions
<jnewbery> Notes and questions are here: https://bitcoincore.reviews/22383. I'll use the questions as prompts to guide the conversation, but feel free to jump in at any point if you have any questions or points you want to add
<jnewbery> Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? And for those that reviewed the PR, can you briefly summarise what it's doing?
<raj_> Using txindex if enabled irrespective of weather blockhash is provided or not. But return failure provided blockhash dont match with found blockhash.
<larryruane> tested ACK -- summary: if the transaction index is enabled (`-txindex`), then use that to loop up a transaction, rather than reading a block from disk
<stickies-v> Approach ACK - it resolves unexpected performance loss where (when having a txindex) GetTransaction becomes slower when providing the optional block_index parameter
<larryruane> also, is it likely that the parts of the txindex we need are already in memory (so no disk read needed)? (But the deserialization is the most important thing)
<jnewbery> larryruane: the txindex points to the location of the transaction on the disk, so a disk read is always required for that part. You may be right about the txindex itself being in memory. I'm not familiar with the txindex code and how often it gets flushed or if there's any caching
<jnewbery> next q: If we’re looking up the transaction by block hash, then what are the steps required? How do we know where to find that block on disk? How much data is deserialized?
<theStack> murch: i guess once the block is deserialied into memory searching a tx doesn't take that long (in comparison), considering the limited number of txs in a block
<larryruane> sipa: so the txindex allows you to seek directly to the tx on disk, do a small read (relative to an entire block), deserialize only that one tx .. so that's why it's much faster
<jnewbery> but if searching for transactions in serialized blocks was a performance critical operation, then stopping as soon as you deserialized the right transaction would be a reasonable operation
<murch> I think larryruane said earlier that the txindex stored the offset of the block in the block file and the offset of the tx respective to that, so all credit to him :D
<raj_> jnewbery, yes but it seems its still reading the header and then seeking to the tx offset (or atleast thats how I am reading the code, so might be wrong)
<larryruane> TIL after it deserializes the header (`file >> header`), the file offset is immediately _after_ the header, so that's a relative seek to the transaction offset .. cool
<jnewbery> raj_: the reason is that it needs to calculate the block hash from the block header. If it didn't need to do that, the index could just store the offset of the tx from the beginning of the file
<jnewbery> alright, next question: The first version of this PR included a behaviour change: when an incorrect block_index is provided to GetTransaction, but g_txindex->FindTx(hash, hashBlock, tx) finds and returns the tx. After this PR we would return the tx although it isn’t in the block the user asked for.
<jnewbery> This behaviour change was removed in a subsequent push to the PR. Do you think the behaviour change is an improvement? Should it be included in this PR?
<murch> No, I think it shouldn't. When you explicitly filter something for a specific block it would be odd to get data back that doesn't fit the search parameters
<theStack> generally i'd argue it's more clean to separate performance optimizations and behavioural changes into different PRs, to not mix things up and lower review burden
<larryruane> jnewbery: I know you've got more questions, but if I may inject a question here: what is txindex used for; why do people sometimes enable it? One answer I can think of is: block explorers .. but maybe there are other reasons?
<jnewbery> theStack: Yes, regardless of whether it's an improvement or not, separating performance improvements from behavioural changes into different PRs is just good hygiene
<glozow> based on what the rpc docs suggest, `getrawtransaction(txid, blockhash)` means "look for this tx in this block" so if it's not in the block and a tx is returned, seems misleading
<stickies-v> raj_: I don't even think it's about performance, without txindex I don't think you can search for transactions (not in your wallet) at all?
<lopp> jnewbery: it seems to me that anyone who has txindex enabled wouldn't want to bother searching by blockhash, though it could come in handy if you are looking for transactions in orphaned blocks.
<theStack> larryruane: i think behvaioural and optimizing changes should always at least separated by commits, but generally i'd prefer different PRs; in a single PR some people (e.g. people deep into optimization, but not caring so much about behaviour) can only "half-ACK" the PR :p
<jnewbery> I'm not sure what different information would be returned from the first and second ones. What block contextual information does getrawtransaction return?
<theStack> as far as i remember the first version of the PR with behavioural changes still passed the CI; so there seem to be test coverage missing related to GetTransaction
<jnewbery> larryruane: I generally think we use the functional tests too much when a unit test would be more appropriate. In this case though, I think the functional tests are fine. This functionality involves a lot of components (rpc, validation, txindex), and we just don't have the unit test framework to mock all those pieces and isolate the behaviour we want to test
<jnewbery> Was anyone surprised that GetTransaction() is in validation.cpp? It seems to me that node/transaction.cpp would be a more appropriate place for it.
<jnewbery> seems weird that validation would call into txindex. I wonder if we remove this function, then validation would no longer need to #include txindex
<jnewbery> sipa: right, this is not meant as a criticism of course. Just wondering if we can organize things a bit more rationally now that we have better separation between things.
<lopp> I'd expect txindex would remain the same until said transaction got confirmed in a new block, at which point the txindex pointer would get updated
<jnewbery> oh, I have one more question. Now that we're all experts in GDB thanks to larry's tutorial last week, did anyone step through the code using a debugger?