Properly rebroadcast unconfirmed transaction chains (wallet)

https://github.com/bitcoin/bitcoin/pull/25768

Host: glozow  -  PR author: achow101

The PR branch HEAD was e40ddf36be at the time of this review club meeting.

Notes

  • The node has multiple local clocks, including the system clock and a mockable NodeClock.

  • The setmocktime regtest-only RPC sets the clock to a specific time in the past or future; the time does not continue to tick forward after that. This allows us to test node behaviors that happen over a long period of time, such as peer disconnection after timeouts.

  • The CScheduler maintains a map from time points to background tasks that should be called periodically or sometime in the future. It runs in its own thread and is used for a variety of jobs such as periodically checking if we should find new peers or rebroadcasting wallet transactions.

  • The mockscheduler regtest-only RPC ā€œfast-forwardsā€ scheduled jobs by making each time point sooner. It does not modify the nodeā€™s clock.

  • The wallet periodically rebroadcasts its transactions that havenā€™t been mined yet. ResendWalletTransactions() is a task scheduled every 1000milliseconds. The task itself compares the nodeā€™s current (mockable) time and the walletā€™s randomly chosen nNextResend time to decide whether or not itā€™s time to call SubmitTxMemoryPoolAndRelay(). As such, to trigger a rebroadcast, the clock must pass two time points: CWallet::nNextResend and the time at which the scheduler has MaybeResendWalletTxs() task set for.

  • PR #25768 addresses an issue in which the wallet rebroadcasts multiple transactions with dependency relationships. It must submit a parent transaction before its child, otherwise the mempool will reject it due to missing inputs.

  • Prior to this PR, the transactions are rebroadcast in whichever order they appear in CWallet::mapWallet, which is a std::unordered_map.

  • This PR reuses the logic in ReacceptWalletTransactions() to sort mapWallet transactions by CWalletTx::nOrderPos.

Questions

  1. Did you review the PR? What was your review approach?

  2. Did you leave a Concept ACK, approach ACK, tested ACK, or NACK on the PR?

Concept

  1. What incorrect behavior does this PR address?

  2. Were you able to reproduce the issue? Does the test added in the PR adequately test this behavior, i.e., does it succeed with this fix but fail without it?

  3. Why is it difficult to write a test that reliably demonstrates that transactions may be rebroadcast in the wrong order? How does std::unordered_map order its elements?

Approach

  1. Can you think of any other methods of getting the wallet transactions sorted in this order?

  2. Would it make sense to keep the wallet transactions sorted in this order? Why is mapWallet implemented as a std::unordered_map; would another data structure make sense?

  3. If you thought of an alternative approach, how does it compare? What is its runtime complexity? How much {more, less} memory is allocated? Does it require a lot of code to be (re)written? Is its complexity appropriate for this purpose?

Implementation

  1. How is GetSortedTxs(), the function extracted from ReacceptWalletTransactions(), implemented? How much memory is needed, as a function of the transactions in mapWallet?

  2. GetSortedTxs() is declared as requiring the wallet lock to already be held. Why or why not does this make sense?

  3. A structured binding declaration is used in two places in GetSortedTxs(). What are the types of each variable (wtxid, wtx, _, and wtx) and what do they represent? Why or why not is this better than declaring the types explicitly?

  4. Can you find any behavior changes to ReacceptWalletTransactions(), or is it a move-only code change?

  5. The test calls both setmocktime and mockscheduler to trigger a rebroadcast. What is the difference between these two calls, and why or why not is it necessary to call both of them?

  6. The test ā€œevictsā€ the transactions from the mempool by calling setmocktime. Why is this necessary, and why does it work?

Meeting Log

  117:00 <glozow> #startmeeting
  217:00 <stickies-v> hi
  317:00 <glozow> hi everyone! if you're looking for Bitcoin Core PR Review Club, you're in the right place. feel free to say hi :)
  417:00 <pablomartin> hello!
  517:00 <raj> hi..
  617:00 <brunoerg> hi
  717:00 <BlueMoon> Hello
  817:00 <hernanmarino> Hi
  917:01 <larryruane_> hi
 1017:01 <glozow> we're looking at a wallet PR today, "Properly rebroadcast unconfirmed transaction chains." Notes in the usual place https://bitcoincore.reviews/25768
 1117:01 <kevkevin> hi
 1217:01 <effexzi> Hi every1
 1317:01 <ishaanam[m]> hi
 1417:01 <lightlike> hi
 1517:01 <glozow> Did everybody get a chance to review the PR and/or look at the notes?
 1617:02 <juancama> hi
 1717:02 <hernanmarino> yes
 1817:02 <raj> y
 1917:02 <brunoerg> y
 2017:02 <juancama> y
 2117:02 <stickies-v> y
 2217:02 <kevkevin> y
 2317:02 <ishaanam[m]> y
 2417:02 <glozow> wonderful! for those of you who reviewed the PR, what was your review approach?
 2517:03 <pablomartin> I have reviewed the code, compiled it but still need to do some testing
 2617:03 <stickies-v> so far just code review and understanding what the functional test was trying to catch
 2717:03 <raj> Read through the code.. Makes sense of the logic.. Compile it.. Run the test.. Try to understand if testing covering the situation intended..
 2817:04 <kevkevin> Reviewd the code ran the functional tests but still need to do some manualy testing, still understanding the repo aswell as im still new here
 2917:04 <raj> Trying to recreate the bug would be cool.. But I am guessing theres no easy way to ensure child before parents in `mapWallet`?
 3017:04 <ishaanam[m]> I mainly looked at the first commit for any changes in behavior and made sure the added test fails on master sometimes
 3117:04 <hernanmarino> tested succesfully and read the code. I still have to dive deeply in the test code to think about alternatives and perhaps improvements
 3217:05 <larryruane_> I'm running the functional test (wallet_resendwallettransactions.py) in the debugger, and watching the node's `debug.log` file ... but I haven't really got to the new test code that this PR adds, still trying to understand the first (existing) part of the test
 3317:06 <glozow> great! glad to hear people are using the test to reproduce the issue. so what exactly is the incorrect behavior this PR addresses?
 3417:06 <larryruane_> It's amazing what the functional test framework can do, lots to learn just about that!
 3517:06 <larryruane_> (things like how the functional test sets itself up as a peer, for example)
 3617:08 <raj> I am guessing its about the situation where a child appears in `mapWallet` before its parent, and thus at the time of rebroadcasting the child would get rejected in mempoolcheck if mempool doesn't have the parent..
 3717:08 <stickies-v> `ResendWalletTransactions()` rebroadcasts transactions based on txid instead of their sequential order, which leads to child transactions not being broadcast if their parents txid is higher
 3817:08 <larryruane_> I think the problem is that if we (our node) broadcasts 2 transactions, A and B, and let's say B is a child of A (B spends an output of A), then if we happen to broadcast tx B first, then A, then our peers won't accept B into their mempool because it's not valid
 3917:08 <Zaidan> ResendWalletTranactions will fail to rebroadcast child transactions if their txids happen to be lexicographically less than their parent's txid (copy and paste)
 4017:09 <Zaidan> and man is this ever a subtle bug wow
 4117:09 <achow101> The lexicographically part is no longer true as mapWallet is now an std::unsorted_map instead of a normal std::map
 4217:09 <juancama> 50% of the time theĀ child is placed before the parent in mapWallet
 4317:09 <achow101> now it's basically just random of which will appear first
 4417:09 <glozow> raj: stickies-v: larryruane_: Zaidan: good answers! indeed, the problem is rebroadcasting children before parents. the order is not "by txid" or "lexicographical" though, as it's a `std::unordered_map`
 4517:10 <glozow> thanks achow101
 4617:10 <larryruane_> in general, is there a long-term effort to replace `std::map`s with unordered maps where possible? aren't the latter considered more efficient?
 4717:11 <larryruane_> (i'm not sure if that's space or time, or both)
 4817:11 <sipa> they're a tiny bit more memory efficient, but not much
 4917:11 <achow101> larryruane_: for me, yes. I've been trying to use unsorted_map and unsorted_set where possible
 5017:12 <glozow> larryruane_: i think it depends on the use case. and same here, I tend to use unsorted when possible
 5117:12 <sipa> for sufficiently large size, unordered_map is faster, due to O(1) lookups, but they're also more complicated (you need a hash function, possibly a salted one depending on whether it can be under attacker influence) rather than just a comparator
 5217:14 <glozow> good thing we have a `SaltedTxidHasher`
 5317:14 <glozow> I think we've just answered the next question, "Why is it difficult to write a test that reliably demonstrates that transactions may be rebroadcast in the wrong order?"
 5417:14 <glozow> i.e. because it's implemented as a hashmap. can't really guarantee which bucket keys are going to be in
 5517:15 <Zaidan> Don't answer this if it is out of scope I'm new: but why are these UTXO's transactions all seperate objects, and not wrapped up in a tree structure where their depth order can be easily maintained and referenced? It seems maintaining tx order is a pain.
 5617:15 <raj> Because its not possible to enforce a child will always appear before the parent in the mapWallet..
 5717:15 <larryruane_> because the test can't control all the randomness that it would need to in order to make the unordered_map iteration order predictable?
 5817:15 <sipa> Zaidan: the transaction graph is generally a DAG, not just a tree
 5917:15 <pablomartin> glozow: yeah perhaps we could force the rebroadcast in wrong order
 6017:16 <sipa> plus, we do need a way to access individual transactions by txid
 6117:16 <sipa> so just keeping them in a tree structure wouldn't work
 6217:16 <larryruane_> I'm not even sure, is `std::unordered_map` iteration deterministic for exactly the same inputs?
 6317:16 <glozow> i did wonder if we're going to end up with a boost multi index container rewrite for map wallet, heh
 6417:16 <Zaidan> sipa: ah ty
 6517:16 <sipa> larryruane_: no, depends on insertion order, even if the hash function is the same
 6617:17 <larryruane_> ok yes but I meant if insertion order is the same
 6717:17 <sipa> (and you don't want identical hash functions)
 6817:17 <glozow> next question is about the functional test: Does the test added in the PR adequately test this behavior, i.e., does it succeed with this fix but fail without it?
 6917:17 <raj> Isn't there any way to manually enforce the wallet to take to store the the child tx before the parent tx?
 7017:18 <raj> *manually enforce the wallet to stor
 7117:18 <sipa> raj: iteration order is determined by the map structure (std::map, or std::unordered_map), you can't control it.
 7217:19 <raj> Ah.. Okay..
 7317:19 <glozow> raj: no, and I don't think it's necessary
 7417:19 <hernanmarino> glozow: didn't test on master, but I'm sure it fails randomnly
 7517:19 <sipa> with a boost::multiindex you could have multiple indexes (one insertion ordered, another txid-based) simultaneously... but indeed I don't think that's useful.
 7617:20 <stickies-v> after `git revert 521169f2428be8e78599aa4fcb96f7ada7bb7e04` (and recompiling) the functional test does indeed fail again
 7717:20 <ishaanam[m]> On master the test fails around 50% of the time for me, though it would be nice if we had a higher chance of failing without the fix. I think glozow left a comment on the PR that would help with this?
 7817:20 <sipa> raj: Point is that the purpose of this data structure is lookup things based on txid - the iteration order follows that structure
 7917:20 <larryruane_> glozow: i didn't actually try it, but there's a great comment added to the functional test that explains that the new test will fail only about half of the time
 8017:21 <glozow> stickies-v: ishannam: wonderful, ā­ for you
 8117:21 <raj> I am seeing test failures too in the new test.. And I am not sure why that should happen..
 8217:21 <glozow> raj: hernanmarino: you're both seeing the test fail *with* the PR's changes?
 8317:21 <stickies-v> I don't think the new test is supposed to fail at all? without the fix it's meant to fail 50% of the time, I think
 8417:21 <raj> Yes.. Though I need to double check..
 8517:21 <larryruane_> I have a question, would it be possible to run this entire new part of the functional test a few times (within a single run of the functional test)? To prevent CI from passing when it shouldn't?
 8617:21 <glozow> yes the new test should pass every time with the changes
 8717:22 <pablomartin> stickies-v +1
 8817:22 <glozow> larryruane_: yes, that should be pretty simple. just do 10 parent-child pairs.
 8917:22 <larryruane_> I'm just imagining that a future change breaks this, and the dev runs the test suite (or CI does), and it passes, so the dev thinks it's okay
 9017:22 <larryruane_> glozow: +1
 9117:23 <lightlike> I wonder if that's actually 50%. I guess it's not a good idea to build a RNG based on the order of an unordered map...
 9217:23 <glozow> ok if you're seeing the test fail with the PR changes, that shouldn't happen, so let us know where it's failing...
 9317:23 <glozow> i'm going to continue with the questions
 9417:23 <glozow> Can you think of any other methods of getting the wallet transactions sorted in this order?
 9517:24 <glozow> Would it make sense to keep the wallet transactions sorted in this order?
 9617:24 <Zaidan> Wouldn't we want to depth sort?
 9717:24 <hernanmarino> glozow: No , within the PR it always succeeds
 9817:24 <glozow> hernanmarino: ah okay! that's good
 9917:25 <stickies-v> with std::sort you just need to allocate an additional vector instead of an additional vector and a map, but there's probably some downside to it I don't understand?
10017:26 <achow101> Zaidan: depth stops being relevant pretty quickly
10117:27 <Zaidan> achow101: okay ty
10217:28 <glozow> stickies-v: indeed, I think this could perhaps just be a std::sort where the comparison function uses `nOrderPos`
10317:28 <stickies-v> yeah it makes for quite a bit less code: https://pastebin.com/1Hxp4YRM
10417:30 <larryruane_> stickies-v: +1 seems like a good idea!
10517:31 <larryruane_> Just to confirm, `nOrderPos` is the order transactions entered the wallet, and that's guaranteed to be "parent before child"
10617:32 <larryruane_> it's a 64-bit integer so we probably assume it will never wrap around
10717:32 <glozow> larryruane_: AFAIK yes, parents always inserted before children. I suppose more precise would be if they cached their ancestor counts or something, but can't imagine why we wouldn't insert a parent before child.
10817:32 <glozow> well, 2^64 transactions would be a lot of transactions...
10917:33 <achow101> larryruane_: I think it is a reasonable assumption. I believe it is possible to mess with it using importprunedfunds though, however that only works on confirmed txs anyways
11017:33 <sipa> we're not even at 2^30 transactions on-chain
11117:34 <glozow> ok. I also wonder if filtering should happen before sorting
11217:34 <larryruane_> yes I'm not saying there's anything wrong, just understanding why it's okay :)
11317:36 <glozow> i.e. `transactions.filter(should_rebroadcast).sort(insertion_order)`
11417:36 <larryruane_> glozow: I was wondering about that, `mapWallet` contains confirmed transactions, right? so it could be pretty large (so we're adding many entries to the temporary map in `GetSortedTxs()`)?
11517:37 <larryruane_> (and there would be a large `out` vector that's returned?)
11617:38 <ishaanam[m]> glozow: yeah, it looks like the main reason that the filtering is done after is because both of the functions that call GetSortedTxs() do different kinds of filtering? why is that?
11717:38 <glozow> indeed. I'll say the inefficiency is probably not noticeable to users since this is only happing once a day, and there *usually* aren't that many transactions in mapWallet, but yes.
11817:39 <glozow> ishaanam: good question, now is a good time to ask: What's the difference between `ResendWalletTransactions` and `ReacceptWalletTransactions`?
11917:41 <Zaidan> Reaccept has to have a lock on the wallet
12017:42 <stickies-v> in terms of usage: resend is triggered ~once per day, reaccept is used every time a wallet is loaded
12117:42 <raj> `Resend` sends all the unconfirmed tx 5 mins before the last known block.. `ReAccept` sends all the wallet transaction unconfirmed, not coinbase and not abandoned.
12217:42 <lightlike> The filters in ReacceptWalletTransactions() seem unnecessary, since SubmitTxMemoryPoolAndRelay() seens to do the same checks again.
12317:42 <glozow> stickies-v: yes, thank you
12417:42 <ishaanam[m]> Zaidan: I think that Resend also obtains a lock on the wallet at some point?
12517:42 <glozow> lightlike: and so does `BroadcastTransaction()`, haha
12617:42 <sipa> Reaccept is loading mempool transactions into the wallet. Rebroadcast is dumping wallet transactions into the mempool.
12717:43 <sipa> I'm wrong, ignore me.
12817:43 <achow101> they both do ~same thing. I'm still trying to figure whether we can de-duplicate here, but both of these functions have existed for very long time and git blaming to figure out why is difficult.
12917:44 <Zaidan> Resend is when we think transactions should have been included in a block
13017:45 <Zaidan> ishaanam: yes, i see that now
13117:45 <glozow> Zaidan: right. rebroadcasting is done automatically, for transactions we're waiting to get confirmed but they haven't yet. reaccept is triggered once, at the start.
13217:45 <glozow> achow101: ideally we get rid of resend from the wallet.
13317:45 <stickies-v> I'm curious why for Reaccept we don't also check if we're in IBD or reindex, because that would raise the same concerns as mentioned for Resend?
13417:46 <pablomartin> stickies: i was thinking the same
13517:47 <pablomartin> isReadyToBroadcast()
13617:48 <glozow> stickies-v: I don't know the answer tbh
13717:48 <achow101> ah, right. so these functions date back to the first commit...
13817:49 <Zaidan> stickies-v: Does the lock handle that?
13917:49 <Zaidan> Can the wallet be locked before IBD or reindex is completed?
14017:50 <larryruane_> Zaidan: a lock would never be held for that long (im pretty sure)
14117:51 <glozow> achow101: woah yeah
14217:51 <glozow> we'll need to ask satoshi why it's this way
14317:51 <larryruane_> "date back to the first commit" -- let's just ask satoshi!
14417:51 <hernanmarino> :)
14517:51 <larryruane_> glozow: sorry .. you beat me to it!
14617:52 <stickies-v> larryruane_: +1, afaik we always try to minimize lock usage to just the bare minimum needed to avoid race conditions?
14717:52 <glozow> Zaidan: assuming you're talking about `cs_wallet`, the wallet lock doesn't have anything to do with the node's state. it's just a object used to make sure multiple threads don't try to access wallet members at the same time.
14817:53 <pablomartin> glozow: achow101: is the resend being used? i see it checks fBroadcastTransactions which is set to false in wallet.h and I don't see it's changed anywhere else
14917:53 <Zaidan> ah ty
15017:54 <glozow> good questions everybody. I'll throw out a few more from the review club notes. The test calls both setmocktime and mockscheduler to trigger a rebroadcast. What is the difference between these two calls, and why or why not is it necessary to call both of them?
15117:54 <achow101> pablomartin: fBroadcastTransactions depends on startup options
15217:54 <pablomartin> achow101: I see, thanks!
15317:55 <glozow> pablomartin: see `DEFAULT_WALLETBROADCAST` https://github.com/bitcoin/bitcoin/blob/1420547ec30a24fc82ba3ae5ac18374e8e5af5e5/src/wallet/wallet.h#L106
15417:56 <glozow> https://github.com/bitcoin/bitcoin/blob/1420547ec30a24fc82ba3ae5ac18374e8e5af5e5/src/wallet/wallet.cpp#L3018
15517:56 <Zaidan> it seems the scheduler runs the time
15617:57 <pablomartin> glozow: thank you! I see then it's passed to SetBroadcastTransactions
15717:57 <glozow> for reference, I'm talking about this part of the test: https://github.com/bitcoin-core-review-club/bitcoin/commit/e40ddf36bed81bdf28d386eb961c9ed22b69e207#diff-2dd85d481900d4ad19d113d2114861b0134bcd283435e95b18d10adf5ad381a0R111-R113
15817:58 <stickies-v> `ResendWalletTransactions()` is exclusively called from the scheduler (every second). I suppose we need to mock the scheduler because otherwise it'd use system time instead of mocked time?
15917:58 <Zaidan> mocktime, explicitly sets the time and then scheduler has the node run to that time?
16017:58 <Zaidan> I like stickies answer, +1
16117:58 <Zaidan> I'm getting on that bandwagon
16217:59 <glozow> stickies-v: yes, we need to mock the scheduler to make it call `MaybeResendWalletTxs` again. and why do we need to call setmocktime?
16318:59 <Zaidan> we have to pass the eviction time so the transactions are dropped?
16418:00 <stickies-v> oh because ResendWalletTransactions() has up to 36h to actually run so we need to advance the mocked time by enough to ensure that
16518:00 <Zaidan> dropped as in not included into a block
16618:00 <glozow> right, if we *just* mock the scheduler forward, `ResendWalletTransactions()` will execute but we won't be at the `nNextResend` time yet.
16718:01 <stickies-v> yeah so first we advance time by 336h (2w) to trigger mempool eviction, then we run node.syncwithvalidationinterfacequeue() to sync, and then we advance by another 36h to ensure ResendWalletTransactions() executes
16818:01 <Zaidan> ahh I missed that last step, I understand now
16918:01 <larryruane_> I'm not quite clear on why we need `mockscheduler` to back up the scheduler items to an earlier time (IIUC), why can't we instead just advance mock time?
17018:02 <larryruane_> stickies-v: +1 good explanation
17118:02 <glozow> stickies-v: exactly
17218:03 <glozow> oh sorry i just realized we're out of time o.O
17318:03 <Zaidan> Larry: I think thats what's happening, we go to the block time, + evict time + evict time + 36*60*60
17418:03 <glozow> #endmeeting
17518:03 <pablomartin> thanks glozow!
17618:03 <larryruane_> thank you @glozow this was super informative!
17718:03 <Zaidan> Thank you for hosting, I learned a lot.
17818:03 <pablomartin> thanks all! see you soon...
17918:03 <glozow> wow that was fast. thought someone had setmocktimed me
18018:03 <lightlike> thanks!
18118:03 <Zaidan> lmao
18218:03 <stickies-v> :-D
18318:03 <ishaanam[m]> thanks for hosting glozow!
18418:04 <hernanmarino> thanks glozow and everybody for your insights
18518:04 <glozow> sorry we didn't get through all the questions from the notes. hope it was helpful in reviewing the PR
18618:04 <stickies-v> thank you glozow , really interesting questions - made me think much more about the PR. and thank you achow101 for authoring
18718:04 <raj> larryruane_, one reason could be `mockscheduler` advances by a delta, where `setmocktime` sets it at a specific timepoint.. So I think they are both essentially same?
18818:04 <glozow> lightlike is hosting next week :)
18918:04 <raj> Thanks glozow , this was a nice one..
19018:05 <larryruane_> glozow: "thought someone had setmocktimed me" -- HAHAHA
19118:07 <glozow> to answer the question of "why do we need to mock the scheduler too," there are a few ways of answering. One is to remove that line from the test and see what happens. Another is to look at how the timepoints in `CScheduler::taskQueue` are handled. What exactly triggers a task to be executed?
19218:16 <achow101> on the question of resend vs reaccept: these functions were introduced in the first commit (or thereabouts) where everything was very tightly coupled together. At that point in time, the mempool was not in charge of deciding what got relayed and when. So Reaccept meant that the transactions would be re-added to the mempool after startup, but not necessarily rebroadcast. Resend meant that the transactions were actually being rebroadcast;
19318:16 <achow101> it would send out invs. Over time, as the mempool began to be in charge of tx relay, adding to the mempool also meant potentially relaying, and resending turned into adding to the mempool. And so now they are not really different and could be deduplicated.
19418:30 <sipa> interesting