Properly rebroadcast unconfirmed transaction chains (wallet)

Aug 24, 2022

https://github.com/bitcoin/bitcoin/pull/25768

Host: glozow - PR author: achow101

The PR branch HEAD was e40ddf36be at the time of this review club meeting.

Notes

The node has multiple local clocks, including the system clock and a mockable NodeClock.
The setmocktime regtest-only RPC sets the clock to a specific time in the past or future; the time does not continue to tick forward after that. This allows us to test node behaviors that happen over a long period of time, such as peer disconnection after timeouts.
The CScheduler maintains a map from time points to background tasks that should be called periodically or sometime in the future. It runs in its own thread and is used for a variety of jobs such as periodically checking if we should find new peers or rebroadcasting wallet transactions.
The mockscheduler regtest-only RPC “fast-forwards” scheduled jobs by making each time point sooner. It does not modify the node’s clock.
The wallet periodically rebroadcasts its transactions that haven’t been mined yet. ResendWalletTransactions() is a task scheduled every 1000milliseconds. The task itself compares the node’s current (mockable) time and the wallet’s randomly chosen nNextResend time to decide whether or not it’s time to call SubmitTxMemoryPoolAndRelay(). As such, to trigger a rebroadcast, the clock must pass two time points: CWallet::nNextResend and the time at which the scheduler has MaybeResendWalletTxs() task set for.
PR #25768 addresses an issue in which the wallet rebroadcasts multiple transactions with dependency relationships. It must submit a parent transaction before its child, otherwise the mempool will reject it due to missing inputs.
Prior to this PR, the transactions are rebroadcast in whichever order they appear in CWallet::mapWallet, which is a std::unordered_map.
This PR reuses the logic in ReacceptWalletTransactions() to sort mapWallet transactions by CWalletTx::nOrderPos.

Questions

Did you review the PR? What was your review approach?
Did you leave a Concept ACK, approach ACK, tested ACK, or NACK on the PR?

Concept

What incorrect behavior does this PR address?
Were you able to reproduce the issue? Does the test added in the PR adequately test this behavior, i.e., does it succeed with this fix but fail without it?
Why is it difficult to write a test that reliably demonstrates that transactions may be rebroadcast in the wrong order? How does std::unordered_map order its elements?

Approach

Can you think of any other methods of getting the wallet transactions sorted in this order?
Would it make sense to keep the wallet transactions sorted in this order? Why is mapWallet implemented as a std::unordered_map; would another data structure make sense?
If you thought of an alternative approach, how does it compare? What is its runtime complexity? How much {more, less} memory is allocated? Does it require a lot of code to be (re)written? Is its complexity appropriate for this purpose?

Implementation

How is GetSortedTxs(), the function extracted from ReacceptWalletTransactions(), implemented? How much memory is needed, as a function of the transactions in mapWallet?
GetSortedTxs() is declared as requiring the wallet lock to already be held. Why or why not does this make sense?
A structured binding declaration is used in two places in GetSortedTxs(). What are the types of each variable (wtxid, wtx, _, and wtx) and what do they represent? Why or why not is this better than declaring the types explicitly?
Can you find any behavior changes to ReacceptWalletTransactions(), or is it a move-only code change?
The test calls both setmocktime and mockscheduler to trigger a rebroadcast. What is the difference between these two calls, and why or why not is it necessary to call both of them?
The test “evicts” the transactions from the mempool by calling setmocktime. Why is this necessary, and why does it work?

Meeting Log

17:00

<glozow> #startmeeting

17:00

<stickies-v> hi

17:00

<glozow> hi everyone! if you're looking for Bitcoin Core PR Review Club, you're in the right place. feel free to say hi :)

17:00

<pablomartin> hello!

17:00

<raj> hi..

17:00

<brunoerg> hi

17:00

<BlueMoon> Hello

17:00

<hernanmarino> Hi

17:01

<larryruane_> hi

17:01

<glozow> we're looking at a wallet PR today, "Properly rebroadcast unconfirmed transaction chains." Notes in the usual place https://bitcoincore.reviews/25768

17:01

<kevkevin> hi

17:01

<effexzi> Hi every1

17:01

<ishaanam[m]> hi

17:01

<lightlike> hi

17:01

<glozow> Did everybody get a chance to review the PR and/or look at the notes?

17:02

<juancama> hi

17:02

<hernanmarino> yes

17:02

<raj> y

17:02

<brunoerg> y

17:02

<juancama> y

17:02

<stickies-v> y

17:02

<kevkevin> y

17:02

<ishaanam[m]> y

17:02

<glozow> wonderful! for those of you who reviewed the PR, what was your review approach?

17:03

<pablomartin> I have reviewed the code, compiled it but still need to do some testing

17:03

<stickies-v> so far just code review and understanding what the functional test was trying to catch

17:03

<raj> Read through the code.. Makes sense of the logic.. Compile it.. Run the test.. Try to understand if testing covering the situation intended..

17:04

<kevkevin> Reviewd the code ran the functional tests but still need to do some manualy testing, still understanding the repo aswell as im still new here

17:04

<raj> Trying to recreate the bug would be cool.. But I am guessing theres no easy way to ensure child before parents in `mapWallet`?

17:04

<ishaanam[m]> I mainly looked at the first commit for any changes in behavior and made sure the added test fails on master sometimes

17:04

<hernanmarino> tested succesfully and read the code. I still have to dive deeply in the test code to think about alternatives and perhaps improvements

17:05

<larryruane_> I'm running the functional test (wallet_resendwallettransactions.py) in the debugger, and watching the node's `debug.log` file ... but I haven't really got to the new test code that this PR adds, still trying to understand the first (existing) part of the test

17:06

<glozow> great! glad to hear people are using the test to reproduce the issue. so what exactly is the incorrect behavior this PR addresses?

17:06

<larryruane_> It's amazing what the functional test framework can do, lots to learn just about that!

17:06

<larryruane_> (things like how the functional test sets itself up as a peer, for example)

17:08

<raj> I am guessing its about the situation where a child appears in `mapWallet` before its parent, and thus at the time of rebroadcasting the child would get rejected in mempoolcheck if mempool doesn't have the parent..

17:08

<stickies-v> `ResendWalletTransactions()` rebroadcasts transactions based on txid instead of their sequential order, which leads to child transactions not being broadcast if their parents txid is higher

17:08

<larryruane_> I think the problem is that if we (our node) broadcasts 2 transactions, A and B, and let's say B is a child of A (B spends an output of A), then if we happen to broadcast tx B first, then A, then our peers won't accept B into their mempool because it's not valid

17:08

<Zaidan> ResendWalletTranactions will fail to rebroadcast child transactions if their txids happen to be lexicographically less than their parent's txid (copy and paste)

17:09

<Zaidan> and man is this ever a subtle bug wow

17:09

<achow101> The lexicographically part is no longer true as mapWallet is now an std::unsorted_map instead of a normal std::map

17:09

<juancama> 50% of the time the child is placed before the parent in mapWallet

17:09

<achow101> now it's basically just random of which will appear first

17:09

<glozow> raj: stickies-v: larryruane_: Zaidan: good answers! indeed, the problem is rebroadcasting children before parents. the order is not "by txid" or "lexicographical" though, as it's a `std::unordered_map`

17:10

<glozow> thanks achow101

17:10

<larryruane_> in general, is there a long-term effort to replace `std::map`s with unordered maps where possible? aren't the latter considered more efficient?

17:11

<larryruane_> (i'm not sure if that's space or time, or both)

17:11

<sipa> they're a tiny bit more memory efficient, but not much

17:11

<achow101> larryruane_: for me, yes. I've been trying to use unsorted_map and unsorted_set where possible

17:12

<glozow> larryruane_: i think it depends on the use case. and same here, I tend to use unsorted when possible

17:12

<sipa> for sufficiently large size, unordered_map is faster, due to O(1) lookups, but they're also more complicated (you need a hash function, possibly a salted one depending on whether it can be under attacker influence) rather than just a comparator

17:14

<glozow> good thing we have a `SaltedTxidHasher`

17:14

<glozow> I think we've just answered the next question, "Why is it difficult to write a test that reliably demonstrates that transactions may be rebroadcast in the wrong order?"

17:14

<glozow> i.e. because it's implemented as a hashmap. can't really guarantee which bucket keys are going to be in

17:15

<Zaidan> Don't answer this if it is out of scope I'm new: but why are these UTXO's transactions all seperate objects, and not wrapped up in a tree structure where their depth order can be easily maintained and referenced? It seems maintaining tx order is a pain.

17:15

<raj> Because its not possible to enforce a child will always appear before the parent in the mapWallet..

17:15

<larryruane_> because the test can't control all the randomness that it would need to in order to make the unordered_map iteration order predictable?

17:15

<sipa> Zaidan: the transaction graph is generally a DAG, not just a tree

17:15

<pablomartin> glozow: yeah perhaps we could force the rebroadcast in wrong order

17:16

<sipa> plus, we do need a way to access individual transactions by txid

17:16

<sipa> so just keeping them in a tree structure wouldn't work

17:16

<larryruane_> I'm not even sure, is `std::unordered_map` iteration deterministic for exactly the same inputs?

17:16

<glozow> i did wonder if we're going to end up with a boost multi index container rewrite for map wallet, heh

17:16

<Zaidan> sipa: ah ty

17:16

<sipa> larryruane_: no, depends on insertion order, even if the hash function is the same

17:17

<larryruane_> ok yes but I meant if insertion order is the same

17:17

<sipa> (and you don't want identical hash functions)

17:17

<glozow> next question is about the functional test: Does the test added in the PR adequately test this behavior, i.e., does it succeed with this fix but fail without it?

17:17

<raj> Isn't there any way to manually enforce the wallet to take to store the the child tx before the parent tx?

17:18

<raj> *manually enforce the wallet to stor

17:18

<sipa> raj: iteration order is determined by the map structure (std::map, or std::unordered_map), you can't control it.

17:19

<raj> Ah.. Okay..

17:19

<glozow> raj: no, and I don't think it's necessary

17:19

<hernanmarino> glozow: didn't test on master, but I'm sure it fails randomnly

17:19

<sipa> with a boost::multiindex you could have multiple indexes (one insertion ordered, another txid-based) simultaneously... but indeed I don't think that's useful.

17:20

<stickies-v> after `git revert 521169f2428be8e78599aa4fcb96f7ada7bb7e04` (and recompiling) the functional test does indeed fail again

17:20

<ishaanam[m]> On master the test fails around 50% of the time for me, though it would be nice if we had a higher chance of failing without the fix. I think glozow left a comment on the PR that would help with this?

17:20

<sipa> raj: Point is that the purpose of this data structure is lookup things based on txid - the iteration order follows that structure

17:20

<larryruane_> glozow: i didn't actually try it, but there's a great comment added to the functional test that explains that the new test will fail only about half of the time

17:21

<glozow> stickies-v: ishannam: wonderful, ⭐ for you

17:21

<raj> I am seeing test failures too in the new test.. And I am not sure why that should happen..

17:21

<glozow> raj: hernanmarino: you're both seeing the test fail *with* the PR's changes?

17:21

<stickies-v> I don't think the new test is supposed to fail at all? without the fix it's meant to fail 50% of the time, I think

17:21

<raj> Yes.. Though I need to double check..

17:21

<larryruane_> I have a question, would it be possible to run this entire new part of the functional test a few times (within a single run of the functional test)? To prevent CI from passing when it shouldn't?

17:21

<glozow> yes the new test should pass every time with the changes

17:22

<pablomartin> stickies-v +1

17:22

<glozow> larryruane_: yes, that should be pretty simple. just do 10 parent-child pairs.

17:22

<larryruane_> I'm just imagining that a future change breaks this, and the dev runs the test suite (or CI does), and it passes, so the dev thinks it's okay

17:22

<larryruane_> glozow: +1

17:23

<lightlike> I wonder if that's actually 50%. I guess it's not a good idea to build a RNG based on the order of an unordered map...

17:23

<glozow> ok if you're seeing the test fail with the PR changes, that shouldn't happen, so let us know where it's failing...

17:23

<glozow> i'm going to continue with the questions

17:23

<glozow> Can you think of any other methods of getting the wallet transactions sorted in this order?

17:24

<glozow> Would it make sense to keep the wallet transactions sorted in this order?

17:24

<Zaidan> Wouldn't we want to depth sort?

17:24

<hernanmarino> glozow: No , within the PR it always succeeds

17:24

<glozow> hernanmarino: ah okay! that's good

17:25

<stickies-v> with std::sort you just need to allocate an additional vector instead of an additional vector and a map, but there's probably some downside to it I don't understand?

100

17:26

<achow101> Zaidan: depth stops being relevant pretty quickly

101

17:27

<Zaidan> achow101: okay ty

102

17:28

<glozow> stickies-v: indeed, I think this could perhaps just be a std::sort where the comparison function uses `nOrderPos`

103

17:28

<stickies-v> yeah it makes for quite a bit less code: https://pastebin.com/1Hxp4YRM

104

17:30

<larryruane_> stickies-v: +1 seems like a good idea!

105

17:31

<larryruane_> Just to confirm, `nOrderPos` is the order transactions entered the wallet, and that's guaranteed to be "parent before child"

106

17:32

<larryruane_> it's a 64-bit integer so we probably assume it will never wrap around

107

17:32

<glozow> larryruane_: AFAIK yes, parents always inserted before children. I suppose more precise would be if they cached their ancestor counts or something, but can't imagine why we wouldn't insert a parent before child.

108

17:32

<glozow> well, 2^64 transactions would be a lot of transactions...

109

17:33

<achow101> larryruane_: I think it is a reasonable assumption. I believe it is possible to mess with it using importprunedfunds though, however that only works on confirmed txs anyways

110

17:33

<sipa> we're not even at 2^30 transactions on-chain

111

17:34

<glozow> ok. I also wonder if filtering should happen before sorting

112

17:34

<larryruane_> yes I'm not saying there's anything wrong, just understanding why it's okay :)

113

17:36

<glozow> i.e. `transactions.filter(should_rebroadcast).sort(insertion_order)`

114

17:36

<larryruane_> glozow: I was wondering about that, `mapWallet` contains confirmed transactions, right? so it could be pretty large (so we're adding many entries to the temporary map in `GetSortedTxs()`)?

115

17:37

<larryruane_> (and there would be a large `out` vector that's returned?)

116

17:38

<ishaanam[m]> glozow: yeah, it looks like the main reason that the filtering is done after is because both of the functions that call GetSortedTxs() do different kinds of filtering? why is that?

117

17:38

<glozow> indeed. I'll say the inefficiency is probably not noticeable to users since this is only happing once a day, and there *usually* aren't that many transactions in mapWallet, but yes.

118

17:39

<glozow> ishaanam: good question, now is a good time to ask: What's the difference between `ResendWalletTransactions` and `ReacceptWalletTransactions`?

119

17:41

<Zaidan> Reaccept has to have a lock on the wallet

120

17:42

<stickies-v> in terms of usage: resend is triggered ~once per day, reaccept is used every time a wallet is loaded

121

17:42

<raj> `Resend` sends all the unconfirmed tx 5 mins before the last known block.. `ReAccept` sends all the wallet transaction unconfirmed, not coinbase and not abandoned.

122

17:42

<lightlike> The filters in ReacceptWalletTransactions() seem unnecessary, since SubmitTxMemoryPoolAndRelay() seens to do the same checks again.

123

17:42

<glozow> stickies-v: yes, thank you

124

17:42

<ishaanam[m]> Zaidan: I think that Resend also obtains a lock on the wallet at some point?

125

17:42

<glozow> lightlike: and so does `BroadcastTransaction()`, haha

126

17:42

<sipa> Reaccept is loading mempool transactions into the wallet. Rebroadcast is dumping wallet transactions into the mempool.

127

17:43

<sipa> I'm wrong, ignore me.

128

17:43

<achow101> they both do ~same thing. I'm still trying to figure whether we can de-duplicate here, but both of these functions have existed for very long time and git blaming to figure out why is difficult.

129

17:44

<Zaidan> Resend is when we think transactions should have been included in a block

130

17:45

<Zaidan> ishaanam: yes, i see that now

131

17:45

<glozow> Zaidan: right. rebroadcasting is done automatically, for transactions we're waiting to get confirmed but they haven't yet. reaccept is triggered once, at the start.

132

17:45

<glozow> achow101: ideally we get rid of resend from the wallet.

133

17:45

<stickies-v> I'm curious why for Reaccept we don't also check if we're in IBD or reindex, because that would raise the same concerns as mentioned for Resend?

134

17:46

<pablomartin> stickies: i was thinking the same

135

17:47

<pablomartin> isReadyToBroadcast()

136

17:48

<glozow> stickies-v: I don't know the answer tbh

137

17:48

<achow101> ah, right. so these functions date back to the first commit...

138

17:49

<Zaidan> stickies-v: Does the lock handle that?

139

17:49

<Zaidan> Can the wallet be locked before IBD or reindex is completed?

140

17:50

<larryruane_> Zaidan: a lock would never be held for that long (im pretty sure)

141

17:51

<glozow> achow101: woah yeah

142

17:51

<glozow> we'll need to ask satoshi why it's this way

143

17:51

<larryruane_> "date back to the first commit" -- let's just ask satoshi!

144

17:51

<hernanmarino> :)

145

17:51

<larryruane_> glozow: sorry .. you beat me to it!

146

17:52

<stickies-v> larryruane_: +1, afaik we always try to minimize lock usage to just the bare minimum needed to avoid race conditions?

147

17:52

<glozow> Zaidan: assuming you're talking about `cs_wallet`, the wallet lock doesn't have anything to do with the node's state. it's just a object used to make sure multiple threads don't try to access wallet members at the same time.

148

17:53

<pablomartin> glozow: achow101: is the resend being used? i see it checks fBroadcastTransactions which is set to false in wallet.h and I don't see it's changed anywhere else

149

17:53

<Zaidan> ah ty

150

17:54

<glozow> good questions everybody. I'll throw out a few more from the review club notes. The test calls both setmocktime and mockscheduler to trigger a rebroadcast. What is the difference between these two calls, and why or why not is it necessary to call both of them?

151

17:54

<achow101> pablomartin: fBroadcastTransactions depends on startup options

152

17:54

<pablomartin> achow101: I see, thanks!

153

17:55

<glozow> pablomartin: see `DEFAULT_WALLETBROADCAST` https://github.com/bitcoin/bitcoin/blob/1420547ec30a24fc82ba3ae5ac18374e8e5af5e5/src/wallet/wallet.h#L106

154

17:56

<glozow> https://github.com/bitcoin/bitcoin/blob/1420547ec30a24fc82ba3ae5ac18374e8e5af5e5/src/wallet/wallet.cpp#L3018

155

17:56

<Zaidan> it seems the scheduler runs the time

156

17:57

<pablomartin> glozow: thank you! I see then it's passed to SetBroadcastTransactions

157

17:57

<glozow> for reference, I'm talking about this part of the test: https://github.com/bitcoin-core-review-club/bitcoin/commit/e40ddf36bed81bdf28d386eb961c9ed22b69e207#diff-2dd85d481900d4ad19d113d2114861b0134bcd283435e95b18d10adf5ad381a0R111-R113

158

17:58

<stickies-v> `ResendWalletTransactions()` is exclusively called from the scheduler (every second). I suppose we need to mock the scheduler because otherwise it'd use system time instead of mocked time?

159

17:58

<Zaidan> mocktime, explicitly sets the time and then scheduler has the node run to that time?

160

17:58

<Zaidan> I like stickies answer, +1

161

17:58

<Zaidan> I'm getting on that bandwagon

162

17:59

<glozow> stickies-v: yes, we need to mock the scheduler to make it call `MaybeResendWalletTxs` again. and why do we need to call setmocktime?

163

18:59

<Zaidan> we have to pass the eviction time so the transactions are dropped?

164

18:00

<stickies-v> oh because ResendWalletTransactions() has up to 36h to actually run so we need to advance the mocked time by enough to ensure that

165

18:00

<Zaidan> dropped as in not included into a block

166

18:00

<glozow> right, if we *just* mock the scheduler forward, `ResendWalletTransactions()` will execute but we won't be at the `nNextResend` time yet.

167

18:01

<stickies-v> yeah so first we advance time by 336h (2w) to trigger mempool eviction, then we run node.syncwithvalidationinterfacequeue() to sync, and then we advance by another 36h to ensure ResendWalletTransactions() executes

168

18:01

<Zaidan> ahh I missed that last step, I understand now

169

18:01

<larryruane_> I'm not quite clear on why we need `mockscheduler` to back up the scheduler items to an earlier time (IIUC), why can't we instead just advance mock time?

170

18:02

<larryruane_> stickies-v: +1 good explanation

171

18:02

<glozow> stickies-v: exactly

172

18:03

<glozow> oh sorry i just realized we're out of time o.O

173

18:03

<Zaidan> Larry: I think thats what's happening, we go to the block time, + evict time + evict time + 36*60*60

174

18:03

<glozow> #endmeeting

175

18:03

<pablomartin> thanks glozow!

176

18:03

<larryruane_> thank you @glozow this was super informative!

177

18:03

<Zaidan> Thank you for hosting, I learned a lot.

178

18:03

<pablomartin> thanks all! see you soon...

179

18:03

<glozow> wow that was fast. thought someone had setmocktimed me

180

18:03

<lightlike> thanks!

181

18:03

<Zaidan> lmao

182

18:03

<stickies-v> :-D

183

18:03

<ishaanam[m]> thanks for hosting glozow!

184

18:04

<hernanmarino> thanks glozow and everybody for your insights

185

18:04

<glozow> sorry we didn't get through all the questions from the notes. hope it was helpful in reviewing the PR

186

18:04

<stickies-v> thank you glozow , really interesting questions - made me think much more about the PR. and thank you achow101 for authoring

187

18:04

<raj> larryruane_, one reason could be `mockscheduler` advances by a delta, where `setmocktime` sets it at a specific timepoint.. So I think they are both essentially same?

188

18:04

<glozow> lightlike is hosting next week :)

189

18:04

<raj> Thanks glozow , this was a nice one..

190

18:05

<larryruane_> glozow: "thought someone had setmocktimed me" -- HAHAHA

191

18:07

<glozow> to answer the question of "why do we need to mock the scheduler too," there are a few ways of answering. One is to remove that line from the test and see what happens. Another is to look at how the timepoints in `CScheduler::taskQueue` are handled. What exactly triggers a task to be executed?

192

18:16

<achow101> on the question of resend vs reaccept: these functions were introduced in the first commit (or thereabouts) where everything was very tightly coupled together. At that point in time, the mempool was not in charge of deciding what got relayed and when. So Reaccept meant that the transactions would be re-added to the mempool after startup, but not necessarily rebroadcast. Resend meant that the transactions were actually being rebroadcast;

193

18:16

<achow101> it would send out invs. Over time, as the mempool began to be in charge of tx relay, adding to the mempool also meant potentially relaying, and resending turned into adding to the mempool. And so now they are not really different and could be deduplicated.

194

18:30

<sipa> interesting