Improve TxOrphanage denial of service bounds (p2p)

Jun 18, 2025

https://github.com/bitcoin/bitcoin/pull/31829

Host: glozow - PR author: glozow

The PR branch HEAD was 0d511965c91 at the time of this review club meeting.

Notes

The PR description summarizes the motivations for this change and the new eviction strategy: we must ensure the orphanage is DoS-resistant, but also want to prevent any peer from affecting another peer’s usage of orphanage space.
The PR does a few things: it virtualizes the TxOrphanage class inherited by TxOrphanageImpl, it implements a new eviction strategy, and it replaces the various TxOrphanage data structure(s) with a single boost::multi_index_container.
With these changes, we can now guarantee at least 1 maximum size-package worth of orphan resolution per peer at a time. This is tested by the txorphan_protected fuzz test and a few functional tests.
If most peers are not providing orphans, the unused space allocated to them can be used by peers that need it.
An earlier version of this PR implemented the new eviction strategy using the existing TxOrphanage data structures. This version makes the PR’s behavior changes clearer. It may also demonstrate why a boost::multi_index_container is the more natural data structure for TxOrphanage, as the original implementation is a bit convoluted. You can find that version of the PR here.

Questions

Concept

Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? What was your review approach?
Why is the current TxOrphanage global maximum size limit of 100 transactions with random eviction problematic? Can you think of a concrete attack scenario that would affect 1-parent-1-child (1p1c) relay?
Can you summarize the changes to the eviction algorithm at a high level?
Why is it desirable to allow peers to exceed their individual limits while the global limits are not reached?
The new algorithm evicts announcements instead of transactions. What is the difference and why does it matter?
Why is there an announcement “limit” but a memory “reservation”?
How does the per-peer memory usage reservation change as the number of peers increases?
How does the per-peer announcement limit change as the number of peers increases? Why is this different from the per-peer memory usage reservation?
Why is it ok to remove orphan expiration?
Should we also remove EraseForBlock and instead rely on eviction to remove orphans that confirm or conflict with a block? Why or why not?
Going back to your attack scenario from an earlier question, how does this PR improve the reliability of 1p1c relay in adversarial environments?

Implementation

What is the purpose of reimplementing TxOrphanageImpl using a boost::multi_index_container along with the eviction changes?
What is a peer’s “DoS Score” and how is it calculated?
The global memory limit scales with the number of peers. Could this create new DoS vectors where an attacker opens many connections to increase the global limit?
Is it possible for the orphanage to NeedsTrim() when there is no peer whose “DoS Score” is greater than 1?
Is it possible that a peer’s “DoS Score” is greater than 1 but NeedsTrim() returns false?
When evicting orphans, why evict non-reconsiderable orphans before reconsiderable ones? What’s the difference between these categories?
How does the number of announcements represent a meaningful bound on “computation” in TxOrphanage operations?
What is the computational complexity of the LimitOrphans loop? How many heap operations can there be? How many erasures can there be?
How can we test (and do tests exist?) that TxOrphanage’s internal data structures are updated correctly, that its DoS limits are correctly enforced, and that the orphans of “honest” peers are protected from eviction?
How does the txorphan_protected fuzz harness test that orphans are protected from eviction?
The default announcement limit is increased from 100 to 3000. How was this number chosen and what are the tradeoffs?

Meeting Log

17:00

<glozow> #startmeeting

17:00

<monlovesmango> heyy

17:00

<pseudoramdom> hi hi

17:01

<glozow> welcome to PR Review Club! we're looking at Improve TxOrphanage denial of service bounds today: https://bitcoincore.reviews/31829

17:01

<marcofleon> hola

17:01

<theStack> hi

17:02

<glozow> did anybody get a chance to review the PR or the notes?

17:02

<instagibbs> reviewing the PR

17:02

<marcofleon> yes, mostly notes and focused on new txorphanage

17:02

<monlovesmango> yes, mostly. haven't reviewed test commits

17:03

<glozow> instagibbs marcofleon monlovesmango: awesome!

17:03

<glozow> Why is the current TxOrphanage global maximum size limit of 100 transactions with random eviction problematic? Can you think of a concrete attack scenario that would affect 1-parent-1-child (1p1c) relay?

17:03

<pseudoramdom> Just read the notes as well

17:03

<marcofleon> i'm also sending the fuzz tests cmooon

17:04

<monlovesmango> bc it enables peers to evict other peer's orphans from your orphanage

17:04

<marcofleon> Attack scenario could be an attacker node spamming the orphanage to prevent a child paying for a low fee parent

17:05

<marcofleon> if the child keeps getting evicted, then i guess parent would be dropped from mempool?

17:06

<glozow> marcofleon: in a 1p1c scenario, the parent isn't in mempool yet. We opportunistically pair it with its child in orphanage if we find one. But if the child gets evicted, then we're out of luck

17:06

<glozow> Can you summarize the changes to the eviction algorithm at a high level?

17:06

<monlovesmango> if a malicious peer floods you with orphans, its pretty likely that you will evict a child that could have otherwise been resolved as 1p1c?

17:06

<pseudoramdom> +1. Low fee rate parent + CPFP scenario, attacker floods with orphan tx?

17:08

<marcofleon> well eviction is no longer random, it's based on the "worst behaving" peer, and its the oldest announcement

17:08

<marcofleon> highest Dos score peer will have their annoucement removed

17:08

<glozow> marcofleon: yep! and we'll get into how we calculate DoS score in a later question

17:08

<glozow> Why is it desirable to allow peers to exceed their individual limits while the global limits are not reached?

17:10

<marcofleon> Because there could be a peer that is sending a lot of orphans, not necessarily dishonestly

17:10

<instagibbs> in the non-adversarial case, it could allow a lot more "honest" CPFPs through

17:10

<pseudoramdom> Not all peers may be active actively broadcasting at the same time?

17:10

<marcofleon> just makes sense to not waste the space by having an inflexible limit per peer

17:10

<glozow> marcofleon: instagibbs: yes exactly. often, peers are using a lot of resources simply because they are the most helpful peer

17:11

<pseudoramdom> Is it possible for attacker to game the DDoS scoring?

17:11

<monlovesmango> rather than having a common pool of orphans, this pr will restructure orphanage to track orphanage counts and usage by peer. each peer will be subject to a orphan announcment count limit that is the global max announcement count divided by the number of peers, and allowed a set amount of weight for the orphans they announce

17:12

<glozow> This was why I originally thought of doing a "token bucket" approach where we'd allow peers resources based on an amount of tokens, and then either replenished tokens if the orphans were useful or destroyed them if it was just spam

17:13

<glozow> The new algorithm evicts announcements instead of transactions. What is the difference and why does it matter?

17:13

<monlovesmango> pseudoramdom: I was thinking about this too. I think if you flood a node with peers with counts that are just over the limit you could theoretically evict a high weight tx from a peer with high weight announcements but low announcement count

17:13

<glozow> marcofleon: yeah, you might as well use the space

17:14

<marcofleon> annoucements are wtxid, peer. so if a peer is misbehaving then the orphan will only be removed for that peer. So a peer can't affect the orphan announcments of other peers

17:14

<monlovesmango> but i'm not sure thats really too much of a concern..?

17:15

<glozow> pseudoramdom: do you mean for an attacker to try to get us to evict a specific orphan? or to appear less resource-intensive than another peer and get them chosen for eviction instead?

17:15

<glozow> marcofleon: yes bingo!

17:16

<pseudoramdom> I was thinking of latter. Staying just under the limit but still managing to evict certain orphans.

17:16

<glozow> I think this relates to pseudoramdom's question: if they did something tricky to try to get a particular orphan of theirs chosen for eviction, that's fine, because we'll keep the transaction as long as another peer has announced it

17:17

<marcofleon> as long as you have at least one honest peer

17:18

<marcofleon> the orphan should (hopefully) remain in the orphanage

17:18

<glozow> pseudoramdom: the other peer won't experience eviction unless they exceed the limits. This still presents a limitation - peers might get evicted if they're just sending stuff at a far faster rate than we manage to process them - but the point is you can't influence the eviction of another peer's announcements

17:18

<monlovesmango> any peer that stays under peer limits can't have their orphan evicted by another peer

17:18

<glozow> monlovesmango: correct

17:19

<glozow> Why is there an announcement “limit” but a memory “reservation”?

17:19

<sipa> hi!

17:19

<glozow> sipa: hello hello

17:20

<glozow> Actually, I feel like you can call them both reservations, haha

17:21

<monlovesmango> I think bc announcement count affects CPU usage? and its not much of a concern to allocate a certain amount of memory to each peer. guessing here, think I read something like that in the PR notes

17:21

<sipa> glozow: both have a global limit, and a per-peer reservation

17:21

<marcofleon> but one can be exceeded and the other is a decreasing share of the pie

17:21

<sipa> the difference is the announcement global limit is a constant, but the global memory limit is a function of the number of peers

17:22

<sipa> so the "constant" is a global announcement limit, and per-peer memory reservation

17:22

<marcofleon> or i think we make the assumption that there is more memory that can be used up to a certain point, but for announcements we're trying to figure out which peer is "overusing" their share

17:22

<glozow> sipa: I wouldn't call it an announcement "reservation" though, because you aren't guaranteed it. if more peers appear, your announcement limit decreases.

17:22

<monlovesmango> hahah so many ways to state the same things

17:22

<glozow> On the other hand, your memory reservation is guaranteed and constant no matter how the peer set changes

17:22

<sipa> glozow: that just means the reservation is dynamic :p

17:23

<sipa> but yeah, the term reservation is weird in that context

17:24

<marcofleon> memory reservation is guaranteed per peer yes?

17:24

<marcofleon> oh yeah you said it above

17:24

<sipa> "Yes sir, your reservation for 4 tonight at FancyDining is confirmed." - "What do you mean my reservation was dropped to 3, because another group made a reservation?!"

17:24

<glozow> marcofleon: yes. you're also guaranteed a certain number of announcements, but it's dynamic

17:25

<glozow> sipa: yeah that's what I mean is weird about "reservation"

17:25

<sipa> fair enough

17:25

<sipa> We shall commence the bikeshedding for a better term now.

17:25

<glozow> How does the per-peer memory usage reservation change as the number of peers increases? How does the per-peer announcement limit change as the number of peers increases?

17:25

<marcofleon> the per peer memory usage doesn't change iiuc

17:26

<monlovesmango> wait, why is one a function of the number of peers and the other not?

17:26

<glozow> marcofleon: yes 💯

17:26

<marcofleon> but the per peer annoucement limit decreases?

17:26

<glozow> yup

17:26

<glozow> monlovesmango: indeed, why?

17:26

<glozow> What is the purpose of the announcement limit?

17:27

<monlovesmango> is it bc anncouncement count affects CPU usage? and so we want to limit this globally?

17:27

<glozow> monlovesmango: how does our "budget" for CPU usage change with more peers sending us orphans?

17:28

<sipa> monlovesmango: more specifically, the *global* announcement limit directly affects the *latency* of trimming announcements - it's not because we have more peers that we can tolerate a longer latency in processing transactions

17:28

<sipa> monlovesmango: but for memory usage, it is normal and expected that your maximum memory usage goes up with more peers - if you're memory-constrained, you should limit your number of peers anyway

17:28

<glozow> (the answer is it doesn't. we can't tolerate more announcements when we have more peers)

17:29

<sipa> (sorry if i spoiled it?)

17:29

<marcofleon> in LimitOrphans we're removing announcements one by one right? and so that's why we're using that limit as a proxy for cpu usage

17:29

<monlovesmango> sipa: thank you that answered my question

17:30

<monlovesmango> marcofleon: that is also a good point

17:30

<sipa> monlovesmango: yup, the number of iterations that loop in LimitOrphans scales directly with the *global* announcement limit

17:30

<monlovesmango> cool cool we can move on thanks all :)

17:30

<glozow> Why is it ok to remove orphan expiration?

17:31

<marcofleon> because we take care of oldest orphans now whenever we start evicting

17:31

<glozow> marcofleon: yes exactly, that's the intuition for why the number of announcements is the number we are interested in. not the number of unique orphans (which is what we used to limit)

17:31

<sipa> glozow: FWIW, have you benchmarked how long LimitOrphans can take?

17:32

<monlovesmango> bc the orhpanage now has other concrete metrics by which we can reliable evict orphans which guarantee the oldest and evicted first and orphans that are no longer needed are removed

17:32

<glozow> marcofleon: yep! but wait, doesn't this mean we can be holding on to orphans for days, or weeks, etc?

17:32

<monlovesmango> yes..?

17:33

<glozow> sipa: not since the rewrite. I can find my old benchmarks and run them. IIRC the `AddTx`s is what takes a really long time

17:34

<marcofleon> hmm yeah i guess we can hold onto it for a while now. as long as there's no conflicting txs that arrive in a block or something

17:34

<sipa> glozow: sure, but the reason for the existence of the global announcement limit is the time that LimitOrphans can take, not AddTx... so perhaps it's worth benchmarking, and seeing if we can perhaps tolerate a higher global announcement limit (or, otherwise, be sad to find out it needs to be reduced)

17:34

<glozow> marcofleon: is it problematic?

100

17:35

<marcofleon> i don't think so, as long as limits aren't being exceeded, seems fine to me

101

17:36

<glozow> sipa: yeah definitely. I just wanted to add some context for anybody who looks at the old benchmarks. What do you think is an acceptable amount of time?

102

17:36

<marcofleon> unless i'm missing something...

103

17:36

<sipa> glozow: probably in the millisecond range?

104

17:36

<glozow> marcofleon: I agree with you

105

17:36

<glozow> sipa: 👍

106

17:37

<glozow> Should we also remove EraseForBlock and instead rely on eviction to remove orphans that confirm or conflict with a block? Why or why not?

107

17:38

<marcofleon> could maybe be worked out somehow, but feels like a separate thing

108

17:39

<marcofleon> so i would say no

109

17:39

<marcofleon> it's not the same reason that an orphan is being removed

110

17:39

<monlovesmango> I would also say no, bc otherwise the caller would have to have knowledge of what is in the orphanage and individually evict txs

111

17:40

<glozow> fwiw, I think the worst case for EraseForBlock is probably worse than LimitOrphans. But EraseForBlock happens on the scheduler thread so speed might not be as much of an issue

112

17:40

<sipa> glozow: it also costs an attacker mining a valid block

113

17:41

<glozow> sipa: is that true? You could look at what's in the projected next block and just create conflicting transactions with a lot of nonexistent utxos

114

17:42

<sipa> glozow: sure, but the victim will still never experience it more than once per block, which is expensive. good point though that it's not necessarily the attacker themselves that pay this cost

115

17:43

<glozow> right. it's not a very worthwhile attack imo

116

17:45

<glozow> And the benefit of freeing up this space is probably worth it

117

17:45

<glozow> What is the purpose of reimplementing TxOrphanageImpl using a boost::multi_index_container along with the eviction changes?

118

17:46

<marcofleon> it's easier on the eyes :)

119

17:46

<instagibbs> glozow we could just look for txid matches instead of scanning inputs :)

120

17:46

<monlovesmango> so that you can look up orphan announcements by either wtxid or peer?

121

17:47

<glozow> instagibbs: indeed. It would require adding a txid index, but maybe that's what we're evicting in practice anyway! Could measure what it looks like in the wild

122

17:47

<glozow> monlovesmango: we can already do that though!

123

17:47

<instagibbs> oh right, wtxid would be the thing on hand

124

17:48

<glozow> instagibbs: right but same thing, maybe we're always evicting exact block txns

125

17:48

<instagibbs> 👍

126

17:48

<sipa> i think it was me who suggested using a multi_index, and the reason was because i saw the older implementation was effectively implementing a multi-index inefficiently, by having separate data structures for per-peer and per-wtxid information about announcements

127

17:48

<monlovesmango> glozow: haha it was just my best guess, didn't actually get around to understanding boost::multi_index_container better

128

17:49

<glozow> yeah it is the more natural data structure. I was also pleasantly surprised to realize that we only needed 2 indexes

129

17:49

<sipa> yeah, i was assuming we'd need 3

130

17:49

<sipa> nice find

131

17:50

<glozow> I've also been told many times that the existing `TxOrphanage` is hard to review

132

17:50

<glozow> so good to hear from marcofleon that it's easier this way

133

17:50

<glozow> What is a peer’s “DoS Score” and how is it calculated?

134

17:51

<monlovesmango> its the max bettween announcement_count/peer_announcement_limit and announcement_usage/peer_announcement_usage_reservation

135

17:52

<sipa> monlovesmango: think of an SQL database with multiple indexes on various columns, but then in-memory entirely, and in a C++y way; it's more efficient (both in memory and CPU) than having multiple independent maps (one for each index), and much easier to keep consistent (because there is no way for the different indexes to contain different information)

136

17:52

<marcofleon> maximum of cpu score and memory score. cpu score is a peers number of announcments / their per peer limit. and memory score is sum of the weight of a peers announced tx weights / the reserved memory usage per peer

137

17:52

<glozow> monlovesmango: marcofleon: yep. how does this compare to having 2 separate scores, and trimming "while CPU score is exceeded or memory score is exceeded" ?

138

17:54

<marcofleon> hmmm well a peer can have a dos score of more than 1

139

17:54

<marcofleon> or maybe i'm confused with the q

140

17:54

<monlovesmango> sipa: that helps my conceptual understanding a lot thanks

141

17:55

<monlovesmango> glozow: this is much simplier for sure

142

17:55

<glozow> Er, my point was "it's the same thing"

143

17:55

<sipa> i think the question is: why do we have a *single* DoS score = max(mem_score, ann_score), as opposed to two different DoS scores that are never compared with one another, and a rule "trim the worst announcement offenders while there are any" + 'trim the worst memory offenders while there are any"

144

17:55

<monlovesmango> bc we only have to track one score rather than two

145

17:56

<marcofleon> wait this is actually a good question, i'm not immediately seeing what the benefit of one score is over two

146

17:56

<marcofleon> is it more gameable somehow?

147

17:56

<monlovesmango> i think this way also allows more the advantage of allowing peers to exceed limits/reservations

148

17:56

<sipa> i don't think the two approaches are equivalent, fwiw, but the difference is small

149

17:57

<monlovesmango> as long as global limits are not reached

150

17:58

<marcofleon> hmm so global limits reached, we get dos scores and target a peer based on that

151

17:59

<glozow> So we're comparing this approach to having 2 loops. "While global announcement limit is exceeded, pick the peer with the most announcements, evict from them. Then, while global memory limit is exceeded, pick the per with the most memory usage, evict from them."

152

17:59

<monlovesmango> well practically speaking, usually only one limit will be reached at a time so it would usually only be one loop no?

153

17:59

<glozow> This approach basically rolls them into 1 loop. "While global announcement or memory limit is exceeded, pick the peer with the highest score (max ratio of both) and evict from them."

154

18:00

<Emc99> What is dos?

155

18:00

<instagibbs> denial of service

156

18:00

<Emc99> Thanks

157

18:00

<glozow> oh oops we are out of time!

158

18:00

<marcofleon> is the two loop approach worse in some other way i'm not seeing other than it's two loops

159

18:01

<marcofleon> thanks for hosting and answering qs glozow! good stuff as usual

160

18:01

<glozow> fwiw, I think that having a ratio-based score is good if we want to consider giving different peers different reservation amounts

161

18:01

<monlovesmango> one small flaw with this approach is that if count limit is reached first, the highest DOS peer might actually be violating the memory reservation and removing it won't immediately resolve the global limit being exceeded

162

18:02

<monlovesmango> thanks for hosting glozow!!

163

18:02

<glozow> #endmeeting