Move SaltedHashers to separate file and add some new ones (
refactoring) Dec 16, 2020
The PR branch HEAD was 281fd1a at the time of this review club meeting.
This refactor PR moves existing code into new
Along the way, it touches some interesting code in:
Hashers are generally used to construct hash maps, which is a data structure
that has fast item lookup, addition and removal. They allow lookup of elements
0(1) time in the typical/best-case scenario although implementation can
The C++ standard library includes a hash map
For a more detailed description of the hash functions used in
Hash functions for C++ Unordered
BlockHasher is used by
src/validation.h::BlockMap to construct a hash map
of blocks and their sequence ID (block height).
BlockManager uses the
BlockMap when adding a new block to the index with
AddBlockToIndex(), first checking whether it’s a duplicate
src/validation.cpp#L3106) and then adding the new block and sequence ID to
SignatureCacheHasher is used to avoid performing costly verification of ECDSA
signatures twice; once when accepted into the mempool and again when accepted
into a block.
The cache type currently used for
CSignatureCache is “cuckoo cache”,
FilterHeaderHasher is used by
src/index/blockfilterindex.h::BlockFilterIndex to construct an index of
filter headers for each filter type enabled.
BlockFilterIndex only consults the
m_headers_cache lookup if the
block index is at a “checkpoint” height – every 1000 blocks per BIP157.
Coin object requires
CCoinsMap to use
SaltedOutpointHasher in its map
CCoinsMap is used by
CCoinsViewCache with most of its methods. As discussed
review club meeting for PR 18113 each
CCoinsCacheEntry contains both the
Coin itself in addition to the flags
associated with that coin.
This is currently used in the
SaltedSipHasher is a generic
siphasher that can be used with many
objects (via the magic of
Did you review the PR?
Concept ACK, approach ACK, tested ACK, or
Why do we need to specify a “hasher” when defining a
What exactly is a
Coin as used by
CCoinsMap and hashed using
src/coins.cpp use a
SaltedOutpointHasher rather than a
What is a “salt” and why are they used in
Why are salts
not used in
What is a boost multi-index? How is the
SaltedTxidHasher used in the
What is the advantage of introducing a new
SaltedSipHasher that takes a
Span as input? Where would we
not be able to use this? Meeting Log
1 18:00 <willcl_ark> #startmeeting
2 18:00 <willcl_ark> Hello everyone, welcome to the final review club before christmas!
6 18:00 <samuel-pedraza> hi!
8 18:00 <murtyjones> hello!
9 18:00 <michaelfolkson> hi
11 18:00 <willcl_ark> Please feel free to say hi if you're here :)
16 18:01 <willcl_ark> Today we're going to be looking at #19935: "Move SaltedHashers to separate file and add some new ones."
17 18:02 <willcl_ark> A quick reminder of some conventions, you don't have to ask to ask a question (e.g. "I have a question about x but I don't know if it's on-topic?") just go ahead and ask.
18 18:02 <willcl_ark> OK, so did everyone get a chance to review the PR? How about a quick y/n?
23 18:02 <michaelfolkson> y
28 18:03 <thomasb06> n (studied Bloom filters instead)
29 18:03 <michaelfolkson> Why bloom filters?
31 18:03 <michaelfolkson> Just fancied it thomasb06?
33 18:03 <samuel-pedraza> y
34 18:04 <thomasb06> because they use hach functions too
35 18:04 <willcl_ark> that's great! The PR was primarily refactor, but it touches some interesting parts of code which I want to dive into
36 18:04 <willcl_ark> So first question to get us started, why do we need to specify a “hasher” when defining a std::unordered_map?
37 18:04 <dhruvm> The Hasher tells std::unordered_map how to make deterministic buckets based on the keys which result in O(1) lookups.
38 18:05 <thomasb06> michaelfolkson: if I understood well the table is T[h_i(m)] for h_i a hash function
39 18:05 <michaelfolkson> Without a hasher you wouldn't know which bucket an element should be placed in
40 18:05 <kouloumos> because the standard library has no implementations for user-defined classes
41 18:06 <joelklabo> user-defined objects need to have their own hashing implementation
42 18:06 <joelklabo> what kouloumos said
43 18:06 <willcl_ark> dhruvm, michaelfolkson, kouloumos joelklabo right! std::unordered_map doesn't know how it should hash our (custom) bitcoin objects.
44 18:08 <willcl_ark> next question: what exactly is a `Coin` as used by `CCoinsMap` and hashed using `SaltedOutpointHasher`?
45 18:08 <murtyjones> Is it a UTXO entry?
46 18:08 <dhruvm> A Coin is a UTXO. The canonical representation of that is CTXOutPoint(txid, n) which are the elements SipHash'ed by SaltedOutpointHasher.
47 18:10 <willcl_ark> murtyjones: dhruvm exactly!
49 18:11 <joelklabo> I have a quick C++ question, is putting SaltedOutpointHasher() in the public interface of itself how you create a singleton? A little confused by the syntax
51 18:12 <dhruvm> joelklabo: I am not sure it is a singleton
52 18:13 <michaelfolkson> That PR 10195 "switched chainstate db and cache to per-txout model". Before it was a per-tx model?
53 18:13 <michaelfolkson> Looks like an interesting PR
54 18:13 <dhruvm> It is used by stdlib when using CCoinsMap
56 18:13 <joelklabo> Ah, ok but I was thinking the salt would need to be consistent for the lifetime of the process, so things don't move around, I'll look that up. Thanks
58 18:14 <dhruvm> joelklabo: you're right that only one instance of SaltedOutpointHasher is ever created in binary lifetime
59 18:14 <dhruvm> but it is not using the singleton code pattern. You could initialize another instance if needed.
60 18:15 <joelklabo> thanks, dhruvm, still learning some basic C++ stuff
61 18:15 <dhruvm> joelklabo: we all are :)
62 18:15 <willcl_ark> joelklabo: we've also got a question on salts coming up soon
63 18:16 <michaelfolkson> Didn't realize only changed to per-txout model in 2017...
64 18:17 <willcl_ark> Alright, so why does src/coins.cpp use a SaltedOutpointHasher rather than a SaltedTxidHasher?
65 18:17 <murtyjones> Using SaltedTxidHasher would require storing all of the outputs in a list in the value, vs. using the outpoint to point to one specific UTXO.
66 18:17 <dhruvm> txid alone is not a canonical representation of a Coin/UTXO. The specific txOut must be referenced.
67 18:17 <jnewbery> dhruvm: I don't think you're right that only one instance is ever created. We'll create a new one for each CCoinsMap object.
68 18:18 <dhruvm> jnewbery: ah, you're right. it's just a typedef there
69 18:19 <willcl_ark> murtyjones: dhruvm right. If we used SaltedTxidHasher then all the UTXOs from the same transaction would hash to the same result which would be, sub-optimal.
70 18:20 <willcl_ark> Are there any potential downsides to using an Outpoint-based system?
71 18:20 <dhruvm> willcl_ark: More buckets?
72 18:20 <murtyjones> according to the PR linked, larger on disk representation
73 18:21 <joelklabo> also slower lookup due to one entry per output v one per tx
74 18:22 <sipa> before 0.15 a per-tx model was used for the chainstste
75 18:23 <willcl_ark> yes! all good potential tradeoffs to beware of.
76 18:23 <sipa> its primary downside was that it required reading all unspent outputs of a tx left from disk any time any of them were spent
77 18:23 <felixweis> leveldb has some logic to not use more disk data when the key prefix is the same
78 18:23 <sipa> this led to a DOS attack
79 18:23 <sipa> *vulnerability
80 18:24 <michaelfolkson> How would that vuln be exploited?
81 18:24 <michaelfolkson> By sending lots of fake transactions to a peer to validate?
82 18:25 <MarcoFalke> michaelfolkson: Create a tx with lots of outputs
83 18:25 <sanket1729> I guess by creating a transaction lotsof inputs
84 18:25 <michaelfolkson> Like lots and lots. On order of hundreds?
85 18:25 <felixweis> create many txs with many outputs
86 18:26 <sipa> first creste txn with lots of outputs, then create a tx that spends one utxo from each
87 18:26 <felixweis> then create a tx with many inputs, each spending from a different large prior tx
88 18:26 <sipa> now you may need gigabytes of memory to load all the utxos from all those
89 18:26 <felixweis> now you need to keep it all in mem
91 18:27 <felixweis> there were demos of how to OOM iirc
92 18:27 <michaelfolkson> Cool
93 18:27 <felixweis> i vaguely remember something about paris
94 18:27 <felixweis> and breaking bitcoin
96 18:27 <willcl_ark> Using an Outpoint-based model also affects the ability to test directly against the UTXO set, but I don't think this is an issue.
97 18:28 <felixweis> and some discussion about responsible disclosure
98 18:28 <jnewbery> I think we're getting a little bit off topic :)
99 18:28 <michaelfolkson> Oh it was that.... I remember haha
100 18:28 <willcl_ark> Ok next question: What is a “salt” and why are they used in SaltedTxidHasher, SaltedOutpointHasher and SaltedSipHasher?
101 18:29 <murtyjones> A salt is random data included in the input to a hash function. My guess is that it's used with those structures to prevent malicious construction of colliding values?
102 18:29 <dhruvm> The salt makes the output of the hash function unpredictable to parties that know the input but not the salt. The hash function remains deterministic so long as the same salt is used. The salt is updated every time the containers are initialized. As an example, SaltedTxidHasher is used in a multi_index for the mempool, which seems to be a non-persistent data structure.
103 18:29 <felixweis> salt is usually used in hashers so an attacker can not know how the buckets are being constructed. otherwise you can create O(n) behaviour
104 18:29 <michaelfolkson> Right... so salts are generally used for things like storing passwords. If you didn't salt them you would be able to see everyone who uses "Password" as their password
105 18:30 <michaelfolkson> By salting them you don't see lots of people with the same hashed password
106 18:30 <willcl_ark> correct! std::unordered_set puts objects into buckets based on their hash value. An adversary who can create hash collisions at will can force a large number of items into the same bucket meaning that lookup time goes from 0(1) to 0(n).
107 18:31 <dhruvm> Wouldn't that require useful work though?
108 18:31 <dhruvm> Like creating a valid transaction that results in a specific txid?
109 18:31 <joelklabo> I guess the same could happen if you had a "bad" hashing algorithm that wasn't evenly distrubuted
110 18:32 <sipa> note that "who can create hash collisions" doesn't mean much... for these data structures the hashes are maybe 10-230 bits in size
111 18:32 <sipa> eh, 10-30
112 18:32 <lightlike> Why are there two uint64_t (m_k0, m_k1) used as salts in SaltedSipHasher instead of just one?
113 18:33 <sipa> siphash happens to use a 128-bit key
114 18:33 <michaelfolkson> What was the DoS attack that motivated the invention of siphash?
115 18:33 <willcl_ark> Does anyone have any ideas about an attack that might be possible if lookup time was affected in this way?
117 18:34 <sipa> michaelfolkson: the paper would be the best place to look i think
118 18:34 <lightlike> thanks, will take a look
121 18:36 <sipa> as i know someone who was confused about this before: i didn't invent siphash
122 18:36 <michaelfolkson> haha not sipahash
123 18:36 <willcl_ark> hehe, easy to mistakenly assume
124 18:37 <dhruvm> i certainly thought that until i saw the paper
125 18:37 <michaelfolkson> jonatack: Coming up to a year open on that one...
126 18:37 <willcl_ark> Alright, why are salts *not* used in BlockHasher and FilterHeaderHasher?
127 18:37 <jnewbery> is it true that you invented sippy cups though?
128 18:38 <dhruvm> Maybe because those hashers are used in persistent data structures? BlockHasher is used in BlockMap/m_block_index. (Un)LoadBlockIndex persist it to disk.
129 18:38 <dhruvm> But I am not sure, as in that case, it'd be beneficial to just persist the salt as well?
130 18:38 <michaelfolkson> jnewbery: What is a sippy cup? A party cup? Off topic...
131 18:38 <willcl_ark> dhruvm: I thought so too at first, although I think we could easily also write the salt to disk
132 18:39 <dhruvm> willcl_ark: exactly
133 18:40 <michaelfolkson> OK... so I get the benefits of salting - use random number in your hash. Like a nonce in dig sigs I guess... I get the benefits of siphash, need secret key to generate.
134 18:40 <michaelfolkson> But why do you need a combination of both?
135 18:41 <michaelfolkson> To stop someone with the secret key from being able to identity hashes that are the same in the database?
136 18:41 <sipa> it's not a combination
137 18:41 <jnewbery> if the aim of the salt is to prevent collisions, then why do we not need to worry about that for blocks?
138 18:41 <felixweis> you need a keyed hash function, siphash is that
139 18:41 <michaelfolkson> A salted siphash is not a combination of salting and siphash?
140 18:41 <sipa> siphash is the gold standard for hash table bucket hashing to prptect against collisions... and siphash uses a salt
141 18:42 <michaelfolkson> Oh. All siphash uses a salt... ok
142 18:42 <sipa> no, siphash is a PRF. it takes as input a key (="salt") and a message
143 18:42 <felixweis> because the work to create a valid block hash is many magnitudes higher than creating an unconfimed tx
144 18:42 <willcl_ark> felixweis: bingo!
145 18:42 <jnewbery> felixweis: correct
146 18:43 <willcl_ark> OK next Q: what is a boost multi-index? How is the SaltedTxidHasher used in the mempool multi-index?
147 18:43 <dhruvm> A multi_index maintains multiple indices on the same data set. index_transaction_set is a multi_index on the mempool. SaltedTxidHasher is used to maintain the mempool sorted by txid for fast lookups by txid.
148 18:44 <jnewbery> if someone wanted to create a collision in our BlockMap, they'd need to be mining valid blocks which _also_ collide in the BlockHasher function
149 18:45 <willcl_ark> dhruvm: correct!
151 18:45 <jonatack> txid and wtxid, though i'm sure dhruvm meant both
152 18:46 <willcl_ark> jonatack: I thought he did :)
153 18:46 <dhruvm> good reminder. yeah i meant the canonical id used for the tx.
154 18:47 <willcl_ark> OK final question, this PR introduces a new `SaltedSipHasher` that takes a `Span` as input, what are the advantages to this?
155 18:47 <michaelfolkson> In addition to the advantages of using Span generally?
156 18:48 <dhruvm> The new implementation allows arbitrary strings to be SipHash'd.
157 18:48 <sipa> span span span span
158 18:48 <michaelfolkson> I guess if you are introducing Span then you want to be able to hash them
159 18:48 <willcl_ark> lovely span
160 18:48 <michaelfolkson> sipan
161 18:49 <michaelfolkson> invented by sipa
162 18:49 <willcl_ark> he's very conspicuous with the naming
163 18:49 <willcl_ark> dhruvm: right, this would allow a single hasher to be used with any of `uint256`, `CPubKey`, `CScript`, `CKeyId`, and generics like `std::vector<unsigned char>`
164 18:50 <willcl_ark> Are there any places where we _wouldn't_ be able to use such a "generic" hasher?
165 18:51 <dhruvm> i think we cannot use it where the type doesn't implement Span?
166 18:51 <dhruvm> *key type
167 18:53 <sipa> you don't need to "implement span" in general
168 18:53 <sipa> types just need a begin() and end() and size()
169 18:53 <dhruvm> oh sorry, i thought there were methods that need to exist on the type. But I need to study up on span.
170 18:54 <dhruvm> i see. not inheritance, or "implementing an interface". cool!
171 18:54 <joelklabo> what C++ version does core use? span seems new in C++ 20?
172 18:54 <willcl_ark> I think Core has a custom re-implementation
173 18:55 <jonatack> joelklabo: c++11 until very recently, now 17
174 18:55 <willcl_ark> (of Span)
176 18:55 <dhruvm> joelklabo: src/span.h implements a subset of the C++20 span
177 18:55 <jonatack> see src/span.h
178 18:55 <joelklabo> thanks
180 18:56 <michaelfolkson> C++17 due for v22 apparently
181 18:56 <felixweis> When leaving c++11 behind, why not go all the way to 20 ?
182 18:56 <jnewbery> felixweis: c++20 was finalized yesterday
183 18:57 <jnewbery> oh, maybe it was a few weeks ago
184 18:57 <willcl_ark> Well that was the final question, so I think it's time to
185 18:57 <willcl_ark> #endmeeting
186 18:58 <willcl_ark> thanks everyone for coming
188 18:58 <jonatack> thanks willcl_ark!
189 18:58 <murtyjones> thank you willcl_ark!
190 18:58 <dhruvm> thank you, willcl_ark !
191 18:58 <jnewbery> one thing before everyone goes
192 18:58 <felixweis> Thanks willcl_ark!
193 18:58 <joelklabo> thanks willcl_ark !
194 18:58 <finite> thanks willcl_ark
195 18:58 <michaelfolkson> I know you jnewbery wanted to only allow parts of C++17. Was it consensus that all C++17 should be allowed in Core v22? I know people argued for this
196 18:58 <felixweis> And Newbery for another awesome year of core review club
197 18:58 <fodediop> thank you willcl_ark!
198 18:59 <caralie> thank you!
199 18:59 <joelklabo> yeah, I look forward to these every week jnewbery
200 18:59 <emzy> thank you!
201 18:59 <michaelfolkson> Thanks willcl_ark jnewbery
202 18:59 <jnewbery> There's a short survey at https://forms.gle/ZLZcxVRoAozVcAyE6 about review club this year. It'd be very helpful if you could fill it out and let me know how we could make review club even better next year.
203 18:59 <lightlike> thanks!
204 19:00 <dhruvm> Thanks jnewbery for PR review club. It has been very useful to my onboarding journey.
205 19:00 <jnewbery> As always, feel free to contact me directly if you have any suggestions or feedback.
207 19:00 <willcl_ark> Thanks jnewbery for organising these!
208 19:00 <thomasb06> thanks
209 19:00 <jnewbery> And have a fantastic Christmas and New Year. See you all in 2021 🚀
210 19:01 <fodediop> Happy NNew Year 🎉
211 19:01 <jonatack> don't forget to leave your review feedback on the PR (and keep revewing other PRs if you're inclined to do so :)
212 19:02 <jonatack> s/revewing/reviewing and testing/
213 19:02 <felixweis> Guten Rutsch!
214 19:03 <anir> Thanks willcl_ark jnewbery!
215 19:08 <wumpus> felixweis | When leaving c++11 behind, why not go all the way to 20 ? <- if only :) every bump of the c++ version is a struggle, which depends on compiler support in even the most slow linux distros
216 19:09 <sipa> also afaik no fully C++20 compliant compiler exists today
217 19:21 <felixweis> wumpus, sipa: thanks