Move SaltedHashers to separate file and add some new ones (refactoring)

https://github.com/bitcoin/bitcoin/pulls/19935

Host: willcl-ark  -  PR author: achow101

The PR branch HEAD was 281fd1a at the time of this review club meeting.

Notes

This refactor PR moves existing code into new src/util/hasher.{cpp|h} files.

Along the way, it touches some interesting code in:

  • src/coins.{cpp|h}
  • src/index/blockfilter.h
  • src/script/sigcache.h
  • src/txmempool.{cpp|h}
  • src/validation.h

Hashers are generally used to construct hash maps, which is a data structure that has fast item lookup, addition and removal. They allow lookup of elements in 0(1) time in the typical/best-case scenario although implementation can affect this.

The C++ standard library includes a hash map (std::unordered_map). For a more detailed description of the hash functions used in std::unordered_map, see Hash functions for C++ Unordered Containers.

BlockHasher

BlockHasher is used by src/validation.h::BlockMap to construct a hash map of blocks and their sequence ID (block height).

BlockManager uses the BlockMap when adding a new block to the index with AddBlockToIndex(), first checking whether it’s a duplicate (src/validation.cpp#L3106) and then adding the new block and sequence ID to the index.

SignatureCacheHasher

SignatureCacheHasher is used to avoid performing costly verification of ECDSA signatures twice; once when accepted into the mempool and again when accepted into a block.

The cache type currently used for CSignatureCache is “cuckoo cache”, implemented in PR 8895.

FilterHeaderHasher

FilterHeaderHasher is used by src/index/blockfilterindex.h::BlockFilterIndex to construct an index of (BIP157) filter headers for each filter type enabled.

However, BlockFilterIndex only consults the m_headers_cache lookup if the block index is at a “checkpoint” height – every 1000 blocks per BIP157.

SaltedOutpointHasher

The Coin object requires CCoinsMap to use SaltedOutpointHasher in its map of CCoinsCacheEntry(s).

CCoinsMap is used by CCoinsViewCache with most of its methods. As discussed in the review club meeting for PR 18113 each CCoinsCacheEntry contains both the Coin itself in addition to the flags associated with that coin.

SaltedTxidHasher

This is currently used in the mempool’s multi-index.

SaltedSipHasher

SaltedSipHasher is a generic siphasher that can be used with many objects (via the magic of Span #18468)

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?

  2. Why do we need to specify a “hasher” when defining a std::unordered_map?

  3. What exactly is a Coin as used by CCoinsMap and hashed using SaltedOutpointHasher?

  4. Why does src/coins.cpp use a SaltedOutpointHasher rather than a SaltedTxidHasher?

  5. What is a “salt” and why are they used in SaltedTxidHasher, SaltedOutpointHasher and SaltedSipHasher?

  6. Why are salts not used in BlockHasher and FilterHeaderHasher?

  7. What is a boost multi-index? How is the SaltedTxidHasher used in the mempool multi-index?

  8. What is the advantage of introducing a new SaltedSipHasher that takes a Span as input? Where would we not be able to use this?

Meeting Log

  118:00 <willcl_ark> #startmeeting
  218:00 <willcl_ark> Hello everyone, welcome to the final review club before christmas!
  318:00 <finite> hi
  418:00 <dhruvm> hi
  518:00 <felixweis> hi
  618:00 <samuel-pedraza> hi!
  718:00 <elle> hi
  818:00 <murtyjones> hello!
  918:00 <michaelfolkson> hi
 1018:00 <svav> Hi
 1118:00 <willcl_ark> Please feel free to say hi if you're here :)
 1218:01 <lightlike> hi
 1318:01 <thomasb06> hi
 1418:01 <joelklabo> hi
 1518:01 <emzy> hi
 1618:01 <willcl_ark> Today we're going to be looking at #19935: "Move SaltedHashers to separate file and add some new ones."
 1718:02 <willcl_ark> A quick reminder of some conventions, you don't have to ask to ask a question (e.g. "I have a question about x but I don't know if it's on-topic?") just go ahead and ask.
 1818:02 <willcl_ark> OK, so did everyone get a chance to review the PR? How about a quick y/n?
 1918:02 <murtyjones> y
 2018:02 <dhruvm> y
 2118:02 <joelklabo> y
 2218:02 <svav> y
 2318:02 <michaelfolkson> y
 2418:02 <emzy> n
 2518:03 <kouloumos> hi
 2618:03 <kouloumos> n
 2718:03 <jnewbery> hi
 2818:03 <thomasb06> n (studied Bloom filters instead)
 2918:03 <michaelfolkson> Why bloom filters?
 3018:03 <lightlike> y
 3118:03 <michaelfolkson> Just fancied it thomasb06?
 3218:03 <jnewbery> y
 3318:03 <samuel-pedraza> y
 3418:04 <thomasb06> because they use hach functions too
 3518:04 <willcl_ark> that's great! The PR was primarily refactor, but it touches some interesting parts of code which I want to dive into
 3618:04 <willcl_ark> So first question to get us started, why do we need to specify a “hasher” when defining a std::unordered_map?
 3718:04 <dhruvm> The Hasher tells std::unordered_map how to make deterministic buckets based on the keys which result in O(1) lookups.
 3818:05 <thomasb06> michaelfolkson: if I understood well the table is T[h_i(m)] for h_i a hash function
 3918:05 <michaelfolkson> Without a hasher you wouldn't know which bucket an element should be placed in
 4018:05 <kouloumos> because the standard library has no implementations for user-defined classes
 4118:06 <joelklabo> user-defined objects need to have their own hashing implementation
 4218:06 <joelklabo> what kouloumos said
 4318:06 <willcl_ark> dhruvm, michaelfolkson, kouloumos joelklabo right! std::unordered_map doesn't know how it should hash our (custom) bitcoin objects.
 4418:08 <willcl_ark> next question: what exactly is a `Coin` as used by `CCoinsMap` and hashed using `SaltedOutpointHasher`?
 4518:08 <murtyjones> Is it a UTXO entry?
 4618:08 <dhruvm> A Coin is a UTXO. The canonical representation of that is CTXOutPoint(txid, n) which are the elements SipHash'ed by SaltedOutpointHasher.
 4718:10 <willcl_ark> murtyjones: dhruvm exactly!
 4818:11 <willcl_ark> The concept of a `Coin` was introduced in https://github.com/bitcoin/bitcoin/pull/10195
 4918:11 <joelklabo> I have a quick C++ question, is putting SaltedOutpointHasher() in the public interface of itself how you create a singleton? A little confused by the syntax
 5018:12 <jnewbery> joelklabo: do you mean here: https://github.com/bitcoin/bitcoin/blob/b440c331/src/coins.h#L92 ? That's declaring the constructor
 5118:12 <dhruvm> joelklabo: I am not sure it is a singleton
 5218:13 <michaelfolkson> That PR 10195 "switched chainstate db and cache to per-txout model". Before it was a per-tx model?
 5318:13 <michaelfolkson> Looks like an interesting PR
 5418:13 <dhruvm> It is used by stdlib when using CCoinsMap
 5518:13 <dhruvm> here: https://github.com/bitcoin/bitcoin/blob/b440c331/src/coins.h#L157
 5618:13 <joelklabo> Ah, ok but I was thinking the salt would need to be consistent for the lifetime of the process, so things don't move around, I'll look that up. Thanks
 5718:14 <fodediop> hi
 5818:14 <dhruvm> joelklabo: you're right that only one instance of SaltedOutpointHasher is ever created in binary lifetime
 5918:14 <dhruvm> but it is not using the singleton code pattern. You could initialize another instance if needed.
 6018:15 <joelklabo> thanks, dhruvm, still learning some basic C++ stuff
 6118:15 <dhruvm> joelklabo: we all are :)
 6218:15 <willcl_ark> joelklabo: we've also got a question on salts coming up soon
 6318:16 <michaelfolkson> Didn't realize only changed to per-txout model in 2017...
 6418:17 <willcl_ark> Alright, so why does src/coins.cpp use a SaltedOutpointHasher rather than a SaltedTxidHasher?
 6518:17 <murtyjones> Using SaltedTxidHasher would require storing all of the outputs in a list in the value, vs. using the outpoint to point to one specific UTXO.
 6618:17 <dhruvm> txid alone is not a canonical representation of a Coin/UTXO. The specific txOut must be referenced.
 6718:17 <jnewbery> dhruvm: I don't think you're right that only one instance is ever created. We'll create a new one for each CCoinsMap object.
 6818:18 <dhruvm> jnewbery: ah, you're right. it's just a typedef there
 6918:19 <willcl_ark> murtyjones: dhruvm right. If we used SaltedTxidHasher then all the UTXOs from the same transaction would hash to the same result which would be, sub-optimal.
 7018:20 <willcl_ark> Are there any potential downsides to using an Outpoint-based system?
 7118:20 <dhruvm> willcl_ark: More buckets?
 7218:20 <murtyjones> according to the PR linked, larger on disk representation
 7318:21 <joelklabo> also slower lookup due to one entry per output v one per tx
 7418:22 <sipa> before 0.15 a per-tx model was used for the chainstste
 7518:23 <willcl_ark> yes! all good potential tradeoffs to beware of.
 7618:23 <sipa> its primary downside was that it required reading all unspent outputs of a tx left from disk any time any of them were spent
 7718:23 <felixweis> leveldb has some logic to not use more disk data when the key prefix is the same
 7818:23 <sipa> this led to a DOS attack
 7918:23 <sipa> *vulnerability
 8018:24 <michaelfolkson> How would that vuln be exploited?
 8118:24 <michaelfolkson> By sending lots of fake transactions to a peer to validate?
 8218:25 <MarcoFalke> michaelfolkson: Create a tx with lots of outputs
 8318:25 <sanket1729> I guess by creating a transaction lotsof inputs
 8418:25 <michaelfolkson> Like lots and lots. On order of hundreds?
 8518:25 <felixweis> create many txs with many outputs
 8618:26 <sipa> first creste txn with lots of outputs, then create a tx that spends one utxo from each
 8718:26 <felixweis> then create a tx with many inputs, each spending from a different large prior tx
 8818:26 <sipa> now you may need gigabytes of memory to load all the utxos from all those
 8918:26 <felixweis> now you need to keep it all in mem
 9018:26 <jonatack> hi
 9118:27 <felixweis> there were demos of how to OOM iirc
 9218:27 <michaelfolkson> Cool
 9318:27 <felixweis> i vaguely remember something about paris
 9418:27 <felixweis> and breaking bitcoin
 9518:27 <sipa> indeed
 9618:27 <willcl_ark> Using an Outpoint-based model also affects the ability to test directly against the UTXO set, but I don't think this is an issue.
 9718:28 <felixweis> and some discussion about responsible disclosure
 9818:28 <jnewbery> I think we're getting a little bit off topic :)
 9918:28 <michaelfolkson> Oh it was that.... I remember haha
10018:28 <willcl_ark> Ok next question: What is a “salt” and why are they used in SaltedTxidHasher, SaltedOutpointHasher and SaltedSipHasher?
10118:29 <murtyjones> A salt is random data included in the input to a hash function. My guess is that it's used with those structures to prevent malicious construction of colliding values?
10218:29 <dhruvm> The salt makes the output of the hash function unpredictable to parties that know the input but not the salt. The hash function remains deterministic so long as the same salt is used. The salt is updated every time the containers are initialized. As an example, SaltedTxidHasher is used in a multi_index for the mempool, which seems to be a non-persistent data structure.
10318:29 <felixweis> salt is usually used in hashers so an attacker can not know how the buckets are being constructed. otherwise you can create O(n) behaviour
10418:29 <michaelfolkson> Right... so salts are generally used for things like storing passwords. If you didn't salt them you would be able to see everyone who uses "Password" as their password
10518:30 <michaelfolkson> By salting them you don't see lots of people with the same hashed password
10618:30 <willcl_ark> correct! std::unordered_set puts objects into buckets based on their hash value. An adversary who can create hash collisions at will can force a large number of items into the same bucket meaning that lookup time goes from 0(1) to 0(n).
10718:31 <dhruvm> Wouldn't that require useful work though?
10818:31 <dhruvm> Like creating a valid transaction that results in a specific txid?
10918:31 <joelklabo> I guess the same could happen if you had a "bad" hashing algorithm that wasn't evenly distrubuted
11018:32 <sipa> note that "who can create hash collisions" doesn't mean much... for these data structures the hashes are maybe 10-230 bits in size
11118:32 <sipa> eh, 10-30
11218:32 <lightlike> Why are there two uint64_t (m_k0, m_k1) used as salts in SaltedSipHasher instead of just one?
11318:33 <sipa> siphash happens to use a 128-bit key
11418:33 <michaelfolkson> What was the DoS attack that motivated the invention of siphash?
11518:33 <willcl_ark> Does anyone have any ideas about an attack that might be possible if lookup time was affected in this way?
11618:34 <jnewbery> https://www.aumasson.jp/siphash/siphash.pdf for those interested in learning more about siphash
11718:34 <sipa> michaelfolkson: the paper would be the best place to look i think
11818:34 <lightlike> thanks, will take a look
11918:35 <jnewbery> (or https://github.com/bitcoin/bitcoin/blob/b440c331/src/crypto/siphash.cpp if you prefer reading code)
12018:35 <jonatack> and then review https://github.com/bitcoin/bitcoin/pull/18014
12118:36 <sipa> as i know someone who was confused about this before: i didn't invent siphash
12218:36 <michaelfolkson> haha not sipahash
12318:36 <willcl_ark> hehe, easy to mistakenly assume
12418:37 <dhruvm> i certainly thought that until i saw the paper
12518:37 <michaelfolkson> jonatack: Coming up to a year open on that one...
12618:37 <willcl_ark> Alright, why are salts *not* used in BlockHasher and FilterHeaderHasher?
12718:37 <jnewbery> is it true that you invented sippy cups though?
12818:38 <dhruvm> Maybe because those hashers are used in persistent data structures? BlockHasher is used in BlockMap/m_block_index. (Un)LoadBlockIndex persist it to disk.
12918:38 <dhruvm> But I am not sure, as in that case, it'd be beneficial to just persist the salt as well?
13018:38 <michaelfolkson> jnewbery: What is a sippy cup? A party cup? Off topic...
13118:38 <willcl_ark> dhruvm: I thought so too at first, although I think we could easily also write the salt to disk
13218:39 <dhruvm> willcl_ark: exactly
13318:40 <michaelfolkson> OK... so I get the benefits of salting - use random number in your hash. Like a nonce in dig sigs I guess... I get the benefits of siphash, need secret key to generate.
13418:40 <michaelfolkson> But why do you need a combination of both?
13518:41 <michaelfolkson> To stop someone with the secret key from being able to identity hashes that are the same in the database?
13618:41 <sipa> it's not a combination
13718:41 <jnewbery> if the aim of the salt is to prevent collisions, then why do we not need to worry about that for blocks?
13818:41 <felixweis> you need a keyed hash function, siphash is that
13918:41 <michaelfolkson> A salted siphash is not a combination of salting and siphash?
14018:41 <sipa> siphash is the gold standard for hash table bucket hashing to prptect against collisions... and siphash uses a salt
14118:42 <michaelfolkson> Oh. All siphash uses a salt... ok
14218:42 <sipa> no, siphash is a PRF. it takes as input a key (="salt") and a message
14318:42 <felixweis> because the work to create a valid block hash is many magnitudes higher than creating an unconfimed tx
14418:42 <willcl_ark> felixweis: bingo!
14518:42 <jnewbery> felixweis: correct
14618:43 <willcl_ark> OK next Q: what is a boost multi-index? How is the SaltedTxidHasher used in the mempool multi-index?
14718:43 <dhruvm> A multi_index maintains multiple indices on the same data set. index_transaction_set is a multi_index on the mempool. SaltedTxidHasher is used to maintain the mempool sorted by txid for fast lookups by txid.
14818:44 <jnewbery> if someone wanted to create a collision in our BlockMap, they'd need to be mining valid blocks which _also_ collide in the BlockHasher function
14918:45 <willcl_ark> dhruvm: correct!
15018:45 <willcl_ark> We can see this here: https://github.com/bitcoin/bitcoin/blob/6a480636/src/txmempool.h#L519-L527
15118:45 <jonatack> txid and wtxid, though i'm sure dhruvm meant both
15218:46 <willcl_ark> jonatack: I thought he did :)
15318:46 <dhruvm> good reminder. yeah i meant the canonical id used for the tx.
15418:47 <willcl_ark> OK final question, this PR introduces a new `SaltedSipHasher` that takes a `Span` as input, what are the advantages to this?
15518:47 <michaelfolkson> In addition to the advantages of using Span generally?
15618:48 <dhruvm> The new implementation allows arbitrary strings to be SipHash'd.
15718:48 <sipa> span span span span
15818:48 <michaelfolkson> I guess if you are introducing Span then you want to be able to hash them
15918:48 <willcl_ark> lovely span
16018:48 <michaelfolkson> sipan
16118:49 <michaelfolkson> invented by sipa
16218:49 <willcl_ark> he's very conspicuous with the naming
16318:49 <willcl_ark> dhruvm: right, this would allow a single hasher to be used with any of `uint256`, `CPubKey`, `CScript`, `CKeyId`, and generics like `std::vector<unsigned char>`
16418:50 <willcl_ark> Are there any places where we _wouldn't_ be able to use such a "generic" hasher?
16518:51 <dhruvm> i think we cannot use it where the type doesn't implement Span?
16618:51 <dhruvm> *key type
16718:53 <sipa> you don't need to "implement span" in general
16818:53 <sipa> types just need a begin() and end() and size()
16918:53 <dhruvm> oh sorry, i thought there were methods that need to exist on the type. But I need to study up on span.
17018:54 <dhruvm> i see. not inheritance, or "implementing an interface". cool!
17118:54 <joelklabo> what C++ version does core use? span seems new in C++ 20?
17218:54 <willcl_ark> I think Core has a custom re-implementation
17318:55 <jonatack> joelklabo: c++11 until very recently, now 17
17418:55 <willcl_ark> (of Span)
17518:55 <michaelfolkson> https://bitcoin.stackexchange.com/questions/100545/what-version-of-c-is-used-in-bitcoin-core
17618:55 <dhruvm> joelklabo: src/span.h implements a subset of the C++20 span
17718:55 <jonatack> see src/span.h
17818:55 <joelklabo> thanks
17918:55 <willcl_ark> It's not easy to use the `SaltedSipHasher` with `Span` in places where the keyed hash is saved to disk, e.g. https://github.com/bitcoin/bitcoin/blob/3f205808/src/addrman.cpp#L14. If the hash function is changed, then the object would no longer hash to the same digest.
18018:56 <michaelfolkson> C++17 due for v22 apparently
18118:56 <felixweis> When leaving c++11 behind, why not go all the way to 20 ?
18218:56 <jnewbery> felixweis: c++20 was finalized yesterday
18318:57 <jnewbery> oh, maybe it was a few weeks ago
18418:57 <willcl_ark> Well that was the final question, so I think it's time to
18518:57 <willcl_ark> #endmeeting
18618:58 <willcl_ark> thanks everyone for coming
18718:58 <jnewbery> oh it got published yesterday: https://www.iso.org/standard/79358.html
18818:58 <jonatack> thanks willcl_ark!
18918:58 <murtyjones> thank you willcl_ark!
19018:58 <dhruvm> thank you, willcl_ark !
19118:58 <jnewbery> one thing before everyone goes
19218:58 <felixweis> Thanks willcl_ark!
19318:58 <joelklabo> thanks willcl_ark !
19418:58 <finite> thanks willcl_ark
19518:58 <michaelfolkson> I know you jnewbery wanted to only allow parts of C++17. Was it consensus that all C++17 should be allowed in Core v22? I know people argued for this
19618:58 <felixweis> And Newbery for another awesome year of core review club
19718:58 <fodediop> thank you willcl_ark!
19818:59 <caralie> thank you!
19918:59 <joelklabo> yeah, I look forward to these every week jnewbery
20018:59 <emzy> thank you!
20118:59 <michaelfolkson> Thanks willcl_ark jnewbery
20218:59 <jnewbery> There's a short survey at https://forms.gle/ZLZcxVRoAozVcAyE6 about review club this year. It'd be very helpful if you could fill it out and let me know how we could make review club even better next year.
20318:59 <lightlike> thanks!
20419:00 <dhruvm> Thanks jnewbery for PR review club. It has been very useful to my onboarding journey.
20519:00 <jnewbery> As always, feel free to contact me directly if you have any suggestions or feedback.
20619:00 <svav> Thanks
20719:00 <willcl_ark> Thanks jnewbery for organising these!
20819:00 <thomasb06> thanks
20919:00 <jnewbery> And have a fantastic Christmas and New Year. See you all in 2021 🚀
21019:01 <fodediop> Happy NNew Year 🎉
21119:01 <jonatack> don't forget to leave your review feedback on the PR (and keep revewing other PRs if you're inclined to do so :)
21219:02 <jonatack> s/revewing/reviewing and testing/
21319:02 <felixweis> Guten Rutsch!
21419:03 <anir> Thanks willcl_ark jnewbery!
21519:08 <wumpus> felixweis | When leaving c++11 behind, why not go all the way to 20 ? <- if only :) every bump of the c++ version is a struggle, which depends on compiler support in even the most slow linux distros
21619:09 <sipa> also afaik no fully C++20 compliant compiler exists today
21719:21 <felixweis> wumpus, sipa: thanks