Bitcoin Core PR Review Club

Move SaltedHashers to separate file and add some new ones (refactoring)

Dec 16, 2020

https://github.com/bitcoin/bitcoin/pull/19935

Host: willcl-ark - PR author: achow101

The PR branch HEAD was 281fd1a at the time of this review club meeting.

Notes

This refactor PR moves existing code into new src/util/hasher.{cpp|h} files.

Along the way, it touches some interesting code in:

src/coins.{cpp|h}
src/index/blockfilter.h
src/script/sigcache.h
src/txmempool.{cpp|h}
src/validation.h

Hashers are generally used to construct hash maps, which is a data structure that has fast item lookup, addition and removal. They allow lookup of elements in 0(1) time in the typical/best-case scenario although implementation can affect this.

The C++ standard library includes a hash map (std::unordered_map). For a more detailed description of the hash functions used in std::unordered_map, see Hash functions for C++ Unordered Containers.

BlockHasher

BlockHasher is used by src/validation.h::BlockMap to construct a hash map of blocks and their sequence ID (block height).

BlockManager uses the BlockMap when adding a new block to the index with AddBlockToIndex(), first checking whether it’s a duplicate (src/validation.cpp#L3106) and then adding the new block and sequence ID to the index.

SignatureCacheHasher

SignatureCacheHasher is used to avoid performing costly verification of ECDSA signatures twice; once when accepted into the mempool and again when accepted into a block.

The cache type currently used for CSignatureCache is “cuckoo cache”, implemented in PR 8895.

FilterHeaderHasher

FilterHeaderHasher is used by src/index/blockfilterindex.h::BlockFilterIndex to construct an index of (BIP157) filter headers for each filter type enabled.

However, BlockFilterIndex only consults the m_headers_cache lookup if the block index is at a “checkpoint” height – every 1000 blocks per BIP157.

SaltedOutpointHasher

The Coin object requires CCoinsMap to use SaltedOutpointHasher in its map of CCoinsCacheEntry(s).

CCoinsMap is used by CCoinsViewCache with most of its methods. As discussed in the review club meeting for PR 18113 each CCoinsCacheEntry contains both the Coin itself in addition to the flags associated with that coin.

SaltedTxidHasher

This is currently used in the mempool’s multi-index.

SaltedSipHasher

SaltedSipHasher is a generic siphasher that can be used with many objects (via the magic of Span #18468)

Questions

Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK?
Why do we need to specify a “hasher” when defining a std::unordered_map?
What exactly is a Coin as used by CCoinsMap and hashed using SaltedOutpointHasher?
Why does src/coins.cpp use a SaltedOutpointHasher rather than a SaltedTxidHasher?
What is a “salt” and why are they used in SaltedTxidHasher, SaltedOutpointHasher and SaltedSipHasher?
Why are salts not used in BlockHasher and FilterHeaderHasher?
What is a boost multi-index? How is the SaltedTxidHasher used in the mempool multi-index?
What is the advantage of introducing a new SaltedSipHasher that takes a Span as input? Where would we not be able to use this?

Meeting Log

18:00

<willcl_ark> #startmeeting

18:00

<willcl_ark> Hello everyone, welcome to the final review club before christmas!

18:00

<finite> hi

18:00

<dhruvm> hi

18:00

<felixweis> hi

18:00

<samuel-pedraza> hi!

18:00

<elle> hi

18:00

<murtyjones> hello!

18:00

<michaelfolkson> hi

18:00

<svav> Hi

18:00

<willcl_ark> Please feel free to say hi if you're here :)

18:01

<lightlike> hi

18:01

<thomasb06> hi

18:01

<joelklabo> hi

18:01

<emzy> hi

18:01

<willcl_ark> Today we're going to be looking at #19935: "Move SaltedHashers to separate file and add some new ones."

18:02

<willcl_ark> A quick reminder of some conventions, you don't have to ask to ask a question (e.g. "I have a question about x but I don't know if it's on-topic?") just go ahead and ask.

18:02

<willcl_ark> OK, so did everyone get a chance to review the PR? How about a quick y/n?

18:02

<murtyjones> y

18:02

<dhruvm> y

18:02

<joelklabo> y

18:02

<svav> y

18:02

<michaelfolkson> y

18:02

<emzy> n

18:03

<kouloumos> hi

18:03

<kouloumos> n

18:03

<jnewbery> hi

18:03

<thomasb06> n (studied Bloom filters instead)

18:03

<michaelfolkson> Why bloom filters?

18:03

<lightlike> y

18:03

<michaelfolkson> Just fancied it thomasb06?

18:03

<jnewbery> y

18:03

<samuel-pedraza> y

18:04

<thomasb06> because they use hach functions too

18:04

<willcl_ark> that's great! The PR was primarily refactor, but it touches some interesting parts of code which I want to dive into

18:04

<willcl_ark> So first question to get us started, why do we need to specify a “hasher” when defining a std::unordered_map?

18:04

<dhruvm> The Hasher tells std::unordered_map how to make deterministic buckets based on the keys which result in O(1) lookups.

18:05

<thomasb06> michaelfolkson: if I understood well the table is T[h_i(m)] for h_i a hash function

18:05

<michaelfolkson> Without a hasher you wouldn't know which bucket an element should be placed in

18:05

<kouloumos> because the standard library has no implementations for user-defined classes

18:06

<joelklabo> user-defined objects need to have their own hashing implementation

18:06

<joelklabo> what kouloumos said

18:06

<willcl_ark> dhruvm, michaelfolkson, kouloumos joelklabo right! std::unordered_map doesn't know how it should hash our (custom) bitcoin objects.

18:08

<willcl_ark> next question: what exactly is a `Coin` as used by `CCoinsMap` and hashed using `SaltedOutpointHasher`?

18:08

<murtyjones> Is it a UTXO entry?

18:08

<dhruvm> A Coin is a UTXO. The canonical representation of that is CTXOutPoint(txid, n) which are the elements SipHash'ed by SaltedOutpointHasher.

18:10

<willcl_ark> murtyjones: dhruvm exactly!

18:11

<willcl_ark> The concept of a `Coin` was introduced in https://github.com/bitcoin/bitcoin/pull/10195

18:11

<joelklabo> I have a quick C++ question, is putting SaltedOutpointHasher() in the public interface of itself how you create a singleton? A little confused by the syntax

18:12

<jnewbery> joelklabo: do you mean here: https://github.com/bitcoin/bitcoin/blob/b440c331/src/coins.h#L92 ? That's declaring the constructor

18:12

<dhruvm> joelklabo: I am not sure it is a singleton

18:13

<michaelfolkson> That PR 10195 "switched chainstate db and cache to per-txout model". Before it was a per-tx model?

18:13

<michaelfolkson> Looks like an interesting PR

18:13

<dhruvm> It is used by stdlib when using CCoinsMap

18:13

<dhruvm> here: https://github.com/bitcoin/bitcoin/blob/b440c331/src/coins.h#L157

18:13

<joelklabo> Ah, ok but I was thinking the salt would need to be consistent for the lifetime of the process, so things don't move around, I'll look that up. Thanks

18:14

<fodediop> hi

18:14

<dhruvm> joelklabo: you're right that only one instance of SaltedOutpointHasher is ever created in binary lifetime

18:14

<dhruvm> but it is not using the singleton code pattern. You could initialize another instance if needed.

18:15

<joelklabo> thanks, dhruvm, still learning some basic C++ stuff

18:15

<dhruvm> joelklabo: we all are :)

18:15

<willcl_ark> joelklabo: we've also got a question on salts coming up soon

18:16

<michaelfolkson> Didn't realize only changed to per-txout model in 2017...

18:17

<willcl_ark> Alright, so why does src/coins.cpp use a SaltedOutpointHasher rather than a SaltedTxidHasher?

18:17

<murtyjones> Using SaltedTxidHasher would require storing all of the outputs in a list in the value, vs. using the outpoint to point to one specific UTXO.

18:17

<dhruvm> txid alone is not a canonical representation of a Coin/UTXO. The specific txOut must be referenced.

18:17

<jnewbery> dhruvm: I don't think you're right that only one instance is ever created. We'll create a new one for each CCoinsMap object.

18:18

<dhruvm> jnewbery: ah, you're right. it's just a typedef there

18:19

<willcl_ark> murtyjones: dhruvm right. If we used SaltedTxidHasher then all the UTXOs from the same transaction would hash to the same result which would be, sub-optimal.

18:20

<willcl_ark> Are there any potential downsides to using an Outpoint-based system?

18:20

<dhruvm> willcl_ark: More buckets?

18:20

<murtyjones> according to the PR linked, larger on disk representation

18:21

<joelklabo> also slower lookup due to one entry per output v one per tx

18:22

<sipa> before 0.15 a per-tx model was used for the chainstste

18:23

<willcl_ark> yes! all good potential tradeoffs to beware of.

18:23

<sipa> its primary downside was that it required reading all unspent outputs of a tx left from disk any time any of them were spent

18:23

<felixweis> leveldb has some logic to not use more disk data when the key prefix is the same

18:23

<sipa> this led to a DOS attack

18:23

<sipa> *vulnerability

18:24

<michaelfolkson> How would that vuln be exploited?

18:24

<michaelfolkson> By sending lots of fake transactions to a peer to validate?

18:25

<MarcoFalke> michaelfolkson: Create a tx with lots of outputs

18:25

<sanket1729> I guess by creating a transaction lotsof inputs

18:25

<michaelfolkson> Like lots and lots. On order of hundreds?

18:25

<felixweis> create many txs with many outputs

18:26

<sipa> first creste txn with lots of outputs, then create a tx that spends one utxo from each

18:26

<felixweis> then create a tx with many inputs, each spending from a different large prior tx

18:26

<sipa> now you may need gigabytes of memory to load all the utxos from all those

18:26

<felixweis> now you need to keep it all in mem

18:26

<jonatack> hi

18:27

<felixweis> there were demos of how to OOM iirc

18:27

<michaelfolkson> Cool

18:27

<felixweis> i vaguely remember something about paris

18:27

<felixweis> and breaking bitcoin

18:27

<sipa> indeed

18:27

<willcl_ark> Using an Outpoint-based model also affects the ability to test directly against the UTXO set, but I don't think this is an issue.

18:28

<felixweis> and some discussion about responsible disclosure

18:28

<jnewbery> I think we're getting a little bit off topic :)

18:28

<michaelfolkson> Oh it was that.... I remember haha

18:28

<willcl_ark> Ok next question: What is a “salt” and why are they used in SaltedTxidHasher, SaltedOutpointHasher and SaltedSipHasher?

18:29

<murtyjones> A salt is random data included in the input to a hash function. My guess is that it's used with those structures to prevent malicious construction of colliding values?

18:29

<dhruvm> The salt makes the output of the hash function unpredictable to parties that know the input but not the salt. The hash function remains deterministic so long as the same salt is used. The salt is updated every time the containers are initialized. As an example, SaltedTxidHasher is used in a multi_index for the mempool, which seems to be a non-persistent data structure.

18:29

<felixweis> salt is usually used in hashers so an attacker can not know how the buckets are being constructed. otherwise you can create O(n) behaviour

18:29

<michaelfolkson> Right... so salts are generally used for things like storing passwords. If you didn't salt them you would be able to see everyone who uses "Password" as their password

18:30

<michaelfolkson> By salting them you don't see lots of people with the same hashed password

18:30

<willcl_ark> correct! std::unordered_set puts objects into buckets based on their hash value. An adversary who can create hash collisions at will can force a large number of items into the same bucket meaning that lookup time goes from 0(1) to 0(n).

18:31

<dhruvm> Wouldn't that require useful work though?

18:31

<dhruvm> Like creating a valid transaction that results in a specific txid?

18:31

<joelklabo> I guess the same could happen if you had a "bad" hashing algorithm that wasn't evenly distrubuted

18:32

<sipa> note that "who can create hash collisions" doesn't mean much... for these data structures the hashes are maybe 10-230 bits in size

18:32

<sipa> eh, 10-30

18:32

<lightlike> Why are there two uint64_t (m_k0, m_k1) used as salts in SaltedSipHasher instead of just one?

18:33

<sipa> siphash happens to use a 128-bit key

18:33

<michaelfolkson> What was the DoS attack that motivated the invention of siphash?

18:33

<willcl_ark> Does anyone have any ideas about an attack that might be possible if lookup time was affected in this way?

18:34

<jnewbery> https://www.aumasson.jp/siphash/siphash.pdf for those interested in learning more about siphash

18:34

<sipa> michaelfolkson: the paper would be the best place to look i think

18:34

<lightlike> thanks, will take a look

18:35

<jnewbery> (or https://github.com/bitcoin/bitcoin/blob/b440c331/src/crypto/siphash.cpp if you prefer reading code)

18:35

<jonatack> and then review https://github.com/bitcoin/bitcoin/pull/18014

18:36

<sipa> as i know someone who was confused about this before: i didn't invent siphash

18:36

<michaelfolkson> haha not sipahash

18:36

<willcl_ark> hehe, easy to mistakenly assume

18:37

<dhruvm> i certainly thought that until i saw the paper

18:37

<michaelfolkson> jonatack: Coming up to a year open on that one...

18:37

<willcl_ark> Alright, why are salts *not* used in BlockHasher and FilterHeaderHasher?

18:37

<jnewbery> is it true that you invented sippy cups though?

18:38

<dhruvm> Maybe because those hashers are used in persistent data structures? BlockHasher is used in BlockMap/m_block_index. (Un)LoadBlockIndex persist it to disk.

18:38

<dhruvm> But I am not sure, as in that case, it'd be beneficial to just persist the salt as well?

18:38

<michaelfolkson> jnewbery: What is a sippy cup? A party cup? Off topic...

18:38

<willcl_ark> dhruvm: I thought so too at first, although I think we could easily also write the salt to disk

18:39

<dhruvm> willcl_ark: exactly

18:40

<michaelfolkson> OK... so I get the benefits of salting - use random number in your hash. Like a nonce in dig sigs I guess... I get the benefits of siphash, need secret key to generate.

18:40

<michaelfolkson> But why do you need a combination of both?

18:41

<michaelfolkson> To stop someone with the secret key from being able to identity hashes that are the same in the database?

18:41

<sipa> it's not a combination

18:41

<jnewbery> if the aim of the salt is to prevent collisions, then why do we not need to worry about that for blocks?

18:41

<felixweis> you need a keyed hash function, siphash is that

18:41

<michaelfolkson> A salted siphash is not a combination of salting and siphash?

18:41

<sipa> siphash is the gold standard for hash table bucket hashing to prptect against collisions... and siphash uses a salt

18:42

<michaelfolkson> Oh. All siphash uses a salt... ok

18:42

<sipa> no, siphash is a PRF. it takes as input a key (="salt") and a message

18:42

<felixweis> because the work to create a valid block hash is many magnitudes higher than creating an unconfimed tx

18:42

<willcl_ark> felixweis: bingo!

18:42

<jnewbery> felixweis: correct

18:43

<willcl_ark> OK next Q: what is a boost multi-index? How is the SaltedTxidHasher used in the mempool multi-index?

18:43

<dhruvm> A multi_index maintains multiple indices on the same data set. index_transaction_set is a multi_index on the mempool. SaltedTxidHasher is used to maintain the mempool sorted by txid for fast lookups by txid.

18:44

<jnewbery> if someone wanted to create a collision in our BlockMap, they'd need to be mining valid blocks which _also_ collide in the BlockHasher function

18:45

<willcl_ark> dhruvm: correct!

18:45

<willcl_ark> We can see this here: https://github.com/bitcoin/bitcoin/blob/6a480636/src/txmempool.h#L519-L527

18:45

<jonatack> txid and wtxid, though i'm sure dhruvm meant both

18:46

<willcl_ark> jonatack: I thought he did :)

18:46

<dhruvm> good reminder. yeah i meant the canonical id used for the tx.

18:47

<willcl_ark> OK final question, this PR introduces a new `SaltedSipHasher` that takes a `Span` as input, what are the advantages to this?

18:47

<michaelfolkson> In addition to the advantages of using Span generally?

18:48

<dhruvm> The new implementation allows arbitrary strings to be SipHash'd.

18:48

<sipa> span span span span

18:48

<michaelfolkson> I guess if you are introducing Span then you want to be able to hash them

18:48

<willcl_ark> lovely span

18:48

<michaelfolkson> sipan

18:49

<michaelfolkson> invented by sipa

18:49

<willcl_ark> he's very conspicuous with the naming

18:49

<willcl_ark> dhruvm: right, this would allow a single hasher to be used with any of `uint256`, `CPubKey`, `CScript`, `CKeyId`, and generics like `std::vector<unsigned char>`

18:50

<willcl_ark> Are there any places where we _wouldn't_ be able to use such a "generic" hasher?

18:51

<dhruvm> i think we cannot use it where the type doesn't implement Span?

18:51

<dhruvm> *key type

18:53

<sipa> you don't need to "implement span" in general

18:53

<sipa> types just need a begin() and end() and size()

18:53

<dhruvm> oh sorry, i thought there were methods that need to exist on the type. But I need to study up on span.

18:54

<dhruvm> i see. not inheritance, or "implementing an interface". cool!

18:54

<joelklabo> what C++ version does core use? span seems new in C++ 20?

18:54

<willcl_ark> I think Core has a custom re-implementation

18:55

<jonatack> joelklabo: c++11 until very recently, now 17

18:55

<willcl_ark> (of Span)

18:55

<michaelfolkson> https://bitcoin.stackexchange.com/questions/100545/what-version-of-c-is-used-in-bitcoin-core

18:55

<dhruvm> joelklabo: src/span.h implements a subset of the C++20 span

18:55

<jonatack> see src/span.h

18:55

<joelklabo> thanks

18:55

<willcl_ark> It's not easy to use the `SaltedSipHasher` with `Span` in places where the keyed hash is saved to disk, e.g. https://github.com/bitcoin/bitcoin/blob/3f205808/src/addrman.cpp#L14. If the hash function is changed, then the object would no longer hash to the same digest.

18:56

<michaelfolkson> C++17 due for v22 apparently

18:56

<felixweis> When leaving c++11 behind, why not go all the way to 20 ?

18:56

<jnewbery> felixweis: c++20 was finalized yesterday

18:57

<jnewbery> oh, maybe it was a few weeks ago

18:57

<willcl_ark> Well that was the final question, so I think it's time to

18:57

<willcl_ark> #endmeeting

18:58

<willcl_ark> thanks everyone for coming

18:58

<jnewbery> oh it got published yesterday: https://www.iso.org/standard/79358.html

18:58

<jonatack> thanks willcl_ark!

18:58

<murtyjones> thank you willcl_ark!

18:58

<dhruvm> thank you, willcl_ark !

18:58

<jnewbery> one thing before everyone goes

18:58

<felixweis> Thanks willcl_ark!

18:58

<joelklabo> thanks willcl_ark !

18:58

<finite> thanks willcl_ark

18:58

<michaelfolkson> I know you jnewbery wanted to only allow parts of C++17. Was it consensus that all C++17 should be allowed in Core v22? I know people argued for this

18:58

<felixweis> And Newbery for another awesome year of core review club

18:58

<fodediop> thank you willcl_ark!

18:59

<caralie> thank you!

18:59

<joelklabo> yeah, I look forward to these every week jnewbery

18:59

<emzy> thank you!

18:59

<michaelfolkson> Thanks willcl_ark jnewbery

18:59

<jnewbery> There's a short survey at https://forms.gle/ZLZcxVRoAozVcAyE6 about review club this year. It'd be very helpful if you could fill it out and let me know how we could make review club even better next year.

18:59

<lightlike> thanks!

19:00

<dhruvm> Thanks jnewbery for PR review club. It has been very useful to my onboarding journey.

19:00

<jnewbery> As always, feel free to contact me directly if you have any suggestions or feedback.

19:00

<svav> Thanks

19:00

<willcl_ark> Thanks jnewbery for organising these!

19:00

<thomasb06> thanks

19:00

<jnewbery> And have a fantastic Christmas and New Year. See you all in 2021 🚀

19:01

<fodediop> Happy NNew Year 🎉

19:01

<jonatack> don't forget to leave your review feedback on the PR (and keep revewing other PRs if you're inclined to do so :)

19:02

<jonatack> s/revewing/reviewing and testing/

19:02

<felixweis> Guten Rutsch!

19:03

<anir> Thanks willcl_ark jnewbery!

19:08

<wumpus> felixweis | When leaving c++11 behind, why not go all the way to 20 ? <- if only :) every bump of the c++ version is a struggle, which depends on compiler support in even the most slow linux distros

19:09

<sipa> also afaik no fully C++20 compliant compiler exists today

19:21

<felixweis> wumpus, sipa: thanks