Add Muhash3072 implementation in Python (tests, math and cryptography)

Jun 10, 2020

https://github.com/bitcoin/bitcoin/pull/19105

Host: fjahr - PR authors: sipa , fjahr

The PR branch HEAD was a6ee4c2cee at the time of this review club meeting.

Notes

This is the second in a series of PRs implementing the Coin Statistics Index. #18000 provides an overview of the work and is used to keep track of the different PRs.
The first PR in the series was #19055, featured in PR review club two weeks ago. If you haven’t seen it, it would be beneficial to read that conversation first. There are also some links to papers in the description that might help you understand the theoretical background of the algorithm better.
Based on the feedback from the last session, the scope of PR 19055 has since been reduced. Today’s PR adds the Python implementation of MuHash3072 which was originally in #19055. #19055 now only adds the C++ implementation.
The previous session was more heavily focussed on conceptual understanding of the UTXO set statistics index project as a whole. This week, we’ll focus on understanding the implementation of MuHash3072 in Python.
The next PR in the series is #19145. It may be interesting to skip ahead to see how the code is used but if you participated in the previous review club session you have already seen some of that code.
This PR follows the pattern to check the C++ implementation of an algorithm against the Python implementation of the same algorithm. Another example of this pattern is Siphash.

Questions

What is the API of MuHash3072 in Python? Is it different from the C++ implementation in PR 19055?
There is a numerator and a denominator property in the Python implementation. Can you explain how these are used in MuHash on a high level? Where are they in the C++ implementation?
ChaCha20 is used in MuHash. What are the properties of that algorithm?
What is ChaCha20 used for in particular in MuHash? Explain why this is needed.
What does the modinv method do? What is it’s role in finalizing the hash (in the digest method)?
What are your thoughts theStack’s suggestion to use Fermat’s little theorem in the code and the trade-offs involved?
How do you like the test to verify parity with the unit test of the C++ implementation? Would have done it differently? Should it be extended? Should it be moved to a different place?

Meeting Log

13:00

<fjahr> #startmeeting

13:00

<fjahr> Hello everyone, welcome to this weeks PR review club. We are reviewing the Python implementation of MuHash3072 (#19105) which part of CoinStatsIndex (overview in #18000). Notes and questions are here: https://bitcoincore.reviews/19105.html.

13:00

<fjahr> I am drawing some comparisons to the C++ implementation in the questions but the focus of this session should definitely be on understanding how the algorithm works in the Python implementation. If you have any questions don't hesitate to ask, there can be multiple discussions happening in parallel.

13:00

<fjahr> feel free to say hi :)

13:00

<jnewbery> hi

13:00

<kanzure> hi

13:00

<troygiorshev> hi

13:00

<gimballock> Hi

13:00

<andrewtoth> hi

13:00

<lightlike> hi

13:00

<michaelfolkson> hi

13:00

<raj_149> hi

13:00

<vindard> hi

13:00

<raj_149> hi

13:01

<fjahr> who had a chance to review the PR? (y/n)

13:01

<provoostenator> hi

13:01

<Ravi_21M> hi

13:01

<provoostenator> y

13:01

<jnewbery> y

13:01

<elichai2> hi

13:01

<raj_149> y

13:01

<emzy> HI

13:01

<MM77788811> Hi

13:01

<emzy> n

13:01

<vindard> y

13:01

<troygiorshev> n

13:01

<lightlike> y

13:01

<andrewtoth> n

13:01

<nehan> hi

13:01

<nehan> y

13:02

<fjahr> Great. There were already some nice review comments. Let's get started with a warm-up question: What is the API of MuHash3072 in Python? Is it different from the C++ implementation in PR 19055?

13:02

<sipa> hi, y

13:03

<raj_149> insert, remove and digest?

13:04

<elichai2> that it doesn't overload the `/=` and `*=` operators and instead use named functions?

13:04

<fjahr> raj_149: yes, how about the constructors?

13:04

<fjahr> elichai2: yepp, that as well

13:06

<sipa> look at the inputs to those insert/remove/*=//=

13:08

<elichai2> That the C++ API accepts MuHash3072 objects and the python accepts raw data and then turn it into a MuHash3072?

13:09

<fjahr> Yepp. And there is an empty default constructor in C++ which we don't have in Python.

13:09

<fjahr> Next Q: There is a numerator and a denominator property in the Python implementation. Can you explain how these are used in MuHash on a high level? Where are they in the C++ implementation?

13:09

<elichai2> isn't this an empty constructor though? https://github.com/bitcoin/bitcoin/blob/a6ee4c2ceee624d1d3ed1dfa4bd6f259139bb9d8/test/functional/test_framework/muhash.py#L63

13:10

<nehan> to keep running products in the numerator/denominator because inversing things is expensive. so inverse once at the end.

13:10

<fjahr> elichai2: Oh, ah, true :D

13:11

<lightlike> numerator to add elements to the set, denominator to remove elements from the set

13:11

<fjahr> nehan: lightlike: right, did you compare it to the C++ code?

13:12

<raj_149> fjahr: there isn't one in c++ it seems?

13:12

<troygiorshev> It looks like the c++ code doesn't current take advantage of this. It computes the inverse every time we want to remove an element

13:12

<nehan> yes but i don't see where there are numerators/denominators in the C++

13:13

<sipa> the MuHash3072 object in the C++ code is lower level; the numerator/denominator is something higher-level code can do if it needs it

13:13

<elichai2> LOL. I only now realized that 386BYTES == 3072bits. the "hash384" confused me to think it uses some 384bits hash(sha384 or something hehe)

13:13

<sipa> one reason is that you may want to cache MuHash3072 objects for running hashes or so, after the inversion (so it's only 384 bytes and not 768)

13:14

<sipa> elichai2: no, 3072/8 = 384

13:14

* elichai2 🤦

13:14

<elichai2> ops yeah 384 not 386, typo

13:14

<troygiorshev> sipa: ah makes sense

13:14

<lightlike> yes, that took me also more time than it should have :-)

13:15

<sipa> but if that's not going to happen, perhaps the numerator/denominator (and also the SHA256'ing) should be integrated into MuHash3072 rather than forcing higher-level code to take care of that

13:15

<jnewbery> sipa: I haven't looked at the c++ code. When you say caching, do you mean that you're saving the 384 byte chacha digest for each block?

13:16

<sipa> jnewbery: i don't know?

13:16

<sipa> it depends on the use case

13:17

<sipa> one could be that you'd precompute the muhash "effect" of every transaction in the mempool for example

13:17

<jnewbery> what do you mean by 'cache muhash 3072 objects for running hashes or so'?

13:17

<provoostenator> I think fjahr said in the C++ PR that only the sha256 hash is stored in an index

13:17

<jnewbery> ah ok

13:17

<sipa> or per block, if you want to be able to roll forward/backward from that block

13:17

<provoostenator> See discussion here-ish: https://github.com/bitcoin/bitcoin/pull/19055#issuecomment-641996489

13:17

<fjahr> provoostenator: this is something different

13:18

<jnewbery> and currently the c++ implementation inverts every utxo spend rather than multiplying them all and then inverting once?

13:18

<sipa> jnewbery: iirc fjahr's code does one inversion per block

13:19

<fjahr> it's on the calculation for each block where I add new utxos and removed utxos separately and then do one division att the end

13:19

<fjahr> sipa: yes

13:19

<jnewbery> sipa fjahr: thanks

13:19

<fjahr> provoostenator: you asked about this in your first pass on the index as well

13:20

<sipa> as it is currently used it would be perfectly reasonable to have the numerator/denominator inside the MuHash3072 object - but if there was ever a desire to cache a full MuHash3072 object in a place where memory usage matters, this may be undesirable

13:21

<nehan> This is where the division is done per block: https://github.com/bitcoin/bitcoin/pull/18000/commits/1502a08b4a7ea8987271e87ad73715c565a916fc#diff-56c85843f4edd7acc62fbf7af80a67d1R197

13:21

<fjahr> nehan: thank you!

13:22

<fjahr> yeah, so instead of removing each undo instead they are added up and then removed together

13:23

<troygiorshev> sipa: IIRC one of the other incremental hashes had a much smaller state

13:23

<sipa> troygiorshev: ECMH has 32 bytes state

13:23

<troygiorshev> ECMH maybe

13:23

<sipa> well, 33

13:24

<elichai2> in storage or in memory? :P

13:24

<sipa> it's complicated

13:25

<elichai2> sipa: I asked troygiorshev hehe

13:25

<raj_149> are we planning on storing the the rolling hash to disk?

13:25

<fjahr> Maybe next question but keep going if the last part is still unclear to anyone! ChaCha20 is used in MuHash. What are the properties of that algorithm? What is ChaCha20 used for in particular in MuHash?

13:25

<sipa> in compressed form, 33 bytes; in affine coordinates, 64 bytes; in jacobian coordinates, 96 bytes; in denormalized jacobian representation, 120 bytes :p

13:26

<elichai2> :D

13:26

<elichai2> still less than 384 bytes though 😅

13:26

<jnewbery> sipa elichai2: I think we're gettin a bit offtopic!

13:26

<sipa> sorry.

13:27

<raj_149> fjahr: its converting 32 bytes to 384 bytes firstly. Is there any other purpose its serving?

13:28

<fjahr> raj_149: yes, but why?

13:28

<raj_149> fjahr: because muhash numerator/denominator needs to be a 384 byte data?

13:29

<sipa> perhaps a useful question is: why not use ChaCha20 directly, without SHA256?

13:29

<jnewbery> sipa's original post here might be helpful: https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2017-May/014337.html

13:30

<jnewbery> Spcifically "As a result, MuHash modulo a sufficiently large safe prime is provably secure under the DL assumption. Common guidelines on security parameters [7] say that 3072-bit DL has about 128 bits of security."

13:31

<nehan> i think you need at least 3072 bits of security. 384 bytes is less than 512 bytes.

13:32

<raj_149> sipa: i am a little confused here, where exactly we are mixing chacha20 with sha256?

100

13:32

<provoostenator> Also if you transform 256 bits to 3072 bits, do you actually get 3072 bits of security? That always give me headaches.

101

13:32

<nehan> so if you were willing to use MuHash on 4096 bit integers you wouldn't need chacha

102

13:32

<sipa> nehan: i think you're confused, but i can't say where

103

13:32

<nehan> sipa: yeah probably :) i'm guessing

104

13:32

<sipa> nehan: maybe you're confusing bits with bytes?

105

13:32

<provoostenator> MuHash works with 256 bit blocks

106

13:33

<provoostenator> And doesn't compress

107

13:33

<sipa> what?

108

13:33

<provoostenator> Uhhh

109

13:33

<sipa> it deals with 3072 bit integers

110

13:33

<provoostenator> ChaCha

111

13:33

<jnewbery> raj_149: sha256 is used twice. We first hash the object using sha256 to a 32 byte digest, and then use that digest as the key in the chacha20 rounds

112

13:33

<lightlike> is there an attack scenario wrt the specific use case of UTXO statistics that we need to be secure against?

113

13:33

<nehan> My guess is that you *could* use MuHash with 4096 bit integers instead of 3072, but it would require extra storage space

114

13:34

<jnewbery> sha256 is then used once there's a muhash output for the entire set to reduce it from 384 bytes back down to 32 bytes

115

13:34

<sipa> nehan: and slower multiplication

116

13:34

<provoostenator> lightlike: not really, but if we ever want to use it for consensus stuff - e.g. UTXO commitments, then we do have to worry

117

13:34

<sipa> raj_149: the UTXO is hashed first with SHA256 (formerly truncated SHA512) to produce 32 bytes; those 32 bytes are expanded to 384 bytes using ChaCha20; those 384 bytes are interpreted as 3072-bit integers and multiplied together; the result of that multiplication is SHA256'd again to produce a 32-byte hash

118

13:34

<nehan> sipa: I also thought when you said "perhaps a useful question is: why not use ChaCha20 directly, without SHA256?" you meant SHA512.

119

13:35

<sipa> nehan: the SHA512 was replaced with SHA256

120

13:35

<sipa> after some benchmarking

121

13:35

<fjahr> sipa: will be :)

122

13:35

<sipa> ah

123

13:35

<nehan> it's still 512 here: https://github.com/bitcoin/bitcoin/pull/19105/commits/400ac1cc9e7fb2983d4dced9512819c970ef9638#diff-832874419bea3a4c22ea70b000657e3eR55

124

13:35

<fjahr> didn't get around to push the new code

125

13:35

<jnewbery> ignore truncated SHA512 and just pretend we're saying sha256 :)

126

13:35

<sipa> sorry for confusing people, i should have checked

127

13:36

<troygiorshev> sipa: in you're question are you referring to the first usage of sha256 or the second?

128

13:36

<raj_149> sipa: just so that i understand it right, this is currently not the case with the python code right?

129

13:36

<jnewbery> raj_149: yes, this is the case with the python code

130

13:36

<sipa> ignore everything i said about SHA256; it's SHA512 for now

131

13:37

<raj_149> jnewbery: isn't chacha here already expecting a 32 byte data? its not hashing it anywhere.

132

13:37

<nehan> ok so ChaCha20 is a way of securely blowing up 256 bytes to 384 bytes

133

13:37

<fjahr> but trunkated to 256bits, which adds to the confusion

134

13:37

<troygiorshev> nehan: 32 to 384 yeah

135

13:37

<nehan> troygiorshe: oh right, thanks

136

13:37

<troygiorshev> nehan: or 256 to 3072 bits, whichever you'd prefer :)

137

13:38

<sipa> nehan: 256 *bits* to 384 *bytes*

138

13:38

<nehan> 32->384. that was my confusion :)

139

13:38

<sipa> yeah

140

13:38

<jnewbery> the data parameter in data_to_num3072 here: https://github.com/bitcoin/bitcoin/pull/19105/files#diff-832874419bea3a4c22ea70b000657e3eR53 should already by a 32 byte digest

141

13:38

<nehan> or I guess what the Python code says now is 64 truncated to 32 to 384

142

13:38

<sipa> the answer is just that ChaCha20 isn't a hash algorithm; it's an encryption algorithm, and it needs a random key to be secure (which we produce by hashing); it's actually encrypting all zeroes

143

13:39

<michaelfolkson> What was the motivation for using truncated SHA512 over SHA256?

144

13:39

<troygiorshev> 64 bit processors, right?

145

13:39

<sipa> michaelfolkson: this was discussed in the previous meeting on this topic, i think

146

13:39

<raj_149> jnewbery: ok got it, i thought its somewhere enforced in the function itself.

147

13:40

<sipa> but given that we're changing it to SHA256 maybe it isn't worth rehashing (haha)

148

13:40

<fjahr> discusion: https://github.com/bitcoin/bitcoin/pull/19055#issuecomment-640169988

149

13:40

<fjahr> michaelfolkson: ^

150

13:40

<troygiorshev> fjahr: thx

151

13:40

<michaelfolkson> Thanks

152

13:40

<jnewbery> I don't think it's worth discussing truncatedSHA512 -vs- SHA256 here. Just imagine it's a random oracle giving you a 32 byte output.

153

13:41

<fjahr> Ok, another question that was raised in an early review: What are your thoughts theStack’s suggestion to use Fermat’s little theorem in the code and the trade-offs involved?

154

13:41

<thomasb06> what are the ranges of 'a' and 'n-2'?

155

13:41

<andrewtoth> what about chacha20 making 384 bytes out of 32? Does it get any other inputs? If it's deterministic, won't the 384 bytes have the same security as 32 bytes?

156

13:42

<sipa> andrewtoth: they have 256 bits of entropy

157

13:42

<troygiorshev> fjahr: I agree with sipa's comment, saying "for those who understand this stuff, it's just as readable either way, and it's black magic for anyone who isn't familiar"

158

13:42

<nehan> fjahr: how often is modinv actually getting called? it seemed like once per digest, right? and the test only calls digest a few times.

159

13:42

<raj_149> fjahr: similar to stack's observation, i am getting one order of mag faster execution with fermet's theorem.

160

13:43

<nehan> raj_149: i think fermat's little theorem is supposed to be *slower*...

161

13:43

<sipa> raj_149: that's unexpected!

162

13:43

<jnewbery> I don't think performance matter too much in the test framework

163

13:43

<sipa> i was expecting the extgcd approach to be faster

164

13:43

<sipa> but agree that performance shouldn't matter

165

13:44

<nehan> i think it's less likely python's pow() will have a bug than this custom implementation of modinv so i'd go with pow :)

166

13:44

<thomasb06> if it doesn't overflow too

167

13:44

<troygiorshev> fwiw my testing agrees with TheStack

168

13:44

<raj_149> nehan: right, opposite, slower.. my bad..

169

13:44

<fjahr> nehan: yes but I am adding a more extensive test where it get's called more often. But in general I was just thinking of how we already are waiting for the ci environment. But it does not have too much of an impact, not comparable to a unnecessary sleep() or so

170

13:45

<nehan> i was thinking if you keep modinv you should add a unittest to ensure it's producing the same results as pow()

171

13:45

<sipa> nehan: that's a good argument - though it's the kind of algorithm that's unlikely to be wrong in a very small subset of cases

172

13:45

<jnewbery> one nice thing about the python implementation is that it's more accessible and readable for most people. I think key.py does a great job at documenting the algorithms: https://github.com/bitcoin/bitcoin/blob/6762a627ecb89ba8d4ed81a049a5d802e6dd75c2/test/functional/test_framework/key.py#L13

173

13:46

<fjahr> jnewbery: yeah, I think it's a good idea to use that one

174

13:46

<jnewbery> so having the same algorithm used as in the c++ implementation is useful because it acts as a teaching aid for people looking at the code

175

13:46

<nehan> jnewbery: good point!

176

13:46

<compress> jnewbery agreed

177

13:46

<raj_149> jnewbery: +1

178

13:46

<sipa> jnewbery: alternatively... a different algorithm in the tests means less likely that there's a bug repeated in both

179

13:47

<michaelfolkson> 2 tests

180

13:47

<jnewbery> (an argument against is that you want a completely different algorithm so you don't replicate a bug in the test code)

181

13:47

<jnewbery> sipa: +1

182

13:48

<fjahr> on a higher level we are comparing to the results of the c++ code so a incorrect modinv should pop up

183

13:48

<compress> so its a trade off of security/soundness vs readability / tech debt?

184

13:48

<nehan> so have multiple algorithms in the test and compare them

185

13:48

<fjahr> this brings me to my last question: How do you like the test to verify parity with the unit test of the C++ implementation? Would have done it differently? Should it be extended? Should it be moved to a different place?

186

13:49

<troygiorshev> are we really worried at all that modinv has been implemented incorrectly? For something so standard, if we're worried that it's wrong we should also be worried about many other things

187

13:49

<michaelfolkson> I did want to understand what was going on in this PR that jnewbery referred to https://github.com/bitcoin/bitcoin/pull/18576

188

13:50

<michaelfolkson> We should be putting related unit tests in functional tests from now on?

189

13:50

<provoostenator> troygiorshev: I'm not worried modinv is implemented incorrectly in the Python library, but might be in our implementation. (though in this case I tested the result is the same)

190

13:50

<michaelfolkson> Is that the motivation?

191

13:52

<fjahr> michaelfolkson: not in the functional tests but into files of the functional test framework

192

13:52

<sipa> provoostenator: modinv isn't implemented in the Python library before 3.8 i think

193

13:52

<troygiorshev> fjahr: this seems like a great opportunity for fuzz testing

194

13:52

<elichai2> Maybe writing a list of test vectors can be nice

195

13:52

<lightlike> fjahr: it doesn't seem in the right place, because stuff tested in a functional tests should depend on a running node in some way - so I like jnewbery's suggestion to have python unit test.

196

13:52

<sipa> fuzz testing cryptographic code is extremely hard

197

13:52

<provoostenator> sipa: correct, I would just wait for that, and put a TODO until then

198

13:53

<fjahr> trygiorshev: yes, provoostenator has pointed this out as well

199

13:53

<troygiorshev> sipa: Ah I'm using the word too flexibly. I just mean giving both the same random inputs

200

13:53

<troygiorshev> would only check for parity

201

13:54

<provoostenator> One thing you can do, if you have hundreds of test vectors, is to run a tiny percentage through the slower algoritm. But we only have 1.

202

13:54

<michaelfolkson> fjahr: And the motivation for this was that tests were unreliable? What was happening because the unit tests weren't in the files of the functional test framework?

203

13:54

<jnewbery> it's a shame that sipa wrote the c++ and python implementation. Ideally the python implemenation would be written by someone who isn't sipa and has never talked to sipa about muhash :)

204

13:55

<sipa> i volunteer jnewbery to rewrite it :)

205

13:55

<michaelfolkson> I need to get to understand the test framework better, it is regularly tripping me up

206

13:55

<sipa> (you're right)

207

13:55

<thomasb06> jnewbery: sipa: maybe one day...

208

13:56

<jnewbery> too late. I've already read your implementation so I can't write one that is uninfluenced by you

209

13:56

* troygiorshev scrolls up to see who responded "n" to "reviewed?"

210

13:56

<fjahr> michaelfolkson: it's a better way to do this style of testing of the the python implementations of stuff, its really a unit test and feels weird in the functional test files

211

13:57

<sipa> i think this is actually one part where MuHash has an advantage... apart from gluing SHA256/ChaCha20 together, the code is trivial in python (it's multiplying and dividing numbers modulo a prime)

212

13:57

<fjahr> I have added a proper functional test in a later part of the series now: https://github.com/bitcoin/bitcoin/pull/19145/commits/6013f04f319188ddf4926142291ced3242817e95

213

13:57

<provoostenator> Agreed, the Python version is very nice and short. Probably the first time I could wrap my head around it.

214

13:58

<fjahr> sipa: agree

215

13:58

<provoostenator> If you chuck the ChaCha20 and ModInv stuff in another file, it's even better

216

13:58

<fjahr> ok, one more minute

217

13:59

<jnewbery> I know it's very late in the meeting, but I thought real_or_random's comment here was very interesting: https://github.com/bitcoin/bitcoin/pull/19055#discussion_r438026185

218

13:59

<jnewbery> I haven't fully digested it, but it's something I plan to dig into

219

13:59

<provoostenator> Haha, digested

220

13:59

<nehan> oh -- i found the use of "set" confusing given that you can add the same value multiple times (at least this is what i recall from some other discussion) and get a different hash

221

14:00

<provoostenator> nehan: that confused me too. It's not a set. Unless you treat it very carefully.

222

14:00

<fjahr> yeah, it's a bit confusing because what you can do =/= what you should do

223

14:01

<sipa> jnewbery: haha, digested

224

14:01

<fjahr> I will improve docs certainly :)