Add Muhash3072 implementation in Python (
tests, cryptography) Jun 10, 2020
The PR branch HEAD was a6ee4c2cee at the time of this review club meeting.
This is the second in a series of PRs implementing the Coin Statistics
an overview of the work and is used to keep track of the different PRs.
The first PR in the series was
featured in PR review club two weeks ago.
If you haven’t seen it, it would be beneficial to read that
conversation first. There are also some links to papers in the description
that might help you understand the theoretical background of the algorithm
Based on the feedback from the last session, the scope of
19055 has since been reduced.
Today’s PR adds the Python implementation of
which was originally in
#19055 now only adds the C++
The previous session was more heavily focussed on conceptual understanding
of the UTXO set statistics index project as a whole. This week, we’ll focus on
understanding the implementation of
MuHash3072 in Python.
The next PR in the series is
It may be interesting to skip ahead to see how the code is used but
if you participated in the previous review club session you have already seen
some of that code.
This PR follows the pattern to check the C++ implementation of an algorithm
against the Python implementation of the same algorithm. Another
example of this pattern is Siphash. Questions
What is the API of
MuHash3072 in Python? Is it different from the C++
There is a
numerator and a
denominator property in the Python
implementation. Can you explain how these are used in MuHash on a high level?
Where are they in the C++ implementation?
ChaCha20 is used in MuHash. What are the properties of that algorithm?
What is ChaCha20 used for in particular in MuHash? Explain why this is needed.
What does the
modinv method do? What is it’s role in finalizing the hash
What are your thoughts
theStack’s suggestion to use Fermat’s little theorem
in the code and the trade-offs involved?
How do you like the test to verify parity with the unit test of the C++
implementation? Would have done it differently? Should it be extended?
Should it be moved to a different place?
1 13:00 <fjahr> #startmeeting
2 13:00 <fjahr> Hello everyone, welcome to this weeks PR review club. We are reviewing the Python implementation of MuHash3072 (#19105) which part of CoinStatsIndex (overview in #18000). Notes and questions are here: https://bitcoincore.reviews/19105.html.
3 13:00 <fjahr> I am drawing some comparisons to the C++ implementation in the questions but the focus of this session should definitely be on understanding how the algorithm works in the Python implementation. If you have any questions don't hesitate to ask, there can be multiple discussions happening in parallel.
4 13:00 <fjahr> feel free to say hi :)
11 13:00 <michaelfolkson> hi
15 13:01 <fjahr> who had a chance to review the PR? (y/n)
16 13:01 <provoostenator> hi
18 13:01 <provoostenator> y
31 13:02 <fjahr> Great. There were already some nice review comments. Let's get started with a warm-up question: What is the API of MuHash3072 in Python? Is it different from the C++ implementation in PR 19055?
33 13:03 <raj_149> insert, remove and digest?
34 13:04 <elichai2> that it doesn't overload the `/=` and `*=` operators and instead use named functions?
35 13:04 <fjahr> raj_149: yes, how about the constructors?
36 13:04 <fjahr> elichai2: yepp, that as well
37 13:06 <sipa> look at the inputs to those insert/remove/*=//=
38 13:08 <elichai2> That the C++ API accepts MuHash3072 objects and the python accepts raw data and then turn it into a MuHash3072?
39 13:09 <fjahr> Yepp. And there is an empty default constructor in C++ which we don't have in Python.
40 13:09 <fjahr> Next Q: There is a numerator and a denominator property in the Python implementation. Can you explain how these are used in MuHash on a high level? Where are they in the C++ implementation?
42 13:10 <nehan> to keep running products in the numerator/denominator because inversing things is expensive. so inverse once at the end.
43 13:10 <fjahr> elichai2: Oh, ah, true :D
44 13:11 <lightlike> numerator to add elements to the set, denominator to remove elements from the set
45 13:11 <fjahr> nehan: lightlike: right, did you compare it to the C++ code?
46 13:12 <raj_149> fjahr: there isn't one in c++ it seems?
47 13:12 <troygiorshev> It looks like the c++ code doesn't current take advantage of this. It computes the inverse every time we want to remove an element
48 13:12 <nehan> yes but i don't see where there are numerators/denominators in the C++
49 13:13 <sipa> the MuHash3072 object in the C++ code is lower level; the numerator/denominator is something higher-level code can do if it needs it
50 13:13 <elichai2> LOL. I only now realized that 386BYTES == 3072bits. the "hash384" confused me to think it uses some 384bits hash(sha384 or something hehe)
51 13:13 <sipa> one reason is that you may want to cache MuHash3072 objects for running hashes or so, after the inversion (so it's only 384 bytes and not 768)
52 13:14 <sipa> elichai2: no, 3072/8 = 384
54 13:14 <elichai2> ops yeah 384 not 386, typo
55 13:14 <troygiorshev> sipa: ah makes sense
56 13:14 <lightlike> yes, that took me also more time than it should have :-)
57 13:15 <sipa> but if that's not going to happen, perhaps the numerator/denominator (and also the SHA256'ing) should be integrated into MuHash3072 rather than forcing higher-level code to take care of that
58 13:15 <jnewbery> sipa: I haven't looked at the c++ code. When you say caching, do you mean that you're saving the 384 byte chacha digest for each block?
59 13:16 <sipa> jnewbery: i don't know?
60 13:16 <sipa> it depends on the use case
61 13:17 <sipa> one could be that you'd precompute the muhash "effect" of every transaction in the mempool for example
62 13:17 <jnewbery> what do you mean by 'cache muhash 3072 objects for running hashes or so'?
63 13:17 <provoostenator> I think fjahr said in the C++ PR that only the sha256 hash is stored in an index
65 13:17 <sipa> or per block, if you want to be able to roll forward/backward from that block
67 13:17 <fjahr> provoostenator: this is something different
68 13:18 <jnewbery> and currently the c++ implementation inverts every utxo spend rather than multiplying them all and then inverting once?
69 13:18 <sipa> jnewbery: iirc fjahr's code does one inversion per block
70 13:19 <fjahr> it's on the calculation for each block where I add new utxos and removed utxos separately and then do one division att the end
71 13:19 <fjahr> sipa: yes
72 13:19 <jnewbery> sipa fjahr: thanks
73 13:19 <fjahr> provoostenator: you asked about this in your first pass on the index as well
74 13:20 <sipa> as it is currently used it would be perfectly reasonable to have the numerator/denominator inside the MuHash3072 object - but if there was ever a desire to cache a full MuHash3072 object in a place where memory usage matters, this may be undesirable
76 13:21 <fjahr> nehan: thank you!
77 13:22 <fjahr> yeah, so instead of removing each undo instead they are added up and then removed together
78 13:23 <troygiorshev> sipa: IIRC one of the other incremental hashes had a much smaller state
79 13:23 <sipa> troygiorshev: ECMH has 32 bytes state
80 13:23 <troygiorshev> ECMH maybe
82 13:24 <elichai2> in storage or in memory? :P
83 13:24 <sipa> it's complicated
84 13:25 <elichai2> sipa: I asked troygiorshev hehe
85 13:25 <raj_149> are we planning on storing the the rolling hash to disk?
86 13:25 <fjahr> Maybe next question but keep going if the last part is still unclear to anyone! ChaCha20 is used in MuHash. What are the properties of that algorithm? What is ChaCha20 used for in particular in MuHash?
87 13:25 <sipa> in compressed form, 33 bytes; in affine coordinates, 64 bytes; in jacobian coordinates, 96 bytes; in denormalized jacobian representation, 120 bytes :p
89 13:26 <elichai2> still less than 384 bytes though 😅
90 13:26 <jnewbery> sipa elichai2: I think we're gettin a bit offtopic!
92 13:27 <raj_149> fjahr: its converting 32 bytes to 384 bytes firstly. Is there any other purpose its serving?
93 13:28 <fjahr> raj_149: yes, but why?
94 13:28 <raj_149> fjahr: because muhash numerator/denominator needs to be a 384 byte data?
95 13:29 <sipa> perhaps a useful question is: why not use ChaCha20 directly, without SHA256?
97 13:30 <jnewbery> Spcifically "As a result, MuHash modulo a sufficiently large safe prime is provably secure under the DL assumption. Common guidelines on security parameters  say that 3072-bit DL has about 128 bits of security."
98 13:31 <nehan> i think you need at least 3072 bits of security. 384 bytes is less than 512 bytes.
99 13:32 <raj_149> sipa: i am a little confused here, where exactly we are mixing chacha20 with sha256?
100 13:32 <provoostenator> Also if you transform 256 bits to 3072 bits, do you actually get 3072 bits of security? That always give me headaches.
101 13:32 <nehan> so if you were willing to use MuHash on 4096 bit integers you wouldn't need chacha
102 13:32 <sipa> nehan: i think you're confused, but i can't say where
103 13:32 <nehan> sipa: yeah probably :) i'm guessing
104 13:32 <sipa> nehan: maybe you're confusing bits with bytes?
105 13:32 <provoostenator> MuHash works with 256 bit blocks
106 13:33 <provoostenator> And doesn't compress
108 13:33 <provoostenator> Uhhh
109 13:33 <sipa> it deals with 3072 bit integers
110 13:33 <provoostenator> ChaCha
111 13:33 <jnewbery> raj_149: sha256 is used twice. We first hash the object using sha256 to a 32 byte digest, and then use that digest as the key in the chacha20 rounds
112 13:33 <lightlike> is there an attack scenario wrt the specific use case of UTXO statistics that we need to be secure against?
113 13:33 <nehan> My guess is that you *could* use MuHash with 4096 bit integers instead of 3072, but it would require extra storage space
114 13:34 <jnewbery> sha256 is then used once there's a muhash output for the entire set to reduce it from 384 bytes back down to 32 bytes
115 13:34 <sipa> nehan: and slower multiplication
116 13:34 <provoostenator> lightlike: not really, but if we ever want to use it for consensus stuff - e.g. UTXO commitments, then we do have to worry
117 13:34 <sipa> raj_149: the UTXO is hashed first with SHA256 (formerly truncated SHA512) to produce 32 bytes; those 32 bytes are expanded to 384 bytes using ChaCha20; those 384 bytes are interpreted as 3072-bit integers and multiplied together; the result of that multiplication is SHA256'd again to produce a 32-byte hash
118 13:34 <nehan> sipa: I also thought when you said "perhaps a useful question is: why not use ChaCha20 directly, without SHA256?" you meant SHA512.
119 13:35 <sipa> nehan: the SHA512 was replaced with SHA256
120 13:35 <sipa> after some benchmarking
121 13:35 <fjahr> sipa: will be :)
124 13:35 <fjahr> didn't get around to push the new code
125 13:35 <jnewbery> ignore truncated SHA512 and just pretend we're saying sha256 :)
126 13:35 <sipa> sorry for confusing people, i should have checked
127 13:36 <troygiorshev> sipa: in you're question are you referring to the first usage of sha256 or the second?
128 13:36 <raj_149> sipa: just so that i understand it right, this is currently not the case with the python code right?
129 13:36 <jnewbery> raj_149: yes, this is the case with the python code
130 13:36 <sipa> ignore everything i said about SHA256; it's SHA512 for now
131 13:37 <raj_149> jnewbery: isn't chacha here already expecting a 32 byte data? its not hashing it anywhere.
132 13:37 <nehan> ok so ChaCha20 is a way of securely blowing up 256 bytes to 384 bytes
133 13:37 <fjahr> but trunkated to 256bits, which adds to the confusion
134 13:37 <troygiorshev> nehan: 32 to 384 yeah
135 13:37 <nehan> troygiorshe: oh right, thanks
136 13:37 <troygiorshev> nehan: or 256 to 3072 bits, whichever you'd prefer :)
137 13:38 <sipa> nehan: 256 *bits* to 384 *bytes*
138 13:38 <nehan> 32->384. that was my confusion :)
141 13:38 <nehan> or I guess what the Python code says now is 64 truncated to 32 to 384
142 13:38 <sipa> the answer is just that ChaCha20 isn't a hash algorithm; it's an encryption algorithm, and it needs a random key to be secure (which we produce by hashing); it's actually encrypting all zeroes
143 13:39 <michaelfolkson> What was the motivation for using truncated SHA512 over SHA256?
144 13:39 <troygiorshev> 64 bit processors, right?
145 13:39 <sipa> michaelfolkson: this was discussed in the previous meeting on this topic, i think
146 13:39 <raj_149> jnewbery: ok got it, i thought its somewhere enforced in the function itself.
147 13:40 <sipa> but given that we're changing it to SHA256 maybe it isn't worth rehashing (haha)
149 13:40 <fjahr> michaelfolkson: ^
150 13:40 <troygiorshev> fjahr: thx
151 13:40 <michaelfolkson> Thanks
152 13:40 <jnewbery> I don't think it's worth discussing truncatedSHA512 -vs- SHA256 here. Just imagine it's a random oracle giving you a 32 byte output.
153 13:41 <fjahr> Ok, another question that was raised in an early review: What are your thoughts theStack’s suggestion to use Fermat’s little theorem in the code and the trade-offs involved?
154 13:41 <thomasb06> what are the ranges of 'a' and 'n-2'?
155 13:41 <andrewtoth> what about chacha20 making 384 bytes out of 32? Does it get any other inputs? If it's deterministic, won't the 384 bytes have the same security as 32 bytes?
156 13:42 <sipa> andrewtoth: they have 256 bits of entropy
157 13:42 <troygiorshev> fjahr: I agree with sipa's comment, saying "for those who understand this stuff, it's just as readable either way, and it's black magic for anyone who isn't familiar"
158 13:42 <nehan> fjahr: how often is modinv actually getting called? it seemed like once per digest, right? and the test only calls digest a few times.
159 13:42 <raj_149> fjahr: similar to stack's observation, i am getting one order of mag faster execution with fermet's theorem.
160 13:43 <nehan> raj_149: i think fermat's little theorem is supposed to be *slower*...
161 13:43 <sipa> raj_149: that's unexpected!
162 13:43 <jnewbery> I don't think performance matter too much in the test framework
163 13:43 <sipa> i was expecting the extgcd approach to be faster
164 13:43 <sipa> but agree that performance shouldn't matter
165 13:44 <nehan> i think it's less likely python's pow() will have a bug than this custom implementation of modinv so i'd go with pow :)
166 13:44 <thomasb06> if it doesn't overflow too
167 13:44 <troygiorshev> fwiw my testing agrees with TheStack
168 13:44 <raj_149> nehan: right, opposite, slower.. my bad..
169 13:44 <fjahr> nehan: yes but I am adding a more extensive test where it get's called more often. But in general I was just thinking of how we already are waiting for the ci environment. But it does not have too much of an impact, not comparable to a unnecessary sleep() or so
170 13:45 <nehan> i was thinking if you keep modinv you should add a unittest to ensure it's producing the same results as pow()
171 13:45 <sipa> nehan: that's a good argument - though it's the kind of algorithm that's unlikely to be wrong in a very small subset of cases
173 13:46 <fjahr> jnewbery: yeah, I think it's a good idea to use that one
174 13:46 <jnewbery> so having the same algorithm used as in the c++ implementation is useful because it acts as a teaching aid for people looking at the code
175 13:46 <nehan> jnewbery: good point!
176 13:46 <compress> jnewbery agreed
177 13:46 <raj_149> jnewbery: +1
178 13:46 <sipa> jnewbery: alternatively... a different algorithm in the tests means less likely that there's a bug repeated in both
179 13:47 <michaelfolkson> 2 tests
180 13:47 <jnewbery> (an argument against is that you want a completely different algorithm so you don't replicate a bug in the test code)
181 13:47 <jnewbery> sipa: +1
182 13:48 <fjahr> on a higher level we are comparing to the results of the c++ code so a incorrect modinv should pop up
183 13:48 <compress> so its a trade off of security/soundness vs readability / tech debt?
184 13:48 <nehan> so have multiple algorithms in the test and compare them
185 13:48 <fjahr> this brings me to my last question: How do you like the test to verify parity with the unit test of the C++ implementation? Would have done it differently? Should it be extended? Should it be moved to a different place?
186 13:49 <troygiorshev> are we really worried at all that modinv has been implemented incorrectly? For something so standard, if we're worried that it's wrong we should also be worried about many other things
188 13:50 <michaelfolkson> We should be putting related unit tests in functional tests from now on?
189 13:50 <provoostenator> troygiorshev: I'm not worried modinv is implemented incorrectly in the Python library, but might be in our implementation. (though in this case I tested the result is the same)
190 13:50 <michaelfolkson> Is that the motivation?
191 13:52 <fjahr> michaelfolkson: not in the functional tests but into files of the functional test framework
192 13:52 <sipa> provoostenator: modinv isn't implemented in the Python library before 3.8 i think
193 13:52 <troygiorshev> fjahr: this seems like a great opportunity for fuzz testing
194 13:52 <elichai2> Maybe writing a list of test vectors can be nice
195 13:52 <lightlike> fjahr: it doesn't seem in the right place, because stuff tested in a functional tests should depend on a running node in some way - so I like jnewbery's suggestion to have python unit test.
196 13:52 <sipa> fuzz testing cryptographic code is extremely hard
197 13:52 <provoostenator> sipa: correct, I would just wait for that, and put a TODO until then
198 13:53 <fjahr> trygiorshev: yes, provoostenator has pointed this out as well
199 13:53 <troygiorshev> sipa: Ah I'm using the word too flexibly. I just mean giving both the same random inputs
200 13:53 <troygiorshev> would only check for parity
201 13:54 <provoostenator> One thing you can do, if you have hundreds of test vectors, is to run a tiny percentage through the slower algoritm. But we only have 1.
202 13:54 <michaelfolkson> fjahr: And the motivation for this was that tests were unreliable? What was happening because the unit tests weren't in the files of the functional test framework?
203 13:54 <jnewbery> it's a shame that sipa wrote the c++ and python implementation. Ideally the python implemenation would be written by someone who isn't sipa and has never talked to sipa about muhash :)
204 13:55 <sipa> i volunteer jnewbery to rewrite it :)
205 13:55 <michaelfolkson> I need to get to understand the test framework better, it is regularly tripping me up
206 13:55 <sipa> (you're right)
207 13:55 <thomasb06> jnewbery: sipa: maybe one day...
208 13:56 <jnewbery> too late. I've already read your implementation so I can't write one that is uninfluenced by you
209 13:56 * troygiorshev scrolls up to see who responded "n" to "reviewed?"
210 13:56 <fjahr> michaelfolkson: it's a better way to do this style of testing of the the python implementations of stuff, its really a unit test and feels weird in the functional test files
211 13:57 <sipa> i think this is actually one part where MuHash has an advantage... apart from gluing SHA256/ChaCha20 together, the code is trivial in python (it's multiplying and dividing numbers modulo a prime)
213 13:57 <provoostenator> Agreed, the Python version is very nice and short. Probably the first time I could wrap my head around it.
214 13:58 <fjahr> sipa: agree
215 13:58 <provoostenator> If you chuck the ChaCha20 and ModInv stuff in another file, it's even better
216 13:58 <fjahr> ok, one more minute
218 13:59 <jnewbery> I haven't fully digested it, but it's something I plan to dig into
219 13:59 <provoostenator> Haha, digested
220 13:59 <nehan> oh -- i found the use of "set" confusing given that you can add the same value multiple times (at least this is what i recall from some other discussion) and get a different hash
221 14:00 <provoostenator> nehan: that confused me too. It's not a set. Unless you treat it very carefully.
222 14:00 <fjahr> yeah, it's a bit confusing because what you can do =/= what you should do
223 14:01 <sipa> jnewbery: haha, digested
224 14:01 <fjahr> I will improve docs certainly :)
225 14:01 <fjahr> #endmeeting
226 14:01 <nehan> thanks fjahr!
227 14:01 <andrewtoth> thanks fjahr!
228 14:01 <fjahr> went by way too fast again. Thanks everyone!
229 14:01 <troygiorshev> thanks fjakr!
231 14:02 <troygiorshev> fjahr*
232 14:02 <jnewbery> thanks fjahr. Thanks sipa. Really interesting discussion
233 14:02 <fjahr> I really appreciate the feedback!
234 14:02 <emzy> thanks fjahr!
235 14:02 <lightlike> thank fjahr
236 14:02 <michaelfolkson> Time flies when you are having fun fjahr. Nice work
237 14:02 <compress> thanks fjahr
238 14:02 <fjahr> Thanks sipa!
239 14:02 <sipa> fjahr: yw, and thanks for hosting again
240 14:02 <figs> thanks everyone
241 14:06 <SirRichard> Thanks everyone, great discussion. Great to learn and follow along.