Add Muhash3072 implementation in Python (tests, math and cryptography)

https://github.com/bitcoin/bitcoin/pull/19105

Host: fjahr  -  PR authors: sipa , fjahr

The PR branch HEAD was a6ee4c2cee at the time of this review club meeting.

Notes

  • This is the second in a series of PRs implementing the Coin Statistics Index. #18000 provides an overview of the work and is used to keep track of the different PRs.

  • The first PR in the series was #19055, featured in PR review club two weeks ago. If you haven’t seen it, it would be beneficial to read that conversation first. There are also some links to papers in the description that might help you understand the theoretical background of the algorithm better.

  • Based on the feedback from the last session, the scope of PR 19055 has since been reduced. Today’s PR adds the Python implementation of MuHash3072 which was originally in #19055. #19055 now only adds the C++ implementation.

  • The previous session was more heavily focussed on conceptual understanding of the UTXO set statistics index project as a whole. This week, we’ll focus on understanding the implementation of MuHash3072 in Python.

  • The next PR in the series is #19145. It may be interesting to skip ahead to see how the code is used but if you participated in the previous review club session you have already seen some of that code.

  • This PR follows the pattern to check the C++ implementation of an algorithm against the Python implementation of the same algorithm. Another example of this pattern is Siphash.

Questions

  • What is the API of MuHash3072 in Python? Is it different from the C++ implementation in PR 19055?

  • There is a numerator and a denominator property in the Python implementation. Can you explain how these are used in MuHash on a high level? Where are they in the C++ implementation?

  • ChaCha20 is used in MuHash. What are the properties of that algorithm?

  • What is ChaCha20 used for in particular in MuHash? Explain why this is needed.

  • What does the modinv method do? What is it’s role in finalizing the hash (in the digest method)?

  • What are your thoughts theStack’s suggestion to use Fermat’s little theorem in the code and the trade-offs involved?

  • How do you like the test to verify parity with the unit test of the C++ implementation? Would have done it differently? Should it be extended? Should it be moved to a different place?

Meeting Log

  113:00 <fjahr> #startmeeting
  213:00 <fjahr> Hello everyone, welcome to this weeks PR review club. We are reviewing the Python implementation of MuHash3072 (#19105) which part of CoinStatsIndex (overview in #18000). Notes and questions are here: https://bitcoincore.reviews/19105.html.
  313:00 <fjahr> I am drawing some comparisons to the C++ implementation in the questions but the focus of this session should definitely be on understanding how the algorithm works in the Python implementation. If you have any questions don't hesitate to ask, there can be multiple discussions happening in parallel.
  413:00 <fjahr> feel free to say hi :)
  513:00 <jnewbery> hi
  613:00 <kanzure> hi
  713:00 <troygiorshev> hi
  813:00 <gimballock> Hi
  913:00 <andrewtoth> hi
 1013:00 <lightlike> hi
 1113:00 <michaelfolkson> hi
 1213:00 <raj_149> hi
 1313:00 <vindard> hi
 1413:00 <raj_149> hi
 1513:01 <fjahr> who had a chance to review the PR? (y/n)
 1613:01 <provoostenator> hi
 1713:01 <Ravi_21M> hi
 1813:01 <provoostenator> y
 1913:01 <jnewbery> y
 2013:01 <elichai2> hi
 2113:01 <raj_149> y
 2213:01 <emzy> HI
 2313:01 <MM77788811> Hi
 2413:01 <emzy> n
 2513:01 <vindard> y
 2613:01 <troygiorshev> n
 2713:01 <lightlike> y
 2813:01 <andrewtoth> n
 2913:01 <nehan> hi
 3013:01 <nehan> y
 3113:02 <fjahr> Great. There were already some nice review comments. Let's get started with a warm-up question: What is the API of MuHash3072 in Python? Is it different from the C++ implementation in PR 19055?
 3213:02 <sipa> hi, y
 3313:03 <raj_149> insert, remove and digest?
 3413:04 <elichai2> that it doesn't overload the `/=` and `*=` operators and instead use named functions?
 3513:04 <fjahr> raj_149: yes, how about the constructors?
 3613:04 <fjahr> elichai2: yepp, that as well
 3713:06 <sipa> look at the inputs to those insert/remove/*=//=
 3813:08 <elichai2> That the C++ API accepts MuHash3072 objects and the python accepts raw data and then turn it into a MuHash3072?
 3913:09 <fjahr> Yepp. And there is an empty default constructor in C++ which we don't have in Python.
 4013:09 <fjahr> Next Q: There is a numerator and a denominator property in the Python implementation. Can you explain how these are used in MuHash on a high level? Where are they in the C++ implementation?
 4113:09 <elichai2> isn't this an empty constructor though? https://github.com/bitcoin/bitcoin/blob/a6ee4c2ceee624d1d3ed1dfa4bd6f259139bb9d8/test/functional/test_framework/muhash.py#L63
 4213:10 <nehan> to keep running products in the numerator/denominator because inversing things is expensive. so inverse once at the end.
 4313:10 <fjahr> elichai2: Oh, ah, true :D
 4413:11 <lightlike> numerator to add elements to the set, denominator to remove elements from the set
 4513:11 <fjahr> nehan: lightlike: right, did you compare it to the C++ code?
 4613:12 <raj_149> fjahr: there isn't one in c++ it seems?
 4713:12 <troygiorshev> It looks like the c++ code doesn't current take advantage of this. It computes the inverse every time we want to remove an element
 4813:12 <nehan> yes but i don't see where there are numerators/denominators in the C++
 4913:13 <sipa> the MuHash3072 object in the C++ code is lower level; the numerator/denominator is something higher-level code can do if it needs it
 5013:13 <elichai2> LOL. I only now realized that 386BYTES == 3072bits. the "hash384" confused me to think it uses some 384bits hash(sha384 or something hehe)
 5113:13 <sipa> one reason is that you may want to cache MuHash3072 objects for running hashes or so, after the inversion (so it's only 384 bytes and not 768)
 5213:14 <sipa> elichai2: no, 3072/8 = 384
 5313:14 * elichai2 🤦
 5413:14 <elichai2> ops yeah 384 not 386, typo
 5513:14 <troygiorshev> sipa: ah makes sense
 5613:14 <lightlike> yes, that took me also more time than it should have :-)
 5713:15 <sipa> but if that's not going to happen, perhaps the numerator/denominator (and also the SHA256'ing) should be integrated into MuHash3072 rather than forcing higher-level code to take care of that
 5813:15 <jnewbery> sipa: I haven't looked at the c++ code. When you say caching, do you mean that you're saving the 384 byte chacha digest for each block?
 5913:16 <sipa> jnewbery: i don't know?
 6013:16 <sipa> it depends on the use case
 6113:17 <sipa> one could be that you'd precompute the muhash "effect" of every transaction in the mempool for example
 6213:17 <jnewbery> what do you mean by 'cache muhash 3072 objects for running hashes or so'?
 6313:17 <provoostenator> I think fjahr said in the C++ PR that only the sha256 hash is stored in an index
 6413:17 <jnewbery> ah ok
 6513:17 <sipa> or per block, if you want to be able to roll forward/backward from that block
 6613:17 <provoostenator> See discussion here-ish: https://github.com/bitcoin/bitcoin/pull/19055#issuecomment-641996489
 6713:17 <fjahr> provoostenator: this is something different
 6813:18 <jnewbery> and currently the c++ implementation inverts every utxo spend rather than multiplying them all and then inverting once?
 6913:18 <sipa> jnewbery: iirc fjahr's code does one inversion per block
 7013:19 <fjahr> it's on the calculation for each block where I add new utxos and removed utxos separately and then do one division att the end
 7113:19 <fjahr> sipa: yes
 7213:19 <jnewbery> sipa fjahr: thanks
 7313:19 <fjahr> provoostenator: you asked about this in your first pass on the index as well
 7413:20 <sipa> as it is currently used it would be perfectly reasonable to have the numerator/denominator inside the MuHash3072 object - but if there was ever a desire to cache a full MuHash3072 object in a place where memory usage matters, this may be undesirable
 7513:21 <nehan> This is where the division is done per block: https://github.com/bitcoin/bitcoin/pull/18000/commits/1502a08b4a7ea8987271e87ad73715c565a916fc#diff-56c85843f4edd7acc62fbf7af80a67d1R197
 7613:21 <fjahr> nehan: thank you!
 7713:22 <fjahr> yeah, so instead of removing each undo instead they are added up and then removed together
 7813:23 <troygiorshev> sipa: IIRC one of the other incremental hashes had a much smaller state
 7913:23 <sipa> troygiorshev: ECMH has 32 bytes state
 8013:23 <troygiorshev> ECMH maybe
 8113:23 <sipa> well, 33
 8213:24 <elichai2> in storage or in memory? :P
 8313:24 <sipa> it's complicated
 8413:25 <elichai2> sipa: I asked troygiorshev hehe
 8513:25 <raj_149> are we planning on storing the the rolling hash to disk?
 8613:25 <fjahr> Maybe next question but keep going if the last part is still unclear to anyone! ChaCha20 is used in MuHash. What are the properties of that algorithm? What is ChaCha20 used for in particular in MuHash?
 8713:25 <sipa> in compressed form, 33 bytes; in affine coordinates, 64 bytes; in jacobian coordinates, 96 bytes; in denormalized jacobian representation, 120 bytes :p
 8813:26 <elichai2> :D
 8913:26 <elichai2> still less than 384 bytes though 😅
 9013:26 <jnewbery> sipa elichai2: I think we're gettin a bit offtopic!
 9113:26 <sipa> sorry.
 9213:27 <raj_149> fjahr: its converting 32 bytes to 384 bytes firstly. Is there any other purpose its serving?
 9313:28 <fjahr> raj_149: yes, but why?
 9413:28 <raj_149> fjahr: because muhash numerator/denominator needs to be a 384 byte data?
 9513:29 <sipa> perhaps a useful question is: why not use ChaCha20 directly, without SHA256?
 9613:29 <jnewbery> sipa's original post here might be helpful: https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2017-May/014337.html
 9713:30 <jnewbery> Spcifically "As a result, MuHash modulo a sufficiently large safe prime is provably secure under the DL assumption. Common guidelines on security parameters [7] say that 3072-bit DL has about 128 bits of security."
 9813:31 <nehan> i think you need at least 3072 bits of security. 384 bytes is less than 512 bytes.
 9913:32 <raj_149> sipa: i am a little confused here, where exactly we are mixing chacha20 with sha256?
10013:32 <provoostenator> Also if you transform 256 bits to 3072 bits, do you actually get 3072 bits of security? That always give me headaches.
10113:32 <nehan> so if you were willing to use MuHash on 4096 bit integers you wouldn't need chacha
10213:32 <sipa> nehan: i think you're confused, but i can't say where
10313:32 <nehan> sipa: yeah probably :) i'm guessing
10413:32 <sipa> nehan: maybe you're confusing bits with bytes?
10513:32 <provoostenator> MuHash works with 256 bit blocks
10613:33 <provoostenator> And doesn't compress
10713:33 <sipa> what?
10813:33 <provoostenator> Uhhh
10913:33 <sipa> it deals with 3072 bit integers
11013:33 <provoostenator> ChaCha
11113:33 <jnewbery> raj_149: sha256 is used twice. We first hash the object using sha256 to a 32 byte digest, and then use that digest as the key in the chacha20 rounds
11213:33 <lightlike> is there an attack scenario wrt the specific use case of UTXO statistics that we need to be secure against?
11313:33 <nehan> My guess is that you *could* use MuHash with 4096 bit integers instead of 3072, but it would require extra storage space
11413:34 <jnewbery> sha256 is then used once there's a muhash output for the entire set to reduce it from 384 bytes back down to 32 bytes
11513:34 <sipa> nehan: and slower multiplication
11613:34 <provoostenator> lightlike: not really, but if we ever want to use it for consensus stuff - e.g. UTXO commitments, then we do have to worry
11713:34 <sipa> raj_149: the UTXO is hashed first with SHA256 (formerly truncated SHA512) to produce 32 bytes; those 32 bytes are expanded to 384 bytes using ChaCha20; those 384 bytes are interpreted as 3072-bit integers and multiplied together; the result of that multiplication is SHA256'd again to produce a 32-byte hash
11813:34 <nehan> sipa: I also thought when you said "perhaps a useful question is: why not use ChaCha20 directly, without SHA256?" you meant SHA512.
11913:35 <sipa> nehan: the SHA512 was replaced with SHA256
12013:35 <sipa> after some benchmarking
12113:35 <fjahr> sipa: will be :)
12213:35 <sipa> ah
12313:35 <nehan> it's still 512 here: https://github.com/bitcoin/bitcoin/pull/19105/commits/400ac1cc9e7fb2983d4dced9512819c970ef9638#diff-832874419bea3a4c22ea70b000657e3eR55
12413:35 <fjahr> didn't get around to push the new code
12513:35 <jnewbery> ignore truncated SHA512 and just pretend we're saying sha256 :)
12613:35 <sipa> sorry for confusing people, i should have checked
12713:36 <troygiorshev> sipa: in you're question are you referring to the first usage of sha256 or the second?
12813:36 <raj_149> sipa: just so that i understand it right, this is currently not the case with the python code right?
12913:36 <jnewbery> raj_149: yes, this is the case with the python code
13013:36 <sipa> ignore everything i said about SHA256; it's SHA512 for now
13113:37 <raj_149> jnewbery: isn't chacha here already expecting a 32 byte data? its not hashing it anywhere.
13213:37 <nehan> ok so ChaCha20 is a way of securely blowing up 256 bytes to 384 bytes
13313:37 <fjahr> but trunkated to 256bits, which adds to the confusion
13413:37 <troygiorshev> nehan: 32 to 384 yeah
13513:37 <nehan> troygiorshe: oh right, thanks
13613:37 <troygiorshev> nehan: or 256 to 3072 bits, whichever you'd prefer :)
13713:38 <sipa> nehan: 256 *bits* to 384 *bytes*
13813:38 <nehan> 32->384. that was my confusion :)
13913:38 <sipa> yeah
14013:38 <jnewbery> the data parameter in data_to_num3072 here: https://github.com/bitcoin/bitcoin/pull/19105/files#diff-832874419bea3a4c22ea70b000657e3eR53 should already by a 32 byte digest
14113:38 <nehan> or I guess what the Python code says now is 64 truncated to 32 to 384
14213:38 <sipa> the answer is just that ChaCha20 isn't a hash algorithm; it's an encryption algorithm, and it needs a random key to be secure (which we produce by hashing); it's actually encrypting all zeroes
14313:39 <michaelfolkson> What was the motivation for using truncated SHA512 over SHA256?
14413:39 <troygiorshev> 64 bit processors, right?
14513:39 <sipa> michaelfolkson: this was discussed in the previous meeting on this topic, i think
14613:39 <raj_149> jnewbery: ok got it, i thought its somewhere enforced in the function itself.
14713:40 <sipa> but given that we're changing it to SHA256 maybe it isn't worth rehashing (haha)
14813:40 <fjahr> discusion: https://github.com/bitcoin/bitcoin/pull/19055#issuecomment-640169988
14913:40 <fjahr> michaelfolkson: ^
15013:40 <troygiorshev> fjahr: thx
15113:40 <michaelfolkson> Thanks
15213:40 <jnewbery> I don't think it's worth discussing truncatedSHA512 -vs- SHA256 here. Just imagine it's a random oracle giving you a 32 byte output.
15313:41 <fjahr> Ok, another question that was raised in an early review: What are your thoughts theStack’s suggestion to use Fermat’s little theorem in the code and the trade-offs involved?
15413:41 <thomasb06> what are the ranges of 'a' and 'n-2'?
15513:41 <andrewtoth> what about chacha20 making 384 bytes out of 32? Does it get any other inputs? If it's deterministic, won't the 384 bytes have the same security as 32 bytes?
15613:42 <sipa> andrewtoth: they have 256 bits of entropy
15713:42 <troygiorshev> fjahr: I agree with sipa's comment, saying "for those who understand this stuff, it's just as readable either way, and it's black magic for anyone who isn't familiar"
15813:42 <nehan> fjahr: how often is modinv actually getting called? it seemed like once per digest, right? and the test only calls digest a few times.
15913:42 <raj_149> fjahr: similar to stack's observation, i am getting one order of mag faster execution with fermet's theorem.
16013:43 <nehan> raj_149: i think fermat's little theorem is supposed to be *slower*...
16113:43 <sipa> raj_149: that's unexpected!
16213:43 <jnewbery> I don't think performance matter too much in the test framework
16313:43 <sipa> i was expecting the extgcd approach to be faster
16413:43 <sipa> but agree that performance shouldn't matter
16513:44 <nehan> i think it's less likely python's pow() will have a bug than this custom implementation of modinv so i'd go with pow :)
16613:44 <thomasb06> if it doesn't overflow too
16713:44 <troygiorshev> fwiw my testing agrees with TheStack
16813:44 <raj_149> nehan: right, opposite, slower.. my bad..
16913:44 <fjahr> nehan: yes but I am adding a more extensive test where it get's called more often. But in general I was just thinking of how we already are waiting for the ci environment. But it does not have too much of an impact, not comparable to a unnecessary sleep() or so
17013:45 <nehan> i was thinking if you keep modinv you should add a unittest to ensure it's producing the same results as pow()
17113:45 <sipa> nehan: that's a good argument - though it's the kind of algorithm that's unlikely to be wrong in a very small subset of cases
17213:45 <jnewbery> one nice thing about the python implementation is that it's more accessible and readable for most people. I think key.py does a great job at documenting the algorithms: https://github.com/bitcoin/bitcoin/blob/6762a627ecb89ba8d4ed81a049a5d802e6dd75c2/test/functional/test_framework/key.py#L13
17313:46 <fjahr> jnewbery: yeah, I think it's a good idea to use that one
17413:46 <jnewbery> so having the same algorithm used as in the c++ implementation is useful because it acts as a teaching aid for people looking at the code
17513:46 <nehan> jnewbery: good point!
17613:46 <compress> jnewbery agreed
17713:46 <raj_149> jnewbery: +1
17813:46 <sipa> jnewbery: alternatively... a different algorithm in the tests means less likely that there's a bug repeated in both
17913:47 <michaelfolkson> 2 tests
18013:47 <jnewbery> (an argument against is that you want a completely different algorithm so you don't replicate a bug in the test code)
18113:47 <jnewbery> sipa: +1
18213:48 <fjahr> on a higher level we are comparing to the results of the c++ code so a incorrect modinv should pop up
18313:48 <compress> so its a trade off of security/soundness vs readability / tech debt?
18413:48 <nehan> so have multiple algorithms in the test and compare them
18513:48 <fjahr> this brings me to my last question: How do you like the test to verify parity with the unit test of the C++ implementation? Would have done it differently? Should it be extended? Should it be moved to a different place?
18613:49 <troygiorshev> are we really worried at all that modinv has been implemented incorrectly? For something so standard, if we're worried that it's wrong we should also be worried about many other things
18713:49 <michaelfolkson> I did want to understand what was going on in this PR that jnewbery referred to https://github.com/bitcoin/bitcoin/pull/18576
18813:50 <michaelfolkson> We should be putting related unit tests in functional tests from now on?
18913:50 <provoostenator> troygiorshev: I'm not worried modinv is implemented incorrectly in the Python library, but might be in our implementation. (though in this case I tested the result is the same)
19013:50 <michaelfolkson> Is that the motivation?
19113:52 <fjahr> michaelfolkson: not in the functional tests but into files of the functional test framework
19213:52 <sipa> provoostenator: modinv isn't implemented in the Python library before 3.8 i think
19313:52 <troygiorshev> fjahr: this seems like a great opportunity for fuzz testing
19413:52 <elichai2> Maybe writing a list of test vectors can be nice
19513:52 <lightlike> fjahr: it doesn't seem in the right place, because stuff tested in a functional tests should depend on a running node in some way - so I like jnewbery's suggestion to have python unit test.
19613:52 <sipa> fuzz testing cryptographic code is extremely hard
19713:52 <provoostenator> sipa: correct, I would just wait for that, and put a TODO until then
19813:53 <fjahr> trygiorshev: yes, provoostenator has pointed this out as well
19913:53 <troygiorshev> sipa: Ah I'm using the word too flexibly. I just mean giving both the same random inputs
20013:53 <troygiorshev> would only check for parity
20113:54 <provoostenator> One thing you can do, if you have hundreds of test vectors, is to run a tiny percentage through the slower algoritm. But we only have 1.
20213:54 <michaelfolkson> fjahr: And the motivation for this was that tests were unreliable? What was happening because the unit tests weren't in the files of the functional test framework?
20313:54 <jnewbery> it's a shame that sipa wrote the c++ and python implementation. Ideally the python implemenation would be written by someone who isn't sipa and has never talked to sipa about muhash :)
20413:55 <sipa> i volunteer jnewbery to rewrite it :)
20513:55 <michaelfolkson> I need to get to understand the test framework better, it is regularly tripping me up
20613:55 <sipa> (you're right)
20713:55 <thomasb06> jnewbery: sipa: maybe one day...
20813:56 <jnewbery> too late. I've already read your implementation so I can't write one that is uninfluenced by you
20913:56 * troygiorshev scrolls up to see who responded "n" to "reviewed?"
21013:56 <fjahr> michaelfolkson: it's a better way to do this style of testing of the the python implementations of stuff, its really a unit test and feels weird in the functional test files
21113:57 <sipa> i think this is actually one part where MuHash has an advantage... apart from gluing SHA256/ChaCha20 together, the code is trivial in python (it's multiplying and dividing numbers modulo a prime)
21213:57 <fjahr> I have added a proper functional test in a later part of the series now: https://github.com/bitcoin/bitcoin/pull/19145/commits/6013f04f319188ddf4926142291ced3242817e95
21313:57 <provoostenator> Agreed, the Python version is very nice and short. Probably the first time I could wrap my head around it.
21413:58 <fjahr> sipa: agree
21513:58 <provoostenator> If you chuck the ChaCha20 and ModInv stuff in another file, it's even better
21613:58 <fjahr> ok, one more minute
21713:59 <jnewbery> I know it's very late in the meeting, but I thought real_or_random's comment here was very interesting: https://github.com/bitcoin/bitcoin/pull/19055#discussion_r438026185
21813:59 <jnewbery> I haven't fully digested it, but it's something I plan to dig into
21913:59 <provoostenator> Haha, digested
22013:59 <nehan> oh -- i found the use of "set" confusing given that you can add the same value multiple times (at least this is what i recall from some other discussion) and get a different hash
22114:00 <provoostenator> nehan: that confused me too. It's not a set. Unless you treat it very carefully.
22214:00 <fjahr> yeah, it's a bit confusing because what you can do =/= what you should do
22314:01 <sipa> jnewbery: haha, digested
22414:01 <fjahr> I will improve docs certainly :)
22514:01 <fjahr> #endmeeting
22614:01 <nehan> thanks fjahr!
22714:01 <andrewtoth> thanks fjahr!
22814:01 <fjahr> went by way too fast again. Thanks everyone!
22914:01 <troygiorshev> thanks fjakr!
23014:01 <fjahr> r
23114:02 <troygiorshev> fjahr*
23214:02 <jnewbery> thanks fjahr. Thanks sipa. Really interesting discussion
23314:02 <fjahr> I really appreciate the feedback!
23414:02 <emzy> thanks fjahr!
23514:02 <lightlike> thank fjahr
23614:02 <michaelfolkson> Time flies when you are having fun fjahr. Nice work
23714:02 <compress> thanks fjahr
23814:02 <fjahr> Thanks sipa!
23914:02 <sipa> fjahr: yw, and thanks for hosting again
24014:02 <figs> thanks everyone
24114:06 <SirRichard> Thanks everyone, great discussion. Great to learn and follow along.