The PR branch HEAD was a6ee4c2cee at the time of this review club meeting.
Notes
This is the second in a series of PRs implementing the Coin Statistics
Index. #18000 provides
an overview of the work and is used to keep track of the different PRs.
The first PR in the series was #19055,
featured in PR review club two weeks ago.
If you haven’t seen it, it would be beneficial to read that
conversation first. There are also some links to papers in the description
that might help you understand the theoretical background of the algorithm
better.
Based on the feedback from the last session, the scope of PR
19055 has since been reduced.
Today’s PR adds the Python implementation of MuHash3072
which was originally in #19055.
#19055 now only adds the C++
implementation.
The previous session was more heavily focussed on conceptual understanding
of the UTXO set statistics index project as a whole. This week, we’ll focus on
understanding the implementation of MuHash3072 in Python.
The next PR in the series is #19145.
It may be interesting to skip ahead to see how the code is used but
if you participated in the previous review club session you have already seen
some of that code.
This PR follows the pattern to check the C++ implementation of an algorithm
against the Python implementation of the same algorithm. Another
example of this pattern is Siphash.
Questions
What is the API of MuHash3072 in Python? Is it different from the C++
implementation in PR 19055?
There is a numerator and a denominator property in the Python
implementation. Can you explain how these are used in MuHash on a high level?
Where are they in the C++ implementation?
ChaCha20 is used in MuHash. What are the properties of that algorithm?
What is ChaCha20 used for in particular in MuHash? Explain why this is needed.
What does the modinv method do? What is it’s role in finalizing the hash
(in the digest method)?
How do you like the test to verify parity with the unit test of the C++
implementation? Would have done it differently? Should it be extended?
Should it be moved to a different place?
<fjahr> Hello everyone, welcome to this weeks PR review club. We are reviewing the Python implementation of MuHash3072 (#19105) which part of CoinStatsIndex (overview in #18000). Notes and questions are here: https://bitcoincore.reviews/19105.html.
<fjahr> I am drawing some comparisons to the C++ implementation in the questions but the focus of this session should definitely be on understanding how the algorithm works in the Python implementation. If you have any questions don't hesitate to ask, there can be multiple discussions happening in parallel.
<fjahr> Great. There were already some nice review comments. Let's get started with a warm-up question: What is the API of MuHash3072 in Python? Is it different from the C++ implementation in PR 19055?
<fjahr> Next Q: There is a numerator and a denominator property in the Python implementation. Can you explain how these are used in MuHash on a high level? Where are they in the C++ implementation?
<sipa> but if that's not going to happen, perhaps the numerator/denominator (and also the SHA256'ing) should be integrated into MuHash3072 rather than forcing higher-level code to take care of that
<sipa> as it is currently used it would be perfectly reasonable to have the numerator/denominator inside the MuHash3072 object - but if there was ever a desire to cache a full MuHash3072 object in a place where memory usage matters, this may be undesirable
<fjahr> Maybe next question but keep going if the last part is still unclear to anyone! ChaCha20 is used in MuHash. What are the properties of that algorithm? What is ChaCha20 used for in particular in MuHash?
<jnewbery> Spcifically "As a result, MuHash modulo a sufficiently large safe prime is provably secure under the DL assumption. Common guidelines on security parameters [7] say that 3072-bit DL has about 128 bits of security."
<jnewbery> raj_149: sha256 is used twice. We first hash the object using sha256 to a 32 byte digest, and then use that digest as the key in the chacha20 rounds
<sipa> raj_149: the UTXO is hashed first with SHA256 (formerly truncated SHA512) to produce 32 bytes; those 32 bytes are expanded to 384 bytes using ChaCha20; those 384 bytes are interpreted as 3072-bit integers and multiplied together; the result of that multiplication is SHA256'd again to produce a 32-byte hash
<sipa> the answer is just that ChaCha20 isn't a hash algorithm; it's an encryption algorithm, and it needs a random key to be secure (which we produce by hashing); it's actually encrypting all zeroes
<fjahr> Ok, another question that was raised in an early review: What are your thoughts theStack’s suggestion to use Fermat’s little theorem in the code and the trade-offs involved?
<andrewtoth> what about chacha20 making 384 bytes out of 32? Does it get any other inputs? If it's deterministic, won't the 384 bytes have the same security as 32 bytes?
<troygiorshev> fjahr: I agree with sipa's comment, saying "for those who understand this stuff, it's just as readable either way, and it's black magic for anyone who isn't familiar"
<fjahr> nehan: yes but I am adding a more extensive test where it get's called more often. But in general I was just thinking of how we already are waiting for the ci environment. But it does not have too much of an impact, not comparable to a unnecessary sleep() or so
<fjahr> this brings me to my last question: How do you like the test to verify parity with the unit test of the C++ implementation? Would have done it differently? Should it be extended? Should it be moved to a different place?
<troygiorshev> are we really worried at all that modinv has been implemented incorrectly? For something so standard, if we're worried that it's wrong we should also be worried about many other things
<provoostenator> troygiorshev: I'm not worried modinv is implemented incorrectly in the Python library, but might be in our implementation. (though in this case I tested the result is the same)
<lightlike> fjahr: it doesn't seem in the right place, because stuff tested in a functional tests should depend on a running node in some way - so I like jnewbery's suggestion to have python unit test.
<provoostenator> One thing you can do, if you have hundreds of test vectors, is to run a tiny percentage through the slower algoritm. But we only have 1.
<michaelfolkson> fjahr: And the motivation for this was that tests were unreliable? What was happening because the unit tests weren't in the files of the functional test framework?
<jnewbery> it's a shame that sipa wrote the c++ and python implementation. Ideally the python implemenation would be written by someone who isn't sipa and has never talked to sipa about muhash :)
<fjahr> michaelfolkson: it's a better way to do this style of testing of the the python implementations of stuff, its really a unit test and feels weird in the functional test files
<sipa> i think this is actually one part where MuHash has an advantage... apart from gluing SHA256/ChaCha20 together, the code is trivial in python (it's multiplying and dividing numbers modulo a prime)
<nehan> oh -- i found the use of "set" confusing given that you can add the same value multiple times (at least this is what i recall from some other discussion) and get a different hash