libminisketch is a library implementing
the PinSketch set reconciliation algorithm.
It is the basis for the efficient relay protocol in
Erlay (covered in a previous review
club), but is generally usable for various
applications that use set reconciliation. PinSketch is effectively a collection
of algorithms doing the following:
Computing a “sketch” of a set of data elements (which are treated as
numbers)
Combining two sketches to compute a sketch of the
symmetric difference
between the two original sets (i.e., the set of elements occurring in one
of the original sets, but not both).
Recovering the set of elements from its sketch, under the assumption the
sketch is large enough.
In this review club, we will discuss some of the math behind the algorithm,
by reviewing the Python
reimplementation,
rather than the C++ code itself. It is a full implementation of all of
libminisketch’s functions natively in Python3. At a high level it uses the
same algorithms as the C++ implementation, but lacking several optimizations,
and using Python data types to represent everything. This makes it several
orders of magnitude slower, but hopefully also makes it easier to review the
algorithms.
The algorithms rely heavily on properties of finite
fields. Fields are sets with an
associated addition and multiplication operation, with all the properties you
want for solving sets of linear equations (commutativity, associativity,
distributivity, a “0” that is the neutral element for addition, a “1” that is
the neutral element for multiplication, and every element except 0 having a
multiplicative inverse). An example of a field is the rational numbers or the
real numbers. In order to be able to represent these objects on a computer we
however want fields with a finite number of elements. This 9-minute
video explains how to construct
finite fields with any size that is a power of a prime number, using
polynomials. In PinSketch, only finite fields whose size is a power of 2 is
used.
The PinSketch algorithm goes a step further, and actually operates on
polynomials whose coefficients are finite field elements. This may be
confusing, as the construction of those finite fields themselves also uses
polynomials, so it’s important to realize there are two layers.
A description of PinSketch’s algorithms at a high level is explained in this
document.
Questions
How are finite field elements represented in Python? How do we perform
additions, subtractions, multiplications, and divisions on them? How are
polynomials with finite field coefficients represented in Python?
Imagine Alice and Bob have sets {a,b,c,d,f} and {b,d,e,f} respectively,
where the variables represent distinct non-zero 8-bit field elements. Assume
that Alice knows ahead of time that the (symmetric) difference between those
two sets is not more than 3. Alice is going to send a sketch of her elements to
Bob, so that Bob can learn the differences. How many bits will her sketch be in
size? Looking at add, what field elements will that sketch consist of?
Following the same example, Bob will compute a sketch over his own elements,
and combine it with the sketch received from Alice. What field elements will
that combined sketch contain? Noting that it only contains odd powers, how
will Bob restore the even ones (see decode)? Would this work if we weren’t
using power-of-two fields?
The next step is converting the power sums to a polynomial using the
Berlekamp-Massey.
What does this algorithm do for us?
The final, and most complicated step to recovering the set of differences is
finding the roots of the obtained polynomial. To do so, we first verify that
the obtained nth degree polynomial actually has n distinct roots. It relies on
the property that (x^(fieldsize) - x) has every field element exactly once as
root. How does that let us test if the polynomial is fully factorizable into n
roots? Why is this test necessary?
To actually find the roots, the Berlekamp Trace Algorithm is used. It uses
the trace function
t(x) = x + x^2 + x^4 + ... + x^(fieldsize/2) which maps every element of a
field of size 2^n to 0 or 1. In our 8-bit field that means
t(x) = x + x^2 + x^4 + x^8 + x^16 + x^32 + x^64 + x^128. This means that
for any non-zero field element p, tr(p*x) also has this property, and
every choice of p will map a different subset of field elements to 0 (and the
others to 1). How is this property used to recursively split the polynomial
into smaller and smaller ones?
<glozow> In `GF2Ops`, elements are represented as integers. In the generation code I found it helpful to note that `GF(2^field_size)` is isomorphic to `Z_2[x] / (f)` for polynomial f of degree `field_size`, so when generating the field, the integers are like polynomials where the nth bit (from the right) is the coefficient for `x^{n-1}`.
<glozow> Oh, and polynomials in the poly_* functions with finite field coefficients are arrays of integers with the ith element of the array corresponding to the coefficient for `x^i` (with no trailing zeroes).
<michaelfolkson> "In theory it could be used for any problem where you have two sets of data and want to work out how those data sets differ (which specific elements are present in one but missing in another) as efficiently as possible."
<sipa> glozow: indeed, that's almost the definition of characteristic (the characteristic of a field is the maximum number of times you need to add an element to itself to be guaranteed to reach 0)
<sipa> so if we're working in the 8-bit field, what would {3} * {3} be (i'm going to write field elements in {} to not give the impression they're actually integers)
<sipa> ok, so, multiplication over the integers-representing-field-elements is really: taking them apart, interpreting the bits as coefficients of a polynomial over Z2 (integers mod 2), multiplying those polynomials, and then doing the result modulo an irreducible polynomial of degree (bits)
<sipa> the field elements are represented as integers in the python code; polynomials are represented as lists of integers (if p is a list, p[0] is the 0th degree coefficvients, p[17] is the 17th degree coefficient, etc)
<sipa> Imagine Alice and Bob have sets {a,b,c,d,f} and {b,d,e,f} respectively, where the variables represent distinct non-zero 8-bit field elements. Assume that Alice knows ahead of time that the (symmetric) difference between those two sets is not more than 3. Alice is going to send a sketch of her elements to Bob, so that Bob can learn the differences. How many bits will her sketch be in size?
<sipa> if Bob wants to learn what elements Alice has he doesn't, and the other way around, he will receive this sketch [a+b+c+d+f, a**3+b**3+c**3+d**3+f**3, a**5+b**5+c**5+d**5+f**5] from Alice
<jnewbery> dergoegge: to answer your question from earlier, it's to get the smallest number of bytes that will fit the right number of bits. // in python is floor division
<sipa> there are 3 steps in that: first computing the even power-sums from just the odd-power ones, applying Berlekamp-Massey to convert it to a polynomial, and then finding the roots of that polynomial
<glozow> Since the polynomials are over a finite field with a characteristic 2, `a + a = 0` for every `a` in the field and thus `(x + y)^2 = x^2 + y^2`. It's also known that `s_0 = n` This makes it possible to compute all even syndromes of the sketch from the odd syndromes since every even is some `s_{2i} = x^{2i} + y^{2i} + ... = (x + y + ...)^{2i} = s_i^2`
<glozow> BM finds the polynomnial (determined by coefficients) that generates the the syndromes. i.e. minimal polynomial L (its coefficients `l_0 ... l_n`) s.t. `S(M)*L` = the power sums.
<sipa> sdaftuar: BM will always find something, but if it finds a polynomial with degree higher than the capacity, we know something went wrong (as that would mean it's generated from more than capacity elements)
<sipa> so applying BM to the sketch in our problem, will find [a*b*c, a*b + b*c + a*c, a + b + c, 1] (i.e. the polynomial (a+b+c) + (a*b + b*c + a*c)*x + (a+b+c)*x^2 + x^3)
<sdaftuar> sipa: from a quick look at the python code, the runtime won't be crazy in those scenarios right? looks like it's just got runtime based on the number of syndromes?
<sipa> 5. The final, and most complicated step to recovering the set of differences is finding the roots of the obtained polynomial. To do so, we first verify that the obtained nth degree polynomial actually has n distinct roots. It relies on the property that (x^(fieldsize) - x) has every field element exactly once as root. How does that let us test if the polynomial is fully factorizable into n
<glozow> A n-degree polynomial can have at most n roots. It _may_ have fewer roots that are repeated (note n-degree polynomial with unique roots is synonymous with having n roots).
<sipa> glozow: This is important because our set {m_1 ... m_n} is derived from the roots of the polynomial. The Berklekamp Trace Algorithm also requires this I believe.
<sipa> that's exactly the answer; our root finding algorithm (next club!) requires that the polynomial is fully factorizable into distinct 1st degree factors